Detecting if you are running in a container

Linux Container Development
 help / color / mirror / Atom feed

* Detecting if you are running in a container
       [not found]       ` <20111010163140.GA22191@tango.0pointer.de>
@ 2011-10-10 20:59         ` Eric W. Biederman
  2011-10-10 21:41           ` Lennart Poettering
  2011-10-11  1:32           ` Ted Ts'o
  0 siblings, 2 replies; 28+ messages in thread
From: Eric W. Biederman @ 2011-10-10 20:59 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg,
	Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage


Cc's and subject updated so hopefully we get the correct people
on this discussion to make progress.

Lennart Poettering <mzxreary@0pointer.de> writes:

> To make a standard distribution run nicely in a Linux container you
> usually have to make quite a number of modifications to it and disable
> certain things from the boot process. Ideally however, one could simply
> boot the same image on a real machine and in a container and would just
> do the right thing, fully stateless. And for that you need to be able to
> detect containers, and currently you can't.

I agree getting to the point where we can run a standard distribution
unmodified in a container sounds like a reasonable goal.

> Quite a few kernel subsystems are
> currently not virtualized, for example SELinux, VTs, most of sysfs, most
> of /proc/sys, audit, udev or file systems (by which I mean that for a
> container you probably don't want to fsck the root fs, and so on), and
> containers tend to be much more lightweight than real systems.

That is an interesting viewpoint on what is not complete.  But as a
listing of the tasks that distribution startup needs to do differently in
a container the list seems more or less reasonable.

There are two questions 
- How in the general case do we detect if we are running in a container.
- How do we make reasonable tests during bootup to see if it makes sense
  to perform certain actions.

For the general detection if we are running in a linux container I can
see two reasonable possibilities.

- Put a file in / that let's you know by convention that you are in a
  linux container.  I am inclined to do this because this is something
  we can support on all kernels old and new.

- Allow modification to the output of uname(2).  The uts namespace
  already covers uname(2) and uname is the standard method to
  communicate to userspace the vageries about the OS level environment
  they are running in.


My list of things that still have work left to do looks like:
- cgroups.  It is not safe to create a new hierarchies with groups
  that are in existing hierarchies.  So cgroups don't work.

- user namespace.  We are very close to have something workable
  on this one, but until we do all of the users inside and outside
  of a container are the same, and pass the same permission checks.

  As a result we have to drop most of roots privileges, and we have
  to be a bit careful what binaries that can gain privileges (think suid
  root) are in the container filesystem.

- Reboot.  I know Daniel was working on something not long ago
  but I am not certain where he would up.

- device namespaces.  We periodically think about having a separate
  set of devices and to support things like losetup in a container
  that seems necessary.  Most of the time getting all of the way
  to device namespaces seems unnecessary.


As for tests on what to startup.

- udev.  All of the kernel interfaces for udev should be supported in
  current kernels.  However I believe udev is useless because container
  start drops CAP_MKNOD so we can't do evil things.  So I would
  recommend basing the startup of udev on presence of CAP_MKNOD.

- VTs.  Ptys should be well supported at this point.  For the rest
  they are physical hardware that a container should not be playing with
  so I would base which gettys to start up based on which device nodes
  are present in /dev.

- sysctls (aka /proc/sys) that is a trick one.  Until the user namespace
  is fleshed out a little more sysctls are going to be a problem,
  because root can write to most of them.  My gut feel says you probably
  want to base that to poke at sysctls on CAP_SYS_ADMIN.  At least that
  test will become true when the userspaces are rolled out, and at
  that point you will want to set all of the sysctls you have permission
  to.

- audit.  My memory is very fuzzy on this one.  The issue in question is
  should we start auditd?  I believe the audit calls actually fail in a
  container so we should be able to trigger starting auditd on if audit
  works at all.  If we can't do it that way certainly the work should be
  put in so that it can be done that way.

- fsck.  A rw filesystem check like you mentioned earlier seems like a
  reasonable place to be I know the OpenVz folks were talking about
  putting containers in their own block devices for their next round of
  supporting containers.  At which point a filesystem check on container
  startup might not be a bad idea at all.

- cgroups hierarchies.  I don't know at which point in the system
  startup we care.  The appropriate solution would seem to be to try
  it and if the operation fails figure it isn't supported.

- selinux.  It really should be in the same category.  You should be
  able to attempt to load a policy and have it fail in a way that
  indicates that selinux is currently supported.  I don't know if
  we can make that work right until we get the user namespace into
  a usable shame.

In general things in a container should work or the kernel feature
should fail in a way that indicates that the feature is not supported.
That currently works well for the networking stack, and with the
pending usablilty of the user namespace it should work just about
everywhere else as well.  For things that don't fit that model we
need to fix the kernel.

So while I agree a check to see if something is a container seems
reasonable.  I do not agree that the pid namespace is the place to put
that information.  I see no natural to put that information in the
pid namespace.

I further think there are a lot of reasonable checks for if a
kernel feature is supported in the current environment I would
rather pursue over hacks based the fact we are in a container.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-10 20:59         ` Detecting if you are running in a container Eric W. Biederman
@ 2011-10-10 21:41           ` Lennart Poettering
  2011-10-11  5:40             ` Eric W. Biederman
                               ` (2 more replies)
  2011-10-11  1:32           ` Ted Ts'o
  1 sibling, 3 replies; 28+ messages in thread
From: Lennart Poettering @ 2011-10-10 21:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg,
	Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote:

> > Quite a few kernel subsystems are
> > currently not virtualized, for example SELinux, VTs, most of sysfs, most
> > of /proc/sys, audit, udev or file systems (by which I mean that for a
> > container you probably don't want to fsck the root fs, and so on), and
> > containers tend to be much more lightweight than real systems.
> 
> That is an interesting viewpoint on what is not complete.  But as a
> listing of the tasks that distribution startup needs to do differently in
> a container the list seems more or less reasonable.

Note that this is just what came to my mind while I was typing this, I
am quite sure there's actually more like this.

> There are two questions 
> - How in the general case do we detect if we are running in a container.
> - How do we make reasonable tests during bootup to see if it makes sense
>   to perform certain actions.
> 
> For the general detection if we are running in a linux container I can
> see two reasonable possibilities.
> 
> - Put a file in / that let's you know by convention that you are in a
>   linux container.  I am inclined to do this because this is something
>   we can support on all kernels old and new.

Hmpf. That would break the stateless read-only-ness of the root dir.

After pointing the issue out to the LXC folks they are now setting
"container=lxc" as env var when spawning a container. In systemd-nspawn
I have then adopted a similar scheme. Not sure though that that is
particularly nice however, since env vars are inherited further down the
tree where we probably don't want them.

In case you are curious: this is the code we use in systemd:

http://cgit.freedesktop.org/systemd/tree/src/virt.c

What matters to me though is that we can generically detect Linux
containers instead of specific implementations.

> - Allow modification to the output of uname(2).  The uts namespace
>   already covers uname(2) and uname is the standard method to
>   communicate to userspace the vageries about the OS level environment
>   they are running in.

Well, I am not a particular fan of having userspace tell userspace about
containers. I would prefer if userspace could get that info from the
kernel without any explicit agreement to set some specific variable.

That said detecting CLONE_NEWUTS by looking at the output of uname(2)
would be a workable solution for us. CLONE_NEWPID and CLONE_NEWUTS are
probably equally definining for what a container is, so I'd be happy if
we could detect either.

For example, if the kernel would append "(container)" or so to
utsname.machine[] after CLONE_NEWUTS is used I'd be quite happy.

> My list of things that still have work left to do looks like:
> - cgroups.  It is not safe to create a new hierarchies with groups
>   that are in existing hierarchies.  So cgroups don't work.

Well, for systemd they actually work quite fine since systemd will
always place its own cgroups below the cgroup it is started in. cgroups
hence make these things nicely stackable.

In fact, most folks involved in cgroups userspace have agreed to these
rules now:

http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups

Among other things they ask all userspace code to only create subgroups
below the group they are started in, so not only systemd should work
fine in a container environment but everything else following these
rules.

In other words: so far one gets away quite nicely with the fact that the
cgroup tree is not virtualized.

> - device namespaces.  We periodically think about having a separate
>   set of devices and to support things like losetup in a container
>   that seems necessary.  Most of the time getting all of the way
>   to device namespaces seems unnecessary.

Well, I am sure people use containers in all kinds of weird ways, but
for me personally I am quitre sure that containers should live in a
fully virtualized world and never get access to real devices.

> As for tests on what to startup.

Note again that my list above is not complete at all and the point I was
trying to make is that while you can find nice hooks for this for many
cases at the end of the day you actually do want to detect containers
for a few specific cases.

> - udev.  All of the kernel interfaces for udev should be supported in
>   current kernels.  However I believe udev is useless because container
>   start drops CAP_MKNOD so we can't do evil things.  So I would
>   recommend basing the startup of udev on presence of CAP_MKNOD.

Using CAP_MKNOD as test here is indeed a good idea. I'll make sure udev
in a systemd world makes use of that.

> - VTs.  Ptys should be well supported at this point.  For the rest
>   they are physical hardware that a container should not be playing with
>   so I would base which gettys to start up based on which device nodes
>   are present in /dev.

Well, I am not sure it's that easy since device nodes tend to show up
dynamically in bare systems. So if you just check whether /dev/tty0 is
there you might end up thinking you are in a container when you actually
aren't simply because you did that check before udev loaded the DRI
driver or so.

> - sysctls (aka /proc/sys) that is a trick one.  Until the user namespace
>   is fleshed out a little more sysctls are going to be a problem,
>   because root can write to most of them.  My gut feel says you probably
>   want to base that to poke at sysctls on CAP_SYS_ADMIN.  At least that
>   test will become true when the userspaces are rolled out, and at
>   that point you will want to set all of the sysctls you have permission
>   to.

So what we did right now in systemd-nspawn is that the container
supervisor premounts /proc/sys read-only into the container. That way
writes to it will fail in the container, and while you get a number of
warnings things will work as they should (though not necessarily safely
since the container can still remount the fs unless you take
CAP_SYS_ADMIN away).

> - selinux.  It really should be in the same category.  You should be
>   able to attempt to load a policy and have it fail in a way that
>   indicates that selinux is currently supported.  I don't know if
>   we can make that work right until we get the user namespace into
>   a usable shame.

The SELinux folks modified libselinux on my request to consider selinux
off if /sys/fs/selinux is already mounted and read-only. That means with
a new container userspace this problem is mostly worked around too. It
is crucial to make libselinux know that selinux is off because otherwise
it will continue to muck with the xattr labels where it shouldn't. In
if you want to fully virtualize this you probably should hide selinux
xattrs entirely in the container.

> So while I agree a check to see if something is a container seems
> reasonable.  I do not agree that the pid namespace is the place to put
> that information.  I see no natural to put that information in the
> pid namespace.

Well, a simple way would be to have a line /proc/1/status called
"PIDNamespaceLevel:" or so which would be 0 for the root namespace, and
increased for each namespace nested in it. Then, processes could simply
read that and be happy.

> I further think there are a lot of reasonable checks for if a
> kernel feature is supported in the current environment I would
> rather pursue over hacks based the fact we are in a container.

Well, believe me we have been tryiung to find nicer hooks that explicit
checks for containers, but I am quite sure that at the end of the day
you won't be able to go without it entirely.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-10 20:59         ` Detecting if you are running in a container Eric W. Biederman
  2011-10-10 21:41           ` Lennart Poettering
@ 2011-10-11  1:32           ` Ted Ts'o
       [not found]             ` <20111011020530.GG16723@count0.beaverton.ibm.com>
  1 sibling, 1 reply; 28+ messages in thread
From: Ted Ts'o @ 2011-10-11  1:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Lennart Poettering, Matt Helsley, Kay Sievers, linux-kernel,
	harald, david, greg, Linux Containers, Linux Containers,
	Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Mon, Oct 10, 2011 at 01:59:10PM -0700, Eric W. Biederman wrote:
> Lennart Poettering <mzxreary@0pointer.de> writes:
> 
> > To make a standard distribution run nicely in a Linux container you
> > usually have to make quite a number of modifications to it and disable
> > certain things from the boot process. Ideally however, one could simply
> > boot the same image on a real machine and in a container and would just
> > do the right thing, fully stateless. And for that you need to be able to
> > detect containers, and currently you can't.
> 
> I agree getting to the point where we can run a standard distribution
> unmodified in a container sounds like a reasonable goal.

Hmm, interesting.  It's not clear to me that running a full standard
distribution in a container is always going to be what everyone wants
to do.

The whole point of containers versus VM's is that containers are
lighter weight.  And one of the ways that containers can be lighter
weight is if you don't have to have N copies of udev, dbus, running in
each container/VM.

If you end up so much overhead to provide the desired security and/or
performance isolation, then it becomes fair to ask the question
whether you might as well pay a tad bit more and get even better
security and isolation by using a VM solution....

	     	       	  	     - Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
       [not found]             ` <20111011020530.GG16723@count0.beaverton.ibm.com>
@ 2011-10-11  3:25               ` Ted Ts'o
  2011-10-11  6:42                 ` Eric W. Biederman
  2011-10-11 22:25               ` david
  1 sibling, 1 reply; 28+ messages in thread
From: Ted Ts'o @ 2011-10-11  3:25 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Eric W. Biederman, Lennart Poettering, Kay Sievers, linux-kernel,
	harald, david, greg, Linux Containers, Linux Containers,
	Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Mon, Oct 10, 2011 at 07:05:30PM -0700, Matt Helsley wrote:
> Yes, it does detract from the unique advantages of using a container.
> However, I think the value here is not the effeciency of the initial
> system configuration but the fact that it gives users a better place to
> start.
> 
> Right now we're effectively asking users to start with non-working
> and/or unfamiliar systems and repair them until they work.

If things are not working with containers, I would submit to you that
we're doing something wrong(tm).  Things should just work, except that
processes in one container can't use more than their fair share (as
dictated by policy) of memory, CPU, networking, and I/O bandwidth.

Something which is baked in my world view of containers (which I
suspect is not shared by other people who are interested in using
containers) is that given that kernel is shared, trying to use
containers to provide better security isolation between mutually
suspicious users is hopeless.  That is, it's pretty much impossible to
prevent a user from finding one or more zero day local privilege
escalation bugs that will allow a user to break root.  And at that
point, they will be able to penetrate the kernel, and from there,
break security of other processes.

So if you want that kind of security isolation, you shouldn't be using
containers in the first place.  You should be using KVM or Xen, and
then only after spending a huge amount of effort fuzz testing the
KVM/Xen paravirtualization interfaces.  So at least in my mind, adding
vast amounts of complexities to try to provide security isolation via
containers is really not worth it.  And if that's the model, then it's
a lot easier to make containers to run jobs in containers that don't
require changes to the distro plus huge increase of complexity for
containers in the kernel....

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-10 21:41           ` Lennart Poettering
@ 2011-10-11  5:40             ` Eric W. Biederman
  2011-10-11  6:54             ` Eric W. Biederman
  2011-10-12 16:59             ` Kay Sievers
  2 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2011-10-11  5:40 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg,
	Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

Lennart Poettering <mzxreary@0pointer.de> writes:

> On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote:
>
>> > Quite a few kernel subsystems are
>> > currently not virtualized, for example SELinux, VTs, most of sysfs, most
>> > of /proc/sys, audit, udev or file systems (by which I mean that for a
>> > container you probably don't want to fsck the root fs, and so on), and
>> > containers tend to be much more lightweight than real systems.
>> 
>> That is an interesting viewpoint on what is not complete.  But as a
>> listing of the tasks that distribution startup needs to do differently in
>> a container the list seems more or less reasonable.
>
> Note that this is just what came to my mind while I was typing this, I
> am quite sure there's actually more like this.
>
>> There are two questions 
>> - How in the general case do we detect if we are running in a container.
>> - How do we make reasonable tests during bootup to see if it makes sense
>>   to perform certain actions.
>> 
>> For the general detection if we are running in a linux container I can
>> see two reasonable possibilities.
>> 
>> - Put a file in / that let's you know by convention that you are in a
>>   linux container.  I am inclined to do this because this is something
>>   we can support on all kernels old and new.
>
> Hmpf. That would break the stateless read-only-ness of the root dir.
>
> After pointing the issue out to the LXC folks they are now setting
> "container=lxc" as env var when spawning a container. In systemd-nspawn
> I have then adopted a similar scheme. Not sure though that that isp
> particularly nice however, since env vars are inherited further down the
> tree where we probably don't want them.

Interesting.  That seems like a reasonable enough thing to require
of the programs that create containers.

> In case you are curious: this is the code we use in systemd:
>
> http://cgit.freedesktop.org/systemd/tree/src/virt.c
>
> What matters to me though is that we can generically detect Linux
> containers instead of specific implementations.

>> - Allow modification to the output of uname(2).  The uts namespace
>>   already covers uname(2) and uname is the standard method to
>>   communicate to userspace the vageries about the OS level environment
>>   they are running in.
>
> Well, I am not a particular fan of having userspace tell userspace about
> containers. I would prefer if userspace could get that info from the
> kernel without any explicit agreement to set some specific variable.

Well userspace tells userspace about stdin and it works reliably.

Containers are a userspace construct built with kernel facilities.
I don't see why asking userspace to implement a convention is any more
important than the other things that have to be done.

We do need to document the convetions.  Just like we document the
standard device names but I don't beyond that we should be fine.

>> My list of things that still have work left to do looks like:
>> - cgroups.  It is not safe to create a new hierarchies with groups
>>   that are in existing hierarchies.  So cgroups don't work.
>
> Well, for systemd they actually work quite fine since systemd will
> always place its own cgroups below the cgroup it is started in. cgroups
> hence make these things nicely stackable.
>
> In fact, most folks involved in cgroups userspace have agreed to these
> rules now:
>
> http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups
>
> Among other things they ask all userspace code to only create subgroups
> below the group they are started in, so not only systemd should work
> fine in a container environment but everything else following these
> rules.
>
> In other words: so far one gets away quite nicely with the fact that the
> cgroup tree is not virtualized.

Assuming you bind mount the cgroups inside and generally don't allow
people in a container to create cgroup hierarchies.  At the very least
that is nasty information leakage.

But I am glad there is a solution for right now.

For my uses I have yet to find control groups anything but borked.

>> - VTs.  Ptys should be well supported at this point.  For the rest
>>   they are physical hardware that a container should not be playing with
>>   so I would base which gettys to start up based on which device nodes
>>   are present in /dev.
>
> Well, I am not sure it's that easy since device nodes tend to show up
> dynamically in bare systems. So if you just check whether /dev/tty0 is
> there you might end up thinking you are in a container when you actually
> aren't simply because you did that check before udev loaded the DRI
> driver or so.

But the point isn't to detect a container the point is to decide if
a getty needs to be spawned.  Even with the configuration for a getty
you need to wait for the device node to exist before spawning one.

>> - sysctls (aka /proc/sys) that is a trick one.  Until the user namespace
>>   is fleshed out a little more sysctls are going to be a problem,
>>   because root can write to most of them.  My gut feel says you probably
>>   want to base that to poke at sysctls on CAP_SYS_ADMIN.  At least that
>>   test will become true when the userspaces are rolled out, and at
>>   that point you will want to set all of the sysctls you have permission
>>   to.
>
> So what we did right now in systemd-nspawn is that the container
> supervisor premounts /proc/sys read-only into the container. That way
> writes to it will fail in the container, and while you get a number of
> warnings things will work as they should (though not necessarily safely
> since the container can still remount the fs unless you take
> CAP_SYS_ADMIN away).

That sort of works.  In practice it means you can't setup interesting
things like forwarding in the networking stack.  But it certainly gets
things going.

>> So while I agree a check to see if something is a container seems
>> reasonable.  I do not agree that the pid namespace is the place to put
>> that information.  I see no natural to put that information in the
>> pid namespace.
>
> Well, a simple way would be to have a line /proc/1/status called
> "PIDNamespaceLevel:" or so which would be 0 for the root namespace, and
> increased for each namespace nested in it. Then, processes could simply
> read that and be happy.

Not a chance.  PIDNamespaceLevel is implementing an implementation
detail that may well change in the lifetime of a process.  It is true
we don't have migration mreged in the kernel yet but one of these days
I expect we will.

>> I further think there are a lot of reasonable checks for if a
>> kernel feature is supported in the current environment I would
>> rather pursue over hacks based the fact we are in a container.
>
> Well, believe me we have been tryiung to find nicer hooks that explicit
> checks for containers, but I am quite sure that at the end of the day
> you won't be able to go without it entirely.

And you have explicit information you are in a container at this point.

It looks like all that is left is Documentation of the conventions.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11  3:25               ` Ted Ts'o
@ 2011-10-11  6:42                 ` Eric W. Biederman
  2011-10-11 12:53                   ` Theodore Tso
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2011-10-11  6:42 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel,
	harald, david, greg, Linux Containers, Linux Containers,
	Serge E. Hallyn, Daniel Lezcano, Paul Menage

Ted Ts'o <tytso@mit.edu> writes:

> On Mon, Oct 10, 2011 at 07:05:30PM -0700, Matt Helsley wrote:
>> Yes, it does detract from the unique advantages of using a container.
>> However, I think the value here is not the effeciency of the initial
>> system configuration but the fact that it gives users a better place to
>> start.
>> 
>> Right now we're effectively asking users to start with non-working
>> and/or unfamiliar systems and repair them until they work.
>
> If things are not working with containers, I would submit to you that
> we're doing something wrong(tm). 

That is what this discussion is about.  What we are doing wrong(tm).
Mostly it is about the bits that have not yet been namespacified but
need to be.

I am totally in favor of not starting the entire world.  But just
like I find it convienient to loopback mount an iso image to see
what is on a disk image.  It would be handy to be able to just
download a distro image and play with it, without doing anything
special.

We can pair things down farther for the people who are running 1000
copies of apache but not requiring detailed distro surgery before
starting up the binaries on a livecd sounds handy.

> Things should just work, except that
> processes in one container can't use more than their fair share (as
> dictated by policy) of memory, CPU, networking, and I/O bandwidth.

You have to be careful with the limiters.  The fundamental reason
why containers are more efficient than hardware virtualization is
that with containers we can do over commit of resources, especially
memory.  I keep seeing implementations of resource limiters that want
to do things in a heavy handed way that break resource over commit.

> Something which is baked in my world view of containers (which I
> suspect is not shared by other people who are interested in using
> containers) is that given that kernel is shared, trying to use
> containers to provide better security isolation between mutually
> suspicious users is hopeless.  That is, it's pretty much impossible to
> prevent a user from finding one or more zero day local privilege
> escalation bugs that will allow a user to break root.  And at that
> point, they will be able to penetrate the kernel, and from there,
> break security of other processes.

You don't even have to get to security problems to have that concern.
There are enough crazy timing and side channel attacks.

I don't know what concern you have security wise, but the problem that
wants to be solved with user namespaces is something you hit much
earlier than when you worry about sharing a kernel between mutually
distrusting users.  Right now root inside a container is root rout
outside of a container just like in a chroot jail.  Where this becomes a
problem is that people change things like like
/proc/sys/kernel/print-fatal-signals expecting it to be a setting local
to their sand box when in fact the global setting and things start
behaving weirdly for other users.  Running sysctl -a during bootup 
has that problem in spades.

With user namespaces what we get is that the global root user is not the
container root user and we have been working our way through the
permission checks in the kernel to ensure we get them right in the
context of the user namespace.  This trivially means that the things
that we allow the global root user to do in /proc/ and /sysfs and
the like simply won't be allowed as a container root user.  Which
makes doing something stupid and affecting other people much more
difficult.

What the user namespace also allows is an escape hatch from the
bonds of suid.  Right now anything that could confuse an existing
app with that is suid root we have to only allow to root, or risk
adding a security hole.  With the user namespaces we can relax
that check and allow it also for container root users as well
as global root users.  When we are brave enough and certain
enough of our code we can allow non-root users to create their
own user namespaces.

There is the third use for containers where for some reason
we have uid assignment overlap.  Perhaps one distroy assigns
uid 22 to sshd and another to the nobody user.  Or perhaps there
are two departments who have that have done the silly thing
of assigning overlapping uids to their users and we want to
accesses filesystems created by both departments at the same
time without a chance of confusion and conflict.

With my sysadmin hat on I would not want to touch two untrusting groups
of users on the same machine.  Because of the probability there is at
least one security hole that can be found and exploited to allow
privilege escalation.

With my kernel developer hat on I can't just say surrender to the
idea that there will in fact be a privilege escalation bug that
is easy to exploit.  The code has to be built and designed so that
privilege escalation is difficult.  Otherwise we might as well
assume if you visit a website an stealthy worm has taken over your
computer.

It is my hope at the end of the day that the user namespaces will be one
more line of defense in messing up and slowing down the evil omnicient
worms that seem to uneering go for every privilege exploit there is.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-10 21:41           ` Lennart Poettering
  2011-10-11  5:40             ` Eric W. Biederman
@ 2011-10-11  6:54             ` Eric W. Biederman
  2011-10-12 16:59             ` Kay Sievers
  2 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2011-10-11  6:54 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg,
	Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

Lennart Poettering <mzxreary@0pointer.de> writes:

> On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote:

>> My list of things that still have work left to do looks like:
>> - cgroups.  It is not safe to create a new hierarchies with groups
>>   that are in existing hierarchies.  So cgroups don't work.
>
> Well, for systemd they actually work quite fine since systemd will
> always place its own cgroups below the cgroup it is started in. cgroups
> hence make these things nicely stackable.
>
> In fact, most folks involved in cgroups userspace have agreed to these
> rules now:
>
> http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups

Wow.   Are cgroups really that complicated to use?  A list of rules
a page long on what you have to do to make them useful and non-conflict.
Something seems off.  Perhaps we need a rule don't mount multiple
controllers in the same hierarchy.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11  6:42                 ` Eric W. Biederman
@ 2011-10-11 12:53                   ` Theodore Tso
  2011-10-11 21:16                     ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: Theodore Tso @ 2011-10-11 12:53 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers,
	linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote:

> I am totally in favor of not starting the entire world.  But just
> like I find it convienient to loopback mount an iso image to see
> what is on a disk image.  It would be handy to be able to just
> download a distro image and play with it, without doing anything
> special.

Agreed, but what's wrong with firing up KVM to play with a distro image?  Personally, I don't consider that "doing something special".

> 
>> Things should just work, except that
>> processes in one container can't use more than their fair share (as
>> dictated by policy) of memory, CPU, networking, and I/O bandwidth.
> 
> You have to be careful with the limiters.  The fundamental reason
> why containers are more efficient than hardware virtualization is
> that with containers we can do over commit of resources, especially
> memory.  I keep seeing implementations of resource limiters that want
> to do things in a heavy handed way that break resource over commit.

Oh, sure.   Resource limiting is something that should be done only when there are other demands on the resource in question.   Put another way, it should be considered more of a resource guarantee than a resource limit.   (You will have at least 10% of the CPU, not at most 10% of the CPU.)

> 
> I don't know what concern you have security wise, but the problem that
> wants to be solved with user namespaces is something you hit much
> earlier than when you worry about sharing a kernel between mutually
> distrusting users.  Right now root inside a container is root rout
> outside of a container just like in a chroot jail.  Where this becomes a
> problem is that people change things like like
> /proc/sys/kernel/print-fatal-signals expecting it to be a setting local
> to their sand box when in fact the global setting and things start
> behaving weirdly for other users.  Running sysctl -a during bootup 
> has that problem in spades.

The moment you start caring about global sysctl settings is the moment I start wondering whether or not VM and separate kernel images is the better solution.   Do we really want to add so much complexity that we are multiplexing different sysctl settings across containers?   To my mind, that way lies madness, and in some cases, it simply can't be done from a semantics perspective.

> 
> With my sysadmin hat on I would not want to touch two untrusting groups
> of users on the same machine.  Because of the probability there is at
> least one security hole that can be found and exploited to allow
> privilege escalation.
> 
> With my kernel developer hat on I can't just say surrender to the
> idea that there will in fact be a privilege escalation bug that
> is easy to exploit.  The code has to be built and designed so that
> privilege escalation is difficult.  Otherwise we might as well
> assume if you visit a website an stealthy worm has taken over your
> computer.

Oh, I agree that we should try to stop privilege escalation attacks.  And it will be a grand and glorious fight, like Leonidas and his 300 men at the pass at Thermopylae.   :-)   Or it will be like Steve Jobs struggling against cancer.  It's a fight that you know that you're going to lose, but it's not about winning or losing but how much you accomplish and how you fight that counts.

Personally, though, if the issue is worries about visiting a website, the primary protection against that has got to be done  at the browser level (i.e., the process level sandboxing done by Chrome).

-- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11 12:53                   ` Theodore Tso
@ 2011-10-11 21:16                     ` Eric W. Biederman
  2011-10-11 22:30                       ` david
  2011-10-12 17:57                       ` J. Bruce Fields
  0 siblings, 2 replies; 28+ messages in thread
From: Eric W. Biederman @ 2011-10-11 21:16 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel,
	harald, david, greg, Linux Containers, Linux Containers,
	Serge E. Hallyn, Daniel Lezcano, Paul Menage

Theodore Tso <tytso@MIT.EDU> writes:

> On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote:
>
>> I am totally in favor of not starting the entire world.  But just
>> like I find it convienient to loopback mount an iso image to see
>> what is on a disk image.  It would be handy to be able to just
>> download a distro image and play with it, without doing anything
>> special.
>
> Agreed, but what's wrong with firing up KVM to play with a distro
> image?  Personally, I don't consider that "doing something special".

Then let me flip this around and give a much more practical use case.
Testing.  A very interesting number of cases involve how multiple
machines interact.  You can test a lot more logical machines interacting
with containers than you can with vms.  And you can test on all the
aritectures and platforms linux supports not just the handful that are
well supported by hardware virtualization.

I admit for a lot of test cases that it makes sense not to use a full
set of userspace daemons.  At the same time there is not particularly
good reason to have a design that doesn't allow you to run a full
userspace.

>>> Things should just work, except that
>>> processes in one container can't use more than their fair share (as
>>> dictated by policy) of memory, CPU, networking, and I/O bandwidth.
>> 
>> You have to be careful with the limiters.  The fundamental reason
>> why containers are more efficient than hardware virtualization is
>> that with containers we can do over commit of resources, especially
>> memory.  I keep seeing implementations of resource limiters that want
>> to do things in a heavy handed way that break resource over commit.
>
> Oh, sure.   Resource limiting is something that should be done only
> when there are other demands on the resource in question.   Put
> another way, it should be considered more of a resource guarantee than
> a resource limit.   (You will have at least 10% of the CPU, not at
> most 10% of the CPU.)

Resource guarantees I suspect may be worse.  But all of this is to say
that the problem control groups are tackling is a hard one.  Resource
control and resource limits across multiple processes is a challenge
problem and in some contexts it is a hard problem.

My observations have been that when you want any kind of strong resource
guarantee or resource limit, it is currently a lot easier to implement
that with hardware virtualization than with control groups (at least for
memory).  I think the cpu scheduling has been solved but until you also
at least solve user space memory there are going to be issues.

At the same time getting better resource controls is an area where
there is a strong interest from all over the place.

>> I don't know what concern you have security wise, but the problem that
>> wants to be solved with user namespaces is something you hit much
>> earlier than when you worry about sharing a kernel between mutually
>> distrusting users.  Right now root inside a container is root rout
>> outside of a container just like in a chroot jail.  Where this becomes a
>> problem is that people change things like like
>> /proc/sys/kernel/print-fatal-signals expecting it to be a setting local
>> to their sand box when in fact the global setting and things start
>> behaving weirdly for other users.  Running sysctl -a during bootup 
>> has that problem in spades.
>
> The moment you start caring about global sysctl settings is the moment
> I start wondering whether or not VM and separate kernel images is the
> better solution.   Do we really want to add so much complexity that we
> are multiplexing different sysctl settings across containers?   To my
> mind, that way lies madness, and in some cases, it simply can't be
> done from a semantics perspective.

It actually isn't much complexity and for the most part the code that
I care about in that area is already merged.  In principle all I care
about are having the identiy checks go from:
(uid1 == uid2) to ((user_ns1 == user_ns2) && (uid1 == uid2))

There are some per subsystem sysctls that do make sense to make per
subsystem and that work is mostly done.  I expect there are a few
more in the networking stack that interesting to make per network
namespace.

The only real world issue right now that I am aware of is the user
namespace aren't quite ready for prime-time and so people run into
issues where something like sysctl -a during bootup sets a bunch of
sysctls and they change sysctls they didn't mean to.  Once the
user namespaces are in place accessing a truly global sysctl will
result in EPERM when you are in a container and everyone will be
happy. ;)

Where all of this winds up interesting in the field of oncoming kernel
work is that uids are persistent and are stored in file systems.  So
once we have all of the permission checks in the kernel tweaked to care
about user namespaces we next look at the filesystems.   The easy
initial implementation is going to be just associating a user namespace
with a super block.  But farther out being able to store uids from
different user namespaces on the same filesystem becomes an interesting
problem.

We already have things like user mapping in 9p and nfsv4 so it isn't
wholly uncharted territory.  But it could get interesting.   Just
a heads up.

>> With my sysadmin hat on I would not want to touch two untrusting groups
>> of users on the same machine.  Because of the probability there is at
>> least one security hole that can be found and exploited to allow
>> privilege escalation.
>> 
>> With my kernel developer hat on I can't just say surrender to the
>> idea that there will in fact be a privilege escalation bug that
>> is easy to exploit.  The code has to be built and designed so that
>> privilege escalation is difficult.  Otherwise we might as well
>> assume if you visit a website an stealthy worm has taken over your
>> computer.
>
> Oh, I agree that we should try to stop privilege escalation attacks.
> And it will be a grand and glorious fight, like Leonidas and his 300
> men at the pass at Thermopylae.  :-) Or it will be like Steve Jobs
> struggling against cancer.  It's a fight that you know that you're
> going to lose, but it's not about winning or losing but how much you
> accomplish and how you fight that counts.
>
> Personally, though, if the issue is worries about visiting a website,
> the primary protection against that has got to be done at the browser
> level (i.e., the process level sandboxing done by Chrome).

My concern is any externally implemented service, but in general 
browsers and web sites are your most likely candidates.  Both because
there is more complexity there and because http is used far more often
than other protocols.

And yes I agree that the first line of defense needs to be in the
browser source code, and then the application level sand boxing
features that the browser takes advantage of.  Last I paid attention
one of the layers of defense that chrome is user was to setup different
namespaces to make the sandbox tight even at the syscall level.   When
it is complete I would not be at all surprised if the user namespace
wound up being used in chrome as well.  Just as one more thing that
helps.

I have found it very surprising how many of the namespaces are
used for what you can't do with them.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
       [not found]             ` <20111011020530.GG16723@count0.beaverton.ibm.com>
  2011-10-11  3:25               ` Ted Ts'o
@ 2011-10-11 22:25               ` david
  1 sibling, 0 replies; 28+ messages in thread
From: david @ 2011-10-11 22:25 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Ted Ts'o, Eric W. Biederman, Lennart Poettering, Kay Sievers,
	linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Mon, 10 Oct 2011, Matt Helsley wrote:

> On Mon, Oct 10, 2011 at 09:32:01PM -0400, Ted Ts'o wrote:
>> On Mon, Oct 10, 2011 at 01:59:10PM -0700, Eric W. Biederman wrote:
>>> Lennart Poettering <mzxreary@0pointer.de> writes:
>>>
>>>> To make a standard distribution run nicely in a Linux container you
>>>> usually have to make quite a number of modifications to it and disable
>>>> certain things from the boot process. Ideally however, one could simply
>>>> boot the same image on a real machine and in a container and would just
>>>> do the right thing, fully stateless. And for that you need to be able to
>>>> detect containers, and currently you can't.
>>>
>>> I agree getting to the point where we can run a standard distribution
>>> unmodified in a container sounds like a reasonable goal.
>>
>> Hmm, interesting.  It's not clear to me that running a full standard
>> distribution in a container is always going to be what everyone wants
>> to do.
>>
>> The whole point of containers versus VM's is that containers are
>> lighter weight.  And one of the ways that containers can be lighter
>> weight is if you don't have to have N copies of udev, dbus, running in
>> each container/VM.
>>
>> If you end up so much overhead to provide the desired security and/or
>> performance isolation, then it becomes fair to ask the question
>> whether you might as well pay a tad bit more and get even better
>> security and isolation by using a VM solution....
>>
>> 	     	       	  	     - Ted
>
> Yes, it does detract from the unique advantages of using a container.
> However, I think the value here is not the effeciency of the initial
> system configuration but the fact that it gives users a better place to
> start.
>
> Right now we're effectively asking users to start with non-working
> and/or unfamiliar systems and repair them until they work.
>
> By enabling unmodified distro installs in a container we're starting
> at the other end. The choices may not be the most efficient but the
> user may begin tuning from a working configuration. They can learn
> about and tune those parts that prove significant for their workload.
> This is better because in the end it's not just about how efficient the
> user  can make their containers but how much effort they will spend
> achieving and maintainingg that efficiency over time.

what's needed isn't a way to run all the daemons, processes and startup 
scripts that a distro uses in a container without conflicting with the 
parent, but instead a easy way to create the appropriate config changes in 
the parent, bind mounts, cgroups, etc  for the container and startup the 
apps that are wanted in the container.

This needs to be something with a lot of knowledge and hooks in the 
parent, so it's not just a matter of adding a way to detect "am I in a 
container" or not.

when I run things in containers, I want to bind mount some things from the 
parent, I want to configure syslog to listen on /dev/log inside the 
container, and then I want to starup just the processes I am planning to 
use inside the container, not all the daemons and other processes that I 
need to run the service the container is built for.

David Lang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11 21:16                     ` Eric W. Biederman
@ 2011-10-11 22:30                       ` david
  2011-10-12  4:26                         ` Eric W. Biederman
  2011-10-12 17:57                       ` J. Bruce Fields
  1 sibling, 1 reply; 28+ messages in thread
From: david @ 2011-10-11 22:30 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers,
	linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Tue, 11 Oct 2011, Eric W. Biederman wrote:

> Theodore Tso <tytso@MIT.EDU> writes:
>
>> On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote:
>>
>>> I am totally in favor of not starting the entire world.  But just
>>> like I find it convienient to loopback mount an iso image to see
>>> what is on a disk image.  It would be handy to be able to just
>>> download a distro image and play with it, without doing anything
>>> special.
>>
>> Agreed, but what's wrong with firing up KVM to play with a distro
>> image?  Personally, I don't consider that "doing something special".
>
> Then let me flip this around and give a much more practical use case.
> Testing.  A very interesting number of cases involve how multiple
> machines interact.  You can test a lot more logical machines interacting
> with containers than you can with vms.  And you can test on all the
> aritectures and platforms linux supports not just the handful that are
> well supported by hardware virtualization.

but in containers, you are not really testing lots of machines, you are 
testing lots of processes on the same machine (they share the same kernel)

> I admit for a lot of test cases that it makes sense not to use a full
> set of userspace daemons.  At the same time there is not particularly
> good reason to have a design that doesn't allow you to run a full
> userspace.

how do you share the display between all the different containers if they 
are trying to run the X server?

how do you avoid all the containers binding to the same port on the 
default IP address?

how do you arbitrate dbus across the containers.

when a new USB device gets plugged in, which container gets control of it?

there are a LOT of hard questions when you start talking about running a 
full system inside a container that do not apply for other use of 
containers.

David Lang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11 22:30                       ` david
@ 2011-10-12  4:26                         ` Eric W. Biederman
  2011-10-12  5:10                           ` david
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2011-10-12  4:26 UTC (permalink / raw)
  To: david
  Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers,
	linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

david@lang.hm writes:

> On Tue, 11 Oct 2011, Eric W. Biederman wrote:
>
>> Theodore Tso <tytso@MIT.EDU> writes:
>>
>>> On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote:
>>>
>>>> I am totally in favor of not starting the entire world.  But just
>>>> like I find it convienient to loopback mount an iso image to see
>>>> what is on a disk image.  It would be handy to be able to just
>>>> download a distro image and play with it, without doing anything
>>>> special.
>>>
>>> Agreed, but what's wrong with firing up KVM to play with a distro
>>> image?  Personally, I don't consider that "doing something special".
>>
>> Then let me flip this around and give a much more practical use case.
>> Testing.  A very interesting number of cases involve how multiple
>> machines interact.  You can test a lot more logical machines interacting
>> with containers than you can with vms.  And you can test on all the
>> aritectures and platforms linux supports not just the handful that are
>> well supported by hardware virtualization.
>
> but in containers, you are not really testing lots of machines, you are testing
> lots of processes on the same machine (they share the same kernel)

True.  But usually that is the interesting part.

>> I admit for a lot of test cases that it makes sense not to use a full
>> set of userspace daemons.  At the same time there is not particularly
>> good reason to have a design that doesn't allow you to run a full
>> userspace.
>
> how do you share the display between all the different containers if they are
> trying to run the X server?

Either X does not start because the hardware it needs is not present or
Xnest or similar gets started.

> how do you avoid all the containers binding to the same port on the default IP
> address?

Network namespaces.

> how do you arbitrate dbus across the containers.

Why should you?

> when a new USB device gets plugged in, which container gets control of
> it?

None of them.  Although today they may all get the uevent.  None of the
containers should have permission to call mknod to mess with it.

> there are a LOT of hard questions when you start talking about running a full
> system inside a container that do not apply for other use of
> containers.

Not really mostly the answer is that you say no.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-12  4:26                         ` Eric W. Biederman
@ 2011-10-12  5:10                           ` david
  2011-10-12 15:08                             ` Serge E. Hallyn
  0 siblings, 1 reply; 28+ messages in thread
From: david @ 2011-10-12  5:10 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers,
	linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Tue, 11 Oct 2011, Eric W. Biederman wrote:

> david@lang.hm writes:
>
>> On Tue, 11 Oct 2011, Eric W. Biederman wrote:
>>
>>> Theodore Tso <tytso@MIT.EDU> writes:
>>>
>>>> On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote:
>>>>
>>> I admit for a lot of test cases that it makes sense not to use a full
>>> set of userspace daemons.  At the same time there is not particularly
>>> good reason to have a design that doesn't allow you to run a full
>>> userspace.
>>
>> how do you share the display between all the different containers if they are
>> trying to run the X server?
>
> Either X does not start because the hardware it needs is not present or
> Xnest or similar gets started.
>
>> how do you avoid all the containers binding to the same port on the default IP
>> address?
>
> Network namespaces.
>
>> how do you arbitrate dbus across the containers.
>
> Why should you?

because the containers are simulating different machines, and dbus doesn't 
work arcross different machines.

>> when a new USB device gets plugged in, which container gets control of
>> it?
>
> None of them.  Although today they may all get the uevent.  None of the
> containers should have permission to call mknod to mess with it.

why would the software inside a container not have the rights to do a 
mknod inside the container?

>> there are a LOT of hard questions when you start talking about running a full
>> system inside a container that do not apply for other use of
>> containers.
>
> Not really mostly the answer is that you say no.
>
> Eric
>

David Lang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-12  5:10                           ` david
@ 2011-10-12 15:08                             ` Serge E. Hallyn
  0 siblings, 0 replies; 28+ messages in thread
From: Serge E. Hallyn @ 2011-10-12 15:08 UTC (permalink / raw)
  To: david
  Cc: Eric W. Biederman, Theodore Tso, Matt Helsley, Lennart Poettering,
	Kay Sievers, linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Daniel Lezcano, Paul Menage

Quoting david@lang.hm (david@lang.hm):
> On Tue, 11 Oct 2011, Eric W. Biederman wrote:
> 
> >david@lang.hm writes:
> >
> >>On Tue, 11 Oct 2011, Eric W. Biederman wrote:
> >>
> >>>Theodore Tso <tytso@MIT.EDU> writes:
> >>>
> >>>>On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote:
> >>>>
> >>>I admit for a lot of test cases that it makes sense not to use a full
> >>>set of userspace daemons.  At the same time there is not particularly
> >>>good reason to have a design that doesn't allow you to run a full
> >>>userspace.
> >>
> >>how do you share the display between all the different containers if they are
> >>trying to run the X server?
> >
> >Either X does not start because the hardware it needs is not present or
> >Xnest or similar gets started.
> >
> >>how do you avoid all the containers binding to the same port on the default IP
> >>address?
> >
> >Network namespaces.
> >
> >>how do you arbitrate dbus across the containers.
> >
> >Why should you?
> 
> because the containers are simulating different machines, and dbus
> doesn't work arcross different machines.

Exactly - Eric is saying dbus should not be (and is not) shared among
containers.

> >>when a new USB device gets plugged in, which container gets control of
> >>it?
> >
> >None of them.  Although today they may all get the uevent.  None of the
> >containers should have permission to call mknod to mess with it.
> 
> why would the software inside a container not have the rights to do
> a mknod inside the container?

Why shouldn't an unprivileged user be allowed to mknod on the host?

-serge

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-10 21:41           ` Lennart Poettering
  2011-10-11  5:40             ` Eric W. Biederman
  2011-10-11  6:54             ` Eric W. Biederman
@ 2011-10-12 16:59             ` Kay Sievers
  2011-11-01 22:05               ` [lxc-devel] " Michael Tokarev
  2 siblings, 1 reply; 28+ messages in thread
From: Kay Sievers @ 2011-10-12 16:59 UTC (permalink / raw)
  To: Lennart Poettering
  Cc: Eric W. Biederman, Matt Helsley, linux-kernel, harald, david,
	greg, Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

On Mon, Oct 10, 2011 at 23:41, Lennart Poettering <mzxreary@0pointer.de> wrote:
> On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote:

>> - udev.  All of the kernel interfaces for udev should be supported in
>>   current kernels.  However I believe udev is useless because container
>>   start drops CAP_MKNOD so we can't do evil things.  So I would
>>   recommend basing the startup of udev on presence of CAP_MKNOD.
>
> Using CAP_MKNOD as test here is indeed a good idea. I'll make sure udev
> in a systemd world makes use of that.

Done.

http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=9371e6f3e04b03692c23e392fdf005a08ccf1edb

Thanks,
Kay

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-11 21:16                     ` Eric W. Biederman
  2011-10-11 22:30                       ` david
@ 2011-10-12 17:57                       ` J. Bruce Fields
  2011-10-12 18:25                         ` Kyle Moffett
  1 sibling, 1 reply; 28+ messages in thread
From: J. Bruce Fields @ 2011-10-12 17:57 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers,
	linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Tue, Oct 11, 2011 at 02:16:24PM -0700, Eric W. Biederman wrote:
> It actually isn't much complexity and for the most part the code that
> I care about in that area is already merged.  In principle all I care
> about are having the identiy checks go from:
> (uid1 == uid2) to ((user_ns1 == user_ns2) && (uid1 == uid2))
> 
> There are some per subsystem sysctls that do make sense to make per
> subsystem and that work is mostly done.  I expect there are a few
> more in the networking stack that interesting to make per network
> namespace.
> 
> The only real world issue right now that I am aware of is the user
> namespace aren't quite ready for prime-time and so people run into
> issues where something like sysctl -a during bootup sets a bunch of
> sysctls and they change sysctls they didn't mean to.  Once the
> user namespaces are in place accessing a truly global sysctl will
> result in EPERM when you are in a container and everyone will be
> happy. ;)
> 
> 
> Where all of this winds up interesting in the field of oncoming kernel
> work is that uids are persistent and are stored in file systems.  So
> once we have all of the permission checks in the kernel tweaked to care
> about user namespaces we next look at the filesystems.   The easy
> initial implementation is going to be just associating a user namespace
> with a super block.  But farther out being able to store uids from
> different user namespaces on the same filesystem becomes an interesting
> problem.

Yipes.  Why would anyone want to do that?

--b.

> We already have things like user mapping in 9p and nfsv4 so it isn't
> wholly uncharted territory.  But it could get interesting.   Just
> a heads up.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-12 17:57                       ` J. Bruce Fields
@ 2011-10-12 18:25                         ` Kyle Moffett
  2011-10-12 19:04                           ` J. Bruce Fields
  0 siblings, 1 reply; 28+ messages in thread
From: Kyle Moffett @ 2011-10-12 18:25 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Eric W. Biederman, Theodore Tso, Matt Helsley, Lennart Poettering,
	Kay Sievers, linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Wed, Oct 12, 2011 at 13:57, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Tue, Oct 11, 2011 at 02:16:24PM -0700, Eric W. Biederman wrote:
>> Where all of this winds up interesting in the field of oncoming kernel
>> work is that uids are persistent and are stored in file systems.  So
>> once we have all of the permission checks in the kernel tweaked to care
>> about user namespaces we next look at the filesystems.   The easy
>> initial implementation is going to be just associating a user namespace
>> with a super block.  But farther out being able to store uids from
>> different user namespaces on the same filesystem becomes an interesting
>> problem.
>
> Yipes.  Why would anyone want to do that?

Consider an NFS file server providing collaborative access to multiple
independently managed domains (EG: several different universities),
each with their own LDAP userid database and Kerberos services.

The common server is in its own realm and allows cross-realm
authentication to the other university realms, using the origin realm
to decide what namespace to map each user into.

If you are just doing read-only operations then you don't need any
kind of namespace persistence on the NFS server's storage.  On the
other hand, if you want to allow users to collaborate and create ACLs
then you need something dramatically more involved.

On the wire, the kerberos server would simply identify each NFSv4 ACL
entry with a particular realm ID, but in the backend it would need
some filesystem-level disambiguation (possibly the recently-proposed
RichACL features?)

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-12 18:25                         ` Kyle Moffett
@ 2011-10-12 19:04                           ` J. Bruce Fields
  2011-10-12 19:12                             ` Kyle Moffett
  0 siblings, 1 reply; 28+ messages in thread
From: J. Bruce Fields @ 2011-10-12 19:04 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: Eric W. Biederman, Theodore Tso, Matt Helsley, Lennart Poettering,
	Kay Sievers, linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Wed, Oct 12, 2011 at 02:25:04PM -0400, Kyle Moffett wrote:
> On Wed, Oct 12, 2011 at 13:57, J. Bruce Fields <bfields@fieldses.org> wrote:
> > On Tue, Oct 11, 2011 at 02:16:24PM -0700, Eric W. Biederman wrote:
> >> Where all of this winds up interesting in the field of oncoming kernel
> >> work is that uids are persistent and are stored in file systems.  So
> >> once we have all of the permission checks in the kernel tweaked to care
> >> about user namespaces we next look at the filesystems.   The easy
> >> initial implementation is going to be just associating a user namespace
> >> with a super block.  But farther out being able to store uids from
> >> different user namespaces on the same filesystem becomes an interesting
> >> problem.
> >
> > Yipes.  Why would anyone want to do that?
> 
> Consider an NFS file server providing collaborative access to multiple
> independently managed domains (EG: several different universities),
> each with their own LDAP userid database and Kerberos services.
> 
> The common server is in its own realm and allows cross-realm
> authentication to the other university realms, using the origin realm
> to decide what namespace to map each user into.
> 
> If you are just doing read-only operations then you don't need any
> kind of namespace persistence on the NFS server's storage.  On the
> other hand, if you want to allow users to collaborate and create ACLs
> then you need something dramatically more involved.

Yeah, OK, I suppose I'd imagined mapping into the server's id space
somehow for that case, but I suppose this would be cleaner.  Still,
seems like a big pain....

> On the wire, the kerberos server would simply identify each NFSv4 ACL
> entry with a particular realm ID, but in the backend it would need
> some filesystem-level disambiguation (possibly the recently-proposed
> RichACL features?)

That doesn't help with owner and group.

--b.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-12 19:04                           ` J. Bruce Fields
@ 2011-10-12 19:12                             ` Kyle Moffett
  2011-10-14 15:54                               ` Ted Ts'o
  0 siblings, 1 reply; 28+ messages in thread
From: Kyle Moffett @ 2011-10-12 19:12 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Eric W. Biederman, Theodore Tso, Matt Helsley, Lennart Poettering,
	Kay Sievers, linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

On Wed, Oct 12, 2011 at 15:04, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Wed, Oct 12, 2011 at 02:25:04PM -0400, Kyle Moffett wrote:
>> On Wed, Oct 12, 2011 at 13:57, J. Bruce Fields <bfields@fieldses.org> wrote:
>> > On Tue, Oct 11, 2011 at 02:16:24PM -0700, Eric W. Biederman wrote:
>> >> Where all of this winds up interesting in the field of oncoming kernel
>> >> work is that uids are persistent and are stored in file systems.  So
>> >> once we have all of the permission checks in the kernel tweaked to care
>> >> about user namespaces we next look at the filesystems.   The easy
>> >> initial implementation is going to be just associating a user namespace
>> >> with a super block.  But farther out being able to store uids from
>> >> different user namespaces on the same filesystem becomes an interesting
>> >> problem.
>> >
>> > Yipes.  Why would anyone want to do that?
>>
>> Consider an NFS file server providing collaborative access to multiple
>> independently managed domains (EG: several different universities),
>> each with their own LDAP userid database and Kerberos services.
>>
>> The common server is in its own realm and allows cross-realm
>> authentication to the other university realms, using the origin realm
>> to decide what namespace to map each user into.
>>
>> If you are just doing read-only operations then you don't need any
>> kind of namespace persistence on the NFS server's storage.  On the
>> other hand, if you want to allow users to collaborate and create ACLs
>> then you need something dramatically more involved.
>
> Yeah, OK, I suppose I'd imagined mapping into the server's id space
> somehow for that case, but I suppose this would be cleaner.  Still,
> seems like a big pain....
>
>> On the wire, the kerberos server would simply identify each NFSv4 ACL
>> entry with a particular realm ID, but in the backend it would need
>> some filesystem-level disambiguation (possibly the recently-proposed
>> RichACL features?)
>
> That doesn't help with owner and group.

Well, you're going to need to introduce a bunch of new xattrs to
handle the namespacing anyways.

As I understand it you can use RichACLs to grant all the same
privileges as owner and group, so you can simply map the real
namespaced owner and group into RichACLs (or another xattr) and force
the inode uid/gid to be root/root (or maybe nobody/nogroup or
something).

I am of course making it sound a million times easier than it's
actually likely to be, but I do think it's possible without too many
odd corner cases.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-12 19:12                             ` Kyle Moffett
@ 2011-10-14 15:54                               ` Ted Ts'o
  2011-10-14 18:04                                 ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: Ted Ts'o @ 2011-10-14 15:54 UTC (permalink / raw)
  To: Kyle Moffett
  Cc: J. Bruce Fields, Eric W. Biederman, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

On Wed, Oct 12, 2011 at 03:12:34PM -0400, Kyle Moffett wrote:
> Well, you're going to need to introduce a bunch of new xattrs to
> handle the namespacing anyways.
> 
> As I understand it you can use RichACLs to grant all the same
> privileges as owner and group, so you can simply map the real
> namespaced owner and group into RichACLs (or another xattr) and force
> the inode uid/gid to be root/root (or maybe nobody/nogroup or
> something).

It's going to be all about mapping tables, and whether the mapping is
done in userspace or kernel space.  For example, you might want to
take a Kerberos principal name, and mapping it to a 128bit identifier
(64 bit realm id + 64 bit user id), and that in turn might require
mapping to some 32-bit Linux uid namespace.

If people want to support multiple 32-bit Linux uid namespaces, then
it's a question of how you name these uid name spaces, and how to
manage the mapping tables outside of kernel, and how the mapping
tables get loaded into the kernel.

> I am of course making it sound a million times easier than it's
> actually likely to be, but I do think it's possible without too many
> odd corner cases.

It's not the corner cases, it's all of the different name spaces that
different system administrators and their sites are going to want to
use, and how to support them all....

And of course, once we start naming uid name spaces, eventually
someone will want to virtualize containers, and then we will have
namespaces for namespaces.  (It's turtles all the way down!  :-)

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-14 15:54                               ` Ted Ts'o
@ 2011-10-14 18:04                                 ` Eric W. Biederman
  2011-10-14 21:58                                   ` H. Peter Anvin
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2011-10-14 18:04 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Kyle Moffett, J. Bruce Fields, Matt Helsley, Lennart Poettering,
	Kay Sievers, linux-kernel, harald, david, greg, Linux Containers,
	Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage

Ted Ts'o <tytso@mit.edu> writes:

>> I am of course making it sound a million times easier than it's
>> actually likely to be, but I do think it's possible without too many
>> odd corner cases.
>
> It's not the corner cases, it's all of the different name spaces that
> different system administrators and their sites are going to want to
> use, and how to support them all....
>
> And of course, once we start naming uid name spaces, eventually
> someone will want to virtualize containers, and then we will have
> namespaces for namespaces.  (It's turtles all the way down!  :-)

I have found and merged a solution that allows us to name namespaces
without needing a namespaces for namespaces.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-14 18:04                                 ` Eric W. Biederman
@ 2011-10-14 21:58                                   ` H. Peter Anvin
  2011-10-16  9:42                                     ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: H. Peter Anvin @ 2011-10-14 21:58 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ted Ts'o, Kyle Moffett, J. Bruce Fields, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Linux Containers, Serge E. Hallyn,
	Daniel Lezcano, Paul Menage

On 10/14/2011 11:04 AM, Eric W. Biederman wrote:
> 
> I have found and merged a solution that allows us to name namespaces
> without needing a namespaces for namespaces.
> 

Something based on UUIDs, perhaps?

UUIDs are kind of exactly this, after all... a single namespace designed
to be large and random enough to be globally unique without a central
registration authority.

	-hpa

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-14 21:58                                   ` H. Peter Anvin
@ 2011-10-16  9:42                                     ` Eric W. Biederman
  2011-10-30 20:11                                       ` H. Peter Anvin
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2011-10-16  9:42 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ted Ts'o, Kyle Moffett, J. Bruce Fields, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Serge E. Hallyn, Daniel Lezcano,
	Paul Menage

"H. Peter Anvin" <hpa@zytor.com> writes:

> On 10/14/2011 11:04 AM, Eric W. Biederman wrote:
>> 
>> I have found and merged a solution that allows us to name namespaces
>> without needing a namespaces for namespaces.
>> 
>
> Something based on UUIDs, perhaps?
>
> UUIDs are kind of exactly this, after all... a single namespace designed
> to be large and random enough to be globally unique without a central
> registration authority.

mount --bind /proc/self/ns/net /var/run/netns/<name>

When we want to refer to the namespace in syscalls we pass a file
descriptor we received from opening the namespace reference object.

That moves the entire naming problem into the file namespace.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-16  9:42                                     ` Eric W. Biederman
@ 2011-10-30 20:11                                       ` H. Peter Anvin
  2011-11-01 13:38                                         ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: H. Peter Anvin @ 2011-10-30 20:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Ted Ts'o, Kyle Moffett, J. Bruce Fields, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Serge E. Hallyn, Daniel Lezcano,
	Paul Menage

On 10/16/2011 02:42 AM, Eric W. Biederman wrote:
>>
>> Something based on UUIDs, perhaps?
>>
>> UUIDs are kind of exactly this, after all... a single namespace designed
>> to be large and random enough to be globally unique without a central
>> registration authority.
> 
> mount --bind /proc/self/ns/net /var/run/netns/<name>
> 
> When we want to refer to the namespace in syscalls we pass a file
> descriptor we received from opening the namespace reference object.
> 
> That moves the entire naming problem into the file namespace.
> 

That doesn't solve what I think of as the *real* problem.

The real problem is just another instance of what I sometimes refer to
as the "alien metadata problem": the alien metadata problem (which crops
up in *all kinds* of contexts, including containers, namespaces, virtual
machines, building distribution disk images, and backups) is the fact
that you would like to be able to store, manipulate and preserve, on
disk and in a mounted filesystem, a set of metadata which may not be the
"currently active" metadata.

There are two forms of "solutions" to this: one where the filesystem
still only contains one set of metadata, but it is not currently active,
and one where the filesystem contains multiple sets of metadata for the
same files at the same time, any one of which can be active (and
different ones may be active for different namespaces.)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Detecting if you are running in a container
  2011-10-30 20:11                                       ` H. Peter Anvin
@ 2011-11-01 13:38                                         ` Eric W. Biederman
  0 siblings, 0 replies; 28+ messages in thread
From: Eric W. Biederman @ 2011-11-01 13:38 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Ted Ts'o, Kyle Moffett, J. Bruce Fields, Matt Helsley,
	Lennart Poettering, Kay Sievers, linux-kernel, harald, david,
	greg, Linux Containers, Serge E. Hallyn, Daniel Lezcano,
	Paul Menage

"H. Peter Anvin" <hpa@zytor.com> writes:

> On 10/16/2011 02:42 AM, Eric W. Biederman wrote:
>>>
>>> Something based on UUIDs, perhaps?
>>>
>>> UUIDs are kind of exactly this, after all... a single namespace designed
>>> to be large and random enough to be globally unique without a central
>>> registration authority.
>> 
>> mount --bind /proc/self/ns/net /var/run/netns/<name>
>> 
>> When we want to refer to the namespace in syscalls we pass a file
>> descriptor we received from opening the namespace reference object.
>> 
>> That moves the entire naming problem into the file namespace.
>> 
>
> That doesn't solve what I think of as the *real* problem.

It solves the problem of not needing a namespace of namespaces and
it solves the problem not requiring universal agreement between all
filesystems on all operating systems on how things should look.

In not precluding different solutions it makes a large stride forward.

> The real problem is just another instance of what I sometimes refer to
> as the "alien metadata problem": the alien metadata problem (which crops
> up in *all kinds* of contexts, including containers, namespaces, virtual
> machines, building distribution disk images, and backups) is the fact
> that you would like to be able to store, manipulate and preserve, on
> disk and in a mounted filesystem, a set of metadata which may not be the
> "currently active" metadata.

When you throw network filesystems with different notions of meta-data
things get even more interesting.

> There are two forms of "solutions" to this: one where the filesystem
> still only contains one set of metadata, but it is not currently active,
> and one where the filesystem contains multiple sets of metadata for the
> same files at the same time, any one of which can be active (and
> different ones may be active for different namespaces.)

There is an important tool that seems to be missing from your toolbox.
- Mapping the metadata on the file into different contexts.

The way I see it classic unix filesystems have exactly one context
that their meta-data is expected to work in.  The context in which
the filesystem is mounted.

However it is very easy to conceive of that context being specified
at a per inode granularity.  In which case at least the backup and
the distribution disk image problem can be solved by trivially
specifying a different context, and associating a user namespace with
that context.  Then you switch into the user namespace to manipulate
"alien metadata".

Where mapping comes in is when those files are accessed from
from another context besides the one where all of their metadata
falls.  At which point you can map all of the files to be owned
by the user who is responsible for making backups.  The mapping
is a bit like the root squash setting.

For the common case I expect we will settle on a well defined acl across
the native unix filesystems that allows us to make this persistent.  For
network filesystems with their broader interoperability requirements how
to specify this gets a little more interesting.

For purposes of implementation it doesn't matter to me if that acl is
a uuid or a unique string.  For management of the data it might.

How I expect a native linux filesystem to work when it encounters a
filesystem with a user namespace acl is that it will work like nfsv4
and do an upcall into userspace, to ask the appropriate userspace
how do I understand this acl.  The the userapce mapping agent will
say.  Oh.  You want the usernamespace for "hpa-backups"?  Let's see:
/var/run/userns/hpa-backups exists let me just tell the kernel about
that mapping.  Or perhaps the usernamespace does not exist so the
mapping daemon would go out and create it be consulting configuration
files in etc to know that everything in "hpa-backups" should a child
user namespace with the user "hpa" being able to switch into that
usernamespace without root permission.

Files with meta-data for more than one usernamespace/context I expect
to work similarly.  Care needs to be take that it doesn't drive the
administrator crazy.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [lxc-devel] Detecting if you are running in a container
  2011-10-12 16:59             ` Kay Sievers
@ 2011-11-01 22:05               ` Michael Tokarev
  2011-11-01 23:51                 ` Eric W. Biederman
  0 siblings, 1 reply; 28+ messages in thread
From: Michael Tokarev @ 2011-11-01 22:05 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Lennart Poettering, greg, Paul Menage, linux-kernel, david,
	Eric W. Biederman, Linux Containers, Linux Containers,
	Serge E. Hallyn, harald

[Replying to an oldish email...]

On 12.10.2011 20:59, Kay Sievers wrote:
> On Mon, Oct 10, 2011 at 23:41, Lennart Poettering <mzxreary@0pointer.de> wrote:
>> On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote:
> 
>>> - udev.  All of the kernel interfaces for udev should be supported in
>>>   current kernels.  However I believe udev is useless because container
>>>   start drops CAP_MKNOD so we can't do evil things.  So I would
>>>   recommend basing the startup of udev on presence of CAP_MKNOD.
>>
>> Using CAP_MKNOD as test here is indeed a good idea. I'll make sure udev
>> in a systemd world makes use of that.
> 
> Done.
> 
> http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=9371e6f3e04b03692c23e392fdf005a08ccf1edb

Maybe CAP_MKNOD isn't actually a good idea, having in mind devtmpfs?

Without CAP_MKNOD, is devtmpfs still being populated internally by
the kernel, so that udev only needs to change ownership/permissions
and maintain symlinks in response to device changes, and perform
other duties (reacting to other types of events) normally?

In other words, provided devtmpfs works even without CAP_MKNOD,
I can easily imagine a whole system running without this capability
from the very early boot, with all functionality in place, including
udev and what not...

And having CAP_MKNOD in container may not be that bad either, while
cgroup device.permission is set correctly - some nodes may need to
be created still, even in an unprivileged containers.  Who filters
out CAP_MKNOD during container startup (I don't see it in the code,
which only removes CAP_SYS_BOOT, and even that due to current
limitation), and which evil things can be done if it is not filtered?

Thanks,

/mjt

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [lxc-devel] Detecting if you are running in a container
  2011-11-01 22:05               ` [lxc-devel] " Michael Tokarev
@ 2011-11-01 23:51                 ` Eric W. Biederman
  2011-11-02  8:08                   ` Michael Tokarev
  0 siblings, 1 reply; 28+ messages in thread
From: Eric W. Biederman @ 2011-11-01 23:51 UTC (permalink / raw)
  To: Michael Tokarev
  Cc: Kay Sievers, Lennart Poettering, greg, Paul Menage, linux-kernel,
	david, Linux Containers, Linux Containers, Serge E. Hallyn,
	harald

Michael Tokarev <mjt@tls.msk.ru> writes:

> [Replying to an oldish email...]
>
> On 12.10.2011 20:59, Kay Sievers wrote:
>> On Mon, Oct 10, 2011 at 23:41, Lennart Poettering <mzxreary@0pointer.de> wrote:
>>> On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote:
>> 
>>>> - udev.  All of the kernel interfaces for udev should be supported in
>>>>   current kernels.  However I believe udev is useless because container
>>>>   start drops CAP_MKNOD so we can't do evil things.  So I would
>>>>   recommend basing the startup of udev on presence of CAP_MKNOD.
>>>
>>> Using CAP_MKNOD as test here is indeed a good idea. I'll make sure udev
>>> in a systemd world makes use of that.
>> 
>> Done.
>> 
>> http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=9371e6f3e04b03692c23e392fdf005a08ccf1edb
>
> Maybe CAP_MKNOD isn't actually a good idea, having in mind devtmpfs?
>
> Without CAP_MKNOD, is devtmpfs still being populated internally by
> the kernel, so that udev only needs to change ownership/permissions
> and maintain symlinks in response to device changes, and perform
> other duties (reacting to other types of events) normally?
>
> In other words, provided devtmpfs works even without CAP_MKNOD,
> I can easily imagine a whole system running without this capability
> from the very early boot, with all functionality in place, including
> udev and what not...

Agreed devtmpfs does pretty much make dropping CAP_MKNOD useless.  I
expect we should verify that whoever mounts devtmpfs has CAP_MKNOD.

> And having CAP_MKNOD in container may not be that bad either, while
> cgroup device.permission is set correctly - some nodes may need to
> be created still, even in an unprivileged containers.  Who filters
> out CAP_MKNOD during container startup (I don't see it in the code,
> which only removes CAP_SYS_BOOT, and even that due to current
> limitation), and which evil things can be done if it is not filtered?

If you don't filter which device nodes you a process can read/write then
that process can access any device on the system.  Steal the keyboard,
the X display, access any filesystem, directly access memory.  Basically
the process can escalate that permission to full control of the system
without needing any kernel bugs to help it.

Eric

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [lxc-devel] Detecting if you are running in a container
  2011-11-01 23:51                 ` Eric W. Biederman
@ 2011-11-02  8:08                   ` Michael Tokarev
  0 siblings, 0 replies; 28+ messages in thread
From: Michael Tokarev @ 2011-11-02  8:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kay Sievers, Lennart Poettering, greg, Paul Menage, linux-kernel,
	david, Linux Containers, Linux Containers, Serge E. Hallyn,
	harald

On 02.11.2011 03:51, Eric W. Biederman wrote:
[]
>> And having CAP_MKNOD in container may not be that bad either, while
>> cgroup device.permission is set correctly - some nodes may need to
>> be created still, even in an unprivileged containers.  Who filters
>> out CAP_MKNOD during container startup (I don't see it in the code,
>> which only removes CAP_SYS_BOOT, and even that due to current
>> limitation), and which evil things can be done if it is not filtered?
> 
> If you don't filter which device nodes you a process can read/write then
> that process can access any device on the system.  Steal the keyboard,
> the X display, access any filesystem, directly access memory.  Basically
> the process can escalate that permission to full control of the system
> without needing any kernel bugs to help it.

There's cap_mknod, and cgroup/devices.{allow,deny}.  Even with CAP_MKNOD,
container can not _use_ devices not allowed in the latter.  That's what
I'm talking about - there's more fine control exist than CAP_MKNOD.  And
my question was about this context - with proper cgroup-level device
control in place, what bad CAP_MKNOD have?

Thanks,

/mjt

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2011-11-02  8:08 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1317943022.1095.25.camel@mop>
     [not found] ` <20111007074904.GC16723@count0.beaverton.ibm.com>
     [not found]   ` <20111007160113.GB14201@tango.0pointer.de>
     [not found]     ` <m17h4g2jqy.fsf@fess.ebiederm.org>
     [not found]       ` <20111010163140.GA22191@tango.0pointer.de>
2011-10-10 20:59         ` Detecting if you are running in a container Eric W. Biederman
2011-10-10 21:41           ` Lennart Poettering
2011-10-11  5:40             ` Eric W. Biederman
2011-10-11  6:54             ` Eric W. Biederman
2011-10-12 16:59             ` Kay Sievers
2011-11-01 22:05               ` [lxc-devel] " Michael Tokarev
2011-11-01 23:51                 ` Eric W. Biederman
2011-11-02  8:08                   ` Michael Tokarev
2011-10-11  1:32           ` Ted Ts'o
     [not found]             ` <20111011020530.GG16723@count0.beaverton.ibm.com>
2011-10-11  3:25               ` Ted Ts'o
2011-10-11  6:42                 ` Eric W. Biederman
2011-10-11 12:53                   ` Theodore Tso
2011-10-11 21:16                     ` Eric W. Biederman
2011-10-11 22:30                       ` david
2011-10-12  4:26                         ` Eric W. Biederman
2011-10-12  5:10                           ` david
2011-10-12 15:08                             ` Serge E. Hallyn
2011-10-12 17:57                       ` J. Bruce Fields
2011-10-12 18:25                         ` Kyle Moffett
2011-10-12 19:04                           ` J. Bruce Fields
2011-10-12 19:12                             ` Kyle Moffett
2011-10-14 15:54                               ` Ted Ts'o
2011-10-14 18:04                                 ` Eric W. Biederman
2011-10-14 21:58                                   ` H. Peter Anvin
2011-10-16  9:42                                     ` Eric W. Biederman
2011-10-30 20:11                                       ` H. Peter Anvin
2011-11-01 13:38                                         ` Eric W. Biederman
2011-10-11 22:25               ` david

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox