* Detecting if you are running in a container [not found] ` <20111010163140.GA22191@tango.0pointer.de> @ 2011-10-10 20:59 ` Eric W. Biederman 2011-10-10 21:41 ` Lennart Poettering 2011-10-11 1:32 ` Ted Ts'o 0 siblings, 2 replies; 28+ messages in thread From: Eric W. Biederman @ 2011-10-10 20:59 UTC (permalink / raw) To: Lennart Poettering Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage Cc's and subject updated so hopefully we get the correct people on this discussion to make progress. Lennart Poettering <mzxreary@0pointer.de> writes: > To make a standard distribution run nicely in a Linux container you > usually have to make quite a number of modifications to it and disable > certain things from the boot process. Ideally however, one could simply > boot the same image on a real machine and in a container and would just > do the right thing, fully stateless. And for that you need to be able to > detect containers, and currently you can't. I agree getting to the point where we can run a standard distribution unmodified in a container sounds like a reasonable goal. > Quite a few kernel subsystems are > currently not virtualized, for example SELinux, VTs, most of sysfs, most > of /proc/sys, audit, udev or file systems (by which I mean that for a > container you probably don't want to fsck the root fs, and so on), and > containers tend to be much more lightweight than real systems. That is an interesting viewpoint on what is not complete. But as a listing of the tasks that distribution startup needs to do differently in a container the list seems more or less reasonable. There are two questions - How in the general case do we detect if we are running in a container. - How do we make reasonable tests during bootup to see if it makes sense to perform certain actions. For the general detection if we are running in a linux container I can see two reasonable possibilities. - Put a file in / that let's you know by convention that you are in a linux container. I am inclined to do this because this is something we can support on all kernels old and new. - Allow modification to the output of uname(2). The uts namespace already covers uname(2) and uname is the standard method to communicate to userspace the vageries about the OS level environment they are running in. My list of things that still have work left to do looks like: - cgroups. It is not safe to create a new hierarchies with groups that are in existing hierarchies. So cgroups don't work. - user namespace. We are very close to have something workable on this one, but until we do all of the users inside and outside of a container are the same, and pass the same permission checks. As a result we have to drop most of roots privileges, and we have to be a bit careful what binaries that can gain privileges (think suid root) are in the container filesystem. - Reboot. I know Daniel was working on something not long ago but I am not certain where he would up. - device namespaces. We periodically think about having a separate set of devices and to support things like losetup in a container that seems necessary. Most of the time getting all of the way to device namespaces seems unnecessary. As for tests on what to startup. - udev. All of the kernel interfaces for udev should be supported in current kernels. However I believe udev is useless because container start drops CAP_MKNOD so we can't do evil things. So I would recommend basing the startup of udev on presence of CAP_MKNOD. - VTs. Ptys should be well supported at this point. For the rest they are physical hardware that a container should not be playing with so I would base which gettys to start up based on which device nodes are present in /dev. - sysctls (aka /proc/sys) that is a trick one. Until the user namespace is fleshed out a little more sysctls are going to be a problem, because root can write to most of them. My gut feel says you probably want to base that to poke at sysctls on CAP_SYS_ADMIN. At least that test will become true when the userspaces are rolled out, and at that point you will want to set all of the sysctls you have permission to. - audit. My memory is very fuzzy on this one. The issue in question is should we start auditd? I believe the audit calls actually fail in a container so we should be able to trigger starting auditd on if audit works at all. If we can't do it that way certainly the work should be put in so that it can be done that way. - fsck. A rw filesystem check like you mentioned earlier seems like a reasonable place to be I know the OpenVz folks were talking about putting containers in their own block devices for their next round of supporting containers. At which point a filesystem check on container startup might not be a bad idea at all. - cgroups hierarchies. I don't know at which point in the system startup we care. The appropriate solution would seem to be to try it and if the operation fails figure it isn't supported. - selinux. It really should be in the same category. You should be able to attempt to load a policy and have it fail in a way that indicates that selinux is currently supported. I don't know if we can make that work right until we get the user namespace into a usable shame. In general things in a container should work or the kernel feature should fail in a way that indicates that the feature is not supported. That currently works well for the networking stack, and with the pending usablilty of the user namespace it should work just about everywhere else as well. For things that don't fit that model we need to fix the kernel. So while I agree a check to see if something is a container seems reasonable. I do not agree that the pid namespace is the place to put that information. I see no natural to put that information in the pid namespace. I further think there are a lot of reasonable checks for if a kernel feature is supported in the current environment I would rather pursue over hacks based the fact we are in a container. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-10 20:59 ` Detecting if you are running in a container Eric W. Biederman @ 2011-10-10 21:41 ` Lennart Poettering 2011-10-11 5:40 ` Eric W. Biederman ` (2 more replies) 2011-10-11 1:32 ` Ted Ts'o 1 sibling, 3 replies; 28+ messages in thread From: Lennart Poettering @ 2011-10-10 21:41 UTC (permalink / raw) To: Eric W. Biederman Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote: > > Quite a few kernel subsystems are > > currently not virtualized, for example SELinux, VTs, most of sysfs, most > > of /proc/sys, audit, udev or file systems (by which I mean that for a > > container you probably don't want to fsck the root fs, and so on), and > > containers tend to be much more lightweight than real systems. > > That is an interesting viewpoint on what is not complete. But as a > listing of the tasks that distribution startup needs to do differently in > a container the list seems more or less reasonable. Note that this is just what came to my mind while I was typing this, I am quite sure there's actually more like this. > There are two questions > - How in the general case do we detect if we are running in a container. > - How do we make reasonable tests during bootup to see if it makes sense > to perform certain actions. > > For the general detection if we are running in a linux container I can > see two reasonable possibilities. > > - Put a file in / that let's you know by convention that you are in a > linux container. I am inclined to do this because this is something > we can support on all kernels old and new. Hmpf. That would break the stateless read-only-ness of the root dir. After pointing the issue out to the LXC folks they are now setting "container=lxc" as env var when spawning a container. In systemd-nspawn I have then adopted a similar scheme. Not sure though that that is particularly nice however, since env vars are inherited further down the tree where we probably don't want them. In case you are curious: this is the code we use in systemd: http://cgit.freedesktop.org/systemd/tree/src/virt.c What matters to me though is that we can generically detect Linux containers instead of specific implementations. > - Allow modification to the output of uname(2). The uts namespace > already covers uname(2) and uname is the standard method to > communicate to userspace the vageries about the OS level environment > they are running in. Well, I am not a particular fan of having userspace tell userspace about containers. I would prefer if userspace could get that info from the kernel without any explicit agreement to set some specific variable. That said detecting CLONE_NEWUTS by looking at the output of uname(2) would be a workable solution for us. CLONE_NEWPID and CLONE_NEWUTS are probably equally definining for what a container is, so I'd be happy if we could detect either. For example, if the kernel would append "(container)" or so to utsname.machine[] after CLONE_NEWUTS is used I'd be quite happy. > My list of things that still have work left to do looks like: > - cgroups. It is not safe to create a new hierarchies with groups > that are in existing hierarchies. So cgroups don't work. Well, for systemd they actually work quite fine since systemd will always place its own cgroups below the cgroup it is started in. cgroups hence make these things nicely stackable. In fact, most folks involved in cgroups userspace have agreed to these rules now: http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups Among other things they ask all userspace code to only create subgroups below the group they are started in, so not only systemd should work fine in a container environment but everything else following these rules. In other words: so far one gets away quite nicely with the fact that the cgroup tree is not virtualized. > - device namespaces. We periodically think about having a separate > set of devices and to support things like losetup in a container > that seems necessary. Most of the time getting all of the way > to device namespaces seems unnecessary. Well, I am sure people use containers in all kinds of weird ways, but for me personally I am quitre sure that containers should live in a fully virtualized world and never get access to real devices. > As for tests on what to startup. Note again that my list above is not complete at all and the point I was trying to make is that while you can find nice hooks for this for many cases at the end of the day you actually do want to detect containers for a few specific cases. > - udev. All of the kernel interfaces for udev should be supported in > current kernels. However I believe udev is useless because container > start drops CAP_MKNOD so we can't do evil things. So I would > recommend basing the startup of udev on presence of CAP_MKNOD. Using CAP_MKNOD as test here is indeed a good idea. I'll make sure udev in a systemd world makes use of that. > - VTs. Ptys should be well supported at this point. For the rest > they are physical hardware that a container should not be playing with > so I would base which gettys to start up based on which device nodes > are present in /dev. Well, I am not sure it's that easy since device nodes tend to show up dynamically in bare systems. So if you just check whether /dev/tty0 is there you might end up thinking you are in a container when you actually aren't simply because you did that check before udev loaded the DRI driver or so. > - sysctls (aka /proc/sys) that is a trick one. Until the user namespace > is fleshed out a little more sysctls are going to be a problem, > because root can write to most of them. My gut feel says you probably > want to base that to poke at sysctls on CAP_SYS_ADMIN. At least that > test will become true when the userspaces are rolled out, and at > that point you will want to set all of the sysctls you have permission > to. So what we did right now in systemd-nspawn is that the container supervisor premounts /proc/sys read-only into the container. That way writes to it will fail in the container, and while you get a number of warnings things will work as they should (though not necessarily safely since the container can still remount the fs unless you take CAP_SYS_ADMIN away). > - selinux. It really should be in the same category. You should be > able to attempt to load a policy and have it fail in a way that > indicates that selinux is currently supported. I don't know if > we can make that work right until we get the user namespace into > a usable shame. The SELinux folks modified libselinux on my request to consider selinux off if /sys/fs/selinux is already mounted and read-only. That means with a new container userspace this problem is mostly worked around too. It is crucial to make libselinux know that selinux is off because otherwise it will continue to muck with the xattr labels where it shouldn't. In if you want to fully virtualize this you probably should hide selinux xattrs entirely in the container. > So while I agree a check to see if something is a container seems > reasonable. I do not agree that the pid namespace is the place to put > that information. I see no natural to put that information in the > pid namespace. Well, a simple way would be to have a line /proc/1/status called "PIDNamespaceLevel:" or so which would be 0 for the root namespace, and increased for each namespace nested in it. Then, processes could simply read that and be happy. > I further think there are a lot of reasonable checks for if a > kernel feature is supported in the current environment I would > rather pursue over hacks based the fact we are in a container. Well, believe me we have been tryiung to find nicer hooks that explicit checks for containers, but I am quite sure that at the end of the day you won't be able to go without it entirely. Lennart -- Lennart Poettering - Red Hat, Inc. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-10 21:41 ` Lennart Poettering @ 2011-10-11 5:40 ` Eric W. Biederman 2011-10-11 6:54 ` Eric W. Biederman 2011-10-12 16:59 ` Kay Sievers 2 siblings, 0 replies; 28+ messages in thread From: Eric W. Biederman @ 2011-10-11 5:40 UTC (permalink / raw) To: Lennart Poettering Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage Lennart Poettering <mzxreary@0pointer.de> writes: > On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote: > >> > Quite a few kernel subsystems are >> > currently not virtualized, for example SELinux, VTs, most of sysfs, most >> > of /proc/sys, audit, udev or file systems (by which I mean that for a >> > container you probably don't want to fsck the root fs, and so on), and >> > containers tend to be much more lightweight than real systems. >> >> That is an interesting viewpoint on what is not complete. But as a >> listing of the tasks that distribution startup needs to do differently in >> a container the list seems more or less reasonable. > > Note that this is just what came to my mind while I was typing this, I > am quite sure there's actually more like this. > >> There are two questions >> - How in the general case do we detect if we are running in a container. >> - How do we make reasonable tests during bootup to see if it makes sense >> to perform certain actions. >> >> For the general detection if we are running in a linux container I can >> see two reasonable possibilities. >> >> - Put a file in / that let's you know by convention that you are in a >> linux container. I am inclined to do this because this is something >> we can support on all kernels old and new. > > Hmpf. That would break the stateless read-only-ness of the root dir. > > After pointing the issue out to the LXC folks they are now setting > "container=lxc" as env var when spawning a container. In systemd-nspawn > I have then adopted a similar scheme. Not sure though that that isp > particularly nice however, since env vars are inherited further down the > tree where we probably don't want them. Interesting. That seems like a reasonable enough thing to require of the programs that create containers. > In case you are curious: this is the code we use in systemd: > > http://cgit.freedesktop.org/systemd/tree/src/virt.c > > What matters to me though is that we can generically detect Linux > containers instead of specific implementations. >> - Allow modification to the output of uname(2). The uts namespace >> already covers uname(2) and uname is the standard method to >> communicate to userspace the vageries about the OS level environment >> they are running in. > > Well, I am not a particular fan of having userspace tell userspace about > containers. I would prefer if userspace could get that info from the > kernel without any explicit agreement to set some specific variable. Well userspace tells userspace about stdin and it works reliably. Containers are a userspace construct built with kernel facilities. I don't see why asking userspace to implement a convention is any more important than the other things that have to be done. We do need to document the convetions. Just like we document the standard device names but I don't beyond that we should be fine. >> My list of things that still have work left to do looks like: >> - cgroups. It is not safe to create a new hierarchies with groups >> that are in existing hierarchies. So cgroups don't work. > > Well, for systemd they actually work quite fine since systemd will > always place its own cgroups below the cgroup it is started in. cgroups > hence make these things nicely stackable. > > In fact, most folks involved in cgroups userspace have agreed to these > rules now: > > http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups > > Among other things they ask all userspace code to only create subgroups > below the group they are started in, so not only systemd should work > fine in a container environment but everything else following these > rules. > > In other words: so far one gets away quite nicely with the fact that the > cgroup tree is not virtualized. Assuming you bind mount the cgroups inside and generally don't allow people in a container to create cgroup hierarchies. At the very least that is nasty information leakage. But I am glad there is a solution for right now. For my uses I have yet to find control groups anything but borked. >> - VTs. Ptys should be well supported at this point. For the rest >> they are physical hardware that a container should not be playing with >> so I would base which gettys to start up based on which device nodes >> are present in /dev. > > Well, I am not sure it's that easy since device nodes tend to show up > dynamically in bare systems. So if you just check whether /dev/tty0 is > there you might end up thinking you are in a container when you actually > aren't simply because you did that check before udev loaded the DRI > driver or so. But the point isn't to detect a container the point is to decide if a getty needs to be spawned. Even with the configuration for a getty you need to wait for the device node to exist before spawning one. >> - sysctls (aka /proc/sys) that is a trick one. Until the user namespace >> is fleshed out a little more sysctls are going to be a problem, >> because root can write to most of them. My gut feel says you probably >> want to base that to poke at sysctls on CAP_SYS_ADMIN. At least that >> test will become true when the userspaces are rolled out, and at >> that point you will want to set all of the sysctls you have permission >> to. > > So what we did right now in systemd-nspawn is that the container > supervisor premounts /proc/sys read-only into the container. That way > writes to it will fail in the container, and while you get a number of > warnings things will work as they should (though not necessarily safely > since the container can still remount the fs unless you take > CAP_SYS_ADMIN away). That sort of works. In practice it means you can't setup interesting things like forwarding in the networking stack. But it certainly gets things going. >> So while I agree a check to see if something is a container seems >> reasonable. I do not agree that the pid namespace is the place to put >> that information. I see no natural to put that information in the >> pid namespace. > > Well, a simple way would be to have a line /proc/1/status called > "PIDNamespaceLevel:" or so which would be 0 for the root namespace, and > increased for each namespace nested in it. Then, processes could simply > read that and be happy. Not a chance. PIDNamespaceLevel is implementing an implementation detail that may well change in the lifetime of a process. It is true we don't have migration mreged in the kernel yet but one of these days I expect we will. >> I further think there are a lot of reasonable checks for if a >> kernel feature is supported in the current environment I would >> rather pursue over hacks based the fact we are in a container. > > Well, believe me we have been tryiung to find nicer hooks that explicit > checks for containers, but I am quite sure that at the end of the day > you won't be able to go without it entirely. And you have explicit information you are in a container at this point. It looks like all that is left is Documentation of the conventions. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-10 21:41 ` Lennart Poettering 2011-10-11 5:40 ` Eric W. Biederman @ 2011-10-11 6:54 ` Eric W. Biederman 2011-10-12 16:59 ` Kay Sievers 2 siblings, 0 replies; 28+ messages in thread From: Eric W. Biederman @ 2011-10-11 6:54 UTC (permalink / raw) To: Lennart Poettering Cc: Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage Lennart Poettering <mzxreary@0pointer.de> writes: > On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote: >> My list of things that still have work left to do looks like: >> - cgroups. It is not safe to create a new hierarchies with groups >> that are in existing hierarchies. So cgroups don't work. > > Well, for systemd they actually work quite fine since systemd will > always place its own cgroups below the cgroup it is started in. cgroups > hence make these things nicely stackable. > > In fact, most folks involved in cgroups userspace have agreed to these > rules now: > > http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups Wow. Are cgroups really that complicated to use? A list of rules a page long on what you have to do to make them useful and non-conflict. Something seems off. Perhaps we need a rule don't mount multiple controllers in the same hierarchy. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-10 21:41 ` Lennart Poettering 2011-10-11 5:40 ` Eric W. Biederman 2011-10-11 6:54 ` Eric W. Biederman @ 2011-10-12 16:59 ` Kay Sievers 2011-11-01 22:05 ` [lxc-devel] " Michael Tokarev 2 siblings, 1 reply; 28+ messages in thread From: Kay Sievers @ 2011-10-12 16:59 UTC (permalink / raw) To: Lennart Poettering Cc: Eric W. Biederman, Matt Helsley, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On Mon, Oct 10, 2011 at 23:41, Lennart Poettering <mzxreary@0pointer.de> wrote: > On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote: >> - udev. All of the kernel interfaces for udev should be supported in >> current kernels. However I believe udev is useless because container >> start drops CAP_MKNOD so we can't do evil things. So I would >> recommend basing the startup of udev on presence of CAP_MKNOD. > > Using CAP_MKNOD as test here is indeed a good idea. I'll make sure udev > in a systemd world makes use of that. Done. http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=9371e6f3e04b03692c23e392fdf005a08ccf1edb Thanks, Kay ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [lxc-devel] Detecting if you are running in a container 2011-10-12 16:59 ` Kay Sievers @ 2011-11-01 22:05 ` Michael Tokarev 2011-11-01 23:51 ` Eric W. Biederman 0 siblings, 1 reply; 28+ messages in thread From: Michael Tokarev @ 2011-11-01 22:05 UTC (permalink / raw) To: Kay Sievers Cc: Lennart Poettering, greg, Paul Menage, linux-kernel, david, Eric W. Biederman, Linux Containers, Linux Containers, Serge E. Hallyn, harald [Replying to an oldish email...] On 12.10.2011 20:59, Kay Sievers wrote: > On Mon, Oct 10, 2011 at 23:41, Lennart Poettering <mzxreary@0pointer.de> wrote: >> On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote: > >>> - udev. All of the kernel interfaces for udev should be supported in >>> current kernels. However I believe udev is useless because container >>> start drops CAP_MKNOD so we can't do evil things. So I would >>> recommend basing the startup of udev on presence of CAP_MKNOD. >> >> Using CAP_MKNOD as test here is indeed a good idea. I'll make sure udev >> in a systemd world makes use of that. > > Done. > > http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=9371e6f3e04b03692c23e392fdf005a08ccf1edb Maybe CAP_MKNOD isn't actually a good idea, having in mind devtmpfs? Without CAP_MKNOD, is devtmpfs still being populated internally by the kernel, so that udev only needs to change ownership/permissions and maintain symlinks in response to device changes, and perform other duties (reacting to other types of events) normally? In other words, provided devtmpfs works even without CAP_MKNOD, I can easily imagine a whole system running without this capability from the very early boot, with all functionality in place, including udev and what not... And having CAP_MKNOD in container may not be that bad either, while cgroup device.permission is set correctly - some nodes may need to be created still, even in an unprivileged containers. Who filters out CAP_MKNOD during container startup (I don't see it in the code, which only removes CAP_SYS_BOOT, and even that due to current limitation), and which evil things can be done if it is not filtered? Thanks, /mjt ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [lxc-devel] Detecting if you are running in a container 2011-11-01 22:05 ` [lxc-devel] " Michael Tokarev @ 2011-11-01 23:51 ` Eric W. Biederman 2011-11-02 8:08 ` Michael Tokarev 0 siblings, 1 reply; 28+ messages in thread From: Eric W. Biederman @ 2011-11-01 23:51 UTC (permalink / raw) To: Michael Tokarev Cc: Kay Sievers, Lennart Poettering, greg, Paul Menage, linux-kernel, david, Linux Containers, Linux Containers, Serge E. Hallyn, harald Michael Tokarev <mjt@tls.msk.ru> writes: > [Replying to an oldish email...] > > On 12.10.2011 20:59, Kay Sievers wrote: >> On Mon, Oct 10, 2011 at 23:41, Lennart Poettering <mzxreary@0pointer.de> wrote: >>> On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote: >> >>>> - udev. All of the kernel interfaces for udev should be supported in >>>> current kernels. However I believe udev is useless because container >>>> start drops CAP_MKNOD so we can't do evil things. So I would >>>> recommend basing the startup of udev on presence of CAP_MKNOD. >>> >>> Using CAP_MKNOD as test here is indeed a good idea. I'll make sure udev >>> in a systemd world makes use of that. >> >> Done. >> >> http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=9371e6f3e04b03692c23e392fdf005a08ccf1edb > > Maybe CAP_MKNOD isn't actually a good idea, having in mind devtmpfs? > > Without CAP_MKNOD, is devtmpfs still being populated internally by > the kernel, so that udev only needs to change ownership/permissions > and maintain symlinks in response to device changes, and perform > other duties (reacting to other types of events) normally? > > In other words, provided devtmpfs works even without CAP_MKNOD, > I can easily imagine a whole system running without this capability > from the very early boot, with all functionality in place, including > udev and what not... Agreed devtmpfs does pretty much make dropping CAP_MKNOD useless. I expect we should verify that whoever mounts devtmpfs has CAP_MKNOD. > And having CAP_MKNOD in container may not be that bad either, while > cgroup device.permission is set correctly - some nodes may need to > be created still, even in an unprivileged containers. Who filters > out CAP_MKNOD during container startup (I don't see it in the code, > which only removes CAP_SYS_BOOT, and even that due to current > limitation), and which evil things can be done if it is not filtered? If you don't filter which device nodes you a process can read/write then that process can access any device on the system. Steal the keyboard, the X display, access any filesystem, directly access memory. Basically the process can escalate that permission to full control of the system without needing any kernel bugs to help it. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [lxc-devel] Detecting if you are running in a container 2011-11-01 23:51 ` Eric W. Biederman @ 2011-11-02 8:08 ` Michael Tokarev 0 siblings, 0 replies; 28+ messages in thread From: Michael Tokarev @ 2011-11-02 8:08 UTC (permalink / raw) To: Eric W. Biederman Cc: Kay Sievers, Lennart Poettering, greg, Paul Menage, linux-kernel, david, Linux Containers, Linux Containers, Serge E. Hallyn, harald On 02.11.2011 03:51, Eric W. Biederman wrote: [] >> And having CAP_MKNOD in container may not be that bad either, while >> cgroup device.permission is set correctly - some nodes may need to >> be created still, even in an unprivileged containers. Who filters >> out CAP_MKNOD during container startup (I don't see it in the code, >> which only removes CAP_SYS_BOOT, and even that due to current >> limitation), and which evil things can be done if it is not filtered? > > If you don't filter which device nodes you a process can read/write then > that process can access any device on the system. Steal the keyboard, > the X display, access any filesystem, directly access memory. Basically > the process can escalate that permission to full control of the system > without needing any kernel bugs to help it. There's cap_mknod, and cgroup/devices.{allow,deny}. Even with CAP_MKNOD, container can not _use_ devices not allowed in the latter. That's what I'm talking about - there's more fine control exist than CAP_MKNOD. And my question was about this context - with proper cgroup-level device control in place, what bad CAP_MKNOD have? Thanks, /mjt ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-10 20:59 ` Detecting if you are running in a container Eric W. Biederman 2011-10-10 21:41 ` Lennart Poettering @ 2011-10-11 1:32 ` Ted Ts'o [not found] ` <20111011020530.GG16723@count0.beaverton.ibm.com> 1 sibling, 1 reply; 28+ messages in thread From: Ted Ts'o @ 2011-10-11 1:32 UTC (permalink / raw) To: Eric W. Biederman Cc: Lennart Poettering, Matt Helsley, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On Mon, Oct 10, 2011 at 01:59:10PM -0700, Eric W. Biederman wrote: > Lennart Poettering <mzxreary@0pointer.de> writes: > > > To make a standard distribution run nicely in a Linux container you > > usually have to make quite a number of modifications to it and disable > > certain things from the boot process. Ideally however, one could simply > > boot the same image on a real machine and in a container and would just > > do the right thing, fully stateless. And for that you need to be able to > > detect containers, and currently you can't. > > I agree getting to the point where we can run a standard distribution > unmodified in a container sounds like a reasonable goal. Hmm, interesting. It's not clear to me that running a full standard distribution in a container is always going to be what everyone wants to do. The whole point of containers versus VM's is that containers are lighter weight. And one of the ways that containers can be lighter weight is if you don't have to have N copies of udev, dbus, running in each container/VM. If you end up so much overhead to provide the desired security and/or performance isolation, then it becomes fair to ask the question whether you might as well pay a tad bit more and get even better security and isolation by using a VM solution.... - Ted ^ permalink raw reply [flat|nested] 28+ messages in thread
[parent not found: <20111011020530.GG16723@count0.beaverton.ibm.com>]
* Re: Detecting if you are running in a container [not found] ` <20111011020530.GG16723@count0.beaverton.ibm.com> @ 2011-10-11 3:25 ` Ted Ts'o 2011-10-11 6:42 ` Eric W. Biederman 2011-10-11 22:25 ` david 1 sibling, 1 reply; 28+ messages in thread From: Ted Ts'o @ 2011-10-11 3:25 UTC (permalink / raw) To: Matt Helsley Cc: Eric W. Biederman, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On Mon, Oct 10, 2011 at 07:05:30PM -0700, Matt Helsley wrote: > Yes, it does detract from the unique advantages of using a container. > However, I think the value here is not the effeciency of the initial > system configuration but the fact that it gives users a better place to > start. > > Right now we're effectively asking users to start with non-working > and/or unfamiliar systems and repair them until they work. If things are not working with containers, I would submit to you that we're doing something wrong(tm). Things should just work, except that processes in one container can't use more than their fair share (as dictated by policy) of memory, CPU, networking, and I/O bandwidth. Something which is baked in my world view of containers (which I suspect is not shared by other people who are interested in using containers) is that given that kernel is shared, trying to use containers to provide better security isolation between mutually suspicious users is hopeless. That is, it's pretty much impossible to prevent a user from finding one or more zero day local privilege escalation bugs that will allow a user to break root. And at that point, they will be able to penetrate the kernel, and from there, break security of other processes. So if you want that kind of security isolation, you shouldn't be using containers in the first place. You should be using KVM or Xen, and then only after spending a huge amount of effort fuzz testing the KVM/Xen paravirtualization interfaces. So at least in my mind, adding vast amounts of complexities to try to provide security isolation via containers is really not worth it. And if that's the model, then it's a lot easier to make containers to run jobs in containers that don't require changes to the distro plus huge increase of complexity for containers in the kernel.... - Ted ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-11 3:25 ` Ted Ts'o @ 2011-10-11 6:42 ` Eric W. Biederman 2011-10-11 12:53 ` Theodore Tso 0 siblings, 1 reply; 28+ messages in thread From: Eric W. Biederman @ 2011-10-11 6:42 UTC (permalink / raw) To: Ted Ts'o Cc: Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage Ted Ts'o <tytso@mit.edu> writes: > On Mon, Oct 10, 2011 at 07:05:30PM -0700, Matt Helsley wrote: >> Yes, it does detract from the unique advantages of using a container. >> However, I think the value here is not the effeciency of the initial >> system configuration but the fact that it gives users a better place to >> start. >> >> Right now we're effectively asking users to start with non-working >> and/or unfamiliar systems and repair them until they work. > > If things are not working with containers, I would submit to you that > we're doing something wrong(tm). That is what this discussion is about. What we are doing wrong(tm). Mostly it is about the bits that have not yet been namespacified but need to be. I am totally in favor of not starting the entire world. But just like I find it convienient to loopback mount an iso image to see what is on a disk image. It would be handy to be able to just download a distro image and play with it, without doing anything special. We can pair things down farther for the people who are running 1000 copies of apache but not requiring detailed distro surgery before starting up the binaries on a livecd sounds handy. > Things should just work, except that > processes in one container can't use more than their fair share (as > dictated by policy) of memory, CPU, networking, and I/O bandwidth. You have to be careful with the limiters. The fundamental reason why containers are more efficient than hardware virtualization is that with containers we can do over commit of resources, especially memory. I keep seeing implementations of resource limiters that want to do things in a heavy handed way that break resource over commit. > Something which is baked in my world view of containers (which I > suspect is not shared by other people who are interested in using > containers) is that given that kernel is shared, trying to use > containers to provide better security isolation between mutually > suspicious users is hopeless. That is, it's pretty much impossible to > prevent a user from finding one or more zero day local privilege > escalation bugs that will allow a user to break root. And at that > point, they will be able to penetrate the kernel, and from there, > break security of other processes. You don't even have to get to security problems to have that concern. There are enough crazy timing and side channel attacks. I don't know what concern you have security wise, but the problem that wants to be solved with user namespaces is something you hit much earlier than when you worry about sharing a kernel between mutually distrusting users. Right now root inside a container is root rout outside of a container just like in a chroot jail. Where this becomes a problem is that people change things like like /proc/sys/kernel/print-fatal-signals expecting it to be a setting local to their sand box when in fact the global setting and things start behaving weirdly for other users. Running sysctl -a during bootup has that problem in spades. With user namespaces what we get is that the global root user is not the container root user and we have been working our way through the permission checks in the kernel to ensure we get them right in the context of the user namespace. This trivially means that the things that we allow the global root user to do in /proc/ and /sysfs and the like simply won't be allowed as a container root user. Which makes doing something stupid and affecting other people much more difficult. What the user namespace also allows is an escape hatch from the bonds of suid. Right now anything that could confuse an existing app with that is suid root we have to only allow to root, or risk adding a security hole. With the user namespaces we can relax that check and allow it also for container root users as well as global root users. When we are brave enough and certain enough of our code we can allow non-root users to create their own user namespaces. There is the third use for containers where for some reason we have uid assignment overlap. Perhaps one distroy assigns uid 22 to sshd and another to the nobody user. Or perhaps there are two departments who have that have done the silly thing of assigning overlapping uids to their users and we want to accesses filesystems created by both departments at the same time without a chance of confusion and conflict. With my sysadmin hat on I would not want to touch two untrusting groups of users on the same machine. Because of the probability there is at least one security hole that can be found and exploited to allow privilege escalation. With my kernel developer hat on I can't just say surrender to the idea that there will in fact be a privilege escalation bug that is easy to exploit. The code has to be built and designed so that privilege escalation is difficult. Otherwise we might as well assume if you visit a website an stealthy worm has taken over your computer. It is my hope at the end of the day that the user namespaces will be one more line of defense in messing up and slowing down the evil omnicient worms that seem to uneering go for every privilege exploit there is. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-11 6:42 ` Eric W. Biederman @ 2011-10-11 12:53 ` Theodore Tso 2011-10-11 21:16 ` Eric W. Biederman 0 siblings, 1 reply; 28+ messages in thread From: Theodore Tso @ 2011-10-11 12:53 UTC (permalink / raw) To: Eric W. Biederman Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote: > I am totally in favor of not starting the entire world. But just > like I find it convienient to loopback mount an iso image to see > what is on a disk image. It would be handy to be able to just > download a distro image and play with it, without doing anything > special. Agreed, but what's wrong with firing up KVM to play with a distro image? Personally, I don't consider that "doing something special". > >> Things should just work, except that >> processes in one container can't use more than their fair share (as >> dictated by policy) of memory, CPU, networking, and I/O bandwidth. > > You have to be careful with the limiters. The fundamental reason > why containers are more efficient than hardware virtualization is > that with containers we can do over commit of resources, especially > memory. I keep seeing implementations of resource limiters that want > to do things in a heavy handed way that break resource over commit. Oh, sure. Resource limiting is something that should be done only when there are other demands on the resource in question. Put another way, it should be considered more of a resource guarantee than a resource limit. (You will have at least 10% of the CPU, not at most 10% of the CPU.) > > I don't know what concern you have security wise, but the problem that > wants to be solved with user namespaces is something you hit much > earlier than when you worry about sharing a kernel between mutually > distrusting users. Right now root inside a container is root rout > outside of a container just like in a chroot jail. Where this becomes a > problem is that people change things like like > /proc/sys/kernel/print-fatal-signals expecting it to be a setting local > to their sand box when in fact the global setting and things start > behaving weirdly for other users. Running sysctl -a during bootup > has that problem in spades. The moment you start caring about global sysctl settings is the moment I start wondering whether or not VM and separate kernel images is the better solution. Do we really want to add so much complexity that we are multiplexing different sysctl settings across containers? To my mind, that way lies madness, and in some cases, it simply can't be done from a semantics perspective. > > With my sysadmin hat on I would not want to touch two untrusting groups > of users on the same machine. Because of the probability there is at > least one security hole that can be found and exploited to allow > privilege escalation. > > With my kernel developer hat on I can't just say surrender to the > idea that there will in fact be a privilege escalation bug that > is easy to exploit. The code has to be built and designed so that > privilege escalation is difficult. Otherwise we might as well > assume if you visit a website an stealthy worm has taken over your > computer. Oh, I agree that we should try to stop privilege escalation attacks. And it will be a grand and glorious fight, like Leonidas and his 300 men at the pass at Thermopylae. :-) Or it will be like Steve Jobs struggling against cancer. It's a fight that you know that you're going to lose, but it's not about winning or losing but how much you accomplish and how you fight that counts. Personally, though, if the issue is worries about visiting a website, the primary protection against that has got to be done at the browser level (i.e., the process level sandboxing done by Chrome). -- Ted ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-11 12:53 ` Theodore Tso @ 2011-10-11 21:16 ` Eric W. Biederman 2011-10-11 22:30 ` david 2011-10-12 17:57 ` J. Bruce Fields 0 siblings, 2 replies; 28+ messages in thread From: Eric W. Biederman @ 2011-10-11 21:16 UTC (permalink / raw) To: Theodore Tso Cc: Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage Theodore Tso <tytso@MIT.EDU> writes: > On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote: > >> I am totally in favor of not starting the entire world. But just >> like I find it convienient to loopback mount an iso image to see >> what is on a disk image. It would be handy to be able to just >> download a distro image and play with it, without doing anything >> special. > > Agreed, but what's wrong with firing up KVM to play with a distro > image? Personally, I don't consider that "doing something special". Then let me flip this around and give a much more practical use case. Testing. A very interesting number of cases involve how multiple machines interact. You can test a lot more logical machines interacting with containers than you can with vms. And you can test on all the aritectures and platforms linux supports not just the handful that are well supported by hardware virtualization. I admit for a lot of test cases that it makes sense not to use a full set of userspace daemons. At the same time there is not particularly good reason to have a design that doesn't allow you to run a full userspace. >>> Things should just work, except that >>> processes in one container can't use more than their fair share (as >>> dictated by policy) of memory, CPU, networking, and I/O bandwidth. >> >> You have to be careful with the limiters. The fundamental reason >> why containers are more efficient than hardware virtualization is >> that with containers we can do over commit of resources, especially >> memory. I keep seeing implementations of resource limiters that want >> to do things in a heavy handed way that break resource over commit. > > Oh, sure. Resource limiting is something that should be done only > when there are other demands on the resource in question. Put > another way, it should be considered more of a resource guarantee than > a resource limit. (You will have at least 10% of the CPU, not at > most 10% of the CPU.) Resource guarantees I suspect may be worse. But all of this is to say that the problem control groups are tackling is a hard one. Resource control and resource limits across multiple processes is a challenge problem and in some contexts it is a hard problem. My observations have been that when you want any kind of strong resource guarantee or resource limit, it is currently a lot easier to implement that with hardware virtualization than with control groups (at least for memory). I think the cpu scheduling has been solved but until you also at least solve user space memory there are going to be issues. At the same time getting better resource controls is an area where there is a strong interest from all over the place. >> I don't know what concern you have security wise, but the problem that >> wants to be solved with user namespaces is something you hit much >> earlier than when you worry about sharing a kernel between mutually >> distrusting users. Right now root inside a container is root rout >> outside of a container just like in a chroot jail. Where this becomes a >> problem is that people change things like like >> /proc/sys/kernel/print-fatal-signals expecting it to be a setting local >> to their sand box when in fact the global setting and things start >> behaving weirdly for other users. Running sysctl -a during bootup >> has that problem in spades. > > The moment you start caring about global sysctl settings is the moment > I start wondering whether or not VM and separate kernel images is the > better solution. Do we really want to add so much complexity that we > are multiplexing different sysctl settings across containers? To my > mind, that way lies madness, and in some cases, it simply can't be > done from a semantics perspective. It actually isn't much complexity and for the most part the code that I care about in that area is already merged. In principle all I care about are having the identiy checks go from: (uid1 == uid2) to ((user_ns1 == user_ns2) && (uid1 == uid2)) There are some per subsystem sysctls that do make sense to make per subsystem and that work is mostly done. I expect there are a few more in the networking stack that interesting to make per network namespace. The only real world issue right now that I am aware of is the user namespace aren't quite ready for prime-time and so people run into issues where something like sysctl -a during bootup sets a bunch of sysctls and they change sysctls they didn't mean to. Once the user namespaces are in place accessing a truly global sysctl will result in EPERM when you are in a container and everyone will be happy. ;) Where all of this winds up interesting in the field of oncoming kernel work is that uids are persistent and are stored in file systems. So once we have all of the permission checks in the kernel tweaked to care about user namespaces we next look at the filesystems. The easy initial implementation is going to be just associating a user namespace with a super block. But farther out being able to store uids from different user namespaces on the same filesystem becomes an interesting problem. We already have things like user mapping in 9p and nfsv4 so it isn't wholly uncharted territory. But it could get interesting. Just a heads up. >> With my sysadmin hat on I would not want to touch two untrusting groups >> of users on the same machine. Because of the probability there is at >> least one security hole that can be found and exploited to allow >> privilege escalation. >> >> With my kernel developer hat on I can't just say surrender to the >> idea that there will in fact be a privilege escalation bug that >> is easy to exploit. The code has to be built and designed so that >> privilege escalation is difficult. Otherwise we might as well >> assume if you visit a website an stealthy worm has taken over your >> computer. > > Oh, I agree that we should try to stop privilege escalation attacks. > And it will be a grand and glorious fight, like Leonidas and his 300 > men at the pass at Thermopylae. :-) Or it will be like Steve Jobs > struggling against cancer. It's a fight that you know that you're > going to lose, but it's not about winning or losing but how much you > accomplish and how you fight that counts. > > Personally, though, if the issue is worries about visiting a website, > the primary protection against that has got to be done at the browser > level (i.e., the process level sandboxing done by Chrome). My concern is any externally implemented service, but in general browsers and web sites are your most likely candidates. Both because there is more complexity there and because http is used far more often than other protocols. And yes I agree that the first line of defense needs to be in the browser source code, and then the application level sand boxing features that the browser takes advantage of. Last I paid attention one of the layers of defense that chrome is user was to setup different namespaces to make the sandbox tight even at the syscall level. When it is complete I would not be at all surprised if the user namespace wound up being used in chrome as well. Just as one more thing that helps. I have found it very surprising how many of the namespaces are used for what you can't do with them. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-11 21:16 ` Eric W. Biederman @ 2011-10-11 22:30 ` david 2011-10-12 4:26 ` Eric W. Biederman 2011-10-12 17:57 ` J. Bruce Fields 1 sibling, 1 reply; 28+ messages in thread From: david @ 2011-10-11 22:30 UTC (permalink / raw) To: Eric W. Biederman Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On Tue, 11 Oct 2011, Eric W. Biederman wrote: > Theodore Tso <tytso@MIT.EDU> writes: > >> On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote: >> >>> I am totally in favor of not starting the entire world. But just >>> like I find it convienient to loopback mount an iso image to see >>> what is on a disk image. It would be handy to be able to just >>> download a distro image and play with it, without doing anything >>> special. >> >> Agreed, but what's wrong with firing up KVM to play with a distro >> image? Personally, I don't consider that "doing something special". > > Then let me flip this around and give a much more practical use case. > Testing. A very interesting number of cases involve how multiple > machines interact. You can test a lot more logical machines interacting > with containers than you can with vms. And you can test on all the > aritectures and platforms linux supports not just the handful that are > well supported by hardware virtualization. but in containers, you are not really testing lots of machines, you are testing lots of processes on the same machine (they share the same kernel) > I admit for a lot of test cases that it makes sense not to use a full > set of userspace daemons. At the same time there is not particularly > good reason to have a design that doesn't allow you to run a full > userspace. how do you share the display between all the different containers if they are trying to run the X server? how do you avoid all the containers binding to the same port on the default IP address? how do you arbitrate dbus across the containers. when a new USB device gets plugged in, which container gets control of it? there are a LOT of hard questions when you start talking about running a full system inside a container that do not apply for other use of containers. David Lang ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-11 22:30 ` david @ 2011-10-12 4:26 ` Eric W. Biederman 2011-10-12 5:10 ` david 0 siblings, 1 reply; 28+ messages in thread From: Eric W. Biederman @ 2011-10-12 4:26 UTC (permalink / raw) To: david Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage david@lang.hm writes: > On Tue, 11 Oct 2011, Eric W. Biederman wrote: > >> Theodore Tso <tytso@MIT.EDU> writes: >> >>> On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote: >>> >>>> I am totally in favor of not starting the entire world. But just >>>> like I find it convienient to loopback mount an iso image to see >>>> what is on a disk image. It would be handy to be able to just >>>> download a distro image and play with it, without doing anything >>>> special. >>> >>> Agreed, but what's wrong with firing up KVM to play with a distro >>> image? Personally, I don't consider that "doing something special". >> >> Then let me flip this around and give a much more practical use case. >> Testing. A very interesting number of cases involve how multiple >> machines interact. You can test a lot more logical machines interacting >> with containers than you can with vms. And you can test on all the >> aritectures and platforms linux supports not just the handful that are >> well supported by hardware virtualization. > > but in containers, you are not really testing lots of machines, you are testing > lots of processes on the same machine (they share the same kernel) True. But usually that is the interesting part. >> I admit for a lot of test cases that it makes sense not to use a full >> set of userspace daemons. At the same time there is not particularly >> good reason to have a design that doesn't allow you to run a full >> userspace. > > how do you share the display between all the different containers if they are > trying to run the X server? Either X does not start because the hardware it needs is not present or Xnest or similar gets started. > how do you avoid all the containers binding to the same port on the default IP > address? Network namespaces. > how do you arbitrate dbus across the containers. Why should you? > when a new USB device gets plugged in, which container gets control of > it? None of them. Although today they may all get the uevent. None of the containers should have permission to call mknod to mess with it. > there are a LOT of hard questions when you start talking about running a full > system inside a container that do not apply for other use of > containers. Not really mostly the answer is that you say no. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-12 4:26 ` Eric W. Biederman @ 2011-10-12 5:10 ` david 2011-10-12 15:08 ` Serge E. Hallyn 0 siblings, 1 reply; 28+ messages in thread From: david @ 2011-10-12 5:10 UTC (permalink / raw) To: Eric W. Biederman Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On Tue, 11 Oct 2011, Eric W. Biederman wrote: > david@lang.hm writes: > >> On Tue, 11 Oct 2011, Eric W. Biederman wrote: >> >>> Theodore Tso <tytso@MIT.EDU> writes: >>> >>>> On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote: >>>> >>> I admit for a lot of test cases that it makes sense not to use a full >>> set of userspace daemons. At the same time there is not particularly >>> good reason to have a design that doesn't allow you to run a full >>> userspace. >> >> how do you share the display between all the different containers if they are >> trying to run the X server? > > Either X does not start because the hardware it needs is not present or > Xnest or similar gets started. > >> how do you avoid all the containers binding to the same port on the default IP >> address? > > Network namespaces. > >> how do you arbitrate dbus across the containers. > > Why should you? because the containers are simulating different machines, and dbus doesn't work arcross different machines. >> when a new USB device gets plugged in, which container gets control of >> it? > > None of them. Although today they may all get the uevent. None of the > containers should have permission to call mknod to mess with it. why would the software inside a container not have the rights to do a mknod inside the container? >> there are a LOT of hard questions when you start talking about running a full >> system inside a container that do not apply for other use of >> containers. > > Not really mostly the answer is that you say no. > > Eric > David Lang ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-12 5:10 ` david @ 2011-10-12 15:08 ` Serge E. Hallyn 0 siblings, 0 replies; 28+ messages in thread From: Serge E. Hallyn @ 2011-10-12 15:08 UTC (permalink / raw) To: david Cc: Eric W. Biederman, Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Daniel Lezcano, Paul Menage Quoting david@lang.hm (david@lang.hm): > On Tue, 11 Oct 2011, Eric W. Biederman wrote: > > >david@lang.hm writes: > > > >>On Tue, 11 Oct 2011, Eric W. Biederman wrote: > >> > >>>Theodore Tso <tytso@MIT.EDU> writes: > >>> > >>>>On Oct 11, 2011, at 2:42 AM, Eric W. Biederman wrote: > >>>> > >>>I admit for a lot of test cases that it makes sense not to use a full > >>>set of userspace daemons. At the same time there is not particularly > >>>good reason to have a design that doesn't allow you to run a full > >>>userspace. > >> > >>how do you share the display between all the different containers if they are > >>trying to run the X server? > > > >Either X does not start because the hardware it needs is not present or > >Xnest or similar gets started. > > > >>how do you avoid all the containers binding to the same port on the default IP > >>address? > > > >Network namespaces. > > > >>how do you arbitrate dbus across the containers. > > > >Why should you? > > because the containers are simulating different machines, and dbus > doesn't work arcross different machines. Exactly - Eric is saying dbus should not be (and is not) shared among containers. > >>when a new USB device gets plugged in, which container gets control of > >>it? > > > >None of them. Although today they may all get the uevent. None of the > >containers should have permission to call mknod to mess with it. > > why would the software inside a container not have the rights to do > a mknod inside the container? Why shouldn't an unprivileged user be allowed to mknod on the host? -serge ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-11 21:16 ` Eric W. Biederman 2011-10-11 22:30 ` david @ 2011-10-12 17:57 ` J. Bruce Fields 2011-10-12 18:25 ` Kyle Moffett 1 sibling, 1 reply; 28+ messages in thread From: J. Bruce Fields @ 2011-10-12 17:57 UTC (permalink / raw) To: Eric W. Biederman Cc: Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On Tue, Oct 11, 2011 at 02:16:24PM -0700, Eric W. Biederman wrote: > It actually isn't much complexity and for the most part the code that > I care about in that area is already merged. In principle all I care > about are having the identiy checks go from: > (uid1 == uid2) to ((user_ns1 == user_ns2) && (uid1 == uid2)) > > There are some per subsystem sysctls that do make sense to make per > subsystem and that work is mostly done. I expect there are a few > more in the networking stack that interesting to make per network > namespace. > > The only real world issue right now that I am aware of is the user > namespace aren't quite ready for prime-time and so people run into > issues where something like sysctl -a during bootup sets a bunch of > sysctls and they change sysctls they didn't mean to. Once the > user namespaces are in place accessing a truly global sysctl will > result in EPERM when you are in a container and everyone will be > happy. ;) > > > Where all of this winds up interesting in the field of oncoming kernel > work is that uids are persistent and are stored in file systems. So > once we have all of the permission checks in the kernel tweaked to care > about user namespaces we next look at the filesystems. The easy > initial implementation is going to be just associating a user namespace > with a super block. But farther out being able to store uids from > different user namespaces on the same filesystem becomes an interesting > problem. Yipes. Why would anyone want to do that? --b. > We already have things like user mapping in 9p and nfsv4 so it isn't > wholly uncharted territory. But it could get interesting. Just > a heads up. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-12 17:57 ` J. Bruce Fields @ 2011-10-12 18:25 ` Kyle Moffett 2011-10-12 19:04 ` J. Bruce Fields 0 siblings, 1 reply; 28+ messages in thread From: Kyle Moffett @ 2011-10-12 18:25 UTC (permalink / raw) To: J. Bruce Fields Cc: Eric W. Biederman, Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On Wed, Oct 12, 2011 at 13:57, J. Bruce Fields <bfields@fieldses.org> wrote: > On Tue, Oct 11, 2011 at 02:16:24PM -0700, Eric W. Biederman wrote: >> Where all of this winds up interesting in the field of oncoming kernel >> work is that uids are persistent and are stored in file systems. So >> once we have all of the permission checks in the kernel tweaked to care >> about user namespaces we next look at the filesystems. The easy >> initial implementation is going to be just associating a user namespace >> with a super block. But farther out being able to store uids from >> different user namespaces on the same filesystem becomes an interesting >> problem. > > Yipes. Why would anyone want to do that? Consider an NFS file server providing collaborative access to multiple independently managed domains (EG: several different universities), each with their own LDAP userid database and Kerberos services. The common server is in its own realm and allows cross-realm authentication to the other university realms, using the origin realm to decide what namespace to map each user into. If you are just doing read-only operations then you don't need any kind of namespace persistence on the NFS server's storage. On the other hand, if you want to allow users to collaborate and create ACLs then you need something dramatically more involved. On the wire, the kerberos server would simply identify each NFSv4 ACL entry with a particular realm ID, but in the backend it would need some filesystem-level disambiguation (possibly the recently-proposed RichACL features?) Cheers, Kyle Moffett ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-12 18:25 ` Kyle Moffett @ 2011-10-12 19:04 ` J. Bruce Fields 2011-10-12 19:12 ` Kyle Moffett 0 siblings, 1 reply; 28+ messages in thread From: J. Bruce Fields @ 2011-10-12 19:04 UTC (permalink / raw) To: Kyle Moffett Cc: Eric W. Biederman, Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On Wed, Oct 12, 2011 at 02:25:04PM -0400, Kyle Moffett wrote: > On Wed, Oct 12, 2011 at 13:57, J. Bruce Fields <bfields@fieldses.org> wrote: > > On Tue, Oct 11, 2011 at 02:16:24PM -0700, Eric W. Biederman wrote: > >> Where all of this winds up interesting in the field of oncoming kernel > >> work is that uids are persistent and are stored in file systems. So > >> once we have all of the permission checks in the kernel tweaked to care > >> about user namespaces we next look at the filesystems. The easy > >> initial implementation is going to be just associating a user namespace > >> with a super block. But farther out being able to store uids from > >> different user namespaces on the same filesystem becomes an interesting > >> problem. > > > > Yipes. Why would anyone want to do that? > > Consider an NFS file server providing collaborative access to multiple > independently managed domains (EG: several different universities), > each with their own LDAP userid database and Kerberos services. > > The common server is in its own realm and allows cross-realm > authentication to the other university realms, using the origin realm > to decide what namespace to map each user into. > > If you are just doing read-only operations then you don't need any > kind of namespace persistence on the NFS server's storage. On the > other hand, if you want to allow users to collaborate and create ACLs > then you need something dramatically more involved. Yeah, OK, I suppose I'd imagined mapping into the server's id space somehow for that case, but I suppose this would be cleaner. Still, seems like a big pain.... > On the wire, the kerberos server would simply identify each NFSv4 ACL > entry with a particular realm ID, but in the backend it would need > some filesystem-level disambiguation (possibly the recently-proposed > RichACL features?) That doesn't help with owner and group. --b. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-12 19:04 ` J. Bruce Fields @ 2011-10-12 19:12 ` Kyle Moffett 2011-10-14 15:54 ` Ted Ts'o 0 siblings, 1 reply; 28+ messages in thread From: Kyle Moffett @ 2011-10-12 19:12 UTC (permalink / raw) To: J. Bruce Fields Cc: Eric W. Biederman, Theodore Tso, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On Wed, Oct 12, 2011 at 15:04, J. Bruce Fields <bfields@fieldses.org> wrote: > On Wed, Oct 12, 2011 at 02:25:04PM -0400, Kyle Moffett wrote: >> On Wed, Oct 12, 2011 at 13:57, J. Bruce Fields <bfields@fieldses.org> wrote: >> > On Tue, Oct 11, 2011 at 02:16:24PM -0700, Eric W. Biederman wrote: >> >> Where all of this winds up interesting in the field of oncoming kernel >> >> work is that uids are persistent and are stored in file systems. So >> >> once we have all of the permission checks in the kernel tweaked to care >> >> about user namespaces we next look at the filesystems. The easy >> >> initial implementation is going to be just associating a user namespace >> >> with a super block. But farther out being able to store uids from >> >> different user namespaces on the same filesystem becomes an interesting >> >> problem. >> > >> > Yipes. Why would anyone want to do that? >> >> Consider an NFS file server providing collaborative access to multiple >> independently managed domains (EG: several different universities), >> each with their own LDAP userid database and Kerberos services. >> >> The common server is in its own realm and allows cross-realm >> authentication to the other university realms, using the origin realm >> to decide what namespace to map each user into. >> >> If you are just doing read-only operations then you don't need any >> kind of namespace persistence on the NFS server's storage. On the >> other hand, if you want to allow users to collaborate and create ACLs >> then you need something dramatically more involved. > > Yeah, OK, I suppose I'd imagined mapping into the server's id space > somehow for that case, but I suppose this would be cleaner. Still, > seems like a big pain.... > >> On the wire, the kerberos server would simply identify each NFSv4 ACL >> entry with a particular realm ID, but in the backend it would need >> some filesystem-level disambiguation (possibly the recently-proposed >> RichACL features?) > > That doesn't help with owner and group. Well, you're going to need to introduce a bunch of new xattrs to handle the namespacing anyways. As I understand it you can use RichACLs to grant all the same privileges as owner and group, so you can simply map the real namespaced owner and group into RichACLs (or another xattr) and force the inode uid/gid to be root/root (or maybe nobody/nogroup or something). I am of course making it sound a million times easier than it's actually likely to be, but I do think it's possible without too many odd corner cases. Cheers, Kyle Moffett ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-12 19:12 ` Kyle Moffett @ 2011-10-14 15:54 ` Ted Ts'o 2011-10-14 18:04 ` Eric W. Biederman 0 siblings, 1 reply; 28+ messages in thread From: Ted Ts'o @ 2011-10-14 15:54 UTC (permalink / raw) To: Kyle Moffett Cc: J. Bruce Fields, Eric W. Biederman, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On Wed, Oct 12, 2011 at 03:12:34PM -0400, Kyle Moffett wrote: > Well, you're going to need to introduce a bunch of new xattrs to > handle the namespacing anyways. > > As I understand it you can use RichACLs to grant all the same > privileges as owner and group, so you can simply map the real > namespaced owner and group into RichACLs (or another xattr) and force > the inode uid/gid to be root/root (or maybe nobody/nogroup or > something). It's going to be all about mapping tables, and whether the mapping is done in userspace or kernel space. For example, you might want to take a Kerberos principal name, and mapping it to a 128bit identifier (64 bit realm id + 64 bit user id), and that in turn might require mapping to some 32-bit Linux uid namespace. If people want to support multiple 32-bit Linux uid namespaces, then it's a question of how you name these uid name spaces, and how to manage the mapping tables outside of kernel, and how the mapping tables get loaded into the kernel. > I am of course making it sound a million times easier than it's > actually likely to be, but I do think it's possible without too many > odd corner cases. It's not the corner cases, it's all of the different name spaces that different system administrators and their sites are going to want to use, and how to support them all.... And of course, once we start naming uid name spaces, eventually someone will want to virtualize containers, and then we will have namespaces for namespaces. (It's turtles all the way down! :-) - Ted ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-14 15:54 ` Ted Ts'o @ 2011-10-14 18:04 ` Eric W. Biederman 2011-10-14 21:58 ` H. Peter Anvin 0 siblings, 1 reply; 28+ messages in thread From: Eric W. Biederman @ 2011-10-14 18:04 UTC (permalink / raw) To: Ted Ts'o Cc: Kyle Moffett, J. Bruce Fields, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage Ted Ts'o <tytso@mit.edu> writes: >> I am of course making it sound a million times easier than it's >> actually likely to be, but I do think it's possible without too many >> odd corner cases. > > It's not the corner cases, it's all of the different name spaces that > different system administrators and their sites are going to want to > use, and how to support them all.... > > And of course, once we start naming uid name spaces, eventually > someone will want to virtualize containers, and then we will have > namespaces for namespaces. (It's turtles all the way down! :-) I have found and merged a solution that allows us to name namespaces without needing a namespaces for namespaces. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-14 18:04 ` Eric W. Biederman @ 2011-10-14 21:58 ` H. Peter Anvin 2011-10-16 9:42 ` Eric W. Biederman 0 siblings, 1 reply; 28+ messages in thread From: H. Peter Anvin @ 2011-10-14 21:58 UTC (permalink / raw) To: Eric W. Biederman Cc: Ted Ts'o, Kyle Moffett, J. Bruce Fields, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On 10/14/2011 11:04 AM, Eric W. Biederman wrote: > > I have found and merged a solution that allows us to name namespaces > without needing a namespaces for namespaces. > Something based on UUIDs, perhaps? UUIDs are kind of exactly this, after all... a single namespace designed to be large and random enough to be globally unique without a central registration authority. -hpa ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-14 21:58 ` H. Peter Anvin @ 2011-10-16 9:42 ` Eric W. Biederman 2011-10-30 20:11 ` H. Peter Anvin 0 siblings, 1 reply; 28+ messages in thread From: Eric W. Biederman @ 2011-10-16 9:42 UTC (permalink / raw) To: H. Peter Anvin Cc: Ted Ts'o, Kyle Moffett, J. Bruce Fields, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage "H. Peter Anvin" <hpa@zytor.com> writes: > On 10/14/2011 11:04 AM, Eric W. Biederman wrote: >> >> I have found and merged a solution that allows us to name namespaces >> without needing a namespaces for namespaces. >> > > Something based on UUIDs, perhaps? > > UUIDs are kind of exactly this, after all... a single namespace designed > to be large and random enough to be globally unique without a central > registration authority. mount --bind /proc/self/ns/net /var/run/netns/<name> When we want to refer to the namespace in syscalls we pass a file descriptor we received from opening the namespace reference object. That moves the entire naming problem into the file namespace. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-16 9:42 ` Eric W. Biederman @ 2011-10-30 20:11 ` H. Peter Anvin 2011-11-01 13:38 ` Eric W. Biederman 0 siblings, 1 reply; 28+ messages in thread From: H. Peter Anvin @ 2011-10-30 20:11 UTC (permalink / raw) To: Eric W. Biederman Cc: Ted Ts'o, Kyle Moffett, J. Bruce Fields, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On 10/16/2011 02:42 AM, Eric W. Biederman wrote: >> >> Something based on UUIDs, perhaps? >> >> UUIDs are kind of exactly this, after all... a single namespace designed >> to be large and random enough to be globally unique without a central >> registration authority. > > mount --bind /proc/self/ns/net /var/run/netns/<name> > > When we want to refer to the namespace in syscalls we pass a file > descriptor we received from opening the namespace reference object. > > That moves the entire naming problem into the file namespace. > That doesn't solve what I think of as the *real* problem. The real problem is just another instance of what I sometimes refer to as the "alien metadata problem": the alien metadata problem (which crops up in *all kinds* of contexts, including containers, namespaces, virtual machines, building distribution disk images, and backups) is the fact that you would like to be able to store, manipulate and preserve, on disk and in a mounted filesystem, a set of metadata which may not be the "currently active" metadata. There are two forms of "solutions" to this: one where the filesystem still only contains one set of metadata, but it is not currently active, and one where the filesystem contains multiple sets of metadata for the same files at the same time, any one of which can be active (and different ones may be active for different namespaces.) -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container 2011-10-30 20:11 ` H. Peter Anvin @ 2011-11-01 13:38 ` Eric W. Biederman 0 siblings, 0 replies; 28+ messages in thread From: Eric W. Biederman @ 2011-11-01 13:38 UTC (permalink / raw) To: H. Peter Anvin Cc: Ted Ts'o, Kyle Moffett, J. Bruce Fields, Matt Helsley, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage "H. Peter Anvin" <hpa@zytor.com> writes: > On 10/16/2011 02:42 AM, Eric W. Biederman wrote: >>> >>> Something based on UUIDs, perhaps? >>> >>> UUIDs are kind of exactly this, after all... a single namespace designed >>> to be large and random enough to be globally unique without a central >>> registration authority. >> >> mount --bind /proc/self/ns/net /var/run/netns/<name> >> >> When we want to refer to the namespace in syscalls we pass a file >> descriptor we received from opening the namespace reference object. >> >> That moves the entire naming problem into the file namespace. >> > > That doesn't solve what I think of as the *real* problem. It solves the problem of not needing a namespace of namespaces and it solves the problem not requiring universal agreement between all filesystems on all operating systems on how things should look. In not precluding different solutions it makes a large stride forward. > The real problem is just another instance of what I sometimes refer to > as the "alien metadata problem": the alien metadata problem (which crops > up in *all kinds* of contexts, including containers, namespaces, virtual > machines, building distribution disk images, and backups) is the fact > that you would like to be able to store, manipulate and preserve, on > disk and in a mounted filesystem, a set of metadata which may not be the > "currently active" metadata. When you throw network filesystems with different notions of meta-data things get even more interesting. > There are two forms of "solutions" to this: one where the filesystem > still only contains one set of metadata, but it is not currently active, > and one where the filesystem contains multiple sets of metadata for the > same files at the same time, any one of which can be active (and > different ones may be active for different namespaces.) There is an important tool that seems to be missing from your toolbox. - Mapping the metadata on the file into different contexts. The way I see it classic unix filesystems have exactly one context that their meta-data is expected to work in. The context in which the filesystem is mounted. However it is very easy to conceive of that context being specified at a per inode granularity. In which case at least the backup and the distribution disk image problem can be solved by trivially specifying a different context, and associating a user namespace with that context. Then you switch into the user namespace to manipulate "alien metadata". Where mapping comes in is when those files are accessed from from another context besides the one where all of their metadata falls. At which point you can map all of the files to be owned by the user who is responsible for making backups. The mapping is a bit like the root squash setting. For the common case I expect we will settle on a well defined acl across the native unix filesystems that allows us to make this persistent. For network filesystems with their broader interoperability requirements how to specify this gets a little more interesting. For purposes of implementation it doesn't matter to me if that acl is a uuid or a unique string. For management of the data it might. How I expect a native linux filesystem to work when it encounters a filesystem with a user namespace acl is that it will work like nfsv4 and do an upcall into userspace, to ask the appropriate userspace how do I understand this acl. The the userapce mapping agent will say. Oh. You want the usernamespace for "hpa-backups"? Let's see: /var/run/userns/hpa-backups exists let me just tell the kernel about that mapping. Or perhaps the usernamespace does not exist so the mapping daemon would go out and create it be consulting configuration files in etc to know that everything in "hpa-backups" should a child user namespace with the user "hpa" being able to switch into that usernamespace without root permission. Files with meta-data for more than one usernamespace/context I expect to work similarly. Care needs to be take that it doesn't drive the administrator crazy. Eric ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Detecting if you are running in a container [not found] ` <20111011020530.GG16723@count0.beaverton.ibm.com> 2011-10-11 3:25 ` Ted Ts'o @ 2011-10-11 22:25 ` david 1 sibling, 0 replies; 28+ messages in thread From: david @ 2011-10-11 22:25 UTC (permalink / raw) To: Matt Helsley Cc: Ted Ts'o, Eric W. Biederman, Lennart Poettering, Kay Sievers, linux-kernel, harald, david, greg, Linux Containers, Linux Containers, Serge E. Hallyn, Daniel Lezcano, Paul Menage On Mon, 10 Oct 2011, Matt Helsley wrote: > On Mon, Oct 10, 2011 at 09:32:01PM -0400, Ted Ts'o wrote: >> On Mon, Oct 10, 2011 at 01:59:10PM -0700, Eric W. Biederman wrote: >>> Lennart Poettering <mzxreary@0pointer.de> writes: >>> >>>> To make a standard distribution run nicely in a Linux container you >>>> usually have to make quite a number of modifications to it and disable >>>> certain things from the boot process. Ideally however, one could simply >>>> boot the same image on a real machine and in a container and would just >>>> do the right thing, fully stateless. And for that you need to be able to >>>> detect containers, and currently you can't. >>> >>> I agree getting to the point where we can run a standard distribution >>> unmodified in a container sounds like a reasonable goal. >> >> Hmm, interesting. It's not clear to me that running a full standard >> distribution in a container is always going to be what everyone wants >> to do. >> >> The whole point of containers versus VM's is that containers are >> lighter weight. And one of the ways that containers can be lighter >> weight is if you don't have to have N copies of udev, dbus, running in >> each container/VM. >> >> If you end up so much overhead to provide the desired security and/or >> performance isolation, then it becomes fair to ask the question >> whether you might as well pay a tad bit more and get even better >> security and isolation by using a VM solution.... >> >> - Ted > > Yes, it does detract from the unique advantages of using a container. > However, I think the value here is not the effeciency of the initial > system configuration but the fact that it gives users a better place to > start. > > Right now we're effectively asking users to start with non-working > and/or unfamiliar systems and repair them until they work. > > By enabling unmodified distro installs in a container we're starting > at the other end. The choices may not be the most efficient but the > user may begin tuning from a working configuration. They can learn > about and tune those parts that prove significant for their workload. > This is better because in the end it's not just about how efficient the > user can make their containers but how much effort they will spend > achieving and maintainingg that efficiency over time. what's needed isn't a way to run all the daemons, processes and startup scripts that a distro uses in a container without conflicting with the parent, but instead a easy way to create the appropriate config changes in the parent, bind mounts, cgroups, etc for the container and startup the apps that are wanted in the container. This needs to be something with a lot of knowledge and hooks in the parent, so it's not just a matter of adding a way to detect "am I in a container" or not. when I run things in containers, I want to bind mount some things from the parent, I want to configure syslog to listen on /dev/log inside the container, and then I want to starup just the processes I am planning to use inside the container, not all the daemons and other processes that I need to run the service the container is built for. David Lang ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2011-11-02 8:08 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1317943022.1095.25.camel@mop>
[not found] ` <20111007074904.GC16723@count0.beaverton.ibm.com>
[not found] ` <20111007160113.GB14201@tango.0pointer.de>
[not found] ` <m17h4g2jqy.fsf@fess.ebiederm.org>
[not found] ` <20111010163140.GA22191@tango.0pointer.de>
2011-10-10 20:59 ` Detecting if you are running in a container Eric W. Biederman
2011-10-10 21:41 ` Lennart Poettering
2011-10-11 5:40 ` Eric W. Biederman
2011-10-11 6:54 ` Eric W. Biederman
2011-10-12 16:59 ` Kay Sievers
2011-11-01 22:05 ` [lxc-devel] " Michael Tokarev
2011-11-01 23:51 ` Eric W. Biederman
2011-11-02 8:08 ` Michael Tokarev
2011-10-11 1:32 ` Ted Ts'o
[not found] ` <20111011020530.GG16723@count0.beaverton.ibm.com>
2011-10-11 3:25 ` Ted Ts'o
2011-10-11 6:42 ` Eric W. Biederman
2011-10-11 12:53 ` Theodore Tso
2011-10-11 21:16 ` Eric W. Biederman
2011-10-11 22:30 ` david
2011-10-12 4:26 ` Eric W. Biederman
2011-10-12 5:10 ` david
2011-10-12 15:08 ` Serge E. Hallyn
2011-10-12 17:57 ` J. Bruce Fields
2011-10-12 18:25 ` Kyle Moffett
2011-10-12 19:04 ` J. Bruce Fields
2011-10-12 19:12 ` Kyle Moffett
2011-10-14 15:54 ` Ted Ts'o
2011-10-14 18:04 ` Eric W. Biederman
2011-10-14 21:58 ` H. Peter Anvin
2011-10-16 9:42 ` Eric W. Biederman
2011-10-30 20:11 ` H. Peter Anvin
2011-11-01 13:38 ` Eric W. Biederman
2011-10-11 22:25 ` david
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox