From: ebiederm@xmission.com (Eric W. Biederman)
To: Lennart Poettering <mzxreary@0pointer.de>
Cc: Matt Helsley <matthltc@us.ibm.com>,
Kay Sievers <kay.sievers@vrfy.org>,
linux-kernel@vger.kernel.org, harald@redhat.com, david@fubar.dk,
greg@kroah.com, Linux Containers <containers@lists.osdl.org>,
Linux Containers <lxc-devel@lists.sourceforge.net>,
"Serge E. Hallyn" <serge@hallyn.com>,
Daniel Lezcano <daniel.lezcano@free.fr>,
Paul Menage <paul@paulmenage.org>
Subject: Re: Detecting if you are running in a container
Date: Mon, 10 Oct 2011 22:40:34 -0700 [thread overview]
Message-ID: <m1obxojdbh.fsf@fess.ebiederm.org> (raw)
In-Reply-To: <20111010214148.GB26510@tango.0pointer.de> (Lennart Poettering's message of "Mon, 10 Oct 2011 23:41:48 +0200")
Lennart Poettering <mzxreary@0pointer.de> writes:
> On Mon, 10.10.11 13:59, Eric W. Biederman (ebiederm@xmission.com) wrote:
>
>> > Quite a few kernel subsystems are
>> > currently not virtualized, for example SELinux, VTs, most of sysfs, most
>> > of /proc/sys, audit, udev or file systems (by which I mean that for a
>> > container you probably don't want to fsck the root fs, and so on), and
>> > containers tend to be much more lightweight than real systems.
>>
>> That is an interesting viewpoint on what is not complete. But as a
>> listing of the tasks that distribution startup needs to do differently in
>> a container the list seems more or less reasonable.
>
> Note that this is just what came to my mind while I was typing this, I
> am quite sure there's actually more like this.
>
>> There are two questions
>> - How in the general case do we detect if we are running in a container.
>> - How do we make reasonable tests during bootup to see if it makes sense
>> to perform certain actions.
>>
>> For the general detection if we are running in a linux container I can
>> see two reasonable possibilities.
>>
>> - Put a file in / that let's you know by convention that you are in a
>> linux container. I am inclined to do this because this is something
>> we can support on all kernels old and new.
>
> Hmpf. That would break the stateless read-only-ness of the root dir.
>
> After pointing the issue out to the LXC folks they are now setting
> "container=lxc" as env var when spawning a container. In systemd-nspawn
> I have then adopted a similar scheme. Not sure though that that isp
> particularly nice however, since env vars are inherited further down the
> tree where we probably don't want them.
Interesting. That seems like a reasonable enough thing to require
of the programs that create containers.
> In case you are curious: this is the code we use in systemd:
>
> http://cgit.freedesktop.org/systemd/tree/src/virt.c
>
> What matters to me though is that we can generically detect Linux
> containers instead of specific implementations.
>> - Allow modification to the output of uname(2). The uts namespace
>> already covers uname(2) and uname is the standard method to
>> communicate to userspace the vageries about the OS level environment
>> they are running in.
>
> Well, I am not a particular fan of having userspace tell userspace about
> containers. I would prefer if userspace could get that info from the
> kernel without any explicit agreement to set some specific variable.
Well userspace tells userspace about stdin and it works reliably.
Containers are a userspace construct built with kernel facilities.
I don't see why asking userspace to implement a convention is any more
important than the other things that have to be done.
We do need to document the convetions. Just like we document the
standard device names but I don't beyond that we should be fine.
>> My list of things that still have work left to do looks like:
>> - cgroups. It is not safe to create a new hierarchies with groups
>> that are in existing hierarchies. So cgroups don't work.
>
> Well, for systemd they actually work quite fine since systemd will
> always place its own cgroups below the cgroup it is started in. cgroups
> hence make these things nicely stackable.
>
> In fact, most folks involved in cgroups userspace have agreed to these
> rules now:
>
> http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups
>
> Among other things they ask all userspace code to only create subgroups
> below the group they are started in, so not only systemd should work
> fine in a container environment but everything else following these
> rules.
>
> In other words: so far one gets away quite nicely with the fact that the
> cgroup tree is not virtualized.
Assuming you bind mount the cgroups inside and generally don't allow
people in a container to create cgroup hierarchies. At the very least
that is nasty information leakage.
But I am glad there is a solution for right now.
For my uses I have yet to find control groups anything but borked.
>> - VTs. Ptys should be well supported at this point. For the rest
>> they are physical hardware that a container should not be playing with
>> so I would base which gettys to start up based on which device nodes
>> are present in /dev.
>
> Well, I am not sure it's that easy since device nodes tend to show up
> dynamically in bare systems. So if you just check whether /dev/tty0 is
> there you might end up thinking you are in a container when you actually
> aren't simply because you did that check before udev loaded the DRI
> driver or so.
But the point isn't to detect a container the point is to decide if
a getty needs to be spawned. Even with the configuration for a getty
you need to wait for the device node to exist before spawning one.
>> - sysctls (aka /proc/sys) that is a trick one. Until the user namespace
>> is fleshed out a little more sysctls are going to be a problem,
>> because root can write to most of them. My gut feel says you probably
>> want to base that to poke at sysctls on CAP_SYS_ADMIN. At least that
>> test will become true when the userspaces are rolled out, and at
>> that point you will want to set all of the sysctls you have permission
>> to.
>
> So what we did right now in systemd-nspawn is that the container
> supervisor premounts /proc/sys read-only into the container. That way
> writes to it will fail in the container, and while you get a number of
> warnings things will work as they should (though not necessarily safely
> since the container can still remount the fs unless you take
> CAP_SYS_ADMIN away).
That sort of works. In practice it means you can't setup interesting
things like forwarding in the networking stack. But it certainly gets
things going.
>> So while I agree a check to see if something is a container seems
>> reasonable. I do not agree that the pid namespace is the place to put
>> that information. I see no natural to put that information in the
>> pid namespace.
>
> Well, a simple way would be to have a line /proc/1/status called
> "PIDNamespaceLevel:" or so which would be 0 for the root namespace, and
> increased for each namespace nested in it. Then, processes could simply
> read that and be happy.
Not a chance. PIDNamespaceLevel is implementing an implementation
detail that may well change in the lifetime of a process. It is true
we don't have migration mreged in the kernel yet but one of these days
I expect we will.
>> I further think there are a lot of reasonable checks for if a
>> kernel feature is supported in the current environment I would
>> rather pursue over hacks based the fact we are in a container.
>
> Well, believe me we have been tryiung to find nicer hooks that explicit
> checks for containers, but I am quite sure that at the end of the day
> you won't be able to go without it entirely.
And you have explicit information you are in a container at this point.
It looks like all that is left is Documentation of the conventions.
Eric
next prev parent reply other threads:[~2011-10-11 5:40 UTC|newest]
Thread overview: 79+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-10-06 23:17 A Plumber’s Wish List for Linux Kay Sievers
2011-10-06 23:46 ` Andi Kleen
2011-10-07 0:13 ` Lennart Poettering
2011-10-07 1:57 ` Andi Kleen
2011-10-07 15:58 ` Lennart Poettering
2011-10-19 23:16 ` H. Peter Anvin
2011-10-07 7:49 ` Matt Helsley
2011-10-07 16:01 ` Lennart Poettering
2011-10-08 4:24 ` Eric W. Biederman
2011-10-10 16:31 ` Lennart Poettering
2011-10-10 20:59 ` Detecting if you are running in a container Eric W. Biederman
2011-10-10 21:41 ` Lennart Poettering
2011-10-11 5:40 ` Eric W. Biederman [this message]
2011-10-11 6:54 ` Eric W. Biederman
2011-10-12 16:59 ` Kay Sievers
2011-11-01 22:05 ` [lxc-devel] " Michael Tokarev
2011-11-01 23:51 ` Eric W. Biederman
2011-11-02 8:08 ` Michael Tokarev
2011-10-11 1:32 ` Ted Ts'o
2011-10-11 2:05 ` Matt Helsley
2011-10-11 3:25 ` Ted Ts'o
2011-10-11 6:42 ` Eric W. Biederman
2011-10-11 12:53 ` Theodore Tso
2011-10-11 21:16 ` Eric W. Biederman
2011-10-11 22:30 ` david
2011-10-12 4:26 ` Eric W. Biederman
2011-10-12 5:10 ` david
2011-10-12 15:08 ` Serge E. Hallyn
2011-10-12 17:57 ` J. Bruce Fields
2011-10-12 18:25 ` Kyle Moffett
2011-10-12 19:04 ` J. Bruce Fields
2011-10-12 19:12 ` Kyle Moffett
2011-10-14 15:54 ` Ted Ts'o
2011-10-14 18:04 ` Eric W. Biederman
2011-10-14 21:58 ` H. Peter Anvin
2011-10-16 9:42 ` Eric W. Biederman
2011-10-30 20:11 ` H. Peter Anvin
2011-11-01 13:38 ` Eric W. Biederman
2011-10-11 22:25 ` david
2011-10-07 10:12 ` A Plumber’s Wish List for Linux Alan Cox
2011-10-07 10:28 ` Kay Sievers
2011-10-07 10:38 ` Alan Cox
2011-10-07 12:46 ` Kay Sievers
2011-10-07 13:39 ` Theodore Tso
2011-10-07 15:21 ` Hugo Mills
2011-10-10 11:18 ` A Plumber???s " David Sterba
2011-10-10 11:18 ` David Sterba
2011-10-10 13:09 ` Theodore Tso
2011-10-13 0:28 ` Dave Chinner
2011-10-14 15:47 ` Ted Ts'o
2011-10-11 13:14 ` Serge E. Hallyn
2011-10-11 15:49 ` Andrew G. Morgan
2011-10-12 2:31 ` Serge E. Hallyn
2011-10-12 20:51 ` Lennart Poettering
2011-10-08 9:53 ` A Plumber’s " Bastien ROUCARIES
2011-10-09 3:15 ` Alex Elsayed
2011-10-07 16:07 ` Valdis.Kletnieks
2011-10-07 12:35 ` Vivek Goyal
2011-10-07 18:59 ` Greg KH
2011-10-09 12:20 ` Kay Sievers
2011-10-09 8:45 ` Rusty Russell
2011-10-11 23:16 ` Andrew Morton
2011-10-12 0:53 ` Frederic Weisbecker
2011-10-12 0:59 ` Frederic Weisbecker
[not found] ` <20111012174014.GE6281@google.com>
2011-10-12 18:16 ` Cyrill Gorcunov
2011-10-14 15:38 ` Frederic Weisbecker
2011-10-14 16:01 ` Cyrill Gorcunov
2011-10-14 16:08 ` Cyrill Gorcunov
2011-10-14 16:19 ` Frederic Weisbecker
2011-10-19 21:19 ` Paul Menage
2011-10-19 21:12 ` Paul Menage
2011-10-19 23:03 ` Lennart Poettering
2011-10-19 23:09 ` Paul Menage
2011-10-19 23:31 ` Lennart Poettering
2011-10-22 10:21 ` Frederic Weisbecker
2011-10-22 15:28 ` Lennart Poettering
2011-10-25 5:40 ` Li Zefan
2011-10-30 17:18 ` Lennart Poettering
2011-11-01 1:27 ` Li Zefan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=m1obxojdbh.fsf@fess.ebiederm.org \
--to=ebiederm@xmission.com \
--cc=containers@lists.osdl.org \
--cc=daniel.lezcano@free.fr \
--cc=david@fubar.dk \
--cc=greg@kroah.com \
--cc=harald@redhat.com \
--cc=kay.sievers@vrfy.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lxc-devel@lists.sourceforge.net \
--cc=matthltc@us.ibm.com \
--cc=mzxreary@0pointer.de \
--cc=paul@paulmenage.org \
--cc=serge@hallyn.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.