public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed
From: Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
To: Alban Crequy <alban-973cpzSjLbNWk0Htik3J/w@public.gmane.org>
Cc: "Iago López Galeiras"
	<iago-973cpzSjLbNWk0Htik3J/w@public.gmane.org>,
	cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	"Linux Containers"
	<containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org>
Subject: Re: How we use cgroups in rkt
Date: Thu, 18 Jun 2015 14:40:24 +0000	[thread overview]
Message-ID: <20150618144024.GB18426@ubuntumail> (raw)
In-Reply-To: <CALdWxcsxWCyLyH9H+BrAYW4gY-oAyNrp_Rf726tFtbHPy2_M9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

Quoting Alban Crequy (alban-973cpzSjLbNWk0Htik3J/w@public.gmane.org):
> On Wed, Jun 17, 2015 at 10:30 PM, Serge Hallyn <serge.hallyn@ubuntu.com> wrote:
> > Quoting Iago López Galeiras (iago-973cpzSjLbNWk0Htik3J/w@public.gmane.org):
> >> Hi everyone,
> >>
> >> We are working on rkt[1] and we want to ask for feedback about the way we use
> >> cgroups to implement isolation in containers. rkt uses systemd-nspawn internally
> >> so I guess the best way to start is explaining how this is handled in
> >> systemd-nspawn.
> >>
> >> The approach taken by nspawn is mounting the cgroup controllers read-only inside
> >> the container except the part that corresponds to it inside the systemd
> >> controller. It is done this way because allowing the container to modify the
> >> other controllers is considered unsafe[2].
> >>
> >> This is how bind mounts look like:
> >>
> >> /sys/fs/cgroup/devices RO
> >> [...]
> >> /sys/fs/cgroup/memory RO
> >> /sys/fs/cgroup/systemd RO
> >> /sys/fs/cgroup/systemd/machine.slice/machine-a.scope RW
> >>
> >> In rkt we have a concept called pod[3] which is a list of apps that run inside a
> >> container, each running in its own chroot. To implement this concept, we start a
> >> systemd-nspawn container with a minimal systemd installation that starts each
> >> app as a service.
> >>
> >> We want to be able to apply different restrictions to each app of a pod using
> >> cgroups and the straightforward way we thought was delegating to systemd inside
> >> the container. Initially, this didn't work because, as mentioned earlier, the
> >> cgroup controllers are mounted read-only.
> >>
> >> The way we solved this problem was mounting the cgroup hierarchy (with the
> >> directories expected by systemd) outside the container. The difference with
> >> systemd-nspawn’s approach is that we don’t mount everything read-only; instead,
> >> we leave the knobs we need in each of the application’s subcgroups read-write.
> >>
> >> For example, if we want to restrict the memory usage of an application we leave
> >> /sys/fs/cgroup/memory/machine/machine.slice/machine-rkt-xxxxx/system.slice/sha512-xxxx/{memory.limit_in_bytes,cgroup.procs}
> >
> > Who exactly does the writing to those files?
> 
> First, rkt prepares systemd a .service file for each application in
> the container with "CPUQuota=" and "MemoryLimit=". The .service files
> are not used by systemd outside the container. Then, rkt uses
> systemd-nspawn to start systemd as pid 1 in the container. Finally,
> systemd inside the container writes to the cgroup files
> {memory.limit_in_bytes,cgroup.procs}.
> 
> We call those limits the "per-app isolators". It's not a security
> boundary because all the apps run in the same container (in the same
> pid/mount/net namespaces). The apps run in different chroots, but
> that's easily escapable.
> 
> > Do the applications want to change them, or only rkt itself?
> 
> At the moment, the limits are statically defined in the app container
> image, so neither rkt or the apps inside the container change them. I
> don't know of a use case where we would need to change them
> dynamically.
> 
> >  If rkt, then it seems like you should be
> > able to use a systemd api to update the values (over dbus), right?
> > systemctl set-property machine-a-scope MemoryLimit=1G or something.
> 
> In addition to the "per-app isolators" described above, rkt can have
> "pod-level isolators" that are applied on the machine slice (the
> cgroup parent directory) rather than at the leaves of the cgroup tree.
> They are defined when rkt itself is started by a systemd .service
> file, and applied by systemd outside of the container. E.g.
> 
> [Service]
> CPUShares=512
> MemoryLimit=1G
> ExecStart=/usr/bin/rkt run myapp.com/myapp-1.3.4
> 
> Updating the pod-level isolators with systemctl on the host should work.
> 
> But systemd inside the container or the apps don't have access to the
> required cgroup knob files: they are mounted read-only.
> 
> > Now I'm pretty sure that systemd doesn't yet support being able to do
> > this from inside the container in a delegated way.
> 
> Indeed by default nspawn/systemd does not support delegating that. It
> only works because rkt prepared the cgroup bind mounts for the
> container.
> 
> > That was cgmanager's
> > reason for being, and I'm interested in working on a proper API for that
> > for systemd.
> 
> Do you mean patching systemd so it does not write to the cgroup
> filesystem directly but talk to the cgmanager/cgproxy socket instead?

More likely, patch it so it can talk to a systemd-owned unix socket
bind-mounted into the container.  So systemd would need to be patched
at both ends.  But that's something I was hoping would happen upstream
anyway.  I would be very happy to help out in that effort.

-serge

      parent reply	other threads:[~2015-06-18 14:40 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-06-17 11:09 How we use cgroups in rkt Iago López Galeiras
     [not found] ` <55815556.4030304-973cpzSjLbNWk0Htik3J/w@public.gmane.org>
2015-06-17 20:30   ` Serge Hallyn
2015-06-18  8:57     ` Alban Crequy
     [not found]       ` <CALdWxcsxWCyLyH9H+BrAYW4gY-oAyNrp_Rf726tFtbHPy2_M9Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-06-18 14:40         ` Serge Hallyn [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150618144024.GB18426@ubuntumail \
    --to=serge.hallyn-gewih/nmzzlqt0dzr+alfa@public.gmane.org \
    --cc=alban-973cpzSjLbNWk0Htik3J/w@public.gmane.org \
    --cc=cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=iago-973cpzSjLbNWk0Htik3J/w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox