cgroup management daemon

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* cgroup management daemon
@ 2013-11-25 22:43 Serge E. Hallyn
       [not found] ` <20131125224335.GA15481-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Serge E. Hallyn @ 2013-11-25 22:43 UTC (permalink / raw)
  To: Tejun Heo, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Tim Hockin, Stéphane Graber, serge-A9i7LUbDfNHQT0dZR+AlfA

Hi,

as i've mentioned several times, I want to write a standalone cgroup
management daemon.  Basic requirements are that it be a standalone
program; that a single instance running on the host be usable from
containers nested at any depth; that it not allow escaping ones
assigned limits; that it not allow subjegating tasks which do not
belong to you; and that, within your limits, you be able to parcel
those limits to your tasks as you like.  

Additionally, Tejun has specified that we do not want users to be
too closely tied to the cgroupfs implementation.  Therefore
commands will be just a hair more general than specifying cgroupfs
filenames and values.  I may go so far as to avoid specifying
specific controllers, as AFAIK there should be no redundancy in
features.  On the other hand, I don't want to get too general.
So I'm basing the API loosely on the lmctfy command line API.

One of the driving goals is to enable nested lxc as simply and safely as
possible.  If this project is a success, then a large chunk of code can
be removed from lxc.  I'm considering this project a part of the larger
lxc project, but given how central it is to systems management that
doesn't mean that I'll consider anyone else's needs as less important
than our own.

This document consists of two parts.  The first describes how I
intend the daemon (cgmanager) to be structured and how it will
enforce the safety requirements.  The second describes the commands 
which clients will be able to send to the manager.  The list of
controller keys which can be set is very incomplete at this point,
serving mainly to show the approach I was thinking of taking.

Summary

Each 'host' (identified by a separate instance of the linux kernel) will
have exactly one running daemon to manage control groups.  This daemon
will answer cgroup management requests over a dbus socket, located at
/sys/fs/cgroup/manager.  This socket can be bind-mounted into various
containers, so that one daemon can support the whole system.

Programs will be able to make cgroup requests using dbus calls, or
indirectly by linking against lmctfy which will be modified to use the
dbus calls if available.

Outline:
  . A single manager, cgmanager, is started on the host, very early
    during boot.  It has very few dependencies, and requires only
    /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
    the cgroup hierarchies in a private namespace and set defaults
    (clone_children, use_hierarchy, sane_behavior, release_agent?) It
    will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
  . A client (requestor 'r') can make cgroup requests over
    /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
    requirements for r are listed below.
  . The client request will pertain an existing or new cgroup A.  r's
    privilege over the cgroup must be checked.  r is said to have
    privilege over A if A is owned by r's uid, or if A's owner is mapped
    into r's user namespace, and r is root in that user namespace.
  . The client request may pertain a victim task v, which may be moved
    to a new cgroup.  In that case r's privilege over both the cgroup
    and v must be checked.  r is said to have privilege over v if v
    is mapped in r's pid namespace, v's uid is mapped into r's user ns,
    and r is root in its userns.  Or if r and v have the same uid
    and v is mapped in r's pid namespace.
  . r's credentials will be taken from socket's peercred, ensuring that
    pid and uid are translated.
  . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
    translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
    which is the global uid, and check /proc/PID(r)/uid_map to see whether
    UID is mapped there.
  . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
    the kernel translate it for the reader.  Only 'move task v to cgroup
    A' will require a SCM_CREDENTIAL to be sent.

Privilege requirements by action:
    * Requestor of an action (r) over a socket may only make
      changes to cgroups over which it has privilege.
    * Requestors may be limited to a certain #/depth of cgroups
      (to limit memory usage) - DEFER?
    * Cgroup hierarchy is responsible for resource limits
    * A requestor must either be uid 0 in its userns with victim mapped
      ito its userns, or the same uid and in same/ancestor pidns as the
      victim
    * If r requests creation of cgroup '/x', /x will be interpreted
      as relative to r's cgroup.  r cannot make changes to cgroups not
      under its own current cgroup.
    * If r is not in the initial user_ns, then it may not change settings
      in its own cgroup, only descendants.  (Not strictly necessary -
      we could require the use of extra cgroups when wanted, as lxc does
      currently)
    * If r requests creation of cgroup '/x', it must have write access
      to its own cgroup  (not strictly necessary)
    * If r requests chown of cgroup /x to uid Y, Y is passed in a
      ucred over the unix socket, and therefore translated to init
      userns.
    * if r requests setting a limit under /x, then
      . either r must be root in its own userns, and UID(/x) be mapped
        into its userns, or else UID(r) == UID(/x)
      . /x must not be / (not strictly necessary, all users know to
        ensure an extra cgroup layer above '/')
      . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
        which won't be satisfied.  Therefore we'll need to do privilege
        checks ourselves, then perform the write as the host root user.
        (see devices.allow/deny).  Further we need to support older kernels
        which don't support setns for pid.
    * If r requests action on victim V, it passes V's pid in a ucred,
      so that gets translated.
      Daemon will verify that V's uid is mapped into r's userns.  Since
      r is either root or the same uid as V, it is allowed to classify.

The above addresses
    * creating cgroups
    * chowning cgroups
    * setting cgroup limits
    * moving tasks into cgroups
  . but does not address a 'cgexec <group> -- command' type of behavior.
    * To handle that (specifically for upstart), recommend that r do:
      if (!pid) {
        request_reclassify(cgroup, getpid());
        do_execve();
      }
  . alternatively, the daemon could, if kernel is new enough, setns to
    the requestor's namespaces to execute a command in a new cgroup.
    The new command would be daemonized to that pid namespaces' pid 1.

Types of requests:
  * r requests creating cgroup A'/A
    . lmctfy/cli/commands/create.cc
    . Verify that UID(r) mapped to 0 in r's userns
    . R=cgroup_of(r)
    . Verify that UID(R) is mapped into r's userns
    . Create R/A'/A
    . chown R/A'/A to UID(r)
  * r requests to move task x to cgroup A.
    . lmctfy/cli/commands/enter.cc
    . r must send PID(x) as ancillary message
    . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
      that userns
      (is it safe to allow if UID(x) == UID(r))?
    . R=cgroup_of(r)
    . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
    . echo PID(x) >> /R/A/tasks
  * r requests chown of cgroup A to uid X
    . X is passed in ancillary message
      * ensures it is valid in r's userns
      * maps the userid to host for us
    . Verify that UID(r) mapped to 0 in r's userns
    . R=cgroup_of(r)
    . Chown R/A to X
  * r requests cgroup A's 'property=value'
    . Verify that either
      * A != ''
      * UID(r) == 0 on host
      In other words, r in a userns may not set root cgroup settings.
    . Verify that UID(r) mapped to 0 in r's userns
    . R=cgroup_of(r)
    . Set property=value for R/A
      * Expect kernel to guarantee hierarchical constraints
  * r requests deletion of cgroup A
    . lmctfy/cli/commands/destroy.cc (without -f)
    . same requirements as setting 'property=value'
  * r requests purge of cgroup A
    . lmctfy/cli/commands/destroy.cc (with -f)
    . same requirements as setting 'property=value'

Long-term we will want the cgroup manager to become more intelligent -
to place its own limits on clients, to address cpu and device hotplug,
etc.  Since we will not be doing that in the first prototype, the daemon
will not keep any state about the clients.

Client DBus Message API

<name>: a-zA-Z0-9
<name>: "a-zA-Z0-9 "
<controllerlist>: <controller1>[:controllerlist]
<valueentry>: key:value
<valueentry>: frozen
<valueentry>: thawed
<values>: valueentry[:values]
keys:
	{memory,swap}.{limit,soft_limit}
	cpus_allowed  # set of allowed cpus
	cpus_fraction # % of allowed cpus
	cpus_number   # number of allowed cpus
	cpu_share_percent   # percent of cpushare
	devices_whitelist
	devices_blacklist
	net_prio_index
	net_prio_interface_map
	net_classid
	hugetlb_limit
	blkio_weight
	blkio_weight_device
	blkio_throttle_{read,write}
readkeys:
	devices_list
	{memory,swap}.{failcnt,max_use,limitnuma_stat}
	hugetlb_max_usage
	hugetlb_usage
	hugetlb_failcnt
	cpuacct_stat
	<etc>
Commands:
	ListControllers
	Create <name> <controllerlist> <values>
	Setvalue <name> <values>
	Getvalue <name> <readkeys>
	ListChildren <name>
	ListTasks <name>
	ListControllers <name>
	Chown <name> <uid>
	Chown <name> <uid>:<gid>
	Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
	Delete <name>
	Delete-force <name>
	Kill <name>

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
       [not found] ` <20131125224335.GA15481-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-26  0:03   ` Marian Marinov
       [not found]     ` <5293E544.10805-NV7Lj0SOnH0@public.gmane.org>
  2013-11-26  2:18   ` Michael H. Warfield
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 39+ messages in thread
From: Marian Marinov @ 2013-11-26  0:03 UTC (permalink / raw)
  To: Serge E. Hallyn, Tejun Heo,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Tim Hockin, Stéphane Graber

On 11/26/2013 12:43 AM, Serge E. Hallyn wrote:
> Hi,
>
> as i've mentioned several times, I want to write a standalone cgroup
> management daemon.  Basic requirements are that it be a standalone
> program; that a single instance running on the host be usable from
> containers nested at any depth; that it not allow escaping ones
> assigned limits; that it not allow subjegating tasks which do not
> belong to you; and that, within your limits, you be able to parcel
> those limits to your tasks as you like.
>
> Additionally, Tejun has specified that we do not want users to be
> too closely tied to the cgroupfs implementation.  Therefore
> commands will be just a hair more general than specifying cgroupfs
> filenames and values.  I may go so far as to avoid specifying
> specific controllers, as AFAIK there should be no redundancy in
> features.  On the other hand, I don't want to get too general.
> So I'm basing the API loosely on the lmctfy command line API.
>
> One of the driving goals is to enable nested lxc as simply and safely as
> possible.  If this project is a success, then a large chunk of code can
> be removed from lxc.  I'm considering this project a part of the larger
> lxc project, but given how central it is to systems management that
> doesn't mean that I'll consider anyone else's needs as less important
> than our own.
>
> This document consists of two parts.  The first describes how I
> intend the daemon (cgmanager) to be structured and how it will
> enforce the safety requirements.  The second describes the commands
> which clients will be able to send to the manager.  The list of
> controller keys which can be set is very incomplete at this point,
> serving mainly to show the approach I was thinking of taking.
>
> Summary
>
> Each 'host' (identified by a separate instance of the linux kernel) will
> have exactly one running daemon to manage control groups.  This daemon
> will answer cgroup management requests over a dbus socket, located at
> /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
> containers, so that one daemon can support the whole system.
>
> Programs will be able to make cgroup requests using dbus calls, or
> indirectly by linking against lmctfy which will be modified to use the
> dbus calls if available.
>
> Outline:
>    . A single manager, cgmanager, is started on the host, very early
>      during boot.  It has very few dependencies, and requires only
>      /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
>      the cgroup hierarchies in a private namespace and set defaults
>      (clone_children, use_hierarchy, sane_behavior, release_agent?) It
>      will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
>    . A client (requestor 'r') can make cgroup requests over
>      /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
>      requirements for r are listed below.
>    . The client request will pertain an existing or new cgroup A.  r's
>      privilege over the cgroup must be checked.  r is said to have
>      privilege over A if A is owned by r's uid, or if A's owner is mapped
>      into r's user namespace, and r is root in that user namespace.
>    . The client request may pertain a victim task v, which may be moved
>      to a new cgroup.  In that case r's privilege over both the cgroup
>      and v must be checked.  r is said to have privilege over v if v
>      is mapped in r's pid namespace, v's uid is mapped into r's user ns,
>      and r is root in its userns.  Or if r and v have the same uid
>      and v is mapped in r's pid namespace.
>    . r's credentials will be taken from socket's peercred, ensuring that
>      pid and uid are translated.
>    . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
>      translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
>      which is the global uid, and check /proc/PID(r)/uid_map to see whether
>      UID is mapped there.
>    . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
>      the kernel translate it for the reader.  Only 'move task v to cgroup
>      A' will require a SCM_CREDENTIAL to be sent.
>
> Privilege requirements by action:
>      * Requestor of an action (r) over a socket may only make
>        changes to cgroups over which it has privilege.
>      * Requestors may be limited to a certain #/depth of cgroups
>        (to limit memory usage) - DEFER?
>      * Cgroup hierarchy is responsible for resource limits
>      * A requestor must either be uid 0 in its userns with victim mapped
>        ito its userns, or the same uid and in same/ancestor pidns as the
>        victim
>      * If r requests creation of cgroup '/x', /x will be interpreted
>        as relative to r's cgroup.  r cannot make changes to cgroups not
>        under its own current cgroup.
>      * If r is not in the initial user_ns, then it may not change settings
>        in its own cgroup, only descendants.  (Not strictly necessary -
>        we could require the use of extra cgroups when wanted, as lxc does
>        currently)
>      * If r requests creation of cgroup '/x', it must have write access
>        to its own cgroup  (not strictly necessary)
>      * If r requests chown of cgroup /x to uid Y, Y is passed in a
>        ucred over the unix socket, and therefore translated to init
>        userns.
>      * if r requests setting a limit under /x, then
>        . either r must be root in its own userns, and UID(/x) be mapped
>          into its userns, or else UID(r) == UID(/x)
>        . /x must not be / (not strictly necessary, all users know to
>          ensure an extra cgroup layer above '/')
>        . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
>          which won't be satisfied.  Therefore we'll need to do privilege
>          checks ourselves, then perform the write as the host root user.
>          (see devices.allow/deny).  Further we need to support older kernels
>          which don't support setns for pid.
>      * If r requests action on victim V, it passes V's pid in a ucred,
>        so that gets translated.
>        Daemon will verify that V's uid is mapped into r's userns.  Since
>        r is either root or the same uid as V, it is allowed to classify.
>
> The above addresses
>      * creating cgroups
>      * chowning cgroups
>      * setting cgroup limits
>      * moving tasks into cgroups
>    . but does not address a 'cgexec <group> -- command' type of behavior.
>      * To handle that (specifically for upstart), recommend that r do:
>        if (!pid) {
>          request_reclassify(cgroup, getpid());
>          do_execve();
>        }
>    . alternatively, the daemon could, if kernel is new enough, setns to
>      the requestor's namespaces to execute a command in a new cgroup.
>      The new command would be daemonized to that pid namespaces' pid 1.
>
> Types of requests:
>    * r requests creating cgroup A'/A
>      . lmctfy/cli/commands/create.cc
>      . Verify that UID(r) mapped to 0 in r's userns
>      . R=cgroup_of(r)
>      . Verify that UID(R) is mapped into r's userns
>      . Create R/A'/A
>      . chown R/A'/A to UID(r)
>    * r requests to move task x to cgroup A.
>      . lmctfy/cli/commands/enter.cc
>      . r must send PID(x) as ancillary message
>      . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
>        that userns
>        (is it safe to allow if UID(x) == UID(r))?
>      . R=cgroup_of(r)
>      . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
>      . echo PID(x) >> /R/A/tasks
>    * r requests chown of cgroup A to uid X
>      . X is passed in ancillary message
>        * ensures it is valid in r's userns
>        * maps the userid to host for us
>      . Verify that UID(r) mapped to 0 in r's userns
>      . R=cgroup_of(r)
>      . Chown R/A to X
>    * r requests cgroup A's 'property=value'
>      . Verify that either
>        * A != ''
>        * UID(r) == 0 on host
>        In other words, r in a userns may not set root cgroup settings.
>      . Verify that UID(r) mapped to 0 in r's userns
>      . R=cgroup_of(r)
>      . Set property=value for R/A
>        * Expect kernel to guarantee hierarchical constraints
>    * r requests deletion of cgroup A
>      . lmctfy/cli/commands/destroy.cc (without -f)
>      . same requirements as setting 'property=value'
>    * r requests purge of cgroup A
>      . lmctfy/cli/commands/destroy.cc (with -f)
>      . same requirements as setting 'property=value'
>
> Long-term we will want the cgroup manager to become more intelligent -
> to place its own limits on clients, to address cpu and device hotplug,
> etc.  Since we will not be doing that in the first prototype, the daemon
> will not keep any state about the clients.
>
> Client DBus Message API
>
> <name>: a-zA-Z0-9
> <name>: "a-zA-Z0-9 "
> <controllerlist>: <controller1>[:controllerlist]
> <valueentry>: key:value
> <valueentry>: frozen
> <valueentry>: thawed
> <values>: valueentry[:values]
> keys:
> 	{memory,swap}.{limit,soft_limit}
> 	cpus_allowed  # set of allowed cpus
> 	cpus_fraction # % of allowed cpus
> 	cpus_number   # number of allowed cpus
> 	cpu_share_percent   # percent of cpushare
> 	devices_whitelist
> 	devices_blacklist
> 	net_prio_index
> 	net_prio_interface_map
> 	net_classid
> 	hugetlb_limit
> 	blkio_weight
> 	blkio_weight_device
> 	blkio_throttle_{read,write}
> readkeys:
> 	devices_list
> 	{memory,swap}.{failcnt,max_use,limitnuma_stat}
> 	hugetlb_max_usage
> 	hugetlb_usage
> 	hugetlb_failcnt
> 	cpuacct_stat
> 	<etc>
> Commands:
> 	ListControllers
> 	Create <name> <controllerlist> <values>
> 	Setvalue <name> <values>
> 	Getvalue <name> <readkeys>
> 	ListChildren <name>
> 	ListTasks <name>
> 	ListControllers <name>
> 	Chown <name> <uid>
> 	Chown <name> <uid>:<gid>
> 	Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
> 	Delete <name>
> 	Delete-force <name>
> 	Kill <name>
>

I really like the idea, but I have a few comments.
I'm not familiar with the dbus, but how will you identify a request made on dbus?
I mean will you get its pid? What if the container has its own PID namespace, how will this be handled?

I know that this may sound a bit radical, but I propose that the daemon is using simple unix sockets.
The daemon should have an easy way of adding more sockets to newly started containers and each newly created socket 
should know the base cgroup to which it belongs. This way the daemon can clearly identify which request is limited to 
what cgroup without many lookups and will be easier to enforce the above mentioned restrictions.

Marian

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]     ` <5293E544.10805-NV7Lj0SOnH0@public.gmane.org>
@ 2013-11-26  0:11       ` Stéphane Graber
  2013-11-26  1:35         ` [lxc-devel] " Marian Marinov
  0 siblings, 1 reply; 39+ messages in thread
From: Stéphane Graber @ 2013-11-26  0:11 UTC (permalink / raw)
  To: Marian Marinov
  Cc: Tim Hockin, Victor Marmol, Rohit Jnagal,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Serge E. Hallyn


[-- Attachment #1.1: Type: text/plain, Size: 13233 bytes --]

On Tue, Nov 26, 2013 at 02:03:16AM +0200, Marian Marinov wrote:
> On 11/26/2013 12:43 AM, Serge E. Hallyn wrote:
> > Hi,
> >
> > as i've mentioned several times, I want to write a standalone cgroup
> > management daemon.  Basic requirements are that it be a standalone
> > program; that a single instance running on the host be usable from
> > containers nested at any depth; that it not allow escaping ones
> > assigned limits; that it not allow subjegating tasks which do not
> > belong to you; and that, within your limits, you be able to parcel
> > those limits to your tasks as you like.
> >
> > Additionally, Tejun has specified that we do not want users to be
> > too closely tied to the cgroupfs implementation.  Therefore
> > commands will be just a hair more general than specifying cgroupfs
> > filenames and values.  I may go so far as to avoid specifying
> > specific controllers, as AFAIK there should be no redundancy in
> > features.  On the other hand, I don't want to get too general.
> > So I'm basing the API loosely on the lmctfy command line API.
> >
> > One of the driving goals is to enable nested lxc as simply and safely as
> > possible.  If this project is a success, then a large chunk of code can
> > be removed from lxc.  I'm considering this project a part of the larger
> > lxc project, but given how central it is to systems management that
> > doesn't mean that I'll consider anyone else's needs as less important
> > than our own.
> >
> > This document consists of two parts.  The first describes how I
> > intend the daemon (cgmanager) to be structured and how it will
> > enforce the safety requirements.  The second describes the commands
> > which clients will be able to send to the manager.  The list of
> > controller keys which can be set is very incomplete at this point,
> > serving mainly to show the approach I was thinking of taking.
> >
> > Summary
> >
> > Each 'host' (identified by a separate instance of the linux kernel) will
> > have exactly one running daemon to manage control groups.  This daemon
> > will answer cgroup management requests over a dbus socket, located at
> > /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
> > containers, so that one daemon can support the whole system.
> >
> > Programs will be able to make cgroup requests using dbus calls, or
> > indirectly by linking against lmctfy which will be modified to use the
> > dbus calls if available.
> >
> > Outline:
> >    . A single manager, cgmanager, is started on the host, very early
> >      during boot.  It has very few dependencies, and requires only
> >      /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
> >      the cgroup hierarchies in a private namespace and set defaults
> >      (clone_children, use_hierarchy, sane_behavior, release_agent?) It
> >      will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
> >    . A client (requestor 'r') can make cgroup requests over
> >      /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
> >      requirements for r are listed below.
> >    . The client request will pertain an existing or new cgroup A.  r's
> >      privilege over the cgroup must be checked.  r is said to have
> >      privilege over A if A is owned by r's uid, or if A's owner is mapped
> >      into r's user namespace, and r is root in that user namespace.
> >    . The client request may pertain a victim task v, which may be moved
> >      to a new cgroup.  In that case r's privilege over both the cgroup
> >      and v must be checked.  r is said to have privilege over v if v
> >      is mapped in r's pid namespace, v's uid is mapped into r's user ns,
> >      and r is root in its userns.  Or if r and v have the same uid
> >      and v is mapped in r's pid namespace.
> >    . r's credentials will be taken from socket's peercred, ensuring that
> >      pid and uid are translated.
> >    . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
> >      translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
> >      which is the global uid, and check /proc/PID(r)/uid_map to see whether
> >      UID is mapped there.
> >    . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
> >      the kernel translate it for the reader.  Only 'move task v to cgroup
> >      A' will require a SCM_CREDENTIAL to be sent.
> >
> > Privilege requirements by action:
> >      * Requestor of an action (r) over a socket may only make
> >        changes to cgroups over which it has privilege.
> >      * Requestors may be limited to a certain #/depth of cgroups
> >        (to limit memory usage) - DEFER?
> >      * Cgroup hierarchy is responsible for resource limits
> >      * A requestor must either be uid 0 in its userns with victim mapped
> >        ito its userns, or the same uid and in same/ancestor pidns as the
> >        victim
> >      * If r requests creation of cgroup '/x', /x will be interpreted
> >        as relative to r's cgroup.  r cannot make changes to cgroups not
> >        under its own current cgroup.
> >      * If r is not in the initial user_ns, then it may not change settings
> >        in its own cgroup, only descendants.  (Not strictly necessary -
> >        we could require the use of extra cgroups when wanted, as lxc does
> >        currently)
> >      * If r requests creation of cgroup '/x', it must have write access
> >        to its own cgroup  (not strictly necessary)
> >      * If r requests chown of cgroup /x to uid Y, Y is passed in a
> >        ucred over the unix socket, and therefore translated to init
> >        userns.
> >      * if r requests setting a limit under /x, then
> >        . either r must be root in its own userns, and UID(/x) be mapped
> >          into its userns, or else UID(r) == UID(/x)
> >        . /x must not be / (not strictly necessary, all users know to
> >          ensure an extra cgroup layer above '/')
> >        . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
> >          which won't be satisfied.  Therefore we'll need to do privilege
> >          checks ourselves, then perform the write as the host root user.
> >          (see devices.allow/deny).  Further we need to support older kernels
> >          which don't support setns for pid.
> >      * If r requests action on victim V, it passes V's pid in a ucred,
> >        so that gets translated.
> >        Daemon will verify that V's uid is mapped into r's userns.  Since
> >        r is either root or the same uid as V, it is allowed to classify.
> >
> > The above addresses
> >      * creating cgroups
> >      * chowning cgroups
> >      * setting cgroup limits
> >      * moving tasks into cgroups
> >    . but does not address a 'cgexec <group> -- command' type of behavior.
> >      * To handle that (specifically for upstart), recommend that r do:
> >        if (!pid) {
> >          request_reclassify(cgroup, getpid());
> >          do_execve();
> >        }
> >    . alternatively, the daemon could, if kernel is new enough, setns to
> >      the requestor's namespaces to execute a command in a new cgroup.
> >      The new command would be daemonized to that pid namespaces' pid 1.
> >
> > Types of requests:
> >    * r requests creating cgroup A'/A
> >      . lmctfy/cli/commands/create.cc
> >      . Verify that UID(r) mapped to 0 in r's userns
> >      . R=cgroup_of(r)
> >      . Verify that UID(R) is mapped into r's userns
> >      . Create R/A'/A
> >      . chown R/A'/A to UID(r)
> >    * r requests to move task x to cgroup A.
> >      . lmctfy/cli/commands/enter.cc
> >      . r must send PID(x) as ancillary message
> >      . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
> >        that userns
> >        (is it safe to allow if UID(x) == UID(r))?
> >      . R=cgroup_of(r)
> >      . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
> >      . echo PID(x) >> /R/A/tasks
> >    * r requests chown of cgroup A to uid X
> >      . X is passed in ancillary message
> >        * ensures it is valid in r's userns
> >        * maps the userid to host for us
> >      . Verify that UID(r) mapped to 0 in r's userns
> >      . R=cgroup_of(r)
> >      . Chown R/A to X
> >    * r requests cgroup A's 'property=value'
> >      . Verify that either
> >        * A != ''
> >        * UID(r) == 0 on host
> >        In other words, r in a userns may not set root cgroup settings.
> >      . Verify that UID(r) mapped to 0 in r's userns
> >      . R=cgroup_of(r)
> >      . Set property=value for R/A
> >        * Expect kernel to guarantee hierarchical constraints
> >    * r requests deletion of cgroup A
> >      . lmctfy/cli/commands/destroy.cc (without -f)
> >      . same requirements as setting 'property=value'
> >    * r requests purge of cgroup A
> >      . lmctfy/cli/commands/destroy.cc (with -f)
> >      . same requirements as setting 'property=value'
> >
> > Long-term we will want the cgroup manager to become more intelligent -
> > to place its own limits on clients, to address cpu and device hotplug,
> > etc.  Since we will not be doing that in the first prototype, the daemon
> > will not keep any state about the clients.
> >
> > Client DBus Message API
> >
> > <name>: a-zA-Z0-9
> > <name>: "a-zA-Z0-9 "
> > <controllerlist>: <controller1>[:controllerlist]
> > <valueentry>: key:value
> > <valueentry>: frozen
> > <valueentry>: thawed
> > <values>: valueentry[:values]
> > keys:
> > 	{memory,swap}.{limit,soft_limit}
> > 	cpus_allowed  # set of allowed cpus
> > 	cpus_fraction # % of allowed cpus
> > 	cpus_number   # number of allowed cpus
> > 	cpu_share_percent   # percent of cpushare
> > 	devices_whitelist
> > 	devices_blacklist
> > 	net_prio_index
> > 	net_prio_interface_map
> > 	net_classid
> > 	hugetlb_limit
> > 	blkio_weight
> > 	blkio_weight_device
> > 	blkio_throttle_{read,write}
> > readkeys:
> > 	devices_list
> > 	{memory,swap}.{failcnt,max_use,limitnuma_stat}
> > 	hugetlb_max_usage
> > 	hugetlb_usage
> > 	hugetlb_failcnt
> > 	cpuacct_stat
> > 	<etc>
> > Commands:
> > 	ListControllers
> > 	Create <name> <controllerlist> <values>
> > 	Setvalue <name> <values>
> > 	Getvalue <name> <readkeys>
> > 	ListChildren <name>
> > 	ListTasks <name>
> > 	ListControllers <name>
> > 	Chown <name> <uid>
> > 	Chown <name> <uid>:<gid>
> > 	Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
> > 	Delete <name>
> > 	Delete-force <name>
> > 	Kill <name>
> >
> 
> I really like the idea, but I have a few comments.
> I'm not familiar with the dbus, but how will you identify a request made on dbus?
> I mean will you get its pid? What if the container has its own PID namespace, how will this be handled?

DBus is essentially just an IPC protocol that can be used over a variety
of medium.

In the case of this cgroup manager, we'll be using the DBus protocol on
top of a standard UNIX socket. One of the properties of unix sockets is
that you can get the uid, gid and pid of your peer. As this information
is provided by the kernel, it'll automatically be translated to match
your vision of the pid and user tree.

That's why we're also planning on abusing SCM_CRED a tiny bit so that
when a container or sub-container is asking for a pid to be moved into a
cgroup, instead of passing that pid as a standard integer over dbus,
it'll instead use the SCM_CRED mechanism, sending a ucred structure
instead which will then get magically mapped to the right namespace when
accessed by the manager and saving us a whole lot of pid/uid mapping
logic in the process.

> 
> I know that this may sound a bit radical, but I propose that the daemon is using simple unix sockets.
> The daemon should have an easy way of adding more sockets to newly started containers and each newly created socket 
> should know the base cgroup to which it belongs. This way the daemon can clearly identify which request is limited to 
> what cgroup without many lookups and will be easier to enforce the above mentioned restrictions.

So it looks like our current design already follows your recommendation
since we're indeed using a standard unix socket, it's just that instead
of re-inventing the wheel, we use a standard IPC protocol on top of it.

> 
> Marian
> 
> ------------------------------------------------------------------------------
> Shape the Mobile Experience: Free Subscription
> Software experts and developers: Be at the forefront of tech innovation.
> Intel(R) Software Adrenaline delivers strategic insight and game-changing 
> conversations that shape the rapidly evolving mobile landscape. Sign up now. 
> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
> _______________________________________________
> Lxc-devel mailing list
> Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/lxc-devel

-- 
Stéphane Graber
Ubuntu developer
http://www.ubuntu.com

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

[-- Attachment #2: Type: text/plain, Size: 427 bytes --]

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk

[-- Attachment #3: Type: text/plain, Size: 186 bytes --]

_______________________________________________
Lxc-devel mailing list
Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/lxc-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
  2013-11-26  0:11       ` Stéphane Graber
@ 2013-11-26  1:35         ` Marian Marinov
       [not found]           ` <5293FADA.8070901-NV7Lj0SOnH0@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Marian Marinov @ 2013-11-26  1:35 UTC (permalink / raw)
  To: Stéphane Graber
  Cc: Serge E. Hallyn, Tejun Heo,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Tim Hockin

On 11/26/2013 02:11 AM, Stéphane Graber wrote:
> On Tue, Nov 26, 2013 at 02:03:16AM +0200, Marian Marinov wrote:
>> On 11/26/2013 12:43 AM, Serge E. Hallyn wrote:
>>> Hi,
>>>
>>> as i've mentioned several times, I want to write a standalone cgroup
>>> management daemon.  Basic requirements are that it be a standalone
>>> program; that a single instance running on the host be usable from
>>> containers nested at any depth; that it not allow escaping ones
>>> assigned limits; that it not allow subjegating tasks which do not
>>> belong to you; and that, within your limits, you be able to parcel
>>> those limits to your tasks as you like.
>>>
>>> Additionally, Tejun has specified that we do not want users to be
>>> too closely tied to the cgroupfs implementation.  Therefore
>>> commands will be just a hair more general than specifying cgroupfs
>>> filenames and values.  I may go so far as to avoid specifying
>>> specific controllers, as AFAIK there should be no redundancy in
>>> features.  On the other hand, I don't want to get too general.
>>> So I'm basing the API loosely on the lmctfy command line API.
>>>
>>> One of the driving goals is to enable nested lxc as simply and safely as
>>> possible.  If this project is a success, then a large chunk of code can
>>> be removed from lxc.  I'm considering this project a part of the larger
>>> lxc project, but given how central it is to systems management that
>>> doesn't mean that I'll consider anyone else's needs as less important
>>> than our own.
>>>
>>> This document consists of two parts.  The first describes how I
>>> intend the daemon (cgmanager) to be structured and how it will
>>> enforce the safety requirements.  The second describes the commands
>>> which clients will be able to send to the manager.  The list of
>>> controller keys which can be set is very incomplete at this point,
>>> serving mainly to show the approach I was thinking of taking.
>>>
>>> Summary
>>>
>>> Each 'host' (identified by a separate instance of the linux kernel) will
>>> have exactly one running daemon to manage control groups.  This daemon
>>> will answer cgroup management requests over a dbus socket, located at
>>> /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
>>> containers, so that one daemon can support the whole system.
>>>
>>> Programs will be able to make cgroup requests using dbus calls, or
>>> indirectly by linking against lmctfy which will be modified to use the
>>> dbus calls if available.
>>>
>>> Outline:
>>>     . A single manager, cgmanager, is started on the host, very early
>>>       during boot.  It has very few dependencies, and requires only
>>>       /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
>>>       the cgroup hierarchies in a private namespace and set defaults
>>>       (clone_children, use_hierarchy, sane_behavior, release_agent?) It
>>>       will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
>>>     . A client (requestor 'r') can make cgroup requests over
>>>       /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
>>>       requirements for r are listed below.
>>>     . The client request will pertain an existing or new cgroup A.  r's
>>>       privilege over the cgroup must be checked.  r is said to have
>>>       privilege over A if A is owned by r's uid, or if A's owner is mapped
>>>       into r's user namespace, and r is root in that user namespace.
>>>     . The client request may pertain a victim task v, which may be moved
>>>       to a new cgroup.  In that case r's privilege over both the cgroup
>>>       and v must be checked.  r is said to have privilege over v if v
>>>       is mapped in r's pid namespace, v's uid is mapped into r's user ns,
>>>       and r is root in its userns.  Or if r and v have the same uid
>>>       and v is mapped in r's pid namespace.
>>>     . r's credentials will be taken from socket's peercred, ensuring that
>>>       pid and uid are translated.
>>>     . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
>>>       translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
>>>       which is the global uid, and check /proc/PID(r)/uid_map to see whether
>>>       UID is mapped there.
>>>     . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
>>>       the kernel translate it for the reader.  Only 'move task v to cgroup
>>>       A' will require a SCM_CREDENTIAL to be sent.
>>>
>>> Privilege requirements by action:
>>>       * Requestor of an action (r) over a socket may only make
>>>         changes to cgroups over which it has privilege.
>>>       * Requestors may be limited to a certain #/depth of cgroups
>>>         (to limit memory usage) - DEFER?
>>>       * Cgroup hierarchy is responsible for resource limits
>>>       * A requestor must either be uid 0 in its userns with victim mapped
>>>         ito its userns, or the same uid and in same/ancestor pidns as the
>>>         victim
>>>       * If r requests creation of cgroup '/x', /x will be interpreted
>>>         as relative to r's cgroup.  r cannot make changes to cgroups not
>>>         under its own current cgroup.
>>>       * If r is not in the initial user_ns, then it may not change settings
>>>         in its own cgroup, only descendants.  (Not strictly necessary -
>>>         we could require the use of extra cgroups when wanted, as lxc does
>>>         currently)
>>>       * If r requests creation of cgroup '/x', it must have write access
>>>         to its own cgroup  (not strictly necessary)
>>>       * If r requests chown of cgroup /x to uid Y, Y is passed in a
>>>         ucred over the unix socket, and therefore translated to init
>>>         userns.
>>>       * if r requests setting a limit under /x, then
>>>         . either r must be root in its own userns, and UID(/x) be mapped
>>>           into its userns, or else UID(r) == UID(/x)
>>>         . /x must not be / (not strictly necessary, all users know to
>>>           ensure an extra cgroup layer above '/')
>>>         . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
>>>           which won't be satisfied.  Therefore we'll need to do privilege
>>>           checks ourselves, then perform the write as the host root user.
>>>           (see devices.allow/deny).  Further we need to support older kernels
>>>           which don't support setns for pid.
>>>       * If r requests action on victim V, it passes V's pid in a ucred,
>>>         so that gets translated.
>>>         Daemon will verify that V's uid is mapped into r's userns.  Since
>>>         r is either root or the same uid as V, it is allowed to classify.
>>>
>>> The above addresses
>>>       * creating cgroups
>>>       * chowning cgroups
>>>       * setting cgroup limits
>>>       * moving tasks into cgroups
>>>     . but does not address a 'cgexec <group> -- command' type of behavior.
>>>       * To handle that (specifically for upstart), recommend that r do:
>>>         if (!pid) {
>>>           request_reclassify(cgroup, getpid());
>>>           do_execve();
>>>         }
>>>     . alternatively, the daemon could, if kernel is new enough, setns to
>>>       the requestor's namespaces to execute a command in a new cgroup.
>>>       The new command would be daemonized to that pid namespaces' pid 1.
>>>
>>> Types of requests:
>>>     * r requests creating cgroup A'/A
>>>       . lmctfy/cli/commands/create.cc
>>>       . Verify that UID(r) mapped to 0 in r's userns
>>>       . R=cgroup_of(r)
>>>       . Verify that UID(R) is mapped into r's userns
>>>       . Create R/A'/A
>>>       . chown R/A'/A to UID(r)
>>>     * r requests to move task x to cgroup A.
>>>       . lmctfy/cli/commands/enter.cc
>>>       . r must send PID(x) as ancillary message
>>>       . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
>>>         that userns
>>>         (is it safe to allow if UID(x) == UID(r))?
>>>       . R=cgroup_of(r)
>>>       . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
>>>       . echo PID(x) >> /R/A/tasks
>>>     * r requests chown of cgroup A to uid X
>>>       . X is passed in ancillary message
>>>         * ensures it is valid in r's userns
>>>         * maps the userid to host for us
>>>       . Verify that UID(r) mapped to 0 in r's userns
>>>       . R=cgroup_of(r)
>>>       . Chown R/A to X
>>>     * r requests cgroup A's 'property=value'
>>>       . Verify that either
>>>         * A != ''
>>>         * UID(r) == 0 on host
>>>         In other words, r in a userns may not set root cgroup settings.
>>>       . Verify that UID(r) mapped to 0 in r's userns
>>>       . R=cgroup_of(r)
>>>       . Set property=value for R/A
>>>         * Expect kernel to guarantee hierarchical constraints
>>>     * r requests deletion of cgroup A
>>>       . lmctfy/cli/commands/destroy.cc (without -f)
>>>       . same requirements as setting 'property=value'
>>>     * r requests purge of cgroup A
>>>       . lmctfy/cli/commands/destroy.cc (with -f)
>>>       . same requirements as setting 'property=value'
>>>
>>> Long-term we will want the cgroup manager to become more intelligent -
>>> to place its own limits on clients, to address cpu and device hotplug,
>>> etc.  Since we will not be doing that in the first prototype, the daemon
>>> will not keep any state about the clients.
>>>
>>> Client DBus Message API
>>>
>>> <name>: a-zA-Z0-9
>>> <name>: "a-zA-Z0-9 "
>>> <controllerlist>: <controller1>[:controllerlist]
>>> <valueentry>: key:value
>>> <valueentry>: frozen
>>> <valueentry>: thawed
>>> <values>: valueentry[:values]
>>> keys:
>>> 	{memory,swap}.{limit,soft_limit}
>>> 	cpus_allowed  # set of allowed cpus
>>> 	cpus_fraction # % of allowed cpus
>>> 	cpus_number   # number of allowed cpus
>>> 	cpu_share_percent   # percent of cpushare
>>> 	devices_whitelist
>>> 	devices_blacklist
>>> 	net_prio_index
>>> 	net_prio_interface_map
>>> 	net_classid
>>> 	hugetlb_limit
>>> 	blkio_weight
>>> 	blkio_weight_device
>>> 	blkio_throttle_{read,write}
>>> readkeys:
>>> 	devices_list
>>> 	{memory,swap}.{failcnt,max_use,limitnuma_stat}
>>> 	hugetlb_max_usage
>>> 	hugetlb_usage
>>> 	hugetlb_failcnt
>>> 	cpuacct_stat
>>> 	<etc>
>>> Commands:
>>> 	ListControllers
>>> 	Create <name> <controllerlist> <values>
>>> 	Setvalue <name> <values>
>>> 	Getvalue <name> <readkeys>
>>> 	ListChildren <name>
>>> 	ListTasks <name>
>>> 	ListControllers <name>
>>> 	Chown <name> <uid>
>>> 	Chown <name> <uid>:<gid>
>>> 	Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
>>> 	Delete <name>
>>> 	Delete-force <name>
>>> 	Kill <name>
>>>
>>
>> I really like the idea, but I have a few comments.
>> I'm not familiar with the dbus, but how will you identify a request made on dbus?
>> I mean will you get its pid? What if the container has its own PID namespace, how will this be handled?
>
> DBus is essentially just an IPC protocol that can be used over a variety
> of medium.
>
> In the case of this cgroup manager, we'll be using the DBus protocol on
> top of a standard UNIX socket. One of the properties of unix sockets is
> that you can get the uid, gid and pid of your peer. As this information
> is provided by the kernel, it'll automatically be translated to match
> your vision of the pid and user tree.
>
> That's why we're also planning on abusing SCM_CRED a tiny bit so that
> when a container or sub-container is asking for a pid to be moved into a
> cgroup, instead of passing that pid as a standard integer over dbus,
> it'll instead use the SCM_CRED mechanism, sending a ucred structure
> instead which will then get magically mapped to the right namespace when
> accessed by the manager and saving us a whole lot of pid/uid mapping
> logic in the process.
>
>>
>> I know that this may sound a bit radical, but I propose that the daemon is using simple unix sockets.
>> The daemon should have an easy way of adding more sockets to newly started containers and each newly created socket
>> should know the base cgroup to which it belongs. This way the daemon can clearly identify which request is limited to
>> what cgroup without many lookups and will be easier to enforce the above mentioned restrictions.
>
> So it looks like our current design already follows your recommendation
> since we're indeed using a standard unix socket, it's just that instead
> of re-inventing the wheel, we use a standard IPC protocol on top of it.

Thanks, I was thinking about the SCM_CREAD exactly :)
I was unaware that it can be combined with the dbus protocol, this is why I commented.

Is there any particular language that you want this project started in? I know that most part of the LXC is C, but I 
don't see any guidelines for using or not other langs.

Marian

>
>>
>> Marian
>>
>> ------------------------------------------------------------------------------
>> Shape the Mobile Experience: Free Subscription
>> Software experts and developers: Be at the forefront of tech innovation.
>> Intel(R) Software Adrenaline delivers strategic insight and game-changing
>> conversations that shape the rapidly evolving mobile landscape. Sign up now.
>> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Lxc-devel mailing list
>> Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
>> https://lists.sourceforge.net/lists/listinfo/lxc-devel
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
       [not found]           ` <5293FADA.8070901-NV7Lj0SOnH0@public.gmane.org>
@ 2013-11-26  1:46             ` Stéphane Graber
  0 siblings, 0 replies; 39+ messages in thread
From: Stéphane Graber @ 2013-11-26  1:46 UTC (permalink / raw)
  To: Marian Marinov
  Cc: Serge E. Hallyn, Tejun Heo,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Tim Hockin

[-- Attachment #1: Type: text/plain, Size: 14372 bytes --]

On Tue, Nov 26, 2013 at 03:35:22AM +0200, Marian Marinov wrote:
> On 11/26/2013 02:11 AM, Stéphane Graber wrote:
> >On Tue, Nov 26, 2013 at 02:03:16AM +0200, Marian Marinov wrote:
> >>On 11/26/2013 12:43 AM, Serge E. Hallyn wrote:
> >>>Hi,
> >>>
> >>>as i've mentioned several times, I want to write a standalone cgroup
> >>>management daemon.  Basic requirements are that it be a standalone
> >>>program; that a single instance running on the host be usable from
> >>>containers nested at any depth; that it not allow escaping ones
> >>>assigned limits; that it not allow subjegating tasks which do not
> >>>belong to you; and that, within your limits, you be able to parcel
> >>>those limits to your tasks as you like.
> >>>
> >>>Additionally, Tejun has specified that we do not want users to be
> >>>too closely tied to the cgroupfs implementation.  Therefore
> >>>commands will be just a hair more general than specifying cgroupfs
> >>>filenames and values.  I may go so far as to avoid specifying
> >>>specific controllers, as AFAIK there should be no redundancy in
> >>>features.  On the other hand, I don't want to get too general.
> >>>So I'm basing the API loosely on the lmctfy command line API.
> >>>
> >>>One of the driving goals is to enable nested lxc as simply and safely as
> >>>possible.  If this project is a success, then a large chunk of code can
> >>>be removed from lxc.  I'm considering this project a part of the larger
> >>>lxc project, but given how central it is to systems management that
> >>>doesn't mean that I'll consider anyone else's needs as less important
> >>>than our own.
> >>>
> >>>This document consists of two parts.  The first describes how I
> >>>intend the daemon (cgmanager) to be structured and how it will
> >>>enforce the safety requirements.  The second describes the commands
> >>>which clients will be able to send to the manager.  The list of
> >>>controller keys which can be set is very incomplete at this point,
> >>>serving mainly to show the approach I was thinking of taking.
> >>>
> >>>Summary
> >>>
> >>>Each 'host' (identified by a separate instance of the linux kernel) will
> >>>have exactly one running daemon to manage control groups.  This daemon
> >>>will answer cgroup management requests over a dbus socket, located at
> >>>/sys/fs/cgroup/manager.  This socket can be bind-mounted into various
> >>>containers, so that one daemon can support the whole system.
> >>>
> >>>Programs will be able to make cgroup requests using dbus calls, or
> >>>indirectly by linking against lmctfy which will be modified to use the
> >>>dbus calls if available.
> >>>
> >>>Outline:
> >>>    . A single manager, cgmanager, is started on the host, very early
> >>>      during boot.  It has very few dependencies, and requires only
> >>>      /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
> >>>      the cgroup hierarchies in a private namespace and set defaults
> >>>      (clone_children, use_hierarchy, sane_behavior, release_agent?) It
> >>>      will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
> >>>    . A client (requestor 'r') can make cgroup requests over
> >>>      /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
> >>>      requirements for r are listed below.
> >>>    . The client request will pertain an existing or new cgroup A.  r's
> >>>      privilege over the cgroup must be checked.  r is said to have
> >>>      privilege over A if A is owned by r's uid, or if A's owner is mapped
> >>>      into r's user namespace, and r is root in that user namespace.
> >>>    . The client request may pertain a victim task v, which may be moved
> >>>      to a new cgroup.  In that case r's privilege over both the cgroup
> >>>      and v must be checked.  r is said to have privilege over v if v
> >>>      is mapped in r's pid namespace, v's uid is mapped into r's user ns,
> >>>      and r is root in its userns.  Or if r and v have the same uid
> >>>      and v is mapped in r's pid namespace.
> >>>    . r's credentials will be taken from socket's peercred, ensuring that
> >>>      pid and uid are translated.
> >>>    . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
> >>>      translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
> >>>      which is the global uid, and check /proc/PID(r)/uid_map to see whether
> >>>      UID is mapped there.
> >>>    . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
> >>>      the kernel translate it for the reader.  Only 'move task v to cgroup
> >>>      A' will require a SCM_CREDENTIAL to be sent.
> >>>
> >>>Privilege requirements by action:
> >>>      * Requestor of an action (r) over a socket may only make
> >>>        changes to cgroups over which it has privilege.
> >>>      * Requestors may be limited to a certain #/depth of cgroups
> >>>        (to limit memory usage) - DEFER?
> >>>      * Cgroup hierarchy is responsible for resource limits
> >>>      * A requestor must either be uid 0 in its userns with victim mapped
> >>>        ito its userns, or the same uid and in same/ancestor pidns as the
> >>>        victim
> >>>      * If r requests creation of cgroup '/x', /x will be interpreted
> >>>        as relative to r's cgroup.  r cannot make changes to cgroups not
> >>>        under its own current cgroup.
> >>>      * If r is not in the initial user_ns, then it may not change settings
> >>>        in its own cgroup, only descendants.  (Not strictly necessary -
> >>>        we could require the use of extra cgroups when wanted, as lxc does
> >>>        currently)
> >>>      * If r requests creation of cgroup '/x', it must have write access
> >>>        to its own cgroup  (not strictly necessary)
> >>>      * If r requests chown of cgroup /x to uid Y, Y is passed in a
> >>>        ucred over the unix socket, and therefore translated to init
> >>>        userns.
> >>>      * if r requests setting a limit under /x, then
> >>>        . either r must be root in its own userns, and UID(/x) be mapped
> >>>          into its userns, or else UID(r) == UID(/x)
> >>>        . /x must not be / (not strictly necessary, all users know to
> >>>          ensure an extra cgroup layer above '/')
> >>>        . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
> >>>          which won't be satisfied.  Therefore we'll need to do privilege
> >>>          checks ourselves, then perform the write as the host root user.
> >>>          (see devices.allow/deny).  Further we need to support older kernels
> >>>          which don't support setns for pid.
> >>>      * If r requests action on victim V, it passes V's pid in a ucred,
> >>>        so that gets translated.
> >>>        Daemon will verify that V's uid is mapped into r's userns.  Since
> >>>        r is either root or the same uid as V, it is allowed to classify.
> >>>
> >>>The above addresses
> >>>      * creating cgroups
> >>>      * chowning cgroups
> >>>      * setting cgroup limits
> >>>      * moving tasks into cgroups
> >>>    . but does not address a 'cgexec <group> -- command' type of behavior.
> >>>      * To handle that (specifically for upstart), recommend that r do:
> >>>        if (!pid) {
> >>>          request_reclassify(cgroup, getpid());
> >>>          do_execve();
> >>>        }
> >>>    . alternatively, the daemon could, if kernel is new enough, setns to
> >>>      the requestor's namespaces to execute a command in a new cgroup.
> >>>      The new command would be daemonized to that pid namespaces' pid 1.
> >>>
> >>>Types of requests:
> >>>    * r requests creating cgroup A'/A
> >>>      . lmctfy/cli/commands/create.cc
> >>>      . Verify that UID(r) mapped to 0 in r's userns
> >>>      . R=cgroup_of(r)
> >>>      . Verify that UID(R) is mapped into r's userns
> >>>      . Create R/A'/A
> >>>      . chown R/A'/A to UID(r)
> >>>    * r requests to move task x to cgroup A.
> >>>      . lmctfy/cli/commands/enter.cc
> >>>      . r must send PID(x) as ancillary message
> >>>      . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
> >>>        that userns
> >>>        (is it safe to allow if UID(x) == UID(r))?
> >>>      . R=cgroup_of(r)
> >>>      . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
> >>>      . echo PID(x) >> /R/A/tasks
> >>>    * r requests chown of cgroup A to uid X
> >>>      . X is passed in ancillary message
> >>>        * ensures it is valid in r's userns
> >>>        * maps the userid to host for us
> >>>      . Verify that UID(r) mapped to 0 in r's userns
> >>>      . R=cgroup_of(r)
> >>>      . Chown R/A to X
> >>>    * r requests cgroup A's 'property=value'
> >>>      . Verify that either
> >>>        * A != ''
> >>>        * UID(r) == 0 on host
> >>>        In other words, r in a userns may not set root cgroup settings.
> >>>      . Verify that UID(r) mapped to 0 in r's userns
> >>>      . R=cgroup_of(r)
> >>>      . Set property=value for R/A
> >>>        * Expect kernel to guarantee hierarchical constraints
> >>>    * r requests deletion of cgroup A
> >>>      . lmctfy/cli/commands/destroy.cc (without -f)
> >>>      . same requirements as setting 'property=value'
> >>>    * r requests purge of cgroup A
> >>>      . lmctfy/cli/commands/destroy.cc (with -f)
> >>>      . same requirements as setting 'property=value'
> >>>
> >>>Long-term we will want the cgroup manager to become more intelligent -
> >>>to place its own limits on clients, to address cpu and device hotplug,
> >>>etc.  Since we will not be doing that in the first prototype, the daemon
> >>>will not keep any state about the clients.
> >>>
> >>>Client DBus Message API
> >>>
> >>><name>: a-zA-Z0-9
> >>><name>: "a-zA-Z0-9 "
> >>><controllerlist>: <controller1>[:controllerlist]
> >>><valueentry>: key:value
> >>><valueentry>: frozen
> >>><valueentry>: thawed
> >>><values>: valueentry[:values]
> >>>keys:
> >>>	{memory,swap}.{limit,soft_limit}
> >>>	cpus_allowed  # set of allowed cpus
> >>>	cpus_fraction # % of allowed cpus
> >>>	cpus_number   # number of allowed cpus
> >>>	cpu_share_percent   # percent of cpushare
> >>>	devices_whitelist
> >>>	devices_blacklist
> >>>	net_prio_index
> >>>	net_prio_interface_map
> >>>	net_classid
> >>>	hugetlb_limit
> >>>	blkio_weight
> >>>	blkio_weight_device
> >>>	blkio_throttle_{read,write}
> >>>readkeys:
> >>>	devices_list
> >>>	{memory,swap}.{failcnt,max_use,limitnuma_stat}
> >>>	hugetlb_max_usage
> >>>	hugetlb_usage
> >>>	hugetlb_failcnt
> >>>	cpuacct_stat
> >>>	<etc>
> >>>Commands:
> >>>	ListControllers
> >>>	Create <name> <controllerlist> <values>
> >>>	Setvalue <name> <values>
> >>>	Getvalue <name> <readkeys>
> >>>	ListChildren <name>
> >>>	ListTasks <name>
> >>>	ListControllers <name>
> >>>	Chown <name> <uid>
> >>>	Chown <name> <uid>:<gid>
> >>>	Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
> >>>	Delete <name>
> >>>	Delete-force <name>
> >>>	Kill <name>
> >>>
> >>
> >>I really like the idea, but I have a few comments.
> >>I'm not familiar with the dbus, but how will you identify a request made on dbus?
> >>I mean will you get its pid? What if the container has its own PID namespace, how will this be handled?
> >
> >DBus is essentially just an IPC protocol that can be used over a variety
> >of medium.
> >
> >In the case of this cgroup manager, we'll be using the DBus protocol on
> >top of a standard UNIX socket. One of the properties of unix sockets is
> >that you can get the uid, gid and pid of your peer. As this information
> >is provided by the kernel, it'll automatically be translated to match
> >your vision of the pid and user tree.
> >
> >That's why we're also planning on abusing SCM_CRED a tiny bit so that
> >when a container or sub-container is asking for a pid to be moved into a
> >cgroup, instead of passing that pid as a standard integer over dbus,
> >it'll instead use the SCM_CRED mechanism, sending a ucred structure
> >instead which will then get magically mapped to the right namespace when
> >accessed by the manager and saving us a whole lot of pid/uid mapping
> >logic in the process.
> >
> >>
> >>I know that this may sound a bit radical, but I propose that the daemon is using simple unix sockets.
> >>The daemon should have an easy way of adding more sockets to newly started containers and each newly created socket
> >>should know the base cgroup to which it belongs. This way the daemon can clearly identify which request is limited to
> >>what cgroup without many lookups and will be easier to enforce the above mentioned restrictions.
> >
> >So it looks like our current design already follows your recommendation
> >since we're indeed using a standard unix socket, it's just that instead
> >of re-inventing the wheel, we use a standard IPC protocol on top of it.
> 
> Thanks, I was thinking about the SCM_CREAD exactly :)
> I was unaware that it can be combined with the dbus protocol, this is why I commented.
> 
> Is there any particular language that you want this project started
> in? I know that most part of the LXC is C, but I don't see any
> guidelines for using or not other langs.
> 
> Marian

LXC itself is currently mostly made in C with some scripts in shell and
an even smaller amount of scripts in python3 and lua.

For the cgroup manager, I think the assumption was that we'd do it in C
as it's going to be a long lasting daemon that should keep a very low
memory and CPU footprint and have as few dependencies as possible so
that any distro shipping it can start it extremely early on (so for
distros that care about this, the manager and all its dependencies
should reside in / and not use anything from /usr).

The advantage of using something very close to standard DBus (with
SCM_CRED being the only odd bit we'd add) is that it'll be trivial to
talk to the daemon in a large variety of languages by simply using the
DBus API and the introspection it offers.
That being said, I suspect the initial users of that API to all be in C
(with LXC being the obvious first one).

-- 
Stéphane Graber
Ubuntu developer
http://www.ubuntu.com

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found] ` <20131125224335.GA15481-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2013-11-26  0:03   ` [lxc-devel] " Marian Marinov
@ 2013-11-26  2:18   ` Michael H. Warfield
       [not found]     ` <1385432284.8590.52.camel-s3/A7Nnwjkf10ug9Blv0m0EOCMrvLtNR@public.gmane.org>
  2013-11-26  4:58   ` Tim Hockin
  2013-12-03 13:45   ` Tejun Heo
  3 siblings, 1 reply; 39+ messages in thread
From: Michael H. Warfield @ 2013-11-26  2:18 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Stéphane Graber, mhw-BetbSzk+GohWk0Htik3J/w, Tim Hockin,
	Victor Marmol, Rohit Jnagal,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 12757 bytes --]

Serge...

You have no idea how much I dread mentioning this (well, after
LinuxPlumbers, maybe you can) but...  You do realize that some of this
is EXACTLY what the systemd crowd was talking about there in NOLA back
then.  I sat in those session grinding my teeth and listening to
comments from some others around me about when systemd might subsume
bash or even vi or quake.

Somehow, you and others have tagged me as a "systemd expert" but I am
far from it and even you noted that Lennart and I were on the edge of a
physical discussion when I made some "off the cuff" remarks there about
systemd design during my talk.  I personally rank systemd in the same
category as NetworkMangler (err, NetworkManager) in its propensity for
committing inexplicable random acts of terrorism and changing its
behavior from release to release to release.  I'm not a fan and I'm not
an expert, but I have to be involved with it and watch the damned thing
like a trapped rat, like it or not.

Like it or not, we can not go off on divergent designs.  As much as they
have delusions of taking over the Linux world, they are still going to
be a major factor and this sort of thing needs to be coordinated.  We
are going to need exactly what you are proposing whether we have systemd
in play or not.  IF we CAN kick it to the curb, when we need to, we
still need to know how to without tearing shit up and breaking shit that
thinks it's there.  Ideally, it shouldn't matter if systemd where in
play or not.

All I ask is that we not get too far off track that we have a major
architectural divergence here.  The risk is there.

Mike


On Mon, 2013-11-25 at 22:43 +0000, Serge E. Hallyn wrote: 
> Hi,
> 
> as i've mentioned several times, I want to write a standalone cgroup
> management daemon.  Basic requirements are that it be a standalone
> program; that a single instance running on the host be usable from
> containers nested at any depth; that it not allow escaping ones
> assigned limits; that it not allow subjegating tasks which do not
> belong to you; and that, within your limits, you be able to parcel
> those limits to your tasks as you like.  
> 
> Additionally, Tejun has specified that we do not want users to be
> too closely tied to the cgroupfs implementation.  Therefore
> commands will be just a hair more general than specifying cgroupfs
> filenames and values.  I may go so far as to avoid specifying
> specific controllers, as AFAIK there should be no redundancy in
> features.  On the other hand, I don't want to get too general.
> So I'm basing the API loosely on the lmctfy command line API.
> 
> One of the driving goals is to enable nested lxc as simply and safely as
> possible.  If this project is a success, then a large chunk of code can
> be removed from lxc.  I'm considering this project a part of the larger
> lxc project, but given how central it is to systems management that
> doesn't mean that I'll consider anyone else's needs as less important
> than our own.
> 
> This document consists of two parts.  The first describes how I
> intend the daemon (cgmanager) to be structured and how it will
> enforce the safety requirements.  The second describes the commands 
> which clients will be able to send to the manager.  The list of
> controller keys which can be set is very incomplete at this point,
> serving mainly to show the approach I was thinking of taking.
> 
> Summary
> 
> Each 'host' (identified by a separate instance of the linux kernel) will
> have exactly one running daemon to manage control groups.  This daemon
> will answer cgroup management requests over a dbus socket, located at
> /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
> containers, so that one daemon can support the whole system.
> 
> Programs will be able to make cgroup requests using dbus calls, or
> indirectly by linking against lmctfy which will be modified to use the
> dbus calls if available.
> 
> Outline:
>   . A single manager, cgmanager, is started on the host, very early
>     during boot.  It has very few dependencies, and requires only
>     /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
>     the cgroup hierarchies in a private namespace and set defaults
>     (clone_children, use_hierarchy, sane_behavior, release_agent?) It
>     will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
>   . A client (requestor 'r') can make cgroup requests over
>     /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
>     requirements for r are listed below.
>   . The client request will pertain an existing or new cgroup A.  r's
>     privilege over the cgroup must be checked.  r is said to have
>     privilege over A if A is owned by r's uid, or if A's owner is mapped
>     into r's user namespace, and r is root in that user namespace.
>   . The client request may pertain a victim task v, which may be moved
>     to a new cgroup.  In that case r's privilege over both the cgroup
>     and v must be checked.  r is said to have privilege over v if v
>     is mapped in r's pid namespace, v's uid is mapped into r's user ns,
>     and r is root in its userns.  Or if r and v have the same uid
>     and v is mapped in r's pid namespace.
>   . r's credentials will be taken from socket's peercred, ensuring that
>     pid and uid are translated.
>   . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
>     translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
>     which is the global uid, and check /proc/PID(r)/uid_map to see whether
>     UID is mapped there.
>   . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
>     the kernel translate it for the reader.  Only 'move task v to cgroup
>     A' will require a SCM_CREDENTIAL to be sent.
> 
> Privilege requirements by action:
>     * Requestor of an action (r) over a socket may only make
>       changes to cgroups over which it has privilege.
>     * Requestors may be limited to a certain #/depth of cgroups
>       (to limit memory usage) - DEFER?
>     * Cgroup hierarchy is responsible for resource limits
>     * A requestor must either be uid 0 in its userns with victim mapped
>       ito its userns, or the same uid and in same/ancestor pidns as the
>       victim
>     * If r requests creation of cgroup '/x', /x will be interpreted
>       as relative to r's cgroup.  r cannot make changes to cgroups not
>       under its own current cgroup.
>     * If r is not in the initial user_ns, then it may not change settings
>       in its own cgroup, only descendants.  (Not strictly necessary -
>       we could require the use of extra cgroups when wanted, as lxc does
>       currently)
>     * If r requests creation of cgroup '/x', it must have write access
>       to its own cgroup  (not strictly necessary)
>     * If r requests chown of cgroup /x to uid Y, Y is passed in a
>       ucred over the unix socket, and therefore translated to init
>       userns.
>     * if r requests setting a limit under /x, then
>       . either r must be root in its own userns, and UID(/x) be mapped
>         into its userns, or else UID(r) == UID(/x)
>       . /x must not be / (not strictly necessary, all users know to
>         ensure an extra cgroup layer above '/')
>       . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
>         which won't be satisfied.  Therefore we'll need to do privilege
>         checks ourselves, then perform the write as the host root user.
>         (see devices.allow/deny).  Further we need to support older kernels
>         which don't support setns for pid.
>     * If r requests action on victim V, it passes V's pid in a ucred,
>       so that gets translated.
>       Daemon will verify that V's uid is mapped into r's userns.  Since
>       r is either root or the same uid as V, it is allowed to classify.
> 
> The above addresses
>     * creating cgroups
>     * chowning cgroups
>     * setting cgroup limits
>     * moving tasks into cgroups
>   . but does not address a 'cgexec <group> -- command' type of behavior.
>     * To handle that (specifically for upstart), recommend that r do:
>       if (!pid) {
>         request_reclassify(cgroup, getpid());
>         do_execve();
>       }
>   . alternatively, the daemon could, if kernel is new enough, setns to
>     the requestor's namespaces to execute a command in a new cgroup.
>     The new command would be daemonized to that pid namespaces' pid 1.
> 
> Types of requests:
>   * r requests creating cgroup A'/A
>     . lmctfy/cli/commands/create.cc
>     . Verify that UID(r) mapped to 0 in r's userns
>     . R=cgroup_of(r)
>     . Verify that UID(R) is mapped into r's userns
>     . Create R/A'/A
>     . chown R/A'/A to UID(r)
>   * r requests to move task x to cgroup A.
>     . lmctfy/cli/commands/enter.cc
>     . r must send PID(x) as ancillary message
>     . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
>       that userns
>       (is it safe to allow if UID(x) == UID(r))?
>     . R=cgroup_of(r)
>     . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
>     . echo PID(x) >> /R/A/tasks
>   * r requests chown of cgroup A to uid X
>     . X is passed in ancillary message
>       * ensures it is valid in r's userns
>       * maps the userid to host for us
>     . Verify that UID(r) mapped to 0 in r's userns
>     . R=cgroup_of(r)
>     . Chown R/A to X
>   * r requests cgroup A's 'property=value'
>     . Verify that either
>       * A != ''
>       * UID(r) == 0 on host
>       In other words, r in a userns may not set root cgroup settings.
>     . Verify that UID(r) mapped to 0 in r's userns
>     . R=cgroup_of(r)
>     . Set property=value for R/A
>       * Expect kernel to guarantee hierarchical constraints
>   * r requests deletion of cgroup A
>     . lmctfy/cli/commands/destroy.cc (without -f)
>     . same requirements as setting 'property=value'
>   * r requests purge of cgroup A
>     . lmctfy/cli/commands/destroy.cc (with -f)
>     . same requirements as setting 'property=value'
> 
> Long-term we will want the cgroup manager to become more intelligent -
> to place its own limits on clients, to address cpu and device hotplug,
> etc.  Since we will not be doing that in the first prototype, the daemon
> will not keep any state about the clients.
> 
> Client DBus Message API
> 
> <name>: a-zA-Z0-9
> <name>: "a-zA-Z0-9 "
> <controllerlist>: <controller1>[:controllerlist]
> <valueentry>: key:value
> <valueentry>: frozen
> <valueentry>: thawed
> <values>: valueentry[:values]
> keys:
> 	{memory,swap}.{limit,soft_limit}
> 	cpus_allowed  # set of allowed cpus
> 	cpus_fraction # % of allowed cpus
> 	cpus_number   # number of allowed cpus
> 	cpu_share_percent   # percent of cpushare
> 	devices_whitelist
> 	devices_blacklist
> 	net_prio_index
> 	net_prio_interface_map
> 	net_classid
> 	hugetlb_limit
> 	blkio_weight
> 	blkio_weight_device
> 	blkio_throttle_{read,write}
> readkeys:
> 	devices_list
> 	{memory,swap}.{failcnt,max_use,limitnuma_stat}
> 	hugetlb_max_usage
> 	hugetlb_usage
> 	hugetlb_failcnt
> 	cpuacct_stat
> 	<etc>
> Commands:
> 	ListControllers
> 	Create <name> <controllerlist> <values>
> 	Setvalue <name> <values>
> 	Getvalue <name> <readkeys>
> 	ListChildren <name>
> 	ListTasks <name>
> 	ListControllers <name>
> 	Chown <name> <uid>
> 	Chown <name> <uid>:<gid>
> 	Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
> 	Delete <name>
> 	Delete-force <name>
> 	Kill <name>
> 
> ------------------------------------------------------------------------------
> Shape the Mobile Experience: Free Subscription
> Software experts and developers: Be at the forefront of tech innovation.
> Intel(R) Software Adrenaline delivers strategic insight and game-changing 
> conversations that shape the rapidly evolving mobile landscape. Sign up now. 
> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
> _______________________________________________
> Lxc-devel mailing list
> Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/lxc-devel
> 

-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  mhw-BetbSzk+GohWk0Htik3J/w@public.gmane.org
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 482 bytes --]

[-- Attachment #2: Type: text/plain, Size: 427 bytes --]

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk

[-- Attachment #3: Type: text/plain, Size: 186 bytes --]

_______________________________________________
Lxc-devel mailing list
Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/lxc-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]     ` <1385432284.8590.52.camel-s3/A7Nnwjkf10ug9Blv0m0EOCMrvLtNR@public.gmane.org>
@ 2013-11-26  2:43       ` Stéphane Graber
  2013-11-26  2:55         ` [lxc-devel] " Michael H. Warfield
  2013-11-26  4:52       ` Tim Hockin
  1 sibling, 1 reply; 39+ messages in thread
From: Stéphane Graber @ 2013-11-26  2:43 UTC (permalink / raw)
  To: Michael H. Warfield
  Cc: Tim Hockin, Victor Marmol, Rohit Jnagal,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Serge E. Hallyn


[-- Attachment #1.1: Type: text/plain, Size: 16613 bytes --]

Haha,

I was wondering how long it'd take before we got the first comment about
systemd's own cgroup manager :)

To try and keep this short, there are a lot of cases where systemd's
plan of having an in-pid1 manager, as practical as it's for them, just
isn't going to work for us.

I believe our design makes things a bit cleaner by not having it tied to
any specific init system or feature and have a relatively low level,
very simple API that people can use as a building block for anything
that wants to manage cgroups.

At this point in time, there's no hard limitation for having one or more
processes writing to the cgroup hierarchy, as much as some people may
want this to change. I very much doubt it'll happen any time soon and
until then, even if not perfectly adequate, there won't be any problem
running both systemd's manager and our own.

There's also the possibility if someone felt sufficiently strongly about
this to contribute patches, to have our manager talk to systemd's if
present and go through their manager instead of accessing cgroupfs
itself. That's assuming systemd offers a sufficiently low level API that
could be used for that without bringing an unreasonable amount of
dependencies to our code.


I don't want this thread to turn into some kind of flamewar or similarly
overheated discussion about systemd vs everyone else, so I'll just state
that from my point of view (and I suspect that of the group who worked
on this early draft), systemd's manager while perfect for grouping and
resource allocation for systemd units and user sessions doesn't quite
fit our bill with regard to supporting multiple level of full
distro-agnostic containers using nesting and mixing user namespaces.
It also has what as a non-systemd person I consider a big drawback of
being built into an init system which quite a few major distributions
don't use (specifically those distros that account for the majority of
LXC's users).

I think there's room for two implementations and competition (even if we
have slightly different goals) is a good thing and will undoubtedly help
both project consider use cases they didn't think of leading to a better
solution for everyone. And if some day one of the two wins or we can
somehow converge into a solution that works for everyone, that'd be
great. But our discussions at Linux Plumbers and other conferences have
shown that this isn't going to happen now, so it's best to stop arguing
and instead get some stuff done.

On Mon, Nov 25, 2013 at 09:18:04PM -0500, Michael H. Warfield wrote:
> Serge...
> 
> You have no idea how much I dread mentioning this (well, after
> LinuxPlumbers, maybe you can) but...  You do realize that some of this
> is EXACTLY what the systemd crowd was talking about there in NOLA back
> then.  I sat in those session grinding my teeth and listening to
> comments from some others around me about when systemd might subsume
> bash or even vi or quake.
> 
> Somehow, you and others have tagged me as a "systemd expert" but I am
> far from it and even you noted that Lennart and I were on the edge of a
> physical discussion when I made some "off the cuff" remarks there about
> systemd design during my talk.  I personally rank systemd in the same
> category as NetworkMangler (err, NetworkManager) in its propensity for
> committing inexplicable random acts of terrorism and changing its
> behavior from release to release to release.  I'm not a fan and I'm not
> an expert, but I have to be involved with it and watch the damned thing
> like a trapped rat, like it or not.
> 
> Like it or not, we can not go off on divergent designs.  As much as they
> have delusions of taking over the Linux world, they are still going to
> be a major factor and this sort of thing needs to be coordinated.  We
> are going to need exactly what you are proposing whether we have systemd
> in play or not.  IF we CAN kick it to the curb, when we need to, we
> still need to know how to without tearing shit up and breaking shit that
> thinks it's there.  Ideally, it shouldn't matter if systemd where in
> play or not.
> 
> All I ask is that we not get too far off track that we have a major
> architectural divergence here.  The risk is there.
> 
> Mike
> 
> 
> On Mon, 2013-11-25 at 22:43 +0000, Serge E. Hallyn wrote: 
> > Hi,
> > 
> > as i've mentioned several times, I want to write a standalone cgroup
> > management daemon.  Basic requirements are that it be a standalone
> > program; that a single instance running on the host be usable from
> > containers nested at any depth; that it not allow escaping ones
> > assigned limits; that it not allow subjegating tasks which do not
> > belong to you; and that, within your limits, you be able to parcel
> > those limits to your tasks as you like.  
> > 
> > Additionally, Tejun has specified that we do not want users to be
> > too closely tied to the cgroupfs implementation.  Therefore
> > commands will be just a hair more general than specifying cgroupfs
> > filenames and values.  I may go so far as to avoid specifying
> > specific controllers, as AFAIK there should be no redundancy in
> > features.  On the other hand, I don't want to get too general.
> > So I'm basing the API loosely on the lmctfy command line API.
> > 
> > One of the driving goals is to enable nested lxc as simply and safely as
> > possible.  If this project is a success, then a large chunk of code can
> > be removed from lxc.  I'm considering this project a part of the larger
> > lxc project, but given how central it is to systems management that
> > doesn't mean that I'll consider anyone else's needs as less important
> > than our own.
> > 
> > This document consists of two parts.  The first describes how I
> > intend the daemon (cgmanager) to be structured and how it will
> > enforce the safety requirements.  The second describes the commands 
> > which clients will be able to send to the manager.  The list of
> > controller keys which can be set is very incomplete at this point,
> > serving mainly to show the approach I was thinking of taking.
> > 
> > Summary
> > 
> > Each 'host' (identified by a separate instance of the linux kernel) will
> > have exactly one running daemon to manage control groups.  This daemon
> > will answer cgroup management requests over a dbus socket, located at
> > /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
> > containers, so that one daemon can support the whole system.
> > 
> > Programs will be able to make cgroup requests using dbus calls, or
> > indirectly by linking against lmctfy which will be modified to use the
> > dbus calls if available.
> > 
> > Outline:
> >   . A single manager, cgmanager, is started on the host, very early
> >     during boot.  It has very few dependencies, and requires only
> >     /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
> >     the cgroup hierarchies in a private namespace and set defaults
> >     (clone_children, use_hierarchy, sane_behavior, release_agent?) It
> >     will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
> >   . A client (requestor 'r') can make cgroup requests over
> >     /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
> >     requirements for r are listed below.
> >   . The client request will pertain an existing or new cgroup A.  r's
> >     privilege over the cgroup must be checked.  r is said to have
> >     privilege over A if A is owned by r's uid, or if A's owner is mapped
> >     into r's user namespace, and r is root in that user namespace.
> >   . The client request may pertain a victim task v, which may be moved
> >     to a new cgroup.  In that case r's privilege over both the cgroup
> >     and v must be checked.  r is said to have privilege over v if v
> >     is mapped in r's pid namespace, v's uid is mapped into r's user ns,
> >     and r is root in its userns.  Or if r and v have the same uid
> >     and v is mapped in r's pid namespace.
> >   . r's credentials will be taken from socket's peercred, ensuring that
> >     pid and uid are translated.
> >   . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
> >     translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
> >     which is the global uid, and check /proc/PID(r)/uid_map to see whether
> >     UID is mapped there.
> >   . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
> >     the kernel translate it for the reader.  Only 'move task v to cgroup
> >     A' will require a SCM_CREDENTIAL to be sent.
> > 
> > Privilege requirements by action:
> >     * Requestor of an action (r) over a socket may only make
> >       changes to cgroups over which it has privilege.
> >     * Requestors may be limited to a certain #/depth of cgroups
> >       (to limit memory usage) - DEFER?
> >     * Cgroup hierarchy is responsible for resource limits
> >     * A requestor must either be uid 0 in its userns with victim mapped
> >       ito its userns, or the same uid and in same/ancestor pidns as the
> >       victim
> >     * If r requests creation of cgroup '/x', /x will be interpreted
> >       as relative to r's cgroup.  r cannot make changes to cgroups not
> >       under its own current cgroup.
> >     * If r is not in the initial user_ns, then it may not change settings
> >       in its own cgroup, only descendants.  (Not strictly necessary -
> >       we could require the use of extra cgroups when wanted, as lxc does
> >       currently)
> >     * If r requests creation of cgroup '/x', it must have write access
> >       to its own cgroup  (not strictly necessary)
> >     * If r requests chown of cgroup /x to uid Y, Y is passed in a
> >       ucred over the unix socket, and therefore translated to init
> >       userns.
> >     * if r requests setting a limit under /x, then
> >       . either r must be root in its own userns, and UID(/x) be mapped
> >         into its userns, or else UID(r) == UID(/x)
> >       . /x must not be / (not strictly necessary, all users know to
> >         ensure an extra cgroup layer above '/')
> >       . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
> >         which won't be satisfied.  Therefore we'll need to do privilege
> >         checks ourselves, then perform the write as the host root user.
> >         (see devices.allow/deny).  Further we need to support older kernels
> >         which don't support setns for pid.
> >     * If r requests action on victim V, it passes V's pid in a ucred,
> >       so that gets translated.
> >       Daemon will verify that V's uid is mapped into r's userns.  Since
> >       r is either root or the same uid as V, it is allowed to classify.
> > 
> > The above addresses
> >     * creating cgroups
> >     * chowning cgroups
> >     * setting cgroup limits
> >     * moving tasks into cgroups
> >   . but does not address a 'cgexec <group> -- command' type of behavior.
> >     * To handle that (specifically for upstart), recommend that r do:
> >       if (!pid) {
> >         request_reclassify(cgroup, getpid());
> >         do_execve();
> >       }
> >   . alternatively, the daemon could, if kernel is new enough, setns to
> >     the requestor's namespaces to execute a command in a new cgroup.
> >     The new command would be daemonized to that pid namespaces' pid 1.
> > 
> > Types of requests:
> >   * r requests creating cgroup A'/A
> >     . lmctfy/cli/commands/create.cc
> >     . Verify that UID(r) mapped to 0 in r's userns
> >     . R=cgroup_of(r)
> >     . Verify that UID(R) is mapped into r's userns
> >     . Create R/A'/A
> >     . chown R/A'/A to UID(r)
> >   * r requests to move task x to cgroup A.
> >     . lmctfy/cli/commands/enter.cc
> >     . r must send PID(x) as ancillary message
> >     . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
> >       that userns
> >       (is it safe to allow if UID(x) == UID(r))?
> >     . R=cgroup_of(r)
> >     . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
> >     . echo PID(x) >> /R/A/tasks
> >   * r requests chown of cgroup A to uid X
> >     . X is passed in ancillary message
> >       * ensures it is valid in r's userns
> >       * maps the userid to host for us
> >     . Verify that UID(r) mapped to 0 in r's userns
> >     . R=cgroup_of(r)
> >     . Chown R/A to X
> >   * r requests cgroup A's 'property=value'
> >     . Verify that either
> >       * A != ''
> >       * UID(r) == 0 on host
> >       In other words, r in a userns may not set root cgroup settings.
> >     . Verify that UID(r) mapped to 0 in r's userns
> >     . R=cgroup_of(r)
> >     . Set property=value for R/A
> >       * Expect kernel to guarantee hierarchical constraints
> >   * r requests deletion of cgroup A
> >     . lmctfy/cli/commands/destroy.cc (without -f)
> >     . same requirements as setting 'property=value'
> >   * r requests purge of cgroup A
> >     . lmctfy/cli/commands/destroy.cc (with -f)
> >     . same requirements as setting 'property=value'
> > 
> > Long-term we will want the cgroup manager to become more intelligent -
> > to place its own limits on clients, to address cpu and device hotplug,
> > etc.  Since we will not be doing that in the first prototype, the daemon
> > will not keep any state about the clients.
> > 
> > Client DBus Message API
> > 
> > <name>: a-zA-Z0-9
> > <name>: "a-zA-Z0-9 "
> > <controllerlist>: <controller1>[:controllerlist]
> > <valueentry>: key:value
> > <valueentry>: frozen
> > <valueentry>: thawed
> > <values>: valueentry[:values]
> > keys:
> > 	{memory,swap}.{limit,soft_limit}
> > 	cpus_allowed  # set of allowed cpus
> > 	cpus_fraction # % of allowed cpus
> > 	cpus_number   # number of allowed cpus
> > 	cpu_share_percent   # percent of cpushare
> > 	devices_whitelist
> > 	devices_blacklist
> > 	net_prio_index
> > 	net_prio_interface_map
> > 	net_classid
> > 	hugetlb_limit
> > 	blkio_weight
> > 	blkio_weight_device
> > 	blkio_throttle_{read,write}
> > readkeys:
> > 	devices_list
> > 	{memory,swap}.{failcnt,max_use,limitnuma_stat}
> > 	hugetlb_max_usage
> > 	hugetlb_usage
> > 	hugetlb_failcnt
> > 	cpuacct_stat
> > 	<etc>
> > Commands:
> > 	ListControllers
> > 	Create <name> <controllerlist> <values>
> > 	Setvalue <name> <values>
> > 	Getvalue <name> <readkeys>
> > 	ListChildren <name>
> > 	ListTasks <name>
> > 	ListControllers <name>
> > 	Chown <name> <uid>
> > 	Chown <name> <uid>:<gid>
> > 	Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
> > 	Delete <name>
> > 	Delete-force <name>
> > 	Kill <name>
> > 
> > ------------------------------------------------------------------------------
> > Shape the Mobile Experience: Free Subscription
> > Software experts and developers: Be at the forefront of tech innovation.
> > Intel(R) Software Adrenaline delivers strategic insight and game-changing 
> > conversations that shape the rapidly evolving mobile landscape. Sign up now. 
> > http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Lxc-devel mailing list
> > Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> > https://lists.sourceforge.net/lists/listinfo/lxc-devel
> > 
> 
> -- 
> Michael H. Warfield (AI4NB) | (770) 978-7061 |  mhw-BetbSzk+GohWk0Htik3J/w@public.gmane.org
>    /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
>    NIC whois: MHW9          | An optimist believes we live in the best of all
>  PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
> 



> ------------------------------------------------------------------------------
> Shape the Mobile Experience: Free Subscription
> Software experts and developers: Be at the forefront of tech innovation.
> Intel(R) Software Adrenaline delivers strategic insight and game-changing 
> conversations that shape the rapidly evolving mobile landscape. Sign up now. 
> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk

> _______________________________________________
> Lxc-devel mailing list
> Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/lxc-devel


-- 
Stéphane Graber
Ubuntu developer
http://www.ubuntu.com

[-- Attachment #1.2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

[-- Attachment #2: Type: text/plain, Size: 427 bytes --]

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk

[-- Attachment #3: Type: text/plain, Size: 186 bytes --]

_______________________________________________
Lxc-devel mailing list
Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/lxc-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
  2013-11-26  2:43       ` Stéphane Graber
@ 2013-11-26  2:55         ` Michael H. Warfield
  0 siblings, 0 replies; 39+ messages in thread
From: Michael H. Warfield @ 2013-11-26  2:55 UTC (permalink / raw)
  To: Stéphane Graber
  Cc: mhw-BetbSzk+GohWk0Htik3J/w, Tim Hockin, Victor Marmol,
	Rohit Jnagal, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Serge E. Hallyn

[-- Attachment #1: Type: text/plain, Size: 18614 bytes --]

On Mon, 2013-11-25 at 21:43 -0500, Stéphane Graber wrote: 
> Haha,
> 
> I was wondering how long it'd take before we got the first comment about
> systemd's own cgroup manager :)
> 
> To try and keep this short, there are a lot of cases where systemd's
> plan of having an in-pid1 manager, as practical as it's for them, just
> isn't going to work for us.
> 
> I believe our design makes things a bit cleaner by not having it tied to
> any specific init system or feature and have a relatively low level,
> very simple API that people can use as a building block for anything
> that wants to manage cgroups.
> 
> At this point in time, there's no hard limitation for having one or more
> processes writing to the cgroup hierarchy, as much as some people may
> want this to change. I very much doubt it'll happen any time soon and
> until then, even if not perfectly adequate, there won't be any problem
> running both systemd's manager and our own.
> 
> There's also the possibility if someone felt sufficiently strongly about
> this to contribute patches, to have our manager talk to systemd's if
> present and go through their manager instead of accessing cgroupfs
> itself. That's assuming systemd offers a sufficiently low level API that
> could be used for that without bringing an unreasonable amount of
> dependencies to our code.
> 
> 
> I don't want this thread to turn into some kind of flamewar or similarly
> overheated discussion about systemd vs everyone else, so I'll just state
> that from my point of view (and I suspect that of the group who worked
> on this early draft), systemd's manager while perfect for grouping and
> resource allocation for systemd units and user sessions doesn't quite
> fit our bill with regard to supporting multiple level of full
> distro-agnostic containers using nesting and mixing user namespaces.
> It also has what as a non-systemd person I consider a big drawback of
> being built into an init system which quite a few major distributions
> don't use (specifically those distros that account for the majority of
> LXC's users).
> 
> I think there's room for two implementations and competition (even if we
> have slightly different goals) is a good thing and will undoubtedly help
> both project consider use cases they didn't think of leading to a better
> solution for everyone. And if some day one of the two wins or we can
> somehow converge into a solution that works for everyone, that'd be
> great. But our discussions at Linux Plumbers and other conferences have
> shown that this isn't going to happen now, so it's best to stop arguing
> and instead get some stuff done.

Concur.  And, as you know, I'm not a fan or supporter of that camp.  I
just want to make sure everyone is aware of all the gorillas in the room
before the fecal flakes hit the rapidly whirling blades.

That being said, I think this is a laudable goal.  If we do it right, it
well can become the standard they have to adhere to.

Regards,
Mike

> On Mon, Nov 25, 2013 at 09:18:04PM -0500, Michael H. Warfield wrote:
> > Serge...
> > 
> > You have no idea how much I dread mentioning this (well, after
> > LinuxPlumbers, maybe you can) but...  You do realize that some of this
> > is EXACTLY what the systemd crowd was talking about there in NOLA back
> > then.  I sat in those session grinding my teeth and listening to
> > comments from some others around me about when systemd might subsume
> > bash or even vi or quake.
> > 
> > Somehow, you and others have tagged me as a "systemd expert" but I am
> > far from it and even you noted that Lennart and I were on the edge of a
> > physical discussion when I made some "off the cuff" remarks there about
> > systemd design during my talk.  I personally rank systemd in the same
> > category as NetworkMangler (err, NetworkManager) in its propensity for
> > committing inexplicable random acts of terrorism and changing its
> > behavior from release to release to release.  I'm not a fan and I'm not
> > an expert, but I have to be involved with it and watch the damned thing
> > like a trapped rat, like it or not.
> > 
> > Like it or not, we can not go off on divergent designs.  As much as they
> > have delusions of taking over the Linux world, they are still going to
> > be a major factor and this sort of thing needs to be coordinated.  We
> > are going to need exactly what you are proposing whether we have systemd
> > in play or not.  IF we CAN kick it to the curb, when we need to, we
> > still need to know how to without tearing shit up and breaking shit that
> > thinks it's there.  Ideally, it shouldn't matter if systemd where in
> > play or not.
> > 
> > All I ask is that we not get too far off track that we have a major
> > architectural divergence here.  The risk is there.
> > 
> > Mike
> > 
> > 
> > On Mon, 2013-11-25 at 22:43 +0000, Serge E. Hallyn wrote: 
> > > Hi,
> > > 
> > > as i've mentioned several times, I want to write a standalone cgroup
> > > management daemon.  Basic requirements are that it be a standalone
> > > program; that a single instance running on the host be usable from
> > > containers nested at any depth; that it not allow escaping ones
> > > assigned limits; that it not allow subjegating tasks which do not
> > > belong to you; and that, within your limits, you be able to parcel
> > > those limits to your tasks as you like.  
> > > 
> > > Additionally, Tejun has specified that we do not want users to be
> > > too closely tied to the cgroupfs implementation.  Therefore
> > > commands will be just a hair more general than specifying cgroupfs
> > > filenames and values.  I may go so far as to avoid specifying
> > > specific controllers, as AFAIK there should be no redundancy in
> > > features.  On the other hand, I don't want to get too general.
> > > So I'm basing the API loosely on the lmctfy command line API.
> > > 
> > > One of the driving goals is to enable nested lxc as simply and safely as
> > > possible.  If this project is a success, then a large chunk of code can
> > > be removed from lxc.  I'm considering this project a part of the larger
> > > lxc project, but given how central it is to systems management that
> > > doesn't mean that I'll consider anyone else's needs as less important
> > > than our own.
> > > 
> > > This document consists of two parts.  The first describes how I
> > > intend the daemon (cgmanager) to be structured and how it will
> > > enforce the safety requirements.  The second describes the commands 
> > > which clients will be able to send to the manager.  The list of
> > > controller keys which can be set is very incomplete at this point,
> > > serving mainly to show the approach I was thinking of taking.
> > > 
> > > Summary
> > > 
> > > Each 'host' (identified by a separate instance of the linux kernel) will
> > > have exactly one running daemon to manage control groups.  This daemon
> > > will answer cgroup management requests over a dbus socket, located at
> > > /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
> > > containers, so that one daemon can support the whole system.
> > > 
> > > Programs will be able to make cgroup requests using dbus calls, or
> > > indirectly by linking against lmctfy which will be modified to use the
> > > dbus calls if available.
> > > 
> > > Outline:
> > >   . A single manager, cgmanager, is started on the host, very early
> > >     during boot.  It has very few dependencies, and requires only
> > >     /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
> > >     the cgroup hierarchies in a private namespace and set defaults
> > >     (clone_children, use_hierarchy, sane_behavior, release_agent?) It
> > >     will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
> > >   . A client (requestor 'r') can make cgroup requests over
> > >     /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
> > >     requirements for r are listed below.
> > >   . The client request will pertain an existing or new cgroup A.  r's
> > >     privilege over the cgroup must be checked.  r is said to have
> > >     privilege over A if A is owned by r's uid, or if A's owner is mapped
> > >     into r's user namespace, and r is root in that user namespace.
> > >   . The client request may pertain a victim task v, which may be moved
> > >     to a new cgroup.  In that case r's privilege over both the cgroup
> > >     and v must be checked.  r is said to have privilege over v if v
> > >     is mapped in r's pid namespace, v's uid is mapped into r's user ns,
> > >     and r is root in its userns.  Or if r and v have the same uid
> > >     and v is mapped in r's pid namespace.
> > >   . r's credentials will be taken from socket's peercred, ensuring that
> > >     pid and uid are translated.
> > >   . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
> > >     translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
> > >     which is the global uid, and check /proc/PID(r)/uid_map to see whether
> > >     UID is mapped there.
> > >   . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
> > >     the kernel translate it for the reader.  Only 'move task v to cgroup
> > >     A' will require a SCM_CREDENTIAL to be sent.
> > > 
> > > Privilege requirements by action:
> > >     * Requestor of an action (r) over a socket may only make
> > >       changes to cgroups over which it has privilege.
> > >     * Requestors may be limited to a certain #/depth of cgroups
> > >       (to limit memory usage) - DEFER?
> > >     * Cgroup hierarchy is responsible for resource limits
> > >     * A requestor must either be uid 0 in its userns with victim mapped
> > >       ito its userns, or the same uid and in same/ancestor pidns as the
> > >       victim
> > >     * If r requests creation of cgroup '/x', /x will be interpreted
> > >       as relative to r's cgroup.  r cannot make changes to cgroups not
> > >       under its own current cgroup.
> > >     * If r is not in the initial user_ns, then it may not change settings
> > >       in its own cgroup, only descendants.  (Not strictly necessary -
> > >       we could require the use of extra cgroups when wanted, as lxc does
> > >       currently)
> > >     * If r requests creation of cgroup '/x', it must have write access
> > >       to its own cgroup  (not strictly necessary)
> > >     * If r requests chown of cgroup /x to uid Y, Y is passed in a
> > >       ucred over the unix socket, and therefore translated to init
> > >       userns.
> > >     * if r requests setting a limit under /x, then
> > >       . either r must be root in its own userns, and UID(/x) be mapped
> > >         into its userns, or else UID(r) == UID(/x)
> > >       . /x must not be / (not strictly necessary, all users know to
> > >         ensure an extra cgroup layer above '/')
> > >       . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
> > >         which won't be satisfied.  Therefore we'll need to do privilege
> > >         checks ourselves, then perform the write as the host root user.
> > >         (see devices.allow/deny).  Further we need to support older kernels
> > >         which don't support setns for pid.
> > >     * If r requests action on victim V, it passes V's pid in a ucred,
> > >       so that gets translated.
> > >       Daemon will verify that V's uid is mapped into r's userns.  Since
> > >       r is either root or the same uid as V, it is allowed to classify.
> > > 
> > > The above addresses
> > >     * creating cgroups
> > >     * chowning cgroups
> > >     * setting cgroup limits
> > >     * moving tasks into cgroups
> > >   . but does not address a 'cgexec <group> -- command' type of behavior.
> > >     * To handle that (specifically for upstart), recommend that r do:
> > >       if (!pid) {
> > >         request_reclassify(cgroup, getpid());
> > >         do_execve();
> > >       }
> > >   . alternatively, the daemon could, if kernel is new enough, setns to
> > >     the requestor's namespaces to execute a command in a new cgroup.
> > >     The new command would be daemonized to that pid namespaces' pid 1.
> > > 
> > > Types of requests:
> > >   * r requests creating cgroup A'/A
> > >     . lmctfy/cli/commands/create.cc
> > >     . Verify that UID(r) mapped to 0 in r's userns
> > >     . R=cgroup_of(r)
> > >     . Verify that UID(R) is mapped into r's userns
> > >     . Create R/A'/A
> > >     . chown R/A'/A to UID(r)
> > >   * r requests to move task x to cgroup A.
> > >     . lmctfy/cli/commands/enter.cc
> > >     . r must send PID(x) as ancillary message
> > >     . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
> > >       that userns
> > >       (is it safe to allow if UID(x) == UID(r))?
> > >     . R=cgroup_of(r)
> > >     . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
> > >     . echo PID(x) >> /R/A/tasks
> > >   * r requests chown of cgroup A to uid X
> > >     . X is passed in ancillary message
> > >       * ensures it is valid in r's userns
> > >       * maps the userid to host for us
> > >     . Verify that UID(r) mapped to 0 in r's userns
> > >     . R=cgroup_of(r)
> > >     . Chown R/A to X
> > >   * r requests cgroup A's 'property=value'
> > >     . Verify that either
> > >       * A != ''
> > >       * UID(r) == 0 on host
> > >       In other words, r in a userns may not set root cgroup settings.
> > >     . Verify that UID(r) mapped to 0 in r's userns
> > >     . R=cgroup_of(r)
> > >     . Set property=value for R/A
> > >       * Expect kernel to guarantee hierarchical constraints
> > >   * r requests deletion of cgroup A
> > >     . lmctfy/cli/commands/destroy.cc (without -f)
> > >     . same requirements as setting 'property=value'
> > >   * r requests purge of cgroup A
> > >     . lmctfy/cli/commands/destroy.cc (with -f)
> > >     . same requirements as setting 'property=value'
> > > 
> > > Long-term we will want the cgroup manager to become more intelligent -
> > > to place its own limits on clients, to address cpu and device hotplug,
> > > etc.  Since we will not be doing that in the first prototype, the daemon
> > > will not keep any state about the clients.
> > > 
> > > Client DBus Message API
> > > 
> > > <name>: a-zA-Z0-9
> > > <name>: "a-zA-Z0-9 "
> > > <controllerlist>: <controller1>[:controllerlist]
> > > <valueentry>: key:value
> > > <valueentry>: frozen
> > > <valueentry>: thawed
> > > <values>: valueentry[:values]
> > > keys:
> > > 	{memory,swap}.{limit,soft_limit}
> > > 	cpus_allowed  # set of allowed cpus
> > > 	cpus_fraction # % of allowed cpus
> > > 	cpus_number   # number of allowed cpus
> > > 	cpu_share_percent   # percent of cpushare
> > > 	devices_whitelist
> > > 	devices_blacklist
> > > 	net_prio_index
> > > 	net_prio_interface_map
> > > 	net_classid
> > > 	hugetlb_limit
> > > 	blkio_weight
> > > 	blkio_weight_device
> > > 	blkio_throttle_{read,write}
> > > readkeys:
> > > 	devices_list
> > > 	{memory,swap}.{failcnt,max_use,limitnuma_stat}
> > > 	hugetlb_max_usage
> > > 	hugetlb_usage
> > > 	hugetlb_failcnt
> > > 	cpuacct_stat
> > > 	<etc>
> > > Commands:
> > > 	ListControllers
> > > 	Create <name> <controllerlist> <values>
> > > 	Setvalue <name> <values>
> > > 	Getvalue <name> <readkeys>
> > > 	ListChildren <name>
> > > 	ListTasks <name>
> > > 	ListControllers <name>
> > > 	Chown <name> <uid>
> > > 	Chown <name> <uid>:<gid>
> > > 	Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
> > > 	Delete <name>
> > > 	Delete-force <name>
> > > 	Kill <name>
> > > 
> > > ------------------------------------------------------------------------------
> > > Shape the Mobile Experience: Free Subscription
> > > Software experts and developers: Be at the forefront of tech innovation.
> > > Intel(R) Software Adrenaline delivers strategic insight and game-changing 
> > > conversations that shape the rapidly evolving mobile landscape. Sign up now. 
> > > http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
> > > _______________________________________________
> > > Lxc-devel mailing list
> > > Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> > > https://lists.sourceforge.net/lists/listinfo/lxc-devel
> > > 
> > 
> > -- 
> > Michael H. Warfield (AI4NB) | (770) 978-7061 |  mhw-BetbSzk+GohWk0Htik3J/w@public.gmane.org
> >    /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
> >    NIC whois: MHW9          | An optimist believes we live in the best of all
> >  PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
> > 
> 
> 
> 
> > ------------------------------------------------------------------------------
> > Shape the Mobile Experience: Free Subscription
> > Software experts and developers: Be at the forefront of tech innovation.
> > Intel(R) Software Adrenaline delivers strategic insight and game-changing 
> > conversations that shape the rapidly evolving mobile landscape. Sign up now. 
> > http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
> 
> > _______________________________________________
> > Lxc-devel mailing list
> > Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> > https://lists.sourceforge.net/lists/listinfo/lxc-devel
> 
> 
> ------------------------------------------------------------------------------
> Shape the Mobile Experience: Free Subscription
> Software experts and developers: Be at the forefront of tech innovation.
> Intel(R) Software Adrenaline delivers strategic insight and game-changing 
> conversations that shape the rapidly evolving mobile landscape. Sign up now. 
> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
> _______________________________________________
> Lxc-devel mailing list
> Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/lxc-devel

-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  mhw-BetbSzk+GohWk0Htik3J/w@public.gmane.org
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 482 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
       [not found]     ` <1385432284.8590.52.camel-s3/A7Nnwjkf10ug9Blv0m0EOCMrvLtNR@public.gmane.org>
  2013-11-26  2:43       ` Stéphane Graber
@ 2013-11-26  4:52       ` Tim Hockin
       [not found]         ` <CAO_RewYmS0fH819BFCr9ozis1132dACgCPwbyb59gM1PafpUkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 39+ messages in thread
From: Tim Hockin @ 2013-11-26  4:52 UTC (permalink / raw)
  To: mhw-UGBql2FAF+1Wk0Htik3J/w
  Cc: Serge E. Hallyn, Tejun Heo,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Stéphane Graber

At the start of this discussion, some months ago, we offered to
co-devel this with Lennart et al.  They did not seem keen on the idea.

If they have an established DBUS protocol spec, we should consider
adopting it instead of a new one, but we CAN'T just play follow the
leader and do whatever they do, change whenever they feel like
changing.

It would be best if we could get a common DBUS api specc'ed and all
agree to it.  Serge, do you feel up to that?

On Mon, Nov 25, 2013 at 6:18 PM, Michael H. Warfield <mhw-UGBql2FAF+1Wk0Htik3J/w@public.gmane.org> wrote:
> Serge...
>
> You have no idea how much I dread mentioning this (well, after
> LinuxPlumbers, maybe you can) but...  You do realize that some of this
> is EXACTLY what the systemd crowd was talking about there in NOLA back
> then.  I sat in those session grinding my teeth and listening to
> comments from some others around me about when systemd might subsume
> bash or even vi or quake.
>
> Somehow, you and others have tagged me as a "systemd expert" but I am
> far from it and even you noted that Lennart and I were on the edge of a
> physical discussion when I made some "off the cuff" remarks there about
> systemd design during my talk.  I personally rank systemd in the same
> category as NetworkMangler (err, NetworkManager) in its propensity for
> committing inexplicable random acts of terrorism and changing its
> behavior from release to release to release.  I'm not a fan and I'm not
> an expert, but I have to be involved with it and watch the damned thing
> like a trapped rat, like it or not.
>
> Like it or not, we can not go off on divergent designs.  As much as they
> have delusions of taking over the Linux world, they are still going to
> be a major factor and this sort of thing needs to be coordinated.  We
> are going to need exactly what you are proposing whether we have systemd
> in play or not.  IF we CAN kick it to the curb, when we need to, we
> still need to know how to without tearing shit up and breaking shit that
> thinks it's there.  Ideally, it shouldn't matter if systemd where in
> play or not.
>
> All I ask is that we not get too far off track that we have a major
> architectural divergence here.  The risk is there.
>
> Mike
>
>
> On Mon, 2013-11-25 at 22:43 +0000, Serge E. Hallyn wrote:
>> Hi,
>>
>> as i've mentioned several times, I want to write a standalone cgroup
>> management daemon.  Basic requirements are that it be a standalone
>> program; that a single instance running on the host be usable from
>> containers nested at any depth; that it not allow escaping ones
>> assigned limits; that it not allow subjegating tasks which do not
>> belong to you; and that, within your limits, you be able to parcel
>> those limits to your tasks as you like.
>>
>> Additionally, Tejun has specified that we do not want users to be
>> too closely tied to the cgroupfs implementation.  Therefore
>> commands will be just a hair more general than specifying cgroupfs
>> filenames and values.  I may go so far as to avoid specifying
>> specific controllers, as AFAIK there should be no redundancy in
>> features.  On the other hand, I don't want to get too general.
>> So I'm basing the API loosely on the lmctfy command line API.
>>
>> One of the driving goals is to enable nested lxc as simply and safely as
>> possible.  If this project is a success, then a large chunk of code can
>> be removed from lxc.  I'm considering this project a part of the larger
>> lxc project, but given how central it is to systems management that
>> doesn't mean that I'll consider anyone else's needs as less important
>> than our own.
>>
>> This document consists of two parts.  The first describes how I
>> intend the daemon (cgmanager) to be structured and how it will
>> enforce the safety requirements.  The second describes the commands
>> which clients will be able to send to the manager.  The list of
>> controller keys which can be set is very incomplete at this point,
>> serving mainly to show the approach I was thinking of taking.
>>
>> Summary
>>
>> Each 'host' (identified by a separate instance of the linux kernel) will
>> have exactly one running daemon to manage control groups.  This daemon
>> will answer cgroup management requests over a dbus socket, located at
>> /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
>> containers, so that one daemon can support the whole system.
>>
>> Programs will be able to make cgroup requests using dbus calls, or
>> indirectly by linking against lmctfy which will be modified to use the
>> dbus calls if available.
>>
>> Outline:
>>   . A single manager, cgmanager, is started on the host, very early
>>     during boot.  It has very few dependencies, and requires only
>>     /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
>>     the cgroup hierarchies in a private namespace and set defaults
>>     (clone_children, use_hierarchy, sane_behavior, release_agent?) It
>>     will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
>>   . A client (requestor 'r') can make cgroup requests over
>>     /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
>>     requirements for r are listed below.
>>   . The client request will pertain an existing or new cgroup A.  r's
>>     privilege over the cgroup must be checked.  r is said to have
>>     privilege over A if A is owned by r's uid, or if A's owner is mapped
>>     into r's user namespace, and r is root in that user namespace.
>>   . The client request may pertain a victim task v, which may be moved
>>     to a new cgroup.  In that case r's privilege over both the cgroup
>>     and v must be checked.  r is said to have privilege over v if v
>>     is mapped in r's pid namespace, v's uid is mapped into r's user ns,
>>     and r is root in its userns.  Or if r and v have the same uid
>>     and v is mapped in r's pid namespace.
>>   . r's credentials will be taken from socket's peercred, ensuring that
>>     pid and uid are translated.
>>   . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
>>     translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
>>     which is the global uid, and check /proc/PID(r)/uid_map to see whether
>>     UID is mapped there.
>>   . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
>>     the kernel translate it for the reader.  Only 'move task v to cgroup
>>     A' will require a SCM_CREDENTIAL to be sent.
>>
>> Privilege requirements by action:
>>     * Requestor of an action (r) over a socket may only make
>>       changes to cgroups over which it has privilege.
>>     * Requestors may be limited to a certain #/depth of cgroups
>>       (to limit memory usage) - DEFER?
>>     * Cgroup hierarchy is responsible for resource limits
>>     * A requestor must either be uid 0 in its userns with victim mapped
>>       ito its userns, or the same uid and in same/ancestor pidns as the
>>       victim
>>     * If r requests creation of cgroup '/x', /x will be interpreted
>>       as relative to r's cgroup.  r cannot make changes to cgroups not
>>       under its own current cgroup.
>>     * If r is not in the initial user_ns, then it may not change settings
>>       in its own cgroup, only descendants.  (Not strictly necessary -
>>       we could require the use of extra cgroups when wanted, as lxc does
>>       currently)
>>     * If r requests creation of cgroup '/x', it must have write access
>>       to its own cgroup  (not strictly necessary)
>>     * If r requests chown of cgroup /x to uid Y, Y is passed in a
>>       ucred over the unix socket, and therefore translated to init
>>       userns.
>>     * if r requests setting a limit under /x, then
>>       . either r must be root in its own userns, and UID(/x) be mapped
>>         into its userns, or else UID(r) == UID(/x)
>>       . /x must not be / (not strictly necessary, all users know to
>>         ensure an extra cgroup layer above '/')
>>       . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
>>         which won't be satisfied.  Therefore we'll need to do privilege
>>         checks ourselves, then perform the write as the host root user.
>>         (see devices.allow/deny).  Further we need to support older kernels
>>         which don't support setns for pid.
>>     * If r requests action on victim V, it passes V's pid in a ucred,
>>       so that gets translated.
>>       Daemon will verify that V's uid is mapped into r's userns.  Since
>>       r is either root or the same uid as V, it is allowed to classify.
>>
>> The above addresses
>>     * creating cgroups
>>     * chowning cgroups
>>     * setting cgroup limits
>>     * moving tasks into cgroups
>>   . but does not address a 'cgexec <group> -- command' type of behavior.
>>     * To handle that (specifically for upstart), recommend that r do:
>>       if (!pid) {
>>         request_reclassify(cgroup, getpid());
>>         do_execve();
>>       }
>>   . alternatively, the daemon could, if kernel is new enough, setns to
>>     the requestor's namespaces to execute a command in a new cgroup.
>>     The new command would be daemonized to that pid namespaces' pid 1.
>>
>> Types of requests:
>>   * r requests creating cgroup A'/A
>>     . lmctfy/cli/commands/create.cc
>>     . Verify that UID(r) mapped to 0 in r's userns
>>     . R=cgroup_of(r)
>>     . Verify that UID(R) is mapped into r's userns
>>     . Create R/A'/A
>>     . chown R/A'/A to UID(r)
>>   * r requests to move task x to cgroup A.
>>     . lmctfy/cli/commands/enter.cc
>>     . r must send PID(x) as ancillary message
>>     . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
>>       that userns
>>       (is it safe to allow if UID(x) == UID(r))?
>>     . R=cgroup_of(r)
>>     . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
>>     . echo PID(x) >> /R/A/tasks
>>   * r requests chown of cgroup A to uid X
>>     . X is passed in ancillary message
>>       * ensures it is valid in r's userns
>>       * maps the userid to host for us
>>     . Verify that UID(r) mapped to 0 in r's userns
>>     . R=cgroup_of(r)
>>     . Chown R/A to X
>>   * r requests cgroup A's 'property=value'
>>     . Verify that either
>>       * A != ''
>>       * UID(r) == 0 on host
>>       In other words, r in a userns may not set root cgroup settings.
>>     . Verify that UID(r) mapped to 0 in r's userns
>>     . R=cgroup_of(r)
>>     . Set property=value for R/A
>>       * Expect kernel to guarantee hierarchical constraints
>>   * r requests deletion of cgroup A
>>     . lmctfy/cli/commands/destroy.cc (without -f)
>>     . same requirements as setting 'property=value'
>>   * r requests purge of cgroup A
>>     . lmctfy/cli/commands/destroy.cc (with -f)
>>     . same requirements as setting 'property=value'
>>
>> Long-term we will want the cgroup manager to become more intelligent -
>> to place its own limits on clients, to address cpu and device hotplug,
>> etc.  Since we will not be doing that in the first prototype, the daemon
>> will not keep any state about the clients.
>>
>> Client DBus Message API
>>
>> <name>: a-zA-Z0-9
>> <name>: "a-zA-Z0-9 "
>> <controllerlist>: <controller1>[:controllerlist]
>> <valueentry>: key:value
>> <valueentry>: frozen
>> <valueentry>: thawed
>> <values>: valueentry[:values]
>> keys:
>>       {memory,swap}.{limit,soft_limit}
>>       cpus_allowed  # set of allowed cpus
>>       cpus_fraction # % of allowed cpus
>>       cpus_number   # number of allowed cpus
>>       cpu_share_percent   # percent of cpushare
>>       devices_whitelist
>>       devices_blacklist
>>       net_prio_index
>>       net_prio_interface_map
>>       net_classid
>>       hugetlb_limit
>>       blkio_weight
>>       blkio_weight_device
>>       blkio_throttle_{read,write}
>> readkeys:
>>       devices_list
>>       {memory,swap}.{failcnt,max_use,limitnuma_stat}
>>       hugetlb_max_usage
>>       hugetlb_usage
>>       hugetlb_failcnt
>>       cpuacct_stat
>>       <etc>
>> Commands:
>>       ListControllers
>>       Create <name> <controllerlist> <values>
>>       Setvalue <name> <values>
>>       Getvalue <name> <readkeys>
>>       ListChildren <name>
>>       ListTasks <name>
>>       ListControllers <name>
>>       Chown <name> <uid>
>>       Chown <name> <uid>:<gid>
>>       Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
>>       Delete <name>
>>       Delete-force <name>
>>       Kill <name>
>>
>> ------------------------------------------------------------------------------
>> Shape the Mobile Experience: Free Subscription
>> Software experts and developers: Be at the forefront of tech innovation.
>> Intel(R) Software Adrenaline delivers strategic insight and game-changing
>> conversations that shape the rapidly evolving mobile landscape. Sign up now.
>> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Lxc-devel mailing list
>> Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
>> https://lists.sourceforge.net/lists/listinfo/lxc-devel
>>
>
> --
> Michael H. Warfield (AI4NB) | (770) 978-7061 |  mhw-BetbSzk+GohWk0Htik3J/w@public.gmane.org
>    /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
>    NIC whois: MHW9          | An optimist believes we live in the best of all
>  PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found] ` <20131125224335.GA15481-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2013-11-26  0:03   ` [lxc-devel] " Marian Marinov
  2013-11-26  2:18   ` Michael H. Warfield
@ 2013-11-26  4:58   ` Tim Hockin
       [not found]     ` <CAO_RewZGWARUafKzDc_t3G5OedGtEPTZgB2VYeHHiKSSrja8fA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-12-03 13:45   ` Tejun Heo
  3 siblings, 1 reply; 39+ messages in thread
From: Tim Hockin @ 2013-11-26  4:58 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Stéphane Graber

Thanks for this!  I think it helps a lot to discuss now, rather than
over nearly-done code.

On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Additionally, Tejun has specified that we do not want users to be
> too closely tied to the cgroupfs implementation.  Therefore
> commands will be just a hair more general than specifying cgroupfs
> filenames and values.  I may go so far as to avoid specifying
> specific controllers, as AFAIK there should be no redundancy in
> features.  On the other hand, I don't want to get too general.
> So I'm basing the API loosely on the lmctfy command line API.

I'm torn here.  While I agree in principle with Tejun, I am concerned
that this agent will always lag new kernel features or that the thin
abstraction you want to provide here does not easily accommodate some
of the more ... oddball features of one cgroup interface or another.

This agent is the very bottom of the stack, and should probably not do
much by way of abstraction.  I think I'd rather let something like
lmctfy provide the abstraction more holistically, and relegate this
agent to very simple plumbing and policy.  It could be as simple as
providing read/write/etc ops to specific control files.  It needs to
handle event_fd, too, I guess.  This has the nice side-effect of
always being "current" on kernel features :)

> Summary
>
> Each 'host' (identified by a separate instance of the linux kernel) will
> have exactly one running daemon to manage control groups.  This daemon
> will answer cgroup management requests over a dbus socket, located at
> /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
> containers, so that one daemon can support the whole system.
>
> Programs will be able to make cgroup requests using dbus calls, or
> indirectly by linking against lmctfy which will be modified to use the
> dbus calls if available.
>
> Outline:
>   . A single manager, cgmanager, is started on the host, very early
>     during boot.  It has very few dependencies, and requires only
>     /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
>     the cgroup hierarchies in a private namespace and set defaults
>     (clone_children, use_hierarchy, sane_behavior, release_agent?) It
>     will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).

Where does the config come from?  How do I specify which hierarchies I
want and where, and which flags?

>   . A client (requestor 'r') can make cgroup requests over
>     /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
>     requirements for r are listed below.
>   . The client request will pertain an existing or new cgroup A.  r's
>     privilege over the cgroup must be checked.  r is said to have
>     privilege over A if A is owned by r's uid, or if A's owner is mapped
>     into r's user namespace, and r is root in that user namespace.

Problem with this definition.  Being owned-by is not the same as
has-root-in.  Specifically, I may choose to give you root in your own
namespace, but you sure as heck can not increase your own memory
limit.

>   . The client request may pertain a victim task v, which may be moved
>     to a new cgroup.  In that case r's privilege over both the cgroup
>     and v must be checked.  r is said to have privilege over v if v
>     is mapped in r's pid namespace, v's uid is mapped into r's user ns,
>     and r is root in its userns.  Or if r and v have the same uid
>     and v is mapped in r's pid namespace.
>   . r's credentials will be taken from socket's peercred, ensuring that
>     pid and uid are translated.
>   . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
>     translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
>     which is the global uid, and check /proc/PID(r)/uid_map to see whether
>     UID is mapped there.
>   . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
>     the kernel translate it for the reader.  Only 'move task v to cgroup
>     A' will require a SCM_CREDENTIAL to be sent.
>
> Privilege requirements by action:
>     * Requestor of an action (r) over a socket may only make
>       changes to cgroups over which it has privilege.
>     * Requestors may be limited to a certain #/depth of cgroups
>       (to limit memory usage) - DEFER?
>     * Cgroup hierarchy is responsible for resource limits
>     * A requestor must either be uid 0 in its userns with victim mapped
>       ito its userns, or the same uid and in same/ancestor pidns as the
>       victim
>     * If r requests creation of cgroup '/x', /x will be interpreted
>       as relative to r's cgroup.  r cannot make changes to cgroups not
>       under its own current cgroup.

Does this imply that r in a lower-level (farter from root) of the
hierarchy can not make requests of higher levels of the hierarchy
(closer to root), even though they have permissions as per the
definition of privilege?

How do we reconcile this pseudo-virtualization with /proc/self/cgroup
which DOES expose raw paths?

>     * If r is not in the initial user_ns, then it may not change settings
>       in its own cgroup, only descendants.  (Not strictly necessary -
>       we could require the use of extra cgroups when wanted, as lxc does
>       currently)
>     * If r requests creation of cgroup '/x', it must have write access
>       to its own cgroup  (not strictly necessary)

Can you explain what you mean by "not strictly necessary" - is this
part of the requirement space or not?

>     * If r requests chown of cgroup /x to uid Y, Y is passed in a
>       ucred over the unix socket, and therefore translated to init
>       userns.

I though only UID 0 could specify a UID other than the real UID?  Have
I misunderstood that?

>     * if r requests setting a limit under /x, then
>       . either r must be root in its own userns, and UID(/x) be mapped
>         into its userns, or else UID(r) == UID(/x)
>       . /x must not be / (not strictly necessary, all users know to
>         ensure an extra cgroup layer above '/')

I don't understand this point

>       . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
>         which won't be satisfied.  Therefore we'll need to do privilege
>         checks ourselves, then perform the write as the host root user.
>         (see devices.allow/deny).  Further we need to support older kernels
>         which don't support setns for pid.
>     * If r requests action on victim V, it passes V's pid in a ucred,
>       so that gets translated.
>       Daemon will verify that V's uid is mapped into r's userns.  Since
>       r is either root or the same uid as V, it is allowed to classify.
>
> The above addresses
>     * creating cgroups
>     * chowning cgroups
>     * setting cgroup limits
>     * moving tasks into cgroups
>   . but does not address a 'cgexec <group> -- command' type of behavior.
>     * To handle that (specifically for upstart), recommend that r do:
>       if (!pid) {
>         request_reclassify(cgroup, getpid());
>         do_execve();
>       }

If I follow, you're saying that the caller does the fork/exec and all
this daemon does is munge cgroups for the calling PID?  If so, I
agree, I think.

>   . alternatively, the daemon could, if kernel is new enough, setns to
>     the requestor's namespaces to execute a command in a new cgroup.
>     The new command would be daemonized to that pid namespaces' pid 1.
>
> Types of requests:
>   * r requests creating cgroup A'/A
>     . lmctfy/cli/commands/create.cc
>     . Verify that UID(r) mapped to 0 in r's userns
>     . R=cgroup_of(r)
>     . Verify that UID(R) is mapped into r's userns
>     . Create R/A'/A
>     . chown R/A'/A to UID(r)
>   * r requests to move task x to cgroup A.
>     . lmctfy/cli/commands/enter.cc
>     . r must send PID(x) as ancillary message
>     . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
>       that userns
>       (is it safe to allow if UID(x) == UID(r))?
>     . R=cgroup_of(r)
>     . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
>     . echo PID(x) >> /R/A/tasks
>   * r requests chown of cgroup A to uid X
>     . X is passed in ancillary message
>       * ensures it is valid in r's userns
>       * maps the userid to host for us
>     . Verify that UID(r) mapped to 0 in r's userns
>     . R=cgroup_of(r)
>     . Chown R/A to X
>   * r requests cgroup A's 'property=value'
>     . Verify that either
>       * A != ''
>       * UID(r) == 0 on host
>       In other words, r in a userns may not set root cgroup settings.
>     . Verify that UID(r) mapped to 0 in r's userns
>     . R=cgroup_of(r)
>     . Set property=value for R/A
>       * Expect kernel to guarantee hierarchical constraints
>   * r requests deletion of cgroup A
>     . lmctfy/cli/commands/destroy.cc (without -f)
>     . same requirements as setting 'property=value'
>   * r requests purge of cgroup A
>     . lmctfy/cli/commands/destroy.cc (with -f)
>     . same requirements as setting 'property=value'
>
> Long-term we will want the cgroup manager to become more intelligent -
> to place its own limits on clients, to address cpu and device hotplug,
> etc.  Since we will not be doing that in the first prototype, the daemon
> will not keep any state about the clients.
>
> Client DBus Message API
>
> <name>: a-zA-Z0-9
> <name>: "a-zA-Z0-9 "
> <controllerlist>: <controller1>[:controllerlist]
> <valueentry>: key:value
> <valueentry>: frozen
> <valueentry>: thawed
> <values>: valueentry[:values]
> keys:
>         {memory,swap}.{limit,soft_limit}
>         cpus_allowed  # set of allowed cpus
>         cpus_fraction # % of allowed cpus
>         cpus_number   # number of allowed cpus
>         cpu_share_percent   # percent of cpushare
>         devices_whitelist
>         devices_blacklist
>         net_prio_index
>         net_prio_interface_map
>         net_classid
>         hugetlb_limit
>         blkio_weight
>         blkio_weight_device
>         blkio_throttle_{read,write}
> readkeys:
>         devices_list
>         {memory,swap}.{failcnt,max_use,limitnuma_stat}
>         hugetlb_max_usage
>         hugetlb_usage
>         hugetlb_failcnt
>         cpuacct_stat
>         <etc>
> Commands:
>         ListControllers
>         Create <name> <controllerlist> <values>
>         Setvalue <name> <values>
>         Getvalue <name> <readkeys>
>         ListChildren <name>
>         ListTasks <name>
>         ListControllers <name>
>         Chown <name> <uid>
>         Chown <name> <uid>:<gid>
>         Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
>         Delete <name>
>         Delete-force <name>
>         Kill <name>

What are the requirements/goals around performance and concurrency?
Do you expect this to be a single-threaded thing, or can we handle
some number of concurrent operations?  Do you expect to use threads of
processes?

Can you talk about logging - what and where?

How will we handle event_fd?  Pass a file-descriptor back to the caller?

That's all I can come up with for now.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]     ` <CAO_RewZGWARUafKzDc_t3G5OedGtEPTZgB2VYeHHiKSSrja8fA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-26  5:47       ` Serge E. Hallyn
       [not found]         ` <20131126054718.GA19134-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2013-11-26 16:12       ` Serge E. Hallyn
  2013-12-03 13:54       ` Tejun Heo
  2 siblings, 1 reply; 39+ messages in thread
From: Serge E. Hallyn @ 2013-11-26  5:47 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Serge E. Hallyn, Tejun Heo,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Stéphane Graber

Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> Thanks for this!  I think it helps a lot to discuss now, rather than
> over nearly-done code.
> 
> On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> > Additionally, Tejun has specified that we do not want users to be
> > too closely tied to the cgroupfs implementation.  Therefore
> > commands will be just a hair more general than specifying cgroupfs
> > filenames and values.  I may go so far as to avoid specifying
> > specific controllers, as AFAIK there should be no redundancy in
> > features.  On the other hand, I don't want to get too general.
> > So I'm basing the API loosely on the lmctfy command line API.
> 
> I'm torn here.  While I agree in principle with Tejun, I am concerned
> that this agent will always lag new kernel features or that the thin
> abstraction you want to provide here does not easily accommodate some
> of the more ... oddball features of one cgroup interface or another.
> 
> This agent is the very bottom of the stack, and should probably not do
> much by way of abstraction.  I think I'd rather let something like
> lmctfy provide the abstraction more holistically, and relegate this

If lmctfy is an abstraction layer that should keep Tejun happy, and
it could keep me out of the resource naming game which makes me happy :)

> agent to very simple plumbing and policy.  It could be as simple as
> providing read/write/etc ops to specific control files.  It needs to
> handle event_fd, too, I guess.  This has the nice side-effect of
> always being "current" on kernel features :)
> 
> > Summary
> >
> > Each 'host' (identified by a separate instance of the linux kernel) will
> > have exactly one running daemon to manage control groups.  This daemon
> > will answer cgroup management requests over a dbus socket, located at
> > /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
> > containers, so that one daemon can support the whole system.
> >
> > Programs will be able to make cgroup requests using dbus calls, or
> > indirectly by linking against lmctfy which will be modified to use the
> > dbus calls if available.
> >
> > Outline:
> >   . A single manager, cgmanager, is started on the host, very early
> >     during boot.  It has very few dependencies, and requires only
> >     /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
> >     the cgroup hierarchies in a private namespace and set defaults
> >     (clone_children, use_hierarchy, sane_behavior, release_agent?) It
> >     will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
> 
> Where does the config come from?  How do I specify which hierarchies I
> want and where, and which flags?

That'll have to be in a file in /etc (which can be mounted readonly).
There should be no surprises there so I've not thought about the format.

> >   . A client (requestor 'r') can make cgroup requests over
> >     /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
> >     requirements for r are listed below.
> >   . The client request will pertain an existing or new cgroup A.  r's
> >     privilege over the cgroup must be checked.  r is said to have
> >     privilege over A if A is owned by r's uid, or if A's owner is mapped
> >     into r's user namespace, and r is root in that user namespace.
> 
> Problem with this definition.  Being owned-by is not the same as
> has-root-in.  Specifically, I may choose to give you root in your own
> namespace, but you sure as heck can not increase your own memory
> limit.

1. If you don't want me to change the value at all, then just don't map
A's owner into the namespace.  I'm uid 100000 which is root in my namespace,
but I only have privilege over other uids mapped into my namespace.

2. I've considered never allowing changes to your own cgroup.  So if you're
in /a/b, you can create /a/b/c and modify c's settings, but you can't modify
b's.  OTOH, that isn't strictly necessary - if we did allow it, then you
could simply clam /a/b's memory to what you want, and stick me in /a/b/c,
so I can't escape the memory limit you wanted.

3. I've not considered having the daemon track resource limits - i.e. creating
a cgroup and saying "give it 100M swap, and if it asks, let it increase that
to 200M."  I'd prefer that be done incidentally through (1) and (2).  Do you
feel that would be insufficient?
 
Or maybe your question is something different and I'm missing it?

> >   . The client request may pertain a victim task v, which may be moved
> >     to a new cgroup.  In that case r's privilege over both the cgroup
> >     and v must be checked.  r is said to have privilege over v if v
> >     is mapped in r's pid namespace, v's uid is mapped into r's user ns,
> >     and r is root in its userns.  Or if r and v have the same uid
> >     and v is mapped in r's pid namespace.
> >   . r's credentials will be taken from socket's peercred, ensuring that
> >     pid and uid are translated.
> >   . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
> >     translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
> >     which is the global uid, and check /proc/PID(r)/uid_map to see whether
> >     UID is mapped there.
> >   . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
> >     the kernel translate it for the reader.  Only 'move task v to cgroup
> >     A' will require a SCM_CREDENTIAL to be sent.
> >
> > Privilege requirements by action:
> >     * Requestor of an action (r) over a socket may only make
> >       changes to cgroups over which it has privilege.
> >     * Requestors may be limited to a certain #/depth of cgroups
> >       (to limit memory usage) - DEFER?
> >     * Cgroup hierarchy is responsible for resource limits
> >     * A requestor must either be uid 0 in its userns with victim mapped
> >       ito its userns, or the same uid and in same/ancestor pidns as the
> >       victim
> >     * If r requests creation of cgroup '/x', /x will be interpreted
> >       as relative to r's cgroup.  r cannot make changes to cgroups not
> >       under its own current cgroup.
> 
> Does this imply that r in a lower-level (farter from root) of the
> hierarchy can not make requests of higher levels of the hierarchy
> (closer to root), even though they have permissions as per the
> definition of privilege?

Right.

> How do we reconcile this pseudo-virtualization with /proc/self/cgroup
> which DOES expose raw paths?

We <shrug> :)

Just as /proc/cpuinfo isn't updated depending on your cpuset.  If you
want to know the true depth, it's not my goal to fool you.

> >     * If r is not in the initial user_ns, then it may not change settings
> >       in its own cgroup, only descendants.  (Not strictly necessary -
> >       we could require the use of extra cgroups when wanted, as lxc does
> >       currently)
> >     * If r requests creation of cgroup '/x', it must have write access
> >       to its own cgroup  (not strictly necessary)
> 
> Can you explain what you mean by "not strictly necessary" - is this
> part of the requirement space or not?

Not sure why I put that there.  Let me state it more generally - if r wants
to create /a/b/c (which is relative to his own current cgroup), then r
must have write access under /a/b.  Whether he must have write access to his
/, that I'm not sure about.

> >     * If r requests chown of cgroup /x to uid Y, Y is passed in a
> >       ucred over the unix socket, and therefore translated to init
> >       userns.
> 
> I though only UID 0 could specify a UID other than the real UID?  Have
> I misunderstood that?

UID 0 in a child user ns should be able to pass in any uid in his own
namespace.

> >     * if r requests setting a limit under /x, then
> >       . either r must be root in its own userns, and UID(/x) be mapped
> >         into its userns, or else UID(r) == UID(/x)
> >       . /x must not be / (not strictly necessary, all users know to
> >         ensure an extra cgroup layer above '/')
> 
> I don't understand this point

The point is to ensure that the in-kernel cgroup hierarchy support enforces
that r can't escape his limits.  So if I create a container and i want it
to not have memory {limit: 500M}, then either I can create /a/b, put the
memory limit on /a/b, and put r into /a/b/c;  or I can put r right into
/a/b and not let r modify /a/b's settings.

> >       . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
> >         which won't be satisfied.  Therefore we'll need to do privilege
> >         checks ourselves, then perform the write as the host root user.
> >         (see devices.allow/deny).  Further we need to support older kernels
> >         which don't support setns for pid.
> >     * If r requests action on victim V, it passes V's pid in a ucred,
> >       so that gets translated.
> >       Daemon will verify that V's uid is mapped into r's userns.  Since
> >       r is either root or the same uid as V, it is allowed to classify.
> >
> > The above addresses
> >     * creating cgroups
> >     * chowning cgroups
> >     * setting cgroup limits
> >     * moving tasks into cgroups
> >   . but does not address a 'cgexec <group> -- command' type of behavior.
> >     * To handle that (specifically for upstart), recommend that r do:
> >       if (!pid) {
> >         request_reclassify(cgroup, getpid());
> >         do_execve();
> >       }
> 
> If I follow, you're saying that the caller does the fork/exec and all
> this daemon does is munge cgroups for the calling PID?  If so, I
> agree, I think.

Right.  (Difference with the unfortunately libcgroup race conditions
being that in this case we have the caller's cooperation :)

> >   . alternatively, the daemon could, if kernel is new enough, setns to
> >     the requestor's namespaces to execute a command in a new cgroup.
> >     The new command would be daemonized to that pid namespaces' pid 1.
> >
> > Types of requests:
> >   * r requests creating cgroup A'/A
> >     . lmctfy/cli/commands/create.cc
> >     . Verify that UID(r) mapped to 0 in r's userns
> >     . R=cgroup_of(r)
> >     . Verify that UID(R) is mapped into r's userns
> >     . Create R/A'/A
> >     . chown R/A'/A to UID(r)
> >   * r requests to move task x to cgroup A.
> >     . lmctfy/cli/commands/enter.cc
> >     . r must send PID(x) as ancillary message
> >     . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
> >       that userns
> >       (is it safe to allow if UID(x) == UID(r))?
> >     . R=cgroup_of(r)
> >     . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
> >     . echo PID(x) >> /R/A/tasks
> >   * r requests chown of cgroup A to uid X
> >     . X is passed in ancillary message
> >       * ensures it is valid in r's userns
> >       * maps the userid to host for us
> >     . Verify that UID(r) mapped to 0 in r's userns
> >     . R=cgroup_of(r)
> >     . Chown R/A to X
> >   * r requests cgroup A's 'property=value'
> >     . Verify that either
> >       * A != ''
> >       * UID(r) == 0 on host
> >       In other words, r in a userns may not set root cgroup settings.
> >     . Verify that UID(r) mapped to 0 in r's userns
> >     . R=cgroup_of(r)
> >     . Set property=value for R/A
> >       * Expect kernel to guarantee hierarchical constraints
> >   * r requests deletion of cgroup A
> >     . lmctfy/cli/commands/destroy.cc (without -f)
> >     . same requirements as setting 'property=value'
> >   * r requests purge of cgroup A
> >     . lmctfy/cli/commands/destroy.cc (with -f)
> >     . same requirements as setting 'property=value'
> >
> > Long-term we will want the cgroup manager to become more intelligent -
> > to place its own limits on clients, to address cpu and device hotplug,
> > etc.  Since we will not be doing that in the first prototype, the daemon
> > will not keep any state about the clients.
> >
> > Client DBus Message API
> >
> > <name>: a-zA-Z0-9
> > <name>: "a-zA-Z0-9 "
> > <controllerlist>: <controller1>[:controllerlist]
> > <valueentry>: key:value
> > <valueentry>: frozen
> > <valueentry>: thawed
> > <values>: valueentry[:values]
> > keys:
> >         {memory,swap}.{limit,soft_limit}
> >         cpus_allowed  # set of allowed cpus
> >         cpus_fraction # % of allowed cpus
> >         cpus_number   # number of allowed cpus
> >         cpu_share_percent   # percent of cpushare
> >         devices_whitelist
> >         devices_blacklist
> >         net_prio_index
> >         net_prio_interface_map
> >         net_classid
> >         hugetlb_limit
> >         blkio_weight
> >         blkio_weight_device
> >         blkio_throttle_{read,write}
> > readkeys:
> >         devices_list
> >         {memory,swap}.{failcnt,max_use,limitnuma_stat}
> >         hugetlb_max_usage
> >         hugetlb_usage
> >         hugetlb_failcnt
> >         cpuacct_stat
> >         <etc>
> > Commands:
> >         ListControllers
> >         Create <name> <controllerlist> <values>
> >         Setvalue <name> <values>
> >         Getvalue <name> <readkeys>
> >         ListChildren <name>
> >         ListTasks <name>
> >         ListControllers <name>
> >         Chown <name> <uid>
> >         Chown <name> <uid>:<gid>
> >         Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
> >         Delete <name>
> >         Delete-force <name>
> >         Kill <name>

Will address the rest tomorrow.  Thanks for reviewing!

> What are the requirements/goals around performance and concurrency?
> Do you expect this to be a single-threaded thing, or can we handle
> some number of concurrent operations?  Do you expect to use threads of
> processes?
> 
> Can you talk about logging - what and where?
> 
> How will we handle event_fd?  Pass a file-descriptor back to the caller?
> 
> That's all I can come up with for now.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]     ` <CAO_RewZGWARUafKzDc_t3G5OedGtEPTZgB2VYeHHiKSSrja8fA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-11-26  5:47       ` Serge E. Hallyn
@ 2013-11-26 16:12       ` Serge E. Hallyn
       [not found]         ` <20131126161246.GA23834-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2013-12-03 13:54       ` Tejun Heo
  2 siblings, 1 reply; 39+ messages in thread
From: Serge E. Hallyn @ 2013-11-26 16:12 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Stéphane Graber, Victor Marmol, Rohit Jnagal,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Serge E. Hallyn

Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> What are the requirements/goals around performance and concurrency?
> Do you expect this to be a single-threaded thing, or can we handle
> some number of concurrent operations?  Do you expect to use threads of
> processes?

The cgmanager should be pretty dumb, so I would expect it to be
quite fast.  I don't have any specific perf goals though.  If you
have requirements I'm very interested to hear them.  I should be
able to tell pretty soon how far short I fall.

By default I'd expect to run with a single thread, but I don't
imagine one thread can serve a busy 1024-cpu system very well.
Unless you have guidance right now, I think I'd like to get
started with the basic functionality and see how it measures
up to your requirements.  I should add perf counters from the
start so we can figure out where bottlenecks (if any) are and
how to handle them.

Otherwise I could start out with a basic numcpus/10 threadpool
and have the main thread do socket i/o and parcel access
verification and vfs work out to the threadpool, but I'd rather
first know where the problems lie.

> Can you talk about logging - what and where?

When started under upstart, anything we print out goes to
/var/log/upstart/cgmanager.log.  Would be nice to keep it
that simple.  We could log requests by r to do something
it is not allowed to do, but it seems to me the failed
attempts cause no harm, while the potential for overflowing
logs can.

Did you have anything in mind?  Did you want logging to help
detect certain conditions for system optimization, or just
for failure notices and security violations?

> How will we handle event_fd?  Pass a file-descriptor back to the caller?

The only thing currently supporting eventfd is memory threshold,
right?  I haven't tested whether this will work or not, but
ideally the caller would open the eventfd fd, pass it, the
cgroup name, controller file to be watched, and the args to
cgmanager;  cgmanager confirms read access, opens the
controller fd, makes the request over cgroup.event_control,
then passes the controller fd back to the caller and closes
its own copy.

I'm also not sure whether the cgroup interface is going to be
offering a new feature to replace eventfd, since it wants
people to stop using cgroupfs...  Tejun?

> That's all I can come up with for now.

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]         ` <20131126161246.GA23834-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-26 16:22           ` Victor Marmol
       [not found]             ` <CAD=mX8tCOEO4wP-XGs9YdRufTAay6zPaOxo_wZF=-KoqepH0wg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-11-26 20:45           ` Tim Hockin
  1 sibling, 1 reply; 39+ messages in thread
From: Victor Marmol @ 2013-11-26 16:22 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Stéphane Graber, Tim Hockin, Rohit Jnagal,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 2885 bytes --]

On Tue, Nov 26, 2013 at 8:12 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:

> Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> > What are the requirements/goals around performance and concurrency?
> > Do you expect this to be a single-threaded thing, or can we handle
> > some number of concurrent operations?  Do you expect to use threads of
> > processes?
>
> The cgmanager should be pretty dumb, so I would expect it to be
> quite fast.  I don't have any specific perf goals though.  If you
> have requirements I'm very interested to hear them.  I should be
> able to tell pretty soon how far short I fall.
>
> By default I'd expect to run with a single thread, but I don't
> imagine one thread can serve a busy 1024-cpu system very well.
> Unless you have guidance right now, I think I'd like to get
> started with the basic functionality and see how it measures
> up to your requirements.  I should add perf counters from the
> start so we can figure out where bottlenecks (if any) are and
> how to handle them.
>
> Otherwise I could start out with a basic numcpus/10 threadpool
> and have the main thread do socket i/o and parcel access
> verification and vfs work out to the threadpool, but I'd rather
> first know where the problems lie.
>

From Rohit's talk at Linux plumbers:

http://www.linuxplumbersconf.net/2013/ocw//system/presentations/1239/original/lmctfy%20(1).pdf

The goal is O(1000) reads and O(100) writes per second.


>
> > Can you talk about logging - what and where?
>
> When started under upstart, anything we print out goes to
> /var/log/upstart/cgmanager.log.  Would be nice to keep it
> that simple.  We could log requests by r to do something
> it is not allowed to do, but it seems to me the failed
> attempts cause no harm, while the potential for overflowing
> logs can.
>
> Did you have anything in mind?  Did you want logging to help
> detect certain conditions for system optimization, or just
> for failure notices and security violations?
>
> > How will we handle event_fd?  Pass a file-descriptor back to the caller?
>
> The only thing currently supporting eventfd is memory threshold,
> right?  I haven't tested whether this will work or not, but
> ideally the caller would open the eventfd fd, pass it, the
> cgroup name, controller file to be watched, and the args to
> cgmanager;  cgmanager confirms read access, opens the
> controller fd, makes the request over cgroup.event_control,
> then passes the controller fd back to the caller and closes
> its own copy.
>
> I'm also not sure whether the cgroup interface is going to be
> offering a new feature to replace eventfd, since it wants
> people to stop using cgroupfs...  Tejun?
>

From my discussions with Tejun, he wanted to move to using inotify so it
may still be an fd we pass around.


> > That's all I can come up with for now.
>

[-- Attachment #1.2: Type: text/html, Size: 4222 bytes --]

[-- Attachment #2: Type: text/plain, Size: 455 bytes --]

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk

[-- Attachment #3: Type: text/plain, Size: 186 bytes --]

_______________________________________________
Lxc-devel mailing list
Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/lxc-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
       [not found]         ` <CAO_RewYmS0fH819BFCr9ozis1132dACgCPwbyb59gM1PafpUkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-26 16:37           ` Serge E. Hallyn
       [not found]             ` <20131126163737.GB23834-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Serge E. Hallyn @ 2013-11-26 16:37 UTC (permalink / raw)
  To: Tim Hockin
  Cc: mhw-UGBql2FAF+1Wk0Htik3J/w, Serge E. Hallyn, Tejun Heo,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Stéphane Graber

Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> At the start of this discussion, some months ago, we offered to
> co-devel this with Lennart et al.  They did not seem keen on the idea.
> 
> If they have an established DBUS protocol spec,

see http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
and http://man7.org/linux/man-pages/man5/systemd.cgroup.5.html

>  we should consider
> adopting it instead of a new one, but we CAN'T just play follow the
> leader and do whatever they do, change whenever they feel like
> changing.

Right.  And if we suspect that the APIs will always be at least
subtly different, then keeping them obviously visually different
seems to have some benefit.  (i.e. 
	systemctl set-property httpd.service CPUShares=500 MemoryLimit=500M
vs
	dbus-send cgmanager set-value http.server "cpushares:500 memorylimit:500M swaplimit:1G"
) rather than have admins try to remember "now why did that not work
here, oh yeah, MemoryLimit over here should be Memorylimit" or whatever.

Then again if lmctfy is the layer which admins will use, then it
doesn't matter as much.

> It would be best if we could get a common DBUS api specc'ed and all
> agree to it.  Serge, do you feel up to that?

Not sure what you mean - I'll certainly send the API to these lists as
the code is developed, and will accept all feedback that I get.  My only
requirements are that the requirements I've listed in the document
be feasible, and be feasible back to, say, 3.2 kernels.  So that is
why we must send an scm-cred for the pid to move into a cgroup.  (With
3.12 we may have alterntives, accepting a vpid as a simple dbus message
and setns()ing into the requestor's pidns to echo the pid into the
cgroup.tasks file.)

-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]             ` <CAD=mX8tCOEO4wP-XGs9YdRufTAay6zPaOxo_wZF=-KoqepH0wg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-26 16:41               ` Serge E. Hallyn
       [not found]                 ` <20131126164125.GC23834-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Serge E. Hallyn @ 2013-11-26 16:41 UTC (permalink / raw)
  To: Victor Marmol
  Cc: Serge E. Hallyn, Tim Hockin, Tejun Heo,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Rohit Jnagal,
	Stéphane Graber

Quoting Victor Marmol (vmarmol-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> On Tue, Nov 26, 2013 at 8:12 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> > > What are the requirements/goals around performance and concurrency?
> > > Do you expect this to be a single-threaded thing, or can we handle
> > > some number of concurrent operations?  Do you expect to use threads of
> > > processes?
> >
> > The cgmanager should be pretty dumb, so I would expect it to be
> > quite fast.  I don't have any specific perf goals though.  If you
> > have requirements I'm very interested to hear them.  I should be
> > able to tell pretty soon how far short I fall.
> >
> > By default I'd expect to run with a single thread, but I don't
> > imagine one thread can serve a busy 1024-cpu system very well.
> > Unless you have guidance right now, I think I'd like to get
> > started with the basic functionality and see how it measures
> > up to your requirements.  I should add perf counters from the
> > start so we can figure out where bottlenecks (if any) are and
> > how to handle them.
> >
> > Otherwise I could start out with a basic numcpus/10 threadpool
> > and have the main thread do socket i/o and parcel access
> > verification and vfs work out to the threadpool, but I'd rather
> > first know where the problems lie.
> >
> 
> >From Rohit's talk at Linux plumbers:
> 
> http://www.linuxplumbersconf.net/2013/ocw//system/presentations/1239/original/lmctfy%20(1).pdf
> 
> The goal is O(1000) reads and O(100) writes per second.

Cool, thanks.  I can try and get a sense next week of how far off the
mark I am for reads.

> > > Can you talk about logging - what and where?
> >
> > When started under upstart, anything we print out goes to
> > /var/log/upstart/cgmanager.log.  Would be nice to keep it
> > that simple.  We could log requests by r to do something
> > it is not allowed to do, but it seems to me the failed
> > attempts cause no harm, while the potential for overflowing
> > logs can.
> >
> > Did you have anything in mind?  Did you want logging to help
> > detect certain conditions for system optimization, or just
> > for failure notices and security violations?
> >
> > > How will we handle event_fd?  Pass a file-descriptor back to the caller?
> >
> > The only thing currently supporting eventfd is memory threshold,
> > right?  I haven't tested whether this will work or not, but
> > ideally the caller would open the eventfd fd, pass it, the
> > cgroup name, controller file to be watched, and the args to
> > cgmanager;  cgmanager confirms read access, opens the
> > controller fd, makes the request over cgroup.event_control,
> > then passes the controller fd back to the caller and closes
> > its own copy.
> >
> > I'm also not sure whether the cgroup interface is going to be
> > offering a new feature to replace eventfd, since it wants
> > people to stop using cgroupfs...  Tejun?
> >
> 
> >From my discussions with Tejun, he wanted to move to using inotify so it
> may still be an fd we pass around.

Hm, would that just be inotify on the memory.max_usage_in_bytes
file, of inotify on a specific fd you've created which is
associated with any threshold you specify?  The former seems
less ideal.

-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]                 ` <20131126164125.GC23834-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-26 17:19                   ` Victor Marmol
       [not found]                     ` <CAD=mX8v-jfA8F5DueK60Oo4Zfcjj86idKYKnDVc9LxQVs9W_rQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Victor Marmol @ 2013-11-26 17:19 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Stéphane Graber, Tim Hockin, Rohit Jnagal,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 3645 bytes --]

On Tue, Nov 26, 2013 at 8:41 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:

> Quoting Victor Marmol (vmarmol-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> > On Tue, Nov 26, 2013 at 8:12 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
> wrote:
> >
> > > Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> > > > What are the requirements/goals around performance and concurrency?
> > > > Do you expect this to be a single-threaded thing, or can we handle
> > > > some number of concurrent operations?  Do you expect to use threads
> of
> > > > processes?
> > >
> > > The cgmanager should be pretty dumb, so I would expect it to be
> > > quite fast.  I don't have any specific perf goals though.  If you
> > > have requirements I'm very interested to hear them.  I should be
> > > able to tell pretty soon how far short I fall.
> > >
> > > By default I'd expect to run with a single thread, but I don't
> > > imagine one thread can serve a busy 1024-cpu system very well.
> > > Unless you have guidance right now, I think I'd like to get
> > > started with the basic functionality and see how it measures
> > > up to your requirements.  I should add perf counters from the
> > > start so we can figure out where bottlenecks (if any) are and
> > > how to handle them.
> > >
> > > Otherwise I could start out with a basic numcpus/10 threadpool
> > > and have the main thread do socket i/o and parcel access
> > > verification and vfs work out to the threadpool, but I'd rather
> > > first know where the problems lie.
> > >
> >
> > >From Rohit's talk at Linux plumbers:
> >
> >
> http://www.linuxplumbersconf.net/2013/ocw//system/presentations/1239/original/lmctfy%20(1).pdf
> >
> > The goal is O(1000) reads and O(100) writes per second.
>
> Cool, thanks.  I can try and get a sense next week of how far off the
> mark I am for reads.
>
> > > > Can you talk about logging - what and where?
> > >
> > > When started under upstart, anything we print out goes to
> > > /var/log/upstart/cgmanager.log.  Would be nice to keep it
> > > that simple.  We could log requests by r to do something
> > > it is not allowed to do, but it seems to me the failed
> > > attempts cause no harm, while the potential for overflowing
> > > logs can.
> > >
> > > Did you have anything in mind?  Did you want logging to help
> > > detect certain conditions for system optimization, or just
> > > for failure notices and security violations?
> > >
> > > > How will we handle event_fd?  Pass a file-descriptor back to the
> caller?
> > >
> > > The only thing currently supporting eventfd is memory threshold,
> > > right?  I haven't tested whether this will work or not, but
> > > ideally the caller would open the eventfd fd, pass it, the
> > > cgroup name, controller file to be watched, and the args to
> > > cgmanager;  cgmanager confirms read access, opens the
> > > controller fd, makes the request over cgroup.event_control,
> > > then passes the controller fd back to the caller and closes
> > > its own copy.
> > >
> > > I'm also not sure whether the cgroup interface is going to be
> > > offering a new feature to replace eventfd, since it wants
> > > people to stop using cgroupfs...  Tejun?
> > >
> >
> > >From my discussions with Tejun, he wanted to move to using inotify so it
> > may still be an fd we pass around.
>
> Hm, would that just be inotify on the memory.max_usage_in_bytes
> file, of inotify on a specific fd you've created which is
> associated with any threshold you specify?  The former seems
> less ideal.
>

Tejun can comment more, but I think it is still TBD.

>
> -serge
>

[-- Attachment #1.2: Type: text/html, Size: 5186 bytes --]

[-- Attachment #2: Type: text/plain, Size: 455 bytes --]

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk

[-- Attachment #3: Type: text/plain, Size: 186 bytes --]

_______________________________________________
Lxc-devel mailing list
Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/lxc-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]         ` <20131126054718.GA19134-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-26 20:38           ` Tim Hockin
       [not found]             ` <CAO_RewZ8cUn-PdXfQF0yH=V=9UqE7Yo1JX2pt2c71WYDrpYE0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Tim Hockin @ 2013-11-26 20:38 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Stéphane Graber

On Mon, Nov 25, 2013 at 9:47 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> Thanks for this!  I think it helps a lot to discuss now, rather than
>> over nearly-done code.
>>
>> On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>> > Additionally, Tejun has specified that we do not want users to be
>> > too closely tied to the cgroupfs implementation.  Therefore
>> > commands will be just a hair more general than specifying cgroupfs
>> > filenames and values.  I may go so far as to avoid specifying
>> > specific controllers, as AFAIK there should be no redundancy in
>> > features.  On the other hand, I don't want to get too general.
>> > So I'm basing the API loosely on the lmctfy command line API.
>>
>> I'm torn here.  While I agree in principle with Tejun, I am concerned
>> that this agent will always lag new kernel features or that the thin
>> abstraction you want to provide here does not easily accommodate some
>> of the more ... oddball features of one cgroup interface or another.
>>
>> This agent is the very bottom of the stack, and should probably not do
>> much by way of abstraction.  I think I'd rather let something like
>> lmctfy provide the abstraction more holistically, and relegate this
>
> If lmctfy is an abstraction layer that should keep Tejun happy, and
> it could keep me out of the resource naming game which makes me happy :)
>
>> agent to very simple plumbing and policy.  It could be as simple as
>> providing read/write/etc ops to specific control files.  It needs to
>> handle event_fd, too, I guess.  This has the nice side-effect of
>> always being "current" on kernel features :)
>>
>> > Summary
>> >
>> > Each 'host' (identified by a separate instance of the linux kernel) will
>> > have exactly one running daemon to manage control groups.  This daemon
>> > will answer cgroup management requests over a dbus socket, located at
>> > /sys/fs/cgroup/manager.  This socket can be bind-mounted into various
>> > containers, so that one daemon can support the whole system.
>> >
>> > Programs will be able to make cgroup requests using dbus calls, or
>> > indirectly by linking against lmctfy which will be modified to use the
>> > dbus calls if available.
>> >
>> > Outline:
>> >   . A single manager, cgmanager, is started on the host, very early
>> >     during boot.  It has very few dependencies, and requires only
>> >     /proc, /run, and /sys to be mounted, with /etc ro.  It will mount
>> >     the cgroup hierarchies in a private namespace and set defaults
>> >     (clone_children, use_hierarchy, sane_behavior, release_agent?) It
>> >     will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs).
>>
>> Where does the config come from?  How do I specify which hierarchies I
>> want and where, and which flags?
>
> That'll have to be in a file in /etc (which can be mounted readonly).
> There should be no surprises there so I've not thought about the format.
>
>> >   . A client (requestor 'r') can make cgroup requests over
>> >     /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
>> >     requirements for r are listed below.
>> >   . The client request will pertain an existing or new cgroup A.  r's
>> >     privilege over the cgroup must be checked.  r is said to have
>> >     privilege over A if A is owned by r's uid, or if A's owner is mapped
>> >     into r's user namespace, and r is root in that user namespace.
>>
>> Problem with this definition.  Being owned-by is not the same as
>> has-root-in.  Specifically, I may choose to give you root in your own
>> namespace, but you sure as heck can not increase your own memory
>> limit.
>
> 1. If you don't want me to change the value at all, then just don't map
> A's owner into the namespace.  I'm uid 100000 which is root in my namespace,
> but I only have privilege over other uids mapped into my namespace.

I think I understand this, but it is subtle.  Maybe some examples would help?

> 2. I've considered never allowing changes to your own cgroup.  So if you're
> in /a/b, you can create /a/b/c and modify c's settings, but you can't modify
> b's.  OTOH, that isn't strictly necessary - if we did allow it, then you
> could simply clam /a/b's memory to what you want, and stick me in /a/b/c,
> so I can't escape the memory limit you wanted.

This is different from what we do internally, but it's an interesting
semantic.  I'm wary of how much we want to make this API about
enforcement of policy vs simple enactment.  In other words, semantics
that diverge from UNIX ownership might be more complicated to
understand than they are worth.

> 3. I've not considered having the daemon track resource limits - i.e. creating
> a cgroup and saying "give it 100M swap, and if it asks, let it increase that
> to 200M."  I'd prefer that be done incidentally through (1) and (2).  Do you
> feel that would be insufficient?

I think this is a higher-level issue that should not be addressed here.

> Or maybe your question is something different and I'm missing it?

My point was that I, as machine admin, create a memory cgroup of 100
MB for you and put you in it.   I also give you root-in-namespace.
You must not be able to change 100 MB to 200 MB.  From your (1) you
are saying that system UID 0 owns the cgroup and is NOT mapped into
your namespace.  Therefore your definition holds.  I think I can buy
that.

>> >   . The client request may pertain a victim task v, which may be moved
>> >     to a new cgroup.  In that case r's privilege over both the cgroup
>> >     and v must be checked.  r is said to have privilege over v if v
>> >     is mapped in r's pid namespace, v's uid is mapped into r's user ns,
>> >     and r is root in its userns.  Or if r and v have the same uid
>> >     and v is mapped in r's pid namespace.
>> >   . r's credentials will be taken from socket's peercred, ensuring that
>> >     pid and uid are translated.
>> >   . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
>> >     translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
>> >     which is the global uid, and check /proc/PID(r)/uid_map to see whether
>> >     UID is mapped there.
>> >   . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
>> >     the kernel translate it for the reader.  Only 'move task v to cgroup
>> >     A' will require a SCM_CREDENTIAL to be sent.
>> >
>> > Privilege requirements by action:
>> >     * Requestor of an action (r) over a socket may only make
>> >       changes to cgroups over which it has privilege.
>> >     * Requestors may be limited to a certain #/depth of cgroups
>> >       (to limit memory usage) - DEFER?
>> >     * Cgroup hierarchy is responsible for resource limits
>> >     * A requestor must either be uid 0 in its userns with victim mapped
>> >       ito its userns, or the same uid and in same/ancestor pidns as the
>> >       victim
>> >     * If r requests creation of cgroup '/x', /x will be interpreted
>> >       as relative to r's cgroup.  r cannot make changes to cgroups not
>> >       under its own current cgroup.
>>
>> Does this imply that r in a lower-level (farter from root) of the
>> hierarchy can not make requests of higher levels of the hierarchy
>> (closer to root), even though they have permissions as per the
>> definition of privilege?
>
> Right.

Is this really a required semantic?  We have use cases where
read-access is required to parent cgroups, which means this agent
could never handle reads.  It's not clear that we have use cases for
write-access to parents, though we have talked about eventfd - is that
read or write access?  Does this daemon want to handle event fd?

>> How do we reconcile this pseudo-virtualization with /proc/self/cgroup
>> which DOES expose raw paths?
>
> We <shrug> :)
>
> Just as /proc/cpuinfo isn't updated depending on your cpuset.  If you
> want to know the true depth, it's not my goal to fool you.

That's a fair answer.

>
>> >     * If r is not in the initial user_ns, then it may not change settings
>> >       in its own cgroup, only descendants.  (Not strictly necessary -
>> >       we could require the use of extra cgroups when wanted, as lxc does
>> >       currently)
>> >     * If r requests creation of cgroup '/x', it must have write access
>> >       to its own cgroup  (not strictly necessary)
>>
>> Can you explain what you mean by "not strictly necessary" - is this
>> part of the requirement space or not?
>
> Not sure why I put that there.  Let me state it more generally - if r wants
> to create /a/b/c (which is relative to his own current cgroup), then r
> must have write access under /a/b.  Whether he must have write access to his
> /, that I'm not sure about.

As above, I think following UNIX perms is the most sane thing we can
do.  I presume that everywhere you say "is owned by" and "has access
to" in this doc you mean strictly through UNIX perms?

>> >     * If r requests chown of cgroup /x to uid Y, Y is passed in a
>> >       ucred over the unix socket, and therefore translated to init
>> >       userns.
>>
>> I though only UID 0 could specify a UID other than the real UID?  Have
>> I misunderstood that?
>
> UID 0 in a child user ns should be able to pass in any uid in his own
> namespace.

And non-0 UIDs in any namespace should not be able to operate across
UIDs.  Got it.

>> >     * if r requests setting a limit under /x, then
>> >       . either r must be root in its own userns, and UID(/x) be mapped
>> >         into its userns, or else UID(r) == UID(/x)
>> >       . /x must not be / (not strictly necessary, all users know to
>> >         ensure an extra cgroup layer above '/')
>>
>> I don't understand this point
>
> The point is to ensure that the in-kernel cgroup hierarchy support enforces
> that r can't escape his limits.  So if I create a container and i want it
> to not have memory {limit: 500M}, then either I can create /a/b, put the
> memory limit on /a/b, and put r into /a/b/c;  or I can put r right into
> /a/b and not let r modify /a/b's settings.
>
>> >       . setns(UIDNS(r)) would not work, due to in-kernel capable() checks
>> >         which won't be satisfied.  Therefore we'll need to do privilege
>> >         checks ourselves, then perform the write as the host root user.
>> >         (see devices.allow/deny).  Further we need to support older kernels
>> >         which don't support setns for pid.
>> >     * If r requests action on victim V, it passes V's pid in a ucred,
>> >       so that gets translated.
>> >       Daemon will verify that V's uid is mapped into r's userns.  Since
>> >       r is either root or the same uid as V, it is allowed to classify.
>> >
>> > The above addresses
>> >     * creating cgroups
>> >     * chowning cgroups
>> >     * setting cgroup limits
>> >     * moving tasks into cgroups
>> >   . but does not address a 'cgexec <group> -- command' type of behavior.
>> >     * To handle that (specifically for upstart), recommend that r do:
>> >       if (!pid) {
>> >         request_reclassify(cgroup, getpid());
>> >         do_execve();
>> >       }
>>
>> If I follow, you're saying that the caller does the fork/exec and all
>> this daemon does is munge cgroups for the calling PID?  If so, I
>> agree, I think.
>
> Right.  (Difference with the unfortunately libcgroup race conditions
> being that in this case we have the caller's cooperation :)
>
>> >   . alternatively, the daemon could, if kernel is new enough, setns to
>> >     the requestor's namespaces to execute a command in a new cgroup.
>> >     The new command would be daemonized to that pid namespaces' pid 1.
>> >
>> > Types of requests:
>> >   * r requests creating cgroup A'/A
>> >     . lmctfy/cli/commands/create.cc
>> >     . Verify that UID(r) mapped to 0 in r's userns
>> >     . R=cgroup_of(r)
>> >     . Verify that UID(R) is mapped into r's userns
>> >     . Create R/A'/A
>> >     . chown R/A'/A to UID(r)
>> >   * r requests to move task x to cgroup A.
>> >     . lmctfy/cli/commands/enter.cc
>> >     . r must send PID(x) as ancillary message
>> >     . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
>> >       that userns
>> >       (is it safe to allow if UID(x) == UID(r))?
>> >     . R=cgroup_of(r)
>> >     . Verify that R/A is owned by UID(r) or UID(x)?  (not sure that's needed)
>> >     . echo PID(x) >> /R/A/tasks
>> >   * r requests chown of cgroup A to uid X
>> >     . X is passed in ancillary message
>> >       * ensures it is valid in r's userns
>> >       * maps the userid to host for us
>> >     . Verify that UID(r) mapped to 0 in r's userns
>> >     . R=cgroup_of(r)
>> >     . Chown R/A to X
>> >   * r requests cgroup A's 'property=value'
>> >     . Verify that either
>> >       * A != ''
>> >       * UID(r) == 0 on host
>> >       In other words, r in a userns may not set root cgroup settings.
>> >     . Verify that UID(r) mapped to 0 in r's userns
>> >     . R=cgroup_of(r)
>> >     . Set property=value for R/A
>> >       * Expect kernel to guarantee hierarchical constraints
>> >   * r requests deletion of cgroup A
>> >     . lmctfy/cli/commands/destroy.cc (without -f)
>> >     . same requirements as setting 'property=value'
>> >   * r requests purge of cgroup A
>> >     . lmctfy/cli/commands/destroy.cc (with -f)
>> >     . same requirements as setting 'property=value'
>> >
>> > Long-term we will want the cgroup manager to become more intelligent -
>> > to place its own limits on clients, to address cpu and device hotplug,
>> > etc.  Since we will not be doing that in the first prototype, the daemon
>> > will not keep any state about the clients.
>> >
>> > Client DBus Message API
>> >
>> > <name>: a-zA-Z0-9
>> > <name>: "a-zA-Z0-9 "
>> > <controllerlist>: <controller1>[:controllerlist]
>> > <valueentry>: key:value
>> > <valueentry>: frozen
>> > <valueentry>: thawed
>> > <values>: valueentry[:values]
>> > keys:
>> >         {memory,swap}.{limit,soft_limit}
>> >         cpus_allowed  # set of allowed cpus
>> >         cpus_fraction # % of allowed cpus
>> >         cpus_number   # number of allowed cpus
>> >         cpu_share_percent   # percent of cpushare
>> >         devices_whitelist
>> >         devices_blacklist
>> >         net_prio_index
>> >         net_prio_interface_map
>> >         net_classid
>> >         hugetlb_limit
>> >         blkio_weight
>> >         blkio_weight_device
>> >         blkio_throttle_{read,write}
>> > readkeys:
>> >         devices_list
>> >         {memory,swap}.{failcnt,max_use,limitnuma_stat}
>> >         hugetlb_max_usage
>> >         hugetlb_usage
>> >         hugetlb_failcnt
>> >         cpuacct_stat
>> >         <etc>
>> > Commands:
>> >         ListControllers
>> >         Create <name> <controllerlist> <values>
>> >         Setvalue <name> <values>
>> >         Getvalue <name> <readkeys>
>> >         ListChildren <name>
>> >         ListTasks <name>
>> >         ListControllers <name>
>> >         Chown <name> <uid>
>> >         Chown <name> <uid>:<gid>
>> >         Move <pid> <name>  [[ pid is sent as a SCM_CREDENTIAL ]]
>> >         Delete <name>
>> >         Delete-force <name>
>> >         Kill <name>
>
> Will address the rest tomorrow.  Thanks for reviewing!
>
>> What are the requirements/goals around performance and concurrency?
>> Do you expect this to be a single-threaded thing, or can we handle
>> some number of concurrent operations?  Do you expect to use threads of
>> processes?
>>
>> Can you talk about logging - what and where?
>>
>> How will we handle event_fd?  Pass a file-descriptor back to the caller?
>>
>> That's all I can come up with for now.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]         ` <20131126161246.GA23834-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2013-11-26 16:22           ` Victor Marmol
@ 2013-11-26 20:45           ` Tim Hockin
  1 sibling, 0 replies; 39+ messages in thread
From: Tim Hockin @ 2013-11-26 20:45 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Stéphane Graber

On Tue, Nov 26, 2013 at 8:12 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> What are the requirements/goals around performance and concurrency?
>> Do you expect this to be a single-threaded thing, or can we handle
>> some number of concurrent operations?  Do you expect to use threads of
>> processes?
>
> The cgmanager should be pretty dumb, so I would expect it to be
> quite fast.  I don't have any specific perf goals though.  If you
> have requirements I'm very interested to hear them.  I should be
> able to tell pretty soon how far short I fall.

If we're limiting this to write traffic only, I think our perf goals
are fairly relaxed.  As longs as you don't develop it to preclude
threading or multi-processing, we can adapt later.  I would like to
see at least a mention to this effect.  We also need to beware DoS
(accidental or otherwise) - perhaps we should force round-robin
service of pending-requests, or something.

> By default I'd expect to run with a single thread, but I don't
> imagine one thread can serve a busy 1024-cpu system very well.
> Unless you have guidance right now, I think I'd like to get
> started with the basic functionality and see how it measures
> up to your requirements.  I should add perf counters from the
> start so we can figure out where bottlenecks (if any) are and
> how to handle them.
>
> Otherwise I could start out with a basic numcpus/10 threadpool
> and have the main thread do socket i/o and parcel access
> verification and vfs work out to the threadpool, but I'd rather
> first know where the problems lie.

Agree.  Correct first, then fast :)

>> Can you talk about logging - what and where?
>
> When started under upstart, anything we print out goes to
> /var/log/upstart/cgmanager.log.  Would be nice to keep it
> that simple.  We could log requests by r to do something
> it is not allowed to do, but it seems to me the failed
> attempts cause no harm, while the potential for overflowing
> logs can.

I agree that we don't want to overflow logs.

> Did you have anything in mind?  Did you want logging to help
> detect certain conditions for system optimization, or just
> for failure notices and security violations?

When something goes amiss, we have to ty to figure out what happened -
how far did a request get?  Logging every change is probably
important.  Logging failures could be downsampled and rate-limited,
something like 1 failure log per second or something.

>> How will we handle event_fd?  Pass a file-descriptor back to the caller?
>
> The only thing currently supporting eventfd is memory threshold,
> right?  I haven't tested whether this will work or not, but
> ideally the caller would open the eventfd fd, pass it, the
> cgroup name, controller file to be watched, and the args to
> cgmanager;  cgmanager confirms read access, opens the
> controller fd, makes the request over cgroup.event_control,
> then passes the controller fd back to the caller and closes
> its own copy.
>
> I'm also not sure whether the cgroup interface is going to be
> offering a new feature to replace eventfd, since it wants
> people to stop using cgroupfs...  Tejun?
>
>> That's all I can come up with for now.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
       [not found]             ` <20131126163737.GB23834-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-26 20:49               ` Tim Hockin
  0 siblings, 0 replies; 39+ messages in thread
From: Tim Hockin @ 2013-11-26 20:49 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: mhw-UGBql2FAF+1Wk0Htik3J/w, Tejun Heo,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Stéphane Graber

On Tue, Nov 26, 2013 at 8:37 AM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> At the start of this discussion, some months ago, we offered to
>> co-devel this with Lennart et al.  They did not seem keen on the idea.
>>
>> If they have an established DBUS protocol spec,
>
> see http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/
> and http://man7.org/linux/man-pages/man5/systemd.cgroup.5.html
>
>>  we should consider
>> adopting it instead of a new one, but we CAN'T just play follow the
>> leader and do whatever they do, change whenever they feel like
>> changing.
>
> Right.  And if we suspect that the APIs will always be at least
> subtly different, then keeping them obviously visually different
> seems to have some benefit.  (i.e.
>         systemctl set-property httpd.service CPUShares=500 MemoryLimit=500M
> vs
>         dbus-send cgmanager set-value http.server "cpushares:500 memorylimit:500M swaplimit:1G"
> ) rather than have admins try to remember "now why did that not work
> here, oh yeah, MemoryLimit over here should be Memorylimit" or whatever.
>
> Then again if lmctfy is the layer which admins will use, then it
> doesn't matter as much.
>
>> It would be best if we could get a common DBUS api specc'ed and all
>> agree to it.  Serge, do you feel up to that?
>
> Not sure what you mean - I'll certainly send the API to these lists as

What I meant was whether it is worth opening a discussion with the
systemd folks on a common lowest-level DBUS interface.  But it looks
like their work is already a bit higher level, so it's probably moot.

> the code is developed, and will accept all feedback that I get.  My only
> requirements are that the requirements I've listed in the document
> be feasible, and be feasible back to, say, 3.2 kernels.  So that is
> why we must send an scm-cred for the pid to move into a cgroup.  (With
> 3.12 we may have alterntives, accepting a vpid as a simple dbus message
> and setns()ing into the requestor's pidns to echo the pid into the
> cgroup.tasks file.)
>
> -serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]             ` <CAO_RewZ8cUn-PdXfQF0yH=V=9UqE7Yo1JX2pt2c71WYDrpYE0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-26 20:58               ` Serge E. Hallyn
       [not found]                 ` <20131126205819.GA27266-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Serge E. Hallyn @ 2013-11-26 20:58 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Serge E. Hallyn, Tejun Heo,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Stéphane Graber

Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> On Mon, Nov 25, 2013 at 9:47 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> > Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
...
> >> >   . A client (requestor 'r') can make cgroup requests over
> >> >     /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
> >> >     requirements for r are listed below.
> >> >   . The client request will pertain an existing or new cgroup A.  r's
> >> >     privilege over the cgroup must be checked.  r is said to have
> >> >     privilege over A if A is owned by r's uid, or if A's owner is mapped
> >> >     into r's user namespace, and r is root in that user namespace.
> >>
> >> Problem with this definition.  Being owned-by is not the same as
> >> has-root-in.  Specifically, I may choose to give you root in your own
> >> namespace, but you sure as heck can not increase your own memory
> >> limit.
> >
> > 1. If you don't want me to change the value at all, then just don't map
> > A's owner into the namespace.  I'm uid 100000 which is root in my namespace,
> > but I only have privilege over other uids mapped into my namespace.
> 
> I think I understand this, but it is subtle.  Maybe some examples would help?

When you create a user namespace, at first it is empty, and you are 'nobody'
(-1).  Then magically some uids from the host, say 100000-101999, are mapped
into your namespace, to uids 0-1999.

Now assume you're uid 0 inside that namespace.  You have privilege over your
uids, 0-999, which are 100000-101999 on the host.

If cgroup file A is owned by host uid 0, then the owner is not mapped into
the user namespace.  uid 0 inside the namespace only gets the world access
rights to that file.

If cgroup file A is owned by host uid 100100, then uid 0 in the
namespace has access to that file by virtue of being root, and uid 100
in the namespace (100100 on the host) has access to the file by virtue
of being the owner.

> > 2. I've considered never allowing changes to your own cgroup.  So if you're
> > in /a/b, you can create /a/b/c and modify c's settings, but you can't modify
> > b's.  OTOH, that isn't strictly necessary - if we did allow it, then you
> > could simply clam /a/b's memory to what you want, and stick me in /a/b/c,
> > so I can't escape the memory limit you wanted.
> 
> This is different from what we do internally, but it's an interesting
> semantic.  I'm wary of how much we want to make this API about
> enforcement of policy vs simple enactment.  In other words, semantics
> that diverge from UNIX ownership might be more complicated to
> understand than they are worth.

The semantics I gave are exactly the user namespace semantics.  If you're
not using a user namespace then they simply do not apply, and you are back
to strict UNIX ownership semantics that you want.  But allowing 'root' in
a user namespace to have privilege over uids, without having any privilege
outside its own namespace, must be honored for this to be usable by lxc.

Like I said, on the bright side, if you don't want to care about user
namespaces, then everything falls back to strict unix semantics - so if
you don't want to care, you don't have to care.

> > 3. I've not considered having the daemon track resource limits - i.e. creating
> > a cgroup and saying "give it 100M swap, and if it asks, let it increase that
> > to 200M."  I'd prefer that be done incidentally through (1) and (2).  Do you
> > feel that would be insufficient?
> 
> I think this is a higher-level issue that should not be addressed here.
> 
> > Or maybe your question is something different and I'm missing it?
> 
> My point was that I, as machine admin, create a memory cgroup of 100
> MB for you and put you in it.   I also give you root-in-namespace.
> You must not be able to change 100 MB to 200 MB.  From your (1) you
> are saying that system UID 0 owns the cgroup and is NOT mapped into
> your namespace.  Therefore your definition holds.  I think I can buy
> that.
> 
> >> >   . The client request may pertain a victim task v, which may be moved
> >> >     to a new cgroup.  In that case r's privilege over both the cgroup
> >> >     and v must be checked.  r is said to have privilege over v if v
> >> >     is mapped in r's pid namespace, v's uid is mapped into r's user ns,
> >> >     and r is root in its userns.  Or if r and v have the same uid
> >> >     and v is mapped in r's pid namespace.
> >> >   . r's credentials will be taken from socket's peercred, ensuring that
> >> >     pid and uid are translated.
> >> >   . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
> >> >     translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
> >> >     which is the global uid, and check /proc/PID(r)/uid_map to see whether
> >> >     UID is mapped there.
> >> >   . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
> >> >     the kernel translate it for the reader.  Only 'move task v to cgroup
> >> >     A' will require a SCM_CREDENTIAL to be sent.
> >> >
> >> > Privilege requirements by action:
> >> >     * Requestor of an action (r) over a socket may only make
> >> >       changes to cgroups over which it has privilege.
> >> >     * Requestors may be limited to a certain #/depth of cgroups
> >> >       (to limit memory usage) - DEFER?
> >> >     * Cgroup hierarchy is responsible for resource limits
> >> >     * A requestor must either be uid 0 in its userns with victim mapped
> >> >       ito its userns, or the same uid and in same/ancestor pidns as the
> >> >       victim
> >> >     * If r requests creation of cgroup '/x', /x will be interpreted
> >> >       as relative to r's cgroup.  r cannot make changes to cgroups not
> >> >       under its own current cgroup.
> >>
> >> Does this imply that r in a lower-level (farter from root) of the
> >> hierarchy can not make requests of higher levels of the hierarchy
> >> (closer to root), even though they have permissions as per the
> >> definition of privilege?
> >
> > Right.
> 
> Is this really a required semantic?  We have use cases where
> read-access is required to parent cgroups, which means this agent
> could never handle reads.  It's not clear that we have use cases for
> write-access to parents, though we have talked about eventfd - is that
> read or write access?  Does this daemon want to handle event fd?

Denying read access to parent cgroups is not strictly necessary to meet
any of my requirements.  Eventfd only requires an open read handle to
the file, so that should be ok.

So to support that, I guess I'd want to add a 'get-my-cgroup'
command with controller argument, which reeturns the absolute
path.  Cgroups which start with a '/' are taken as absolute
cgroup paths, as opposed to the usual, relative-to-my-own.
It sounds like you also might want to just use '../' ?

I'd refuse write access for now altogether.  We can talk later, if
someone finds a need, about a way to support conditional write
access, but that's pretty much completely bypassing the hierarchial
constraints :)

-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]                 ` <20131126205819.GA27266-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-26 21:24                   ` Tim Hockin
       [not found]                     ` <CAO_RewZh+dNkUdZdu-R3CKTvYzbPL50v-BsBHvek75ti3V6kZQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Tim Hockin @ 2013-11-26 21:24 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Tejun Heo, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Stéphane Graber

lmctfy literally supports ".." as a container name :)

On Tue, Nov 26, 2013 at 12:58 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> On Mon, Nov 25, 2013 at 9:47 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>> > Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> ...
>> >> >   . A client (requestor 'r') can make cgroup requests over
>> >> >     /sys/fs/cgroup/manager using dbus calls.  Detailed privilege
>> >> >     requirements for r are listed below.
>> >> >   . The client request will pertain an existing or new cgroup A.  r's
>> >> >     privilege over the cgroup must be checked.  r is said to have
>> >> >     privilege over A if A is owned by r's uid, or if A's owner is mapped
>> >> >     into r's user namespace, and r is root in that user namespace.
>> >>
>> >> Problem with this definition.  Being owned-by is not the same as
>> >> has-root-in.  Specifically, I may choose to give you root in your own
>> >> namespace, but you sure as heck can not increase your own memory
>> >> limit.
>> >
>> > 1. If you don't want me to change the value at all, then just don't map
>> > A's owner into the namespace.  I'm uid 100000 which is root in my namespace,
>> > but I only have privilege over other uids mapped into my namespace.
>>
>> I think I understand this, but it is subtle.  Maybe some examples would help?
>
> When you create a user namespace, at first it is empty, and you are 'nobody'
> (-1).  Then magically some uids from the host, say 100000-101999, are mapped
> into your namespace, to uids 0-1999.
>
> Now assume you're uid 0 inside that namespace.  You have privilege over your
> uids, 0-999, which are 100000-101999 on the host.
>
> If cgroup file A is owned by host uid 0, then the owner is not mapped into
> the user namespace.  uid 0 inside the namespace only gets the world access
> rights to that file.
>
> If cgroup file A is owned by host uid 100100, then uid 0 in the
> namespace has access to that file by virtue of being root, and uid 100
> in the namespace (100100 on the host) has access to the file by virtue
> of being the owner.
>
>> > 2. I've considered never allowing changes to your own cgroup.  So if you're
>> > in /a/b, you can create /a/b/c and modify c's settings, but you can't modify
>> > b's.  OTOH, that isn't strictly necessary - if we did allow it, then you
>> > could simply clam /a/b's memory to what you want, and stick me in /a/b/c,
>> > so I can't escape the memory limit you wanted.
>>
>> This is different from what we do internally, but it's an interesting
>> semantic.  I'm wary of how much we want to make this API about
>> enforcement of policy vs simple enactment.  In other words, semantics
>> that diverge from UNIX ownership might be more complicated to
>> understand than they are worth.
>
> The semantics I gave are exactly the user namespace semantics.  If you're
> not using a user namespace then they simply do not apply, and you are back
> to strict UNIX ownership semantics that you want.  But allowing 'root' in
> a user namespace to have privilege over uids, without having any privilege
> outside its own namespace, must be honored for this to be usable by lxc.
>
> Like I said, on the bright side, if you don't want to care about user
> namespaces, then everything falls back to strict unix semantics - so if
> you don't want to care, you don't have to care.
>
>> > 3. I've not considered having the daemon track resource limits - i.e. creating
>> > a cgroup and saying "give it 100M swap, and if it asks, let it increase that
>> > to 200M."  I'd prefer that be done incidentally through (1) and (2).  Do you
>> > feel that would be insufficient?
>>
>> I think this is a higher-level issue that should not be addressed here.
>>
>> > Or maybe your question is something different and I'm missing it?
>>
>> My point was that I, as machine admin, create a memory cgroup of 100
>> MB for you and put you in it.   I also give you root-in-namespace.
>> You must not be able to change 100 MB to 200 MB.  From your (1) you
>> are saying that system UID 0 owns the cgroup and is NOT mapped into
>> your namespace.  Therefore your definition holds.  I think I can buy
>> that.
>>
>> >> >   . The client request may pertain a victim task v, which may be moved
>> >> >     to a new cgroup.  In that case r's privilege over both the cgroup
>> >> >     and v must be checked.  r is said to have privilege over v if v
>> >> >     is mapped in r's pid namespace, v's uid is mapped into r's user ns,
>> >> >     and r is root in its userns.  Or if r and v have the same uid
>> >> >     and v is mapped in r's pid namespace.
>> >> >   . r's credentials will be taken from socket's peercred, ensuring that
>> >> >     pid and uid are translated.
>> >> >   . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the
>> >> >     translated global pid.  It will then read UID(v) from /proc/PID(v)/status,
>> >> >     which is the global uid, and check /proc/PID(r)/uid_map to see whether
>> >> >     UID is mapped there.
>> >> >   . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have
>> >> >     the kernel translate it for the reader.  Only 'move task v to cgroup
>> >> >     A' will require a SCM_CREDENTIAL to be sent.
>> >> >
>> >> > Privilege requirements by action:
>> >> >     * Requestor of an action (r) over a socket may only make
>> >> >       changes to cgroups over which it has privilege.
>> >> >     * Requestors may be limited to a certain #/depth of cgroups
>> >> >       (to limit memory usage) - DEFER?
>> >> >     * Cgroup hierarchy is responsible for resource limits
>> >> >     * A requestor must either be uid 0 in its userns with victim mapped
>> >> >       ito its userns, or the same uid and in same/ancestor pidns as the
>> >> >       victim
>> >> >     * If r requests creation of cgroup '/x', /x will be interpreted
>> >> >       as relative to r's cgroup.  r cannot make changes to cgroups not
>> >> >       under its own current cgroup.
>> >>
>> >> Does this imply that r in a lower-level (farter from root) of the
>> >> hierarchy can not make requests of higher levels of the hierarchy
>> >> (closer to root), even though they have permissions as per the
>> >> definition of privilege?
>> >
>> > Right.
>>
>> Is this really a required semantic?  We have use cases where
>> read-access is required to parent cgroups, which means this agent
>> could never handle reads.  It's not clear that we have use cases for
>> write-access to parents, though we have talked about eventfd - is that
>> read or write access?  Does this daemon want to handle event fd?
>
> Denying read access to parent cgroups is not strictly necessary to meet
> any of my requirements.  Eventfd only requires an open read handle to
> the file, so that should be ok.
>
> So to support that, I guess I'd want to add a 'get-my-cgroup'
> command with controller argument, which reeturns the absolute
> path.  Cgroups which start with a '/' are taken as absolute
> cgroup paths, as opposed to the usual, relative-to-my-own.
> It sounds like you also might want to just use '../' ?
>
> I'd refuse write access for now altogether.  We can talk later, if
> someone finds a need, about a way to support conditional write
> access, but that's pretty much completely bypassing the hierarchial
> constraints :)
>
> -serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]                     ` <CAO_RewZh+dNkUdZdu-R3CKTvYzbPL50v-BsBHvek75ti3V6kZQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-26 21:28                       ` Serge E. Hallyn
       [not found]                         ` <20131126212814.GA27602-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Serge E. Hallyn @ 2013-11-26 21:28 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Serge E. Hallyn, Tejun Heo,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Stéphane Graber

Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> lmctfy literally supports ".." as a container name :)

So is ../.. ever used, or does noone every do anything beyond ..?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]                         ` <20131126212814.GA27602-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2013-11-26 21:31                           ` Victor Marmol
       [not found]                             ` <CAD=mX8uuAeN7s8ZA6Gc-wsBd6-PHevBRyBL6hMAS9VW15T5eYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Victor Marmol @ 2013-11-26 21:31 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Stéphane Graber, Tim Hockin, Rohit Jnagal,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 534 bytes --]

I think most of our usecases have only wanted to know about the parent, but
I can see people wanting to go further. Would it be much different to
support both? I feel like it'll be simpler to support all if we go that
route.


On Tue, Nov 26, 2013 at 1:28 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:

> Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> > lmctfy literally supports ".." as a container name :)
>
> So is ../.. ever used, or does noone every do anything beyond ..?
>

[-- Attachment #1.2: Type: text/html, Size: 984 bytes --]

[-- Attachment #2: Type: text/plain, Size: 455 bytes --]

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk

[-- Attachment #3: Type: text/plain, Size: 186 bytes --]

_______________________________________________
Lxc-devel mailing list
Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/lxc-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]                             ` <CAD=mX8uuAeN7s8ZA6Gc-wsBd6-PHevBRyBL6hMAS9VW15T5eYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-27  1:49                               ` Tim Hockin
       [not found]                                 ` <CAO_RewY0eFTgkVqbRJwdW9bgR3nz9h5t6c823wFH5cg1CD0sEA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Tim Hockin @ 2013-11-27  1:49 UTC (permalink / raw)
  To: Victor Marmol
  Cc: Serge E. Hallyn, Tejun Heo,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Rohit Jnagal,
	Stéphane Graber

I see three models:

1) Don't "virtualize" the cgroup path.  This is what lmctfy does,
though we have discussed changing to:

2) Virtualize to an "administrative root" - I get to tell you where
your root is, and you can't see anythign higher than that.

3) Virtualize to CWD root - you can never go up, just down.


#1 seems easy, but exposes a lot.  #3 is restrictive and fairly easy -
could we live with that?  #2 seems ideal, but it's not clear to me how
to actually implement it.

On Tue, Nov 26, 2013 at 1:31 PM, Victor Marmol <vmarmol-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> I think most of our usecases have only wanted to know about the parent, but
> I can see people wanting to go further. Would it be much different to
> support both? I feel like it'll be simpler to support all if we go that
> route.
>
>
> On Tue, Nov 26, 2013 at 1:28 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
>>
>> Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
>> > lmctfy literally supports ".." as a container name :)
>>
>> So is ../.. ever used, or does noone every do anything beyond ..?
>
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]                                 ` <CAO_RewY0eFTgkVqbRJwdW9bgR3nz9h5t6c823wFH5cg1CD0sEA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-11-27  1:53                                   ` Serge E. Hallyn
  0 siblings, 0 replies; 39+ messages in thread
From: Serge E. Hallyn @ 2013-11-27  1:53 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Stéphane Graber, Victor Marmol, Rohit Jnagal,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Serge E. Hallyn

I was planning on doing #3, but since you guys need to access .., my
plan is to have 'a/b' refer to $cwd/a/b while /a/b is the absolute
path, and allow read and eventfd but no write to any parent dirs.

Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> I see three models:
> 
> 1) Don't "virtualize" the cgroup path.  This is what lmctfy does,
> though we have discussed changing to:
> 
> 2) Virtualize to an "administrative root" - I get to tell you where
> your root is, and you can't see anythign higher than that.
> 
> 3) Virtualize to CWD root - you can never go up, just down.
> 
> 
> #1 seems easy, but exposes a lot.  #3 is restrictive and fairly easy -
> could we live with that?  #2 seems ideal, but it's not clear to me how
> to actually implement it.
> 
> On Tue, Nov 26, 2013 at 1:31 PM, Victor Marmol <vmarmol-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > I think most of our usecases have only wanted to know about the parent, but
> > I can see people wanting to go further. Would it be much different to
> > support both? I feel like it'll be simpler to support all if we go that
> > route.
> >
> >
> > On Tue, Nov 26, 2013 at 1:28 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> >>
> >> Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> >> > lmctfy literally supports ".." as a container name :)
> >>
> >> So is ../.. ever used, or does noone every do anything beyond ..?
> >
> >

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found] ` <20131125224335.GA15481-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2013-11-26  4:58   ` Tim Hockin
@ 2013-12-03 13:45   ` Tejun Heo
       [not found]     ` <20131203134506.GG8277-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  3 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2013-12-03 13:45 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Tim Hockin, Stéphane Graber

Hello, guys.

Sorry about the delay.

On Mon, Nov 25, 2013 at 10:43:35PM +0000, Serge E. Hallyn wrote:
> Additionally, Tejun has specified that we do not want users to be
> too closely tied to the cgroupfs implementation.  Therefore
> commands will be just a hair more general than specifying cgroupfs
> filenames and values.  I may go so far as to avoid specifying
> specific controllers, as AFAIK there should be no redundancy in
> features.  On the other hand, I don't want to get too general.
> So I'm basing the API loosely on the lmctfy command line API.

One of the reasons for not exposing knobs as-is is that the knobs we
currently have aren't consistent.  The weight values have different
ranges, some combinations of values don't make much sense, and so on.
The user can cope with it but it'd probably be better to expose
something which doesn't lead to mistakes too easily.

> The above addresses
>     * creating cgroups
>     * chowning cgroups
>     * setting cgroup limits
>     * moving tasks into cgroups
>   . but does not address a 'cgexec <group> -- command' type of behavior.
>     * To handle that (specifically for upstart), recommend that r do:
>       if (!pid) {
>         request_reclassify(cgroup, getpid());
>         do_execve();
>       }
>   . alternatively, the daemon could, if kernel is new enough, setns to
>     the requestor's namespaces to execute a command in a new cgroup.
>     The new command would be daemonized to that pid namespaces' pid 1.

So, IIUC, cgroup hierarchy management - creation and removal of
cgroups and assignments of tasks will go through while configuring
control knobs will be delegated to the cgroup owner, right?

Hmmm... the plan is to allow delegating task assignments in the
sub-hierarchy but require CAP_X for writes to knobs (not reads).  This
stems from the fact that, especially with unified hierarchy, those
operations will be cgroup-core proper operations which are gonna be
relatively safer and that task organizations in the subhierarchy and
monitoring knobs are likely to be higher frequency operation than
enabling and configuring controllers.

As I communicated multiple times before, delegating write access to
control knobs to untrusted domain has always been a security risk and
is likely to continue to remain so.  Also, organizationally, a
cgroup's control knobs belong to the parent not the cgroup itself.
That probably is why you were thinking about putting an extra cgroup
inbetween for isolation, but the root problem there is that those
knobs belong to the parent, not the directory itself.

Security is in most part logistics - it's about getting all the
details right, and we don't either design or implement each knob with
security in mind and DoSing them has always been pretty easy, so I
don't think delegating write accesses to knobs is a good idea.

If you, for whatever reason, can trust the delegatee, which I believe
is the case for google, it's fine.  If you're trying to delegate to a
container which you don't have any control over, it isn't a good idea.

Another thing to consider is due to both the fundamental characterics
of hierarchy and implementation issues, things will become expensive
if nesting gets beyond several layers (if controllers are enabled,
that is) and the controllers in general will be implemented and
optimized with limited level of nesting in mind.  IOW, building, say,
8 level deep hierarchy in the host and then doing the same thing
inside the container with controllers enabled won't make a very happy
system.  It probably is something to keep in mind when laying out how
the whole thing eventually would look like.

> Long-term we will want the cgroup manager to become more intelligent -
> to place its own limits on clients, to address cpu and device hotplug,
> etc.  Since we will not be doing that in the first prototype, the daemon
> will not keep any state about the clients.

Isn't the above conflicting with chowning control knobs?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]     ` <20131203134506.GG8277-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2013-12-03 13:45       ` Tejun Heo
  2013-12-04  0:03       ` [lxc-devel] " Serge Hallyn
  1 sibling, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2013-12-03 13:45 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Tim Hockin, Stéphane Graber

Ooh, can you also please cc Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> when
replying?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]     ` <CAO_RewZGWARUafKzDc_t3G5OedGtEPTZgB2VYeHHiKSSrja8fA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-11-26  5:47       ` Serge E. Hallyn
  2013-11-26 16:12       ` Serge E. Hallyn
@ 2013-12-03 13:54       ` Tejun Heo
  2 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2013-12-03 13:54 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Serge E. Hallyn, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Victor Marmol, Rohit Jnagal,
	Stéphane Graber

Hello, Tim.

On Mon, Nov 25, 2013 at 08:58:09PM -0800, Tim Hockin wrote:
> Thanks for this!  I think it helps a lot to discuss now, rather than
> over nearly-done code.
> 
> On Mon, Nov 25, 2013 at 2:43 PM, Serge E. Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> wrote:
> > Additionally, Tejun has specified that we do not want users to be
> > too closely tied to the cgroupfs implementation.  Therefore
> > commands will be just a hair more general than specifying cgroupfs
> > filenames and values.  I may go so far as to avoid specifying
> > specific controllers, as AFAIK there should be no redundancy in
> > features.  On the other hand, I don't want to get too general.
> > So I'm basing the API loosely on the lmctfy command line API.
> 
> I'm torn here.  While I agree in principle with Tejun, I am concerned
> that this agent will always lag new kernel features or that the thin
> abstraction you want to provide here does not easily accommodate some
> of the more ... oddball features of one cgroup interface or another.

Yeah, that's the trade-off but cgroupfs is a kernel API.  It shouldn't
change or grow rapidly once things settle down.  As long as there's
not too crazy way to step-aside when such rare case arises, I think
pros outweight cons.

> This agent is the very bottom of the stack, and should probably not do
> much by way of abstraction.  I think I'd rather let something like
> lmctfy provide the abstraction more holistically, and relegate this
> agent to very simple plumbing and policy.  It could be as simple as
> providing read/write/etc ops to specific control files.  It needs to
> handle event_fd, too, I guess.  This has the nice side-effect of
> always being "current" on kernel features :)

The level of abstraction is definitely something debatable.  Please
note that the existing event_fd based mechanism won't grow any new
users (BTW, event_control is one of the dos vectors if you give write
access to it) and all new notifications will be using inotify.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]                     ` <CAD=mX8v-jfA8F5DueK60Oo4Zfcjj86idKYKnDVc9LxQVs9W_rQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-12-03 14:00                       ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2013-12-03 14:00 UTC (permalink / raw)
  To: Victor Marmol
  Cc: Serge E. Hallyn, Tim Hockin,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Rohit Jnagal,
	Stéphane Graber

Hello,

On Tue, Nov 26, 2013 at 09:19:18AM -0800, Victor Marmol wrote:
> > > >From my discussions with Tejun, he wanted to move to using inotify so it
> > > may still be an fd we pass around.
> >
> > Hm, would that just be inotify on the memory.max_usage_in_bytes
> > file, of inotify on a specific fd you've created which is
> > associated with any threshold you specify?  The former seems
> > less ideal.
> >
> 
> Tejun can comment more, but I think it is still TBD.

It's likely the former with configurable cadence or per-knob (not
per-opener) configurable thresholds.  max_usage_in_bytes is a special
case here as all other knobs can simply generate an event on each
transition.  If event (de)muxing is necessary, it probably should be
done from userland.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
       [not found]     ` <20131203134506.GG8277-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  2013-12-03 13:45       ` Tejun Heo
@ 2013-12-04  0:03       ` Serge Hallyn
  2013-12-04  1:24         ` Tejun Heo
  1 sibling, 1 reply; 39+ messages in thread
From: Serge Hallyn @ 2013-12-04  0:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Stéphane Graber, Tim Hockin, Victor Marmol,
	Rohit Jnagal, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hello, guys.
> 
> Sorry about the delay.
> 
> On Mon, Nov 25, 2013 at 10:43:35PM +0000, Serge E. Hallyn wrote:
> > Additionally, Tejun has specified that we do not want users to be
> > too closely tied to the cgroupfs implementation.  Therefore
> > commands will be just a hair more general than specifying cgroupfs
> > filenames and values.  I may go so far as to avoid specifying
> > specific controllers, as AFAIK there should be no redundancy in
> > features.  On the other hand, I don't want to get too general.
> > So I'm basing the API loosely on the lmctfy command line API.
> 
> One of the reasons for not exposing knobs as-is is that the knobs we
> currently have aren't consistent.  The weight values have different
> ranges, some combinations of values don't make much sense, and so on.
> The user can cope with it but it'd probably be better to expose
> something which doesn't lead to mistakes too easily.

For the moment, for prototype (github.com/hallyn/cgmanager), I'm just
going with filenames/values.

When the bulk of the work is done, we can either (or both) (a) introduce
a thin abstraction layer over the key/values, or/and (b) whitelist
some of the filenames and filter some values.

I know the upstart folks don't want to have to wait long for a
specification...  I'll hopefully make a final decision on this next
week.

> > The above addresses
> >     * creating cgroups
> >     * chowning cgroups
> >     * setting cgroup limits
> >     * moving tasks into cgroups
> >   . but does not address a 'cgexec <group> -- command' type of behavior.
> >     * To handle that (specifically for upstart), recommend that r do:
> >       if (!pid) {
> >         request_reclassify(cgroup, getpid());
> >         do_execve();
> >       }
> >   . alternatively, the daemon could, if kernel is new enough, setns to
> >     the requestor's namespaces to execute a command in a new cgroup.
> >     The new command would be daemonized to that pid namespaces' pid 1.
> 
> So, IIUC, cgroup hierarchy management - creation and removal of
> cgroups and assignments of tasks will go through while configuring
> control knobs will be delegated to the cgroup owner, right?

Not sure what you mean, but I think the answer is no.  Everything
goes through the manager.  The manager doesn't try to enforce that,
but by default the cgroup filesystems will only be mounted in the
manager's private mnt_ns, and containers at least will not be
allowed to mount cgroup fstype.

> Hmmm... the plan is to allow delegating task assignments in the
> sub-hierarchy but require CAP_X for writes to knobs (not reads).  This
> stems from the fact that, especially with unified hierarchy, those
> operations will be cgroup-core proper operations which are gonna be
> relatively safer and that task organizations in the subhierarchy and
> monitoring knobs are likely to be higher frequency operation than
> enabling and configuring controllers.

Should be ok for this.

> As I communicated multiple times before, delegating write access to
> control knobs to untrusted domain has always been a security risk and
> is likely to continue to remain so.  Also, organizationally, a

Then that will need to be address with per-key blacklisting and/or
per-value filtering in the manager.

Which is my way of saying:  can we please have a list of the security
issues so we can handle them?  :)  (I've asked several times before
but haven't seen a list or anyone offering to make one)

> cgroup's control knobs belong to the parent not the cgroup itself.

After thinking awhile I think this makes perfect sense.  I haven't
implemented set_value yet, and when I do I think I'll implement this
guideline.

> That probably is why you were thinking about putting an extra cgroup
> inbetween for isolation, but the root problem there is that those
> knobs belong to the parent, not the directory itself.

Yup.

> Security is in most part logistics - it's about getting all the
> details right, and we don't either design or implement each knob with
> security in mind and DoSing them has always been pretty easy, so I
> don't think delegating write accesses to knobs is a good idea.
> 
> If you, for whatever reason, can trust the delegatee, which I believe
> is the case for google, it's fine.  If you're trying to delegate to a
> container which you don't have any control over, it isn't a good idea.
> 
> Another thing to consider is due to both the fundamental characterics
> of hierarchy and implementation issues, things will become expensive
> if nesting gets beyond several layers (if controllers are enabled,
> that is) and the controllers in general will be implemented and
> optimized with limited level of nesting in mind.  IOW, building, say,
> 8 level deep hierarchy in the host and then doing the same thing
> inside the container with controllers enabled won't make a very happy

Yes, I very much want to avoid that.

> system.  It probably is something to keep in mind when laying out how
> the whole thing eventually would look like.
> 
> > Long-term we will want the cgroup manager to become more intelligent -
> > to place its own limits on clients, to address cpu and device hotplug,
> > etc.  Since we will not be doing that in the first prototype, the daemon
> > will not keep any state about the clients.
> 
> Isn't the above conflicting with chowning control knobs?

Not sure what you mean by this.

To be clear what I'm talking about is having the client be able to say
"grant 50% of cpus", then when more cpus are added, the actual cpuset
gets recalculated.  This may well forever stay outside of the cgmanager
scope.  It may be more appropriate to put that logic into the lmctfy
layer.

thanks,
-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
  2013-12-04  0:03       ` [lxc-devel] " Serge Hallyn
@ 2013-12-04  1:24         ` Tejun Heo
       [not found]           ` <20131204012416.GY8277-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Tejun Heo @ 2013-12-04  1:24 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Serge E. Hallyn, Stéphane Graber, Tim Hockin, Victor Marmol,
	Rohit Jnagal, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA

Hello, Serge.

On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote:
> > As I communicated multiple times before, delegating write access to
> > control knobs to untrusted domain has always been a security risk and
> > is likely to continue to remain so.  Also, organizationally, a
> 
> Then that will need to be address with per-key blacklisting and/or
> per-value filtering in the manager.
> 
> Which is my way of saying:  can we please have a list of the security
> issues so we can handle them?  :)  (I've asked several times before
> but haven't seen a list or anyone offering to make one)

Unfortunately, for now, please consider everything blacklisted.  Yes,
it is true that some knobs should be mostly safe but given the level
of changes we're going through and the difficulty of properly auditing
anything for delegation to untrusted environment, I don't feel
comfortable at all about delegating through chown.  It is an
accidental feature which happened just because it uses filesystem as
its interface and it is no where near the top of the todo list.  It
has never worked properly and won't in any foreseeable future.

> > cgroup's control knobs belong to the parent not the cgroup itself.
> 
> After thinking awhile I think this makes perfect sense.  I haven't
> implemented set_value yet, and when I do I think I'll implement this
> guideline.

I'm kinda confused here.  You say *everything* is gonna go through the
manager and then talks about chowning directories.  Don't the two
conflict?

> > > Long-term we will want the cgroup manager to become more intelligent -
> > > to place its own limits on clients, to address cpu and device hotplug,
> > > etc.  Since we will not be doing that in the first prototype, the daemon
> > > will not keep any state about the clients.
> > 
> > Isn't the above conflicting with chowning control knobs?
> 
> Not sure what you mean by this.
> 
> To be clear what I'm talking about is having the client be able to say
> "grant 50% of cpus", then when more cpus are added, the actual cpuset
> gets recalculated.  This may well forever stay outside of the cgmanager
> scope.  It may be more appropriate to put that logic into the lmctfy
> layer.

Yes, something like that would be nice but if you give out raw access
to the control knobs by chowning them, I just don't see how that would
be implementable.  What am I missing here?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
       [not found]           ` <20131204012416.GY8277-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
@ 2013-12-04  1:26             ` Tejun Heo
  2013-12-04  2:31             ` Serge Hallyn
  1 sibling, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2013-12-04  1:26 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Serge E. Hallyn, Stéphane Graber, Tim Hockin, Victor Marmol,
	Rohit Jnagal, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA

And can somebody please fix up lxc-devel so that it doesn't generate
"your message awaits moderator approval" notification on *each*
message?  :(

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
       [not found]           ` <20131204012416.GY8277-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
  2013-12-04  1:26             ` Tejun Heo
@ 2013-12-04  2:31             ` Serge Hallyn
  2013-12-04  4:53               ` Tim Hockin
  1 sibling, 1 reply; 39+ messages in thread
From: Serge Hallyn @ 2013-12-04  2:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Serge E. Hallyn, Stéphane Graber, Tim Hockin, Victor Marmol,
	Rohit Jnagal, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA

Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> Hello, Serge.
> 
> On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote:
> > > As I communicated multiple times before, delegating write access to
> > > control knobs to untrusted domain has always been a security risk and
> > > is likely to continue to remain so.  Also, organizationally, a
> > 
> > Then that will need to be address with per-key blacklisting and/or
> > per-value filtering in the manager.
> > 
> > Which is my way of saying:  can we please have a list of the security
> > issues so we can handle them?  :)  (I've asked several times before
> > but haven't seen a list or anyone offering to make one)
> 
> Unfortunately, for now, please consider everything blacklisted.  Yes,
> it is true that some knobs should be mostly safe but given the level
> of changes we're going through and the difficulty of properly auditing
> anything for delegation to untrusted environment, I don't feel
> comfortable at all about delegating through chown.  It is an
> accidental feature which happened just because it uses filesystem as
> its interface and it is no where near the top of the todo list.  It
> has never worked properly and won't in any foreseeable future.
> 
> > > cgroup's control knobs belong to the parent not the cgroup itself.
> > 
> > After thinking awhile I think this makes perfect sense.  I haven't
> > implemented set_value yet, and when I do I think I'll implement this
> > guideline.
> 
> I'm kinda confused here.  You say *everything* is gonna go through the
> manager and then talks about chowning directories.  Don't the two
> conflict?

No.  I expect the user - except in the google case - to either have
access to no cgroupfs mounts, or readonly mounts.

-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
  2013-12-04  2:31             ` Serge Hallyn
@ 2013-12-04  4:53               ` Tim Hockin
       [not found]                 ` <CAO_RewbZiLCJcO9G7pgxN8ZxkkVdEW1B84nkQ5wX3a9DPq4zfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 39+ messages in thread
From: Tim Hockin @ 2013-12-04  4:53 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Tejun Heo, Serge E. Hallyn, Stéphane Graber, Victor Marmol,
	Rohit Jnagal, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA

If this daemon works as advertised, we will explore moving all write
traffic to use it.  I still have concerns that this can't handle read
traffic at the scale we need.

Tejun,  I am not sure why chown came back into the conversation.  This
is a replacement for that.

On Tue, Dec 3, 2013 at 6:31 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org> wrote:
> Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
>> Hello, Serge.
>>
>> On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote:
>> > > As I communicated multiple times before, delegating write access to
>> > > control knobs to untrusted domain has always been a security risk and
>> > > is likely to continue to remain so.  Also, organizationally, a
>> >
>> > Then that will need to be address with per-key blacklisting and/or
>> > per-value filtering in the manager.
>> >
>> > Which is my way of saying:  can we please have a list of the security
>> > issues so we can handle them?  :)  (I've asked several times before
>> > but haven't seen a list or anyone offering to make one)
>>
>> Unfortunately, for now, please consider everything blacklisted.  Yes,
>> it is true that some knobs should be mostly safe but given the level
>> of changes we're going through and the difficulty of properly auditing
>> anything for delegation to untrusted environment, I don't feel
>> comfortable at all about delegating through chown.  It is an
>> accidental feature which happened just because it uses filesystem as
>> its interface and it is no where near the top of the todo list.  It
>> has never worked properly and won't in any foreseeable future.
>>
>> > > cgroup's control knobs belong to the parent not the cgroup itself.
>> >
>> > After thinking awhile I think this makes perfect sense.  I haven't
>> > implemented set_value yet, and when I do I think I'll implement this
>> > guideline.
>>
>> I'm kinda confused here.  You say *everything* is gonna go through the
>> manager and then talks about chowning directories.  Don't the two
>> conflict?
>
> No.  I expect the user - except in the google case - to either have
> access to no cgroupfs mounts, or readonly mounts.
>
> -serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: cgroup management daemon
       [not found]                 ` <CAO_RewbZiLCJcO9G7pgxN8ZxkkVdEW1B84nkQ5wX3a9DPq4zfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-12-04  5:09                   ` Victor Marmol
       [not found]                     ` <CAD=mX8seoMfM63hOwbmJ_0GdS-fa8H6fB40k8uyqBNbSVqfXrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-12-04 11:37                   ` Tejun Heo
  2013-12-04 15:54                   ` Serge Hallyn
  2 siblings, 1 reply; 39+ messages in thread
From: Victor Marmol @ 2013-12-04  5:09 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Stéphane Graber, Rohit Jnagal,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Serge E. Hallyn


[-- Attachment #1.1: Type: text/plain, Size: 2751 bytes --]

I thought we were going to use chown in the initial version to enforce the
ownership/permissions on the hierarchy. Only the cgroup manager has access
to the hierarchy, but it tries to access the hierarchy as the user that
sent the request. It was only meant to be a "for now" solution while the
real one rolls out. It may also have gotten thrown out since last I heard :)


On Tue, Dec 3, 2013 at 8:53 PM, Tim Hockin <thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:

> If this daemon works as advertised, we will explore moving all write
> traffic to use it.  I still have concerns that this can't handle read
> traffic at the scale we need.
>
> Tejun,  I am not sure why chown came back into the conversation.  This
> is a replacement for that.
>
> On Tue, Dec 3, 2013 at 6:31 PM, Serge Hallyn <serge.hallyn-GeWIH/nMZzLQT0dZR+AlfA@public.gmane.org>
> wrote:
> > Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org):
> >> Hello, Serge.
> >>
> >> On Tue, Dec 03, 2013 at 06:03:44PM -0600, Serge Hallyn wrote:
> >> > > As I communicated multiple times before, delegating write access to
> >> > > control knobs to untrusted domain has always been a security risk
> and
> >> > > is likely to continue to remain so.  Also, organizationally, a
> >> >
> >> > Then that will need to be address with per-key blacklisting and/or
> >> > per-value filtering in the manager.
> >> >
> >> > Which is my way of saying:  can we please have a list of the security
> >> > issues so we can handle them?  :)  (I've asked several times before
> >> > but haven't seen a list or anyone offering to make one)
> >>
> >> Unfortunately, for now, please consider everything blacklisted.  Yes,
> >> it is true that some knobs should be mostly safe but given the level
> >> of changes we're going through and the difficulty of properly auditing
> >> anything for delegation to untrusted environment, I don't feel
> >> comfortable at all about delegating through chown.  It is an
> >> accidental feature which happened just because it uses filesystem as
> >> its interface and it is no where near the top of the todo list.  It
> >> has never worked properly and won't in any foreseeable future.
> >>
> >> > > cgroup's control knobs belong to the parent not the cgroup itself.
> >> >
> >> > After thinking awhile I think this makes perfect sense.  I haven't
> >> > implemented set_value yet, and when I do I think I'll implement this
> >> > guideline.
> >>
> >> I'm kinda confused here.  You say *everything* is gonna go through the
> >> manager and then talks about chowning directories.  Don't the two
> >> conflict?
> >
> > No.  I expect the user - except in the google case - to either have
> > access to no cgroupfs mounts, or readonly mounts.
> >
> > -serge
>

[-- Attachment #1.2: Type: text/html, Size: 3731 bytes --]

[-- Attachment #2: Type: text/plain, Size: 279 bytes --]

------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk

[-- Attachment #3: Type: text/plain, Size: 186 bytes --]

_______________________________________________
lxc-devel mailing list
lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
https://lists.sourceforge.net/lists/listinfo/lxc-devel

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
       [not found]                 ` <CAO_RewbZiLCJcO9G7pgxN8ZxkkVdEW1B84nkQ5wX3a9DPq4zfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-12-04  5:09                   ` Victor Marmol
@ 2013-12-04 11:37                   ` Tejun Heo
  2013-12-04 15:54                   ` Serge Hallyn
  2 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2013-12-04 11:37 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Serge Hallyn, Serge E. Hallyn, Stéphane Graber,
	Victor Marmol, Rohit Jnagal,
	lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA

Hello, Tim.

On Tue, Dec 03, 2013 at 08:53:21PM -0800, Tim Hockin wrote:
> If this daemon works as advertised, we will explore moving all write
> traffic to use it.  I still have concerns that this can't handle read
> traffic at the scale we need.

At least from the kernel side, cgroup doesn't and won't have any
problem with direct reads.

> Tejun,  I am not sure why chown came back into the conversation.  This
> is a replacement for that.

I guess I'm just confused because of the mentions of chown.  If it
isn't about giving unmoderated write access to untrusted domains,
everything should be fine.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
       [not found]                 ` <CAO_RewbZiLCJcO9G7pgxN8ZxkkVdEW1B84nkQ5wX3a9DPq4zfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-12-04  5:09                   ` Victor Marmol
  2013-12-04 11:37                   ` Tejun Heo
@ 2013-12-04 15:54                   ` Serge Hallyn
  2013-12-04 23:06                     ` Tejun Heo
  2 siblings, 1 reply; 39+ messages in thread
From: Serge Hallyn @ 2013-12-04 15:54 UTC (permalink / raw)
  To: Tim Hockin
  Cc: Tejun Heo, Serge E. Hallyn, Stéphane Graber, Victor Marmol,
	Rohit Jnagal, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA

Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> If this daemon works as advertised, we will explore moving all write
> traffic to use it.  I still have concerns that this can't handle read
> traffic at the scale we need.
> 
> Tejun,  I am not sure why chown came back into the conversation.  This
> is a replacement for that.

Because the daemon is chowning directories and files.  That's how
the daemon decides whether clients have access.

-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
       [not found]                     ` <CAD=mX8seoMfM63hOwbmJ_0GdS-fa8H6fB40k8uyqBNbSVqfXrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-12-04 15:56                       ` Serge Hallyn
  0 siblings, 0 replies; 39+ messages in thread
From: Serge Hallyn @ 2013-12-04 15:56 UTC (permalink / raw)
  To: Victor Marmol
  Cc: Tim Hockin, Tejun Heo, Serge E. Hallyn, Stéphane Graber,
	Rohit Jnagal, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA

Quoting Victor Marmol (vmarmol-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> I thought we were going to use chown in the initial version to enforce the
> ownership/permissions on the hierarchy. Only the cgroup manager has access
> to the hierarchy, but it tries to access the hierarchy as the user that
> sent the request. It was only meant to be a "for now" solution while the
> real one rolls out. It may also have gotten thrown out since last I heard :)

Actually that part wasn't meant as a "for now" solution.  It can of
course be thrown away in favor of having the daemon store all this
information, but I'm seeing no advantages to that right now.

There are other things which the daemon can eventually try to keep
track of, if we don't decide they belong in a higher layer.

-serge

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [lxc-devel] cgroup management daemon
  2013-12-04 15:54                   ` Serge Hallyn
@ 2013-12-04 23:06                     ` Tejun Heo
  0 siblings, 0 replies; 39+ messages in thread
From: Tejun Heo @ 2013-12-04 23:06 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Tim Hockin, Serge E. Hallyn, Stéphane Graber, Victor Marmol,
	Rohit Jnagal, lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	cgroups-u79uwXL29TY76Z2rM5mHXA, lizefan-hv44wF8Li93QT0dZR+AlfA

On Wed, Dec 04, 2013 at 09:54:37AM -0600, Serge Hallyn wrote:
> Quoting Tim Hockin (thockin-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> > If this daemon works as advertised, we will explore moving all write
> > traffic to use it.  I still have concerns that this can't handle read
> > traffic at the scale we need.
> > 
> > Tejun,  I am not sure why chown came back into the conversation.  This
> > is a replacement for that.
> 
> Because the daemon is chowning directories and files.  That's how
> the daemon decides whether clients have access.

Ah, okay, so the manager is just using filesystem metadata for
bookkeeping.  That should be fine.  Please note that cgroup filesystem
also supports xattr and AFAIK systemd is already making use of it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2013-12-04 23:06 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-11-25 22:43 cgroup management daemon Serge E. Hallyn
     [not found] ` <20131125224335.GA15481-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-26  0:03   ` [lxc-devel] " Marian Marinov
     [not found]     ` <5293E544.10805-NV7Lj0SOnH0@public.gmane.org>
2013-11-26  0:11       ` Stéphane Graber
2013-11-26  1:35         ` [lxc-devel] " Marian Marinov
     [not found]           ` <5293FADA.8070901-NV7Lj0SOnH0@public.gmane.org>
2013-11-26  1:46             ` Stéphane Graber
2013-11-26  2:18   ` Michael H. Warfield
     [not found]     ` <1385432284.8590.52.camel-s3/A7Nnwjkf10ug9Blv0m0EOCMrvLtNR@public.gmane.org>
2013-11-26  2:43       ` Stéphane Graber
2013-11-26  2:55         ` [lxc-devel] " Michael H. Warfield
2013-11-26  4:52       ` Tim Hockin
     [not found]         ` <CAO_RewYmS0fH819BFCr9ozis1132dACgCPwbyb59gM1PafpUkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-26 16:37           ` Serge E. Hallyn
     [not found]             ` <20131126163737.GB23834-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-26 20:49               ` Tim Hockin
2013-11-26  4:58   ` Tim Hockin
     [not found]     ` <CAO_RewZGWARUafKzDc_t3G5OedGtEPTZgB2VYeHHiKSSrja8fA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-26  5:47       ` Serge E. Hallyn
     [not found]         ` <20131126054718.GA19134-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-26 20:38           ` Tim Hockin
     [not found]             ` <CAO_RewZ8cUn-PdXfQF0yH=V=9UqE7Yo1JX2pt2c71WYDrpYE0Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-26 20:58               ` Serge E. Hallyn
     [not found]                 ` <20131126205819.GA27266-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-26 21:24                   ` Tim Hockin
     [not found]                     ` <CAO_RewZh+dNkUdZdu-R3CKTvYzbPL50v-BsBHvek75ti3V6kZQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-26 21:28                       ` Serge E. Hallyn
     [not found]                         ` <20131126212814.GA27602-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-26 21:31                           ` Victor Marmol
     [not found]                             ` <CAD=mX8uuAeN7s8ZA6Gc-wsBd6-PHevBRyBL6hMAS9VW15T5eYA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-27  1:49                               ` Tim Hockin
     [not found]                                 ` <CAO_RewY0eFTgkVqbRJwdW9bgR3nz9h5t6c823wFH5cg1CD0sEA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-27  1:53                                   ` Serge E. Hallyn
2013-11-26 16:12       ` Serge E. Hallyn
     [not found]         ` <20131126161246.GA23834-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-26 16:22           ` Victor Marmol
     [not found]             ` <CAD=mX8tCOEO4wP-XGs9YdRufTAay6zPaOxo_wZF=-KoqepH0wg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-11-26 16:41               ` Serge E. Hallyn
     [not found]                 ` <20131126164125.GC23834-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2013-11-26 17:19                   ` Victor Marmol
     [not found]                     ` <CAD=mX8v-jfA8F5DueK60Oo4Zfcjj86idKYKnDVc9LxQVs9W_rQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-12-03 14:00                       ` Tejun Heo
2013-11-26 20:45           ` Tim Hockin
2013-12-03 13:54       ` Tejun Heo
2013-12-03 13:45   ` Tejun Heo
     [not found]     ` <20131203134506.GG8277-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2013-12-03 13:45       ` Tejun Heo
2013-12-04  0:03       ` [lxc-devel] " Serge Hallyn
2013-12-04  1:24         ` Tejun Heo
     [not found]           ` <20131204012416.GY8277-Gd/HAXX7CRxy/B6EtB590w@public.gmane.org>
2013-12-04  1:26             ` Tejun Heo
2013-12-04  2:31             ` Serge Hallyn
2013-12-04  4:53               ` Tim Hockin
     [not found]                 ` <CAO_RewbZiLCJcO9G7pgxN8ZxkkVdEW1B84nkQ5wX3a9DPq4zfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-12-04  5:09                   ` Victor Marmol
     [not found]                     ` <CAD=mX8seoMfM63hOwbmJ_0GdS-fa8H6fB40k8uyqBNbSVqfXrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-12-04 15:56                       ` [lxc-devel] " Serge Hallyn
2013-12-04 11:37                   ` Tejun Heo
2013-12-04 15:54                   ` Serge Hallyn
2013-12-04 23:06                     ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).