From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marian Marinov Subject: Re: [lxc-devel] cgroup management daemon Date: Tue, 26 Nov 2013 03:35:22 +0200 Message-ID: <5293FADA.8070901@yuhu.biz> References: <20131125224335.GA15481@mail.hallyn.com> <5293E544.10805@yuhu.biz> <20131126001139.GL26027@castiana> Mime-Version: 1.0 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=yuhu.biz; s=default; t=1385429727; bh=NNvESysqfLlHiJx/0gddyeoXHlPoCv4hqnVLL+RfhEQ=; h=Date:From:To:CC:Subject:References:In-Reply-To; b=YGuY1Ed/4xh9XkbLwhVUD155/CtJvogUiWtmqByERtd8wVBIFCIycjELsB6ZS2fJG cNBdj/B+hjyGNT4Q+h6S8bHYiSwTI0PdCl5RG2PI+6v2i7vR/KTb8VNnzmMnJ6tUO0 HY1gXPfw370eFH5VNS/XMcaN6Rwpg2waRbchStzM= In-Reply-To: <20131126001139.GL26027@castiana> Sender: cgroups-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-ID: Content-Type: text/plain; charset="iso-8859-1"; format="flowed" To: =?ISO-8859-1?Q?St=E9phane_Graber?= Cc: "Serge E. Hallyn" , Tejun Heo , lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org, cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Victor Marmol , Rohit Jnagal , Tim Hockin On 11/26/2013 02:11 AM, St=E9phane Graber wrote: > On Tue, Nov 26, 2013 at 02:03:16AM +0200, Marian Marinov wrote: >> On 11/26/2013 12:43 AM, Serge E. Hallyn wrote: >>> Hi, >>> >>> as i've mentioned several times, I want to write a standalone cgrou= p >>> management daemon. Basic requirements are that it be a standalone >>> program; that a single instance running on the host be usable from >>> containers nested at any depth; that it not allow escaping ones >>> assigned limits; that it not allow subjegating tasks which do not >>> belong to you; and that, within your limits, you be able to parcel >>> those limits to your tasks as you like. >>> >>> Additionally, Tejun has specified that we do not want users to be >>> too closely tied to the cgroupfs implementation. Therefore >>> commands will be just a hair more general than specifying cgroupfs >>> filenames and values. I may go so far as to avoid specifying >>> specific controllers, as AFAIK there should be no redundancy in >>> features. On the other hand, I don't want to get too general. >>> So I'm basing the API loosely on the lmctfy command line API. >>> >>> One of the driving goals is to enable nested lxc as simply and safe= ly as >>> possible. If this project is a success, then a large chunk of code= can >>> be removed from lxc. I'm considering this project a part of the la= rger >>> lxc project, but given how central it is to systems management that >>> doesn't mean that I'll consider anyone else's needs as less importa= nt >>> than our own. >>> >>> This document consists of two parts. The first describes how I >>> intend the daemon (cgmanager) to be structured and how it will >>> enforce the safety requirements. The second describes the commands >>> which clients will be able to send to the manager. The list of >>> controller keys which can be set is very incomplete at this point, >>> serving mainly to show the approach I was thinking of taking. >>> >>> Summary >>> >>> Each 'host' (identified by a separate instance of the linux kernel)= will >>> have exactly one running daemon to manage control groups. This dae= mon >>> will answer cgroup management requests over a dbus socket, located = at >>> /sys/fs/cgroup/manager. This socket can be bind-mounted into vario= us >>> containers, so that one daemon can support the whole system. >>> >>> Programs will be able to make cgroup requests using dbus calls, or >>> indirectly by linking against lmctfy which will be modified to use = the >>> dbus calls if available. >>> >>> Outline: >>> . A single manager, cgmanager, is started on the host, very ear= ly >>> during boot. It has very few dependencies, and requires only >>> /proc, /run, and /sys to be mounted, with /etc ro. It will m= ount >>> the cgroup hierarchies in a private namespace and set default= s >>> (clone_children, use_hierarchy, sane_behavior, release_agent?= ) It >>> will open a socket at /sys/fs/cgroup/cgmanager (in a small tm= pfs). >>> . A client (requestor 'r') can make cgroup requests over >>> /sys/fs/cgroup/manager using dbus calls. Detailed privilege >>> requirements for r are listed below. >>> . The client request will pertain an existing or new cgroup A. = r's >>> privilege over the cgroup must be checked. r is said to have >>> privilege over A if A is owned by r's uid, or if A's owner is= mapped >>> into r's user namespace, and r is root in that user namespace= =2E >>> . The client request may pertain a victim task v, which may be = moved >>> to a new cgroup. In that case r's privilege over both the cg= roup >>> and v must be checked. r is said to have privilege over v if= v >>> is mapped in r's pid namespace, v's uid is mapped into r's us= er ns, >>> and r is root in its userns. Or if r and v have the same uid >>> and v is mapped in r's pid namespace. >>> . r's credentials will be taken from socket's peercred, ensurin= g that >>> pid and uid are translated. >>> . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receiv= es the >>> translated global pid. It will then read UID(v) from /proc/P= ID(v)/status, >>> which is the global uid, and check /proc/PID(r)/uid_map to se= e whether >>> UID is mapped there. >>> . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to = have >>> the kernel translate it for the reader. Only 'move task v to= cgroup >>> A' will require a SCM_CREDENTIAL to be sent. >>> >>> Privilege requirements by action: >>> * Requestor of an action (r) over a socket may only make >>> changes to cgroups over which it has privilege. >>> * Requestors may be limited to a certain #/depth of cgroups >>> (to limit memory usage) - DEFER? >>> * Cgroup hierarchy is responsible for resource limits >>> * A requestor must either be uid 0 in its userns with victim = mapped >>> ito its userns, or the same uid and in same/ancestor pidns = as the >>> victim >>> * If r requests creation of cgroup '/x', /x will be interpret= ed >>> as relative to r's cgroup. r cannot make changes to cgroup= s not >>> under its own current cgroup. >>> * If r is not in the initial user_ns, then it may not change = settings >>> in its own cgroup, only descendants. (Not strictly necessa= ry - >>> we could require the use of extra cgroups when wanted, as l= xc does >>> currently) >>> * If r requests creation of cgroup '/x', it must have write a= ccess >>> to its own cgroup (not strictly necessary) >>> * If r requests chown of cgroup /x to uid Y, Y is passed in a >>> ucred over the unix socket, and therefore translated to ini= t >>> userns. >>> * if r requests setting a limit under /x, then >>> . either r must be root in its own userns, and UID(/x) be m= apped >>> into its userns, or else UID(r) =3D=3D UID(/x) >>> . /x must not be / (not strictly necessary, all users know = to >>> ensure an extra cgroup layer above '/') >>> . setns(UIDNS(r)) would not work, due to in-kernel capable(= ) checks >>> which won't be satisfied. Therefore we'll need to do pri= vilege >>> checks ourselves, then perform the write as the host root= user. >>> (see devices.allow/deny). Further we need to support old= er kernels >>> which don't support setns for pid. >>> * If r requests action on victim V, it passes V's pid in a uc= red, >>> so that gets translated. >>> Daemon will verify that V's uid is mapped into r's userns. = Since >>> r is either root or the same uid as V, it is allowed to cla= ssify. >>> >>> The above addresses >>> * creating cgroups >>> * chowning cgroups >>> * setting cgroup limits >>> * moving tasks into cgroups >>> . but does not address a 'cgexec -- command' type of be= havior. >>> * To handle that (specifically for upstart), recommend that r= do: >>> if (!pid) { >>> request_reclassify(cgroup, getpid()); >>> do_execve(); >>> } >>> . alternatively, the daemon could, if kernel is new enough, set= ns to >>> the requestor's namespaces to execute a command in a new cgro= up. >>> The new command would be daemonized to that pid namespaces' p= id 1. >>> >>> Types of requests: >>> * r requests creating cgroup A'/A >>> . lmctfy/cli/commands/create.cc >>> . Verify that UID(r) mapped to 0 in r's userns >>> . R=3Dcgroup_of(r) >>> . Verify that UID(R) is mapped into r's userns >>> . Create R/A'/A >>> . chown R/A'/A to UID(r) >>> * r requests to move task x to cgroup A. >>> . lmctfy/cli/commands/enter.cc >>> . r must send PID(x) as ancillary message >>> . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is= mapped into >>> that userns >>> (is it safe to allow if UID(x) =3D=3D UID(r))? >>> . R=3Dcgroup_of(r) >>> . Verify that R/A is owned by UID(r) or UID(x)? (not sure th= at's needed) >>> . echo PID(x) >> /R/A/tasks >>> * r requests chown of cgroup A to uid X >>> . X is passed in ancillary message >>> * ensures it is valid in r's userns >>> * maps the userid to host for us >>> . Verify that UID(r) mapped to 0 in r's userns >>> . R=3Dcgroup_of(r) >>> . Chown R/A to X >>> * r requests cgroup A's 'property=3Dvalue' >>> . Verify that either >>> * A !=3D '' >>> * UID(r) =3D=3D 0 on host >>> In other words, r in a userns may not set root cgroup setti= ngs. >>> . Verify that UID(r) mapped to 0 in r's userns >>> . R=3Dcgroup_of(r) >>> . Set property=3Dvalue for R/A >>> * Expect kernel to guarantee hierarchical constraints >>> * r requests deletion of cgroup A >>> . lmctfy/cli/commands/destroy.cc (without -f) >>> . same requirements as setting 'property=3Dvalue' >>> * r requests purge of cgroup A >>> . lmctfy/cli/commands/destroy.cc (with -f) >>> . same requirements as setting 'property=3Dvalue' >>> >>> Long-term we will want the cgroup manager to become more intelligen= t - >>> to place its own limits on clients, to address cpu and device hotpl= ug, >>> etc. Since we will not be doing that in the first prototype, the d= aemon >>> will not keep any state about the clients. >>> >>> Client DBus Message API >>> >>> : a-zA-Z0-9 >>> : "a-zA-Z0-9 " >>> : [:controllerlist] >>> : key:value >>> : frozen >>> : thawed >>> : valueentry[:values] >>> keys: >>> {memory,swap}.{limit,soft_limit} >>> cpus_allowed # set of allowed cpus >>> cpus_fraction # % of allowed cpus >>> cpus_number # number of allowed cpus >>> cpu_share_percent # percent of cpushare >>> devices_whitelist >>> devices_blacklist >>> net_prio_index >>> net_prio_interface_map >>> net_classid >>> hugetlb_limit >>> blkio_weight >>> blkio_weight_device >>> blkio_throttle_{read,write} >>> readkeys: >>> devices_list >>> {memory,swap}.{failcnt,max_use,limitnuma_stat} >>> hugetlb_max_usage >>> hugetlb_usage >>> hugetlb_failcnt >>> cpuacct_stat >>> >>> Commands: >>> ListControllers >>> Create >>> Setvalue >>> Getvalue >>> ListChildren >>> ListTasks >>> ListControllers >>> Chown >>> Chown : >>> Move [[ pid is sent as a SCM_CREDENTIAL ]] >>> Delete >>> Delete-force >>> Kill >>> >> >> I really like the idea, but I have a few comments. >> I'm not familiar with the dbus, but how will you identify a request = made on dbus? >> I mean will you get its pid? What if the container has its own PID n= amespace, how will this be handled? > > DBus is essentially just an IPC protocol that can be used over a vari= ety > of medium. > > In the case of this cgroup manager, we'll be using the DBus protocol = on > top of a standard UNIX socket. One of the properties of unix sockets = is > that you can get the uid, gid and pid of your peer. As this informati= on > is provided by the kernel, it'll automatically be translated to match > your vision of the pid and user tree. > > That's why we're also planning on abusing SCM_CRED a tiny bit so that > when a container or sub-container is asking for a pid to be moved int= o a > cgroup, instead of passing that pid as a standard integer over dbus, > it'll instead use the SCM_CRED mechanism, sending a ucred structure > instead which will then get magically mapped to the right namespace w= hen > accessed by the manager and saving us a whole lot of pid/uid mapping > logic in the process. > >> >> I know that this may sound a bit radical, but I propose that the dae= mon is using simple unix sockets. >> The daemon should have an easy way of adding more sockets to newly s= tarted containers and each newly created socket >> should know the base cgroup to which it belongs. This way the daemon= can clearly identify which request is limited to >> what cgroup without many lookups and will be easier to enforce the a= bove mentioned restrictions. > > So it looks like our current design already follows your recommendati= on > since we're indeed using a standard unix socket, it's just that inste= ad > of re-inventing the wheel, we use a standard IPC protocol on top of i= t. Thanks, I was thinking about the SCM_CREAD exactly :) I was unaware that it can be combined with the dbus protocol, this is w= hy I commented. Is there any particular language that you want this project started in?= I know that most part of the LXC is C, but I=20 don't see any guidelines for using or not other langs. Marian > >> >> Marian >> >> --------------------------------------------------------------------= ---------- >> Shape the Mobile Experience: Free Subscription >> Software experts and developers: Be at the forefront of tech innovat= ion. >> Intel(R) Software Adrenaline delivers strategic insight and game-cha= nging >> conversations that shape the rapidly evolving mobile landscape. Sign= up now. >> http://pubads.g.doubleclick.net/gampad/clk?id=3D63431311&iu=3D/4140/= ostg.clktrk >> _______________________________________________ >> Lxc-devel mailing list >> Lxc-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org >> https://lists.sourceforge.net/lists/listinfo/lxc-devel >