* Why does devices cgroup check for CAP_SYS_ADMIN explicitly?
@ 2012-11-06 2:38 Tejun Heo
[not found] ` <20121106023845.GI19354-9pTldWuhBndy/B6EtB590w@public.gmane.org>
0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2012-11-06 2:38 UTC (permalink / raw)
To: Aristeu Rozanski, Serge E. Hallyn
Cc: Li Zefan, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
cgroups-u79uwXL29TY76Z2rM5mHXA
Hello, guys.
Why doesn't it follow the usual security enforced by cgroupfs
permissions? Why is the explicit check necessary?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 21+ messages in thread[parent not found: <20121106023845.GI19354-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <20121106023845.GI19354-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-11-06 11:58 ` Eric W. Biederman [not found] ` <877gpzrlir.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Eric W. Biederman @ 2012-11-06 11:58 UTC (permalink / raw) To: Tejun Heo Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes: > Hello, guys. > > Why doesn't it follow the usual security enforced by cgroupfs > permissions? Why is the explicit check necessary? An almost more interesting question is why is cgroup one of the last pieces of code not using capabilities and instead lets you attach to any process simply if your uid == 0. I don't know the history but the device cgroup testing for CAP_SYS_ADMIN makes a naive sort of sense to me. Eric ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <877gpzrlir.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <877gpzrlir.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2012-11-06 14:48 ` Tejun Heo 2012-11-06 15:01 ` Serge Hallyn 1 sibling, 0 replies; 21+ messages in thread From: Tejun Heo @ 2012-11-06 14:48 UTC (permalink / raw) To: Eric W. Biederman Cc: Aristeu Rozanski, Serge E. Hallyn, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hey, Eric. On Tue, Nov 06, 2012 at 03:58:04AM -0800, Eric W. Biederman wrote: > > Why doesn't it follow the usual security enforced by cgroupfs > > permissions? Why is the explicit check necessary? > > An almost more interesting question is why is cgroup one of the last > pieces of code not using capabilities and instead lets you attach to any > process simply if your uid == 0. Because it has a filesystem as interface and most things w/ file system interface depend on VFS for access policies. > I don't know the history but the device cgroup testing for CAP_SYS_ADMIN > makes a naive sort of sense to me. If some different CAP_* is needed for cgroup (but why?), the right thing to do is enforcing it uniformly from cgroup core instead of doing it from individual controllers. If there's no actual good reason to keep this, I'll write up a patch to remove it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <877gpzrlir.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2012-11-06 14:48 ` Tejun Heo @ 2012-11-06 15:01 ` Serge Hallyn 2012-11-06 15:06 ` Tejun Heo 1 sibling, 1 reply; 21+ messages in thread From: Serge Hallyn @ 2012-11-06 15:01 UTC (permalink / raw) To: Eric W. Biederman Cc: Tejun Heo, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): > Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes: > > > Hello, guys. > > > > Why doesn't it follow the usual security enforced by cgroupfs > > permissions? Why is the explicit check necessary? > > An almost more interesting question is why is cgroup one of the last > pieces of code not using capabilities and instead lets you attach to any > process simply if your uid == 0. > > I don't know the history but the device cgroup testing for CAP_SYS_ADMIN > makes a naive sort of sense to me. Right, it's possible to run a daemon with uid 0 but restricted capabilities (enforced by bounding set) for all its children, but not possible to run a non-uid-0 daemon with some capabilities across execs. (Of course that can be worked around with privileged helpers or to an extent inherited capabilities.) Capabilities make sense for the cgroup controls. In other words, any code which equates uid 0 with posession of a capability is taking away from the usefulness of capabilities everywhere. More practically, lacking user namespaces you can create a full (i.e. ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without root. So this allows you to prevent containers from bypassing devices cgroup restrictions set by the parent. (In reality we are not using that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict - but other distros do.) -serge ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? 2012-11-06 15:01 ` Serge Hallyn @ 2012-11-06 15:06 ` Tejun Heo [not found] ` <20121106150639.GB30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Tejun Heo @ 2012-11-06 15:06 UTC (permalink / raw) To: Serge Hallyn Cc: Eric W. Biederman, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hello, Serge. On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote: > More practically, lacking user namespaces you can create a full (i.e. > ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without > root. So this allows you to prevent containers from bypassing devices > cgroup restrictions set by the parent. (In reality we are not using > that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict - > but other distros do.) Do you even mount cgroupfs in containers? If you just bind-mount cgroupfs verbatim in containers, I don't think that's gonna work very well. If not, all this doesn't make any difference for containers. So, you don't really have any actual use case for the explicit CAP_* checks, right? Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20121106150639.GB30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <20121106150639.GB30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-11-06 15:30 ` Serge Hallyn 2012-11-06 15:41 ` Tejun Heo 2012-11-06 15:34 ` Eric W. Biederman 1 sibling, 1 reply; 21+ messages in thread From: Serge Hallyn @ 2012-11-06 15:30 UTC (permalink / raw) To: Tejun Heo Cc: Eric W. Biederman, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org): > Hello, Serge. > > On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote: > > More practically, lacking user namespaces you can create a full (i.e. > > ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without > > root. So this allows you to prevent containers from bypassing devices > > cgroup restrictions set by the parent. (In reality we are not using > > that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict - > > but other distros do.) > > Do you even mount cgroupfs in containers? If you just bind-mount > cgroupfs verbatim in containers, I don't think that's gonna work very > well. If not, all this doesn't make any difference for containers. I don't know if those who restrict CAP_SYS_ADMIN do so or not. We by default do not. (It's not relevant for this discussion as again we use apparmor to deny writes, but we *do* optionally bind mount cgroups into the containers, mounting /sys/fs/cgroup/$cgroup/lxc/$containername/$containername.real on the host to /sys/fs/cgroup/$cgroup in the container for each cgroup.) > So, you don't really have any actual use case for the explicit CAP_* > checks, right? No, especially since we will now have user namespaces. We will want to be able to strictly enforce hierarchical limits - i.e. allow uid 100000 (which is uid 0 in the container) to change cgroup settings, but never exceed limits set on the parent directory. IIUC you are working toward anyway with the general hierarchy work? (thanks for that). -serge ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? 2012-11-06 15:30 ` Serge Hallyn @ 2012-11-06 15:41 ` Tejun Heo [not found] ` <20121106154105.GD30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Tejun Heo @ 2012-11-06 15:41 UTC (permalink / raw) To: Serge Hallyn Cc: Eric W. Biederman, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hello, On Tue, Nov 06, 2012 at 09:30:32AM -0600, Serge Hallyn wrote: > > So, you don't really have any actual use case for the explicit CAP_* > > checks, right? > > No, especially since we will now have user namespaces. > > We will want to be able to strictly enforce hierarchical limits - i.e. > allow uid 100000 (which is uid 0 in the container) to change cgroup > settings, but never exceed limits set on the parent directory. IIUC > you are working toward anyway with the general hierarchy work? (thanks > for that). Yeah, I'm working toward that but I'm not sure it would mean that containers would be able to directly bind mount cgroupfs subdirectory and have free reign on it. Maybe such thing can be made to work but I would be much more comfortable having something inbetween for impedance matching (in access policies, root cgroup behavior matching, whatnot). So, the functionality will be there but it probably would need something inbetween if you wanna give containers control over its own cgroup hierarchy. There are some issues tho. As it currently stands, devices cgroup inherits configuration rather than enforcing hierarchy while checking for access permission. This means that changes in an ancestor have to be propagated downwards and *update* configurations of descendants, which is what I'm working on but it can be confusing for someone inside the container. Without breaking compatibility, I don't see any other way out tho. I suppose it's something we'll have to live with. Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20121106154105.GD30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <20121106154105.GD30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-11-06 16:12 ` Aristeu Rozanski 0 siblings, 0 replies; 21+ messages in thread From: Aristeu Rozanski @ 2012-11-06 16:12 UTC (permalink / raw) To: Tejun Heo Cc: Serge Hallyn, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Eric W. Biederman On Tue, Nov 06, 2012 at 07:41:05AM -0800, Tejun Heo wrote: > On Tue, Nov 06, 2012 at 09:30:32AM -0600, Serge Hallyn wrote: > > > So, you don't really have any actual use case for the explicit CAP_* > > > checks, right? > > > > No, especially since we will now have user namespaces. > > > > We will want to be able to strictly enforce hierarchical limits - i.e. > > allow uid 100000 (which is uid 0 in the container) to change cgroup > > settings, but never exceed limits set on the parent directory. IIUC > > you are working toward anyway with the general hierarchy work? (thanks > > for that). > > Yeah, I'm working toward that but I'm not sure it would mean that > containers would be able to directly bind mount cgroupfs subdirectory > and have free reign on it. Maybe such thing can be made to work but I > would be much more comfortable having something inbetween for > impedance matching (in access policies, root cgroup behavior matching, > whatnot). So, the functionality will be there but it probably would > need something inbetween if you wanna give containers control over its > own cgroup hierarchy. > > There are some issues tho. As it currently stands, devices cgroup > inherits configuration rather than enforcing hierarchy while checking > for access permission. This means that changes in an ancestor have to > be propagated downwards and *update* configurations of descendants, > which is what I'm working on but it can be confusing for someone > inside the container. Without breaking compatibility, I don't see any > other way out tho. I suppose it's something we'll have to live with. yes, but most of the changes will happen before the containers start. I've been working in a patchset (as promised) and it propagates down the changes and also keep 'local' rules. everytime a parent change something, the local rules are re-evaluated and reapplied if they're still valid. -- Aristeu ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <20121106150639.GB30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> 2012-11-06 15:30 ` Serge Hallyn @ 2012-11-06 15:34 ` Eric W. Biederman [not found] ` <871ug6rbio.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 1 sibling, 1 reply; 21+ messages in thread From: Eric W. Biederman @ 2012-11-06 15:34 UTC (permalink / raw) To: Tejun Heo Cc: Serge Hallyn, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes: > Hello, Serge. > > On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote: >> More practically, lacking user namespaces you can create a full (i.e. >> ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without >> root. So this allows you to prevent containers from bypassing devices >> cgroup restrictions set by the parent. (In reality we are not using >> that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict - >> but other distros do.) > > Do you even mount cgroupfs in containers? If you just bind-mount > cgroupfs verbatim in containers, I don't think that's gonna work very > well. If not, all this doesn't make any difference for containers. > > So, you don't really have any actual use case for the explicit CAP_* > checks, right? Having thought about this a little more I can give a definitive answer. Adding a process to the device control group is equivalent to calling mknod, as it allows that process to open device nodes, or equivalently not open device nodes. Therefore a capable check is absolutely required. Without a capability check it would be possible to remove access to /dev/console for a suid root application keeping it from reporting attempts to hack it for example. update_access can allow access to previously unaccessible devices and so is equivalent to mknod and as such requires a capability call. static int devcgroup_update_access(struct dev_cgroup *devcgroup, int filetype, const char *buffer) { .... if (!capable(CAP_SYS_ADMIN)) return -EPERM; Likewise move to a different cgroup can give you a completely different set of devices you can use. And is roughly equivalent to mknod, and needs a capability call. static int devcgroup_can_attach(struct cgroup *new_cgrp, struct cgroup_taskset *set) { struct task_struct *task = cgroup_taskset_first(set); if (current != task && !capable(CAP_SYS_ADMIN)) return -EPERM; The generic cgroup check in attach_task_by_pid to see if you can move another process into a cgroup needs to be a capability call and not a test for uid == 0. static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup) { if (pid) { tsk = find_task_by_vpid(pid); /* * even if we're attaching all tasks in the thread group, we * only need to check permissions on one of them. */ tcred = __task_cred(tsk); if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && ^^^^^^^^^^^^^^^ !uid_eq(cred->euid, tcred->uid) && !uid_eq(cred->euid, tcred->suid)) { rcu_read_unlock(); ret = -EACCES; goto out_unlock_cgroup; Eric ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <871ug6rbio.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <871ug6rbio.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2012-11-06 15:43 ` Tejun Heo [not found] ` <20121106154320.GE30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> 2012-11-06 15:45 ` Serge Hallyn 1 sibling, 1 reply; 21+ messages in thread From: Tejun Heo @ 2012-11-06 15:43 UTC (permalink / raw) To: Eric W. Biederman Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA Hey, Eric. On Tue, Nov 06, 2012 at 07:34:07AM -0800, Eric W. Biederman wrote: > Having thought about this a little more I can give a definitive answer. > > Adding a process to the device control group is equivalent to calling > mknod, as it allows that process to open device nodes, or equivalently > not open device nodes. Therefore a capable check is absolutely > required. > > Without a capability check it would be possible to remove access to > /dev/console for a suid root application keeping it from reporting > attempts to hack it for example. You understand that the whole thing is gated by VFS permission check, right? I'm kinda lost what you're talking about. > The generic cgroup check in attach_task_by_pid to see if you can move > another process into a cgroup needs to be a capability call and not a > test for uid == 0. > > static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup) > { > if (pid) { > tsk = find_task_by_vpid(pid); > > /* > * even if we're attaching all tasks in the thread group, we > * only need to check permissions on one of them. > */ > tcred = __task_cred(tsk); > if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && > ^^^^^^^^^^^^^^^ > !uid_eq(cred->euid, tcred->uid) && > !uid_eq(cred->euid, tcred->suid)) { > rcu_read_unlock(); > ret = -EACCES; > goto out_unlock_cgroup; This one isn't gated by VFS so we need to add CAP check to this function. No? Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20121106154320.GE30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <20121106154320.GE30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-11-06 16:10 ` Eric W. Biederman [not found] ` <87sj8mogpp.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Eric W. Biederman @ 2012-11-06 16:10 UTC (permalink / raw) To: Tejun Heo Cc: Serge Hallyn, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes: > Hey, Eric. > > On Tue, Nov 06, 2012 at 07:34:07AM -0800, Eric W. Biederman wrote: >> Having thought about this a little more I can give a definitive answer. >> >> Adding a process to the device control group is equivalent to calling >> mknod, as it allows that process to open device nodes, or equivalently >> not open device nodes. Therefore a capable check is absolutely >> required. >> >> Without a capability check it would be possible to remove access to >> /dev/console for a suid root application keeping it from reporting >> attempts to hack it for example. > > You understand that the whole thing is gated by VFS permission check, > right? I'm kinda lost what you're talking about. mknod is gated by the vfs with a capability call. open does not perform the CAP_MKNOD check. Since the device cgroup prevents opening of device nodes adding permission to access a new device node (update_access) is roughly equivalent to mknod when the device cgroup does not exist. To preserve the notion that only a privileged user can grant access to device nodes we need a capability check. Especially since the device cgroup is designed to limit processes with uid == 0. Without a capability check a process with CAP_DAC_OVERRIDE can go shopping for a device control group that happens to have the device it wants to use. Similary without a capability check a process with CAP_DAD_OVERRIDE can add or remove any device node into a device control group. I don't see how the device control group can limit uid == 0 with the device control group without making the operations require a capability you don't give to ever user who has uid == 0. >> The generic cgroup check in attach_task_by_pid to see if you can move >> another process into a cgroup needs to be a capability call and not a >> test for uid == 0. >> >> static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup) >> { >> if (pid) { >> tsk = find_task_by_vpid(pid); >> >> /* >> * even if we're attaching all tasks in the thread group, we >> * only need to check permissions on one of them. >> */ >> tcred = __task_cred(tsk); >> if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && >> ^^^^^^^^^^^^^^^ >> !uid_eq(cred->euid, tcred->uid) && >> !uid_eq(cred->euid, tcred->suid)) { >> rcu_read_unlock(); >> ret = -EACCES; >> goto out_unlock_cgroup; > > This one isn't gated by VFS so we need to add CAP check to this > function. No? correct. The check should read something like: tcred = __task_cred(tsk); if (!uid_eq(cred->euid, tcred->uid) && !uid_eq(cred->euid, tcred->suid) && !capable(CAP_SYS_ADMIN)) { rcu_read_unlock(); ret = -EACCES; goto out_unlock_cgroup; Eric ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <87sj8mogpp.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <87sj8mogpp.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2012-11-06 16:52 ` Tejun Heo [not found] ` <20121106165246.GF30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Tejun Heo @ 2012-11-06 16:52 UTC (permalink / raw) To: Eric W. Biederman Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA Hello, Eric. On Tue, Nov 06, 2012 at 08:10:10AM -0800, Eric W. Biederman wrote: > mknod is gated by the vfs with a capability call. > > open does not perform the CAP_MKNOD check. > > Since the device cgroup prevents opening of device nodes adding > permission to access a new device node (update_access) is roughly > equivalent to mknod when the device cgroup does not exist. I think that's a jump. > To preserve the notion that only a privileged user can grant access to > device nodes we need a capability check. Especially since the device > cgroup is designed to limit processes with uid == 0. > > Without a capability check a process with CAP_DAC_OVERRIDE can go > shopping for a device control group that happens to have the device it > wants to use. > > Similary without a capability check a process with CAP_DAD_OVERRIDE can > add or remove any device node into a device control group. > > I don't see how the device control group can limit uid == 0 with the > device control group without making the operations require a capability > you don't give to ever user who has uid == 0. devices cgroup adds to restrictions what a group of tasks can do. Access to cgroup configuration is gated by cgroup core (currently by VFS permissions) and that's it. I really don't want each controller to develop its own permission checks. If a controller can't live with that, it probably shouldn't be a cgroup controller. So, if you think the CAP check is needed for cgroup in general (and can justify it), please feel free to move it to cgroup core; otherwise, the CAP check is going away from devices and if devices can't live with that, it probably shouldn't have been a cgroup controller from the beginning. Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20121106165246.GF30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <20121106165246.GF30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-11-06 17:31 ` Serge Hallyn 2012-11-06 17:38 ` Tejun Heo 0 siblings, 1 reply; 21+ messages in thread From: Serge Hallyn @ 2012-11-06 17:31 UTC (permalink / raw) To: Tejun Heo Cc: Eric W. Biederman, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org): > Hello, Eric. > > On Tue, Nov 06, 2012 at 08:10:10AM -0800, Eric W. Biederman wrote: > > mknod is gated by the vfs with a capability call. > > > > open does not perform the CAP_MKNOD check. > > > > Since the device cgroup prevents opening of device nodes adding > > permission to access a new device node (update_access) is roughly > > equivalent to mknod when the device cgroup does not exist. > > I think that's a jump. > > > To preserve the notion that only a privileged user can grant access to > > device nodes we need a capability check. Especially since the device > > cgroup is designed to limit processes with uid == 0. > > > > Without a capability check a process with CAP_DAC_OVERRIDE can go > > shopping for a device control group that happens to have the device it > > wants to use. > > > > Similary without a capability check a process with CAP_DAD_OVERRIDE can > > add or remove any device node into a device control group. > > > > I don't see how the device control group can limit uid == 0 with the > > device control group without making the operations require a capability > > you don't give to ever user who has uid == 0. > > devices cgroup adds to restrictions what a group of tasks can do. > Access to cgroup configuration is gated by cgroup core (currently by > VFS permissions) and that's it. I really don't want each controller > to develop its own permission checks. If a controller can't live with > that, it probably shouldn't be a cgroup controller. So, if you think > the CAP check is needed for cgroup in general (and can justify it), > please feel free to move it to cgroup core; otherwise, the CAP check > is going away from devices and if devices can't live with that, it > probably shouldn't have been a cgroup controller from the beginning. We can't generally require a capability to move tasks between cgroups, as that will break currently intended uses. I can create two cgroups, chown them to serge, and let serge move between them. -serge ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? 2012-11-06 17:31 ` Serge Hallyn @ 2012-11-06 17:38 ` Tejun Heo [not found] ` <20121106173823.GK30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Tejun Heo @ 2012-11-06 17:38 UTC (permalink / raw) To: Serge Hallyn Cc: Eric W. Biederman, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hello, On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote: > We can't generally require a capability to move tasks between cgroups, > as that will break currently intended uses. I can create two cgroups, > chown them to serge, and let serge move between them. Sure, then just live with the cgroupfs based permission check. What next? Should we add CAP_SYS_RESOURCE check to all resource related controllers? Moreover, We're headed to unified hierarchy, so in the end that means only the user with almost all CAP_* can manipulate cgroups at all making the whole thing meaningless. I don't think applying fine-grained CAP_* to cgroup controllers makes sense or would be useful in any real sense. We can introduce, say, CAP_CGROUP to control access cgroupfs but I think we already have enough access control to cgroupfs, don't we? Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20121106173823.GK30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <20121106173823.GK30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-11-06 17:41 ` Tejun Heo [not found] ` <20121106174130.GL30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> 2012-11-06 18:12 ` Serge Hallyn 1 sibling, 1 reply; 21+ messages in thread From: Tejun Heo @ 2012-11-06 17:41 UTC (permalink / raw) To: Serge Hallyn Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Eric W. Biederman, Aristeu Rozanski Just one more thing. On Tue, Nov 06, 2012 at 09:38:23AM -0800, Tejun Heo wrote: > Hello, > > On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote: > > We can't generally require a capability to move tasks between cgroups, > > as that will break currently intended uses. I can create two cgroups, > > chown them to serge, and let serge move between them. > > Sure, then just live with the cgroupfs based permission check. What > next? Should we add CAP_SYS_RESOURCE check to all resource related > controllers? Moreover, We're headed to unified hierarchy, so in the > end that means only the user with almost all CAP_* can manipulate > cgroups at all making the whole thing meaningless. As for using cgroup as !root user, I would advise not doing that. Again, we're moving toward a unified cgroup hierarchy. You wouldn't be creating multiple cgroup hierarchies and assigning different user accesses to them. Also, I would strongly discourage chowning sub directories in cgroupfs and letting non-priviledged users modify them directly. Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20121106174130.GL30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <20121106174130.GL30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-11-06 18:02 ` Serge Hallyn 2012-11-06 18:08 ` Tejun Heo 0 siblings, 1 reply; 21+ messages in thread From: Serge Hallyn @ 2012-11-06 18:02 UTC (permalink / raw) To: Tejun Heo Cc: Eric W. Biederman, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org): > Just one more thing. > > On Tue, Nov 06, 2012 at 09:38:23AM -0800, Tejun Heo wrote: > > Hello, > > > > On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote: > > > We can't generally require a capability to move tasks between cgroups, > > > as that will break currently intended uses. I can create two cgroups, > > > chown them to serge, and let serge move between them. > > > > Sure, then just live with the cgroupfs based permission check. What > > next? Should we add CAP_SYS_RESOURCE check to all resource related > > controllers? Moreover, We're headed to unified hierarchy, so in the > > end that means only the user with almost all CAP_* can manipulate > > cgroups at all making the whole thing meaningless. > > As for using cgroup as !root user, I would advise not doing that. > Again, we're moving toward a unified cgroup hierarchy. You wouldn't > be creating multiple cgroup hierarchies and assigning different user > accesses to them. Also, I would strongly discourage chowning sub > directories in cgroupfs and letting non-priviledged users modify them > directly. So to be clear, if I want a user to be able to confine his own compute-intensive tasks and freeze them, the recommended route will be with privileged (setuid-root) helpers? -serge ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? 2012-11-06 18:02 ` Serge Hallyn @ 2012-11-06 18:08 ` Tejun Heo 0 siblings, 0 replies; 21+ messages in thread From: Tejun Heo @ 2012-11-06 18:08 UTC (permalink / raw) To: Serge Hallyn Cc: Eric W. Biederman, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hello, Serge. On Tue, Nov 06, 2012 at 12:02:33PM -0600, Serge Hallyn wrote: > So to be clear, if I want a user to be able to confine his own > compute-intensive tasks and freeze them, the recommended route will be > with privileged (setuid-root) helpers? Something like that. I think we'll eventually need a policy manager in userland which controls the whole hierarchy. Not sure how that will look at this point tho, but cgroupfs will be an interface to just expose cgroup configuration directly rather than allow multiplexing userland / namespace / whatnot multiplexing on top of it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <20121106173823.GK30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> 2012-11-06 17:41 ` Tejun Heo @ 2012-11-06 18:12 ` Serge Hallyn 2012-11-06 18:16 ` Tejun Heo 1 sibling, 1 reply; 21+ messages in thread From: Serge Hallyn @ 2012-11-06 18:12 UTC (permalink / raw) To: Tejun Heo Cc: Eric W. Biederman, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org): > Hello, > > On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote: > > We can't generally require a capability to move tasks between cgroups, > > as that will break currently intended uses. I can create two cgroups, > > chown them to serge, and let serge move between them. > > Sure, then just live with the cgroupfs based permission check. What > next? Should we add CAP_SYS_RESOURCE check to all resource related That would be the next step, yes. > controllers? Moreover, We're headed to unified hierarchy, so in the > end that means only the user with almost all CAP_* can manipulate > cgroups at all making the whole thing meaningless. Why meaningless? Many caps needed to "do everything", but moving a task into the freezer and freezing it, or reducing its allowed memory, would only requiring uid equiv or some limited bit of privilege. > I don't think applying fine-grained CAP_* to cgroup controllers makes > sense or would be useful in any real sense. We can introduce, say, > CAP_CGROUP to control access cgroupfs but I think we already have > enough access control to cgroupfs, don't we? That's the question :) I feel like we need a list of the various uses people have in mind, so we can figure out which ones are supportable... but I know there is the whole long thread I've not had time to keep up with, and many answers are probably there. -serge ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? 2012-11-06 18:12 ` Serge Hallyn @ 2012-11-06 18:16 ` Tejun Heo [not found] ` <20121106181623.GO30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> 0 siblings, 1 reply; 21+ messages in thread From: Tejun Heo @ 2012-11-06 18:16 UTC (permalink / raw) To: Serge Hallyn Cc: Eric W. Biederman, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Hello, Serge. On Tue, Nov 06, 2012 at 12:12:14PM -0600, Serge Hallyn wrote: > Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org): > > On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote: > > > We can't generally require a capability to move tasks between cgroups, > > > as that will break currently intended uses. I can create two cgroups, > > > chown them to serge, and let serge move between them. > > > > Sure, then just live with the cgroupfs based permission check. What > > next? Should we add CAP_SYS_RESOURCE check to all resource related > > That would be the next step, yes. Hmmm... I don't know. What does that buy us? Fine grained CAP_* is to be able to give away privliedges in smaller pieces, right? If we're gonna be requiring root to modify cgroup, what difference does it make to have finer grained CAPs specified? > > controllers? Moreover, We're headed to unified hierarchy, so in the > > end that means only the user with almost all CAP_* can manipulate > > cgroups at all making the whole thing meaningless. > > Why meaningless? Many caps needed to "do everything", but moving > a task into the freezer and freezing it, or reducing its allowed > memory, would only requiring uid equiv or some limited bit of > privilege. All controllers will be co-mounted and you'll need all CAPs needed by all controllers to move tasks between cgroups and there won't be an easy way to tell which CAP is missing. Doesn't sound too useful to me. > > I don't think applying fine-grained CAP_* to cgroup controllers makes > > sense or would be useful in any real sense. We can introduce, say, > > CAP_CGROUP to control access cgroupfs but I think we already have > > enough access control to cgroupfs, don't we? > > That's the question :) > > I feel like we need a list of the various uses people have in mind, > so we can figure out which ones are supportable... but I know there > is the whole long thread I've not had time to keep up with, and > many answers are probably there. Care to give a pointer? Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
[parent not found: <20121106181623.GO30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>]
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <20121106181623.GO30069-9pTldWuhBndy/B6EtB590w@public.gmane.org> @ 2012-11-06 18:25 ` Serge Hallyn 0 siblings, 0 replies; 21+ messages in thread From: Serge Hallyn @ 2012-11-06 18:25 UTC (permalink / raw) To: Tejun Heo Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Eric W. Biederman, Aristeu Rozanski Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org): > Hello, Serge. > > On Tue, Nov 06, 2012 at 12:12:14PM -0600, Serge Hallyn wrote: > > Quoting Tejun Heo (tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org): > > > On Tue, Nov 06, 2012 at 11:31:04AM -0600, Serge Hallyn wrote: > > > > We can't generally require a capability to move tasks between cgroups, > > > > as that will break currently intended uses. I can create two cgroups, > > > > chown them to serge, and let serge move between them. > > > > > > Sure, then just live with the cgroupfs based permission check. What > > > next? Should we add CAP_SYS_RESOURCE check to all resource related > > > > That would be the next step, yes. > > Hmmm... I don't know. What does that buy us? Fine grained CAP_* is > to be able to give away privliedges in smaller pieces, right? If > we're gonna be requiring root to modify cgroup, what difference does > it make to have finer grained CAPs specified? > > > > controllers? Moreover, We're headed to unified hierarchy, so in the > > > end that means only the user with almost all CAP_* can manipulate > > > cgroups at all making the whole thing meaningless. > > > > Why meaningless? Many caps needed to "do everything", but moving > > a task into the freezer and freezing it, or reducing its allowed > > memory, would only requiring uid equiv or some limited bit of > > privilege. > > All controllers will be co-mounted and you'll need all CAPs needed by Oh. I thought each controller could only be mounted once, but not that all must be co-mounted. Jinkeys. > all controllers to move tasks between cgroups and there won't be an > easy way to tell which CAP is missing. Doesn't sound too useful to > me. > > > > I don't think applying fine-grained CAP_* to cgroup controllers makes > > > sense or would be useful in any real sense. We can introduce, say, > > > CAP_CGROUP to control access cgroupfs but I think we already have > > > enough access control to cgroupfs, don't we? > > > > That's the question :) > > > > I feel like we need a list of the various uses people have in mind, > > so we can figure out which ones are supportable... but I know there > > is the whole long thread I've not had time to keep up with, and > > many answers are probably there. > > Care to give a pointer? I mean the recent thread you started on cgroup cleanups :) -serge ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: Why does devices cgroup check for CAP_SYS_ADMIN explicitly? [not found] ` <871ug6rbio.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2012-11-06 15:43 ` Tejun Heo @ 2012-11-06 15:45 ` Serge Hallyn 1 sibling, 0 replies; 21+ messages in thread From: Serge Hallyn @ 2012-11-06 15:45 UTC (permalink / raw) To: Eric W. Biederman Cc: Tejun Heo, Aristeu Rozanski, cgroups-u79uwXL29TY76Z2rM5mHXA, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): > Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> writes: > > > Hello, Serge. > > > > On Tue, Nov 06, 2012 at 09:01:32AM -0600, Serge Hallyn wrote: > >> More practically, lacking user namespaces you can create a full (i.e. > >> ubuntu) container that doesn't have CAP_SYS_ADMIN, but not one without > >> root. So this allows you to prevent containers from bypassing devices > >> cgroup restrictions set by the parent. (In reality we are not using > >> that in ubuntu - we grant CAP_SYS_ADMIN and use apparmor to restrict - > >> but other distros do.) > > > > Do you even mount cgroupfs in containers? If you just bind-mount > > cgroupfs verbatim in containers, I don't think that's gonna work very > > well. If not, all this doesn't make any difference for containers. > > > > So, you don't really have any actual use case for the explicit CAP_* > > checks, right? > > Having thought about this a little more I can give a definitive answer. > > Adding a process to the device control group is equivalent to calling > mknod, as it allows that process to open device nodes, or equivalently > not open device nodes. Therefore a capable check is absolutely > required. > > Without a capability check it would be possible to remove access to > /dev/console for a suid root application keeping it from reporting > attempts to hack it for example. > > update_access can allow access to previously unaccessible devices > and so is equivalent to mknod and as such requires a capability call. > > static int devcgroup_update_access(struct dev_cgroup *devcgroup, > int filetype, const char *buffer) > { > .... > if (!capable(CAP_SYS_ADMIN)) > return -EPERM; > > > Likewise move to a different cgroup can give you a completely different > set of devices you can use. And is roughly equivalent to mknod, and > needs a capability call. > > static int devcgroup_can_attach(struct cgroup *new_cgrp, > struct cgroup_taskset *set) > { > struct task_struct *task = cgroup_taskset_first(set); > > if (current != task && !capable(CAP_SYS_ADMIN)) > return -EPERM; > > > The generic cgroup check in attach_task_by_pid to see if you can move > another process into a cgroup needs to be a capability call and not a > test for uid == 0. > > static int attach_task_by_pid(struct cgroup *cgrp, u64 pid, bool threadgroup) > { > if (pid) { > tsk = find_task_by_vpid(pid); > > /* > * even if we're attaching all tasks in the thread group, we > * only need to check permissions on one of them. > */ > tcred = __task_cred(tsk); > if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) && > ^^^^^^^^^^^^^^^ > !uid_eq(cred->euid, tcred->uid) && > !uid_eq(cred->euid, tcred->suid)) { > rcu_read_unlock(); > ret = -EACCES; > goto out_unlock_cgroup; > > Eric (full context kept, though long, bc it's all important) Note that part of the problem is simply that the devices cgroup is serving as a stand-in for the lack of both user and device namespaces. If those both existed, we could get rid of the devices cgroup. Likewise, the presence of the devices cgroup makes a device namespace far less compelling :) We can play games with bind mounts into /dev and devcgroup to do most of what we want a devicens for. -serge ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2012-11-06 18:25 UTC | newest]
Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-06 2:38 Why does devices cgroup check for CAP_SYS_ADMIN explicitly? Tejun Heo
[not found] ` <20121106023845.GI19354-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 11:58 ` Eric W. Biederman
[not found] ` <877gpzrlir.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2012-11-06 14:48 ` Tejun Heo
2012-11-06 15:01 ` Serge Hallyn
2012-11-06 15:06 ` Tejun Heo
[not found] ` <20121106150639.GB30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 15:30 ` Serge Hallyn
2012-11-06 15:41 ` Tejun Heo
[not found] ` <20121106154105.GD30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 16:12 ` Aristeu Rozanski
2012-11-06 15:34 ` Eric W. Biederman
[not found] ` <871ug6rbio.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2012-11-06 15:43 ` Tejun Heo
[not found] ` <20121106154320.GE30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 16:10 ` Eric W. Biederman
[not found] ` <87sj8mogpp.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2012-11-06 16:52 ` Tejun Heo
[not found] ` <20121106165246.GF30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 17:31 ` Serge Hallyn
2012-11-06 17:38 ` Tejun Heo
[not found] ` <20121106173823.GK30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 17:41 ` Tejun Heo
[not found] ` <20121106174130.GL30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 18:02 ` Serge Hallyn
2012-11-06 18:08 ` Tejun Heo
2012-11-06 18:12 ` Serge Hallyn
2012-11-06 18:16 ` Tejun Heo
[not found] ` <20121106181623.GO30069-9pTldWuhBndy/B6EtB590w@public.gmane.org>
2012-11-06 18:25 ` Serge Hallyn
2012-11-06 15:45 ` Serge Hallyn
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).