* [RFC] cgroup TODOs @ 2012-09-13 20:58 Tejun Heo 2012-09-14 11:15 ` Peter Zijlstra [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 0 siblings, 2 replies; 75+ messages in thread From: Tejun Heo @ 2012-09-13 20:58 UTC (permalink / raw) To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Neil Horman, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner, Ingo Molnar Hello, guys. Here's the write-up I promised last week about what I think are the problems in cgroup and what the current plans are. First of all, it's a mess. Shame on me. Shame on you. Shame on all of us for allowing this mess. Let's all tremble in shame for solid ten seconds before proceeding. I'll list the issues I currently see with cgroup (easier ones first). I think I now have at least tentative plans for all of them and will list them together with the prospective asignees (my wish mostly). Unfortunately, some of the plans involve userland visible changes which would at least cause some discomfort and require adjustments on their part. 1. cpu and cpuacct They cover the same resources and the scheduler cgroup code ends up having to traverse two separate cgroup trees to update the stats. With nested cgroups, the overhead isn't insignificant and it generally is silly. While the use cases for having cpuacct on a separate and likely more granular hierarchy, are somewhat valid, the consensus seems that it's just not worth the trouble and cpuacct should be removed in the long term and we shouldn't allow overlapping controllers for the same resource, especially accounting ones. Solution: * Whine if cpuacct is not co-mounted with cpu. * Make sure cpu has all the stats of cpuacct. If cpu and cpuacct are comounted, don't really mount cpuacct but tell cpu that the user requested it. cpu is updated to create aliases for cpuacct.* files in such cases. This involves special casing cpuacct in cgroup core but I much prefer one-off exception case to adding a generic mechanism for this. * After a while, we can just remove cpuacct completely. * Later on, phase out the aliases too. Who: Me, working on it. 2. memcg's __DEPRECATED_clear_css_refs This is a remnant of another weird design decision of requiring synchronous draining of refcnts on cgroup removal and allowing subsystems to veto cgroup removal - what's the userspace supposed to do afterwards? Note that this also hinders co-mounting different controllers. The behavior could be useful for development and debugging but it unnecessarily interlocks userland visible behavior with in-kernel implementation details. To me, it seems outright wrong (either implement proper severing semantics in the controller or do full refcnting) and disallows, for example, lazy drain of caching refs. Also, it complicates the removal path with try / commit / revert logic which has never been fully correct since the beginning. Currently, the only left user is memcg. Solution: * Update memcg->pre_destroy() such that it never fails. * Drop __DEPRECATED_clear_css_refs and all related logic. Convert pre_destroy() to return void. Who: KAMEZAWA, Michal, PLEASE. I will make __DEPRECATED_clear_css_refs trigger WARN sooner or later. Let's please get this settled. 3. cgroup_mutex usage outside cgroup core This is another thing which is simply broken. Given the way cgroup is structured and used, nesting cgroup_mutex inside any other commonly used lock simply doesn't work - it's held while invoking controller callbacks which then interact and synchronize with various core subsystems. There are currently three external cgroup_mutex users - cpuset, memcontrol and cgroup_freezer. Solution: Well, we should just stop doing it - use a separate nested lock (which seems possible for cgroup_freezer) or track and mange task in/egress some other way. Who: I'll do the cgroup_freezer. I'm hoping PeterZ or someone who's familiar with the code base takes care of cpuset. Michal, can you please take care of memcg? 4. Make disabled controllers cheaper Mostly through the use of static_keys, I suppose. Making this easier AFAICS depends on resolving #2. The lock dependency loop from #2 makes using static_keys from cgroup callbacks extremely nasty. Solution: Fix #2 and support common pattern from cgroup core. Who: Dunno. Let's see. 5. I CAN HAZ HIERARCHIES? The cpu ones handle nesting correctly - parent's accounting includes children's, parent's configuration affects children's unless explicitly overridden, and children's limits nest inside parent's. memcg asked itself the existential question of to be hierarchical or not and then got confused and decided to become both. When faced with the same question, blkio and cgroup_freezer just gave up and decided to allow nesting and then ignore it - brilliant. And there are others which kinda sorta try to handle hierarchy but only goes way-half. This one is screwed up embarrassingly badly. We failed to establish one of the most basic semantics and can't even define what a cgroup hierarchy is - it depends on each controller and they're mostly wacky! Fortunately, I don't think it will be prohibitively difficult to dig ourselves out of this hole. Solution: * cpu ones seem fine. * For broken controllers, cgroup core will be generating warning messages if the user tries to nest cgroups so that the user at least can know that the behavior may change underneath them later on. For more details, http://thread.gmane.org/gmane.linux.kernel/1356264/focus=3902 * memcg can be fully hierarchical but we need to phase out the flat hierarchy support. Unfortunately, this involves flipping the behavior for the existing users. Upstream will try to nudge users with warning messages. Most burden would be on the distros and at least SUSE seems to be on board with it. Needs coordination with other distros. * blkio is the most problematic. It has two sub-controllers - cfq and blk-throttle. Both are utterly broken in terms of hierarchy support and the former is known to have pretty hairy code base. I don't see any other way than just biting the bullet and fixing it. * cgroup_freezer and others shouldn't be too difficult to fix. Who: memcg can be handled by memcg people and I can handle cgroup_freezer and others with help from the authors. The problematic one is blkio. If anyone is interested in working on blkio, please be my guest. Vivek? Glauber? 6. Multiple hierarchies Apart from the apparent wheeeeeeeeness of it (I think I talked about that enough the last time[1]), there's a basic problem when more than one controllers interact - it's impossible to define a resource group when more than two controllers are involved because the intersection of different controllers is only defined in terms of tasks. IOW, if an entity X is of interest to two controllers, there's no way to map X to the cgroups of the two controllers. X may belong to A and B when viewed by one task but A' and B when viewed by another. This already is a head scratcher in writeback where blkcg and memcg have to interact. While I am pushing for unified hierarchy, I think it's necessary to have different levels of granularities depending on controllers given that nesting involves significant overhead and noticeable controller-dependent behavior changes. Solution: I think a unified hierarchy with the ability to ignore subtrees depending on controllers should work. For example, let's assume the following hierarchy. R / \ A B / \ AA AB All controllers are co-mounted. There is per-cgroup knob which controls which controllers nest beyond it. If blkio doesn't want to distinguish AA and AB, the user can specify that blkio doesn't nest beyond A and blkio would see the tree as, R / \ A B While other controllers keep seeing the original tree. The exact form of interface, I don't know yet. It could be a single file which the user echoes [-]controller name into it or per-controller boolean file. I think this level of flexibility should be enough for most use cases. If someone disagrees, please voice your objections now. I *think* this can be achieved by changing where css_set is bound. Currently, a css_set is (conceptually) owned by a task. After the change, a cgroup in the unified hierarchy has its own css_set which tasks point to and can also be used to tag resources as necessary. This way, it should be achieveable without introducing a lot of new code or affecting individual controllers too much. The headache will be the transition period where we'll probably have to support both modes of operation. Oh well.... Who: Li, Glauber and me, I guess? 7. Misc issues * Sort & unique when listing tasks. Even the documentation says it doesn't happen but we have a good hunk of code doing it in cgroup.c. I'm gonna rip it out at some point. Again, if you don't like it, scream. * At the PLC, pjt told me that assinging threads of a cgroup to different cgroups is useful for some use cases but if we're to have a unified hierarchy, I don't think we can continue to do that. Paul, can you please elaborate the use case? * Vivek brought up the issue of distributing resources to tasks and groups in the same cgroup. I don't know. Need to think more about it. Thanks. -- tejun [1] http://thread.gmane.org/gmane.linux.kernel.cgroups/857 ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs 2012-09-13 20:58 [RFC] cgroup TODOs Tejun Heo @ 2012-09-14 11:15 ` Peter Zijlstra 2012-09-14 12:54 ` Daniel P. Berrange 2012-09-14 17:53 ` Tejun Heo [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 1 sibling, 2 replies; 75+ messages in thread From: Peter Zijlstra @ 2012-09-14 11:15 UTC (permalink / raw) To: Tejun Heo Cc: containers, cgroups, linux-kernel, Li Zefan, Michal Hocko, Glauber Costa, Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V On Thu, 2012-09-13 at 13:58 -0700, Tejun Heo wrote: > The cpu ones handle nesting correctly - parent's accounting includes > children's, parent's configuration affects children's unless > explicitly overridden, and children's limits nest inside parent's. The implementation has some issues with fixed point math limitations on deep hierarchies/large cpu count, but yes. Doing soft-float/bignum just isn't going to be popular I guess ;-) People also don't seem to understand that each extra cgroup carries a cost and that nested cgroups are more expensive still, even if the intermediate levels are mostly empty (libvirt is a good example of how not to do things). Anyway, I guess what I'm saying is that we need to work on the awareness of cost associated with all this cgroup nonsense, people seem to think its all good and free -- or not think at all, which, while depressing, seem the more likely option. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs 2012-09-14 11:15 ` Peter Zijlstra @ 2012-09-14 12:54 ` Daniel P. Berrange [not found] ` <20120914125427.GW6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 17:53 ` Tejun Heo 1 sibling, 1 reply; 75+ messages in thread From: Daniel P. Berrange @ 2012-09-14 12:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, Serge E. Hallyn, Paul Turner, Ingo Molnar On Fri, Sep 14, 2012 at 01:15:02PM +0200, Peter Zijlstra wrote: > On Thu, 2012-09-13 at 13:58 -0700, Tejun Heo wrote: > > The cpu ones handle nesting correctly - parent's accounting includes > > children's, parent's configuration affects children's unless > > explicitly overridden, and children's limits nest inside parent's. > > The implementation has some issues with fixed point math limitations on > deep hierarchies/large cpu count, but yes. > > Doing soft-float/bignum just isn't going to be popular I guess ;-) > > People also don't seem to understand that each extra cgroup carries a > cost and that nested cgroups are more expensive still, even if the > intermediate levels are mostly empty (libvirt is a good example of how > not to do things). > > Anyway, I guess what I'm saying is that we need to work on the awareness > of cost associated with all this cgroup nonsense, people seem to think > its all good and free -- or not think at all, which, while depressing, > seem the more likely option. In defense of what libvirt is doing, I'll point out that the kernel docs on cgroups make little to no mention of these performance / cost implications, and the examples of usage given arguably encourage use of deep hierarchies. Given what we've now learnt about the kernel's lack of scalability wrt cgroup hierarchies, we'll be changing the way libvirt deals with cgroups, to flatten it out to only use 1 level by default. If the kernel docs had clearly expressed the limitations & made better recommendations on app usage we would never have picked the approach we originally chose. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914125427.GW6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914125427.GW6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-09-14 8:55 ` Glauber Costa 0 siblings, 0 replies; 75+ messages in thread From: Glauber Costa @ 2012-09-14 8:55 UTC (permalink / raw) To: Daniel P. Berrange Cc: Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner On 09/14/2012 04:54 PM, Daniel P. Berrange wrote: > On Fri, Sep 14, 2012 at 01:15:02PM +0200, Peter Zijlstra wrote: >> On Thu, 2012-09-13 at 13:58 -0700, Tejun Heo wrote: >>> The cpu ones handle nesting correctly - parent's accounting includes >>> children's, parent's configuration affects children's unless >>> explicitly overridden, and children's limits nest inside parent's. >> >> The implementation has some issues with fixed point math limitations on >> deep hierarchies/large cpu count, but yes. >> >> Doing soft-float/bignum just isn't going to be popular I guess ;-) >> >> People also don't seem to understand that each extra cgroup carries a >> cost and that nested cgroups are more expensive still, even if the >> intermediate levels are mostly empty (libvirt is a good example of how >> not to do things). >> >> Anyway, I guess what I'm saying is that we need to work on the awareness >> of cost associated with all this cgroup nonsense, people seem to think >> its all good and free -- or not think at all, which, while depressing, >> seem the more likely option. > > In defense of what libvirt is doing, I'll point out that the kernel > docs on cgroups make little to no mention of these performance / cost > implications, and the examples of usage given arguably encourage use > of deep hierarchies. > > Given what we've now learnt about the kernel's lack of scalability > wrt cgroup hierarchies, we'll be changing the way libvirt deals with > cgroups, to flatten it out to only use 1 level by default. If the > kernel docs had clearly expressed the limitations & made better > recommendations on app usage we would never have picked the approach > we originally chose. > > Regards, > Daniel > I personally don't think this is such a crazy setup. It is perfectly valid to say "all applications managed by libvirt as a whole cannot use more than X". Now of course there are other ways to do it, and we really need to make people more aware of the costs... ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs 2012-09-14 11:15 ` Peter Zijlstra 2012-09-14 12:54 ` Daniel P. Berrange @ 2012-09-14 17:53 ` Tejun Heo 1 sibling, 0 replies; 75+ messages in thread From: Tejun Heo @ 2012-09-14 17:53 UTC (permalink / raw) To: Peter Zijlstra Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko, Glauber Costa, Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V Hello, Peter. On Fri, Sep 14, 2012 at 01:15:02PM +0200, Peter Zijlstra wrote: > On Thu, 2012-09-13 at 13:58 -0700, Tejun Heo wrote: > > The cpu ones handle nesting correctly - parent's accounting includes > > children's, parent's configuration affects children's unless > > explicitly overridden, and children's limits nest inside parent's. > > The implementation has some issues with fixed point math limitations on > deep hierarchies/large cpu count, but yes. > > Doing soft-float/bignum just isn't going to be popular I guess ;-) As things currently stand, I think the cpu stuff is high enough bar to aim for. That said, I do have some problems with how it handles tasks vs. groups. Will talk about in another reply. > People also don't seem to understand that each extra cgroup carries a > cost and that nested cgroups are more expensive still, even if the > intermediate levels are mostly empty (libvirt is a good example of how > not to do things). > > Anyway, I guess what I'm saying is that we need to work on the awareness > of cost associated with all this cgroup nonsense, people seem to think > its all good and free -- or not think at all, which, while depressing, > seem the more likely option. The decision may not have been conscious but it seems that we settled on the direction where cgroup does more hierarchy-wise rather than leaving non-scalable operations to each use case - e.g. filesystem trees are very scalable but for that they give up a lot of tree-aware things like knowing the size of a given subtree. For what cgroup does, I think the naturally chosen direction is the right one. Its functionality inherently requires more involvement with the tree structure and we of course should try to document the implications clearly and make things scale better where we can (e.g. stat propagation has no reason to happen on every update). Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-14 8:16 ` Glauber Costa [not found] ` <5052E7DF.7040000-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-14 9:04 ` Mike Galbraith ` (8 subsequent siblings) 9 siblings, 1 reply; 75+ messages in thread From: Glauber Costa @ 2012-09-14 8:16 UTC (permalink / raw) To: Tejun Heo Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar First: Can we please keep some key userspace guys CCd? > 1. cpu and cpuacct > > They cover the same resources and the scheduler cgroup code ends up > having to traverse two separate cgroup trees to update the stats. > With nested cgroups, the overhead isn't insignificant and it > generally is silly. > > While the use cases for having cpuacct on a separate and likely more > granular hierarchy, are somewhat valid, the consensus seems that > it's just not worth the trouble and cpuacct should be removed in the > long term and we shouldn't allow overlapping controllers for the > same resource, especially accounting ones. > > Solution: > > * Whine if cpuacct is not co-mounted with cpu. > > * Make sure cpu has all the stats of cpuacct. If cpu and cpuacct > are comounted, don't really mount cpuacct but tell cpu that the > user requested it. cpu is updated to create aliases for cpuacct.* > files in such cases. This involves special casing cpuacct in > cgroup core but I much prefer one-off exception case to adding a > generic mechanism for this. > > * After a while, we can just remove cpuacct completely. > > * Later on, phase out the aliases too. > > Who: > > Me, working on it. I can work on it as well if you want. I dealt with it many times in the past, and tried some different approaches, so I am familiar. But if you're already doing it, be my guest... > > 2. memcg's __DEPRECATED_clear_css_refs > > This is a remnant of another weird design decision of requiring > synchronous draining of refcnts on cgroup removal and allowing > subsystems to veto cgroup removal - what's the userspace supposed to > do afterwards? Note that this also hinders co-mounting different > controllers. > > The behavior could be useful for development and debugging but it > unnecessarily interlocks userland visible behavior with in-kernel > implementation details. To me, it seems outright wrong (either > implement proper severing semantics in the controller or do full > refcnting) and disallows, for example, lazy drain of caching refs. > Also, it complicates the removal path with try / commit / revert > logic which has never been fully correct since the beginning. > > Currently, the only left user is memcg. > > Solution: > > * Update memcg->pre_destroy() such that it never fails. > > * Drop __DEPRECATED_clear_css_refs and all related logic. > Convert pre_destroy() to return void. > > Who: > > KAMEZAWA, Michal, PLEASE. I will make __DEPRECATED_clear_css_refs > trigger WARN sooner or later. Let's please get this settled. > > 3. cgroup_mutex usage outside cgroup core > > This is another thing which is simply broken. Given the way cgroup > is structured and used, nesting cgroup_mutex inside any other > commonly used lock simply doesn't work - it's held while invoking > controller callbacks which then interact and synchronize with > various core subsystems. > > There are currently three external cgroup_mutex users - cpuset, > memcontrol and cgroup_freezer. > > Solution: > > Well, we should just stop doing it - use a separate nested lock > (which seems possible for cgroup_freezer) or track and mange task > in/egress some other way. > > Who: > > I'll do the cgroup_freezer. I'm hoping PeterZ or someone who's > familiar with the code base takes care of cpuset. Michal, can you > please take care of memcg? > I think this is a pressing problem, yes, but not the only problem with cgroup lock. Even if we restrict its usage to cgroup core, we still can call cgroup functions, which will lock. And then we gain nothing. And the problem is that people need to lock. cgroup_lock is needed because the data you are accessing is protected by it. The way I see it, it is incredible how we were able to revive the BKL in the form of cgroup_lock after we finally manage to successfully get rid of it! We should just start to do a more fine grained locking of data, instead of "stop the world, cgroup just started!". If we do that, the problem you are trying to address here will even cease to exist. > 4. Make disabled controllers cheaper > > Mostly through the use of static_keys, I suppose. Making this > easier AFAICS depends on resolving #2. The lock dependency loop > from #2 makes using static_keys from cgroup callbacks extremely > nasty. > > Solution: > > Fix #2 and support common pattern from cgroup core. > > Who: > > Dunno. Let's see. I've been doing it for kmem related controllers, and by trying to do it with cpu/cpuacct, I became quite familiar with the corner cases, etc. I can happily tackle it. > > 5. I CAN HAZ HIERARCHIES? > > The cpu ones handle nesting correctly - parent's accounting includes > children's, parent's configuration affects children's unless > explicitly overridden, and children's limits nest inside parent's. > > memcg asked itself the existential question of to be hierarchical or > not and then got confused and decided to become both. > > When faced with the same question, blkio and cgroup_freezer just > gave up and decided to allow nesting and then ignore it - brilliant. > > And there are others which kinda sorta try to handle hierarchy but > only goes way-half. > > This one is screwed up embarrassingly badly. We failed to establish > one of the most basic semantics and can't even define what a cgroup > hierarchy is - it depends on each controller and they're mostly > wacky! > > Fortunately, I don't think it will be prohibitively difficult to dig > ourselves out of this hole. > > Solution: > > * cpu ones seem fine. > > * For broken controllers, cgroup core will be generating warning > messages if the user tries to nest cgroups so that the user at > least can know that the behavior may change underneath them later > on. For more details, > > http://thread.gmane.org/gmane.linux.kernel/1356264/focus=3902 > > * memcg can be fully hierarchical but we need to phase out the flat > hierarchy support. Unfortunately, this involves flipping the > behavior for the existing users. Upstream will try to nudge users > with warning messages. Most burden would be on the distros and at > least SUSE seems to be on board with it. Needs coordination with > other distros. > > * blkio is the most problematic. It has two sub-controllers - cfq > and blk-throttle. Both are utterly broken in terms of hierarchy > support and the former is known to have pretty hairy code base. I > don't see any other way than just biting the bullet and fixing it. > > * cgroup_freezer and others shouldn't be too difficult to fix. > > Who: > > memcg can be handled by memcg people and I can handle cgroup_freezer > and others with help from the authors. The problematic one is > blkio. If anyone is interested in working on blkio, please be my > guest. Vivek? Glauber? I am happy to help where manpower is needed, but I must node I am a bit ignorant of block in general. > > 6. Multiple hierarchies > > Apart from the apparent wheeeeeeeeness of it (I think I talked about > that enough the last time[1]), there's a basic problem when more > than one controllers interact - it's impossible to define a resource > group when more than two controllers are involved because the > intersection of different controllers is only defined in terms of > tasks. > > IOW, if an entity X is of interest to two controllers, there's no > way to map X to the cgroups of the two controllers. X may belong to > A and B when viewed by one task but A' and B when viewed by another. > This already is a head scratcher in writeback where blkcg and memcg > have to interact. > > While I am pushing for unified hierarchy, I think it's necessary to > have different levels of granularities depending on controllers > given that nesting involves significant overhead and noticeable > controller-dependent behavior changes. > > Solution: > > I think a unified hierarchy with the ability to ignore subtrees > depending on controllers should work. For example, let's assume the > following hierarchy. > > R > / \ > A B > / \ > AA AB > > All controllers are co-mounted. There is per-cgroup knob which > controls which controllers nest beyond it. If blkio doesn't want to > distinguish AA and AB, the user can specify that blkio doesn't nest > beyond A and blkio would see the tree as, > > R > / \ > A B > > While other controllers keep seeing the original tree. The exact > form of interface, I don't know yet. It could be a single file > which the user echoes [-]controller name into it or per-controller > boolean file. > > I think this level of flexibility should be enough for most use > cases. If someone disagrees, please voice your objections now. > Do you realize this is the exact same thing I proposed in our last round, and you keep screaming saying you wanted something else, right? The only difference is that the discussion at the time started by a forced-comount patch, but that is not the core of the question. For that you are proposing to make sense, the controllers need to be comounted, and at some point we'll have to enforce it. Be it now or in the future. But what to do when they are in fact comounted, I see no difference from what you are saying, and what I said. > I *think* this can be achieved by changing where css_set is bound. > Currently, a css_set is (conceptually) owned by a task. After the > change, a cgroup in the unified hierarchy has its own css_set which > tasks point to and can also be used to tag resources as necessary. > This way, it should be achieveable without introducing a lot of new > code or affecting individual controllers too much. > > The headache will be the transition period where we'll probably have > to support both modes of operation. Oh well.... > > Who: > > Li, Glauber and me, I guess? > > 7. Misc issues > > * Sort & unique when listing tasks. Even the documentation says it > doesn't happen but we have a good hunk of code doing it in > cgroup.c. I'm gonna rip it out at some point. Again, if you > don't like it, scream. > In all honesty, I never noticed that. ugh ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <5052E7DF.7040000-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <5052E7DF.7040000-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-14 9:12 ` Li Zefan [not found] ` <5052F4FF.6070508-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> 2012-09-14 17:43 ` Tejun Heo 1 sibling, 1 reply; 75+ messages in thread From: Li Zefan @ 2012-09-14 9:12 UTC (permalink / raw) To: Glauber Costa Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf, Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner >> >> 2. memcg's __DEPRECATED_clear_css_refs >> >> This is a remnant of another weird design decision of requiring >> synchronous draining of refcnts on cgroup removal and allowing >> subsystems to veto cgroup removal - what's the userspace supposed to >> do afterwards? Note that this also hinders co-mounting different >> controllers. >> >> The behavior could be useful for development and debugging but it >> unnecessarily interlocks userland visible behavior with in-kernel >> implementation details. To me, it seems outright wrong (either >> implement proper severing semantics in the controller or do full >> refcnting) and disallows, for example, lazy drain of caching refs. >> Also, it complicates the removal path with try / commit / revert >> logic which has never been fully correct since the beginning. >> >> Currently, the only left user is memcg. >> >> Solution: >> >> * Update memcg->pre_destroy() such that it never fails. >> >> * Drop __DEPRECATED_clear_css_refs and all related logic. >> Convert pre_destroy() to return void. >> >> Who: >> >> KAMEZAWA, Michal, PLEASE. I will make __DEPRECATED_clear_css_refs >> trigger WARN sooner or later. Let's please get this settled. >> >> 3. cgroup_mutex usage outside cgroup core >> >> This is another thing which is simply broken. Given the way cgroup >> is structured and used, nesting cgroup_mutex inside any other >> commonly used lock simply doesn't work - it's held while invoking >> controller callbacks which then interact and synchronize with >> various core subsystems. >> >> There are currently three external cgroup_mutex users - cpuset, >> memcontrol and cgroup_freezer. >> >> Solution: >> >> Well, we should just stop doing it - use a separate nested lock >> (which seems possible for cgroup_freezer) or track and mange task >> in/egress some other way. >> >> Who: >> >> I'll do the cgroup_freezer. I'm hoping PeterZ or someone who's >> familiar with the code base takes care of cpuset. Michal, can you >> please take care of memcg? >> > > I think this is a pressing problem, yes, but not the only problem with > cgroup lock. Even if we restrict its usage to cgroup core, we still can > call cgroup functions, which will lock. And then we gain nothing. > Agreed. The biggest issue in cpuset is if hotplug makes a cpuset's cpulist empty the tasks in it will be moved to an ancestor cgroup, which requires holding cgroup lock. We have to either change cpuset's behavior or eliminate the global lock. > And the problem is that people need to lock. cgroup_lock is needed > because the data you are accessing is protected by it. The way I see it, > it is incredible how we were able to revive the BKL in the form of > cgroup_lock after we finally manage to successfully get rid of it! > > We should just start to do a more fine grained locking of data, instead > of "stop the world, cgroup just started!". If we do that, the problem > you are trying to address here will even cease to exist. > ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <5052F4FF.6070508-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <5052F4FF.6070508-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> @ 2012-09-14 11:22 ` Peter Zijlstra 2012-09-14 17:59 ` Tejun Heo 1 sibling, 0 replies; 75+ messages in thread From: Peter Zijlstra @ 2012-09-14 11:22 UTC (permalink / raw) To: Li Zefan Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf, Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner On Fri, 2012-09-14 at 17:12 +0800, Li Zefan wrote: > > I think this is a pressing problem, yes, but not the only problem with > > cgroup lock. Even if we restrict its usage to cgroup core, we still can > > call cgroup functions, which will lock. And then we gain nothing. > > > > Agreed. The biggest issue in cpuset is if hotplug makes a cpuset's cpulist > empty the tasks in it will be moved to an ancestor cgroup, which requires > holding cgroup lock. We have to either change cpuset's behavior or eliminate > the global lock. PJ (the original cpuset author) has always been very conservative in changing cpuset semantics/behaviour. Its being used at the big HPC labs and those people simply don't like change. It also ties in with us having to preserve ABI, Linus says you can only do so if nobody notices -- if a tree falls in a forest and there's nobody to hear it, it really didn't fall at all. Which I guess means we're going to have to split locks :-) ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <5052F4FF.6070508-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> 2012-09-14 11:22 ` Peter Zijlstra @ 2012-09-14 17:59 ` Tejun Heo [not found] ` <20120914175944.GF17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 75+ messages in thread From: Tejun Heo @ 2012-09-14 17:59 UTC (permalink / raw) To: Li Zefan Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner Hello, On Fri, Sep 14, 2012 at 05:12:31PM +0800, Li Zefan wrote: > Agreed. The biggest issue in cpuset is if hotplug makes a cpuset's cpulist > empty the tasks in it will be moved to an ancestor cgroup, which requires > holding cgroup lock. We have to either change cpuset's behavior or eliminate > the global lock. Does that have to happen synchronously? Can't we have a cgroup operation which asynchronously pushes all tasks in a cgroup to its parent from a work item? Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914175944.GF17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914175944.GF17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-14 18:23 ` Peter Zijlstra 2012-09-14 18:33 ` Tejun Heo 0 siblings, 1 reply; 75+ messages in thread From: Peter Zijlstra @ 2012-09-14 18:23 UTC (permalink / raw) To: Tejun Heo Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Fri, 2012-09-14 at 10:59 -0700, Tejun Heo wrote: > Hello, > > On Fri, Sep 14, 2012 at 05:12:31PM +0800, Li Zefan wrote: > > Agreed. The biggest issue in cpuset is if hotplug makes a cpuset's cpulist > > empty the tasks in it will be moved to an ancestor cgroup, which requires > > holding cgroup lock. We have to either change cpuset's behavior or eliminate > > the global lock. > > Does that have to happen synchronously? Can't we have a cgroup > operation which asynchronously pushes all tasks in a cgroup to its > parent from a work item? Its hotplug, all hotplug stuff is synchronous, the last thing hotplug needs is the added complexity of async callbacks. Also pushing stuff out into worklets just to work around locking issues is vile. <handwave as I never can remember all the cgroup stuff/> Can't we play games by pinning both cgroups with a reference and playing games with threadgroup_change / task_lock for the individual tasks being moved about? ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs 2012-09-14 18:23 ` Peter Zijlstra @ 2012-09-14 18:33 ` Tejun Heo 0 siblings, 0 replies; 75+ messages in thread From: Tejun Heo @ 2012-09-14 18:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar Hello, On Fri, Sep 14, 2012 at 08:23:41PM +0200, Peter Zijlstra wrote: > Its hotplug, all hotplug stuff is synchronous, the last thing hotplug > needs is the added complexity of async callbacks. Also pushing stuff out > into worklets just to work around locking issues is vile. I was asking whether it *has* to be part of synchronous CPU hotplug operation. IOW, do all tasks in the depleted cgroup have to be moved to its parent before CPU hotunplug can proceed to completion or is it okay to happen afterwards? Making the migration part asynchronous doesn't add much complexity. The only thing you have to make sure is flushing the previously scheduled one from the next CPU_UP_PREPARE. Also note that this can't easily be solved by splitting tree protecting inner lock from the outer lock. We're talking about doing full migration operations which likely require the outer one too. > <handwave as I never can remember all the cgroup stuff/> > > Can't we play games by pinning both cgroups with a reference and playing > games with threadgroup_change / task_lock for the individual tasks being > moved about? I'm lost. Can you please elaborate? Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <5052E7DF.7040000-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-14 9:12 ` Li Zefan @ 2012-09-14 17:43 ` Tejun Heo [not found] ` <20120914174329.GD17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 75+ messages in thread From: Tejun Heo @ 2012-09-14 17:43 UTC (permalink / raw) To: Glauber Costa Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar Hello, Glauber. On Fri, Sep 14, 2012 at 12:16:31PM +0400, Glauber Costa wrote: > Can we please keep some key userspace guys CCd? Yeap, thanks for adding the ccs. > > 1. cpu and cpuacct ... > > Me, working on it. > I can work on it as well if you want. I dealt with it many times in > the past, and tried some different approaches, so I am familiar. But > if you're already doing it, be my guest... I'm trying something minimal which can serve as basis for the actual work. I think I figured it out mostly and will probably post it later today. Will squeak if I get stuck. > > I'll do the cgroup_freezer. I'm hoping PeterZ or someone who's > > familiar with the code base takes care of cpuset. Michal, can you > > please take care of memcg? > > I think this is a pressing problem, yes, but not the only problem with > cgroup lock. Even if we restrict its usage to cgroup core, we still can > call cgroup functions, which will lock. And then we gain nothing. Can you be a bit more specific? > And the problem is that people need to lock. cgroup_lock is needed > because the data you are accessing is protected by it. The way I see it, > it is incredible how we were able to revive the BKL in the form of > cgroup_lock after we finally manage to successfully get rid of it! I wouldn't go as far as comparing it to BKL. > We should just start to do a more fine grained locking of data, instead > of "stop the world, cgroup just started!". If we do that, the problem > you are trying to address here will even cease to exist. I'd much prefer keeping locking as simple and dumb as possible. Let's break it up only as absolutely necessary. > > memcg can be handled by memcg people and I can handle cgroup_freezer > > and others with help from the authors. The problematic one is > > blkio. If anyone is interested in working on blkio, please be my > > guest. Vivek? Glauber? > > I am happy to help where manpower is needed, but I must node I am a bit > ignorant of block in general. I think blkcg can definitely make use of more manpower. ATM, there are two big things to do. * Fix hierarchy support. * Fix handling of writeback. Both are fairly big chunk of work. > > 6. Multiple hierarchies > > Do you realize this is the exact same thing I proposed in our last > round, and you keep screaming saying you wanted something else, right? > > The only difference is that the discussion at the time started by a > forced-comount patch, but that is not the core of the question. For that > you are proposing to make sense, the controllers need to be comounted, > and at some point we'll have to enforce it. Be it now or in the future. > But what to do when they are in fact comounted, I see no difference from > what you are saying, and what I said. Maybe I misunderstood you or from still talking about forced co-mounts more likely you're still misunderstanding. From what you told PeterZ, it seemed like you were thinking that this somehow will get rid of differing hierarchies depending on specific controllers and thus will help, for example, the optimization issues between cpu and cpuacct. Going back to the above example, Unified tree Controller Y's view controller X's view R R / \ / \ A B A B / \ AA AB If a task assigned to or resourced tagged with AA, for controller X it'll map to AA and for controller Y to A, so we would still need css_set, which actually becomes the primary resource tag and may point to different subsystem states depending on the specific controller. If that is the direction we're headed, forcing co-mounts at this point doesn't make any sense. We'll make things which are possible today impossible for quite a while and then restore part of it, which is a terrible transition plan. What we need to do is nudging the current users away from practices which hinder implementation of the final form and then transition to it gradually. If you still don't understand, I don't know what more I can do to help. > > 7. Misc issues > > > > * Sort & unique when listing tasks. Even the documentation says it > > doesn't happen but we have a good hunk of code doing it in > > cgroup.c. I'm gonna rip it out at some point. Again, if you > > don't like it, scream. > > In all honesty, I never noticed that. ugh Yeah, tell me about it. :( Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914174329.GD17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914174329.GD17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-17 8:50 ` Glauber Costa [not found] ` <5056E467.2090108-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Glauber Costa @ 2012-09-17 8:50 UTC (permalink / raw) To: Tejun Heo Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko, Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V, Daniel P. Berrange, Lennart Poettering, Kay Sievers On 09/14/2012 09:43 PM, Tejun Heo wrote: > Hello, Glauber. > > On Fri, Sep 14, 2012 at 12:16:31PM +0400, Glauber Costa wrote: >> Can we please keep some key userspace guys CCd? > > Yeap, thanks for adding the ccs. > >>> 1. cpu and cpuacct > ... >>> Me, working on it. >> I can work on it as well if you want. I dealt with it many times in >> the past, and tried some different approaches, so I am familiar. But >> if you're already doing it, be my guest... > > I'm trying something minimal which can serve as basis for the actual > work. I think I figured it out mostly and will probably post it later > today. Will squeak if I get stuck. > >>> I'll do the cgroup_freezer. I'm hoping PeterZ or someone who's >>> familiar with the code base takes care of cpuset. Michal, can you >>> please take care of memcg? >> >> I think this is a pressing problem, yes, but not the only problem with >> cgroup lock. Even if we restrict its usage to cgroup core, we still can >> call cgroup functions, which will lock. And then we gain nothing. > > Can you be a bit more specific? > What I mean is that if some operation needs to operate locked, they will have to lock. Whether or not the locking is called from cgroup core or not. If the lock is not available outside, people will end up calling a core function that locks. >> And the problem is that people need to lock. cgroup_lock is needed >> because the data you are accessing is protected by it. The way I see it, >> it is incredible how we were able to revive the BKL in the form of >> cgroup_lock after we finally manage to successfully get rid of it! > > I wouldn't go as far as comparing it to BKL. > Of course not, since it is not system-wide. But I think the comparison still holds in spirit... >> Do you realize this is the exact same thing I proposed in our last >> round, and you keep screaming saying you wanted something else, right? >> >> The only difference is that the discussion at the time started by a >> forced-comount patch, but that is not the core of the question. For that >> you are proposing to make sense, the controllers need to be comounted, >> and at some point we'll have to enforce it. Be it now or in the future. >> But what to do when they are in fact comounted, I see no difference from >> what you are saying, and what I said. > > Maybe I misunderstood you or from still talking about forced co-mounts > more likely you're still misunderstanding. From what you told PeterZ, > it seemed like you were thinking that this somehow will get rid of > differing hierarchies depending on specific controllers and thus will > help, for example, the optimization issues between cpu and cpuacct. > Going back to the above example, > > Unified tree Controller Y's view > controller X's view > > R R > / \ / \ > A B A B > / \ > AA AB > > If a task assigned to or resourced tagged with AA, for controller X > it'll map to AA and for controller Y to A, so we would still need > css_set, which actually becomes the primary resource tag and may point > to different subsystem states depending on the specific controller. > > If that is the direction we're headed, forcing co-mounts at this point > doesn't make any sense. We'll make things which are possible today > impossible for quite a while and then restore part of it, which is a > terrible transition plan. What we need to do is nudging the current > users away from practices which hinder implementation of the final > form and then transition to it gradually. > > If you still don't understand, I don't know what more I can do to > help. > you seem to hear "comount", and think of unified vision, and that is the reason for this discussion to still be going on. Mounting is all about the root. And if you comount, hierarchies have the same root. In your example, the different controllers are comounted. They have not the same view, but the possible views are restricted to be a subset of the underlying tree - because they are mounted in the same place, forced or not. In a situation like this, it makes all the sense in the world to use the css_id as a primary identifier, because it will be guaranteed to be the same. What makes the tree overly flexible, is that you can have multiple roots, starting in multiple places, with arbitrary topologies downwards. If you still don't understand, I don't know what more I can do to help. ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <5056E467.2090108-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <5056E467.2090108-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-17 17:21 ` Tejun Heo [not found] ` <20120917172123.GB18677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Tejun Heo @ 2012-09-17 17:21 UTC (permalink / raw) To: Glauber Costa Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar Hello, Glauber. On Mon, Sep 17, 2012 at 12:50:47PM +0400, Glauber Costa wrote: > > Can you be a bit more specific? > > What I mean is that if some operation needs to operate locked, they will > have to lock. Whether or not the locking is called from cgroup core or > not. If the lock is not available outside, people will end up calling a > core function that locks. I was asking whether you have certain specific operations on mind. > >> And the problem is that people need to lock. cgroup_lock is needed > >> because the data you are accessing is protected by it. The way I see it, > >> it is incredible how we were able to revive the BKL in the form of > >> cgroup_lock after we finally manage to successfully get rid of it! > > > > I wouldn't go as far as comparing it to BKL. > > Of course not, since it is not system-wide. But I think the comparison > still holds in spirit... Subsystem-wide locks covering non-hot paths aren't evil things. We have a lot of them and they work fine. BKL was a completely different beast initially with implicit locking on kernel entry and unlocking on sleeping and then got morphed into some chimera inbetween afterwards. Simple locking is a good thing. If finer-grained locking is necessary, we sure do that but please stop throwing over-generalized half-arguments at it. It doesn't help anything. > you seem to hear "comount", and think of unified vision, and that is the > reason for this discussion to still be going on. Mounting is all about > the root. And if you comount, hierarchies have the same root. > > In your example, the different controllers are comounted. They have not > the same view, but the possible views are restricted to be a subset of > the underlying tree - because they are mounted in the same place, forced > or not. Heh, I can't really tell whether you understand it or not. Here and in the previous thread too. You seem to understand that there are different views upto this point. > In a situation like this, it makes all the sense in the world to use the > css_id as a primary identifier, because it will be guaranteed to be the And then you say something like this (or that this would remove walking different hierarchies in the previous thread - yes, to a certain point but not completely). css_id is a per-css attribute. How can that be the "primariy" identifier when there can be multiple views? For each userland-visible cgroup, there must be a css_set which points to the css's belonging to it, which may not be at the same level - multiple nodes in the userland visible tree may point to the same css. If you mean that css_id would be the primary identifier for that specific controller's css, why even say that? That's true now and won't ever change. > same. What makes the tree overly flexible, is that you can have multiple > roots, starting in multiple places, with arbitrary topologies downwards. And now you seem to be on the same page again. But then again, you're asserting that incorporating forced co-mounts *now* is a gradual step towards the goal, which is utterly bonkers. I don't know. I just can't understand what you're thinking at all. Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120917172123.GB18677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120917172123.GB18677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-18 8:16 ` Glauber Costa 0 siblings, 0 replies; 75+ messages in thread From: Glauber Costa @ 2012-09-18 8:16 UTC (permalink / raw) To: Tejun Heo Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On 09/17/2012 09:21 PM, Tejun Heo wrote: > Hello, Glauber. > > On Mon, Sep 17, 2012 at 12:50:47PM +0400, Glauber Costa wrote: >>> Can you be a bit more specific? >> >> What I mean is that if some operation needs to operate locked, they will >> have to lock. Whether or not the locking is called from cgroup core or >> not. If the lock is not available outside, people will end up calling a >> core function that locks. > > I was asking whether you have certain specific operations on mind. > >>>> And the problem is that people need to lock. cgroup_lock is needed >>>> because the data you are accessing is protected by it. The way I see it, >>>> it is incredible how we were able to revive the BKL in the form of >>>> cgroup_lock after we finally manage to successfully get rid of it! >>> >>> I wouldn't go as far as comparing it to BKL. >> >> Of course not, since it is not system-wide. But I think the comparison >> still holds in spirit... > > Subsystem-wide locks covering non-hot paths aren't evil things. We > have a lot of them and they work fine. BKL was a completely different > beast initially with implicit locking on kernel entry and unlocking on > sleeping and then got morphed into some chimera inbetween afterwards. > > Simple locking is a good thing. If finer-grained locking is > necessary, we sure do that but please stop throwing over-generalized > half-arguments at it. It doesn't help anything. > >> you seem to hear "comount", and think of unified vision, and that is the >> reason for this discussion to still be going on. Mounting is all about >> the root. And if you comount, hierarchies have the same root. >> >> In your example, the different controllers are comounted. They have not >> the same view, but the possible views are restricted to be a subset of >> the underlying tree - because they are mounted in the same place, forced >> or not. > > Heh, I can't really tell whether you understand it or not. Here and > in the previous thread too. You seem to understand that there are > different views upto this point. > >> In a situation like this, it makes all the sense in the world to use the >> css_id as a primary identifier, because it will be guaranteed to be the > > And then you say something like this (or that this would remove > walking different hierarchies in the previous thread - yes, to a > certain point but not completely). css_id is a per-css attribute. > How can that be the "primariy" identifier when there can be multiple > views? For each userland-visible cgroup, there must be a css_set > which points to the css's belonging to it, which may not be at the > same level - multiple nodes in the userland visible tree may point to > the same css. > > If you mean that css_id would be the primary identifier for that > specific controller's css, why even say that? That's true now and > won't ever change. > >> same. What makes the tree overly flexible, is that you can have multiple >> roots, starting in multiple places, with arbitrary topologies downwards. > > And now you seem to be on the same page again. But then again, you're > asserting that incorporating forced co-mounts *now* is a gradual step > towards the goal, which is utterly bonkers. I don't know. I just > can't understand what you're thinking at all. > > Thanks. > I will just stop, because i am not trying to convince you to do anything different than you are proposing now. I am just trying to convince you what I have been saying has the exact same effects of this. So let us focus our energies in the actual work ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-14 8:16 ` Glauber Costa @ 2012-09-14 9:04 ` Mike Galbraith [not found] ` <1347613484.4340.132.camel-YqMYhexLQo31wTEvPJ5Q0F6hYfS7NtTn@public.gmane.org> 2012-09-14 9:10 ` Daniel P. Berrange ` (7 subsequent siblings) 9 siblings, 1 reply; 75+ messages in thread From: Mike Galbraith @ 2012-09-14 9:04 UTC (permalink / raw) To: Tejun Heo Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko, Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V On Thu, 2012-09-13 at 13:58 -0700, Tejun Heo wrote: > 7. Misc issues > * Extract synchronize_rcu() from user interface? Exporting grace periods to userspace isn't wonderful for dynamic launchers. -Mike ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <1347613484.4340.132.camel-YqMYhexLQo31wTEvPJ5Q0F6hYfS7NtTn@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <1347613484.4340.132.camel-YqMYhexLQo31wTEvPJ5Q0F6hYfS7NtTn@public.gmane.org> @ 2012-09-14 17:17 ` Tejun Heo 0 siblings, 0 replies; 75+ messages in thread From: Tejun Heo @ 2012-09-14 17:17 UTC (permalink / raw) To: Mike Galbraith Cc: Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Fri, Sep 14, 2012 at 11:04:44AM +0200, Mike Galbraith wrote: > On Thu, 2012-09-13 at 13:58 -0700, Tejun Heo wrote: > > > 7. Misc issues > > > * Extract synchronize_rcu() from user interface? Exporting grace > periods to userspace isn't wonderful for dynamic launchers. Aye aye. Also, * Update doc. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-14 8:16 ` Glauber Costa 2012-09-14 9:04 ` Mike Galbraith @ 2012-09-14 9:10 ` Daniel P. Berrange [not found] ` <20120914091032.GA6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 14:25 ` Vivek Goyal ` (6 subsequent siblings) 9 siblings, 1 reply; 75+ messages in thread From: Daniel P. Berrange @ 2012-09-14 9:10 UTC (permalink / raw) To: Tejun Heo Cc: Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote: > 5. I CAN HAZ HIERARCHIES? > > The cpu ones handle nesting correctly - parent's accounting includes > children's, parent's configuration affects children's unless > explicitly overridden, and children's limits nest inside parent's. > > memcg asked itself the existential question of to be hierarchical or > not and then got confused and decided to become both. > > When faced with the same question, blkio and cgroup_freezer just > gave up and decided to allow nesting and then ignore it - brilliant. > > And there are others which kinda sorta try to handle hierarchy but > only goes way-half. > > This one is screwed up embarrassingly badly. We failed to establish > one of the most basic semantics and can't even define what a cgroup > hierarchy is - it depends on each controller and they're mostly > wacky! > > Fortunately, I don't think it will be prohibitively difficult to dig > ourselves out of this hole. > > Solution: > > * cpu ones seem fine. > > * For broken controllers, cgroup core will be generating warning > messages if the user tries to nest cgroups so that the user at > least can know that the behavior may change underneath them later > on. For more details, > > http://thread.gmane.org/gmane.linux.kernel/1356264/focus=3902 > > * memcg can be fully hierarchical but we need to phase out the flat > hierarchy support. Unfortunately, this involves flipping the > behavior for the existing users. Upstream will try to nudge users > with warning messages. Most burden would be on the distros and at > least SUSE seems to be on board with it. Needs coordination with > other distros. > > * blkio is the most problematic. It has two sub-controllers - cfq > and blk-throttle. Both are utterly broken in terms of hierarchy > support and the former is known to have pretty hairy code base. I > don't see any other way than just biting the bullet and fixing it. > > * cgroup_freezer and others shouldn't be too difficult to fix. > > Who: > > memcg can be handled by memcg people and I can handle cgroup_freezer > and others with help from the authors. The problematic one is > blkio. If anyone is interested in working on blkio, please be my > guest. Vivek? Glauber? > > 6. Multiple hierarchies > > Apart from the apparent wheeeeeeeeness of it (I think I talked about > that enough the last time[1]), there's a basic problem when more > than one controllers interact - it's impossible to define a resource > group when more than two controllers are involved because the > intersection of different controllers is only defined in terms of > tasks. > > IOW, if an entity X is of interest to two controllers, there's no > way to map X to the cgroups of the two controllers. X may belong to > A and B when viewed by one task but A' and B when viewed by another. > This already is a head scratcher in writeback where blkcg and memcg > have to interact. > > While I am pushing for unified hierarchy, I think it's necessary to > have different levels of granularities depending on controllers > given that nesting involves significant overhead and noticeable > controller-dependent behavior changes. > > Solution: > > I think a unified hierarchy with the ability to ignore subtrees > depending on controllers should work. For example, let's assume the > following hierarchy. > > R > / \ > A B > / \ > AA AB > > All controllers are co-mounted. There is per-cgroup knob which > controls which controllers nest beyond it. If blkio doesn't want to > distinguish AA and AB, the user can specify that blkio doesn't nest > beyond A and blkio would see the tree as, > > R > / \ > A B > > While other controllers keep seeing the original tree. The exact > form of interface, I don't know yet. It could be a single file > which the user echoes [-]controller name into it or per-controller > boolean file. > > I think this level of flexibility should be enough for most use > cases. If someone disagrees, please voice your objections now. > > I *think* this can be achieved by changing where css_set is bound. > Currently, a css_set is (conceptually) owned by a task. After the > change, a cgroup in the unified hierarchy has its own css_set which > tasks point to and can also be used to tag resources as necessary. > This way, it should be achieveable without introducing a lot of new > code or affecting individual controllers too much. > > The headache will be the transition period where we'll probably have > to support both modes of operation. Oh well.... > > Who: > > Li, Glauber and me, I guess? FWIW, from the POV of libvirt and its KVM/LXC drivers, I think that co-mounting all controllers is just fine. In our usage model we always want to have exactly the same hierarchy for all of them. It rather complicates life to have to deal with multiple hierarchies, so I'd be happy if they went away. libvirtd will always create its own cgroups starting at the location where libvirtd itself has been placed. This is to co-operate with systemd / initscripts which may place each system service in a dedicated group. Thus historically we usually end up in a layout: $CG_MOUNT_ROOT | +- apache.service +- mysql.service +- sendmail.service +- ....service +- libvirtd.service (if systemd has put us in an isolated group) | +- libvirt | +- lxc | | | +- container1 | +- container2 | +- container3 | ... +- qemu | +- machine1 +- machine2 +- machine3 ... Now we know that many controllers don't respect this hiearchy and will flatten it so all those leaf nodes (container1, container2, machine1, machine2...etc) are immediately at the root level. While this is clearly sub-optimal, for our current needs that does not actually harm us really. While we did intend that a sysadmin could place controls on the 'libvirt', 'lxc' or 'qemu' cgroups, I'm not aware of anyone who actually does this currently. Everyone, so far, only cares about placing controls in individual virtual machines and containers. Thus given what we now know about the performance problems wrt hierarchies we're planning to flatten that significantly to look closer to this: $CG_MOUNT_ROOT | +- apache.service +- mysql.service +- sendmail.service +- ....service +- libvirtd.service (if systemd has put us in an isolated group) | +- libvirt-lxc-container1 +- libvirt-lxc-container2 +- libvirt-lxc-container3 +- libvirt-lxc-... +- libvirt-qemu-machine1 +- libvirt-qemu-machine2 +- libvirt-qemu-machine3 +- libvirt-qemu-... (though we'll have config option to retain the old style hiearchy too for backwards compatibility) Also bear in mind that with containers, the processes inside the containers may want to use cgroups too. eg if runnning systemd inside a container too $CG_MOUNT_ROOT | +- apache.service +- mysql.service +- sendmail.service +- ....service +- libvirtd.service (if systemd has put us in an isolated group) | +- libvirt-lxc-container1 | | | +- apache.service | +- mysql.service | +- sendmail.service | ... +- libvirt-lxc-container2 +- libvirt-lxc-container3 +- libvirt-lxc-... +- libvirt-qemu-machine1 +- libvirt-qemu-machine2 +- libvirt-qemu-machine3 +- libvirt-qemu-... Or if each user login session has been given a cgroup and we are running libvirtd as a non-root user, we can end up with something like this: $CG_MOUNT_ROOT | +- fred.user +- joe.user +- bob.user | +- libvirtd.service (if systemd has put us in an isolated group) | +- libvirt-qemu-machine1 +- libvirt-qemu-machine2 +- libvirt-qemu-machine3 +- libvirt-qemu-... In essence what I'm saying is that I'm fine with co-mounting. What we care about is being able to create the kind of hiearchies outlined above, and have all controllers actually work sensibly with them. The systemd & libvirt folks came up with the following recommendations to try to get good co-operation between different user space apps who want to use cgroups. Basically the idea is that if each app follows the guidelines, then no individual app should need to have a global world of all cgroups. http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups I think everything you describe is compatible with what we've documented there. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914091032.GA6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914091032.GA6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-09-14 9:08 ` Glauber Costa 2012-09-14 13:58 ` Vivek Goyal 1 sibling, 0 replies; 75+ messages in thread From: Glauber Costa @ 2012-09-14 9:08 UTC (permalink / raw) To: Daniel P. Berrange Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf, Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, Serge E. Hallyn, Paul Turner On 09/14/2012 01:10 PM, Daniel P. Berrange wrote: > libvirtd will always create its own cgroups starting at the location > where libvirtd itself has been placed. This is to co-operate with > systemd / initscripts which may place each system service in a > dedicated group This is more or less what I am doing now for OpenVZ as well. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120914091032.GA6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 9:08 ` Glauber Costa @ 2012-09-14 13:58 ` Vivek Goyal [not found] ` <20120914135830.GB6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 75+ messages in thread From: Vivek Goyal @ 2012-09-14 13:58 UTC (permalink / raw) To: Daniel P. Berrange Cc: Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf, Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner On Fri, Sep 14, 2012 at 10:10:32AM +0100, Daniel P. Berrange wrote: [..] > > 6. Multiple hierarchies > > > > Apart from the apparent wheeeeeeeeness of it (I think I talked about > > that enough the last time[1]), there's a basic problem when more > > than one controllers interact - it's impossible to define a resource > > group when more than two controllers are involved because the > > intersection of different controllers is only defined in terms of > > tasks. > > > > IOW, if an entity X is of interest to two controllers, there's no > > way to map X to the cgroups of the two controllers. X may belong to > > A and B when viewed by one task but A' and B when viewed by another. > > This already is a head scratcher in writeback where blkcg and memcg > > have to interact. > > > > While I am pushing for unified hierarchy, I think it's necessary to > > have different levels of granularities depending on controllers > > given that nesting involves significant overhead and noticeable > > controller-dependent behavior changes. > > > > Solution: > > > > I think a unified hierarchy with the ability to ignore subtrees > > depending on controllers should work. For example, let's assume the > > following hierarchy. > > > > R > > / \ > > A B > > / \ > > AA AB > > > > All controllers are co-mounted. There is per-cgroup knob which > > controls which controllers nest beyond it. If blkio doesn't want to > > distinguish AA and AB, the user can specify that blkio doesn't nest > > beyond A and blkio would see the tree as, > > > > R > > / \ > > A B > > > > While other controllers keep seeing the original tree. The exact > > form of interface, I don't know yet. It could be a single file > > which the user echoes [-]controller name into it or per-controller > > boolean file. > > > > I think this level of flexibility should be enough for most use > > cases. If someone disagrees, please voice your objections now. Tejun, Daniel, I am little concerned about above and wondering how systemd and libvirt will interact and behave out of the box. Currently systemd does not create its own hierarchy under blkio and libvirt does. So putting all together means there is no way to avoid the overhead of systemd created hierarchy. \ | +- system | +- libvirtd.service | +- virt-machine1 +- virt-machine2 So there is now way to avoid the overhead of two levels of hierarchy created by systemd. I really wish that systemd gets rid of "system" cgroup and puts services directly in top level group. Creating deeper hieararchices is expensive. I just want to mention it clearly that with above model, it will not be possible for libvirt to avoid hierarchy levels created by systemd. So solution would be to keep depth of hierarchy as low as possible and to keep controller overhead as low as possible. Now I know that with blkio idling kills performance. So one solution could be that on anything fast, don't use CFQ. Use deadline and then group idling overhead goes away and tools like systemd and libvirt don't have to worry about keeping track of disks and what scheduler is running. They don't want to do it and expect kernel to get it right. But getting that right out of box does not happen as of today as CFQ is default on everything. Distributions can carry their own patches to do some approximation, but it would be better to have a better mechanism in kernel to select better IO scheduler out of box for a storage lun. It is more important now then even since blkio controller has come into picture. Above is the scenario I am most worried about where CFQ shows up by default on all the luns, systemd and libvirt create 4-5 level deep hierarchies by default and IO performance sucks out of the box. Already CFQ underforms for fast storage and with group creation problem becomes worse. Thanks Vivek ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914135830.GB6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914135830.GB6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-09-14 19:29 ` Tejun Heo [not found] ` <20120914192935.GO17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Tejun Heo @ 2012-09-14 19:29 UTC (permalink / raw) To: Vivek Goyal Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner Hello, (cc'ing Lennart and Kay) On Fri, Sep 14, 2012 at 09:58:30AM -0400, Vivek Goyal wrote: > I am little concerned about above and wondering how systemd and libvirt > will interact and behave out of the box. > > Currently systemd does not create its own hierarchy under blkio and > libvirt does. So putting all together means there is no way to avoid > the overhead of systemd created hierarchy. > > \ > | > +- system > | > +- libvirtd.service > | > +- virt-machine1 > +- virt-machine2 > > So there is now way to avoid the overhead of two levels of hierarchy > created by systemd. I really wish that systemd gets rid of "system" > cgroup and puts services directly in top level group. Creating deeper > hieararchices is expensive. > > I just want to mention it clearly that with above model, it will not > be possible for libvirt to avoid hierarchy levels created by systemd. > So solution would be to keep depth of hierarchy as low as possible and > to keep controller overhead as low as possible. Yes, if we're do full unified hierarchy, nesting should happen iff resource control actually requires the nesting so that tree depth is kept minimal. Nesting shouldn't be used purely for organizational purposes. > Now I know that with blkio idling kills performance. So one solution > could be that on anything fast, don't use CFQ. Use deadline and then > group idling overhead goes away and tools like systemd and libvirt don't > have to worry about keeping track of disks and what scheduler is running. > They don't want to do it and expect kernel to get it right. I personally don't think the level of complexity we have in cfq is something useful for the SSDs which are getting ever better. cfq is allowed to use a lot of processing overhead and complexity because disks are *so* slow. The balance already has completely changed with SSDs and we should be doing something a lot simpler most likely based on iops for them - be it deadline or whatever. blkcg support is currently tied to cfq-iosched which sucks but I think that could be the only way to achieve any kind of acceptable blkcg support for rotating disks. I think what we should do is abstract out the common organization part as much as possible so that we don't end up duplicating everything for blk-throttle, cfq and, say, deadline. Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914192935.GO17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914192935.GO17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-14 21:51 ` Kay Sievers 0 siblings, 0 replies; 75+ messages in thread From: Kay Sievers @ 2012-09-14 21:51 UTC (permalink / raw) To: Tejun Heo Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Vivek Goyal On Fri, Sep 14, 2012 at 9:29 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > On Fri, Sep 14, 2012 at 09:58:30AM -0400, Vivek Goyal wrote: >> I am little concerned about above and wondering how systemd and libvirt >> will interact and behave out of the box. >> >> Currently systemd does not create its own hierarchy under blkio and >> libvirt does. So putting all together means there is no way to avoid >> the overhead of systemd created hierarchy. >> >> \ >> | >> +- system >> | >> +- libvirtd.service >> | >> +- virt-machine1 >> +- virt-machine2 >> >> So there is now way to avoid the overhead of two levels of hierarchy >> created by systemd. I really wish that systemd gets rid of "system" >> cgroup and puts services directly in top level group. Creating deeper >> hieararchices is expensive. The idea here is to split equally between the "system" and the "user"s at that level. That all can be re-considered and changed if really needed, but it's not an unintentionally created directory. Thanks, Kay ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> ` (2 preceding siblings ...) 2012-09-14 9:10 ` Daniel P. Berrange @ 2012-09-14 14:25 ` Vivek Goyal [not found] ` <20120914142539.GC6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 15:03 ` Michal Hocko ` (5 subsequent siblings) 9 siblings, 1 reply; 75+ messages in thread From: Vivek Goyal @ 2012-09-14 14:25 UTC (permalink / raw) To: Tejun Heo Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote: [..] > * blkio is the most problematic. It has two sub-controllers - cfq > and blk-throttle. Both are utterly broken in terms of hierarchy > support and the former is known to have pretty hairy code base. I > don't see any other way than just biting the bullet and fixing it. I am still little concerned about changing the blkio behavior unexpectedly. Can we have some kind of mount time flag which retains the old flat behavior and we warn user that this mode is deprecated and will soon be removed. Move over to hierarchical mode. Then after few release we can drop the flag and cleanup any extra code which supports flat mode in CFQ. This will atleast make transition smooth. > > * cgroup_freezer and others shouldn't be too difficult to fix. > > Who: > > memcg can be handled by memcg people and I can handle cgroup_freezer > and others with help from the authors. The problematic one is > blkio. If anyone is interested in working on blkio, please be my > guest. Vivek? Glauber? I will try to spend some time on this. Doing changes in blk-throttle should be relatively easy. Painful part if CFQ. It does so much that it is not clear whether a particular change will bite us badly or not. So doing changes becomes hard. There are heuristics, preemptions, queue selection logic, service tree and bringing it all together for full hierarchy becomes interesting. I think first thing which needs to be done is merge group scheduling and cfqq scheduling. Because of flat hierarchy currently we use two scheduling algorithm. Old logic for queue selection and new logic for group scheduling. If we treat task and group at same level then we have to merge two and come up with single logic. Glauber feel free to jump into it if you like to. We can sort it out together. [..] > * Vivek brought up the issue of distributing resources to tasks and > groups in the same cgroup. I don't know. Need to think more > about it. This one will require some thought. I have heard arguments for both the models. Treating tasks and groups at same level seem to have one disadvantange and that is that people can't think of system resources in terms of %. People often say, give 20% of disk resources to a particular cgroup. But it is not possible as there are all kernel threads running in root cgroup and tasks come and go and that means % share of a group is variable and not fixed. To make it fixed, we will need to make sure that number of entities fighting for resources are not variable. That means only group fight for resources at a level and tasks with-in groups. Now the question is should kernel enforce it or should it be left to user space. I think doing it in user space is also messy as different agents control different part of hiearchy. For example, if somebody says that give a particular virtual machine a x% of system resource, libvirt has no way to do that. At max it can ensure x% of parent group but above that hierarchy is controlled by systemd and libvirtd has no control over that. Only possible way to do this will seem to be that systemd creates libvirt group at top level with a minimum fixed % of quota and then libvirt can figure out % share of each virtual machine. But it is hard to do. So while % model is more intutive to users, it is hard to implement. So an easier way is to stick to the model of relative weights/share and let user specify relative importance of a virtual machine and actual quota or % will vary dynamically depending on other tasks/components in the system. Thoughts? Thanks Vivek ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914142539.GC6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914142539.GC6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-09-14 14:53 ` Peter Zijlstra 2012-09-14 15:14 ` Vivek Goyal 2012-09-14 21:39 ` Tejun Heo 1 sibling, 1 reply; 75+ messages in thread From: Peter Zijlstra @ 2012-09-14 14:53 UTC (permalink / raw) To: Vivek Goyal Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Fri, 2012-09-14 at 10:25 -0400, Vivek Goyal wrote: > So while % model is more intutive to users, it is hard to implement. I don't agree with that. The fixed quota thing is counter-intuitive and hard to use. It begets you questions like: why, if everything is idle except my task, am I not getting the full throughput. It also makes adding entities harder because you're constrained to 100%. This means you have to start each new cgroup with 0% because any !0 value will eventually get you over 100%, it also means you have to do some form of admission control to make sure you never get over that 100%. Starting with 0% is not convenient for people.. they think this is the wrong default, even though as argued above, it is the only possible value. > So > an easier way is to stick to the model of relative weights/share and > let user specify relative importance of a virtual machine and actual > quota or % will vary dynamically depending on other tasks/components > in the system. > > Thoughts? cpu does the relative weight, so 'users' will have to deal with it anyway regardless of blk, its effectively free of learning curve for all subsequent controllers. Now cpu also has an optional upper limit. But its optional for those people who do want it (also its expensive). For RT we must use fixed quota since variable service completely defeats determinism, RT is 'special' and hard to use anyway, so making it harder is fine. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs 2012-09-14 14:53 ` Peter Zijlstra @ 2012-09-14 15:14 ` Vivek Goyal [not found] ` <20120914151447.GD6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Vivek Goyal @ 2012-09-14 15:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Fri, Sep 14, 2012 at 04:53:29PM +0200, Peter Zijlstra wrote: > On Fri, 2012-09-14 at 10:25 -0400, Vivek Goyal wrote: > > So while % model is more intutive to users, it is hard to implement. > > I don't agree with that. The fixed quota thing is counter-intuitive and > hard to use. It begets you questions like: why, if everything is idle > except my task, am I not getting the full throughput. Actually by fixed quota I meant minimum fixed %. So if other groups are idle, this group still gets to use 100% bandwidth. When resources are highly contended, this group gets its minimum fixed %. > > It also makes adding entities harder because you're constrained to 100%. > This means you have to start each new cgroup with 0% because any !0 > value will eventually get you over 100%, it also means you have to do > some form of admission control to make sure you never get over that > 100%. > > Starting with 0% is not convenient for people.. they think this is the > wrong default, even though as argued above, it is the only possible > value. We don't have to start with 0%. We can keep a pool with dynamic % and launch all the virtual machines from that single pool. So nobody starts with 0%. If we require certain % for a machine, only then we look at peers and see if we have bandwidth free and create cgroup and move virtual machine there, otherwise we deny resources. So I think it is doable just that it is painful and tricky and I think lot of it will be in user space. > > > So > > an easier way is to stick to the model of relative weights/share and > > let user specify relative importance of a virtual machine and actual > > quota or % will vary dynamically depending on other tasks/components > > in the system. > > > > Thoughts? > > cpu does the relative weight, so 'users' will have to deal with it > anyway regardless of blk, its effectively free of learning curve for all > subsequent controllers. I am inclined to keep it simple in kernel and just follow cpu model of relative weights and treating tasks and gropu at same level in the hierarchy. It makes behavior consistent across the controllers and I think it might just work for majority of cases. Those who really need to implement % model, they will have to do heavy lifting in user space. I am skeptical that will take off but kernel does not prohibit from somebody creating a group, moving all tasks there and making sure tasks and groups are not at same level hence % becomes more predictable. Just that, that's not the default from kernel. So yes, doing it cpu controller way in block controller should be reasonable. Thanks Vivek ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914151447.GD6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914151447.GD6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-09-14 21:57 ` Tejun Heo [not found] ` <20120914215701.GW17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-17 8:55 ` Glauber Costa 1 sibling, 1 reply; 75+ messages in thread From: Tejun Heo @ 2012-09-14 21:57 UTC (permalink / raw) To: Vivek Goyal Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar Hello, Vivek, Peter. On Fri, Sep 14, 2012 at 11:14:47AM -0400, Vivek Goyal wrote: > We don't have to start with 0%. We can keep a pool with dynamic % and > launch all the virtual machines from that single pool. So nobody starts > with 0%. If we require certain % for a machine, only then we look at > peers and see if we have bandwidth free and create cgroup and move virtual > machine there, otherwise we deny resources. > > So I think it is doable just that it is painful and tricky and I think > lot of it will be in user space. I think the system-wide % thing is rather distracting for the discussion at hand (and I don't think being able to specify X% of the whole system when you're three level down the resource hierarchy makes sense anyway). Let's focus on tasks vs. groups. > > > So > > > an easier way is to stick to the model of relative weights/share and > > > let user specify relative importance of a virtual machine and actual > > > quota or % will vary dynamically depending on other tasks/components > > > in the system. > > > > > > Thoughts? > > > > cpu does the relative weight, so 'users' will have to deal with it > > anyway regardless of blk, its effectively free of learning curve for all > > subsequent controllers. > > I am inclined to keep it simple in kernel and just follow cpu model of > relative weights and treating tasks and gropu at same level in the > hierarchy. It makes behavior consistent across the controllers and I > think it might just work for majority of cases. I think we need to stick to one model for all controllers; otherwise, it gets confusing and unified hierarchy can't work. That said, I'm not too happy about how cpu is handling it now. * As I wrote before, the configuration esacpes cgroup proper and the mapping from per-task value to group weight is essentially arbitrary and may not exist depending on the resource type. * The proportion of each group fluctuates as tasks fork and exit in the parent group, which is confusing. * cpu deals with tasks but blkcg deals with iocontexts and memcg, which currently doesn't implement proportional control, deals with address spaces (processes). The proportions wouldn't even fluctuate the same way across different controllers. So, I really don't think the current model used by cpu is a good one and we rather should treat the tasks as a group competing with the rest of child groups. Whether we can change that at this point, I don't know. Peter, what do you think? Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914215701.GW17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914215701.GW17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-17 15:27 ` Vivek Goyal 2012-09-18 18:08 ` Vivek Goyal 1 sibling, 0 replies; 75+ messages in thread From: Vivek Goyal @ 2012-09-17 15:27 UTC (permalink / raw) To: Tejun Heo Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Fri, Sep 14, 2012 at 02:57:01PM -0700, Tejun Heo wrote: [..] > > > cpu does the relative weight, so 'users' will have to deal with it > > > anyway regardless of blk, its effectively free of learning curve for all > > > subsequent controllers. > > > > I am inclined to keep it simple in kernel and just follow cpu model of > > relative weights and treating tasks and gropu at same level in the > > hierarchy. It makes behavior consistent across the controllers and I > > think it might just work for majority of cases. > > I think we need to stick to one model for all controllers; otherwise, > it gets confusing and unified hierarchy can't work. That said, I'm > not too happy about how cpu is handling it now. > > * As I wrote before, the configuration esacpes cgroup proper and the > mapping from per-task value to group weight is essentially > arbitrary and may not exist depending on the resource type. If need be, one can create task priority type for those resources too. Or one could even think of being able to directly specify weigths (same thing as groups) for tasks. That should be doable if people think if that kind of interface helps. > > * The proportion of each group fluctuates as tasks fork and exit in > the parent group, which is confusing. Agreed with that. But some people are just happy with varying percentage and don't care about fixed percentage. In fact current deployments of systemd and libvirt don't care about fixed percentage. They are just happy providing relative priority to things and making sure some kind of basic isolation. > > * cpu deals with tasks but blkcg deals with iocontexts and memcg, > which currently doesn't implement proportional control, deals with > address spaces (processes). The proportions wouldn't even fluctuate > the same way across different controllers. > > So, I really don't think the current model used by cpu is a good one > and we rather should treat the tasks as a group competing with the > rest of child groups. Whether we can change that at this point, I > don't know. Peter, what do you think? I am not convinced that by default kernel should enforce that all the tasks of a group are accounted to a hidden group. People have use cases where they are happy with currently offered semantics. I think auto scheduler group is another example where system is well protected from workloads like "make -j64". Even in the case of hidden group it will be protected but %share of that group will be much higher. (Up to 50%). So IMHO, if users really care about tasks and groups not competing at same level, users should create hiearchy that way and kernel should not enforce that. Thanks Vivek ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120914215701.GW17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-17 15:27 ` Vivek Goyal @ 2012-09-18 18:08 ` Vivek Goyal 1 sibling, 0 replies; 75+ messages in thread From: Vivek Goyal @ 2012-09-18 18:08 UTC (permalink / raw) To: Peter Zijlstra Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Fri, Sep 14, 2012 at 02:57:01PM -0700, Tejun Heo wrote: [..] > I think we need to stick to one model for all controllers; otherwise, > it gets confusing and unified hierarchy can't work. That said, I'm > not too happy about how cpu is handling it now. > > * As I wrote before, the configuration esacpes cgroup proper and the > mapping from per-task value to group weight is essentially > arbitrary and may not exist depending on the resource type. > > * The proportion of each group fluctuates as tasks fork and exit in > the parent group, which is confusing. > > * cpu deals with tasks but blkcg deals with iocontexts and memcg, > which currently doesn't implement proportional control, deals with > address spaces (processes). The proportions wouldn't even fluctuate > the same way across different controllers. > > So, I really don't think the current model used by cpu is a good one > and we rather should treat the tasks as a group competing with the > rest of child groups. Whether we can change that at this point, I > don't know. Peter, what do you think? Peter, do you have thoughts on this? I vaguely remember that similar discussion had happened for cpu controller. We first need to settle this debate of treating tasks at same level as groups before further design points can be discussed. Thanks Vivek ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120914151447.GD6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 21:57 ` Tejun Heo @ 2012-09-17 8:55 ` Glauber Costa 1 sibling, 0 replies; 75+ messages in thread From: Glauber Costa @ 2012-09-17 8:55 UTC (permalink / raw) To: Vivek Goyal Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On 09/14/2012 07:14 PM, Vivek Goyal wrote: > Those who really need to implement % model, they will have to do heavy > lifting in user space. I am skeptical that will take off but kernel > does not prohibit from somebody creating a group, moving all tasks > there and making sure tasks and groups are not at same level hence > % becomes more predictable. Just that, that's not the default from > kernel. I subscribe to that. I use a % model for memory / kernel memory (Give kernel 20 % of userspace memory), but the kernel never knows about it. It only understand megabytes. Of course this is simpler, because it is all inside the same cgroup. But if you want global %'s you need to calculate it from everybody *anyway*, be it in the kernel or in userspace. ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120914142539.GC6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 14:53 ` Peter Zijlstra @ 2012-09-14 21:39 ` Tejun Heo [not found] ` <20120914213938.GV17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 75+ messages in thread From: Tejun Heo @ 2012-09-14 21:39 UTC (permalink / raw) To: Vivek Goyal Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar Hello, Vivek. On Fri, Sep 14, 2012 at 10:25:39AM -0400, Vivek Goyal wrote: > On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote: > > [..] > > * blkio is the most problematic. It has two sub-controllers - cfq > > and blk-throttle. Both are utterly broken in terms of hierarchy > > support and the former is known to have pretty hairy code base. I > > don't see any other way than just biting the bullet and fixing it. > > I am still little concerned about changing the blkio behavior > unexpectedly. Can we have some kind of mount time flag which retains > the old flat behavior and we warn user that this mode is deprecated > and will soon be removed. Move over to hierarchical mode. Then after > few release we can drop the flag and cleanup any extra code which > supports flat mode in CFQ. This will atleast make transition smooth. I don't know. That essentially is what we're doing with memcg now and it doesn't seem any less messy. Given the already scary complexity, do we really want to support both flat and hierarchy models at the same time? > > memcg can be handled by memcg people and I can handle cgroup_freezer > > and others with help from the authors. The problematic one is > > blkio. If anyone is interested in working on blkio, please be my > > guest. Vivek? Glauber? > > I will try to spend some time on this. Doing changes in blk-throttle > should be relatively easy. Painful part if CFQ. It does so much that > it is not clear whether a particular change will bite us badly or > not. So doing changes becomes hard. There are heuristics, preemptions, > queue selection logic, service tree and bringing it all together > for full hierarchy becomes interesting. > > I think first thing which needs to be done is merge group scheduling > and cfqq scheduling. Because of flat hierarchy currently we use two > scheduling algorithm. Old logic for queue selection and new logic > for group scheduling. If we treat task and group at same level then > we have to merge two and come up with single logic. I think this depends on how we decide to handle tasks vs. groups, right? > [..] > > * Vivek brought up the issue of distributing resources to tasks and > > groups in the same cgroup. I don't know. Need to think more > > about it. > > This one will require some thought. I have heard arguments for both the > models. Treating tasks and groups at same level seem to have one > disadvantange and that is that people can't think of system resources > in terms of %. People often say, give 20% of disk resources to a > particular cgroup. But it is not possible as there are all kernel > threads running in root cgroup and tasks come and go and that means > % share of a group is variable and not fixed. Another problem is that configuration isn't contained in cgroup proper. We need a way to assign weights to individual tasks which can be somehow directly compared against group weights. cpu cooks priority for this and blkcg may be able to cook ioprio but it's nasty and unobvious. Also, let's say we grow network bandwidth controller for whatever reason. What value are we gonna use? > To make it fixed, we will need to make sure that number of entities > fighting for resources are not variable. That means only group fight > for resources at a level and tasks with-in groups. > > Now the question is should kernel enforce it or should it be left to > user space. I think doing it in user space is also messy as different > agents control different part of hiearchy. For example, if somebody says > that give a particular virtual machine a x% of system resource, libvirt > has no way to do that. At max it can ensure x% of parent group but above > that hierarchy is controlled by systemd and libvirtd has no control > over that. > > Only possible way to do this will seem to be that systemd creates libvirt > group at top level with a minimum fixed % of quota and then libvirt can > figure out % share of each virtual machine. But it is hard to do. > > So while % model is more intutive to users, it is hard to implement. So > an easier way is to stick to the model of relative weights/share and > let user specify relative importance of a virtual machine and actual > quota or % will vary dynamically depending on other tasks/components > in the system. Why is it hard to implement? You just need to treat tasks in the current node as another group competing with other cgroups on equal terms. If anything, isn't that simpler than treating scheduling "entities"? Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914213938.GV17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914213938.GV17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-17 15:05 ` Vivek Goyal [not found] ` <20120917150518.GB5094-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Vivek Goyal @ 2012-09-17 15:05 UTC (permalink / raw) To: Tejun Heo Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Fri, Sep 14, 2012 at 02:39:38PM -0700, Tejun Heo wrote: [..] > > I am still little concerned about changing the blkio behavior > > unexpectedly. Can we have some kind of mount time flag which retains > > the old flat behavior and we warn user that this mode is deprecated > > and will soon be removed. Move over to hierarchical mode. Then after > > few release we can drop the flag and cleanup any extra code which > > supports flat mode in CFQ. This will atleast make transition smooth. > > I don't know. That essentially is what we're doing with memcg now and > it doesn't seem any less messy. Given the already scary complexity, > do we really want to support both flat and hierarchy models at the > same time? As a developer, I will be happy to support only one model and keep code simple. I am only concerned that for blkcg we have still not charted out a clear migration path. The warning message your patch is giving out will work only if we decide to not treat task and groups at same level. I guess first we need to decide task vs groups issue and then look into this issue again. > > > > memcg can be handled by memcg people and I can handle cgroup_freezer > > > and others with help from the authors. The problematic one is > > > blkio. If anyone is interested in working on blkio, please be my > > > guest. Vivek? Glauber? > > > > I will try to spend some time on this. Doing changes in blk-throttle > > should be relatively easy. Painful part if CFQ. It does so much that > > it is not clear whether a particular change will bite us badly or > > not. So doing changes becomes hard. There are heuristics, preemptions, > > queue selection logic, service tree and bringing it all together > > for full hierarchy becomes interesting. > > > > I think first thing which needs to be done is merge group scheduling > > and cfqq scheduling. Because of flat hierarchy currently we use two > > scheduling algorithm. Old logic for queue selection and new logic > > for group scheduling. If we treat task and group at same level then > > we have to merge two and come up with single logic. > > I think this depends on how we decide to handle tasks vs. groups, > right? Yes. If we decide to account all the tasks of a group into a hidden group which completes with other group children, then there is no way one can create hiearchy where tasks and groups are competing at same level. So we can still continue to retain the existing logic. > > > [..] > > > * Vivek brought up the issue of distributing resources to tasks and > > > groups in the same cgroup. I don't know. Need to think more > > > about it. > > > > This one will require some thought. I have heard arguments for both the > > models. Treating tasks and groups at same level seem to have one > > disadvantange and that is that people can't think of system resources > > in terms of %. People often say, give 20% of disk resources to a > > particular cgroup. But it is not possible as there are all kernel > > threads running in root cgroup and tasks come and go and that means > > % share of a group is variable and not fixed. > > Another problem is that configuration isn't contained in cgroup > proper. We need a way to assign weights to individual tasks which can > be somehow directly compared against group weights. cpu cooks > priority for this and blkcg may be able to cook ioprio but it's nasty > and unobvious. Also, let's say we grow network bandwidth controller > for whatever reason. What value are we gonna use? So if somebody cares about settting SO_PRIORITY for traffic originating from a tasks, move it into a cgroup. Otherwise they all get default priority. I think question here is that why do you want to provide a hidden group as default mechianism from kernel. If a user does not like the idea of tasks and groups competing at same level, he can always create a cgroups and move all the tasks there. Only thing we need to provide is reliable ways of migrating group of tasks into other cgroups at run time. By creating a hidden group for tasks, there also comes an issue for configuration of that hidden group (group weight, stats etc). By forcing user space to create a new group for tasks, it is an explicit cgroup and user space already knows how to handle it. So to me, leaving this decision to userspace based on their requirement makes sense. Also, I think cpu controller has already discussed this in the past (the possibility of a hidden group for tasks). Peter will have more details about it, I think. > > > To make it fixed, we will need to make sure that number of entities > > fighting for resources are not variable. That means only group fight > > for resources at a level and tasks with-in groups. > > > > Now the question is should kernel enforce it or should it be left to > > user space. I think doing it in user space is also messy as different > > agents control different part of hiearchy. For example, if somebody says > > that give a particular virtual machine a x% of system resource, libvirt > > has no way to do that. At max it can ensure x% of parent group but above > > that hierarchy is controlled by systemd and libvirtd has no control > > over that. > > > > Only possible way to do this will seem to be that systemd creates libvirt > > group at top level with a minimum fixed % of quota and then libvirt can > > figure out % share of each virtual machine. But it is hard to do. > > > > So while % model is more intutive to users, it is hard to implement. So > > an easier way is to stick to the model of relative weights/share and > > let user specify relative importance of a virtual machine and actual > > quota or % will vary dynamically depending on other tasks/components > > in the system. > > Why is it hard to implement? You just need to treat tasks in the > current node as another group competing with other cgroups on equal > terms. If anything, isn't that simpler than treating scheduling > "entities"? I meant "hard to implement" in the sense of if kernel has to keep track of % and enforce it across hiearchy etc. Yes, creating a hidden group for tasks in current group should not be hard from implementation point of view. But again, I am concerned about configuration of hidden group and I also don't like the idea of taking flexibility away from user to treat tasks and group at same level. Thanks Vivek ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120917150518.GB5094-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120917150518.GB5094-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-09-17 16:40 ` Tejun Heo 0 siblings, 0 replies; 75+ messages in thread From: Tejun Heo @ 2012-09-17 16:40 UTC (permalink / raw) To: Vivek Goyal Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar Hello, On Mon, Sep 17, 2012 at 11:05:18AM -0400, Vivek Goyal wrote: > As a developer, I will be happy to support only one model and keep code > simple. I am only concerned that for blkcg we have still not charted > out a clear migration path. The warning message your patch is giving > out will work only if we decide to not treat task and groups at same > level. It may not be enough but it still is in the right direction. > > Another problem is that configuration isn't contained in cgroup > > proper. We need a way to assign weights to individual tasks which can > > be somehow directly compared against group weights. cpu cooks > > priority for this and blkcg may be able to cook ioprio but it's nasty > > and unobvious. Also, let's say we grow network bandwidth controller > > for whatever reason. What value are we gonna use? > > So if somebody cares about settting SO_PRIORITY for traffic originating > from a tasks, move it into a cgroup. Otherwise they all get default > priority. I don't know. Do we wanna add, say, prctl for memory weight too? > So to me, leaving this decision to userspace based on their requirement > makes sense. Leaving too many decisions to userland is one of the reasons that got us into this mess, so I'm not sold on flexibility for flexibility's sake. > Yes, creating a hidden group for tasks in current group should not be > hard from implementation point of view. But again, I am concerned about > configuration of hidden group and I also don't like the idea of taking > flexibility away from user to treat tasks and group at same level. I don't know. Create a reserved directory for it? I do like the idea of taking flexibility away form user unless it's actually useful but am a bit worried we might be too late for that. :( Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> ` (3 preceding siblings ...) 2012-09-14 14:25 ` Vivek Goyal @ 2012-09-14 15:03 ` Michal Hocko [not found] ` <20120914150306.GQ28039-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-09-14 18:07 ` [RFC] cgroup TODOs Vivek Goyal ` (4 subsequent siblings) 9 siblings, 1 reply; 75+ messages in thread From: Michal Hocko @ 2012-09-14 15:03 UTC (permalink / raw) To: Tejun Heo Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V On Thu 13-09-12 13:58:27, Tejun Heo wrote: [...] > 2. memcg's __DEPRECATED_clear_css_refs > > This is a remnant of another weird design decision of requiring > synchronous draining of refcnts on cgroup removal and allowing > subsystems to veto cgroup removal - what's the userspace supposed to > do afterwards? Note that this also hinders co-mounting different > controllers. > > The behavior could be useful for development and debugging but it > unnecessarily interlocks userland visible behavior with in-kernel > implementation details. To me, it seems outright wrong (either > implement proper severing semantics in the controller or do full > refcnting) and disallows, for example, lazy drain of caching refs. > Also, it complicates the removal path with try / commit / revert > logic which has never been fully correct since the beginning. > > Currently, the only left user is memcg. > > Solution: > > * Update memcg->pre_destroy() such that it never fails. > > * Drop __DEPRECATED_clear_css_refs and all related logic. > Convert pre_destroy() to return void. > > Who: > > KAMEZAWA, Michal, PLEASE. I will make __DEPRECATED_clear_css_refs > trigger WARN sooner or later. Let's please get this settled. I think we are almost there. One big step was that we no longer charge to the parent and only move statistics but there are still some corner cases when we race with LRU handling. [...] > * memcg can be fully hierarchical but we need to phase out the flat > hierarchy support. Unfortunately, this involves flipping the > behavior for the existing users. Upstream will try to nudge users > with warning messages. Most burden would be on the distros and at > least SUSE seems to be on board with it. Needs coordination with > other distros. I am currently planning to add a warning to most of the currenly maintained distributions to have as big coverage as possible. No default switch for obvious reasons but hopefuly we will get some feedback at least. Thanks Tejun for doing this. We needed it for a long time. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914150306.GQ28039-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914150306.GQ28039-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-09-19 14:02 ` Michal Hocko [not found] ` <20120919140203.GA5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Michal Hocko @ 2012-09-19 14:02 UTC (permalink / raw) To: Tejun Heo Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V, Dave Jones, Ben Hutchings [CCing Dave, Ben] Just a short summary as you were not on the CC list. This is sort of follow up on https://lkml.org/lkml/2012/9/3/211. The end result is slightly different because Tejun did a more generic cgroup solution (see bellow). I cannot do the same for OpenSUSE so I will stick with the memcg specific patch. On Fri 14-09-12 17:03:06, Michal Hocko wrote: > I am currently planning to add a warning to most of the currenly > maintained distributions to have as big coverage as possible. No default > switch for obvious reasons but hopefuly we will get some feedback at > least. Just for the record, I will post backports of the patch I ended up using for openSUSE 11.4 and 12.[12] and SLES-SP2 as a reply to this email (and 2.6.32 in case somebody is interested). I hope other distributions can either go with this (which will never be merged but it should help to identify dubious usage of flat hierarchies without a risk of breaking anythign) or what Tejun has in his tree[1] 8c7f6edb (cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them) which is more generic but it is also slightly more intrusive. --- [1] - git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-3.7-hierarchy -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120919140203.GA5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0 [not found] ` <20120919140203.GA5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-09-19 14:03 ` Michal Hocko [not found] ` <20120919140308.GB5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-09-19 14:03 ` [PATCH 3.0] " Michal Hocko 2012-09-19 14:05 ` [PATCH 3.2+] " Michal Hocko 2 siblings, 1 reply; 75+ messages in thread From: Michal Hocko @ 2012-09-19 14:03 UTC (permalink / raw) To: Tejun Heo Cc: Dave Jones, Neil Horman, Serge E. Hallyn, Ben Hutchings, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar From 34be56e3e7e4f9c31381ce35247e0a0b7f972874 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Tue, 4 Sep 2012 15:55:03 +0200 Subject: [PATCH] memcg: warn on deeper hierarchies with use_hierarchy==0 The memory controller supports both hierarchical and non-hierarchical behavior which is controlled by use_hierarchy knob (0 by default). The primary motivation for this distinction was an ineffectiveness of hierarchical accounting. This has improved a lot since it was introduced. This schizophrenia makes the code and integration with other controllers more complicated (e.g. mounting it with fully hierarchical one could have an unexpected side effects) for no good reason so it would be good to make the memory controller behave only hierarchically. It seems that there is no good reasons for deep cgroup hierarchies which are not truly hierarchical so we could set the default to 1. This might, however, lead to unexpected regressions when somebody relies on the current default behavior. For example, consider the following setup: Root[cpuset,memory] | A (use_hierarchy=0) / \ B C All three A, B, C have some tasks and their memory limits. The hierarchy is created only because of the cpuset and its configuration. Say the default is changed. Then a memory pressure in C could influence both A and B which wouldn't happen before. The problem might be really hard to notice (unexpected slowdown). This configuration could be fixed up easily by reorganization, though: Root | A' (use_hierarchy=1, limit=unlimited, no tasks) /|\ A B C The problem is that we don't know whether somebody has an use case which cannot be transformed like that. Therefore this patch starts the slow transition to hierarchical only memory controller by warning users who are using flat hierarchies. The warning triggers only if a subgroup of non-root group is created with use_hierarchy==0. Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- mm/memcontrol.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f99f599..b61c34b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3106,6 +3106,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) } else { parent = mem_cgroup_from_cont(cont->parent); mem->use_hierarchy = parent->use_hierarchy; + WARN_ONCE(!mem->use_hierarchy && parent != root_mem_cgroup, + "Creating hierarchies with use_hierarchy==0 " + "(flat hierarchy) is considered deprecated. " + "If you believe that your setup is correct, " + "we kindly ask you to contact linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org and let us know"); } if (parent && parent->use_hierarchy) { -- 1.7.10.4 ^ permalink raw reply related [flat|nested] 75+ messages in thread
[parent not found: <20120919140308.GB5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0 [not found] ` <20120919140308.GB5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-09-19 19:38 ` David Rientjes [not found] ` <alpine.DEB.2.00.1209191237020.749-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: David Rientjes @ 2012-09-19 19:38 UTC (permalink / raw) To: Michal Hocko Cc: Dave Jones, Neil Horman, Serge E. Hallyn, Ben Hutchings, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Graf, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Wed, 19 Sep 2012, Michal Hocko wrote: > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index f99f599..b61c34b 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -3106,6 +3106,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) > } else { > parent = mem_cgroup_from_cont(cont->parent); > mem->use_hierarchy = parent->use_hierarchy; > + WARN_ONCE(!mem->use_hierarchy && parent != root_mem_cgroup, > + "Creating hierarchies with use_hierarchy==0 " > + "(flat hierarchy) is considered deprecated. " > + "If you believe that your setup is correct, " > + "we kindly ask you to contact linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org and let us know"); When I deprecated /proc/pid/oom_adj (now removed), we did a WARN_ONCE() and others complained that this unnecessarily alters userspace scripts that a serious issue has occurred and Linus agreed that we shouldn't do deprecation in this way. The alternative is to use printk_once() instead. This applies to all three patches for this one, 3.0, and 3.2+. ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <alpine.DEB.2.00.1209191237020.749-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]
* Re: [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0 [not found] ` <alpine.DEB.2.00.1209191237020.749-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> @ 2012-09-20 13:24 ` Michal Hocko [not found] ` <20120920132400.GC23872-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Michal Hocko @ 2012-09-20 13:24 UTC (permalink / raw) To: David Rientjes Cc: Tejun Heo, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V, Dave Jones, Ben Hutchings On Wed 19-09-12 12:38:18, David Rientjes wrote: > On Wed, 19 Sep 2012, Michal Hocko wrote: > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index f99f599..b61c34b 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -3106,6 +3106,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) > > } else { > > parent = mem_cgroup_from_cont(cont->parent); > > mem->use_hierarchy = parent->use_hierarchy; > > + WARN_ONCE(!mem->use_hierarchy && parent != root_mem_cgroup, > > + "Creating hierarchies with use_hierarchy==0 " > > + "(flat hierarchy) is considered deprecated. " > > + "If you believe that your setup is correct, " > > + "we kindly ask you to contact linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org and let us know"); > > When I deprecated /proc/pid/oom_adj (now removed), we did a WARN_ONCE() > and others complained that this unnecessarily alters userspace scripts > that a serious issue has occurred and Linus agreed that we shouldn't do > deprecation in this way. The alternative is to use printk_once() instead. Yes printk_once is an alternative but I really wanted to have this as much visible as possible. People tend to react to stack traceces more and this one will trigger only if somebody is either doing something wrong or the configuration is the one we are looking for. Comparing to oom_adj, that one was used much more often so the WARN_ONCE was too verbose especially when you usually had to wait for an userspace update which is not the case here. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120920132400.GC23872-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0 [not found] ` <20120920132400.GC23872-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2012-09-20 22:33 ` David Rientjes [not found] ` <alpine.DEB.2.00.1209201531250.17455-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: David Rientjes @ 2012-09-20 22:33 UTC (permalink / raw) To: Michal Hocko Cc: Dave Jones, Neil Horman, Serge E. Hallyn, Ben Hutchings, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Graf, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Thu, 20 Sep 2012, Michal Hocko wrote: > Yes printk_once is an alternative but I really wanted to have this as > much visible as possible. People tend to react to stack traceces more > and this one will trigger only if somebody is either doing something > wrong or the configuration is the one we are looking for. > That's the complete opposite of what Linus has said he wants, he said very specifically that he doesn't want WARN_ONCE() or WARN_ON_ONCE() for deprecation of tunables. If you want to have this merged, then please get him to ack it. > Comparing to oom_adj, that one was used much more often so the WARN_ONCE > was too verbose especially when you usually had to wait for an userspace > update which is not the case here. How is WARN_ONCE() too verbose for oom_adj? It's printed once! And how can you claim that userspace doesn't need to change if it's creating a hierarchy while use_hierarchy == 0? ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <alpine.DEB.2.00.1209201531250.17455-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]
* Re: [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0 [not found] ` <alpine.DEB.2.00.1209201531250.17455-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> @ 2012-09-21 7:16 ` Michal Hocko 0 siblings, 0 replies; 75+ messages in thread From: Michal Hocko @ 2012-09-21 7:16 UTC (permalink / raw) To: David Rientjes Cc: Tejun Heo, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V, Dave Jones, Ben Hutchings On Thu 20-09-12 15:33:23, David Rientjes wrote: > On Thu, 20 Sep 2012, Michal Hocko wrote: > > > Yes printk_once is an alternative but I really wanted to have this as > > much visible as possible. People tend to react to stack traceces more > > and this one will trigger only if somebody is either doing something > > wrong or the configuration is the one we are looking for. > > > > That's the complete opposite of what Linus has said he wants, he said very > specifically that he doesn't want WARN_ONCE() or WARN_ON_ONCE() for > deprecation of tunables. If you want to have this merged, then please get > him to ack it. This is not meant to be merged upstream. I do not think this is a stable material and Linus tree will get the more generic cgroup based patch instead. This is just for distributions so that they can help to find usecases which would prevent use_hierachy removal > > Comparing to oom_adj, that one was used much more often so the WARN_ONCE > > was too verbose especially when you usually had to wait for an userspace > > update which is not the case here. > > How is WARN_ONCE() too verbose for oom_adj? It's printed once! It prints much more than one line, right? When I said oom_adj was used much more I meant more applications cared about the value (so the prbability of the warning was quite high) not that the message would be printed multiple times. And to be honest I didn't mind WARN_ONCE being used for that. > And how can you claim that userspace doesn't need to change if it's > creating a hierarchy while use_hierarchy == 0? It is code vs. configuration change. You have to wait for an update or change and recompile in the first case while the second one can be done directly. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 75+ messages in thread
* [PATCH 3.0] memcg: warn on deeper hierarchies with use_hierarchy==0 [not found] ` <20120919140203.GA5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-09-19 14:03 ` [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0 Michal Hocko @ 2012-09-19 14:03 ` Michal Hocko 2012-09-19 14:05 ` [PATCH 3.2+] " Michal Hocko 2 siblings, 0 replies; 75+ messages in thread From: Michal Hocko @ 2012-09-19 14:03 UTC (permalink / raw) To: Tejun Heo Cc: Dave Jones, Neil Horman, Serge E. Hallyn, Ben Hutchings, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar From 9364396ddc0c8843fce3a7fda0255b39ba7e4f31 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Tue, 4 Sep 2012 15:55:03 +0200 Subject: [PATCH] memcg: warn on deeper hierarchies with use_hierarchy==0 The memory controller supports both hierarchical and non-hierarchical behavior which is controlled by use_hierarchy knob (0 by default). The primary motivation for this distinction was an ineffectiveness of hierarchical accounting. This has improved a lot since it was introduced. This schizophrenia makes the code and integration with other controllers more complicated (e.g. mounting it with fully hierarchical one could have an unexpected side effects) for no good reason so it would be good to make the memory controller behave only hierarchically. It seems that there is no good reasons for deep cgroup hierarchies which are not truly hierarchical so we could set the default to 1. This might, however, lead to unexpected regressions when somebody relies on the current default behavior. For example, consider the following setup: Root[cpuset,memory] | A (use_hierarchy=0) / \ B C All three A, B, C have some tasks and their memory limits. The hierarchy is created only because of the cpuset and its configuration. Say the default is changed. Then a memory pressure in C could influence both A and B which wouldn't happen before. The problem might be really hard to notice (unexpected slowdown). This configuration could be fixed up easily by reorganization, though: Root | A' (use_hierarchy=1, limit=unlimited, no tasks) /|\ A B C The problem is that we don't know whether somebody has an use case which cannot be transformed like that. Therefore this patch starts the slow transition to hierarchical only memory controller by warning users who are using flat hierarchies. The warning triggers only if a subgroup of non-root group is created with use_hierarchy==0. Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- mm/memcontrol.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e013b8e..d8ec0cd 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4976,6 +4976,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) parent = mem_cgroup_from_cont(cont->parent); mem->use_hierarchy = parent->use_hierarchy; mem->oom_kill_disable = parent->oom_kill_disable; + WARN_ONCE(!mem->use_hierarchy && parent != root_mem_cgroup, + "Creating hierarchies with use_hierarchy==0 " + "(flat hierarchy) is considered deprecated. " + "If you believe that your setup is correct, " + "we kindly ask you to contact linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org and let us know"); } if (parent && parent->use_hierarchy) { -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 75+ messages in thread
* [PATCH 3.2+] memcg: warn on deeper hierarchies with use_hierarchy==0 [not found] ` <20120919140203.GA5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-09-19 14:03 ` [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0 Michal Hocko 2012-09-19 14:03 ` [PATCH 3.0] " Michal Hocko @ 2012-09-19 14:05 ` Michal Hocko 2 siblings, 0 replies; 75+ messages in thread From: Michal Hocko @ 2012-09-19 14:05 UTC (permalink / raw) To: Tejun Heo Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V, Dave Jones, Ben Hutchings Should apply to 3.4 and later as well --- From cbfc6f1cdb4d8095003036c84d250a391054f971 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> Date: Tue, 4 Sep 2012 15:55:03 +0200 Subject: [PATCH] memcg: warn on deeper hierarchies with use_hierarchy==0 The memory controller supports both hierarchical and non-hierarchical behavior which is controlled by use_hierarchy knob (0 by default). The primary motivation for this distinction was an ineffectiveness of hierarchical accounting. This has improved a lot since it was introduced. This schizophrenia makes the code and integration with other controllers more complicated (e.g. mounting it with fully hierarchical one could have an unexpected side effects) for no good reason so it would be good to make the memory controller behave only hierarchically. It seems that there is no good reasons for deep cgroup hierarchies which are not truly hierarchical so we could set the default to 1. This might, however, lead to unexpected regressions when somebody relies on the current default behavior. For example, consider the following setup: Root[cpuset,memory] | A (use_hierarchy=0) / \ B C All three A, B, C have some tasks and their memory limits. The hierarchy is created only because of the cpuset and its configuration. Say the default is changed. Then a memory pressure in C could influence both A and B which wouldn't happen before. The problem might be really hard to notice (unexpected slowdown). This configuration could be fixed up easily by reorganization, though: Root | A' (use_hierarchy=1, limit=unlimited, no tasks) /|\ A B C The problem is that we don't know whether somebody has an use case which cannot be transformed like that. Therefore this patch starts the slow transition to hierarchical only memory controller by warning users who are using flat hierarchies. The warning triggers only if a subgroup of non-root group is created with use_hierarchy==0. Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- mm/memcontrol.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b63f5f7..6fbb0d7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4920,6 +4920,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) parent = mem_cgroup_from_cont(cont->parent); memcg->use_hierarchy = parent->use_hierarchy; memcg->oom_kill_disable = parent->oom_kill_disable; + WARN_ONCE(!memcg->use_hierarchy && parent != root_mem_cgroup, + "Creating hierarchies with use_hierarchy==0 " + "(flat hierarchy) is considered deprecated. " + "If you believe that your setup is correct, " + "we kindly ask you to contact linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org and let us know"); } if (parent && parent->use_hierarchy) { -- 1.7.10.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> ` (4 preceding siblings ...) 2012-09-14 15:03 ` Michal Hocko @ 2012-09-14 18:07 ` Vivek Goyal [not found] ` <20120914180754.GF6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 18:36 ` Aristeu Rozanski ` (3 subsequent siblings) 9 siblings, 1 reply; 75+ messages in thread From: Vivek Goyal @ 2012-09-14 18:07 UTC (permalink / raw) To: Tejun Heo Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote: [..] > 6. Multiple hierarchies > > Apart from the apparent wheeeeeeeeness of it (I think I talked about > that enough the last time[1]), there's a basic problem when more > than one controllers interact - it's impossible to define a resource > group when more than two controllers are involved because the > intersection of different controllers is only defined in terms of > tasks. > > IOW, if an entity X is of interest to two controllers, there's no > way to map X to the cgroups of the two controllers. X may belong to > A and B when viewed by one task but A' and B when viewed by another. > This already is a head scratcher in writeback where blkcg and memcg > have to interact. > > While I am pushing for unified hierarchy, I think it's necessary to > have different levels of granularities depending on controllers > given that nesting involves significant overhead and noticeable > controller-dependent behavior changes. > > Solution: > > I think a unified hierarchy with the ability to ignore subtrees > depending on controllers should work. For example, let's assume the > following hierarchy. > > R > / \ > A B > / \ > AA AB > > All controllers are co-mounted. There is per-cgroup knob which > controls which controllers nest beyond it. If blkio doesn't want to > distinguish AA and AB, the user can specify that blkio doesn't nest > beyond A and blkio would see the tree as, > > R > / \ > A B > > While other controllers keep seeing the original tree. The exact > form of interface, I don't know yet. It could be a single file > which the user echoes [-]controller name into it or per-controller > boolean file. > > I think this level of flexibility should be enough for most use > cases. If someone disagrees, please voice your objections now. Hi Tejun, I am curious that why are you planning to provide capability of controller specific view of hierarchy. To me it sounds pretty close to having separate hierarchies per controller. Just that it is a little more restricted configuration. IOW, who is is the user of this functionality and who is asking for it. Can we go all out where all controllers have only one hierarchy view. Thanks Vivek ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914180754.GF6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914180754.GF6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-09-14 18:53 ` Tejun Heo [not found] ` <20120914185324.GI17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Tejun Heo @ 2012-09-14 18:53 UTC (permalink / raw) To: Vivek Goyal Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar Hello, Vivek. On Fri, Sep 14, 2012 at 02:07:54PM -0400, Vivek Goyal wrote: > I am curious that why are you planning to provide capability of controller > specific view of hierarchy. To me it sounds pretty close to having > separate hierarchies per controller. Just that it is a little more > restricted configuration. I think it's a lot less crazy and gives us a way to bind a resource to a set of controller cgroups regardless which task is looking at it, which is something we're sorely missing now. > IOW, who is is the user of this functionality and who is asking for it. > Can we go all out where all controllers have only one hierarchy view. I think the issue is that controllers inherently have overhead and behavior alterations depending on the tree organization. At least from the usage I see from google which uses nested cgroups extensively, at least that level of flexibility seems necessary. In addition, for some resources, granularity beyond certain point simply doesn't work. Per-service granularity might make sense for cpu but applying it by default would be silly for blkio. Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914185324.GI17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914185324.GI17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-14 19:28 ` Vivek Goyal [not found] ` <20120914192840.GG6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Vivek Goyal @ 2012-09-14 19:28 UTC (permalink / raw) To: Tejun Heo Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Fri, Sep 14, 2012 at 11:53:24AM -0700, Tejun Heo wrote: [..] > In addition, for some resources, granularity beyond certain point > simply doesn't work. Per-service granularity might make sense for cpu > but applying it by default would be silly for blkio. Hmm.., In that case how libvirt will make use of blkio in the proposed scheme. We can't disable blkio nesting at "system" level. So We will have to disable it at each service level except "libvirtd" so that libvirt can use blkio for its virtual machines. That means blkio will see each service in a cgroup of its own and if that does not make sense by default, its a problem. In the existing scheme, atleast every service does not show up in its cgroup from blkio point of view. Everthig is in root and libvirt can create its own cgroups, keeping number of cgroups small. Thanks Vivek ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914192840.GG6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914192840.GG6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-09-14 19:44 ` Tejun Heo [not found] ` <20120914194439.GP17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Tejun Heo @ 2012-09-14 19:44 UTC (permalink / raw) To: Vivek Goyal Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko, Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V, Serge Hallyn Hello, Vivek. On Fri, Sep 14, 2012 at 03:28:40PM -0400, Vivek Goyal wrote: > Hmm.., In that case how libvirt will make use of blkio in the proposed > scheme. We can't disable blkio nesting at "system" level. So We will > have to disable it at each service level except "libvirtd" so that > libvirt can use blkio for its virtual machines. > > That means blkio will see each service in a cgroup of its own and if > that does not make sense by default, its a problem. In the existing Yeap, if libvirtd wants use blkcg, blkcg will be enabled upto libvirtd's root. It might not be optimal but I think it makes sense. If you want to excercise hierarchical control on a resource, the only sane way is sticking to the hierarchy until it reaches root. > scheme, atleast every service does not show up in its cgroup from > blkio point of view. Everthig is in root and libvirt can create its > own cgroups, keeping number of cgroups small. Even a broken clock is right twice a day. I don't think this is a behavior we can keep for the sake of "but if we do this ass-weird thing, we can bypass the overhead for XYZ" when it breaks so many fundamental things. I think there currently is too much (broken) flexibility and intent to remove it. That doesn't mean that removeing all flexibility is the right direction. It inherently is a balancing act and I think the proposed solution is a reasonable tradeoff. There's important difference between causing full overhead by default for all users and requiring some overhead when the use case at hand calls for the functionality. Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914194439.GP17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914194439.GP17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-14 19:49 ` Tejun Heo [not found] ` <20120914194950.GQ17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Tejun Heo @ 2012-09-14 19:49 UTC (permalink / raw) To: Vivek Goyal Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Fri, Sep 14, 2012 at 12:44:39PM -0700, Tejun Heo wrote: > I think there currently is too much (broken) flexibility and intent to > remove it. That doesn't mean that removeing all flexibility is the > right direction. It inherently is a balancing act and I think the > proposed solution is a reasonable tradeoff. There's important > difference between causing full overhead by default for all users and > requiring some overhead when the use case at hand calls for the > functionality. That said, if someone can think of a better solution, I'm all ears. One thing that *has* to be maintained is that it should be able to tag a resource in such way that its associated controllers are identifiable regardless of which task is looking at it. Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914194950.GQ17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914194950.GQ17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-14 20:39 ` Tejun Heo [not found] ` <20120914203925.GR17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Tejun Heo @ 2012-09-14 20:39 UTC (permalink / raw) To: Vivek Goyal Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko, Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V, Serge Hallyn Hello, again. On Fri, Sep 14, 2012 at 12:49:50PM -0700, Tejun Heo wrote: > That said, if someone can think of a better solution, I'm all ears. > One thing that *has* to be maintained is that it should be able to tag > a resource in such way that its associated controllers are > identifiable regardless of which task is looking at it. So, I thought about it more. How about we do "consider / ignore this node" instead of "(don't) nest beyond this level". For example, let's assume a tree like the following. R / | \ A B C / \ AA AB If we want to differentiate between AA and AB, we'll have to consider the whole tree with the previous sheme - A needs to nest, so R needs to nest and we end up with the whole tree. Instead, if we have honor / ignore this node. We can set the honor bit on A, AA and AB and see the tree as R / A / \ AA AB We still see the intermediate A node but can ignore the other branches. Implementation and concept-wise, it's fairly simple too. For any given node and controller, you travel upwards until you meet a node which has the controller enabled and that's the cgroup the controller considers. Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914203925.GR17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914203925.GR17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-17 8:40 ` Glauber Costa [not found] ` <5056E1FC.1090508-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-17 14:37 ` Vivek Goyal 1 sibling, 1 reply; 75+ messages in thread From: Glauber Costa @ 2012-09-17 8:40 UTC (permalink / raw) To: Tejun Heo Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar, Vivek Goyal On 09/15/2012 12:39 AM, Tejun Heo wrote: > Hello, again. > > On Fri, Sep 14, 2012 at 12:49:50PM -0700, Tejun Heo wrote: >> That said, if someone can think of a better solution, I'm all ears. >> One thing that *has* to be maintained is that it should be able to tag >> a resource in such way that its associated controllers are >> identifiable regardless of which task is looking at it. > > So, I thought about it more. How about we do "consider / ignore this > node" instead of "(don't) nest beyond this level". For example, let's > assume a tree like the following. > > R > / | \ > A B C > / \ > AA AB > > If we want to differentiate between AA and AB, we'll have to consider > the whole tree with the previous sheme - A needs to nest, so R needs > to nest and we end up with the whole tree. Instead, if we have honor > / ignore this node. We can set the honor bit on A, AA and AB and see > the tree as > > R > / > A > / \ > AA AB > > We still see the intermediate A node but can ignore the other > branches. Implementation and concept-wise, it's fairly simple too. > For any given node and controller, you travel upwards until you meet a > node which has the controller enabled and that's the cgroup the > controller considers. > > Thanks. > That is exactly what I proposed in our previous discussions around memcg, with files like "available_controllers" , "current_controllers". Name chosen to match what other subsystems already do. if memcg is not in "available_controllers" for a node, it cannot be seen by anyone bellow that level. ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <5056E1FC.1090508-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <5056E1FC.1090508-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> @ 2012-09-17 17:30 ` Tejun Heo 0 siblings, 0 replies; 75+ messages in thread From: Tejun Heo @ 2012-09-17 17:30 UTC (permalink / raw) To: Glauber Costa Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar, Vivek Goyal On Mon, Sep 17, 2012 at 12:40:28PM +0400, Glauber Costa wrote: > That is exactly what I proposed in our previous discussions around > memcg, with files like "available_controllers" , "current_controllers". > Name chosen to match what other subsystems already do. > > if memcg is not in "available_controllers" for a node, it cannot be seen > by anyone bellow that level. Glauber, I was talking about making the switch applicable from the current level *INSTEAD OF* anyone below the current level, so that we don't have to apply the same switch on all siblings. I have no idea why this is causing so much miscommunication. :( -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120914203925.GR17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-17 8:40 ` Glauber Costa @ 2012-09-17 14:37 ` Vivek Goyal 1 sibling, 0 replies; 75+ messages in thread From: Vivek Goyal @ 2012-09-17 14:37 UTC (permalink / raw) To: Tejun Heo Cc: Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Fri, Sep 14, 2012 at 01:39:25PM -0700, Tejun Heo wrote: > Hello, again. > > On Fri, Sep 14, 2012 at 12:49:50PM -0700, Tejun Heo wrote: > > That said, if someone can think of a better solution, I'm all ears. > > One thing that *has* to be maintained is that it should be able to tag > > a resource in such way that its associated controllers are > > identifiable regardless of which task is looking at it. > > So, I thought about it more. How about we do "consider / ignore this > node" instead of "(don't) nest beyond this level". For example, let's > assume a tree like the following. > > R > / | \ > A B C > / \ > AA AB > > If we want to differentiate between AA and AB, we'll have to consider > the whole tree with the previous sheme - A needs to nest, so R needs > to nest and we end up with the whole tree. Instead, if we have honor > / ignore this node. We can set the honor bit on A, AA and AB and see > the tree as > > R > / > A > / \ > AA AB > > We still see the intermediate A node but can ignore the other > branches. Implementation and concept-wise, it's fairly simple too. > For any given node and controller, you travel upwards until you meet a > node which has the controller enabled and that's the cgroup the > controller considers. I think this proposal sounds reasonable. So by default if a new cgroup is created, we can inherit the controller settings of parent. And if user does not want to enable particular controller on newly created cgroup, it shall have to explicitly disable it. Thanks Vivek ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> ` (5 preceding siblings ...) 2012-09-14 18:07 ` [RFC] cgroup TODOs Vivek Goyal @ 2012-09-14 18:36 ` Aristeu Rozanski [not found] ` <20120914183641.GA2191-YqEmrenMroyQb786VAuzj9i2O/JbrIOy@public.gmane.org> 2012-09-14 22:03 ` Dhaval Giani ` (2 subsequent siblings) 9 siblings, 1 reply; 75+ messages in thread From: Aristeu Rozanski @ 2012-09-14 18:36 UTC (permalink / raw) To: Tejun Heo Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko, Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V Tejun, On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote: > memcg can be handled by memcg people and I can handle cgroup_freezer > and others with help from the authors. The problematic one is > blkio. If anyone is interested in working on blkio, please be my > guest. Vivek? Glauber? if Serge is not planning to do it already, I can take a look in device_cgroup. also, heard about the desire of having a device namespace instead with support for translation ("sda" -> "sdf"). If anyone see immediate use for this please let me know. -- Aristeu ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120914183641.GA2191-YqEmrenMroyQb786VAuzj9i2O/JbrIOy@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120914183641.GA2191-YqEmrenMroyQb786VAuzj9i2O/JbrIOy@public.gmane.org> @ 2012-09-14 18:54 ` Tejun Heo 2012-09-15 2:20 ` Serge E. Hallyn 2012-09-16 8:19 ` [RFC] cgroup TODOs James Bottomley 2 siblings, 0 replies; 75+ messages in thread From: Tejun Heo @ 2012-09-14 18:54 UTC (permalink / raw) To: Aristeu Rozanski Cc: Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar Hello, On Fri, Sep 14, 2012 at 02:36:41PM -0400, Aristeu Rozanski wrote: > if Serge is not planning to do it already, I can take a look in device_cgroup. Yes please. :) Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120914183641.GA2191-YqEmrenMroyQb786VAuzj9i2O/JbrIOy@public.gmane.org> 2012-09-14 18:54 ` Tejun Heo @ 2012-09-15 2:20 ` Serge E. Hallyn [not found] ` <20120915022037.GA6438-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 2012-09-16 8:19 ` [RFC] cgroup TODOs James Bottomley 2 siblings, 1 reply; 75+ messages in thread From: Serge E. Hallyn @ 2012-09-15 2:20 UTC (permalink / raw) To: Aristeu Rozanski Cc: Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar, Eric W. Biederman Quoting Aristeu Rozanski (aris-moeOTchvdi7YtjvyW6yDsg@public.gmane.org): > Tejun, > On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote: > > memcg can be handled by memcg people and I can handle cgroup_freezer > > and others with help from the authors. The problematic one is > > blkio. If anyone is interested in working on blkio, please be my > > guest. Vivek? Glauber? > > if Serge is not planning to do it already, I can take a look in device_cgroup. That's fine with me, thanks. > also, heard about the desire of having a device namespace instead with > support for translation ("sda" -> "sdf"). If anyone see immediate use for > this please let me know. Before going down this road, I'd like to discuss this with at least you, me, and Eric Biederman (cc:d) as to how it relates to a device namespace. thanks, -serge ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120915022037.GA6438-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]
* Controlling devices and device namespaces [not found] ` <20120915022037.GA6438-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> @ 2012-09-15 9:27 ` Eric W. Biederman [not found] ` <87wqzv7i08.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-09-15 9:27 UTC (permalink / raw) To: Serge E. Hallyn Cc: Aristeu Rozanski, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner, Ingo Molnar "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > Quoting Aristeu Rozanski (aris-moeOTchvdi7YtjvyW6yDsg@public.gmane.org): >> Tejun, >> On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote: >> > memcg can be handled by memcg people and I can handle cgroup_freezer >> > and others with help from the authors. The problematic one is >> > blkio. If anyone is interested in working on blkio, please be my >> > guest. Vivek? Glauber? >> >> if Serge is not planning to do it already, I can take a look in device_cgroup. > > That's fine with me, thanks. > >> also, heard about the desire of having a device namespace instead with >> support for translation ("sda" -> "sdf"). If anyone see immediate use for >> this please let me know. > > Before going down this road, I'd like to discuss this with at least you, > me, and Eric Biederman (cc:d) as to how it relates to a device > namespace. The problem with devices. - An unrestricted mknod gives you access to effectively any device in the system. - During process migration if the device number changes using stat to file descriptors can fail on the same file descriptor. - Devices coming from prexisting filesystems that we mount as unprivileged users are as dangerous as mknod but show that the problem is not limited to mknod. - udev thinks mknod is a system call we can remove from the kernel. --- The use cases seem comparitively simple to enumerate. - Giving unfiltered access to a device to someone not root. - Virtual devices that everyone uses and have no real privilege requirements: /dev/null /dev/tty /dev/zero etc. - Dynamically created devices /dev/loopN /dev/tun /dev/macvtapN, nbd, iscsi, /dev/ptsN, etc --- There are a couple of solution to these problems. - The classic solution of creating a /dev for a container before starting it. - The devpts filesystem. This works well for unprivileged access to ptys. Except for the /dev/ptmx sillines I very like how things are handled today with devpts. - Device control groups. I am not quite certain what to make of them. The only case I see where they are better than a prebuilt static dev is if there is a hotppluged device that I want to push into my container. I think the only problem with device control groups and hierarchies is that removing a device from a whitelist does not recurse down the hierarchy. Can a process inside of a device control group create a child group that has access to a subset of it's devices? The actually checks don't need to be hierarchical but the presence of device nodes should be. --- I see a couple of holes in the device control picture. - How do we handle hotplug events? I think we can do this by relaying events trough userspace, upating the device control groups etc. - Unprivileged processess interacting with all of this. (possibly with privilege in their user namespace) What I don't know how to do is how to create a couple of different subhierarchies each for different child processes. - Dynamically created devices. My gut feel is that we should replicate the success of devpts and give each type of dynamically created device it's own filesystem and mount point under /dev, and just bend the handful of userspace users into that model. - Sysfs My gut says for the container use case we should aim to simply not have dynamically created devices in sysfs and then we can simply not care. - Migration Either we need block device numbers that can migrate with us, (possibly a subset of the entire range ala devpts) or we need to send hotplug events to userspace right after a migration so userspace processes that care can invalidate their caches of stat data. --- With the code in my userns development tree I can create a user namespace, create a new mount namespace, and then if I have access to any block devices mount filesystems, all without needing to have any special privileges. What I haven't figured out is what it would take to get the the device control group into the middle that. It feels like it should be possible to get the checks straight and use the device control group hooks to control which devices are usable in a user namespace. Unfortunately when I try and work it out the independence of the user namespace and the device control group seem to make that impossible. Shrug there is most definitely something missing from our model on how to handle devices well. I am hoping we can sprinkling some devpts derived pixie dust at the problem migrate userspace to some new interfaces and have life be good. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <87wqzv7i08.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Controlling devices and device namespaces [not found] ` <87wqzv7i08.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2012-09-15 22:05 ` Serge E. Hallyn [not found] ` <20120915220520.GA11364-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Serge E. Hallyn @ 2012-09-15 22:05 UTC (permalink / raw) To: Eric W. Biederman Cc: Aristeu Rozanski, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner, Ingo Molnar Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): > "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > > > Quoting Aristeu Rozanski (aris-moeOTchvdi7YtjvyW6yDsg@public.gmane.org): > >> Tejun, > >> On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote: > >> > memcg can be handled by memcg people and I can handle cgroup_freezer > >> > and others with help from the authors. The problematic one is > >> > blkio. If anyone is interested in working on blkio, please be my > >> > guest. Vivek? Glauber? > >> > >> if Serge is not planning to do it already, I can take a look in device_cgroup. > > > > That's fine with me, thanks. > > > >> also, heard about the desire of having a device namespace instead with > >> support for translation ("sda" -> "sdf"). If anyone see immediate use for > >> this please let me know. > > > > Before going down this road, I'd like to discuss this with at least you, > > me, and Eric Biederman (cc:d) as to how it relates to a device > > namespace. > > > The problem with devices. > > - An unrestricted mknod gives you access to effectively any device in > the system. > > - During process migration if the device number changes using > stat to file descriptors can fail on the same file descriptor. > > - Devices coming from prexisting filesystems that we mount > as unprivileged users are as dangerous as mknod but show > that the problem is not limited to mknod. > > - udev thinks mknod is a system call we can remove from the kernel. Also, - udevadm trigger --action=add causes all the devices known on the host to be re-sent to everyone (all namespaces). Which floods everyone and causes the host to reset some devices. > --- > > The use cases seem comparitively simple to enumerate. > > - Giving unfiltered access to a device to someone not root. > > - Virtual devices that everyone uses and have no real privilege > requirements: /dev/null /dev/tty /dev/zero etc. > > - Dynamically created devices /dev/loopN /dev/tun /dev/macvtapN, > nbd, iscsi, /dev/ptsN, etc and - per-namespace uevent filtering. > --- > > There are a couple of solution to these problems. > > - The classic solution of creating a /dev for a container > before starting it. > > - The devpts filesystem. This works well for unprivileged access > to ptys. Except for the /dev/ptmx sillines I very like how > things are handled today with devpts. > > - Device control groups. I am not quite certain what to make > of them. The only case I see where they are better than > a prebuilt static dev is if there is a hotppluged device > that I want to push into my container. > > I think the only problem with device control groups and > hierarchies is that removing a device from a whitelist > does not recurse down the hierarchy. That's going to be fixed soon thanks to Aristeu :) > Can a process inside of a device control group create > a child group that has access to a subset of it's > devices? The actually checks don't need to be hierarchical > but the presence of device nodes should be. If I understand your question right, yes. > --- > > I see a couple of holes in the device control picture. > > - How do we handle hotplug events? > > I think we can do this by relaying events trough userspace, > upating the device control groups etc. > > - Unprivileged processess interacting with all of this. > (possibly with privilege in their user namespace) > What I don't know how to do is how to create a couple of different > subhierarchies each for different child processes. > > - Dynamically created devices. > > My gut feel is that we should replicate the success of devpts > and give each type of dynamically created device it's own > filesystem and mount point under /dev, and just bend > the handful of userspace users into that model. Phew. Maybe. Had not considered that. But seems daunting. > - Sysfs > > My gut says for the container use case we should aim to > simply not have dynamically created devices in sysfs > and then we can simply not care. > > - Migration > > Either we need block device numbers that can migrate with us, > (possibly a subset of the entire range ala devpts) or we need to send > hotplug events to userspace right after a migration so userspace > processes that care can invalidate their caches of stat data. > > --- > > With the code in my userns development tree I can create a user > namespace, create a new mount namespace, and then if I have > access to any block devices mount filesystems, all without > needing to have any special privileges. What I haven't > figured out is what it would take to get the the device > control group into the middle that. I'm really not sure that's a question we want to ask. The device control group, like the ns cgroup, was meant as a temporary workaround to not having user and device namespaces. If we can come up with a device cgroup model that works to fill all the requirements we would have for a devices ns, then great. But I don't want us to be constrained by that. > It feels like it should be possible to get the checks straight > and use the device control group hooks to control which devices > are usable in a user namespace. Unfortunately when I try and work > it out the independence of the user namespace and the device > control group seem to make that impossible. > > Shrug there is most definitely something missing from our > model on how to handle devices well. I am hoping we can > sprinkling some devpts derived pixie dust at the problem > migrate userspace to some new interfaces and have life > be good. > > Eric Me too! I'm torn between suggesting that we have a session at UDS to discuss this, and not wanting to so that we can focus on the remaining questions with the user namespace. thanks, -serge ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120915220520.GA11364-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]
* Re: Controlling devices and device namespaces [not found] ` <20120915220520.GA11364-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> @ 2012-09-16 0:24 ` Eric W. Biederman [not found] ` <87y5kazuez.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-09-16 0:24 UTC (permalink / raw) To: Serge E. Hallyn Cc: Aristeu Rozanski, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner, Ingo Molnar Thinking about this a bit more I think we have been asking the wrong question. I think the correct question should be: How do we safely allow for unprivileged creation of device nodes and devices? One piece of the puzzle is that we should be able to allow unprivileged device node creation and access for any device on any filesystem for which it unprivileged access is safe. Something like the current device control group hooks but with the whitelist implemented like: static bool unpriv_mknod_ok(struct device *dev) { char *tmp, *name; umode_t mode = 0; name = device_get_devnode(dev, &mode, &tmp); if (!name) return false; kfree(tmp); return mode == 0666; } Are there current use cases where people actually want arbitrary access to hardware devices? I really want to say no and get udev and sysfs out of the picture as much as possible. "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): >> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: >> >> > Quoting Aristeu Rozanski (aris-moeOTchvdi7YtjvyW6yDsg@public.gmane.org): >> >> Tejun, >> >> On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote: >> >> > memcg can be handled by memcg people and I can handle cgroup_freezer >> >> > and others with help from the authors. The problematic one is >> >> > blkio. If anyone is interested in working on blkio, please be my >> >> > guest. Vivek? Glauber? >> >> >> >> if Serge is not planning to do it already, I can take a look in device_cgroup. >> > >> > That's fine with me, thanks. >> > >> >> also, heard about the desire of having a device namespace instead with >> >> support for translation ("sda" -> "sdf"). If anyone see immediate use for >> >> this please let me know. >> > >> > Before going down this road, I'd like to discuss this with at least you, >> > me, and Eric Biederman (cc:d) as to how it relates to a device >> > namespace. >> >> >> The problem with devices. >> >> - An unrestricted mknod gives you access to effectively any device in >> the system. >> >> - During process migration if the device number changes using >> stat to file descriptors can fail on the same file descriptor. >> >> - Devices coming from prexisting filesystems that we mount >> as unprivileged users are as dangerous as mknod but show >> that the problem is not limited to mknod. >> >> - udev thinks mknod is a system call we can remove from the kernel. > > Also, > > - udevadm trigger --action=add > > causes all the devices known on the host to be re-sent to > everyone (all namespaces). Which floods everyone and causes the > host to reset some devices. I think this is all userspace activity, and should be largely fixed by not begin root in a container. >> --- >> >> The use cases seem comparitively simple to enumerate. >> >> - Giving unfiltered access to a device to someone not root. >> >> - Virtual devices that everyone uses and have no real privilege >> requirements: /dev/null /dev/tty /dev/zero etc. >> >> - Dynamically created devices /dev/loopN /dev/tun /dev/macvtapN, >> nbd, iscsi, /dev/ptsN, etc > > and > > - per-namespace uevent filtering. One possible solution there is to just send the kernel uevents (except for the network ones) into the initial network namespace. >> --- >> >> There are a couple of solution to these problems. >> >> - The classic solution of creating a /dev for a container >> before starting it. >> >> - The devpts filesystem. This works well for unprivileged access >> to ptys. Except for the /dev/ptmx sillines I very like how >> things are handled today with devpts. >> >> - Device control groups. I am not quite certain what to make >> of them. The only case I see where they are better than >> a prebuilt static dev is if there is a hotppluged device >> that I want to push into my container. >> >> I think the only problem with device control groups and >> hierarchies is that removing a device from a whitelist >> does not recurse down the hierarchy. > > That's going to be fixed soon thanks to Aristeu :) > >> Can a process inside of a device control group create >> a child group that has access to a subset of it's >> devices? The actually checks don't need to be hierarchical >> but the presence of device nodes should be. > > If I understand your question right, yes. I should also have asked can we do this without any capabilities and without our uid being 0? >> --- >> >> I see a couple of holes in the device control picture. >> >> - How do we handle hotplug events? >> >> I think we can do this by relaying events trough userspace, >> upating the device control groups etc. >> >> - Unprivileged processess interacting with all of this. >> (possibly with privilege in their user namespace) >> What I don't know how to do is how to create a couple of different >> subhierarchies each for different child processes. >> >> - Dynamically created devices. >> >> My gut feel is that we should replicate the success of devpts >> and give each type of dynamically created device it's own >> filesystem and mount point under /dev, and just bend >> the handful of userspace users into that model. > > Phew. Maybe. Had not considered that. But seems daunting. I think the list of device types that we care about here is pretty small. Please correct me if I am wrong. loop nbd iscsi macvtap And if we want it to be safe to use these devices in a user namespace without global root privileges we need to go through the code anyway. So I think it is the gradual safe and sane approach assume we don't run into something like the devpts /dev/ptmx silliness that stalled devpts. >> - Sysfs >> >> My gut says for the container use case we should aim to >> simply not have dynamically created devices in sysfs >> and then we can simply not care. I guess what I keep thinking for sysfs is that it should be for real hardware backed devices. If we can get away with that like we do with ptys it just makes everyone's life simpler. Primarily sysfs and uevents are for allowing the system to take automatic action when a new device is created. Do we have an actual need for hotplug support in containers? >> - Migration >> >> Either we need block device numbers that can migrate with us, >> (possibly a subset of the entire range ala devpts) or we need to send >> hotplug events to userspace right after a migration so userspace >> processes that care can invalidate their caches of stat data. >> >> --- >> >> With the code in my userns development tree I can create a user >> namespace, create a new mount namespace, and then if I have >> access to any block devices mount filesystems, all without >> needing to have any special privileges. What I haven't >> figured out is what it would take to get the the device >> control group into the middle that. > > I'm really not sure that's a question we want to ask. The > device control group, like the ns cgroup, was meant as a > temporary workaround to not having user and device namespaces. > > If we can come up with a device cgroup model that works to > fill all the requirements we would have for a devices ns, then > great. But I don't want us to be constrained by that. > >> It feels like it should be possible to get the checks straight >> and use the device control group hooks to control which devices >> are usable in a user namespace. Unfortunately when I try and work >> it out the independence of the user namespace and the device >> control group seem to make that impossible. >> >> Shrug there is most definitely something missing from our >> model on how to handle devices well. I am hoping we can >> sprinkling some devpts derived pixie dust at the problem >> migrate userspace to some new interfaces and have life >> be good. >> >> Eric > > Me too! > > I'm torn between suggesting that we have a session at UDS to > discuss this, and not wanting to so that we can focus on the > remaining questions with the user namespace. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <87y5kazuez.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Controlling devices and device namespaces [not found] ` <87y5kazuez.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2012-09-16 3:31 ` Serge E. Hallyn 2012-09-16 11:21 ` Alan Cox 1 sibling, 0 replies; 75+ messages in thread From: Serge E. Hallyn @ 2012-09-16 3:31 UTC (permalink / raw) To: Eric W. Biederman Cc: Aristeu Rozanski, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner, Ingo Molnar Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): > > Thinking about this a bit more I think we have been asking the wrong > question. > > I think the correct question should be: How do we safely allow for > unprivileged creation of device nodes and devices? > > One piece of the puzzle is that we should be able to allow unprivileged > device node creation and access for any device on any filesystem > for which it unprivileged access is safe. > > Something like the current device control group hooks but > with the whitelist implemented like: > > static bool unpriv_mknod_ok(struct device *dev) > { > char *tmp, *name; > umode_t mode = 0; > > name = device_get_devnode(dev, &mode, &tmp); > if (!name) > return false; > kfree(tmp); > return mode == 0666; > } > > Are there current use cases where people actually want arbitrary > access to hardware devices? I really want to say no and get > udev and sysfs out of the picture as much as possible. Other devices I'm pretty sure people will be asking for include audio and video devices, input devices, usb drives, LVM volumes and probably volume groups and PVs as well. I do believe people want to dedicate drives to containers. Of course there is also /dev/random, and /dev/kmsg which I think needs to be tied to the also sorely missing syslog namespace. > "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > > > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): > >> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > >> > >> > Quoting Aristeu Rozanski (aris-moeOTchvdi7YtjvyW6yDsg@public.gmane.org): > >> >> Tejun, > >> >> On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote: > >> >> > memcg can be handled by memcg people and I can handle cgroup_freezer > >> >> > and others with help from the authors. The problematic one is > >> >> > blkio. If anyone is interested in working on blkio, please be my > >> >> > guest. Vivek? Glauber? > >> >> > >> >> if Serge is not planning to do it already, I can take a look in device_cgroup. > >> > > >> > That's fine with me, thanks. > >> > > >> >> also, heard about the desire of having a device namespace instead with > >> >> support for translation ("sda" -> "sdf"). If anyone see immediate use for > >> >> this please let me know. > >> > > >> > Before going down this road, I'd like to discuss this with at least you, > >> > me, and Eric Biederman (cc:d) as to how it relates to a device > >> > namespace. > >> > >> > >> The problem with devices. > >> > >> - An unrestricted mknod gives you access to effectively any device in > >> the system. > >> > >> - During process migration if the device number changes using > >> stat to file descriptors can fail on the same file descriptor. > >> > >> - Devices coming from prexisting filesystems that we mount > >> as unprivileged users are as dangerous as mknod but show > >> that the problem is not limited to mknod. > >> > >> - udev thinks mknod is a system call we can remove from the kernel. > > > > Also, > > > > - udevadm trigger --action=add > > > > causes all the devices known on the host to be re-sent to > > everyone (all namespaces). Which floods everyone and causes the > > host to reset some devices. > > I think this is all userspace activity, Well the uevents are sent from the kernel, and cause a flurry of userspace activity. (But not sending uevents to the containers as you suggest below would work) > and should be largely > fixed by not begin root in a container. That doesn't fit with our goal, which is to run the same, unmodified userspace on hardware, virtualization (kvm/vmware), and containers. This is important - the more we have to have different init and userspace in containers (there are a few things we have to special-case still at the moment) the more duplicated testing and otherwise avoidable bugs we'll have. Or did you just mean not being GLOBAL_ROOT_UID in a container? > >> The use cases seem comparitively simple to enumerate. > >> > >> - Giving unfiltered access to a device to someone not root. > >> > >> - Virtual devices that everyone uses and have no real privilege > >> requirements: /dev/null /dev/tty /dev/zero etc. > >> > >> - Dynamically created devices /dev/loopN /dev/tun /dev/macvtapN, > >> nbd, iscsi, /dev/ptsN, etc > > > > and > > > > - per-namespace uevent filtering. > > One possible solution there is to just send the kernel uevents (except > for the network ones) into the initial network namespace. We'd also want storage (especially usb but not just) passed in, and audio, video and input - but maybe those should be faked from userspace from the host (or parent container)? Also, there *are* containers which are not in private network namespaces. Now I'm not sure how much we worry about those, as they generally need custom init anyway (so as not to reconfigure the host's networking etc). > >> There are a couple of solution to these problems. > >> > >> - The classic solution of creating a /dev for a container > >> before starting it. > >> > >> - The devpts filesystem. This works well for unprivileged access > >> to ptys. Except for the /dev/ptmx sillines I very like how > >> things are handled today with devpts. > >> > >> - Device control groups. I am not quite certain what to make > >> of them. The only case I see where they are better than > >> a prebuilt static dev is if there is a hotppluged device > >> that I want to push into my container. > >> > >> I think the only problem with device control groups and > >> hierarchies is that removing a device from a whitelist > >> does not recurse down the hierarchy. > > > > That's going to be fixed soon thanks to Aristeu :) > > > >> Can a process inside of a device control group create > >> a child group that has access to a subset of it's > >> devices? The actually checks don't need to be hierarchical > >> but the presence of device nodes should be. > > > > If I understand your question right, yes. > > I should also have asked can we do this without any capabilities > and without our uid being 0? Currently you need CAP_SYS_ADMIN to update device cgroup permissions. > >> I see a couple of holes in the device control picture. > >> > >> - How do we handle hotplug events? > >> > >> I think we can do this by relaying events trough userspace, > >> upating the device control groups etc. > >> > >> - Unprivileged processess interacting with all of this. > >> (possibly with privilege in their user namespace) > >> What I don't know how to do is how to create a couple of different > >> subhierarchies each for different child processes. > >> > >> - Dynamically created devices. > >> > >> My gut feel is that we should replicate the success of devpts > >> and give each type of dynamically created device it's own > >> filesystem and mount point under /dev, and just bend > >> the handful of userspace users into that model. > > > > Phew. Maybe. Had not considered that. But seems daunting. > > I think the list of device types that we care about here is pretty > small. Please correct me if I am wrong. > > loop nbd iscsi macvtap I assume you're asking only about devices that need virtualized instances, with the instances either unique or mapped between namespaces. (and I assume the hope is that we can get away with them being unique, as with devpts, and mappable with bind mounts) I can't think of any others offhand. Common devices used in containers include tty*, rtc, fuse, tun, hpet, kvm. /dev/tty and /dev/console are special anyway. The tty* in containers are always bind mounted with devpts. So I don't think any of those fit the criteria - no work needed. > And if we want it to be safe to use these devices in a user namespace > without global root privileges we need to go through the code anyway. Agreed. > So I think it is the gradual safe and sane approach assume we don't > run into something like the devpts /dev/ptmx silliness that stalled > devpts. Agreed. > >> - Sysfs > >> > >> My gut says for the container use case we should aim to > >> simply not have dynamically created devices in sysfs > >> and then we can simply not care. > > I guess what I keep thinking for sysfs is that it should be for real > hardware backed devices. If we can get away with that like we do with > ptys it just makes everyone's life simpler. You've brought up /sys and /proc, does devtmpfs further complicate things? > Primarily sysfs and uevents are for allowing the system to take > automatic action when a new device is created. Do we have an actual > need for hotplug support in containers? As I argue above, I claim we need them for the event-drive init systems to see NICs and other devices brought up, and to handle passing in usb devices etc. -serge ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Controlling devices and device namespaces [not found] ` <87y5kazuez.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2012-09-16 3:31 ` Serge E. Hallyn @ 2012-09-16 11:21 ` Alan Cox [not found] ` <20120916122112.3f16178d-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org> 1 sibling, 1 reply; 75+ messages in thread From: Alan Cox @ 2012-09-16 11:21 UTC (permalink / raw) To: Eric W. Biederman Cc: Aristeu Rozanski, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner, Ingo Molnar > One piece of the puzzle is that we should be able to allow unprivileged > device node creation and access for any device on any filesystem > for which it unprivileged access is safe. Which devices are "safe" is policy for all interesting and useful cases, as are file permissions, security tags, chroot considerations and the like. It's a complete non starter. Alan ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120916122112.3f16178d-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org>]
* Re: Controlling devices and device namespaces [not found] ` <20120916122112.3f16178d-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org> @ 2012-09-16 11:56 ` Eric W. Biederman [not found] ` <87sjaiuqp5.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-09-16 11:56 UTC (permalink / raw) To: Alan Cox Cc: Aristeu Rozanski, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner, Ingo Molnar Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> writes: >> One piece of the puzzle is that we should be able to allow unprivileged >> device node creation and access for any device on any filesystem >> for which it unprivileged access is safe. > > Which devices are "safe" is policy for all interesting and useful cases, > as are file permissions, security tags, chroot considerations and the > like. > > It's a complete non starter. There are a handful of device nodes that the kernel creates with mode 0666. Esentially it is just /dev/tty /dev/null /dev/zero and a few others. Enourmous numbers of programs won't work without them. Making them both interesting and useful. In very peculiar cases I can see not wanting to have access to generally safe devices, like in other peculiar cases we don't have want access to the network stack. As for the general case device nodes for real hardware in a container which I think is the "interesting" case you were referring to. I personally find that case icky and boring. The sanest way I can think of handling real hardware device nodes is a tmpfs (acting like devtmpfs) mounted on /dev in the containers mount namespace, but also visible outside to the global root mounted somewhere interesting. We have a fuse filesystem pretending to be sysfs and relaying file accesses from the real sysfs for just the devices that we want to allow to that container. Then to add a device in a container the managing daemon makes the devices available in the pretend sysfs, calls mknod on the tmpfs and fakes the uevents. The only case I don't see that truly covering is keeping the stat data the same for files of migrated applications. Shrug perhaps that will just have to be handled with another synthesized uevent. Hey userspace I just hot-unplugged and hot-plugged your kernel please cope. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <87sjaiuqp5.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Controlling devices and device namespaces [not found] ` <87sjaiuqp5.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2012-09-16 12:17 ` Eric W. Biederman [not found] ` <87d31mupp3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-09-16 12:17 UTC (permalink / raw) To: Alan Cox Cc: Aristeu Rozanski, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner, Ingo Molnar ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: > Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> writes: > >>> One piece of the puzzle is that we should be able to allow unprivileged >>> device node creation and access for any device on any filesystem >>> for which it unprivileged access is safe. >> >> Which devices are "safe" is policy for all interesting and useful cases, >> as are file permissions, security tags, chroot considerations and the >> like. >> >> It's a complete non starter. Come to think of it mknod is completely unnecessary. Without mknod. Without being able to mount filesystems containing device nodes. The mount namespace is sufficient to prevent all of the cases that the device control group prevents (open and mknod on device nodes). So I honestly think the device control group is superflous, and it is probably wise to deprecate it and move to a model where it does not exist. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <87d31mupp3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Controlling devices and device namespaces [not found] ` <87d31mupp3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2012-09-16 13:32 ` Serge Hallyn [not found] ` <5055D4D1.3070407-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Serge Hallyn @ 2012-09-16 13:32 UTC (permalink / raw) To: Eric W. Biederman Cc: Aristeu Rozanski, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner, Ingo Molnar, Alan Cox On 09/16/2012 07:17 AM, Eric W. Biederman wrote: > ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: > >> Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> writes: >> >>>> One piece of the puzzle is that we should be able to allow unprivileged >>>> device node creation and access for any device on any filesystem >>>> for which it unprivileged access is safe. >>> >>> Which devices are "safe" is policy for all interesting and useful cases, >>> as are file permissions, security tags, chroot considerations and the >>> like. >>> >>> It's a complete non starter. > > Come to think of it mknod is completely unnecessary. > > Without mknod. Without being able to mount filesystems containing > device nodes. Hm? That sounds like it will really upset init/udev/upgrades in the container. Are you saying all filesystems containing device nodes will need to be mounted in advance by the process setting up the container? > The mount namespace is sufficient to prevent all of the > cases that the device control group prevents (open and mknod on device > nodes). > > So I honestly think the device control group is superflous, and it is > probably wise to deprecate it and move to a model where it does not > exist. > > Eric > That's what I said a few emails ago :) The device cgroup was meant as a short-term workaround for lack of user (and device) namespaces. ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <5055D4D1.3070407-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]
* Re: Controlling devices and device namespaces [not found] ` <5055D4D1.3070407-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> @ 2012-09-16 14:23 ` Eric W. Biederman [not found] ` <87k3vuqc5l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Eric W. Biederman @ 2012-09-16 14:23 UTC (permalink / raw) To: Serge Hallyn Cc: Aristeu Rozanski, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner, Ingo Molnar, Alan Cox Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > On 09/16/2012 07:17 AM, Eric W. Biederman wrote: >> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: >> >>> Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> writes: >>> >>>>> One piece of the puzzle is that we should be able to allow unprivileged >>>>> device node creation and access for any device on any filesystem >>>>> for which it unprivileged access is safe. >>>> >>>> Which devices are "safe" is policy for all interesting and useful cases, >>>> as are file permissions, security tags, chroot considerations and the >>>> like. >>>> >>>> It's a complete non starter. >> >> Come to think of it mknod is completely unnecessary. >> >> Without mknod. Without being able to mount filesystems containing >> device nodes. > > Hm? That sounds like it will really upset init/udev/upgrades in the > container. udev does not create device nodes. For an older udev the worst I can see it doing is having mknod failing with EEXIST because the device node already exists. We should be able to make it look to init like a ramdisk mounted the filesystems. Why should upgrades care? Package installation shouldn't be calling mknod. At least with a recent modern distro I can't imagine this to be an issue. I expect we could have a kernel build option that removed the mknod system call and a modern distro wouldn't notice. > Are you saying all filesystems containing device nodes will need to be > mounted in advance by the process setting up the container? As a general rule. I think in practice there is wiggle room for special cases like mounting a fresh devpts. devpts at least in always create a new instance on mount mode seems safe, as it can not give you access to any existing devices. You can also do a lot of what would normally be done with mknod with bind mounts to the original devices location. >> The mount namespace is sufficient to prevent all of the >> cases that the device control group prevents (open and mknod on device >> nodes). >> >> So I honestly think the device control group is superflous, and it is >> probably wise to deprecate it and move to a model where it does not >> exist. >> >> Eric >> > > That's what I said a few emails ago :) The device cgroup was meant as > a short-term workaround for lack of user (and device) namespaces. I am saying something stronger. The device cgroup doesn't seem to have a practical function now. That for the general case we don't need any kernel support. That all of this should be a matter of some user space glue code, and just the tiniest bit of sorting out how hotplug events are sent. The only thing I can think we would need a device namespace for is for migration. For migration with direct access to real hardware devices we must treat it as hardware hotunplug. There is nothing else we can do. If there is any other case where we need to preserve device numbers etc we have the example of devpts. So at this point I really don't think we need a device namespace or a device control group. (Just emulate devtmpfs, sysfs and uevents). Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <87k3vuqc5l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]
* Re: Controlling devices and device namespaces [not found] ` <87k3vuqc5l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> @ 2012-09-16 16:13 ` Alan Cox [not found] ` <20120916171316.517ad0fd-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org> 2012-09-16 16:15 ` Serge Hallyn 1 sibling, 1 reply; 75+ messages in thread From: Alan Cox @ 2012-09-16 16:13 UTC (permalink / raw) To: Eric W. Biederman Cc: Aristeu Rozanski, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner, Ingo Molnar > At least with a recent modern distro I can't imagine this to be an > issue. I expect we could have a kernel build option that removed the > mknod system call and a modern distro wouldn't notice. A few things beyond named pipes will break. PCMCIA I believe still depends on ugly mknod hackery of its own. You also need it for some classes of non detectable device. Basically though you could. > For migration with direct access to real hardware devices we must treat > it as hardware hotunplug. There is nothing else we can do. That is demonstrably false for a shared bus or a network linked device. Consider a firewire camera wired to two systems at once. Consider SAN storage. Alan ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120916171316.517ad0fd-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org>]
* Re: Controlling devices and device namespaces [not found] ` <20120916171316.517ad0fd-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org> @ 2012-09-16 17:49 ` Eric W. Biederman 0 siblings, 0 replies; 75+ messages in thread From: Eric W. Biederman @ 2012-09-16 17:49 UTC (permalink / raw) To: Alan Cox Cc: Aristeu Rozanski, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> writes: >> At least with a recent modern distro I can't imagine this to be an >> issue. I expect we could have a kernel build option that removed the >> mknod system call and a modern distro wouldn't notice. > > A few things beyond named pipes will break. PCMCIA I believe still > depends on ugly mknod hackery of its own. You also need it for some > classes of non detectable device. > > Basically though you could. Ah yes fifos. I had forgotten mknod created them. I am half surprised there isn't a mkfifo system call. >> For migration with direct access to real hardware devices we must treat >> it as hardware hotunplug. There is nothing else we can do. > > That is demonstrably false for a shared bus or a network linked device. > Consider a firewire camera wired to two systems at once. Consider SAN > storage. Sort of. If you are talking to the device directly there is usually enough state with the path changing that modelling it as a hotunplug/hotplug is about all that is practical. There is all of that intermediate state for in progress DMAs in the end system controllers etc. Now if you have a logical abstraction like a block device in between the program and the SAN storage, then figuring out how to preserve device names and numbers becomes interesting. At least far enough to keep device and inode numbers for stat intact. A fully general solution for preserving device names, and numbers requires rewriting sysfs. I expect a lot of the infrastructure someone needs is there already from my network namespace work but after having done the network namespace I am sick and tired of manhandling that unreasonably conjoined glob of device stuff. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: Controlling devices and device namespaces [not found] ` <87k3vuqc5l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2012-09-16 16:13 ` Alan Cox @ 2012-09-16 16:15 ` Serge Hallyn [not found] ` <5055FB2A.1020103-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 75+ messages in thread From: Serge Hallyn @ 2012-09-16 16:15 UTC (permalink / raw) To: Eric W. Biederman Cc: Aristeu Rozanski, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Turner, Ingo Molnar, Alan Cox On 09/16/2012 09:23 AM, Eric W. Biederman wrote: > Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: > >> On 09/16/2012 07:17 AM, Eric W. Biederman wrote: >>> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes: >>> >>>> Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> writes: >>>> >>>>>> One piece of the puzzle is that we should be able to allow unprivileged >>>>>> device node creation and access for any device on any filesystem >>>>>> for which it unprivileged access is safe. >>>>> >>>>> Which devices are "safe" is policy for all interesting and useful cases, >>>>> as are file permissions, security tags, chroot considerations and the >>>>> like. >>>>> >>>>> It's a complete non starter. >>> >>> Come to think of it mknod is completely unnecessary. >>> >>> Without mknod. Without being able to mount filesystems containing >>> device nodes. >> >> Hm? That sounds like it will really upset init/udev/upgrades in the >> container. > > udev does not create device nodes. For an older udev the worst > I can see it doing is having mknod failing with EEXIST because > the device node already exists. > > We should be able to make it look to init like a ramdisk mounted the > filesystems. > > Why should upgrades care? Package installation shouldn't be calling > mknod. > > At least with a recent modern distro I can't imagine this to be an > issue. I expect we could have a kernel build option that removed the > mknod system call and a modern distro wouldn't notice. > >> Are you saying all filesystems containing device nodes will need to be >> mounted in advance by the process setting up the container? > > As a general rule. > > I think in practice there is wiggle room for special cases > like mounting a fresh devpts. devpts at least in always create a new > instance on mount mode seems safe, as it can not give you access to > any existing devices. > > You can also do a lot of what would normally be done with mknod > with bind mounts to the original devices location. > >>> The mount namespace is sufficient to prevent all of the >>> cases that the device control group prevents (open and mknod on device >>> nodes). >>> >>> So I honestly think the device control group is superflous, and it is >>> probably wise to deprecate it and move to a model where it does not >>> exist. >>> >>> Eric >>> >> >> That's what I said a few emails ago :) The device cgroup was meant as >> a short-term workaround for lack of user (and device) namespaces. > > I am saying something stronger. The device cgroup doesn't seem to have > a practical function now. "Now" is wrong. The user namespace is not complete and not yet usable for a full system container. We still need the device control group. I'd like us to have a sprint (either a day at UDS in person, or a few days with a virtual sprint) with the focus of getting a full system container working the way you envision it, as cleanly as possible. I can take two or three consecutave days sometime in the next 2-3 weeks, we can sit on irc and share a few instances on which to experiment? > That for the general case we don't need any > kernel support. That all of this should be a matter of some user space > glue code, and just the tiniest bit of sorting out how hotplug events are > sent. > > The only thing I can think we would need a device namespace for is > for migration. > > For migration with direct access to real hardware devices we must treat > it as hardware hotunplug. There is nothing else we can do. > > If there is any other case where we need to preserve device numbers > etc we have the example of devpts. > > So at this point I really don't think we need a device namespace or a > device control group. (Just emulate devtmpfs, sysfs and uevents). > > Eric > ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <5055FB2A.1020103-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]
* Re: Controlling devices and device namespaces [not found] ` <5055FB2A.1020103-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> @ 2012-09-16 16:53 ` Eric W. Biederman 0 siblings, 0 replies; 75+ messages in thread From: Eric W. Biederman @ 2012-09-16 16:53 UTC (permalink / raw) To: Serge Hallyn Cc: Aristeu Rozanski, Neil Horman, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo, Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Alan Cox Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes: >>> That's what I said a few emails ago :) The device cgroup was meant as >>> a short-term workaround for lack of user (and device) namespaces. >> >> I am saying something stronger. The device cgroup doesn't seem to have >> a practical function now. > > "Now" is wrong. The user namespace is not complete and not yet usable for a > full system container. We still need the device control group. Dropping cap mknod, and not having any device nodes you can mount a filesystem with device nodes, plus mount namespace work to only allow you to have access to proper device nodes should work today. And I admit the user namespace as I have it coded in my tree does make this simpler. But I agree "Now" is too soon until we have actually demonstrated something else. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120914183641.GA2191-YqEmrenMroyQb786VAuzj9i2O/JbrIOy@public.gmane.org> 2012-09-14 18:54 ` Tejun Heo 2012-09-15 2:20 ` Serge E. Hallyn @ 2012-09-16 8:19 ` James Bottomley [not found] ` <1347783557.2463.1.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 2 siblings, 1 reply; 75+ messages in thread From: James Bottomley @ 2012-09-16 8:19 UTC (permalink / raw) To: Aristeu Rozanski Cc: Tejun Heo, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko, Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V On Fri, 2012-09-14 at 14:36 -0400, Aristeu Rozanski wrote: > also, heard about the desire of having a device namespace instead with > support for translation ("sda" -> "sdf"). If anyone see immediate use for > this please let me know. That sounds like a really bad idea to me. We've spent ages training users that the actual sd<x> name of their device doesn't matter and they should use UUIDs or WWNs instead ... why should they now care inside containers? James ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <1347783557.2463.1.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <1347783557.2463.1.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> @ 2012-09-16 14:41 ` Eric W. Biederman 2012-09-17 13:21 ` Aristeu Rozanski 1 sibling, 0 replies; 75+ messages in thread From: Eric W. Biederman @ 2012-09-16 14:41 UTC (permalink / raw) To: James Bottomley Cc: Aristeu Rozanski, Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar James Bottomley <James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org> writes: > On Fri, 2012-09-14 at 14:36 -0400, Aristeu Rozanski wrote: >> also, heard about the desire of having a device namespace instead with >> support for translation ("sda" -> "sdf"). If anyone see immediate use for >> this please let me know. > > That sounds like a really bad idea to me. We've spent ages training > users that the actual sd<x> name of their device doesn't matter and they > should use UUIDs or WWNs instead ... why should they now care inside > containers? The goal is not to introduce new the cases where people care but to handle cases where people do care. The biggest practical case of interest that I know of is if stat /home/myinteresintfile Device: 806h Inode: 7460974 migration stat /home/myinteresintfile Device: 732h Inode: 7460974 And an unchanging file looks like it has just become a totally different file on a totally different filesystem. I think even things like git status will care. Although how much git cares about the device number I don't know. I do know rsyncing a git tree to another directory is enough to give git conniption fits. So this is really about device management and handling the horrible things that real user space does. There is also the case that there are some very strong ties between the names of device nodes the names of sysfs files. Strong enough ties that I think you can strongly confuse userspace if you just happen to rename a device node. And ultimately this conversation is about the fact that none of this has been interesting enough in practice to figure out what really needs to be done to manage devices in containers. You can read the other thread if you want details. But right now it looks to me like the right answer is going to be building some userspace software and totally deprecating the device control group. Eric ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <1347783557.2463.1.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 2012-09-16 14:41 ` Eric W. Biederman @ 2012-09-17 13:21 ` Aristeu Rozanski 1 sibling, 0 replies; 75+ messages in thread From: Aristeu Rozanski @ 2012-09-17 13:21 UTC (permalink / raw) To: James Bottomley Cc: Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar On Sun, Sep 16, 2012 at 09:19:17AM +0100, James Bottomley wrote: > On Fri, 2012-09-14 at 14:36 -0400, Aristeu Rozanski wrote: > > also, heard about the desire of having a device namespace instead with > > support for translation ("sda" -> "sdf"). If anyone see immediate use for > > this please let me know. > > That sounds like a really bad idea to me. We've spent ages training > users that the actual sd<x> name of their device doesn't matter and they > should use UUIDs or WWNs instead ... why should they now care inside > containers? True, bad example on my part. The use case I had in mind when I wrote that can be solved by symbolic links. -- Aristeu ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> ` (6 preceding siblings ...) 2012-09-14 18:36 ` Aristeu Rozanski @ 2012-09-14 22:03 ` Dhaval Giani [not found] ` <CAPhKKr8wDLrcWHLTRq1M7gU_6CGNxzzF83zJo2WZ5vrY7h8Qyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2012-09-20 1:33 ` Andy Lutomirski 2012-09-21 21:40 ` Tejun Heo 9 siblings, 1 reply; 75+ messages in thread From: Dhaval Giani @ 2012-09-14 22:03 UTC (permalink / raw) To: Tejun Heo Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko, Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V > > * Sort & unique when listing tasks. Even the documentation says it > doesn't happen but we have a good hunk of code doing it in > cgroup.c. I'm gonna rip it out at some point. Again, if you > don't like it, scream. > I think some userspace tools do assume the uniq bit. So if we can preserve that, great! Thanks Dhaval ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <CAPhKKr8wDLrcWHLTRq1M7gU_6CGNxzzF83zJo2WZ5vrY7h8Qyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <CAPhKKr8wDLrcWHLTRq1M7gU_6CGNxzzF83zJo2WZ5vrY7h8Qyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2012-09-14 22:06 ` Tejun Heo 0 siblings, 0 replies; 75+ messages in thread From: Tejun Heo @ 2012-09-14 22:06 UTC (permalink / raw) To: Dhaval Giani Cc: Neil Horman, Serge E. Hallyn, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar Hello, On Fri, Sep 14, 2012 at 06:03:16PM -0400, Dhaval Giani wrote: > > > > * Sort & unique when listing tasks. Even the documentation says it > > doesn't happen but we have a good hunk of code doing it in > > cgroup.c. I'm gonna rip it out at some point. Again, if you > > don't like it, scream. > > I think some userspace tools do assume the uniq bit. So if we can > preserve that, great! Can you point me to those? If there are users depending on it, I won't break it, at least for now, but I at least wanna know more about them. Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> ` (7 preceding siblings ...) 2012-09-14 22:03 ` Dhaval Giani @ 2012-09-20 1:33 ` Andy Lutomirski [not found] ` <505A725B.2080901-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> 2012-09-21 21:40 ` Tejun Heo 9 siblings, 1 reply; 75+ messages in thread From: Andy Lutomirski @ 2012-09-20 1:33 UTC (permalink / raw) To: Tejun Heo Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List, Neil Horman, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Paul Turner, Ingo Molnar, serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw [grr. why does gmane scramble addresses?] On 09/13/2012 01:58 PM, Tejun Heo wrote: > > 6. Multiple hierarchies > > Apart from the apparent wheeeeeeeeness of it (I think I talked about > that enough the last time[1]), there's a basic problem when more > than one controllers interact - it's impossible to define a resource > group when more than two controllers are involved because the > intersection of different controllers is only defined in terms of > tasks. > > IOW, if an entity X is of interest to two controllers, there's no > way to map X to the cgroups of the two controllers. X may belong to > A and B when viewed by one task but A' and B when viewed by another. > This already is a head scratcher in writeback where blkcg and memcg > have to interact. > > While I am pushing for unified hierarchy, I think it's necessary to > have different levels of granularities depending on controllers > given that nesting involves significant overhead and noticeable > controller-dependent behavior changes. > > > ... > I think this level of flexibility should be enough for most use > cases. If someone disagrees, please voice your objections now. > OK, I'll bite. I have a server that has a whole bunch of cores. A small fraction of those cores are general purpose and run whatever they like. The rest are tightly controlled. For simplicity, we have two cpusets that we use. The root allows all cpus. The other one only allows the general purpose cpus. We shove everything into the general-purpose-only cpuset, and then we move special stuff back to root. (We also shove some kernel threads into a non-root cpuset using the 'cset' tool.) Enter systemd, which wants a hierarchy corresponding to services. If we were to use it, we might end up violating its hierarchy. Alternatively, if we started using memcg, then we might have some tasks to have more restrictive memory usage but less restrictive cpu usage. As long as we can still pull this off, I'm happy. --Andy P.S. I'm sure you can guess why based on my email address :) ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <505A725B.2080901-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <505A725B.2080901-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> @ 2012-09-20 18:26 ` Tejun Heo [not found] ` <20120920182651.GH28934-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 75+ messages in thread From: Tejun Heo @ 2012-09-20 18:26 UTC (permalink / raw) To: Andy Lutomirski Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List, Neil Horman, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Paul Turner, Ingo Molnar, serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw Hello, On Wed, Sep 19, 2012 at 06:33:15PM -0700, Andy Lutomirski wrote: > [grr. why does gmane scramble addresses?] You can append /raw to the message url and see the raw mssage. http://article.gmane.org/gmane.linux.kernel.containers/23802/raw > > I think this level of flexibility should be enough for most use > > cases. If someone disagrees, please voice your objections now. > > OK, I'll bite. > > I have a server that has a whole bunch of cores. A small fraction of > those cores are general purpose and run whatever they like. The rest > are tightly controlled. > > For simplicity, we have two cpusets that we use. The root allows all > cpus. The other one only allows the general purpose cpus. We shove > everything into the general-purpose-only cpuset, and then we move > special stuff back to root. (We also shove some kernel threads into a > non-root cpuset using the 'cset' tool.) Using root for special stuff probably isn't a good idea and moving bound kthreads into !root cgroups is already disallowed. > Enter systemd, which wants a hierarchy corresponding to services. If we > were to use it, we might end up violating its hierarchy. > > Alternatively, if we started using memcg, then we might have some tasks > to have more restrictive memory usage but less restrictive cpu usage. > > As long as we can still pull this off, I'm happy. IIUC, you basically want just two groups w/ cpuset and use it for loose cpu ioslation for high priority jobs. Structure-wise, I don't think it's gonna be a problem although using root for special stuff would need to change. Thanks. -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
[parent not found: <20120920182651.GH28934-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* Re: [RFC] cgroup TODOs [not found] ` <20120920182651.GH28934-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> @ 2012-09-20 18:39 ` Andy Lutomirski 0 siblings, 0 replies; 75+ messages in thread From: Andy Lutomirski @ 2012-09-20 18:39 UTC (permalink / raw) To: Tejun Heo Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List, Neil Horman, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf, Paul Turner, Ingo Molnar, serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw On Thu, Sep 20, 2012 at 11:26 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote: > Hello, > > On Wed, Sep 19, 2012 at 06:33:15PM -0700, Andy Lutomirski wrote: >> [grr. why does gmane scramble addresses?] > > You can append /raw to the message url and see the raw mssage. > > http://article.gmane.org/gmane.linux.kernel.containers/23802/raw Thanks! > >> > I think this level of flexibility should be enough for most use >> > cases. If someone disagrees, please voice your objections now. >> >> OK, I'll bite. >> >> I have a server that has a whole bunch of cores. A small fraction of >> those cores are general purpose and run whatever they like. The rest >> are tightly controlled. >> >> For simplicity, we have two cpusets that we use. The root allows all >> cpus. The other one only allows the general purpose cpus. We shove >> everything into the general-purpose-only cpuset, and then we move >> special stuff back to root. (We also shove some kernel threads into a >> non-root cpuset using the 'cset' tool.) > > Using root for special stuff probably isn't a good idea and moving > bound kthreads into !root cgroups is already disallowed. Agreed. I do it this way because it's easy and it works. I can change it in the future if needed. > >> Enter systemd, which wants a hierarchy corresponding to services. If we >> were to use it, we might end up violating its hierarchy. >> >> Alternatively, if we started using memcg, then we might have some tasks >> to have more restrictive memory usage but less restrictive cpu usage. >> >> As long as we can still pull this off, I'm happy. > > IIUC, you basically want just two groups w/ cpuset and use it for > loose cpu ioslation for high priority jobs. Structure-wise, I don't > think it's gonna be a problem although using root for special stuff > would need to change. Right. But what happens when multiple hierarchies go away and I lose control of the structure? If systemd or whatever sticks my whole session or my service (or however I organize it) into cgroup /whatever, then either I can put my use-all-cpus tasks into /whatever/everything or I can step outside the hierarchy and put them into /everything. The former doesn't work, because <quote> The following rules apply to each cpuset: - Its CPUs and Memory Nodes must be a subset of its parents. </quote> The latter might confuse systemd. My real objection might be to that requirement a cpuset can't be less restrictive than its parents. Currently I can arrange for a task to simultaneously have a less restrictive cpuset and a more restrictive memory limit (or to stick it into a container or whatever). If the hierarchies have to correspond, this stops working. --Andy ^ permalink raw reply [flat|nested] 75+ messages in thread
* Re: [RFC] cgroup TODOs [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> ` (8 preceding siblings ...) 2012-09-20 1:33 ` Andy Lutomirski @ 2012-09-21 21:40 ` Tejun Heo 9 siblings, 0 replies; 75+ messages in thread From: Tejun Heo @ 2012-09-21 21:40 UTC (permalink / raw) To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: Li Zefan, Michal Hocko, Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote: > 7. Misc issues > > * Sort & unique when listing tasks. Even the documentation says it > doesn't happen but we have a good hunk of code doing it in > cgroup.c. I'm gonna rip it out at some point. Again, if you > don't like it, scream. > > * At the PLC, pjt told me that assinging threads of a cgroup to > different cgroups is useful for some use cases but if we're to > have a unified hierarchy, I don't think we can continue to do > that. Paul, can you please elaborate the use case? > > * Vivek brought up the issue of distributing resources to tasks and > groups in the same cgroup. I don't know. Need to think more > about it. * Update docs. * Clean up cftype->read/write*() mess. * Use sane fs event mechanism. * Drop userland helper based empty notification. Argh... -- tejun ^ permalink raw reply [flat|nested] 75+ messages in thread
end of thread, other threads:[~2012-09-21 21:40 UTC | newest] Thread overview: 75+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-09-13 20:58 [RFC] cgroup TODOs Tejun Heo 2012-09-14 11:15 ` Peter Zijlstra 2012-09-14 12:54 ` Daniel P. Berrange [not found] ` <20120914125427.GW6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 8:55 ` Glauber Costa 2012-09-14 17:53 ` Tejun Heo [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-14 8:16 ` Glauber Costa [not found] ` <5052E7DF.7040000-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-14 9:12 ` Li Zefan [not found] ` <5052F4FF.6070508-hv44wF8Li93QT0dZR+AlfA@public.gmane.org> 2012-09-14 11:22 ` Peter Zijlstra 2012-09-14 17:59 ` Tejun Heo [not found] ` <20120914175944.GF17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-14 18:23 ` Peter Zijlstra 2012-09-14 18:33 ` Tejun Heo 2012-09-14 17:43 ` Tejun Heo [not found] ` <20120914174329.GD17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-17 8:50 ` Glauber Costa [not found] ` <5056E467.2090108-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-17 17:21 ` Tejun Heo [not found] ` <20120917172123.GB18677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-18 8:16 ` Glauber Costa 2012-09-14 9:04 ` Mike Galbraith [not found] ` <1347613484.4340.132.camel-YqMYhexLQo31wTEvPJ5Q0F6hYfS7NtTn@public.gmane.org> 2012-09-14 17:17 ` Tejun Heo 2012-09-14 9:10 ` Daniel P. Berrange [not found] ` <20120914091032.GA6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 9:08 ` Glauber Costa 2012-09-14 13:58 ` Vivek Goyal [not found] ` <20120914135830.GB6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 19:29 ` Tejun Heo [not found] ` <20120914192935.GO17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-14 21:51 ` Kay Sievers 2012-09-14 14:25 ` Vivek Goyal [not found] ` <20120914142539.GC6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 14:53 ` Peter Zijlstra 2012-09-14 15:14 ` Vivek Goyal [not found] ` <20120914151447.GD6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 21:57 ` Tejun Heo [not found] ` <20120914215701.GW17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-17 15:27 ` Vivek Goyal 2012-09-18 18:08 ` Vivek Goyal 2012-09-17 8:55 ` Glauber Costa 2012-09-14 21:39 ` Tejun Heo [not found] ` <20120914213938.GV17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-17 15:05 ` Vivek Goyal [not found] ` <20120917150518.GB5094-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-17 16:40 ` Tejun Heo 2012-09-14 15:03 ` Michal Hocko [not found] ` <20120914150306.GQ28039-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-09-19 14:02 ` Michal Hocko [not found] ` <20120919140203.GA5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-09-19 14:03 ` [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0 Michal Hocko [not found] ` <20120919140308.GB5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-09-19 19:38 ` David Rientjes [not found] ` <alpine.DEB.2.00.1209191237020.749-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> 2012-09-20 13:24 ` Michal Hocko [not found] ` <20120920132400.GC23872-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2012-09-20 22:33 ` David Rientjes [not found] ` <alpine.DEB.2.00.1209201531250.17455-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org> 2012-09-21 7:16 ` Michal Hocko 2012-09-19 14:03 ` [PATCH 3.0] " Michal Hocko 2012-09-19 14:05 ` [PATCH 3.2+] " Michal Hocko 2012-09-14 18:07 ` [RFC] cgroup TODOs Vivek Goyal [not found] ` <20120914180754.GF6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 18:53 ` Tejun Heo [not found] ` <20120914185324.GI17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-14 19:28 ` Vivek Goyal [not found] ` <20120914192840.GG6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-09-14 19:44 ` Tejun Heo [not found] ` <20120914194439.GP17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-14 19:49 ` Tejun Heo [not found] ` <20120914194950.GQ17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-14 20:39 ` Tejun Heo [not found] ` <20120914203925.GR17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-17 8:40 ` Glauber Costa [not found] ` <5056E1FC.1090508-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org> 2012-09-17 17:30 ` Tejun Heo 2012-09-17 14:37 ` Vivek Goyal 2012-09-14 18:36 ` Aristeu Rozanski [not found] ` <20120914183641.GA2191-YqEmrenMroyQb786VAuzj9i2O/JbrIOy@public.gmane.org> 2012-09-14 18:54 ` Tejun Heo 2012-09-15 2:20 ` Serge E. Hallyn [not found] ` <20120915022037.GA6438-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 2012-09-15 9:27 ` Controlling devices and device namespaces Eric W. Biederman [not found] ` <87wqzv7i08.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2012-09-15 22:05 ` Serge E. Hallyn [not found] ` <20120915220520.GA11364-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org> 2012-09-16 0:24 ` Eric W. Biederman [not found] ` <87y5kazuez.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2012-09-16 3:31 ` Serge E. Hallyn 2012-09-16 11:21 ` Alan Cox [not found] ` <20120916122112.3f16178d-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org> 2012-09-16 11:56 ` Eric W. Biederman [not found] ` <87sjaiuqp5.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2012-09-16 12:17 ` Eric W. Biederman [not found] ` <87d31mupp3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2012-09-16 13:32 ` Serge Hallyn [not found] ` <5055D4D1.3070407-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2012-09-16 14:23 ` Eric W. Biederman [not found] ` <87k3vuqc5l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org> 2012-09-16 16:13 ` Alan Cox [not found] ` <20120916171316.517ad0fd-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org> 2012-09-16 17:49 ` Eric W. Biederman 2012-09-16 16:15 ` Serge Hallyn [not found] ` <5055FB2A.1020103-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> 2012-09-16 16:53 ` Eric W. Biederman 2012-09-16 8:19 ` [RFC] cgroup TODOs James Bottomley [not found] ` <1347783557.2463.1.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 2012-09-16 14:41 ` Eric W. Biederman 2012-09-17 13:21 ` Aristeu Rozanski 2012-09-14 22:03 ` Dhaval Giani [not found] ` <CAPhKKr8wDLrcWHLTRq1M7gU_6CGNxzzF83zJo2WZ5vrY7h8Qyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2012-09-14 22:06 ` Tejun Heo 2012-09-20 1:33 ` Andy Lutomirski [not found] ` <505A725B.2080901-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org> 2012-09-20 18:26 ` Tejun Heo [not found] ` <20120920182651.GH28934-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> 2012-09-20 18:39 ` Andy Lutomirski 2012-09-21 21:40 ` Tejun Heo
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).