[RFC] cgroup TODOs

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] cgroup TODOs
@ 2012-09-13 20:58 Tejun Heo
  2012-09-14 11:15 ` Peter Zijlstra
       [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 2 replies; 75+ messages in thread
From: Tejun Heo @ 2012-09-13 20:58 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Neil Horman, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Turner, Ingo Molnar

Hello, guys.

Here's the write-up I promised last week about what I think are the
problems in cgroup and what the current plans are.

First of all, it's a mess.  Shame on me.  Shame on you.  Shame on all
of us for allowing this mess.  Let's all tremble in shame for solid
ten seconds before proceeding.

I'll list the issues I currently see with cgroup (easier ones first).
I think I now have at least tentative plans for all of them and will
list them together with the prospective asignees (my wish mostly).
Unfortunately, some of the plans involve userland visible changes
which would at least cause some discomfort and require adjustments on
their part.

1. cpu and cpuacct

  They cover the same resources and the scheduler cgroup code ends up
  having to traverse two separate cgroup trees to update the stats.
  With nested cgroups, the overhead isn't insignificant and it
  generally is silly.

  While the use cases for having cpuacct on a separate and likely more
  granular hierarchy, are somewhat valid, the consensus seems that
  it's just not worth the trouble and cpuacct should be removed in the
  long term and we shouldn't allow overlapping controllers for the
  same resource, especially accounting ones.

  Solution:

  * Whine if cpuacct is not co-mounted with cpu.

  * Make sure cpu has all the stats of cpuacct.  If cpu and cpuacct
    are comounted, don't really mount cpuacct but tell cpu that the
    user requested it.  cpu is updated to create aliases for cpuacct.*
    files in such cases.  This involves special casing cpuacct in
    cgroup core but I much prefer one-off exception case to adding a
    generic mechanism for this.

  * After a while, we can just remove cpuacct completely.

  * Later on, phase out the aliases too.

  Who:

  Me, working on it.

2. memcg's __DEPRECATED_clear_css_refs

  This is a remnant of another weird design decision of requiring
  synchronous draining of refcnts on cgroup removal and allowing
  subsystems to veto cgroup removal - what's the userspace supposed to
  do afterwards?  Note that this also hinders co-mounting different
  controllers.

  The behavior could be useful for development and debugging but it
  unnecessarily interlocks userland visible behavior with in-kernel
  implementation details.  To me, it seems outright wrong (either
  implement proper severing semantics in the controller or do full
  refcnting) and disallows, for example, lazy drain of caching refs.
  Also, it complicates the removal path with try / commit / revert
  logic which has never been fully correct since the beginning.

  Currently, the only left user is memcg.

  Solution:

  * Update memcg->pre_destroy() such that it never fails.

  * Drop __DEPRECATED_clear_css_refs and all related logic.
    Convert pre_destroy() to return void.

  Who:

  KAMEZAWA, Michal, PLEASE.  I will make __DEPRECATED_clear_css_refs
  trigger WARN sooner or later.  Let's please get this settled.

3. cgroup_mutex usage outside cgroup core

  This is another thing which is simply broken.  Given the way cgroup
  is structured and used, nesting cgroup_mutex inside any other
  commonly used lock simply doesn't work - it's held while invoking
  controller callbacks which then interact and synchronize with
  various core subsystems.

  There are currently three external cgroup_mutex users - cpuset,
  memcontrol and cgroup_freezer.

  Solution:

  Well, we should just stop doing it - use a separate nested lock
  (which seems possible for cgroup_freezer) or track and mange task
  in/egress some other way.

  Who:

  I'll do the cgroup_freezer.  I'm hoping PeterZ or someone who's
  familiar with the code base takes care of cpuset.  Michal, can you
  please take care of memcg?

4. Make disabled controllers cheaper

  Mostly through the use of static_keys, I suppose.  Making this
  easier AFAICS depends on resolving #2.  The lock dependency loop
  from #2 makes using static_keys from cgroup callbacks extremely
  nasty.

  Solution:

  Fix #2 and support common pattern from cgroup core.

  Who:

  Dunno.  Let's see.

5. I CAN HAZ HIERARCHIES?

  The cpu ones handle nesting correctly - parent's accounting includes
  children's, parent's configuration affects children's unless
  explicitly overridden, and children's limits nest inside parent's.

  memcg asked itself the existential question of to be hierarchical or
  not and then got confused and decided to become both.

  When faced with the same question, blkio and cgroup_freezer just
  gave up and decided to allow nesting and then ignore it - brilliant.

  And there are others which kinda sorta try to handle hierarchy but
  only goes way-half.

  This one is screwed up embarrassingly badly.  We failed to establish
  one of the most basic semantics and can't even define what a cgroup
  hierarchy is - it depends on each controller and they're mostly
  wacky!

  Fortunately, I don't think it will be prohibitively difficult to dig
  ourselves out of this hole.

  Solution:

  * cpu ones seem fine.

  * For broken controllers, cgroup core will be generating warning
    messages if the user tries to nest cgroups so that the user at
    least can know that the behavior may change underneath them later
    on.  For more details,

    http://thread.gmane.org/gmane.linux.kernel/1356264/focus=3902

  * memcg can be fully hierarchical but we need to phase out the flat
    hierarchy support.  Unfortunately, this involves flipping the
    behavior for the existing users.  Upstream will try to nudge users
    with warning messages.  Most burden would be on the distros and at
    least SUSE seems to be on board with it.  Needs coordination with
    other distros.

  * blkio is the most problematic.  It has two sub-controllers - cfq
    and blk-throttle.  Both are utterly broken in terms of hierarchy
    support and the former is known to have pretty hairy code base.  I
    don't see any other way than just biting the bullet and fixing it.

  * cgroup_freezer and others shouldn't be too difficult to fix.

  Who:

  memcg can be handled by memcg people and I can handle cgroup_freezer
  and others with help from the authors.  The problematic one is
  blkio.  If anyone is interested in working on blkio, please be my
  guest.  Vivek?  Glauber?

6. Multiple hierarchies

  Apart from the apparent wheeeeeeeeness of it (I think I talked about
  that enough the last time[1]), there's a basic problem when more
  than one controllers interact - it's impossible to define a resource
  group when more than two controllers are involved because the
  intersection of different controllers is only defined in terms of
  tasks.

  IOW, if an entity X is of interest to two controllers, there's no
  way to map X to the cgroups of the two controllers.  X may belong to
  A and B when viewed by one task but A' and B when viewed by another.
  This already is a head scratcher in writeback where blkcg and memcg
  have to interact.

  While I am pushing for unified hierarchy, I think it's necessary to
  have different levels of granularities depending on controllers
  given that nesting involves significant overhead and noticeable
  controller-dependent behavior changes.

  Solution:

  I think a unified hierarchy with the ability to ignore subtrees
  depending on controllers should work.  For example, let's assume the
  following hierarchy.

          R
	/   \
       A     B
      / \
     AA AB

  All controllers are co-mounted.  There is per-cgroup knob which
  controls which controllers nest beyond it.  If blkio doesn't want to
  distinguish AA and AB, the user can specify that blkio doesn't nest
  beyond A and blkio would see the tree as,

          R
	/   \
       A     B

  While other controllers keep seeing the original tree.  The exact
  form of interface, I don't know yet.  It could be a single file
  which the user echoes [-]controller name into it or per-controller
  boolean file.

  I think this level of flexibility should be enough for most use
  cases.  If someone disagrees, please voice your objections now.

  I *think* this can be achieved by changing where css_set is bound.
  Currently, a css_set is (conceptually) owned by a task.  After the
  change, a cgroup in the unified hierarchy has its own css_set which
  tasks point to and can also be used to tag resources as necessary.
  This way, it should be achieveable without introducing a lot of new
  code or affecting individual controllers too much.

  The headache will be the transition period where we'll probably have
  to support both modes of operation.  Oh well....

  Who:

  Li, Glauber and me, I guess?

7. Misc issues

  * Sort & unique when listing tasks.  Even the documentation says it
    doesn't happen but we have a good hunk of code doing it in
    cgroup.c.  I'm gonna rip it out at some point.  Again, if you
    don't like it, scream.

  * At the PLC, pjt told me that assinging threads of a cgroup to
    different cgroups is useful for some use cases but if we're to
    have a unified hierarchy, I don't think we can continue to do
    that.  Paul, can you please elaborate the use case?

  * Vivek brought up the issue of distributing resources to tasks and
    groups in the same cgroup.  I don't know.  Need to think more
    about it.

Thanks.

--
tejun

[1] http://thread.gmane.org/gmane.linux.kernel.cgroups/857

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
  2012-09-13 20:58 [RFC] cgroup TODOs Tejun Heo
@ 2012-09-14 11:15 ` Peter Zijlstra
  2012-09-14 12:54   ` Daniel P. Berrange
  2012-09-14 17:53   ` Tejun Heo
       [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  1 sibling, 2 replies; 75+ messages in thread
From: Peter Zijlstra @ 2012-09-14 11:15 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers, cgroups, linux-kernel, Li Zefan, Michal Hocko,
	Glauber Costa, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Thu, 2012-09-13 at 13:58 -0700, Tejun Heo wrote:
>   The cpu ones handle nesting correctly - parent's accounting includes
>   children's, parent's configuration affects children's unless
>   explicitly overridden, and children's limits nest inside parent's. 

The implementation has some issues with fixed point math limitations on
deep hierarchies/large cpu count, but yes.

Doing soft-float/bignum just isn't going to be popular I guess ;-)

People also don't seem to understand that each extra cgroup carries a
cost and that nested cgroups are more expensive still, even if the
intermediate levels are mostly empty (libvirt is a good example of how
not to do things).

Anyway, I guess what I'm saying is that we need to work on the awareness
of cost associated with all this cgroup nonsense, people seem to think
its all good and free -- or not think at all, which, while depressing,
seem the more likely option.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
  2012-09-14 11:15 ` Peter Zijlstra
@ 2012-09-14 12:54   ` Daniel P. Berrange
       [not found]     ` <20120914125427.GW6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-09-14 17:53   ` Tejun Heo
  1 sibling, 1 reply; 75+ messages in thread
From: Daniel P. Berrange @ 2012-09-14 12:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo,
	Serge E. Hallyn, Paul Turner, Ingo Molnar

On Fri, Sep 14, 2012 at 01:15:02PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-09-13 at 13:58 -0700, Tejun Heo wrote:
> >   The cpu ones handle nesting correctly - parent's accounting includes
> >   children's, parent's configuration affects children's unless
> >   explicitly overridden, and children's limits nest inside parent's. 
> 
> The implementation has some issues with fixed point math limitations on
> deep hierarchies/large cpu count, but yes.
> 
> Doing soft-float/bignum just isn't going to be popular I guess ;-)
> 
> People also don't seem to understand that each extra cgroup carries a
> cost and that nested cgroups are more expensive still, even if the
> intermediate levels are mostly empty (libvirt is a good example of how
> not to do things).
> 
> Anyway, I guess what I'm saying is that we need to work on the awareness
> of cost associated with all this cgroup nonsense, people seem to think
> its all good and free -- or not think at all, which, while depressing,
> seem the more likely option.

In defense of what libvirt is doing, I'll point out that the kernel
docs on cgroups make little to no mention of these performance / cost
implications, and the examples of usage given arguably encourage use
of deep hierarchies.

Given what we've now learnt about the kernel's lack of scalability
wrt cgroup hierarchies, we'll be changing the way libvirt deals with
cgroups, to flatten it out to only use 1 level by default. If the
kernel docs had clearly expressed the limitations & made better
recommendations on app usage we would never have picked the approach
we originally chose.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914125427.GW6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]     ` <20120914125427.GW6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-09-14  8:55       ` Glauber Costa
  0 siblings, 0 replies; 75+ messages in thread
From: Glauber Costa @ 2012-09-14  8:55 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner

On 09/14/2012 04:54 PM, Daniel P. Berrange wrote:
> On Fri, Sep 14, 2012 at 01:15:02PM +0200, Peter Zijlstra wrote:
>> On Thu, 2012-09-13 at 13:58 -0700, Tejun Heo wrote:
>>>   The cpu ones handle nesting correctly - parent's accounting includes
>>>   children's, parent's configuration affects children's unless
>>>   explicitly overridden, and children's limits nest inside parent's. 
>>
>> The implementation has some issues with fixed point math limitations on
>> deep hierarchies/large cpu count, but yes.
>>
>> Doing soft-float/bignum just isn't going to be popular I guess ;-)
>>
>> People also don't seem to understand that each extra cgroup carries a
>> cost and that nested cgroups are more expensive still, even if the
>> intermediate levels are mostly empty (libvirt is a good example of how
>> not to do things).
>>
>> Anyway, I guess what I'm saying is that we need to work on the awareness
>> of cost associated with all this cgroup nonsense, people seem to think
>> its all good and free -- or not think at all, which, while depressing,
>> seem the more likely option.
> 
> In defense of what libvirt is doing, I'll point out that the kernel
> docs on cgroups make little to no mention of these performance / cost
> implications, and the examples of usage given arguably encourage use
> of deep hierarchies.
> 
> Given what we've now learnt about the kernel's lack of scalability
> wrt cgroup hierarchies, we'll be changing the way libvirt deals with
> cgroups, to flatten it out to only use 1 level by default. If the
> kernel docs had clearly expressed the limitations & made better
> recommendations on app usage we would never have picked the approach
> we originally chose.
> 
> Regards,
> Daniel
> 
I personally don't think this is such a crazy setup. It is perfectly
valid to say "all applications managed by libvirt as a whole cannot use
more than X". Now of course there are other ways to do it, and we really
need to make people more aware of the costs...

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
  2012-09-14 11:15 ` Peter Zijlstra
  2012-09-14 12:54   ` Daniel P. Berrange
@ 2012-09-14 17:53   ` Tejun Heo
  1 sibling, 0 replies; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 17:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko,
	Glauber Costa, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Hello, Peter.

On Fri, Sep 14, 2012 at 01:15:02PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-09-13 at 13:58 -0700, Tejun Heo wrote:
> >   The cpu ones handle nesting correctly - parent's accounting includes
> >   children's, parent's configuration affects children's unless
> >   explicitly overridden, and children's limits nest inside parent's. 
> 
> The implementation has some issues with fixed point math limitations on
> deep hierarchies/large cpu count, but yes.
> 
> Doing soft-float/bignum just isn't going to be popular I guess ;-)

As things currently stand, I think the cpu stuff is high enough bar to
aim for.  That said, I do have some problems with how it handles tasks
vs. groups.  Will talk about in another reply.

> People also don't seem to understand that each extra cgroup carries a
> cost and that nested cgroups are more expensive still, even if the
> intermediate levels are mostly empty (libvirt is a good example of how
> not to do things).
> 
> Anyway, I guess what I'm saying is that we need to work on the awareness
> of cost associated with all this cgroup nonsense, people seem to think
> its all good and free -- or not think at all, which, while depressing,
> seem the more likely option.

The decision may not have been conscious but it seems that we settled
on the direction where cgroup does more hierarchy-wise rather than
leaving non-scalable operations to each use case - e.g. filesystem
trees are very scalable but for that they give up a lot of tree-aware
things like knowing the size of a given subtree.

For what cgroup does, I think the naturally chosen direction is the
right one.  Its functionality inherently requires more involvement
with the tree structure and we of course should try to document the
implications clearly and make things scale better where we can
(e.g. stat propagation has no reason to happen on every update).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-09-14  8:16   ` Glauber Costa
       [not found]     ` <5052E7DF.7040000-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2012-09-14  9:04   ` Mike Galbraith
                     ` (8 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Glauber Costa @ 2012-09-14  8:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

First:

Can we please keep some key userspace guys CCd?

> 1. cpu and cpuacct
> 
>   They cover the same resources and the scheduler cgroup code ends up
>   having to traverse two separate cgroup trees to update the stats.
>   With nested cgroups, the overhead isn't insignificant and it
>   generally is silly.
> 
>   While the use cases for having cpuacct on a separate and likely more
>   granular hierarchy, are somewhat valid, the consensus seems that
>   it's just not worth the trouble and cpuacct should be removed in the
>   long term and we shouldn't allow overlapping controllers for the
>   same resource, especially accounting ones.
> 
>   Solution:
> 
>   * Whine if cpuacct is not co-mounted with cpu.
> 
>   * Make sure cpu has all the stats of cpuacct.  If cpu and cpuacct
>     are comounted, don't really mount cpuacct but tell cpu that the
>     user requested it.  cpu is updated to create aliases for cpuacct.*
>     files in such cases.  This involves special casing cpuacct in
>     cgroup core but I much prefer one-off exception case to adding a
>     generic mechanism for this.
> 
>   * After a while, we can just remove cpuacct completely.
> 
>   * Later on, phase out the aliases too.
> 
>   Who:
> 
>   Me, working on it.
I can work on it as well if you want. I dealt with it many times in the
past, and tried some different approaches, so I am familiar. But if
you're already doing it, be my guest...

> 
> 2. memcg's __DEPRECATED_clear_css_refs
> 
>   This is a remnant of another weird design decision of requiring
>   synchronous draining of refcnts on cgroup removal and allowing
>   subsystems to veto cgroup removal - what's the userspace supposed to
>   do afterwards?  Note that this also hinders co-mounting different
>   controllers.
> 
>   The behavior could be useful for development and debugging but it
>   unnecessarily interlocks userland visible behavior with in-kernel
>   implementation details.  To me, it seems outright wrong (either
>   implement proper severing semantics in the controller or do full
>   refcnting) and disallows, for example, lazy drain of caching refs.
>   Also, it complicates the removal path with try / commit / revert
>   logic which has never been fully correct since the beginning.
> 
>   Currently, the only left user is memcg.
> 
>   Solution:
> 
>   * Update memcg->pre_destroy() such that it never fails.
> 
>   * Drop __DEPRECATED_clear_css_refs and all related logic.
>     Convert pre_destroy() to return void.
> 
>   Who:
> 
>   KAMEZAWA, Michal, PLEASE.  I will make __DEPRECATED_clear_css_refs
>   trigger WARN sooner or later.  Let's please get this settled.
> 
> 3. cgroup_mutex usage outside cgroup core
> 
>   This is another thing which is simply broken.  Given the way cgroup
>   is structured and used, nesting cgroup_mutex inside any other
>   commonly used lock simply doesn't work - it's held while invoking
>   controller callbacks which then interact and synchronize with
>   various core subsystems.
> 
>   There are currently three external cgroup_mutex users - cpuset,
>   memcontrol and cgroup_freezer.
> 
>   Solution:
> 
>   Well, we should just stop doing it - use a separate nested lock
>   (which seems possible for cgroup_freezer) or track and mange task
>   in/egress some other way.
> 
>   Who:
> 
>   I'll do the cgroup_freezer.  I'm hoping PeterZ or someone who's
>   familiar with the code base takes care of cpuset.  Michal, can you
>   please take care of memcg?
> 

I think this is a pressing problem, yes, but not the only problem with
cgroup lock. Even if we restrict its usage to cgroup core, we still can
call cgroup functions, which will lock. And then we gain nothing.

And the problem is that people need to lock. cgroup_lock is needed
because the data you are accessing is protected by it. The way I see it,
it is incredible how we were able to revive the BKL in the form of
cgroup_lock after we finally manage to successfully get rid of it!

We should just start to do a more fine grained locking of data, instead
of "stop the world, cgroup just started!". If we do that, the problem
you are trying to address here will even cease to exist.

> 4. Make disabled controllers cheaper
> 
>   Mostly through the use of static_keys, I suppose.  Making this
>   easier AFAICS depends on resolving #2.  The lock dependency loop
>   from #2 makes using static_keys from cgroup callbacks extremely
>   nasty.
> 
>   Solution:
> 
>   Fix #2 and support common pattern from cgroup core.
> 
>   Who:
> 
>   Dunno.  Let's see.

I've been doing it for kmem related controllers, and by trying to do it
with cpu/cpuacct, I became quite familiar with the corner cases, etc. I
can happily tackle it.

> 
> 5. I CAN HAZ HIERARCHIES?
> 
>   The cpu ones handle nesting correctly - parent's accounting includes
>   children's, parent's configuration affects children's unless
>   explicitly overridden, and children's limits nest inside parent's.
> 
>   memcg asked itself the existential question of to be hierarchical or
>   not and then got confused and decided to become both.
> 
>   When faced with the same question, blkio and cgroup_freezer just
>   gave up and decided to allow nesting and then ignore it - brilliant.
> 
>   And there are others which kinda sorta try to handle hierarchy but
>   only goes way-half.
> 
>   This one is screwed up embarrassingly badly.  We failed to establish
>   one of the most basic semantics and can't even define what a cgroup
>   hierarchy is - it depends on each controller and they're mostly
>   wacky!
> 
>   Fortunately, I don't think it will be prohibitively difficult to dig
>   ourselves out of this hole.
> 
>   Solution:
> 
>   * cpu ones seem fine.
> 
>   * For broken controllers, cgroup core will be generating warning
>     messages if the user tries to nest cgroups so that the user at
>     least can know that the behavior may change underneath them later
>     on.  For more details,
> 
>     http://thread.gmane.org/gmane.linux.kernel/1356264/focus=3902
> 
>   * memcg can be fully hierarchical but we need to phase out the flat
>     hierarchy support.  Unfortunately, this involves flipping the
>     behavior for the existing users.  Upstream will try to nudge users
>     with warning messages.  Most burden would be on the distros and at
>     least SUSE seems to be on board with it.  Needs coordination with
>     other distros.
> 
>   * blkio is the most problematic.  It has two sub-controllers - cfq
>     and blk-throttle.  Both are utterly broken in terms of hierarchy
>     support and the former is known to have pretty hairy code base.  I
>     don't see any other way than just biting the bullet and fixing it.
> 
>   * cgroup_freezer and others shouldn't be too difficult to fix.
> 
>   Who:
> 
>   memcg can be handled by memcg people and I can handle cgroup_freezer
>   and others with help from the authors.  The problematic one is
>   blkio.  If anyone is interested in working on blkio, please be my
>   guest.  Vivek?  Glauber?

I am happy to help where manpower is needed, but I must node I am a bit
ignorant of block in general.

> 
> 6. Multiple hierarchies
> 
>   Apart from the apparent wheeeeeeeeness of it (I think I talked about
>   that enough the last time[1]), there's a basic problem when more
>   than one controllers interact - it's impossible to define a resource
>   group when more than two controllers are involved because the
>   intersection of different controllers is only defined in terms of
>   tasks.
> 
>   IOW, if an entity X is of interest to two controllers, there's no
>   way to map X to the cgroups of the two controllers.  X may belong to
>   A and B when viewed by one task but A' and B when viewed by another.
>   This already is a head scratcher in writeback where blkcg and memcg
>   have to interact.
> 
>   While I am pushing for unified hierarchy, I think it's necessary to
>   have different levels of granularities depending on controllers
>   given that nesting involves significant overhead and noticeable
>   controller-dependent behavior changes.
> 
>   Solution:
> 
>   I think a unified hierarchy with the ability to ignore subtrees
>   depending on controllers should work.  For example, let's assume the
>   following hierarchy.
> 
>           R
> 	/   \
>        A     B
>       / \
>      AA AB
> 
>   All controllers are co-mounted.  There is per-cgroup knob which
>   controls which controllers nest beyond it.  If blkio doesn't want to
>   distinguish AA and AB, the user can specify that blkio doesn't nest
>   beyond A and blkio would see the tree as,
> 
>           R
> 	/   \
>        A     B
> 
>   While other controllers keep seeing the original tree.  The exact
>   form of interface, I don't know yet.  It could be a single file
>   which the user echoes [-]controller name into it or per-controller
>   boolean file.
> 
>   I think this level of flexibility should be enough for most use
>   cases.  If someone disagrees, please voice your objections now.
> 

Do you realize this is the exact same thing I proposed in our last
round, and you keep screaming saying you wanted something else, right?

The only difference is that the discussion at the time started by a
forced-comount patch, but that is not the core of the question. For that
you are proposing to make sense, the controllers need to be comounted,
and at some point we'll have to enforce it. Be it now or in the future.
But what to do when they are in fact comounted, I see no difference from
what you are saying, and what I said.

>   I *think* this can be achieved by changing where css_set is bound.
>   Currently, a css_set is (conceptually) owned by a task.  After the
>   change, a cgroup in the unified hierarchy has its own css_set which
>   tasks point to and can also be used to tag resources as necessary.
>   This way, it should be achieveable without introducing a lot of new
>   code or affecting individual controllers too much.
> 
>   The headache will be the transition period where we'll probably have
>   to support both modes of operation.  Oh well....
> 
>   Who:
> 
>   Li, Glauber and me, I guess?
> 
> 7. Misc issues
> 
>   * Sort & unique when listing tasks.  Even the documentation says it
>     doesn't happen but we have a good hunk of code doing it in
>     cgroup.c.  I'm gonna rip it out at some point.  Again, if you
>     don't like it, scream.
> 
In all honesty, I never noticed that. ugh

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <5052E7DF.7040000-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]     ` <5052E7DF.7040000-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-09-14  9:12       ` Li Zefan
       [not found]         ` <5052F4FF.6070508-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  2012-09-14 17:43       ` Tejun Heo
  1 sibling, 1 reply; 75+ messages in thread
From: Li Zefan @ 2012-09-14  9:12 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Thomas Graf, Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner

>>
>> 2. memcg's __DEPRECATED_clear_css_refs
>>
>>   This is a remnant of another weird design decision of requiring
>>   synchronous draining of refcnts on cgroup removal and allowing
>>   subsystems to veto cgroup removal - what's the userspace supposed to
>>   do afterwards?  Note that this also hinders co-mounting different
>>   controllers.
>>
>>   The behavior could be useful for development and debugging but it
>>   unnecessarily interlocks userland visible behavior with in-kernel
>>   implementation details.  To me, it seems outright wrong (either
>>   implement proper severing semantics in the controller or do full
>>   refcnting) and disallows, for example, lazy drain of caching refs.
>>   Also, it complicates the removal path with try / commit / revert
>>   logic which has never been fully correct since the beginning.
>>
>>   Currently, the only left user is memcg.
>>
>>   Solution:
>>
>>   * Update memcg->pre_destroy() such that it never fails.
>>
>>   * Drop __DEPRECATED_clear_css_refs and all related logic.
>>     Convert pre_destroy() to return void.
>>
>>   Who:
>>
>>   KAMEZAWA, Michal, PLEASE.  I will make __DEPRECATED_clear_css_refs
>>   trigger WARN sooner or later.  Let's please get this settled.
>>
>> 3. cgroup_mutex usage outside cgroup core
>>
>>   This is another thing which is simply broken.  Given the way cgroup
>>   is structured and used, nesting cgroup_mutex inside any other
>>   commonly used lock simply doesn't work - it's held while invoking
>>   controller callbacks which then interact and synchronize with
>>   various core subsystems.
>>
>>   There are currently three external cgroup_mutex users - cpuset,
>>   memcontrol and cgroup_freezer.
>>
>>   Solution:
>>
>>   Well, we should just stop doing it - use a separate nested lock
>>   (which seems possible for cgroup_freezer) or track and mange task
>>   in/egress some other way.
>>
>>   Who:
>>
>>   I'll do the cgroup_freezer.  I'm hoping PeterZ or someone who's
>>   familiar with the code base takes care of cpuset.  Michal, can you
>>   please take care of memcg?
>>
> 
> I think this is a pressing problem, yes, but not the only problem with
> cgroup lock. Even if we restrict its usage to cgroup core, we still can
> call cgroup functions, which will lock. And then we gain nothing.
> 

Agreed. The biggest issue in cpuset is if hotplug makes a cpuset's cpulist
empty the tasks in it will be moved to an ancestor cgroup, which requires
holding cgroup lock. We have to either change cpuset's behavior or eliminate
the global lock.

> And the problem is that people need to lock. cgroup_lock is needed
> because the data you are accessing is protected by it. The way I see it,
> it is incredible how we were able to revive the BKL in the form of
> cgroup_lock after we finally manage to successfully get rid of it!
> 
> We should just start to do a more fine grained locking of data, instead
> of "stop the world, cgroup just started!". If we do that, the problem
> you are trying to address here will even cease to exist.
> 

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <5052F4FF.6070508-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]         ` <5052F4FF.6070508-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
@ 2012-09-14 11:22           ` Peter Zijlstra
  2012-09-14 17:59           ` Tejun Heo
  1 sibling, 0 replies; 75+ messages in thread
From: Peter Zijlstra @ 2012-09-14 11:22 UTC (permalink / raw)
  To: Li Zefan
  Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Thomas Graf, Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner

On Fri, 2012-09-14 at 17:12 +0800, Li Zefan wrote:
> > I think this is a pressing problem, yes, but not the only problem with
> > cgroup lock. Even if we restrict its usage to cgroup core, we still can
> > call cgroup functions, which will lock. And then we gain nothing.
> > 
> 
> Agreed. The biggest issue in cpuset is if hotplug makes a cpuset's cpulist
> empty the tasks in it will be moved to an ancestor cgroup, which requires
> holding cgroup lock. We have to either change cpuset's behavior or eliminate
> the global lock. 

PJ (the original cpuset author) has always been very conservative in
changing cpuset semantics/behaviour. Its being used at the big HPC labs
and those people simply don't like change.

It also ties in with us having to preserve ABI, Linus says you can only
do so if nobody notices -- if a tree falls in a forest and there's
nobody to hear it, it really didn't fall at all.

Which I guess means we're going to have to split locks :-)

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found]         ` <5052F4FF.6070508-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
  2012-09-14 11:22           ` Peter Zijlstra
@ 2012-09-14 17:59           ` Tejun Heo
       [not found]             ` <20120914175944.GF17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 17:59 UTC (permalink / raw)
  To: Li Zefan
  Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner

Hello,

On Fri, Sep 14, 2012 at 05:12:31PM +0800, Li Zefan wrote:
> Agreed. The biggest issue in cpuset is if hotplug makes a cpuset's cpulist
> empty the tasks in it will be moved to an ancestor cgroup, which requires
> holding cgroup lock. We have to either change cpuset's behavior or eliminate
> the global lock.

Does that have to happen synchronously?  Can't we have a cgroup
operation which asynchronously pushes all tasks in a cgroup to its
parent from a work item?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914175944.GF17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]             ` <20120914175944.GF17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-09-14 18:23               ` Peter Zijlstra
  2012-09-14 18:33                 ` Tejun Heo
  0 siblings, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2012-09-14 18:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

On Fri, 2012-09-14 at 10:59 -0700, Tejun Heo wrote:
> Hello,
> 
> On Fri, Sep 14, 2012 at 05:12:31PM +0800, Li Zefan wrote:
> > Agreed. The biggest issue in cpuset is if hotplug makes a cpuset's cpulist
> > empty the tasks in it will be moved to an ancestor cgroup, which requires
> > holding cgroup lock. We have to either change cpuset's behavior or eliminate
> > the global lock.
> 
> Does that have to happen synchronously?  Can't we have a cgroup
> operation which asynchronously pushes all tasks in a cgroup to its
> parent from a work item?

Its hotplug, all hotplug stuff is synchronous, the last thing hotplug
needs is the added complexity of async callbacks. Also pushing stuff out
into worklets just to work around locking issues is vile.

<handwave as I never can remember all the cgroup stuff/>

Can't we play games by pinning both cgroups with a reference and playing
games with threadgroup_change / task_lock for the individual tasks being
moved about?

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
  2012-09-14 18:23               ` Peter Zijlstra
@ 2012-09-14 18:33                 ` Tejun Heo
  0 siblings, 0 replies; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 18:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

Hello,

On Fri, Sep 14, 2012 at 08:23:41PM +0200, Peter Zijlstra wrote:
> Its hotplug, all hotplug stuff is synchronous, the last thing hotplug
> needs is the added complexity of async callbacks. Also pushing stuff out
> into worklets just to work around locking issues is vile.

I was asking whether it *has* to be part of synchronous CPU hotplug
operation.  IOW, do all tasks in the depleted cgroup have to be moved
to its parent before CPU hotunplug can proceed to completion or is it
okay to happen afterwards?  Making the migration part asynchronous
doesn't add much complexity.  The only thing you have to make sure is
flushing the previously scheduled one from the next CPU_UP_PREPARE.

Also note that this can't easily be solved by splitting tree
protecting inner lock from the outer lock.  We're talking about doing
full migration operations which likely require the outer one too.

> <handwave as I never can remember all the cgroup stuff/>
> 
> Can't we play games by pinning both cgroups with a reference and playing
> games with threadgroup_change / task_lock for the individual tasks being
> moved about?

I'm lost.  Can you please elaborate?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found]     ` <5052E7DF.7040000-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2012-09-14  9:12       ` Li Zefan
@ 2012-09-14 17:43       ` Tejun Heo
       [not found]         ` <20120914174329.GD17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 17:43 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

Hello, Glauber.

On Fri, Sep 14, 2012 at 12:16:31PM +0400, Glauber Costa wrote:
> Can we please keep some key userspace guys CCd?

Yeap, thanks for adding the ccs.

> > 1. cpu and cpuacct
...
> >   Me, working on it.
> I can work on it as well if you want. I dealt with it many times in
> the past, and tried some different approaches, so I am familiar. But
> if you're already doing it, be my guest...

I'm trying something minimal which can serve as basis for the actual
work.  I think I figured it out mostly and will probably post it later
today.  Will squeak if I get stuck.

> >   I'll do the cgroup_freezer.  I'm hoping PeterZ or someone who's
> >   familiar with the code base takes care of cpuset.  Michal, can you
> >   please take care of memcg?
> 
> I think this is a pressing problem, yes, but not the only problem with
> cgroup lock. Even if we restrict its usage to cgroup core, we still can
> call cgroup functions, which will lock. And then we gain nothing.

Can you be a bit more specific?

> And the problem is that people need to lock. cgroup_lock is needed
> because the data you are accessing is protected by it. The way I see it,
> it is incredible how we were able to revive the BKL in the form of
> cgroup_lock after we finally manage to successfully get rid of it!

I wouldn't go as far as comparing it to BKL.

> We should just start to do a more fine grained locking of data, instead
> of "stop the world, cgroup just started!". If we do that, the problem
> you are trying to address here will even cease to exist.

I'd much prefer keeping locking as simple and dumb as possible.  Let's
break it up only as absolutely necessary.

> >   memcg can be handled by memcg people and I can handle cgroup_freezer
> >   and others with help from the authors.  The problematic one is
> >   blkio.  If anyone is interested in working on blkio, please be my
> >   guest.  Vivek?  Glauber?
> 
> I am happy to help where manpower is needed, but I must node I am a bit
> ignorant of block in general.

I think blkcg can definitely make use of more manpower.  ATM, there
are two big things to do.

* Fix hierarchy support.

* Fix handling of writeback.

Both are fairly big chunk of work.

> > 6. Multiple hierarchies
> 
> Do you realize this is the exact same thing I proposed in our last
> round, and you keep screaming saying you wanted something else, right?
>
> The only difference is that the discussion at the time started by a
> forced-comount patch, but that is not the core of the question. For that
> you are proposing to make sense, the controllers need to be comounted,
> and at some point we'll have to enforce it. Be it now or in the future.
> But what to do when they are in fact comounted, I see no difference from
> what you are saying, and what I said.

Maybe I misunderstood you or from still talking about forced co-mounts
more likely you're still misunderstanding.  From what you told PeterZ,
it seemed like you were thinking that this somehow will get rid of
differing hierarchies depending on specific controllers and thus will
help, for example, the optimization issues between cpu and cpuacct.
Going back to the above example,

 Unified tree           Controller Y's view
 controller X's view

      R                          R
     / \                        / \
    A   B                      A   B
   / \
  AA AB

If a task assigned to or resourced tagged with AA, for controller X
it'll map to AA and for controller Y to A, so we would still need
css_set, which actually becomes the primary resource tag and may point
to different subsystem states depending on the specific controller.

If that is the direction we're headed, forcing co-mounts at this point
doesn't make any sense.  We'll make things which are possible today
impossible for quite a while and then restore part of it, which is a
terrible transition plan.  What we need to do is nudging the current
users away from practices which hinder implementation of the final
form and then transition to it gradually.

If you still don't understand, I don't know what more I can do to
help.

> > 7. Misc issues
> > 
> >   * Sort & unique when listing tasks.  Even the documentation says it
> >     doesn't happen but we have a good hunk of code doing it in
> >     cgroup.c.  I'm gonna rip it out at some point.  Again, if you
> >     don't like it, scream.
>
> In all honesty, I never noticed that. ugh

Yeah, tell me about it. :(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914174329.GD17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]         ` <20120914174329.GD17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-09-17  8:50           ` Glauber Costa
       [not found]             ` <5056E467.2090108-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Glauber Costa @ 2012-09-17  8:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V,
	Daniel P. Berrange, Lennart Poettering, Kay Sievers

On 09/14/2012 09:43 PM, Tejun Heo wrote:
> Hello, Glauber.
> 
> On Fri, Sep 14, 2012 at 12:16:31PM +0400, Glauber Costa wrote:
>> Can we please keep some key userspace guys CCd?
> 
> Yeap, thanks for adding the ccs.
> 
>>> 1. cpu and cpuacct
> ...
>>>   Me, working on it.
>> I can work on it as well if you want. I dealt with it many times in
>> the past, and tried some different approaches, so I am familiar. But
>> if you're already doing it, be my guest...
> 
> I'm trying something minimal which can serve as basis for the actual
> work.  I think I figured it out mostly and will probably post it later
> today.  Will squeak if I get stuck.
> 
>>>   I'll do the cgroup_freezer.  I'm hoping PeterZ or someone who's
>>>   familiar with the code base takes care of cpuset.  Michal, can you
>>>   please take care of memcg?
>>
>> I think this is a pressing problem, yes, but not the only problem with
>> cgroup lock. Even if we restrict its usage to cgroup core, we still can
>> call cgroup functions, which will lock. And then we gain nothing.
> 
> Can you be a bit more specific?
> 
What I mean is that if some operation needs to operate locked, they will
have to lock. Whether or not the locking is called from cgroup core or
not. If the lock is not available outside, people will end up calling a
core function that locks.


>> And the problem is that people need to lock. cgroup_lock is needed
>> because the data you are accessing is protected by it. The way I see it,
>> it is incredible how we were able to revive the BKL in the form of
>> cgroup_lock after we finally manage to successfully get rid of it!
> 
> I wouldn't go as far as comparing it to BKL.
> 
Of course not, since it is not system-wide. But I think the comparison
still holds in spirit...

>> Do you realize this is the exact same thing I proposed in our last
>> round, and you keep screaming saying you wanted something else, right?
>>
>> The only difference is that the discussion at the time started by a
>> forced-comount patch, but that is not the core of the question. For that
>> you are proposing to make sense, the controllers need to be comounted,
>> and at some point we'll have to enforce it. Be it now or in the future.
>> But what to do when they are in fact comounted, I see no difference from
>> what you are saying, and what I said.
> 
> Maybe I misunderstood you or from still talking about forced co-mounts
> more likely you're still misunderstanding.  From what you told PeterZ,
> it seemed like you were thinking that this somehow will get rid of
> differing hierarchies depending on specific controllers and thus will
> help, for example, the optimization issues between cpu and cpuacct.
> Going back to the above example,
> 
>  Unified tree           Controller Y's view
>  controller X's view
> 
>       R                          R
>      / \                        / \
>     A   B                      A   B
>    / \
>   AA AB
> 
> If a task assigned to or resourced tagged with AA, for controller X
> it'll map to AA and for controller Y to A, so we would still need
> css_set, which actually becomes the primary resource tag and may point
> to different subsystem states depending on the specific controller.
> 
> If that is the direction we're headed, forcing co-mounts at this point
> doesn't make any sense.  We'll make things which are possible today
> impossible for quite a while and then restore part of it, which is a
> terrible transition plan.  What we need to do is nudging the current
> users away from practices which hinder implementation of the final
> form and then transition to it gradually.
> 
> If you still don't understand, I don't know what more I can do to
> help.
>

you seem to hear "comount", and think of unified vision, and that is the
reason for this discussion to still be going on. Mounting is all about
the root. And if you comount, hierarchies have the same root.

In your example, the different controllers are comounted. They have not
the same view, but the possible views are restricted to be a subset of
the underlying tree - because they are mounted in the same place, forced
or not.

In a situation like this, it makes all the sense in the world to use the
css_id as a primary identifier, because it will be guaranteed to be the
same. What makes the tree overly flexible, is that you can have multiple
roots, starting in multiple places, with arbitrary topologies downwards.

If you still don't understand, I don't know what more I can do to help.


^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <5056E467.2090108-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]             ` <5056E467.2090108-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-09-17 17:21               ` Tejun Heo
       [not found]                 ` <20120917172123.GB18677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Tejun Heo @ 2012-09-17 17:21 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

Hello, Glauber.

On Mon, Sep 17, 2012 at 12:50:47PM +0400, Glauber Costa wrote:
> > Can you be a bit more specific?
>
> What I mean is that if some operation needs to operate locked, they will
> have to lock. Whether or not the locking is called from cgroup core or
> not. If the lock is not available outside, people will end up calling a
> core function that locks.

I was asking whether you have certain specific operations on mind.

> >> And the problem is that people need to lock. cgroup_lock is needed
> >> because the data you are accessing is protected by it. The way I see it,
> >> it is incredible how we were able to revive the BKL in the form of
> >> cgroup_lock after we finally manage to successfully get rid of it!
> > 
> > I wouldn't go as far as comparing it to BKL.
>
> Of course not, since it is not system-wide. But I think the comparison
> still holds in spirit...

Subsystem-wide locks covering non-hot paths aren't evil things.  We
have a lot of them and they work fine.  BKL was a completely different
beast initially with implicit locking on kernel entry and unlocking on
sleeping and then got morphed into some chimera inbetween afterwards.

Simple locking is a good thing.  If finer-grained locking is
necessary, we sure do that but please stop throwing over-generalized
half-arguments at it.  It doesn't help anything.

> you seem to hear "comount", and think of unified vision, and that is the
> reason for this discussion to still be going on. Mounting is all about
> the root. And if you comount, hierarchies have the same root.
>
> In your example, the different controllers are comounted. They have not
> the same view, but the possible views are restricted to be a subset of
> the underlying tree - because they are mounted in the same place, forced
> or not.

Heh, I can't really tell whether you understand it or not.  Here and
in the previous thread too.  You seem to understand that there are
different views upto this point.

> In a situation like this, it makes all the sense in the world to use the
> css_id as a primary identifier, because it will be guaranteed to be the

And then you say something like this (or that this would remove
walking different hierarchies in the previous thread - yes, to a
certain point but not completely).  css_id is a per-css attribute.
How can that be the "primariy" identifier when there can be multiple
views?  For each userland-visible cgroup, there must be a css_set
which points to the css's belonging to it, which may not be at the
same level - multiple nodes in the userland visible tree may point to
the same css.

If you mean that css_id would be the primary identifier for that
specific controller's css, why even say that?  That's true now and
won't ever change.

> same. What makes the tree overly flexible, is that you can have multiple
> roots, starting in multiple places, with arbitrary topologies downwards.

And now you seem to be on the same page again.  But then again, you're
asserting that incorporating forced co-mounts *now* is a gradual step
towards the goal, which is utterly bonkers.  I don't know.  I just
can't understand what you're thinking at all.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120917172123.GB18677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]                 ` <20120917172123.GB18677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-09-18  8:16                   ` Glauber Costa
  0 siblings, 0 replies; 75+ messages in thread
From: Glauber Costa @ 2012-09-18  8:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

On 09/17/2012 09:21 PM, Tejun Heo wrote:
> Hello, Glauber.
> 
> On Mon, Sep 17, 2012 at 12:50:47PM +0400, Glauber Costa wrote:
>>> Can you be a bit more specific?
>>
>> What I mean is that if some operation needs to operate locked, they will
>> have to lock. Whether or not the locking is called from cgroup core or
>> not. If the lock is not available outside, people will end up calling a
>> core function that locks.
> 
> I was asking whether you have certain specific operations on mind.
> 
>>>> And the problem is that people need to lock. cgroup_lock is needed
>>>> because the data you are accessing is protected by it. The way I see it,
>>>> it is incredible how we were able to revive the BKL in the form of
>>>> cgroup_lock after we finally manage to successfully get rid of it!
>>>
>>> I wouldn't go as far as comparing it to BKL.
>>
>> Of course not, since it is not system-wide. But I think the comparison
>> still holds in spirit...
> 
> Subsystem-wide locks covering non-hot paths aren't evil things.  We
> have a lot of them and they work fine.  BKL was a completely different
> beast initially with implicit locking on kernel entry and unlocking on
> sleeping and then got morphed into some chimera inbetween afterwards.
> 
> Simple locking is a good thing.  If finer-grained locking is
> necessary, we sure do that but please stop throwing over-generalized
> half-arguments at it.  It doesn't help anything.
> 
>> you seem to hear "comount", and think of unified vision, and that is the
>> reason for this discussion to still be going on. Mounting is all about
>> the root. And if you comount, hierarchies have the same root.
>>
>> In your example, the different controllers are comounted. They have not
>> the same view, but the possible views are restricted to be a subset of
>> the underlying tree - because they are mounted in the same place, forced
>> or not.
> 
> Heh, I can't really tell whether you understand it or not.  Here and
> in the previous thread too.  You seem to understand that there are
> different views upto this point.
> 
>> In a situation like this, it makes all the sense in the world to use the
>> css_id as a primary identifier, because it will be guaranteed to be the
> 
> And then you say something like this (or that this would remove
> walking different hierarchies in the previous thread - yes, to a
> certain point but not completely).  css_id is a per-css attribute.
> How can that be the "primariy" identifier when there can be multiple
> views?  For each userland-visible cgroup, there must be a css_set
> which points to the css's belonging to it, which may not be at the
> same level - multiple nodes in the userland visible tree may point to
> the same css.
> 
> If you mean that css_id would be the primary identifier for that
> specific controller's css, why even say that?  That's true now and
> won't ever change.
> 
>> same. What makes the tree overly flexible, is that you can have multiple
>> roots, starting in multiple places, with arbitrary topologies downwards.
> 
> And now you seem to be on the same page again.  But then again, you're
> asserting that incorporating forced co-mounts *now* is a gradual step
> towards the goal, which is utterly bonkers.  I don't know.  I just
> can't understand what you're thinking at all.
> 
> Thanks.
> 
I will just stop, because i am not trying to convince you to do anything
different than you are proposing now. I am just trying to convince you
what I have been saying has the exact same effects of this.

So let us focus our energies in the actual work

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-09-14  8:16   ` Glauber Costa
@ 2012-09-14  9:04   ` Mike Galbraith
       [not found]     ` <1347613484.4340.132.camel-YqMYhexLQo31wTEvPJ5Q0F6hYfS7NtTn@public.gmane.org>
  2012-09-14  9:10   ` Daniel P. Berrange
                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Mike Galbraith @ 2012-09-14  9:04 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko,
	Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner,
	Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Thu, 2012-09-13 at 13:58 -0700, Tejun Heo wrote:

> 7. Misc issues
> 
   * Extract synchronize_rcu() from user interface?  Exporting grace
periods to userspace isn't wonderful for dynamic launchers.

-Mike

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <1347613484.4340.132.camel-YqMYhexLQo31wTEvPJ5Q0F6hYfS7NtTn@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]     ` <1347613484.4340.132.camel-YqMYhexLQo31wTEvPJ5Q0F6hYfS7NtTn@public.gmane.org>
@ 2012-09-14 17:17       ` Tejun Heo
  0 siblings, 0 replies; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 17:17 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras,
	Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner,
	Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner,
	Ingo Molnar

On Fri, Sep 14, 2012 at 11:04:44AM +0200, Mike Galbraith wrote:
> On Thu, 2012-09-13 at 13:58 -0700, Tejun Heo wrote:
> 
> > 7. Misc issues
> > 
>    * Extract synchronize_rcu() from user interface?  Exporting grace
> periods to userspace isn't wonderful for dynamic launchers.

Aye aye.

Also,

* Update doc.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-09-14  8:16   ` Glauber Costa
  2012-09-14  9:04   ` Mike Galbraith
@ 2012-09-14  9:10   ` Daniel P. Berrange
       [not found]     ` <20120914091032.GA6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-09-14 14:25   ` Vivek Goyal
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Daniel P. Berrange @ 2012-09-14  9:10 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Ingo Molnar,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner

On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote:
> 5. I CAN HAZ HIERARCHIES?
> 
>   The cpu ones handle nesting correctly - parent's accounting includes
>   children's, parent's configuration affects children's unless
>   explicitly overridden, and children's limits nest inside parent's.
> 
>   memcg asked itself the existential question of to be hierarchical or
>   not and then got confused and decided to become both.
> 
>   When faced with the same question, blkio and cgroup_freezer just
>   gave up and decided to allow nesting and then ignore it - brilliant.
> 
>   And there are others which kinda sorta try to handle hierarchy but
>   only goes way-half.
> 
>   This one is screwed up embarrassingly badly.  We failed to establish
>   one of the most basic semantics and can't even define what a cgroup
>   hierarchy is - it depends on each controller and they're mostly
>   wacky!
> 
>   Fortunately, I don't think it will be prohibitively difficult to dig
>   ourselves out of this hole.
> 
>   Solution:
> 
>   * cpu ones seem fine.
> 
>   * For broken controllers, cgroup core will be generating warning
>     messages if the user tries to nest cgroups so that the user at
>     least can know that the behavior may change underneath them later
>     on.  For more details,
> 
>     http://thread.gmane.org/gmane.linux.kernel/1356264/focus=3902
> 
>   * memcg can be fully hierarchical but we need to phase out the flat
>     hierarchy support.  Unfortunately, this involves flipping the
>     behavior for the existing users.  Upstream will try to nudge users
>     with warning messages.  Most burden would be on the distros and at
>     least SUSE seems to be on board with it.  Needs coordination with
>     other distros.
> 
>   * blkio is the most problematic.  It has two sub-controllers - cfq
>     and blk-throttle.  Both are utterly broken in terms of hierarchy
>     support and the former is known to have pretty hairy code base.  I
>     don't see any other way than just biting the bullet and fixing it.
> 
>   * cgroup_freezer and others shouldn't be too difficult to fix.
> 
>   Who:
> 
>   memcg can be handled by memcg people and I can handle cgroup_freezer
>   and others with help from the authors.  The problematic one is
>   blkio.  If anyone is interested in working on blkio, please be my
>   guest.  Vivek?  Glauber?
> 
> 6. Multiple hierarchies
> 
>   Apart from the apparent wheeeeeeeeness of it (I think I talked about
>   that enough the last time[1]), there's a basic problem when more
>   than one controllers interact - it's impossible to define a resource
>   group when more than two controllers are involved because the
>   intersection of different controllers is only defined in terms of
>   tasks.
> 
>   IOW, if an entity X is of interest to two controllers, there's no
>   way to map X to the cgroups of the two controllers.  X may belong to
>   A and B when viewed by one task but A' and B when viewed by another.
>   This already is a head scratcher in writeback where blkcg and memcg
>   have to interact.
> 
>   While I am pushing for unified hierarchy, I think it's necessary to
>   have different levels of granularities depending on controllers
>   given that nesting involves significant overhead and noticeable
>   controller-dependent behavior changes.
> 
>   Solution:
> 
>   I think a unified hierarchy with the ability to ignore subtrees
>   depending on controllers should work.  For example, let's assume the
>   following hierarchy.
> 
>           R
> 	/   \
>        A     B
>       / \
>      AA AB
> 
>   All controllers are co-mounted.  There is per-cgroup knob which
>   controls which controllers nest beyond it.  If blkio doesn't want to
>   distinguish AA and AB, the user can specify that blkio doesn't nest
>   beyond A and blkio would see the tree as,
> 
>           R
> 	/   \
>        A     B
> 
>   While other controllers keep seeing the original tree.  The exact
>   form of interface, I don't know yet.  It could be a single file
>   which the user echoes [-]controller name into it or per-controller
>   boolean file.
> 
>   I think this level of flexibility should be enough for most use
>   cases.  If someone disagrees, please voice your objections now.
> 
>   I *think* this can be achieved by changing where css_set is bound.
>   Currently, a css_set is (conceptually) owned by a task.  After the
>   change, a cgroup in the unified hierarchy has its own css_set which
>   tasks point to and can also be used to tag resources as necessary.
>   This way, it should be achieveable without introducing a lot of new
>   code or affecting individual controllers too much.
> 
>   The headache will be the transition period where we'll probably have
>   to support both modes of operation.  Oh well....
> 
>   Who:
> 
>   Li, Glauber and me, I guess?

FWIW, from the POV of libvirt and its KVM/LXC drivers, I think that
co-mounting all controllers is just fine. In our usage model we
always want to have exactly the same hierarchy for all of them. It
rather complicates life to have to deal with multiple hierarchies,
so I'd be happy if they went away.

libvirtd will always create its own cgroups starting at the location
where libvirtd itself has been placed. This is to co-operate with
systemd / initscripts which may place each system service in a
dedicated group. Thus historically we usually end up in a layout:

 $CG_MOUNT_ROOT
  |
  +- apache.service
  +- mysql.service
  +- sendmail.service
  +- ....service
  +- libvirtd.service (if systemd has put us in an isolated group)
      |
      +- libvirt
          |
          +- lxc
          |   |
          |   +- container1
          |   +- container2
          |   +- container3
          |   ...
          +- qemu
              |
              +- machine1
              +- machine2
              +- machine3
              ...

Now we know that many controllers don't respect this hiearchy and
will flatten it so all those leaf nodes (container1, container2,
machine1, machine2...etc) are immediately at the root level. While
this is clearly sub-optimal, for our current needs that does not
actually harm us really. While we did intend that a sysadmin could
place controls on the 'libvirt', 'lxc' or 'qemu' cgroups, I'm not
aware of anyone who actually does this currently. Everyone, so far,
only cares about placing controls in individual virtual machines
and containers.

Thus given what we now know about the performance problems wrt
hierarchies we're planning to flatten that significantly to look
closer to this:

 $CG_MOUNT_ROOT
  |
  +- apache.service
  +- mysql.service
  +- sendmail.service
  +- ....service
  +- libvirtd.service (if systemd has put us in an isolated group)
      |
      +- libvirt-lxc-container1
      +- libvirt-lxc-container2
      +- libvirt-lxc-container3
      +- libvirt-lxc-...
      +- libvirt-qemu-machine1
      +- libvirt-qemu-machine2
      +- libvirt-qemu-machine3
      +- libvirt-qemu-...

(though we'll have config option to retain the old style hiearchy
too for backwards compatibility)

Also bear in mind that with containers, the processes inside
the containers may want to use cgroups too. eg if runnning
systemd inside a container too

 $CG_MOUNT_ROOT
  |
  +- apache.service
  +- mysql.service
  +- sendmail.service
  +- ....service
  +- libvirtd.service (if systemd has put us in an isolated group)
      |
      +- libvirt-lxc-container1
      |   |
      |   +- apache.service
      |   +- mysql.service
      |   +- sendmail.service
      |   ...
      +- libvirt-lxc-container2
      +- libvirt-lxc-container3
      +- libvirt-lxc-...
      +- libvirt-qemu-machine1
      +- libvirt-qemu-machine2
      +- libvirt-qemu-machine3
      +- libvirt-qemu-...

Or if each user login session has been given a cgroup and we are
running libvirtd as a non-root user, we can end up with something
like this:

 $CG_MOUNT_ROOT
  |
  +- fred.user
  +- joe.user
  +- bob.user
      |
      +- libvirtd.service (if systemd has put us in an isolated group)
          |
          +- libvirt-qemu-machine1
          +- libvirt-qemu-machine2
          +- libvirt-qemu-machine3
          +- libvirt-qemu-...

In essence what I'm saying is that I'm fine with co-mounting. What
we care about is being able to create the kind of hiearchies outlined
above, and have all controllers actually work sensibly with them.

The systemd & libvirt folks came up with the following recommendations
to try to get good co-operation between different user space apps who
want to use cgroups. Basically the idea is that if each app follows the
guidelines, then no individual app should need to have a global world
of all cgroups.

  http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups

I think everything you describe is compatible with what we've documented
there.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914091032.GA6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]     ` <20120914091032.GA6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-09-14  9:08       ` Glauber Costa
  2012-09-14 13:58       ` Vivek Goyal
  1 sibling, 0 replies; 75+ messages in thread
From: Glauber Costa @ 2012-09-14  9:08 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf,
	Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo,
	Serge E. Hallyn, Paul Turner

On 09/14/2012 01:10 PM, Daniel P. Berrange wrote:
> libvirtd will always create its own cgroups starting at the location
> where libvirtd itself has been placed. This is to co-operate with
> systemd / initscripts which may place each system service in a
> dedicated group

This is more or less what I am doing now for OpenVZ as well.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found]     ` <20120914091032.GA6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-09-14  9:08       ` Glauber Costa
@ 2012-09-14 13:58       ` Vivek Goyal
       [not found]         ` <20120914135830.GB6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 75+ messages in thread
From: Vivek Goyal @ 2012-09-14 13:58 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf,
	Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner

On Fri, Sep 14, 2012 at 10:10:32AM +0100, Daniel P. Berrange wrote:

[..]
> > 6. Multiple hierarchies
> > 
> >   Apart from the apparent wheeeeeeeeness of it (I think I talked about
> >   that enough the last time[1]), there's a basic problem when more
> >   than one controllers interact - it's impossible to define a resource
> >   group when more than two controllers are involved because the
> >   intersection of different controllers is only defined in terms of
> >   tasks.
> > 
> >   IOW, if an entity X is of interest to two controllers, there's no
> >   way to map X to the cgroups of the two controllers.  X may belong to
> >   A and B when viewed by one task but A' and B when viewed by another.
> >   This already is a head scratcher in writeback where blkcg and memcg
> >   have to interact.
> > 
> >   While I am pushing for unified hierarchy, I think it's necessary to
> >   have different levels of granularities depending on controllers
> >   given that nesting involves significant overhead and noticeable
> >   controller-dependent behavior changes.
> > 
> >   Solution:
> > 
> >   I think a unified hierarchy with the ability to ignore subtrees
> >   depending on controllers should work.  For example, let's assume the
> >   following hierarchy.
> > 
> >           R
> > 	/   \
> >        A     B
> >       / \
> >      AA AB
> > 
> >   All controllers are co-mounted.  There is per-cgroup knob which
> >   controls which controllers nest beyond it.  If blkio doesn't want to
> >   distinguish AA and AB, the user can specify that blkio doesn't nest
> >   beyond A and blkio would see the tree as,
> > 
> >           R
> > 	/   \
> >        A     B
> > 
> >   While other controllers keep seeing the original tree.  The exact
> >   form of interface, I don't know yet.  It could be a single file
> >   which the user echoes [-]controller name into it or per-controller
> >   boolean file.
> > 
> >   I think this level of flexibility should be enough for most use
> >   cases.  If someone disagrees, please voice your objections now.

Tejun, Daniel,

I am little concerned about above and wondering how systemd and libvirt
will interact and behave out of the box.

Currently systemd does not create its own hierarchy under blkio and
libvirt does. So putting all together means there is no way to avoid
the overhead of systemd created hierarchy.

\
|
+- system
     |
     +- libvirtd.service
              |
              +- virt-machine1
              +- virt-machine2

So there is now way to avoid the overhead of two levels of hierarchy
created by systemd. I really wish that systemd gets rid of "system"
cgroup and puts services directly in top level group. Creating deeper
hieararchices is expensive.

I just want to mention it clearly that with above model, it will not
be possible for libvirt to avoid hierarchy levels created by systemd.
So solution would be to keep depth of hierarchy as low as possible and
to keep controller overhead as low as possible.

Now I know that with blkio idling kills performance. So one solution
could be that on anything fast, don't use CFQ. Use deadline and then
group idling overhead goes away and tools like systemd and libvirt don't
have to worry about keeping track of disks and what scheduler is running.
They don't want to do it and expect kernel to get it right.

But getting that right out of box does not happen as of today as CFQ
is default on everything. Distributions can carry their own patches
to do some approximation, but it would be better to have a better
mechanism in kernel to select better IO scheduler out of box for a
storage lun. It is more important now then even since blkio controller
has come into picture.

Above is the scenario I am most worried about where CFQ shows up by default
on all the luns, systemd and libvirt create 4-5 level deep hierarchies
by default and IO performance sucks out of the box. Already CFQ underforms
for fast storage and with group creation problem becomes worse.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914135830.GB6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]         ` <20120914135830.GB6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-09-14 19:29           ` Tejun Heo
       [not found]             ` <20120914192935.GO17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 19:29 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Kay Sievers, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner

Hello,

(cc'ing Lennart and Kay)

On Fri, Sep 14, 2012 at 09:58:30AM -0400, Vivek Goyal wrote:
> I am little concerned about above and wondering how systemd and libvirt
> will interact and behave out of the box.
> 
> Currently systemd does not create its own hierarchy under blkio and
> libvirt does. So putting all together means there is no way to avoid
> the overhead of systemd created hierarchy.
> 
> \
> |
> +- system
>      |
>      +- libvirtd.service
>               |
>               +- virt-machine1
>               +- virt-machine2
> 
> So there is now way to avoid the overhead of two levels of hierarchy
> created by systemd. I really wish that systemd gets rid of "system"
> cgroup and puts services directly in top level group. Creating deeper
> hieararchices is expensive.
> 
> I just want to mention it clearly that with above model, it will not
> be possible for libvirt to avoid hierarchy levels created by systemd.
> So solution would be to keep depth of hierarchy as low as possible and
> to keep controller overhead as low as possible.

Yes, if we're do full unified hierarchy, nesting should happen iff
resource control actually requires the nesting so that tree depth is
kept minimal.  Nesting shouldn't be used purely for organizational
purposes.

> Now I know that with blkio idling kills performance. So one solution
> could be that on anything fast, don't use CFQ. Use deadline and then
> group idling overhead goes away and tools like systemd and libvirt don't
> have to worry about keeping track of disks and what scheduler is running.
> They don't want to do it and expect kernel to get it right.

I personally don't think the level of complexity we have in cfq is
something useful for the SSDs which are getting ever better.  cfq is
allowed to use a lot of processing overhead and complexity because
disks are *so* slow.  The balance already has completely changed with
SSDs and we should be doing something a lot simpler most likely based
on iops for them - be it deadline or whatever.

blkcg support is currently tied to cfq-iosched which sucks but I think
that could be the only way to achieve any kind of acceptable blkcg
support for rotating disks.  I think what we should do is abstract out
the common organization part as much as possible so that we don't end
up duplicating everything for blk-throttle, cfq and, say, deadline.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914192935.GO17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]             ` <20120914192935.GO17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-09-14 21:51               ` Kay Sievers
  0 siblings, 0 replies; 75+ messages in thread
From: Kay Sievers @ 2012-09-14 21:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lennart Poettering, Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Ingo Molnar,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Vivek Goyal

On Fri, Sep 14, 2012 at 9:29 PM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Fri, Sep 14, 2012 at 09:58:30AM -0400, Vivek Goyal wrote:
>> I am little concerned about above and wondering how systemd and libvirt
>> will interact and behave out of the box.
>>
>> Currently systemd does not create its own hierarchy under blkio and
>> libvirt does. So putting all together means there is no way to avoid
>> the overhead of systemd created hierarchy.
>>
>> \
>> |
>> +- system
>>      |
>>      +- libvirtd.service
>>               |
>>               +- virt-machine1
>>               +- virt-machine2
>>
>> So there is now way to avoid the overhead of two levels of hierarchy
>> created by systemd. I really wish that systemd gets rid of "system"
>> cgroup and puts services directly in top level group. Creating deeper
>> hieararchices is expensive.

The idea here is to split equally between the "system" and the "user"s
at that level.

That all can be re-considered and changed if really needed, but it's
not an unintentionally created directory.

Thanks,
Kay

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2012-09-14  9:10   ` Daniel P. Berrange
@ 2012-09-14 14:25   ` Vivek Goyal
       [not found]     ` <20120914142539.GC6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-09-14 15:03   ` Michal Hocko
                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Vivek Goyal @ 2012-09-14 14:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote:

[..]
>   * blkio is the most problematic.  It has two sub-controllers - cfq
>     and blk-throttle.  Both are utterly broken in terms of hierarchy
>     support and the former is known to have pretty hairy code base.  I
>     don't see any other way than just biting the bullet and fixing it.

I am still little concerned about changing the blkio behavior
unexpectedly. Can we have some kind of mount time flag which retains
the old flat behavior and we warn user that this mode is deprecated
and will soon be removed. Move over to hierarchical mode. Then after
few release we can drop the flag and cleanup any extra code which
supports flat mode in CFQ. This will atleast make transition smooth.

> 
>   * cgroup_freezer and others shouldn't be too difficult to fix.
> 
>   Who:
> 
>   memcg can be handled by memcg people and I can handle cgroup_freezer
>   and others with help from the authors.  The problematic one is
>   blkio.  If anyone is interested in working on blkio, please be my
>   guest.  Vivek?  Glauber?

I will try to spend some time on this. Doing changes in blk-throttle
should be relatively easy. Painful part if CFQ. It does so much that
it is not clear whether a particular change will bite us badly or
not. So doing changes becomes hard. There are heuristics, preemptions,
queue selection logic, service tree and bringing it all together
for full hierarchy becomes interesting.

I think first thing which needs to be done is merge group scheduling
and cfqq scheduling. Because of flat hierarchy currently we use two
scheduling algorithm. Old logic for queue selection and new logic
for group scheduling. If we treat task and group at same level then
we have to merge two and come up with single logic.

Glauber feel free to jump into it if you like to. We can sort it out
together.

[..]
>   * Vivek brought up the issue of distributing resources to tasks and
>     groups in the same cgroup.  I don't know.  Need to think more
>     about it.

This one will require some thought. I have heard arguments for both the
models. Treating tasks and groups at same level seem to have one
disadvantange and that is that people can't think of system resources
in terms of %. People often say, give 20% of disk resources to a
particular cgroup. But it is not possible as there are all kernel
threads running in root cgroup and tasks come and go and that means
% share of a group is variable and not fixed.

To make it fixed, we will need to make sure that number of entities
fighting for resources are not variable. That means only group fight
for resources at a level and tasks with-in groups. 

Now the question is should kernel enforce it or should it be left to 
user space. I think doing it in user space is also messy as different
agents control different part of hiearchy. For example, if somebody says
that give a particular virtual machine a x% of system resource, libvirt
has no way to do that. At max it can ensure x% of parent group but above
that hierarchy is controlled by systemd and libvirtd has no control
over that.

Only possible way to do this will seem to be that systemd creates libvirt
group at top level with a minimum fixed % of quota and then libvirt can
figure out % share of each virtual machine. But it is hard to do.

So while % model is more intutive to users, it is hard to implement. So
an easier way is to stick to the model of relative weights/share and
let user specify relative importance of a virtual machine and actual
quota or % will vary dynamically depending on other tasks/components
in the system.

Thoughts?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914142539.GC6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]     ` <20120914142539.GC6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-09-14 14:53       ` Peter Zijlstra
  2012-09-14 15:14         ` Vivek Goyal
  2012-09-14 21:39       ` Tejun Heo
  1 sibling, 1 reply; 75+ messages in thread
From: Peter Zijlstra @ 2012-09-14 14:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Thomas Graf, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar

On Fri, 2012-09-14 at 10:25 -0400, Vivek Goyal wrote:
> So while % model is more intutive to users, it is hard to implement.

I don't agree with that. The fixed quota thing is counter-intuitive and
hard to use. It begets you questions like: why, if everything is idle
except my task, am I not getting the full throughput.

It also makes adding entities harder because you're constrained to 100%.
This means you have to start each new cgroup with 0% because any !0
value will eventually get you over 100%, it also means you have to do
some form of admission control to make sure you never get over that
100%.

Starting with 0% is not convenient for people.. they think this is the
wrong default, even though as argued above, it is the only possible
value.

>  So
> an easier way is to stick to the model of relative weights/share and
> let user specify relative importance of a virtual machine and actual
> quota or % will vary dynamically depending on other tasks/components
> in the system.
> 
> Thoughts? 

cpu does the relative weight, so 'users' will have to deal with it
anyway regardless of blk, its effectively free of learning curve for all
subsequent controllers.

Now cpu also has an optional upper limit. But its optional for those
people who do want it (also its expensive).

For RT we must use fixed quota since variable service completely defeats
determinism, RT is 'special' and hard to use anyway, so making it harder
is fine.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
  2012-09-14 14:53       ` Peter Zijlstra
@ 2012-09-14 15:14         ` Vivek Goyal
       [not found]           ` <20120914151447.GD6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Vivek Goyal @ 2012-09-14 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Thomas Graf, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar

On Fri, Sep 14, 2012 at 04:53:29PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-09-14 at 10:25 -0400, Vivek Goyal wrote:
> > So while % model is more intutive to users, it is hard to implement.
> 
> I don't agree with that. The fixed quota thing is counter-intuitive and
> hard to use. It begets you questions like: why, if everything is idle
> except my task, am I not getting the full throughput.

Actually by fixed quota I meant minimum fixed %. So if other groups are
idle, this group still gets to use 100% bandwidth. When resources are
highly contended, this group gets its minimum fixed %.

> 
> It also makes adding entities harder because you're constrained to 100%.
> This means you have to start each new cgroup with 0% because any !0
> value will eventually get you over 100%, it also means you have to do
> some form of admission control to make sure you never get over that
> 100%.
> 
> Starting with 0% is not convenient for people.. they think this is the
> wrong default, even though as argued above, it is the only possible
> value.

We don't have to start with 0%. We can keep a pool with dynamic % and
launch all the virtual machines from that single pool. So nobody starts
with 0%. If we require certain % for a machine, only then we look at
peers and see if we have bandwidth free and create cgroup and move virtual
machine there, otherwise we deny resources. 

So I think it is doable just that it is painful and tricky and I think
lot of it will be in user space.

> 
> >  So
> > an easier way is to stick to the model of relative weights/share and
> > let user specify relative importance of a virtual machine and actual
> > quota or % will vary dynamically depending on other tasks/components
> > in the system.
> > 
> > Thoughts? 
> 
> cpu does the relative weight, so 'users' will have to deal with it
> anyway regardless of blk, its effectively free of learning curve for all
> subsequent controllers.

I am inclined to keep it simple in kernel and just follow cpu model of
relative weights and treating tasks and gropu at same level in the
hierarchy. It makes behavior consistent across the controllers and I
think it might just work for majority of cases.

Those who really need to implement % model, they will have to do heavy
lifting in user space. I am skeptical that will take off but kernel
does not prohibit from somebody creating a group, moving all tasks
there and making sure tasks and groups are not at same level hence
% becomes more predictable. Just that, that's not the default from
kernel.

So yes, doing it cpu controller way in block controller should be
reasonable.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914151447.GD6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]           ` <20120914151447.GD6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-09-14 21:57             ` Tejun Heo
       [not found]               ` <20120914215701.GW17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-09-17  8:55             ` Glauber Costa
  1 sibling, 1 reply; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 21:57 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

Hello, Vivek, Peter.

On Fri, Sep 14, 2012 at 11:14:47AM -0400, Vivek Goyal wrote:
> We don't have to start with 0%. We can keep a pool with dynamic % and
> launch all the virtual machines from that single pool. So nobody starts
> with 0%. If we require certain % for a machine, only then we look at
> peers and see if we have bandwidth free and create cgroup and move virtual
> machine there, otherwise we deny resources. 
> 
> So I think it is doable just that it is painful and tricky and I think
> lot of it will be in user space.

I think the system-wide % thing is rather distracting for the
discussion at hand (and I don't think being able to specify X% of the
whole system when you're three level down the resource hierarchy makes
sense anyway).  Let's focus on tasks vs. groups.

> > >  So
> > > an easier way is to stick to the model of relative weights/share and
> > > let user specify relative importance of a virtual machine and actual
> > > quota or % will vary dynamically depending on other tasks/components
> > > in the system.
> > > 
> > > Thoughts? 
> > 
> > cpu does the relative weight, so 'users' will have to deal with it
> > anyway regardless of blk, its effectively free of learning curve for all
> > subsequent controllers.
> 
> I am inclined to keep it simple in kernel and just follow cpu model of
> relative weights and treating tasks and gropu at same level in the
> hierarchy. It makes behavior consistent across the controllers and I
> think it might just work for majority of cases.

I think we need to stick to one model for all controllers; otherwise,
it gets confusing and unified hierarchy can't work.  That said, I'm
not too happy about how cpu is handling it now.

* As I wrote before, the configuration esacpes cgroup proper and the
  mapping from per-task value to group weight is essentially
  arbitrary and may not exist depending on the resource type.

* The proportion of each group fluctuates as tasks fork and exit in
  the parent group, which is confusing.

* cpu deals with tasks but blkcg deals with iocontexts and memcg,
  which currently doesn't implement proportional control, deals with
  address spaces (processes).  The proportions wouldn't even fluctuate
  the same way across different controllers.

So, I really don't think the current model used by cpu is a good one
and we rather should treat the tasks as a group competing with the
rest of child groups.  Whether we can change that at this point, I
don't know.  Peter, what do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914215701.GW17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]               ` <20120914215701.GW17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-09-17 15:27                 ` Vivek Goyal
  2012-09-18 18:08                 ` Vivek Goyal
  1 sibling, 0 replies; 75+ messages in thread
From: Vivek Goyal @ 2012-09-17 15:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

On Fri, Sep 14, 2012 at 02:57:01PM -0700, Tejun Heo wrote:

[..]
> > > cpu does the relative weight, so 'users' will have to deal with it
> > > anyway regardless of blk, its effectively free of learning curve for all
> > > subsequent controllers.
> > 
> > I am inclined to keep it simple in kernel and just follow cpu model of
> > relative weights and treating tasks and gropu at same level in the
> > hierarchy. It makes behavior consistent across the controllers and I
> > think it might just work for majority of cases.
> 
> I think we need to stick to one model for all controllers; otherwise,
> it gets confusing and unified hierarchy can't work.  That said, I'm
> not too happy about how cpu is handling it now.
> 
> * As I wrote before, the configuration esacpes cgroup proper and the
>   mapping from per-task value to group weight is essentially
>   arbitrary and may not exist depending on the resource type.

If need be, one can create task priority type for those resources too. Or
one could even think of being able to directly specify weigths (same
thing as groups) for tasks. That should be doable if people think if that
kind of interface helps.

> 
> * The proportion of each group fluctuates as tasks fork and exit in
>   the parent group, which is confusing.

Agreed with that. But some people are just happy with varying
percentage and don't care about fixed percentage. In fact current
deployments of systemd and libvirt don't care about fixed percentage.
They are just happy providing relative priority to things and making
sure some kind of basic isolation.

> 
> * cpu deals with tasks but blkcg deals with iocontexts and memcg,
>   which currently doesn't implement proportional control, deals with
>   address spaces (processes).  The proportions wouldn't even fluctuate
>   the same way across different controllers.
> 
> So, I really don't think the current model used by cpu is a good one
> and we rather should treat the tasks as a group competing with the
> rest of child groups.  Whether we can change that at this point, I
> don't know.  Peter, what do you think?

I am not convinced that by default kernel should enforce that all the
tasks of a group are accounted to a hidden group. People have use
cases where they are happy with currently offered semantics. I think
auto scheduler group is another example where system is well protected
from workloads like "make -j64". Even in the case of hidden group it
will be protected but %share of that group will be much higher. (Up
to 50%).

So IMHO, if users really care about tasks and groups not competing
at same level, users should create hiearchy that way and kernel should
not enforce that.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found]               ` <20120914215701.GW17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-09-17 15:27                 ` Vivek Goyal
@ 2012-09-18 18:08                 ` Vivek Goyal
  1 sibling, 0 replies; 75+ messages in thread
From: Vivek Goyal @ 2012-09-18 18:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Thomas Graf, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar

On Fri, Sep 14, 2012 at 02:57:01PM -0700, Tejun Heo wrote:

[..]
> I think we need to stick to one model for all controllers; otherwise,
> it gets confusing and unified hierarchy can't work.  That said, I'm
> not too happy about how cpu is handling it now.
> 
> * As I wrote before, the configuration esacpes cgroup proper and the
>   mapping from per-task value to group weight is essentially
>   arbitrary and may not exist depending on the resource type.
> 
> * The proportion of each group fluctuates as tasks fork and exit in
>   the parent group, which is confusing.
> 
> * cpu deals with tasks but blkcg deals with iocontexts and memcg,
>   which currently doesn't implement proportional control, deals with
>   address spaces (processes).  The proportions wouldn't even fluctuate
>   the same way across different controllers.
> 
> So, I really don't think the current model used by cpu is a good one
> and we rather should treat the tasks as a group competing with the
> rest of child groups.  Whether we can change that at this point, I
> don't know.  Peter, what do you think?

Peter, do you have thoughts on this? I vaguely remember that similar
discussion had happened for cpu controller. 

We first need to settle this debate of treating tasks at same level
as groups before further design points can be discussed.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found]           ` <20120914151447.GD6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-09-14 21:57             ` Tejun Heo
@ 2012-09-17  8:55             ` Glauber Costa
  1 sibling, 0 replies; 75+ messages in thread
From: Glauber Costa @ 2012-09-17  8:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Thomas Graf, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Ingo Molnar

On 09/14/2012 07:14 PM, Vivek Goyal wrote:
> Those who really need to implement % model, they will have to do heavy
> lifting in user space. I am skeptical that will take off but kernel
> does not prohibit from somebody creating a group, moving all tasks
> there and making sure tasks and groups are not at same level hence
> % becomes more predictable. Just that, that's not the default from
> kernel.
I subscribe to that.

I use a % model for memory / kernel memory (Give kernel 20 % of
userspace memory), but the kernel never knows about it. It only
understand megabytes. Of course this is simpler, because it is all
inside the same cgroup.

But if you want global %'s you need to calculate it from everybody
*anyway*, be it in the kernel or in userspace.

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found]     ` <20120914142539.GC6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-09-14 14:53       ` Peter Zijlstra
@ 2012-09-14 21:39       ` Tejun Heo
       [not found]         ` <20120914213938.GV17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 21:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

Hello, Vivek.

On Fri, Sep 14, 2012 at 10:25:39AM -0400, Vivek Goyal wrote:
> On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote:
> 
> [..]
> >   * blkio is the most problematic.  It has two sub-controllers - cfq
> >     and blk-throttle.  Both are utterly broken in terms of hierarchy
> >     support and the former is known to have pretty hairy code base.  I
> >     don't see any other way than just biting the bullet and fixing it.
> 
> I am still little concerned about changing the blkio behavior
> unexpectedly. Can we have some kind of mount time flag which retains
> the old flat behavior and we warn user that this mode is deprecated
> and will soon be removed. Move over to hierarchical mode. Then after
> few release we can drop the flag and cleanup any extra code which
> supports flat mode in CFQ. This will atleast make transition smooth.

I don't know.  That essentially is what we're doing with memcg now and
it doesn't seem any less messy.  Given the already scary complexity,
do we really want to support both flat and hierarchy models at the
same time?

> >   memcg can be handled by memcg people and I can handle cgroup_freezer
> >   and others with help from the authors.  The problematic one is
> >   blkio.  If anyone is interested in working on blkio, please be my
> >   guest.  Vivek?  Glauber?
> 
> I will try to spend some time on this. Doing changes in blk-throttle
> should be relatively easy. Painful part if CFQ. It does so much that
> it is not clear whether a particular change will bite us badly or
> not. So doing changes becomes hard. There are heuristics, preemptions,
> queue selection logic, service tree and bringing it all together
> for full hierarchy becomes interesting.
> 
> I think first thing which needs to be done is merge group scheduling
> and cfqq scheduling. Because of flat hierarchy currently we use two
> scheduling algorithm. Old logic for queue selection and new logic
> for group scheduling. If we treat task and group at same level then
> we have to merge two and come up with single logic.

I think this depends on how we decide to handle tasks vs. groups,
right?

> [..]
> >   * Vivek brought up the issue of distributing resources to tasks and
> >     groups in the same cgroup.  I don't know.  Need to think more
> >     about it.
> 
> This one will require some thought. I have heard arguments for both the
> models. Treating tasks and groups at same level seem to have one
> disadvantange and that is that people can't think of system resources
> in terms of %. People often say, give 20% of disk resources to a
> particular cgroup. But it is not possible as there are all kernel
> threads running in root cgroup and tasks come and go and that means
> % share of a group is variable and not fixed.

Another problem is that configuration isn't contained in cgroup
proper.  We need a way to assign weights to individual tasks which can
be somehow directly compared against group weights.  cpu cooks
priority for this and blkcg may be able to cook ioprio but it's nasty
and unobvious.  Also, let's say we grow network bandwidth controller
for whatever reason.  What value are we gonna use?

> To make it fixed, we will need to make sure that number of entities
> fighting for resources are not variable. That means only group fight
> for resources at a level and tasks with-in groups. 
> 
> Now the question is should kernel enforce it or should it be left to 
> user space. I think doing it in user space is also messy as different
> agents control different part of hiearchy. For example, if somebody says
> that give a particular virtual machine a x% of system resource, libvirt
> has no way to do that. At max it can ensure x% of parent group but above
> that hierarchy is controlled by systemd and libvirtd has no control
> over that.
>
> Only possible way to do this will seem to be that systemd creates libvirt
> group at top level with a minimum fixed % of quota and then libvirt can
> figure out % share of each virtual machine. But it is hard to do.
> 
> So while % model is more intutive to users, it is hard to implement. So
> an easier way is to stick to the model of relative weights/share and
> let user specify relative importance of a virtual machine and actual
> quota or % will vary dynamically depending on other tasks/components
> in the system.

Why is it hard to implement?  You just need to treat tasks in the
current node as another group competing with other cgroups on equal
terms.  If anything, isn't that simpler than treating scheduling
"entities"?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914213938.GV17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]         ` <20120914213938.GV17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-09-17 15:05           ` Vivek Goyal
       [not found]             ` <20120917150518.GB5094-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Vivek Goyal @ 2012-09-17 15:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

On Fri, Sep 14, 2012 at 02:39:38PM -0700, Tejun Heo wrote:
[..]
> > I am still little concerned about changing the blkio behavior
> > unexpectedly. Can we have some kind of mount time flag which retains
> > the old flat behavior and we warn user that this mode is deprecated
> > and will soon be removed. Move over to hierarchical mode. Then after
> > few release we can drop the flag and cleanup any extra code which
> > supports flat mode in CFQ. This will atleast make transition smooth.
> 
> I don't know.  That essentially is what we're doing with memcg now and
> it doesn't seem any less messy.  Given the already scary complexity,
> do we really want to support both flat and hierarchy models at the
> same time?

As a developer, I will be happy to support only one model and keep code
simple. I am only concerned that for blkcg we have still not charted
out a clear migration path. The warning message your patch is giving
out will work only if we decide to not treat task and groups at same
level.

I guess first we need to decide task vs groups issue and then look
into this issue again.

> 
> > >   memcg can be handled by memcg people and I can handle cgroup_freezer
> > >   and others with help from the authors.  The problematic one is
> > >   blkio.  If anyone is interested in working on blkio, please be my
> > >   guest.  Vivek?  Glauber?
> > 
> > I will try to spend some time on this. Doing changes in blk-throttle
> > should be relatively easy. Painful part if CFQ. It does so much that
> > it is not clear whether a particular change will bite us badly or
> > not. So doing changes becomes hard. There are heuristics, preemptions,
> > queue selection logic, service tree and bringing it all together
> > for full hierarchy becomes interesting.
> > 
> > I think first thing which needs to be done is merge group scheduling
> > and cfqq scheduling. Because of flat hierarchy currently we use two
> > scheduling algorithm. Old logic for queue selection and new logic
> > for group scheduling. If we treat task and group at same level then
> > we have to merge two and come up with single logic.
> 
> I think this depends on how we decide to handle tasks vs. groups,
> right?

Yes. If we decide to account all the tasks of a group into a hidden
group which completes with other group children, then there is no
way one can create hiearchy where tasks and groups are competing at
same level. So we can still continue to retain the existing logic.

> 
> > [..]
> > >   * Vivek brought up the issue of distributing resources to tasks and
> > >     groups in the same cgroup.  I don't know.  Need to think more
> > >     about it.
> > 
> > This one will require some thought. I have heard arguments for both the
> > models. Treating tasks and groups at same level seem to have one
> > disadvantange and that is that people can't think of system resources
> > in terms of %. People often say, give 20% of disk resources to a
> > particular cgroup. But it is not possible as there are all kernel
> > threads running in root cgroup and tasks come and go and that means
> > % share of a group is variable and not fixed.
> 
> Another problem is that configuration isn't contained in cgroup
> proper.  We need a way to assign weights to individual tasks which can
> be somehow directly compared against group weights.  cpu cooks
> priority for this and blkcg may be able to cook ioprio but it's nasty
> and unobvious.  Also, let's say we grow network bandwidth controller
> for whatever reason.  What value are we gonna use?

So if somebody cares about settting SO_PRIORITY for traffic originating
from a tasks, move it into a cgroup. Otherwise they all get default
priority.

I think question here is that why do you want to provide a hidden group
as default mechianism from kernel. If a user does not like the idea of
tasks and groups competing at same level, he can always create a cgroups
and move all the tasks there. Only thing we need to provide is reliable
ways of migrating group of tasks into other cgroups at run time.

By creating a hidden group for tasks, there also comes an issue for
configuration of that hidden group (group weight, stats etc). By forcing
user space to create a new group for tasks, it is an explicit cgroup
and user space already knows how to handle it.

So to me, leaving this decision to userspace based on their requirement
makes sense.

Also, I think cpu controller has already discussed this in the past
(the possibility of a hidden group for tasks). Peter will have
more details about it, I think.

> 
> > To make it fixed, we will need to make sure that number of entities
> > fighting for resources are not variable. That means only group fight
> > for resources at a level and tasks with-in groups. 
> > 
> > Now the question is should kernel enforce it or should it be left to 
> > user space. I think doing it in user space is also messy as different
> > agents control different part of hiearchy. For example, if somebody says
> > that give a particular virtual machine a x% of system resource, libvirt
> > has no way to do that. At max it can ensure x% of parent group but above
> > that hierarchy is controlled by systemd and libvirtd has no control
> > over that.
> >
> > Only possible way to do this will seem to be that systemd creates libvirt
> > group at top level with a minimum fixed % of quota and then libvirt can
> > figure out % share of each virtual machine. But it is hard to do.
> > 
> > So while % model is more intutive to users, it is hard to implement. So
> > an easier way is to stick to the model of relative weights/share and
> > let user specify relative importance of a virtual machine and actual
> > quota or % will vary dynamically depending on other tasks/components
> > in the system.
> 
> Why is it hard to implement?  You just need to treat tasks in the
> current node as another group competing with other cgroups on equal
> terms.  If anything, isn't that simpler than treating scheduling
> "entities"?

I meant "hard to implement" in the sense of if kernel has to keep track of
% and enforce it across hiearchy etc.

Yes, creating a hidden group for tasks in current group should not be
hard from implementation point of view. But again, I am concerned about
configuration of hidden group and I also don't like the idea of taking
flexibility away from user to treat tasks and group at same level.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120917150518.GB5094-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]             ` <20120917150518.GB5094-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-09-17 16:40               ` Tejun Heo
  0 siblings, 0 replies; 75+ messages in thread
From: Tejun Heo @ 2012-09-17 16:40 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

Hello,

On Mon, Sep 17, 2012 at 11:05:18AM -0400, Vivek Goyal wrote:
> As a developer, I will be happy to support only one model and keep code
> simple. I am only concerned that for blkcg we have still not charted
> out a clear migration path. The warning message your patch is giving
> out will work only if we decide to not treat task and groups at same
> level.

It may not be enough but it still is in the right direction.

> > Another problem is that configuration isn't contained in cgroup
> > proper.  We need a way to assign weights to individual tasks which can
> > be somehow directly compared against group weights.  cpu cooks
> > priority for this and blkcg may be able to cook ioprio but it's nasty
> > and unobvious.  Also, let's say we grow network bandwidth controller
> > for whatever reason.  What value are we gonna use?
> 
> So if somebody cares about settting SO_PRIORITY for traffic originating
> from a tasks, move it into a cgroup. Otherwise they all get default
> priority.

I don't know.  Do we wanna add, say, prctl for memory weight too?

> So to me, leaving this decision to userspace based on their requirement
> makes sense.

Leaving too many decisions to userland is one of the reasons that got
us into this mess, so I'm not sold on flexibility for flexibility's
sake.

> Yes, creating a hidden group for tasks in current group should not be
> hard from implementation point of view. But again, I am concerned about
> configuration of hidden group and I also don't like the idea of taking
> flexibility away from user to treat tasks and group at same level.

I don't know.  Create a reserved directory for it?  I do like the idea
of taking flexibility away form user unless it's actually useful but
am a bit worried we might be too late for that. :(

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2012-09-14 14:25   ` Vivek Goyal
@ 2012-09-14 15:03   ` Michal Hocko
       [not found]     ` <20120914150306.GQ28039-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2012-09-14 18:07   ` [RFC] cgroup TODOs Vivek Goyal
                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Michal Hocko @ 2012-09-14 15:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Thu 13-09-12 13:58:27, Tejun Heo wrote:
[...]
> 2. memcg's __DEPRECATED_clear_css_refs
> 
>   This is a remnant of another weird design decision of requiring
>   synchronous draining of refcnts on cgroup removal and allowing
>   subsystems to veto cgroup removal - what's the userspace supposed to
>   do afterwards?  Note that this also hinders co-mounting different
>   controllers.
> 
>   The behavior could be useful for development and debugging but it
>   unnecessarily interlocks userland visible behavior with in-kernel
>   implementation details.  To me, it seems outright wrong (either
>   implement proper severing semantics in the controller or do full
>   refcnting) and disallows, for example, lazy drain of caching refs.
>   Also, it complicates the removal path with try / commit / revert
>   logic which has never been fully correct since the beginning.
> 
>   Currently, the only left user is memcg.
> 
>   Solution:
> 
>   * Update memcg->pre_destroy() such that it never fails.
> 
>   * Drop __DEPRECATED_clear_css_refs and all related logic.
>     Convert pre_destroy() to return void.
> 
>   Who:
> 
>   KAMEZAWA, Michal, PLEASE.  I will make __DEPRECATED_clear_css_refs
>   trigger WARN sooner or later.  Let's please get this settled.

I think we are almost there. One big step was that we no longer charge
to the parent and only move statistics but there are still some corner
cases when we race with LRU handling.

[...]
>   * memcg can be fully hierarchical but we need to phase out the flat
>     hierarchy support.  Unfortunately, this involves flipping the
>     behavior for the existing users.  Upstream will try to nudge users
>     with warning messages.  Most burden would be on the distros and at
>     least SUSE seems to be on board with it.  Needs coordination with
>     other distros.

I am currently planning to add a warning to most of the currenly
maintained distributions to have as big coverage as possible. No default
switch for obvious reasons but hopefuly we will get some feedback at
least.

Thanks Tejun for doing this. We needed it for a long time.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914150306.GQ28039-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]     ` <20120914150306.GQ28039-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2012-09-19 14:02       ` Michal Hocko
       [not found]         ` <20120919140203.GA5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Michal Hocko @ 2012-09-19 14:02 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V,
	Dave Jones, Ben Hutchings

[CCing Dave, Ben]

Just a short summary as you were not on the CC list. This is sort of
follow up on https://lkml.org/lkml/2012/9/3/211. The end result is
slightly different because Tejun did a more generic cgroup solution (see
bellow). I cannot do the same for OpenSUSE so I will stick with the
memcg specific patch.

On Fri 14-09-12 17:03:06, Michal Hocko wrote:
> I am currently planning to add a warning to most of the currenly
> maintained distributions to have as big coverage as possible. No default
> switch for obvious reasons but hopefuly we will get some feedback at
> least.

Just for the record, I will post backports of the patch I ended up using
for openSUSE 11.4 and 12.[12] and SLES-SP2 as a reply to this email (and
2.6.32 in case somebody is interested).

I hope other distributions can either go with this (which will never be
merged but it should help to identify dubious usage of flat hierarchies
without a risk of breaking anythign) or what Tejun has in his tree[1]
8c7f6edb (cgroup: mark subsystems with broken hierarchy support and
whine if cgroups are nested for them) which is more generic but it is
also slightly more intrusive.

---
[1] - git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-3.7-hierarchy
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120919140203.GA5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0
       [not found]         ` <20120919140203.GA5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2012-09-19 14:03           ` Michal Hocko
       [not found]             ` <20120919140308.GB5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2012-09-19 14:03           ` [PATCH 3.0] " Michal Hocko
  2012-09-19 14:05           ` [PATCH 3.2+] " Michal Hocko
  2 siblings, 1 reply; 75+ messages in thread
From: Michal Hocko @ 2012-09-19 14:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Dave Jones, Neil Horman, Serge E. Hallyn, Ben Hutchings,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner,
	Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner,
	Ingo Molnar


From 34be56e3e7e4f9c31381ce35247e0a0b7f972874 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Date: Tue, 4 Sep 2012 15:55:03 +0200
Subject: [PATCH] memcg: warn on deeper hierarchies with use_hierarchy==0

The memory controller supports both hierarchical and non-hierarchical
behavior which is controlled by use_hierarchy knob (0 by default).
The primary motivation for this distinction was an ineffectiveness
of hierarchical accounting. This has improved a lot since it was
introduced.

This schizophrenia makes the code and integration with other controllers
more complicated (e.g. mounting it with fully hierarchical one could
have an unexpected side effects) for no good reason so it would be good
to make the memory controller behave only hierarchically.

It seems that there is no good reasons for deep cgroup hierarchies which
are not truly hierarchical so we could set the default to 1. This might,
however, lead to unexpected regressions when somebody relies on the
current default behavior. For example, consider the following setup:
		 Root[cpuset,memory]
		  |
		  A (use_hierarchy=0)
		 / \
		B  C

All three A, B, C have some tasks and their memory limits. The hierarchy
is created only because of the cpuset and its configuration.
Say the default is changed. Then a memory pressure in C could influence
both A and B which wouldn't happen before. The problem might be really
hard to notice (unexpected slowdown).
This configuration could be fixed up easily by reorganization, though:
		 Root
		  |
		  A' (use_hierarchy=1, limit=unlimited, no tasks)
		 /|\
		A B C

The problem is that we don't know whether somebody has an use case which
cannot be transformed like that. Therefore this patch starts the slow
transition to hierarchical only memory controller by warning users who
are using flat hierarchies. The warning triggers only if a subgroup of
non-root group is created with use_hierarchy==0.

Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
 mm/memcontrol.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f99f599..b61c34b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3106,6 +3106,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	} else {
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
+		WARN_ONCE(!mem->use_hierarchy && parent != root_mem_cgroup,
+				"Creating hierarchies with use_hierarchy==0 "
+				"(flat hierarchy) is considered deprecated. "
+				"If you believe that your setup is correct, "
+				"we kindly ask you to contact linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org and let us know");
 	}
 
 	if (parent && parent->use_hierarchy) {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 75+ messages in thread

[parent not found: <20120919140308.GB5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0
       [not found]             ` <20120919140308.GB5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2012-09-19 19:38               ` David Rientjes
       [not found]                 ` <alpine.DEB.2.00.1209191237020.749-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: David Rientjes @ 2012-09-19 19:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Jones, Neil Horman, Serge E. Hallyn, Ben Hutchings,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Graf, Paul Mackerras,
	Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner,
	Ingo Molnar

On Wed, 19 Sep 2012, Michal Hocko wrote:

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index f99f599..b61c34b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3106,6 +3106,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	} else {
>  		parent = mem_cgroup_from_cont(cont->parent);
>  		mem->use_hierarchy = parent->use_hierarchy;
> +		WARN_ONCE(!mem->use_hierarchy && parent != root_mem_cgroup,
> +				"Creating hierarchies with use_hierarchy==0 "
> +				"(flat hierarchy) is considered deprecated. "
> +				"If you believe that your setup is correct, "
> +				"we kindly ask you to contact linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org and let us know");

When I deprecated /proc/pid/oom_adj (now removed), we did a WARN_ONCE() 
and others complained that this unnecessarily alters userspace scripts 
that a serious issue has occurred and Linus agreed that we shouldn't do 
deprecation in this way.  The alternative is to use printk_once() instead.

This applies to all three patches for this one, 3.0, and 3.2+.

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <alpine.DEB.2.00.1209191237020.749-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]

* Re: [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0
       [not found]                 ` <alpine.DEB.2.00.1209191237020.749-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2012-09-20 13:24                   ` Michal Hocko
       [not found]                     ` <20120920132400.GC23872-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Michal Hocko @ 2012-09-20 13:24 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tejun Heo, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V,
	Dave Jones, Ben Hutchings

On Wed 19-09-12 12:38:18, David Rientjes wrote:
> On Wed, 19 Sep 2012, Michal Hocko wrote:
> 
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index f99f599..b61c34b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -3106,6 +3106,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> >  	} else {
> >  		parent = mem_cgroup_from_cont(cont->parent);
> >  		mem->use_hierarchy = parent->use_hierarchy;
> > +		WARN_ONCE(!mem->use_hierarchy && parent != root_mem_cgroup,
> > +				"Creating hierarchies with use_hierarchy==0 "
> > +				"(flat hierarchy) is considered deprecated. "
> > +				"If you believe that your setup is correct, "
> > +				"we kindly ask you to contact linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org and let us know");
> 
> When I deprecated /proc/pid/oom_adj (now removed), we did a WARN_ONCE() 
> and others complained that this unnecessarily alters userspace scripts 
> that a serious issue has occurred and Linus agreed that we shouldn't do 
> deprecation in this way.  The alternative is to use printk_once() instead.

Yes printk_once is an alternative but I really wanted to have this as
much visible as possible. People tend to react to stack traceces more
and this one will trigger only if somebody is either doing something
wrong or the configuration is the one we are looking for.

Comparing to oom_adj, that one was used much more often so the WARN_ONCE
was too verbose especially when you usually had to wait for an userspace
update which is not the case here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120920132400.GC23872-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]

* Re: [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0
       [not found]                     ` <20120920132400.GC23872-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2012-09-20 22:33                       ` David Rientjes
       [not found]                         ` <alpine.DEB.2.00.1209201531250.17455-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: David Rientjes @ 2012-09-20 22:33 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Jones, Neil Horman, Serge E. Hallyn, Ben Hutchings,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Thomas Graf, Paul Mackerras,
	Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner,
	Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner,
	Ingo Molnar

On Thu, 20 Sep 2012, Michal Hocko wrote:

> Yes printk_once is an alternative but I really wanted to have this as
> much visible as possible. People tend to react to stack traceces more
> and this one will trigger only if somebody is either doing something
> wrong or the configuration is the one we are looking for.
> 

That's the complete opposite of what Linus has said he wants, he said very 
specifically that he doesn't want WARN_ONCE() or WARN_ON_ONCE() for 
deprecation of tunables.  If you want to have this merged, then please get 
him to ack it.

> Comparing to oom_adj, that one was used much more often so the WARN_ONCE
> was too verbose especially when you usually had to wait for an userspace
> update which is not the case here.

How is WARN_ONCE() too verbose for oom_adj?  It's printed once!  And how 
can you claim that userspace doesn't need to change if it's creating a 
hierarchy while use_hierarchy == 0?

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <alpine.DEB.2.00.1209201531250.17455-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>]

* Re: [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0
       [not found]                         ` <alpine.DEB.2.00.1209201531250.17455-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
@ 2012-09-21  7:16                           ` Michal Hocko
  0 siblings, 0 replies; 75+ messages in thread
From: Michal Hocko @ 2012-09-21  7:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tejun Heo, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V,
	Dave Jones, Ben Hutchings

On Thu 20-09-12 15:33:23, David Rientjes wrote:
> On Thu, 20 Sep 2012, Michal Hocko wrote:
> 
> > Yes printk_once is an alternative but I really wanted to have this as
> > much visible as possible. People tend to react to stack traceces more
> > and this one will trigger only if somebody is either doing something
> > wrong or the configuration is the one we are looking for.
> > 
> 
> That's the complete opposite of what Linus has said he wants, he said very 
> specifically that he doesn't want WARN_ONCE() or WARN_ON_ONCE() for 
> deprecation of tunables.  If you want to have this merged, then please get 
> him to ack it.

This is not meant to be merged upstream. I do not think this is a stable
material and Linus tree will get the more generic cgroup based patch
instead. This is just for distributions so that they can help to find
usecases which would prevent use_hierachy removal
 
> > Comparing to oom_adj, that one was used much more often so the WARN_ONCE
> > was too verbose especially when you usually had to wait for an userspace
> > update which is not the case here.
> 
> How is WARN_ONCE() too verbose for oom_adj?  It's printed once!  

It prints much more than one line, right? When I said oom_adj was used
much more I meant more applications cared about the value (so the
prbability of the warning was quite high) not that the message would be
printed multiple times. And to be honest I didn't mind WARN_ONCE being
used for that.

> And how can you claim that userspace doesn't need to change if it's
> creating a hierarchy while use_hierarchy == 0?

It is code vs. configuration change. You have to wait for an update or
change and recompile in the first case while the second one can be done
directly.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 75+ messages in thread

* [PATCH 3.0] memcg: warn on deeper hierarchies with use_hierarchy==0
       [not found]         ` <20120919140203.GA5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2012-09-19 14:03           ` [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0 Michal Hocko
@ 2012-09-19 14:03           ` Michal Hocko
  2012-09-19 14:05           ` [PATCH 3.2+] " Michal Hocko
  2 siblings, 0 replies; 75+ messages in thread
From: Michal Hocko @ 2012-09-19 14:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Dave Jones, Neil Horman, Serge E. Hallyn, Ben Hutchings,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras,
	Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner,
	Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner,
	Ingo Molnar

From 9364396ddc0c8843fce3a7fda0255b39ba7e4f31 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Date: Tue, 4 Sep 2012 15:55:03 +0200
Subject: [PATCH] memcg: warn on deeper hierarchies with use_hierarchy==0

The memory controller supports both hierarchical and non-hierarchical
behavior which is controlled by use_hierarchy knob (0 by default).
The primary motivation for this distinction was an ineffectiveness
of hierarchical accounting. This has improved a lot since it was
introduced.

This schizophrenia makes the code and integration with other controllers
more complicated (e.g. mounting it with fully hierarchical one could
have an unexpected side effects) for no good reason so it would be good
to make the memory controller behave only hierarchically.

It seems that there is no good reasons for deep cgroup hierarchies which
are not truly hierarchical so we could set the default to 1. This might,
however, lead to unexpected regressions when somebody relies on the
current default behavior. For example, consider the following setup:
		 Root[cpuset,memory]
		  |
		  A (use_hierarchy=0)
		 / \
		B  C

All three A, B, C have some tasks and their memory limits. The hierarchy
is created only because of the cpuset and its configuration.
Say the default is changed. Then a memory pressure in C could influence
both A and B which wouldn't happen before. The problem might be really
hard to notice (unexpected slowdown).
This configuration could be fixed up easily by reorganization, though:
		 Root
		  |
		  A' (use_hierarchy=1, limit=unlimited, no tasks)
		 /|\
		A B C

The problem is that we don't know whether somebody has an use case which
cannot be transformed like that. Therefore this patch starts the slow
transition to hierarchical only memory controller by warning users who
are using flat hierarchies. The warning triggers only if a subgroup of
non-root group is created with use_hierarchy==0.

Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
 mm/memcontrol.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e013b8e..d8ec0cd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4976,6 +4976,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		parent = mem_cgroup_from_cont(cont->parent);
 		mem->use_hierarchy = parent->use_hierarchy;
 		mem->oom_kill_disable = parent->oom_kill_disable;
+		WARN_ONCE(!mem->use_hierarchy && parent != root_mem_cgroup,
+				"Creating hierarchies with use_hierarchy==0 "
+				"(flat hierarchy) is considered deprecated. "
+				"If you believe that your setup is correct, "
+				"we kindly ask you to contact linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org and let us know");
 	}
 
 	if (parent && parent->use_hierarchy) {
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* [PATCH 3.2+] memcg: warn on deeper hierarchies with use_hierarchy==0
       [not found]         ` <20120919140203.GA5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2012-09-19 14:03           ` [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0 Michal Hocko
  2012-09-19 14:03           ` [PATCH 3.0] " Michal Hocko
@ 2012-09-19 14:05           ` Michal Hocko
  2 siblings, 0 replies; 75+ messages in thread
From: Michal Hocko @ 2012-09-19 14:05 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Glauber Costa,
	Peter Zijlstra, Paul Turner, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V,
	Dave Jones, Ben Hutchings

Should apply to 3.4 and later as well
---
From cbfc6f1cdb4d8095003036c84d250a391054f971 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
Date: Tue, 4 Sep 2012 15:55:03 +0200
Subject: [PATCH] memcg: warn on deeper hierarchies with use_hierarchy==0

The memory controller supports both hierarchical and non-hierarchical
behavior which is controlled by use_hierarchy knob (0 by default).
The primary motivation for this distinction was an ineffectiveness
of hierarchical accounting. This has improved a lot since it was
introduced.

This schizophrenia makes the code and integration with other controllers
more complicated (e.g. mounting it with fully hierarchical one could
have an unexpected side effects) for no good reason so it would be good
to make the memory controller behave only hierarchically.

It seems that there is no good reasons for deep cgroup hierarchies which
are not truly hierarchical so we could set the default to 1. This might,
however, lead to unexpected regressions when somebody relies on the
current default behavior. For example, consider the following setup:
		 Root[cpuset,memory]
		  |
		  A (use_hierarchy=0)
		 / \
		B  C

All three A, B, C have some tasks and their memory limits. The hierarchy
is created only because of the cpuset and its configuration.
Say the default is changed. Then a memory pressure in C could influence
both A and B which wouldn't happen before. The problem might be really
hard to notice (unexpected slowdown).
This configuration could be fixed up easily by reorganization, though:
		 Root
		  |
		  A' (use_hierarchy=1, limit=unlimited, no tasks)
		 /|\
		A B C

The problem is that we don't know whether somebody has an use case which
cannot be transformed like that. Therefore this patch starts the slow
transition to hierarchical only memory controller by warning users who
are using flat hierarchies. The warning triggers only if a subgroup of
non-root group is created with use_hierarchy==0.

Signed-off-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
---
 mm/memcontrol.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..6fbb0d7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4920,6 +4920,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		parent = mem_cgroup_from_cont(cont->parent);
 		memcg->use_hierarchy = parent->use_hierarchy;
 		memcg->oom_kill_disable = parent->oom_kill_disable;
+		WARN_ONCE(!memcg->use_hierarchy && parent != root_mem_cgroup,
+				"Creating hierarchies with use_hierarchy==0 "
+				"(flat hierarchy) is considered deprecated. "
+				"If you believe that your setup is correct, "
+				"we kindly ask you to contact linux-mm-Bw31MaZKKs3YtjvyW6yDsg@public.gmane.org and let us know");
 	}
 
 	if (parent && parent->use_hierarchy) {
-- 
1.7.10.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2012-09-14 15:03   ` Michal Hocko
@ 2012-09-14 18:07   ` Vivek Goyal
       [not found]     ` <20120914180754.GF6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2012-09-14 18:36   ` Aristeu Rozanski
                     ` (3 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Vivek Goyal @ 2012-09-14 18:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote:

[..]
> 6. Multiple hierarchies
> 
>   Apart from the apparent wheeeeeeeeness of it (I think I talked about
>   that enough the last time[1]), there's a basic problem when more
>   than one controllers interact - it's impossible to define a resource
>   group when more than two controllers are involved because the
>   intersection of different controllers is only defined in terms of
>   tasks.
> 
>   IOW, if an entity X is of interest to two controllers, there's no
>   way to map X to the cgroups of the two controllers.  X may belong to
>   A and B when viewed by one task but A' and B when viewed by another.
>   This already is a head scratcher in writeback where blkcg and memcg
>   have to interact.
> 
>   While I am pushing for unified hierarchy, I think it's necessary to
>   have different levels of granularities depending on controllers
>   given that nesting involves significant overhead and noticeable
>   controller-dependent behavior changes.
> 
>   Solution:
> 
>   I think a unified hierarchy with the ability to ignore subtrees
>   depending on controllers should work.  For example, let's assume the
>   following hierarchy.
> 
>           R
> 	/   \
>        A     B
>       / \
>      AA AB
> 
>   All controllers are co-mounted.  There is per-cgroup knob which
>   controls which controllers nest beyond it.  If blkio doesn't want to
>   distinguish AA and AB, the user can specify that blkio doesn't nest
>   beyond A and blkio would see the tree as,
> 
>           R
> 	/   \
>        A     B
> 
>   While other controllers keep seeing the original tree.  The exact
>   form of interface, I don't know yet.  It could be a single file
>   which the user echoes [-]controller name into it or per-controller
>   boolean file.
> 
>   I think this level of flexibility should be enough for most use
>   cases.  If someone disagrees, please voice your objections now.

Hi Tejun,

I am curious that why are you planning to provide capability of controller
specific view of hierarchy. To me it sounds pretty close to having
separate hierarchies per controller. Just that it is a little more
restricted configuration. 

IOW, who is is the user of this functionality and who is asking for it.
Can we go all out where all controllers have only one hierarchy view.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914180754.GF6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]     ` <20120914180754.GF6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-09-14 18:53       ` Tejun Heo
       [not found]         ` <20120914185324.GI17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 18:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

Hello, Vivek.

On Fri, Sep 14, 2012 at 02:07:54PM -0400, Vivek Goyal wrote:
> I am curious that why are you planning to provide capability of controller
> specific view of hierarchy. To me it sounds pretty close to having
> separate hierarchies per controller. Just that it is a little more
> restricted configuration. 

I think it's a lot less crazy and gives us a way to bind a resource to
a set of controller cgroups regardless which task is looking at it,
which is something we're sorely missing now.

> IOW, who is is the user of this functionality and who is asking for it.
> Can we go all out where all controllers have only one hierarchy view.

I think the issue is that controllers inherently have overhead and
behavior alterations depending on the tree organization.  At least
from the usage I see from google which uses nested cgroups
extensively, at least that level of flexibility seems necessary.

In addition, for some resources, granularity beyond certain point
simply doesn't work.  Per-service granularity might make sense for cpu
but applying it by default would be silly for blkio.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914185324.GI17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]         ` <20120914185324.GI17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-09-14 19:28           ` Vivek Goyal
       [not found]             ` <20120914192840.GG6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Vivek Goyal @ 2012-09-14 19:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

On Fri, Sep 14, 2012 at 11:53:24AM -0700, Tejun Heo wrote:

[..]
> In addition, for some resources, granularity beyond certain point
> simply doesn't work.  Per-service granularity might make sense for cpu
> but applying it by default would be silly for blkio.

Hmm.., In that case how libvirt will make use of blkio in the proposed
scheme. We can't disable blkio nesting at "system" level. So We will
have to disable it at each service level  except "libvirtd" so that
libvirt can use blkio for its virtual machines.

That means blkio will see each service in a cgroup of its own and if 
that does not make sense by default, its a problem. In the existing
scheme, atleast every service does not show up in its cgroup from
blkio point of view. Everthig is in root and libvirt can create its
own cgroups, keeping number of cgroups small.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914192840.GG6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]             ` <20120914192840.GG6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2012-09-14 19:44               ` Tejun Heo
       [not found]                 ` <20120914194439.GP17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 19:44 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko,
	Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner,
	Thomas Graf, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V,
	Serge Hallyn

Hello, Vivek.

On Fri, Sep 14, 2012 at 03:28:40PM -0400, Vivek Goyal wrote:
> Hmm.., In that case how libvirt will make use of blkio in the proposed
> scheme. We can't disable blkio nesting at "system" level. So We will
> have to disable it at each service level  except "libvirtd" so that
> libvirt can use blkio for its virtual machines.
> 
> That means blkio will see each service in a cgroup of its own and if 
> that does not make sense by default, its a problem. In the existing

Yeap, if libvirtd wants use blkcg, blkcg will be enabled upto
libvirtd's root.  It might not be optimal but I think it makes sense.
If you want to excercise hierarchical control on a resource, the only
sane way is sticking to the hierarchy until it reaches root.

> scheme, atleast every service does not show up in its cgroup from
> blkio point of view. Everthig is in root and libvirt can create its
> own cgroups, keeping number of cgroups small.

Even a broken clock is right twice a day.  I don't think this is a
behavior we can keep for the sake of "but if we do this ass-weird
thing, we can bypass the overhead for XYZ" when it breaks so many
fundamental things.

I think there currently is too much (broken) flexibility and intent to
remove it.  That doesn't mean that removeing all flexibility is the
right direction.  It inherently is a balancing act and I think the
proposed solution is a reasonable tradeoff.  There's important
difference between causing full overhead by default for all users and
requiring some overhead when the use case at hand calls for the
functionality.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914194439.GP17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]                 ` <20120914194439.GP17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-09-14 19:49                   ` Tejun Heo
       [not found]                     ` <20120914194950.GQ17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 19:49 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

On Fri, Sep 14, 2012 at 12:44:39PM -0700, Tejun Heo wrote:
> I think there currently is too much (broken) flexibility and intent to
> remove it.  That doesn't mean that removeing all flexibility is the
> right direction.  It inherently is a balancing act and I think the
> proposed solution is a reasonable tradeoff.  There's important
> difference between causing full overhead by default for all users and
> requiring some overhead when the use case at hand calls for the
> functionality.

That said, if someone can think of a better solution, I'm all ears.
One thing that *has* to be maintained is that it should be able to tag
a resource in such way that its associated controllers are
identifiable regardless of which task is looking at it.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914194950.GQ17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]                     ` <20120914194950.GQ17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-09-14 20:39                       ` Tejun Heo
       [not found]                         ` <20120914203925.GR17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 20:39 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko,
	Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner,
	Thomas Graf, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V,
	Serge Hallyn

Hello, again.

On Fri, Sep 14, 2012 at 12:49:50PM -0700, Tejun Heo wrote:
> That said, if someone can think of a better solution, I'm all ears.
> One thing that *has* to be maintained is that it should be able to tag
> a resource in such way that its associated controllers are
> identifiable regardless of which task is looking at it.

So, I thought about it more.  How about we do "consider / ignore this
node" instead of "(don't) nest beyond this level".  For example, let's
assume a tree like the following.

        R
     /  |  \
    A   B   C
   / \
  AA AB

If we want to differentiate between AA and AB, we'll have to consider
the whole tree with the previous sheme - A needs to nest, so R needs
to nest and we end up with the whole tree.  Instead, if we have honor
/ ignore this node.  We can set the honor bit on A, AA and AB and see
the tree as

        R
     /
    A
   / \
  AA AB

We still see the intermediate A node but can ignore the other
branches.  Implementation and concept-wise, it's fairly simple too.
For any given node and controller, you travel upwards until you meet a
node which has the controller enabled and that's the cgroup the
controller considers.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914203925.GR17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]                         ` <20120914203925.GR17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-09-17  8:40                           ` Glauber Costa
       [not found]                             ` <5056E1FC.1090508-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
  2012-09-17 14:37                           ` Vivek Goyal
  1 sibling, 1 reply; 75+ messages in thread
From: Glauber Costa @ 2012-09-17  8:40 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar, Vivek Goyal

On 09/15/2012 12:39 AM, Tejun Heo wrote:
> Hello, again.
> 
> On Fri, Sep 14, 2012 at 12:49:50PM -0700, Tejun Heo wrote:
>> That said, if someone can think of a better solution, I'm all ears.
>> One thing that *has* to be maintained is that it should be able to tag
>> a resource in such way that its associated controllers are
>> identifiable regardless of which task is looking at it.
> 
> So, I thought about it more.  How about we do "consider / ignore this
> node" instead of "(don't) nest beyond this level".  For example, let's
> assume a tree like the following.
> 
>         R
>      /  |  \
>     A   B   C
>    / \
>   AA AB
> 
> If we want to differentiate between AA and AB, we'll have to consider
> the whole tree with the previous sheme - A needs to nest, so R needs
> to nest and we end up with the whole tree.  Instead, if we have honor
> / ignore this node.  We can set the honor bit on A, AA and AB and see
> the tree as
> 
>         R
>      /
>     A
>    / \
>   AA AB
> 
> We still see the intermediate A node but can ignore the other
> branches.  Implementation and concept-wise, it's fairly simple too.
> For any given node and controller, you travel upwards until you meet a
> node which has the controller enabled and that's the cgroup the
> controller considers.
> 
> Thanks.
> 

That is exactly what I proposed in our previous discussions around
memcg, with files like "available_controllers" , "current_controllers".
Name chosen to match what other subsystems already do.

if memcg is not in "available_controllers" for a node, it cannot be seen
by anyone bellow that level.

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <5056E1FC.1090508-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]                             ` <5056E1FC.1090508-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
@ 2012-09-17 17:30                               ` Tejun Heo
  0 siblings, 0 replies; 75+ messages in thread
From: Tejun Heo @ 2012-09-17 17:30 UTC (permalink / raw)
  To: Glauber Costa
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar, Vivek Goyal

On Mon, Sep 17, 2012 at 12:40:28PM +0400, Glauber Costa wrote:
> That is exactly what I proposed in our previous discussions around
> memcg, with files like "available_controllers" , "current_controllers".
> Name chosen to match what other subsystems already do.
> 
> if memcg is not in "available_controllers" for a node, it cannot be seen
> by anyone bellow that level.

Glauber, I was talking about making the switch applicable from the
current level *INSTEAD OF* anyone below the current level, so that we
don't have to apply the same switch on all siblings.  I have no idea
why this is causing so much miscommunication. :(

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found]                         ` <20120914203925.GR17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  2012-09-17  8:40                           ` Glauber Costa
@ 2012-09-17 14:37                           ` Vivek Goyal
  1 sibling, 0 replies; 75+ messages in thread
From: Vivek Goyal @ 2012-09-17 14:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Serge Hallyn, linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

On Fri, Sep 14, 2012 at 01:39:25PM -0700, Tejun Heo wrote:
> Hello, again.
> 
> On Fri, Sep 14, 2012 at 12:49:50PM -0700, Tejun Heo wrote:
> > That said, if someone can think of a better solution, I'm all ears.
> > One thing that *has* to be maintained is that it should be able to tag
> > a resource in such way that its associated controllers are
> > identifiable regardless of which task is looking at it.
> 
> So, I thought about it more.  How about we do "consider / ignore this
> node" instead of "(don't) nest beyond this level".  For example, let's
> assume a tree like the following.
> 
>         R
>      /  |  \
>     A   B   C
>    / \
>   AA AB
> 
> If we want to differentiate between AA and AB, we'll have to consider
> the whole tree with the previous sheme - A needs to nest, so R needs
> to nest and we end up with the whole tree.  Instead, if we have honor
> / ignore this node.  We can set the honor bit on A, AA and AB and see
> the tree as
> 
>         R
>      /
>     A
>    / \
>   AA AB
> 
> We still see the intermediate A node but can ignore the other
> branches.  Implementation and concept-wise, it's fairly simple too.
> For any given node and controller, you travel upwards until you meet a
> node which has the controller enabled and that's the cgroup the
> controller considers.

I think this proposal sounds reasonable. So by default if a new cgroup
is created, we can inherit the controller settings of parent. And if
user does not want to enable particular controller on newly created
cgroup, it shall have to explicitly disable it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2012-09-14 18:07   ` [RFC] cgroup TODOs Vivek Goyal
@ 2012-09-14 18:36   ` Aristeu Rozanski
       [not found]     ` <20120914183641.GA2191-YqEmrenMroyQb786VAuzj9i2O/JbrIOy@public.gmane.org>
  2012-09-14 22:03   ` Dhaval Giani
                     ` (2 subsequent siblings)
  9 siblings, 1 reply; 75+ messages in thread
From: Aristeu Rozanski @ 2012-09-14 18:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko,
	Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner,
	Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

Tejun,
On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote:
>   memcg can be handled by memcg people and I can handle cgroup_freezer
>   and others with help from the authors.  The problematic one is
>   blkio.  If anyone is interested in working on blkio, please be my
>   guest.  Vivek?  Glauber?

if Serge is not planning to do it already, I can take a look in device_cgroup.

also, heard about the desire of having a device namespace instead with
support for translation ("sda" -> "sdf"). If anyone see immediate use for
this please let me know.

-- 
Aristeu

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120914183641.GA2191-YqEmrenMroyQb786VAuzj9i2O/JbrIOy@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]     ` <20120914183641.GA2191-YqEmrenMroyQb786VAuzj9i2O/JbrIOy@public.gmane.org>
@ 2012-09-14 18:54       ` Tejun Heo
  2012-09-15  2:20       ` Serge E. Hallyn
  2012-09-16  8:19       ` [RFC] cgroup TODOs James Bottomley
  2 siblings, 0 replies; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 18:54 UTC (permalink / raw)
  To: Aristeu Rozanski
  Cc: Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras,
	Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner,
	Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner,
	Ingo Molnar

Hello,

On Fri, Sep 14, 2012 at 02:36:41PM -0400, Aristeu Rozanski wrote:
> if Serge is not planning to do it already, I can take a look in device_cgroup.

Yes please. :)

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found]     ` <20120914183641.GA2191-YqEmrenMroyQb786VAuzj9i2O/JbrIOy@public.gmane.org>
  2012-09-14 18:54       ` Tejun Heo
@ 2012-09-15  2:20       ` Serge E. Hallyn
       [not found]         ` <20120915022037.GA6438-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  2012-09-16  8:19       ` [RFC] cgroup TODOs James Bottomley
  2 siblings, 1 reply; 75+ messages in thread
From: Serge E. Hallyn @ 2012-09-15  2:20 UTC (permalink / raw)
  To: Aristeu Rozanski
  Cc: Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar, Eric W. Biederman

Quoting Aristeu Rozanski (aris-moeOTchvdi7YtjvyW6yDsg@public.gmane.org):
> Tejun,
> On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote:
> >   memcg can be handled by memcg people and I can handle cgroup_freezer
> >   and others with help from the authors.  The problematic one is
> >   blkio.  If anyone is interested in working on blkio, please be my
> >   guest.  Vivek?  Glauber?
> 
> if Serge is not planning to do it already, I can take a look in device_cgroup.

That's fine with me, thanks.

> also, heard about the desire of having a device namespace instead with
> support for translation ("sda" -> "sdf"). If anyone see immediate use for
> this please let me know.

Before going down this road, I'd like to discuss this with at least you,
me, and Eric Biederman (cc:d) as to how it relates to a device namespace.

thanks,
-serge

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120915022037.GA6438-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]

* Controlling devices and device namespaces
       [not found]         ` <20120915022037.GA6438-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2012-09-15  9:27           ` Eric W. Biederman
       [not found]             ` <87wqzv7i08.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-09-15  9:27 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Aristeu Rozanski, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Turner, Ingo Molnar

"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Aristeu Rozanski (aris-moeOTchvdi7YtjvyW6yDsg@public.gmane.org):
>> Tejun,
>> On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote:
>> >   memcg can be handled by memcg people and I can handle cgroup_freezer
>> >   and others with help from the authors.  The problematic one is
>> >   blkio.  If anyone is interested in working on blkio, please be my
>> >   guest.  Vivek?  Glauber?
>> 
>> if Serge is not planning to do it already, I can take a look in device_cgroup.
>
> That's fine with me, thanks.
>
>> also, heard about the desire of having a device namespace instead with
>> support for translation ("sda" -> "sdf"). If anyone see immediate use for
>> this please let me know.
>
> Before going down this road, I'd like to discuss this with at least you,
> me, and Eric Biederman (cc:d) as to how it relates to a device
> namespace.

The problem with devices.

- An unrestricted mknod gives you access to effectively any device in
  the system.

- During process migration if the device number changes using
  stat to file descriptors can fail on the same file descriptor.

- Devices coming from prexisting filesystems that we mount
  as unprivileged users are as dangerous as mknod but show
  that the problem is not limited to mknod.

- udev thinks mknod is a system call we can remove from the kernel.

---

The use cases seem comparitively simple to enumerate.

- Giving unfiltered access to a device to someone not root.

- Virtual devices that everyone uses and have no real privilege
  requirements: /dev/null /dev/tty /dev/zero etc.

- Dynamically created devices /dev/loopN /dev/tun /dev/macvtapN,
  nbd, iscsi, /dev/ptsN, etc

---

There are a couple of solution to these problems.

- The classic solution of creating a /dev for a container
  before starting it.

- The devpts filesystem.  This works well for unprivileged access
  to ptys.  Except for the /dev/ptmx sillines I very like how
  things are handled today with devpts.

- Device control groups.  I am not quite certain what to make
  of them.  The only case I see where they are better than
  a prebuilt static dev is if there is a hotppluged device
  that I want to push into my container.

  I think the only problem with device control groups and
  hierarchies is that removing a device from a whitelist
  does not recurse down the hierarchy.

  Can a process inside of a device control group create
  a child group that has access to a subset of it's
  devices?  The actually checks don't need to be hierarchical
  but the presence of device nodes should be.

---

I see a couple of holes in the device control picture.

- How do we handle hotplug events?

  I think we can do this by relaying events trough userspace,
  upating the device control groups etc.

- Unprivileged processess interacting with all of this.
  (possibly with privilege in their user namespace)
  What I don't know how to do is how to create a couple of different
  subhierarchies each for different child processes.

- Dynamically created devices.

  My gut feel is that we should replicate the success of devpts
  and give each type of dynamically created device it's own
  filesystem and mount point under /dev, and just bend
  the handful of userspace users into that model.

- Sysfs

  My gut says for the container use case we should aim to
  simply not have dynamically created devices in sysfs
  and then we can simply not care.

- Migration

  Either we need block device numbers that can migrate with us,
  (possibly a subset of the entire range ala devpts) or we need to send
  hotplug events to userspace right after a migration so userspace
  processes that care can invalidate their caches of stat data.

---

With the code in my userns development tree I can create a user
namespace, create a new mount namespace, and then if I have
access to any block devices mount filesystems, all without
needing to have any special privileges.  What I haven't
figured out is what it would take to get the the device
control group into the middle that.

It feels like it should be possible to get the checks straight
and use the device control group hooks to control which devices
are usable in a user namespace.  Unfortunately when I try and work
it out the independence of the user namespace and the device
control group seem to make that impossible.

Shrug there is most definitely something missing from our
model on how to handle devices well.  I am hoping we can
sprinkling some devpts derived pixie dust at the problem
migrate userspace to some new interfaces and have life
be good.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <87wqzv7i08.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]

* Re: Controlling devices and device namespaces
       [not found]             ` <87wqzv7i08.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2012-09-15 22:05               ` Serge E. Hallyn
       [not found]                 ` <20120915220520.GA11364-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Serge E. Hallyn @ 2012-09-15 22:05 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aristeu Rozanski, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Turner, Ingo Molnar

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Aristeu Rozanski (aris-moeOTchvdi7YtjvyW6yDsg@public.gmane.org):
> >> Tejun,
> >> On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote:
> >> >   memcg can be handled by memcg people and I can handle cgroup_freezer
> >> >   and others with help from the authors.  The problematic one is
> >> >   blkio.  If anyone is interested in working on blkio, please be my
> >> >   guest.  Vivek?  Glauber?
> >> 
> >> if Serge is not planning to do it already, I can take a look in device_cgroup.
> >
> > That's fine with me, thanks.
> >
> >> also, heard about the desire of having a device namespace instead with
> >> support for translation ("sda" -> "sdf"). If anyone see immediate use for
> >> this please let me know.
> >
> > Before going down this road, I'd like to discuss this with at least you,
> > me, and Eric Biederman (cc:d) as to how it relates to a device
> > namespace.
> 
> 
> The problem with devices.
> 
> - An unrestricted mknod gives you access to effectively any device in
>   the system.
> 
> - During process migration if the device number changes using
>   stat to file descriptors can fail on the same file descriptor.
> 
> - Devices coming from prexisting filesystems that we mount
>   as unprivileged users are as dangerous as mknod but show
>   that the problem is not limited to mknod.
> 
> - udev thinks mknod is a system call we can remove from the kernel.

Also,

 - udevadm trigger --action=add

causes all the devices known on the host to be re-sent to
everyone (all namespaces).  Which floods everyone and causes the
host to reset some devices.

> ---
> 
> The use cases seem comparitively simple to enumerate.
> 
> - Giving unfiltered access to a device to someone not root.
> 
> - Virtual devices that everyone uses and have no real privilege
>   requirements: /dev/null /dev/tty /dev/zero etc.
> 
> - Dynamically created devices /dev/loopN /dev/tun /dev/macvtapN,
>   nbd, iscsi, /dev/ptsN, etc

and

 - per-namespace uevent filtering.

> ---
> 
> There are a couple of solution to these problems.
> 
> - The classic solution of creating a /dev for a container
>   before starting it.
> 
> - The devpts filesystem.  This works well for unprivileged access
>   to ptys.  Except for the /dev/ptmx sillines I very like how
>   things are handled today with devpts.
> 
> - Device control groups.  I am not quite certain what to make
>   of them.  The only case I see where they are better than
>   a prebuilt static dev is if there is a hotppluged device
>   that I want to push into my container.
> 
>   I think the only problem with device control groups and
>   hierarchies is that removing a device from a whitelist
>   does not recurse down the hierarchy.

That's going to be fixed soon thanks to Aristeu  :)

>   Can a process inside of a device control group create
>   a child group that has access to a subset of it's
>   devices?  The actually checks don't need to be hierarchical
>   but the presence of device nodes should be.

If I understand your question right, yes.

> ---
> 
> I see a couple of holes in the device control picture.
> 
> - How do we handle hotplug events?
> 
>   I think we can do this by relaying events trough userspace,
>   upating the device control groups etc.
> 
> - Unprivileged processess interacting with all of this.
>   (possibly with privilege in their user namespace)
>   What I don't know how to do is how to create a couple of different
>   subhierarchies each for different child processes.
> 
> - Dynamically created devices.
> 
>   My gut feel is that we should replicate the success of devpts
>   and give each type of dynamically created device it's own
>   filesystem and mount point under /dev, and just bend
>   the handful of userspace users into that model.

Phew.  Maybe.  Had not considered that.  But seems daunting.

> - Sysfs
> 
>   My gut says for the container use case we should aim to
>   simply not have dynamically created devices in sysfs
>   and then we can simply not care.
> 
> - Migration
> 
>   Either we need block device numbers that can migrate with us,
>   (possibly a subset of the entire range ala devpts) or we need to send
>   hotplug events to userspace right after a migration so userspace
>   processes that care can invalidate their caches of stat data.
> 
> ---
> 
> With the code in my userns development tree I can create a user
> namespace, create a new mount namespace, and then if I have
> access to any block devices mount filesystems, all without
> needing to have any special privileges.  What I haven't
> figured out is what it would take to get the the device
> control group into the middle that.

I'm really not sure that's a question we want to ask.  The
device control group, like the ns cgroup, was meant as a
temporary workaround to not having user and device namespaces.

If we can come up with a device cgroup model that works to
fill all the requirements we would have for a devices ns, then
great.  But I don't want us to be constrained by that.

> It feels like it should be possible to get the checks straight
> and use the device control group hooks to control which devices
> are usable in a user namespace.  Unfortunately when I try and work
> it out the independence of the user namespace and the device
> control group seem to make that impossible.
> 
> Shrug there is most definitely something missing from our
> model on how to handle devices well.  I am hoping we can
> sprinkling some devpts derived pixie dust at the problem
> migrate userspace to some new interfaces and have life
> be good.
> 
> Eric

Me too!

I'm torn between suggesting that we have a session at UDS to
discuss this, and not wanting to so that we can focus on the
remaining questions with the user namespace.

thanks,
-serge

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120915220520.GA11364-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>]

* Re: Controlling devices and device namespaces
       [not found]                 ` <20120915220520.GA11364-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
@ 2012-09-16  0:24                   ` Eric W. Biederman
       [not found]                     ` <87y5kazuez.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-09-16  0:24 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Aristeu Rozanski, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Turner, Ingo Molnar


Thinking about this a bit more I think we have been asking the wrong
question.

I think the correct question should be: How do we safely allow for
unprivileged creation of device nodes and devices?

One piece of the puzzle is that we should be able to allow unprivileged
device node creation and access for any device on any filesystem
for which it unprivileged access is safe.

Something like the current device control group hooks but
with the whitelist implemented like:

static bool unpriv_mknod_ok(struct device *dev)
{
	char *tmp, *name;
	umode_t mode = 0;

	name = device_get_devnode(dev, &mode, &tmp);
	if (!name)
        	return false;
	kfree(tmp);
        return mode == 0666;
}

Are there current use cases where people actually want arbitrary
access to hardware devices?  I really want to say no and get
udev and sysfs out of the picture as much as possible.

"Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
>> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>> 
>> > Quoting Aristeu Rozanski (aris-moeOTchvdi7YtjvyW6yDsg@public.gmane.org):
>> >> Tejun,
>> >> On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote:
>> >> >   memcg can be handled by memcg people and I can handle cgroup_freezer
>> >> >   and others with help from the authors.  The problematic one is
>> >> >   blkio.  If anyone is interested in working on blkio, please be my
>> >> >   guest.  Vivek?  Glauber?
>> >> 
>> >> if Serge is not planning to do it already, I can take a look in device_cgroup.
>> >
>> > That's fine with me, thanks.
>> >
>> >> also, heard about the desire of having a device namespace instead with
>> >> support for translation ("sda" -> "sdf"). If anyone see immediate use for
>> >> this please let me know.
>> >
>> > Before going down this road, I'd like to discuss this with at least you,
>> > me, and Eric Biederman (cc:d) as to how it relates to a device
>> > namespace.
>> 
>> 
>> The problem with devices.
>> 
>> - An unrestricted mknod gives you access to effectively any device in
>>   the system.
>> 
>> - During process migration if the device number changes using
>>   stat to file descriptors can fail on the same file descriptor.
>> 
>> - Devices coming from prexisting filesystems that we mount
>>   as unprivileged users are as dangerous as mknod but show
>>   that the problem is not limited to mknod.
>> 
>> - udev thinks mknod is a system call we can remove from the kernel.
>
> Also,
>
>  - udevadm trigger --action=add
>
> causes all the devices known on the host to be re-sent to
> everyone (all namespaces).  Which floods everyone and causes the
> host to reset some devices.

I think this is all userspace activity, and should be largely
fixed by not begin root in a container.

>> ---
>> 
>> The use cases seem comparitively simple to enumerate.
>> 
>> - Giving unfiltered access to a device to someone not root.
>> 
>> - Virtual devices that everyone uses and have no real privilege
>>   requirements: /dev/null /dev/tty /dev/zero etc.
>> 
>> - Dynamically created devices /dev/loopN /dev/tun /dev/macvtapN,
>>   nbd, iscsi, /dev/ptsN, etc
>
> and
>
>  - per-namespace uevent filtering.

One possible solution there is to just send the kernel uevents (except
for the network ones) into the initial network namespace.

>> ---
>> 
>> There are a couple of solution to these problems.
>> 
>> - The classic solution of creating a /dev for a container
>>   before starting it.
>> 
>> - The devpts filesystem.  This works well for unprivileged access
>>   to ptys.  Except for the /dev/ptmx sillines I very like how
>>   things are handled today with devpts.
>> 
>> - Device control groups.  I am not quite certain what to make
>>   of them.  The only case I see where they are better than
>>   a prebuilt static dev is if there is a hotppluged device
>>   that I want to push into my container.
>> 
>>   I think the only problem with device control groups and
>>   hierarchies is that removing a device from a whitelist
>>   does not recurse down the hierarchy.
>
> That's going to be fixed soon thanks to Aristeu  :)
>
>>   Can a process inside of a device control group create
>>   a child group that has access to a subset of it's
>>   devices?  The actually checks don't need to be hierarchical
>>   but the presence of device nodes should be.
>
> If I understand your question right, yes.

I should also have asked can we do this without any capabilities
and without our uid being 0?

>> ---
>> 
>> I see a couple of holes in the device control picture.
>> 
>> - How do we handle hotplug events?
>> 
>>   I think we can do this by relaying events trough userspace,
>>   upating the device control groups etc.
>> 
>> - Unprivileged processess interacting with all of this.
>>   (possibly with privilege in their user namespace)
>>   What I don't know how to do is how to create a couple of different
>>   subhierarchies each for different child processes.
>> 
>> - Dynamically created devices.
>> 
>>   My gut feel is that we should replicate the success of devpts
>>   and give each type of dynamically created device it's own
>>   filesystem and mount point under /dev, and just bend
>>   the handful of userspace users into that model.
>
> Phew.  Maybe.  Had not considered that.  But seems daunting.

I think the list of device types that we care about here is pretty
small.  Please correct me if I am wrong.

loop nbd iscsi macvtap

And if we want it to be safe to use these devices in a user namespace
without global root privileges we need to go through the code anyway.

So I think it is the gradual safe and sane approach assume we don't
run into something like the devpts /dev/ptmx silliness that stalled
devpts.

>> - Sysfs
>> 
>>   My gut says for the container use case we should aim to
>>   simply not have dynamically created devices in sysfs
>>   and then we can simply not care.

I guess what I keep thinking for sysfs is that it should be for real
hardware backed devices.  If we can get away with that like we do with
ptys it just makes everyone's life simpler.

Primarily sysfs and uevents are for allowing the system to take
automatic action when a new device is created.  Do we have an actual
need for hotplug support in containers?

>> - Migration
>> 
>>   Either we need block device numbers that can migrate with us,
>>   (possibly a subset of the entire range ala devpts) or we need to send
>>   hotplug events to userspace right after a migration so userspace
>>   processes that care can invalidate their caches of stat data.
>> 
>> ---
>> 
>> With the code in my userns development tree I can create a user
>> namespace, create a new mount namespace, and then if I have
>> access to any block devices mount filesystems, all without
>> needing to have any special privileges.  What I haven't
>> figured out is what it would take to get the the device
>> control group into the middle that.
>
> I'm really not sure that's a question we want to ask.  The
> device control group, like the ns cgroup, was meant as a
> temporary workaround to not having user and device namespaces.
>
> If we can come up with a device cgroup model that works to
> fill all the requirements we would have for a devices ns, then
> great.  But I don't want us to be constrained by that.
>
>> It feels like it should be possible to get the checks straight
>> and use the device control group hooks to control which devices
>> are usable in a user namespace.  Unfortunately when I try and work
>> it out the independence of the user namespace and the device
>> control group seem to make that impossible.
>> 
>> Shrug there is most definitely something missing from our
>> model on how to handle devices well.  I am hoping we can
>> sprinkling some devpts derived pixie dust at the problem
>> migrate userspace to some new interfaces and have life
>> be good.
>> 
>> Eric
>
> Me too!
>
> I'm torn between suggesting that we have a session at UDS to
> discuss this, and not wanting to so that we can focus on the
> remaining questions with the user namespace.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <87y5kazuez.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]

* Re: Controlling devices and device namespaces
       [not found]                     ` <87y5kazuez.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2012-09-16  3:31                       ` Serge E. Hallyn
  2012-09-16 11:21                       ` Alan Cox
  1 sibling, 0 replies; 75+ messages in thread
From: Serge E. Hallyn @ 2012-09-16  3:31 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aristeu Rozanski, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Turner, Ingo Molnar

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> 
> Thinking about this a bit more I think we have been asking the wrong
> question.
> 
> I think the correct question should be: How do we safely allow for
> unprivileged creation of device nodes and devices?
> 
> One piece of the puzzle is that we should be able to allow unprivileged
> device node creation and access for any device on any filesystem
> for which it unprivileged access is safe.
> 
> Something like the current device control group hooks but
> with the whitelist implemented like:
> 
> static bool unpriv_mknod_ok(struct device *dev)
> {
> 	char *tmp, *name;
> 	umode_t mode = 0;
> 
> 	name = device_get_devnode(dev, &mode, &tmp);
> 	if (!name)
>         	return false;
> 	kfree(tmp);
>         return mode == 0666;
> }
> 
> Are there current use cases where people actually want arbitrary
> access to hardware devices?  I really want to say no and get
> udev and sysfs out of the picture as much as possible.

Other devices I'm pretty sure people will be asking for include audio
and video devices, input devices, usb drives, LVM volumes and probably
volume groups and PVs as well.  I do believe people want to dedicate
drives to containers.

Of course there is also /dev/random, and /dev/kmsg which I think
needs to be tied to the also sorely missing syslog namespace.

> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> >> "Serge E. Hallyn" <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
> >> 
> >> > Quoting Aristeu Rozanski (aris-moeOTchvdi7YtjvyW6yDsg@public.gmane.org):
> >> >> Tejun,
> >> >> On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote:
> >> >> >   memcg can be handled by memcg people and I can handle cgroup_freezer
> >> >> >   and others with help from the authors.  The problematic one is
> >> >> >   blkio.  If anyone is interested in working on blkio, please be my
> >> >> >   guest.  Vivek?  Glauber?
> >> >> 
> >> >> if Serge is not planning to do it already, I can take a look in device_cgroup.
> >> >
> >> > That's fine with me, thanks.
> >> >
> >> >> also, heard about the desire of having a device namespace instead with
> >> >> support for translation ("sda" -> "sdf"). If anyone see immediate use for
> >> >> this please let me know.
> >> >
> >> > Before going down this road, I'd like to discuss this with at least you,
> >> > me, and Eric Biederman (cc:d) as to how it relates to a device
> >> > namespace.
> >> 
> >> 
> >> The problem with devices.
> >> 
> >> - An unrestricted mknod gives you access to effectively any device in
> >>   the system.
> >> 
> >> - During process migration if the device number changes using
> >>   stat to file descriptors can fail on the same file descriptor.
> >> 
> >> - Devices coming from prexisting filesystems that we mount
> >>   as unprivileged users are as dangerous as mknod but show
> >>   that the problem is not limited to mknod.
> >> 
> >> - udev thinks mknod is a system call we can remove from the kernel.
> >
> > Also,
> >
> >  - udevadm trigger --action=add
> >
> > causes all the devices known on the host to be re-sent to
> > everyone (all namespaces).  Which floods everyone and causes the
> > host to reset some devices.
> 
> I think this is all userspace activity,

Well the uevents are sent from the kernel, and cause a flurry
of userspace activity.  (But not sending uevents to the containers
as you suggest below would work)

> and should be largely
> fixed by not begin root in a container.

That doesn't fit with our goal, which is to run the same, unmodified
userspace on hardware, virtualization (kvm/vmware), and containers.
This is important - the more we have to have different init and userspace
in containers (there are a few things we have to special-case still
at the moment) the more duplicated testing and otherwise avoidable
bugs we'll have.

Or did you just mean not being GLOBAL_ROOT_UID in a container?

> >> The use cases seem comparitively simple to enumerate.
> >> 
> >> - Giving unfiltered access to a device to someone not root.
> >> 
> >> - Virtual devices that everyone uses and have no real privilege
> >>   requirements: /dev/null /dev/tty /dev/zero etc.
> >> 
> >> - Dynamically created devices /dev/loopN /dev/tun /dev/macvtapN,
> >>   nbd, iscsi, /dev/ptsN, etc
> >
> > and
> >
> >  - per-namespace uevent filtering.
> 
> One possible solution there is to just send the kernel uevents (except
> for the network ones) into the initial network namespace.

We'd also want storage (especially usb but not just) passed in,
and audio, video and input - but maybe those should be faked from
userspace from the host (or parent container)?

Also, there *are* containers which are not in private network
namespaces.  Now I'm not sure how much we worry about those,
as they generally need custom init anyway (so as not to reconfigure
the host's networking etc).

> >> There are a couple of solution to these problems.
> >> 
> >> - The classic solution of creating a /dev for a container
> >>   before starting it.
> >> 
> >> - The devpts filesystem.  This works well for unprivileged access
> >>   to ptys.  Except for the /dev/ptmx sillines I very like how
> >>   things are handled today with devpts.
> >> 
> >> - Device control groups.  I am not quite certain what to make
> >>   of them.  The only case I see where they are better than
> >>   a prebuilt static dev is if there is a hotppluged device
> >>   that I want to push into my container.
> >> 
> >>   I think the only problem with device control groups and
> >>   hierarchies is that removing a device from a whitelist
> >>   does not recurse down the hierarchy.
> >
> > That's going to be fixed soon thanks to Aristeu  :)
> >
> >>   Can a process inside of a device control group create
> >>   a child group that has access to a subset of it's
> >>   devices?  The actually checks don't need to be hierarchical
> >>   but the presence of device nodes should be.
> >
> > If I understand your question right, yes.
> 
> I should also have asked can we do this without any capabilities
> and without our uid being 0?

Currently you need CAP_SYS_ADMIN to update device cgroup permissions.

> >> I see a couple of holes in the device control picture.
> >> 
> >> - How do we handle hotplug events?
> >> 
> >>   I think we can do this by relaying events trough userspace,
> >>   upating the device control groups etc.
> >> 
> >> - Unprivileged processess interacting with all of this.
> >>   (possibly with privilege in their user namespace)
> >>   What I don't know how to do is how to create a couple of different
> >>   subhierarchies each for different child processes.
> >> 
> >> - Dynamically created devices.
> >> 
> >>   My gut feel is that we should replicate the success of devpts
> >>   and give each type of dynamically created device it's own
> >>   filesystem and mount point under /dev, and just bend
> >>   the handful of userspace users into that model.
> >
> > Phew.  Maybe.  Had not considered that.  But seems daunting.
> 
> I think the list of device types that we care about here is pretty
> small.  Please correct me if I am wrong.
> 
> loop nbd iscsi macvtap

I assume you're asking only about devices that need virtualized
instances, with the instances either unique or mapped between
namespaces.  (and I assume the hope is that we can get away with them
being unique, as with devpts, and mappable with bind mounts)  I can't
think of any others offhand.

Common devices used in containers include tty*, rtc, fuse, tun, hpet,
kvm.  /dev/tty and /dev/console are special anyway.  The tty*  in
containers are always bind mounted with devpts.  So I don't think any
of those fit the criteria - no work needed.

> And if we want it to be safe to use these devices in a user namespace
> without global root privileges we need to go through the code anyway.

Agreed.

> So I think it is the gradual safe and sane approach assume we don't
> run into something like the devpts /dev/ptmx silliness that stalled
> devpts.

Agreed.

> >> - Sysfs
> >> 
> >>   My gut says for the container use case we should aim to
> >>   simply not have dynamically created devices in sysfs
> >>   and then we can simply not care.
> 
> I guess what I keep thinking for sysfs is that it should be for real
> hardware backed devices.  If we can get away with that like we do with
> ptys it just makes everyone's life simpler.

You've brought up /sys and /proc, does devtmpfs further complicate
things?

> Primarily sysfs and uevents are for allowing the system to take
> automatic action when a new device is created.  Do we have an actual
> need for hotplug support in containers?

As I argue above, I claim we need them for the event-drive init
systems to see NICs and other devices brought up, and to handle
passing in usb devices etc.

-serge

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Controlling devices and device namespaces
       [not found]                     ` <87y5kazuez.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2012-09-16  3:31                       ` Serge E. Hallyn
@ 2012-09-16 11:21                       ` Alan Cox
       [not found]                         ` <20120916122112.3f16178d-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org>
  1 sibling, 1 reply; 75+ messages in thread
From: Alan Cox @ 2012-09-16 11:21 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aristeu Rozanski, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Turner, Ingo Molnar

> One piece of the puzzle is that we should be able to allow unprivileged
> device node creation and access for any device on any filesystem
> for which it unprivileged access is safe.

Which devices are "safe" is policy for all interesting and useful cases,
as are file permissions, security tags, chroot considerations and the
like.

It's a complete non starter.

Alan

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120916122112.3f16178d-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org>]

* Re: Controlling devices and device namespaces
       [not found]                         ` <20120916122112.3f16178d-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org>
@ 2012-09-16 11:56                           ` Eric W. Biederman
       [not found]                             ` <87sjaiuqp5.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-09-16 11:56 UTC (permalink / raw)
  To: Alan Cox
  Cc: Aristeu Rozanski, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Turner, Ingo Molnar

Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> writes:

>> One piece of the puzzle is that we should be able to allow unprivileged
>> device node creation and access for any device on any filesystem
>> for which it unprivileged access is safe.
>
> Which devices are "safe" is policy for all interesting and useful cases,
> as are file permissions, security tags, chroot considerations and the
> like.
>
> It's a complete non starter.

There are a handful of device nodes that the kernel creates with mode
0666.  Esentially it is just /dev/tty /dev/null /dev/zero and a few
others.  Enourmous numbers of programs won't work without them.  Making
them both interesting and useful.

In very peculiar cases I can see not wanting to have access to generally
safe devices, like in other peculiar cases we don't have want access to
the network stack.

As for the general case device nodes for real hardware in a container
which I think is the "interesting" case you were referring to.  I
personally find that case icky and boring.

The sanest way I can think of handling real hardware device nodes is a
tmpfs (acting like devtmpfs) mounted on /dev in the containers mount
namespace, but also visible outside to the global root mounted somewhere
interesting.  We have a fuse filesystem pretending to be sysfs and
relaying file accesses from the real sysfs for just the devices that we
want to allow to that container.

Then to add a device in a container the managing daemon makes the
devices available in the pretend sysfs, calls mknod on the tmpfs
and fakes the uevents.

The only case I don't see that truly covering is keeping the stat data
the same for files of migrated applications.  Shrug perhaps that will
just have to be handled with another synthesized uevent.  Hey userspace I
just hot-unplugged and hot-plugged your kernel please cope.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <87sjaiuqp5.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]

* Re: Controlling devices and device namespaces
       [not found]                             ` <87sjaiuqp5.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2012-09-16 12:17                               ` Eric W. Biederman
       [not found]                                 ` <87d31mupp3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-09-16 12:17 UTC (permalink / raw)
  To: Alan Cox
  Cc: Aristeu Rozanski, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Turner, Ingo Molnar

ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:

> Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> writes:
>
>>> One piece of the puzzle is that we should be able to allow unprivileged
>>> device node creation and access for any device on any filesystem
>>> for which it unprivileged access is safe.
>>
>> Which devices are "safe" is policy for all interesting and useful cases,
>> as are file permissions, security tags, chroot considerations and the
>> like.
>>
>> It's a complete non starter.

Come to think of it mknod is completely unnecessary.

Without mknod.  Without being able to mount filesystems containing
device nodes.  The mount namespace is sufficient to prevent all of the
cases that the device control group prevents (open and mknod on device
nodes).

So I honestly think the device control group is superflous, and it is
probably wise to deprecate it and move to a model where it does not
exist.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <87d31mupp3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]

* Re: Controlling devices and device namespaces
       [not found]                                 ` <87d31mupp3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2012-09-16 13:32                                   ` Serge Hallyn
       [not found]                                     ` <5055D4D1.3070407-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Serge Hallyn @ 2012-09-16 13:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aristeu Rozanski, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Turner, Ingo Molnar, Alan Cox

On 09/16/2012 07:17 AM, Eric W. Biederman wrote:
> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
>
>> Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> writes:
>>
>>>> One piece of the puzzle is that we should be able to allow unprivileged
>>>> device node creation and access for any device on any filesystem
>>>> for which it unprivileged access is safe.
>>>
>>> Which devices are "safe" is policy for all interesting and useful cases,
>>> as are file permissions, security tags, chroot considerations and the
>>> like.
>>>
>>> It's a complete non starter.
>
> Come to think of it mknod is completely unnecessary.
>
> Without mknod.  Without being able to mount filesystems containing
> device nodes.

Hm?  That sounds like it will really upset init/udev/upgrades in the 
container.

Are you saying all filesystems containing device nodes will need to be 
mounted in advance by the process setting up the container?

 > The mount namespace is sufficient to prevent all of the
> cases that the device control group prevents (open and mknod on device
> nodes).
>
> So I honestly think the device control group is superflous, and it is
> probably wise to deprecate it and move to a model where it does not
> exist.
>
> Eric
>

That's what I said a few emails ago :)  The device cgroup was meant as a 
short-term workaround for lack of user (and device) namespaces.

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <5055D4D1.3070407-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]

* Re: Controlling devices and device namespaces
       [not found]                                     ` <5055D4D1.3070407-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
@ 2012-09-16 14:23                                       ` Eric W. Biederman
       [not found]                                         ` <87k3vuqc5l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Eric W. Biederman @ 2012-09-16 14:23 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Aristeu Rozanski, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Turner, Ingo Molnar, Alan Cox

Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:

> On 09/16/2012 07:17 AM, Eric W. Biederman wrote:
>> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
>>
>>> Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> writes:
>>>
>>>>> One piece of the puzzle is that we should be able to allow unprivileged
>>>>> device node creation and access for any device on any filesystem
>>>>> for which it unprivileged access is safe.
>>>>
>>>> Which devices are "safe" is policy for all interesting and useful cases,
>>>> as are file permissions, security tags, chroot considerations and the
>>>> like.
>>>>
>>>> It's a complete non starter.
>>
>> Come to think of it mknod is completely unnecessary.
>>
>> Without mknod.  Without being able to mount filesystems containing
>> device nodes.
>
> Hm?  That sounds like it will really upset init/udev/upgrades in the
> container.

udev does not create device nodes.  For an older udev the worst
I can see it doing is having mknod failing with EEXIST because
the device node already exists.

We should be able to make it look to init like a ramdisk mounted the
filesystems.

Why should upgrades care?  Package installation shouldn't be calling
mknod.

At least with a recent modern distro I can't imagine this to be an
issue.  I expect we could have a kernel build option that removed the
mknod system call and a modern distro wouldn't notice.

> Are you saying all filesystems containing device nodes will need to be
> mounted in advance by the process setting up the container?

As a general rule.

I think in practice there is wiggle room for special cases
like mounting a fresh devpts.  devpts at least in always create a new
instance on mount mode seems safe, as it can not give you access to
any existing devices.

You can also do a lot of what would normally be done with mknod
with bind mounts to the original devices location.

>> The mount namespace is sufficient to prevent all of the
>> cases that the device control group prevents (open and mknod on device
>> nodes).
>>
>> So I honestly think the device control group is superflous, and it is
>> probably wise to deprecate it and move to a model where it does not
>> exist.
>>
>> Eric
>>
>
> That's what I said a few emails ago :)  The device cgroup was meant as
> a short-term workaround for lack of user (and device) namespaces.

I am saying something stronger.  The device cgroup doesn't seem to have
a practical function now.  That for the general case we don't need any
kernel support.  That all of this should be a matter of some user space
glue code, and just the tiniest bit of sorting out how hotplug events are
sent.

The only thing I can think we would need a device namespace for is
for migration.

For migration with direct access to real hardware devices we must treat
it as hardware hotunplug.  There is nothing else we can do.

If there is any other case where we need to preserve device numbers
etc we have the example of devpts.

So at this point I really don't think we need a device namespace or a
device control group.  (Just emulate devtmpfs, sysfs and uevents).

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <87k3vuqc5l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>]

* Re: Controlling devices and device namespaces
       [not found]                                         ` <87k3vuqc5l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
@ 2012-09-16 16:13                                           ` Alan Cox
       [not found]                                             ` <20120916171316.517ad0fd-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org>
  2012-09-16 16:15                                           ` Serge Hallyn
  1 sibling, 1 reply; 75+ messages in thread
From: Alan Cox @ 2012-09-16 16:13 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aristeu Rozanski, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Turner, Ingo Molnar

> At least with a recent modern distro I can't imagine this to be an
> issue.  I expect we could have a kernel build option that removed the
> mknod system call and a modern distro wouldn't notice.

A few things beyond named pipes will break. PCMCIA I believe still
depends on ugly mknod hackery of its own. You also need it for some
classes of non detectable device.

Basically though you could.

> For migration with direct access to real hardware devices we must treat
> it as hardware hotunplug.  There is nothing else we can do.

That is demonstrably false for a shared bus or a network linked device.
Consider a firewire camera wired to two systems at once. Consider SAN
storage.

Alan

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120916171316.517ad0fd-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org>]

* Re: Controlling devices and device namespaces
       [not found]                                             ` <20120916171316.517ad0fd-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org>
@ 2012-09-16 17:49                                               ` Eric W. Biederman
  0 siblings, 0 replies; 75+ messages in thread
From: Eric W. Biederman @ 2012-09-16 17:49 UTC (permalink / raw)
  To: Alan Cox
  Cc: Aristeu Rozanski, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner

Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> writes:

>> At least with a recent modern distro I can't imagine this to be an
>> issue.  I expect we could have a kernel build option that removed the
>> mknod system call and a modern distro wouldn't notice.
>
> A few things beyond named pipes will break. PCMCIA I believe still
> depends on ugly mknod hackery of its own. You also need it for some
> classes of non detectable device.
>
> Basically though you could.

Ah yes fifos.  I had forgotten mknod created them.  I am half surprised
there isn't a mkfifo system call.

>> For migration with direct access to real hardware devices we must treat
>> it as hardware hotunplug.  There is nothing else we can do.
>
> That is demonstrably false for a shared bus or a network linked device.
> Consider a firewire camera wired to two systems at once. Consider SAN
> storage.

Sort of.

If you are talking to the device directly there is usually enough state
with the path changing that modelling it as a hotunplug/hotplug is about
all that is practical.  There is all of that intermediate state for in
progress DMAs in the end system controllers etc.

Now if you have a logical abstraction like a block device in between the
program and the SAN storage, then figuring out how to preserve device
names and numbers becomes interesting.  At least far enough to keep
device and inode numbers for stat intact.

A fully general solution for preserving device names, and numbers
requires rewriting sysfs.  I expect a lot of the infrastructure someone
needs is there already from my network namespace work but after having
done the network namespace I am sick and tired of manhandling that
unreasonably conjoined glob of device stuff.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: Controlling devices and device namespaces
       [not found]                                         ` <87k3vuqc5l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
  2012-09-16 16:13                                           ` Alan Cox
@ 2012-09-16 16:15                                           ` Serge Hallyn
       [not found]                                             ` <5055FB2A.1020103-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 75+ messages in thread
From: Serge Hallyn @ 2012-09-16 16:15 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Aristeu Rozanski, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Serge E. Hallyn, Paul Turner, Ingo Molnar, Alan Cox

On 09/16/2012 09:23 AM, Eric W. Biederman wrote:
> Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:
>
>> On 09/16/2012 07:17 AM, Eric W. Biederman wrote:
>>> ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org (Eric W. Biederman) writes:
>>>
>>>> Alan Cox <alan-qBU/x9rampVanCEyBjwyrvXRex20P6io@public.gmane.org> writes:
>>>>
>>>>>> One piece of the puzzle is that we should be able to allow unprivileged
>>>>>> device node creation and access for any device on any filesystem
>>>>>> for which it unprivileged access is safe.
>>>>>
>>>>> Which devices are "safe" is policy for all interesting and useful cases,
>>>>> as are file permissions, security tags, chroot considerations and the
>>>>> like.
>>>>>
>>>>> It's a complete non starter.
>>>
>>> Come to think of it mknod is completely unnecessary.
>>>
>>> Without mknod.  Without being able to mount filesystems containing
>>> device nodes.
>>
>> Hm?  That sounds like it will really upset init/udev/upgrades in the
>> container.
>
> udev does not create device nodes.  For an older udev the worst
> I can see it doing is having mknod failing with EEXIST because
> the device node already exists.
>
> We should be able to make it look to init like a ramdisk mounted the
> filesystems.
>
> Why should upgrades care?  Package installation shouldn't be calling
> mknod.
>
> At least with a recent modern distro I can't imagine this to be an
> issue.  I expect we could have a kernel build option that removed the
> mknod system call and a modern distro wouldn't notice.
>
>> Are you saying all filesystems containing device nodes will need to be
>> mounted in advance by the process setting up the container?
>
> As a general rule.
>
> I think in practice there is wiggle room for special cases
> like mounting a fresh devpts.  devpts at least in always create a new
> instance on mount mode seems safe, as it can not give you access to
> any existing devices.
>
> You can also do a lot of what would normally be done with mknod
> with bind mounts to the original devices location.
>
>>> The mount namespace is sufficient to prevent all of the
>>> cases that the device control group prevents (open and mknod on device
>>> nodes).
>>>
>>> So I honestly think the device control group is superflous, and it is
>>> probably wise to deprecate it and move to a model where it does not
>>> exist.
>>>
>>> Eric
>>>
>>
>> That's what I said a few emails ago :)  The device cgroup was meant as
>> a short-term workaround for lack of user (and device) namespaces.
>
> I am saying something stronger.  The device cgroup doesn't seem to have
> a practical function now.

"Now" is wrong.  The user namespace is not complete and not yet usable 
for a full system container.  We still need the device control group.

I'd like us to have a sprint (either a day at UDS in person, or a few 
days with a virtual sprint) with the focus of getting a full system 
container working the way you envision it, as cleanly as possible.  I 
can take two or three consecutave days sometime in the next 2-3 weeks, 
we can sit on irc and share a few instances on which to experiment?

>  That for the general case we don't need any
> kernel support.  That all of this should be a matter of some user space
> glue code, and just the tiniest bit of sorting out how hotplug events are
> sent.
>
> The only thing I can think we would need a device namespace for is
> for migration.
 >
> For migration with direct access to real hardware devices we must treat
> it as hardware hotunplug.  There is nothing else we can do.
>
> If there is any other case where we need to preserve device numbers
> etc we have the example of devpts.
>
> So at this point I really don't think we need a device namespace or a
> device control group.  (Just emulate devtmpfs, sysfs and uevents).
>
> Eric
>

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <5055FB2A.1020103-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>]

* Re: Controlling devices and device namespaces
       [not found]                                             ` <5055FB2A.1020103-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
@ 2012-09-16 16:53                                               ` Eric W. Biederman
  0 siblings, 0 replies; 75+ messages in thread
From: Eric W. Biederman @ 2012-09-16 16:53 UTC (permalink / raw)
  To: Serge Hallyn
  Cc: Aristeu Rozanski, Neil Horman,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Tejun Heo,
	Ingo Molnar, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner, Alan Cox

Serge Hallyn <serge-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org> writes:

>>> That's what I said a few emails ago :)  The device cgroup was meant as
>>> a short-term workaround for lack of user (and device) namespaces.
>>
>> I am saying something stronger.  The device cgroup doesn't seem to have
>> a practical function now.
>
> "Now" is wrong.  The user namespace is not complete and not yet usable for a
> full system container.  We still need the device control group.

Dropping cap mknod, and not having any device nodes you can mount
a filesystem with device nodes, plus mount namespace work to only allow
you to have access to proper device nodes should work today.  And I
admit the user namespace as I have it coded in my tree does make this
simpler.

But I agree "Now" is too soon until we have actually demonstrated
something else.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found]     ` <20120914183641.GA2191-YqEmrenMroyQb786VAuzj9i2O/JbrIOy@public.gmane.org>
  2012-09-14 18:54       ` Tejun Heo
  2012-09-15  2:20       ` Serge E. Hallyn
@ 2012-09-16  8:19       ` James Bottomley
       [not found]         ` <1347783557.2463.1.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
  2 siblings, 1 reply; 75+ messages in thread
From: James Bottomley @ 2012-09-16  8:19 UTC (permalink / raw)
  To: Aristeu Rozanski
  Cc: Tejun Heo, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko,
	Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner,
	Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

On Fri, 2012-09-14 at 14:36 -0400, Aristeu Rozanski wrote:
> also, heard about the desire of having a device namespace instead with
> support for translation ("sda" -> "sdf"). If anyone see immediate use for
> this please let me know.

That sounds like a really bad idea to me.  We've spent ages training
users that the actual sd<x> name of their device doesn't matter and they
should use UUIDs or WWNs instead ... why should they now care inside
containers?

James




^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <1347783557.2463.1.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]         ` <1347783557.2463.1.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
@ 2012-09-16 14:41           ` Eric W. Biederman
  2012-09-17 13:21           ` Aristeu Rozanski
  1 sibling, 0 replies; 75+ messages in thread
From: Eric W. Biederman @ 2012-09-16 14:41 UTC (permalink / raw)
  To: James Bottomley
  Cc: Aristeu Rozanski, Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

James Bottomley <James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org> writes:

> On Fri, 2012-09-14 at 14:36 -0400, Aristeu Rozanski wrote:
>> also, heard about the desire of having a device namespace instead with
>> support for translation ("sda" -> "sdf"). If anyone see immediate use for
>> this please let me know.
>
> That sounds like a really bad idea to me.  We've spent ages training
> users that the actual sd<x> name of their device doesn't matter and they
> should use UUIDs or WWNs instead ... why should they now care inside
> containers?

The goal is not to introduce new the cases where people care but to
handle cases where people do care.

The biggest practical case of interest that I know of is if

stat /home/myinteresintfile Device: 806h	Inode: 7460974
migration
stat /home/myinteresintfile Device: 732h	Inode: 7460974

And an unchanging file looks like it has just become a totally different
file on a totally different filesystem.

I think even things like git status will care.  Although how much git
cares about the device number I don't know.  I do know rsyncing a git
tree to another directory is enough to give git conniption fits.

So this is really about device management and handling the horrible
things that real user space does.  There is also the case that
there are some very strong ties between the names of device nodes
the names of sysfs files.  Strong enough ties that I think you can
strongly confuse userspace if you just happen to rename a device node.

And ultimately this conversation is about the fact that none of this
has been interesting enough in practice to figure out what really needs
to be done to manage devices in containers.

You can read the other thread if you want details.  But right now it
looks to me like the right answer is going to be building some userspace
software and totally deprecating the device control group.

Eric

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found]         ` <1347783557.2463.1.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
  2012-09-16 14:41           ` Eric W. Biederman
@ 2012-09-17 13:21           ` Aristeu Rozanski
  1 sibling, 0 replies; 75+ messages in thread
From: Aristeu Rozanski @ 2012-09-17 13:21 UTC (permalink / raw)
  To: James Bottomley
  Cc: Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Thomas Graf,
	Paul Mackerras, Aneesh Kumar K.V, Arnaldo Carvalho de Melo,
	Johannes Weiner, Tejun Heo, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Paul Turner, Ingo Molnar

On Sun, Sep 16, 2012 at 09:19:17AM +0100, James Bottomley wrote:
> On Fri, 2012-09-14 at 14:36 -0400, Aristeu Rozanski wrote:
> > also, heard about the desire of having a device namespace instead with
> > support for translation ("sda" -> "sdf"). If anyone see immediate use for
> > this please let me know.
> 
> That sounds like a really bad idea to me.  We've spent ages training
> users that the actual sd<x> name of their device doesn't matter and they
> should use UUIDs or WWNs instead ... why should they now care inside
> containers?

True, bad example on my part. The use case I had in mind when I wrote that
can be solved by symbolic links.

-- 
Aristeu

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2012-09-14 18:36   ` Aristeu Rozanski
@ 2012-09-14 22:03   ` Dhaval Giani
       [not found]     ` <CAPhKKr8wDLrcWHLTRq1M7gU_6CGNxzzF83zJo2WZ5vrY7h8Qyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2012-09-20  1:33   ` Andy Lutomirski
  2012-09-21 21:40   ` Tejun Heo
  9 siblings, 1 reply; 75+ messages in thread
From: Dhaval Giani @ 2012-09-14 22:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Michal Hocko,
	Glauber Costa, Peter Zijlstra, Paul Turner, Johannes Weiner,
	Thomas Graf, Serge E. Hallyn, Paul Mackerras, Ingo Molnar,
	Arnaldo Carvalho de Melo, Neil Horman, Aneesh Kumar K.V

>
>   * Sort & unique when listing tasks.  Even the documentation says it
>     doesn't happen but we have a good hunk of code doing it in
>     cgroup.c.  I'm gonna rip it out at some point.  Again, if you
>     don't like it, scream.
>

I think some userspace tools do assume the uniq bit. So if we can
preserve that, great!

Thanks
Dhaval

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <CAPhKKr8wDLrcWHLTRq1M7gU_6CGNxzzF83zJo2WZ5vrY7h8Qyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]     ` <CAPhKKr8wDLrcWHLTRq1M7gU_6CGNxzzF83zJo2WZ5vrY7h8Qyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-09-14 22:06       ` Tejun Heo
  0 siblings, 0 replies; 75+ messages in thread
From: Tejun Heo @ 2012-09-14 22:06 UTC (permalink / raw)
  To: Dhaval Giani
  Cc: Neil Horman, Serge E. Hallyn,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Michal Hocko, Paul Mackerras,
	Aneesh Kumar K.V, Arnaldo Carvalho de Melo, Johannes Weiner,
	Thomas Graf, cgroups-u79uwXL29TY76Z2rM5mHXA, Paul Turner,
	Ingo Molnar

Hello,

On Fri, Sep 14, 2012 at 06:03:16PM -0400, Dhaval Giani wrote:
> >
> >   * Sort & unique when listing tasks.  Even the documentation says it
> >     doesn't happen but we have a good hunk of code doing it in
> >     cgroup.c.  I'm gonna rip it out at some point.  Again, if you
> >     don't like it, scream.
> 
> I think some userspace tools do assume the uniq bit. So if we can
> preserve that, great!

Can you point me to those?  If there are users depending on it, I
won't break it, at least for now, but I at least wanna know more about
them.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2012-09-14 22:03   ` Dhaval Giani
@ 2012-09-20  1:33   ` Andy Lutomirski
       [not found]     ` <505A725B.2080901-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
  2012-09-21 21:40   ` Tejun Heo
  9 siblings, 1 reply; 75+ messages in thread
From: Andy Lutomirski @ 2012-09-20  1:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	Neil Horman, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Paul Turner, Ingo Molnar, serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw

[grr.  why does gmane scramble addresses?]

On 09/13/2012 01:58 PM, Tejun Heo wrote:
> 
> 6. Multiple hierarchies
> 
>   Apart from the apparent wheeeeeeeeness of it (I think I talked about
>   that enough the last time[1]), there's a basic problem when more
>   than one controllers interact - it's impossible to define a resource
>   group when more than two controllers are involved because the
>   intersection of different controllers is only defined in terms of
>   tasks.
> 
>   IOW, if an entity X is of interest to two controllers, there's no
>   way to map X to the cgroups of the two controllers.  X may belong to
>   A and B when viewed by one task but A' and B when viewed by another.
>   This already is a head scratcher in writeback where blkcg and memcg
>   have to interact.
> 
>   While I am pushing for unified hierarchy, I think it's necessary to
>   have different levels of granularities depending on controllers
>   given that nesting involves significant overhead and noticeable
>   controller-dependent behavior changes.
> 
> 

> ...

>   I think this level of flexibility should be enough for most use
>   cases.  If someone disagrees, please voice your objections now.
> 

OK, I'll bite.

I have a server that has a whole bunch of cores.  A small fraction of
those cores are general purpose and run whatever they like.  The rest
are tightly controlled.

For simplicity, we have two cpusets that we use.  The root allows all
cpus.  The other one only allows the general purpose cpus.  We shove
everything into the general-purpose-only cpuset, and then we move
special stuff back to root.  (We also shove some kernel threads into a
non-root cpuset using the 'cset' tool.)

Enter systemd, which wants a hierarchy corresponding to services.  If we
were to use it, we might end up violating its hierarchy.

Alternatively, if we started using memcg, then we might have some tasks
to have more restrictive memory usage but less restrictive cpu usage.

As long as we can still pull this off, I'm happy.

--Andy

P.S.  I'm sure you can guess why based on my email address :)

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <505A725B.2080901-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]     ` <505A725B.2080901-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
@ 2012-09-20 18:26       ` Tejun Heo
       [not found]         ` <20120920182651.GH28934-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 75+ messages in thread
From: Tejun Heo @ 2012-09-20 18:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	Neil Horman, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Paul Turner, Ingo Molnar, serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw

Hello,

On Wed, Sep 19, 2012 at 06:33:15PM -0700, Andy Lutomirski wrote:
> [grr.  why does gmane scramble addresses?]

You can append /raw to the message url and see the raw mssage.

  http://article.gmane.org/gmane.linux.kernel.containers/23802/raw

> >   I think this level of flexibility should be enough for most use
> >   cases.  If someone disagrees, please voice your objections now.
> 
> OK, I'll bite.
> 
> I have a server that has a whole bunch of cores.  A small fraction of
> those cores are general purpose and run whatever they like.  The rest
> are tightly controlled.
> 
> For simplicity, we have two cpusets that we use.  The root allows all
> cpus.  The other one only allows the general purpose cpus.  We shove
> everything into the general-purpose-only cpuset, and then we move
> special stuff back to root.  (We also shove some kernel threads into a
> non-root cpuset using the 'cset' tool.)

Using root for special stuff probably isn't a good idea and moving
bound kthreads into !root cgroups is already disallowed.

> Enter systemd, which wants a hierarchy corresponding to services.  If we
> were to use it, we might end up violating its hierarchy.
> 
> Alternatively, if we started using memcg, then we might have some tasks
> to have more restrictive memory usage but less restrictive cpu usage.
> 
> As long as we can still pull this off, I'm happy.

IIUC, you basically want just two groups w/ cpuset and use it for
loose cpu ioslation for high priority jobs.  Structure-wise, I don't
think it's gonna be a problem although using root for special stuff
would need to change.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

[parent not found: <20120920182651.GH28934-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]

* Re: [RFC] cgroup TODOs
       [not found]         ` <20120920182651.GH28934-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
@ 2012-09-20 18:39           ` Andy Lutomirski
  0 siblings, 0 replies; 75+ messages in thread
From: Andy Lutomirski @ 2012-09-20 18:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA, Linux Kernel Mailing List,
	Neil Horman, Michal Hocko, Paul Mackerras, Aneesh Kumar K.V,
	Arnaldo Carvalho de Melo, Johannes Weiner, Thomas Graf,
	Paul Turner, Ingo Molnar, serge.hallyn-Z7WLFzj8eWMS+FvcfC7Uqw

On Thu, Sep 20, 2012 at 11:26 AM, Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> Hello,
>
> On Wed, Sep 19, 2012 at 06:33:15PM -0700, Andy Lutomirski wrote:
>> [grr.  why does gmane scramble addresses?]
>
> You can append /raw to the message url and see the raw mssage.
>
>   http://article.gmane.org/gmane.linux.kernel.containers/23802/raw

Thanks!

>
>> >   I think this level of flexibility should be enough for most use
>> >   cases.  If someone disagrees, please voice your objections now.
>>
>> OK, I'll bite.
>>
>> I have a server that has a whole bunch of cores.  A small fraction of
>> those cores are general purpose and run whatever they like.  The rest
>> are tightly controlled.
>>
>> For simplicity, we have two cpusets that we use.  The root allows all
>> cpus.  The other one only allows the general purpose cpus.  We shove
>> everything into the general-purpose-only cpuset, and then we move
>> special stuff back to root.  (We also shove some kernel threads into a
>> non-root cpuset using the 'cset' tool.)
>
> Using root for special stuff probably isn't a good idea and moving
> bound kthreads into !root cgroups is already disallowed.

Agreed.  I do it this way because it's easy and it works.  I can
change it in the future if needed.

>
>> Enter systemd, which wants a hierarchy corresponding to services.  If we
>> were to use it, we might end up violating its hierarchy.
>>
>> Alternatively, if we started using memcg, then we might have some tasks
>> to have more restrictive memory usage but less restrictive cpu usage.
>>
>> As long as we can still pull this off, I'm happy.
>
> IIUC, you basically want just two groups w/ cpuset and use it for
> loose cpu ioslation for high priority jobs.  Structure-wise, I don't
> think it's gonna be a problem although using root for special stuff
> would need to change.

Right.

But what happens when multiple hierarchies go away and I lose control
of the structure?  If systemd or whatever sticks my whole session or
my service (or however I organize it) into cgroup /whatever, then
either I can put my use-all-cpus tasks into /whatever/everything or I
can step outside the hierarchy and put them into /everything.  The
former doesn't work, because

<quote>
The following rules apply to each cpuset:

 - Its CPUs and Memory Nodes must be a subset of its parents.
</quote>

The latter might confuse systemd.

My real objection might be to that requirement a cpuset can't be less
restrictive than its parents.  Currently I can arrange for a task to
simultaneously have a less restrictive cpuset and a more restrictive
memory limit (or to stick it into a container or whatever).  If the
hierarchies have to correspond, this stops working.

--Andy

^ permalink raw reply	[flat|nested] 75+ messages in thread

* Re: [RFC] cgroup TODOs
       [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2012-09-20  1:33   ` Andy Lutomirski
@ 2012-09-21 21:40   ` Tejun Heo
  9 siblings, 0 replies; 75+ messages in thread
From: Tejun Heo @ 2012-09-21 21:40 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: Li Zefan, Michal Hocko, Glauber Costa, Peter Zijlstra,
	Paul Turner, Johannes Weiner, Thomas Graf, Serge E. Hallyn,
	Paul Mackerras, Ingo Molnar, Arnaldo Carvalho de Melo,
	Neil Horman, Aneesh Kumar K.V

On Thu, Sep 13, 2012 at 01:58:27PM -0700, Tejun Heo wrote:
> 7. Misc issues
> 
>   * Sort & unique when listing tasks.  Even the documentation says it
>     doesn't happen but we have a good hunk of code doing it in
>     cgroup.c.  I'm gonna rip it out at some point.  Again, if you
>     don't like it, scream.
> 
>   * At the PLC, pjt told me that assinging threads of a cgroup to
>     different cgroups is useful for some use cases but if we're to
>     have a unified hierarchy, I don't think we can continue to do
>     that.  Paul, can you please elaborate the use case?
> 
>   * Vivek brought up the issue of distributing resources to tasks and
>     groups in the same cgroup.  I don't know.  Need to think more
>     about it.

* Update docs.

* Clean up cftype->read/write*() mess.

* Use sane fs event mechanism.

* Drop userland helper based empty notification.  Argh...

-- 
tejun

^ permalink raw reply	[flat|nested] 75+ messages in thread

end of thread, other threads:[~2012-09-21 21:40 UTC | newest]

Thread overview: 75+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-13 20:58 [RFC] cgroup TODOs Tejun Heo
2012-09-14 11:15 ` Peter Zijlstra
2012-09-14 12:54   ` Daniel P. Berrange
     [not found]     ` <20120914125427.GW6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-09-14  8:55       ` Glauber Costa
2012-09-14 17:53   ` Tejun Heo
     [not found] ` <20120913205827.GO7677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-09-14  8:16   ` Glauber Costa
     [not found]     ` <5052E7DF.7040000-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-09-14  9:12       ` Li Zefan
     [not found]         ` <5052F4FF.6070508-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
2012-09-14 11:22           ` Peter Zijlstra
2012-09-14 17:59           ` Tejun Heo
     [not found]             ` <20120914175944.GF17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-09-14 18:23               ` Peter Zijlstra
2012-09-14 18:33                 ` Tejun Heo
2012-09-14 17:43       ` Tejun Heo
     [not found]         ` <20120914174329.GD17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-09-17  8:50           ` Glauber Costa
     [not found]             ` <5056E467.2090108-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-09-17 17:21               ` Tejun Heo
     [not found]                 ` <20120917172123.GB18677-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-09-18  8:16                   ` Glauber Costa
2012-09-14  9:04   ` Mike Galbraith
     [not found]     ` <1347613484.4340.132.camel-YqMYhexLQo31wTEvPJ5Q0F6hYfS7NtTn@public.gmane.org>
2012-09-14 17:17       ` Tejun Heo
2012-09-14  9:10   ` Daniel P. Berrange
     [not found]     ` <20120914091032.GA6819-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-09-14  9:08       ` Glauber Costa
2012-09-14 13:58       ` Vivek Goyal
     [not found]         ` <20120914135830.GB6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-09-14 19:29           ` Tejun Heo
     [not found]             ` <20120914192935.GO17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-09-14 21:51               ` Kay Sievers
2012-09-14 14:25   ` Vivek Goyal
     [not found]     ` <20120914142539.GC6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-09-14 14:53       ` Peter Zijlstra
2012-09-14 15:14         ` Vivek Goyal
     [not found]           ` <20120914151447.GD6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-09-14 21:57             ` Tejun Heo
     [not found]               ` <20120914215701.GW17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-09-17 15:27                 ` Vivek Goyal
2012-09-18 18:08                 ` Vivek Goyal
2012-09-17  8:55             ` Glauber Costa
2012-09-14 21:39       ` Tejun Heo
     [not found]         ` <20120914213938.GV17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-09-17 15:05           ` Vivek Goyal
     [not found]             ` <20120917150518.GB5094-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-09-17 16:40               ` Tejun Heo
2012-09-14 15:03   ` Michal Hocko
     [not found]     ` <20120914150306.GQ28039-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2012-09-19 14:02       ` Michal Hocko
     [not found]         ` <20120919140203.GA5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2012-09-19 14:03           ` [PATCH 2.6.32] memcg: warn on deeper hierarchies with use_hierarchy==0 Michal Hocko
     [not found]             ` <20120919140308.GB5398-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2012-09-19 19:38               ` David Rientjes
     [not found]                 ` <alpine.DEB.2.00.1209191237020.749-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2012-09-20 13:24                   ` Michal Hocko
     [not found]                     ` <20120920132400.GC23872-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
2012-09-20 22:33                       ` David Rientjes
     [not found]                         ` <alpine.DEB.2.00.1209201531250.17455-X6Q0R45D7oAcqpCFd4KODRPsWskHk0ljAL8bYrjMMd8@public.gmane.org>
2012-09-21  7:16                           ` Michal Hocko
2012-09-19 14:03           ` [PATCH 3.0] " Michal Hocko
2012-09-19 14:05           ` [PATCH 3.2+] " Michal Hocko
2012-09-14 18:07   ` [RFC] cgroup TODOs Vivek Goyal
     [not found]     ` <20120914180754.GF6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-09-14 18:53       ` Tejun Heo
     [not found]         ` <20120914185324.GI17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-09-14 19:28           ` Vivek Goyal
     [not found]             ` <20120914192840.GG6221-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2012-09-14 19:44               ` Tejun Heo
     [not found]                 ` <20120914194439.GP17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-09-14 19:49                   ` Tejun Heo
     [not found]                     ` <20120914194950.GQ17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-09-14 20:39                       ` Tejun Heo
     [not found]                         ` <20120914203925.GR17747-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-09-17  8:40                           ` Glauber Costa
     [not found]                             ` <5056E1FC.1090508-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
2012-09-17 17:30                               ` Tejun Heo
2012-09-17 14:37                           ` Vivek Goyal
2012-09-14 18:36   ` Aristeu Rozanski
     [not found]     ` <20120914183641.GA2191-YqEmrenMroyQb786VAuzj9i2O/JbrIOy@public.gmane.org>
2012-09-14 18:54       ` Tejun Heo
2012-09-15  2:20       ` Serge E. Hallyn
     [not found]         ` <20120915022037.GA6438-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2012-09-15  9:27           ` Controlling devices and device namespaces Eric W. Biederman
     [not found]             ` <87wqzv7i08.fsf_-_-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2012-09-15 22:05               ` Serge E. Hallyn
     [not found]                 ` <20120915220520.GA11364-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2012-09-16  0:24                   ` Eric W. Biederman
     [not found]                     ` <87y5kazuez.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2012-09-16  3:31                       ` Serge E. Hallyn
2012-09-16 11:21                       ` Alan Cox
     [not found]                         ` <20120916122112.3f16178d-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org>
2012-09-16 11:56                           ` Eric W. Biederman
     [not found]                             ` <87sjaiuqp5.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2012-09-16 12:17                               ` Eric W. Biederman
     [not found]                                 ` <87d31mupp3.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2012-09-16 13:32                                   ` Serge Hallyn
     [not found]                                     ` <5055D4D1.3070407-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2012-09-16 14:23                                       ` Eric W. Biederman
     [not found]                                         ` <87k3vuqc5l.fsf-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org>
2012-09-16 16:13                                           ` Alan Cox
     [not found]                                             ` <20120916171316.517ad0fd-38n7/U1jhRXW96NNrWNlrekiAK3p4hvP@public.gmane.org>
2012-09-16 17:49                                               ` Eric W. Biederman
2012-09-16 16:15                                           ` Serge Hallyn
     [not found]                                             ` <5055FB2A.1020103-A9i7LUbDfNHQT0dZR+AlfA@public.gmane.org>
2012-09-16 16:53                                               ` Eric W. Biederman
2012-09-16  8:19       ` [RFC] cgroup TODOs James Bottomley
     [not found]         ` <1347783557.2463.1.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>
2012-09-16 14:41           ` Eric W. Biederman
2012-09-17 13:21           ` Aristeu Rozanski
2012-09-14 22:03   ` Dhaval Giani
     [not found]     ` <CAPhKKr8wDLrcWHLTRq1M7gU_6CGNxzzF83zJo2WZ5vrY7h8Qyw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-09-14 22:06       ` Tejun Heo
2012-09-20  1:33   ` Andy Lutomirski
     [not found]     ` <505A725B.2080901-kltTT9wpgjJwATOyAt5JVQ@public.gmane.org>
2012-09-20 18:26       ` Tejun Heo
     [not found]         ` <20120920182651.GH28934-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2012-09-20 18:39           ` Andy Lutomirski
2012-09-21 21:40   ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).