All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Tejun Heo <tj@kernel.org>
Cc: lizefan@huawei.com, hannes@cmpxchg.org, mingo@redhat.com,
	pjt@google.com, luto@amacapital.net, efault@gmx.de,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com, lvenanci@redhat.com,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode
Date: Tue, 14 Feb 2017 11:35:41 +0100	[thread overview]
Message-ID: <20170214103541.GS6515@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20170212050544.GJ29323@mtj.duckdns.org>

On Sun, Feb 12, 2017 at 02:05:44PM +0900, Tejun Heo wrote:
> Hello,
> 
> On Fri, Feb 10, 2017 at 06:51:45PM +0100, Peter Zijlstra wrote:
> > Sure, we're past that. This isn't about what memcg can or cannot do.
> > Previous discussions established that controllers come in two shapes:
> > 
> >  - task based controllers; these are build on per task properties and
> >    groups are aggregates over sets of tasks. Since per definition inter
> >    task competition is already defined on individual tasks, its fairly
> >    trivial to extend the same rules to sets of tasks etc..
> > 
> >    Examples: cpu, cpuset, cpuacct, perf, pid, (freezer)
> >
> >  - system controllers; instead of building from tasks upwards, they
> >    split what previously would be machine wide / global state. For these
> >    there is no natural competition rule vs tasks, and hence your
> >    no-internal-task rule.
> > 
> >    Examples: memcg, io, hugetlb
> 
> This is a bit of delta but as I wrote before, at least cpu (and
> accordingly cpuacct) won't stay purely task-based as we should account
> for resource consumptions which aren't tied to specific tasks to the
> matching domain (e.g. CPU consumption during writeback, disk
> encryption or CPU cycles spent to receive packets).

We should probably do that in another thread, but I'd probably insist on
separate controllers that co-operate to get that done.

> > > And here's another point, currently, all controllers are enabled
> > > consecutively from root.  If we have leaf thread subtrees, this still
> > > works fine.  Resource domain controllers won't be enabled into thread
> > > subtrees.  If we allow switching back and forth, what do we do in the
> > > middle while we're in the thread part?
> > 
> > From what I understand you cannot re-enable a controller once its been
> > disabled, right? If you disable it, its dead for the entire subtree.
> 
> cgroups on creation don't enable controllers by default and users can
> enable and disable controllers dynamically as long as the conditions
> are met.  So, they can be disable and re-enabled.

I was talking in a hierarchical sense, your section 2-4-2. Top-Down
constraint seems to state similar things to what I meant.

Once you disable a controller it cannot be re-enabled in a subtree.

> > > No matter what we do, it's
> > > gonna be more confusing and we lose basic invariants like "parent
> > > always has superset of control knobs that its child has".
> > 
> > No, exactly that. I don't think I ever proposed something different.
> >
> > The "resource domain" flag I proposed violates the no-internal-processes
> > thing, but it doesn't violate that rule afaict.
> 
> If we go to thread mode and back to domain mode, the control knobs for
> domain controllers don't make sense on the thread part of the tree and
> they won't have cgroup_subsys_state to correspond to either.  For
> example,
> 
>  A - T - B
> 
> B's memcg knobs would control memory distribution from A and cgroups
> in T can't have memcg knobs.  It'd be weird to indicate that memcg is
> enabled in those cgroups too.

But memcg _is_ enabled for T. All the tasks are mapped onto A for
purpose of the system controller (memcg) and are subject to its
constraints.

> We can make it work somehow.  It's just weird-ass interface.

You could make these control files (read-only?) symlinks back to A's
actual control files. To more explicitly show this.

> > > As for the runtime overhead, if you get affected by adding a top-level
> > > cgroup in any measureable way, we need to fix that.  That's not a
> > > valid argument for messing up the interface.
> > 
> > I think cgroup tree depth is a more significant issue; because of
> > hierarchy we often do tree walks (uo-to-root or down-to-task).
> > 
> > So creating elaborate trees is something I try not to do.
> 
> So, as long as the depth stays reasonable (single digit or lower),
> what we try to do is keeping tree traversal operations aggregated or
> located on slow paths. 

While at the same time you allowed that BPF cgroup thing to not be
hierarchical because iterating the tree was too expensive; or did I
misunderstand?

Also, I think Mike showed you the pain and hurt are quite visible for
even a few levels.

Batching is tricky, you need to somehow bound the error function in
order to not become too big a factor on behaviour. Esp. for cpu, cpuacct
obviously doesn't care much as it doesn't enforce anything.

> In general, I think it's important to ensure that this in general is
> the case so that users can use the logical layouts matching the actual
> resource hierarchy rather than having to twist the layout for
> optimization.

One does what one can.. But it is important to understand the
constraints, nothing comes for free.

> > > Even if we allow switching back and forth, we can't make the same
> > > cgroup both resource domain && thread root.  Not in a sane way at
> > > least.
> > 
> > The back and forth thing yes, but even with a single level, the one
> > resource domain you tag will be both resource domain and thread root.
> 
> Ah, you're right.

Also, there is the one giant wart in v2 wrt no-internal-processes;
namely the root group is allowed to have them.

Now I understand why this is so; so don't feel compelled to explain that
again, but it does make the model very ugly and has a real problem, see
below. OTOH, since it is there, I would very much like to make use of
this 'feature' and allow a thread-group on the root group.

And since you then _can_ have nested thread groups, it again becomes
very important to be able to find the resource domains, which brings me
back to my proposal; albeit with an addition constraint.

Now on to the problem of the no-internal-processes wart; how does
cgroup-v2 currently implement the whole container invariant? Because by
that invariant, a container's 'root' group must also allow
internal-processes.


  parent reply	other threads:[~2017-02-14 10:35 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-02 20:06 [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Tejun Heo
2017-02-02 20:06 ` Tejun Heo
2017-02-02 20:06 ` [PATCH 1/5] cgroup: reorganize cgroup.procs / task write path Tejun Heo
2017-02-02 20:06 ` [PATCH 2/5] cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS Tejun Heo
2017-02-02 20:06 ` [PATCH 4/5] cgroup: implement CSS_TASK_ITER_THREADED Tejun Heo
     [not found] ` <20170202200632.13992-1-tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
2017-02-02 20:06   ` [PATCH 3/5] cgroup: introduce cgroup->proc_cgrp and threaded css_set handling Tejun Heo
2017-02-02 20:06     ` Tejun Heo
2017-02-02 20:06   ` [PATCH 5/5] cgroup: implement cgroup v2 thread support Tejun Heo
2017-02-02 20:06     ` Tejun Heo
2017-02-02 21:32   ` [PATCHSET for-4.11] cgroup: implement cgroup v2 thread mode Andy Lutomirski
2017-02-02 21:32     ` Andy Lutomirski
     [not found]     ` <CALCETrW6Mqj9VLogd0XaLgVAzEqsZ+VnZjN5NROCqr0ssdYaKg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-02 21:52       ` Tejun Heo
2017-02-02 21:52         ` Tejun Heo
     [not found]         ` <20170202215229.GA27231-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2017-02-03 21:10           ` Andy Lutomirski
2017-02-03 21:10             ` Andy Lutomirski
2017-02-03 21:56             ` Tejun Heo
2017-02-06  9:50           ` Peter Zijlstra
2017-02-06  9:50             ` Peter Zijlstra
2017-02-03 20:20   ` Peter Zijlstra
2017-02-03 20:20     ` Peter Zijlstra
     [not found]     ` <20170203202048.GD6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
2017-02-03 20:59       ` Tejun Heo
2017-02-03 20:59         ` Tejun Heo
     [not found]         ` <20170203205955.GA9886-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2017-02-06 12:49           ` Peter Zijlstra
2017-02-06 12:49             ` Peter Zijlstra
     [not found]             ` <20170206124943.GJ6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
2017-02-08 23:08               ` Tejun Heo
2017-02-08 23:08                 ` Tejun Heo
     [not found]                 ` <20170208230819.GD25826-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2017-02-09 10:29                   ` Peter Zijlstra
2017-02-09 10:29                     ` Peter Zijlstra
     [not found]                     ` <20170209102909.GC6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
2017-02-10 15:45                       ` Tejun Heo
2017-02-10 15:45                         ` Tejun Heo
     [not found]                         ` <20170210154508.GA16097-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2017-02-10 17:51                           ` Peter Zijlstra
2017-02-10 17:51                             ` Peter Zijlstra
     [not found]                             ` <20170210175145.GJ6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
2017-02-12  5:05                               ` Tejun Heo
2017-02-12  5:05                                 ` Tejun Heo
     [not found]                                 ` <20170212050544.GJ29323-qYNAdHglDFBN0TnZuCh8vA@public.gmane.org>
2017-02-12  6:59                                   ` Mike Galbraith
2017-02-12  6:59                                     ` Mike Galbraith
2017-02-13  5:45                                     ` Mike Galbraith
     [not found]                                       ` <1486964707.5912.93.camel-Mmb7MZpHnFY@public.gmane.org>
2017-03-13 19:26                                         ` Tejun Heo
2017-03-13 19:26                                           ` Tejun Heo
2017-03-14 14:45                                           ` Mike Galbraith
2017-02-14 10:35                                 ` Peter Zijlstra [this message]
     [not found]                                   ` <20170214103541.GS6515-ndre7Fmf5hadTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
2017-03-13 20:05                                     ` Tejun Heo
2017-03-13 20:05                                       ` Tejun Heo
     [not found]                                       ` <20170313200544.GE15709-piEFEHQLUPpN0TnZuCh8vA@public.gmane.org>
2017-03-21 12:39                                         ` Peter Zijlstra
2017-03-21 12:39                                           ` Peter Zijlstra
     [not found]                                           ` <20170321123958.af7mcvcovexxzahu-Nxj+rRp3nVydTX5a5knrm8zTDFooKrT+cvkQGrU6aU0@public.gmane.org>
2017-03-22 14:52                                             ` Peter Zijlstra
2017-03-22 14:52                                               ` Peter Zijlstra
2017-02-09 13:07 ` Paul Turner
2017-02-09 14:47   ` Peter Zijlstra
2017-02-09 15:08     ` Mike Galbraith
     [not found]     ` <CAPM31RJaJjFwenC36Abij+EdzO3KBm-DEjQ_crSmzrtrrn2N2A@mail.gmail.com>
2017-02-13  5:28       ` Mike Galbraith
     [not found]   ` <CAPM31RKHsM1-iWb5B6jkOtLomLhiOARtgMOTWZ8p1yjEn-ZK0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-02-10 15:46     ` Tejun Heo
2017-02-10 15:46       ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170214103541.GS6515@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=efault@gmx.de \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=luto@amacapital.net \
    --cc=lvenanci@redhat.com \
    --cc=mingo@redhat.com \
    --cc=pjt@google.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.