Re: [Documentation] State of CPU controller in cgroup v2

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Tejun Heo <tj@kernel.org>, Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@redhat.com>,
	Mike Galbraith <umgwanakikbuti@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	kernel-team@fb.com,
	"open list:CONTROL GROUP (CGROUP)" <cgroups@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Paul Turner <pjt@google.com>, Li Zefan <lizefan@huawei.com>,
	Linux API <linux-api@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [Documentation] State of CPU controller in cgroup v2
Date: Mon, 12 Sep 2016 11:20:03 -0400	[thread overview]
Message-ID: <ab6f3376-4c09-a339-f984-937f537ddc17@gmail.com> (raw)
In-Reply-To: <20160909225747.GA30105@mtj.duckdns.org>

On 2016-09-09 18:57, Tejun Heo wrote:
> Hello, again.
>
> On Mon, Sep 05, 2016 at 10:37:55AM -0700, Andy Lutomirski wrote:
>>> * It doesn't bring any practical benefits in terms of capability.
>>>   Userland can trivially handle the system-root and namespace-roots in
>>>   a symmetrical manner.
>>
>> Your idea of "trivially" doesn't match mine.  You gave a use case in
>
> I suppose I wasn't clear enough.  It is trivial in the sense that if
> the userland implements something which works for namespace-root, it
> would work the same in system-root without further modifications.
>
>> which userspace might take advantage of root being special.  If
>
> I was emphasizing the cases where userspace would have to deal with
> the inherent differences, and, when they don't, they can behave
> exactly the same way.
>
>> userspace does that, then that userspace cannot be run in a container.
>> This could be a problem for real users.  Sure, "don't do that" is a
>> *valid* answer, but it's not a very helpful answer.
>
> Great, now we agree that what's currently implemented is valid.  I
> think you're still failing to recognize the inherent specialness of
> the system-root and how much unnecessary pain the removal of the
> exemption would cause at virtually no practical gain.  I won't repeat
> the same backing points here.
>
>>> * It's an unncessary inconvenience, especially for cases where the
>>>   cgroup agent isn't in control of boot, for partial usage cases, or
>>>   just for playing with it.
>>>
>>> You say that I'm ignoring the same use case for namespace-scope but
>>> namespace-roots don't have the same hybrid function for partial and
>>> uncontrolled systems, so it's not clear why there even NEEDS to be
>>> strict symmetry.
>>
>> I think their functions are much closer than you think they are.  I
>> want a whole Linux distro to be able to run in a container.  This
>> means that useful things people do in a distro or initramfs or
>> whatever should just work if containerized.
>
> There isn't much which is getting in the way of doing that.  Again,
> something which follows no-internal-task rule would behave the same no
> matter where it is.  The system-root is different in that it is exempt
> from the rule and thus is more flexible but that difference is serving
> the purpose of handling the inherent specialness of the system-root.
> AFAICS, it is the solution which causes the least amount of contortion
> and unnecessary inconvenience to userland.
>
>>> It's easy and understandable to get hangups on asymmetries or
>>> exemptions like this, but they also often are acceptable trade-offs.
>>> It's really frustrating to see you first getting hung up on "this must
>>> be wrong" and even after explanations repeating the same thing just in
>>> different ways.
>>>
>>> If there is something fundamentally wrong with it, sure, let's fix it,
>>> but what's actually broken?
>>
>> I'm not saying it's fundamentally wrong.  I'm saying it's a design
>
> You were.
>
>> that has a big wart, and that wart is unfortunate, and after thinking
>> a bit, I'm starting to agree with PeterZ that this is problematic.  It
>> also seems fixable: the constraint could be relaxed.
>
> You've been pushing for enforcing the restriction on the system-root
> too and now are jumping to the opposite end.  It's really frustrating
> that this is such a whack-a-mole game where you throw ideas without
> really thinking through them and only concede the bare minimum when
> all other logical avenues are closed off.  Here, again, you seem to be
> stating a strong opinion when you haven't fully thought about it or
> tried to understand the reasons behind it.
>
> But, whatever, let's go there: Given the arguments that I laid out for
> the no-internal-tasks rule, how does the problem seem fixable through
> relaxing the constraint?
>
>>>>>> Also, here's an idea to maybe make PeterZ happier: relax the
>>>>>> restriction a bit per-controller.  Currently (except for /), if you
>>>>>> have subtree control enabled you can't have any processes in the
>>>>>> cgroup.  Could you change this so it only applies to certain
>>>>>> controllers?  If the cpu controller is entirely happy to have
>>>>>> processes and cgroups as siblings, then maybe a cgroup with only cpu
>>>>>> subtree control enabled could allow processes to exist.
>>>>>
>>>>> The document lists several reasons for not doing this and also that
>>>>> there is no known real world use case for such configuration.
>>>
>>> So, up until this point, we were talking about no-internal-tasks
>>> constraint.
>>
>> Isn't this the same thing?  IIUC the constraint in question is that,
>> if a non-root cgroup has subtree control on, then it can't have
>> processes in it.  This is the no-internal-tasks constraint, right?
>
> Yes, that is what no-internal-tasks rule is but I don't understand how
> that is the same thing as process granularity.  Am I completely
> misunderstanding what you are trying to say here?
>
>> And I still think that, at least for cpu, nothing at all goes wrong if
>> you allow processes to exist in cgroups that have cpu set in
>> subtree-control.
>
> If you confine it to the cpu controller, ignore anonymous
> consumptions, the rather ugly mapping between nice and weight values
> and the fact that nobody could come up with a practical usefulness for
> such setup, yes.  My point was never that the cpu controller can't do
> it but that we should find a better way of coordinating it with other
> controllers and exposing it to individual applications.
So, having a container where not everything in the container is split 
further into subgroups is not a practically useful situation?  Because 
that's exactly what both systemd and every other cgroup management tool 
expects to have work as things stand right now.  The root cgroup within 
a cgroup namespace has to function exactly like the system-root, 
otherwise nothing can depend on the special cases for the system root, 
because they might get run in a cgroup namespace and such assumptions 
will be invalid.  This in turn means that no current distro can run 
unmodified in a cgroup namespace under a v2 hierarchy, which is a Very 
Bad Thing.
>
>> ----- begin talking about process granularity -----
> ...
>>> I do.  It's a horrible userland API to expose to individual
>>> applications if the organization that a given application expects can
>>> be disturbed by system operations.  Imagine how this would be
>>> documented - "if this operation races with system operation, it may
>>> return -ENOENT.  Repeating the path lookup might make the operation
>>> succeed again."
>>
>> It could be made to work without races, though, with minimal (or even
>> no) ABI change.  The managed program could grab an fd pointing to its
>> cgroup.  Then it would use openat, etc for all operations.  As long as
>> 'mv /cgroup/a/b /cgroup/c/" didn't cause that fd to stop working,
>> we're fine.
>
> After a migration, the cgroup and its interface knobs are a different
> directory and files.  Semantically, during migration, we aren't moving
> the directory or files and it'd be bizarre to overlay the semantics
> you're describing on top of the existing cgroupfs.  We will have to
> break away from the very basic vfs rules such as a fd, once opened,
> always corresponding to the same file.  The only thing openat(2) does
> is abstracting away prefix handling and that is only a small part of
> the problem.
>
> A more acceptable way could be implementing, say, per-task filesystem
> which always appears at the fixed location and proxies the operations;
> however, even this wouldn't be able to handle issues stemming from
> lack of actual atomicity.  Think about two tasks accessing the same
> interface file.  If they race against outside agent migrating them
> one-by-one, they may or may not be accessing the same file.  If they
> perform operations with side effects such as config changes, creation
> of sub-cgroups and migrations, what would be the end result?
>
> In addition, a per-task filesystem is an a lot worse interface to
> program against than a system-call based API, especially when the same
> API which is used to do the exact same operations on threads can be
> reused for resource groups.
>
>> Note that this pretty much has to work if cgroup namespaces are to
>> allow rearrangement of the hierarchy -- '/cgroup/' from inside the
>> namespace has to remain valid at all times
>
> If I'm not mistaken, namespaces don't allow this type of dynamic
> migrations.
>
>> Obviously this only works if the cgroup in question doesn't itself get
>> destroyed, but having an internal hierarchy is a bit nonsensical if
>> the application shares a cgroup with another application, so that
>> shouldn't be a problem in practice.
>>
>> In fact, ISTM that allowing applications to manage cgroup
>> sub-hierarchies has almost exactly the same set of constraints as
>> allowing namespaced cgroup managers to work.  In a container, the
>> outer manager manages where the container lives and the container
>> manages its own hierarchy.  Why can't fancy cgroup-aware applications
>> work exactly the same way?
>
> System agents and individual applications are different.  This is the
> same argument that you brought up earlier in this thread where you
> said that userland can just set up namespaces for individual
> applications.  In purely mathematical terms, they can be mapped to
> each other but that grossly ignores practical differences between
> them.
>
> Most applications should and want to keep their assumptions
> conservative, robust and portable, and not dependent on some crazy
> fragile and custom-built namespace setup that nobody in the stack is
> really responsible for.  How many would ever program against something
> like that?
>
> A system agent has a large part of the system configuration under its
> control (it's the system agent after all) and thus is way more
> flexible in what assumptions it can dictate and depend on.
>
>>> Yeah, systemd has delegation feature for cases like that which we
>>> depend on too.
>>>
>>> As for your example, who performs the cgroup setup and configuration,
>>> the application itself or an external entity?  If an external entity,
>>> how does it know which thread is what?
>>
>> In my case, it would be a little script that reads a config file that
>> knows all kinds of internal information about the application and its
>> threads.
>
> I see.  One-of-a-kind custom setup.  This is a completely valid usage;
> however, please also recognize that it's an extremely specific one
> which is niche by definition.  If we're going to support
> in-application hierarchical resource control, I think it's very
> important to make sure that it's something which is easy to use and
> widely accessible so that any lay application can make use of it.
> I'll come back to this point later.
>
>>> And, as for rgroup not covering it, would extending rgroup to cover
>>> multi-process cases be enough or are there more fundamental issues?
>>
>> Maybe, as long as the configuration could actually be created -- IIUC
>> the current rgroup proposal requires that the hierarchy of groups
>> matches the hierarchy implied by clone(), which isn't going to happen
>> in my case.
>
> We can make that dynamic as long as the subtree is properly scoped;
> however, there is an important design decision to make here.  If we
> open up full-on dynamic migrations to individual applications, we
> commit ourselves to supporting arbitrarily high frequency migration
> operations, which we've never supported before and will restrict what
> we can do in terms of optimizing hot paths over migration.
>
> We haven't had to face this decision because cgroup has never properly
> supported delegating to applications and the in-use setups where this
> happens are custom configurations where there is no boundary between
> system and applications and adhoc trial-and-error is good enough a way
> to find a working solution.  That wiggle room goes away once we
> officially open this up to individual applications.
>
> So, if we decide to open up dynamic assignment, we need to weigh what
> we gain in terms of capabilities against reduction of implementation
> maneuvering room.  I guess there can be a middleground where, for
> example, only initial asssignment is allowed.
>
> It is really difficult to understand your position without
> understanding where the requirements are coming from.  Can you please
> elaborate more on the workload?  Why is the specific configuration
> useful?  What is it trying to achieve?
>
>> But, given that this fancy-cgroup-aware-multiprocess-application case
>> looks so much like cgroup-using container, ISTM you could solve the
>> problem completely by just allowing tasks to be split out by users who
>> want to do it.  (Obviously those users will get funny results if they
>> try to do this to memcg.  "Don't do that" seems fine here.)  I don't
>> expect the race condition issues you're worried about to happen in
>> practice.  Certainly not in my case, since I control the entire
>> system.
>
> What people do now with cgroup inside an application is extremely
> limited.  Because there is no proper support for it, each use case has
> to craft up a dedicated custom setup which is all but guaranteed to be
> incompatible with what someone else would come up for another
> application.  Everybody is in "this is mine, I control the entire
> system" mindset, which is fine for those specific setups but
> deterimental to making it widely available and useful.
>
> Accepting some measured restrictions and building a common ground for
> everyone can make in-application cgroup usages vastly more accessible
> and useful than now.  Certain things would need to be done differently
> and maybe some scenarios won't be supported as well but those are
> trade-offs that we'd need to weigh against what we gain.  Another
> point is that, for very specific use cases where none of these generic
> concerns matter, keeping using cgroup v1 is fine.  The lack of common
> resource domains has never been an issue for those use cases anyway.
>
> Thanks.
>

next prev parent reply	other threads:[~2016-09-12 15:20 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-05 17:07 [Documentation] State of CPU controller in cgroup v2 Tejun Heo
2016-08-05 17:09 ` [PATCH 1/2] sched: Misc preps for cgroup unified hierarchy interface Tejun Heo
2016-08-05 17:09 ` [PATCH 2/2] sched: Implement interface for cgroup unified hierarchy Tejun Heo
2016-08-06  9:04 ` [Documentation] State of CPU controller in cgroup v2 Mike Galbraith
2016-08-10 22:09   ` Johannes Weiner
2016-08-11  6:25     ` Mike Galbraith
2016-08-12 22:17       ` Johannes Weiner
2016-08-13  5:08         ` Mike Galbraith
2016-08-16 14:07     ` Peter Zijlstra
2016-08-16 14:58       ` Chris Mason
2016-08-16 16:30       ` Johannes Weiner
2016-08-17  9:33         ` Mike Galbraith
2016-08-16 21:59       ` Tejun Heo
2016-08-17 20:18 ` Andy Lutomirski
2016-08-20 15:56   ` Tejun Heo
2016-08-20 18:45     ` Andy Lutomirski
2016-08-29 22:20       ` Tejun Heo
2016-08-31  3:42         ` Andy Lutomirski
2016-08-31 17:32           ` Tejun Heo
2016-08-31 19:11             ` Andy Lutomirski
2016-08-31 21:07               ` Tejun Heo
2016-08-31 21:46                 ` Andy Lutomirski
2016-09-03 22:05                   ` Tejun Heo
2016-09-05 17:37                     ` Andy Lutomirski
2016-09-06 10:29                       ` Peter Zijlstra
2016-10-04 14:47                         ` Tejun Heo
2016-10-05  8:07                           ` Peter Zijlstra
2016-09-09 22:57                       ` Tejun Heo
2016-09-10  8:54                         ` Mike Galbraith
2016-09-10 10:08                         ` Mike Galbraith
2016-09-30  9:06                           ` Tejun Heo
2016-09-30 14:53                             ` Mike Galbraith
2016-09-12 15:20                         ` Austin S. Hemmelgarn [this message]
2016-09-19 21:34                           ` Tejun Heo
     [not found]                         ` <CALCETrUhpPQdyZ-6WRjdB+iLbpGBduRZMWXQtCuS+R7Cq7rygg@mail.gmail.com>
2016-09-14 20:00                           ` Tejun Heo
2016-09-15 20:08                             ` Andy Lutomirski
2016-09-16  7:51                               ` Peter Zijlstra
2016-09-16 15:12                                 ` Andy Lutomirski
2016-09-16 16:19                                   ` Peter Zijlstra
2016-09-16 16:29                                     ` Andy Lutomirski
2016-09-16 16:50                                       ` Peter Zijlstra
2016-09-16 18:19                                         ` Andy Lutomirski
2016-09-17  1:47                                           ` Peter Zijlstra
2016-09-19 21:53                               ` Tejun Heo
2016-08-31 19:57         ` Andy Lutomirski
2016-08-22 10:12     ` Mike Galbraith
2016-08-21  5:34   ` James Bottomley
2016-08-29 22:35     ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ab6f3376-4c09-a339-f984-937f537ddc17@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=luto@amacapital.net \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=umgwanakikbuti@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).