From: Andrew Morton <akpm@linux-foundation.org>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>,
Vladimir Davydov <vdavydov@parallels.com>,
Greg Thelen <gthelen@google.com>,
linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [patch 2/2] mm: memcontrol: default hierarchy interface for memory
Date: Mon, 12 Jan 2015 15:37:16 -0800 [thread overview]
Message-ID: <20150112153716.d54e90c634b70d49e8bb8688@linux-foundation.org> (raw)
In-Reply-To: <1420776904-8559-2-git-send-email-hannes@cmpxchg.org>
On Thu, 8 Jan 2015 23:15:04 -0500 Johannes Weiner <hannes@cmpxchg.org> wrote:
> Introduce the basic control files to account, partition, and limit
> memory using cgroups in default hierarchy mode.
>
> This interface versioning allows us to address fundamental design
> issues in the existing memory cgroup interface, further explained
> below. The old interface will be maintained indefinitely, but a
> clearer model and improved workload performance should encourage
> existing users to switch over to the new one eventually.
>
> The control files are thus:
>
> - memory.current shows the current consumption of the cgroup and its
> descendants, in bytes.
>
> - memory.low configures the lower end of the cgroup's expected
> memory consumption range. The kernel considers memory below that
> boundary to be a reserve - the minimum that the workload needs in
> order to make forward progress - and generally avoids reclaiming
> it, unless there is an imminent risk of entering an OOM situation.
The code appears to be ascribing a special meaning to low==0: you can
write "none" to set this. But I'm not seeing any description of this?
> - memory.high configures the upper end of the cgroup's expected
> memory consumption range. A cgroup whose consumption grows beyond
> this threshold is forced into direct reclaim, to work off the
> excess and to throttle new allocations heavily, but is generally
> allowed to continue and the OOM killer is not invoked.
>
> - memory.max configures the hard maximum amount of memory that the
> cgroup is allowed to consume before the OOM killer is invoked.
>
> - memory.events shows event counters that indicate how often the
> cgroup was reclaimed while below memory.low, how often it was
> forced to reclaim excess beyond memory.high, how often it hit
> memory.max, and how often it entered OOM due to memory.max. This
> allows users to identify configuration problems when observing a
> degradation in workload performance. An overcommitted system will
> have an increased rate of low boundary breaches, whereas increased
> rates of high limit breaches, maximum hits, or even OOM situations
> will indicate internally overcommitted cgroups.
>
> For existing users of memory cgroups, the following deviations from
> the current interface are worth pointing out and explaining:
>
> - The original lower boundary, the soft limit, is defined as a limit
> that is per default unset. As a result, the set of cgroups that
> global reclaim prefers is opt-in, rather than opt-out. The costs
> for optimizing these mostly negative lookups are so high that the
> implementation, despite its enormous size, does not even provide
> the basic desirable behavior. First off, the soft limit has no
> hierarchical meaning. All configured groups are organized in a
> global rbtree and treated like equal peers, regardless where they
> are located in the hierarchy. This makes subtree delegation
> impossible. Second, the soft limit reclaim pass is so aggressive
> that it not just introduces high allocation latencies into the
> system, but also impacts system performance due to overreclaim, to
> the point where the feature becomes self-defeating.
>
> The memory.low boundary on the other hand is a top-down allocated
> reserve. A cgroup enjoys reclaim protection when it and all its
> ancestors are below their low boundaries, which makes delegation
> of subtrees possible. Secondly, new cgroups have no reserve per
> default and in the common case most cgroups are eligible for the
> preferred reclaim pass. This allows the new low boundary to be
> efficiently implemented with just a minor addition to the generic
> reclaim code, without the need for out-of-band data structures and
> reclaim passes. Because the generic reclaim code considers all
> cgroups except for the ones running low in the preferred first
> reclaim pass, overreclaim of individual groups is eliminated as
> well, resulting in much better overall workload performance.
>
> - The original high boundary, the hard limit, is defined as a strict
> limit that can not budge, even if the OOM killer has to be called.
> But this generally goes against the goal of making the most out of
> the available memory. The memory consumption of workloads varies
> during runtime, and that requires users to overcommit. But doing
> that with a strict upper limit requires either a fairly accurate
> prediction of the working set size or adding slack to the limit.
> Since working set size estimation is hard and error prone, and
> getting it wrong results in OOM kills, most users tend to err on
> the side of a looser limit and end up wasting precious resources.
>
> The memory.high boundary on the other hand can be set much more
> conservatively. When hit, it throttles allocations by forcing
> them into direct reclaim to work off the excess, but it never
> invokes the OOM killer. As a result, a high boundary that is
> chosen too aggressively will not terminate the processes, but
> instead it will lead to gradual performance degradation. The user
> can monitor this and make corrections until the minimal memory
> footprint that still gives acceptable performance is found.
>
> In extreme cases, with many concurrent allocations and a complete
> breakdown of reclaim progress within the group, the high boundary
> can be exceeded. But even then it's mostly better to satisfy the
> allocation from the slack available in other groups or the rest of
> the system than killing the group. Otherwise, memory.max is there
> to limit this type of spillover and ultimately contain buggy or
> even malicious applications.
>
> - The existing control file names are unwieldy and inconsistent in
> many different ways. For example, the upper boundary hit count is
> exported in the memory.failcnt file, but an OOM event count has to
> be manually counted by listening to memory.oom_control events, and
> lower boundary / soft limit events have to be counted by first
> setting a threshold for that value and then counting those events.
> Also, usage and limit files encode their units in the filename.
> That makes the filenames very long, even though this is not
> information that a user needs to be reminded of every time they
> type out those names.
>
> To address these naming issues, as well as to signal clearly that
> the new interface carries a new configuration model, the naming
> conventions in it necessarily differ from the old interface.
This all sounds pretty major. How much trouble is this change likely to
cause existing memcg users?
> include/linux/memcontrol.h | 32 ++++++
> mm/memcontrol.c | 247 +++++++++++++++++++++++++++++++++++++++++++--
> mm/vmscan.c | 22 +++-
No Documentation/cgroups/memory.txt?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2015-01-12 23:37 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-01-09 4:15 [patch 1/2] mm: page_counter: pull "-1" handling out of page_counter_memparse() Johannes Weiner
2015-01-09 4:15 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
2015-01-12 23:37 ` Andrew Morton [this message]
2015-01-13 15:50 ` Johannes Weiner
2015-01-13 20:52 ` Andrew Morton
2015-01-13 21:44 ` Johannes Weiner
2015-01-13 23:20 ` Greg Thelen
2015-01-14 16:01 ` Johannes Weiner
2015-01-14 14:28 ` Vladimir Davydov
2015-01-14 15:34 ` Michal Hocko
2015-01-14 17:19 ` Johannes Weiner
2015-01-15 17:08 ` Michal Hocko
2015-01-14 16:17 ` Michal Hocko
2015-01-13 15:59 ` [patch 1/2] mm: page_counter: pull "-1" handling out of page_counter_memparse() Vladimir Davydov
-- strict thread matches above, loose matches on Subject: below --
2015-01-20 15:31 [patch 0/2] mm: memcontrol: default hierarchy interface for memory v2 Johannes Weiner
2015-01-20 15:31 ` [patch 2/2] mm: memcontrol: default hierarchy interface for memory Johannes Weiner
2015-01-20 16:31 ` Michal Hocko
2015-02-23 11:13 ` Sasha Levin
2015-02-23 14:28 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150112153716.d54e90c634b70d49e8bb8688@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.cz \
--cc=vdavydov@parallels.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).