linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Andrew Morton <akpm@linux-foundation.org>
To: Roman Gushchin <guro@fb.com>
Cc: David Rientjes <rientjes@google.com>,
	linux-mm@vger.kernel.org, Michal Hocko <mhocko@suse.com>,
	Vladimir Davydov <vdavydov.dev@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>,
	Tejun Heo <tj@kernel.org>,
	kernel-team@fb.com, cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH v13 0/7] cgroup-aware OOM killer
Date: Wed, 10 Jan 2018 11:33:45 -0800	[thread overview]
Message-ID: <20180110113345.54dd571967fd6e70bfba68c3@linux-foundation.org> (raw)
In-Reply-To: <20180110131143.GB26913@castle.DHCP.thefacebook.com>

On Wed, 10 Jan 2018 05:11:44 -0800 Roman Gushchin <guro@fb.com> wrote:

> Hello, David!
> 
> On Tue, Jan 09, 2018 at 04:57:53PM -0800, David Rientjes wrote:
> > On Thu, 30 Nov 2017, Andrew Morton wrote:
> > 
> > > > This patchset makes the OOM killer cgroup-aware.
> > > 
> > > Thanks, I'll grab these.
> > > 
> > > There has been controversy over this patchset, to say the least.  I
> > > can't say that I followed it closely!  Could those who still have
> > > reservations please summarise their concerns and hopefully suggest a
> > > way forward?
> > > 
> > 
> > Yes, I'll summarize what my concerns have been in the past and what they 
> > are wrt the patchset as it stands in -mm.  None of them originate from my 
> > current usecase or anticipated future usecase of the oom killer for 
> > system-wide or memcg-constrained oom conditions.  They are based purely on 
> > the patchset's use of an incomplete and unfair heuristic for deciding 
> > which cgroup to target.
> > 
> > I'll also suggest simple changes to the patchset, which I have in the 
> > past, that can be made to address all of these concerns.
> > 
> > 1. The unfair comparison of the root mem cgroup vs leaf mem cgroups
> > 
> > The patchset uses two different heuristics to compare root and leaf mem 
> > cgroups and scores them based on number of pages.  For the root mem 
> > cgroup, it totals the /proc/pid/oom_score of all processes attached: 
> > that's based on rss, swap, pgtables, and, most importantly, oom_score_adj.  
> > For leaf mem cgroups, it's based on that memcg's anonymous, unevictable, 
> > unreclaimable slab, kernel stack, and swap counters.  These can be wildly 
> > different independent of /proc/pid/oom_score_adj, but the most obvious 
> > unfairness comes from users who tune oom_score_adj.
> > 
> > An example: start a process that faults 1GB of anonymous memory and leave 
> > it attached to the root mem cgroup.  Start six more processes that each 
> > fault 1GB of anonymous memory and attached them to a leaf mem cgroup.  Set 
> > all processes to have /proc/pid/oom_score_adj of 1000.  System oom kill 
> > will always kill the 1GB process attached to the root mem cgroup.  It's 
> > because oom_badness() relies on /proc/pid/oom_score_adj, which is used to 
> > evaluate the root mem cgroup, and leaf mem cgroups completely disregard 
> > it.
> > 
> > In this example, the leaf mem cgroup's score is 1,573,044, the number of 
> > pages for the 6GB of faulted memory.  The root mem cgroup's score is 
> > 12,652,907, eight times larger even though its usage is six times smaller.
> > 
> > This is caused by the patchset disregarding oom_score_adj entirely for 
> > leaf mem cgroups and relying on it heavily for the root mem cgroup.  It's 
> > the complete opposite result of what the cgroup aware oom killer 
> > advertises.
> > 
> > It also works the other way, if a large memory hog is attached to the root 
> > mem cgroup but has a negative oom_score_adj it is never killed and random 
> > processes are nuked solely because they happened to be attached to a leaf 
> > mem cgroup.  This behavior wrt oom_score_adj is completely undocumented, 
> > so I can't presume that it is either known nor tested.
> > 
> > Solution: compare the root mem cgroup and leaf mem cgroups equally with 
> > the same criteria by doing hierarchical accounting of usage and 
> > subtracting from total system usage to find root usage.
> 
> I find this problem quite minor, because I haven't seen any practical problems
> caused by accounting of the root cgroup memory.
> If it's a serious problem for you, it can be solved without switching to the
> hierarchical accounting: it's possible to sum up all leaf cgroup stats and
> substract them from global values. So, it can be a relatively small enhancement
> on top of the current mm tree. This has nothing to do with global victim selection
> approach.

It sounds like a significant shortcoming to me - the oom-killing
decisions which David describes are clearly incorrect?

If this can be fixed against the -mm patchset with a "relatively small
enhancement" then please let's get that done so it can be reviewed and
tested.

> > 
> > 2. Evading the oom killer by attaching processes to child cgroups
> > 
> > Any cgroup on the system can attach all their processes to individual 
> > child cgroups.  This is functionally the same as doing
> > 
> > 	for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done
> > 
> > without the no internal process constraint introduced with cgroup v2.  All 
> > child cgroups are evaluated based on their own usage: all anon, 
> > unevictable, and unreclaimable slab as described previously.  It requires 
> > an individual cgroup to be the single largest consumer to be targeted by 
> > the oom killer.
> > 
> > An example: allow users to manage two different mem cgroup hierarchies 
> > limited to 100GB each.  User A uses 10GB of memory and user B uses 90GB of 
> > memory in their respective hierarchies.  On a system oom condition, we'd 
> > expect at least one process from user B's hierarchy would always be oom 
> > killed with the cgroup aware oom killer.  In fact, the changelog 
> > explicitly states it solves an issue where "1) There is no fairness 
> > between containers. A small container with few large processes will be 
> > chosen over a large one with huge number of small processes."
> > 
> > The opposite becomes true, however, if user B creates child cgroups and 
> > distributes its processes such that each child cgroup's usage never 
> > exceeds 10GB of memory.  This can either be done intentionally to 
> > purposefully have a low cgroup memory footprint to evade the oom killer or 
> > unintentionally with cgroup v2 to allow those individual processes to be 
> > constrained by other cgroups in a single hierarchy model.  User A, using 
> > 10% of his memory limit, is always oom killed instead of user B, using 90% 
> > of his memory limit.
> > 
> > Others have commented its still possible to do this with a per-process 
> > model if users split their processes into many subprocesses with small 
> > memory footprints.
> > 
> > Solution: comparing cgroups must be done hierarchically.  Neither user A 
> > nor user B can evade the oom killer because targeting is done based on the 
> > total hierarchical usage rather than individual cgroups in their 
> > hierarchies.
> 
> We've discussed this a lot.
> Hierarchical approach has their own issues, which we've discussed during
> previous iterations of the patchset. If you know how to address them
> (I've no idea), please, go on and suggest your version.

Well, if a hierarchical approach isn't a workable fix for the problem
which David has identified then what *is* the fix?

> > 
> > 3. Userspace has zero control over oom kill selection in leaf mem cgroups
> > 
> > Unlike using /proc/pid/oom_score_adj to bias or prefer certain processes 
> > from the oom killer, the cgroup aware oom killer does not provide any 
> > solution for the user to protect leaf mem cgroups.  This is a result of 
> > leaf mem cgroups being evaluated based on their anon, unevictable, and 
> > unreclaimable slab usage and disregarding any user tunable.
> > 
> > Absent the cgroup aware oom killer, users have the ability to strongly 
> > prefer a process is oom killed (/proc/pid/oom_score_adj = 1000) or 
> > strongly bias against a process (/proc/pid/oom_score_adj = -999).
> > 
> > An example: a process knows its going to use a lot of memory, so it sets 
> > /proc/self/oom_score_adj to 1000.  It wants to be killed first to avoid 
> > distrupting any other process.  If it's attached to the root mem cgroup, 
> > it will be oom killed.  If it's attached to a leaf mem cgroup by an admin 
> > outside its control, it will never be oom killed unless that cgroup's 
> > usage is the largest single cgroup usage on the system.  The reverse also 
> > is true for processes that the admin does not want to be oom killed: set 
> > /proc/pid/oom_score_adj to -999, but it will *always* be oom killed if its 
> > cgroup has the highest usage on the system.
> > 
> > The result is that both admins and users have lost all control over which 
> > processes are oom killed.  They are left with only one alternative: set 
> > /proc/pid/oom_score_adj to -1000 to completely disable a process from oom 
> > kill.  It doesn't address the issue at all for memcg-constrained oom 
> > conditions since no processes are killable anymore, and risks panicking 
> > the system if it is the only process left on the system.  A process 
> > preferring that it is first in line for oom kill simply cannot volunteer 
> > anymore.
> > 
> > Solution: allow users and admins to control oom kill selection by 
> > introducing a memory.oom_score_adj to affect the oom score of that mem 
> > cgroup, exactly the same as /proc/pid/oom_score_adj affects the oom score 
> > of a process.
> 
> The per-process oom_score_adj interface is not the nicest one, and I'm not
> sure we want to replicate it on cgroup level as is. If you have an idea of how
> it should look like, please, propose a patch; otherwise it's hard to discuss
> it without the code.

It does make sense to have some form of per-cgroup tunability.  Why is
the oom_score_adj approach inappropriate and what would be better?  How
hard is it to graft such a thing onto the -mm patchset?

> > 
> > 
> > I proposed a solution in 
> > https://marc.info/?l=linux-kernel&m=150956897302725, which was never 
> > responded to, for all of these issues.  The idea is to do hierarchical 
> > accounting of mem cgroup hierarchies so that the hierarchy is traversed 
> > comparing total usage at each level to select target cgroups.  Admins and 
> > users can use memory.oom_score_adj to influence that decisionmaking at 
> > each level.
> > 
> > This solves #1 because mem cgroups can be compared based on the same 
> > classes of memory and the root mem cgroup's usage can be fairly compared 
> > by subtracting top-level mem cgroup usage from system usage.  All of the 
> > criteria used to evaluate a leaf mem cgroup has a reasonable system-wide 
> > counterpart that can be used to do the simple subtraction.
> > 
> > This solves #2 because evaluation is done hierarchically so that 
> > distributing processes over a set of child cgroups either intentionally 
> > or unintentionally no longer evades the oom killer.  Total usage is always 
> > accounted to the parent and there is no escaping this criteria for users.
> > 
> > This solves #3 because it allows admins to protect important processes in 
> > cgroups that are supposed to use, for example, 75% of system memory 
> > without it unconditionally being selected for oom kill but still oom kill 
> > if it exceeds a certain threshold.  In this sense, the cgroup aware oom 
> > killer, as currently implemented, is selling mem cgroups short by 
> > requiring the user to accept that the important process will be oom killed 
> > iff it uses mem cgroups and isn't attached to root.  It also allows users 
> > to actually volunteer to be oom killed first without majority usage.
> > 
> > It has come up time and time again that this support can be introduced on 
> > top of the cgroup oom killer as implemented.  It simply cannot.  For 
> > admins and users to have control over decisionmaking, it needs a 
> > oom_score_adj type tunable that cannot change semantics from kernel 
> > version to kernel version and without polluting the mem cgroup filesystem.  
> > That, in my suggestion, is an adjustment on the amount of total 
> > hierarchical usage of each mem cgroup at each level of the hierarchy.  
> > That requires that the heuristic uses hierarchical usage rather than 
> > considering each cgroup as independent consumers as it does today.  We 
> > need to implement that heuristic and introduce userspace influence over 
> > oom kill selection now rather than later because its implementation 
> > changes how this patchset is implemented.
> > 
> > I can implement these changes, if preferred, on top of the current 
> > patchset, but I do not believe we want inconsistencies between kernel 
> > versions that introduce user visible changes for the sole reason that this 
> > current implementation is incomplete and unfair.  We can implement and 
> > introduce it once without behavior changing later because the core 
> > heuristic has necessarily changed.
> 
> David, I _had_ hierarchical accounting implemented in one of the previous
> versions of this patchset. And there were _reasons_, why we went away from it.

Can you please summarize those issues for my understanding?

> You can't just ignore them and say that "there is a simple solution, which
> Roman is not responding". If you know how to address these issues and
> convince everybody that hierarchical approach is a way to go, please,
> go on and send your version of the patchset.
> 
> Thanks!
> 
> Roman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2018-01-10 19:33 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-30 15:28 [PATCH v13 0/7] cgroup-aware OOM killer Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 1/7] mm, oom: refactor the oom_kill_process() function Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 2/7] mm: implement mem_cgroup_scan_tasks() for the root memory cgroup Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 3/7] mm, oom: cgroup-aware OOM killer Roman Gushchin
2017-12-01  8:35   ` Michal Hocko
2017-12-07  1:24   ` Andrew Morton
2017-12-07 13:39     ` Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 4/7] mm, oom: introduce memory.oom_group Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 5/7] mm, oom: add cgroup v2 mount option for cgroup-aware OOM killer Roman Gushchin
2017-12-01  8:41   ` Michal Hocko
2017-12-01 13:15     ` Roman Gushchin
2017-12-01 13:31       ` Michal Hocko
2017-12-01 17:00         ` Roman Gushchin
2017-11-30 15:28 ` [PATCH v13 6/7] mm, oom, docs: describe the " Roman Gushchin
2017-12-01  8:41   ` Michal Hocko
2017-12-01 17:01     ` Roman Gushchin
2017-12-01 17:13       ` Michal Hocko
2017-11-30 15:28 ` [PATCH v13 7/7] cgroup: list groupoom in cgroup features Roman Gushchin
2017-11-30 20:39 ` [PATCH v13 0/7] cgroup-aware OOM killer Andrew Morton
2018-01-10  0:57   ` David Rientjes
2018-01-10 13:11     ` Roman Gushchin
2018-01-10 19:33       ` Andrew Morton [this message]
2018-01-11  9:08         ` Michal Hocko
2018-01-11 13:18           ` Roman Gushchin
2018-01-12 22:03             ` David Rientjes
2018-01-15 11:54               ` Michal Hocko
2018-01-16 21:36                 ` David Rientjes
2018-01-16 22:09                   ` Michal Hocko
2018-01-11 21:57           ` David Rientjes
2018-01-13 17:14         ` Johannes Weiner
2018-01-14 23:44           ` David Rientjes
2018-01-15 16:25             ` Johannes Weiner
2018-01-16 21:21               ` David Rientjes
2018-01-10 20:50       ` David Rientjes
2017-12-01  9:14 ` [PATCH] mm, oom: simplify alloc_pages_before_oomkill handling Michal Hocko
2017-12-01 13:26   ` Tetsuo Handa
2017-12-01 13:32   ` Roman Gushchin
2017-12-01 13:54     ` Michal Hocko
2018-06-05 11:47 ` [PATCH v13 0/7] cgroup-aware OOM killer Michal Hocko
2018-06-05 12:13   ` Michal Hocko
2018-07-13 21:59   ` David Rientjes
2018-07-14  1:55     ` Tetsuo Handa
2018-07-16 21:13       ` Tetsuo Handa
2018-07-16 22:09         ` Roman Gushchin
2018-07-17  0:55           ` Tetsuo Handa
2018-07-31 14:14             ` Tetsuo Handa
2018-08-01 16:37               ` Roman Gushchin
2018-08-01 22:01                 ` Tetsuo Handa
2018-08-01 22:55                   ` Roman Gushchin
2018-07-16  9:36     ` Michal Hocko
2018-07-17  3:59       ` David Rientjes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180110113345.54dd571967fd6e70bfba68c3@linux-foundation.org \
    --to=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=guro@fb.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-mm@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=penguin-kernel@i-love.sakura.ne.jp \
    --cc=rientjes@google.com \
    --cc=tj@kernel.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).