linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Johannes Weiner <hannes@cmpxchg.org>
To: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	cgroups@vger.kernel.org
Subject: Re: [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves
Date: Tue, 3 Dec 2013 15:17:13 -0500	[thread overview]
Message-ID: <20131203201713.GR3556@cmpxchg.org> (raw)
In-Reply-To: <20131203120454.GA12758@dhcp22.suse.cz>

On Tue, Dec 03, 2013 at 01:04:54PM +0100, Michal Hocko wrote:
> On Mon 02-12-13 16:25:00, Johannes Weiner wrote:
> > On Mon, Dec 02, 2013 at 09:02:21PM +0100, Michal Hocko wrote:
> [...]
> > > But we are not talking just about races here. What if the OOM is a
> > > result of an OOM action itself. E.g. a killed task faults a memory in
> > > while exiting and it hasn't freed its memory yet. Should we notify in
> > > such a case? What would an userspace OOM handler do (the in-kernel
> > > implementation has an advantage because it can check the tasks flags)?
> > 
> > We don't notify in such a case.  Every charge from a TIF_MEMDIE or
> > exiting task is bypassing the limit immediately.  Not even reclaim.
> 
> Not really. Assume a memcg is under OOM. A task is killed by
> userspace so we get into signal delivery code which clears
> fatal_signal_pending and the code goes on to exit but then it faults in.
> __mem_cgroup_try_charge will not see signal pending and TIF_MEMDIE is
> not set yet. OOM is still not resolved so we are back to square one.

Ah, that's a completely separate problem, though.  One issue I have
with these checks is that I never know which one are shortcuts and
micro optimizations and which one are functionally necessary.

> > > > So again, I don't see this patch is doing anything but blur the
> > > > current line and make notification less predictable. And, as someone
> > > > else in this thread already said, it's a uservisible change in
> > > > behavior and would break known tuning usecases.
> > > 
> > > I would like to understand how would such a tuning usecase work and how
> > > it would break with this change.
> > 
> > I would do test runs and with every run increase the size of the
> > workload until I get OOM notifications to know when the kernel has
> > been pushed beyond its limits and available memory + reclaim
> > capability can't keep up with the workload anymore.
> > 
> > Not informing me just because due to timing variance a random process
> > exits in the last moment would be flat out lying.  The machine is OOM.
> > Many reclaim cycles failing is a good predictor.  Last minute exit of
> > random task is not, it's happenstance and I don't want to rely on a
> > fluke like this to size my workload.
> 
> Such a metric would be inherently racy for the same reason. You simply
> cannot rely on not seeing OOMs because an exiting task managed to leave
> in time (after MEM_CGROUP_RECLAIM_RETRIES direct reclaim loops and
> before mem_cgroup_oom). Difference between in time and little bit too
> late is just too fragile to be useful IMO.

Are we saying the same thing?  Or did I misunderstand you?

> > > Consider the above example. You would get 2 notification for the very
> > > same OOM condition.
> > > On the other hand if the encountered exiting task was just a race then
> > > we have two options basically. Either there are more tasks racing (and
> > > not all of them are exiting) or there is only one (all are exiting).
> > > We will not loose any notification in the first case because the flags
> > > are checked before mem_cgroup_oom_trylock and so one of tasks would lock
> > > and notify.
> > > The second case is more interesting. Userspace won't get notification
> > > but we also know that no action is required as the OOM will be resolved
> > > by itself. And now we should consider whether notification would do more
> > > good than harm. The tuning usecase would loose one event. Would such a
> > > rare situation skew the statistics so much? On the other hand a real OOM
> > > killer would do something which means something will be killed. I find
> > > the later much worse.
> > 
> > We already check in various places (sigh) for whether reclaim and
> > killing is still necessary.  What is the end game here?  An endless
> > loop right before the kill where we check if the kill is still
> > necessary?
> 
> The patch as is doesn't cover all the cases and ideally we should check
> that for OOM_SCAN_ABORT and later in oom_kill_process because they can
> back out as well if we want to have only-on-action notification. Such a
> solution would be too messy though.

There is never an only-on-action notification and there is no check to
cover all cases.  This is the fallacy I'm trying to point out.  All we
can do is pick a fairly predictable line and stick to it.

> But as I've said. The primary reason I liked this change is because it
> solves the above mentioned OOM during exit issue and it also prevents
> from a pointless notification. I am perfectly fine with moving the
> check+set TIF_MEMDIE down so solve only the issue #1 and do not mess
> with notifications.

The notification is not pointless, we are OOM at the time we make the
decision.

> > You're not fixing this problem, so why make the notifications less
> > reliable?
> 
> I am still not seeing why it is less reliable. The notification is
> inherently racy so you cannot rely on any simple metrics based on their
> count (at least not in general).

It should be based on reclaim failing, which is more predictable and
reproducible across multiple runs.  Anything else is just random
coincidence, I can't fathom why you think that it's at all meaningful.

You could do one more reclaim cycle and a charge attempt, or even just
add an msleep() between the last reclaim cycle and the last charge
attempt for exactly the same outcome.

> > > So all in all. I do agree with you that this path will never be race
> > > free and without pointless OOM actions. I also agree that drawing the
> > > line is hard. But I am more inclined to prevent from notification when
> > > we already know that _no action_ is required because IMHO the vast
> > > majority of oom listeners are there to _do_ an action which is mostly
> > > deadly.
> > 
> > If you want to push the machine so hard that active measures like
> > reclaim can't keep up and you rely on stupid timing like this to save
> > your sorry butt, then you'll just have to live with the
> > unpredictability of it.  You're going to eat kills that might have
> > been avoided last minute either way.  It's no excuse to plaster the MM
> > with TIF_MEMDIE checks and last-minute cgroup margin checks in the
> > weirdest locations.
> 
> Yes I do not agree with putting TIF_MEMDIE checks all over the place and
> we should reduce their number to minimum. It is fair to say that the
> patch didn't add a new check. It just has moved it to cover both
> in-kernel and user space oom paths. That was a bonus I liked. To be
> honest I do not see the notification side effect as a big deal as those
> are racy anyway and I would rather see fewer of them than more
> (especially when it is clear that nothing is to be done).

I don't know why we have this check there in the first place, it
should be part of the OOM victim selection process and not a hacky
shortcut in memcg.  It's a blatant layering violation for no obvious
reason.

> > Again, how likely is it anyway that the kill was truly skipped and not
> > just deferred?  Reclaim failing is a good indicator that you're in
> > trouble, a random task exiting in an ongoing workload does not say
> > much.  The machine could still be in trouble, so you just deferred the
> > inevitable, you didn't really avoid a kill.
> > 
> > At this point we are talking about OOM kill frequency and statistical
> > probability during apparently normal operations.  The OOM killer was
> > never written for that, it was supposed to be a last minute resort
> > that should not occur during normal operations and only if all SANE
> > measures to avoid it have failed.  99% of all users have no interest
> > in these micro-optimizations and we shouldn't clutter the code and
> > have unpredictable behavior without even a trace of data to show that
> > this is anything more than a placebo measure for one use case.
> 
> OK, as it seems that the notification part is too controversial, how
> would you like the following? It reverts the notification part and still
> solves the fault on exit path. I will prepare the full patch with the
> changelog if this looks reasonable:
> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 28c9221b74ea..f44fe7e65a98 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1783,6 +1783,16 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	unsigned int points = 0;
>  	struct task_struct *chosen = NULL;
>  
> +	/*
> +	 * If current has a pending SIGKILL or is exiting, then automatically
> +	 * select it.  The goal is to allow it to allocate so that it may
> +	 * quickly exit and free its memory.
> +	 */
> +	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> +		set_thread_flag(TIF_MEMDIE);
> +		goto cleanup;

		return;

Anyway, I wish this would not be here at all and part of the thread
scanner.  As David said, this is a very cold path, why the shortcut?

> @@ -2233,16 +2243,6 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  	if (!handle)
>  		goto cleanup;
>  
> -	/*
> -	 * If current has a pending SIGKILL or is exiting, then automatically
> -	 * select it.  The goal is to allow it to allocate so that it may
> -	 * quickly exit and free its memory.
> -	 */
> -	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> -		set_thread_flag(TIF_MEMDIE);
> -		goto cleanup;
> -	}
> -
>  	owait.memcg = memcg;
>  	owait.wait.flags = 0;
>  	owait.wait.func = memcg_oom_wake_function;
> @@ -2266,6 +2266,13 @@ bool mem_cgroup_oom_synchronize(bool handle)
>  		schedule();
>  		mem_cgroup_unmark_under_oom(memcg);
>  		finish_wait(&memcg_oom_waitq, &owait.wait);
> +
> +		/* Userspace OOM handler cannot set TIF_MEMDIE to a target */
> +		if (memcg->oom_kill_disable) {
> +			if ((fatal_signal_pending(current) ||
> +						current->flags & PF_EXITING))
> +				set_thread_flag(TIF_MEMDIE);
> +		}

This is an entirely different change that I think makes sense.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2013-12-03 20:17 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-31  1:39 [patch] mm, memcg: add memory.oom_control notification for system oom David Rientjes
2013-10-31  5:49 ` Johannes Weiner
2013-11-13 22:19   ` David Rientjes
2013-11-13 23:34     ` Johannes Weiner
2013-11-14  0:56       ` David Rientjes
2013-11-14  3:25         ` Johannes Weiner
2013-11-14 22:57           ` David Rientjes
2013-11-14 23:26             ` [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves David Rientjes
2013-11-14 23:26               ` [patch 2/2] mm, memcg: add memory.oom_control notification for system oom David Rientjes
2013-11-18 18:52                 ` Michal Hocko
2013-11-19  1:25                   ` David Rientjes
2013-11-19 12:41                     ` Michal Hocko
2013-11-18 12:52               ` [patch 1/2] mm, memcg: avoid oom notification when current needs access to memory reserves Michal Hocko
2013-11-18 12:55                 ` Michal Hocko
2013-11-19  1:19                   ` David Rientjes
2013-11-18 15:41               ` Johannes Weiner
2013-11-18 16:51                 ` Michal Hocko
2013-11-19  1:22                   ` David Rientjes
2013-11-22 16:51                   ` Johannes Weiner
2013-11-27  0:53                     ` David Rientjes
2013-11-27 16:34                       ` Johannes Weiner
2013-11-27 21:51                         ` David Rientjes
2013-11-27 23:19                           ` Johannes Weiner
2013-11-28  0:22                             ` David Rientjes
2013-11-28  2:28                               ` Johannes Weiner
2013-11-28  2:52                                 ` David Rientjes
2013-11-28  3:16                                   ` Johannes Weiner
2013-12-02 20:02                         ` Michal Hocko
2013-12-02 21:25                           ` Johannes Weiner
2013-12-03 12:04                             ` Michal Hocko
2013-12-03 20:17                               ` Johannes Weiner [this message]
2013-12-03 21:00                                 ` Michal Hocko
2013-12-03 21:23                                   ` Johannes Weiner
2013-12-03 23:50                               ` David Rientjes
2013-12-04  3:34                                 ` Johannes Weiner
2013-12-04 11:13                                 ` Michal Hocko
2013-12-05  0:23                                   ` David Rientjes
2013-12-09 12:48                                     ` Michal Hocko
2013-12-09 21:46                                       ` David Rientjes
2013-12-09 22:51                                         ` Johannes Weiner
2013-12-09 23:05                                         ` Johannes Weiner
2014-01-10  0:34                                           ` David Rientjes
2013-12-10 10:38                                         ` Michal Hocko
2013-12-11  1:03                                           ` David Rientjes
2013-12-11  9:55                                             ` Michal Hocko
2013-12-11 22:40                                               ` David Rientjes
2013-12-12 10:31                                                 ` Michal Hocko
2013-12-12 10:50                                                   ` Michal Hocko
2013-12-12 12:11                                                   ` Michal Hocko
2013-12-12 12:37                                                     ` Michal Hocko
2013-12-13 23:55                                                   ` David Rientjes
2013-12-17 16:23                                                     ` Michal Hocko
2013-12-17 20:50                                                       ` David Rientjes
2013-12-18 20:04                                                         ` Michal Hocko
2013-12-19  6:09                                                           ` David Rientjes
2013-12-19 14:41                                                             ` Michal Hocko
2014-01-08  0:25                                                               ` Andrew Morton
2014-01-08 10:33                                                                 ` Michal Hocko
2014-01-09 14:30                                                                   ` [PATCH] memcg: Do not hang on OOM when killed by userspace OOM " Michal Hocko
2014-01-09 21:40                                                                     ` David Rientjes
2014-01-10  8:23                                                                       ` Michal Hocko
2014-01-10 21:33                                                                         ` David Rientjes
2014-01-15 14:26                                                                           ` Michal Hocko
2014-01-15 21:19                                                                             ` David Rientjes
2014-01-16 10:12                                                                               ` Michal Hocko
2014-01-21  6:13                                                                                 ` David Rientjes
2014-01-21 13:21                                                                                   ` Michal Hocko
2014-01-09 21:34                                                                 ` [patch 1/2] mm, memcg: avoid oom notification when current needs " David Rientjes
2014-01-09 22:47                                                                   ` Andrew Morton
2014-01-10  0:01                                                                     ` David Rientjes
2014-01-10  0:12                                                                       ` Andrew Morton
2014-01-10  0:23                                                                         ` David Rientjes
2014-01-10  0:35                                                                           ` David Rientjes
2014-01-10 22:14                                                                           ` Johannes Weiner
2014-01-12 22:10                                                                             ` David Rientjes
2014-01-15 14:34                                                                               ` Michal Hocko
2014-01-15 21:23                                                                                 ` David Rientjes
2014-01-16  9:32                                                                                   ` Michal Hocko
2014-01-21  5:58                                                                                     ` David Rientjes
2014-01-21  6:04                                                                                       ` Greg Kroah-Hartmann
2014-01-21  6:08                                                                                         ` David Rientjes
2014-01-10  8:30                                                                       ` Michal Hocko
2014-01-10 21:38                                                                         ` David Rientjes
2014-01-10 22:34                                                                           ` Johannes Weiner
2014-01-12 22:14                                                                             ` David Rientjes
2013-11-18 15:54             ` [patch] mm, memcg: add memory.oom_control notification for system oom Johannes Weiner
2013-11-18 23:15               ` One Thousand Gnomes

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131203201713.GR3556@cmpxchg.org \
    --to=hannes@cmpxchg.org \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).