Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
       [not found]               ` <20131128035218.GM3556@cmpxchg.org>
@ 2013-11-30  0:00                 ` David Rientjes
  2013-11-30  0:51                   ` Greg KH
  2013-11-30  3:35                   ` Johannes Weiner
  0 siblings, 2 replies; 7+ messages in thread
From: David Rientjes @ 2013-11-30  0:00 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
	linux-kernel, linux-mm

On Wed, 27 Nov 2013, Johannes Weiner wrote:

> > None that I am currently aware of, I'll continue to try them out.  I'd 
> > suggest just dropping the stable@kernel.org from the whole series though 
> > unless there is another report of such a problem that people are running 
> > into.
> 
> The series has long been merged, how do we drop stable@kernel.org from
> it?
> 

You said you have informed stable to not merge these patches until further 
notice, I'd suggest simply avoid ever merging the whole series into a 
stable kernel since the problem isn't serious enough.  Marking changes 
that do "goto nomem" seem fine to mark for stable, though.

> > We've had this patch internally since we started using memcg, it has 
> > avoided some unnecessary oom killing.
> 
> Do you have quantified data that OOM kills are reduced over a longer
> sampling period?  How many kills are skipped?  How many of them are
> deferred temporarily but the VM ended up having to kill something
> anyway?

On the scale that we run memcg, we would see it daily in automated testing 
primarily because we panic the machine for memcg oom conditions where 
there are no killable processes.  It would typically manifest by two 
processes that are allocating memory in a memcg; one is oom killed, is 
allowed to allocate, handles its SIGKILL, exits and frees its memory and 
the second process which is oom disabled races with the uncharge and is 
oom disabled so the machine panics.

The upstream kernel of course doesn't panic in such a condition but if the 
same scenario were to have happened, the second process would be 
unnecessarily oom killed because it raced with the uncharge of the first 
victim and it had exited before the scan of processes in the memcg oom 
killer could detect it and defer.  So this patch definitely does prevent 
unnecessary oom killing when run at such a large scale that we do.

I'll send a formal patch.

> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1836,6 +1836,13 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  	if (!chosen)
> >  		return;
> >  	points = chosen_points * 1000 / totalpages;
> > +
> > +	/* One last chance to see if we really need to kill something */
> > +	if (mem_cgroup_margin(memcg) >= (1 << order)) {
> > +		put_task_struct(chosen);
> > +		return;
> > +	}
> > +
> >  	oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
> >  			 NULL, "Memory cgroup out of memory");
> >  }
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-30  0:00                 ` [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree David Rientjes
@ 2013-11-30  0:51                   ` Greg KH
  2013-11-30 10:25                     ` David Rientjes
  2013-11-30  3:35                   ` Johannes Weiner
  1 sibling, 1 reply; 7+ messages in thread
From: Greg KH @ 2013-11-30  0:51 UTC (permalink / raw)
  To: David Rientjes
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, azurit, mm-commits,
	stable, linux-kernel, linux-mm

On Fri, Nov 29, 2013 at 04:00:09PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > > None that I am currently aware of, I'll continue to try them out.  I'd 
> > > suggest just dropping the stable@kernel.org from the whole series though 
> > > unless there is another report of such a problem that people are running 
> > > into.
> > 
> > The series has long been merged, how do we drop stable@kernel.org from
> > it?
> > 
> 
> You said you have informed stable to not merge these patches until further 
> notice, I'd suggest simply avoid ever merging the whole series into a 
> stable kernel since the problem isn't serious enough.  Marking changes 
> that do "goto nomem" seem fine to mark for stable, though.

I'm lost.  These patches are in 3.12, so how can they not be "in
stable"?

What exactly do you want me to do here?

totally confused,

greg k-h

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-30  0:00                 ` [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree David Rientjes
  2013-11-30  0:51                   ` Greg KH
@ 2013-11-30  3:35                   ` Johannes Weiner
  2013-11-30 10:32                     ` David Rientjes
  1 sibling, 1 reply; 7+ messages in thread
From: Johannes Weiner @ 2013-11-30  3:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
	linux-kernel, linux-mm

On Fri, Nov 29, 2013 at 04:00:09PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
> 
> > > None that I am currently aware of, I'll continue to try them out.  I'd 
> > > suggest just dropping the stable@kernel.org from the whole series though 
> > > unless there is another report of such a problem that people are running 
> > > into.
> > 
> > The series has long been merged, how do we drop stable@kernel.org from
> > it?
> > 
> 
> You said you have informed stable to not merge these patches until further 
> notice, I'd suggest simply avoid ever merging the whole series into a 
> stable kernel since the problem isn't serious enough.  Marking changes 
> that do "goto nomem" seem fine to mark for stable, though.

These are followup fixes for the series that is upstream but didn't go
to stable.  I truly have no idea what you are talking about.

> > > We've had this patch internally since we started using memcg, it has 
> > > avoided some unnecessary oom killing.
> > 
> > Do you have quantified data that OOM kills are reduced over a longer
> > sampling period?  How many kills are skipped?  How many of them are
> > deferred temporarily but the VM ended up having to kill something
> > anyway?
> 
> On the scale that we run memcg, we would see it daily in automated testing 
> primarily because we panic the machine for memcg oom conditions where 
> there are no killable processes.  It would typically manifest by two 
> processes that are allocating memory in a memcg; one is oom killed, is 
> allowed to allocate, handles its SIGKILL, exits and frees its memory and 
> the second process which is oom disabled races with the uncharge and is 
> oom disabled so the machine panics.

So why don't you implement proper synchronization instead of putting
these random checks all over the map to make the race window just
small enough to not matter most of the time?

> The upstream kernel of course doesn't panic in such a condition but if the 
> same scenario were to have happened, the second process would be 
> unnecessarily oom killed because it raced with the uncharge of the first 
> victim and it had exited before the scan of processes in the memcg oom 
> killer could detect it and defer.  So this patch definitely does prevent 
> unnecessary oom killing when run at such a large scale that we do.

If you are really bothered by this race, then please have OOM kill
invocations wait for any outstanding TIF_MEMDIE tasks in the same
context.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-30  0:51                   ` Greg KH
@ 2013-11-30 10:25                     ` David Rientjes
  0 siblings, 0 replies; 7+ messages in thread
From: David Rientjes @ 2013-11-30 10:25 UTC (permalink / raw)
  To: Greg KH
  Cc: Johannes Weiner, Andrew Morton, Michal Hocko, azurit, mm-commits,
	stable, linux-kernel, linux-mm

On Fri, 29 Nov 2013, Greg KH wrote:

> > > > None that I am currently aware of, I'll continue to try them out.  I'd 
> > > > suggest just dropping the stable@kernel.org from the whole series though 
> > > > unless there is another report of such a problem that people are running 
> > > > into.
> > > 
> > > The series has long been merged, how do we drop stable@kernel.org from
> > > it?
> > > 
> > 
> > You said you have informed stable to not merge these patches until further 
> > notice, I'd suggest simply avoid ever merging the whole series into a 
> > stable kernel since the problem isn't serious enough.  Marking changes 
> > that do "goto nomem" seem fine to mark for stable, though.
> 
> I'm lost.  These patches are in 3.12, so how can they not be "in
> stable"?
> 
> What exactly do you want me to do here?
> 

Sorry, I sympathize with your confusion since the handling of this patch 
series has been strange and confusing from the beginning.

I'm referring to the comment in this thread from Johannes: "[t]his patch 
series was not supposed to go into the last merge window, I already told 
stable to hold off on these until further notice" from 
http://marc.info/?l=linux-mm&m=138559524422298 and his intention to send 
the entire series to stable in 
http://marc.info/?l=linux-kernel&m=138539243412073.

>From that, I had thought you were already aware of this series and were 
waiting to merge it into previous stable kernels; I'm suggesting that 
stable doesn't touch this series with a ten-foot pole and only backport 
the fixes that have gone into 3.12.

That series is:

759496ba640c ("arch: mm: pass userspace fault flag to generic fault handler")
3a13c4d761b4 ("x86: finish user fault error path with fatal signal")
519e52473ebe ("mm: memcg: enable memcg OOM killer only for user faults")
fb2a6fc56be6 ("mm: memcg: rework and document OOM waiting and wakeup")
3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM")

And then there's the mystery of 4942642080ea ("mm: memcg: handle non-error 
OOM situations more gracefully").  This patch went into 3.12-rc6 and is 
marked for stable@vger.kernel.org.  Its changelog indicates it fixes the 
last patch in the above series, but that patch isn't marked for 
stable@vger.kernel.org.

And then you have 84235de394d9 ("fs: buffer: move allocation failure loop 
into the allocator") which is marked for stable@vger.kernel.org, but 
3168ecbe1c04 ("mm: memcg: use proper memcg in limit bypass") which fixes 
the memory corruption in that commit isn't marked for stable.

I disagreed with the entire series being marked for stable in 
http://marc.info/?l=linux-kernel&m=137107020216528 since it violates the 
rules in Documentation/stable_kernel_rules.txt when it was originally 
proposed for stable.  The memcg oom waitqueue that this series avoids has 
been in memcg for 3 1/2 years since 2.6.34 and to date one user, cc'd, has 
reported any issues with it.

And then Johannes commented that he is asking stable to hold off on the 
series until further notice.  I'm suggesting the series not be merged into 
stable at all, and that's what's in the email you're responding to.

Further, since this is confusing enough as it is, I suggest if you do take 
84235de394d9 ("fs: buffer: move allocation failure loop into the 
allocator") that you certainly must also consider 3168ecbe1c04 ("mm: 
memcg: use proper memcg in limit bypass").  The former was backported to 
3.5 and they required an emergency release of 3.5.7.25 to take it back 
out.

There's also another patch that Johannes will shortly be sending to fix 
the leakage to the root memcg in oom conditions that can trivially cause 
large amounts of memory to be charged to the root memcg when it should 
have been isolated to the oom memcg.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-30  3:35                   ` Johannes Weiner
@ 2013-11-30 10:32                     ` David Rientjes
  2013-11-30 15:55                       ` Johannes Weiner
  0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2013-11-30 10:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
	linux-kernel, linux-mm

On Fri, 29 Nov 2013, Johannes Weiner wrote:

> > You said you have informed stable to not merge these patches until further 
> > notice, I'd suggest simply avoid ever merging the whole series into a 
> > stable kernel since the problem isn't serious enough.  Marking changes 
> > that do "goto nomem" seem fine to mark for stable, though.
> 
> These are followup fixes for the series that is upstream but didn't go
> to stable.  I truly have no idea what you are talking about.
> 

I'm responding to your comments[*] that indicate you were going to 
eventually be sending it to stable.

> > On the scale that we run memcg, we would see it daily in automated testing 
> > primarily because we panic the machine for memcg oom conditions where 
> > there are no killable processes.  It would typically manifest by two 
> > processes that are allocating memory in a memcg; one is oom killed, is 
> > allowed to allocate, handles its SIGKILL, exits and frees its memory and 
> > the second process which is oom disabled races with the uncharge and is 
> > oom disabled so the machine panics.
> 
> So why don't you implement proper synchronization instead of putting
> these random checks all over the map to make the race window just
> small enough to not matter most of the time?
> 

The oom killer can be quite expensive, so we have found that is 
advantageous after doing all that work that the memcg is still oom for 
the charge order before needlessly killing a process.  I am not suggesting 
that we add synchronization to the uncharge path for such a race, but 
merely a simple check as illustrated as due diligence.  I think a simple 
conditional in the oom killer to avoid needlessly killing a user job is 
beneficial and avoids questions from customers who have a kernel log 
showing an oom kill occurring in a memcg that is not oom.  We could even 
do the check in oom_kill_process() after dump_header() if you want to 
reduce any chance of that to avoid getting bug reports about such cases.

> If you are really bothered by this race, then please have OOM kill
> invocations wait for any outstanding TIF_MEMDIE tasks in the same
> context.
> 

The oom killer requires a tasklist scan, or an iteration over the set of 
processes attached to the memcg for the memcg case, to find a victim.  It 
already defers if it finds eligible threads with TIF_MEMDIE set.

Thanks.

[*] http://marc.info/?l=linux-mm&m=138559524422298
    http://marc.info/?l=linux-kernel&m=138539243412073

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-30 10:32                     ` David Rientjes
@ 2013-11-30 15:55                       ` Johannes Weiner
  2013-11-30 22:12                         ` David Rientjes
  0 siblings, 1 reply; 7+ messages in thread
From: Johannes Weiner @ 2013-11-30 15:55 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
	linux-kernel, linux-mm

On Sat, Nov 30, 2013 at 02:32:43AM -0800, David Rientjes wrote:
> On Fri, 29 Nov 2013, Johannes Weiner wrote:
> 
> > > You said you have informed stable to not merge these patches until further 
> > > notice, I'd suggest simply avoid ever merging the whole series into a 
> > > stable kernel since the problem isn't serious enough.  Marking changes 
> > > that do "goto nomem" seem fine to mark for stable, though.
> > 
> > These are followup fixes for the series that is upstream but didn't go
> > to stable.  I truly have no idea what you are talking about.
> > 
> 
> I'm responding to your comments[*] that indicate you were going to 
> eventually be sending it to stable.
> 
> > > On the scale that we run memcg, we would see it daily in automated testing 
> > > primarily because we panic the machine for memcg oom conditions where 
> > > there are no killable processes.  It would typically manifest by two 
> > > processes that are allocating memory in a memcg; one is oom killed, is 
> > > allowed to allocate, handles its SIGKILL, exits and frees its memory and 
> > > the second process which is oom disabled races with the uncharge and is 
> > > oom disabled so the machine panics.
> > 
> > So why don't you implement proper synchronization instead of putting
> > these random checks all over the map to make the race window just
> > small enough to not matter most of the time?
> > 
> 
> The oom killer can be quite expensive, so we have found that is 
> advantageous after doing all that work that the memcg is still oom for 
> the charge order before needlessly killing a process.  I am not suggesting 
> that we add synchronization to the uncharge path for such a race, but 
> merely a simple check as illustrated as due diligence.  I think a simple 
> conditional in the oom killer to avoid needlessly killing a user job is 
> beneficial and avoids questions from customers who have a kernel log 
> showing an oom kill occurring in a memcg that is not oom.  We could even 
> do the check in oom_kill_process() after dump_header() if you want to 
> reduce any chance of that to avoid getting bug reports about such cases.

I asked about quantified data of this last-minute check, you replied
with a race condition between an OOM kill victim and a subsequent OOM
kill invocation.

> > If you are really bothered by this race, then please have OOM kill
> > invocations wait for any outstanding TIF_MEMDIE tasks in the same
> > context.
> > 
> 
> The oom killer requires a tasklist scan, or an iteration over the set of 
> processes attached to the memcg for the memcg case, to find a victim.  It 
> already defers if it finds eligible threads with TIF_MEMDIE set.

And now you say that this race does not really exist and repeat the
same ramblings about last-minute checks to avoid unnecessary kills
again.  And again without any supporting data that I already asked
for.

The more I talk to you, the less sense this all makes.  Why do you
insist we merge this patch when you have apparently no idea why and
how it works, and can't demonstrate that it works in the first place?

I only followed you around in circles because I'm afraid that my
shutting up would be interpreted as agreement again and Andrew would
merge this anyway.  But this is unsustainable, the burden of proof
should be on you, not me.  I'm going to stop replying until you
provide the information I asked for.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
  2013-11-30 15:55                       ` Johannes Weiner
@ 2013-11-30 22:12                         ` David Rientjes
  0 siblings, 0 replies; 7+ messages in thread
From: David Rientjes @ 2013-11-30 22:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
	linux-kernel, linux-mm

On Sat, 30 Nov 2013, Johannes Weiner wrote:

> > The oom killer requires a tasklist scan, or an iteration over the set of 
> > processes attached to the memcg for the memcg case, to find a victim.  It 
> > already defers if it finds eligible threads with TIF_MEMDIE set.
> 
> And now you say that this race does not really exist and repeat the
> same ramblings about last-minute checks to avoid unnecessary kills
> again.  And again without any supporting data that I already asked
> for.
> 

The race does exist, perhaps you don't understand what the race is?  This 
race occurs when process (A) declares oom and enters the oom killer, 
meanwhile an already oom killed victim (B) frees its memory and exits, and 
the process (A) oom kills another process even though the memcg is below 
its limit because of process (B).

When doing something expensive in the kernel like oom killing, it usually 
doesn't cause so much hassle when the suggestion is:

	<declare an action is necessary>
	<do something expensive>
	<select an action>
	if (!<action is still necessary>)
		abort
	<perform the action>

That type of check is fairly straight forward and makes sense.  It 
prevents unnecessary oom killing (although it can't guarantee it in all 
conditions) and prevents customers from reporting oom kills when the log 
shows there is memory available for their memcg.

When using memcg on a large scale to enforce memory isolation for user 
jobs, these types of scenarios happen often and there is no downside to 
adding such a check.  The oom killer is not a hotpath, it's not 
performance sensitive to the degree that we cannot add a simple 
conditional that checks the current limit, it prevents unnecessary oom 
kills, and prevents user confusion.

Without more invasive synchronization that would touch hotpaths, this is 
the best we can do: check if the oom kill is really necessary just before 
issuing the kill.  Having the kernel actually kill a user process is a 
serious matter and we should strive to ensure it is prevented whenever 
possible.

> The more I talk to you, the less sense this all makes.  Why do you
> insist we merge this patch when you have apparently no idea why and
> how it works, and can't demonstrate that it works in the first place?
> 

I'm not insisting anything, I don't make demands of others or maintainers 
like you do to merge or not merge anything.  I also haven't even formally 
proposed the patch with a changelog that would explain the motivation.

> I only followed you around in circles because I'm afraid that my
> shutting up would be interpreted as agreement again and Andrew would
> merge this anyway.  But this is unsustainable, the burden of proof
> should be on you, not me.  I'm going to stop replying until you
> provide the information I asked for.
> 

Andrew can't merge a patch that hasn't been proposed for merge.

Have a nice weekend.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-11-30 22:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <526028bd.k5qPj2+MDOK1o6ii%akpm@linux-foundation.org>
     [not found] ` <alpine.DEB.2.02.1311271453270.13682@chino.kir.corp.google.com>
     [not found]   ` <20131127233353.GH3556@cmpxchg.org>
     [not found]     ` <alpine.DEB.2.02.1311271622330.10617@chino.kir.corp.google.com>
     [not found]       ` <20131128021809.GI3556@cmpxchg.org>
     [not found]         ` <alpine.DEB.2.02.1311271826001.5120@chino.kir.corp.google.com>
     [not found]           ` <20131128031313.GK3556@cmpxchg.org>
     [not found]             ` <alpine.DEB.2.02.1311271914460.5120@chino.kir.corp.google.com>
     [not found]               ` <20131128035218.GM3556@cmpxchg.org>
2013-11-30  0:00                 ` [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree David Rientjes
2013-11-30  0:51                   ` Greg KH
2013-11-30 10:25                     ` David Rientjes
2013-11-30  3:35                   ` Johannes Weiner
2013-11-30 10:32                     ` David Rientjes
2013-11-30 15:55                       ` Johannes Weiner
2013-11-30 22:12                         ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox