* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
[not found] ` <20131128035218.GM3556@cmpxchg.org>
@ 2013-11-30 0:00 ` David Rientjes
2013-11-30 0:51 ` Greg KH
2013-11-30 3:35 ` Johannes Weiner
0 siblings, 2 replies; 7+ messages in thread
From: David Rientjes @ 2013-11-30 0:00 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
linux-kernel, linux-mm
On Wed, 27 Nov 2013, Johannes Weiner wrote:
> > None that I am currently aware of, I'll continue to try them out. I'd
> > suggest just dropping the stable@kernel.org from the whole series though
> > unless there is another report of such a problem that people are running
> > into.
>
> The series has long been merged, how do we drop stable@kernel.org from
> it?
>
You said you have informed stable to not merge these patches until further
notice, I'd suggest simply avoid ever merging the whole series into a
stable kernel since the problem isn't serious enough. Marking changes
that do "goto nomem" seem fine to mark for stable, though.
> > We've had this patch internally since we started using memcg, it has
> > avoided some unnecessary oom killing.
>
> Do you have quantified data that OOM kills are reduced over a longer
> sampling period? How many kills are skipped? How many of them are
> deferred temporarily but the VM ended up having to kill something
> anyway?
On the scale that we run memcg, we would see it daily in automated testing
primarily because we panic the machine for memcg oom conditions where
there are no killable processes. It would typically manifest by two
processes that are allocating memory in a memcg; one is oom killed, is
allowed to allocate, handles its SIGKILL, exits and frees its memory and
the second process which is oom disabled races with the uncharge and is
oom disabled so the machine panics.
The upstream kernel of course doesn't panic in such a condition but if the
same scenario were to have happened, the second process would be
unnecessarily oom killed because it raced with the uncharge of the first
victim and it had exited before the scan of processes in the memcg oom
killer could detect it and defer. So this patch definitely does prevent
unnecessary oom killing when run at such a large scale that we do.
I'll send a formal patch.
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1836,6 +1836,13 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > if (!chosen)
> > return;
> > points = chosen_points * 1000 / totalpages;
> > +
> > + /* One last chance to see if we really need to kill something */
> > + if (mem_cgroup_margin(memcg) >= (1 << order)) {
> > + put_task_struct(chosen);
> > + return;
> > + }
> > +
> > oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
> > NULL, "Memory cgroup out of memory");
> > }
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
2013-11-30 0:00 ` [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree David Rientjes
@ 2013-11-30 0:51 ` Greg KH
2013-11-30 10:25 ` David Rientjes
2013-11-30 3:35 ` Johannes Weiner
1 sibling, 1 reply; 7+ messages in thread
From: Greg KH @ 2013-11-30 0:51 UTC (permalink / raw)
To: David Rientjes
Cc: Johannes Weiner, Andrew Morton, Michal Hocko, azurit, mm-commits,
stable, linux-kernel, linux-mm
On Fri, Nov 29, 2013 at 04:00:09PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
>
> > > None that I am currently aware of, I'll continue to try them out. I'd
> > > suggest just dropping the stable@kernel.org from the whole series though
> > > unless there is another report of such a problem that people are running
> > > into.
> >
> > The series has long been merged, how do we drop stable@kernel.org from
> > it?
> >
>
> You said you have informed stable to not merge these patches until further
> notice, I'd suggest simply avoid ever merging the whole series into a
> stable kernel since the problem isn't serious enough. Marking changes
> that do "goto nomem" seem fine to mark for stable, though.
I'm lost. These patches are in 3.12, so how can they not be "in
stable"?
What exactly do you want me to do here?
totally confused,
greg k-h
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
2013-11-30 0:00 ` [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree David Rientjes
2013-11-30 0:51 ` Greg KH
@ 2013-11-30 3:35 ` Johannes Weiner
2013-11-30 10:32 ` David Rientjes
1 sibling, 1 reply; 7+ messages in thread
From: Johannes Weiner @ 2013-11-30 3:35 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
linux-kernel, linux-mm
On Fri, Nov 29, 2013 at 04:00:09PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
>
> > > None that I am currently aware of, I'll continue to try them out. I'd
> > > suggest just dropping the stable@kernel.org from the whole series though
> > > unless there is another report of such a problem that people are running
> > > into.
> >
> > The series has long been merged, how do we drop stable@kernel.org from
> > it?
> >
>
> You said you have informed stable to not merge these patches until further
> notice, I'd suggest simply avoid ever merging the whole series into a
> stable kernel since the problem isn't serious enough. Marking changes
> that do "goto nomem" seem fine to mark for stable, though.
These are followup fixes for the series that is upstream but didn't go
to stable. I truly have no idea what you are talking about.
> > > We've had this patch internally since we started using memcg, it has
> > > avoided some unnecessary oom killing.
> >
> > Do you have quantified data that OOM kills are reduced over a longer
> > sampling period? How many kills are skipped? How many of them are
> > deferred temporarily but the VM ended up having to kill something
> > anyway?
>
> On the scale that we run memcg, we would see it daily in automated testing
> primarily because we panic the machine for memcg oom conditions where
> there are no killable processes. It would typically manifest by two
> processes that are allocating memory in a memcg; one is oom killed, is
> allowed to allocate, handles its SIGKILL, exits and frees its memory and
> the second process which is oom disabled races with the uncharge and is
> oom disabled so the machine panics.
So why don't you implement proper synchronization instead of putting
these random checks all over the map to make the race window just
small enough to not matter most of the time?
> The upstream kernel of course doesn't panic in such a condition but if the
> same scenario were to have happened, the second process would be
> unnecessarily oom killed because it raced with the uncharge of the first
> victim and it had exited before the scan of processes in the memcg oom
> killer could detect it and defer. So this patch definitely does prevent
> unnecessary oom killing when run at such a large scale that we do.
If you are really bothered by this race, then please have OOM kill
invocations wait for any outstanding TIF_MEMDIE tasks in the same
context.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
2013-11-30 0:51 ` Greg KH
@ 2013-11-30 10:25 ` David Rientjes
0 siblings, 0 replies; 7+ messages in thread
From: David Rientjes @ 2013-11-30 10:25 UTC (permalink / raw)
To: Greg KH
Cc: Johannes Weiner, Andrew Morton, Michal Hocko, azurit, mm-commits,
stable, linux-kernel, linux-mm
On Fri, 29 Nov 2013, Greg KH wrote:
> > > > None that I am currently aware of, I'll continue to try them out. I'd
> > > > suggest just dropping the stable@kernel.org from the whole series though
> > > > unless there is another report of such a problem that people are running
> > > > into.
> > >
> > > The series has long been merged, how do we drop stable@kernel.org from
> > > it?
> > >
> >
> > You said you have informed stable to not merge these patches until further
> > notice, I'd suggest simply avoid ever merging the whole series into a
> > stable kernel since the problem isn't serious enough. Marking changes
> > that do "goto nomem" seem fine to mark for stable, though.
>
> I'm lost. These patches are in 3.12, so how can they not be "in
> stable"?
>
> What exactly do you want me to do here?
>
Sorry, I sympathize with your confusion since the handling of this patch
series has been strange and confusing from the beginning.
I'm referring to the comment in this thread from Johannes: "[t]his patch
series was not supposed to go into the last merge window, I already told
stable to hold off on these until further notice" from
http://marc.info/?l=linux-mm&m=138559524422298 and his intention to send
the entire series to stable in
http://marc.info/?l=linux-kernel&m=138539243412073.
>From that, I had thought you were already aware of this series and were
waiting to merge it into previous stable kernels; I'm suggesting that
stable doesn't touch this series with a ten-foot pole and only backport
the fixes that have gone into 3.12.
That series is:
759496ba640c ("arch: mm: pass userspace fault flag to generic fault handler")
3a13c4d761b4 ("x86: finish user fault error path with fatal signal")
519e52473ebe ("mm: memcg: enable memcg OOM killer only for user faults")
fb2a6fc56be6 ("mm: memcg: rework and document OOM waiting and wakeup")
3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM")
And then there's the mystery of 4942642080ea ("mm: memcg: handle non-error
OOM situations more gracefully"). This patch went into 3.12-rc6 and is
marked for stable@vger.kernel.org. Its changelog indicates it fixes the
last patch in the above series, but that patch isn't marked for
stable@vger.kernel.org.
And then you have 84235de394d9 ("fs: buffer: move allocation failure loop
into the allocator") which is marked for stable@vger.kernel.org, but
3168ecbe1c04 ("mm: memcg: use proper memcg in limit bypass") which fixes
the memory corruption in that commit isn't marked for stable.
I disagreed with the entire series being marked for stable in
http://marc.info/?l=linux-kernel&m=137107020216528 since it violates the
rules in Documentation/stable_kernel_rules.txt when it was originally
proposed for stable. The memcg oom waitqueue that this series avoids has
been in memcg for 3 1/2 years since 2.6.34 and to date one user, cc'd, has
reported any issues with it.
And then Johannes commented that he is asking stable to hold off on the
series until further notice. I'm suggesting the series not be merged into
stable at all, and that's what's in the email you're responding to.
Further, since this is confusing enough as it is, I suggest if you do take
84235de394d9 ("fs: buffer: move allocation failure loop into the
allocator") that you certainly must also consider 3168ecbe1c04 ("mm:
memcg: use proper memcg in limit bypass"). The former was backported to
3.5 and they required an emergency release of 3.5.7.25 to take it back
out.
There's also another patch that Johannes will shortly be sending to fix
the leakage to the root memcg in oom conditions that can trivially cause
large amounts of memory to be charged to the root memcg when it should
have been isolated to the oom memcg.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
2013-11-30 3:35 ` Johannes Weiner
@ 2013-11-30 10:32 ` David Rientjes
2013-11-30 15:55 ` Johannes Weiner
0 siblings, 1 reply; 7+ messages in thread
From: David Rientjes @ 2013-11-30 10:32 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
linux-kernel, linux-mm
On Fri, 29 Nov 2013, Johannes Weiner wrote:
> > You said you have informed stable to not merge these patches until further
> > notice, I'd suggest simply avoid ever merging the whole series into a
> > stable kernel since the problem isn't serious enough. Marking changes
> > that do "goto nomem" seem fine to mark for stable, though.
>
> These are followup fixes for the series that is upstream but didn't go
> to stable. I truly have no idea what you are talking about.
>
I'm responding to your comments[*] that indicate you were going to
eventually be sending it to stable.
> > On the scale that we run memcg, we would see it daily in automated testing
> > primarily because we panic the machine for memcg oom conditions where
> > there are no killable processes. It would typically manifest by two
> > processes that are allocating memory in a memcg; one is oom killed, is
> > allowed to allocate, handles its SIGKILL, exits and frees its memory and
> > the second process which is oom disabled races with the uncharge and is
> > oom disabled so the machine panics.
>
> So why don't you implement proper synchronization instead of putting
> these random checks all over the map to make the race window just
> small enough to not matter most of the time?
>
The oom killer can be quite expensive, so we have found that is
advantageous after doing all that work that the memcg is still oom for
the charge order before needlessly killing a process. I am not suggesting
that we add synchronization to the uncharge path for such a race, but
merely a simple check as illustrated as due diligence. I think a simple
conditional in the oom killer to avoid needlessly killing a user job is
beneficial and avoids questions from customers who have a kernel log
showing an oom kill occurring in a memcg that is not oom. We could even
do the check in oom_kill_process() after dump_header() if you want to
reduce any chance of that to avoid getting bug reports about such cases.
> If you are really bothered by this race, then please have OOM kill
> invocations wait for any outstanding TIF_MEMDIE tasks in the same
> context.
>
The oom killer requires a tasklist scan, or an iteration over the set of
processes attached to the memcg for the memcg case, to find a victim. It
already defers if it finds eligible threads with TIF_MEMDIE set.
Thanks.
[*] http://marc.info/?l=linux-mm&m=138559524422298
http://marc.info/?l=linux-kernel&m=138539243412073
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
2013-11-30 10:32 ` David Rientjes
@ 2013-11-30 15:55 ` Johannes Weiner
2013-11-30 22:12 ` David Rientjes
0 siblings, 1 reply; 7+ messages in thread
From: Johannes Weiner @ 2013-11-30 15:55 UTC (permalink / raw)
To: David Rientjes
Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
linux-kernel, linux-mm
On Sat, Nov 30, 2013 at 02:32:43AM -0800, David Rientjes wrote:
> On Fri, 29 Nov 2013, Johannes Weiner wrote:
>
> > > You said you have informed stable to not merge these patches until further
> > > notice, I'd suggest simply avoid ever merging the whole series into a
> > > stable kernel since the problem isn't serious enough. Marking changes
> > > that do "goto nomem" seem fine to mark for stable, though.
> >
> > These are followup fixes for the series that is upstream but didn't go
> > to stable. I truly have no idea what you are talking about.
> >
>
> I'm responding to your comments[*] that indicate you were going to
> eventually be sending it to stable.
>
> > > On the scale that we run memcg, we would see it daily in automated testing
> > > primarily because we panic the machine for memcg oom conditions where
> > > there are no killable processes. It would typically manifest by two
> > > processes that are allocating memory in a memcg; one is oom killed, is
> > > allowed to allocate, handles its SIGKILL, exits and frees its memory and
> > > the second process which is oom disabled races with the uncharge and is
> > > oom disabled so the machine panics.
> >
> > So why don't you implement proper synchronization instead of putting
> > these random checks all over the map to make the race window just
> > small enough to not matter most of the time?
> >
>
> The oom killer can be quite expensive, so we have found that is
> advantageous after doing all that work that the memcg is still oom for
> the charge order before needlessly killing a process. I am not suggesting
> that we add synchronization to the uncharge path for such a race, but
> merely a simple check as illustrated as due diligence. I think a simple
> conditional in the oom killer to avoid needlessly killing a user job is
> beneficial and avoids questions from customers who have a kernel log
> showing an oom kill occurring in a memcg that is not oom. We could even
> do the check in oom_kill_process() after dump_header() if you want to
> reduce any chance of that to avoid getting bug reports about such cases.
I asked about quantified data of this last-minute check, you replied
with a race condition between an OOM kill victim and a subsequent OOM
kill invocation.
> > If you are really bothered by this race, then please have OOM kill
> > invocations wait for any outstanding TIF_MEMDIE tasks in the same
> > context.
> >
>
> The oom killer requires a tasklist scan, or an iteration over the set of
> processes attached to the memcg for the memcg case, to find a victim. It
> already defers if it finds eligible threads with TIF_MEMDIE set.
And now you say that this race does not really exist and repeat the
same ramblings about last-minute checks to avoid unnecessary kills
again. And again without any supporting data that I already asked
for.
The more I talk to you, the less sense this all makes. Why do you
insist we merge this patch when you have apparently no idea why and
how it works, and can't demonstrate that it works in the first place?
I only followed you around in circles because I'm afraid that my
shutting up would be interpreted as agreement again and Andrew would
merge this anyway. But this is unsustainable, the burden of proof
should be on you, not me. I'm going to stop replying until you
provide the information I asked for.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree
2013-11-30 15:55 ` Johannes Weiner
@ 2013-11-30 22:12 ` David Rientjes
0 siblings, 0 replies; 7+ messages in thread
From: David Rientjes @ 2013-11-30 22:12 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable,
linux-kernel, linux-mm
On Sat, 30 Nov 2013, Johannes Weiner wrote:
> > The oom killer requires a tasklist scan, or an iteration over the set of
> > processes attached to the memcg for the memcg case, to find a victim. It
> > already defers if it finds eligible threads with TIF_MEMDIE set.
>
> And now you say that this race does not really exist and repeat the
> same ramblings about last-minute checks to avoid unnecessary kills
> again. And again without any supporting data that I already asked
> for.
>
The race does exist, perhaps you don't understand what the race is? This
race occurs when process (A) declares oom and enters the oom killer,
meanwhile an already oom killed victim (B) frees its memory and exits, and
the process (A) oom kills another process even though the memcg is below
its limit because of process (B).
When doing something expensive in the kernel like oom killing, it usually
doesn't cause so much hassle when the suggestion is:
<declare an action is necessary>
<do something expensive>
<select an action>
if (!<action is still necessary>)
abort
<perform the action>
That type of check is fairly straight forward and makes sense. It
prevents unnecessary oom killing (although it can't guarantee it in all
conditions) and prevents customers from reporting oom kills when the log
shows there is memory available for their memcg.
When using memcg on a large scale to enforce memory isolation for user
jobs, these types of scenarios happen often and there is no downside to
adding such a check. The oom killer is not a hotpath, it's not
performance sensitive to the degree that we cannot add a simple
conditional that checks the current limit, it prevents unnecessary oom
kills, and prevents user confusion.
Without more invasive synchronization that would touch hotpaths, this is
the best we can do: check if the oom kill is really necessary just before
issuing the kill. Having the kernel actually kill a user process is a
serious matter and we should strive to ensure it is prevented whenever
possible.
> The more I talk to you, the less sense this all makes. Why do you
> insist we merge this patch when you have apparently no idea why and
> how it works, and can't demonstrate that it works in the first place?
>
I'm not insisting anything, I don't make demands of others or maintainers
like you do to merge or not merge anything. I also haven't even formally
proposed the patch with a changelog that would explain the motivation.
> I only followed you around in circles because I'm afraid that my
> shutting up would be interpreted as agreement again and Andrew would
> merge this anyway. But this is unsustainable, the burden of proof
> should be on you, not me. I'm going to stop replying until you
> provide the information I asked for.
>
Andrew can't merge a patch that hasn't been proposed for merge.
Have a nice weekend.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-11-30 22:12 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <526028bd.k5qPj2+MDOK1o6ii%akpm@linux-foundation.org>
[not found] ` <alpine.DEB.2.02.1311271453270.13682@chino.kir.corp.google.com>
[not found] ` <20131127233353.GH3556@cmpxchg.org>
[not found] ` <alpine.DEB.2.02.1311271622330.10617@chino.kir.corp.google.com>
[not found] ` <20131128021809.GI3556@cmpxchg.org>
[not found] ` <alpine.DEB.2.02.1311271826001.5120@chino.kir.corp.google.com>
[not found] ` <20131128031313.GK3556@cmpxchg.org>
[not found] ` <alpine.DEB.2.02.1311271914460.5120@chino.kir.corp.google.com>
[not found] ` <20131128035218.GM3556@cmpxchg.org>
2013-11-30 0:00 ` [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree David Rientjes
2013-11-30 0:51 ` Greg KH
2013-11-30 10:25 ` David Rientjes
2013-11-30 3:35 ` Johannes Weiner
2013-11-30 10:32 ` David Rientjes
2013-11-30 15:55 ` Johannes Weiner
2013-11-30 22:12 ` David Rientjes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox