* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree [not found] ` <20131128035218.GM3556@cmpxchg.org> @ 2013-11-30 0:00 ` David Rientjes 2013-11-30 0:51 ` Greg KH 2013-11-30 3:35 ` Johannes Weiner 0 siblings, 2 replies; 7+ messages in thread From: David Rientjes @ 2013-11-30 0:00 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable, linux-kernel, linux-mm On Wed, 27 Nov 2013, Johannes Weiner wrote: > > None that I am currently aware of, I'll continue to try them out. I'd > > suggest just dropping the stable@kernel.org from the whole series though > > unless there is another report of such a problem that people are running > > into. > > The series has long been merged, how do we drop stable@kernel.org from > it? > You said you have informed stable to not merge these patches until further notice, I'd suggest simply avoid ever merging the whole series into a stable kernel since the problem isn't serious enough. Marking changes that do "goto nomem" seem fine to mark for stable, though. > > We've had this patch internally since we started using memcg, it has > > avoided some unnecessary oom killing. > > Do you have quantified data that OOM kills are reduced over a longer > sampling period? How many kills are skipped? How many of them are > deferred temporarily but the VM ended up having to kill something > anyway? On the scale that we run memcg, we would see it daily in automated testing primarily because we panic the machine for memcg oom conditions where there are no killable processes. It would typically manifest by two processes that are allocating memory in a memcg; one is oom killed, is allowed to allocate, handles its SIGKILL, exits and frees its memory and the second process which is oom disabled races with the uncharge and is oom disabled so the machine panics. The upstream kernel of course doesn't panic in such a condition but if the same scenario were to have happened, the second process would be unnecessarily oom killed because it raced with the uncharge of the first victim and it had exited before the scan of processes in the memcg oom killer could detect it and defer. So this patch definitely does prevent unnecessary oom killing when run at such a large scale that we do. I'll send a formal patch. > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1836,6 +1836,13 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, > > if (!chosen) > > return; > > points = chosen_points * 1000 / totalpages; > > + > > + /* One last chance to see if we really need to kill something */ > > + if (mem_cgroup_margin(memcg) >= (1 << order)) { > > + put_task_struct(chosen); > > + return; > > + } > > + > > oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg, > > NULL, "Memory cgroup out of memory"); > > } > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree 2013-11-30 0:00 ` [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree David Rientjes @ 2013-11-30 0:51 ` Greg KH 2013-11-30 10:25 ` David Rientjes 2013-11-30 3:35 ` Johannes Weiner 1 sibling, 1 reply; 7+ messages in thread From: Greg KH @ 2013-11-30 0:51 UTC (permalink / raw) To: David Rientjes Cc: Johannes Weiner, Andrew Morton, Michal Hocko, azurit, mm-commits, stable, linux-kernel, linux-mm On Fri, Nov 29, 2013 at 04:00:09PM -0800, David Rientjes wrote: > On Wed, 27 Nov 2013, Johannes Weiner wrote: > > > > None that I am currently aware of, I'll continue to try them out. I'd > > > suggest just dropping the stable@kernel.org from the whole series though > > > unless there is another report of such a problem that people are running > > > into. > > > > The series has long been merged, how do we drop stable@kernel.org from > > it? > > > > You said you have informed stable to not merge these patches until further > notice, I'd suggest simply avoid ever merging the whole series into a > stable kernel since the problem isn't serious enough. Marking changes > that do "goto nomem" seem fine to mark for stable, though. I'm lost. These patches are in 3.12, so how can they not be "in stable"? What exactly do you want me to do here? totally confused, greg k-h ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree 2013-11-30 0:51 ` Greg KH @ 2013-11-30 10:25 ` David Rientjes 0 siblings, 0 replies; 7+ messages in thread From: David Rientjes @ 2013-11-30 10:25 UTC (permalink / raw) To: Greg KH Cc: Johannes Weiner, Andrew Morton, Michal Hocko, azurit, mm-commits, stable, linux-kernel, linux-mm On Fri, 29 Nov 2013, Greg KH wrote: > > > > None that I am currently aware of, I'll continue to try them out. I'd > > > > suggest just dropping the stable@kernel.org from the whole series though > > > > unless there is another report of such a problem that people are running > > > > into. > > > > > > The series has long been merged, how do we drop stable@kernel.org from > > > it? > > > > > > > You said you have informed stable to not merge these patches until further > > notice, I'd suggest simply avoid ever merging the whole series into a > > stable kernel since the problem isn't serious enough. Marking changes > > that do "goto nomem" seem fine to mark for stable, though. > > I'm lost. These patches are in 3.12, so how can they not be "in > stable"? > > What exactly do you want me to do here? > Sorry, I sympathize with your confusion since the handling of this patch series has been strange and confusing from the beginning. I'm referring to the comment in this thread from Johannes: "[t]his patch series was not supposed to go into the last merge window, I already told stable to hold off on these until further notice" from http://marc.info/?l=linux-mm&m=138559524422298 and his intention to send the entire series to stable in http://marc.info/?l=linux-kernel&m=138539243412073. >From that, I had thought you were already aware of this series and were waiting to merge it into previous stable kernels; I'm suggesting that stable doesn't touch this series with a ten-foot pole and only backport the fixes that have gone into 3.12. That series is: 759496ba640c ("arch: mm: pass userspace fault flag to generic fault handler") 3a13c4d761b4 ("x86: finish user fault error path with fatal signal") 519e52473ebe ("mm: memcg: enable memcg OOM killer only for user faults") fb2a6fc56be6 ("mm: memcg: rework and document OOM waiting and wakeup") 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") And then there's the mystery of 4942642080ea ("mm: memcg: handle non-error OOM situations more gracefully"). This patch went into 3.12-rc6 and is marked for stable@vger.kernel.org. Its changelog indicates it fixes the last patch in the above series, but that patch isn't marked for stable@vger.kernel.org. And then you have 84235de394d9 ("fs: buffer: move allocation failure loop into the allocator") which is marked for stable@vger.kernel.org, but 3168ecbe1c04 ("mm: memcg: use proper memcg in limit bypass") which fixes the memory corruption in that commit isn't marked for stable. I disagreed with the entire series being marked for stable in http://marc.info/?l=linux-kernel&m=137107020216528 since it violates the rules in Documentation/stable_kernel_rules.txt when it was originally proposed for stable. The memcg oom waitqueue that this series avoids has been in memcg for 3 1/2 years since 2.6.34 and to date one user, cc'd, has reported any issues with it. And then Johannes commented that he is asking stable to hold off on the series until further notice. I'm suggesting the series not be merged into stable at all, and that's what's in the email you're responding to. Further, since this is confusing enough as it is, I suggest if you do take 84235de394d9 ("fs: buffer: move allocation failure loop into the allocator") that you certainly must also consider 3168ecbe1c04 ("mm: memcg: use proper memcg in limit bypass"). The former was backported to 3.5 and they required an emergency release of 3.5.7.25 to take it back out. There's also another patch that Johannes will shortly be sending to fix the leakage to the root memcg in oom conditions that can trivially cause large amounts of memory to be charged to the root memcg when it should have been isolated to the oom memcg. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree 2013-11-30 0:00 ` [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree David Rientjes 2013-11-30 0:51 ` Greg KH @ 2013-11-30 3:35 ` Johannes Weiner 2013-11-30 10:32 ` David Rientjes 1 sibling, 1 reply; 7+ messages in thread From: Johannes Weiner @ 2013-11-30 3:35 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable, linux-kernel, linux-mm On Fri, Nov 29, 2013 at 04:00:09PM -0800, David Rientjes wrote: > On Wed, 27 Nov 2013, Johannes Weiner wrote: > > > > None that I am currently aware of, I'll continue to try them out. I'd > > > suggest just dropping the stable@kernel.org from the whole series though > > > unless there is another report of such a problem that people are running > > > into. > > > > The series has long been merged, how do we drop stable@kernel.org from > > it? > > > > You said you have informed stable to not merge these patches until further > notice, I'd suggest simply avoid ever merging the whole series into a > stable kernel since the problem isn't serious enough. Marking changes > that do "goto nomem" seem fine to mark for stable, though. These are followup fixes for the series that is upstream but didn't go to stable. I truly have no idea what you are talking about. > > > We've had this patch internally since we started using memcg, it has > > > avoided some unnecessary oom killing. > > > > Do you have quantified data that OOM kills are reduced over a longer > > sampling period? How many kills are skipped? How many of them are > > deferred temporarily but the VM ended up having to kill something > > anyway? > > On the scale that we run memcg, we would see it daily in automated testing > primarily because we panic the machine for memcg oom conditions where > there are no killable processes. It would typically manifest by two > processes that are allocating memory in a memcg; one is oom killed, is > allowed to allocate, handles its SIGKILL, exits and frees its memory and > the second process which is oom disabled races with the uncharge and is > oom disabled so the machine panics. So why don't you implement proper synchronization instead of putting these random checks all over the map to make the race window just small enough to not matter most of the time? > The upstream kernel of course doesn't panic in such a condition but if the > same scenario were to have happened, the second process would be > unnecessarily oom killed because it raced with the uncharge of the first > victim and it had exited before the scan of processes in the memcg oom > killer could detect it and defer. So this patch definitely does prevent > unnecessary oom killing when run at such a large scale that we do. If you are really bothered by this race, then please have OOM kill invocations wait for any outstanding TIF_MEMDIE tasks in the same context. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree 2013-11-30 3:35 ` Johannes Weiner @ 2013-11-30 10:32 ` David Rientjes 2013-11-30 15:55 ` Johannes Weiner 0 siblings, 1 reply; 7+ messages in thread From: David Rientjes @ 2013-11-30 10:32 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable, linux-kernel, linux-mm On Fri, 29 Nov 2013, Johannes Weiner wrote: > > You said you have informed stable to not merge these patches until further > > notice, I'd suggest simply avoid ever merging the whole series into a > > stable kernel since the problem isn't serious enough. Marking changes > > that do "goto nomem" seem fine to mark for stable, though. > > These are followup fixes for the series that is upstream but didn't go > to stable. I truly have no idea what you are talking about. > I'm responding to your comments[*] that indicate you were going to eventually be sending it to stable. > > On the scale that we run memcg, we would see it daily in automated testing > > primarily because we panic the machine for memcg oom conditions where > > there are no killable processes. It would typically manifest by two > > processes that are allocating memory in a memcg; one is oom killed, is > > allowed to allocate, handles its SIGKILL, exits and frees its memory and > > the second process which is oom disabled races with the uncharge and is > > oom disabled so the machine panics. > > So why don't you implement proper synchronization instead of putting > these random checks all over the map to make the race window just > small enough to not matter most of the time? > The oom killer can be quite expensive, so we have found that is advantageous after doing all that work that the memcg is still oom for the charge order before needlessly killing a process. I am not suggesting that we add synchronization to the uncharge path for such a race, but merely a simple check as illustrated as due diligence. I think a simple conditional in the oom killer to avoid needlessly killing a user job is beneficial and avoids questions from customers who have a kernel log showing an oom kill occurring in a memcg that is not oom. We could even do the check in oom_kill_process() after dump_header() if you want to reduce any chance of that to avoid getting bug reports about such cases. > If you are really bothered by this race, then please have OOM kill > invocations wait for any outstanding TIF_MEMDIE tasks in the same > context. > The oom killer requires a tasklist scan, or an iteration over the set of processes attached to the memcg for the memcg case, to find a victim. It already defers if it finds eligible threads with TIF_MEMDIE set. Thanks. [*] http://marc.info/?l=linux-mm&m=138559524422298 http://marc.info/?l=linux-kernel&m=138539243412073 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree 2013-11-30 10:32 ` David Rientjes @ 2013-11-30 15:55 ` Johannes Weiner 2013-11-30 22:12 ` David Rientjes 0 siblings, 1 reply; 7+ messages in thread From: Johannes Weiner @ 2013-11-30 15:55 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable, linux-kernel, linux-mm On Sat, Nov 30, 2013 at 02:32:43AM -0800, David Rientjes wrote: > On Fri, 29 Nov 2013, Johannes Weiner wrote: > > > > You said you have informed stable to not merge these patches until further > > > notice, I'd suggest simply avoid ever merging the whole series into a > > > stable kernel since the problem isn't serious enough. Marking changes > > > that do "goto nomem" seem fine to mark for stable, though. > > > > These are followup fixes for the series that is upstream but didn't go > > to stable. I truly have no idea what you are talking about. > > > > I'm responding to your comments[*] that indicate you were going to > eventually be sending it to stable. > > > > On the scale that we run memcg, we would see it daily in automated testing > > > primarily because we panic the machine for memcg oom conditions where > > > there are no killable processes. It would typically manifest by two > > > processes that are allocating memory in a memcg; one is oom killed, is > > > allowed to allocate, handles its SIGKILL, exits and frees its memory and > > > the second process which is oom disabled races with the uncharge and is > > > oom disabled so the machine panics. > > > > So why don't you implement proper synchronization instead of putting > > these random checks all over the map to make the race window just > > small enough to not matter most of the time? > > > > The oom killer can be quite expensive, so we have found that is > advantageous after doing all that work that the memcg is still oom for > the charge order before needlessly killing a process. I am not suggesting > that we add synchronization to the uncharge path for such a race, but > merely a simple check as illustrated as due diligence. I think a simple > conditional in the oom killer to avoid needlessly killing a user job is > beneficial and avoids questions from customers who have a kernel log > showing an oom kill occurring in a memcg that is not oom. We could even > do the check in oom_kill_process() after dump_header() if you want to > reduce any chance of that to avoid getting bug reports about such cases. I asked about quantified data of this last-minute check, you replied with a race condition between an OOM kill victim and a subsequent OOM kill invocation. > > If you are really bothered by this race, then please have OOM kill > > invocations wait for any outstanding TIF_MEMDIE tasks in the same > > context. > > > > The oom killer requires a tasklist scan, or an iteration over the set of > processes attached to the memcg for the memcg case, to find a victim. It > already defers if it finds eligible threads with TIF_MEMDIE set. And now you say that this race does not really exist and repeat the same ramblings about last-minute checks to avoid unnecessary kills again. And again without any supporting data that I already asked for. The more I talk to you, the less sense this all makes. Why do you insist we merge this patch when you have apparently no idea why and how it works, and can't demonstrate that it works in the first place? I only followed you around in circles because I'm afraid that my shutting up would be interpreted as agreement again and Andrew would merge this anyway. But this is unsustainable, the burden of proof should be on you, not me. I'm going to stop replying until you provide the information I asked for. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree 2013-11-30 15:55 ` Johannes Weiner @ 2013-11-30 22:12 ` David Rientjes 0 siblings, 0 replies; 7+ messages in thread From: David Rientjes @ 2013-11-30 22:12 UTC (permalink / raw) To: Johannes Weiner Cc: Andrew Morton, Michal Hocko, azurit, mm-commits, stable, linux-kernel, linux-mm On Sat, 30 Nov 2013, Johannes Weiner wrote: > > The oom killer requires a tasklist scan, or an iteration over the set of > > processes attached to the memcg for the memcg case, to find a victim. It > > already defers if it finds eligible threads with TIF_MEMDIE set. > > And now you say that this race does not really exist and repeat the > same ramblings about last-minute checks to avoid unnecessary kills > again. And again without any supporting data that I already asked > for. > The race does exist, perhaps you don't understand what the race is? This race occurs when process (A) declares oom and enters the oom killer, meanwhile an already oom killed victim (B) frees its memory and exits, and the process (A) oom kills another process even though the memcg is below its limit because of process (B). When doing something expensive in the kernel like oom killing, it usually doesn't cause so much hassle when the suggestion is: <declare an action is necessary> <do something expensive> <select an action> if (!<action is still necessary>) abort <perform the action> That type of check is fairly straight forward and makes sense. It prevents unnecessary oom killing (although it can't guarantee it in all conditions) and prevents customers from reporting oom kills when the log shows there is memory available for their memcg. When using memcg on a large scale to enforce memory isolation for user jobs, these types of scenarios happen often and there is no downside to adding such a check. The oom killer is not a hotpath, it's not performance sensitive to the degree that we cannot add a simple conditional that checks the current limit, it prevents unnecessary oom kills, and prevents user confusion. Without more invasive synchronization that would touch hotpaths, this is the best we can do: check if the oom kill is really necessary just before issuing the kill. Having the kernel actually kill a user process is a serious matter and we should strive to ensure it is prevented whenever possible. > The more I talk to you, the less sense this all makes. Why do you > insist we merge this patch when you have apparently no idea why and > how it works, and can't demonstrate that it works in the first place? > I'm not insisting anything, I don't make demands of others or maintainers like you do to merge or not merge anything. I also haven't even formally proposed the patch with a changelog that would explain the motivation. > I only followed you around in circles because I'm afraid that my > shutting up would be interpreted as agreement again and Andrew would > merge this anyway. But this is unsustainable, the burden of proof > should be on you, not me. I'm going to stop replying until you > provide the information I asked for. > Andrew can't merge a patch that hasn't been proposed for merge. Have a nice weekend. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-11-30 22:12 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <526028bd.k5qPj2+MDOK1o6ii%akpm@linux-foundation.org>
[not found] ` <alpine.DEB.2.02.1311271453270.13682@chino.kir.corp.google.com>
[not found] ` <20131127233353.GH3556@cmpxchg.org>
[not found] ` <alpine.DEB.2.02.1311271622330.10617@chino.kir.corp.google.com>
[not found] ` <20131128021809.GI3556@cmpxchg.org>
[not found] ` <alpine.DEB.2.02.1311271826001.5120@chino.kir.corp.google.com>
[not found] ` <20131128031313.GK3556@cmpxchg.org>
[not found] ` <alpine.DEB.2.02.1311271914460.5120@chino.kir.corp.google.com>
[not found] ` <20131128035218.GM3556@cmpxchg.org>
2013-11-30 0:00 ` [merged] mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from -mm tree David Rientjes
2013-11-30 0:51 ` Greg KH
2013-11-30 10:25 ` David Rientjes
2013-11-30 3:35 ` Johannes Weiner
2013-11-30 10:32 ` David Rientjes
2013-11-30 15:55 ` Johannes Weiner
2013-11-30 22:12 ` David Rientjes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox