[patch] mm, coredump: fail allocations when coredumping instead of oom killing

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [patch] mm, coredump: fail allocations when coredumping instead of oom killing
@ 2012-03-15  2:15 David Rientjes
  2012-03-15 10:20 ` Mel Gorman
  0 siblings, 1 reply; 6+ messages in thread
From: David Rientjes @ 2012-03-15  2:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim, Oleg Nesterov,
	linux-mm

The size of coredump files is limited by RLIMIT_CORE, however, allocating
large amounts of memory results in three negative consequences:

 - the coredumping process may be chosen for oom kill and quickly deplete
   all memory reserves in oom conditions preventing further progress from
   being made or tasks from exiting,

 - the coredumping process may cause other processes to be oom killed
   without fault of their own as the result of a SIGSEGV, for example, in
   the coredumping process, or

 - the coredumping process may result in a livelock while writing to the
   dump file if it needs memory to allocate while other threads are in
   the exit path waiting on the coredumper to complete.

This is fixed by implying __GFP_NORETRY in the page allocator for
coredumping processes when reclaim has failed so the allocations fail and
the process continues to exit.

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2306,6 +2306,10 @@ rebalance:
 		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
 			if (oom_killer_disabled)
 				goto nopage;
+			/* Coredumps can quickly deplete all memory reserves */
+			if ((current->flags & PF_DUMPCORE) &&
+			    !(gfp_mask & __GFP_NOFAIL))
+				goto nopage;
 			page = __alloc_pages_may_oom(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask, preferred_zone,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch] mm, coredump: fail allocations when coredumping instead of oom killing
  2012-03-15  2:15 [patch] mm, coredump: fail allocations when coredumping instead of oom killing David Rientjes
@ 2012-03-15 10:20 ` Mel Gorman
  2012-03-15 21:47   ` David Rientjes
  0 siblings, 1 reply; 6+ messages in thread
From: Mel Gorman @ 2012-03-15 10:20 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Minchan Kim, Oleg Nesterov,
	linux-mm

On Wed, Mar 14, 2012 at 07:15:10PM -0700, David Rientjes wrote:
> The size of coredump files is limited by RLIMIT_CORE, however, allocating
> large amounts of memory results in three negative consequences:
> 
>  - the coredumping process may be chosen for oom kill and quickly deplete
>    all memory reserves in oom conditions preventing further progress from
>    being made or tasks from exiting,
> 

Where is all the memory going?  A brief look at elf_core_dump() looks
fairly innocent.

o kmalloc for a header note
o kmalloc potentially for a short header
o dump_write() verifies access and calls f_op->write. I guess this could
  be doing a lot of allocations underneath, is this where all the memory
  is going?
o get_dump_page() underneath is pinning pages but it should not be
  allocating for the zero page and instead leaving a hole in the dump
  file. The pages it does allocate should be freed quickly but I
  recognise that it could cause a lot of paging activity if the
  information has to be retrieved from disk

I recognise that core dumping is potentially very heavy on the system so
is bringing all the data in from backing storage causing the problem?

>  - the coredumping process may cause other processes to be oom killed
>    without fault of their own as the result of a SIGSEGV, for example, in
>    the coredumping process, or
> 

Which is related to point 1

>  - the coredumping process may result in a livelock while writing to the
>    dump file if it needs memory to allocate while other threads are in
>    the exit path waiting on the coredumper to complete.
> 

I can see how this could happen within the filesystem doing block allocations
and the like. It looks from exec.c that the file is not opened O_DIRECT
so is it the case that the page cache backing the core file until the IO
is complete is really what is pushing the system OOM?

> This is fixed by implying __GFP_NORETRY in the page allocator for
> coredumping processes when reclaim has failed so the allocations fail and
> the process continues to exit.
> 

>From a page allocator perspective, this change looks fine but I'm concerned
about the functional change.

Does the change mean that core dumps may fail where previously they would
have succeeded even if the system churns a bit trying to write them out?
If so, should it be a tunable in like /proc/sys/kernel/core_mayoom that
defaults to 1? Alternatively, would it be better if there was an option
to synchronously write the core file and discard the page cache pages as
the dump is written? It would be slower but it might stress the system less.

> Signed-off-by: David Rientjes <rientjes@google.com>
> ---
>  mm/page_alloc.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2306,6 +2306,10 @@ rebalance:
>  		if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
>  			if (oom_killer_disabled)
>  				goto nopage;
> +			/* Coredumps can quickly deplete all memory reserves */
> +			if ((current->flags & PF_DUMPCORE) &&
> +			    !(gfp_mask & __GFP_NOFAIL))
> +				goto nopage;
>  			page = __alloc_pages_may_oom(gfp_mask, order,
>  					zonelist, high_zoneidx,
>  					nodemask, preferred_zone,

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch] mm, coredump: fail allocations when coredumping instead of oom killing
  2012-03-15 10:20 ` Mel Gorman
@ 2012-03-15 21:47   ` David Rientjes
  2012-03-19 21:52     ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: David Rientjes @ 2012-03-15 21:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, KAMEZAWA Hiroyuki, Minchan Kim, Oleg Nesterov,
	linux-mm

On Thu, 15 Mar 2012, Mel Gorman wrote:

> Where is all the memory going?  A brief look at elf_core_dump() looks
> fairly innocent.
> 
> o kmalloc for a header note
> o kmalloc potentially for a short header
> o dump_write() verifies access and calls f_op->write. I guess this could
>   be doing a lot of allocations underneath, is this where all the memory
>   is going?

Yup, this is the one.  We only currently see this when a memcg is at its 
limit and there are other threads that are trying to exit that are blocked 
on a coredumper that can no longer get memory.  dump_write() calling 
->write() (ext4 in this case) causes a livelock when 
add_to_page_cache_locked() tries to charge the soon-to-be-added pagecache 
to the coredumper's memcg that is oom and calls 
mem_cgroup_charge_common().  That allows the oom, but the oom killer will 
find the other threads that are exiting and choose to be a no-op to avoid 
needlessly killing threads.  The coredumper only has PF_DUMPCORE and not 
PF_EXITING so it doesn't get immediately killed.

So we have a decision to either immediately oom kill the coredumper or 
just fail its allocations and exit since this code does seem to have good 
error handling.  If RLIMIT_CORE is relatively small, it's not a problem to 
kill the coredumper and give it access to memory reserves.  We don't want 
to take that chance, however, since memcg allows all charges to be 
bypassed for threads that have been oom killed and have access to memory 
reserves with their TIF_MEMDIE bit set.  In the worst case, when 
RLIMIT_CORE is high or even unlimited, this could quickly cause a system 
oom condition and then we'd be stuck again because the oom killer finds an 
eligible thread with TIF_MEMDIE and all memory reserves have been 
depleted.

> Does the change mean that core dumps may fail where previously they would
> have succeeded even if the system churns a bit trying to write them out?

We haven't seen this in system-wide oom conditions but it shouldn't be 
unlike the memcg case where we completely livelock because all threads are 
waiting for the coredumper to exit and no memory is being freed.  Unless 
the hard limit is increased (or memory hotplugged in the system-wide oom 
condition), this consistely results in a livelock.  With the system-wide 
oom condition it's more likely that another thread can exit that is not 
going to block on waiting for the coredumper to exit and free memory, but 
it's not guaranteed and this patch fixes the memcg case as well.

> If so, should it be a tunable in like /proc/sys/kernel/core_mayoom that
> defaults to 1? Alternatively, would it be better if there was an option
> to synchronously write the core file and discard the page cache pages as
> the dump is written? It would be slower but it might stress the system less.
> 

I didn't think of adding a sysctl because all of its allocations are 
already GFP_KERNEL and are in a killable context where the oom killer 
already could decide to kill something; the problem is that it chooses not 
to do so because it sees threads that are PF_EXITING and defers, but in 
this case that will never happen because for them to fully exit and free 
its memory would require (perhaps a ton of) memory for the coredumper.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch] mm, coredump: fail allocations when coredumping instead of oom killing
  2012-03-15 21:47   ` David Rientjes
@ 2012-03-19 21:52     ` Andrew Morton
  2012-03-20  0:46       ` David Rientjes
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2012-03-19 21:52 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim, Oleg Nesterov,
	linux-mm

On Thu, 15 Mar 2012 14:47:50 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Thu, 15 Mar 2012, Mel Gorman wrote:
> 
> > Where is all the memory going?  A brief look at elf_core_dump() looks
> > fairly innocent.
> > 
> > o kmalloc for a header note
> > o kmalloc potentially for a short header
> > o dump_write() verifies access and calls f_op->write. I guess this could
> >   be doing a lot of allocations underneath, is this where all the memory
> >   is going?
> 
> Yup, this is the one.  We only currently see this when a memcg is at its 
> limit and there are other threads that are trying to exit that are blocked 
> on a coredumper that can no longer get memory.  dump_write() calling 
> ->write() (ext4 in this case) causes a livelock when 
> add_to_page_cache_locked() tries to charge the soon-to-be-added pagecache 
> to the coredumper's memcg that is oom and calls 
> mem_cgroup_charge_common().  That allows the oom, but the oom killer will 
> find the other threads that are exiting and choose to be a no-op to avoid 
> needlessly killing threads.  The coredumper only has PF_DUMPCORE and not 
> PF_EXITING so it doesn't get immediately killed.

I don't understand the description of the livelock.  Does
add_to_page_cache_locked() succeed, or fail?  What does "allows the
oom" mean?

IOW, please have another attempt at explaining the livelock?

> So we have a decision to either immediately oom kill the coredumper or 
> just fail its allocations and exit since this code does seem to have good 
> error handling.  If RLIMIT_CORE is relatively small, it's not a problem to 
> kill the coredumper and give it access to memory reserves.  We don't want 
> to take that chance, however, since memcg allows all charges to be 
> bypassed for threads that have been oom killed and have access to memory 
> reserves with their TIF_MEMDIE bit set.  In the worst case, when 
> RLIMIT_CORE is high or even unlimited, this could quickly cause a system 
> oom condition and then we'd be stuck again because the oom killer finds an 
> eligible thread with TIF_MEMDIE and all memory reserves have been 
> depleted.

AFAICT, dumping core should only require the allocation of 2-3
unreclaimable pages at any one time.  That's if reclaim is working
properly.  So I'd have thought that permitting the core-dumper to
allocate those pages would cause everything to run to completion
nicely.

Relatedly, RLIMIT_CORE shouldn't affect this?  The core dumper only
really needs to pin a single pagecache page: the one into which it is
presently copying data.


My vague take on this patch is that we should instead try to let
everything run to completion, rather than failing allocations or
oom-killing anything.  But I don't yet understand the problem which the
patch is addressing...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch] mm, coredump: fail allocations when coredumping instead of oom killing
  2012-03-19 21:52     ` Andrew Morton
@ 2012-03-20  0:46       ` David Rientjes
  2012-03-22 23:07         ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: David Rientjes @ 2012-03-20  0:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim, Oleg Nesterov,
	linux-mm

On Mon, 19 Mar 2012, Andrew Morton wrote:

> > Yup, this is the one.  We only currently see this when a memcg is at its 
> > limit and there are other threads that are trying to exit that are blocked 
> > on a coredumper that can no longer get memory.  dump_write() calling 
> > ->write() (ext4 in this case) causes a livelock when 
> > add_to_page_cache_locked() tries to charge the soon-to-be-added pagecache 
> > to the coredumper's memcg that is oom and calls 
> > mem_cgroup_charge_common().  That allows the oom, but the oom killer will 
> > find the other threads that are exiting and choose to be a no-op to avoid 
> > needlessly killing threads.  The coredumper only has PF_DUMPCORE and not 
> > PF_EXITING so it doesn't get immediately killed.
> 
> I don't understand the description of the livelock.  Does
> add_to_page_cache_locked() succeed, or fail?  What does "allows the
> oom" mean?
> 

Sorry if it wasn't clear.  The coredumper calling into 
add_to_page_cache_locked() calls the oom killer because the memcg is oom 
(and would call the global oom killer if the entire system were oom).  The 
oom killer, both memcg and global, doesn't do anything because it sees 
eligible threads with PF_EXITING set.  This logic has existed for several 
years to avoid needlessly oom killing additional threads when others are 
already in the process of exiting and freeing their memory.  Those 
PF_EXITING threads, however, are blocked on the coredumper to exit in 
exit_mm(), so they'll never actually exit.  Thus, the coredumper must make 
forward progress for anything to actually exit and the oom killer is 
useless.

In this condition, there are a few options:

 - give the coredumper access to memory reserves and allow it to allocate,
   essentially oom killing it,

 - fail coredumper memory allocations because of the oom condition and 
   allow the threads blocked on it to exit, or

 - implement an oom killer timeout that would kill additional threads if 
   we repeatedly call into it without making forward progress over a small 
   period of time.

The first and last, in my opinion, are non-starters because it allows a 
complete depletion of memory reserves if the coredumper is chosen and then 
nothing is guaranteed to be able to ever exit.  This patch implements the 
middle option where we do our best effort to allow the coredump to be 
successful (we even try direct reclaim before failing) but choose to fail 
before calling into the oom killer and causing a livelock.

> AFAICT, dumping core should only require the allocation of 2-3
> unreclaimable pages at any one time.  That's if reclaim is working
> properly.  So I'd have thought that permitting the core-dumper to
> allocate those pages would cause everything to run to completion
> nicely.
> 

If there's nothing to reclaim (more obvious when running in a memcg) then 
we prohibit the allocation and livelock in the presence of PF_EXITING 
threads that are waiting on the coredump; there's nothing that allows 
those allocations in the kernel to succeed currently.  If we can guarantee 
that the call to ->write() allocates 2-3 pages at most then we could 
perhaps get away with doing something like

	if (current->flags & PF_DUMPCORE) {
		set_thread_flag(TIF_MEMDIE);
		return 0;
	}

in the oom killer like we allow for fatal_signal_pending() right now.  I 
chose to be more conservative, however, because the amount of memory it 
allocates is filesystem dependent and may deplete all memory reserves.

> Relatedly, RLIMIT_CORE shouldn't affect this?  The core dumper only
> really needs to pin a single pagecache page: the one into which it is
> presently copying data.
> 

It's filesystem dependent, the VM doesn't safeguard against a livelock of 
the memcg and the system without this patch.  But even with one page the 
vulnerability still exists.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [patch] mm, coredump: fail allocations when coredumping instead of oom killing
  2012-03-20  0:46       ` David Rientjes
@ 2012-03-22 23:07         ` Andrew Morton
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2012-03-22 23:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, KAMEZAWA Hiroyuki, Minchan Kim, Oleg Nesterov,
	linux-mm

On Mon, 19 Mar 2012 17:46:47 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> On Mon, 19 Mar 2012, Andrew Morton wrote:
> 
> > > Yup, this is the one.  We only currently see this when a memcg is at its 
> > > limit and there are other threads that are trying to exit that are blocked 
> > > on a coredumper that can no longer get memory.  dump_write() calling 
> > > ->write() (ext4 in this case) causes a livelock when 
> > > add_to_page_cache_locked() tries to charge the soon-to-be-added pagecache 
> > > to the coredumper's memcg that is oom and calls 
> > > mem_cgroup_charge_common().  That allows the oom, but the oom killer will 
> > > find the other threads that are exiting and choose to be a no-op to avoid 
> > > needlessly killing threads.  The coredumper only has PF_DUMPCORE and not 
> > > PF_EXITING so it doesn't get immediately killed.
> > 
> > I don't understand the description of the livelock.  Does
> > add_to_page_cache_locked() succeed, or fail?  What does "allows the
> > oom" mean?
> > 
> 
> Sorry if it wasn't clear.  The coredumper calling into 
> add_to_page_cache_locked() calls the oom killer because the memcg is oom 
> (and would call the global oom killer if the entire system were oom).  The 
> oom killer, both memcg and global, doesn't do anything because it sees 
> eligible threads with PF_EXITING set.  This logic has existed for several 
> years to avoid needlessly oom killing additional threads when others are 
> already in the process of exiting and freeing their memory.  Those 
> PF_EXITING threads, however, are blocked on the coredumper to exit in 
> exit_mm(), so they'll never actually exit.  Thus, the coredumper must make 
> forward progress for anything to actually exit and the oom killer is 
> useless.
> 
> In this condition, there are a few options:
> 
>  - give the coredumper access to memory reserves and allow it to allocate,
>    essentially oom killing it,
> 
>  - fail coredumper memory allocations because of the oom condition and 
>    allow the threads blocked on it to exit, or
> 
>  - implement an oom killer timeout that would kill additional threads if 
>    we repeatedly call into it without making forward progress over a small 
>    period of time.
> 
> The first and last, in my opinion, are non-starters because it allows a 
> complete depletion of memory reserves if the coredumper is chosen and then 
> nothing is guaranteed to be able to ever exit.

Why does option 1 lead to reserve exhaustion?  If we have a zillion
simultaneous core dumps?

>  This patch implements the 
> middle option where we do our best effort to allow the coredump to be 
> successful (we even try direct reclaim before failing) but choose to fail 
> before calling into the oom killer and causing a livelock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-03-22 23:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-15  2:15 [patch] mm, coredump: fail allocations when coredumping instead of oom killing David Rientjes
2012-03-15 10:20 ` Mel Gorman
2012-03-15 21:47   ` David Rientjes
2012-03-19 21:52     ` Andrew Morton
2012-03-20  0:46       ` David Rientjes
2012-03-22 23:07         ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).