All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@suse.de>
To: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrew Barry <abarry@cray.com>, linux-mm <linux-mm@kvack.org>,
	Rik van Riel <riel@redhat.com>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch.
Date: Tue, 17 May 2011 12:34:30 +0100	[thread overview]
Message-ID: <20110517113430.GM5279@suse.de> (raw)
In-Reply-To: <BANLkTikiXUzbsUkzaKZsZg+5ugruA2JdMA@mail.gmail.com>

On Tue, May 17, 2011 at 07:34:47PM +0900, Minchan Kim wrote:
> On Sat, May 14, 2011 at 6:31 AM, Andrew Barry <abarry@cray.com> wrote:
> > I believe I found a problem in __alloc_pages_slowpath, which allows a process to
> > get stuck endlessly looping, even when lots of memory is available.
> >
> > Running an I/O and memory intensive stress-test I see a 0-order page allocation
> > with __GFP_IO and __GFP_WAIT, running on a system with very little free memory.
> > Right about the same time that the stress-test gets killed by the OOM-killer,
> > the utility trying to allocate memory gets stuck in __alloc_pages_slowpath even
> > though most of the systems memory was freed by the oom-kill of the stress-test.
> >
> > The utility ends up looping from the rebalance label down through the
> > wait_iff_congested continiously. Because order=0, __alloc_pages_direct_compact
> > skips the call to get_page_from_freelist. Because all of the reclaimable memory
> > on the system has already been reclaimed, __alloc_pages_direct_reclaim skips the
> > call to get_page_from_freelist. Since there is no __GFP_FS flag, the block with
> > __alloc_pages_may_oom is skipped. The loop hits the wait_iff_congested, then
> > jumps back to rebalance without ever trying to get_page_from_freelist. This loop
> > repeats infinitely.
> >
> > Is there a reason that this loop is set up this way for 0 order allocations? I
> > applied the below patch, and the problem corrects itself. Does anyone have any
> > thoughts on the patch, or on a better way to address this situation?
> >
> > The test case is pretty pathological. Running a mix of I/O stress-tests that do
> > a lot of fork() and consume all of the system memory, I can pretty reliably hit
> > this on 600 nodes, in about 12 hours. 32GB/node.
> >
> 
> It's amazing.
> I think it's _very_ rare but it's possible if test program killed by
> oom has only lots of anonymous pages and allocation tasks try to
> allocate order-0 page with GFP_NOFS.
> 
> When the [in]active lists are empty suddenly(But I am not sure how
> come the situation happens.)

Maybe because the stress test consumed almost all, if not all, of the
LRU and then got oom-killed emptying the lists.

> and we are reclaiming order-0 page,
> compaction and __alloc_pages_direct_reclaim doesn't work. compaction
> doesn't work as it's order-0 page reclaiming.  In case of
> __alloc_pages_direct_reclaim, it would work only if we have lru pages
> in [in]active list. But unfortunately we don't have any pages in lru
> list.
> So, last resort is following codes in do_try_to_free_pages.
> 
>         /* top priority shrink_zones still had more to do? don't OOM, then */
>         if (scanning_global_lru(sc) && !all_unreclaimable(zonelist, sc))
>                 return 1;
> 
> But it has a problem, too. all_unreclaimable checks zone->all_unreclaimable.
> zone->all_unreclaimable is set by below condition.
> 
> zone->pages_scanned < zone_reclaimable_pages(zone) * 6
> 
> If lru list is completely empty, shrink_zone doesn't work so
> zone->pages_scanned would be zero. But as we know, zone_page_state
> isn't exact by per_cpu_pageset. So it might be positive value. After
> all, zone_reclaimable always return true. It means kswapd never set
> zone->all_unreclaimable.  So last resort become nop.
> 
> In this case, current allocation doesn't have a chance to call
> get_page_from_freelist as Andrew Barry said.
> 
> Does it make sense?
> If it is, how about this?
> 

This looks like a better fix. The alternative fix continually wakes
kswapd and takes additional unnecessary steps.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-05-17 11:34 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-13 21:31 Unending loop in __alloc_pages_slowpath following OOM-kill; rfc: patch Andrew Barry
2011-05-17 10:34 ` Minchan Kim
2011-05-17 11:34   ` Mel Gorman [this message]
2011-05-17 15:49   ` Andrew Barry
2011-05-18 22:29     ` Minchan Kim
2011-05-20 16:49       ` Minchan Kim
2011-05-20 16:49         ` Minchan Kim
2011-05-20 17:16         ` Rik van Riel
2011-05-20 17:16           ` Rik van Riel
2011-05-20 17:23         ` Mel Gorman
2011-05-20 17:23           ` Mel Gorman
2011-05-24  4:54         ` KOSAKI Motohiro
2011-05-24  4:54           ` KOSAKI Motohiro
2011-05-24  5:45           ` KOSAKI Motohiro
2011-05-24  5:45             ` KOSAKI Motohiro
2011-05-24  8:30           ` Mel Gorman
2011-05-24  8:30             ` Mel Gorman
2011-05-24  8:36             ` KOSAKI Motohiro
2011-05-24  8:36               ` KOSAKI Motohiro
2011-05-24  8:49               ` Mel Gorman
2011-05-24  8:49                 ` Mel Gorman
2011-05-24  9:05                 ` KOSAKI Motohiro
2011-05-24  9:05                   ` KOSAKI Motohiro
2011-05-24  9:16                   ` Mel Gorman
2011-05-24  9:16                     ` Mel Gorman
2011-05-24  9:40                     ` KOSAKI Motohiro
2011-05-24  9:40                       ` KOSAKI Motohiro
2011-05-24 10:57                       ` Mel Gorman
2011-05-24 10:57                         ` Mel Gorman
2011-05-24 23:53                         ` KOSAKI Motohiro
2011-05-24 23:53                           ` KOSAKI Motohiro
2011-05-24  8:34           ` Minchan Kim
2011-05-24  8:34             ` Minchan Kim
2011-05-24  8:41             ` KOSAKI Motohiro
2011-05-24  8:41               ` KOSAKI Motohiro
2011-05-24  8:57               ` Minchan Kim
2011-05-24  8:57                 ` Minchan Kim
2011-05-24  9:36                 ` KOSAKI Motohiro
2011-05-24  9:36                   ` KOSAKI Motohiro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110517113430.GM5279@suse.de \
    --to=mgorman@suse.de \
    --cc=abarry@cray.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.