Re: Bug in reclaim logic with exhausted nodes?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: linux-mm@kvack.org
Cc: cl@linux.com, rientjes@google.com, linuxppc-dev@lists.ozlabs.org,
	anton@samba.org, mgorman@suse.de
Subject: Re: Bug in reclaim logic with exhausted nodes?
Date: Thu, 13 Mar 2014 10:01:27 -0700	[thread overview]
Message-ID: <20140313170127.GE22247@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140311210614.GB946@linux.vnet.ibm.com>

There might have been an error in my original mail, so resending...

On 11.03.2014 [14:06:14 -0700], Nishanth Aravamudan wrote:
> We have seen the following situation on a test system:
> 
> 2-node system, each node has 32GB of memory.
> 
> 2 gigantic (16GB) pages reserved at boot-time, both of which are
> allocated from node 1.
> 
> SLUB notices this:
> 
> [    0.000000] SLUB: Unable to allocate memory from node 1
> [    0.000000] SLUB: Allocating a useless per node structure in order to
> be able to continue
> 
> After boot, user then did:
> 
> echo 24 > /proc/sys/vm/nr_hugepages
> 
> And tasks are stuck:
> 
> [<c0000000010980b8>] kexec_stack+0xb8/0x8000
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> [<c00000004f9334b0>] 0xc00000004f9334b0
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> [<c00000004f91f440>] 0xc00000004f91f440
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb54c>] .nr_hugepages_store_common.isra.39+0xbc/0x1b0
> [<c0000000003662cc>] .kobj_attr_store+0x2c/0x50
> [<c0000000002b2c2c>] .sysfs_write_file+0xec/0x1c0
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> kswapd1 is also pegged at this point at 100% cpu.
> 
> If we go in and manually:
> 
> echo 24 >
> /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages
> 
> rather than relying on the interleaving allocator from the sysctl, the
> allocation succeeds (and the echo returns immediately).
> 
> I think we are hitting the following:
> 
> mm/hugetlb.c::alloc_fresh_huge_page_node():
> 
>         page = alloc_pages_exact_node(nid,
>                 htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
>                                                 __GFP_REPEAT|__GFP_NOWARN,
>                 huge_page_order(h));
> 
> include/linux/gfp.h:
> 
> #define GFP_THISNODE    (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
> 
> and mm/page_alloc.c::__alloc_pages_slowpath():
> 
>         /*
>          * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
>          * __GFP_NOWARN set) should not cause reclaim since the subsystem
>          * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
>          * using a larger set of nodes after it has established that the
>          * allowed per node queues are empty and that nodes are
>          * over allocated.
>          */
>         if (IS_ENABLED(CONFIG_NUMA) &&
>                         (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
>                 goto nopage;
> 
> so we *do* reclaim in this callpath. Under my reading, since node1 is
> exhausted, no matter how much work kswapd1 does, it will never reclaim
> memory from node1 to satisfy a 16M page allocation request (or any
> other, for that matter).
> 
> I see the following possible changes/fixes, but am unsure if
> a) my analysis is right
> b) which is best.
> 
> 1) Since we did notice early in boot that (in this case) node 1 was
> exhausted, perhaps we should mark it as such there somehow, and if a
> __GFP_THISNODE allocation request comes through on such a node, we
> immediately fallthrough to nopage?
> 
> 2) There is the following check
>         /*
>          * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
>          * specified, then we retry until we no longer reclaim any pages
>          * (above), or we've reclaimed an order of pages at least as
>          * large as the allocation's order. In both cases, if the
>          * allocation still fails, we stop retrying.
>          */
>         if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
>                 return 1;
> 
> I wonder if we should add a check to also be sure that the pages we are
> reclaiming, if __GFP_THISNODE is set, are from the right node?
> 
>        if (gfp_mask & __GFP_THISNODE && the progress we have made is on
>        		the node requested?)
> 
> 3) did_some_progress could be updated to track where the progress is
> occuring, and if we are in __GFP_THISNODE allocation request and we
> didn't make any progress on the correct node, we fail the allocation?
> 
> I think this situation could be reproduced (and am working on it) by
> exhausting a NUMA node with 16M hugepages and then using the generic
> RR allocator to ask for more. Other node exhaustion cases probably
> exist, but since we can't swap the hugepages, it seems like the most
> straightforward way to try and reproduce it.
> 
> Any thoughts on this? Am I way off base?
> 
> Thanks,
> Nish
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-03-13 17:01 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-11 21:06 Bug in reclaim logic with exhausted nodes? Nishanth Aravamudan
2014-03-13 17:01 ` Nishanth Aravamudan [this message]
2014-03-24 23:05   ` Nishanth Aravamudan
2014-03-25 16:17     ` Christoph Lameter
2014-03-25 16:23       ` Nishanth Aravamudan
2014-03-25 16:53         ` Christoph Lameter
2014-03-25 18:10           ` Nishanth Aravamudan
2014-03-25 18:25             ` Christoph Lameter
2014-03-25 18:37               ` Nishanth Aravamudan
2014-03-27 20:33               ` Nishanth Aravamudan
2014-03-29  5:40                 ` Christoph Lameter
2014-04-01  1:33                   ` Nishanth Aravamudan
2014-04-03 16:41                     ` Christoph Lameter
2014-05-12 18:46                       ` Nishanth Aravamudan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140313170127.GE22247@linux.vnet.ibm.com \
    --to=nacc@linux.vnet.ibm.com \
    --cc=anton@samba.org \
    --cc=cl@linux.com \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@suse.de \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).