Bug in reclaim logic with exhausted nodes?

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: linux-mm@kvack.org
Cc: anton@samba.org, linuxppc-dev@lists.ozlabs.org, mgorman@suse.de,
	cl@linux.com, rientjes@google.com
Subject: Bug in reclaim logic with exhausted nodes?
Date: Tue, 11 Mar 2014 14:06:14 -0700	[thread overview]
Message-ID: <20140311210614.GB946@linux.vnet.ibm.com> (raw)

We have seen the following situation on a test system:

2-node system, each node has 32GB of memory.

2 gigantic (16GB) pages reserved at boot-time, both of which are
allocated from node 1.

SLUB notices this:

[    0.000000] SLUB: Unable to allocate memory from node 1
[    0.000000] SLUB: Allocating a useless per node structure in order to
be able to continue

After boot, user then did:

echo 24 > /proc/sys/vm/nr_hugepages

And tasks are stuck:

[<c0000000010980b8>] kexec_stack+0xb8/0x8000
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
[<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

[<c00000004f9334b0>] 0xc00000004f9334b0
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
[<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

[<c00000004f91f440>] 0xc00000004f91f440
[<c0000000000144d0>] .__switch_to+0x1c0/0x390
[<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
[<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
[<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
[<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
[<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
[<c0000000001eb54c>] .nr_hugepages_store_common.isra.39+0xbc/0x1b0
[<c0000000003662cc>] .kobj_attr_store+0x2c/0x50
[<c0000000002b2c2c>] .sysfs_write_file+0xec/0x1c0
[<c00000000021dcc0>] .vfs_write+0xe0/0x260
[<c00000000021e8c8>] .SyS_write+0x58/0xd0
[<c000000000009e7c>] syscall_exit+0x0/0x7c

kswapd1 is also pegged at this point at 100% cpu.

If we go in and manually:

echo 24 >
/sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages

rather than relying on the interleaving allocator from the sysctl, the
allocation succeeds (and the echo returns immediately).

I think we are hitting the following:

mm/hugetlb.c::alloc_fresh_huge_page_node():

        page = alloc_pages_exact_node(nid,
                htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
                                                __GFP_REPEAT|__GFP_NOWARN,
                huge_page_order(h));

include/linux/gfp.h:

#define GFP_THISNODE    (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)

and mm/page_alloc.c::__alloc_pages_slowpath():

        /*
         * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
         * __GFP_NOWARN set) should not cause reclaim since the subsystem
         * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
         * using a larger set of nodes after it has established that the
         * allowed per node queues are empty and that nodes are
         * over allocated.
         */
        if (IS_ENABLED(CONFIG_NUMA) &&
                        (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
                goto nopage;

so we *do* reclaim in this callpath. Under my reading, since node1 is
exhausted, no matter how much work kswapd1 does, it will never reclaim
memory from node1 to satisfy a 16M page allocation request (or any
other, for that matter).

I see the following possible changes/fixes, but am unsure if
a) my analysis is right
b) which is best.

1) Since we did notice early in boot that (in this case) node 1 was
exhausted, perhaps we should mark it as such there somehow, and if a
__GFP_THISNODE allocation request comes through on such a node, we
immediately fallthrough to nopage?

2) There is the following check
        /*
         * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
         * specified, then we retry until we no longer reclaim any pages
         * (above), or we've reclaimed an order of pages at least as
         * large as the allocation's order. In both cases, if the
         * allocation still fails, we stop retrying.
         */
        if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
                return 1;

I wonder if we should add a check to also be sure that the pages we are
reclaiming, if __GFP_THISNODE is set, are from the right node?

       if (gfp_mask & __GFP_THISNODE && the progress we have made is on
       		the node requested?)

3) did_some_progress could be updated to track where the progress is
occuring, and if we are in __GFP_THISNODE allocation request and we
didn't make any progress on the correct node, we fail the allocation?

I think this situation could be reproduced (and am working on it) by
exhausting a NUMA node with 16M hugepages and then using the generic
RR allocator to ask for more. Other node exhaustion cases probably
exist, but since we can't swap the hugepages, it seems like the most
straightforward way to try and reproduce it.

Any thoughts on this? Am I way off base?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next             reply	other threads:[~2014-03-11 21:06 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-11 21:06 Nishanth Aravamudan [this message]
2014-03-13 17:01 ` Bug in reclaim logic with exhausted nodes? Nishanth Aravamudan
2014-03-24 23:05   ` Nishanth Aravamudan
2014-03-25 16:17     ` Christoph Lameter
2014-03-25 16:23       ` Nishanth Aravamudan
2014-03-25 16:53         ` Christoph Lameter
2014-03-25 18:10           ` Nishanth Aravamudan
2014-03-25 18:25             ` Christoph Lameter
2014-03-25 18:37               ` Nishanth Aravamudan
2014-03-27 20:33               ` Nishanth Aravamudan
2014-03-29  5:40                 ` Christoph Lameter
2014-04-01  1:33                   ` Nishanth Aravamudan
2014-04-03 16:41                     ` Christoph Lameter
2014-05-12 18:46                       ` Nishanth Aravamudan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140311210614.GB946@linux.vnet.ibm.com \
    --to=nacc@linux.vnet.ibm.com \
    --cc=anton@samba.org \
    --cc=cl@linux.com \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@suse.de \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).