Re: Bug in reclaim logic with exhausted nodes?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: linux-mm@kvack.org
Cc: mgorman@suse.de, cl@linux.com, linuxppc-dev@lists.ozlabs.org,
	anton@samba.org, rientjes@google.com
Subject: Re: Bug in reclaim logic with exhausted nodes?
Date: Thu, 13 Mar 2014 10:01:27 -0700	[thread overview]
Message-ID: <20140313170127.GE22247@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140311210614.GB946@linux.vnet.ibm.com>

There might have been an error in my original mail, so resending...

On 11.03.2014 [14:06:14 -0700], Nishanth Aravamudan wrote:
> We have seen the following situation on a test system:
> 
> 2-node system, each node has 32GB of memory.
> 
> 2 gigantic (16GB) pages reserved at boot-time, both of which are
> allocated from node 1.
> 
> SLUB notices this:
> 
> [    0.000000] SLUB: Unable to allocate memory from node 1
> [    0.000000] SLUB: Allocating a useless per node structure in order to
> be able to continue
> 
> After boot, user then did:
> 
> echo 24 > /proc/sys/vm/nr_hugepages
> 
> And tasks are stuck:
> 
> [<c0000000010980b8>] kexec_stack+0xb8/0x8000
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> [<c00000004f9334b0>] 0xc00000004f9334b0
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> [<c00000004f91f440>] 0xc00000004f91f440
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb54c>] .nr_hugepages_store_common.isra.39+0xbc/0x1b0
> [<c0000000003662cc>] .kobj_attr_store+0x2c/0x50
> [<c0000000002b2c2c>] .sysfs_write_file+0xec/0x1c0
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> kswapd1 is also pegged at this point at 100% cpu.
> 
> If we go in and manually:
> 
> echo 24 >
> /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages
> 
> rather than relying on the interleaving allocator from the sysctl, the
> allocation succeeds (and the echo returns immediately).
> 
> I think we are hitting the following:
> 
> mm/hugetlb.c::alloc_fresh_huge_page_node():
> 
>         page = alloc_pages_exact_node(nid,
>                 htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
>                                                 __GFP_REPEAT|__GFP_NOWARN,
>                 huge_page_order(h));
> 
> include/linux/gfp.h:
> 
> #define GFP_THISNODE    (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
> 
> and mm/page_alloc.c::__alloc_pages_slowpath():
> 
>         /*
>          * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
>          * __GFP_NOWARN set) should not cause reclaim since the subsystem
>          * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
>          * using a larger set of nodes after it has established that the
>          * allowed per node queues are empty and that nodes are
>          * over allocated.
>          */
>         if (IS_ENABLED(CONFIG_NUMA) &&
>                         (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
>                 goto nopage;
> 
> so we *do* reclaim in this callpath. Under my reading, since node1 is
> exhausted, no matter how much work kswapd1 does, it will never reclaim
> memory from node1 to satisfy a 16M page allocation request (or any
> other, for that matter).
> 
> I see the following possible changes/fixes, but am unsure if
> a) my analysis is right
> b) which is best.
> 
> 1) Since we did notice early in boot that (in this case) node 1 was
> exhausted, perhaps we should mark it as such there somehow, and if a
> __GFP_THISNODE allocation request comes through on such a node, we
> immediately fallthrough to nopage?
> 
> 2) There is the following check
>         /*
>          * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
>          * specified, then we retry until we no longer reclaim any pages
>          * (above), or we've reclaimed an order of pages at least as
>          * large as the allocation's order. In both cases, if the
>          * allocation still fails, we stop retrying.
>          */
>         if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
>                 return 1;
> 
> I wonder if we should add a check to also be sure that the pages we are
> reclaiming, if __GFP_THISNODE is set, are from the right node?
> 
>        if (gfp_mask & __GFP_THISNODE && the progress we have made is on
>        		the node requested?)
> 
> 3) did_some_progress could be updated to track where the progress is
> occuring, and if we are in __GFP_THISNODE allocation request and we
> didn't make any progress on the correct node, we fail the allocation?
> 
> I think this situation could be reproduced (and am working on it) by
> exhausting a NUMA node with 16M hugepages and then using the generic
> RR allocator to ask for more. Other node exhaustion cases probably
> exist, but since we can't swap the hugepages, it seems like the most
> straightforward way to try and reproduce it.
> 
> Any thoughts on this? Am I way off base?
> 
> Thanks,
> Nish
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

WARNING: multiple messages have this Message-ID (diff)

From: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
To: linux-mm@kvack.org
Cc: cl@linux.com, rientjes@google.com, linuxppc-dev@lists.ozlabs.org,
	anton@samba.org, mgorman@suse.de
Subject: Re: Bug in reclaim logic with exhausted nodes?
Date: Thu, 13 Mar 2014 10:01:27 -0700	[thread overview]
Message-ID: <20140313170127.GE22247@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140311210614.GB946@linux.vnet.ibm.com>

There might have been an error in my original mail, so resending...

On 11.03.2014 [14:06:14 -0700], Nishanth Aravamudan wrote:
> We have seen the following situation on a test system:
> 
> 2-node system, each node has 32GB of memory.
> 
> 2 gigantic (16GB) pages reserved at boot-time, both of which are
> allocated from node 1.
> 
> SLUB notices this:
> 
> [    0.000000] SLUB: Unable to allocate memory from node 1
> [    0.000000] SLUB: Allocating a useless per node structure in order to
> be able to continue
> 
> After boot, user then did:
> 
> echo 24 > /proc/sys/vm/nr_hugepages
> 
> And tasks are stuck:
> 
> [<c0000000010980b8>] kexec_stack+0xb8/0x8000
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> [<c00000004f9334b0>] 0xc00000004f9334b0
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb7c8>] .hugetlb_sysctl_handler_common+0x168/0x180
> [<c0000000002ae21c>] .proc_sys_call_handler+0xfc/0x120
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> [<c00000004f91f440>] 0xc00000004f91f440
> [<c0000000000144d0>] .__switch_to+0x1c0/0x390
> [<c0000000001ac708>] .throttle_direct_reclaim.isra.31+0x238/0x2c0
> [<c0000000001b0b34>] .try_to_free_pages+0xb4/0x210
> [<c0000000001a2f1c>] .__alloc_pages_nodemask+0x75c/0xb00
> [<c0000000001eafb0>] .alloc_fresh_huge_page+0x70/0x150
> [<c0000000001eb2d0>] .set_max_huge_pages.part.37+0x130/0x2f0
> [<c0000000001eb54c>] .nr_hugepages_store_common.isra.39+0xbc/0x1b0
> [<c0000000003662cc>] .kobj_attr_store+0x2c/0x50
> [<c0000000002b2c2c>] .sysfs_write_file+0xec/0x1c0
> [<c00000000021dcc0>] .vfs_write+0xe0/0x260
> [<c00000000021e8c8>] .SyS_write+0x58/0xd0
> [<c000000000009e7c>] syscall_exit+0x0/0x7c
> 
> kswapd1 is also pegged at this point at 100% cpu.
> 
> If we go in and manually:
> 
> echo 24 >
> /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages
> 
> rather than relying on the interleaving allocator from the sysctl, the
> allocation succeeds (and the echo returns immediately).
> 
> I think we are hitting the following:
> 
> mm/hugetlb.c::alloc_fresh_huge_page_node():
> 
>         page = alloc_pages_exact_node(nid,
>                 htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
>                                                 __GFP_REPEAT|__GFP_NOWARN,
>                 huge_page_order(h));
> 
> include/linux/gfp.h:
> 
> #define GFP_THISNODE    (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
> 
> and mm/page_alloc.c::__alloc_pages_slowpath():
> 
>         /*
>          * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
>          * __GFP_NOWARN set) should not cause reclaim since the subsystem
>          * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
>          * using a larger set of nodes after it has established that the
>          * allowed per node queues are empty and that nodes are
>          * over allocated.
>          */
>         if (IS_ENABLED(CONFIG_NUMA) &&
>                         (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
>                 goto nopage;
> 
> so we *do* reclaim in this callpath. Under my reading, since node1 is
> exhausted, no matter how much work kswapd1 does, it will never reclaim
> memory from node1 to satisfy a 16M page allocation request (or any
> other, for that matter).
> 
> I see the following possible changes/fixes, but am unsure if
> a) my analysis is right
> b) which is best.
> 
> 1) Since we did notice early in boot that (in this case) node 1 was
> exhausted, perhaps we should mark it as such there somehow, and if a
> __GFP_THISNODE allocation request comes through on such a node, we
> immediately fallthrough to nopage?
> 
> 2) There is the following check
>         /*
>          * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
>          * specified, then we retry until we no longer reclaim any pages
>          * (above), or we've reclaimed an order of pages at least as
>          * large as the allocation's order. In both cases, if the
>          * allocation still fails, we stop retrying.
>          */
>         if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
>                 return 1;
> 
> I wonder if we should add a check to also be sure that the pages we are
> reclaiming, if __GFP_THISNODE is set, are from the right node?
> 
>        if (gfp_mask & __GFP_THISNODE && the progress we have made is on
>        		the node requested?)
> 
> 3) did_some_progress could be updated to track where the progress is
> occuring, and if we are in __GFP_THISNODE allocation request and we
> didn't make any progress on the correct node, we fail the allocation?
> 
> I think this situation could be reproduced (and am working on it) by
> exhausting a NUMA node with 16M hugepages and then using the generic
> RR allocator to ask for more. Other node exhaustion cases probably
> exist, but since we can't swap the hugepages, it seems like the most
> straightforward way to try and reproduce it.
> 
> Any thoughts on this? Am I way off base?
> 
> Thanks,
> Nish
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2014-03-13 17:01 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-11 21:06 Bug in reclaim logic with exhausted nodes? Nishanth Aravamudan
2014-03-11 21:06 ` Nishanth Aravamudan
2014-03-13 17:01 ` Nishanth Aravamudan [this message]
2014-03-13 17:01   ` Nishanth Aravamudan
2014-03-24 23:05   ` Nishanth Aravamudan
2014-03-24 23:05     ` Nishanth Aravamudan
2014-03-25 16:17     ` Christoph Lameter
2014-03-25 16:17       ` Christoph Lameter
2014-03-25 16:23       ` Nishanth Aravamudan
2014-03-25 16:23         ` Nishanth Aravamudan
2014-03-25 16:53         ` Christoph Lameter
2014-03-25 16:53           ` Christoph Lameter
2014-03-25 18:10           ` Nishanth Aravamudan
2014-03-25 18:10             ` Nishanth Aravamudan
2014-03-25 18:25             ` Christoph Lameter
2014-03-25 18:25               ` Christoph Lameter
2014-03-25 18:37               ` Nishanth Aravamudan
2014-03-25 18:37                 ` Nishanth Aravamudan
2014-03-27 20:33               ` Nishanth Aravamudan
2014-03-27 20:33                 ` Nishanth Aravamudan
2014-03-29  5:40                 ` Christoph Lameter
2014-03-29  5:40                   ` Christoph Lameter
2014-04-01  1:33                   ` Nishanth Aravamudan
2014-04-01  1:33                     ` Nishanth Aravamudan
2014-04-03 16:41                     ` Christoph Lameter
2014-04-03 16:41                       ` Christoph Lameter
2014-05-12 18:46                       ` Nishanth Aravamudan
2014-05-12 18:46                         ` Nishanth Aravamudan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140313170127.GE22247@linux.vnet.ibm.com \
    --to=nacc@linux.vnet.ibm.com \
    --cc=anton@samba.org \
    --cc=cl@linux.com \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mgorman@suse.de \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.