From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qc0-f182.google.com (mail-qc0-f182.google.com [209.85.216.182]) by kanga.kvack.org (Postfix) with ESMTP id A33E56B0035 for ; Tue, 11 Mar 2014 17:06:34 -0400 (EDT) Received: by mail-qc0-f182.google.com with SMTP id e16so10415907qcx.13 for ; Tue, 11 Mar 2014 14:06:34 -0700 (PDT) Received: from e7.ny.us.ibm.com (e7.ny.us.ibm.com. [32.97.182.137]) by mx.google.com with ESMTPS id g52si9379879qge.133.2014.03.11.14.06.33 for (version=TLSv1 cipher=RC4-SHA bits=128/128); Tue, 11 Mar 2014 14:06:33 -0700 (PDT) Received: from /spool/local by e7.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 11 Mar 2014 17:06:33 -0400 Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com [9.57.198.23]) by d01dlp03.pok.ibm.com (Postfix) with ESMTP id 47A4FC90042 for ; Tue, 11 Mar 2014 17:06:27 -0400 (EDT) Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by b01cxnp22033.gho.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id s2BL6U5214024710 for ; Tue, 11 Mar 2014 21:06:30 GMT Received: from d01av03.pok.ibm.com (localhost [127.0.0.1]) by d01av03.pok.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id s2BL6ULE006119 for ; Tue, 11 Mar 2014 17:06:30 -0400 Date: Tue, 11 Mar 2014 14:06:14 -0700 From: Nishanth Aravamudan Subject: Bug in reclaim logic with exhausted nodes? Message-ID: <20140311210614.GB946@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: anton@samba.org, linuxppc-dev@lists.ozlabs.org, mgorman@suse.de, cl@linux.com, rientjes@google.com We have seen the following situation on a test system: 2-node system, each node has 32GB of memory. 2 gigantic (16GB) pages reserved at boot-time, both of which are allocated from node 1. SLUB notices this: [ 0.000000] SLUB: Unable to allocate memory from node 1 [ 0.000000] SLUB: Allocating a useless per node structure in order to be able to continue After boot, user then did: echo 24 > /proc/sys/vm/nr_hugepages And tasks are stuck: [] kexec_stack+0xb8/0x8000 [] .__switch_to+0x1c0/0x390 [] .throttle_direct_reclaim.isra.31+0x238/0x2c0 [] .try_to_free_pages+0xb4/0x210 [] .__alloc_pages_nodemask+0x75c/0xb00 [] .alloc_fresh_huge_page+0x70/0x150 [] .set_max_huge_pages.part.37+0x130/0x2f0 [] .hugetlb_sysctl_handler_common+0x168/0x180 [] .proc_sys_call_handler+0xfc/0x120 [] .vfs_write+0xe0/0x260 [] .SyS_write+0x58/0xd0 [] syscall_exit+0x0/0x7c [] 0xc00000004f9334b0 [] .__switch_to+0x1c0/0x390 [] .throttle_direct_reclaim.isra.31+0x238/0x2c0 [] .try_to_free_pages+0xb4/0x210 [] .__alloc_pages_nodemask+0x75c/0xb00 [] .alloc_fresh_huge_page+0x70/0x150 [] .set_max_huge_pages.part.37+0x130/0x2f0 [] .hugetlb_sysctl_handler_common+0x168/0x180 [] .proc_sys_call_handler+0xfc/0x120 [] .vfs_write+0xe0/0x260 [] .SyS_write+0x58/0xd0 [] syscall_exit+0x0/0x7c [] 0xc00000004f91f440 [] .__switch_to+0x1c0/0x390 [] .throttle_direct_reclaim.isra.31+0x238/0x2c0 [] .try_to_free_pages+0xb4/0x210 [] .__alloc_pages_nodemask+0x75c/0xb00 [] .alloc_fresh_huge_page+0x70/0x150 [] .set_max_huge_pages.part.37+0x130/0x2f0 [] .nr_hugepages_store_common.isra.39+0xbc/0x1b0 [] .kobj_attr_store+0x2c/0x50 [] .sysfs_write_file+0xec/0x1c0 [] .vfs_write+0xe0/0x260 [] .SyS_write+0x58/0xd0 [] syscall_exit+0x0/0x7c kswapd1 is also pegged at this point at 100% cpu. If we go in and manually: echo 24 > /sys/devices/system/node/node0/hugepages/hugepages-16384kB/nr_hugepages rather than relying on the interleaving allocator from the sysctl, the allocation succeeds (and the echo returns immediately). I think we are hitting the following: mm/hugetlb.c::alloc_fresh_huge_page_node(): page = alloc_pages_exact_node(nid, htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE| __GFP_REPEAT|__GFP_NOWARN, huge_page_order(h)); include/linux/gfp.h: #define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY) and mm/page_alloc.c::__alloc_pages_slowpath(): /* * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and * __GFP_NOWARN set) should not cause reclaim since the subsystem * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim * using a larger set of nodes after it has established that the * allowed per node queues are empty and that nodes are * over allocated. */ if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & GFP_THISNODE) == GFP_THISNODE) goto nopage; so we *do* reclaim in this callpath. Under my reading, since node1 is exhausted, no matter how much work kswapd1 does, it will never reclaim memory from node1 to satisfy a 16M page allocation request (or any other, for that matter). I see the following possible changes/fixes, but am unsure if a) my analysis is right b) which is best. 1) Since we did notice early in boot that (in this case) node 1 was exhausted, perhaps we should mark it as such there somehow, and if a __GFP_THISNODE allocation request comes through on such a node, we immediately fallthrough to nopage? 2) There is the following check /* * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is * specified, then we retry until we no longer reclaim any pages * (above), or we've reclaimed an order of pages at least as * large as the allocation's order. In both cases, if the * allocation still fails, we stop retrying. */ if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order)) return 1; I wonder if we should add a check to also be sure that the pages we are reclaiming, if __GFP_THISNODE is set, are from the right node? if (gfp_mask & __GFP_THISNODE && the progress we have made is on the node requested?) 3) did_some_progress could be updated to track where the progress is occuring, and if we are in __GFP_THISNODE allocation request and we didn't make any progress on the correct node, we fail the allocation? I think this situation could be reproduced (and am working on it) by exhausting a NUMA node with 16M hugepages and then using the generic RR allocator to ask for more. Other node exhaustion cases probably exist, but since we can't swap the hugepages, it seems like the most straightforward way to try and reproduce it. Any thoughts on this? Am I way off base? Thanks, Nish -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org