From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from e3.ny.us.ibm.com (e3.ny.us.ibm.com [32.97.182.143]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "e3.ny.us.ibm.com", Issuer "Equifax" (verified OK)) by ozlabs.org (Postfix) with ESMTP id 69FE2679E0 for ; Thu, 31 Aug 2006 03:44:21 +1000 (EST) Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e3.ny.us.ibm.com (8.13.8/8.12.11) with ESMTP id k7UHiGZH022839 for ; Wed, 30 Aug 2006 13:44:16 -0400 Received: from d01av03.pok.ibm.com (d01av03.pok.ibm.com [9.56.224.217]) by d01relay02.pok.ibm.com (8.13.6/8.13.6/NCO v8.1.1) with ESMTP id k7UHiFf9283182 for ; Wed, 30 Aug 2006 13:44:15 -0400 Received: from d01av03.pok.ibm.com (loopback [127.0.0.1]) by d01av03.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id k7UHiFro016007 for ; Wed, 30 Aug 2006 13:44:15 -0400 Subject: Re: libnuma interleaving oddness From: Adam Litke To: Andi Kleen In-Reply-To: <200608300919.13125.ak@suse.de> References: <20060829231545.GY5195@us.ibm.com> <20060830002110.GZ5195@us.ibm.com> <200608300919.13125.ak@suse.de> Content-Type: text/plain Date: Wed, 30 Aug 2006 12:44:10 -0500 Message-Id: <1156959851.7185.8647.camel@localhost.localdomain> Mime-Version: 1.0 Cc: linux-mm@kvack.org, Nishanth Aravamudan , lnxninja@us.ibm.com, linuxppc-dev@ozlabs.org, Christoph Lameter List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Wed, 2006-08-30 at 09:19 +0200, Andi Kleen wrote: > mous pages. > > > > The order is (with necessary params filled in): > > > > p = mmap( , newsize, RW, PRIVATE, unlinked_hugetlbfs_heap_fd, ); > > > > numa_interleave_memory(p, newsize); > > > > mlock(p, newsize); /* causes all the hugepages to be faulted in */ > > > > munlock(p,newsize); > > > > From what I gathered from the numa manpages, the interleave policy > > should take effect on the mlock, as that is "fault-time" in this > > context. We're forcing the fault, that is. > > mlock shouldn't be needed at all here. the new hugetlbfs is supposed > to reserve at mmap time and numa_interleave_memory() sets a VMA > policy which will should do the right thing no matter when the fault > occurs. mmap-time reservation of huge pages is done only for shared mappings. MAP_PRIVATE mappings have full-overcommit semantics. We use the mlock call to "guarantee" the MAP_PRIVATE memory to the process. If mlock fails, we simply unmap the hugetlb region and tell glibc to revert to its normal allocation method (mmap normal pages). > Hmm, maybe mlock() policy() is broken. The policy decision is made further down than mlock. As each huge page is allocated from the static pool, the policy is consulted to see from which node to pop a huge page. The function huge_zonelist() seems to encapsulate the numa policy logic and after sniffing the code, it looks right to me. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center