* libnuma interleaving oddness @ 2006-08-29 23:15 Nishanth Aravamudan 2006-08-29 23:57 ` Christoph Lameter 0 siblings, 1 reply; 23+ messages in thread From: Nishanth Aravamudan @ 2006-08-29 23:15 UTC (permalink / raw) To: clameter, ak; +Cc: linux-mm, lnxninja, linuxppc-dev [Sorry for the double-post, correcting Christoph's address] Hi, While trying to add NUMA-awareness to libhugetlbfs' morecore functionality (hugepage-backed malloc), I ran into an issue on a ppc64-box with 8 memory nodes, running SLES10. I am using two functions from libnuma: numa_available() and numa_interleave_memory(). When I ask numa_interleave_memory() to interleave over all nodes (numa_all_nodes is the nodemask from libnuma), it exhausts node 0, then moves to node 1, then node 2, etc, until the allocations are satisfied. If I custom generate a nodemask, such that bits 1 through 7 are set, but bit 0 is not, then I get proper interleaving, where the first hugepage is on node 1, the second is on node 2, etc. Similarly, if I set bits 0 through 6 in a custom nodemask, interleaving works across the requested 7 nodes. But it has yet to work across all 8. I don't know if this is a libnuma bug (I extracted out the code from libnuma, it looked sane; and even reimplemented it in libhugetlbfs for testing purposes, but got the same results) or a NUMA kernel bug (mbind is some hairy code...) or a ppc64 bug or maybe not a bug at all. Regardless, I'm getting somewhat inconsistent behavior. I can provide more debugging output, or whatever is requested, but I wasn't sure what to include. I'm hoping someone has heard of or seen something similar? The test application I'm using makes some mallopt calls then justs mallocs large chunks in a loop (4096 * 100 bytes). libhugetlbfs is LD_PRELOAD'd so that we can override malloc. Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-29 23:15 libnuma interleaving oddness Nishanth Aravamudan @ 2006-08-29 23:57 ` Christoph Lameter 2006-08-30 0:21 ` Nishanth Aravamudan 2006-08-30 7:16 ` Andi Kleen 0 siblings, 2 replies; 23+ messages in thread From: Christoph Lameter @ 2006-08-29 23:57 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: linux-mm, lnxninja, ak, linuxppc-dev On Tue, 29 Aug 2006, Nishanth Aravamudan wrote: > I don't know if this is a libnuma bug (I extracted out the code from > libnuma, it looked sane; and even reimplemented it in libhugetlbfs for > testing purposes, but got the same results) or a NUMA kernel bug (mbind > is some hairy code...) or a ppc64 bug or maybe not a bug at all. > Regardless, I'm getting somewhat inconsistent behavior. I can provide > more debugging output, or whatever is requested, but I wasn't sure what > to include. I'm hoping someone has heard of or seen something similar? Are you setting the tasks allocation policy before the allocation or do you set a vma based policy? The vma based policies will only work for anonymous pages. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-29 23:57 ` Christoph Lameter @ 2006-08-30 0:21 ` Nishanth Aravamudan 2006-08-30 2:26 ` Nishanth Aravamudan 2006-08-30 7:19 ` Andi Kleen 2006-08-30 7:16 ` Andi Kleen 1 sibling, 2 replies; 23+ messages in thread From: Nishanth Aravamudan @ 2006-08-30 0:21 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, lnxninja, ak, linuxppc-dev On 29.08.2006 [16:57:35 -0700], Christoph Lameter wrote: > On Tue, 29 Aug 2006, Nishanth Aravamudan wrote: > > > I don't know if this is a libnuma bug (I extracted out the code from > > libnuma, it looked sane; and even reimplemented it in libhugetlbfs > > for testing purposes, but got the same results) or a NUMA kernel bug > > (mbind is some hairy code...) or a ppc64 bug or maybe not a bug at > > all. Regardless, I'm getting somewhat inconsistent behavior. I can > > provide more debugging output, or whatever is requested, but I > > wasn't sure what to include. I'm hoping someone has heard of or seen > > something similar? > > Are you setting the tasks allocation policy before the allocation or > do you set a vma based policy? The vma based policies will only work > for anonymous pages. The order is (with necessary params filled in): p = mmap( , newsize, RW, PRIVATE, unlinked_hugetlbfs_heap_fd, ); numa_interleave_memory(p, newsize); mlock(p, newsize); /* causes all the hugepages to be faulted in */ munlock(p,newsize); >From what I gathered from the numa manpages, the interleave policy should take effect on the mlock, as that is "fault-time" in this context. We're forcing the fault, that is. Does that answer your question? Sorry if I'm unclear, I'm a bit of a newbie to the VM. Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-30 0:21 ` Nishanth Aravamudan @ 2006-08-30 2:26 ` Nishanth Aravamudan 2006-08-30 4:26 ` Christoph Lameter 2006-08-30 7:19 ` Andi Kleen 1 sibling, 1 reply; 23+ messages in thread From: Nishanth Aravamudan @ 2006-08-30 2:26 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, lnxninja, ak, linuxppc-dev On 29.08.2006 [17:21:10 -0700], Nishanth Aravamudan wrote: > On 29.08.2006 [16:57:35 -0700], Christoph Lameter wrote: > > On Tue, 29 Aug 2006, Nishanth Aravamudan wrote: > > > > > I don't know if this is a libnuma bug (I extracted out the code from > > > libnuma, it looked sane; and even reimplemented it in libhugetlbfs > > > for testing purposes, but got the same results) or a NUMA kernel bug > > > (mbind is some hairy code...) or a ppc64 bug or maybe not a bug at > > > all. Regardless, I'm getting somewhat inconsistent behavior. I can > > > provide more debugging output, or whatever is requested, but I > > > wasn't sure what to include. I'm hoping someone has heard of or seen > > > something similar? > > > > Are you setting the tasks allocation policy before the allocation or > > do you set a vma based policy? The vma based policies will only work > > for anonymous pages. > > The order is (with necessary params filled in): > > p = mmap( , newsize, RW, PRIVATE, unlinked_hugetlbfs_heap_fd, ); > > numa_interleave_memory(p, newsize); > > mlock(p, newsize); /* causes all the hugepages to be faulted in */ > > munlock(p,newsize); > > From what I gathered from the numa manpages, the interleave policy > should take effect on the mlock, as that is "fault-time" in this > context. We're forcing the fault, that is. For some more data, I did some manipulations of libhugetlbfs and came up with the following: If I use the default hugepage-aligned hugepage-backed malloc replacement, I get the following in /proc/pid/numa_maps (excerpt): 20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 21000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 ... 37000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 If I change the nodemask to 1-7, I get: 20000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1 21000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1 22000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1 23000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1 24000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N5=1 25000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N6=1 26000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N7=1 ... 35000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1 36000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1 37000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1 38000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1 If I then change our malloc implementation to (unnecessarily) mmap a size aligned to 4 hugepages, rather aligned to a single hugepage, but using a nodemask of 0-7, I get: 20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 24000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 28000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 2c000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 30000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 34000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=1 mapped=4 N0=1 N1=1 N2=1 N3=1 It seems rather odd that it's this inconsistent, and that I'm the only one seeing it as such :) Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-30 2:26 ` Nishanth Aravamudan @ 2006-08-30 4:26 ` Christoph Lameter 2006-08-30 5:31 ` Nishanth Aravamudan 2006-08-30 5:40 ` Tim Pepper 0 siblings, 2 replies; 23+ messages in thread From: Christoph Lameter @ 2006-08-30 4:26 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: linux-mm, lnxninja, ak, linuxppc-dev On Tue, 29 Aug 2006, Nishanth Aravamudan wrote: > If I use the default hugepage-aligned hugepage-backed malloc > replacement, I get the following in /proc/pid/numa_maps (excerpt): > > 20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 > 21000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 > ... > 37000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 > 38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 Is this with nodemask set to [0]? > If I change the nodemask to 1-7, I get: > > 20000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1 > 21000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1 > 22000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1 > 23000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1 > 24000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N5=1 > 25000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N6=1 > 26000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N7=1 > ... > 35000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1 > 36000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1 > 37000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1 > 38000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1 So interleave has an effect. Are you using cpusets? Or are you only using memory policies? What is the default policy of the task you are running? > If I then change our malloc implementation to (unnecessarily) mmap a > size aligned to 4 hugepages, rather aligned to a single hugepage, but > using a nodemask of 0-7, I get: > > 20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 > 24000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 > 28000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 > 2c000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 > 30000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 > 34000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 > 38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=1 mapped=4 N0=1 N1=1 N2=1 N3=1 Hmm... Strange. Interleaving should continue after the last one.... ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-30 4:26 ` Christoph Lameter @ 2006-08-30 5:31 ` Nishanth Aravamudan 2006-08-30 5:40 ` Tim Pepper 1 sibling, 0 replies; 23+ messages in thread From: Nishanth Aravamudan @ 2006-08-30 5:31 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, lnxninja, ak, linuxppc-dev On 29.08.2006 [21:26:58 -0700], Christoph Lameter wrote: > On Tue, 29 Aug 2006, Nishanth Aravamudan wrote: > > > If I use the default hugepage-aligned hugepage-backed malloc > > replacement, I get the following in /proc/pid/numa_maps (excerpt): > > > > 20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 > > 21000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 > > ... > > 37000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 > > 38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 > > Is this with nodemask set to [0]? nodemask was set to 0xFF, effectively, bits 0-7 set, all others cleared. Just to make sure that I'm not misunderstanding, that's what the interleave=0-7 also indicates, right? That the particular memory area was specified to interleave over those nodes, if possible, and then at the end of each line are the nodes that it actually was placed on? > > If I change the nodemask to 1-7, I get: > > > > 20000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1 > > 21000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1 > > 22000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1 > > 23000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1 > > 24000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N5=1 > > 25000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N6=1 > > 26000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N7=1 > > ... > > 35000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1 > > 36000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1 > > 37000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1 > > 38000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1 > > So interleave has an effect. Yup, exactly -- and that's the confusing part. I was willing to write it off as being some sort of mistake on my part, but all I have to do is clear any one bit between 0 and 7, and I get the interleaving I expect. That's what leads me to conclude there is a bug, but after a lot of looking at libnuma and the mbind() system call, I couldn't see the problem. > Are you using cpusets? Or are you only using memory policies? What is > the default policy of the task you are running? No cpusets, only memory policies. The test application that is exhibiting this behavior is *really* simple, and doesn't specifically set a memory policy, so I assume it's MPOL_DEFAULT? > > If I then change our malloc implementation to (unnecessarily) mmap a > > size aligned to 4 hugepages, rather aligned to a single hugepage, > > but using a nodemask of 0-7, I get: > > > > 20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 > > 24000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 > > 28000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 > > 2c000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 > > 30000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 > > 34000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1 > > 38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=1 mapped=4 N0=1 N1=1 N2=1 N3=1 > > Hmm... Strange. Interleaving should continue after the last one.... "last one" being the last allocation, or the last node? My understanding of what is happening in this case is that interleave is working, but in a way different from the immediately previous example. Here we're interleaving within the allocation, so each of the 4 hugepages goes on a different node. When the next allocation comes through, we start back over at node 0 (given the previous results, I would have thought it would have gone N0,N1,N2,N3 then N4,N5,N6,N7 then back to N0,N1,N2,N3). Also, note that in this last case, in case I wasn't clear before, I was artificially inflating our consumption of hugepages per allocation, just to see what happened. I should also mention this is the SuSE kernel, too, so 2.6.16-ish. If there are sufficient changes in this area between there and mainline, I can try and get the box rebooted into 2.6.18-rc5. Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-30 4:26 ` Christoph Lameter 2006-08-30 5:31 ` Nishanth Aravamudan @ 2006-08-30 5:40 ` Tim Pepper 1 sibling, 0 replies; 23+ messages in thread From: Tim Pepper @ 2006-08-30 5:40 UTC (permalink / raw) To: Christoph Lameter; +Cc: linuxppc-dev, Nishanth Aravamudan, ak, linux-mm On 8/29/06, Christoph Lameter <clameter@sgi.com> wrote: > On Tue, 29 Aug 2006, Nishanth Aravamudan wrote: > > > If I use the default hugepage-aligned hugepage-backed malloc > > replacement, I get the following in /proc/pid/numa_maps (excerpt): > > > > 20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 > > 21000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 > > ... > > 37000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 > > 38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1 > > Is this with nodemask set to [0]? The above is with a nodemask of 0-7. Just removing node 0 from the mask causes interleaving to start as below: > > If I change the nodemask to 1-7, I get: > > > > 20000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1 > > 21000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1 > > 22000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1 > > 23000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1 > > 24000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N5=1 > > 25000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N6=1 > > 26000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N7=1 > > ... > > 35000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1 > > 36000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1 > > 37000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1 > > 38000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1 > > So interleave has an effect. > > Are you using cpusets? Or are you only using memory policies? What is the > default policy of the task you are running? Just memory policies with the default task policy...really simple code. The current incantation basically does setup in the form of: numa_available(); nodemask_zero(&nodemask); for (i = 0; i <= maxnode; i++) nodemask_set(&nodemask, i); and then creates mmaps followed by: numa_interleave_memory(p, size, &nodemask); mlock(p, size) munlock(p, size); to get the page faulted in. > Hmm... Strange. Interleaving should continue after the last one.... That's what we thought...good to know we're not crazy. We've spent a lot of time looking at libnuma and the userspace side of things trying to figure out if we were somehow passing an invalid nodemask into the kernel, but we've pretty well convinced ourselves that is not the case. The kernel side of things (eg: sys_mbind() codepath) isn't exactly obvious...code inspection's been a bit gruelling...need to do kernel side probing to see what codepaths we're actually hitting. An interesting additional point: Nish's code originally wasn't using libnuma and I wrote a simple little mmapping test program using libnuma to compare results (thinking userspace issue). My code worked fine. He rewrote to use libnuma and I rewrote to not use libnuma thinking we'd find the problem in between. Yet my code still gets interleaving and his does not. The only real difference between our code is that mine basically does: mmap(...many hugepages...) and Nish's effectively is doing: foreach(1..n) { mmap(...many/n hugepages...)} if that pseudocode makes sense. As above, when he changes his mmap to grab more than one hugepage of memory at a time he starts seeing interleaving. Tim ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-30 0:21 ` Nishanth Aravamudan 2006-08-30 2:26 ` Nishanth Aravamudan @ 2006-08-30 7:19 ` Andi Kleen 2006-08-30 7:29 ` Nishanth Aravamudan 2006-08-30 17:44 ` libnuma interleaving oddness Adam Litke 1 sibling, 2 replies; 23+ messages in thread From: Andi Kleen @ 2006-08-30 7:19 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: linuxppc-dev, linux-mm, lnxninja, Christoph Lameter mous pages. > > The order is (with necessary params filled in): > > p = mmap( , newsize, RW, PRIVATE, unlinked_hugetlbfs_heap_fd, ); > > numa_interleave_memory(p, newsize); > > mlock(p, newsize); /* causes all the hugepages to be faulted in */ > > munlock(p,newsize); > > From what I gathered from the numa manpages, the interleave policy > should take effect on the mlock, as that is "fault-time" in this > context. We're forcing the fault, that is. mlock shouldn't be needed at all here. the new hugetlbfs is supposed to reserve at mmap time and numa_interleave_memory() sets a VMA policy which will should do the right thing no matter when the fault occurs. Hmm, maybe mlock() policy() is broken. -Andi ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-30 7:19 ` Andi Kleen @ 2006-08-30 7:29 ` Nishanth Aravamudan 2006-08-30 7:32 ` Andi Kleen 2006-08-30 21:04 ` Christoph Lameter 2006-08-30 17:44 ` libnuma interleaving oddness Adam Litke 1 sibling, 2 replies; 23+ messages in thread From: Nishanth Aravamudan @ 2006-08-30 7:29 UTC (permalink / raw) To: Andi Kleen; +Cc: linuxppc-dev, linux-mm, lnxninja, Christoph Lameter On 30.08.2006 [09:19:13 +0200], Andi Kleen wrote: > mous pages. > > > > The order is (with necessary params filled in): > > > > p = mmap( , newsize, RW, PRIVATE, unlinked_hugetlbfs_heap_fd, ); > > > > numa_interleave_memory(p, newsize); > > > > mlock(p, newsize); /* causes all the hugepages to be faulted in */ > > > > munlock(p,newsize); > > > > From what I gathered from the numa manpages, the interleave policy > > should take effect on the mlock, as that is "fault-time" in this > > context. We're forcing the fault, that is. > > mlock shouldn't be needed at all here. the new hugetlbfs is supposed > to reserve at mmap time and numa_interleave_memory() sets a VMA policy > which will should do the right thing no matter when the fault occurs. Ok. > Hmm, maybe mlock() policy() is broken. I took out the mlock() call, and I get the same results, FWIW. Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-30 7:29 ` Nishanth Aravamudan @ 2006-08-30 7:32 ` Andi Kleen 2006-08-30 18:01 ` Tim Pepper 2006-08-30 21:04 ` Christoph Lameter 1 sibling, 1 reply; 23+ messages in thread From: Andi Kleen @ 2006-08-30 7:32 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: linuxppc-dev, linux-mm, lnxninja, Christoph Lameter On Wednesday 30 August 2006 09:29, Nishanth Aravamudan wrote: > > > Hmm, maybe mlock() policy() is broken. > > I took out the mlock() call, and I get the same results, FWIW. Then it's probably some new problem in hugetlbfs. Does it work with shmfs? The regression test for hugetlbfs is numactl is unfortunately still disabled. I need to enable it at some point for hugetlbfs now that it reached mainline. -Andi ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-30 7:32 ` Andi Kleen @ 2006-08-30 18:01 ` Tim Pepper 2006-08-30 18:12 ` Andi Kleen 2006-08-30 18:13 ` Adam Litke 0 siblings, 2 replies; 23+ messages in thread From: Tim Pepper @ 2006-08-30 18:01 UTC (permalink / raw) To: Andi Kleen; +Cc: linux-mm, Nishanth Aravamudan, linuxppc-dev, Christoph Lameter On 8/30/06, Andi Kleen <ak@suse.de> wrote: > Then it's probably some new problem in hugetlbfs. It's something subtle though, because I _am_ able to get interleaving on hugetlbfs with a slightly simplified test case (see previous email) compared to Nish's. > Does it work with shmfs? Haven't tried shmfs, but the following correctly does the expected interleaving with hugepages (although not hugetlbfs backed): shmid = shmget( 0, NR_HUGE_PAGES, IPC_CREAT | SHM_HUGETLB | 0666 ); shmat_addr = shmat( shmid, NULL, 0 ); ... numa_interleave_memory( shmat_addr, SHM_SIZE, &nm ); I'd expect it works fine with non-huge pages, shmfs. > The regression test for hugetlbfs is numactl is unfortunately still disabled. > I need to enable it at some point for hugetlbfs now that it reached mainline. On my list of random things to do is trying to improve the test coverage in this area. We keep running into bugs or possible bugs or confusion on expected behaviour. I'm going through the code trying to understand it and writing little programs to confirm my understanding here and there anyway. Tim ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-30 18:01 ` Tim Pepper @ 2006-08-30 18:12 ` Andi Kleen 2006-08-30 18:13 ` Adam Litke 1 sibling, 0 replies; 23+ messages in thread From: Andi Kleen @ 2006-08-30 18:12 UTC (permalink / raw) To: Tim Pepper; +Cc: linux-mm, Nishanth Aravamudan, linuxppc-dev, Christoph Lameter > On my list of random things to do is trying to improve the test > coverage in this area. We keep running into bugs or possible bugs or > confusion on expected behaviour. I'm going through the code trying to > understand it and writing little programs to confirm my understanding > here and there anyway. numactl has a little regression test suite in test/* that tests a lot of stuff, but not all. Feel free to extend it. -Andi ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-30 18:01 ` Tim Pepper 2006-08-30 18:12 ` Andi Kleen @ 2006-08-30 18:13 ` Adam Litke 1 sibling, 0 replies; 23+ messages in thread From: Adam Litke @ 2006-08-30 18:13 UTC (permalink / raw) To: Tim Pepper Cc: linux-mm, Nishanth Aravamudan, Andi Kleen, linuxppc-dev, Christoph Lameter On Wed, 2006-08-30 at 11:01 -0700, Tim Pepper wrote: > On 8/30/06, Andi Kleen <ak@suse.de> wrote: > > Then it's probably some new problem in hugetlbfs. > > It's something subtle though, because I _am_ able to get interleaving > on hugetlbfs with a slightly simplified test case (see previous email) > compared to Nish's. > > > Does it work with shmfs? > > Haven't tried shmfs, but the following correctly does the expected > interleaving with hugepages (although not hugetlbfs backed): > shmid = shmget( 0, NR_HUGE_PAGES, IPC_CREAT | SHM_HUGETLB | 0666 ); > shmat_addr = shmat( shmid, NULL, 0 ); > ... > numa_interleave_memory( shmat_addr, SHM_SIZE, &nm ); > I'd expect it works fine with non-huge pages, shmfs. Actually, the above call will yield hugetlbfs backed huge pages. The kernel just prepares the hugetlbfs file for you. See hugetlb_zero_setup(). -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-30 7:29 ` Nishanth Aravamudan 2006-08-30 7:32 ` Andi Kleen @ 2006-08-30 21:04 ` Christoph Lameter 2006-08-31 6:00 ` Nishanth Aravamudan 1 sibling, 1 reply; 23+ messages in thread From: Christoph Lameter @ 2006-08-30 21:04 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: linux-mm, lnxninja, Andi Kleen, linuxppc-dev > I took out the mlock() call, and I get the same results, FWIW. What zones are available on your box? Any with HIGHMEM? Also what kernel version are we talking about? Before 2.6.18? ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-30 21:04 ` Christoph Lameter @ 2006-08-31 6:00 ` Nishanth Aravamudan 2006-08-31 7:47 ` Andi Kleen 2006-08-31 16:00 ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan 0 siblings, 2 replies; 23+ messages in thread From: Nishanth Aravamudan @ 2006-08-31 6:00 UTC (permalink / raw) To: Christoph Lameter; +Cc: linux-mm, lnxninja, Andi Kleen, linuxppc-dev On 30.08.2006 [14:04:40 -0700], Christoph Lameter wrote: > > I took out the mlock() call, and I get the same results, FWIW. > > What zones are available on your box? Any with HIGHMEM? How do I tell the available zones from userspace? This is ppc64 with about 64GB of memory total, it looks like. So, none of the nodes (according to /sys/devices/system/node/*/meminfo) have highmem. > Also what kernel version are we talking about? Before 2.6.18? The SuSE default, 2.6.16.21 -- I thought I mentioned that in one of my replies, sorry. Tim and I spent most of this afternoon debugging the huge_zonelist() callpath with kprobes and jprobes. We found the following via a jprobe to offset_li_node(): jprobe: vma=0xc000000006dc2d78, pol->policy=0x3, pol->v.nodes=0xff, off=0x0 jprobe: vma=0xc00000000f247e30, pol->policy=0x3, pol->v.nodes=0xff, off=0x1000 jprobe: vma=0xc000000006dbf648, pol->policy=0x3, pol->v.nodes=0xff, off=0x2000 ... jprobe: vma=0xc00000000f298870, pol->policy=0x3, pol->v.nodes=0xff, off=0x17000 jprobe: vma=0xc00000000f298368, pol->policy=0x3, pol->v.nodes=0xff, off=0x18000 So, it's quite clear that the nodemask is set appropriately and so is the policy. The problem, in fact, is the "offset" being passed into offset_li_node(). The problem, I think, is from interleave_nid(): off = vma->vm_pgoff; off += (addr - vma->vm_start) >> shift; return offset_il_node(pol, vma, off); For hugetlbfs vma's, since vm_pgoff is in units of small pages, the lower (HPAGE_SHIFT - PAGE_SHIFT) bits of vma->vm_pgoff and off will always be zero (12 in this case). Thus, when we get into offset_li_node(): unsigned nnodes = nodes_weight(pol->v.nodes); unsigned target = (unsigned)off % nnodes; int c; int nid = -1; c = 0; do { nid = next_node(nid, pol->v.nodes); c++; } while (c <= target); return nid; nnodes is 8 (the number of nodes). Our offset (some multiple of 4096) is always going to be evenly divided by 8. So, our target node is always node 0! Note, that when we took out a bit in our nodemask, nnodes changed accordingly and 7 did not evenly divide the offset, and we got interleaving as expected. To test my hypothesis (my analysis may be a bit hand-wavy, sorry), I changed interleave_nid() to shift off right by (HPAGE_SHIFT - PAGE_SHIFT) only #if CONFIG_HUGETLB_PAGE. This fixes the behavior for the page-by-page case. But I'm not sure this is an acceptable mainline change, but I've included my signed-off-but-not-for-inclusion patch. Note, that when I try this with my testcase that makes each allocation be 4 hugepages large, I get 4 hugepages on node 0, then 4 on node 4, then 4 on node 0, and so on. I believe this is because the offset ends up being the same for all of the 4 hugepages in each set, so they go to the same node Many thanks to Tim for his help debugging. --- Once again, not for inclusion! Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> diff -urpN 2.6.18-rc5/mm/mempolicy.c 2.6.18-rc5-dev/mm/mempolicy.c --- 2.6.18-rc5/mm/mempolicy.c 2006-08-30 22:55:33.000000000 -0700 +++ 2.6.18-rc5-dev/mm/mempolicy.c 2006-08-30 22:56:43.000000000 -0700 @@ -1169,6 +1169,7 @@ static unsigned offset_il_node(struct me return nid; } +#ifndef CONFIG_HUGETLBFS /* Determine a node number for interleave */ static inline unsigned interleave_nid(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long addr, int shift) @@ -1182,8 +1183,22 @@ static inline unsigned interleave_nid(st } else return interleave_nodes(pol); } +#else +/* Determine a node number for interleave */ +static inline unsigned interleave_nid(struct mempolicy *pol, + struct vm_area_struct *vma, unsigned long addr, int shift) +{ + if (vma) { + unsigned long off; + + off = vma->vm_pgoff; + off += (addr - vma->vm_start) >> shift; + off >>= (HPAGE_SHIFT - PAGE_SHIFT); + return offset_il_node(pol, vma, off); + } else + return interleave_nodes(pol); +} -#ifdef CONFIG_HUGETLBFS /* Return a zonelist suitable for a huge page allocation. */ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr) { -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-31 6:00 ` Nishanth Aravamudan @ 2006-08-31 7:47 ` Andi Kleen 2006-08-31 15:49 ` Nishanth Aravamudan 2006-08-31 16:00 ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan 1 sibling, 1 reply; 23+ messages in thread From: Andi Kleen @ 2006-08-31 7:47 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: linuxppc-dev, linux-mm, lnxninja, Christoph Lameter On Thursday 31 August 2006 08:00, Nishanth Aravamudan wrote: > On 30.08.2006 [14:04:40 -0700], Christoph Lameter wrote: > > > I took out the mlock() call, and I get the same results, FWIW. > > > > What zones are available on your box? Any with HIGHMEM? > > How do I tell the available zones from userspace? This is ppc64 with > about 64GB of memory total, it looks like. So, none of the nodes > (according to /sys/devices/system/node/*/meminfo) have highmem. The zones are listed at the beginning of dmesg "On node X total pages ... DMA zone ... ..." -Andi ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-31 7:47 ` Andi Kleen @ 2006-08-31 15:49 ` Nishanth Aravamudan 0 siblings, 0 replies; 23+ messages in thread From: Nishanth Aravamudan @ 2006-08-31 15:49 UTC (permalink / raw) To: Andi Kleen; +Cc: linuxppc-dev, linux-mm, lnxninja, Christoph Lameter On 31.08.2006 [09:47:30 +0200], Andi Kleen wrote: > On Thursday 31 August 2006 08:00, Nishanth Aravamudan wrote: > > On 30.08.2006 [14:04:40 -0700], Christoph Lameter wrote: > > > > I took out the mlock() call, and I get the same results, FWIW. > > > > > > What zones are available on your box? Any with HIGHMEM? > > > > How do I tell the available zones from userspace? This is ppc64 with > > about 64GB of memory total, it looks like. So, none of the nodes > > (according to /sys/devices/system/node/*/meminfo) have highmem. > > The zones are listed at the beginning of dmesg > > "On node X total pages ... > DMA zone ... > ..." Page orders: linear mapping = 24, others = 12 <snip> [boot]0100 MM Init [boot]0100 MM Init Done Linux version 2.6.16.21-0.8-ppc64 (geeko@buildhost) (gcc version 4.1.0 (SUSE Linux)) #1 SMP Mon Jul 3 18:25:39 UTC 2006 [boot]0012 Setup Arch Node 0 Memory: 0x0-0x1b0000000 Node 1 Memory: 0x1b0000000-0x3b0000000 Node 2 Memory: 0x3b0000000-0x5b0000000 Node 3 Memory: 0x5b0000000-0x7b0000000 Node 4 Memory: 0x7b0000000-0x9a0000000 Node 5 Memory: 0x9a0000000-0xba0000000 Node 6 Memory: 0xba0000000-0xda0000000 Node 7 Memory: 0xda0000000-0xf90000000 EEH: PCI Enhanced I/O Error Handling Enabled PPC64 nvram contains 7168 bytes Using dedicated idle loop On node 0 totalpages: 1769472 DMA zone: 1769472 pages, LIFO batch:31 DMA32 zone: 0 pages, LIFO batch:0 Normal zone: 0 pages, LIFO batch:0 HighMem zone: 0 pages, LIFO batch:0 On node 1 totalpages: 2097152 DMA zone: 2097152 pages, LIFO batch:31 DMA32 zone: 0 pages, LIFO batch:0 Normal zone: 0 pages, LIFO batch:0 HighMem zone: 0 pages, LIFO batch:0 On node 2 totalpages: 2097152 DMA zone: 2097152 pages, LIFO batch:31 DMA32 zone: 0 pages, LIFO batch:0 Normal zone: 0 pages, LIFO batch:0 HighMem zone: 0 pages, LIFO batch:0 On node 3 totalpages: 2097152 DMA zone: 2097152 pages, LIFO batch:31 DMA32 zone: 0 pages, LIFO batch:0 Normal zone: 0 pages, LIFO batch:0 HighMem zone: 0 pages, LIFO batch:0 On node 4 totalpages: 2031616 DMA zone: 2031616 pages, LIFO batch:31 DMA32 zone: 0 pages, LIFO batch:0 Normal zone: 0 pages, LIFO batch:0 HighMem zone: 0 pages, LIFO batch:0 On node 5 totalpages: 2097152 DMA zone: 2097152 pages, LIFO batch:31 DMA32 zone: 0 pages, LIFO batch:0 Normal zone: 0 pages, LIFO batch:0 HighMem zone: 0 pages, LIFO batch:0 On node 6 totalpages: 2097152 DMA zone: 2097152 pages, LIFO batch:31 DMA32 zone: 0 pages, LIFO batch:0 Normal zone: 0 pages, LIFO batch:0 HighMem zone: 0 pages, LIFO batch:0 On node 7 totalpages: 2031616 DMA zone: 2031616 pages, LIFO batch:31 DMA32 zone: 0 pages, LIFO batch:0 Normal zone: 0 pages, LIFO batch:0 HighMem zone: 0 pages, LIFO batch:0 [boot]0015 Setup Done Built 8 zonelists Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) 2006-08-31 6:00 ` Nishanth Aravamudan 2006-08-31 7:47 ` Andi Kleen @ 2006-08-31 16:00 ` Nishanth Aravamudan 2006-08-31 16:08 ` Adam Litke ` (2 more replies) 1 sibling, 3 replies; 23+ messages in thread From: Nishanth Aravamudan @ 2006-08-31 16:00 UTC (permalink / raw) To: Christoph Lameter; +Cc: akpm, linuxppc-dev, Andi Kleen, linux-mm, lnxninja On 30.08.2006 [23:00:36 -0700], Nishanth Aravamudan wrote: > On 30.08.2006 [14:04:40 -0700], Christoph Lameter wrote: > > > I took out the mlock() call, and I get the same results, FWIW. > > > > What zones are available on your box? Any with HIGHMEM? > > How do I tell the available zones from userspace? This is ppc64 with > about 64GB of memory total, it looks like. So, none of the nodes > (according to /sys/devices/system/node/*/meminfo) have highmem. > > > Also what kernel version are we talking about? Before 2.6.18? > > The SuSE default, 2.6.16.21 -- I thought I mentioned that in one of my > replies, sorry. > > Tim and I spent most of this afternoon debugging the huge_zonelist() > callpath with kprobes and jprobes. We found the following via a jprobe > to offset_li_node(): <snip lengthy previous discussion> Since vma->vm_pgoff is in units of smallpages, VMAs for huge pages have the lower HPAGE_SHIFT - PAGE_SHIFT bits always cleared, which results in badd offsets to the interleave functions. Take this difference from small pages into account when calculating the offset. This does add a 0-bit shift into the small-page path (via alloc_page_vma()), but I think that is negligible. Also add a BUG_ON to prevent the offset from growing due to a negative right-shift, which probably shouldn't be allowed anyways. Tested on an 8-memory node ppc64 NUMA box and got the interleaving I expected. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> --- Results with this patch applied, which shouldn't go into the changelog, I don't think: for the 4-hugepages at a time case: 20000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.r1YKfL huge dirty=4 N0=1 N1=1 N2=1 N3=1 24000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.r1YKfL huge dirty=4 N4=1 N5=1 N6=1 N7=1 28000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.r1YKfL huge dirty=4 N0=1 N1=1 N2=1 N3=1 for the 1-hugepage at a time case: 20000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N0=1 21000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N1=1 22000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N2=1 23000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N3=1 24000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N4=1 25000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N5=1 26000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N6=1 27000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N7=1 28000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N0=1 Andrew, can we get this into 2.6.18? diff -urpN 2.6.18-rc5/mm/mempolicy.c 2.6.18-rc5-dev/mm/mempolicy.c --- 2.6.18-rc5/mm/mempolicy.c 2006-08-30 22:55:33.000000000 -0700 +++ 2.6.18-rc5-dev/mm/mempolicy.c 2006-08-31 08:46:22.000000000 -0700 @@ -1176,7 +1176,15 @@ static inline unsigned interleave_nid(st if (vma) { unsigned long off; - off = vma->vm_pgoff; + /* + * for small pages, there is no difference between + * shift and PAGE_SHIFT, so the bit-shift is safe. + * for huge pages, since vm_pgoff is in units of small + * pages, we need to shift off the always 0 bits to get + * a useful offset. + */ + BUG_ON(shift < PAGE_SHIFT); + off = vma->vm_pgoff >> (shift - PAGE_SHIFT); off += (addr - vma->vm_start) >> shift; return offset_il_node(pol, vma, off); } else -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) 2006-08-31 16:00 ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan @ 2006-08-31 16:08 ` Adam Litke 2006-08-31 16:19 ` Tim Pepper 2006-08-31 16:37 ` Christoph Lameter 2 siblings, 0 replies; 23+ messages in thread From: Adam Litke @ 2006-08-31 16:08 UTC (permalink / raw) To: Nishanth Aravamudan Cc: akpm, linux-mm, Andi Kleen, linuxppc-dev, lnxninja, Christoph Lameter On Thu, 2006-08-31 at 09:00 -0700, Nishanth Aravamudan wrote: > Since vma->vm_pgoff is in units of smallpages, VMAs for huge pages have > the lower HPAGE_SHIFT - PAGE_SHIFT bits always cleared, which results in > badd offsets to the interleave functions. Take this difference from > small pages into account when calculating the offset. This does add a > 0-bit shift into the small-page path (via alloc_page_vma()), but I think > that is negligible. Also add a BUG_ON to prevent the offset from growing > due to a negative right-shift, which probably shouldn't be allowed > anyways. > > Tested on an 8-memory node ppc64 NUMA box and got the interleaving I > expected. > > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) 2006-08-31 16:00 ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan 2006-08-31 16:08 ` Adam Litke @ 2006-08-31 16:19 ` Tim Pepper 2006-08-31 16:37 ` Christoph Lameter 2 siblings, 0 replies; 23+ messages in thread From: Tim Pepper @ 2006-08-31 16:19 UTC (permalink / raw) To: Nishanth Aravamudan Cc: akpm, linux-mm, Andi Kleen, linuxppc-dev, Christoph Lameter On 8/31/06, Nishanth Aravamudan <nacc@us.ibm.com> wrote: > > Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Tim Pepper <lnxninja@us.ibm.com> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) 2006-08-31 16:00 ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan 2006-08-31 16:08 ` Adam Litke 2006-08-31 16:19 ` Tim Pepper @ 2006-08-31 16:37 ` Christoph Lameter 2 siblings, 0 replies; 23+ messages in thread From: Christoph Lameter @ 2006-08-31 16:37 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: akpm, linuxppc-dev, Andi Kleen, linux-mm, lnxninja On Thu, 31 Aug 2006, Nishanth Aravamudan wrote: > Andrew, can we get this into 2.6.18? Acked-by: Christoph Lameter <clameter@sgi.con> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-30 7:19 ` Andi Kleen 2006-08-30 7:29 ` Nishanth Aravamudan @ 2006-08-30 17:44 ` Adam Litke 1 sibling, 0 replies; 23+ messages in thread From: Adam Litke @ 2006-08-30 17:44 UTC (permalink / raw) To: Andi Kleen Cc: linux-mm, Nishanth Aravamudan, lnxninja, linuxppc-dev, Christoph Lameter On Wed, 2006-08-30 at 09:19 +0200, Andi Kleen wrote: > mous pages. > > > > The order is (with necessary params filled in): > > > > p = mmap( , newsize, RW, PRIVATE, unlinked_hugetlbfs_heap_fd, ); > > > > numa_interleave_memory(p, newsize); > > > > mlock(p, newsize); /* causes all the hugepages to be faulted in */ > > > > munlock(p,newsize); > > > > From what I gathered from the numa manpages, the interleave policy > > should take effect on the mlock, as that is "fault-time" in this > > context. We're forcing the fault, that is. > > mlock shouldn't be needed at all here. the new hugetlbfs is supposed > to reserve at mmap time and numa_interleave_memory() sets a VMA > policy which will should do the right thing no matter when the fault > occurs. mmap-time reservation of huge pages is done only for shared mappings. MAP_PRIVATE mappings have full-overcommit semantics. We use the mlock call to "guarantee" the MAP_PRIVATE memory to the process. If mlock fails, we simply unmap the hugetlb region and tell glibc to revert to its normal allocation method (mmap normal pages). > Hmm, maybe mlock() policy() is broken. The policy decision is made further down than mlock. As each huge page is allocated from the static pool, the policy is consulted to see from which node to pop a huge page. The function huge_zonelist() seems to encapsulate the numa policy logic and after sniffing the code, it looks right to me. -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: libnuma interleaving oddness 2006-08-29 23:57 ` Christoph Lameter 2006-08-30 0:21 ` Nishanth Aravamudan @ 2006-08-30 7:16 ` Andi Kleen 1 sibling, 0 replies; 23+ messages in thread From: Andi Kleen @ 2006-08-30 7:16 UTC (permalink / raw) To: Christoph Lameter; +Cc: linuxppc-dev, Nishanth Aravamudan, lnxninja, linux-mm On Wednesday 30 August 2006 01:57, Christoph Lameter wrote: > On Tue, 29 Aug 2006, Nishanth Aravamudan wrote: > > > I don't know if this is a libnuma bug (I extracted out the code from > > libnuma, it looked sane; and even reimplemented it in libhugetlbfs for > > testing purposes, but got the same results) or a NUMA kernel bug (mbind > > is some hairy code...) or a ppc64 bug or maybe not a bug at all. > > Regardless, I'm getting somewhat inconsistent behavior. I can provide > > more debugging output, or whatever is requested, but I wasn't sure what > > to include. I'm hoping someone has heard of or seen something similar? > > Are you setting the tasks allocation policy before the allocation or do > you set a vma based policy? The vma based policies will only work for > anonymous pages. They should work for hugetlb/shmfs too. At least when I originally wrote it. But the original patch I did for hugetlbfs for that was never merged and I admit I have never rechecked if it worked with the patchkit that was merged later. The problem originally was that hugetlbfs needed to be changed to do allocate-on-demand instead of allocation-on-mmap, because mbind() comes after mmap() and when mmap() already allocates it can't work. -Andi ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2006-08-31 16:46 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-08-29 23:15 libnuma interleaving oddness Nishanth Aravamudan 2006-08-29 23:57 ` Christoph Lameter 2006-08-30 0:21 ` Nishanth Aravamudan 2006-08-30 2:26 ` Nishanth Aravamudan 2006-08-30 4:26 ` Christoph Lameter 2006-08-30 5:31 ` Nishanth Aravamudan 2006-08-30 5:40 ` Tim Pepper 2006-08-30 7:19 ` Andi Kleen 2006-08-30 7:29 ` Nishanth Aravamudan 2006-08-30 7:32 ` Andi Kleen 2006-08-30 18:01 ` Tim Pepper 2006-08-30 18:12 ` Andi Kleen 2006-08-30 18:13 ` Adam Litke 2006-08-30 21:04 ` Christoph Lameter 2006-08-31 6:00 ` Nishanth Aravamudan 2006-08-31 7:47 ` Andi Kleen 2006-08-31 15:49 ` Nishanth Aravamudan 2006-08-31 16:00 ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan 2006-08-31 16:08 ` Adam Litke 2006-08-31 16:19 ` Tim Pepper 2006-08-31 16:37 ` Christoph Lameter 2006-08-30 17:44 ` libnuma interleaving oddness Adam Litke 2006-08-30 7:16 ` Andi Kleen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).