libnuma interleaving oddness

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* libnuma interleaving oddness
@ 2006-08-29 23:15 Nishanth Aravamudan
  2006-08-29 23:57 ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Nishanth Aravamudan @ 2006-08-29 23:15 UTC (permalink / raw)
  To: clameter, ak; +Cc: linux-mm, lnxninja, linuxppc-dev

[Sorry for the double-post, correcting Christoph's address]

Hi,

While trying to add NUMA-awareness to libhugetlbfs' morecore
functionality (hugepage-backed malloc), I ran into an issue on a
ppc64-box with 8 memory nodes, running SLES10. I am using two functions
from libnuma: numa_available() and numa_interleave_memory().  When I ask
numa_interleave_memory() to interleave over all nodes (numa_all_nodes is
the nodemask from libnuma), it exhausts node 0, then moves to node 1,
then node 2, etc, until the allocations are satisfied. If I custom
generate a nodemask, such that bits 1 through 7 are set, but bit 0 is
not, then I get proper interleaving, where the first hugepage is on node
1, the second is on node 2, etc. Similarly, if I set bits 0 through 6 in
a custom nodemask, interleaving works across the requested 7 nodes. But
it has yet to work across all 8.

I don't know if this is a libnuma bug (I extracted out the code from
libnuma, it looked sane; and even reimplemented it in libhugetlbfs for
testing purposes, but got the same results) or a NUMA kernel bug (mbind
is some hairy code...) or a ppc64 bug or maybe not a bug at all.
Regardless, I'm getting somewhat inconsistent behavior. I can provide
more debugging output, or whatever is requested, but I wasn't sure what
to include. I'm hoping someone has heard of or seen something similar?

The test application I'm using makes some mallopt calls then justs
mallocs large chunks in a loop (4096 * 100 bytes). libhugetlbfs is
LD_PRELOAD'd so that we can override malloc.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-29 23:15 libnuma interleaving oddness Nishanth Aravamudan
@ 2006-08-29 23:57 ` Christoph Lameter
  2006-08-30  0:21   ` Nishanth Aravamudan
  2006-08-30  7:16   ` Andi Kleen
  0 siblings, 2 replies; 24+ messages in thread
From: Christoph Lameter @ 2006-08-29 23:57 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, lnxninja, ak, linuxppc-dev

On Tue, 29 Aug 2006, Nishanth Aravamudan wrote:

> I don't know if this is a libnuma bug (I extracted out the code from
> libnuma, it looked sane; and even reimplemented it in libhugetlbfs for
> testing purposes, but got the same results) or a NUMA kernel bug (mbind
> is some hairy code...) or a ppc64 bug or maybe not a bug at all.
> Regardless, I'm getting somewhat inconsistent behavior. I can provide
> more debugging output, or whatever is requested, but I wasn't sure what
> to include. I'm hoping someone has heard of or seen something similar?

Are you setting the tasks allocation policy before the allocation or do 
you set a vma based policy? The vma based policies will only work for 
anonymous pages.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-29 23:57 ` Christoph Lameter
@ 2006-08-30  0:21   ` Nishanth Aravamudan
  2006-08-30  2:26     ` Nishanth Aravamudan
  2006-08-30  7:19     ` Andi Kleen
  2006-08-30  7:16   ` Andi Kleen
  1 sibling, 2 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2006-08-30  0:21 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, lnxninja, ak, linuxppc-dev

On 29.08.2006 [16:57:35 -0700], Christoph Lameter wrote:
> On Tue, 29 Aug 2006, Nishanth Aravamudan wrote:
> 
> > I don't know if this is a libnuma bug (I extracted out the code from
> > libnuma, it looked sane; and even reimplemented it in libhugetlbfs
> > for testing purposes, but got the same results) or a NUMA kernel bug
> > (mbind is some hairy code...) or a ppc64 bug or maybe not a bug at
> > all.  Regardless, I'm getting somewhat inconsistent behavior. I can
> > provide more debugging output, or whatever is requested, but I
> > wasn't sure what to include. I'm hoping someone has heard of or seen
> > something similar?
> 
> Are you setting the tasks allocation policy before the allocation or
> do you set a vma based policy? The vma based policies will only work
> for anonymous pages.

The order is (with necessary params filled in):

p = mmap( , newsize, RW, PRIVATE, unlinked_hugetlbfs_heap_fd, );

numa_interleave_memory(p, newsize);

mlock(p, newsize); /* causes all the hugepages to be faulted in */

munlock(p,newsize);

>From what I gathered from the numa manpages, the interleave policy
should take effect on the mlock, as that is "fault-time" in this
context. We're forcing the fault, that is.

Does that answer your question? Sorry if I'm unclear, I'm a bit of a
newbie to the VM.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-30  0:21   ` Nishanth Aravamudan
@ 2006-08-30  2:26     ` Nishanth Aravamudan
  2006-08-30  4:26       ` Christoph Lameter
  2006-08-30  7:19     ` Andi Kleen
  1 sibling, 1 reply; 24+ messages in thread
From: Nishanth Aravamudan @ 2006-08-30  2:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, lnxninja, ak, linuxppc-dev

On 29.08.2006 [17:21:10 -0700], Nishanth Aravamudan wrote:
> On 29.08.2006 [16:57:35 -0700], Christoph Lameter wrote:
> > On Tue, 29 Aug 2006, Nishanth Aravamudan wrote:
> > 
> > > I don't know if this is a libnuma bug (I extracted out the code from
> > > libnuma, it looked sane; and even reimplemented it in libhugetlbfs
> > > for testing purposes, but got the same results) or a NUMA kernel bug
> > > (mbind is some hairy code...) or a ppc64 bug or maybe not a bug at
> > > all.  Regardless, I'm getting somewhat inconsistent behavior. I can
> > > provide more debugging output, or whatever is requested, but I
> > > wasn't sure what to include. I'm hoping someone has heard of or seen
> > > something similar?
> > 
> > Are you setting the tasks allocation policy before the allocation or
> > do you set a vma based policy? The vma based policies will only work
> > for anonymous pages.
> 
> The order is (with necessary params filled in):
> 
> p = mmap( , newsize, RW, PRIVATE, unlinked_hugetlbfs_heap_fd, );
> 
> numa_interleave_memory(p, newsize);
> 
> mlock(p, newsize); /* causes all the hugepages to be faulted in */
> 
> munlock(p,newsize);
> 
> From what I gathered from the numa manpages, the interleave policy
> should take effect on the mlock, as that is "fault-time" in this
> context. We're forcing the fault, that is.

For some more data, I did some manipulations of libhugetlbfs and came up
with the following:

If I use the default hugepage-aligned hugepage-backed malloc
replacement, I get the following in /proc/pid/numa_maps (excerpt):

20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
21000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
...
37000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1

If I change the nodemask to 1-7, I get:

20000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1
21000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1
22000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1
23000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1
24000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N5=1
25000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N6=1
26000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N7=1
...
35000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1
36000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1
37000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1
38000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1

If I then change our malloc implementation to (unnecessarily) mmap a
size aligned to 4 hugepages, rather aligned to a single hugepage, but
using a nodemask of 0-7, I get:

20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
24000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
28000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
2c000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
30000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
34000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=1 mapped=4 N0=1 N1=1 N2=1 N3=1

It seems rather odd that it's this inconsistent, and that I'm the only
one seeing it as such :)

Thanks,
Nish


-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-30  2:26     ` Nishanth Aravamudan
@ 2006-08-30  4:26       ` Christoph Lameter
  2006-08-30  5:31         ` Nishanth Aravamudan
  2006-08-30  5:40         ` Tim Pepper
  0 siblings, 2 replies; 24+ messages in thread
From: Christoph Lameter @ 2006-08-30  4:26 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, lnxninja, ak, linuxppc-dev

On Tue, 29 Aug 2006, Nishanth Aravamudan wrote:

> If I use the default hugepage-aligned hugepage-backed malloc
> replacement, I get the following in /proc/pid/numa_maps (excerpt):
> 
> 20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
> 21000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
> ...
> 37000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
> 38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1

Is this with nodemask set to [0]?

> If I change the nodemask to 1-7, I get:
> 
> 20000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1
> 21000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1
> 22000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1
> 23000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1
> 24000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N5=1
> 25000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N6=1
> 26000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N7=1
> ...
> 35000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1
> 36000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1
> 37000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1
> 38000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1

So interleave has an effect.

Are you using cpusets? Or are you only using memory policies? What is the 
default policy of the task you are running?

> If I then change our malloc implementation to (unnecessarily) mmap a
> size aligned to 4 hugepages, rather aligned to a single hugepage, but
> using a nodemask of 0-7, I get:
> 
> 20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
> 24000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
> 28000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
> 2c000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
> 30000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
> 34000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
> 38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=1 mapped=4 N0=1 N1=1 N2=1 N3=1

Hmm... Strange. Interleaving should continue after the last one....

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-30  4:26       ` Christoph Lameter
@ 2006-08-30  5:31         ` Nishanth Aravamudan
  2006-08-30  5:40         ` Tim Pepper
  1 sibling, 0 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2006-08-30  5:31 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, lnxninja, ak, linuxppc-dev

On 29.08.2006 [21:26:58 -0700], Christoph Lameter wrote:
> On Tue, 29 Aug 2006, Nishanth Aravamudan wrote:
> 
> > If I use the default hugepage-aligned hugepage-backed malloc
> > replacement, I get the following in /proc/pid/numa_maps (excerpt):
> > 
> > 20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
> > 21000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
> > ...
> > 37000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
> > 38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
> 
> Is this with nodemask set to [0]?

nodemask was set to 0xFF, effectively, bits 0-7 set, all others cleared.
Just to make sure that I'm not misunderstanding, that's what the
interleave=0-7 also indicates, right? That the particular memory area
was specified to interleave over those nodes, if possible, and then at
the end of each line are the nodes that it actually was placed on?

> > If I change the nodemask to 1-7, I get:
> > 
> > 20000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1
> > 21000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1
> > 22000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1
> > 23000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1
> > 24000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N5=1
> > 25000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N6=1
> > 26000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N7=1
> > ...
> > 35000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1
> > 36000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1
> > 37000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1
> > 38000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1
> 
> So interleave has an effect.

Yup, exactly -- and that's the confusing part. I was willing to write it
off as being some sort of mistake on my part, but all I have to do is
clear any one bit between 0 and 7, and I get the interleaving I expect.
That's what leads me to conclude there is a bug, but after a lot of
looking at libnuma and the mbind() system call, I couldn't see the
problem.

> Are you using cpusets? Or are you only using memory policies? What is
> the default policy of the task you are running?

No cpusets, only memory policies. The test application that is
exhibiting this behavior is *really* simple, and doesn't specifically
set a memory policy, so I assume it's MPOL_DEFAULT?

> > If I then change our malloc implementation to (unnecessarily) mmap a
> > size aligned to 4 hugepages, rather aligned to a single hugepage,
> > but using a nodemask of 0-7, I get:
> > 
> > 20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
> > 24000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
> > 28000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
> > 2c000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
> > 30000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
> > 34000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
> > 38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=1 mapped=4 N0=1 N1=1 N2=1 N3=1
> 
> Hmm... Strange. Interleaving should continue after the last one....

"last one" being the last allocation, or the last node? My understanding
of what is happening in this case is that interleave is working, but in
a way different from the immediately previous example. Here we're
interleaving within the allocation, so each of the 4 hugepages goes on a
different node. When the next allocation comes through, we start back
over at node 0 (given the previous results, I would have thought it
would have gone N0,N1,N2,N3 then N4,N5,N6,N7 then back to N0,N1,N2,N3).

Also, note that in this last case, in case I wasn't clear before, I was
artificially inflating our consumption of hugepages per allocation, just
to see what happened.

I should also mention this is the SuSE kernel, too, so 2.6.16-ish. If
there are sufficient changes in this area between there and mainline, I
can try and get the box rebooted into 2.6.18-rc5.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-30  4:26       ` Christoph Lameter
  2006-08-30  5:31         ` Nishanth Aravamudan
@ 2006-08-30  5:40         ` Tim Pepper
  1 sibling, 0 replies; 24+ messages in thread
From: Tim Pepper @ 2006-08-30  5:40 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linuxppc-dev, Nishanth Aravamudan, ak, linux-mm

On 8/29/06, Christoph Lameter <clameter@sgi.com> wrote:
> On Tue, 29 Aug 2006, Nishanth Aravamudan wrote:
>
> > If I use the default hugepage-aligned hugepage-backed malloc
> > replacement, I get the following in /proc/pid/numa_maps (excerpt):
> >
> > 20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
> > 21000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
> > ...
> > 37000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
> > 38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
>
> Is this with nodemask set to [0]?

The above is with a nodemask of 0-7.  Just removing node 0 from the mask causes
interleaving to start as below:

> > If I change the nodemask to 1-7, I get:
> >
> > 20000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1
> > 21000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1
> > 22000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1
> > 23000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1
> > 24000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N5=1
> > 25000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N6=1
> > 26000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N7=1
> > ...
> > 35000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1
> > 36000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1
> > 37000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1
> > 38000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1
>
> So interleave has an effect.
>
> Are you using cpusets? Or are you only using memory policies? What is the
> default policy of the task you are running?

Just memory policies with the default task policy...really simple
code.  The current incantation basically does setup in the form of:
        numa_available();
        nodemask_zero(&nodemask);
        for (i = 0; i <= maxnode; i++)
                nodemask_set(&nodemask, i);
and then creates mmaps followed by:
        numa_interleave_memory(p, size, &nodemask);
        mlock(p, size)
        munlock(p, size);
to get the page faulted in.

> Hmm... Strange. Interleaving should continue after the last one....

That's what we thought...good to know we're not crazy.  We've spent a
lot of time looking at libnuma and the userspace side of things trying
to figure out if we were somehow passing an invalid nodemask into the
kernel, but we've pretty well convinced ourselves that is not the
case.  The kernel side of things (eg: sys_mbind() codepath) isn't
exactly obvious...code inspection's been a bit gruelling...need to do
kernel side probing to see what codepaths we're actually hitting.

An interesting additional point:  Nish's code originally wasn't using
libnuma and I wrote a simple little mmapping test program using
libnuma to compare results (thinking userspace issue).  My code worked
fine.  He rewrote to use libnuma and I rewrote to not use libnuma
thinking we'd find the problem in between.  Yet my code still gets
interleaving and his does not.  The only real difference between our
code is that mine basically does:
        mmap(...many hugepages...)
and Nish's effectively is doing:
        foreach(1..n) { mmap(...many/n hugepages...)}
if that pseudocode makes sense.  As above, when he changes his mmap to
grab more than one hugepage of memory at a time he starts seeing
interleaving.


Tim

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-30  0:21   ` Nishanth Aravamudan
  2006-08-30  2:26     ` Nishanth Aravamudan
@ 2006-08-30  7:19     ` Andi Kleen
  2006-08-30  7:29       ` Nishanth Aravamudan
  2006-08-30 17:44       ` libnuma interleaving oddness Adam Litke
  1 sibling, 2 replies; 24+ messages in thread
From: Andi Kleen @ 2006-08-30  7:19 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linuxppc-dev, linux-mm, lnxninja, Christoph Lameter

mous pages.
> 
> The order is (with necessary params filled in):
> 
> p = mmap( , newsize, RW, PRIVATE, unlinked_hugetlbfs_heap_fd, );
> 
> numa_interleave_memory(p, newsize);
> 
> mlock(p, newsize); /* causes all the hugepages to be faulted in */
> 
> munlock(p,newsize);
> 
> From what I gathered from the numa manpages, the interleave policy
> should take effect on the mlock, as that is "fault-time" in this
> context. We're forcing the fault, that is.

mlock shouldn't be needed at all here. the new hugetlbfs is supposed
to reserve at mmap time and numa_interleave_memory() sets a VMA 
policy which will should do the right thing no matter when the fault
occurs.

Hmm, maybe mlock() policy() is broken.

-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-30  7:19     ` Andi Kleen
@ 2006-08-30  7:29       ` Nishanth Aravamudan
  2006-08-30  7:32         ` Andi Kleen
  2006-08-30 21:04         ` Christoph Lameter
  2006-08-30 17:44       ` libnuma interleaving oddness Adam Litke
  1 sibling, 2 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2006-08-30  7:29 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linuxppc-dev, linux-mm, lnxninja, Christoph Lameter

On 30.08.2006 [09:19:13 +0200], Andi Kleen wrote:
> mous pages.
> > 
> > The order is (with necessary params filled in):
> > 
> > p = mmap( , newsize, RW, PRIVATE, unlinked_hugetlbfs_heap_fd, );
> > 
> > numa_interleave_memory(p, newsize);
> > 
> > mlock(p, newsize); /* causes all the hugepages to be faulted in */
> > 
> > munlock(p,newsize);
> > 
> > From what I gathered from the numa manpages, the interleave policy
> > should take effect on the mlock, as that is "fault-time" in this
> > context. We're forcing the fault, that is.
> 
> mlock shouldn't be needed at all here. the new hugetlbfs is supposed
> to reserve at mmap time and numa_interleave_memory() sets a VMA policy
> which will should do the right thing no matter when the fault occurs.

Ok.

> Hmm, maybe mlock() policy() is broken.

I took out the mlock() call, and I get the same results, FWIW.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-30  7:29       ` Nishanth Aravamudan
@ 2006-08-30  7:32         ` Andi Kleen
  2006-08-30 18:01           ` Tim Pepper
  2006-08-30 21:04         ` Christoph Lameter
  1 sibling, 1 reply; 24+ messages in thread
From: Andi Kleen @ 2006-08-30  7:32 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linuxppc-dev, linux-mm, lnxninja, Christoph Lameter

On Wednesday 30 August 2006 09:29, Nishanth Aravamudan wrote:

> 
> > Hmm, maybe mlock() policy() is broken.
> 
> I took out the mlock() call, and I get the same results, FWIW.

Then it's probably some new problem in hugetlbfs. Does it work with shmfs?

The regression test for hugetlbfs is numactl is unfortunately still disabled.
I need to enable it at some point for hugetlbfs now that it reached mainline.

-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-30  7:32         ` Andi Kleen
@ 2006-08-30 18:01           ` Tim Pepper
  2006-08-30 18:12             ` Andi Kleen
  2006-08-30 18:13             ` Adam Litke
  0 siblings, 2 replies; 24+ messages in thread
From: Tim Pepper @ 2006-08-30 18:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-mm, Nishanth Aravamudan, linuxppc-dev, Christoph Lameter

On 8/30/06, Andi Kleen <ak@suse.de> wrote:
> Then it's probably some new problem in hugetlbfs.

It's something subtle though, because I _am_ able to get interleaving
on hugetlbfs with a slightly simplified test case (see previous email)
compared to Nish's.

> Does it work with shmfs?

Haven't tried shmfs, but the following correctly does the expected
interleaving with hugepages (although not hugetlbfs backed):
     shmid = shmget( 0, NR_HUGE_PAGES, IPC_CREAT | SHM_HUGETLB | 0666 );
     shmat_addr = shmat( shmid, NULL, 0 );
     ...
     numa_interleave_memory( shmat_addr, SHM_SIZE, &nm );
I'd expect it works fine with non-huge pages, shmfs.

> The regression test for hugetlbfs is numactl is unfortunately still disabled.
> I need to enable it at some point for hugetlbfs now that it reached mainline.

On my list of random things to do is trying to improve the test
coverage in this area.  We keep running into bugs or possible bugs or
confusion on expected behaviour.  I'm going through the code trying to
understand it and writing little programs to confirm my understanding
here and there anyway.


Tim

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-30 18:01           ` Tim Pepper
@ 2006-08-30 18:12             ` Andi Kleen
  2006-08-30 18:13             ` Adam Litke
  1 sibling, 0 replies; 24+ messages in thread
From: Andi Kleen @ 2006-08-30 18:12 UTC (permalink / raw)
  To: Tim Pepper; +Cc: linux-mm, Nishanth Aravamudan, linuxppc-dev, Christoph Lameter


> On my list of random things to do is trying to improve the test
> coverage in this area.  We keep running into bugs or possible bugs or
> confusion on expected behaviour.  I'm going through the code trying to
> understand it and writing little programs to confirm my understanding
> here and there anyway.

numactl has a little regression test suite in test/* that tests a lot of stuff,
but not all. Feel free to extend it.

-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-30 18:01           ` Tim Pepper
  2006-08-30 18:12             ` Andi Kleen
@ 2006-08-30 18:13             ` Adam Litke
  1 sibling, 0 replies; 24+ messages in thread
From: Adam Litke @ 2006-08-30 18:13 UTC (permalink / raw)
  To: Tim Pepper
  Cc: linux-mm, Nishanth Aravamudan, Andi Kleen, linuxppc-dev,
	Christoph Lameter

On Wed, 2006-08-30 at 11:01 -0700, Tim Pepper wrote:
> On 8/30/06, Andi Kleen <ak@suse.de> wrote:
> > Then it's probably some new problem in hugetlbfs.
> 
> It's something subtle though, because I _am_ able to get interleaving
> on hugetlbfs with a slightly simplified test case (see previous email)
> compared to Nish's.
> 
> > Does it work with shmfs?
> 
> Haven't tried shmfs, but the following correctly does the expected
> interleaving with hugepages (although not hugetlbfs backed):
>      shmid = shmget( 0, NR_HUGE_PAGES, IPC_CREAT | SHM_HUGETLB | 0666 );
>      shmat_addr = shmat( shmid, NULL, 0 );
>      ...
>      numa_interleave_memory( shmat_addr, SHM_SIZE, &nm );
> I'd expect it works fine with non-huge pages, shmfs.

Actually, the above call will yield hugetlbfs backed huge pages.  The
kernel just prepares the hugetlbfs file for you.  See
hugetlb_zero_setup().

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-30  7:29       ` Nishanth Aravamudan
  2006-08-30  7:32         ` Andi Kleen
@ 2006-08-30 21:04         ` Christoph Lameter
  2006-08-31  6:00           ` Nishanth Aravamudan
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2006-08-30 21:04 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linux-mm, lnxninja, Andi Kleen, linuxppc-dev

> I took out the mlock() call, and I get the same results, FWIW.

What zones are available on your box? Any with HIGHMEM?

Also what kernel version are we talking about? Before 2.6.18?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-30 21:04         ` Christoph Lameter
@ 2006-08-31  6:00           ` Nishanth Aravamudan
  2006-08-31  7:47             ` Andi Kleen
  2006-08-31 16:00             ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan
  0 siblings, 2 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2006-08-31  6:00 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, lnxninja, Andi Kleen, linuxppc-dev

On 30.08.2006 [14:04:40 -0700], Christoph Lameter wrote:
> > I took out the mlock() call, and I get the same results, FWIW.
> 
> What zones are available on your box? Any with HIGHMEM?

How do I tell the available zones from userspace? This is ppc64 with
about 64GB of memory total, it looks like. So, none of the nodes
(according to /sys/devices/system/node/*/meminfo) have highmem.

> Also what kernel version are we talking about? Before 2.6.18?

The SuSE default, 2.6.16.21 -- I thought I mentioned that in one of my
replies, sorry.

Tim and I spent most of this afternoon debugging the huge_zonelist()
callpath with kprobes and jprobes. We found the following via a jprobe
to offset_li_node():

jprobe: vma=0xc000000006dc2d78, pol->policy=0x3, pol->v.nodes=0xff, off=0x0
jprobe: vma=0xc00000000f247e30, pol->policy=0x3, pol->v.nodes=0xff, off=0x1000
jprobe: vma=0xc000000006dbf648, pol->policy=0x3, pol->v.nodes=0xff, off=0x2000
...
jprobe: vma=0xc00000000f298870, pol->policy=0x3, pol->v.nodes=0xff, off=0x17000
jprobe: vma=0xc00000000f298368, pol->policy=0x3, pol->v.nodes=0xff, off=0x18000

So, it's quite clear that the nodemask is set appropriately and so is
the policy. The problem, in fact, is the "offset" being passed into
offset_li_node().

The problem, I think, is from interleave_nid():

	off = vma->vm_pgoff;
	off += (addr - vma->vm_start) >> shift;
	return offset_il_node(pol, vma, off);

For hugetlbfs vma's, since vm_pgoff is in units of small pages, the lower
(HPAGE_SHIFT - PAGE_SHIFT) bits of vma->vm_pgoff and off will always be zero
(12 in this case).  Thus, when we get into offset_li_node():

        unsigned nnodes = nodes_weight(pol->v.nodes);
        unsigned target = (unsigned)off % nnodes;
        int c;
        int nid = -1;

        c = 0;
        do {
                nid = next_node(nid, pol->v.nodes);
                c++;
        } while (c <= target);
        return nid;

nnodes is 8 (the number of nodes). Our offset (some multiple of 4096) is
always going to be evenly divided by 8. So, our target node is always
node 0! Note, that when we took out a bit in our nodemask, nnodes
changed accordingly and 7 did not evenly divide the offset, and we got
interleaving as expected.

To test my hypothesis (my analysis may be a bit hand-wavy, sorry), I
changed interleave_nid() to shift off right by (HPAGE_SHIFT -
PAGE_SHIFT) only #if CONFIG_HUGETLB_PAGE. This fixes the behavior for
the page-by-page case. But I'm not sure this is an acceptable mainline
change, but I've included my signed-off-but-not-for-inclusion patch.

Note, that when I try this with my testcase that makes each allocation
be 4 hugepages large, I get 4 hugepages on node 0, then 4 on node 4,
then 4 on node 0, and so on. I believe this is because the offset ends
up being the same for all of the 4 hugepages in each set, so they go to
the same node

Many thanks to Tim for his help debugging.

---

Once again, not for inclusion!

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff -urpN 2.6.18-rc5/mm/mempolicy.c 2.6.18-rc5-dev/mm/mempolicy.c
--- 2.6.18-rc5/mm/mempolicy.c	2006-08-30 22:55:33.000000000 -0700
+++ 2.6.18-rc5-dev/mm/mempolicy.c	2006-08-30 22:56:43.000000000 -0700
@@ -1169,6 +1169,7 @@ static unsigned offset_il_node(struct me
 	return nid;
 }

+#ifndef CONFIG_HUGETLBFS
 /* Determine a node number for interleave */
 static inline unsigned interleave_nid(struct mempolicy *pol,
 		 struct vm_area_struct *vma, unsigned long addr, int shift)
@@ -1182,8 +1183,22 @@ static inline unsigned interleave_nid(st
 	} else
 		return interleave_nodes(pol);
 }
+#else
+/* Determine a node number for interleave */
+static inline unsigned interleave_nid(struct mempolicy *pol,
+		 struct vm_area_struct *vma, unsigned long addr, int shift)
+{
+	if (vma) {
+		unsigned long off;
+
+		off = vma->vm_pgoff;
+		off += (addr - vma->vm_start) >> shift;
+		off >>= (HPAGE_SHIFT - PAGE_SHIFT);
+		return offset_il_node(pol, vma, off);
+	} else
+		return interleave_nodes(pol);
+}

-#ifdef CONFIG_HUGETLBFS
 /* Return a zonelist suitable for a huge page allocation. */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr)
 {

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-31  6:00           ` Nishanth Aravamudan
@ 2006-08-31  7:47             ` Andi Kleen
  2006-08-31 15:49               ` Nishanth Aravamudan
  2006-08-31 16:00             ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan
  1 sibling, 1 reply; 24+ messages in thread
From: Andi Kleen @ 2006-08-31  7:47 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: linuxppc-dev, linux-mm, lnxninja, Christoph Lameter

On Thursday 31 August 2006 08:00, Nishanth Aravamudan wrote:
> On 30.08.2006 [14:04:40 -0700], Christoph Lameter wrote:
> > > I took out the mlock() call, and I get the same results, FWIW.
> > 
> > What zones are available on your box? Any with HIGHMEM?
> 
> How do I tell the available zones from userspace? This is ppc64 with
> about 64GB of memory total, it looks like. So, none of the nodes
> (according to /sys/devices/system/node/*/meminfo) have highmem.

The zones are listed at the beginning of dmesg

"On node X total pages ...
      DMA zone ...
      ..." 

-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-31  7:47             ` Andi Kleen
@ 2006-08-31 15:49               ` Nishanth Aravamudan
  0 siblings, 0 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2006-08-31 15:49 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linuxppc-dev, linux-mm, lnxninja, Christoph Lameter

On 31.08.2006 [09:47:30 +0200], Andi Kleen wrote:
> On Thursday 31 August 2006 08:00, Nishanth Aravamudan wrote:
> > On 30.08.2006 [14:04:40 -0700], Christoph Lameter wrote:
> > > > I took out the mlock() call, and I get the same results, FWIW.
> > > 
> > > What zones are available on your box? Any with HIGHMEM?
> > 
> > How do I tell the available zones from userspace? This is ppc64 with
> > about 64GB of memory total, it looks like. So, none of the nodes
> > (according to /sys/devices/system/node/*/meminfo) have highmem.
> 
> The zones are listed at the beginning of dmesg
> 
> "On node X total pages ...
>       DMA zone ...
>       ..." 

Page orders: linear mapping = 24, others = 12
<snip>
[boot]0100 MM Init
[boot]0100 MM Init Done
Linux version 2.6.16.21-0.8-ppc64 (geeko@buildhost) (gcc version 4.1.0 (SUSE Linux)) #1 SMP Mon Jul 3 18:25:39 UTC 2006
[boot]0012 Setup Arch
Node 0 Memory: 0x0-0x1b0000000
Node 1 Memory: 0x1b0000000-0x3b0000000
Node 2 Memory: 0x3b0000000-0x5b0000000
Node 3 Memory: 0x5b0000000-0x7b0000000
Node 4 Memory: 0x7b0000000-0x9a0000000
Node 5 Memory: 0x9a0000000-0xba0000000
Node 6 Memory: 0xba0000000-0xda0000000
Node 7 Memory: 0xda0000000-0xf90000000
EEH: PCI Enhanced I/O Error Handling Enabled
PPC64 nvram contains 7168 bytes
Using dedicated idle loop
On node 0 totalpages: 1769472
  DMA zone: 1769472 pages, LIFO batch:31
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 0 pages, LIFO batch:0
  HighMem zone: 0 pages, LIFO batch:0
On node 1 totalpages: 2097152
  DMA zone: 2097152 pages, LIFO batch:31
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 0 pages, LIFO batch:0
  HighMem zone: 0 pages, LIFO batch:0
On node 2 totalpages: 2097152
  DMA zone: 2097152 pages, LIFO batch:31
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 0 pages, LIFO batch:0
  HighMem zone: 0 pages, LIFO batch:0
On node 3 totalpages: 2097152
  DMA zone: 2097152 pages, LIFO batch:31
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 0 pages, LIFO batch:0
  HighMem zone: 0 pages, LIFO batch:0
On node 4 totalpages: 2031616
  DMA zone: 2031616 pages, LIFO batch:31
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 0 pages, LIFO batch:0
  HighMem zone: 0 pages, LIFO batch:0
On node 5 totalpages: 2097152
  DMA zone: 2097152 pages, LIFO batch:31
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 0 pages, LIFO batch:0
  HighMem zone: 0 pages, LIFO batch:0
On node 6 totalpages: 2097152
  DMA zone: 2097152 pages, LIFO batch:31
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 0 pages, LIFO batch:0
  HighMem zone: 0 pages, LIFO batch:0
On node 7 totalpages: 2031616
  DMA zone: 2031616 pages, LIFO batch:31
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 0 pages, LIFO batch:0
  HighMem zone: 0 pages, LIFO batch:0
[boot]0015 Setup Done
Built 8 zonelists

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness)
  2006-08-31  6:00           ` Nishanth Aravamudan
  2006-08-31  7:47             ` Andi Kleen
@ 2006-08-31 16:00             ` Nishanth Aravamudan
  2006-08-31 16:08               ` Adam Litke
                                 ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2006-08-31 16:00 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, linuxppc-dev, Andi Kleen, linux-mm, lnxninja

On 30.08.2006 [23:00:36 -0700], Nishanth Aravamudan wrote:
> On 30.08.2006 [14:04:40 -0700], Christoph Lameter wrote:
> > > I took out the mlock() call, and I get the same results, FWIW.
> > 
> > What zones are available on your box? Any with HIGHMEM?
> 
> How do I tell the available zones from userspace? This is ppc64 with
> about 64GB of memory total, it looks like. So, none of the nodes
> (according to /sys/devices/system/node/*/meminfo) have highmem.
> 
> > Also what kernel version are we talking about? Before 2.6.18?
> 
> The SuSE default, 2.6.16.21 -- I thought I mentioned that in one of my
> replies, sorry.
> 
> Tim and I spent most of this afternoon debugging the huge_zonelist()
> callpath with kprobes and jprobes. We found the following via a jprobe
> to offset_li_node():

<snip lengthy previous discussion>

Since vma->vm_pgoff is in units of smallpages, VMAs for huge pages have
the lower HPAGE_SHIFT - PAGE_SHIFT bits always cleared, which results in
badd offsets to the interleave functions. Take this difference from
small pages into account when calculating the offset. This does add a
0-bit shift into the small-page path (via alloc_page_vma()), but I think
that is negligible. Also add a BUG_ON to prevent the offset from growing
due to a negative right-shift, which probably shouldn't be allowed
anyways.

Tested on an 8-memory node ppc64 NUMA box and got the interleaving I
expected.

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

---

Results with this patch applied, which shouldn't go into the changelog,
I don't think:

for the 4-hugepages at a time case:
20000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.r1YKfL huge dirty=4 N0=1 N1=1 N2=1 N3=1
24000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.r1YKfL huge dirty=4 N4=1 N5=1 N6=1 N7=1
28000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.r1YKfL huge dirty=4 N0=1 N1=1 N2=1 N3=1

for the 1-hugepage at a time case:
20000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N0=1
21000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N1=1
22000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N2=1
23000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N3=1
24000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N4=1
25000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N5=1
26000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N6=1
27000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N7=1
28000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N0=1

Andrew, can we get this into 2.6.18?

diff -urpN 2.6.18-rc5/mm/mempolicy.c 2.6.18-rc5-dev/mm/mempolicy.c
--- 2.6.18-rc5/mm/mempolicy.c	2006-08-30 22:55:33.000000000 -0700
+++ 2.6.18-rc5-dev/mm/mempolicy.c	2006-08-31 08:46:22.000000000 -0700
@@ -1176,7 +1176,15 @@ static inline unsigned interleave_nid(st
 	if (vma) {
 		unsigned long off;
 
-		off = vma->vm_pgoff;
+		/*
+		 * for small pages, there is no difference between
+		 * shift and PAGE_SHIFT, so the bit-shift is safe.
+		 * for huge pages, since vm_pgoff is in units of small
+		 * pages, we need to shift off the always 0 bits to get
+		 * a useful offset.
+		 */
+		BUG_ON(shift < PAGE_SHIFT);
+		off = vma->vm_pgoff >> (shift - PAGE_SHIFT);
 		off += (addr - vma->vm_start) >> shift;
 		return offset_il_node(pol, vma, off);
 	} else

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness)
  2006-08-31 16:00             ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan
@ 2006-08-31 16:08               ` Adam Litke
  2006-08-31 16:19               ` Tim Pepper
  2006-08-31 16:37               ` Christoph Lameter
  2 siblings, 0 replies; 24+ messages in thread
From: Adam Litke @ 2006-08-31 16:08 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: akpm, linux-mm, Andi Kleen, linuxppc-dev, lnxninja,
	Christoph Lameter

On Thu, 2006-08-31 at 09:00 -0700, Nishanth Aravamudan wrote:
> Since vma->vm_pgoff is in units of smallpages, VMAs for huge pages have
> the lower HPAGE_SHIFT - PAGE_SHIFT bits always cleared, which results in
> badd offsets to the interleave functions. Take this difference from
> small pages into account when calculating the offset. This does add a
> 0-bit shift into the small-page path (via alloc_page_vma()), but I think
> that is negligible. Also add a BUG_ON to prevent the offset from growing
> due to a negative right-shift, which probably shouldn't be allowed
> anyways.
> 
> Tested on an 8-memory node ppc64 NUMA box and got the interleaving I
> expected.
> 
> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

Acked-by: Adam Litke <agl@us.ibm.com>

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness)
  2006-08-31 16:00             ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan
  2006-08-31 16:08               ` Adam Litke
@ 2006-08-31 16:19               ` Tim Pepper
  2006-08-31 16:37               ` Christoph Lameter
  2 siblings, 0 replies; 24+ messages in thread
From: Tim Pepper @ 2006-08-31 16:19 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: akpm, linux-mm, Andi Kleen, linuxppc-dev, Christoph Lameter

On 8/31/06, Nishanth Aravamudan <nacc@us.ibm.com> wrote:
>
> Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

Acked-by: Tim Pepper <lnxninja@us.ibm.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness)
  2006-08-31 16:00             ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan
  2006-08-31 16:08               ` Adam Litke
  2006-08-31 16:19               ` Tim Pepper
@ 2006-08-31 16:37               ` Christoph Lameter
  2 siblings, 0 replies; 24+ messages in thread
From: Christoph Lameter @ 2006-08-31 16:37 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: akpm, linuxppc-dev, Andi Kleen, linux-mm, lnxninja

On Thu, 31 Aug 2006, Nishanth Aravamudan wrote:

> Andrew, can we get this into 2.6.18?

Acked-by: Christoph Lameter <clameter@sgi.con>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-30  7:19     ` Andi Kleen
  2006-08-30  7:29       ` Nishanth Aravamudan
@ 2006-08-30 17:44       ` Adam Litke
  1 sibling, 0 replies; 24+ messages in thread
From: Adam Litke @ 2006-08-30 17:44 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-mm, Nishanth Aravamudan, lnxninja, linuxppc-dev,
	Christoph Lameter

On Wed, 2006-08-30 at 09:19 +0200, Andi Kleen wrote:
> mous pages.
> > 
> > The order is (with necessary params filled in):
> > 
> > p = mmap( , newsize, RW, PRIVATE, unlinked_hugetlbfs_heap_fd, );
> > 
> > numa_interleave_memory(p, newsize);
> > 
> > mlock(p, newsize); /* causes all the hugepages to be faulted in */
> > 
> > munlock(p,newsize);
> > 
> > From what I gathered from the numa manpages, the interleave policy
> > should take effect on the mlock, as that is "fault-time" in this
> > context. We're forcing the fault, that is.
> 
> mlock shouldn't be needed at all here. the new hugetlbfs is supposed
> to reserve at mmap time and numa_interleave_memory() sets a VMA 
> policy which will should do the right thing no matter when the fault
> occurs.

mmap-time reservation of huge pages is done only for shared mappings.
MAP_PRIVATE mappings have full-overcommit semantics.  We use the mlock
call to "guarantee" the MAP_PRIVATE memory to the process.  If mlock
fails, we simply unmap the hugetlb region and tell glibc to revert to
its normal allocation method (mmap normal pages).

> Hmm, maybe mlock() policy() is broken.

The policy decision is made further down than mlock.  As each huge page
is allocated from the static pool, the policy is consulted to see from
which node to pop a huge page. 

The function huge_zonelist() seems to encapsulate the numa policy logic
and after sniffing the code, it looks right to me.

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: libnuma interleaving oddness
  2006-08-29 23:57 ` Christoph Lameter
  2006-08-30  0:21   ` Nishanth Aravamudan
@ 2006-08-30  7:16   ` Andi Kleen
  1 sibling, 0 replies; 24+ messages in thread
From: Andi Kleen @ 2006-08-30  7:16 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linuxppc-dev, Nishanth Aravamudan, lnxninja, linux-mm

On Wednesday 30 August 2006 01:57, Christoph Lameter wrote:
> On Tue, 29 Aug 2006, Nishanth Aravamudan wrote:
> 
> > I don't know if this is a libnuma bug (I extracted out the code from
> > libnuma, it looked sane; and even reimplemented it in libhugetlbfs for
> > testing purposes, but got the same results) or a NUMA kernel bug (mbind
> > is some hairy code...) or a ppc64 bug or maybe not a bug at all.
> > Regardless, I'm getting somewhat inconsistent behavior. I can provide
> > more debugging output, or whatever is requested, but I wasn't sure what
> > to include. I'm hoping someone has heard of or seen something similar?
> 
> Are you setting the tasks allocation policy before the allocation or do 
> you set a vma based policy? The vma based policies will only work for 
> anonymous pages.

They should work for hugetlb/shmfs too. At least when I originally
wrote it. But the original patch I did for hugetlbfs for that was
never merged and I admit I have never rechecked if it worked with 
the patchkit that was merged later. The problem originally was
that hugetlbfs needed to be changed to do allocate-on-demand
instead of allocation-on-mmap, because mbind() comes after mmap()
and when mmap() already allocates it can't work.

-Andi

^ permalink raw reply	[flat|nested] 24+ messages in thread

* libnuma interleaving oddness
@ 2006-08-29 23:02 Nishanth Aravamudan
  0 siblings, 0 replies; 24+ messages in thread
From: Nishanth Aravamudan @ 2006-08-29 23:02 UTC (permalink / raw)
  To: lameter, ak; +Cc: linux-mm, lnxninja, linuxppc-dev

Hi,

While trying to add NUMA-awareness to libhugetlbfs' morecore
functionality (hugepage-backed malloc), I ran into an issue on a
ppc64-box with 8 memory nodes, running SLES10. I am using two functions
from libnuma: numa_available() and numa_interleave_memory().  When I ask
numa_interleave_memory() to interleave over all nodes (numa_all_nodes is
the nodemask from libnuma), it exhausts node 0, then moves to node 1,
then node 2, etc, until the allocations are satisfied. If I custom
generate a nodemask, such that bits 1 through 7 are set, but bit 0 is
not, then I get proper interleaving, where the first hugepage is on node
1, the second is on node 2, etc. Similarly, if I set bits 0 through 6 in
a custom nodemask, interleaving works across the requested 7 nodes. But
it has yet to work across all 8.

I don't know if this is a libnuma bug (I extracted out the code from
libnuma, it looked sane; and even reimplemented it in libhugetlbfs for
testing purposes, but got the same results) or a NUMA kernel bug (mbind
is some hairy code...) or a ppc64 bug or maybe not a bug at all.
Regardless, I'm getting somewhat inconsistent behavior. I can provide
more debugging output, or whatever is requested, but I wasn't sure what
to include. I'm hoping someone has heard of or seen something similar?

The test application I'm using makes some mallopt calls then justs
mallocs large chunks in a loop (4096 * 100 bytes). libhugetlbfs is
LD_PRELOAD'd so that we can override malloc.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2006-08-31 16:46 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-29 23:15 libnuma interleaving oddness Nishanth Aravamudan
2006-08-29 23:57 ` Christoph Lameter
2006-08-30  0:21   ` Nishanth Aravamudan
2006-08-30  2:26     ` Nishanth Aravamudan
2006-08-30  4:26       ` Christoph Lameter
2006-08-30  5:31         ` Nishanth Aravamudan
2006-08-30  5:40         ` Tim Pepper
2006-08-30  7:19     ` Andi Kleen
2006-08-30  7:29       ` Nishanth Aravamudan
2006-08-30  7:32         ` Andi Kleen
2006-08-30 18:01           ` Tim Pepper
2006-08-30 18:12             ` Andi Kleen
2006-08-30 18:13             ` Adam Litke
2006-08-30 21:04         ` Christoph Lameter
2006-08-31  6:00           ` Nishanth Aravamudan
2006-08-31  7:47             ` Andi Kleen
2006-08-31 15:49               ` Nishanth Aravamudan
2006-08-31 16:00             ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan
2006-08-31 16:08               ` Adam Litke
2006-08-31 16:19               ` Tim Pepper
2006-08-31 16:37               ` Christoph Lameter
2006-08-30 17:44       ` libnuma interleaving oddness Adam Litke
2006-08-30  7:16   ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2006-08-29 23:02 Nishanth Aravamudan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).