From: Nishanth Aravamudan <nacc@us.ibm.com>
To: Christoph Lameter <clameter@sgi.com>
Cc: linux-mm@kvack.org, lnxninja@us.ibm.com, ak@suse.de,
linuxppc-dev@ozlabs.org
Subject: Re: libnuma interleaving oddness
Date: Tue, 29 Aug 2006 19:26:21 -0700 [thread overview]
Message-ID: <20060830022621.GA5195@us.ibm.com> (raw)
In-Reply-To: <20060830002110.GZ5195@us.ibm.com>
On 29.08.2006 [17:21:10 -0700], Nishanth Aravamudan wrote:
> On 29.08.2006 [16:57:35 -0700], Christoph Lameter wrote:
> > On Tue, 29 Aug 2006, Nishanth Aravamudan wrote:
> >
> > > I don't know if this is a libnuma bug (I extracted out the code from
> > > libnuma, it looked sane; and even reimplemented it in libhugetlbfs
> > > for testing purposes, but got the same results) or a NUMA kernel bug
> > > (mbind is some hairy code...) or a ppc64 bug or maybe not a bug at
> > > all. Regardless, I'm getting somewhat inconsistent behavior. I can
> > > provide more debugging output, or whatever is requested, but I
> > > wasn't sure what to include. I'm hoping someone has heard of or seen
> > > something similar?
> >
> > Are you setting the tasks allocation policy before the allocation or
> > do you set a vma based policy? The vma based policies will only work
> > for anonymous pages.
>
> The order is (with necessary params filled in):
>
> p = mmap( , newsize, RW, PRIVATE, unlinked_hugetlbfs_heap_fd, );
>
> numa_interleave_memory(p, newsize);
>
> mlock(p, newsize); /* causes all the hugepages to be faulted in */
>
> munlock(p,newsize);
>
> From what I gathered from the numa manpages, the interleave policy
> should take effect on the mlock, as that is "fault-time" in this
> context. We're forcing the fault, that is.
For some more data, I did some manipulations of libhugetlbfs and came up
with the following:
If I use the default hugepage-aligned hugepage-backed malloc
replacement, I get the following in /proc/pid/numa_maps (excerpt):
20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
21000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
...
37000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
If I change the nodemask to 1-7, I get:
20000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1
21000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1
22000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1
23000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1
24000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N5=1
25000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N6=1
26000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N7=1
...
35000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1
36000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1
37000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1
38000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1
If I then change our malloc implementation to (unnecessarily) mmap a
size aligned to 4 hugepages, rather aligned to a single hugepage, but
using a nodemask of 0-7, I get:
20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
24000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
28000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
2c000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
30000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
34000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=1 mapped=4 N0=1 N1=1 N2=1 N3=1
It seems rather odd that it's this inconsistent, and that I'm the only
one seeing it as such :)
Thanks,
Nish
--
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center
WARNING: multiple messages have this Message-ID (diff)
From: Nishanth Aravamudan <nacc@us.ibm.com>
To: Christoph Lameter <clameter@sgi.com>
Cc: ak@suse.de, linux-mm@kvack.org, linuxppc-dev@ozlabs.org,
lnxninja@us.ibm.com, agl@us.ibm.com
Subject: Re: libnuma interleaving oddness
Date: Tue, 29 Aug 2006 19:26:21 -0700 [thread overview]
Message-ID: <20060830022621.GA5195@us.ibm.com> (raw)
In-Reply-To: <20060830002110.GZ5195@us.ibm.com>
On 29.08.2006 [17:21:10 -0700], Nishanth Aravamudan wrote:
> On 29.08.2006 [16:57:35 -0700], Christoph Lameter wrote:
> > On Tue, 29 Aug 2006, Nishanth Aravamudan wrote:
> >
> > > I don't know if this is a libnuma bug (I extracted out the code from
> > > libnuma, it looked sane; and even reimplemented it in libhugetlbfs
> > > for testing purposes, but got the same results) or a NUMA kernel bug
> > > (mbind is some hairy code...) or a ppc64 bug or maybe not a bug at
> > > all. Regardless, I'm getting somewhat inconsistent behavior. I can
> > > provide more debugging output, or whatever is requested, but I
> > > wasn't sure what to include. I'm hoping someone has heard of or seen
> > > something similar?
> >
> > Are you setting the tasks allocation policy before the allocation or
> > do you set a vma based policy? The vma based policies will only work
> > for anonymous pages.
>
> The order is (with necessary params filled in):
>
> p = mmap( , newsize, RW, PRIVATE, unlinked_hugetlbfs_heap_fd, );
>
> numa_interleave_memory(p, newsize);
>
> mlock(p, newsize); /* causes all the hugepages to be faulted in */
>
> munlock(p,newsize);
>
> From what I gathered from the numa manpages, the interleave policy
> should take effect on the mlock, as that is "fault-time" in this
> context. We're forcing the fault, that is.
For some more data, I did some manipulations of libhugetlbfs and came up
with the following:
If I use the default hugepage-aligned hugepage-backed malloc
replacement, I get the following in /proc/pid/numa_maps (excerpt):
20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
21000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
...
37000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.3JbO7R\040(deleted) huge dirty=1 N0=1
If I change the nodemask to 1-7, I get:
20000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1
21000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1
22000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1
23000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1
24000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N5=1
25000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N6=1
26000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N7=1
...
35000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N1=1
36000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N2=1
37000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N3=1
38000000 interleave=1-7 file=/libhugetlbfs/libhugetlbfs.tmp.Eh9Bmp\040(deleted) huge dirty=1 N4=1
If I then change our malloc implementation to (unnecessarily) mmap a
size aligned to 4 hugepages, rather aligned to a single hugepage, but
using a nodemask of 0-7, I get:
20000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
24000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
28000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
2c000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
30000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
34000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=4 N0=1 N1=1 N2=1 N3=1
38000000 interleave=0-7 file=/libhugetlbfs/libhugetlbfs.tmp.PFt0xt\040(deleted) huge dirty=1 mapped=4 N0=1 N1=1 N2=1 N3=1
It seems rather odd that it's this inconsistent, and that I'm the only
one seeing it as such :)
Thanks,
Nish
--
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2006-08-30 2:26 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-08-29 23:15 libnuma interleaving oddness Nishanth Aravamudan
2006-08-29 23:15 ` Nishanth Aravamudan
2006-08-29 23:57 ` Christoph Lameter
2006-08-29 23:57 ` Christoph Lameter
2006-08-30 0:21 ` Nishanth Aravamudan
2006-08-30 0:21 ` Nishanth Aravamudan
2006-08-30 2:26 ` Nishanth Aravamudan [this message]
2006-08-30 2:26 ` Nishanth Aravamudan
2006-08-30 4:26 ` Christoph Lameter
2006-08-30 4:26 ` Christoph Lameter
2006-08-30 5:31 ` Nishanth Aravamudan
2006-08-30 5:31 ` Nishanth Aravamudan
2006-08-30 5:40 ` Tim Pepper
2006-08-30 5:40 ` Tim Pepper
2006-08-30 7:19 ` Andi Kleen
2006-08-30 7:19 ` Andi Kleen
2006-08-30 7:29 ` Nishanth Aravamudan
2006-08-30 7:29 ` Nishanth Aravamudan
2006-08-30 7:32 ` Andi Kleen
2006-08-30 7:32 ` Andi Kleen
2006-08-30 18:01 ` Tim Pepper
2006-08-30 18:01 ` Tim Pepper
2006-08-30 18:12 ` Andi Kleen
2006-08-30 18:12 ` Andi Kleen
2006-08-30 18:13 ` Adam Litke
2006-08-30 18:13 ` Adam Litke
2006-08-30 21:04 ` Christoph Lameter
2006-08-30 21:04 ` Christoph Lameter
2006-08-31 6:00 ` Nishanth Aravamudan
2006-08-31 6:00 ` Nishanth Aravamudan
2006-08-31 7:47 ` Andi Kleen
2006-08-31 7:47 ` Andi Kleen
2006-08-31 15:49 ` Nishanth Aravamudan
2006-08-31 15:49 ` Nishanth Aravamudan
2006-08-31 16:00 ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan
2006-08-31 16:00 ` Nishanth Aravamudan
2006-08-31 16:08 ` Adam Litke
2006-08-31 16:08 ` Adam Litke
2006-08-31 16:19 ` Tim Pepper
2006-08-31 16:19 ` Tim Pepper
2006-08-31 16:37 ` Christoph Lameter
2006-08-31 16:37 ` Christoph Lameter
2006-08-30 17:44 ` libnuma interleaving oddness Adam Litke
2006-08-30 17:44 ` Adam Litke
2006-08-30 7:16 ` Andi Kleen
2006-08-30 7:16 ` Andi Kleen
-- strict thread matches above, loose matches on Subject: below --
2006-08-29 23:02 Nishanth Aravamudan
2006-08-29 23:02 ` Nishanth Aravamudan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060830022621.GA5195@us.ibm.com \
--to=nacc@us.ibm.com \
--cc=ak@suse.de \
--cc=clameter@sgi.com \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@ozlabs.org \
--cc=lnxninja@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.