From: Nishanth Aravamudan <nacc@us.ibm.com>
To: Christoph Lameter <clameter@sgi.com>
Cc: akpm@osdl.org, linuxppc-dev@ozlabs.org, Andi Kleen <ak@suse.de>,
linux-mm@kvack.org, lnxninja@us.ibm.com
Subject: [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness)
Date: Thu, 31 Aug 2006 09:00:52 -0700 [thread overview]
Message-ID: <20060831160052.GB23990@us.ibm.com> (raw)
In-Reply-To: <20060831060036.GA18661@us.ibm.com>
On 30.08.2006 [23:00:36 -0700], Nishanth Aravamudan wrote:
> On 30.08.2006 [14:04:40 -0700], Christoph Lameter wrote:
> > > I took out the mlock() call, and I get the same results, FWIW.
> >
> > What zones are available on your box? Any with HIGHMEM?
>
> How do I tell the available zones from userspace? This is ppc64 with
> about 64GB of memory total, it looks like. So, none of the nodes
> (according to /sys/devices/system/node/*/meminfo) have highmem.
>
> > Also what kernel version are we talking about? Before 2.6.18?
>
> The SuSE default, 2.6.16.21 -- I thought I mentioned that in one of my
> replies, sorry.
>
> Tim and I spent most of this afternoon debugging the huge_zonelist()
> callpath with kprobes and jprobes. We found the following via a jprobe
> to offset_li_node():
<snip lengthy previous discussion>
Since vma->vm_pgoff is in units of smallpages, VMAs for huge pages have
the lower HPAGE_SHIFT - PAGE_SHIFT bits always cleared, which results in
badd offsets to the interleave functions. Take this difference from
small pages into account when calculating the offset. This does add a
0-bit shift into the small-page path (via alloc_page_vma()), but I think
that is negligible. Also add a BUG_ON to prevent the offset from growing
due to a negative right-shift, which probably shouldn't be allowed
anyways.
Tested on an 8-memory node ppc64 NUMA box and got the interleaving I
expected.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
---
Results with this patch applied, which shouldn't go into the changelog,
I don't think:
for the 4-hugepages at a time case:
20000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.r1YKfL huge dirty=4 N0=1 N1=1 N2=1 N3=1
24000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.r1YKfL huge dirty=4 N4=1 N5=1 N6=1 N7=1
28000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.r1YKfL huge dirty=4 N0=1 N1=1 N2=1 N3=1
for the 1-hugepage at a time case:
20000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N0=1
21000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N1=1
22000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N2=1
23000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N3=1
24000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N4=1
25000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N5=1
26000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N6=1
27000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N7=1
28000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N0=1
Andrew, can we get this into 2.6.18?
diff -urpN 2.6.18-rc5/mm/mempolicy.c 2.6.18-rc5-dev/mm/mempolicy.c
--- 2.6.18-rc5/mm/mempolicy.c 2006-08-30 22:55:33.000000000 -0700
+++ 2.6.18-rc5-dev/mm/mempolicy.c 2006-08-31 08:46:22.000000000 -0700
@@ -1176,7 +1176,15 @@ static inline unsigned interleave_nid(st
if (vma) {
unsigned long off;
- off = vma->vm_pgoff;
+ /*
+ * for small pages, there is no difference between
+ * shift and PAGE_SHIFT, so the bit-shift is safe.
+ * for huge pages, since vm_pgoff is in units of small
+ * pages, we need to shift off the always 0 bits to get
+ * a useful offset.
+ */
+ BUG_ON(shift < PAGE_SHIFT);
+ off = vma->vm_pgoff >> (shift - PAGE_SHIFT);
off += (addr - vma->vm_start) >> shift;
return offset_il_node(pol, vma, off);
} else
--
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center
WARNING: multiple messages have this Message-ID (diff)
From: Nishanth Aravamudan <nacc@us.ibm.com>
To: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>,
linux-mm@kvack.org, linuxppc-dev@ozlabs.org, lnxninja@us.ibm.com,
agl@us.ibm.com, akpm@osdl.org
Subject: [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness)
Date: Thu, 31 Aug 2006 09:00:52 -0700 [thread overview]
Message-ID: <20060831160052.GB23990@us.ibm.com> (raw)
In-Reply-To: <20060831060036.GA18661@us.ibm.com>
On 30.08.2006 [23:00:36 -0700], Nishanth Aravamudan wrote:
> On 30.08.2006 [14:04:40 -0700], Christoph Lameter wrote:
> > > I took out the mlock() call, and I get the same results, FWIW.
> >
> > What zones are available on your box? Any with HIGHMEM?
>
> How do I tell the available zones from userspace? This is ppc64 with
> about 64GB of memory total, it looks like. So, none of the nodes
> (according to /sys/devices/system/node/*/meminfo) have highmem.
>
> > Also what kernel version are we talking about? Before 2.6.18?
>
> The SuSE default, 2.6.16.21 -- I thought I mentioned that in one of my
> replies, sorry.
>
> Tim and I spent most of this afternoon debugging the huge_zonelist()
> callpath with kprobes and jprobes. We found the following via a jprobe
> to offset_li_node():
<snip lengthy previous discussion>
Since vma->vm_pgoff is in units of smallpages, VMAs for huge pages have
the lower HPAGE_SHIFT - PAGE_SHIFT bits always cleared, which results in
badd offsets to the interleave functions. Take this difference from
small pages into account when calculating the offset. This does add a
0-bit shift into the small-page path (via alloc_page_vma()), but I think
that is negligible. Also add a BUG_ON to prevent the offset from growing
due to a negative right-shift, which probably shouldn't be allowed
anyways.
Tested on an 8-memory node ppc64 NUMA box and got the interleaving I
expected.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
---
Results with this patch applied, which shouldn't go into the changelog,
I don't think:
for the 4-hugepages at a time case:
20000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.r1YKfL huge dirty=4 N0=1 N1=1 N2=1 N3=1
24000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.r1YKfL huge dirty=4 N4=1 N5=1 N6=1 N7=1
28000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.r1YKfL huge dirty=4 N0=1 N1=1 N2=1 N3=1
for the 1-hugepage at a time case:
20000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N0=1
21000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N1=1
22000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N2=1
23000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N3=1
24000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N4=1
25000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N5=1
26000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N6=1
27000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N7=1
28000000 interleave=0-7 file=/hugetlbfs/libhugetlbfs.tmp.LeSnPN huge dirty=1 N0=1
Andrew, can we get this into 2.6.18?
diff -urpN 2.6.18-rc5/mm/mempolicy.c 2.6.18-rc5-dev/mm/mempolicy.c
--- 2.6.18-rc5/mm/mempolicy.c 2006-08-30 22:55:33.000000000 -0700
+++ 2.6.18-rc5-dev/mm/mempolicy.c 2006-08-31 08:46:22.000000000 -0700
@@ -1176,7 +1176,15 @@ static inline unsigned interleave_nid(st
if (vma) {
unsigned long off;
- off = vma->vm_pgoff;
+ /*
+ * for small pages, there is no difference between
+ * shift and PAGE_SHIFT, so the bit-shift is safe.
+ * for huge pages, since vm_pgoff is in units of small
+ * pages, we need to shift off the always 0 bits to get
+ * a useful offset.
+ */
+ BUG_ON(shift < PAGE_SHIFT);
+ off = vma->vm_pgoff >> (shift - PAGE_SHIFT);
off += (addr - vma->vm_start) >> shift;
return offset_il_node(pol, vma, off);
} else
--
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2006-08-31 16:00 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-08-29 23:15 libnuma interleaving oddness Nishanth Aravamudan
2006-08-29 23:15 ` Nishanth Aravamudan
2006-08-29 23:57 ` Christoph Lameter
2006-08-29 23:57 ` Christoph Lameter
2006-08-30 0:21 ` Nishanth Aravamudan
2006-08-30 0:21 ` Nishanth Aravamudan
2006-08-30 2:26 ` Nishanth Aravamudan
2006-08-30 2:26 ` Nishanth Aravamudan
2006-08-30 4:26 ` Christoph Lameter
2006-08-30 4:26 ` Christoph Lameter
2006-08-30 5:31 ` Nishanth Aravamudan
2006-08-30 5:31 ` Nishanth Aravamudan
2006-08-30 5:40 ` Tim Pepper
2006-08-30 5:40 ` Tim Pepper
2006-08-30 7:19 ` Andi Kleen
2006-08-30 7:19 ` Andi Kleen
2006-08-30 7:29 ` Nishanth Aravamudan
2006-08-30 7:29 ` Nishanth Aravamudan
2006-08-30 7:32 ` Andi Kleen
2006-08-30 7:32 ` Andi Kleen
2006-08-30 18:01 ` Tim Pepper
2006-08-30 18:01 ` Tim Pepper
2006-08-30 18:12 ` Andi Kleen
2006-08-30 18:12 ` Andi Kleen
2006-08-30 18:13 ` Adam Litke
2006-08-30 18:13 ` Adam Litke
2006-08-30 21:04 ` Christoph Lameter
2006-08-30 21:04 ` Christoph Lameter
2006-08-31 6:00 ` Nishanth Aravamudan
2006-08-31 6:00 ` Nishanth Aravamudan
2006-08-31 7:47 ` Andi Kleen
2006-08-31 7:47 ` Andi Kleen
2006-08-31 15:49 ` Nishanth Aravamudan
2006-08-31 15:49 ` Nishanth Aravamudan
2006-08-31 16:00 ` Nishanth Aravamudan [this message]
2006-08-31 16:00 ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan
2006-08-31 16:08 ` Adam Litke
2006-08-31 16:08 ` Adam Litke
2006-08-31 16:19 ` Tim Pepper
2006-08-31 16:19 ` Tim Pepper
2006-08-31 16:37 ` Christoph Lameter
2006-08-31 16:37 ` Christoph Lameter
2006-08-30 17:44 ` libnuma interleaving oddness Adam Litke
2006-08-30 17:44 ` Adam Litke
2006-08-30 7:16 ` Andi Kleen
2006-08-30 7:16 ` Andi Kleen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060831160052.GB23990@us.ibm.com \
--to=nacc@us.ibm.com \
--cc=ak@suse.de \
--cc=akpm@osdl.org \
--cc=clameter@sgi.com \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@ozlabs.org \
--cc=lnxninja@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.