From: Nishanth Aravamudan <nacc@us.ibm.com>
To: Christoph Lameter <clameter@sgi.com>
Cc: linux-mm@kvack.org, lnxninja@us.ibm.com, Andi Kleen <ak@suse.de>,
linuxppc-dev@ozlabs.org
Subject: Re: libnuma interleaving oddness
Date: Wed, 30 Aug 2006 23:00:36 -0700 [thread overview]
Message-ID: <20060831060036.GA18661@us.ibm.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0608301401290.4217@schroedinger.engr.sgi.com>
On 30.08.2006 [14:04:40 -0700], Christoph Lameter wrote:
> > I took out the mlock() call, and I get the same results, FWIW.
>
> What zones are available on your box? Any with HIGHMEM?
How do I tell the available zones from userspace? This is ppc64 with
about 64GB of memory total, it looks like. So, none of the nodes
(according to /sys/devices/system/node/*/meminfo) have highmem.
> Also what kernel version are we talking about? Before 2.6.18?
The SuSE default, 2.6.16.21 -- I thought I mentioned that in one of my
replies, sorry.
Tim and I spent most of this afternoon debugging the huge_zonelist()
callpath with kprobes and jprobes. We found the following via a jprobe
to offset_li_node():
jprobe: vma=0xc000000006dc2d78, pol->policy=0x3, pol->v.nodes=0xff, off=0x0
jprobe: vma=0xc00000000f247e30, pol->policy=0x3, pol->v.nodes=0xff, off=0x1000
jprobe: vma=0xc000000006dbf648, pol->policy=0x3, pol->v.nodes=0xff, off=0x2000
...
jprobe: vma=0xc00000000f298870, pol->policy=0x3, pol->v.nodes=0xff, off=0x17000
jprobe: vma=0xc00000000f298368, pol->policy=0x3, pol->v.nodes=0xff, off=0x18000
So, it's quite clear that the nodemask is set appropriately and so is
the policy. The problem, in fact, is the "offset" being passed into
offset_li_node().
The problem, I think, is from interleave_nid():
off = vma->vm_pgoff;
off += (addr - vma->vm_start) >> shift;
return offset_il_node(pol, vma, off);
For hugetlbfs vma's, since vm_pgoff is in units of small pages, the lower
(HPAGE_SHIFT - PAGE_SHIFT) bits of vma->vm_pgoff and off will always be zero
(12 in this case). Thus, when we get into offset_li_node():
unsigned nnodes = nodes_weight(pol->v.nodes);
unsigned target = (unsigned)off % nnodes;
int c;
int nid = -1;
c = 0;
do {
nid = next_node(nid, pol->v.nodes);
c++;
} while (c <= target);
return nid;
nnodes is 8 (the number of nodes). Our offset (some multiple of 4096) is
always going to be evenly divided by 8. So, our target node is always
node 0! Note, that when we took out a bit in our nodemask, nnodes
changed accordingly and 7 did not evenly divide the offset, and we got
interleaving as expected.
To test my hypothesis (my analysis may be a bit hand-wavy, sorry), I
changed interleave_nid() to shift off right by (HPAGE_SHIFT -
PAGE_SHIFT) only #if CONFIG_HUGETLB_PAGE. This fixes the behavior for
the page-by-page case. But I'm not sure this is an acceptable mainline
change, but I've included my signed-off-but-not-for-inclusion patch.
Note, that when I try this with my testcase that makes each allocation
be 4 hugepages large, I get 4 hugepages on node 0, then 4 on node 4,
then 4 on node 0, and so on. I believe this is because the offset ends
up being the same for all of the 4 hugepages in each set, so they go to
the same node
Many thanks to Tim for his help debugging.
---
Once again, not for inclusion!
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
diff -urpN 2.6.18-rc5/mm/mempolicy.c 2.6.18-rc5-dev/mm/mempolicy.c
--- 2.6.18-rc5/mm/mempolicy.c 2006-08-30 22:55:33.000000000 -0700
+++ 2.6.18-rc5-dev/mm/mempolicy.c 2006-08-30 22:56:43.000000000 -0700
@@ -1169,6 +1169,7 @@ static unsigned offset_il_node(struct me
return nid;
}
+#ifndef CONFIG_HUGETLBFS
/* Determine a node number for interleave */
static inline unsigned interleave_nid(struct mempolicy *pol,
struct vm_area_struct *vma, unsigned long addr, int shift)
@@ -1182,8 +1183,22 @@ static inline unsigned interleave_nid(st
} else
return interleave_nodes(pol);
}
+#else
+/* Determine a node number for interleave */
+static inline unsigned interleave_nid(struct mempolicy *pol,
+ struct vm_area_struct *vma, unsigned long addr, int shift)
+{
+ if (vma) {
+ unsigned long off;
+
+ off = vma->vm_pgoff;
+ off += (addr - vma->vm_start) >> shift;
+ off >>= (HPAGE_SHIFT - PAGE_SHIFT);
+ return offset_il_node(pol, vma, off);
+ } else
+ return interleave_nodes(pol);
+}
-#ifdef CONFIG_HUGETLBFS
/* Return a zonelist suitable for a huge page allocation. */
struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr)
{
--
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center
next prev parent reply other threads:[~2006-08-31 6:01 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-08-29 23:15 libnuma interleaving oddness Nishanth Aravamudan
2006-08-29 23:57 ` Christoph Lameter
2006-08-30 0:21 ` Nishanth Aravamudan
2006-08-30 2:26 ` Nishanth Aravamudan
2006-08-30 4:26 ` Christoph Lameter
2006-08-30 5:31 ` Nishanth Aravamudan
2006-08-30 5:40 ` Tim Pepper
2006-08-30 7:19 ` Andi Kleen
2006-08-30 7:29 ` Nishanth Aravamudan
2006-08-30 7:32 ` Andi Kleen
2006-08-30 18:01 ` Tim Pepper
2006-08-30 18:12 ` Andi Kleen
2006-08-30 18:13 ` Adam Litke
2006-08-30 21:04 ` Christoph Lameter
2006-08-31 6:00 ` Nishanth Aravamudan [this message]
2006-08-31 7:47 ` Andi Kleen
2006-08-31 15:49 ` Nishanth Aravamudan
2006-08-31 16:00 ` [PATCH] fix NUMA interleaving for huge pages (was RE: libnuma interleaving oddness) Nishanth Aravamudan
2006-08-31 16:08 ` Adam Litke
2006-08-31 16:19 ` Tim Pepper
2006-08-31 16:37 ` Christoph Lameter
2006-08-30 17:44 ` libnuma interleaving oddness Adam Litke
2006-08-30 7:16 ` Andi Kleen
-- strict thread matches above, loose matches on Subject: below --
2006-08-29 23:02 Nishanth Aravamudan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20060831060036.GA18661@us.ibm.com \
--to=nacc@us.ibm.com \
--cc=ak@suse.de \
--cc=clameter@sgi.com \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@ozlabs.org \
--cc=lnxninja@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).