From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <nacc@us.ibm.com>
Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "e36.co.us.ibm.com", Issuer "Equifax" (verified OK))
	by ozlabs.org (Postfix) with ESMTP id 82D2567B58
	for <linuxppc-dev@ozlabs.org>; Thu, 31 Aug 2006 16:01:28 +1000 (EST)
Received: from westrelay02.boulder.ibm.com (westrelay02.boulder.ibm.com
	[9.17.195.11])
	by e36.co.us.ibm.com (8.13.8/8.12.11) with ESMTP id k7V61PoC031526
	for <linuxppc-dev@ozlabs.org>; Thu, 31 Aug 2006 02:01:25 -0400
Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167])
	by westrelay02.boulder.ibm.com (8.13.6/8.13.6/NCO v8.1.1) with ESMTP id
	k7V61PQj372936
	for <linuxppc-dev@ozlabs.org>; Thu, 31 Aug 2006 00:01:25 -0600
Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1])
	by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id
	k7V60IHP031804
	for <linuxppc-dev@ozlabs.org>; Thu, 31 Aug 2006 00:01:24 -0600
Date: Wed, 30 Aug 2006 23:00:36 -0700
From: Nishanth Aravamudan <nacc@us.ibm.com>
To: Christoph Lameter <clameter@sgi.com>
Subject: Re: libnuma interleaving oddness
Message-ID: <20060831060036.GA18661@us.ibm.com>
References: <20060829231545.GY5195@us.ibm.com>
	<Pine.LNX.4.64.0608291655160.22397@schroedinger.engr.sgi.com>
	<20060830002110.GZ5195@us.ibm.com> <200608300919.13125.ak@suse.de>
	<20060830072948.GE5195@us.ibm.com>
	<Pine.LNX.4.64.0608301401290.4217@schroedinger.engr.sgi.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
In-Reply-To: <Pine.LNX.4.64.0608301401290.4217@schroedinger.engr.sgi.com>
Cc: linux-mm@kvack.org, lnxninja@us.ibm.com, Andi Kleen <ak@suse.de>,
	linuxppc-dev@ozlabs.org
List-Id: Linux on PowerPC Developers Mail List <linuxppc-dev.ozlabs.org>
List-Unsubscribe: <https://ozlabs.org/mailman/listinfo/linuxppc-dev>,
	<mailto:linuxppc-dev-request@ozlabs.org?subject=unsubscribe>
List-Archive: <http://ozlabs.org/pipermail/linuxppc-dev>
List-Post: <mailto:linuxppc-dev@ozlabs.org>
List-Help: <mailto:linuxppc-dev-request@ozlabs.org?subject=help>
List-Subscribe: <https://ozlabs.org/mailman/listinfo/linuxppc-dev>,
	<mailto:linuxppc-dev-request@ozlabs.org?subject=subscribe>

On 30.08.2006 [14:04:40 -0700], Christoph Lameter wrote:
> > I took out the mlock() call, and I get the same results, FWIW.
> 
> What zones are available on your box? Any with HIGHMEM?

How do I tell the available zones from userspace? This is ppc64 with
about 64GB of memory total, it looks like. So, none of the nodes
(according to /sys/devices/system/node/*/meminfo) have highmem.

> Also what kernel version are we talking about? Before 2.6.18?

The SuSE default, 2.6.16.21 -- I thought I mentioned that in one of my
replies, sorry.

Tim and I spent most of this afternoon debugging the huge_zonelist()
callpath with kprobes and jprobes. We found the following via a jprobe
to offset_li_node():

jprobe: vma=0xc000000006dc2d78, pol->policy=0x3, pol->v.nodes=0xff, off=0x0
jprobe: vma=0xc00000000f247e30, pol->policy=0x3, pol->v.nodes=0xff, off=0x1000
jprobe: vma=0xc000000006dbf648, pol->policy=0x3, pol->v.nodes=0xff, off=0x2000
...
jprobe: vma=0xc00000000f298870, pol->policy=0x3, pol->v.nodes=0xff, off=0x17000
jprobe: vma=0xc00000000f298368, pol->policy=0x3, pol->v.nodes=0xff, off=0x18000

So, it's quite clear that the nodemask is set appropriately and so is
the policy. The problem, in fact, is the "offset" being passed into
offset_li_node().

The problem, I think, is from interleave_nid():

	off = vma->vm_pgoff;
	off += (addr - vma->vm_start) >> shift;
	return offset_il_node(pol, vma, off);

For hugetlbfs vma's, since vm_pgoff is in units of small pages, the lower
(HPAGE_SHIFT - PAGE_SHIFT) bits of vma->vm_pgoff and off will always be zero
(12 in this case).  Thus, when we get into offset_li_node():

        unsigned nnodes = nodes_weight(pol->v.nodes);
        unsigned target = (unsigned)off % nnodes;
        int c;
        int nid = -1;

        c = 0;
        do {
                nid = next_node(nid, pol->v.nodes);
                c++;
        } while (c <= target);
        return nid;

nnodes is 8 (the number of nodes). Our offset (some multiple of 4096) is
always going to be evenly divided by 8. So, our target node is always
node 0! Note, that when we took out a bit in our nodemask, nnodes
changed accordingly and 7 did not evenly divide the offset, and we got
interleaving as expected.

To test my hypothesis (my analysis may be a bit hand-wavy, sorry), I
changed interleave_nid() to shift off right by (HPAGE_SHIFT -
PAGE_SHIFT) only #if CONFIG_HUGETLB_PAGE. This fixes the behavior for
the page-by-page case. But I'm not sure this is an acceptable mainline
change, but I've included my signed-off-but-not-for-inclusion patch.

Note, that when I try this with my testcase that makes each allocation
be 4 hugepages large, I get 4 hugepages on node 0, then 4 on node 4,
then 4 on node 0, and so on. I believe this is because the offset ends
up being the same for all of the 4 hugepages in each set, so they go to
the same node

Many thanks to Tim for his help debugging.

---

Once again, not for inclusion!

Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>

diff -urpN 2.6.18-rc5/mm/mempolicy.c 2.6.18-rc5-dev/mm/mempolicy.c
--- 2.6.18-rc5/mm/mempolicy.c	2006-08-30 22:55:33.000000000 -0700
+++ 2.6.18-rc5-dev/mm/mempolicy.c	2006-08-30 22:56:43.000000000 -0700
@@ -1169,6 +1169,7 @@ static unsigned offset_il_node(struct me
 	return nid;
 }
 
+#ifndef CONFIG_HUGETLBFS
 /* Determine a node number for interleave */
 static inline unsigned interleave_nid(struct mempolicy *pol,
 		 struct vm_area_struct *vma, unsigned long addr, int shift)
@@ -1182,8 +1183,22 @@ static inline unsigned interleave_nid(st
 	} else
 		return interleave_nodes(pol);
 }
+#else
+/* Determine a node number for interleave */
+static inline unsigned interleave_nid(struct mempolicy *pol,
+		 struct vm_area_struct *vma, unsigned long addr, int shift)
+{
+	if (vma) {
+		unsigned long off;
+
+		off = vma->vm_pgoff;
+		off += (addr - vma->vm_start) >> shift;
+		off >>= (HPAGE_SHIFT - PAGE_SHIFT);
+		return offset_il_node(pol, vma, off);
+	} else
+		return interleave_nodes(pol);
+}
 
-#ifdef CONFIG_HUGETLBFS
 /* Return a zonelist suitable for a huge page allocation. */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr)
 {

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center