From: Geoff Levand <geoffrey.levand@am.sony.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Linuxppc-dev@ozlabs.org, Jon Tollefson <kniht@linux.vnet.ibm.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Dave Hansen <dave@linux.vnet.ibm.com>
Subject: [patch] powerpc: fix numa reserve bootmem page selection
Date: Thu, 12 Feb 2009 14:36:04 -0800 [thread overview]
Message-ID: <4994A454.4080207@am.sony.com> (raw)
In-Reply-To: <49924348.5050702@am.sony.com>
From: Dave Hansen <dave@linux.vnet.ibm.com>
Fix the powerpc NUMA reserve bootmem page selection logic.
commit 8f64e1f2d1e09267ac926e15090fd505c1c0cbcb (powerpc: Reserve
in bootmem lmb reserved regions that cross NUMA nodes) changed
the logic for how the powerpc LMB reserved regions were converted
to bootmen reserved regions. As the folowing discussion reports,
the new logic was not correct.
mark_reserved_regions_for_nid() goes through each LMB on the
system that specifies a reserved area. It searches for
active regions that intersect with that LMB and are on the
specified node. It attempts to bootmem-reserve only the area
where the active region and the reserved LMB intersect. We
can not reserve things on other nodes as they may not have
bootmem structures allocated, yet.
We base the size of the bootmem reservation on two possible
things. Normally, we just make the reservation start and
stop exactly at the start and end of the LMB.
However, the LMB reservations are not aware of NUMA nodes and
on occasion a single LMB may cross into several adjacent
active regions. Those may even be on different NUMA nodes
and will require separate calls to the bootmem reserve
functions. So, the bootmem reservation must be trimmed to
fit inside the current active region.
That's all fine and dandy, but we trim the reservation
in a page-aligned fashion. That's bad because we start the
reservation at a non-page-aligned address: physbase.
The reservation may only span 2 bytes, but that those bytes
may span two pfns and cause a reserve_size of 2*PAGE_SIZE.
Take the case where you reserve 0x2 bytes at 0x0fff and
where the active region ends at 0x1000. You'll jump into
that if() statment, but node_ar.end_pfn=0x1 and
start_pfn=0x0. You'll end up with a reserve_size=0x1000,
and then call
reserve_bootmem_node(node, physbase=0xfff, size=0x1000);
0x1000 may not be on the same node as 0xfff. Oops.
In almost all the vm code, end_<anything> is not inclusive.
If you have an end_pfn of 0x1234, page 0x1234 is not
included in the range. Using PFN_UP instead of the
(>> >> PAGE_SHIFT) will make this consistent with the other VM
code.
We also need to do math for the reserved size with physbase
instead of start_pfn. node_ar.end_pfn << PAGE_SHIFT is
*precisely* the end of the node. However,
(start_pfn << PAGE_SHIFT) is *NOT* precisely the beginning
of the reserved area. That is, of course, physbase.
If we don't use physbase here, the reserve_size can be
made too large.
From: Dave Hansen <dave@linux.vnet.ibm.com>
Tested-by: Geoff Levand <geoffrey.levand@am.sony.com> Tested on PS3.
---
linux-2.6.git-dave/arch/powerpc/mm/numa.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff -puN arch/powerpc/mm/numa.c~reserve-over-fix arch/powerpc/mm/numa.c
--- a/arch/powerpc/mm/numa.c~reserve-over-fix
+++ b/arch/powerpc/mm/numa.c
@@ -19,6 +19,7 @@
#include <linux/notifier.h>
#include <linux/lmb.h>
#include <linux/of.h>
+#include <linux/pfn.h>
#include <asm/sparsemem.h>
#include <asm/prom.h>
#include <asm/system.h>
@@ -882,7 +883,7 @@ static void mark_reserved_regions_for_ni
unsigned long physbase = lmb.reserved.region[i].base;
unsigned long size = lmb.reserved.region[i].size;
unsigned long start_pfn = physbase >> PAGE_SHIFT;
- unsigned long end_pfn = ((physbase + size) >> PAGE_SHIFT);
+ unsigned long end_pfn = PFN_UP(physbase + size);
struct node_active_region node_ar;
unsigned long node_end_pfn = node->node_start_pfn +
node->node_spanned_pages;
@@ -908,7 +909,7 @@ static void mark_reserved_regions_for_ni
*/
if (end_pfn > node_ar.end_pfn)
reserve_size = (node_ar.end_pfn << PAGE_SHIFT)
- - (start_pfn << PAGE_SHIFT);
+ - physbase;
/*
* Only worry about *this* node, others may not
* yet have valid NODE_DATA().
prev parent reply other threads:[~2009-02-12 22:36 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-10-09 20:18 [PATCH v3] powerpc: properly reserve in bootmem the lmb reserved regions that cross NUMA nodes Jon Tollefson
2008-10-09 20:18 ` Jon Tollefson
2008-10-10 4:55 ` Benjamin Herrenschmidt
2008-10-10 4:55 ` Benjamin Herrenschmidt
2008-10-10 4:55 ` Benjamin Herrenschmidt
2008-10-17 4:59 ` Jon Tollefson
2008-10-17 4:59 ` Jon Tollefson
2008-10-17 4:59 ` Jon Tollefson
2009-02-11 3:17 ` problem with numa reserve bootmem Geoff Levand
2009-02-11 3:55 ` Michael Ellerman
2009-02-12 22:36 ` Geoff Levand [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4994A454.4080207@am.sony.com \
--to=geoffrey.levand@am.sony.com \
--cc=Linuxppc-dev@ozlabs.org \
--cc=benh@kernel.crashing.org \
--cc=dave@linux.vnet.ibm.com \
--cc=hannes@cmpxchg.org \
--cc=kniht@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.