From: Andrea Arcangeli <aarcange@redhat.com>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Andrew Morton <akpm@linux-foundation.org>,
Peter Zijlstra <pzijlstr@redhat.com>, Ingo Molnar <mingo@elte.hu>,
Mel Gorman <mel@csn.ul.ie>, Hugh Dickins <hughd@google.com>,
Rik van Riel <riel@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Hillf Danton <dhillf@gmail.com>,
Andrew Jones <drjones@redhat.com>, Dan Smith <danms@us.ibm.com>,
Thomas Gleixner <tglx@linutronix.de>,
Paul Turner <pjt@google.com>, Christoph Lameter <cl@linux.com>,
Suresh Siddha <suresh.b.siddha@intel.com>,
Mike Galbraith <efault@gmx.de>,
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
Lai Jiangshan <laijs@cn.fujitsu.com>,
Bharata B Rao <bharata.rao@gmail.com>,
Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>,
Alex Shi <alex.shi@intel.com>,
Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>,
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
Don Morris <don.morris@hp.com>,
Benjamin Herrenschmidt <benh@kernel.crashing.org>
Subject: [PATCH 29/33] autonuma: page_autonuma
Date: Thu, 4 Oct 2012 01:51:11 +0200 [thread overview]
Message-ID: <1349308275-2174-30-git-send-email-aarcange@redhat.com> (raw)
In-Reply-To: <1349308275-2174-1-git-send-email-aarcange@redhat.com>
Move the autonuma_last_nid from the "struct page" to a separate
page_autonuma data structure allocated in the memsection (with
sparsemem) or in the pgdat (with flatmem).
This is done to avoid growing the size of "struct page". The
page_autonuma data is only allocated if the kernel is booted on real
NUMA hardware and noautonuma is not passed as a parameter to the
kernel.
An alternative would be to takeover 16 bits from the page->flags: but:
1) 32bit are already used (in fact 32bit archs are considering to
adding another 32bit too to avoid losing common code features), 16
bits would be used by the last_nid, and several bits are used by
per-node (readonly) zone/node information, so we would be left with
just an handful of spare PG_ bits if we stole 16 for the last_nid.
2) We cannot exclude we'll want to add more bits of information in the
future (and more than 16 wouldn't fit on page->flags). Changing the
format or layout of the page_autonuma structure is trivial,
compared to altering the format of the page->flags. So
page_autonuma is much more hackable than page->flags.
3) page->flags can be modified from under us with locked ops
(lock_page and all page flags operations). Normally we never change
more than 1 bit at once on it. So the way page->flags could be
updated is through cmpxchg. That's slow and tricky code would need
to be written for it (potentially to drop late in case of point 2
above). Allocating those 2 bytes separately to me looks a lot
cleaner even if it takes 0.048% of memory (but only when booting on
NUMA hardware).
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
include/linux/autonuma.h | 8 ++
include/linux/autonuma_types.h | 19 +++
include/linux/mm_types.h | 11 --
include/linux/mmzone.h | 12 ++
include/linux/page_autonuma.h | 50 +++++++++
init/main.c | 2 +
mm/Makefile | 2 +-
mm/autonuma.c | 37 +++++--
mm/huge_memory.c | 13 ++-
mm/page_alloc.c | 14 +--
mm/page_autonuma.c | 237 ++++++++++++++++++++++++++++++++++++++++
mm/sparse.c | 126 ++++++++++++++++++++-
12 files changed, 490 insertions(+), 41 deletions(-)
create mode 100644 include/linux/page_autonuma.h
create mode 100644 mm/page_autonuma.c
diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index 02d4875..274c616 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -10,6 +10,13 @@ extern void autonuma_exit(struct mm_struct *mm);
extern void autonuma_migrate_split_huge_page(struct page *page,
struct page *page_tail);
extern void autonuma_setup_new_exec(struct task_struct *p);
+extern struct page_autonuma *lookup_page_autonuma(struct page *page);
+
+static inline void autonuma_free_page(struct page *page)
+{
+ if (autonuma_possible())
+ lookup_page_autonuma(page)->autonuma_last_nid = -1;
+}
#define autonuma_printk(format, args...) \
if (autonuma_debug()) printk(format, ##args)
@@ -21,6 +28,7 @@ static inline void autonuma_exit(struct mm_struct *mm) {}
static inline void autonuma_migrate_split_huge_page(struct page *page,
struct page *page_tail) {}
static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+static inline void autonuma_free_page(struct page *page) {}
#endif /* CONFIG_AUTONUMA */
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 9673ce8..d0c6403 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -78,6 +78,25 @@ struct task_autonuma {
/* do not add more variables here, the above array size is dynamic */
};
+/*
+ * Per page (or per-pageblock) structure dynamically allocated only if
+ * autonuma is possible.
+ */
+struct page_autonuma {
+ /*
+ * autonuma_last_nid records the NUMA node that accessed the
+ * page during the last NUMA hinting page fault. If a
+ * different node accesses the page next, AutoNUMA will not
+ * migrate the page. This tries to avoid page thrashing by
+ * requiring that a page be accessed by the same node twice in
+ * a row before it is queued for migration.
+ */
+#if MAX_NUMNODES > 32767
+#error "too many nodes"
+#endif
+ short autonuma_last_nid;
+};
+
extern int alloc_task_autonuma(struct task_struct *tsk,
struct task_struct *orig,
int node);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9e8398a..c80101c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -152,17 +152,6 @@ struct page {
struct page *first_page; /* Compound tail pages */
};
-#ifdef CONFIG_AUTONUMA
- /*
- * FIXME: move to pgdat section along with the memcg and allocate
- * at runtime only in presence of a numa system.
- */
-#if MAX_NUMNODES > 32767
-#error "too many nodes"
-#endif
- short autonuma_last_nid;
-#endif
-
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f793541..db68389 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -710,6 +710,9 @@ typedef struct pglist_data {
int kswapd_max_order;
enum zone_type classzone_idx;
#ifdef CONFIG_AUTONUMA
+#if !defined(CONFIG_SPARSEMEM)
+ struct page_autonuma *node_page_autonuma;
+#endif
/*
* Lock serializing the per destination node AutoNUMA memory
* migration rate limiting data.
@@ -1081,6 +1084,15 @@ struct mem_section {
* section. (see memcontrol.h/page_cgroup.h about this.)
*/
struct page_cgroup *page_cgroup;
+#endif
+#ifdef CONFIG_AUTONUMA
+ /*
+ * If !SPARSEMEM, pgdat doesn't have page_autonuma pointer. We use
+ * section.
+ */
+ struct page_autonuma *section_page_autonuma;
+#endif
+#if defined(CONFIG_MEMCG) ^ defined(CONFIG_AUTONUMA)
unsigned long pad;
#endif
};
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
new file mode 100644
index 0000000..6da6c51
--- /dev/null
+++ b/include/linux/page_autonuma.h
@@ -0,0 +1,50 @@
+#ifndef _LINUX_PAGE_AUTONUMA_H
+#define _LINUX_PAGE_AUTONUMA_H
+
+#include <linux/autonuma_flags.h>
+
+#if defined(CONFIG_AUTONUMA) && !defined(CONFIG_SPARSEMEM)
+extern void __init page_autonuma_init_flatmem(void);
+#else
+static inline void __init page_autonuma_init_flatmem(void) {}
+#endif
+
+#ifdef CONFIG_AUTONUMA
+
+extern void __meminit page_autonuma_map_init(struct page *page,
+ struct page_autonuma *page_autonuma,
+ int nr_pages);
+
+#ifdef CONFIG_SPARSEMEM
+#define PAGE_AUTONUMA_SIZE (sizeof(struct page_autonuma))
+#define SECTION_PAGE_AUTONUMA_SIZE (PAGE_AUTONUMA_SIZE * \
+ PAGES_PER_SECTION)
+#endif
+
+extern void __meminit pgdat_autonuma_init(struct pglist_data *);
+
+#else /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+struct page_autonuma;
+#define PAGE_AUTONUMA_SIZE 0
+#define SECTION_PAGE_AUTONUMA_SIZE 0
+#endif /* CONFIG_SPARSEMEM */
+
+static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+extern struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+ unsigned long nr_pages);
+extern void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+ unsigned long nr_pages);
+extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+ unsigned long pnum_begin,
+ unsigned long pnum_end,
+ unsigned long map_count,
+ int nodeid);
+#endif
+
+#endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/init/main.c b/init/main.c
index b286730..586764f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -69,6 +69,7 @@
#include <linux/slab.h>
#include <linux/perf_event.h>
#include <linux/file.h>
+#include <linux/page_autonuma.h>
#include <asm/io.h>
#include <asm/bugs.h>
@@ -456,6 +457,7 @@ static void __init mm_init(void)
* bigger than MAX_ORDER unless SPARSEMEM.
*/
page_cgroup_init_flatmem();
+ page_autonuma_init_flatmem();
mem_init();
kmem_cache_init();
percpu_init_late();
diff --git a/mm/Makefile b/mm/Makefile
index 0fd3165..5a4fa30 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -34,7 +34,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
-obj-$(CONFIG_AUTONUMA) += autonuma.o
+obj-$(CONFIG_AUTONUMA) += autonuma.o page_autonuma.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index 1b2530c..b5c5ff6 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -55,10 +55,19 @@ void autonuma_migrate_split_huge_page(struct page *page,
struct page *page_tail)
{
int last_nid;
+ struct page_autonuma *page_autonuma, *page_tail_autonuma;
- last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+ if (!autonuma_possible())
+ return;
+
+ page_autonuma = lookup_page_autonuma(page);
+ page_tail_autonuma = lookup_page_autonuma(page_tail);
+
+ VM_BUG_ON(page_tail_autonuma->autonuma_last_nid != -1);
+
+ last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
if (last_nid >= 0)
- page_tail->autonuma_last_nid = last_nid;
+ page_tail_autonuma->autonuma_last_nid = last_nid;
}
static int sync_isolate_migratepages(struct list_head *migratepages,
@@ -176,13 +185,18 @@ static struct page *alloc_migrate_dst_page(struct page *page,
{
int nid = (int) data;
struct page *newpage;
+ struct page_autonuma *page_autonuma, *newpage_autonuma;
newpage = alloc_pages_exact_node(nid,
(GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
__GFP_NOMEMALLOC | __GFP_NORETRY |
__GFP_NOWARN | __GFP_NO_KSWAPD) &
~GFP_IOFS, 0);
- if (newpage)
- newpage->autonuma_last_nid = page->autonuma_last_nid;
+ if (newpage) {
+ page_autonuma = lookup_page_autonuma(page);
+ newpage_autonuma = lookup_page_autonuma(newpage);
+ newpage_autonuma->autonuma_last_nid =
+ page_autonuma->autonuma_last_nid;
+ }
return newpage;
}
@@ -291,13 +305,14 @@ static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
static inline bool last_nid_set(struct page *page, int this_nid)
{
bool ret = true;
- int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+ struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+ int autonuma_last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
VM_BUG_ON(this_nid < 0);
VM_BUG_ON(this_nid >= MAX_NUMNODES);
if (autonuma_last_nid != this_nid) {
if (autonuma_last_nid >= 0)
ret = false;
- ACCESS_ONCE(page->autonuma_last_nid) = this_nid;
+ ACCESS_ONCE(page_autonuma->autonuma_last_nid) = this_nid;
}
return ret;
}
@@ -1185,7 +1200,8 @@ static int __init noautonuma_setup(char *str)
}
return 1;
}
-__setup("noautonuma", noautonuma_setup);
+/* early so sparse.c also can see it */
+early_param("noautonuma", noautonuma_setup);
static bool autonuma_init_checks_failed(void)
{
@@ -1209,7 +1225,12 @@ static int __init autonuma_init(void)
VM_BUG_ON(num_possible_nodes() < 1);
if (num_possible_nodes() <= 1 || !autonuma_possible()) {
- clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+ /* should have been already initialized by page_autonuma */
+ if (autonuma_possible()) {
+ WARN_ON(1);
+ /* try to fixup if it wasn't ok */
+ clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+ }
return -EINVAL;
} else if (autonuma_init_checks_failed()) {
printk("autonuma disengaged: init checks failed\n");
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 757c1cc..86db742 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1850,7 +1850,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
bool mknuma = false;
#ifdef CONFIG_AUTONUMA
int autonuma_last_nid = -1;
+ struct page_autonuma *src_page_an, *page_an = NULL;
+
+ if (autonuma_possible())
+ page_an = lookup_page_autonuma(page);
#endif
+
for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
pte_t pteval = *_pte;
struct page *src_page;
@@ -1862,12 +1867,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
src_page = pte_page(pteval);
#ifdef CONFIG_AUTONUMA
/* pick the first one, better than nothing */
- if (autonuma_last_nid < 0) {
+ if (autonuma_possible() && autonuma_last_nid < 0) {
+ src_page_an = lookup_page_autonuma(src_page);
autonuma_last_nid =
- ACCESS_ONCE(src_page->
- autonuma_last_nid);
+ ACCESS_ONCE(src_page_an->autonuma_last_nid);
if (autonuma_last_nid >= 0)
- ACCESS_ONCE(page->autonuma_last_nid) =
+ ACCESS_ONCE(page_an->autonuma_last_nid) =
autonuma_last_nid;
}
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e096742..8e6493a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
#include <linux/migrate.h>
#include <linux/page-debug-flags.h>
#include <linux/autonuma.h>
+#include <linux/page_autonuma.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -619,9 +620,7 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
-#ifdef CONFIG_AUTONUMA
- page->autonuma_last_nid = -1;
-#endif
+ autonuma_free_page(page);
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
@@ -3797,9 +3796,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
INIT_LIST_HEAD(&page->lru);
-#ifdef CONFIG_AUTONUMA
- page->autonuma_last_nid = -1;
-#endif
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
if (!is_highmem_idx(zone))
@@ -4402,14 +4398,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
int ret;
pgdat_resize_init(pgdat);
-#ifdef CONFIG_AUTONUMA
- spin_lock_init(&pgdat->autonuma_migrate_lock);
- pgdat->autonuma_migrate_nr_pages = 0;
- pgdat->autonuma_migrate_last_jiffies = jiffies;
-#endif
init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
pgdat_page_cgroup_init(pgdat);
+ pgdat_autonuma_init(pgdat);
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
new file mode 100644
index 0000000..d400d7f
--- /dev/null
+++ b/mm/page_autonuma.c
@@ -0,0 +1,237 @@
+#include <linux/mm.h>
+#include <linux/memory.h>
+#include <linux/autonuma.h>
+#include <linux/page_autonuma.h>
+#include <linux/bootmem.h>
+#include <linux/vmalloc.h>
+
+void __meminit page_autonuma_map_init(struct page *page,
+ struct page_autonuma *page_autonuma,
+ int nr_pages)
+{
+ struct page *end;
+ for (end = page + nr_pages; page < end; page++, page_autonuma++)
+ page_autonuma->autonuma_last_nid = -1;
+}
+
+static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+ spin_lock_init(&pgdat->autonuma_migrate_lock);
+ pgdat->autonuma_migrate_nr_pages = 0;
+ pgdat->autonuma_migrate_last_jiffies = jiffies;
+
+ /* initialize autonuma_possible() */
+ if (num_possible_nodes() <= 1)
+ clear_bit(AUTONUMA_POSSIBLE_FLAG, &autonuma_flags);
+}
+
+#if !defined(CONFIG_SPARSEMEM)
+
+static unsigned long total_usage;
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+ __pgdat_autonuma_init(pgdat);
+ pgdat->node_page_autonuma = NULL;
+}
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long offset;
+ struct page_autonuma *base;
+
+ base = NODE_DATA(page_to_nid(page))->node_page_autonuma;
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * The sanity checks the page allocator does upon freeing a
+ * page can reach here before the page_autonuma arrays are
+ * allocated when feeding a range of pages to the allocator
+ * for the first time during bootup or memory hotplug.
+ */
+ if (unlikely(!base))
+ return NULL;
+#endif
+ offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+ return base + offset;
+}
+
+static int __init alloc_node_page_autonuma(int nid)
+{
+ struct page_autonuma *base;
+ unsigned long table_size;
+ unsigned long nr_pages;
+
+ nr_pages = NODE_DATA(nid)->node_spanned_pages;
+ if (!nr_pages)
+ return 0;
+
+ table_size = sizeof(struct page_autonuma) * nr_pages;
+
+ base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+ table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+ if (!base)
+ return -ENOMEM;
+ NODE_DATA(nid)->node_page_autonuma = base;
+ total_usage += table_size;
+ page_autonuma_map_init(NODE_DATA(nid)->node_mem_map, base, nr_pages);
+ return 0;
+}
+
+void __init page_autonuma_init_flatmem(void)
+{
+
+ int nid, fail;
+
+ /* __pgdat_autonuma_init initialized autonuma_possible() */
+ if (!autonuma_possible())
+ return;
+
+ for_each_online_node(nid) {
+ fail = alloc_node_page_autonuma(nid);
+ if (fail)
+ goto fail;
+ }
+ printk(KERN_INFO "allocated %lu KBytes of page_autonuma\n",
+ total_usage >> 10);
+ printk(KERN_INFO "please try the 'noautonuma' option if you"
+ " don't want to allocate page_autonuma memory\n");
+ return;
+fail:
+ printk(KERN_CRIT "allocation of page_autonuma failed.\n");
+ printk(KERN_CRIT "please try the 'noautonuma' boot option\n");
+ panic("Out of memory");
+}
+
+#else /* CONFIG_SPARSEMEM */
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ struct mem_section *section = __pfn_to_section(pfn);
+
+ /* if it's not a power of two we may be wasting memory */
+ BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
+ (SECTION_PAGE_AUTONUMA_SIZE-1));
+
+ /* memsection must be a power of two */
+ BUILD_BUG_ON(sizeof(struct mem_section) &
+ (sizeof(struct mem_section)-1));
+
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * The sanity checks the page allocator does upon freeing a
+ * page can reach here before the page_autonuma arrays are
+ * allocated when feeding a range of pages to the allocator
+ * for the first time during bootup or memory hotplug.
+ */
+ if (!section->section_page_autonuma)
+ return NULL;
+#endif
+ return section->section_page_autonuma + pfn;
+}
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+ __pgdat_autonuma_init(pgdat);
+}
+
+struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+ unsigned long nr_pages)
+{
+ struct page_autonuma *ret;
+ struct page *page;
+ unsigned long memmap_size = PAGE_AUTONUMA_SIZE * nr_pages;
+
+ page = alloc_pages_node(nid, GFP_KERNEL|__GFP_NOWARN,
+ get_order(memmap_size));
+ if (page)
+ goto got_map_page_autonuma;
+
+ ret = vmalloc(memmap_size);
+ if (ret)
+ goto out;
+
+ return NULL;
+got_map_page_autonuma:
+ ret = (struct page_autonuma *)pfn_to_kaddr(page_to_pfn(page));
+out:
+ return ret;
+}
+
+void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+ unsigned long nr_pages)
+{
+ if (is_vmalloc_addr(page_autonuma))
+ vfree(page_autonuma);
+ else
+ free_pages((unsigned long)page_autonuma,
+ get_order(PAGE_AUTONUMA_SIZE * nr_pages));
+}
+
+static struct page_autonuma __init *sparse_page_autonuma_map_populate(unsigned long pnum,
+ int nid)
+{
+ struct page_autonuma *map;
+ unsigned long size;
+
+ map = alloc_remap(nid, SECTION_PAGE_AUTONUMA_SIZE);
+ if (map)
+ return map;
+
+ size = PAGE_ALIGN(SECTION_PAGE_AUTONUMA_SIZE);
+ map = __alloc_bootmem_node_high(NODE_DATA(nid), size,
+ PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+ return map;
+}
+
+void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+ unsigned long pnum_begin,
+ unsigned long pnum_end,
+ unsigned long map_count,
+ int nodeid)
+{
+ void *map;
+ unsigned long pnum;
+ unsigned long size = SECTION_PAGE_AUTONUMA_SIZE;
+
+ map = alloc_remap(nodeid, size * map_count);
+ if (map) {
+ for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+ if (!present_section_nr(pnum))
+ continue;
+ page_autonuma_map[pnum] = map;
+ map += size;
+ }
+ return;
+ }
+
+ size = PAGE_ALIGN(size);
+ map = __alloc_bootmem_node_high(NODE_DATA(nodeid), size * map_count,
+ PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+ if (map) {
+ for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+ if (!present_section_nr(pnum))
+ continue;
+ page_autonuma_map[pnum] = map;
+ map += size;
+ }
+ return;
+ }
+
+ /* fallback */
+ for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+ struct mem_section *ms;
+
+ if (!present_section_nr(pnum))
+ continue;
+ page_autonuma_map[pnum] = sparse_page_autonuma_map_populate(pnum, nodeid);
+ if (page_autonuma_map[pnum])
+ continue;
+ ms = __nr_to_section(pnum);
+ printk(KERN_ERR "%s: sparsemem page_autonuma map backing failed "
+ "some memory will not be available.\n", __func__);
+ }
+}
+
+#endif /* CONFIG_SPARSEMEM */
diff --git a/mm/sparse.c b/mm/sparse.c
index fac95f2..5b8d018 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -9,6 +9,7 @@
#include <linux/export.h>
#include <linux/spinlock.h>
#include <linux/vmalloc.h>
+#include <linux/page_autonuma.h>
#include "internal.h"
#include <asm/dma.h>
#include <asm/pgalloc.h>
@@ -230,7 +231,8 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
static int __meminit sparse_init_one_section(struct mem_section *ms,
unsigned long pnum, struct page *mem_map,
- unsigned long *pageblock_bitmap)
+ unsigned long *pageblock_bitmap,
+ struct page_autonuma *page_autonuma)
{
if (!present_section(ms))
return -EINVAL;
@@ -239,6 +241,14 @@ static int __meminit sparse_init_one_section(struct mem_section *ms,
ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
SECTION_HAS_MEM_MAP;
ms->pageblock_flags = pageblock_bitmap;
+#ifdef CONFIG_AUTONUMA
+ if (page_autonuma) {
+ ms->section_page_autonuma = page_autonuma - section_nr_to_pfn(pnum);
+ page_autonuma_map_init(mem_map, page_autonuma, PAGES_PER_SECTION);
+ }
+#else
+ BUG_ON(page_autonuma);
+#endif
return 1;
}
@@ -480,6 +490,9 @@ void __init sparse_init(void)
int size2;
struct page **map_map;
#endif
+ struct page_autonuma **uninitialized_var(page_autonuma_map);
+ struct page_autonuma *page_autonuma;
+ int size3;
/* Setup pageblock_order for HUGETLB_PAGE_SIZE_VARIABLE */
set_pageblock_order();
@@ -577,6 +590,63 @@ void __init sparse_init(void)
map_count, nodeid_begin);
#endif
+ /* __pgdat_autonuma_init initialized autonuma_possible() */
+ if (autonuma_possible()) {
+ unsigned long total_page_autonuma;
+ unsigned long page_autonuma_count;
+
+ size3 = sizeof(struct page_autonuma *) * NR_MEM_SECTIONS;
+ page_autonuma_map = alloc_bootmem(size3);
+ if (!page_autonuma_map)
+ panic("can not allocate page_autonuma_map\n");
+
+ for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+ struct mem_section *ms;
+
+ if (!present_section_nr(pnum))
+ continue;
+ ms = __nr_to_section(pnum);
+ nodeid_begin = sparse_early_nid(ms);
+ pnum_begin = pnum;
+ break;
+ }
+ total_page_autonuma = 0;
+ page_autonuma_count = 1;
+ for (pnum = pnum_begin + 1; pnum < NR_MEM_SECTIONS; pnum++) {
+ struct mem_section *ms;
+ int nodeid;
+
+ if (!present_section_nr(pnum))
+ continue;
+ ms = __nr_to_section(pnum);
+ nodeid = sparse_early_nid(ms);
+ if (nodeid == nodeid_begin) {
+ page_autonuma_count++;
+ continue;
+ }
+ /* ok, we need to take cake of from pnum_begin to pnum - 1*/
+ sparse_early_page_autonuma_alloc_node(page_autonuma_map,
+ pnum_begin,
+ NR_MEM_SECTIONS,
+ page_autonuma_count,
+ nodeid_begin);
+ total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+ /* new start, update count etc*/
+ nodeid_begin = nodeid;
+ pnum_begin = pnum;
+ page_autonuma_count = 1;
+ }
+ /* ok, last chunk */
+ sparse_early_page_autonuma_alloc_node(page_autonuma_map, pnum_begin,
+ NR_MEM_SECTIONS,
+ page_autonuma_count, nodeid_begin);
+ total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+ printk("allocated %lu KBytes of page_autonuma\n",
+ total_page_autonuma >> 10);
+ printk(KERN_INFO "please try the 'noautonuma' option if you"
+ " don't want to allocate page_autonuma memory\n");
+ }
+
for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
if (!present_section_nr(pnum))
continue;
@@ -585,6 +655,13 @@ void __init sparse_init(void)
if (!usemap)
continue;
+ if (autonuma_possible()) {
+ page_autonuma = page_autonuma_map[pnum];
+ if (!page_autonuma)
+ continue;
+ } else
+ page_autonuma = NULL;
+
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
map = map_map[pnum];
#else
@@ -594,11 +671,13 @@ void __init sparse_init(void)
continue;
sparse_init_one_section(__nr_to_section(pnum), pnum, map,
- usemap);
+ usemap, page_autonuma);
}
vmemmap_populate_print_last();
+ if (autonuma_possible())
+ free_bootmem(__pa(page_autonuma_map), size3);
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
free_bootmem(__pa(map_map), size2);
#endif
@@ -685,7 +764,8 @@ static void free_map_bootmem(struct page *page, unsigned long nr_pages)
}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */
-static void free_section_usemap(struct page *memmap, unsigned long *usemap)
+static void free_section_usemap(struct page *memmap, unsigned long *usemap,
+ struct page_autonuma *page_autonuma)
{
struct page *usemap_page;
unsigned long nr_pages;
@@ -699,8 +779,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
*/
if (PageSlab(usemap_page)) {
kfree(usemap);
- if (memmap)
+ if (memmap) {
__kfree_section_memmap(memmap, PAGES_PER_SECTION);
+ if (autonuma_possible())
+ __kfree_section_page_autonuma(page_autonuma,
+ PAGES_PER_SECTION);
+ else
+ BUG_ON(page_autonuma);
+ }
return;
}
@@ -717,6 +803,13 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
>> PAGE_SHIFT;
free_map_bootmem(memmap_page, nr_pages);
+
+ if (autonuma_possible()) {
+ struct page *page_autonuma_page;
+ page_autonuma_page = virt_to_page(page_autonuma);
+ free_map_bootmem(page_autonuma_page, nr_pages);
+ } else
+ BUG_ON(page_autonuma);
}
}
@@ -732,6 +825,7 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
struct pglist_data *pgdat = zone->zone_pgdat;
struct mem_section *ms;
struct page *memmap;
+ struct page_autonuma *page_autonuma;
unsigned long *usemap;
unsigned long flags;
int ret;
@@ -751,6 +845,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
__kfree_section_memmap(memmap, nr_pages);
return -ENOMEM;
}
+ if (autonuma_possible()) {
+ page_autonuma = __kmalloc_section_page_autonuma(pgdat->node_id,
+ nr_pages);
+ if (!page_autonuma) {
+ kfree(usemap);
+ __kfree_section_memmap(memmap, nr_pages);
+ return -ENOMEM;
+ }
+ } else
+ page_autonuma = NULL;
pgdat_resize_lock(pgdat, &flags);
@@ -762,11 +866,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
ms->section_mem_map |= SECTION_MARKED_PRESENT;
- ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
+ ret = sparse_init_one_section(ms, section_nr, memmap, usemap,
+ page_autonuma);
out:
pgdat_resize_unlock(pgdat, &flags);
if (ret <= 0) {
+ if (autonuma_possible())
+ __kfree_section_page_autonuma(page_autonuma, nr_pages);
+ else
+ BUG_ON(page_autonuma);
kfree(usemap);
__kfree_section_memmap(memmap, nr_pages);
}
@@ -777,6 +886,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
{
struct page *memmap = NULL;
unsigned long *usemap = NULL;
+ struct page_autonuma *page_autonuma = NULL;
if (ms->section_mem_map) {
usemap = ms->pageblock_flags;
@@ -784,8 +894,12 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
__section_nr(ms));
ms->section_mem_map = 0;
ms->pageblock_flags = NULL;
+
+#ifdef CONFIG_AUTONUMA
+ page_autonuma = ms->section_page_autonuma;
+#endif
}
- free_section_usemap(memmap, usemap);
+ free_section_usemap(memmap, usemap, page_autonuma);
}
#endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-10-03 23:51 UTC|newest]
Thread overview: 117+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-10-03 23:50 [PATCH 00/33] AutoNUMA27 Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 01/33] autonuma: add Documentation/vm/autonuma.txt Andrea Arcangeli
2012-10-11 10:50 ` Mel Gorman
2012-10-11 16:07 ` Andrea Arcangeli
2012-10-11 19:37 ` Mel Gorman
2012-10-03 23:50 ` [PATCH 02/33] autonuma: make set_pmd_at always available Andrea Arcangeli
2012-10-11 10:54 ` Mel Gorman
2012-10-03 23:50 ` [PATCH 03/33] autonuma: export is_vma_temporary_stack() even if CONFIG_TRANSPARENT_HUGEPAGE=n Andrea Arcangeli
2012-10-11 10:54 ` Mel Gorman
2012-10-03 23:50 ` [PATCH 04/33] autonuma: define _PAGE_NUMA Andrea Arcangeli
2012-10-11 11:01 ` Mel Gorman
2012-10-11 16:43 ` Andrea Arcangeli
2012-10-11 19:48 ` Mel Gorman
2012-10-03 23:50 ` [PATCH 05/33] autonuma: pte_numa() and pmd_numa() Andrea Arcangeli
2012-10-11 11:15 ` Mel Gorman
2012-10-11 16:58 ` Andrea Arcangeli
2012-10-11 19:54 ` Mel Gorman
2012-10-03 23:50 ` [PATCH 06/33] autonuma: teach gup_fast about pmd_numa Andrea Arcangeli
2012-10-11 12:22 ` Mel Gorman
2012-10-11 17:05 ` Andrea Arcangeli
2012-10-11 20:01 ` Mel Gorman
2012-10-03 23:50 ` [PATCH 07/33] autonuma: mm_autonuma and task_autonuma data structures Andrea Arcangeli
2012-10-11 12:28 ` Mel Gorman
2012-10-11 15:24 ` Rik van Riel
2012-10-11 15:57 ` Mel Gorman
2012-10-12 0:23 ` Christoph Lameter
2012-10-12 0:52 ` Andrea Arcangeli
2012-10-11 17:15 ` Andrea Arcangeli
2012-10-11 20:06 ` Mel Gorman
2012-10-03 23:50 ` [PATCH 08/33] autonuma: define the autonuma flags Andrea Arcangeli
2012-10-11 13:46 ` Mel Gorman
2012-10-11 17:34 ` Andrea Arcangeli
2012-10-11 20:17 ` Mel Gorman
2012-10-03 23:50 ` [PATCH 09/33] autonuma: core autonuma.h header Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 10/33] autonuma: CPU follows memory algorithm Andrea Arcangeli
2012-10-11 14:58 ` Mel Gorman
2012-10-12 0:25 ` Andrea Arcangeli
2012-10-12 8:29 ` Mel Gorman
2012-10-03 23:50 ` [PATCH 11/33] autonuma: add the autonuma_last_nid in the page structure Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 12/33] autonuma: Migrate On Fault per NUMA node data Andrea Arcangeli
2012-10-11 15:43 ` Mel Gorman
2012-10-03 23:50 ` [PATCH 13/33] autonuma: autonuma_enter/exit Andrea Arcangeli
2012-10-11 13:50 ` Mel Gorman
2012-10-03 23:50 ` [PATCH 14/33] autonuma: call autonuma_setup_new_exec() Andrea Arcangeli
2012-10-11 15:47 ` Mel Gorman
2012-10-03 23:50 ` [PATCH 15/33] autonuma: alloc/free/init task_autonuma Andrea Arcangeli
2012-10-11 15:53 ` Mel Gorman
2012-10-11 17:34 ` Rik van Riel
[not found] ` <20121011175953.GT1818@redhat.com>
2012-10-12 14:03 ` Rik van Riel
2012-10-03 23:50 ` [PATCH 16/33] autonuma: alloc/free/init mm_autonuma Andrea Arcangeli
2012-10-03 23:50 ` [PATCH 17/33] autonuma: prevent select_task_rq_fair to return -1 Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 18/33] autonuma: teach CFS about autonuma affinity Andrea Arcangeli
2012-10-05 6:41 ` Mike Galbraith
2012-10-05 11:54 ` Andrea Arcangeli
2012-10-06 2:39 ` Mike Galbraith
2012-10-06 12:34 ` Andrea Arcangeli
2012-10-07 6:07 ` Mike Galbraith
2012-10-08 7:03 ` Mike Galbraith
2012-10-03 23:51 ` [PATCH 19/33] autonuma: memory follows CPU algorithm and task/mm_autonuma stats collection Andrea Arcangeli
2012-10-10 22:01 ` Rik van Riel
2012-10-10 22:36 ` Andrea Arcangeli
2012-10-11 18:28 ` Mel Gorman
2012-10-13 18:06 ` Srikar Dronamraju
2012-10-15 8:24 ` Srikar Dronamraju
2012-10-15 9:20 ` Mel Gorman
2012-10-15 10:00 ` Srikar Dronamraju
2012-10-03 23:51 ` [PATCH 20/33] autonuma: default mempolicy follow AutoNUMA Andrea Arcangeli
2012-10-04 20:03 ` KOSAKI Motohiro
2012-10-11 18:32 ` Mel Gorman
2012-10-03 23:51 ` [PATCH 21/33] autonuma: call autonuma_split_huge_page() Andrea Arcangeli
2012-10-11 18:33 ` Mel Gorman
2012-10-03 23:51 ` [PATCH 22/33] autonuma: make khugepaged pte_numa aware Andrea Arcangeli
2012-10-11 18:36 ` Mel Gorman
2012-10-03 23:51 ` [PATCH 23/33] autonuma: retain page last_nid information in khugepaged Andrea Arcangeli
2012-10-11 18:44 ` Mel Gorman
2012-10-12 11:37 ` Rik van Riel
2012-10-12 12:35 ` Mel Gorman
2012-10-03 23:51 ` [PATCH 24/33] autonuma: split_huge_page: transfer the NUMA type from the pmd to the pte Andrea Arcangeli
2012-10-11 18:45 ` Mel Gorman
2012-10-03 23:51 ` [PATCH 25/33] autonuma: numa hinting page faults entry points Andrea Arcangeli
2012-10-11 18:47 ` Mel Gorman
2012-10-03 23:51 ` [PATCH 26/33] autonuma: reset autonuma page data when pages are freed Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 27/33] autonuma: link mm/autonuma.o and kernel/sched/numa.o Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 28/33] autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED Andrea Arcangeli
2012-10-11 18:50 ` Mel Gorman
2012-10-03 23:51 ` Andrea Arcangeli [this message]
2012-10-04 14:16 ` [PATCH 29/33] autonuma: page_autonuma Christoph Lameter
2012-10-04 20:09 ` KOSAKI Motohiro
2012-10-05 11:31 ` Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 30/33] autonuma: bugcheck page_autonuma fields on newly allocated pages Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 31/33] autonuma: boost khugepaged scanning rate Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 32/33] autonuma: add migrate_allow_first_fault knob in sysfs Andrea Arcangeli
2012-10-03 23:51 ` [PATCH 33/33] autonuma: add mm_autonuma working set estimation Andrea Arcangeli
2012-10-04 18:39 ` [PATCH 00/33] AutoNUMA27 Andrew Morton
2012-10-04 20:49 ` Rik van Riel
2012-10-05 23:08 ` Rik van Riel
2012-10-05 23:14 ` Andi Kleen
2012-10-05 23:57 ` Tim Chen
2012-10-06 0:11 ` Andi Kleen
2012-10-08 13:44 ` Don Morris
2012-10-08 20:34 ` Rik van Riel
2012-10-11 10:19 ` Mel Gorman
2012-10-11 14:56 ` Andrea Arcangeli
2012-10-11 15:35 ` Mel Gorman
2012-10-12 0:41 ` Andrea Arcangeli
2012-10-12 14:54 ` Mel Gorman
2012-10-11 21:34 ` Mel Gorman
2012-10-12 1:45 ` Andrea Arcangeli
2012-10-12 8:46 ` Mel Gorman
2012-10-13 18:40 ` Srikar Dronamraju
2012-10-14 4:57 ` Andrea Arcangeli
2012-10-15 8:16 ` Srikar Dronamraju
2012-10-23 16:32 ` Srikar Dronamraju
2012-10-16 13:48 ` Mel Gorman
-- strict thread matches above, loose matches on Subject: below --
2012-10-04 16:50 Andrea Arcangeli
2012-10-04 18:17 ` your mail Christoph Lameter
2012-10-04 18:38 ` [PATCH 29/33] autonuma: page_autonuma Andrea Arcangeli
2012-10-04 19:11 ` Christoph Lameter
2012-10-05 11:11 ` Andrea Arcangeli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1349308275-2174-30-git-send-email-aarcange@redhat.com \
--to=aarcange@redhat.com \
--cc=Lee.Schermerhorn@hp.com \
--cc=akpm@linux-foundation.org \
--cc=alex.shi@intel.com \
--cc=benh@kernel.crashing.org \
--cc=bharata.rao@gmail.com \
--cc=cl@linux.com \
--cc=danms@us.ibm.com \
--cc=dhillf@gmail.com \
--cc=don.morris@hp.com \
--cc=drjones@redhat.com \
--cc=efault@gmx.de \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=konrad.wilk@oracle.com \
--cc=laijs@cn.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mauricfo@linux.vnet.ibm.com \
--cc=mel@csn.ul.ie \
--cc=mingo@elte.hu \
--cc=paulmck@linux.vnet.ibm.com \
--cc=pjt@google.com \
--cc=pzijlstr@redhat.com \
--cc=riel@redhat.com \
--cc=suresh.b.siddha@intel.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
--cc=vatsa@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).