[RFC] DNUMA: Runtime NUMA memory layout reconfiguration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC] DNUMA: Runtime NUMA memory layout reconfiguration
@ 2013-02-28  2:41 Cody P Schafer
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
  0 siblings, 1 reply; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28  2:41 UTC (permalink / raw)
  To: Linux MM; +Cc: David Hansen, Cody P Schafer, LKML

These patches allow the NUMA memory layout (meaning the mapping of a page to a
node) to be changed at runtime in place (without hotplugging).

= Why/when is this useful? =

In virtual machines (VMs) running on NUMA systems both [a] if/when the
hypervisor decides to move their backing memory around (compacting,
prioritizing another VMs desired layout, etc) and [b] in general for
migration of VMs.

The hardware is _already_ changing the NUMA layout underneath us. We have
powerpc64 systems with firmware that currently move the backing memory around,
and have the ability to notify Linux of new NUMA info.

= Code & testing =

web:
	https://github.com/jmesmon/linux/tree/dnuma/v26
git:
	https://github.com/jmesmon/linux.git dnuma/v26

commit range:
	7e4f3230c9161706ebe9d37d774398082dc352de^..01e16461cf4a914feb1a34ed8dd7b28f3e842645

Some patches are marked "XXX: ...", they are only for testing or
temporary documentation purposes.

A debugfs interface allows the NUMA memory layout to be changed.  Basically,
you don't need to have wierd systems to test this, in fact, I've done all my
testing so far in plain old qemu-i386.

A script which stripes the memory between nodes or pushes all memory to a
(potentially new) node is avaliable here:

	https://raw.github.com/jmesmon/trifles/master/bin/dnuma-test

Related LSF/MM Topic proposal:

	http://permalink.gmane.org/gmane.linux.kernel.mm/95342

= How are you managing to do this? =

Reconfiguration of page->node mappings is done at the page allocator
level by both pulling out free pages (when a new memory layout is
commited) & redirecting pages on free to their new node.

Because we can't change page_node(A) while A is allocated, a rbtree
holding the mapping from pfn ranges to node ids ('struct memlayout')
is introduced to track the pfn->node mapping for
yet-to-be-transplanted pages. A lookup in this rbtree occurs on any
page allocator path that decides which zone to free a page to.

To avoid horrible performance due to rbtree lookups all the time, the
rbtree is only consulted when the page is marked with a new pageflag
(LookupNode).

= Current Limitations =

For the reconfiguration to be effective (and not make the allocator make
poorer choices), updating the cpu->node mappings is also needed. This patchset
does _not_ handle this. Also missing is a way to update topology (node
distances), which is less fatal.

These patches only work on SPARSEMEM and the node id _must_ fit in the pageflags
(can't be pushed out to the section). This generally means that 32-bit
platforms are out (unless you hack MAX_PHYS{ADDR,MEM}_BITS like I do for
testing).

This code does the reconfiguration without hotplugging memory at all (1
errant page doesn't keep us from fixing the rest of them). But it still
depends on MEMORY_HOTPLUG for functions that online nodes & adjust
zone/pgdat size.

Things that need doing (but aren't quite bugs):
 - While the interface is meant to be driven via a hypervisor/firmware, that
   portion is not yet included.
 - notifier for kernel users of memory that need/want their allocations on a
   particular node (NODE_DATA(), for instance).
 - notifier for userspace.
 - a way to allocate things from the appropriate node prior to the page
   allocator being fully updated (could just be "allocate it wrong now &
   reallocate later").
 - Make memlayout faster (potentially via per-node allocation, different data
   structure, or more/smarter caching).
 - (potentially) propagation of updated layout knowledge into kmem_caches
   (SL*B).

Known Bugs:
 - Transplant of free pages is _very_ slow due to excessive use of
   stop_machine() via zone_pcp_update() & build_zonelists(). On my i5 laptop,
   it take ~9 minutes to stripe the layout in blocks of 256 pfns to 3 nodes on a
   128MB 8 cpu x86_32 VM booted with 2 nodes.
 - memory leak when SLUB is used (struct kmem_cache_nodes are leaked), SLAB
   appears fine.
 - Locking of managed_pages/present_pages needs adjustment, or they need to
   be updated outside of the free page path.
 - Exported numa/memory info in sysfs isn't updated (`numactl --show` segfaults,
   `numactl --hardware` shows new nodes as nearly empty).
 - Uses pageflag setters without "owning" pages, could cause loss of pageflag
   updates when combined with non-atomic pageflag users in mm/*.
 - some strange sleeps while atomic (for me they occur when memory is
   moved out of all the boot nodes)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC][PATCH 00/24] DNUMA: Runtime NUMA memory layout reconfiguration
  2013-02-28  2:41 [RFC] DNUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
@ 2013-02-28 20:44 ` Cody P Schafer
  2013-02-28 20:44   ` [PATCH 01/24] XXX: reduce MAX_PHYSADDR_BITS & MAX_PHYSMEM_BITS in PAE Cody P Schafer
                     ` (24 more replies)
  0 siblings, 25 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 20:44 UTC (permalink / raw)
  To: Linux MM; +Cc: David Hansen, Cody P Schafer

Some people asked me to send the email patches for this instead of just posting a git tree link

For reference, this is the original message:
	http://lkml.org/lkml/2013/2/27/374

--

 arch/x86/Kconfig                 |   1 -
 arch/x86/include/asm/sparsemem.h |   4 +-
 arch/x86/mm/numa.c               |  32 +++-
 include/linux/dnuma.h            |  96 +++++++++++
 include/linux/memlayout.h        | 111 +++++++++++++
 include/linux/memory_hotplug.h   |   4 +
 include/linux/mm.h               |   7 +-
 include/linux/page-flags.h       |  18 ++
 include/linux/rbtree.h           |  11 ++
 init/main.c                      |   2 +
 lib/rbtree.c                     |  40 +++++
 mm/Kconfig                       |  44 +++++
 mm/Makefile                      |   2 +
 mm/dnuma.c                       | 351 +++++++++++++++++++++++++++++++++++++++
 mm/internal.h                    |  13 +-
 mm/memlayout-debugfs.c           | 323 +++++++++++++++++++++++++++++++++++
 mm/memlayout-debugfs.h           |  35 ++++
 mm/memlayout.c                   | 267 +++++++++++++++++++++++++++++
 mm/memory_hotplug.c              |  53 +++---
 mm/page_alloc.c                  | 112 +++++++++++--
 20 files changed, 1486 insertions(+), 40 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/24] XXX: reduce MAX_PHYSADDR_BITS & MAX_PHYSMEM_BITS in PAE.
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
@ 2013-02-28 20:44   ` Cody P Schafer
  2013-02-28 20:44   ` [PATCH 02/24] XXX: x86/Kconfig: simplify NUMA config for NUMA_EMU on X86_32 Cody P Schafer
                     ` (23 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 20:44 UTC (permalink / raw)
  To: Linux MM; +Cc: David Hansen, Cody P Schafer

This is a hack I use to allow PAE to be enabled & still fit the node
into the pageflags (PAE is enabled as a workaround for a kvm bug).

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 arch/x86/include/asm/sparsemem.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
index 4517d6b..548e612 100644
--- a/arch/x86/include/asm/sparsemem.h
+++ b/arch/x86/include/asm/sparsemem.h
@@ -17,8 +17,8 @@
 #ifdef CONFIG_X86_32
 # ifdef CONFIG_X86_PAE
 #  define SECTION_SIZE_BITS	29
-#  define MAX_PHYSADDR_BITS	36
-#  define MAX_PHYSMEM_BITS	36
+#  define MAX_PHYSADDR_BITS	32
+#  define MAX_PHYSMEM_BITS	32
 # else
 #  define SECTION_SIZE_BITS	26
 #  define MAX_PHYSADDR_BITS	32
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 02/24] XXX: x86/Kconfig: simplify NUMA config for NUMA_EMU on X86_32.
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
  2013-02-28 20:44   ` [PATCH 01/24] XXX: reduce MAX_PHYSADDR_BITS & MAX_PHYSMEM_BITS in PAE Cody P Schafer
@ 2013-02-28 20:44   ` Cody P Schafer
  2013-02-28 20:44   ` [PATCH 03/24] XXX: memory_hotplug locking note in online_pages Cody P Schafer
                     ` (22 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 20:44 UTC (permalink / raw)
  To: Linux MM; +Cc: David Hansen, Cody P Schafer

NUMA_EMU depends on NUMA.
NUMA depends on (X86_64 || (X86_32 && ( list of extended platforms))).

This forced enabling an extended platform when using numa emulation on
x86_32, which is silly.

Remoing the list of extended platforms (plus EXPERIMENTAL) results in
NUMA depending on X86_64 || X86_32, so simply remove all dependencies
from (except SMP) from NUMA.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 arch/x86/Kconfig | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 6a93833..58cd8fb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1228,7 +1228,6 @@ config DIRECT_GBPAGES
 config NUMA
 	bool "Numa Memory Allocation and Scheduler Support"
 	depends on SMP
-	depends on X86_64 || (X86_32 && HIGHMEM64G && (X86_NUMAQ || X86_BIGSMP || X86_SUMMIT && ACPI))
 	default y if (X86_NUMAQ || X86_SUMMIT || X86_BIGSMP)
 	---help---
 	  Enable NUMA (Non Uniform Memory Access) support.
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 03/24] XXX: memory_hotplug locking note in online_pages.
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
  2013-02-28 20:44   ` [PATCH 01/24] XXX: reduce MAX_PHYSADDR_BITS & MAX_PHYSMEM_BITS in PAE Cody P Schafer
  2013-02-28 20:44   ` [PATCH 02/24] XXX: x86/Kconfig: simplify NUMA config for NUMA_EMU on X86_32 Cody P Schafer
@ 2013-02-28 20:44   ` Cody P Schafer
  2013-02-28 20:44   ` [PATCH 04/24] rbtree: add postorder iteration functions Cody P Schafer
                     ` (21 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 20:44 UTC (permalink / raw)
  To: Linux MM; +Cc: David Hansen, Cody P Schafer

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/memory_hotplug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b81a367b..102c06a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -984,6 +984,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_typ
 
 	zone->managed_pages += onlined_pages;
 	zone->present_pages += onlined_pages;
+	/* FIXME: should be protected by pgdat_resize_lock() */
 	zone->zone_pgdat->node_present_pages += onlined_pages;
 	if (onlined_pages) {
 		node_states_set_node(zone_to_nid(zone), &arg);
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 04/24] rbtree: add postorder iteration functions.
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (2 preceding siblings ...)
  2013-02-28 20:44   ` [PATCH 03/24] XXX: memory_hotplug locking note in online_pages Cody P Schafer
@ 2013-02-28 20:44   ` Cody P Schafer
  2013-02-28 20:44   ` [PATCH 05/24] rbtree: add rbtree_postorder_for_each_entry_safe() helper Cody P Schafer
                     ` (20 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 20:44 UTC (permalink / raw)
  To: Linux MM; +Cc: David Hansen, Cody P Schafer

Add postorder iteration functions for rbtree. These are useful for
safely freeing an entire rbtree without modifying the tree at all.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/rbtree.h |  4 ++++
 lib/rbtree.c           | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 0022c1b..2879e96 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -68,6 +68,10 @@ extern struct rb_node *rb_prev(const struct rb_node *);
 extern struct rb_node *rb_first(const struct rb_root *);
 extern struct rb_node *rb_last(const struct rb_root *);
 
+/* Postorder iteration - always visit the parent after it's children */
+extern struct rb_node *rb_first_postorder(const struct rb_root *);
+extern struct rb_node *rb_next_postorder(const struct rb_node *);
+
 /* Fast replacement of a single node without remove/rebalance/add/rebalance */
 extern void rb_replace_node(struct rb_node *victim, struct rb_node *new, 
 			    struct rb_root *root);
diff --git a/lib/rbtree.c b/lib/rbtree.c
index c0e31fe..65f4eff 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -518,3 +518,43 @@ void rb_replace_node(struct rb_node *victim, struct rb_node *new,
 	*new = *victim;
 }
 EXPORT_SYMBOL(rb_replace_node);
+
+static struct rb_node *rb_left_deepest_node(const struct rb_node *node)
+{
+	for (;;) {
+		if (node->rb_left)
+			node = node->rb_left;
+		else if (node->rb_right)
+			node = node->rb_right;
+		else
+			return (struct rb_node *)node;
+	}
+}
+
+struct rb_node *rb_next_postorder(const struct rb_node *node)
+{
+	const struct rb_node *parent;
+	if (!node)
+		return NULL;
+	parent = rb_parent(node);
+
+	/* If we're sitting on node, we've already seen our children */
+	if (parent && node == parent->rb_left && parent->rb_right) {
+		/* If we are the parent's left node, go to the parent's right
+		 * node then all the way down to the left */
+		return rb_left_deepest_node(parent->rb_right);
+	} else
+		/* Otherwise we are the parent's right node, and the parent
+		 * should be next */
+		return (struct rb_node *)parent;
+}
+EXPORT_SYMBOL(rb_next_postorder);
+
+struct rb_node *rb_first_postorder(const struct rb_root *root)
+{
+	if (!root->rb_node)
+		return NULL;
+
+	return rb_left_deepest_node(root->rb_node);
+}
+EXPORT_SYMBOL(rb_first_postorder);
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 05/24] rbtree: add rbtree_postorder_for_each_entry_safe() helper.
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (3 preceding siblings ...)
  2013-02-28 20:44   ` [PATCH 04/24] rbtree: add postorder iteration functions Cody P Schafer
@ 2013-02-28 20:44   ` Cody P Schafer
  2013-02-28 20:44   ` [PATCH 06/24] mm/memory_hotplug: factor out zone+pgdat growth Cody P Schafer
                     ` (19 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 20:44 UTC (permalink / raw)
  To: Linux MM; +Cc: David Hansen, Cody P Schafer

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/rbtree.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 2879e96..8ff52b2 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -85,4 +85,11 @@ static inline void rb_link_node(struct rb_node * node, struct rb_node * parent,
 	*rb_link = node;
 }
 
+#define rbtree_postorder_for_each_entry_safe(pos, n, root, field)		\
+	for (pos = rb_entry(rb_first_postorder(root), typeof(*pos), field),	\
+	      n = rb_entry(rb_next_postorder(&pos->field), typeof(*pos), field);	\
+	     &pos->field;							\
+	     pos = n,								\
+	      n = rb_entry(rb_next_postorder(&pos->field), typeof(*pos), field))
+
 #endif	/* _LINUX_RBTREE_H */
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 06/24] mm/memory_hotplug: factor out zone+pgdat growth.
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (4 preceding siblings ...)
  2013-02-28 20:44   ` [PATCH 05/24] rbtree: add rbtree_postorder_for_each_entry_safe() helper Cody P Schafer
@ 2013-02-28 20:44   ` Cody P Schafer
  2013-02-28 20:44   ` [PATCH 07/24] memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h Cody P Schafer
                     ` (18 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 20:44 UTC (permalink / raw)
  To: Linux MM; +Cc: David Hansen, Cody P Schafer

Create a new function grow_pgdat_and_zone() which handles locking +
growth of a zone & the pgdat which it is associated with.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/memory_hotplug.h |  3 +++
 mm/memory_hotplug.c            | 17 +++++++++++------
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index b6a3be7..cd393014 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -78,6 +78,9 @@ static inline void zone_seqlock_init(struct zone *zone)
 {
 	seqlock_init(&zone->span_seqlock);
 }
+extern void grow_pgdat_and_zone(struct zone *zone, unsigned long start_pfn,
+				unsigned long end_pfn);
+
 extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
 extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 102c06a..9e4c32b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -390,13 +390,22 @@ static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
 					pgdat->node_start_pfn;
 }
 
+void grow_pgdat_and_zone(struct zone *zone, unsigned long start_pfn,
+		unsigned long end_pfn)
+{
+	unsigned long flags;
+	pgdat_resize_lock(zone->zone_pgdat, &flags);
+	grow_zone_span(zone, start_pfn, end_pfn);
+	grow_pgdat_span(zone->zone_pgdat, start_pfn, end_pfn);
+	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+}
+
 static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
 {
 	struct pglist_data *pgdat = zone->zone_pgdat;
 	int nr_pages = PAGES_PER_SECTION;
 	int nid = pgdat->node_id;
 	int zone_type;
-	unsigned long flags;
 	int ret;
 
 	zone_type = zone - pgdat->node_zones;
@@ -404,11 +413,7 @@ static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
 	if (ret)
 		return ret;
 
-	pgdat_resize_lock(zone->zone_pgdat, &flags);
-	grow_zone_span(zone, phys_start_pfn, phys_start_pfn + nr_pages);
-	grow_pgdat_span(zone->zone_pgdat, phys_start_pfn,
-			phys_start_pfn + nr_pages);
-	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+	grow_pgdat_and_zone(zone, phys_start_pfn, phys_start_pfn + nr_pages);
 	memmap_init_zone(nr_pages, nid, zone_type,
 			 phys_start_pfn, MEMMAP_HOTPLUG);
 	return 0;
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 07/24] memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (5 preceding siblings ...)
  2013-02-28 20:44   ` [PATCH 06/24] mm/memory_hotplug: factor out zone+pgdat growth Cody P Schafer
@ 2013-02-28 20:44   ` Cody P Schafer
  2013-02-28 20:44   ` [PATCH 08/24] mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones & pgdats Cody P Schafer
                     ` (17 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 20:44 UTC (permalink / raw)
  To: Linux MM; +Cc: David Hansen, Cody P Schafer

Export ensure_zone_is_initialized() so that it can be used to initialize
new zones within the dynamic numa code.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/internal.h       | 8 ++++++++
 mm/memory_hotplug.c | 2 +-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index 1c0c4cc..6c63752 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -105,6 +105,14 @@ extern void prep_compound_page(struct page *page, unsigned long order);
 extern bool is_free_buddy_page(struct page *page);
 #endif
 
+#ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * in mm/memory_hotplug.c
+ */
+extern int ensure_zone_is_initialized(struct zone *zone,
+			unsigned long start_pfn, unsigned long num_pages);
+#endif
+
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
 
 /*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9e4c32b..9f43c80 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -284,7 +284,7 @@ static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
 
 /* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
  * alloc_bootmem_node_nopanic() */
-static int __ref ensure_zone_is_initialized(struct zone *zone,
+int __ref ensure_zone_is_initialized(struct zone *zone,
 			unsigned long start_pfn, unsigned long num_pages)
 {
 	if (!zone_is_initialized(zone))
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 08/24] mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones & pgdats
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (6 preceding siblings ...)
  2013-02-28 20:44   ` [PATCH 07/24] memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h Cody P Schafer
@ 2013-02-28 20:44   ` Cody P Schafer
  2013-02-28 20:44   ` [PATCH 09/24] mm: add nid_zone() helper Cody P Schafer
                     ` (16 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 20:44 UTC (permalink / raw)
  To: Linux MM; +Cc: David Hansen, Cody P Schafer

Use the *_is_empty() helpers to be more clear about what we're actually
checking for.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/memory_hotplug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 9f43c80..eae4a2a 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -242,7 +242,7 @@ static void grow_zone_span(struct zone *zone, unsigned long start_pfn,
 	zone_span_writelock(zone);
 
 	old_zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
-	if (!zone->spanned_pages || start_pfn < zone->zone_start_pfn)
+	if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
 		zone->zone_start_pfn = start_pfn;
 
 	zone->spanned_pages = max(old_zone_end_pfn, end_pfn) -
@@ -383,7 +383,7 @@ static void grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
 	unsigned long old_pgdat_end_pfn =
 		pgdat->node_start_pfn + pgdat->node_spanned_pages;
 
-	if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
+	if (pgdat_is_empty(pgdat) || start_pfn < pgdat->node_start_pfn)
 		pgdat->node_start_pfn = start_pfn;
 
 	pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) -
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 09/24] mm: add nid_zone() helper
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (7 preceding siblings ...)
  2013-02-28 20:44   ` [PATCH 08/24] mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones & pgdats Cody P Schafer
@ 2013-02-28 20:44   ` Cody P Schafer
  2013-02-28 21:26   ` [PATCH 10/24] page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled Cody P Schafer
                     ` (15 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 20:44 UTC (permalink / raw)
  To: Linux MM; +Cc: David Hansen, Cody P Schafer

Add nid_zone(), which returns the zone corresponding to a given nid & zonenum.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/mm.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e7c3f9a..562304a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -707,9 +707,14 @@ static inline void page_nid_reset_last(struct page *page)
 }
 #endif
 
+static inline struct zone *nid_zone(int nid, enum zone_type zonenum)
+{
+	return &NODE_DATA(nid)->node_zones[zonenum];
+}
+
 static inline struct zone *page_zone(const struct page *page)
 {
-	return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
+	return nid_zone(page_to_nid(page), page_zonenum(page));
 }
 
 #ifdef SECTION_IN_PAGE_FLAGS
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 10/24] page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled.
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (8 preceding siblings ...)
  2013-02-28 20:44   ` [PATCH 09/24] mm: add nid_zone() helper Cody P Schafer
@ 2013-02-28 21:26   ` Cody P Schafer
  2013-02-28 21:26   ` [PATCH 11/24] page_alloc: in move_freepages(), skip pages instead of VM_BUG on node differences Cody P Schafer
                     ` (14 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:26 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

Add return_pages_to_zone(), which uses return_page_to_zone().
It is a minimized version of __free_pages_ok() which handles adding
pages which have been removed from another zone into a new zone.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/internal.h   |  5 ++++-
 mm/page_alloc.c | 17 +++++++++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index 6c63752..b075e34 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -104,6 +104,10 @@ extern void prep_compound_page(struct page *page, unsigned long order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
+#ifdef CONFIG_DYNAMIC_NUMA
+void return_pages_to_zone(struct page *page, unsigned int order,
+			  struct zone *zone);
+#endif
 
 #ifdef CONFIG_MEMORY_HOTPLUG
 /*
@@ -114,7 +118,6 @@ extern int ensure_zone_is_initialized(struct zone *zone,
 #endif
 
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
-
 /*
  * in mm/compaction.c
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0dade3f..bbc9b6e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -449,6 +449,12 @@ static inline void set_page_order(struct page *page, int order)
 	__SetPageBuddy(page);
 }
 
+static inline void set_free_page_order(struct page *page, int order)
+{
+	set_page_private(page, order);
+	VM_BUG_ON(!PageBuddy(page));
+}
+
 static inline void rmv_page_order(struct page *page)
 {
 	__ClearPageBuddy(page);
@@ -745,6 +751,17 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	local_irq_restore(flags);
 }
 
+#ifdef CONFIG_DYNAMIC_NUMA
+void return_pages_to_zone(struct page *page, unsigned int order,
+			  struct zone *zone)
+{
+	unsigned long flags;
+	local_irq_save(flags);
+	free_one_page(zone, page, order, get_freepage_migratetype(page));
+	local_irq_restore(flags);
+}
+#endif
+
 /*
  * Read access to zone->managed_pages is safe because it's unsigned long,
  * but we still need to serialize writers. Currently all callers of
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 11/24] page_alloc: in move_freepages(), skip pages instead of VM_BUG on node differences.
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (9 preceding siblings ...)
  2013-02-28 21:26   ` [PATCH 10/24] page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled Cody P Schafer
@ 2013-02-28 21:26   ` Cody P Schafer
  2013-02-28 21:26   ` [PATCH 12/24] page_alloc: when dynamic numa is enabled, don't check that all pages in a block belong to the same zone Cody P Schafer
                     ` (13 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:26 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

With dynamic numa, pages are going to be gradully moved from one node to
another, causing the page ranges that move_freepages() examines to
contain pages that actually belong to another node.

When dynamic numa is enabled, we skip these pages instead of VM_BUGing
out on them.

This additionally moves the VM_BUG_ON() (which detects a change in node)
so that it follows the pfn_valid_within() check.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bbc9b6e..972d7cc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -964,6 +964,7 @@ int move_freepages(struct zone *zone,
 	struct page *page;
 	unsigned long order;
 	int pages_moved = 0;
+	int zone_nid = zone_to_nid(zone);
 
 #ifndef CONFIG_HOLES_IN_ZONE
 	/*
@@ -977,14 +978,24 @@ int move_freepages(struct zone *zone,
 #endif
 
 	for (page = start_page; page <= end_page;) {
-		/* Make sure we are not inadvertently changing nodes */
-		VM_BUG_ON(page_to_nid(page) != zone_to_nid(zone));
-
 		if (!pfn_valid_within(page_to_pfn(page))) {
 			page++;
 			continue;
 		}
 
+		if (page_to_nid(page) != zone_nid) {
+#ifndef CONFIG_DYNAMIC_NUMA
+			/*
+			 * In the normal case (without Dynamic NUMA), all pages
+			 * in a pageblock should belong to the same zone (and
+			 * as a result all have the same nid).
+			 */
+			VM_BUG_ON(page_to_nid(page) != zone_nid);
+#endif
+			page++;
+			continue;
+		}
+
 		if (!PageBuddy(page)) {
 			page++;
 			continue;
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 12/24] page_alloc: when dynamic numa is enabled, don't check that all pages in a block belong to the same zone
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (10 preceding siblings ...)
  2013-02-28 21:26   ` [PATCH 11/24] page_alloc: in move_freepages(), skip pages instead of VM_BUG on node differences Cody P Schafer
@ 2013-02-28 21:26   ` Cody P Schafer
  2013-02-28 21:26   ` [PATCH 13/24] page-flags dnuma: reserve a pageflag for determining if a page needs a node lookup Cody P Schafer
                     ` (12 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:26 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

When dynamic numa is enabled, the last or first page in a pageblock may
have been transplanted to a new zone (or may not yet be transplanted to
a new zone).

Disable a BUG_ON() which checks that the start_page and end_page are in
the same zone, if they are not in the proper zone they will simply be
skipped.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 972d7cc..274826c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -966,13 +966,16 @@ int move_freepages(struct zone *zone,
 	int pages_moved = 0;
 	int zone_nid = zone_to_nid(zone);
 
-#ifndef CONFIG_HOLES_IN_ZONE
+#if !defined(CONFIG_HOLES_IN_ZONE) && !defined(CONFIG_DYNAMIC_NUMA)
 	/*
-	 * page_zone is not safe to call in this context when
-	 * CONFIG_HOLES_IN_ZONE is set. This bug check is probably redundant
-	 * anyway as we check zone boundaries in move_freepages_block().
-	 * Remove at a later date when no bug reports exist related to
-	 * grouping pages by mobility
+	 * With CONFIG_HOLES_IN_ZONE set, this check is unsafe as start_page or
+	 * end_page may not be "valid".
+	 * With CONFIG_DYNAMIC_NUMA set, this condition is a valid occurence &
+	 * not a bug.
+	 *
+	 * This bug check is probably redundant anyway as we check zone
+	 * boundaries in move_freepages_block().  Remove at a later date when
+	 * no bug reports exist related to grouping pages by mobility
 	 */
 	BUG_ON(page_zone(start_page) != page_zone(end_page));
 #endif
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 13/24] page-flags dnuma: reserve a pageflag for determining if a page needs a node lookup.
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (11 preceding siblings ...)
  2013-02-28 21:26   ` [PATCH 12/24] page_alloc: when dynamic numa is enabled, don't check that all pages in a block belong to the same zone Cody P Schafer
@ 2013-02-28 21:26   ` Cody P Schafer
  2013-02-28 21:26   ` [PATCH 14/24] memory_hotplug: factor out locks in mem_online_cpu() Cody P Schafer
                     ` (11 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:26 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

Add a pageflag called "lookup_node"/ PG_lookup_node / Page*LookupNode().

Used by dynamic numa to indicate when a page has a new node assignment
waiting for it.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/page-flags.h | 18 ++++++++++++++++++
 mm/page_alloc.c            |  3 +++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675..e0241d8 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,9 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+#ifdef CONFIG_DYNAMIC_NUMA
+	PG_lookup_node,		/* need to do an extra lookup to determine actual node */
+#endif
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -275,6 +278,17 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
+/* Setting is unconditional, simply leads to an extra lookup.
+ * Clearing must be conditional so we don't miss any memlayout changes.
+ */
+#ifdef CONFIG_DYNAMIC_NUMA
+PAGEFLAG(LookupNode, lookup_node)
+TESTCLEARFLAG(LookupNode, lookup_node)
+#else
+PAGEFLAG_FALSE(LookupNode)
+TESTCLEARFLAG_FALSE(LookupNode)
+#endif
+
 u64 stable_page_flags(struct page *page);
 
 static inline int PageUptodate(struct page *page)
@@ -509,7 +523,11 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
  * Pages being prepped should not have any flags set.  It they are set,
  * there has been a kernel bug or struct page corruption.
  */
+#ifndef CONFIG_DYNAMIC_NUMA
 #define PAGE_FLAGS_CHECK_AT_PREP	((1 << NR_PAGEFLAGS) - 1)
+#else
+#define PAGE_FLAGS_CHECK_AT_PREP	(((1 << NR_PAGEFLAGS) - 1) & ~(1 << PG_lookup_node))
+#endif
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1 << PG_private | 1 << PG_private_2)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 274826c..5eeb547 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6449,6 +6449,9 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{1UL << PG_compound_lock,	"compound_lock"	},
 #endif
+#ifdef CONFIG_DYNAMIC_NUMA
+	{1UL << PG_lookup_node,		"lookup_node"   },
+#endif
 };
 
 static void dump_page_flags(unsigned long flags)
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 14/24] memory_hotplug: factor out locks in mem_online_cpu()
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (12 preceding siblings ...)
  2013-02-28 21:26   ` [PATCH 13/24] page-flags dnuma: reserve a pageflag for determining if a page needs a node lookup Cody P Schafer
@ 2013-02-28 21:26   ` Cody P Schafer
  2013-02-28 21:26   ` [PATCH 15/24] mm: add memlayout & dnuma to track pfn->nid & transplant pages between nodes Cody P Schafer
                     ` (10 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:26 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

In dynamic numa, when onlining nodes, lock_memory_hotplug() is already
held when mem_online_node()'s functionality is needed.

Factor out the locking and create a new function __mem_online_node() to
allow reuse.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/memory_hotplug.h |  1 +
 mm/memory_hotplug.c            | 29 ++++++++++++++++-------------
 2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index cd393014..391824d 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -248,6 +248,7 @@ static inline int is_mem_section_removable(unsigned long pfn,
 static inline void try_offline_node(int nid) {}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
+extern int __mem_online_node(int nid);
 extern int mem_online_node(int nid);
 extern int add_memory(int nid, u64 start, u64 size);
 extern int arch_add_memory(int nid, u64 start, u64 size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index eae4a2a..7b0ab4f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1058,26 +1058,29 @@ static void rollback_node_hotadd(int nid, pg_data_t *pgdat)
 	return;
 }
 
-
-/*
- * called by cpu_up() to online a node without onlined memory.
- */
-int mem_online_node(int nid)
+int __mem_online_node(int nid)
 {
-	pg_data_t	*pgdat;
-	int	ret;
+	pg_data_t *pgdat;
+	int ret;
 
-	lock_memory_hotplug();
 	pgdat = hotadd_new_pgdat(nid, 0);
-	if (!pgdat) {
-		ret = -ENOMEM;
-		goto out;
-	}
+	if (!pgdat)
+		return -ENOMEM;
+
 	node_set_online(nid);
 	ret = register_one_node(nid);
 	BUG_ON(ret);
+	return ret;
+}
 
-out:
+/*
+ * called by cpu_up() to online a node without onlined memory.
+ */
+int mem_online_node(int nid)
+{
+	int ret;
+	lock_memory_hotplug();
+	ret = __mem_online_node(nid);
 	unlock_memory_hotplug();
 	return ret;
 }
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 15/24] mm: add memlayout & dnuma to track pfn->nid & transplant pages between nodes
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (13 preceding siblings ...)
  2013-02-28 21:26   ` [PATCH 14/24] memory_hotplug: factor out locks in mem_online_cpu() Cody P Schafer
@ 2013-02-28 21:26   ` Cody P Schafer
  2013-02-28 21:26   ` [PATCH 16/24] mm: memlayout+dnuma: add debugfs interface Cody P Schafer
                     ` (9 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:26 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

On certain systems, the hypervisor can (and will) relocate physical
addresses as seen in a VM between real NUMA nodes. For example, IBM's
Power systems which are using PHYP (their proprietary hypervisor).

This change set introduces the infrastructure for tracking & dynamically
changing "memory layouts" (or "memlayouts"): the mapping between page
ranges & the actual backing NUMA node.

A memlayout is an rbtree which maps pfns (really, ranges of pfns) to a
node. This mapping (combined with the LookupNode pageflag) is used to
"transplant" (move pages between nodes) pages when they are freed back
to the page allocator.

Additionally, when a new memlayout is commited the currently free pages
that are now in the wrong zone's freelist are immidiately transplanted.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/dnuma.h     |  96 +++++++++++++
 include/linux/memlayout.h | 110 +++++++++++++++
 mm/Kconfig                |  19 +++
 mm/Makefile               |   1 +
 mm/dnuma.c                | 349 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/memlayout.c            | 238 +++++++++++++++++++++++++++++++
 6 files changed, 813 insertions(+)
 create mode 100644 include/linux/dnuma.h
 create mode 100644 include/linux/memlayout.h
 create mode 100644 mm/dnuma.c
 create mode 100644 mm/memlayout.c

diff --git a/include/linux/dnuma.h b/include/linux/dnuma.h
new file mode 100644
index 0000000..8f5cbf9
--- /dev/null
+++ b/include/linux/dnuma.h
@@ -0,0 +1,96 @@
+#ifndef LINUX_DNUMA_H_
+#define LINUX_DNUMA_H_
+
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/memlayout.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+
+#ifdef CONFIG_DYNAMIC_NUMA
+/* Must be called _before_ setting a new_ml to the pfn_to_node_map */
+void dnuma_online_required_nodes_and_zones(struct memlayout *new_ml);
+
+/* Must be called _after_ setting a new_ml to the pfn_to_node_map */
+void dnuma_move_free_pages(struct memlayout *new_ml);
+void dnuma_mark_page_range(struct memlayout *new_ml);
+
+static inline bool dnuma_is_active(void)
+{
+	struct memlayout *ml;
+	bool ret;
+
+	rcu_read_lock();
+	ml = rcu_dereference(pfn_to_node_map);
+	ret = ml && (ml->type != ML_INITIAL);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool dnuma_has_memlayout(void)
+{
+	return !!rcu_access_pointer(pfn_to_node_map);
+}
+
+static inline int dnuma_page_needs_move(struct page *page)
+{
+	int new_nid, old_nid;
+
+	if (!TestClearPageLookupNode(page))
+		return NUMA_NO_NODE;
+
+	/* FIXME: this does rcu_lock, deref, unlock */
+	if (WARN_ON(!dnuma_is_active()))
+		return NUMA_NO_NODE;
+
+	/* FIXME: and so does this (rcu lock, deref, and unlock) */
+	new_nid = memlayout_pfn_to_nid(page_to_pfn(page));
+	old_nid = page_to_nid(page);
+
+	if (new_nid == NUMA_NO_NODE) {
+		pr_alert("dnuma: pfn %05lx has moved from node %d to a non-memlayout range.\n",
+				page_to_pfn(page), old_nid);
+		return NUMA_NO_NODE;
+	}
+
+	if (new_nid == old_nid)
+		return NUMA_NO_NODE;
+
+	if (WARN_ON(!zone_is_initialized(nid_zone(new_nid, page_zonenum(page)))))
+		return NUMA_NO_NODE;
+
+	return new_nid;
+}
+
+void dnuma_post_free_to_new_zone(struct page *page, int order);
+void dnuma_prior_free_to_new_zone(struct page *page, int order,
+				  struct zone *dest_zone,
+				  int dest_nid);
+
+#else /* !defined CONFIG_DYNAMIC_NUMA */
+
+static inline bool dnuma_is_active(void)
+{
+	return false;
+}
+
+static inline void dnuma_prior_free_to_new_zone(struct page *page, int order,
+						struct zone *dest_zone,
+						int dest_nid)
+{
+	BUG();
+}
+
+static inline void dnuma_post_free_to_new_zone(struct page *page, int order)
+{
+	BUG();
+}
+
+static inline int dnuma_page_needs_move(struct page *page)
+{
+	return NUMA_NO_NODE;
+}
+#endif /* !defined CONFIG_DYNAMIC_NUMA */
+
+#endif /* defined LINUX_DNUMA_H_ */
diff --git a/include/linux/memlayout.h b/include/linux/memlayout.h
new file mode 100644
index 0000000..eeb88e0
--- /dev/null
+++ b/include/linux/memlayout.h
@@ -0,0 +1,110 @@
+#ifndef LINUX_MEMLAYOUT_H_
+#define LINUX_MEMLAYOUT_H_
+
+#include <linux/memblock.h> /* __init_memblock */
+#include <linux/mm.h>       /* NODE_DATA, page_zonenum */
+#include <linux/mmzone.h>   /* pfn_to_nid */
+#include <linux/rbtree.h>
+#include <linux/types.h>    /* size_t */
+
+#ifdef CONFIG_DYNAMIC_NUMA
+# ifdef NODE_NOT_IN_PAGE_FLAGS
+#  error "CONFIG_DYNAMIC_NUMA requires the NODE is in page flags. Try freeing up some flags by decreasing the maximum number of NUMA nodes, or switch to sparsmem-vmemmap"
+# endif
+
+enum memlayout_type {
+	ML_INITIAL,
+	ML_DNUMA,
+	ML_NUM_TYPES
+};
+
+/*
+ * - rbtree of {node, start, end}.
+ * - assumes no 'ranges' overlap.
+ */
+struct rangemap_entry {
+	struct rb_node node;
+	unsigned long pfn_start;
+	/* @pfn_end: inclusive, not stored as a count to make the lookup
+	 *           faster
+	 */
+	unsigned long pfn_end;
+	int nid;
+};
+
+struct memlayout {
+	struct rb_root root;
+	enum memlayout_type type;
+
+	/*
+	 * When a memlayout is commited, 'cache' is accessed (the field is read
+	 * from & written to) by multiple tasks without additional locking
+	 * (other than the rcu locking for accessing the memlayout).
+	 *
+	 * Do not assume that it will not change. Use ACCESS_ONCE() to avoid
+	 * potential races.
+	 */
+	struct rangemap_entry *cache;
+
+#ifdef CONFIG_DNUMA_DEBUGFS
+	unsigned seq;
+	struct dentry *d;
+#endif
+};
+
+extern __rcu struct memlayout *pfn_to_node_map;
+
+/* FIXME: overflow potential in completion check */
+#define ml_for_each_pfn_in_range(rme, pfn)	\
+	for (pfn = rme->pfn_start;		\
+	     pfn <= rme->pfn_end;		\
+	     pfn++)
+
+#define ml_for_each_range(ml, rme) \
+	for (rme = rb_entry(rb_first(&ml->root), typeof(*rme), node);	\
+	     &rme->node;						\
+	     rme = rb_entry(rb_next(&rme->node), typeof(*rme), node))
+
+#define rme_next(rme) rb_entry(rb_next(&rme->node), typeof(*rme), node)
+
+struct memlayout *memlayout_create(enum memlayout_type);
+void              memlayout_destroy(struct memlayout *ml);
+
+/* Callers accesing the same memlayout are assumed to be serialized */
+int memlayout_new_range(struct memlayout *ml,
+		unsigned long pfn_start, unsigned long pfn_end, int nid);
+
+/* only queries the memlayout tracking structures. */
+int memlayout_pfn_to_nid(unsigned long pfn);
+
+/* Put ranges added by memlayout_new_range() into use by
+ * memlayout_pfn_get_nid() and retire old ranges.
+ *
+ * No modifications to a memlayout can be made after it is commited.
+ *
+ * sleeps via syncronize_rcu().
+ *
+ * memlayout takes ownership of ml, no futher mamlayout_new_range's should be
+ * issued
+ */
+void memlayout_commit(struct memlayout *ml);
+
+/* Sets up an inital memlayout in early boot.
+ * A weak default which uses memblock is provided.
+ */
+void memlayout_global_init(void);
+
+#else /* ! defined(CONFIG_DYNAMIC_NUMA) */
+
+/* memlayout_new_range() & memlayout_commit() are purposefully omitted */
+
+static inline void memlayout_global_init(void)
+{}
+
+static inline int memlayout_pfn_to_nid(unsigned long pfn)
+{
+	return NUMA_NO_NODE;
+}
+#endif /* !defined(CONFIG_DYNAMIC_NUMA) */
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 2c7aea7..7209ea5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -169,6 +169,25 @@ config MOVABLE_NODE
 config HAVE_BOOTMEM_INFO_NODE
 	def_bool n
 
+config DYNAMIC_NUMA
+	bool "Dynamic Numa: Allow NUMA layout to change after boot time"
+	depends on NUMA
+	depends on !DISCONTIGMEM
+	depends on MEMORY_HOTPLUG # locking + mem_online_node().
+	help
+	 Dynamic Numa (DNUMA) allows the movement of pages between NUMA nodes at
+	 run time.
+
+	 Typically, this is used on systems running under a hypervisor which
+	 may move the running VM based on the hypervisors needs. On such a
+	 system, this config option enables Linux to update it's knowledge of
+	 the memory layout.
+
+	 If the feature is not used but is enabled, there is a small amount of overhead (an
+	 additional pointer NULL check) added to all page frees.
+
+	 Choose Y if you have one of these systems (XXX: which ones?), otherwise choose N.
+
 # eventually, we can have this option just 'select SPARSEMEM'
 config MEMORY_HOTPLUG
 	bool "Allow for memory hot-add"
diff --git a/mm/Makefile b/mm/Makefile
index 3a46287..82fe7c9b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -58,3 +58,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
+obj-$(CONFIG_DYNAMIC_NUMA) += dnuma.o memlayout.o
diff --git a/mm/dnuma.c b/mm/dnuma.c
new file mode 100644
index 0000000..8bc81b2
--- /dev/null
+++ b/mm/dnuma.c
@@ -0,0 +1,349 @@
+#define pr_fmt(fmt) "dnuma: " fmt
+
+#include <linux/dnuma.h>
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <linux/memory.h>
+
+#include "internal.h"
+
+/* Issues due to pageflag_blocks attached to zones with Discontig Mem (&
+ * Flatmem??).
+ * - Need atomicity over the combination of commiting a new memlayout and
+ *   removing the pages from free lists.
+ */
+
+/* XXX: "present pages" is guarded by lock_memory_hotplug(), not the spanlock.
+ * Need to change all users. */
+void adjust_zone_present_pages(struct zone *zone, long delta)
+{
+	unsigned long flags;
+	pgdat_resize_lock(zone->zone_pgdat, &flags);
+	zone_span_writelock(zone);
+
+	zone->managed_pages += delta;
+	zone->present_pages += delta;
+	zone->zone_pgdat->node_present_pages += delta;
+
+	zone_span_writeunlock(zone);
+	pgdat_resize_unlock(zone->zone_pgdat, &flags);
+}
+
+/* - must be called under lock_memory_hotplug() */
+/* TODO: avoid iterating over all PFNs. */
+void dnuma_online_required_nodes_and_zones(struct memlayout *new_ml)
+{
+	struct rangemap_entry *rme;
+	ml_for_each_range(new_ml, rme) {
+		unsigned long pfn;
+		int nid = rme->nid;
+
+		if (!node_online(nid)) {
+			pr_info("onlining node %d [start]\n", nid);
+
+			/* XXX: somewhere in here do a memory online notify: we
+			 * aren't really onlining memory, but some code uses
+			 * memory online notifications to tell if new nodes
+			 * have been created.
+			 *
+			 * Also note that the notifyers expect to be able to do
+			 * allocations, ie we must allow for might_sleep() */
+			{
+				int ret;
+
+				/* memory_notify() expects:
+				 *	- to add pages at the same time
+				 *	- to add zones at the same time
+				 * We can do neither of these things.
+				 *
+				 * FIXME: Right now we just set the things
+				 * needed by the slub handler.
+				 */
+				struct memory_notify arg = {
+					.status_change_nid_normal = nid,
+				};
+
+				ret = memory_notify(MEM_GOING_ONLINE, &arg);
+				ret = notifier_to_errno(ret);
+				if (WARN_ON(ret)) {
+					/* XXX: other stuff will bug out if we
+					 * keep going, need to actually cancel
+					 * memlayout changes
+					 */
+					memory_notify(MEM_CANCEL_ONLINE, &arg);
+				}
+			}
+
+			/* Consult hotadd_new_pgdat() */
+			__mem_online_node(nid);
+			if (!node_online(nid)) {
+				pr_alert("node %d not online after onlining\n", nid);
+			}
+
+			pr_info("onlining node %d [complete]\n", nid);
+		}
+
+		/* Determine the zones required */
+		for (pfn = rme->pfn_start; pfn <= rme->pfn_end; pfn++) {
+			struct zone *zone;
+			if (!pfn_valid(pfn))
+				continue;
+
+			zone = nid_zone(nid, page_zonenum(pfn_to_page(pfn)));
+			/* XXX: we (dnuma paths) can handle this (there will
+			 * just be quite a few WARNS in the logs), but if we
+			 * are indicating error above, should we bail out here
+			 * as well? */
+			WARN_ON(ensure_zone_is_initialized(zone, 0, 0));
+		}
+	}
+}
+
+/*
+ * Cannot be folded into dnuma_move_unallocated_pages() because unmarked pages
+ * could be freed back into the zone as dnuma_move_unallocated_pages() was in
+ * the process of iterating over it.
+ */
+void dnuma_mark_page_range(struct memlayout *new_ml)
+{
+	struct rangemap_entry *rme;
+	ml_for_each_range(new_ml, rme) {
+		unsigned long pfn;
+		for (pfn = rme->pfn_start; pfn <= rme->pfn_end; pfn++) {
+			if (!pfn_valid(pfn))
+				continue;
+			/* FIXME: should we be skipping compound / buddied
+			 *        pages? */
+			/* FIXME: if PageReserved(), can we just poke the nid
+			 *        directly? Should we? */
+			SetPageLookupNode(pfn_to_page(pfn));
+		}
+	}
+}
+
+#if 0
+static void node_states_set_node(int node, struct memory_notify *arg)
+{
+	if (arg->status_change_nid_normal >= 0)
+		node_set_state(node, N_NORMAL_MEMORY);
+
+	if (arg->status_change_nid_high >= 0)
+		node_set_state(node, N_HIGH_MEMORY);
+
+	node_set_state(node, N_MEMORY);
+}
+#endif
+
+void dnuma_post_free_to_new_zone(struct page *page, int order)
+{
+	adjust_zone_present_pages(page_zone(page), (1 << order));
+}
+
+static void dnuma_prior_return_to_new_zone(struct page *page, int order,
+					   struct zone *dest_zone,
+					   int dest_nid)
+{
+	int i;
+	unsigned long pfn = page_to_pfn(page);
+
+	grow_pgdat_and_zone(dest_zone, pfn, pfn + (1UL << order));
+
+	for (i = 0; i < 1UL << order; i++)
+		set_page_node(&page[i], dest_nid);
+}
+
+static void clear_lookup_node(struct page *page, int order)
+{
+	int i;
+	for (i = 0; i < 1UL << order; i++)
+		ClearPageLookupNode(&page[i]);
+}
+
+/* Does not assume it is called with any locking (but can be called with zone
+ * locks held, if needed) */
+void dnuma_prior_free_to_new_zone(struct page *page, int order,
+				  struct zone *dest_zone,
+				  int dest_nid)
+{
+	struct zone *curr_zone = page_zone(page);
+
+	/* XXX: Fiddle with 1st zone's locks */
+	adjust_zone_present_pages(curr_zone, -(1UL << order));
+
+	/* XXX: fiddles with 2nd zone's locks */
+	dnuma_prior_return_to_new_zone(page, order, dest_zone, dest_nid);
+}
+
+/* must be called with zone->lock held and memlayout's update_lock held */
+static void remove_free_pages_from_zone(struct zone *zone, struct page *page, int order)
+{
+	/* zone free stats */
+	zone->free_area[order].nr_free--;
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+	adjust_zone_present_pages(zone, -(1UL << order));
+
+	list_del(&page->lru);
+	__ClearPageBuddy(page);
+
+	/* Allowed because we hold the memlayout update_lock. */
+	clear_lookup_node(page, order);
+
+	/* XXX: can we shrink spanned_pages & start_pfn without too much work?
+	 *  - not crutial because having a
+	 *    larger-than-necessary span simply means that more
+	 *    PFNs are iterated over.
+	 *  - would be nice to be able to do this to cut down
+	 *    on overhead caused by PFN iterators.
+	 */
+}
+
+/*
+ * __ref is to allow (__meminit) zone_pcp_update(), which we will have because
+ * DYNAMIC_NUMA depends on MEMORY_HOTPLUG (and all the MEMORY_HOTPLUG comments
+ * indicate __meminit is allowed when they are enabled).
+ */
+static void __ref add_free_page_to_node(int dest_nid, struct page *page, int order)
+{
+	bool need_zonelists_rebuild = false;
+	struct zone *dest_zone = nid_zone(dest_nid, page_zonenum(page));
+	VM_BUG_ON(!zone_is_initialized(dest_zone));
+
+	if (zone_is_empty(dest_zone))
+		need_zonelists_rebuild = true;
+
+	/* Add page to new zone */
+	dnuma_prior_return_to_new_zone(page, order, dest_zone, dest_nid);
+	return_pages_to_zone(page, order, dest_zone);
+	dnuma_post_free_to_new_zone(page, order);
+
+	/* XXX: fixme, there are other states that need fixing up */
+	if (!node_state(dest_nid, N_MEMORY))
+		node_set_state(dest_nid, N_MEMORY);
+
+	if (need_zonelists_rebuild) {
+		/* XXX: also does stop_machine() */
+		//zone_pcp_reset(zone);
+		/* XXX: why is this locking actually needed? */
+		mutex_lock(&zonelists_mutex);
+		//build_all_zonelists(NULL, NULL);
+		build_all_zonelists(NULL, dest_zone);
+		mutex_unlock(&zonelists_mutex);
+	} else
+		/* FIXME: does stop_machine() after EVERY SINGLE PAGE */
+		/* XXX: this is probably wrong. What does "update" actually
+		 * indicate in zone_pcp terms? */
+		zone_pcp_update(dest_zone);
+}
+
+static struct rangemap_entry *add_split_pages_to_zones(
+		struct rangemap_entry *first_rme,
+		struct page *page, int order)
+{
+	int i;
+	struct rangemap_entry *rme = first_rme;
+	for (i = 0; i < (1 << order); i++) {
+		unsigned long pfn = page_to_pfn(page);
+		while (pfn > rme->pfn_end) {
+			rme = rme_next(rme);
+		}
+
+		add_free_page_to_node(rme->nid, page + i, 0);
+	}
+
+	return rme;
+}
+
+void dnuma_move_free_pages(struct memlayout *new_ml)
+{
+	/* FIXME: how does this removal of pages from a zone interact with
+	 * migrate types? ISOLATION? */
+	struct rangemap_entry *rme;
+	ml_for_each_range(new_ml, rme) {
+		unsigned long pfn = rme->pfn_start;
+		int range_nid;
+		struct page *page;
+new_rme:
+		range_nid = rme->nid;
+
+		for (; pfn <= rme->pfn_end; pfn++) {
+			struct zone *zone;
+			int page_nid, order;
+			unsigned long flags, last_pfn, first_pfn;
+			if (!pfn_valid(pfn))
+				continue;
+
+			page = pfn_to_page(pfn);
+#if 0
+			/* XXX: can we ensure this is safe? Pages marked
+			 * reserved could be freed into the page allocator if
+			 * they mark memory areas that were allocated via
+			 * earlier allocators. */
+			if (PageReserved(page)) {
+				set_page_node(page, range_nid);
+				/* TODO: adjust spanned_pages & present_pages & start_pfn. */
+			}
+#endif
+
+			/* Currently allocated, will be fixed up when freed. */
+			if (!PageBuddy(page))
+				continue;
+
+			page_nid = page_to_nid(page);
+			if (page_nid == range_nid)
+				continue;
+
+			zone = page_zone(page);
+			spin_lock_irqsave(&zone->lock, flags);
+
+			/* Someone allocated it since we last checked. It will
+			 * be fixed up when it is freed */
+			if (!PageBuddy(page))
+				goto skip_unlock;
+
+			/* It has already been transplanted "somewhere",
+			 * somewhere should be the proper zone. */
+			if (page_zone(page) != zone) {
+				VM_BUG_ON(zone != nid_zone(range_nid, page_zonenum(page)));
+				goto skip_unlock;
+			}
+
+			order = page_order(page);
+			first_pfn = pfn & ~((1 << order) - 1);
+			last_pfn  = pfn |  ((1 << order) - 1);
+			if (WARN(pfn != first_pfn, "pfn %05lx is not first_pfn %05lx\n",
+							pfn, first_pfn)) {
+				pfn = last_pfn;
+				goto skip_unlock;
+			}
+
+			if (last_pfn > rme->pfn_end) {
+				/* this higher order page doesn't fit into the
+				 * current range even though it starts there.
+				 */
+				pr_warn("high-order page from pfn %05lx to %05lx extends beyond end of rme {%05lx - %05lx}:%d\n",
+						first_pfn, last_pfn,
+						rme->pfn_start, rme->pfn_end,
+						rme->nid);
+
+				remove_free_pages_from_zone(zone, page, order);
+				spin_unlock_irqrestore(&zone->lock, flags);
+
+				rme = add_split_pages_to_zones(rme, page, order);
+				pfn = last_pfn + 1;
+				goto new_rme;
+			}
+
+			remove_free_pages_from_zone(zone, page, order);
+			spin_unlock_irqrestore(&zone->lock, flags);
+
+			add_free_page_to_node(range_nid, page, order);
+			pfn = last_pfn;
+			continue;
+skip_unlock:
+			spin_unlock_irqrestore(&zone->lock, flags);
+		}
+	}
+}
diff --git a/mm/memlayout.c b/mm/memlayout.c
new file mode 100644
index 0000000..69222ac
--- /dev/null
+++ b/mm/memlayout.c
@@ -0,0 +1,238 @@
+/*
+ * memlayout - provides a mapping of PFN ranges to nodes with the requirements
+ * that looking up a node from a PFN is fast, and changes to the mapping will
+ * occour relatively infrequently.
+ *
+ */
+#define pr_fmt(fmt) "memlayout: " fmt
+
+#include <linux/dnuma.h>
+#include <linux/export.h>
+#include <linux/memblock.h>
+#include <linux/printk.h>
+#include <linux/rbtree.h>
+#include <linux/rcupdate.h>
+#include <linux/slab.h>
+
+/* protected by memlayout_lock */
+__rcu struct memlayout *pfn_to_node_map;
+DEFINE_MUTEX(memlayout_lock);
+
+static void free_rme_tree(struct rb_root *root)
+{
+	struct rangemap_entry *pos, *n;
+	rbtree_postorder_for_each_entry_safe(pos, n, root, node) {
+		kfree(pos);
+	}
+}
+
+static void ml_destroy_mem(struct memlayout *ml)
+{
+	if (!ml)
+		return;
+	free_rme_tree(&ml->root);
+	kfree(ml);
+}
+
+static int find_insertion_point(struct memlayout *ml, unsigned long pfn_start,
+		unsigned long pfn_end, int nid, struct rb_node ***o_new,
+		struct rb_node **o_parent)
+{
+	struct rb_node **new = &ml->root.rb_node, *parent = NULL;
+	struct rangemap_entry *rme;
+	pr_debug("adding range: {%lX-%lX}:%d\n", pfn_start, pfn_end, nid);
+	while (*new) {
+		rme = rb_entry(*new, typeof(*rme), node);
+
+		parent = *new;
+		if (pfn_end < rme->pfn_start && pfn_start < rme->pfn_end)
+			new = &((*new)->rb_left);
+		else if (pfn_start > rme->pfn_end && pfn_end > rme->pfn_end)
+			new = &((*new)->rb_right);
+		else {
+			/* an embedded region, need to use an interval or
+			 * sequence tree. */
+			pr_warn("tried to embed {%lX,%lX}:%d inside {%lX-%lX}:%d\n",
+				 pfn_start, pfn_end, nid,
+				 rme->pfn_start, rme->pfn_end, rme->nid);
+			return 1;
+		}
+	}
+
+	*o_new = new;
+	*o_parent = parent;
+	return 0;
+}
+
+int memlayout_new_range(struct memlayout *ml, unsigned long pfn_start,
+		unsigned long pfn_end, int nid)
+{
+	struct rb_node **new, *parent;
+	struct rangemap_entry *rme;
+
+	if (WARN_ON(nid < 0))
+		return -EINVAL;
+	if (WARN_ON(nid >= MAX_NUMNODES))
+		return -EINVAL;
+
+	if (find_insertion_point(ml, pfn_start, pfn_end, nid, &new, &parent))
+		return 1;
+
+	rme = kmalloc(sizeof(*rme), GFP_KERNEL);
+	if (!rme)
+		return -ENOMEM;
+
+	rme->pfn_start = pfn_start;
+	rme->pfn_end = pfn_end;
+	rme->nid = nid;
+
+	rb_link_node(&rme->node, parent, new);
+	rb_insert_color(&rme->node, &ml->root);
+	return 0;
+}
+
+static inline bool rme_bounds_pfn(struct rangemap_entry *rme, unsigned long pfn)
+{
+	return rme->pfn_start <= pfn && pfn <= rme->pfn_end;
+}
+
+int memlayout_pfn_to_nid(unsigned long pfn)
+{
+	struct rb_node *node;
+	struct memlayout *ml;
+	struct rangemap_entry *rme;
+	rcu_read_lock();
+	ml = rcu_dereference(pfn_to_node_map);
+	if (!ml || (ml->type == ML_INITIAL))
+		goto out;
+
+	rme = ACCESS_ONCE(ml->cache);
+	if (rme && rme_bounds_pfn(rme, pfn)) {
+		rcu_read_unlock();
+		return rme->nid;
+	}
+
+	node = ml->root.rb_node;
+	while (node) {
+		struct rangemap_entry *rme = rb_entry(node, typeof(*rme), node);
+		bool greater_than_start = rme->pfn_start <= pfn;
+		bool less_than_end = pfn <= rme->pfn_end;
+
+		if (greater_than_start && !less_than_end)
+			node = node->rb_right;
+		else if (less_than_end && !greater_than_start)
+			node = node->rb_left;
+		else {
+			/* greater_than_start && less_than_end.
+			 *  the case (!greater_than_start  && !less_than_end)
+			 *  is impossible */
+			int nid = rme->nid;
+			ACCESS_ONCE(ml->cache) = rme;
+			rcu_read_unlock();
+			return nid;
+		}
+	}
+
+out:
+	rcu_read_unlock();
+	return NUMA_NO_NODE;
+}
+
+void memlayout_destroy(struct memlayout *ml)
+{
+	ml_destroy_mem(ml);
+}
+
+struct memlayout *memlayout_create(enum memlayout_type type)
+{
+	struct memlayout *ml;
+
+	if (WARN_ON(type < 0 || type >= ML_NUM_TYPES))
+		return NULL;
+
+	ml = kmalloc(sizeof(*ml), GFP_KERNEL);
+	if (!ml)
+		return NULL;
+
+	ml->root = RB_ROOT;
+	ml->type = type;
+	ml->cache = NULL;
+
+	return ml;
+}
+
+void memlayout_commit(struct memlayout *ml)
+{
+	struct memlayout *old_ml;
+
+	if (ml->type == ML_INITIAL) {
+		if (WARN(dnuma_has_memlayout(), "memlayout marked first is not first, ignoring.\n")) {
+			memlayout_destroy(ml);
+			return;
+		}
+
+		mutex_lock(&memlayout_lock);
+		rcu_assign_pointer(pfn_to_node_map, ml);
+		mutex_unlock(&memlayout_lock);
+		return;
+	}
+
+	lock_memory_hotplug();
+	dnuma_online_required_nodes_and_zones(ml);
+	unlock_memory_hotplug();
+
+	mutex_lock(&memlayout_lock);
+	old_ml = rcu_dereference_protected(pfn_to_node_map,
+			mutex_is_locked(&memlayout_lock));
+
+	rcu_assign_pointer(pfn_to_node_map, ml);
+
+	synchronize_rcu();
+	memlayout_destroy(old_ml);
+
+	/* Must be called only after the new value for pfn_to_node_map has
+	 * propogated to all tasks, otherwise some pages may lookup the old
+	 * pfn_to_node_map on free & not transplant themselves to their new-new
+	 * node. */
+	dnuma_mark_page_range(ml);
+
+	/* Do this after the free path is set up so that pages are free'd into
+	 * their "new" zones so that after this completes, no free pages in the
+	 * wrong zone remain. */
+	dnuma_move_free_pages(ml);
+
+	/* All new _non pcp_ page allocations now match the memlayout*/
+	drain_all_pages();
+	/* All new page allocations now match the memlayout */
+
+	mutex_unlock(&memlayout_lock);
+}
+
+/*
+ * The default memlayout global initializer, using memblock to determine affinities
+ * reqires: slab_is_available() && memblock is not (yet) freed.
+ * sleeps: definitely: memlayout_commit() -> synchronize_rcu()
+ *	   potentially: kmalloc()
+ */
+__weak __meminit
+void memlayout_global_init(void)
+{
+	int i, nid, errs = 0;
+	unsigned long start, end;
+	struct memlayout *ml = memlayout_create(ML_INITIAL);
+	if (WARN_ON(!ml))
+		return;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+		int r = memlayout_new_range(ml, start, end - 1, nid);
+		if (r) {
+			pr_err("failed to add range [%05lx, %05lx] in node %d to mapping\n",
+					start, end, nid);
+			errs++;
+		} else
+			pr_devel("added range [%05lx, %05lx] in node %d\n",
+					start, end, nid);
+	}
+
+	memlayout_commit(ml);
+}
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 16/24] mm: memlayout+dnuma: add debugfs interface
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (14 preceding siblings ...)
  2013-02-28 21:26   ` [PATCH 15/24] mm: add memlayout & dnuma to track pfn->nid & transplant pages between nodes Cody P Schafer
@ 2013-02-28 21:26   ` Cody P Schafer
  2013-02-28 21:26   ` [PATCH 17/24] page_alloc: use dnuma to transplant newly freed pages in __free_pages_ok() Cody P Schafer
                     ` (8 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:26 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

Add a debugfs interface to dnuma/memlayout. It keeps track of a
variable backlog of memory layouts, provides some statistics on dnuma
moved pages & cache performance, and allows the setting of a new global
memlayout.

TODO: split out statistics, backlog, & write interfaces from eachother.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 include/linux/memlayout.h |   1 +
 mm/Kconfig                |  25 ++++
 mm/Makefile               |   1 +
 mm/dnuma.c                |   2 +
 mm/memlayout-debugfs.c    | 323 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/memlayout-debugfs.h    |  35 +++++
 mm/memlayout.c            |  17 ++-
 7 files changed, 402 insertions(+), 2 deletions(-)
 create mode 100644 mm/memlayout-debugfs.c
 create mode 100644 mm/memlayout-debugfs.h

diff --git a/include/linux/memlayout.h b/include/linux/memlayout.h
index eeb88e0..499ab4d 100644
--- a/include/linux/memlayout.h
+++ b/include/linux/memlayout.h
@@ -53,6 +53,7 @@ struct memlayout {
 };
 
 extern __rcu struct memlayout *pfn_to_node_map;
+extern struct mutex memlayout_lock; /* update-side lock */
 
 /* FIXME: overflow potential in completion check */
 #define ml_for_each_pfn_in_range(rme, pfn)	\
diff --git a/mm/Kconfig b/mm/Kconfig
index 7209ea5..5f24e6a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -188,6 +188,31 @@ config DYNAMIC_NUMA
 
 	 Choose Y if you have one of these systems (XXX: which ones?), otherwise choose N.
 
+config DNUMA_DEBUGFS
+	bool "Export DNUMA & memlayout internals via debugfs"
+	depends on DYNAMIC_NUMA
+	help
+	 Provides
+
+config DNUMA_BACKLOG
+	int "Number of old memlayouts to keep (0 = None, -1 = unlimited)"
+	depends on DNUMA_DEBUGFS
+	help
+	 Allows access to old memory layouts & statistics in debugfs.
+
+	 Each memlayout will consume some memory, and when set to -1
+	 (unlimited), this can result in unbounded kernel memory use.
+
+config DNUMA_DEBUGFS_WRITE
+	bool "Change NUMA layout via debugfs"
+	depends on DNUMA_DEBUGFS
+	help
+	 Enable the use of <debugfs>/memlayout/{start,end,node,commit}
+
+	 Write a PFN to 'start' & 'end', then a node id to 'node'.
+	 Repeat this until you are satisfied with your memory layout, then
+	 write '1' to 'commit'.
+
 # eventually, we can have this option just 'select SPARSEMEM'
 config MEMORY_HOTPLUG
 	bool "Allow for memory hot-add"
diff --git a/mm/Makefile b/mm/Makefile
index 82fe7c9b..b07926c 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -59,3 +59,4 @@ obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
 obj-$(CONFIG_DYNAMIC_NUMA) += dnuma.o memlayout.o
+obj-$(CONFIG_DNUMA_DEBUGFS) += memlayout-debugfs.o
diff --git a/mm/dnuma.c b/mm/dnuma.c
index 8bc81b2..a0139f6 100644
--- a/mm/dnuma.c
+++ b/mm/dnuma.c
@@ -9,6 +9,7 @@
 #include <linux/memory.h>
 
 #include "internal.h"
+#include "memlayout-debugfs.h"
 
 /* Issues due to pageflag_blocks attached to zones with Discontig Mem (&
  * Flatmem??).
@@ -140,6 +141,7 @@ static void node_states_set_node(int node, struct memory_notify *arg)
 void dnuma_post_free_to_new_zone(struct page *page, int order)
 {
 	adjust_zone_present_pages(page_zone(page), (1 << order));
+	ml_stat_count_moved_pages(order);
 }
 
 static void dnuma_prior_return_to_new_zone(struct page *page, int order,
diff --git a/mm/memlayout-debugfs.c b/mm/memlayout-debugfs.c
new file mode 100644
index 0000000..93574d5
--- /dev/null
+++ b/mm/memlayout-debugfs.c
@@ -0,0 +1,323 @@
+#include <linux/debugfs.h>
+
+#include <linux/slab.h> /* kmalloc */
+#include <linux/module.h> /* THIS_MODULE, needed for DEFINE_SIMPLE_ATTR */
+
+#include "memlayout-debugfs.h"
+
+#if CONFIG_DNUMA_BACKLOG > 0
+/* Fixed size backlog */
+#include <linux/kfifo.h>
+DEFINE_KFIFO(ml_backlog, struct memlayout *, CONFIG_DNUMA_BACKLOG);
+void ml_backlog_feed(__unused struct memlayout *ml)
+{
+	if (kfifo_is_full(&ml_backlog)) {
+		struct memlayout *old_ml;
+		kfifo_get(&ml_backlog, &old_ml);
+		memlayout_destroy(ml);
+	}
+
+	kfifo_put(&ml_backlog, &ml);
+}
+#elif CONFIG_DNUMA_BACKLOG < 0
+/* Unlimited backlog */
+void ml_backlog_feed(struct memlayout *ml)
+{
+	/* TODO: we never use the rme_tree, so we could use ml_destroy_mem() to
+	 * save space. */
+}
+#else /* CONFIG_DNUMA_BACKLOG == 0 */
+/* No backlog */
+void ml_backlog_feed(struct memlayout *ml)
+{
+	memlayout_destroy(ml);
+}
+#endif
+
+static atomic64_t dnuma_moved_page_ct;
+void ml_stat_count_moved_pages(int order)
+{
+	atomic64_add(1 << order, &dnuma_moved_page_ct);
+}
+
+static atomic_t ml_seq = ATOMIC_INIT(0);
+static struct dentry *root_dentry, *current_dentry;
+#define ML_LAYOUT_NAME_SZ \
+	((size_t)(DIV_ROUND_UP(sizeof(unsigned) * 8, 3) + 1 + strlen("layout.")))
+#define ML_REGION_NAME_SZ ((size_t)(2 * BITS_PER_LONG / 4 + 2))
+
+static void ml_layout_name(struct memlayout *ml, char *name)
+{
+	sprintf(name, "layout.%u", ml->seq);
+}
+
+static int dfs_range_get(void *data, u64 *val)
+{
+	*val = (int)data;
+	return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(range_fops, dfs_range_get, NULL, "%lld\n");
+
+static void _ml_dbgfs_create_range(struct dentry *base,
+		struct rangemap_entry *rme, char *name)
+{
+	struct dentry *rd;
+	sprintf(name, "%05lx-%05lx", rme->pfn_start, rme->pfn_end);
+	rd = debugfs_create_file(name, 0400, base,
+				(void *)rme->nid, &range_fops);
+	if (!rd)
+		pr_devel("debugfs: failed to create {%lX-%lX}:%d\n",
+				rme->pfn_start, rme->pfn_end, rme->nid);
+	else
+		pr_devel("debugfs: created {%lX-%lX}:%d\n",
+				rme->pfn_start, rme->pfn_end, rme->nid);
+}
+
+/* Must be called with memlayout_lock held */
+static void _ml_dbgfs_set_current(struct memlayout *ml, char *name)
+{
+	ml_layout_name(ml, name);
+
+	if (current_dentry)
+		debugfs_remove(current_dentry);
+	current_dentry = debugfs_create_symlink("current", root_dentry, name);
+}
+
+static void ml_dbgfs_create_layout_assume_root(struct memlayout *ml)
+{
+	char name[ML_LAYOUT_NAME_SZ];
+	ml_layout_name(ml, name);
+	WARN_ON(!root_dentry);
+	ml->d = debugfs_create_dir(name, root_dentry);
+	WARN_ON(!ml->d);
+}
+
+# if defined(CONFIG_DNUMA_DEBUGFS_WRITE)
+
+#define DEFINE_DEBUGFS_GET(___type)					\
+	static int debugfs_## ___type ## _get(void *data, u64 *val)	\
+	{								\
+		*val = *(___type *)data;				\
+		return 0;						\
+	}
+
+DEFINE_DEBUGFS_GET(u32);
+DEFINE_DEBUGFS_GET(u8);
+
+#define DEFINE_WATCHED_ATTR(___type, ___var)			\
+	static int ___var ## _watch_set(void *data, u64 val)	\
+	{							\
+		___type old_val = *(___type *)data;		\
+		int ret = ___var ## _watch(old_val, val);	\
+		if (!ret)					\
+			*(___type *)data = val;			\
+		return ret;					\
+	}							\
+	DEFINE_SIMPLE_ATTRIBUTE(___var ## _fops,		\
+			debugfs_ ## ___type ## _get,		\
+			___var ## _watch_set, "%llu\n");
+
+static u64 dnuma_user_start;
+static u64 dnuma_user_end;
+static u32 dnuma_user_node; /* XXX: I don't care about this var, remove? */
+static u8  dnuma_user_commit; /* XXX: don't care about this one either */
+static struct memlayout *user_ml;
+static DEFINE_MUTEX(dnuma_user_lock);
+static int dnuma_user_node_watch(u32 old_val, u32 new_val)
+{
+	int ret = 0;
+	mutex_lock(&dnuma_user_lock);
+	if (!user_ml)
+		user_ml = memlayout_create(ML_DNUMA);
+
+	if (WARN_ON(!user_ml)) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (new_val >= MAX_NUMNODES) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (dnuma_user_start > dnuma_user_end) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = memlayout_new_range(user_ml, dnuma_user_start, dnuma_user_end,
+				  new_val);
+
+	if (!ret) {
+		dnuma_user_start = 0;
+		dnuma_user_end = 0;
+	}
+out:
+	mutex_unlock(&dnuma_user_lock);
+	return ret;
+}
+
+static int dnuma_user_commit_watch(u8 old_val, u8 new_val)
+{
+	mutex_lock(&dnuma_user_lock);
+	if (user_ml)
+		memlayout_commit(user_ml);
+	user_ml = NULL;
+	mutex_unlock(&dnuma_user_lock);
+	return 0;
+}
+
+DEFINE_WATCHED_ATTR(u32, dnuma_user_node);
+DEFINE_WATCHED_ATTR(u8, dnuma_user_commit);
+# endif /* defined(CONFIG_DNUMA_DEBUGFS_WRITE) */
+
+/* create the entire current memlayout.
+ * only used for the layout which exsists prior to fs initialization
+ */
+static void ml_dbgfs_create_initial_layout(void)
+{
+	struct rangemap_entry *rme;
+	char name[max(ML_REGION_NAME_SZ, ML_LAYOUT_NAME_SZ)];
+	struct memlayout *old_ml, *new_ml;
+
+	new_ml = kmalloc(sizeof(*new_ml), GFP_KERNEL);
+	if (WARN(!new_ml, "memlayout allocation failed\n"))
+		return;
+
+	mutex_lock(&memlayout_lock);
+
+	old_ml = rcu_dereference_protected(pfn_to_node_map,
+			mutex_is_locked(&memlayout_lock));
+	if (WARN_ON(!old_ml))
+		goto e_out;
+	*new_ml = *old_ml;
+
+	if (WARN_ON(new_ml->d))
+		goto e_out;
+
+	/* this assumption holds as ml_dbgfs_create_initial_layout() (this
+	 * function) is only called by ml_dbgfs_create_root() */
+	ml_dbgfs_create_layout_assume_root(new_ml);
+	if (!new_ml->d)
+		goto e_out;
+
+	ml_for_each_range(new_ml, rme) {
+		_ml_dbgfs_create_range(new_ml->d, rme, name);
+	}
+
+	_ml_dbgfs_set_current(new_ml, name);
+	rcu_assign_pointer(pfn_to_node_map, new_ml);
+	mutex_unlock(&memlayout_lock);
+
+	synchronize_rcu();
+	kfree(old_ml);
+	return;
+e_out:
+	mutex_unlock(&memlayout_lock);
+	kfree(new_ml);
+}
+
+static atomic64_t ml_cache_hits;
+static atomic64_t ml_cache_misses;
+
+void ml_stat_cache_miss(void)
+{
+	atomic64_inc(&ml_cache_misses);
+}
+
+void ml_stat_cache_hit(void)
+{
+	atomic64_inc(&ml_cache_hits);
+}
+
+/* returns 0 if root_dentry has been created */
+static int ml_dbgfs_create_root(void)
+{
+	if (root_dentry)
+		return 0;
+
+	if (!debugfs_initialized()) {
+		pr_devel("debugfs not registered or disabled.\n");
+		return -EINVAL;
+	}
+
+	root_dentry = debugfs_create_dir("memlayout", NULL);
+	if (!root_dentry) {
+		pr_devel("root dir creation failed\n");
+		return -EINVAL;
+	}
+
+	/* TODO: place in a different dir? (to keep memlayout & dnuma seperate)
+	 */
+	/* XXX: Horrible atomic64 hack is horrible. */
+	debugfs_create_u64("moved-pages", 0400, root_dentry,
+			   &dnuma_moved_page_ct.counter);
+	debugfs_create_u64("pfn-lookup-cache-misses", 0400, root_dentry,
+			   &ml_cache_misses.counter);
+	debugfs_create_u64("pfn-lookup-cache-hits", 0400, root_dentry,
+			   &ml_cache_hits.counter);
+
+# if defined(CONFIG_DNUMA_DEBUGFS_WRITE)
+	/* Set node last: on write, it adds the range. */
+	debugfs_create_x64("start", 0600, root_dentry, &dnuma_user_start);
+	debugfs_create_x64("end",   0600, root_dentry, &dnuma_user_end);
+	debugfs_create_file("node",  0200, root_dentry,
+			&dnuma_user_node, &dnuma_user_node_fops);
+	debugfs_create_file("commit",  0200, root_dentry,
+			&dnuma_user_commit, &dnuma_user_commit_fops);
+# endif
+
+	/* uses root_dentry */
+	ml_dbgfs_create_initial_layout();
+
+	return 0;
+}
+
+static void ml_dbgfs_create_layout(struct memlayout *ml)
+{
+	if (ml_dbgfs_create_root()) {
+		ml->d = NULL;
+		return;
+	}
+	ml_dbgfs_create_layout_assume_root(ml);
+}
+
+static int ml_dbgfs_init_root(void)
+{
+	ml_dbgfs_create_root();
+	return 0;
+}
+
+void ml_dbgfs_init(struct memlayout *ml)
+{
+	ml->seq = atomic_inc_return(&ml_seq) - 1;
+	ml_dbgfs_create_layout(ml);
+}
+
+void ml_dbgfs_create_range(struct memlayout *ml, struct rangemap_entry *rme)
+{
+	char name[ML_REGION_NAME_SZ];
+	if (ml->d)
+		_ml_dbgfs_create_range(ml->d, rme, name);
+}
+
+void ml_dbgfs_set_current(struct memlayout *ml)
+{
+	char name[ML_LAYOUT_NAME_SZ];
+	_ml_dbgfs_set_current(ml, name);
+}
+
+void ml_destroy_dbgfs(struct memlayout *ml)
+{
+	if (ml && ml->d)
+		debugfs_remove_recursive(ml->d);
+}
+
+static void __exit ml_dbgfs_exit(void)
+{
+	debugfs_remove_recursive(root_dentry);
+	root_dentry = NULL;
+}
+
+module_init(ml_dbgfs_init_root);
+module_exit(ml_dbgfs_exit);
diff --git a/mm/memlayout-debugfs.h b/mm/memlayout-debugfs.h
new file mode 100644
index 0000000..d8895dd
--- /dev/null
+++ b/mm/memlayout-debugfs.h
@@ -0,0 +1,35 @@
+#ifndef LINUX_MM_MEMLAYOUT_DEBUGFS_H_
+#define LINUX_MM_MEMLAYOUT_DEBUGFS_H_
+
+#include <linux/memlayout.h>
+
+void ml_backlog_feed(struct memlayout *ml);
+
+#ifdef CONFIG_DNUMA_DEBUGFS
+void ml_stat_count_moved_pages(int order);
+void ml_stat_cache_hit(void);
+void ml_stat_cache_miss(void);
+void ml_dbgfs_init(struct memlayout *ml);
+void ml_dbgfs_create_range(struct memlayout *ml, struct rangemap_entry *rme);
+void ml_destroy_dbgfs(struct memlayout *ml);
+void ml_dbgfs_set_current(struct memlayout *ml);
+
+#else /* !defined(CONFIG_DNUMA_DEBUGFS) */
+static inline void ml_stat_count_moved_pages(int order)
+{}
+static inline void ml_stat_cache_hit(void)
+{}
+static inline void ml_stat_cache_miss(void)
+{}
+
+static inline void ml_dbgfs_init(struct memlayout *ml)
+{}
+static inline void ml_dbgfs_create_range(struct memlayout *ml, struct rangemap_entry *rme)
+{}
+static inline void ml_destroy_dbgfs(struct memlayout *ml)
+{}
+static inline void ml_dbgfs_set_current(struct memlayout *ml)
+{}
+#endif
+
+#endif
diff --git a/mm/memlayout.c b/mm/memlayout.c
index 69222ac..5fef032 100644
--- a/mm/memlayout.c
+++ b/mm/memlayout.c
@@ -14,6 +14,8 @@
 #include <linux/rcupdate.h>
 #include <linux/slab.h>
 
+#include "memlayout-debugfs.h"
+
 /* protected by memlayout_lock */
 __rcu struct memlayout *pfn_to_node_map;
 DEFINE_MUTEX(memlayout_lock);
@@ -88,6 +90,8 @@ int memlayout_new_range(struct memlayout *ml, unsigned long pfn_start,
 
 	rb_link_node(&rme->node, parent, new);
 	rb_insert_color(&rme->node, &ml->root);
+
+	ml_dbgfs_create_range(ml, rme);
 	return 0;
 }
 
@@ -109,9 +113,12 @@ int memlayout_pfn_to_nid(unsigned long pfn)
 	rme = ACCESS_ONCE(ml->cache);
 	if (rme && rme_bounds_pfn(rme, pfn)) {
 		rcu_read_unlock();
+		ml_stat_cache_hit();
 		return rme->nid;
 	}
 
+	ml_stat_cache_miss();
+
 	node = ml->root.rb_node;
 	while (node) {
 		struct rangemap_entry *rme = rb_entry(node, typeof(*rme), node);
@@ -140,6 +147,7 @@ out:
 
 void memlayout_destroy(struct memlayout *ml)
 {
+	ml_destroy_dbgfs(ml);
 	ml_destroy_mem(ml);
 }
 
@@ -158,6 +166,7 @@ struct memlayout *memlayout_create(enum memlayout_type type)
 	ml->type = type;
 	ml->cache = NULL;
 
+	ml_dbgfs_init(ml);
 	return ml;
 }
 
@@ -167,11 +176,12 @@ void memlayout_commit(struct memlayout *ml)
 
 	if (ml->type == ML_INITIAL) {
 		if (WARN(dnuma_has_memlayout(), "memlayout marked first is not first, ignoring.\n")) {
-			memlayout_destroy(ml);
+			ml_backlog_feed(ml);
 			return;
 		}
 
 		mutex_lock(&memlayout_lock);
+		ml_dbgfs_set_current(ml);
 		rcu_assign_pointer(pfn_to_node_map, ml);
 		mutex_unlock(&memlayout_lock);
 		return;
@@ -182,13 +192,16 @@ void memlayout_commit(struct memlayout *ml)
 	unlock_memory_hotplug();
 
 	mutex_lock(&memlayout_lock);
+
+	ml_dbgfs_set_current(ml);
+
 	old_ml = rcu_dereference_protected(pfn_to_node_map,
 			mutex_is_locked(&memlayout_lock));
 
 	rcu_assign_pointer(pfn_to_node_map, ml);
 
 	synchronize_rcu();
-	memlayout_destroy(old_ml);
+	ml_backlog_feed(old_ml);
 
 	/* Must be called only after the new value for pfn_to_node_map has
 	 * propogated to all tasks, otherwise some pages may lookup the old
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 17/24] page_alloc: use dnuma to transplant newly freed pages in __free_pages_ok()
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (15 preceding siblings ...)
  2013-02-28 21:26   ` [PATCH 16/24] mm: memlayout+dnuma: add debugfs interface Cody P Schafer
@ 2013-02-28 21:26   ` Cody P Schafer
  2013-02-28 21:26   ` [PATCH 18/24] page_alloc: use dnuma to transplant newly freed pages in free_hot_cold_page() Cody P Schafer
                     ` (7 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:26 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

__free_pages_ok() handles higher order (order != 0) pages. Transplant
hook is added here as this is where the struct zone to free to is
decided.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5eeb547..5c7930f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
 #include <linux/migrate.h>
 #include <linux/page-debug-flags.h>
 #include <linux/sched/rt.h>
+#include <linux/dnuma.h>
 
 #include <asm/tlbflush.h>
 #include <asm/div64.h>
@@ -739,6 +740,13 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 {
 	unsigned long flags;
 	int migratetype;
+	int dest_nid = dnuma_page_needs_move(page);
+	struct zone *zone;
+
+	if (dest_nid != NUMA_NO_NODE)
+		zone = nid_zone(dest_nid, page_zonenum(page));
+	else
+		zone = page_zone(page);
 
 	if (!free_pages_prepare(page, order))
 		return;
@@ -747,7 +755,11 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	__count_vm_events(PGFREE, 1 << order);
 	migratetype = get_pageblock_migratetype(page);
 	set_freepage_migratetype(page, migratetype);
-	free_one_page(page_zone(page), page, order, migratetype);
+	if (dest_nid != NUMA_NO_NODE)
+		dnuma_prior_free_to_new_zone(page, order, zone, dest_nid);
+	free_one_page(zone, page, order, migratetype);
+	if (dest_nid != NUMA_NO_NODE)
+		dnuma_post_free_to_new_zone(page, order);
 	local_irq_restore(flags);
 }
 
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 18/24] page_alloc: use dnuma to transplant newly freed pages in free_hot_cold_page()
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (16 preceding siblings ...)
  2013-02-28 21:26   ` [PATCH 17/24] page_alloc: use dnuma to transplant newly freed pages in __free_pages_ok() Cody P Schafer
@ 2013-02-28 21:26   ` Cody P Schafer
  2013-02-28 21:26   ` [PATCH 19/24] page_alloc: transplant pages that are being flushed from the per-cpu lists Cody P Schafer
                     ` (6 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:26 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

free_hot_cold_page() is used for order == 0 pages, and is where the
page's zone is decided.

In the normal case, these pages are freed to the per-cpu lists. When a
page needs transplanting (ie: the actual node it belongs to has changed,
and it needs to be moved to another zone), the pcp lists are skipped &
the page is freed via free_one_page().

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5c7930f..5579eda 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1364,6 +1364,7 @@ void mark_free_pages(struct zone *zone)
  */
 void free_hot_cold_page(struct page *page, int cold)
 {
+	int dest_nid;
 	struct zone *zone = page_zone(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
@@ -1377,6 +1378,15 @@ void free_hot_cold_page(struct page *page, int cold)
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
 
+	dest_nid = dnuma_page_needs_move(page);
+	if (dest_nid != NUMA_NO_NODE) {
+		struct zone *dest_zone = nid_zone(dest_nid, page_zonenum(page));
+		dnuma_prior_free_to_new_zone(page, 0, dest_zone, dest_nid);
+		free_one_page(dest_zone, page, 0, migratetype);
+		dnuma_post_free_to_new_zone(page, 0);
+		goto out;
+	}
+
 	/*
 	 * We only track unmovable, reclaimable and movable on pcp lists.
 	 * Free ISOLATE pages back to the allocator because they are being
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 19/24] page_alloc: transplant pages that are being flushed from the per-cpu lists
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (17 preceding siblings ...)
  2013-02-28 21:26   ` [PATCH 18/24] page_alloc: use dnuma to transplant newly freed pages in free_hot_cold_page() Cody P Schafer
@ 2013-02-28 21:26   ` Cody P Schafer
  2013-02-28 21:26   ` [PATCH 20/24] x86: memlayout: add a arch specific inital memlayout setter Cody P Schafer
                     ` (5 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:26 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

In free_pcppages_bulk(), check if a page needs to be moved to a new
node/zone & then perform the transplant (in a slightly defered manner).

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/page_alloc.c | 36 +++++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5579eda..11947c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -650,13 +650,14 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 	int migratetype = 0;
 	int batch_free = 0;
 	int to_free = count;
+	struct page *pos, *page;
+	LIST_HEAD(need_move);
 
 	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 
 	while (to_free) {
-		struct page *page;
 		struct list_head *list;
 
 		/*
@@ -679,11 +680,23 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 
 		do {
 			int mt;	/* migratetype of the to-be-freed page */
+			int dest_nid;
 
 			page = list_entry(list->prev, struct page, lru);
 			/* must delete as __free_one_page list manipulates */
 			list_del(&page->lru);
 			mt = get_freepage_migratetype(page);
+
+			dest_nid = dnuma_page_needs_move(page);
+			if (dest_nid != NUMA_NO_NODE) {
+				dnuma_prior_free_to_new_zone(page, 0,
+						nid_zone(dest_nid,
+							page_zonenum(page)),
+						dest_nid);
+				list_add(&page->lru, &need_move);
+				continue;
+			}
+
 			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
@@ -695,6 +708,27 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 		} while (--to_free && --batch_free && !list_empty(list));
 	}
 	spin_unlock(&zone->lock);
+
+	list_for_each_entry_safe(page, pos, &need_move, lru) {
+		struct zone *dest_zone = page_zone(page);
+		int mt;
+
+		spin_lock(&dest_zone->lock);
+
+		VM_BUG_ON(dest_zone != page_zone(page));
+		pr_devel("freeing pcp page %pK with changed node\n", page);
+		list_del(&page->lru);
+		mt = get_freepage_migratetype(page);
+		__free_one_page(page, dest_zone, 0, mt);
+		trace_mm_page_pcpu_drain(page, 0, mt);
+
+		/* XXX: fold into "post_free_to_new_zone()" ? */
+		if (is_migrate_cma(mt))
+			__mod_zone_page_state(dest_zone, NR_FREE_CMA_PAGES, 1);
+		dnuma_post_free_to_new_zone(page, 0);
+
+		spin_unlock(&dest_zone->lock);
+	}
 }
 
 static void free_one_page(struct zone *zone, struct page *page, int order,
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 20/24] x86: memlayout: add a arch specific inital memlayout setter.
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (18 preceding siblings ...)
  2013-02-28 21:26   ` [PATCH 19/24] page_alloc: transplant pages that are being flushed from the per-cpu lists Cody P Schafer
@ 2013-02-28 21:26   ` Cody P Schafer
  2013-02-28 21:57   ` [PATCH 21/24] init/main: call memlayout_global_init() in start_kernel() Cody P Schafer
                     ` (4 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:26 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

On x86, we have numa_info specifically to track the numa layout, which
is precisely the data memlayout needs, so use it to create an initial
memlayout.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 arch/x86/mm/numa.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index ff3633c..a2a8dd5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -11,6 +11,7 @@
 #include <linux/nodemask.h>
 #include <linux/sched.h>
 #include <linux/topology.h>
+#include <linux/dnuma.h>
 
 #include <asm/e820.h>
 #include <asm/proto.h>
@@ -32,6 +33,29 @@ __initdata
 #endif
 ;
 
+#ifdef CONFIG_DYNAMIC_NUMA
+void __init memlayout_global_init(void)
+{
+	struct numa_meminfo *mi = &numa_meminfo;
+	int i;
+	struct numa_memblk *blk;
+	struct memlayout *ml = memlayout_create(ML_INITIAL);
+	if (WARN_ON(!ml))
+		return;
+
+	pr_devel("x86/memlayout: adding ranges from numa_meminfo\n");
+	for (i = 0; i < mi->nr_blks; i++) {
+		blk = mi->blk + i;
+		pr_devel("  adding range {%LX[%LX]-%LX[%LX]}:%d\n",
+			 PFN_DOWN(blk->start), blk->start, PFN_DOWN(blk->end - PAGE_SIZE / 2 - 1), blk->end - 1, blk->nid);
+		memlayout_new_range(ml, PFN_DOWN(blk->start), PFN_DOWN(blk->end - PAGE_SIZE / 2 - 1), blk->nid);
+	}
+	pr_devel("  done adding ranges from numa_meminfo\n");
+
+	memlayout_commit(ml);
+}
+#endif
+
 static int numa_distance_cnt;
 static u8 *numa_distance;
 
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 21/24] init/main: call memlayout_global_init() in start_kernel().
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (19 preceding siblings ...)
  2013-02-28 21:26   ` [PATCH 20/24] x86: memlayout: add a arch specific inital memlayout setter Cody P Schafer
@ 2013-02-28 21:57   ` Cody P Schafer
  2013-02-28 21:57   ` [PATCH 22/24] dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug Cody P Schafer
                     ` (3 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:57 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

memlayout_global_init() initializes the first memlayout, which is
assumed to match the initial page-flag nid settings.

This is done in start_kernel() as the initdata used to populate the
memlayout is purged from memory early in the boot process (XXX: When?).

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 init/main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/init/main.c b/init/main.c
index 63534a1..a1c2094 100644
--- a/init/main.c
+++ b/init/main.c
@@ -72,6 +72,7 @@
 #include <linux/ptrace.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
+#include <linux/memlayout.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -618,6 +619,7 @@ asmlinkage void __init start_kernel(void)
 	security_init();
 	dbg_late_init();
 	vfs_caches_init(totalram_pages);
+	memlayout_global_init();
 	signals_init();
 	/* rootfs populating might need page-writeback */
 	page_writeback_init();
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 22/24] dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (20 preceding siblings ...)
  2013-02-28 21:57   ` [PATCH 21/24] init/main: call memlayout_global_init() in start_kernel() Cody P Schafer
@ 2013-02-28 21:57   ` Cody P Schafer
  2013-02-28 21:57   ` [PATCH 23/24] x86/mm/numa: when dnuma is enabled, use memlayout to handle memory hotplug's physaddr_to_nid Cody P Schafer
                     ` (2 subsequent siblings)
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:57 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 mm/memlayout.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/mm/memlayout.c b/mm/memlayout.c
index 5fef032..b432b3a 100644
--- a/mm/memlayout.c
+++ b/mm/memlayout.c
@@ -249,3 +249,19 @@ void memlayout_global_init(void)
 
 	memlayout_commit(ml);
 }
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+/*
+ * Provides a default memory_add_physaddr_to_nid() for memory hotplug, unless
+ * overridden by the arch.
+ */
+__weak
+int memory_add_physaddr_to_nid(u64 start)
+{
+	int nid = memlayout_pfn_to_nid(PFN_DOWN(start));
+	if (nid == NUMA_NO_NODE)
+		return 0;
+	return nid;
+}
+EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+#endif
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 23/24] x86/mm/numa: when dnuma is enabled, use memlayout to handle memory hotplug's physaddr_to_nid.
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (21 preceding siblings ...)
  2013-02-28 21:57   ` [PATCH 22/24] dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug Cody P Schafer
@ 2013-02-28 21:57   ` Cody P Schafer
  2013-02-28 21:57   ` [PATCH 24/24] XXX: x86/mm/numa: Avoid spamming warnings due to lack of cpu reconfig Cody P Schafer
  2013-04-04  5:28   ` [RFC][PATCH 00/24] DNUMA: Runtime NUMA memory layout reconfiguration Simon Jeons
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:57 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

When a memlayout is tracked (ie: CONFIG_DYNAMIC_NUMA is enabled), rather
than iterate over numa_meminfo, a lookup can be done using memlayout.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 arch/x86/mm/numa.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a2a8dd5..1ed76d5 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -28,7 +28,7 @@ struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
 
 static struct numa_meminfo numa_meminfo
-#ifndef CONFIG_MEMORY_HOTPLUG
+#if !defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DYNAMIC_NUMA)
 __initdata
 #endif
 ;
@@ -832,7 +832,7 @@ EXPORT_SYMBOL(cpumask_of_node);
 
 #endif	/* !CONFIG_DEBUG_PER_CPU_MAPS */
 
-#ifdef CONFIG_MEMORY_HOTPLUG
+#if defined(CONFIG_MEMORY_HOTPLUG) && !defined(CONFIG_DYNAMIC_NUMA)
 int memory_add_physaddr_to_nid(u64 start)
 {
 	struct numa_meminfo *mi = &numa_meminfo;
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 24/24] XXX: x86/mm/numa: Avoid spamming warnings due to lack of cpu reconfig
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (22 preceding siblings ...)
  2013-02-28 21:57   ` [PATCH 23/24] x86/mm/numa: when dnuma is enabled, use memlayout to handle memory hotplug's physaddr_to_nid Cody P Schafer
@ 2013-02-28 21:57   ` Cody P Schafer
  2013-04-04  5:28   ` [RFC][PATCH 00/24] DNUMA: Runtime NUMA memory layout reconfiguration Simon Jeons
  24 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-02-28 21:57 UTC (permalink / raw)
  To: Linux MM; +Cc: Cody P Schafer, David Hansen

the code wants to map a node id to a cpu mask, but we don't update the
arch specific cpu masks when onlining a new node. For now, avoid this
warning (as it is expected) when DYNAMIC_NUMA is enabled.

Modifying __mem_online_node() to fix this up would be ideal.

Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
---
 arch/x86/mm/numa.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1ed76d5..e9a50df 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -813,10 +813,14 @@ void __cpuinit numa_remove_cpu(int cpu)
 const struct cpumask *cpumask_of_node(int node)
 {
 	if (node >= nr_node_ids) {
+		/* XXX: this ifdef should be removed when proper cpu to node
+		 * mapping updates are added */
+#ifndef CONFIG_DYNAMIC_NUMA
 		printk(KERN_WARNING
 			"cpumask_of_node(%d): node > nr_node_ids(%d)\n",
 			node, nr_node_ids);
 		dump_stack();
+#endif
 		return cpu_none_mask;
 	}
 	if (node_to_cpumask_map[node] == NULL) {
-- 
1.8.1.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 00/24] DNUMA: Runtime NUMA memory layout reconfiguration
  2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
                     ` (23 preceding siblings ...)
  2013-02-28 21:57   ` [PATCH 24/24] XXX: x86/mm/numa: Avoid spamming warnings due to lack of cpu reconfig Cody P Schafer
@ 2013-04-04  5:28   ` Simon Jeons
  2013-04-04 19:07     ` Cody P Schafer
  24 siblings, 1 reply; 28+ messages in thread
From: Simon Jeons @ 2013-04-04  5:28 UTC (permalink / raw)
  To: Cody P Schafer; +Cc: Linux MM, David Hansen

Hi Cody,
On 03/01/2013 04:44 AM, Cody P Schafer wrote:
> Some people asked me to send the email patches for this instead of just posting a git tree link
>
> For reference, this is the original message:
> 	http://lkml.org/lkml/2013/2/27/374

Could you show me your test codes?

> --
>
>   arch/x86/Kconfig                 |   1 -
>   arch/x86/include/asm/sparsemem.h |   4 +-
>   arch/x86/mm/numa.c               |  32 +++-
>   include/linux/dnuma.h            |  96 +++++++++++
>   include/linux/memlayout.h        | 111 +++++++++++++
>   include/linux/memory_hotplug.h   |   4 +
>   include/linux/mm.h               |   7 +-
>   include/linux/page-flags.h       |  18 ++
>   include/linux/rbtree.h           |  11 ++
>   init/main.c                      |   2 +
>   lib/rbtree.c                     |  40 +++++
>   mm/Kconfig                       |  44 +++++
>   mm/Makefile                      |   2 +
>   mm/dnuma.c                       | 351 +++++++++++++++++++++++++++++++++++++++
>   mm/internal.h                    |  13 +-
>   mm/memlayout-debugfs.c           | 323 +++++++++++++++++++++++++++++++++++
>   mm/memlayout-debugfs.h           |  35 ++++
>   mm/memlayout.c                   | 267 +++++++++++++++++++++++++++++
>   mm/memory_hotplug.c              |  53 +++---
>   mm/page_alloc.c                  | 112 +++++++++++--
>   20 files changed, 1486 insertions(+), 40 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC][PATCH 00/24] DNUMA: Runtime NUMA memory layout reconfiguration
  2013-04-04  5:28   ` [RFC][PATCH 00/24] DNUMA: Runtime NUMA memory layout reconfiguration Simon Jeons
@ 2013-04-04 19:07     ` Cody P Schafer
  0 siblings, 0 replies; 28+ messages in thread
From: Cody P Schafer @ 2013-04-04 19:07 UTC (permalink / raw)
  To: Simon Jeons; +Cc: Linux MM, David Hansen

On 04/03/2013 10:28 PM, Simon Jeons wrote:
> Hi Cody,
> On 03/01/2013 04:44 AM, Cody P Schafer wrote:
>> Some people asked me to send the email patches for this instead of
>> just posting a git tree link
>>
>> For reference, this is the original message:
>>     http://lkml.org/lkml/2013/2/27/374
>
> Could you show me your test codes?
>

Sure, I linked to it in the original email

 >	https://raw.github.com/jmesmon/trifles/master/bin/dnuma-test

I generally run something like `dnuma-test s 1 3 512`, which creates 
stripes with size='512 pages' and distributes them between nodes 1, 2, 
and 3.

Also, this patchset has some major issues (not updating the watermarks, 
for example). I've been working on ironing them out, and plan on sending 
another patchset out "soon". Current tree is 
https://github.com/jmesmon/linux/tree/dnuma/v31 (keep in mind that this 
has a few commits in it that I just use for development).

>> --
>>
>>   arch/x86/Kconfig                 |   1 -
>>   arch/x86/include/asm/sparsemem.h |   4 +-
>>   arch/x86/mm/numa.c               |  32 +++-
>>   include/linux/dnuma.h            |  96 +++++++++++
>>   include/linux/memlayout.h        | 111 +++++++++++++
>>   include/linux/memory_hotplug.h   |   4 +
>>   include/linux/mm.h               |   7 +-
>>   include/linux/page-flags.h       |  18 ++
>>   include/linux/rbtree.h           |  11 ++
>>   init/main.c                      |   2 +
>>   lib/rbtree.c                     |  40 +++++
>>   mm/Kconfig                       |  44 +++++
>>   mm/Makefile                      |   2 +
>>   mm/dnuma.c                       | 351
>> +++++++++++++++++++++++++++++++++++++++
>>   mm/internal.h                    |  13 +-
>>   mm/memlayout-debugfs.c           | 323
>> +++++++++++++++++++++++++++++++++++
>>   mm/memlayout-debugfs.h           |  35 ++++
>>   mm/memlayout.c                   | 267 +++++++++++++++++++++++++++++
>>   mm/memory_hotplug.c              |  53 +++---
>>   mm/page_alloc.c                  | 112 +++++++++++--
>>   20 files changed, 1486 insertions(+), 40 deletions(-)
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2013-04-04 19:07 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-02-28  2:41 [RFC] DNUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
2013-02-28 20:44 ` [RFC][PATCH 00/24] " Cody P Schafer
2013-02-28 20:44   ` [PATCH 01/24] XXX: reduce MAX_PHYSADDR_BITS & MAX_PHYSMEM_BITS in PAE Cody P Schafer
2013-02-28 20:44   ` [PATCH 02/24] XXX: x86/Kconfig: simplify NUMA config for NUMA_EMU on X86_32 Cody P Schafer
2013-02-28 20:44   ` [PATCH 03/24] XXX: memory_hotplug locking note in online_pages Cody P Schafer
2013-02-28 20:44   ` [PATCH 04/24] rbtree: add postorder iteration functions Cody P Schafer
2013-02-28 20:44   ` [PATCH 05/24] rbtree: add rbtree_postorder_for_each_entry_safe() helper Cody P Schafer
2013-02-28 20:44   ` [PATCH 06/24] mm/memory_hotplug: factor out zone+pgdat growth Cody P Schafer
2013-02-28 20:44   ` [PATCH 07/24] memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h Cody P Schafer
2013-02-28 20:44   ` [PATCH 08/24] mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones & pgdats Cody P Schafer
2013-02-28 20:44   ` [PATCH 09/24] mm: add nid_zone() helper Cody P Schafer
2013-02-28 21:26   ` [PATCH 10/24] page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled Cody P Schafer
2013-02-28 21:26   ` [PATCH 11/24] page_alloc: in move_freepages(), skip pages instead of VM_BUG on node differences Cody P Schafer
2013-02-28 21:26   ` [PATCH 12/24] page_alloc: when dynamic numa is enabled, don't check that all pages in a block belong to the same zone Cody P Schafer
2013-02-28 21:26   ` [PATCH 13/24] page-flags dnuma: reserve a pageflag for determining if a page needs a node lookup Cody P Schafer
2013-02-28 21:26   ` [PATCH 14/24] memory_hotplug: factor out locks in mem_online_cpu() Cody P Schafer
2013-02-28 21:26   ` [PATCH 15/24] mm: add memlayout & dnuma to track pfn->nid & transplant pages between nodes Cody P Schafer
2013-02-28 21:26   ` [PATCH 16/24] mm: memlayout+dnuma: add debugfs interface Cody P Schafer
2013-02-28 21:26   ` [PATCH 17/24] page_alloc: use dnuma to transplant newly freed pages in __free_pages_ok() Cody P Schafer
2013-02-28 21:26   ` [PATCH 18/24] page_alloc: use dnuma to transplant newly freed pages in free_hot_cold_page() Cody P Schafer
2013-02-28 21:26   ` [PATCH 19/24] page_alloc: transplant pages that are being flushed from the per-cpu lists Cody P Schafer
2013-02-28 21:26   ` [PATCH 20/24] x86: memlayout: add a arch specific inital memlayout setter Cody P Schafer
2013-02-28 21:57   ` [PATCH 21/24] init/main: call memlayout_global_init() in start_kernel() Cody P Schafer
2013-02-28 21:57   ` [PATCH 22/24] dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug Cody P Schafer
2013-02-28 21:57   ` [PATCH 23/24] x86/mm/numa: when dnuma is enabled, use memlayout to handle memory hotplug's physaddr_to_nid Cody P Schafer
2013-02-28 21:57   ` [PATCH 24/24] XXX: x86/mm/numa: Avoid spamming warnings due to lack of cpu reconfig Cody P Schafer
2013-04-04  5:28   ` [RFC][PATCH 00/24] DNUMA: Runtime NUMA memory layout reconfiguration Simon Jeons
2013-04-04 19:07     ` Cody P Schafer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).