linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Cody P Schafer <cody@linux.vnet.ibm.com>
To: Linux MM <linux-mm@kvack.org>
Cc: David Hansen <dave@linux.vnet.ibm.com>,
	Cody P Schafer <cody@linux.vnet.ibm.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: [RFC] DNUMA: Runtime NUMA memory layout reconfiguration
Date: Wed, 27 Feb 2013 18:41:12 -0800	[thread overview]
Message-ID: <20130228024112.GA24970@negative> (raw)

These patches allow the NUMA memory layout (meaning the mapping of a page to a
node) to be changed at runtime in place (without hotplugging).

= Why/when is this useful? =

In virtual machines (VMs) running on NUMA systems both [a] if/when the
hypervisor decides to move their backing memory around (compacting,
prioritizing another VMs desired layout, etc) and [b] in general for
migration of VMs.

The hardware is _already_ changing the NUMA layout underneath us. We have
powerpc64 systems with firmware that currently move the backing memory around,
and have the ability to notify Linux of new NUMA info.

= Code & testing =

web:
	https://github.com/jmesmon/linux/tree/dnuma/v26
git:
	https://github.com/jmesmon/linux.git dnuma/v26

commit range:
	7e4f3230c9161706ebe9d37d774398082dc352de^..01e16461cf4a914feb1a34ed8dd7b28f3e842645

Some patches are marked "XXX: ...", they are only for testing or
temporary documentation purposes.

A debugfs interface allows the NUMA memory layout to be changed.  Basically,
you don't need to have wierd systems to test this, in fact, I've done all my
testing so far in plain old qemu-i386.

A script which stripes the memory between nodes or pushes all memory to a
(potentially new) node is avaliable here:

	https://raw.github.com/jmesmon/trifles/master/bin/dnuma-test

Related LSF/MM Topic proposal:

	http://permalink.gmane.org/gmane.linux.kernel.mm/95342

= How are you managing to do this? =

Reconfiguration of page->node mappings is done at the page allocator
level by both pulling out free pages (when a new memory layout is
commited) & redirecting pages on free to their new node.

Because we can't change page_node(A) while A is allocated, a rbtree
holding the mapping from pfn ranges to node ids ('struct memlayout')
is introduced to track the pfn->node mapping for
yet-to-be-transplanted pages. A lookup in this rbtree occurs on any
page allocator path that decides which zone to free a page to.

To avoid horrible performance due to rbtree lookups all the time, the
rbtree is only consulted when the page is marked with a new pageflag
(LookupNode).

= Current Limitations =

For the reconfiguration to be effective (and not make the allocator make
poorer choices), updating the cpu->node mappings is also needed. This patchset
does _not_ handle this. Also missing is a way to update topology (node
distances), which is less fatal.

These patches only work on SPARSEMEM and the node id _must_ fit in the pageflags
(can't be pushed out to the section). This generally means that 32-bit
platforms are out (unless you hack MAX_PHYS{ADDR,MEM}_BITS like I do for
testing).

This code does the reconfiguration without hotplugging memory at all (1
errant page doesn't keep us from fixing the rest of them). But it still
depends on MEMORY_HOTPLUG for functions that online nodes & adjust
zone/pgdat size.

Things that need doing (but aren't quite bugs):
 - While the interface is meant to be driven via a hypervisor/firmware, that
   portion is not yet included.
 - notifier for kernel users of memory that need/want their allocations on a
   particular node (NODE_DATA(), for instance).
 - notifier for userspace.
 - a way to allocate things from the appropriate node prior to the page
   allocator being fully updated (could just be "allocate it wrong now &
   reallocate later").
 - Make memlayout faster (potentially via per-node allocation, different data
   structure, or more/smarter caching).
 - (potentially) propagation of updated layout knowledge into kmem_caches
   (SL*B).

Known Bugs:
 - Transplant of free pages is _very_ slow due to excessive use of
   stop_machine() via zone_pcp_update() & build_zonelists(). On my i5 laptop,
   it take ~9 minutes to stripe the layout in blocks of 256 pfns to 3 nodes on a
   128MB 8 cpu x86_32 VM booted with 2 nodes.
 - memory leak when SLUB is used (struct kmem_cache_nodes are leaked), SLAB
   appears fine.
 - Locking of managed_pages/present_pages needs adjustment, or they need to
   be updated outside of the free page path.
 - Exported numa/memory info in sysfs isn't updated (`numactl --show` segfaults,
   `numactl --hardware` shows new nodes as nearly empty).
 - Uses pageflag setters without "owning" pages, could cause loss of pageflag
   updates when combined with non-atomic pageflag users in mm/*.
 - some strange sleeps while atomic (for me they occur when memory is
   moved out of all the boot nodes)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

             reply	other threads:[~2013-02-28  2:41 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-28  2:41 Cody P Schafer [this message]
2013-02-28 20:44 ` [RFC][PATCH 00/24] DNUMA: Runtime NUMA memory layout reconfiguration Cody P Schafer
2013-02-28 20:44   ` [PATCH 01/24] XXX: reduce MAX_PHYSADDR_BITS & MAX_PHYSMEM_BITS in PAE Cody P Schafer
2013-02-28 20:44   ` [PATCH 02/24] XXX: x86/Kconfig: simplify NUMA config for NUMA_EMU on X86_32 Cody P Schafer
2013-02-28 20:44   ` [PATCH 03/24] XXX: memory_hotplug locking note in online_pages Cody P Schafer
2013-02-28 20:44   ` [PATCH 04/24] rbtree: add postorder iteration functions Cody P Schafer
2013-02-28 20:44   ` [PATCH 05/24] rbtree: add rbtree_postorder_for_each_entry_safe() helper Cody P Schafer
2013-02-28 20:44   ` [PATCH 06/24] mm/memory_hotplug: factor out zone+pgdat growth Cody P Schafer
2013-02-28 20:44   ` [PATCH 07/24] memory_hotplug: export ensure_zone_is_initialized() in mm/internal.h Cody P Schafer
2013-02-28 20:44   ` [PATCH 08/24] mm/memory_hotplug: use {pgdat,zone}_is_empty() when resizing zones & pgdats Cody P Schafer
2013-02-28 20:44   ` [PATCH 09/24] mm: add nid_zone() helper Cody P Schafer
2013-02-28 21:26   ` [PATCH 10/24] page_alloc: add return_pages_to_zone() when DYNAMIC_NUMA is enabled Cody P Schafer
2013-02-28 21:26   ` [PATCH 11/24] page_alloc: in move_freepages(), skip pages instead of VM_BUG on node differences Cody P Schafer
2013-02-28 21:26   ` [PATCH 12/24] page_alloc: when dynamic numa is enabled, don't check that all pages in a block belong to the same zone Cody P Schafer
2013-02-28 21:26   ` [PATCH 13/24] page-flags dnuma: reserve a pageflag for determining if a page needs a node lookup Cody P Schafer
2013-02-28 21:26   ` [PATCH 14/24] memory_hotplug: factor out locks in mem_online_cpu() Cody P Schafer
2013-02-28 21:26   ` [PATCH 15/24] mm: add memlayout & dnuma to track pfn->nid & transplant pages between nodes Cody P Schafer
2013-02-28 21:26   ` [PATCH 16/24] mm: memlayout+dnuma: add debugfs interface Cody P Schafer
2013-02-28 21:26   ` [PATCH 17/24] page_alloc: use dnuma to transplant newly freed pages in __free_pages_ok() Cody P Schafer
2013-02-28 21:26   ` [PATCH 18/24] page_alloc: use dnuma to transplant newly freed pages in free_hot_cold_page() Cody P Schafer
2013-02-28 21:26   ` [PATCH 19/24] page_alloc: transplant pages that are being flushed from the per-cpu lists Cody P Schafer
2013-02-28 21:26   ` [PATCH 20/24] x86: memlayout: add a arch specific inital memlayout setter Cody P Schafer
2013-02-28 21:57   ` [PATCH 21/24] init/main: call memlayout_global_init() in start_kernel() Cody P Schafer
2013-02-28 21:57   ` [PATCH 22/24] dnuma: memlayout: add memory_add_physaddr_to_nid() for memory_hotplug Cody P Schafer
2013-02-28 21:57   ` [PATCH 23/24] x86/mm/numa: when dnuma is enabled, use memlayout to handle memory hotplug's physaddr_to_nid Cody P Schafer
2013-02-28 21:57   ` [PATCH 24/24] XXX: x86/mm/numa: Avoid spamming warnings due to lack of cpu reconfig Cody P Schafer
2013-04-04  5:28   ` [RFC][PATCH 00/24] DNUMA: Runtime NUMA memory layout reconfiguration Simon Jeons
2013-04-04 19:07     ` Cody P Schafer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130228024112.GA24970@negative \
    --to=cody@linux.vnet.ibm.com \
    --cc=dave@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).