[PATCH] Document Linux Memory Policy

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: linux-mm <linux-mm@kvack.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Andi Kleen <ak@suse.de>, Christoph Lameter <clameter@sgi.com>
Subject: [PATCH] Document Linux Memory Policy
Date: Tue, 29 May 2007 15:33:53 -0400	[thread overview]
Message-ID: <1180467234.5067.52.camel@localhost> (raw)

[PATCH] Document Linux Memory Policy

I couldn't find any memory policy documentation in the Documentation
directory, so here is my attempt to document it.  My objectives are
two fold:

1) to provide missing documentation for anyone interested in this topic,

2) to explain my current understanding, on which I base proposed patches
   to address what I see as missing or broken behavior.

There's lots more that could be written about the internal design--including
data structures, functions, etc.  And one could address the interaction of
memory policy with cpusets.  I haven't tackled that yet.  However, if you
agree that this is better that the nothing that exists now, perhaps it could
be added to -mm.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/memory_policy.txt |  339 +++++++++++++++++++++++++++++++++++++
 1 files changed, 339 insertions(+)

Index: Linux/Documentation/vm/memory_policy.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ Linux/Documentation/vm/memory_policy.txt	2007-05-29 15:08:01.000000000 -0400
@@ -0,0 +1,339 @@
+
+What is Linux Memory Policy?
+
+In the Linux kernel, "memory policy" determines from which node the kernel will
+allocate memory in a NUMA system or in an emulated NUMA system.  Linux has
+supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
+The current memory policy support was added to Linux 2.6 around May 2004.  This
+document attempts to describe the concepts and APIs of the 2.6 memory policy
+support.
+
+	TODO:  try to describe internal design?
+
+MEMORY POLICY CONCEPTS
+
+Scope of Memory Policies
+
+The Linux kernel supports four more or less distinct scopes of memory policy:
+
+    System Default Policy:  this policy is "hard coded" into the kernel.  It
+    is the policy that governs the all page allocations that aren't controlled
+    by one of the more specific policy scopes discussed below.
+
+    Task/Process Policy:  this is an optional, per-task policy.  When defined
+    for a specific task, this policy controls all page allocations made by or
+    on behalf of the task that aren't controlled by a more specific scope.
+    If a task does not define a task policy, then all page allocations that
+    would have been controlled by the task policy "fall back" to the System
+    Default Policy.
+
+	Because task policy applies to the entire address space of a task,
+	it is inheritable across both fork() [clone() w/o the CLONE_VM flag]
+	and exec*().  Thus, a parent task may establish the task policy for
+	a child task exec()'d from an executable image that has no awareness
+	of memory policy.
+
+	In a multi-threaded task, task policies apply only to the thread
+	[Linux kernel task] that installs the policy and any threads
+	subsequently created by that thread.  Any sibling threads existing
+	at the time a new task policy is installed retain their current
+	policy.
+
+	A task policy applies only to pages allocated after the policy is
+	installed.  Any pages already faulted in by the task remain where
+	they were allocated based on the policy at the time they were
+	allocated.
+
+    VMA Policy:  A "VMA" or "Virtual Memory Area" refers to a range of a task's
+    virtual adddress space.  A task may define a specific policy for a range
+    of its virtual address space.  This VMA policy will govern the allocation
+    of pages that back this region of the address space.  Any regions of the
+    task's address space that don't have an explicit VMA policy will fall back
+    to the task policy, which may itself fall back to the system default policy.
+
+	VMA policy applies ONLY to anonymous pages.  These include pages
+	allocated for anonymous segments, such as the task stack and heap, and
+	any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
+	Anonymous pages copied from private file mappings [files mmap()ed with
+	the MAP_PRIVATE flag] also obey VMA policy, if defined.
+
+	VMA policies are shared between all tasks that share a virtual address
+	space--a.k.a. threads--independent of when the policy is installed; and
+	they are inherited across fork().  However, because VMA policies refer
+	to a specific region of a task's address space, and because the address
+	space is discarded and recreated on exec*(), VMA policies are NOT
+	inheritable across exec().  Thus, only NUMA-aware applications may
+	use VMA policies.
+
+	A task may install a new VMA policy on a sub-range of a previously
+	mmap()ed region.  When this happens, Linux splits the existing virtual
+	memory area into 2 or 3 VMAs, each with it's own policy.
+
+	By default, VMA policy applies only to pages allocated after the policy
+	is installed.  Any pages already faulted into the VMA range remain where
+	they were allocated based on the policy at the time they were
+	allocated.  However, since 2.6.16, Linux supports page migration so
+	that page contents can be moved to match a newly installed policy.
+
+    Shared Policy:  This policy applies to "memory objects" mapped shared into
+    one or more tasks' distinct address spaces.  Shared policies are applied
+    directly to the shared object.  Thus, all tasks that attach to the object
+    share the policy, and all pages allocated for the shared object, by any
+    task, will obey the shared policy.
+
+	Currently [2.6.22], only shared memory segments, created by shmget(),
+	support shared policy.  When shared policy support was added to Linux,
+	the associated data structures were added to shared hugetlbfs segments.
+	However, at the time, hugetlbfs did not support allocation at fault
+	time--a.k.a lazy allocation--so hugetlbfs segments were never "hooked
+	up" to the shared policy support.  Although hugetlbfs segments now
+	support lazy allocation, their support for shared policy has not been
+	completed.
+
+	Although internal to the kernel shared memory segments are really
+	files backed by swap space that have been mmap()ed shared into tasks'
+	address spaces, regular files mmap()ed shared do NOT support shared
+	policy.  Rather, shared page cache pages, including pages backing
+	private mappings that have not yet been written by the task, follow
+	task policy, if any, else system default policy.
+
+	The shared policy infrastructure supports different policies on subset
+	ranges of the shared object.  However, Linux still splits the VMA of
+	the task that installs the policy for each range of distinct policy.
+	Thus, different tasks that attach to a shared memory segment can have
+	different VMA configurations mapping that one shared object.
+
+Components of Memory Policies
+
+    A Linux memory policy is a tuple consisting of a "mode" and an optional set
+    of nodes.  The mode determine the behavior of the policy, while the optional
+    set of nodes can be viewed as the arguments to the behavior.
+
+	Note:  in some functions, the mode is called "policy".  However, to
+	avoid confusion with the policy tuple, this document will continue
+	to use the term "mode".
+
+   Linux memory policy supports the following 4 modes:
+
+	Default Mode--MPOL_DEFAULT:  The behavior specified by this mode is
+	context dependent.
+
+	    The system default policy is hard coded to contain the Default mode.
+	    In this context, it means "local" allocation--that is attempt to
+	    allocate the page from the node associated with the cpu where the
+	    fault occurs.  If the "local" node has no memory, or the node's
+	    memory can be exhausted [no free pages available], local allocation
+	    will attempt to allocate pages from "nearby" nodes, using a per node
+	    list of nodes--called zonelists--built at boot time.
+
+		TODO:  address runtime rebuild of node/zonelists when
+		supported.
+
+	    When a task/process policy contains the Default mode, it means
+	    "fall back to the system default mode".  And, as discussed above,
+	    this means use "local" allocation.
+
+	    In the context of a VMA, Default mode means "fall back to task
+	    policy"--which may, itself, fall back to system default policy.
+	    In the context of shared policies, Default mode means fall back
+	    directly to the system default policy.  Note:  the result of this
+	    semantic is that if the task policy is something other than Default,
+	    it is not possible to specify local allocation for a region of the
+	    task's address space using a VMA policy.
+
+	    The Default mode does not use the optional set of nodes.
+
+	MPOL_BIND:  This mode specifies that memory must come from the
+	set of nodes specified by the policy.  The kernel builds a custom
+	zonelist containing just the nodes specified by the Bind policy.
+	If the kernel is unable to allocate a page from the first node in the
+	custom zonelist, it moves on to the next, and so forth.  If it is unable
+	to allocate a page from any of the nodes in this list, the allocation
+	will fail.
+
+	    The memory policy APIs do not specify an order in which the nodes
+	    will be searched.  However, unlike the per node zonelists mentioned
+	    above, the custom zonelist for the Bind policy do not consider the
+	    distance between the nodes.  Rather, the lists are built in order
+	    of numeric node id.
+
+
+	MPOL_PREFERRED:  This mode specifies that the allocation should be
+	attempted from the single node specified in the policy.  If that
+	allocation fails, the kernel will search other nodes, exactly as
+	it would for a local allocation that started at the preferred node--
+	that is, using the per-node zonelists in increasing distance from
+	the preferred node.
+
+	    If the Preferred policy specifies more than one node, the node
+	    with the numerically lowest node id will be selected to start
+	    the allocation scan.
+
+	MPOL_INTERLEAVED:  This mode specifies that page allocations be
+	interleaved, on a page granularity, across the nodes specified in
+	the policy.  This mode also behaves slightly differently, based on
+	the context where it is used:
+
+	    For allocation of anonymous pages and shared memory pages,
+	    Interleave mode indexes the set of nodes specified by the policy
+	    using the page offset of the faulting address into the segment
+	    [VMA] containing the address modulo the number of nodes specified
+	    by the policy.  It then attempts to allocate a page, starting at
+	    the selected node, as if the node had been specified by a Preferred
+	    policy or had been selected by a local allocation.  That is,
+	    allocation will follow the per node zonelist.
+
+	    For allocation of page cache pages, Interleave mode indexes the set
+	    of nodes specified by the policy using a node counter maintained
+	    per task.  This counter wraps around to the lowest specified node
+	    after it reaches the highest specified node.  This will tend to
+	    spread the pages out over the nodes specified by the policy based
+	    on the order in which they are allocated, rather than based on any
+	    page offset into an address range or file.
+
+MEMORY POLICY APIs
+
+Linux supports 3 system calls for controlling memory policy.  These APIS
+always affect only the calling task, the calling task's address space, or
+some shared object mapped into the calling task's address space.
+
+	Note:  the headers that define these APIs and the parameter data types
+	for user space applications reside in a package that is not part of
+	the Linux kernel.  The kernel system call interfaces, with the 'sys_'
+	prefix, are defined in <linux/syscalls.h>; the mode and flag
+	definitions are defined in <linux/mempolicy.h>.
+
+Set [Task] Memory Policy:
+
+	long set_mempolicy(int mode, const unsigned long *nmask,
+					unsigned long maxnode);
+
+	Set's the calling task's "task/process memory policy" to mode
+	specified by the 'mode' argument and the set of nodes defined
+	by 'nmask'.  'nmask' points to a bit mask of node ids containing
+	at least 'maxnode' ids.
+
+	If successful, the specified policy will control the allocation
+	of all pages, by and on behalf of this task and its descendants,
+	that aren't controlled by a more specific VMA or shared policy.
+	If the calling task is part of a multi-threaded application, the
+	task policy of other existing threads are unchanged.
+
+Get [Task] Memory Policy or Related Information
+
+	long get_mempolicy(int *mode,
+			   const unsigned long *nmask, unsigned long maxnode,
+			   void *addr, int flags);
+
+	Queries the "task/process memory policy" of the calling task, or
+	the policy or location of a specified virtual address, depending
+	on the 'flags' argument.
+
+	If 'flags' is 0, get_mempolicy() returns the calling task's policy
+	as set by set_mempolicy() or inherited from its parent.  The mode
+	is stored in the location pointed to by the 'mode' argument, if it
+	is non-NULL.  The associated node mask, if any, is stored in the bit
+	mask pointed to by a non-NULL 'nmask' argument.  When 'nmask' is
+	non-NULL, 'maxnode' must specify one greater than the maximum bit
+	number that can be stored in 'nmask'--i.e., the number of bits.
+
+	If 'flags' specifies MPOL_F_ADDR, get_mempolicy() returns similar
+	policy information that governs the allocation of pages at the
+	specified 'addr'.  This may be different from the task policy--
+	i.e., if a VMA or shared policy applies to that address.
+
+	'flags' may also contain 'MPOL_F_NODE'.  This flag has been
+	described in some get_mempolicy() man pages as "not for application
+	use" and subject to change.  Applications are cautioned against
+	using it.  However, for completeness and because it is useful for
+	testing the kernel memory policy support, current behavior is
+	documented here:
+
+	If 'flags' contains MPOL_F_NODE, but not MPOL_F_ADDR, and if
+	the task policy of the calling task specifies the Intereleave
+	mode [MPOL_INTERLEAVE], get_mempolicy() will return the next
+	node on which a page cache page would be allocated by the calling
+	task, in the location pointed to by a non-NULL 'mode'.
+
+	If 'flags' contains MPOL_F_NODE and MPOL_F_ADDR, and 'addr'
+	contains a valid address in the calling task's address space,
+	get_mempolicy() will return the node where the page backing that
+	address resides.  If no page has currently been allocated for
+	the specified address, a page will be allocated as if the task
+	had performed a read/load from that address.  The node of the
+	page allocated will be returned.
+
+	    Note:  if the address specifies an anonymous region of the
+	    task's address space with no page currently allocated, the
+	    resulting "read access fault" will likely just map the shared
+	    ZEROPAGE.  It will NOT, for example, allocate a local page in
+	    the case of default policy [unless the task happens to be
+	    running on the node containing the ZEROPAGE], nor will it obey
+	    VMA policy, if any.
+
+
+Install VMA/Shared Policy for a Range of Task's Address Space
+
+	long mbind(void *start, unsigned long len, int mode,
+		   const unsigned long *nmask, unsigned long maxnode,
+		   unsigned flags);
+
+	mbind() applies the policy specified by (mode, nmask, maxnodes) to
+	the range of the calling task's address space specified by the
+	'start' and 'len' arguments.  Additional actions may be requested
+	via the 'flags' argument.
+
+	If the address space range covers an anonymous region or a private
+	mapping of a regular file, a VMA policy will be installed in this
+	region.  This policy will govern all subsequent allocations of pages
+	for that range for all threads in the task.
+
+	    For the case of a private mapping of a regular file, the
+	    specified policy will only govern the allocation of anonymous
+	    pages created when the task writes/stores to an address in the
+	    range.  Pages allocated for read faults will use the faulting
+	    task's task policy, if any, else the system default.
+
+	If the address space range maps a shared object, such as a shared
+	memory segment, a shared policy will be installed on the specified
+	range of the underlying shared object.  This policy will govern all
+	subsequent allocates of pages for that range of the shared object,
+	for all task that map/attach the shared object.
+
+	If the address space range maps a shared hugetlbfs segment, a VMA
+	policy will be installed for that range.  This policy will govern
+	subsequent huge page allocations from the calling task, but will
+	be ignored by any subsequent huge page allocations from other tasks
+	that attach to the hugetlb shared memory object.
+
+	If the address space range covers a shared mapping of a regular
+	file, a VMA policy will be installed for that range.  This policy
+	will be ignored for all page allocations by the calling task or
+	by any other task.  Rather, all page allocations in that range will
+	be allocated using the faulting task's task policy, if any, else
+	the system default policy.
+
+	Before 2.6.16, Linux did not support page migration.  Therefore,
+	if any pages were already allocated in the range specified by the
+	mbind() call, the application was stuck with their existing location.
+	However, mbind() did, and still does, support the MPOL_MF_STRICT flag.
+	This flag causes mbind() to check the specified range for any
+	existing pages that don't obey the specified policy.  If any such
+	pages exist, the mbind() call fails with the EIO error number.
+
+	Since 2.6.16, Linux supports direct [synchronous] page migration
+	via the mbind() system call.  When the 'flags' argument specifies
+	MPOL_MF_MOVE, mbind() will attempt to migrate all existing pages
+	in the range to match the specified policy.  However, the MPOL_MF_MOVE
+	flag will migrate only those pages that are only referenced by the
+	calling task's page tables [internally:  page's mapcount == 1].  The
+	MPOL_MF_STRICT flag may be specified to detect whether any pages
+	could not be migrated for this or other reasons.
+
+	A privileged task [with CAP_SYS_NICE] may specify the MPOL_MF_MOVE_ALL
+	flag.  With this flag, mbind() will attempt to migrate pages in the
+	range to match the specified policy, regardless of the number of page
+	table entries referencing the page [regardless of mapcount].  Again,
+	some conditions may still prevent pages from being migrated, and the
+	MPOL_MF_STRICT flag may be specified to detect this condition.
+


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next             reply	other threads:[~2007-05-29 19:33 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-29 19:33 Lee Schermerhorn [this message]
2007-05-29 20:04 ` [PATCH] Document Linux Memory Policy Christoph Lameter
2007-05-29 20:16   ` Andi Kleen
2007-05-30 16:17     ` Lee Schermerhorn
2007-05-30 17:41       ` Christoph Lameter
2007-05-31  8:20       ` Michael Kerrisk
2007-05-31 14:49         ` Lee Schermerhorn
2007-05-31 15:56           ` Michael Kerrisk
2007-06-01 21:15         ` [PATCH] enhance memory policy sys call man pages v1 Lee Schermerhorn
2007-07-23  6:11           ` Michael Kerrisk
2007-07-23  6:32           ` mbind.2 man page patch Michael Kerrisk
2007-07-23 14:26             ` Lee Schermerhorn
2007-07-26 17:19               ` Michael Kerrisk
2007-07-26 18:06                 ` Lee Schermerhorn
2007-07-26 18:18                   ` Michael Kerrisk
2007-07-23  6:32           ` get_mempolicy.2 " Michael Kerrisk
2007-07-28  9:31             ` Michael Kerrisk
2007-08-09 18:43               ` Lee Schermerhorn
2007-08-09 20:57                 ` Michael Kerrisk
2007-08-16 20:05               ` Andi Kleen
2007-08-18  5:50                 ` Michael Kerrisk
2007-08-21 15:45                   ` Lee Schermerhorn
2007-08-22  4:10                     ` Michael Kerrisk
2007-08-22 16:08                       ` [PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2 Lee Schermerhorn
2007-08-27 11:29                         ` Michael Kerrisk
2007-08-22 16:10                       ` [PATCH] Mempolicy Man Pages 2.64 2/3 - set_mempolicy.2 Lee Schermerhorn
2007-08-27 11:30                         ` Michael Kerrisk
2007-08-22 16:12                       ` [PATCH] Mempolicy Man Pages 2.64 3/3 - get_mempolicy.2 Lee Schermerhorn
2007-08-27 11:30                         ` Michael Kerrisk
2007-08-27 10:46                 ` get_mempolicy.2 man page patch Michael Kerrisk
2007-07-23  6:33           ` set_mempolicy.2 " Michael Kerrisk
2007-05-30 16:55   ` [PATCH] Document Linux Memory Policy Lee Schermerhorn
2007-05-30 17:56     ` Christoph Lameter
2007-05-31  6:18       ` Gleb Natapov
2007-05-31  6:41         ` Christoph Lameter
2007-05-31  6:47           ` Gleb Natapov
2007-05-31  6:56             ` Christoph Lameter
2007-05-31  7:11               ` Gleb Natapov
2007-05-31  7:24                 ` Christoph Lameter
2007-05-31  7:39                   ` Gleb Natapov
2007-05-31 17:43                     ` Christoph Lameter
2007-05-31 17:07                   ` Lee Schermerhorn
2007-05-31 10:43             ` Andi Kleen
2007-05-31 11:04               ` Gleb Natapov
2007-05-31 11:30                 ` Gleb Natapov
2007-05-31 15:26                   ` Lee Schermerhorn
2007-05-31 17:41                     ` Gleb Natapov
2007-05-31 18:56                       ` Lee Schermerhorn
2007-05-31 20:06                         ` Gleb Natapov
2007-05-31 20:43                           ` Andi Kleen
2007-06-01  9:38                             ` Gleb Natapov
2007-06-01 10:21                               ` Andi Kleen
2007-06-01 12:25                                 ` Gleb Natapov
2007-06-01 13:09                                   ` Andi Kleen
2007-06-01 17:15                                 ` Lee Schermerhorn
2007-06-01 18:43                                   ` Christoph Lameter
2007-06-01 19:38                                     ` Lee Schermerhorn
2007-06-01 19:48                                       ` Christoph Lameter
2007-06-01 21:05                                         ` Lee Schermerhorn
2007-06-01 21:56                                           ` Christoph Lameter
2007-06-04 13:46                                             ` Lee Schermerhorn
2007-06-04 16:34                                               ` Christoph Lameter
2007-06-04 17:02                                                 ` Lee Schermerhorn
2007-06-04 17:11                                                   ` Christoph Lameter
2007-06-04 20:23                                                     ` Andi Kleen
2007-06-04 21:51                                                       ` Christoph Lameter
2007-06-05 14:30                                                         ` Lee Schermerhorn
2007-06-01 20:28                                     ` Gleb Natapov
2007-06-01 20:45                                       ` Christoph Lameter
2007-06-01 21:10                                         ` Lee Schermerhorn
2007-06-01 21:58                                           ` Christoph Lameter
2007-06-02  7:23                                         ` Gleb Natapov
2007-05-31 11:47                 ` Andi Kleen
2007-05-31 11:59                   ` Gleb Natapov
2007-05-31 12:15                     ` Andi Kleen
2007-05-31 12:18                       ` Gleb Natapov
2007-05-31 18:28       ` Lee Schermerhorn
2007-05-31 18:35         ` Christoph Lameter
2007-05-31 19:29           ` Lee Schermerhorn
2007-05-31 19:25       ` Paul Jackson
2007-05-31 20:22         ` Lee Schermerhorn
2007-05-29 20:07 ` Andi Kleen
2007-05-30 16:04   ` Lee Schermerhorn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1180467234.5067.52.camel@localhost \
    --to=lee.schermerhorn@hp.com \
    --cc=ak@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.