All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: ak@suse.de, akpm@linux-foundation.org, linux-mm@kvack.org,
	clameter@sgi.com
Subject: [PATCH]  enhance memory policy sys call man pages v1
Date: Fri, 01 Jun 2007 17:15:44 -0400	[thread overview]
Message-ID: <1180732544.5278.158.camel@localhost> (raw)
In-Reply-To: <20070531082016.19080@gmx.net>

Subject was:  Re: [PATCH] Document Linux Memory Policy

On Thu, 2007-05-31 at 10:20 +0200, Michael Kerrisk wrote:
> > > > The docs are wrong. This is fully supported.
> > > 
> > > Yes, I gave up on that one and the warning in the manpage should be 
> > > probably dropped 
> > 
> > OK.  I'll work with the man page maintainers. 
> 
> Hi Lee,
> 
> If you could write a patch for the man page, that would be ideal.
> Location of current tarball is in the .sig.

[PATCH]  enhance memory policy sys call man pages v1

Against man pages 2.51

This patch enhances the 3 memory policy system call man pages
to add description of missing semantics, error return values,
etc.  The descriptions match the semantics of the kernel circa
2.6.21/22, as gleaned from the source code.

I have changed the "policy" parameter to "mode" through out the
descriptions in an attempt to promote the concept that the memory
policy is a tuple consisting of a mode and optional set of nodes.
Also matches internal name and <numaif.h> prototypes for mbind()
and set_mempolicy().

I think I've covered all of the existing errno returns, but may
have missed a few.  

These pages definitely need proofing by other sets of eyes...

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 man2/get_mempolicy.2 |  222 +++++++++++++++++++++------------
 man2/mbind.2         |  335 +++++++++++++++++++++++++++++++++++++--------------
 man2/set_mempolicy.2 |  173 +++++++++++++++++++++-----
 3 files changed, 526 insertions(+), 204 deletions(-)

Index: Linux/man2/mbind.2
===================================================================
--- Linux.orig/man2/mbind.2	2007-05-11 19:07:02.000000000 -0400
+++ Linux/man2/mbind.2	2007-06-01 12:28:06.000000000 -0400
@@ -18,15 +18,16 @@
 .\" the source, must acknowledge the copyright and authors of this work.
 .\"
 .\" 2006-02-03, mtk, substantial wording changes and other improvements
+.\" 2007-06-01, lts, more precise specification of behavior.
 .\"
-.TH MBIND 2 "2006-02-07" "SuSE Labs" "Linux Programmer's Manual"
+.TH MBIND 2 "2007-06-01" "SuSE Labs" "Linux Programmer's Manual"
 .SH NAME
 mbind \- Set memory policy for a memory range
 .SH SYNOPSIS
 .nf
 .B "#include <numaif.h>"
 .sp
-.BI "int mbind(void *" start ", unsigned long " len  ", int " policy ,
+.BI "int mbind(void *" start ", unsigned long " len  ", int " mode ,
 .BI "          unsigned long *" nodemask  ", unsigned long " maxnode ,
 .BI "          unsigned " flags );
 .sp
@@ -34,76 +35,179 @@ mbind \- Set memory policy for a memory 
 .fi
 .SH DESCRIPTION
 .BR mbind ()
-sets the NUMA memory
-.I policy
+sets the NUMA memory policy,
+which consists of a policy mode and zero or more nodes,
 for the memory range starting with
 .I start
 and continuing for
 .IR len
 bytes.
 The memory of a NUMA machine is divided into multiple nodes.
-The memory policy defines in which node memory is allocated.
+The memory policy defines from which node memory is allocated.
+
+If the memory range specified by the
+.IR start " and " len
+arguments includes an "anonymous" region of memory\(emthat is
+a region of memory created using the
+.BR mmap (2)
+system call with the
+.BR MAP_ANONYMOUS \(emor
+a memory mapped file, mapped using the
+.BR mmap (2)
+system call with the
+.B MAP_PRIVATE
+flag, pages will only be allocated according to the specified
+policy when the application writes [stores] to the page.
+For anonymous regions, an initial read access will use a shared
+page in the kernel containing all zeros.
+For a file mapped with
+.BR MAP_PRIVATE ,
+an initial read access will allocate pages according to the
+process policy of the process that causes the page to be allocated.
+This may not be the process that called
+.BR mbind ().
+
+If the specified memory range includes a memory mapped file,
+mapped using the
+.BR mmap (2)
+system call with the
+.B MAP_SHARED
+flag, the specified policy will be ignored for all page allocations
+in this range.
+Rather the pages will be allocated according to the process policy
+of the process that caused the page to be allocated.
+Again, this may not be the process that called
+.BR mbind ().
+
+If the specified memory range includes a shared memory region
+created using the
+.BR shmget (2)
+system call and attached using the
+.BR shmat (2)
+system call,
+pages allocated for the anonymous or shared memory region will
+be allocated according to the policy specified, regardless which
+process attached to the shared memory segment causes the allocation.
+If, however, the shared memory region was created with the
+.B SHM_HUGETLB
+flag,
+the huge pages will be allocated according to the policy specified
+only if the page allocation is caused by the task that calls
+.BR mbind ()
+for that region.
+
+By default,
 .BR mbind ()
 only has an effect for new allocations; if the pages inside
 the range have been already touched before setting the policy,
 then the policy has no effect.
+This default behavior may be overridden by the
+.BR MPOL_MF_MOVE
+and
+.B MPOL_MF_MOVE_ALL
+flags described below.
 
-Available policies are
+The
+.I mode
+argument must specify one of
 .BR MPOL_DEFAULT ,
 .BR MPOL_BIND ,
-.BR MPOL_INTERLEAVE ,
-and
+.B MPOL_INTERLEAVE
+or
 .BR MPOL_PREFERRED .
-All policies except
+All policy modes except
 .B MPOL_DEFAULT
-require the caller to specify the nodes to which the policy applies in the
+require the caller to specify via the
 .I nodemask
-parameter.
+parameter,
+the node or nodes to which the mode applies.
+
 .I nodemask
-is a bitmask of nodes containing up to
+points to a bitmask of nodes containing up to
 .I maxnode
 bits.
-The actual number of bytes transferred via this argument
-is rounded up to the next multiple of
+The bit mask size is rounded to the next multiple of
 .IR "sizeof(unsigned long)" ,
 but the kernel will only use bits up to
 .IR maxnode .
-A NULL argument means an empty set of nodes.
+A NULL value of
+.I nodemask
+or a
+.I maxnode
+value of zero specifies the empty set of nodes.
+If the value of
+.I maxnode
+is zero,
+the
+.I nodemask
+argument is ignored.
 
 The
 .B MPOL_DEFAULT
-policy is the default and means to use the underlying process policy
-(which can be modified with
-.BR set_mempolicy (2)).
-Unless the process policy has been changed this means to allocate
-memory on the node of the CPU that triggered the allocation.
+mode specifies the default policy.
+When applied to a range of memory via
+.IR mbind (),
+this means to use the process policy,
+ which may have been set with
+.BR set_mempolicy (2).
+If the mode of the process policy is also
+.B MPOL_DEFAULT
+pages will be allocated on the node of the CPU that triggers the allocation.
+For
+.BR MPOL_DEFAULT ,
+the
 .I nodemask
-should be specified as NULL.
+and
+.I maxnode
+arguments must be specify the empty set of nodes.
 
 The
 .B MPOL_BIND
-policy is a strict policy that restricts memory allocation to the
-nodes specified in
+mode specifies a strict policy that restricts memory allocation to
+the nodes specified in
 .IR nodemask .
+If
+.I nodemask
+specifies more than one node, page allocations will come from
+the node with the lowest numeric node id first, until that node
+contains no free memory.
+Allocations will then come from the node with the next highest
+node id specified in
+.I nodemask
+and so forth, until none of the specified nodes contain free memory.
 There won't be allocations on other nodes.
 
+The
 .B MPOL_INTERLEAVE
-interleaves allocations to the nodes specified in
+mode specifies that page allocations be interleaved across the
+set of nodes specified in
 .IR nodemask .
-This optimizes for bandwidth instead of latency.
+This optimizes for bandwidth instead of latency
+by spreading out pages and memory accesses to those pages across
+multiple nodes.
 To be effective the memory area should be fairly large,
-at least 1MB or bigger.
+at least 1MB or bigger with a fairly uniform access pattern.
+Accesses to a single page of the area will still be limited to
+the memory bandwidth of a single node.
 
 .B MPOL_PREFERRED
 sets the preferred node for allocation.
-The kernel will try to allocate in this
+The kernel will try to allocate pages from this
 node first and fall back to other nodes if the
 preferred nodes is low on free memory.
-Only the first node in the
+If
 .I nodemask
-is used.
-If no node is set in the mask, then the memory is allocated on
-the node of the CPU that triggered the allocation allocation).
+specifies more than one node id, the first node in the
+mask will be selected as the preferred node.
+If the
+.I nodemask
+and
+.I maxnode
+arguments specify the empty set, then the memory is allocated on
+the node of the CPU that triggered the allocation.
+This is the only way to specify "local allocation" for a
+range of memory via
+.IR mbind (2).
 
 If
 .B MPOL_MF_STRICT
@@ -115,17 +219,18 @@ is not
 .BR MPOL_DEFAULT ,
 then the call will fail with the error
 .B EIO
-if the existing pages in the mapping don't follow the policy.
-In 2.6.16 or later the kernel will also try to move pages
-to the requested node with this flag.
+if the existing pages in the memory range don't follow the policy.
+.\" According to the kernel code, the following is not true --lts
+.\" In 2.6.16 or later the kernel will also try to move pages
+.\" to the requested node with this flag.
 
 If
 .B MPOL_MF_MOVE
-is passed in
+is specified in
 .IR flags ,
-then an attempt will be made  to
-move all the pages in the mapping so that they follow the policy.
-Pages that are shared with other processes are not moved.
+then the kernel will attempt to move all the existing pages
+in the memory range so that they follow the policy.
+Pages that are shared with other processes will not be moved.
 If
 .B MPOL_MF_STRICT
 is also specified, then the call will fail with the error
@@ -136,8 +241,8 @@ If
 .B MPOL_MF_MOVE_ALL
 is passed in
 .IR flags ,
-then all pages in the mapping will be moved regardless of whether
-other processes use the pages.
+then the kernel will attempt to move all existing pages in the memory range
+regardless of whether other processes use the pages.
 The calling process must be privileged
 .RB ( CAP_SYS_NICE )
 to use this flag.
@@ -146,6 +251,7 @@ If
 is also specified, then the call will fail with the error
 .B EIO
 if some pages could not be moved.
+.\" ---------------------------------------------------------------
 .SH RETURN VALUE
 On success,
 .BR mbind ()
@@ -153,11 +259,9 @@ returns 0;
 on error, \-1 is returned and
 .I errno
 is set to indicate the error.
+.\" ---------------------------------------------------------------
 .SH ERRORS
-.TP
-.B EFAULT
-There was a unmapped hole in the specified memory range
-or a passed pointer was not valid.
+.\"  I think I got all of the error returns.  --lts
 .TP
 .B EINVAL
 An invalid value was specified for
@@ -169,53 +273,102 @@ or
 was less than
 .IR start ;
 or
-.I policy
-was
-.B MPOL_DEFAULT
+.I start
+is not a multiple of the system page size.
+Or,
+.I mode
+is
+.I MPOL_DEFAULT
 and
 .I nodemask
-pointed to a non-empty set;
+specified a non-empty set;
 or
-.I policy
-was
-.B MPOL_BIND
+.I mode
+is
+.I MPOL_BIND
 or
-.B MPOL_INTERLEAVE
+.I MPOL_INTERLEAVE
 and
 .I nodemask
-pointed to an empty set,
+is empty.
+Or,
+.I maxnode
+specifies more than a page worth of bits.
+Or,
+.I nodemask
+specifies one or more node ids that are
+greater than the maximum supported node id,
+or are not allowed in the calling task's context.
+.\" "calling task's context" refers to cpusets.  No man page avail to ref. --lts
+Or, none of the node ids specified by
+.I nodemask
+are on-line, or none of the specified nodes contain memory.
+.TP
+.B EFAULT
+Part of all of the memory range specified by
+.I nodemask
+and
+.I maxnode
+points outside your accessible address space.
+Or, there was a unmapped hole in the specified memory range.
 .TP
 .B ENOMEM
-System out of memory.
+Insufficient kernel memory was available.
 .TP
 .B EIO
 .B MPOL_MF_STRICT
 was specified and an existing page was already on a node
-that does not follow the policy.
+that does not follow the policy;
+or
+.B MPOL_MF_MOVE
+or
+.B MPOL_MF_MOVE_ALL
+was specified and the kernel was unable to move all existing
+pages in the range.
+.TP
+.B EPERM
+The
+.I flags
+argument included the
+.B MPOL_MF_MOVE_ALL
+flag and the caller does not have the
+.B CAP_SYS_NICE
+privilege.
+.\" ---------------------------------------------------------------
 .SH NOTES
-NUMA policy is not supported on file mappings.
+NUMA policy is not supported on a memory mapped file range
+that was mapped with the
+.I MAP_SHARED
+flag.
 
 .B MPOL_MF_STRICT
-is  ignored  on  huge page mappings right now.
+is ignored on huge page mappings.
 
-It is unfortunate that the same flag,
+The
 .BR MPOL_DEFAULT ,
-has different effects for
+mode has different effects for
 .BR mbind (2)
 and
 .BR set_mempolicy (2).
-To select "allocation on the node of the CPU that
-triggered the allocation" (like
-.BR set_mempolicy (2)
-.BR MPOL_DEFAULT )
-when calling
+When
+.B MPOL_DEFAULT
+is specified for a range of memory using
 .BR mbind (),
+any pages subsequently allocated for that range will use
+the process' policy, as set by
+.BR set_mempolicy (2).
+This effectively removes the explicit policy from the
+specified range.
+To select "local allocation" for a memory range,
 specify a
-.I policy
+.I mode
 of
 .B MPOL_PREFERRED
-with an empty
-.IR nodemask .
+with an empty set of nodes.
+This method will work for
+.BR set_mempolicy (2),
+as well.
+.\" ---------------------------------------------------------------
 .SH "VERSIONS AND LIBRARY SUPPORT"
 The
 .BR mbind (),
@@ -226,16 +379,18 @@ system calls were added to the Linux ker
 They are only available on kernels compiled with
 .BR CONFIG_NUMA .
 
-Support for huge page policy was added with 2.6.16.
-For interleave policy to be effective on huge page mappings the
-policied memory needs to be tens of megabytes or larger.
-
-.B MPOL_MF_MOVE
-and
-.B MPOL_MF_MOVE_ALL
-are only available on Linux 2.6.16 and later.
+You can link with
+.I -lnuma
+to get system call definitions.
+.I libnuma
+and the required
+.I numaif.h
+header.
+are available in the
+.I numactl
+package.
 
-These system calls should not be used directly.
+However, applications should not use these system calls directly.
 Instead, the higher level interface provided by the
 .BR numa (3)
 functions in the
@@ -245,17 +400,21 @@ The
 .I numactl
 package is available at
 .IR ftp://ftp.suse.com/pub/people/ak/numa/ .
-
-You can link with
-.I -lnuma
-to get system call definitions.
-.I libnuma
-is available in the
-.I numactl
+The package is also included in some Linux distributions.
+Some distributions include the development library and header
+in the separate
+.I numactl-devel
 package.
-This package also has the
-.I numaif.h
-header.
+
+Support for huge page policy was added with 2.6.16.
+For interleave policy to be effective on huge page mappings the
+policied memory needs to be tens of megabytes or larger.
+
+.B MPOL_MF_MOVE
+and
+.B MPOL_MF_MOVE_ALL
+are only available on Linux 2.6.16 and later.
+
 .SH CONFORMING TO
 This system call is Linux specific.
 .SH SEE ALSO
@@ -263,4 +422,6 @@ This system call is Linux specific.
 .BR numactl (8),
 .BR set_mempolicy (2),
 .BR get_mempolicy (2),
-.BR mmap (2)
+.BR mmap (2),
+.BR shmget (2),
+.BR shmat (2).
Index: Linux/man2/get_mempolicy.2
===================================================================
--- Linux.orig/man2/get_mempolicy.2	2007-04-12 18:42:49.000000000 -0400
+++ Linux/man2/get_mempolicy.2	2007-06-01 12:29:00.000000000 -0400
@@ -18,6 +18,7 @@
 .\" the source, must acknowledge the copyright and authors of this work.
 .\"
 .\" 2006-02-03, mtk, substantial wording changes and other improvements
+.\" 2007-06-01, lts, more precise specification of behavior.
 .\"
 .TH GET_MEMPOLICY 2 "2006-02-07" "SuSE Labs" "Linux Programmer's Manual"
 .SH SYNOPSIS
@@ -26,9 +27,11 @@ get_mempolicy \- Retrieve NUMA memory po
 .B "#include <numaif.h>"
 .nf
 .sp
-.BI "int get_mempolicy(int *" policy ", unsigned long *" nodemask ,
+.BI "int get_mempolicy(int *" mode ", unsigned long *" nodemask ,
 .BI "                  unsigned long " maxnode ", unsigned long " addr ,
 .BI "                  unsigned long " flags );
+.sp
+.BI "cc ... \-lnuma"
 .fi
 .\" TBD rewrite this. it is confusing.
 .SH DESCRIPTION
@@ -39,7 +42,7 @@ depending on the setting of
 
 A NUMA machine has different
 memory controllers with different distances to specific CPUs.
-The memory policy defines in which node memory is allocated for
+The memory policy defines from which node memory is allocated for
 the process.
 
 If
@@ -58,58 +61,75 @@ then information is returned about the p
 address given in
 .IR addr .
 This policy may be different from the process's default policy if
-.BR set_mempolicy (2)
-has been used to establish a policy for the page containing
+.BR mbind (2)
+or one of the helper functions described in
+.BR numa(3)
+has been used to establish a policy for the memory range containing
 .IR addr .
 
-If
-.I policy
-is not NULL, then it is used to return the policy.
+If the
+.I mode
+argument is not NULL, then
+.IR get_mempolicy ()
+will store the policy mode of the requested NUMA policy in the location
+pointed to by this argument.
 If
 .IR nodemask
-is not NULL, then it is used to return the nodemask associated
-with the policy.
+is not NULL, then the nodemask associated with the policy will be stored
+in the location pointed to by this argument.
 .I maxnode
-is the maximum bit number plus one that can be stored into
-.IR nodemask .
-The bit number is always rounded to a multiple of
-.IR "unsigned long" .
-.\"
-.\" If
-.\" .I flags
-.\" specifies both
-.\" .B MPOL_F_NODE
-.\" and
-.\" .BR MPOL_F_ADDR ,
-.\" then
-.\" .I policy
-.\" instead returns the number of the node on which the address
-.\" .I addr
-.\" is allocated.
-.\"
-.\" If
-.\" .I flags
-.\" specifies
-.\" .B MPOL_F_NODE
-.\" but not
-.\" .BR MPOL_F_ADDR ,
-.\" and the process's current policy is
-.\" .BR MPOL_INTERLEAVE ,
-.\" then
-.\" checkme: Andi's text below says that the info is returned in
-.\" 'nodemask', not 'policy':
-.\" .I policy
-.\" instead returns the number of the next node that will be used for
-.\" interleaving allocation.
-.\" FIXME .
-.\" The other valid flag is
-.\" .I MPOL_F_NODE.
-.\" It is only valid when the policy is
-.\" .I MPOL_INTERLEAVE.
-.\" In this case not the interleave mask, but an unsigned long with the next
-.\" node that would be used for interleaving is returned in
-.\" .I nodemask.
-.\" Other flag values are reserved.
+specifies the number of node ids
+that can be stored into
+.IR nodemask \(emthat
+is, the maximum node id plus one.
+The value specified by
+.I maxnode
+is always rounded to a multiple of
+.IR "sizeof(unsigned long)" .
+
+If
+.I flags
+specifies both
+.B MPOL_F_NODE
+and
+.BR MPOL_F_ADDR ,
+.IR get_mempolicy ()
+will return the node id of the node on which the address
+.I addr
+is allocated into the location pointed to by
+.IR mode .
+If no page has yet been allocated for the specified address,
+.IR get_mempolicy ()
+will allocate a page as if the process had performed a read
+[load] access to that address, and return the id of the node
+where that page was allocated.
+
+If
+.I flags
+specifies
+.BR MPOL_F_NODE ,
+but not
+.BR MPOL_F_ADDR ,
+and the process's current policy is
+.BR MPOL_INTERLEAVE ,
+then
+.IR get_mempolicy ()
+will return in the location pointed to by a non-NULL
+.I mode
+argument,
+the node id of the next node that will be used for
+interleaving of internal kernel pages allocated on behalf of the process.
+.\" Note:  code returns next interleave node via 'mode' argument -lts
+These allocations include pages for memory mapped files in
+process memory ranges mapped using the
+.IR mmap (2)
+call with the
+.I MAP_PRIVATE
+flag for read accesses, and in memory ranges mapped with the
+.I MAP_SHARED
+flag for all accesses.
+
+Other flag values are reserved.
 
 For an overview of the possible policies see
 .BR set_mempolicy (2).
@@ -120,40 +140,77 @@ returns 0;
 on error, \-1 is returned and
 .I errno
 is set to indicate the error.
-.\" .SH ERRORS
-.\" FIXME writeme -- no errors are listed on this page
-.\" .
-.\" .TP
-.\" .B EINVAL
-.\" .I nodemask
-.\" is non-NULL, and
-.\" .I maxnode
-.\" is too small;
-.\" or
-.\" .I flags
-.\" specified values other than
-.\" .B MPOL_F_NODE
-.\" or
-.\" .BR MPOL_F_ADDR ;
-.\" or
-.\" .I flags
-.\" specified
-.\" .B MPOL_F_ADDR
-.\" and
-.\" .I addr
-.\" is NULL.
-.\" (And there are other EINVAL cases.)
+.SH ERRORS
+.TP
+.B EINVAL
+The value specified by
+.I maxnode
+is less than the number of node ids supported by the system.
+Or
+.I flags
+specified values other than
+.B MPOL_F_NODE
+or
+.BR MPOL_F_ADDR ;
+or
+.I flags
+specified
+.B MPOL_F_ADDR
+and
+.I addr
+is NULL,
+or
+.I flags
+did not specify
+.B MPOL_F_ADDR
+and
+.I addr
+is not NULL.
+Or,
+.I flags
+specified
+.B MPOL_F_NODE
+but not
+.B MPOL_F_ADDR
+and the current process policy is not
+.BR MPOL_INTERLEAVE .
+(And there are other EINVAL cases.)
+.TP
+.B EFAULT
+Part of all of the memory range specified by
+.I nodemask
+and
+.I maxnode
+points outside your accessible address space.
 .SH NOTES
-This manual page is incomplete:
-it does not document the details the
-.BR MPOL_F_NODE
-flag,
-which modifies the operation of
-.BR get_mempolicy ().
-This is deliberate: this flag is not intended for application use,
-and its operation may change or it may be removed altogether in
-future kernel versions.
-.B Do not use it.
+If the mode of the process policy or the policy governing allocations at the
+specified address is
+.I MPOL_PREFERRED
+and this policy was installed with an empty
+.IR nodemask \(emspecifying
+local allocation,
+.IR get_mempolicy ()
+will return the mask of on-line node ids in the location pointed to by
+a non-NULL
+.I nodemask
+argument.
+This mask does not take into consideration any adminstratively imposed
+restrictions on the process' context.
+.\" "context" above refers to cpusets.  No man page to reference. --lts
+
+.\"  Christoph says the following is untrue.  These are "fully supported."
+.\"  Andi concedes that he has lost this battle and approves [?]
+.\"  updating the man pages to document the behavior.  --lts
+.\" This manual page is incomplete:
+.\" it does not document the details the
+.\" .BR MPOL_F_NODE
+.\" flag,
+.\" which modifies the operation of
+.\" .BR get_mempolicy ().
+.\" This is deliberate: this flag is not intended for application use,
+.\" and its operation may change or it may be removed altogether in
+.\" future kernel versions.
+.\" .B Do not use it.
 .SH "VERSIONS AND LIBRARY SUPPORT"
 See
 .BR mbind (2).
@@ -161,6 +218,7 @@ See
 This system call is Linux specific.
 .SH SEE ALSO
 .BR mbind (2),
+.BR mmap (2),
 .BR set_mempolicy (2),
 .BR numactl (8),
 .BR numa (3)
Index: Linux/man2/set_mempolicy.2
===================================================================
--- Linux.orig/man2/set_mempolicy.2	2007-04-12 18:42:49.000000000 -0400
+++ Linux/man2/set_mempolicy.2	2007-06-01 12:28:49.000000000 -0400
@@ -18,6 +18,7 @@
 .\" the source, must acknowledge the copyright and authors of this work.
 .\"
 .\" 2006-02-03, mtk, substantial wording changes and other improvements
+.\" 2007-06-01, lts, more precise specification of behavior.
 .\"
 .TH SET_MEMPOLICY 2 "2006-02-07" "SuSE Labs" "Linux Programmer's Manual"
 .SH NAME
@@ -26,80 +27,141 @@ set_mempolicy \- set default NUMA memory
 .nf
 .B "#include <numaif.h>"
 .sp
-.BI "int set_mempolicy(int " policy ", unsigned long *" nodemask ,
+.BI "int set_mempolicy(int " mode ", unsigned long *" nodemask ,
 .BI "                  unsigned long " maxnode );
+.sp
+.BI "cc ... \-lnuma"
 .fi
 .SH DESCRIPTION
 .BR set_mempolicy ()
-sets the NUMA memory policy of the calling process to
-.IR policy .
+sets the NUMA memory policy of the calling process,
+which consists of a policy mode and zero or more nodes,
+to the values specified by the
+.IR mode ,
+.I nodemask
+and
+.IR maxnode
+arguments.
 
 A NUMA machine has different
 memory controllers with different distances to specific CPUs.
-The memory policy defines in which node memory is allocated for
+The memory policy defines from which node memory is allocated for
 the process.
 
-This system call defines the default policy for the process;
-in addition a policy can be set for specific memory ranges using
+This system call defines the default policy for the process.
+The process policy governs allocation of pages in the process'
+address space outside of memory ranges
+controlled by a more specific policy set by
 .BR mbind (2).
+The process default policy also controls allocation of any pages for
+memory mapped files mapped using the
+.BR mmap (2)
+call with the
+.B MAP_PRIVATE
+flag and that are only read [loaded] from by the task
+and of memory mapped files mapped using the
+.BR mmap (2)
+call with the
+.B MAP_SHARED
+flag, regardless of the access type.
 The policy is only applied when a new page is allocated
 for the process.
 For anonymous memory this is when the page is first
 touched by the application.
 
-Available policies are
+The
+.I mode
+argument must specify one of
 .BR MPOL_DEFAULT ,
 .BR MPOL_BIND ,
-.BR MPOL_INTERLEAVE ,
+.B MPOL_INTERLEAVE
+or
 .BR MPOL_PREFERRED .
-All policies except
+All modes except
 .B MPOL_DEFAULT
-require the caller to specify the nodes to which the policy applies in the
+require the caller to specify via the
 .I nodemask
-parameter.
+parameter
+one or more nodes.
+
 .I nodemask
-is pointer to a bit field of nodes that contains up to
+points to a bit mask of node ids that contains up to
 .I maxnode
 bits.
-The bit field size is rounded to the next multiple of
+The bit mask size is rounded to the next multiple of
 .IR "sizeof(unsigned long)" ,
 but the kernel will only use bits up to
 .IR maxnode .
+A NULL value of
+.I nodemask
+or a
+.I maxnode
+value of zero specifies the empty set of nodes.
+If the value of
+.I maxnode
+is zero,
+the
+.I nodemask
+argument is ignored.
 
 The
 .B MPOL_DEFAULT
-policy is the default and means to allocate memory locally,
+mode is the default and means to allocate memory locally,
 i.e., on the node of the CPU that triggered the allocation.
 .I nodemask
-should be specified as NULL.
+must be specified as NULL.
+If the "local node" contains no free memory, the system will
+attempt to allocate memory from a "near by" node.
 
 The
 .B MPOL_BIND
-policy is a strict policy that restricts memory allocation to the
+mode defines a strict policy that restricts memory allocation to the
 nodes specified in
 .IR nodemask .
-There won't be allocations on other nodes.
+If
+.I nodemask
+specifies more than one node, page allocations will come from
+the node with the lowest numeric node id first, until that node
+contains no free memory.
+Allocations will then come from the node with the next highest
+node id specified in
+.I nodemask
+and so forth, until none of the specified nodes contain free memory.
+Pages will not be allocated from any node not specified in the
+.IR nodemask .
 
 .B MPOL_INTERLEAVE
-interleaves allocations to the nodes specified in
-.IR nodemask .
-This optimizes for bandwidth instead of latency.
-To be effective the memory area should be fairly large,
-at least 1MB or bigger.
+interleaves page allocations across the nodes specified in
+.I nodemask
+in numeric node id order.
+This optimizes for bandwidth instead of latency
+by spreading out pages and memory accesses to those pages across
+multiple nodes.
+However, accesses to a single page will still be limited to
+the memory bandwidth of a single node.
+.\" NOTE:  the following sentence doesn't make sense in the context
+.\" of set_mempolicy() -- no memory area specified.
+.\" To be effective the memory area should be fairly large,
+.\" at least 1MB or bigger.
 
 .B MPOL_PREFERRED
 sets the preferred node for allocation.
-The kernel will try to allocate in this
-node first and fall back to other nodes if the preferred node is low on free
+The kernel will try to allocate pages from this node first
+and fall back to "near by" nodes if the preferred node is low on free
 memory.
-Only the first node in the
+If
+.I nodemask
+specifies more than one node id, the first node in the
+mask will be selected as the preferred node.
+If the
 .I nodemask
-is used.
-If no node is set in the mask, then the memory is allocated on
-the node of the CPU that triggered the allocation allocation (like
+and
+.I maxnode
+arguments specify the empty set, then the memory is allocated on
+the node of the CPU that triggered the allocation (like
 .BR MPOL_DEFAULT ).
 
-The memory policy is preserved across an
+The process memory policy is preserved across an
 .BR execve (2),
 and is inherited by child processes created using
 .BR fork (2)
@@ -107,6 +169,9 @@ or
 .BR clone (2).
 .SH NOTES
 Process policy is not remembered if the page is swapped out.
+When such a page is paged back in, it will use the policy of
+the process or memory range that is in effect at the time the
+page is allocated.
 .SH RETURN VALUE
 On success,
 .BR set_mempolicy ()
@@ -114,12 +179,49 @@ returns 0;
 on error, \-1 is returned and
 .I errno
 is set to indicate the error.
-.\" .SH ERRORS
-.\" FIXME writeme -- no errors are listed on this page
-.\" .
-.\" .TP
-.\" .B EINVAL
-.\" .I mode is invalid.
+.SH ERRORS
+.TP
+.B EINVAL
+.I mode is invalid.
+Or,
+.I mode
+is
+.I MPOL_DEFAULT
+and
+.I nodemask
+is non-empty,
+or
+.I mode
+is
+.I MPOL_BIND
+or
+.I MPOL_INTERLEAVE
+and
+.I nodemask
+is empty.
+Or,
+.I maxnode
+specifies more than a page worth of bits.
+Or,
+.I nodemask
+specifies one or more node ids that are
+greater than the maximum supported node id,
+or are not allowed in the calling task's context.
+.\" "calling task's context" refers to cpusets.  No man page avail to ref. --lts
+Or, none of the node ids specified by
+.I nodemask
+are on-line, or none of the specified nodes contain memory.
+.TP
+.B EFAULT
+Part of all of the memory range specified by
+.I nodemask
+and
+.I maxnode
+points outside your accessible address space.
+.TP
+.B ENOMEM
+Insufficient kernel memory was available.
+
 .SH "VERSIONS AND LIBRARY SUPPORT"
 See
 .BR mbind (2).
@@ -127,6 +229,7 @@ See
 This system call is Linux specific.
 .SH SEE ALSO
 .BR mbind (2),
+.BR mmap (2),
 .BR get_mempolicy (2),
 .BR numactl (8),
 .BR numa (3)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2007-06-01 21:15 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-29 19:33 [PATCH] Document Linux Memory Policy Lee Schermerhorn
2007-05-29 20:04 ` Christoph Lameter
2007-05-29 20:16   ` Andi Kleen
2007-05-30 16:17     ` Lee Schermerhorn
2007-05-30 17:41       ` Christoph Lameter
2007-05-31  8:20       ` Michael Kerrisk
2007-05-31 14:49         ` Lee Schermerhorn
2007-05-31 15:56           ` Michael Kerrisk
2007-06-01 21:15         ` Lee Schermerhorn [this message]
2007-07-23  6:11           ` [PATCH] enhance memory policy sys call man pages v1 Michael Kerrisk
2007-07-23  6:32           ` mbind.2 man page patch Michael Kerrisk
2007-07-23 14:26             ` Lee Schermerhorn
2007-07-26 17:19               ` Michael Kerrisk
2007-07-26 18:06                 ` Lee Schermerhorn
2007-07-26 18:18                   ` Michael Kerrisk
2007-07-23  6:32           ` get_mempolicy.2 " Michael Kerrisk
2007-07-28  9:31             ` Michael Kerrisk
2007-08-09 18:43               ` Lee Schermerhorn
2007-08-09 20:57                 ` Michael Kerrisk
2007-08-16 20:05               ` Andi Kleen
2007-08-18  5:50                 ` Michael Kerrisk
2007-08-21 15:45                   ` Lee Schermerhorn
2007-08-22  4:10                     ` Michael Kerrisk
2007-08-22 16:08                       ` [PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2 Lee Schermerhorn
2007-08-27 11:29                         ` Michael Kerrisk
2007-08-22 16:10                       ` [PATCH] Mempolicy Man Pages 2.64 2/3 - set_mempolicy.2 Lee Schermerhorn
2007-08-27 11:30                         ` Michael Kerrisk
2007-08-22 16:12                       ` [PATCH] Mempolicy Man Pages 2.64 3/3 - get_mempolicy.2 Lee Schermerhorn
2007-08-27 11:30                         ` Michael Kerrisk
2007-08-27 10:46                 ` get_mempolicy.2 man page patch Michael Kerrisk
2007-07-23  6:33           ` set_mempolicy.2 " Michael Kerrisk
2007-05-30 16:55   ` [PATCH] Document Linux Memory Policy Lee Schermerhorn
2007-05-30 17:56     ` Christoph Lameter
2007-05-31  6:18       ` Gleb Natapov
2007-05-31  6:41         ` Christoph Lameter
2007-05-31  6:47           ` Gleb Natapov
2007-05-31  6:56             ` Christoph Lameter
2007-05-31  7:11               ` Gleb Natapov
2007-05-31  7:24                 ` Christoph Lameter
2007-05-31  7:39                   ` Gleb Natapov
2007-05-31 17:43                     ` Christoph Lameter
2007-05-31 17:07                   ` Lee Schermerhorn
2007-05-31 10:43             ` Andi Kleen
2007-05-31 11:04               ` Gleb Natapov
2007-05-31 11:30                 ` Gleb Natapov
2007-05-31 15:26                   ` Lee Schermerhorn
2007-05-31 17:41                     ` Gleb Natapov
2007-05-31 18:56                       ` Lee Schermerhorn
2007-05-31 20:06                         ` Gleb Natapov
2007-05-31 20:43                           ` Andi Kleen
2007-06-01  9:38                             ` Gleb Natapov
2007-06-01 10:21                               ` Andi Kleen
2007-06-01 12:25                                 ` Gleb Natapov
2007-06-01 13:09                                   ` Andi Kleen
2007-06-01 17:15                                 ` Lee Schermerhorn
2007-06-01 18:43                                   ` Christoph Lameter
2007-06-01 19:38                                     ` Lee Schermerhorn
2007-06-01 19:48                                       ` Christoph Lameter
2007-06-01 21:05                                         ` Lee Schermerhorn
2007-06-01 21:56                                           ` Christoph Lameter
2007-06-04 13:46                                             ` Lee Schermerhorn
2007-06-04 16:34                                               ` Christoph Lameter
2007-06-04 17:02                                                 ` Lee Schermerhorn
2007-06-04 17:11                                                   ` Christoph Lameter
2007-06-04 20:23                                                     ` Andi Kleen
2007-06-04 21:51                                                       ` Christoph Lameter
2007-06-05 14:30                                                         ` Lee Schermerhorn
2007-06-01 20:28                                     ` Gleb Natapov
2007-06-01 20:45                                       ` Christoph Lameter
2007-06-01 21:10                                         ` Lee Schermerhorn
2007-06-01 21:58                                           ` Christoph Lameter
2007-06-02  7:23                                         ` Gleb Natapov
2007-05-31 11:47                 ` Andi Kleen
2007-05-31 11:59                   ` Gleb Natapov
2007-05-31 12:15                     ` Andi Kleen
2007-05-31 12:18                       ` Gleb Natapov
2007-05-31 18:28       ` Lee Schermerhorn
2007-05-31 18:35         ` Christoph Lameter
2007-05-31 19:29           ` Lee Schermerhorn
2007-05-31 19:25       ` Paul Jackson
2007-05-31 20:22         ` Lee Schermerhorn
2007-05-29 20:07 ` Andi Kleen
2007-05-30 16:04   ` Lee Schermerhorn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1180732544.5278.158.camel@localhost \
    --to=lee.schermerhorn@hp.com \
    --cc=ak@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=linux-mm@kvack.org \
    --cc=mtk-manpages@gmx.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.