* [PATCH] Document Linux Memory Policy
@ 2007-05-29 19:33 Lee Schermerhorn
2007-05-29 20:04 ` Christoph Lameter
2007-05-29 20:07 ` Andi Kleen
0 siblings, 2 replies; 83+ messages in thread
From: Lee Schermerhorn @ 2007-05-29 19:33 UTC (permalink / raw)
To: linux-mm; +Cc: Andrew Morton, Andi Kleen, Christoph Lameter
[PATCH] Document Linux Memory Policy
I couldn't find any memory policy documentation in the Documentation
directory, so here is my attempt to document it. My objectives are
two fold:
1) to provide missing documentation for anyone interested in this topic,
2) to explain my current understanding, on which I base proposed patches
to address what I see as missing or broken behavior.
There's lots more that could be written about the internal design--including
data structures, functions, etc. And one could address the interaction of
memory policy with cpusets. I haven't tackled that yet. However, if you
agree that this is better that the nothing that exists now, perhaps it could
be added to -mm.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Documentation/vm/memory_policy.txt | 339 +++++++++++++++++++++++++++++++++++++
1 files changed, 339 insertions(+)
Index: Linux/Documentation/vm/memory_policy.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ Linux/Documentation/vm/memory_policy.txt 2007-05-29 15:08:01.000000000 -0400
@@ -0,0 +1,339 @@
+
+What is Linux Memory Policy?
+
+In the Linux kernel, "memory policy" determines from which node the kernel will
+allocate memory in a NUMA system or in an emulated NUMA system. Linux has
+supported platforms with Non-Uniform Memory Access architectures since 2.4.?.
+The current memory policy support was added to Linux 2.6 around May 2004. This
+document attempts to describe the concepts and APIs of the 2.6 memory policy
+support.
+
+ TODO: try to describe internal design?
+
+MEMORY POLICY CONCEPTS
+
+Scope of Memory Policies
+
+The Linux kernel supports four more or less distinct scopes of memory policy:
+
+ System Default Policy: this policy is "hard coded" into the kernel. It
+ is the policy that governs the all page allocations that aren't controlled
+ by one of the more specific policy scopes discussed below.
+
+ Task/Process Policy: this is an optional, per-task policy. When defined
+ for a specific task, this policy controls all page allocations made by or
+ on behalf of the task that aren't controlled by a more specific scope.
+ If a task does not define a task policy, then all page allocations that
+ would have been controlled by the task policy "fall back" to the System
+ Default Policy.
+
+ Because task policy applies to the entire address space of a task,
+ it is inheritable across both fork() [clone() w/o the CLONE_VM flag]
+ and exec*(). Thus, a parent task may establish the task policy for
+ a child task exec()'d from an executable image that has no awareness
+ of memory policy.
+
+ In a multi-threaded task, task policies apply only to the thread
+ [Linux kernel task] that installs the policy and any threads
+ subsequently created by that thread. Any sibling threads existing
+ at the time a new task policy is installed retain their current
+ policy.
+
+ A task policy applies only to pages allocated after the policy is
+ installed. Any pages already faulted in by the task remain where
+ they were allocated based on the policy at the time they were
+ allocated.
+
+ VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's
+ virtual adddress space. A task may define a specific policy for a range
+ of its virtual address space. This VMA policy will govern the allocation
+ of pages that back this region of the address space. Any regions of the
+ task's address space that don't have an explicit VMA policy will fall back
+ to the task policy, which may itself fall back to the system default policy.
+
+ VMA policy applies ONLY to anonymous pages. These include pages
+ allocated for anonymous segments, such as the task stack and heap, and
+ any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
+ Anonymous pages copied from private file mappings [files mmap()ed with
+ the MAP_PRIVATE flag] also obey VMA policy, if defined.
+
+ VMA policies are shared between all tasks that share a virtual address
+ space--a.k.a. threads--independent of when the policy is installed; and
+ they are inherited across fork(). However, because VMA policies refer
+ to a specific region of a task's address space, and because the address
+ space is discarded and recreated on exec*(), VMA policies are NOT
+ inheritable across exec(). Thus, only NUMA-aware applications may
+ use VMA policies.
+
+ A task may install a new VMA policy on a sub-range of a previously
+ mmap()ed region. When this happens, Linux splits the existing virtual
+ memory area into 2 or 3 VMAs, each with it's own policy.
+
+ By default, VMA policy applies only to pages allocated after the policy
+ is installed. Any pages already faulted into the VMA range remain where
+ they were allocated based on the policy at the time they were
+ allocated. However, since 2.6.16, Linux supports page migration so
+ that page contents can be moved to match a newly installed policy.
+
+ Shared Policy: This policy applies to "memory objects" mapped shared into
+ one or more tasks' distinct address spaces. Shared policies are applied
+ directly to the shared object. Thus, all tasks that attach to the object
+ share the policy, and all pages allocated for the shared object, by any
+ task, will obey the shared policy.
+
+ Currently [2.6.22], only shared memory segments, created by shmget(),
+ support shared policy. When shared policy support was added to Linux,
+ the associated data structures were added to shared hugetlbfs segments.
+ However, at the time, hugetlbfs did not support allocation at fault
+ time--a.k.a lazy allocation--so hugetlbfs segments were never "hooked
+ up" to the shared policy support. Although hugetlbfs segments now
+ support lazy allocation, their support for shared policy has not been
+ completed.
+
+ Although internal to the kernel shared memory segments are really
+ files backed by swap space that have been mmap()ed shared into tasks'
+ address spaces, regular files mmap()ed shared do NOT support shared
+ policy. Rather, shared page cache pages, including pages backing
+ private mappings that have not yet been written by the task, follow
+ task policy, if any, else system default policy.
+
+ The shared policy infrastructure supports different policies on subset
+ ranges of the shared object. However, Linux still splits the VMA of
+ the task that installs the policy for each range of distinct policy.
+ Thus, different tasks that attach to a shared memory segment can have
+ different VMA configurations mapping that one shared object.
+
+Components of Memory Policies
+
+ A Linux memory policy is a tuple consisting of a "mode" and an optional set
+ of nodes. The mode determine the behavior of the policy, while the optional
+ set of nodes can be viewed as the arguments to the behavior.
+
+ Note: in some functions, the mode is called "policy". However, to
+ avoid confusion with the policy tuple, this document will continue
+ to use the term "mode".
+
+ Linux memory policy supports the following 4 modes:
+
+ Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
+ context dependent.
+
+ The system default policy is hard coded to contain the Default mode.
+ In this context, it means "local" allocation--that is attempt to
+ allocate the page from the node associated with the cpu where the
+ fault occurs. If the "local" node has no memory, or the node's
+ memory can be exhausted [no free pages available], local allocation
+ will attempt to allocate pages from "nearby" nodes, using a per node
+ list of nodes--called zonelists--built at boot time.
+
+ TODO: address runtime rebuild of node/zonelists when
+ supported.
+
+ When a task/process policy contains the Default mode, it means
+ "fall back to the system default mode". And, as discussed above,
+ this means use "local" allocation.
+
+ In the context of a VMA, Default mode means "fall back to task
+ policy"--which may, itself, fall back to system default policy.
+ In the context of shared policies, Default mode means fall back
+ directly to the system default policy. Note: the result of this
+ semantic is that if the task policy is something other than Default,
+ it is not possible to specify local allocation for a region of the
+ task's address space using a VMA policy.
+
+ The Default mode does not use the optional set of nodes.
+
+ MPOL_BIND: This mode specifies that memory must come from the
+ set of nodes specified by the policy. The kernel builds a custom
+ zonelist containing just the nodes specified by the Bind policy.
+ If the kernel is unable to allocate a page from the first node in the
+ custom zonelist, it moves on to the next, and so forth. If it is unable
+ to allocate a page from any of the nodes in this list, the allocation
+ will fail.
+
+ The memory policy APIs do not specify an order in which the nodes
+ will be searched. However, unlike the per node zonelists mentioned
+ above, the custom zonelist for the Bind policy do not consider the
+ distance between the nodes. Rather, the lists are built in order
+ of numeric node id.
+
+
+ MPOL_PREFERRED: This mode specifies that the allocation should be
+ attempted from the single node specified in the policy. If that
+ allocation fails, the kernel will search other nodes, exactly as
+ it would for a local allocation that started at the preferred node--
+ that is, using the per-node zonelists in increasing distance from
+ the preferred node.
+
+ If the Preferred policy specifies more than one node, the node
+ with the numerically lowest node id will be selected to start
+ the allocation scan.
+
+ MPOL_INTERLEAVED: This mode specifies that page allocations be
+ interleaved, on a page granularity, across the nodes specified in
+ the policy. This mode also behaves slightly differently, based on
+ the context where it is used:
+
+ For allocation of anonymous pages and shared memory pages,
+ Interleave mode indexes the set of nodes specified by the policy
+ using the page offset of the faulting address into the segment
+ [VMA] containing the address modulo the number of nodes specified
+ by the policy. It then attempts to allocate a page, starting at
+ the selected node, as if the node had been specified by a Preferred
+ policy or had been selected by a local allocation. That is,
+ allocation will follow the per node zonelist.
+
+ For allocation of page cache pages, Interleave mode indexes the set
+ of nodes specified by the policy using a node counter maintained
+ per task. This counter wraps around to the lowest specified node
+ after it reaches the highest specified node. This will tend to
+ spread the pages out over the nodes specified by the policy based
+ on the order in which they are allocated, rather than based on any
+ page offset into an address range or file.
+
+MEMORY POLICY APIs
+
+Linux supports 3 system calls for controlling memory policy. These APIS
+always affect only the calling task, the calling task's address space, or
+some shared object mapped into the calling task's address space.
+
+ Note: the headers that define these APIs and the parameter data types
+ for user space applications reside in a package that is not part of
+ the Linux kernel. The kernel system call interfaces, with the 'sys_'
+ prefix, are defined in <linux/syscalls.h>; the mode and flag
+ definitions are defined in <linux/mempolicy.h>.
+
+Set [Task] Memory Policy:
+
+ long set_mempolicy(int mode, const unsigned long *nmask,
+ unsigned long maxnode);
+
+ Set's the calling task's "task/process memory policy" to mode
+ specified by the 'mode' argument and the set of nodes defined
+ by 'nmask'. 'nmask' points to a bit mask of node ids containing
+ at least 'maxnode' ids.
+
+ If successful, the specified policy will control the allocation
+ of all pages, by and on behalf of this task and its descendants,
+ that aren't controlled by a more specific VMA or shared policy.
+ If the calling task is part of a multi-threaded application, the
+ task policy of other existing threads are unchanged.
+
+Get [Task] Memory Policy or Related Information
+
+ long get_mempolicy(int *mode,
+ const unsigned long *nmask, unsigned long maxnode,
+ void *addr, int flags);
+
+ Queries the "task/process memory policy" of the calling task, or
+ the policy or location of a specified virtual address, depending
+ on the 'flags' argument.
+
+ If 'flags' is 0, get_mempolicy() returns the calling task's policy
+ as set by set_mempolicy() or inherited from its parent. The mode
+ is stored in the location pointed to by the 'mode' argument, if it
+ is non-NULL. The associated node mask, if any, is stored in the bit
+ mask pointed to by a non-NULL 'nmask' argument. When 'nmask' is
+ non-NULL, 'maxnode' must specify one greater than the maximum bit
+ number that can be stored in 'nmask'--i.e., the number of bits.
+
+ If 'flags' specifies MPOL_F_ADDR, get_mempolicy() returns similar
+ policy information that governs the allocation of pages at the
+ specified 'addr'. This may be different from the task policy--
+ i.e., if a VMA or shared policy applies to that address.
+
+ 'flags' may also contain 'MPOL_F_NODE'. This flag has been
+ described in some get_mempolicy() man pages as "not for application
+ use" and subject to change. Applications are cautioned against
+ using it. However, for completeness and because it is useful for
+ testing the kernel memory policy support, current behavior is
+ documented here:
+
+ If 'flags' contains MPOL_F_NODE, but not MPOL_F_ADDR, and if
+ the task policy of the calling task specifies the Intereleave
+ mode [MPOL_INTERLEAVE], get_mempolicy() will return the next
+ node on which a page cache page would be allocated by the calling
+ task, in the location pointed to by a non-NULL 'mode'.
+
+ If 'flags' contains MPOL_F_NODE and MPOL_F_ADDR, and 'addr'
+ contains a valid address in the calling task's address space,
+ get_mempolicy() will return the node where the page backing that
+ address resides. If no page has currently been allocated for
+ the specified address, a page will be allocated as if the task
+ had performed a read/load from that address. The node of the
+ page allocated will be returned.
+
+ Note: if the address specifies an anonymous region of the
+ task's address space with no page currently allocated, the
+ resulting "read access fault" will likely just map the shared
+ ZEROPAGE. It will NOT, for example, allocate a local page in
+ the case of default policy [unless the task happens to be
+ running on the node containing the ZEROPAGE], nor will it obey
+ VMA policy, if any.
+
+
+Install VMA/Shared Policy for a Range of Task's Address Space
+
+ long mbind(void *start, unsigned long len, int mode,
+ const unsigned long *nmask, unsigned long maxnode,
+ unsigned flags);
+
+ mbind() applies the policy specified by (mode, nmask, maxnodes) to
+ the range of the calling task's address space specified by the
+ 'start' and 'len' arguments. Additional actions may be requested
+ via the 'flags' argument.
+
+ If the address space range covers an anonymous region or a private
+ mapping of a regular file, a VMA policy will be installed in this
+ region. This policy will govern all subsequent allocations of pages
+ for that range for all threads in the task.
+
+ For the case of a private mapping of a regular file, the
+ specified policy will only govern the allocation of anonymous
+ pages created when the task writes/stores to an address in the
+ range. Pages allocated for read faults will use the faulting
+ task's task policy, if any, else the system default.
+
+ If the address space range maps a shared object, such as a shared
+ memory segment, a shared policy will be installed on the specified
+ range of the underlying shared object. This policy will govern all
+ subsequent allocates of pages for that range of the shared object,
+ for all task that map/attach the shared object.
+
+ If the address space range maps a shared hugetlbfs segment, a VMA
+ policy will be installed for that range. This policy will govern
+ subsequent huge page allocations from the calling task, but will
+ be ignored by any subsequent huge page allocations from other tasks
+ that attach to the hugetlb shared memory object.
+
+ If the address space range covers a shared mapping of a regular
+ file, a VMA policy will be installed for that range. This policy
+ will be ignored for all page allocations by the calling task or
+ by any other task. Rather, all page allocations in that range will
+ be allocated using the faulting task's task policy, if any, else
+ the system default policy.
+
+ Before 2.6.16, Linux did not support page migration. Therefore,
+ if any pages were already allocated in the range specified by the
+ mbind() call, the application was stuck with their existing location.
+ However, mbind() did, and still does, support the MPOL_MF_STRICT flag.
+ This flag causes mbind() to check the specified range for any
+ existing pages that don't obey the specified policy. If any such
+ pages exist, the mbind() call fails with the EIO error number.
+
+ Since 2.6.16, Linux supports direct [synchronous] page migration
+ via the mbind() system call. When the 'flags' argument specifies
+ MPOL_MF_MOVE, mbind() will attempt to migrate all existing pages
+ in the range to match the specified policy. However, the MPOL_MF_MOVE
+ flag will migrate only those pages that are only referenced by the
+ calling task's page tables [internally: page's mapcount == 1]. The
+ MPOL_MF_STRICT flag may be specified to detect whether any pages
+ could not be migrated for this or other reasons.
+
+ A privileged task [with CAP_SYS_NICE] may specify the MPOL_MF_MOVE_ALL
+ flag. With this flag, mbind() will attempt to migrate pages in the
+ range to match the specified policy, regardless of the number of page
+ table entries referencing the page [regardless of mapcount]. Again,
+ some conditions may still prevent pages from being migrated, and the
+ MPOL_MF_STRICT flag may be specified to detect this condition.
+
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-29 19:33 [PATCH] Document Linux Memory Policy Lee Schermerhorn
@ 2007-05-29 20:04 ` Christoph Lameter
2007-05-29 20:16 ` Andi Kleen
2007-05-30 16:55 ` [PATCH] Document Linux Memory Policy Lee Schermerhorn
2007-05-29 20:07 ` Andi Kleen
1 sibling, 2 replies; 83+ messages in thread
From: Christoph Lameter @ 2007-05-29 20:04 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: linux-mm, Andrew Morton, Andi Kleen
On Tue, 29 May 2007, Lee Schermerhorn wrote:
> + A task policy applies only to pages allocated after the policy is
> + installed. Any pages already faulted in by the task remain where
> + they were allocated based on the policy at the time they were
> + allocated.
You can use cpusets to automatically migrate pages and sys_migrate_pages
to manually migrate pages of a process though.
> + VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's
> + virtual adddress space. A task may define a specific policy for a range
> + of its virtual address space. This VMA policy will govern the allocation
> + of pages that back this region of the address space. Any regions of the
> + task's address space that don't have an explicit VMA policy will fall back
> + to the task policy, which may itself fall back to the system default policy.
The system default policy is always the same when the system is running.
There is no way to configure it. So it would be easier to avoid this layer
and say they fall back to node local
> + VMA policies are shared between all tasks that share a virtual address
> + space--a.k.a. threads--independent of when the policy is installed; and
> + they are inherited across fork(). However, because VMA policies refer
> + to a specific region of a task's address space, and because the address
> + space is discarded and recreated on exec*(), VMA policies are NOT
> + inheritable across exec(). Thus, only NUMA-aware applications may
> + use VMA policies.
Memory policies require NUMA. Drop the last sentence? You can set the task
policy via numactl though.
> + Shared Policy: This policy applies to "memory objects" mapped shared into
> + one or more tasks' distinct address spaces. Shared policies are applied
> + directly to the shared object. Thus, all tasks that attach to the object
> + share the policy, and all pages allocated for the shared object, by any
> + task, will obey the shared policy.
> +
> + Currently [2.6.22], only shared memory segments, created by shmget(),
> + support shared policy. When shared policy support was added to Linux,
> + the associated data structures were added to shared hugetlbfs segments.
> + However, at the time, hugetlbfs did not support allocation at fault
> + time--a.k.a lazy allocation--so hugetlbfs segments were never "hooked
> + up" to the shared policy support. Although hugetlbfs segments now
> + support lazy allocation, their support for shared policy has not been
> + completed.
I guess patches would be welcome to complete it. But that may only be
releveant if huge pages are shared between processes. We so far have no
case in which that support is required.
> + Although internal to the kernel shared memory segments are really
> + files backed by swap space that have been mmap()ed shared into tasks'
> + address spaces, regular files mmap()ed shared do NOT support shared
> + policy. Rather, shared page cache pages, including pages backing
> + private mappings that have not yet been written by the task, follow
> + task policy, if any, else system default policy.
Yes. shared memory segments do not represent file content. The file
content of mmap pages may exist before the mmap. Also there may be regular
buffered I/O going on which will also use the task policy.
Having no vma policy support insures that pagecache pages regardless if
they are mmapped or not will get the task policy applied.
> + Linux memory policy supports the following 4 modes:
> +
> + Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
> + context dependent.
> +
> + The system default policy is hard coded to contain the Default mode.
> + In this context, it means "local" allocation--that is attempt to
> + allocate the page from the node associated with the cpu where the
> + fault occurs. If the "local" node has no memory, or the node's
> + memory can be exhausted [no free pages available], local allocation
> + will attempt to allocate pages from "nearby" nodes, using a per node
> + list of nodes--called zonelists--built at boot time.
> +
> + TODO: address runtime rebuild of node/zonelists when
> + supported.
Why?
> + When a task/process policy contains the Default mode, it means
> + "fall back to the system default mode". And, as discussed above,
> + this means use "local" allocation.
This would be easier if you would drop the system default mode and simply
say its node local.
> + In the context of a VMA, Default mode means "fall back to task
> + policy"--which may, itself, fall back to system default policy.
> + In the context of shared policies, Default mode means fall back
> + directly to the system default policy. Note: the result of this
> + semantic is that if the task policy is something other than Default,
> + it is not possible to specify local allocation for a region of the
> + task's address space using a VMA policy.
> +
> + The Default mode does not use the optional set of nodes.
Neither does the preferred node mode.
> + MPOL_BIND: This mode specifies that memory must come from the
> + set of nodes specified by the policy. The kernel builds a custom
> + zonelist containing just the nodes specified by the Bind policy.
> + If the kernel is unable to allocate a page from the first node in the
> + custom zonelist, it moves on to the next, and so forth. If it is unable
> + to allocate a page from any of the nodes in this list, the allocation
> + will fail.
> +
> + The memory policy APIs do not specify an order in which the nodes
> + will be searched. However, unlike the per node zonelists mentioned
> + above, the custom zonelist for the Bind policy do not consider the
> + distance between the nodes. Rather, the lists are built in order
> + of numeric node id.
Right. TODO: MPOL_BIND needs to pick the best node.
> + MPOL_PREFERRED: This mode specifies that the allocation should be
> + attempted from the single node specified in the policy. If that
> + allocation fails, the kernel will search other nodes, exactly as
> + it would for a local allocation that started at the preferred node--
> + that is, using the per-node zonelists in increasing distance from
> + the preferred node.
> +
> + If the Preferred policy specifies more than one node, the node
> + with the numerically lowest node id will be selected to start
> + the allocation scan.
AFAIK perferred policy was only intended to specify one node.
> + For allocation of page cache pages, Interleave mode indexes the set
> + of nodes specified by the policy using a node counter maintained
> + per task. This counter wraps around to the lowest specified node
> + after it reaches the highest specified node. This will tend to
> + spread the pages out over the nodes specified by the policy based
> + on the order in which they are allocated, rather than based on any
> + page offset into an address range or file.
Which is particularly important if random pages in a file are used.
> +Linux supports 3 system calls for controlling memory policy. These APIS
> +always affect only the calling task, the calling task's address space, or
> +some shared object mapped into the calling task's address space.
These are wrapped by the numactl library. So these are not exposed to the
user.
> + Note: the headers that define these APIs and the parameter data types
> + for user space applications reside in a package that is not part of
> + the Linux kernel. The kernel system call interfaces, with the 'sys_'
> + prefix, are defined in <linux/syscalls.h>; the mode and flag
> + definitions are defined in <linux/mempolicy.h>.
You need to mention the numactl library here.
> + 'flags' may also contain 'MPOL_F_NODE'. This flag has been
> + described in some get_mempolicy() man pages as "not for application
> + use" and subject to change. Applications are cautioned against
> + using it. However, for completeness and because it is useful for
> + testing the kernel memory policy support, current behavior is
> + documented here:
The docs are wrong. This is fully supported.
> + Note: if the address specifies an anonymous region of the
> + task's address space with no page currently allocated, the
> + resulting "read access fault" will likely just map the shared
> + ZEROPAGE. It will NOT, for example, allocate a local page in
> + the case of default policy [unless the task happens to be
> + running on the node containing the ZEROPAGE], nor will it obey
> + VMA policy, if any.
Yes the intend for it was to be used on a mapped page.
> + If the address space range covers an anonymous region or a private
> + mapping of a regular file, a VMA policy will be installed in this
> + region. This policy will govern all subsequent allocations of pages
> + for that range for all threads in the task.
Wont it be installed regardless if it is anonymous or not?
> + If the address space range covers a shared mapping of a regular
> + file, a VMA policy will be installed for that range. This policy
> + will be ignored for all page allocations by the calling task or
> + by any other task. Rather, all page allocations in that range will
> + be allocated using the faulting task's task policy, if any, else
> + the system default policy.
The policy is going to be used for COW in that range.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-29 19:33 [PATCH] Document Linux Memory Policy Lee Schermerhorn
2007-05-29 20:04 ` Christoph Lameter
@ 2007-05-29 20:07 ` Andi Kleen
2007-05-30 16:04 ` Lee Schermerhorn
1 sibling, 1 reply; 83+ messages in thread
From: Andi Kleen @ 2007-05-29 20:07 UTC (permalink / raw)
To: Lee Schermerhorn, Michael Kerrisk
Cc: linux-mm, Andrew Morton, Christoph Lameter
On Tuesday 29 May 2007 21:33, Lee Schermerhorn wrote:
> [PATCH] Document Linux Memory Policy
>
> I couldn't find any memory policy documentation in the Documentation
> directory, so here is my attempt to document it. My objectives are
> two fold:
The theory is that the comment at the top of mempolicy.c gives an brief
internal oriented overview and the manpages describe the details. I must say
I'm not a big fan of too much redundant documentation because the likelihood
of bitrotting increases more with more redundancy. We also normally don't
keep syscall documentation in Documentation/*
I see you got a few details that are right now missing in the manpages.
How about you just add them to the mbind/set_mempolicy/etc manpages
(and perhaps a new numa.7) and send a patch to the manpage
maintainer (cc'ed)? I believe having everything in the manpages
is the most useful for userland programmers who hardly look
into Documentation/* (in fact it is often not installed on systems
without kernel source)
The comment in mempolicy.c could probably also be improved a bit
for anything internal.
-Andi
>
> 1) to provide missing documentation for anyone interested in this topic,
>
> 2) to explain my current understanding, on which I base proposed patches
> to address what I see as missing or broken behavior.
>
> There's lots more that could be written about the internal
> design--including data structures, functions, etc. And one could address
> the interaction of memory policy with cpusets. I haven't tackled that yet.
> However, if you agree that this is better that the nothing that exists
> now, perhaps it could be added to -mm.
>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
>
> Documentation/vm/memory_policy.txt | 339
> +++++++++++++++++++++++++++++++++++++ 1 files changed, 339 insertions(+)
>
> Index: Linux/Documentation/vm/memory_policy.txt
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ Linux/Documentation/vm/memory_policy.txt 2007-05-29 15:08:01.000000000
> -0400 @@ -0,0 +1,339 @@
> +
> +What is Linux Memory Policy?
> +
> +In the Linux kernel, "memory policy" determines from which node the kernel
> will +allocate memory in a NUMA system or in an emulated NUMA system.
> Linux has +supported platforms with Non-Uniform Memory Access architectures
> since 2.4.?. +The current memory policy support was added to Linux 2.6
> around May 2004. This +document attempts to describe the concepts and APIs
> of the 2.6 memory policy +support.
> +
> + TODO: try to describe internal design?
> +
> +MEMORY POLICY CONCEPTS
> +
> +Scope of Memory Policies
> +
> +The Linux kernel supports four more or less distinct scopes of memory
> policy: +
> + System Default Policy: this policy is "hard coded" into the kernel.
> It + is the policy that governs the all page allocations that aren't
> controlled + by one of the more specific policy scopes discussed below.
> +
> + Task/Process Policy: this is an optional, per-task policy. When
> defined + for a specific task, this policy controls all page allocations
> made by or + on behalf of the task that aren't controlled by a more
> specific scope. + If a task does not define a task policy, then all page
> allocations that + would have been controlled by the task policy "fall
> back" to the System + Default Policy.
> +
> + Because task policy applies to the entire address space of a task,
> + it is inheritable across both fork() [clone() w/o the CLONE_VM flag]
> + and exec*(). Thus, a parent task may establish the task policy for
> + a child task exec()'d from an executable image that has no awareness
> + of memory policy.
> +
> + In a multi-threaded task, task policies apply only to the thread
> + [Linux kernel task] that installs the policy and any threads
> + subsequently created by that thread. Any sibling threads existing
> + at the time a new task policy is installed retain their current
> + policy.
> +
> + A task policy applies only to pages allocated after the policy is
> + installed. Any pages already faulted in by the task remain where
> + they were allocated based on the policy at the time they were
> + allocated.
> +
> + VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a
> task's + virtual adddress space. A task may define a specific policy
> for a range + of its virtual address space. This VMA policy will govern
> the allocation + of pages that back this region of the address space.
> Any regions of the + task's address space that don't have an explicit
> VMA policy will fall back + to the task policy, which may itself fall
> back to the system default policy. +
> + VMA policy applies ONLY to anonymous pages. These include pages
> + allocated for anonymous segments, such as the task stack and heap, and
> + any regions of the address space mmap()ed with the MAP_ANONYMOUS flag.
> + Anonymous pages copied from private file mappings [files mmap()ed with
> + the MAP_PRIVATE flag] also obey VMA policy, if defined.
> +
> + VMA policies are shared between all tasks that share a virtual address
> + space--a.k.a. threads--independent of when the policy is installed; and
> + they are inherited across fork(). However, because VMA policies refer
> + to a specific region of a task's address space, and because the address
> + space is discarded and recreated on exec*(), VMA policies are NOT
> + inheritable across exec(). Thus, only NUMA-aware applications may
> + use VMA policies.
> +
> + A task may install a new VMA policy on a sub-range of a previously
> + mmap()ed region. When this happens, Linux splits the existing virtual
> + memory area into 2 or 3 VMAs, each with it's own policy.
> +
> + By default, VMA policy applies only to pages allocated after the policy
> + is installed. Any pages already faulted into the VMA range remain where
> + they were allocated based on the policy at the time they were
> + allocated. However, since 2.6.16, Linux supports page migration so
> + that page contents can be moved to match a newly installed policy.
> +
> + Shared Policy: This policy applies to "memory objects" mapped shared
> into + one or more tasks' distinct address spaces. Shared policies are
> applied + directly to the shared object. Thus, all tasks that attach to
> the object + share the policy, and all pages allocated for the shared
> object, by any + task, will obey the shared policy.
> +
> + Currently [2.6.22], only shared memory segments, created by shmget(),
> + support shared policy. When shared policy support was added to Linux,
> + the associated data structures were added to shared hugetlbfs segments.
> + However, at the time, hugetlbfs did not support allocation at fault
> + time--a.k.a lazy allocation--so hugetlbfs segments were never "hooked
> + up" to the shared policy support. Although hugetlbfs segments now
> + support lazy allocation, their support for shared policy has not been
> + completed.
> +
> + Although internal to the kernel shared memory segments are really
> + files backed by swap space that have been mmap()ed shared into tasks'
> + address spaces, regular files mmap()ed shared do NOT support shared
> + policy. Rather, shared page cache pages, including pages backing
> + private mappings that have not yet been written by the task, follow
> + task policy, if any, else system default policy.
> +
> + The shared policy infrastructure supports different policies on subset
> + ranges of the shared object. However, Linux still splits the VMA of
> + the task that installs the policy for each range of distinct policy.
> + Thus, different tasks that attach to a shared memory segment can have
> + different VMA configurations mapping that one shared object.
> +
> +Components of Memory Policies
> +
> + A Linux memory policy is a tuple consisting of a "mode" and an
> optional set + of nodes. The mode determine the behavior of the policy,
> while the optional + set of nodes can be viewed as the arguments to the
> behavior.
> +
> + Note: in some functions, the mode is called "policy". However, to
> + avoid confusion with the policy tuple, this document will continue
> + to use the term "mode".
> +
> + Linux memory policy supports the following 4 modes:
> +
> + Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
> + context dependent.
> +
> + The system default policy is hard coded to contain the Default mode.
> + In this context, it means "local" allocation--that is attempt to
> + allocate the page from the node associated with the cpu where the
> + fault occurs. If the "local" node has no memory, or the node's
> + memory can be exhausted [no free pages available], local allocation
> + will attempt to allocate pages from "nearby" nodes, using a per node
> + list of nodes--called zonelists--built at boot time.
> +
> + TODO: address runtime rebuild of node/zonelists when
> + supported.
> +
> + When a task/process policy contains the Default mode, it means
> + "fall back to the system default mode". And, as discussed above,
> + this means use "local" allocation.
> +
> + In the context of a VMA, Default mode means "fall back to task
> + policy"--which may, itself, fall back to system default policy.
> + In the context of shared policies, Default mode means fall back
> + directly to the system default policy. Note: the result of this
> + semantic is that if the task policy is something other than Default,
> + it is not possible to specify local allocation for a region of the
> + task's address space using a VMA policy.
> +
> + The Default mode does not use the optional set of nodes.
> +
> + MPOL_BIND: This mode specifies that memory must come from the
> + set of nodes specified by the policy. The kernel builds a custom
> + zonelist containing just the nodes specified by the Bind policy.
> + If the kernel is unable to allocate a page from the first node in the
> + custom zonelist, it moves on to the next, and so forth. If it is unable
> + to allocate a page from any of the nodes in this list, the allocation
> + will fail.
> +
> + The memory policy APIs do not specify an order in which the nodes
> + will be searched. However, unlike the per node zonelists mentioned
> + above, the custom zonelist for the Bind policy do not consider the
> + distance between the nodes. Rather, the lists are built in order
> + of numeric node id.
> +
> +
> + MPOL_PREFERRED: This mode specifies that the allocation should be
> + attempted from the single node specified in the policy. If that
> + allocation fails, the kernel will search other nodes, exactly as
> + it would for a local allocation that started at the preferred node--
> + that is, using the per-node zonelists in increasing distance from
> + the preferred node.
> +
> + If the Preferred policy specifies more than one node, the node
> + with the numerically lowest node id will be selected to start
> + the allocation scan.
> +
> + MPOL_INTERLEAVED: This mode specifies that page allocations be
> + interleaved, on a page granularity, across the nodes specified in
> + the policy. This mode also behaves slightly differently, based on
> + the context where it is used:
> +
> + For allocation of anonymous pages and shared memory pages,
> + Interleave mode indexes the set of nodes specified by the policy
> + using the page offset of the faulting address into the segment
> + [VMA] containing the address modulo the number of nodes specified
> + by the policy. It then attempts to allocate a page, starting at
> + the selected node, as if the node had been specified by a Preferred
> + policy or had been selected by a local allocation. That is,
> + allocation will follow the per node zonelist.
> +
> + For allocation of page cache pages, Interleave mode indexes the set
> + of nodes specified by the policy using a node counter maintained
> + per task. This counter wraps around to the lowest specified node
> + after it reaches the highest specified node. This will tend to
> + spread the pages out over the nodes specified by the policy based
> + on the order in which they are allocated, rather than based on any
> + page offset into an address range or file.
> +
> +MEMORY POLICY APIs
> +
> +Linux supports 3 system calls for controlling memory policy. These APIS
> +always affect only the calling task, the calling task's address space, or
> +some shared object mapped into the calling task's address space.
> +
> + Note: the headers that define these APIs and the parameter data types
> + for user space applications reside in a package that is not part of
> + the Linux kernel. The kernel system call interfaces, with the 'sys_'
> + prefix, are defined in <linux/syscalls.h>; the mode and flag
> + definitions are defined in <linux/mempolicy.h>.
> +
> +Set [Task] Memory Policy:
> +
> + long set_mempolicy(int mode, const unsigned long *nmask,
> + unsigned long maxnode);
> +
> + Set's the calling task's "task/process memory policy" to mode
> + specified by the 'mode' argument and the set of nodes defined
> + by 'nmask'. 'nmask' points to a bit mask of node ids containing
> + at least 'maxnode' ids.
> +
> + If successful, the specified policy will control the allocation
> + of all pages, by and on behalf of this task and its descendants,
> + that aren't controlled by a more specific VMA or shared policy.
> + If the calling task is part of a multi-threaded application, the
> + task policy of other existing threads are unchanged.
> +
> +Get [Task] Memory Policy or Related Information
> +
> + long get_mempolicy(int *mode,
> + const unsigned long *nmask, unsigned long maxnode,
> + void *addr, int flags);
> +
> + Queries the "task/process memory policy" of the calling task, or
> + the policy or location of a specified virtual address, depending
> + on the 'flags' argument.
> +
> + If 'flags' is 0, get_mempolicy() returns the calling task's policy
> + as set by set_mempolicy() or inherited from its parent. The mode
> + is stored in the location pointed to by the 'mode' argument, if it
> + is non-NULL. The associated node mask, if any, is stored in the bit
> + mask pointed to by a non-NULL 'nmask' argument. When 'nmask' is
> + non-NULL, 'maxnode' must specify one greater than the maximum bit
> + number that can be stored in 'nmask'--i.e., the number of bits.
> +
> + If 'flags' specifies MPOL_F_ADDR, get_mempolicy() returns similar
> + policy information that governs the allocation of pages at the
> + specified 'addr'. This may be different from the task policy--
> + i.e., if a VMA or shared policy applies to that address.
> +
> + 'flags' may also contain 'MPOL_F_NODE'. This flag has been
> + described in some get_mempolicy() man pages as "not for application
> + use" and subject to change. Applications are cautioned against
> + using it. However, for completeness and because it is useful for
> + testing the kernel memory policy support, current behavior is
> + documented here:
> +
> + If 'flags' contains MPOL_F_NODE, but not MPOL_F_ADDR, and if
> + the task policy of the calling task specifies the Intereleave
> + mode [MPOL_INTERLEAVE], get_mempolicy() will return the next
> + node on which a page cache page would be allocated by the calling
> + task, in the location pointed to by a non-NULL 'mode'.
> +
> + If 'flags' contains MPOL_F_NODE and MPOL_F_ADDR, and 'addr'
> + contains a valid address in the calling task's address space,
> + get_mempolicy() will return the node where the page backing that
> + address resides. If no page has currently been allocated for
> + the specified address, a page will be allocated as if the task
> + had performed a read/load from that address. The node of the
> + page allocated will be returned.
> +
> + Note: if the address specifies an anonymous region of the
> + task's address space with no page currently allocated, the
> + resulting "read access fault" will likely just map the shared
> + ZEROPAGE. It will NOT, for example, allocate a local page in
> + the case of default policy [unless the task happens to be
> + running on the node containing the ZEROPAGE], nor will it obey
> + VMA policy, if any.
> +
> +
> +Install VMA/Shared Policy for a Range of Task's Address Space
> +
> + long mbind(void *start, unsigned long len, int mode,
> + const unsigned long *nmask, unsigned long maxnode,
> + unsigned flags);
> +
> + mbind() applies the policy specified by (mode, nmask, maxnodes) to
> + the range of the calling task's address space specified by the
> + 'start' and 'len' arguments. Additional actions may be requested
> + via the 'flags' argument.
> +
> + If the address space range covers an anonymous region or a private
> + mapping of a regular file, a VMA policy will be installed in this
> + region. This policy will govern all subsequent allocations of pages
> + for that range for all threads in the task.
> +
> + For the case of a private mapping of a regular file, the
> + specified policy will only govern the allocation of anonymous
> + pages created when the task writes/stores to an address in the
> + range. Pages allocated for read faults will use the faulting
> + task's task policy, if any, else the system default.
> +
> + If the address space range maps a shared object, such as a shared
> + memory segment, a shared policy will be installed on the specified
> + range of the underlying shared object. This policy will govern all
> + subsequent allocates of pages for that range of the shared object,
> + for all task that map/attach the shared object.
> +
> + If the address space range maps a shared hugetlbfs segment, a VMA
> + policy will be installed for that range. This policy will govern
> + subsequent huge page allocations from the calling task, but will
> + be ignored by any subsequent huge page allocations from other tasks
> + that attach to the hugetlb shared memory object.
> +
> + If the address space range covers a shared mapping of a regular
> + file, a VMA policy will be installed for that range. This policy
> + will be ignored for all page allocations by the calling task or
> + by any other task. Rather, all page allocations in that range will
> + be allocated using the faulting task's task policy, if any, else
> + the system default policy.
> +
> + Before 2.6.16, Linux did not support page migration. Therefore,
> + if any pages were already allocated in the range specified by the
> + mbind() call, the application was stuck with their existing location.
> + However, mbind() did, and still does, support the MPOL_MF_STRICT flag.
> + This flag causes mbind() to check the specified range for any
> + existing pages that don't obey the specified policy. If any such
> + pages exist, the mbind() call fails with the EIO error number.
> +
> + Since 2.6.16, Linux supports direct [synchronous] page migration
> + via the mbind() system call. When the 'flags' argument specifies
> + MPOL_MF_MOVE, mbind() will attempt to migrate all existing pages
> + in the range to match the specified policy. However, the MPOL_MF_MOVE
> + flag will migrate only those pages that are only referenced by the
> + calling task's page tables [internally: page's mapcount == 1]. The
> + MPOL_MF_STRICT flag may be specified to detect whether any pages
> + could not be migrated for this or other reasons.
> +
> + A privileged task [with CAP_SYS_NICE] may specify the MPOL_MF_MOVE_ALL
> + flag. With this flag, mbind() will attempt to migrate pages in the
> + range to match the specified policy, regardless of the number of page
> + table entries referencing the page [regardless of mapcount]. Again,
> + some conditions may still prevent pages from being migrated, and the
> + MPOL_MF_STRICT flag may be specified to detect this condition.
> +
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-29 20:04 ` Christoph Lameter
@ 2007-05-29 20:16 ` Andi Kleen
2007-05-30 16:17 ` Lee Schermerhorn
2007-05-30 16:55 ` [PATCH] Document Linux Memory Policy Lee Schermerhorn
1 sibling, 1 reply; 83+ messages in thread
From: Andi Kleen @ 2007-05-29 20:16 UTC (permalink / raw)
To: Christoph Lameter, mtk-manpages; +Cc: Lee Schermerhorn, linux-mm, Andrew Morton
On Tuesday 29 May 2007 22:04, Christoph Lameter wrote:
> > + Currently [2.6.22], only shared memory segments, created by shmget(),
> > + support shared policy. When shared policy support was added to Linux,
> > + the associated data structures were added to shared hugetlbfs segments.
> > + However, at the time, hugetlbfs did not support allocation at fault
> > + time--a.k.a lazy allocation--so hugetlbfs segments were never "hooked
> > + up" to the shared policy support. Although hugetlbfs segments now
> > + support lazy allocation, their support for shared policy has not been
> > + completed.
>
> I guess patches would be welcome to complete it.
I actually had it working in SLES9 (which sported a lazy hugetlb
implementation somewhat different from what mainline has now)
Somehow it dropped off the radar in mainline, but it should be easy
to readd.
> But that may only be
> releveant if huge pages are shared between processes.
NUMA policy is useful for multithreaded processes too
> We so far have no
> case in which that support is required.
Besides I think hugetlbfs mappings can be shared anyways.
> > + If the Preferred policy specifies more than one node, the node
> > + with the numerically lowest node id will be selected to start
> > + the allocation scan.
>
> AFAIK perferred policy was only intended to specify one node.
Yes.
Also the big difference to MPOL_BIND is that it is not strict and will fall
back like the default policy.
> > + For allocation of page cache pages, Interleave mode indexes the set
> > + of nodes specified by the policy using a node counter maintained
> > + per task. This counter wraps around to the lowest specified node
> > + after it reaches the highest specified node. This will tend to
> > + spread the pages out over the nodes specified by the policy based
> > + on the order in which they are allocated, rather than based on any
> > + page offset into an address range or file.
>
> Which is particularly important if random pages in a file are used.
Not sure that should be documented too closely -- it is a implementation
detail that could change.
>
> > + 'flags' may also contain 'MPOL_F_NODE'. This flag has been
> > + described in some get_mempolicy() man pages as "not for application
> > + use" and subject to change. Applications are cautioned against
> > + using it. However, for completeness and because it is useful for
> > + testing the kernel memory policy support, current behavior is
> > + documented here:
>
> The docs are wrong. This is fully supported.
Yes, I gave up on that one and the warning in the manpage should be
probably dropped
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-29 20:07 ` Andi Kleen
@ 2007-05-30 16:04 ` Lee Schermerhorn
0 siblings, 0 replies; 83+ messages in thread
From: Lee Schermerhorn @ 2007-05-30 16:04 UTC (permalink / raw)
To: Andi Kleen; +Cc: Michael Kerrisk, linux-mm, Andrew Morton, Christoph Lameter
On Tue, 2007-05-29 at 22:07 +0200, Andi Kleen wrote:
> On Tuesday 29 May 2007 21:33, Lee Schermerhorn wrote:
> > [PATCH] Document Linux Memory Policy
> >
> > I couldn't find any memory policy documentation in the Documentation
> > directory, so here is my attempt to document it. My objectives are
> > two fold:
>
> The theory is that the comment at the top of mempolicy.c gives an brief
> internal oriented overview and the manpages describe the details. I must say
> I'm not a big fan of too much redundant documentation because the likelihood
> of bitrotting increases more with more redundancy. We also normally don't
> keep syscall documentation in Documentation/*
I did see the comment in mempolicy.c. Perhaps that is the best place to
document any design details. But I found that, and the man pages quite
sparse in the details. The Linux provides a lot of surprising behavior
for anyone who has used NUMA systems before. Memory locality and the
control thereof is so important in some NUMA platforms that I think it's
important to describe exactly what the behavior is. I tried to distill
some general concepts on which to hang the existing behavior--a mental
map, if you will.
Regarding the syscall documentation...
>
> I see you got a few details that are right now missing in the manpages.
> How about you just add them to the mbind/set_mempolicy/etc manpages
> (and perhaps a new numa.7) and send a patch to the manpage
> maintainer (cc'ed)? I believe having everything in the manpages
> is the most useful for userland programmers who hardly look
> into Documentation/* (in fact it is often not installed on systems
> without kernel source)
Yes, the man pages do need updating. [I've seen reference in the
set_mempolicy() man page to a non-existent 'flags' argument. I sort of
wish that it did exist. Could have used it to set global page cache
policy someday. That [global page cache policy] is still in your todo
list in the comment block ;-).]
>
> The comment in mempolicy.c could probably also be improved a bit
> for anything internal.
>
> -Andi
<snip>
I'll address Christoph's and your other points in the context of your
response there...
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-29 20:16 ` Andi Kleen
@ 2007-05-30 16:17 ` Lee Schermerhorn
2007-05-30 17:41 ` Christoph Lameter
2007-05-31 8:20 ` Michael Kerrisk
0 siblings, 2 replies; 83+ messages in thread
From: Lee Schermerhorn @ 2007-05-30 16:17 UTC (permalink / raw)
To: Andi Kleen; +Cc: Christoph Lameter, mtk-manpages, linux-mm, Andrew Morton
On Tue, 2007-05-29 at 22:16 +0200, Andi Kleen wrote:
> On Tuesday 29 May 2007 22:04, Christoph Lameter wrote:
>
> > > + Currently [2.6.22], only shared memory segments, created by shmget(),
> > > + support shared policy. When shared policy support was added to Linux,
> > > + the associated data structures were added to shared hugetlbfs segments.
> > > + However, at the time, hugetlbfs did not support allocation at fault
> > > + time--a.k.a lazy allocation--so hugetlbfs segments were never "hooked
> > > + up" to the shared policy support. Although hugetlbfs segments now
> > > + support lazy allocation, their support for shared policy has not been
> > > + completed.
> >
> > I guess patches would be welcome to complete it.
>
> I actually had it working in SLES9 (which sported a lazy hugetlb
> implementation somewhat different from what mainline has now)
> Somehow it dropped off the radar in mainline, but it should be easy
> to readd.
Yes. In progress. As I mentioned in our previous discussion, if you
just add the policy vm_ops, it works as far as allocating the pages, but
numa_maps hangs displaying the segment. My series fixed that. I'm
extracting the numa_maps fix and related clean up, and when that works,
I'll post along with a patch to add the vm_ops. Fixes come first,
right?
>
> > But that may only be
> > releveant if huge pages are shared between processes.
>
> NUMA policy is useful for multithreaded processes too
Two orthogonal concepts, right? A multi-threaded task can use NUMA
policy w/o sharing objects between "processes" [by which I mean a Linux
task plus it's address space and associated resources]. I think that is
what Christoph was referring to?
>
> > We so far have no
> > case in which that support is required.
>
> Besides I think hugetlbfs mappings can be shared anyways.
No use case for sharing huge pages between processes, huh?
I'm aware of at least one large enterprise database that uses both huge
pages and shmem segments to good advantage, performance-wise, even on
Linux. That same database uses the NUMA policy support of the various
enterprise unix systems for additional performance gain. I understand
that this support will be enabled for Linux once a process can determine
which cpu/node it's running on--maybe ~2.6.22?
>
>
> > > + If the Preferred policy specifies more than one node, the node
> > > + with the numerically lowest node id will be selected to start
> > > + the allocation scan.
> >
> > AFAIK perferred policy was only intended to specify one node.
>
> Yes.
>
> Also the big difference to MPOL_BIND is that it is not strict and will fall
> back like the default policy.
Right. And since the API argument is a node mask, one might want to
know what happens if more than one node is specified. On the other
hand, we could play hardball and reject the call if more than one is
specified.
>
> > > + For allocation of page cache pages, Interleave mode indexes the set
> > > + of nodes specified by the policy using a node counter maintained
> > > + per task. This counter wraps around to the lowest specified node
> > > + after it reaches the highest specified node. This will tend to
> > > + spread the pages out over the nodes specified by the policy based
> > > + on the order in which they are allocated, rather than based on any
> > > + page offset into an address range or file.
> >
> > Which is particularly important if random pages in a file are used.
>
> Not sure that should be documented too closely -- it is a implementation
> detail that could change.
I think it's useful for the kernel documentation.
>
> >
> > > + 'flags' may also contain 'MPOL_F_NODE'. This flag has been
> > > + described in some get_mempolicy() man pages as "not for application
> > > + use" and subject to change. Applications are cautioned against
> > > + using it. However, for completeness and because it is useful for
> > > + testing the kernel memory policy support, current behavior is
> > > + documented here:
> >
> > The docs are wrong. This is fully supported.
>
> Yes, I gave up on that one and the warning in the manpage should be
> probably dropped
OK. I'll work with the man page maintainers.
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-29 20:04 ` Christoph Lameter
2007-05-29 20:16 ` Andi Kleen
@ 2007-05-30 16:55 ` Lee Schermerhorn
2007-05-30 17:56 ` Christoph Lameter
1 sibling, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-05-30 16:55 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, Andrew Morton, Andi Kleen
On Tue, 2007-05-29 at 13:04 -0700, Christoph Lameter wrote:
> On Tue, 29 May 2007, Lee Schermerhorn wrote:
>
> > + A task policy applies only to pages allocated after the policy is
> > + installed. Any pages already faulted in by the task remain where
> > + they were allocated based on the policy at the time they were
> > + allocated.
>
> You can use cpusets to automatically migrate pages and sys_migrate_pages
> to manually migrate pages of a process though.
I consider cpusets, and the explicit migration APIs, orthogonal to
mempolicy. Mempolicy is an application interface, while cpusets are an
administrative interface that restricts what mempolicy can ask for. And
sys_migrate_pages/sys_move_pages seem to ignore mempolicy altogether.
I would agree, however, that they could be better integrated. E.g., how
can a NUMA-aware application [one that uses the mempolicy APIs]
determine what memories it's allowed to use. So far, all I've been able
to determine is that I try each node in the mask and the ones that don't
error out are valid. Seems a bit awkward...
>
> > + VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's
> > + virtual adddress space. A task may define a specific policy for a range
> > + of its virtual address space. This VMA policy will govern the allocation
> > + of pages that back this region of the address space. Any regions of the
> > + task's address space that don't have an explicit VMA policy will fall back
> > + to the task policy, which may itself fall back to the system default policy.
>
> The system default policy is always the same when the system is running.
> There is no way to configure it. So it would be easier to avoid this layer
> and say they fall back to node local
What you describe is, indeed, the effect, but I'm trying to explain why
it works that way.
>
>
> > + VMA policies are shared between all tasks that share a virtual address
> > + space--a.k.a. threads--independent of when the policy is installed; and
> > + they are inherited across fork(). However, because VMA policies refer
> > + to a specific region of a task's address space, and because the address
> > + space is discarded and recreated on exec*(), VMA policies are NOT
> > + inheritable across exec(). Thus, only NUMA-aware applications may
> > + use VMA policies.
>
> Memory policies require NUMA. Drop the last sentence? You can set the task
> policy via numactl though.
I disagree about dropping the last sentence. I can/will define
NUMA-aware as applications that directly call the mempolicy APIs. You
can run an unmodified, non-NUMA-aware program on a NUMA platform with or
without numactl and take whatever performance you get. In some cases,
you'll be leaving performance on the table, but that may be a trade-off
some are willing to make not to have to modify their existing
applications.
>
> > + Shared Policy: This policy applies to "memory objects" mapped shared into
> > + one or more tasks' distinct address spaces. Shared policies are applied
> > + directly to the shared object. Thus, all tasks that attach to the object
> > + share the policy, and all pages allocated for the shared object, by any
> > + task, will obey the shared policy.
> > +
> > + Currently [2.6.22], only shared memory segments, created by shmget(),
> > + support shared policy. When shared policy support was added to Linux,
> > + the associated data structures were added to shared hugetlbfs segments.
> > + However, at the time, hugetlbfs did not support allocation at fault
> > + time--a.k.a lazy allocation--so hugetlbfs segments were never "hooked
> > + up" to the shared policy support. Although hugetlbfs segments now
> > + support lazy allocation, their support for shared policy has not been
> > + completed.
>
> I guess patches would be welcome to complete it. But that may only be
> releveant if huge pages are shared between processes. We so far have no
> case in which that support is required.
See response to Andi's mail re: data base use of shmem & hugepages.
>
> > + Although internal to the kernel shared memory segments are really
> > + files backed by swap space that have been mmap()ed shared into tasks'
> > + address spaces, regular files mmap()ed shared do NOT support shared
> > + policy. Rather, shared page cache pages, including pages backing
> > + private mappings that have not yet been written by the task, follow
> > + task policy, if any, else system default policy.
>
> Yes. shared memory segments do not represent file content. The file
> content of mmap pages may exist before the mmap. Also there may be regular
> buffered I/O going on which will also use the task policy.
Unix/Posix/Linux semantics are very flexible with respect to file
description access [read, write, et al] and memory mapped access to
files. One CAN access files via both of these interfaces, and the
system jumps through hoops backwards [e.g., consider truncation] to make
it work. However, some applications just access the files via mmap()
and want to control the NUMA placement like any other component of their
address space. Read/write access to such a file, while I agree it
should work, is, IMO, secondary to load/store access. In such a case,
the performance of the load/store access shouldn't be sacrificed for the
read/write case, which already has to go through system calls, buffer
copies, ...
>
> Having no vma policy support insures that pagecache pages regardless if
> they are mmapped or not will get the task policy applied.
Which is fine if that's what you want. If you're using a memory mapped
file as a persistent shared memory area that faults pages in where you
specified, as you access them, maybe that's not what you want. I
guarantee that's not what I want.
However, it seems to me, this is our other discussion. What I've tried
to do with this patch is document the existing concepts and behavior, as
I understand them.
>
> > + Linux memory policy supports the following 4 modes:
> > +
> > + Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
> > + context dependent.
> > +
> > + The system default policy is hard coded to contain the Default mode.
> > + In this context, it means "local" allocation--that is attempt to
> > + allocate the page from the node associated with the cpu where the
> > + fault occurs. If the "local" node has no memory, or the node's
> > + memory can be exhausted [no free pages available], local allocation
> > + will attempt to allocate pages from "nearby" nodes, using a per node
> > + list of nodes--called zonelists--built at boot time.
> > +
> > + TODO: address runtime rebuild of node/zonelists when
> > + supported.
>
> Why?
Because "built at boot time" is then not strictly correct, is it?
>
> > + When a task/process policy contains the Default mode, it means
> > + "fall back to the system default mode". And, as discussed above,
> > + this means use "local" allocation.
>
> This would be easier if you would drop the system default mode and simply
> say its node local.
I'm trying to build the reader's mental map.
>
> > + In the context of a VMA, Default mode means "fall back to task
> > + policy"--which may, itself, fall back to system default policy.
> > + In the context of shared policies, Default mode means fall back
> > + directly to the system default policy. Note: the result of this
> > + semantic is that if the task policy is something other than Default,
> > + it is not possible to specify local allocation for a region of the
> > + task's address space using a VMA policy.
> > +
> > + The Default mode does not use the optional set of nodes.
>
> Neither does the preferred node mode.
Actually, it does take the node mask argument. It just selects the
first node therein. See response to Andi.
>
> > + MPOL_BIND: This mode specifies that memory must come from the
> > + set of nodes specified by the policy. The kernel builds a custom
> > + zonelist containing just the nodes specified by the Bind policy.
> > + If the kernel is unable to allocate a page from the first node in the
> > + custom zonelist, it moves on to the next, and so forth. If it is unable
> > + to allocate a page from any of the nodes in this list, the allocation
> > + will fail.
> > +
> > + The memory policy APIs do not specify an order in which the nodes
> > + will be searched. However, unlike the per node zonelists mentioned
> > + above, the custom zonelist for the Bind policy do not consider the
> > + distance between the nodes. Rather, the lists are built in order
> > + of numeric node id.
>
> Right. TODO: MPOL_BIND needs to pick the best node.
>
> > + MPOL_PREFERRED: This mode specifies that the allocation should be
> > + attempted from the single node specified in the policy. If that
> > + allocation fails, the kernel will search other nodes, exactly as
> > + it would for a local allocation that started at the preferred node--
> > + that is, using the per-node zonelists in increasing distance from
> > + the preferred node.
> > +
> > + If the Preferred policy specifies more than one node, the node
> > + with the numerically lowest node id will be selected to start
> > + the allocation scan.
>
> AFAIK perferred policy was only intended to specify one node.
Covered in response to Andi.
>
> > + For allocation of page cache pages, Interleave mode indexes the set
> > + of nodes specified by the policy using a node counter maintained
> > + per task. This counter wraps around to the lowest specified node
> > + after it reaches the highest specified node. This will tend to
> > + spread the pages out over the nodes specified by the policy based
> > + on the order in which they are allocated, rather than based on any
> > + page offset into an address range or file.
>
> Which is particularly important if random pages in a file are used.
>
> > +Linux supports 3 system calls for controlling memory policy. These APIS
> > +always affect only the calling task, the calling task's address space, or
> > +some shared object mapped into the calling task's address space.
>
> These are wrapped by the numactl library. So these are not exposed to the
> user.
>
> > + Note: the headers that define these APIs and the parameter data types
> > + for user space applications reside in a package that is not part of
> > + the Linux kernel. The kernel system call interfaces, with the 'sys_'
> > + prefix, are defined in <linux/syscalls.h>; the mode and flag
> > + definitions are defined in <linux/mempolicy.h>.
>
> You need to mention the numactl library here.
I'm trying to describe kernel behavior. I would expect this to be
picked up by the man pages at some time. As I responded to Andi, I'll
work the maintainers... When I get the time.
>
> > + 'flags' may also contain 'MPOL_F_NODE'. This flag has been
> > + described in some get_mempolicy() man pages as "not for application
> > + use" and subject to change. Applications are cautioned against
> > + using it. However, for completeness and because it is useful for
> > + testing the kernel memory policy support, current behavior is
> > + documented here:
>
> The docs are wrong. This is fully supported.
>
> > + Note: if the address specifies an anonymous region of the
> > + task's address space with no page currently allocated, the
> > + resulting "read access fault" will likely just map the shared
> > + ZEROPAGE. It will NOT, for example, allocate a local page in
> > + the case of default policy [unless the task happens to be
> > + running on the node containing the ZEROPAGE], nor will it obey
> > + VMA policy, if any.
>
> Yes the intend for it was to be used on a mapped page.
Just pointing out that this might not be what you expect. E.g., if you
mbind() an anonymous region to some node where the ZEROPAGE does NOT
reside [do we intend to do per node ZEROPAGEs, or was that idea
dropped?], fault in the pages via read access and then query the page
location, either via get_mempolicy() w/ '_ADDR|"_NODE or via numa_maps,
you'll see the pages on some node you don't expect and think it's
broken. Well, not YOU, but someone not familiar with kernel internals
might.
>
> > + If the address space range covers an anonymous region or a private
> > + mapping of a regular file, a VMA policy will be installed in this
> > + region. This policy will govern all subsequent allocations of pages
> > + for that range for all threads in the task.
>
> Wont it be installed regardless if it is anonymous or not?
Yes, I suppose I could reword that and the next paragraph differently.
>
> > + If the address space range covers a shared mapping of a regular
> > + file, a VMA policy will be installed for that range. This policy
> > + will be ignored for all page allocations by the calling task or
> > + by any other task. Rather, all page allocations in that range will
> > + be allocated using the faulting task's task policy, if any, else
> > + the system default policy.
>
> The policy is going to be used for COW in that range.
You don't get COW if it's a shared mapping. You use the page cache
pages which ignores my mbind(). That's my beef! [;-)]
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-30 16:17 ` Lee Schermerhorn
@ 2007-05-30 17:41 ` Christoph Lameter
2007-05-31 8:20 ` Michael Kerrisk
1 sibling, 0 replies; 83+ messages in thread
From: Christoph Lameter @ 2007-05-30 17:41 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: Andi Kleen, mtk-manpages, linux-mm, Andrew Morton
On Wed, 30 May 2007, Lee Schermerhorn wrote:
> > Also the big difference to MPOL_BIND is that it is not strict and will fall
> > back like the default policy.
>
> Right. And since the API argument is a node mask, one might want to
> know what happens if more than one node is specified. On the other
> hand, we could play hardball and reject the call if more than one is
> specified.
I think we would like to reject the call if more than one node is
specified.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-30 16:55 ` [PATCH] Document Linux Memory Policy Lee Schermerhorn
@ 2007-05-30 17:56 ` Christoph Lameter
2007-05-31 6:18 ` Gleb Natapov
` (2 more replies)
0 siblings, 3 replies; 83+ messages in thread
From: Christoph Lameter @ 2007-05-30 17:56 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: linux-mm, Andrew Morton, Andi Kleen
On Wed, 30 May 2007, Lee Schermerhorn wrote:
> > You can use cpusets to automatically migrate pages and sys_migrate_pages
> > to manually migrate pages of a process though.
>
> I consider cpusets, and the explicit migration APIs, orthogonal to
> mempolicy. Mempolicy is an application interface, while cpusets are an
> administrative interface that restricts what mempolicy can ask for. And
> sys_migrate_pages/sys_move_pages seem to ignore mempolicy altogether.
They have to since they may be used to change page locations when policies
are active. There is a libcpuset library that can be used for application
control of cpusets. I think Paul would disagree with you here.
> I would agree, however, that they could be better integrated. E.g., how
> can a NUMA-aware application [one that uses the mempolicy APIs]
> determine what memories it's allowed to use. So far, all I've been able
> to determine is that I try each node in the mask and the ones that don't
> error out are valid. Seems a bit awkward...
The cpuset interfaces provide this information.
> > There is no way to configure it. So it would be easier to avoid this layer
> > and say they fall back to node local
>
> What you describe is, indeed, the effect, but I'm trying to explain why
> it works that way.
But the explanation adds a new element that only serves to complicate the
description.
> > > + VMA policies are shared between all tasks that share a virtual address
> > > + space--a.k.a. threads--independent of when the policy is installed; and
> > > + they are inherited across fork(). However, because VMA policies refer
> > > + to a specific region of a task's address space, and because the address
> > > + space is discarded and recreated on exec*(), VMA policies are NOT
> > > + inheritable across exec(). Thus, only NUMA-aware applications may
> > > + use VMA policies.
> >
> > Memory policies require NUMA. Drop the last sentence? You can set the task
> > policy via numactl though.
>
> I disagree about dropping the last sentence. I can/will define
> NUMA-aware as applications that directly call the mempolicy APIs. You
Or the cpuset APIs.
> can run an unmodified, non-NUMA-aware program on a NUMA platform with or
> without numactl and take whatever performance you get. In some cases,
Right.
> you'll be leaving performance on the table, but that may be a trade-off
> some are willing to make not to have to modify their existing
> applications.
The sentence still does not make sense. There is no point in using numa
memory policies if the app is not an NUMA app.
> > > + Although internal to the kernel shared memory segments are really
> > > + files backed by swap space that have been mmap()ed shared into tasks'
> > > + address spaces, regular files mmap()ed shared do NOT support shared
> > > + policy. Rather, shared page cache pages, including pages backing
> > > + private mappings that have not yet been written by the task, follow
> > > + task policy, if any, else system default policy.
> >
> > Yes. shared memory segments do not represent file content. The file
> > content of mmap pages may exist before the mmap. Also there may be regular
> > buffered I/O going on which will also use the task policy.
>
> Unix/Posix/Linux semantics are very flexible with respect to file
> description access [read, write, et al] and memory mapped access to
> files. One CAN access files via both of these interfaces, and the
> system jumps through hoops backwards [e.g., consider truncation] to make
> it work. However, some applications just access the files via mmap()
> and want to control the NUMA placement like any other component of their
> address space. Read/write access to such a file, while I agree it
Right but the pages may already have been in memory due to buffered read
access.
> should work, is, IMO, secondary to load/store access. In such a case,
> the performance of the load/store access shouldn't be sacrificed for the
> read/write case, which already has to go through system calls, buffer
> copies, ...
Its not a matter of sacrifice. Its consistency. page cache pages are
always subject to the tasks memory policy whether you use bufferred I/O or
mmapped I/O.
> > Having no vma policy support insures that pagecache pages regardless if
> > they are mmapped or not will get the task policy applied.
>
> Which is fine if that's what you want. If you're using a memory mapped
> file as a persistent shared memory area that faults pages in where you
> specified, as you access them, maybe that's not what you want. I
> guarantee that's not what I want.
>
> However, it seems to me, this is our other discussion. What I've tried
> to do with this patch is document the existing concepts and behavior, as
> I understand them.
It seems that you are creating some artificial problems here.
> > > + Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
> > > + context dependent.
> > > +
> > > + The system default policy is hard coded to contain the Default mode.
> > > + In this context, it means "local" allocation--that is attempt to
> > > + allocate the page from the node associated with the cpu where the
> > > + fault occurs. If the "local" node has no memory, or the node's
> > > + memory can be exhausted [no free pages available], local allocation
> > > + will attempt to allocate pages from "nearby" nodes, using a per node
> > > + list of nodes--called zonelists--built at boot time.
> > > +
> > > + TODO: address runtime rebuild of node/zonelists when
> > > + supported.
> >
> > Why?
>
> Because "built at boot time" is then not strictly correct, is it?
I still do not understand what this is all about. The zonelists are
rebuild due to Kame-san's patch for the ZONE_DMA problems. Okay. So what
does this have to do with MPOL_DEFAULT?
> > > + The Default mode does not use the optional set of nodes.
> >
> > Neither does the preferred node mode.
>
> Actually, it does take the node mask argument. It just selects the
> first node therein. See response to Andi.
It uses one node yes. It does not support (or is not intended to support)
a nodemask.
> > > + Note: the headers that define these APIs and the parameter data types
> > > + for user space applications reside in a package that is not part of
> > > + the Linux kernel. The kernel system call interfaces, with the 'sys_'
> > > + prefix, are defined in <linux/syscalls.h>; the mode and flag
> > > + definitions are defined in <linux/mempolicy.h>.
> >
> > You need to mention the numactl library here.
>
> I'm trying to describe kernel behavior. I would expect this to be
> picked up by the man pages at some time. As I responded to Andi, I'll
> work the maintainers... When I get the time.
I though you wanted to explain this to users? If so then you need to
mention the user APIs such as numactl and libcpuset.
> You don't get COW if it's a shared mapping. You use the page cache
> pages which ignores my mbind(). That's my beef! [;-)]
page cache pages are subject to a tasks memory policy regardless of how we
get to the page cache page. I think that is pretty consistent.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-30 17:56 ` Christoph Lameter
@ 2007-05-31 6:18 ` Gleb Natapov
2007-05-31 6:41 ` Christoph Lameter
2007-05-31 18:28 ` Lee Schermerhorn
2007-05-31 19:25 ` Paul Jackson
2 siblings, 1 reply; 83+ messages in thread
From: Gleb Natapov @ 2007-05-31 6:18 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Andi Kleen
On Wed, May 30, 2007 at 10:56:17AM -0700, Christoph Lameter wrote:
> > You don't get COW if it's a shared mapping. You use the page cache
> > pages which ignores my mbind(). That's my beef! [;-)]
>
> page cache pages are subject to a tasks memory policy regardless of how we
> get to the page cache page. I think that is pretty consistent.
>
I am a little bit confused here. If two processes mmap some file with
MAP_SHARED and each one marks different part of the file with
numa_setlocal_memory() (and suppose that no pages were faulted in for
this file yet). Now first process touches a part of the file that was
marked local by second process. Will faulted page be placed in first
process' local memory or second? I surely expect later, but it seems I
am wrong.
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 6:18 ` Gleb Natapov
@ 2007-05-31 6:41 ` Christoph Lameter
2007-05-31 6:47 ` Gleb Natapov
0 siblings, 1 reply; 83+ messages in thread
From: Christoph Lameter @ 2007-05-31 6:41 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Andi Kleen
On Thu, 31 May 2007, Gleb Natapov wrote:
> On Wed, May 30, 2007 at 10:56:17AM -0700, Christoph Lameter wrote:
> > > You don't get COW if it's a shared mapping. You use the page cache
> > > pages which ignores my mbind(). That's my beef! [;-)]
> >
> > page cache pages are subject to a tasks memory policy regardless of how we
> > get to the page cache page. I think that is pretty consistent.
> >
> I am a little bit confused here. If two processes mmap some file with
> MAP_SHARED and each one marks different part of the file with
> numa_setlocal_memory() (and suppose that no pages were faulted in for
The numa_setlocal_memory() has no effect on ranges that map pagecache
pages.
> this file yet). Now first process touches a part of the file that was
> marked local by second process. Will faulted page be placed in first
> process' local memory or second? I surely expect later, but it seems I
> am wrong.
The faulted page will use the memory policy of the task that faulted it
in. If that process has numa_set_localalloc() set then the page will be
located as closely as possible to the allocating thread.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 6:41 ` Christoph Lameter
@ 2007-05-31 6:47 ` Gleb Natapov
2007-05-31 6:56 ` Christoph Lameter
2007-05-31 10:43 ` Andi Kleen
0 siblings, 2 replies; 83+ messages in thread
From: Gleb Natapov @ 2007-05-31 6:47 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Andi Kleen
On Wed, May 30, 2007 at 11:41:25PM -0700, Christoph Lameter wrote:
> On Thu, 31 May 2007, Gleb Natapov wrote:
>
> > On Wed, May 30, 2007 at 10:56:17AM -0700, Christoph Lameter wrote:
> > > > You don't get COW if it's a shared mapping. You use the page cache
> > > > pages which ignores my mbind(). That's my beef! [;-)]
> > >
> > > page cache pages are subject to a tasks memory policy regardless of how we
> > > get to the page cache page. I think that is pretty consistent.
> > >
> > I am a little bit confused here. If two processes mmap some file with
> > MAP_SHARED and each one marks different part of the file with
> > numa_setlocal_memory() (and suppose that no pages were faulted in for
>
> The numa_setlocal_memory() has no effect on ranges that map pagecache
> pages.
>
> > this file yet). Now first process touches a part of the file that was
> > marked local by second process. Will faulted page be placed in first
> > process' local memory or second? I surely expect later, but it seems I
> > am wrong.
>
> The faulted page will use the memory policy of the task that faulted it
> in. If that process has numa_set_localalloc() set then the page will be
> located as closely as possible to the allocating thread.
Thanks. But I have to say this feels very unnatural. So to have
desirable effect I have to create shared memory with shmget?
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 6:47 ` Gleb Natapov
@ 2007-05-31 6:56 ` Christoph Lameter
2007-05-31 7:11 ` Gleb Natapov
2007-05-31 10:43 ` Andi Kleen
1 sibling, 1 reply; 83+ messages in thread
From: Christoph Lameter @ 2007-05-31 6:56 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Andi Kleen
On Thu, 31 May 2007, Gleb Natapov wrote:
> > The faulted page will use the memory policy of the task that faulted it
> > in. If that process has numa_set_localalloc() set then the page will be
> > located as closely as possible to the allocating thread.
>
> Thanks. But I have to say this feels very unnatural. So to have
> desirable effect I have to create shared memory with shmget?
Right. From a user perspective: How would you solve the problem that
1. A shared range has multiple tasks that can fault pages in.
The policy of which task should control how the page is allocated?
Is it the last one that set the policy?
2. Pagecache pages can be read and written by buffered I/O and
via mmap. Should there be different allocation semantics
depending on the way you got the page? Obviously no policy
for a memory range can be applied to a page allocated via
buffered I/O. Later it may be mapped via mmap but then
we never use policies if the page is already in memory.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 6:56 ` Christoph Lameter
@ 2007-05-31 7:11 ` Gleb Natapov
2007-05-31 7:24 ` Christoph Lameter
0 siblings, 1 reply; 83+ messages in thread
From: Gleb Natapov @ 2007-05-31 7:11 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Andi Kleen
On Wed, May 30, 2007 at 11:56:34PM -0700, Christoph Lameter wrote:
> On Thu, 31 May 2007, Gleb Natapov wrote:
>
> > > The faulted page will use the memory policy of the task that faulted it
> > > in. If that process has numa_set_localalloc() set then the page will be
> > > located as closely as possible to the allocating thread.
> >
> > Thanks. But I have to say this feels very unnatural. So to have
> > desirable effect I have to create shared memory with shmget?
>
> Right. From a user perspective: How would you solve the problem that
>
> 1. A shared range has multiple tasks that can fault pages in.
> The policy of which task should control how the page is allocated?
> Is it the last one that set the policy?
How is it done for shmget? For my particular case I would prefer to get an error
from numa_setlocal_memory() if process tries to set policy on the area
of the file that already has policy set. This may happen only as a
result of a bug in my app.
>
> 2. Pagecache pages can be read and written by buffered I/O and
> via mmap. Should there be different allocation semantics
> depending on the way you got the page? Obviously no policy
> for a memory range can be applied to a page allocated via
> buffered I/O. Later it may be mapped via mmap but then
> we never use policies if the page is already in memory.
If page is already in the pagecache use it. Or return an error if strict
policy is in use. Or something else :) In my case I make sure that files
is accessed only through mmap interface.
I agree that from kernel point of view the current behaviour seems more
logical/easy to implement. After all memory policy is a property of a
memory space and not a file. But as a user I expect to be able to use mmap to
create shared space between processes and set memory policy on this
space.
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 7:11 ` Gleb Natapov
@ 2007-05-31 7:24 ` Christoph Lameter
2007-05-31 7:39 ` Gleb Natapov
2007-05-31 17:07 ` Lee Schermerhorn
0 siblings, 2 replies; 83+ messages in thread
From: Christoph Lameter @ 2007-05-31 7:24 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Andi Kleen
On Thu, 31 May 2007, Gleb Natapov wrote:
> > 1. A shared range has multiple tasks that can fault pages in.
> > The policy of which task should control how the page is allocated?
> > Is it the last one that set the policy?
> How is it done for shmget? For my particular case I would prefer to get an error
> from numa_setlocal_memory() if process tries to set policy on the area
> of the file that already has policy set. This may happen only as a
> result of a bug in my app.
Hmmm.... Thats an idea. Lee: Do we have some way of returning an error?
We then need to have a function that clears memory policy. Maybe the
default policy is the clear?
> > 2. Pagecache pages can be read and written by buffered I/O and
> > via mmap. Should there be different allocation semantics
> > depending on the way you got the page? Obviously no policy
> > for a memory range can be applied to a page allocated via
> > buffered I/O. Later it may be mapped via mmap but then
> > we never use policies if the page is already in memory.
> If page is already in the pagecache use it. Or return an error if strict
> policy is in use. Or something else :) In my case I make sure that files
> is accessed only through mmap interface.
On an mmap we cannot really return an error. If your program has just run
then pages may linger in memory. If you run it on another node then the
earlier used pages may be used.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 7:24 ` Christoph Lameter
@ 2007-05-31 7:39 ` Gleb Natapov
2007-05-31 17:43 ` Christoph Lameter
2007-05-31 17:07 ` Lee Schermerhorn
1 sibling, 1 reply; 83+ messages in thread
From: Gleb Natapov @ 2007-05-31 7:39 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Andi Kleen
On Thu, May 31, 2007 at 12:24:06AM -0700, Christoph Lameter wrote:
> > > 2. Pagecache pages can be read and written by buffered I/O and
> > > via mmap. Should there be different allocation semantics
> > > depending on the way you got the page? Obviously no policy
> > > for a memory range can be applied to a page allocated via
> > > buffered I/O. Later it may be mapped via mmap but then
> > > we never use policies if the page is already in memory.
>
> > If page is already in the pagecache use it. Or return an error if strict
> > policy is in use. Or something else :) In my case I make sure that files
> > is accessed only through mmap interface.
>
> On an mmap we cannot really return an error. If your program has just run
> then pages may linger in memory. If you run it on another node then the
> earlier used pages may be used.
I am OK with that behaviour. For already faulted pages there is nothing
we can do, so if application really cares it should make sure this doesn't
happen (flash file from pagecache before mmap. Is it even possible?).
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-30 16:17 ` Lee Schermerhorn
2007-05-30 17:41 ` Christoph Lameter
@ 2007-05-31 8:20 ` Michael Kerrisk
2007-05-31 14:49 ` Lee Schermerhorn
2007-06-01 21:15 ` [PATCH] enhance memory policy sys call man pages v1 Lee Schermerhorn
1 sibling, 2 replies; 83+ messages in thread
From: Michael Kerrisk @ 2007-05-31 8:20 UTC (permalink / raw)
To: Lee Schermerhorn, ak; +Cc: akpm, linux-mm, clameter
> > > The docs are wrong. This is fully supported.
> >
> > Yes, I gave up on that one and the warning in the manpage should be
> > probably dropped
>
> OK. I'll work with the man page maintainers.
Hi Lee,
If you could write a patch for the man page, that would be ideal.
Location of current tarball is in the .sig.
Cheers,
Michael
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance?
Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages ,
read the HOWTOHELP file and grep the source
files for 'FIXME'.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 6:47 ` Gleb Natapov
2007-05-31 6:56 ` Christoph Lameter
@ 2007-05-31 10:43 ` Andi Kleen
2007-05-31 11:04 ` Gleb Natapov
1 sibling, 1 reply; 83+ messages in thread
From: Andi Kleen @ 2007-05-31 10:43 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Christoph Lameter, Lee Schermerhorn, linux-mm, Andrew Morton
> > The faulted page will use the memory policy of the task that faulted it
> > in. If that process has numa_set_localalloc() set then the page will be
> > located as closely as possible to the allocating thread.
>
> Thanks. But I have to say this feels very unnatural.
What do you think is unnatural exactly? First one wins seems like a quite
natural policy to me.
> So to have
> desirable effect I have to create shared memory with shmget?
shmget behaves the same.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 10:43 ` Andi Kleen
@ 2007-05-31 11:04 ` Gleb Natapov
2007-05-31 11:30 ` Gleb Natapov
2007-05-31 11:47 ` Andi Kleen
0 siblings, 2 replies; 83+ messages in thread
From: Gleb Natapov @ 2007-05-31 11:04 UTC (permalink / raw)
To: Andi Kleen; +Cc: Christoph Lameter, Lee Schermerhorn, linux-mm, Andrew Morton
On Thu, May 31, 2007 at 12:43:19PM +0200, Andi Kleen wrote:
>
> > > The faulted page will use the memory policy of the task that faulted it
> > > in. If that process has numa_set_localalloc() set then the page will be
> > > located as closely as possible to the allocating thread.
> >
> > Thanks. But I have to say this feels very unnatural.
>
> What do you think is unnatural exactly? First one wins seems like a quite
> natural policy to me.
No it is not (not always). I want to create shared memory for
interprocess communication. Process A will write into the memory and
process B will periodically poll it to see if there is a message there.
In NUMA system I want the physical memory for this VMA to be allocated
from node close to process B since it will use it much more frequently.
But I don't want to pre-fault all pages in process B to achieve this
because the region can be huge and because it doesn't guaranty much if
swapping is involved. So numa_set_localalloc() looks like it achieves
exactly this. Without this function I agree that the "first one wins" is
very sensible assumption, but when each process stated it's preferences
explicitly by calling the function it is not longer sensible to me as a
user of the API. When you start to thing about how memory policy may be
implemented in the kernel and understand that memory policy is a
property of an address space (is it?) and not a file then you start to
understand current behaviour, but this is implementation details.
>
> > So to have
> > desirable effect I have to create shared memory with shmget?
>
> shmget behaves the same.
>
Then I misinterpreted "Shared Policy" section from Lee's document.
It seems that he states that for memory region created with shmget the
policy is a property of a shared object and not of a process' address
space.
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 11:04 ` Gleb Natapov
@ 2007-05-31 11:30 ` Gleb Natapov
2007-05-31 15:26 ` Lee Schermerhorn
2007-05-31 11:47 ` Andi Kleen
1 sibling, 1 reply; 83+ messages in thread
From: Gleb Natapov @ 2007-05-31 11:30 UTC (permalink / raw)
To: Andi Kleen; +Cc: Christoph Lameter, Lee Schermerhorn, linux-mm, Andrew Morton
On Thu, May 31, 2007 at 02:04:12PM +0300, Gleb Natapov wrote:
> On Thu, May 31, 2007 at 12:43:19PM +0200, Andi Kleen wrote:
> >
> > > > The faulted page will use the memory policy of the task that faulted it
> > > > in. If that process has numa_set_localalloc() set then the page will be
> > > > located as closely as possible to the allocating thread.
> > >
> > > Thanks. But I have to say this feels very unnatural.
> >
> > What do you think is unnatural exactly? First one wins seems like a quite
> > natural policy to me.
> No it is not (not always). I want to create shared memory for
> interprocess communication. Process A will write into the memory and
> process B will periodically poll it to see if there is a message there.
> In NUMA system I want the physical memory for this VMA to be allocated
> from node close to process B since it will use it much more frequently.
> But I don't want to pre-fault all pages in process B to achieve this
> because the region can be huge and because it doesn't guaranty much if
> swapping is involved. So numa_set_localalloc() looks like it achieves
> exactly this. Without this function I agree that the "first one wins" is
> very sensible assumption, but when each process stated it's preferences
> explicitly by calling the function it is not longer sensible to me as a
> user of the API. When you start to thing about how memory policy may be
OK now, rereading man page, I see that numa_tonode_memory() to achieve
this without pre-faulting. A should now what CPU B is running on, but
this is a minor problem.
> implemented in the kernel and understand that memory policy is a
> property of an address space (is it?) and not a file then you start to
> understand current behaviour, but this is implementation details.
>
>
> >
> > > So to have
> > > desirable effect I have to create shared memory with shmget?
> >
> > shmget behaves the same.
> >
> Then I misinterpreted "Shared Policy" section from Lee's document.
> It seems that he states that for memory region created with shmget the
> policy is a property of a shared object and not of a process' address
> space.
>
Man page states:
Memory policy set for memory areas is shared by all threads of the
process. Memory policy is also shared by other processes mapping the
same memory using shmat(2) or mmap(2) from shmfs/hugetlbfs. It is not
shared for disk backed file mappings right now although that may change
in the future.
So what does this mean? If I set local policy for memory region in process
A it should be obeyed by memory access in process B?
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 11:04 ` Gleb Natapov
2007-05-31 11:30 ` Gleb Natapov
@ 2007-05-31 11:47 ` Andi Kleen
2007-05-31 11:59 ` Gleb Natapov
1 sibling, 1 reply; 83+ messages in thread
From: Andi Kleen @ 2007-05-31 11:47 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Christoph Lameter, Lee Schermerhorn, linux-mm, Andrew Morton
> No it is not (not always).
Natural = as in benefits a large number of application. Your requirement
seems to be quite special.
> I want to create shared memory for
> interprocess communication. Process A will write into the memory and
> process B will periodically poll it to see if there is a message there.
> In NUMA system I want the physical memory for this VMA to be allocated
> from node close to process B
Then bind it to the node of process B (using numa_set_membind())
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 11:47 ` Andi Kleen
@ 2007-05-31 11:59 ` Gleb Natapov
2007-05-31 12:15 ` Andi Kleen
0 siblings, 1 reply; 83+ messages in thread
From: Gleb Natapov @ 2007-05-31 11:59 UTC (permalink / raw)
To: Andi Kleen; +Cc: Christoph Lameter, Lee Schermerhorn, linux-mm, Andrew Morton
On Thu, May 31, 2007 at 01:47:28PM +0200, Andi Kleen wrote:
>
> > No it is not (not always).
>
> Natural = as in benefits a large number of application. Your requirement
> seems to be quite special.
Really. Is use of shared memory to communicate between two processes so
rare and special?
>
> > I want to create shared memory for
> > interprocess communication. Process A will write into the memory and
> > process B will periodically poll it to see if there is a message there.
> > In NUMA system I want the physical memory for this VMA to be allocated
> > from node close to process B
>
> Then bind it to the node of process B (using numa_set_membind())
>
Already found it. Thanks.
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 11:59 ` Gleb Natapov
@ 2007-05-31 12:15 ` Andi Kleen
2007-05-31 12:18 ` Gleb Natapov
0 siblings, 1 reply; 83+ messages in thread
From: Andi Kleen @ 2007-05-31 12:15 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Christoph Lameter, Lee Schermerhorn, linux-mm, Andrew Morton
On Thursday 31 May 2007 13:59:31 Gleb Natapov wrote:
> On Thu, May 31, 2007 at 01:47:28PM +0200, Andi Kleen wrote:
> >
> > > No it is not (not always).
> >
> > Natural = as in benefits a large number of application. Your requirement
> > seems to be quite special.
> Really. Is use of shared memory to communicate between two processes so
> rare and special?
It is more rare that not the first process touching memory is using it more often.
It tends to happen with some memory allocators that reuse memory, but there
is no reasonable way except asking for explicit policy to handle that anyways.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 12:15 ` Andi Kleen
@ 2007-05-31 12:18 ` Gleb Natapov
0 siblings, 0 replies; 83+ messages in thread
From: Gleb Natapov @ 2007-05-31 12:18 UTC (permalink / raw)
To: Andi Kleen; +Cc: Christoph Lameter, Lee Schermerhorn, linux-mm, Andrew Morton
On Thu, May 31, 2007 at 02:15:11PM +0200, Andi Kleen wrote:
> On Thursday 31 May 2007 13:59:31 Gleb Natapov wrote:
> > On Thu, May 31, 2007 at 01:47:28PM +0200, Andi Kleen wrote:
> > >
> > > > No it is not (not always).
> > >
> > > Natural = as in benefits a large number of application. Your requirement
> > > seems to be quite special.
> > Really. Is use of shared memory to communicate between two processes so
> > rare and special?
>
> It is more rare that not the first process touching memory is using it more often.
> It tends to happen with some memory allocators that reuse memory, but there
> is no reasonable way except asking for explicit policy to handle that anyways.
>
OK. It is possible to achieve exactly what I need with existing API and this is what
matters. Thanks.
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 8:20 ` Michael Kerrisk
@ 2007-05-31 14:49 ` Lee Schermerhorn
2007-05-31 15:56 ` Michael Kerrisk
2007-06-01 21:15 ` [PATCH] enhance memory policy sys call man pages v1 Lee Schermerhorn
1 sibling, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-05-31 14:49 UTC (permalink / raw)
To: Michael Kerrisk; +Cc: ak, akpm, linux-mm, clameter
On Thu, 2007-05-31 at 10:20 +0200, Michael Kerrisk wrote:
> > > > The docs are wrong. This is fully supported.
> > >
> > > Yes, I gave up on that one and the warning in the manpage should be
> > > probably dropped
> >
> > OK. I'll work with the man page maintainers.
>
> Hi Lee,
>
> If you could write a patch for the man page, that would be ideal.
> Location of current tarball is in the .sig.
Michael: I'd be happy to. I'll put that in my queue ;-).
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 11:30 ` Gleb Natapov
@ 2007-05-31 15:26 ` Lee Schermerhorn
2007-05-31 17:41 ` Gleb Natapov
0 siblings, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-05-31 15:26 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Andi Kleen, Christoph Lameter, linux-mm, Andrew Morton
On Thu, 2007-05-31 at 14:30 +0300, Gleb Natapov wrote:
> On Thu, May 31, 2007 at 02:04:12PM +0300, Gleb Natapov wrote:
> > On Thu, May 31, 2007 at 12:43:19PM +0200, Andi Kleen wrote:
> > >
> > > > > The faulted page will use the memory policy of the task that faulted it
> > > > > in. If that process has numa_set_localalloc() set then the page will be
> > > > > located as closely as possible to the allocating thread.
> > > >
> > > > Thanks. But I have to say this feels very unnatural.
> > >
> > > What do you think is unnatural exactly? First one wins seems like a quite
> > > natural policy to me.
> > No it is not (not always). I want to create shared memory for
> > interprocess communication. Process A will write into the memory and
> > process B will periodically poll it to see if there is a message there.
> > In NUMA system I want the physical memory for this VMA to be allocated
> > from node close to process B since it will use it much more frequently.
> > But I don't want to pre-fault all pages in process B to achieve this
> > because the region can be huge and because it doesn't guaranty much if
> > swapping is involved. So numa_set_localalloc() looks like it achieves
> > exactly this. Without this function I agree that the "first one wins" is
> > very sensible assumption, but when each process stated it's preferences
> > explicitly by calling the function it is not longer sensible to me as a
> > user of the API. When you start to thing about how memory policy may be
> OK now, rereading man page, I see that numa_tonode_memory() to achieve
> this without pre-faulting. A should now what CPU B is running on, but
> this is a minor problem.
Gleb: numa_tonode_memory() won't do what you want if the file is
mapped shared. The numa_*_memory() interfaces use mbind() which
installs a VMA policy in the address space of the caller. When a page
is faulted in for a mmap'd file, the page will be allocated using the
faulting task's task policy, if any, else system default.
Now, if it were a private mapping and you write/store to the pages, the
kernel will COW and give you a page that obeys the policy you installed
in your address space. But, shared mappings don't COW, so you retain
the original page that followed the faulting task's policy. There are
currently 2 ways that I know of to explicitly place a page in a shared
file mapping:
1) let the task policy default to local or explicitly specify local
access [numa_set_localalloc()] and then ensure that you prefault the
page while executing on a cpu local to the node where you want the page.
2) set the task policy to bind to the desired node and prefault the
page.
Of course, if the page gets reclaimed and later faulted back in, the
location will depend on the task policy, and possibly the location, of
the task that causes the refault.
I've been proposing patches to generalize the shared policy support
enjoyed by shmem segments for use with shared mmap'd files. I was
beginning to think that I'm the only one with applications [well, with
customers with applications] that need this behavior. Sounds like your
requirements are very similar: huge file [don't want to prefault nor
wait for it to all be read into shmem before starting processing], only
accesses via mmap, ...
>
> > implemented in the kernel and understand that memory policy is a
> > property of an address space (is it?) and not a file then you start to
> > understand current behaviour, but this is implementation details.
> >
> >
> > >
> > > > So to have
> > > > desirable effect I have to create shared memory with shmget?
> > >
> > > shmget behaves the same.
> > >
> > Then I misinterpreted "Shared Policy" section from Lee's document.
> > It seems that he states that for memory region created with shmget the
> > policy is a property of a shared object and not of a process' address
> > space.
> >
> Man page states:
> Memory policy set for memory areas is shared by all threads of the
> process. Memory policy is also shared by other processes mapping the
> same memory using shmat(2) or mmap(2) from shmfs/hugetlbfs. It is not
> shared for disk backed file mappings right now although that may change
> in the future.
> So what does this mean? If I set local policy for memory region in process
> A it should be obeyed by memory access in process B?
shmem does, indeed, work this way. Policies installed on ranges of the
shared segment via mbind() are stored with the shared object.
I think the future is now: time to share policy for disk backed file
mappings.
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 14:49 ` Lee Schermerhorn
@ 2007-05-31 15:56 ` Michael Kerrisk
0 siblings, 0 replies; 83+ messages in thread
From: Michael Kerrisk @ 2007-05-31 15:56 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: clameter, linux-mm, akpm, ak
> > If you could write a patch for the man page, that would be ideal.
> > Location of current tarball is in the .sig.
>
> Michael: I'd be happy to. I'll put that in my queue ;-).
Thanks Lee!
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance?
Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages ,
read the HOWTOHELP file and grep the source
files for 'FIXME'.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 7:24 ` Christoph Lameter
2007-05-31 7:39 ` Gleb Natapov
@ 2007-05-31 17:07 ` Lee Schermerhorn
1 sibling, 0 replies; 83+ messages in thread
From: Lee Schermerhorn @ 2007-05-31 17:07 UTC (permalink / raw)
To: Christoph Lameter, Gleb Natapov; +Cc: linux-mm, Andrew Morton, Andi Kleen
On Thu, 2007-05-31 at 00:24 -0700, Christoph Lameter wrote:
> On Thu, 31 May 2007, Gleb Natapov wrote:
>
> > > 1. A shared range has multiple tasks that can fault pages in.
> > > The policy of which task should control how the page is allocated?
> > > Is it the last one that set the policy?
>
> > How is it done for shmget? For my particular case I would prefer to get an error
> > from numa_setlocal_memory() if process tries to set policy on the area
> > of the file that already has policy set. This may happen only as a
> > result of a bug in my app.
>
> Hmmm.... Thats an idea. Lee: Do we have some way of returning an error?
> We then need to have a function that clears memory policy. Maybe the
> default policy is the clear?
For shmem, mbind() of a range of the object [that's what
numa_setlocal_memory() does] replaces any existing policy in that range.
This is what I would expect--the last one applied takes effect--just
like . Multiple tasks attaching to a shmem, or mmap()ing the same file
shared, would, I hope, be cooperating tasks and know what they are
doing. Typically--i.e., in the applications I'm familiar with, only one
task that sets up the shmem or file mapping for the multi-task
application would set the policy.
However, I agree that if I'm ever successful in getting policy attached
to shared file mappings, we'll need a way to delete the policy. I'm
thinking of something like "MPOL_DELETE" that completely deletes the
policy--whether it be on a range of virtual addresses via mbind() or the
task policy, via set_mempolicy(). Of course, MPOL_DELETE would work for
shmem segments as well.
>
> > > 2. Pagecache pages can be read and written by buffered I/O and
> > > via mmap. Should there be different allocation semantics
> > > depending on the way you got the page? Obviously no policy
> > > for a memory range can be applied to a page allocated via
> > > buffered I/O. Later it may be mapped via mmap but then
> > > we never use policies if the page is already in memory.
>
> > If page is already in the pagecache use it. Or return an error if strict
> > policy is in use. Or something else :) In my case I make sure that files
> > is accessed only through mmap interface.
This is the model that I've been trying to support--tasks which have, as
a portion of their address space, a shared mapping of an application
specific file that is only ever accessed via mmap().
>
> On an mmap we cannot really return an error. If your program has just run
> then pages may linger in memory. If you run it on another node then the
> earlier used pages may be used.
It's true that a page of such a file [private to the application, only
accessed by mmap()] may be in the page cache in the wrong location,
either because you run later on another node, as Christoph says, or
because you've just done a backup or restored from one. However, in
this case, if your application is the only one that mmap's the file, and
you only apply policy from the "application initialization task", then
the that task will be the only one mapping the file. In this case, you
can use Christoph's excellent MPOL_MF_MOVE facility to ensure that the
pages follow your new policy. If other tasks have the page mapped,
you'll need to use MPOL_MF_MOVE_ALL, which requires special privilege
[CAP_SYS_NICE].
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 15:26 ` Lee Schermerhorn
@ 2007-05-31 17:41 ` Gleb Natapov
2007-05-31 18:56 ` Lee Schermerhorn
0 siblings, 1 reply; 83+ messages in thread
From: Gleb Natapov @ 2007-05-31 17:41 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: Andi Kleen, Christoph Lameter, linux-mm, Andrew Morton
On Thu, May 31, 2007 at 11:26:44AM -0400, Lee Schermerhorn wrote:
> On Thu, 2007-05-31 at 14:30 +0300, Gleb Natapov wrote:
> > On Thu, May 31, 2007 at 02:04:12PM +0300, Gleb Natapov wrote:
> > > On Thu, May 31, 2007 at 12:43:19PM +0200, Andi Kleen wrote:
> > > >
> > > > > > The faulted page will use the memory policy of the task that faulted it
> > > > > > in. If that process has numa_set_localalloc() set then the page will be
> > > > > > located as closely as possible to the allocating thread.
> > > > >
> > > > > Thanks. But I have to say this feels very unnatural.
> > > >
> > > > What do you think is unnatural exactly? First one wins seems like a quite
> > > > natural policy to me.
> > > No it is not (not always). I want to create shared memory for
> > > interprocess communication. Process A will write into the memory and
> > > process B will periodically poll it to see if there is a message there.
> > > In NUMA system I want the physical memory for this VMA to be allocated
> > > from node close to process B since it will use it much more frequently.
> > > But I don't want to pre-fault all pages in process B to achieve this
> > > because the region can be huge and because it doesn't guaranty much if
> > > swapping is involved. So numa_set_localalloc() looks like it achieves
> > > exactly this. Without this function I agree that the "first one wins" is
> > > very sensible assumption, but when each process stated it's preferences
> > > explicitly by calling the function it is not longer sensible to me as a
> > > user of the API. When you start to thing about how memory policy may be
> > OK now, rereading man page, I see that numa_tonode_memory() to achieve
> > this without pre-faulting. A should now what CPU B is running on, but
> > this is a minor problem.
>
> Gleb: numa_tonode_memory() won't do what you want if the file is
> mapped shared. The numa_*_memory() interfaces use mbind() which
> installs a VMA policy in the address space of the caller. When a page
> is faulted in for a mmap'd file, the page will be allocated using the
> faulting task's task policy, if any, else system default.
>
Suppose I have two processes that want to communicate through the shared memory.
They mmap same file with MAP_SHARED. Now first process call
numa_setlocal_memory() on the region where it will receive messages and
call numa_tonode_memory(second process nodeid) on the region where it
will post messages for the second process. The second process does the
same thing. After that no matter what process touches memory first,
faulted in pages should be allocated from the correct memory node. Do I
miss something here?
> I've been proposing patches to generalize the shared policy support
> enjoyed by shmem segments for use with shared mmap'd files. I was
> beginning to think that I'm the only one with applications [well, with
> customers with applications] that need this behavior. Sounds like your
> requirements are very similar: huge file [don't want to prefault nor
> wait for it to all be read into shmem before starting processing], only
> accesses via mmap, ...
I thought this is pretty common user case, but Andi thinks different. I
don't have any hard evidence one way or the other.
> > Man page states:
> > Memory policy set for memory areas is shared by all threads of the
> > process. Memory policy is also shared by other processes mapping the
> > same memory using shmat(2) or mmap(2) from shmfs/hugetlbfs. It is not
> > shared for disk backed file mappings right now although that may change
> > in the future.
> > So what does this mean? If I set local policy for memory region in process
> > A it should be obeyed by memory access in process B?
>
> shmem does, indeed, work this way. Policies installed on ranges of the
> shared segment via mbind() are stored with the shared object.
>
> I think the future is now: time to share policy for disk backed file
> mappings.
>
At least it will be consistent with what you get when shared memory is
created via shmget(). It will be very surprising for a programmer if
his program' logic will break just because he changes the way how shared
memory is created.
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 7:39 ` Gleb Natapov
@ 2007-05-31 17:43 ` Christoph Lameter
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Lameter @ 2007-05-31 17:43 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Lee Schermerhorn, linux-mm, Andrew Morton, Andi Kleen
On Thu, 31 May 2007, Gleb Natapov wrote:
> happen (flash file from pagecache before mmap. Is it even possible?).
Hmmm.... fadvise or so I guess.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-30 17:56 ` Christoph Lameter
2007-05-31 6:18 ` Gleb Natapov
@ 2007-05-31 18:28 ` Lee Schermerhorn
2007-05-31 18:35 ` Christoph Lameter
2007-05-31 19:25 ` Paul Jackson
2 siblings, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-05-31 18:28 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, Andrew Morton, Andi Kleen, Gleb Natapov
On Wed, 2007-05-30 at 10:56 -0700, Christoph Lameter wrote:
> On Wed, 30 May 2007, Lee Schermerhorn wrote:
>
> > > You can use cpusets to automatically migrate pages and sys_migrate_pages
> > > to manually migrate pages of a process though.
> >
> > I consider cpusets, and the explicit migration APIs, orthogonal to
> > mempolicy. Mempolicy is an application interface, while cpusets are an
> > administrative interface that restricts what mempolicy can ask for. And
> > sys_migrate_pages/sys_move_pages seem to ignore mempolicy altogether.
>
> They have to since they may be used to change page locations when policies
> are active.
That's fine, I guess. But I still think that makes them orthogonal to
mempolicy...
> There is a libcpuset library that can be used for application
> control of cpusets.
libcpusets is part of the SGI ProPack, right? Is there a generic
version of that available for current kernels? I see the ProPack 3 on
the SGI web site, but it appears to be for an older version of Linux and
CpuMemSets and a tad Altix specific. [I've been assuming we're talking
about general Linux capabilities.]
I did find in several versions of the on-line ProPack documentation this
statement: "The cpuset facility is primarily a workload manager tool
permitting a system administrator to restrict the number of processors
and memory resources that a process or set of processes may use." This
matches my understanding that cpusets are a "container-like" facility.
Indeed, they appear to be evolving to this upstream.
And certainly a "workload manager tool" can be viewed as an application.
I just tend to separate privileged system admin tools and the facilities
they use from applications such as numerical/scientific computation,
enterprise workloads, web servers, ... Not the only way to view the
world, I agree.
> I think Paul would disagree with you here.
Paul?
>
> > I would agree, however, that they could be better integrated. E.g., how
> > can a NUMA-aware application [one that uses the mempolicy APIs]
> > determine what memories it's allowed to use. So far, all I've been able
> > to determine is that I try each node in the mask and the ones that don't
> > error out are valid. Seems a bit awkward...
>
> The cpuset interfaces provide this information.
Well, NUMA systems don't require cpusets. I agree tho' that they're
very useful for system partitioning and am glad to see them supported by
the standard kernels in the current generation of Enterprise distros.
>
> > > There is no way to configure it. So it would be easier to avoid this layer
> > > and say they fall back to node local
> >
> > What you describe is, indeed, the effect, but I'm trying to explain why
> > it works that way.
>
> But the explanation adds a new element that only serves to complicate the
> description.
I'm reworking the doc to address this and other comments... Where I
don't disagree too strongly ;-).
>
> > > > + VMA policies are shared between all tasks that share a virtual address
> > > > + space--a.k.a. threads--independent of when the policy is installed; and
> > > > + they are inherited across fork(). However, because VMA policies refer
> > > > + to a specific region of a task's address space, and because the address
> > > > + space is discarded and recreated on exec*(), VMA policies are NOT
> > > > + inheritable across exec(). Thus, only NUMA-aware applications may
> > > > + use VMA policies.
> > >
> > > Memory policies require NUMA. Drop the last sentence? You can set the task
> > > policy via numactl though.
> >
> > I disagree about dropping the last sentence. I can/will define
> > NUMA-aware as applications that directly call the mempolicy APIs. You
>
> Or the cpuset APIs.
Yes, an "application" that uses the cpuset APIs would be a NUMA-aware
administration tool. ;-)
>
> > can run an unmodified, non-NUMA-aware program on a NUMA platform with or
> > without numactl and take whatever performance you get. In some cases,
>
> Right.
>
> > you'll be leaving performance on the table, but that may be a trade-off
> > some are willing to make not to have to modify their existing
> > applications.
>
> The sentence still does not make sense. There is no point in using numa
> memory policies if the app is not an NUMA app.
OK. Let me try to explain it this way. You can take a non-NUMA aware
app, that uses neither the memory policy APIs nor the cpuset interface,
perhaps from a dusty old SMP system, and run that on a NUMA system.
Depending on workload, load balancing, etc., you may end up with a lot
of non-local accesses. However, with numactl, you can restrict that
application, without modification, to a single node or set of close
neighbor nodes and achieve some of the benefit of memory policy APIs.
If the application fits in the cpu and memory resources of a single
node, then you probably need do no more. Can't get much more local than
that. If the application requires more than one node's worth of
resources, then at some point it might be worth while to make the
application NUMA-aware and use the policy APIs directly. This assumes,
of course, that you have someone who understands the memory access
behavior of the application well enough to specify the policies.
Performance analyzers can help, as can automatic page migration ;-).
>
> > > > + Although internal to the kernel shared memory segments are really
> > > > + files backed by swap space that have been mmap()ed shared into tasks'
> > > > + address spaces, regular files mmap()ed shared do NOT support shared
> > > > + policy. Rather, shared page cache pages, including pages backing
> > > > + private mappings that have not yet been written by the task, follow
> > > > + task policy, if any, else system default policy.
> > >
> > > Yes. shared memory segments do not represent file content. The file
> > > content of mmap pages may exist before the mmap. Also there may be regular
> > > buffered I/O going on which will also use the task policy.
> >
> > Unix/Posix/Linux semantics are very flexible with respect to file
> > description access [read, write, et al] and memory mapped access to
> > files. One CAN access files via both of these interfaces, and the
> > system jumps through hoops backwards [e.g., consider truncation] to make
> > it work. However, some applications just access the files via mmap()
> > and want to control the NUMA placement like any other component of their
> > address space. Read/write access to such a file, while I agree it
>
> Right but the pages may already have been in memory due to buffered read
> access.
True. As we've been discussion in another branch with Gleb Natapov
[added to cc list], some applications use "application private" files
[not to be confused with MPA_PRIVATE, please] that they only ever access
via mmap(). Still pages could be in the page cache because the file had
just been backed up or restored from backup. However, in this case, the
pages' mapcount should be '1'--the first application task to mmap shared
and apply the policy--so MPOL_MF_MOVE should work.
>
> > should work, is, IMO, secondary to load/store access. In such a case,
> > the performance of the load/store access shouldn't be sacrificed for the
> > read/write case, which already has to go through system calls, buffer
> > copies, ...
>
> Its not a matter of sacrifice. Its consistency. page cache pages are
> always subject to the tasks memory policy whether you use bufferred I/O or
> mmapped I/O.
I'm all for consistency when it helps. Here it hurts.
>
> > > Having no vma policy support insures that pagecache pages regardless if
> > > they are mmapped or not will get the task policy applied.
> >
> > Which is fine if that's what you want. If you're using a memory mapped
> > file as a persistent shared memory area that faults pages in where you
> > specified, as you access them, maybe that's not what you want. I
> > guarantee that's not what I want.
> >
> > However, it seems to me, this is our other discussion. What I've tried
> > to do with this patch is document the existing concepts and behavior, as
> > I understand them.
>
> It seems that you are creating some artificial problems here.
Christoph: Let me assume you, I'm not persisting in this exchange
because I'm enjoying it. Quite the opposite, actually. However, like
you, my employer asks me to address our customers' requirements. I'm
trying to understand and play within the rules of the community. I
attempted this documentation patch to address what I saw as missing
documentation and to provide context for further discussion of my patch
set.
>
> > > > + Default Mode--MPOL_DEFAULT: The behavior specified by this mode is
> > > > + context dependent.
> > > > +
> > > > + The system default policy is hard coded to contain the Default mode.
> > > > + In this context, it means "local" allocation--that is attempt to
> > > > + allocate the page from the node associated with the cpu where the
> > > > + fault occurs. If the "local" node has no memory, or the node's
> > > > + memory can be exhausted [no free pages available], local allocation
> > > > + will attempt to allocate pages from "nearby" nodes, using a per node
> > > > + list of nodes--called zonelists--built at boot time.
> > > > +
> > > > + TODO: address runtime rebuild of node/zonelists when
> > > > + supported.
> > >
> > > Why?
> >
> > Because "built at boot time" is then not strictly correct, is it?
>
> I still do not understand what this is all about. The zonelists are
> rebuild due to Kame-san's patch for the ZONE_DMA problems. Okay. So what
> does this have to do with MPOL_DEFAULT?
I'll remove the TODO, OK?
My point was that the description of MPOL_DEFAULT made reference to the
zonelists built at boot time, to distinguish them from the custom
zonelists built for an MPOL_BIND. Since the zonelist reorder patch
hasn't made it out of Andrew's tree yet, I didn't want to refer to it
this round of the doc. If it makes it into the tree, I had planned say
something like: "at boot time or on request". I should probably add
"or on memory hotplug".
>
> > > > + The Default mode does not use the optional set of nodes.
> > >
> > > Neither does the preferred node mode.
> >
> > Actually, it does take the node mask argument. It just selects the
> > first node therein. See response to Andi.
>
> It uses one node yes. It does not support (or is not intended to support)
> a nodemask.
OK. In the context of this concepts section, I see your point. I've
rewritten this section.
In the context of the API section, the argument is defined as a nodemask
and can have 0 [local allocation], 1, or more [choses the first]. I'll
fix it up.
>
> > > > + Note: the headers that define these APIs and the parameter data types
> > > > + for user space applications reside in a package that is not part of
> > > > + the Linux kernel. The kernel system call interfaces, with the 'sys_'
> > > > + prefix, are defined in <linux/syscalls.h>; the mode and flag
> > > > + definitions are defined in <linux/mempolicy.h>.
> > >
> > > You need to mention the numactl library here.
> >
> > I'm trying to describe kernel behavior. I would expect this to be
> > picked up by the man pages at some time. As I responded to Andi, I'll
> > work the maintainers... When I get the time.
>
> I though you wanted to explain this to users? If so then you need to
> mention the user APIs such as numactl and libcpuset.
OK. Since application developers might come here to get information, I
should probably at least point them at the libnuma for the wrappers, as
that tends to ship with many distros. I'm still not sure about the
general availability of libcpuset.
But, after I see what gets accepted into the man pages that I've agreed
to update, I'll consider dropping this section altogether. Maybe the
entire document.
>
> > You don't get COW if it's a shared mapping. You use the page cache
> > pages which ignores my mbind(). That's my beef! [;-)]
>
> page cache pages are subject to a tasks memory policy regardless of how we
> get to the page cache page. I think that is pretty consistent.
Oh, it's consistent, alright. Just not pretty [;-)] when it's not what
the application wants.
Later,
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 18:28 ` Lee Schermerhorn
@ 2007-05-31 18:35 ` Christoph Lameter
2007-05-31 19:29 ` Lee Schermerhorn
0 siblings, 1 reply; 83+ messages in thread
From: Christoph Lameter @ 2007-05-31 18:35 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: linux-mm, Andrew Morton, Andi Kleen, Gleb Natapov
On Thu, 31 May 2007, Lee Schermerhorn wrote:
> > It seems that you are creating some artificial problems here.
>
> Christoph: Let me assume you, I'm not persisting in this exchange
> because I'm enjoying it. Quite the opposite, actually. However, like
> you, my employer asks me to address our customers' requirements. I'm
> trying to understand and play within the rules of the community. I
> attempted this documentation patch to address what I saw as missing
> documentation and to provide context for further discussion of my patch
> set.
Could you explain to us what kind of user scenario you are addressing? We
have repeatedly asked you for that information. I am happy to hear that
there is an actual customer requirement.
> My point was that the description of MPOL_DEFAULT made reference to the
> zonelists built at boot time, to distinguish them from the custom
> zonelists built for an MPOL_BIND. Since the zonelist reorder patch
> hasn't made it out of Andrew's tree yet, I didn't want to refer to it
> this round of the doc. If it makes it into the tree, I had planned say
> something like: "at boot time or on request". I should probably add
> "or on memory hotplug".
Hmmm... The zonelists for MPOL_BIND are never rebuilt by Kame-san's
patches. That is a concern.
> But, after I see what gets accepted into the man pages that I've agreed
> to update, I'll consider dropping this section altogether. Maybe the
> entire document.
I'd be very thankful if you could upgrade the manpages. Andi has some
patches from me against numactl pending that include manpage
updatess. I can forward that too you.
> > page cache pages are subject to a tasks memory policy regardless of how we
> > get to the page cache page. I think that is pretty consistent.
>
> Oh, it's consistent, alright. Just not pretty [;-)] when it's not what
> the application wants.
I sure hope that we can at some point figure out what your applications is
doing. Its been a hard road to that information so far.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 17:41 ` Gleb Natapov
@ 2007-05-31 18:56 ` Lee Schermerhorn
2007-05-31 20:06 ` Gleb Natapov
0 siblings, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-05-31 18:56 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Andi Kleen, Christoph Lameter, linux-mm, Andrew Morton
On Thu, 2007-05-31 at 20:41 +0300, Gleb Natapov wrote:
> On Thu, May 31, 2007 at 11:26:44AM -0400, Lee Schermerhorn wrote:
> > On Thu, 2007-05-31 at 14:30 +0300, Gleb Natapov wrote:
> > > On Thu, May 31, 2007 at 02:04:12PM +0300, Gleb Natapov wrote:
> > > > On Thu, May 31, 2007 at 12:43:19PM +0200, Andi Kleen wrote:
> > > > >
> > > > > > > The faulted page will use the memory policy of the task that faulted it
> > > > > > > in. If that process has numa_set_localalloc() set then the page will be
> > > > > > > located as closely as possible to the allocating thread.
> > > > > >
> > > > > > Thanks. But I have to say this feels very unnatural.
> > > > >
> > > > > What do you think is unnatural exactly? First one wins seems like a quite
> > > > > natural policy to me.
> > > > No it is not (not always). I want to create shared memory for
> > > > interprocess communication. Process A will write into the memory and
> > > > process B will periodically poll it to see if there is a message there.
> > > > In NUMA system I want the physical memory for this VMA to be allocated
> > > > from node close to process B since it will use it much more frequently.
> > > > But I don't want to pre-fault all pages in process B to achieve this
> > > > because the region can be huge and because it doesn't guaranty much if
> > > > swapping is involved. So numa_set_localalloc() looks like it achieves
> > > > exactly this. Without this function I agree that the "first one wins" is
> > > > very sensible assumption, but when each process stated it's preferences
> > > > explicitly by calling the function it is not longer sensible to me as a
> > > > user of the API. When you start to thing about how memory policy may be
> > > OK now, rereading man page, I see that numa_tonode_memory() to achieve
> > > this without pre-faulting. A should now what CPU B is running on, but
> > > this is a minor problem.
> >
> > Gleb: numa_tonode_memory() won't do what you want if the file is
> > mapped shared. The numa_*_memory() interfaces use mbind() which
> > installs a VMA policy in the address space of the caller. When a page
> > is faulted in for a mmap'd file, the page will be allocated using the
> > faulting task's task policy, if any, else system default.
> >
> Suppose I have two processes that want to communicate through the shared memory.
> They mmap same file with MAP_SHARED. Now first process call
> numa_setlocal_memory() on the region where it will receive messages and
> call numa_tonode_memory(second process nodeid) on the region where it
> will post messages for the second process. The second process does the
> same thing. After that no matter what process touches memory first,
> faulted in pages should be allocated from the correct memory node.
Not as I understand you're meaning for "correct memory node". Certainly
not [necessarily] the one you implied/specified in the numa_*_memory()
calls.
> Do I
> miss something here?
I think you do.
The policies that each task apply get installed as VMA policies in the
address space of each task. However, because you have mapped the file
shared, these policies are ignored at fault time. Rather, because
you're faulting in a file page, the system allocates a page cache page.
The page cache allocation function will just use the faulting task's
task policy [or system default]. It will NOT consult the address space
of the faulting task. As Christoph pointed out, the page may already be
in the page cache, allocated based on the task policy of the task that
caused the allocation. In this case, the system will just add a page
table entry for that page to your task's page table.
The Mapped File Policy patch series that I posted addresses the behavior
described above--probably not what you expect nor what you want?--by
using the same shared policy infrastructure used by shmem to control
allocation for regular files mmap()'d shared.
Semantics [with my patches] are as follows:
If you map a file MAP_PRIVATE, policy only gets applied to the calling
task's address space. I.e., current behavior. It will be ignored by
page cache allocations. However, if you write to the page, the kernel
will COW the page, making a private anonymous copy for your task. The
anonymous COWed page WILL follow the VMA policy you installed, but won't
be visible to any other task mmap()ing the file--shared or private.
This is also current behavior.
If you map a file MAP_SHARED and DON'T apply a policy--which covers most
existing applications, according to Andi--then page cache allocations
will still default to task policy or system default--again, current
behavior. Even if you write to the page, because you've mapped shared,
you keep the page cache page allocated at fault time.
If you map a file shared and apply a policy via mbind() or one of the
libnuma wrappers you mention above, the policy will "punch through" the
VMA and be installed on the file's internal incarnation [inode +
address_space structures], dynamically allocating the necessary
shared_policy structure. Then, for this file, page_cache allocations
that hit the range on which you installed a policy will use that shared
policy. If you don't cover the entire file with your policy, those
ranges that you don't cover will continue to use task/system default
policy--just like shmem.
>
> > I've been proposing patches to generalize the shared policy support
> > enjoyed by shmem segments for use with shared mmap'd files. I was
> > beginning to think that I'm the only one with applications [well, with
> > customers with applications] that need this behavior. Sounds like your
> > requirements are very similar: huge file [don't want to prefault nor
> > wait for it to all be read into shmem before starting processing], only
> > accesses via mmap, ...
> I thought this is pretty common user case, but Andi thinks different. I
> don't have any hard evidence one way or the other.
The only evidence I have is from customers I've worked with in the past
that we're trying to convert to Linux and requirements apparently coming
from customers with whom I don't have direct contact--i.e., from
marketing, sales/support, ... Or maybe they're just making it up to
keep my busy ;-).
>
> > > Man page states:
> > > Memory policy set for memory areas is shared by all threads of the
> > > process. Memory policy is also shared by other processes mapping the
> > > same memory using shmat(2) or mmap(2) from shmfs/hugetlbfs. It is not
> > > shared for disk backed file mappings right now although that may change
> > > in the future.
> > > So what does this mean? If I set local policy for memory region in process
> > > A it should be obeyed by memory access in process B?
> >
> > shmem does, indeed, work this way. Policies installed on ranges of the
> > shared segment via mbind() are stored with the shared object.
> >
> > I think the future is now: time to share policy for disk backed file
> > mappings.
> >
> At least it will be consistent with what you get when shared memory is
> created via shmget(). It will be very surprising for a programmer if
> his program' logic will break just because he changes the way how shared
> memory is created.
Yes. A bit inconsistent, from the application programmer's viewpoint.
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-30 17:56 ` Christoph Lameter
2007-05-31 6:18 ` Gleb Natapov
2007-05-31 18:28 ` Lee Schermerhorn
@ 2007-05-31 19:25 ` Paul Jackson
2007-05-31 20:22 ` Lee Schermerhorn
2 siblings, 1 reply; 83+ messages in thread
From: Paul Jackson @ 2007-05-31 19:25 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Lee.Schermerhorn, linux-mm, akpm, ak
> They have to since they may be used to change page locations when policies
> are active. There is a libcpuset library that can be used for application
> control of cpusets. I think Paul would disagree with you here.
In the most common usage, a batch scheduler uses cpusets to control
a jobs memory and placement, and application code within the job uses
the memory policy calls (mbind, set_mempolicy) and scheduler policy
call (set_schedaffinity) to manage its detailed placement.
In particular, the memory policy calls can only be applied to the
current task, so any larger scope control has to be done by cpusets.
The cpuset file system, with its traditional file system hierarchy
and permission model, allows as much control as desired to be passed
on to specific applications, and over time, I expect this to happen
more.
However, there will always be a different focus here.
The primary purpose of the memory and scheduler policy mechanisms is to
maximize the efficient usage of available resources by a co-operating
set of tasks - get tasks close to their memory and things like that.
The mind set is "we own the machine - how can we best use it." For
example tightly coupled MPI jobs will need to place one compute bound
thread on each processor, insure that nothing else is actively running
on those processors, and place data close to task accessing it. The
expectation is that a jobs code may have to be modified, perhaps even
radically rewritten with a new algorithm, to optimize processor and
memory usage, as relative speeds of processor, memory and bus change.
The primary purpose of cpusets is job isolation, ensuring that one job
does not interfere with another, by keeping the jobs on separate cpus
and memory nodes. The mind set is "how can we keep these several jobs
out of each others hair, minimizing any impact of one jobs resource
usage on the runtime of another." The expectation is that jobs must
be controlled externally, without any change to the jobs code or even
any expertise in the fine grained memory or scheduler policy behaviour
of the job.
It may well make sense to document memory policy, for the developers
of large applications that need to use the scheduler or memory policy
routines to manage their multi-threaded, or multiple memory node (NUMA)
placement, -separate- from documenting cpuset placement of jobs on cpus
and memory. It's a quite different audience. In so far as possible,
the cpuset code was designed to enable controlling the placement of
jobs without the developer of those jobs, who might be using the
scheduler and memory placement calls, being aware of cpusets -- it's
just a smaller machine available to their job. Migration should also
be transparent to them -- their machine moved, that's all.
Unfortunately there are a couple of details that leak through:
1) big apps using scheduler and memory policy calls often want to
know how "big" their machine is, which changes under cpusets
from the physical size of the system, and
2) the sched_setaffinity, mbind and set_mempolicy calls take hard
physical CPU and Memory Node numbers, which change under migration
non-transparently.
Therefore I have in libcpuset two kinds of routines:
1) a large powerful set used by heavy weight batch schedulers to
provide sophisticated job placement, and
2) a small simple set used by applications that provide an interface
to sched_setaffinity, mbind and set_mempolicy that is virtualized
to the cpuset, providing cpuset relative CPU and Memory Node
numbering and cpuset relative sizes, safely usable from an
application across a migration to different nodes, without
application awareness.
The ancient, Linux 2.4 kernel based, libcpuset on oss.sgi.com is
really ancient and not relevant here. The cpuset mechanism in
Linux 2.6 is a complete redesign from SGI's cpumemset mechanism
for Linux 2.4 kernels.
SGI releases libcpuset under GPL license, though currently I've just
set this up for customers of SGI's software. Someday I hope to get
the current libcpuset up on oss.sgi.com, for all to use.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 18:35 ` Christoph Lameter
@ 2007-05-31 19:29 ` Lee Schermerhorn
0 siblings, 0 replies; 83+ messages in thread
From: Lee Schermerhorn @ 2007-05-31 19:29 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-mm, Andrew Morton, Andi Kleen, Gleb Natapov
On Thu, 2007-05-31 at 11:35 -0700, Christoph Lameter wrote:
> On Thu, 31 May 2007, Lee Schermerhorn wrote:
>
> > > It seems that you are creating some artificial problems here.
> >
> > Christoph: Let me assume you, I'm not persisting in this exchange
> > because I'm enjoying it. Quite the opposite, actually. However, like
> > you, my employer asks me to address our customers' requirements. I'm
> > trying to understand and play within the rules of the community. I
> > attempted this documentation patch to address what I saw as missing
> > documentation and to provide context for further discussion of my patch
> > set.
>
> Could you explain to us what kind of user scenario you are addressing? We
> have repeatedly asked you for that information. I am happy to hear that
> there is an actual customer requirement.
And I've tried to explain without "naming names". Let me try it this
way: An multi-task application that mmap()s a large file--think O(1TB)
or larger--shared. You can think of it as an in-memory data base, but
the size of the file could exceed physical memory. Various tasks of the
application will access various portions of the memory area in different
ways/with different frequencies, ... [sort of like Gleb described]. The
memory region is large enough and cache locality poor enough that
"locality matters".
Why not just use shmem and read the file in at startup? In those cases
where it would fit, it takes quite a while to read a file of this size
in, and processing can't start until it's all in. Perhaps one could
modify the application to carefully sequence the load so other tasks
could get started before it's all in. And, it only works if the access
pattern is known a priori. And, if the file is larger than memory,
you'll need swap space to back it.
You want persistence across runs of the application--e.g., so that you
could suspend it and continue later. You could just write the entire
shmem out at the end, but again, that takes a long time. The
application could keep track of which regions of memory have been
modified and write them out incrementally, but with a mapped file, the
kernel does this automatically [I won't say "for free" ;-)] with an
occasional msync() or if reclaim becomes necessary.
Granted, these last 2 paragraphs describe how a number of large
enterprise data bases work. So, it's not impossible. It IS a lot of
work if you don't need the type of guarantees that those systems
provide.
Why not just use task policy to place the pages? Task policy affects
all task allocations, including stack, heap, ... Better to let those
default to local. Well, why not place the pages, lock them down and
then change the task policy back to default/local? File might not fit;
even if it did, might not want to commit that much memory, ... And,
yes, it seems unnatural to have to jump through these hoops--at least
for customers bringing applications from envrionments where they didn't
have to. [I know, I know. Functional parity with other systems... Not
a valid reason... Yada yada. ;-)]
>
> > My point was that the description of MPOL_DEFAULT made reference to the
> > zonelists built at boot time, to distinguish them from the custom
> > zonelists built for an MPOL_BIND. Since the zonelist reorder patch
> > hasn't made it out of Andrew's tree yet, I didn't want to refer to it
> > this round of the doc. If it makes it into the tree, I had planned say
> > something like: "at boot time or on request". I should probably add
> > "or on memory hotplug".
>
> Hmmm... The zonelists for MPOL_BIND are never rebuilt by Kame-san's
> patches. That is a concern.
Yes. And as we noted earlier, even the initial ones don't consider
distance. The latter should be relatively easy to fix, as we have code
that does it for the node zonelists. Would require some generalization.
Rebuilding policy zonelists would require finding them all somehow.
Either an expensive system [cpuset] wide scan or a system-wide/per
cpuset list of [MPOL-BIND] policies. A per cpuset list might reduce the
scope of the rebuild, but you'd have to scan tasks and reparent their
policies when move them between cpusets. Not pretty either way.
>
> > But, after I see what gets accepted into the man pages that I've agreed
> > to update, I'll consider dropping this section altogether. Maybe the
> > entire document.
>
> I'd be very thankful if you could upgrade the manpages. Andi has some
> patches from me against numactl pending that include manpage
> updatess. I can forward that too you.
>
> > > page cache pages are subject to a tasks memory policy regardless of how we
> > > get to the page cache page. I think that is pretty consistent.
> >
> > Oh, it's consistent, alright. Just not pretty [;-)] when it's not what
> > the application wants.
>
> I sure hope that we can at some point figure out what your applications is
> doing. Its been a hard road to that information so far.
>
I thought I'd explained before. I guess just too abstractly. Maybe the
description above is a too abstract as well. However, Gleb's
application has similar requirements--he wants to control the location
of pages in a shared, mmap'ed file using explicit policies. He's even
willing to issue the identical policy calls from each task--something I
don't think he should need to do--to accomplish it. But, it still won't
work for him...
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 18:56 ` Lee Schermerhorn
@ 2007-05-31 20:06 ` Gleb Natapov
2007-05-31 20:43 ` Andi Kleen
0 siblings, 1 reply; 83+ messages in thread
From: Gleb Natapov @ 2007-05-31 20:06 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: Andi Kleen, Christoph Lameter, linux-mm, Andrew Morton
On Thu, May 31, 2007 at 02:56:04PM -0400, Lee Schermerhorn wrote:
> > Suppose I have two processes that want to communicate through the shared memory.
> > They mmap same file with MAP_SHARED. Now first process call
> > numa_setlocal_memory() on the region where it will receive messages and
> > call numa_tonode_memory(second process nodeid) on the region where it
> > will post messages for the second process. The second process does the
> > same thing. After that no matter what process touches memory first,
> > faulted in pages should be allocated from the correct memory node.
>
> Not as I understand you're meaning for "correct memory node". Certainly
> not [necessarily] the one you implied/specified in the numa_*_memory()
> calls.
>
> > Do I
> > miss something here?
>
> I think you do.
OK. It seems I missed the fact that VMA policy is completely ignored for
pagecache backed files and only task policy is used. So prefaulting is
the only option left. Very sad.
> Semantics [with my patches] are as follows:
>
> If you map a file MAP_PRIVATE, policy only gets applied to the calling
> task's address space. I.e., current behavior. It will be ignored by
> page cache allocations. However, if you write to the page, the kernel
> will COW the page, making a private anonymous copy for your task. The
> anonymous COWed page WILL follow the VMA policy you installed, but won't
> be visible to any other task mmap()ing the file--shared or private.
> This is also current behavior.
>
> If you map a file MAP_SHARED and DON'T apply a policy--which covers most
> existing applications, according to Andi--then page cache allocations
> will still default to task policy or system default--again, current
> behavior. Even if you write to the page, because you've mapped shared,
> you keep the page cache page allocated at fault time.
>
> If you map a file shared and apply a policy via mbind() or one of the
> libnuma wrappers you mention above, the policy will "punch through" the
> VMA and be installed on the file's internal incarnation [inode +
> address_space structures], dynamically allocating the necessary
> shared_policy structure. Then, for this file, page_cache allocations
> that hit the range on which you installed a policy will use that shared
> policy. If you don't cover the entire file with your policy, those
> ranges that you don't cover will continue to use task/system default
> policy--just like shmem.
This sound very reasonable and actually what I expected from the system
in the first place.
> > > > Man page states:
> > > > Memory policy set for memory areas is shared by all threads of the
> > > > process. Memory policy is also shared by other processes mapping the
> > > > same memory using shmat(2) or mmap(2) from shmfs/hugetlbfs. It is not
> > > > shared for disk backed file mappings right now although that may change
> > > > in the future.
> > > > So what does this mean? If I set local policy for memory region in process
> > > > A it should be obeyed by memory access in process B?
> > >
> > > shmem does, indeed, work this way. Policies installed on ranges of the
> > > shared segment via mbind() are stored with the shared object.
> > >
> > > I think the future is now: time to share policy for disk backed file
> > > mappings.
> > >
> > At least it will be consistent with what you get when shared memory is
> > created via shmget(). It will be very surprising for a programmer if
> > his program' logic will break just because he changes the way how shared
> > memory is created.
>
>
> Yes. A bit inconsistent, from the application programmer's viewpoint.
>
"A bit" is underestimation :)
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 19:25 ` Paul Jackson
@ 2007-05-31 20:22 ` Lee Schermerhorn
0 siblings, 0 replies; 83+ messages in thread
From: Lee Schermerhorn @ 2007-05-31 20:22 UTC (permalink / raw)
To: Paul Jackson; +Cc: Christoph Lameter, linux-mm, akpm, ak
On Thu, 2007-05-31 at 12:25 -0700, Paul Jackson wrote:
> > They have to since they may be used to change page locations when policies
> > are active. There is a libcpuset library that can be used for application
> > control of cpusets. I think Paul would disagree with you here.
>
> In the most common usage, a batch scheduler uses cpusets to control
> a jobs memory and placement, and application code within the job uses
> the memory policy calls (mbind, set_mempolicy) and scheduler policy
> call (set_schedaffinity) to manage its detailed placement.
>
<snip>
Paul: Excellent writeup. Thanks. No disrespect implied by the <snip>.
>
> Unfortunately there are a couple of details that leak through:
> 1) big apps using scheduler and memory policy calls often want to
> know how "big" their machine is, which changes under cpusets
> from the physical size of the system, and
> 2) the sched_setaffinity, mbind and set_mempolicy calls take hard
> physical CPU and Memory Node numbers, which change under migration
> non-transparently.
>
> Therefore I have in libcpuset two kinds of routines:
> 1) a large powerful set used by heavy weight batch schedulers to
> provide sophisticated job placement, and
> 2) a small simple set used by applications that provide an interface
> to sched_setaffinity, mbind and set_mempolicy that is virtualized
> to the cpuset, providing cpuset relative CPU and Memory Node
> numbering and cpuset relative sizes, safely usable from an
> application across a migration to different nodes, without
> application awareness.
>
> The ancient, Linux 2.4 kernel based, libcpuset on oss.sgi.com is
> really ancient and not relevant here. The cpuset mechanism in
> Linux 2.6 is a complete redesign from SGI's cpumemset mechanism
> for Linux 2.4 kernels.
I saw this one on the site and it did appear quite old. I haven't come
across libcpuset "in the wild" yet, but I like the notion of cpuset
relative ids ["container namespaces"?]. I'd also be happy if things
like numa_membind() in libnuma returned just the available mems, hard
physical ids and all.
>
> SGI releases libcpuset under GPL license, though currently I've just
> set this up for customers of SGI's software. Someday I hope to get
> the current libcpuset up on oss.sgi.com, for all to use.
I'll be looking for it...
Thanks, again
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 20:06 ` Gleb Natapov
@ 2007-05-31 20:43 ` Andi Kleen
2007-06-01 9:38 ` Gleb Natapov
0 siblings, 1 reply; 83+ messages in thread
From: Andi Kleen @ 2007-05-31 20:43 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Lee Schermerhorn, Christoph Lameter, linux-mm, Andrew Morton
> > > Do I
> > > miss something here?
> >
> > I think you do.
> OK. It seems I missed the fact that VMA policy is completely ignored for
> pagecache backed files and only task policy is used.
That's not correct. tmpfs is page cache backed and supports (even shared) VMA policy.
hugetlbfs used to too, but lost its ability, but will hopefully get it again.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-05-31 20:43 ` Andi Kleen
@ 2007-06-01 9:38 ` Gleb Natapov
2007-06-01 10:21 ` Andi Kleen
0 siblings, 1 reply; 83+ messages in thread
From: Gleb Natapov @ 2007-06-01 9:38 UTC (permalink / raw)
To: Andi Kleen; +Cc: Lee Schermerhorn, Christoph Lameter, linux-mm, Andrew Morton
On Thu, May 31, 2007 at 10:43:19PM +0200, Andi Kleen wrote:
>
> > > > Do I
> > > > miss something here?
> > >
> > > I think you do.
> > OK. It seems I missed the fact that VMA policy is completely ignored for
> > pagecache backed files and only task policy is used.
>
> That's not correct. tmpfs is page cache backed and supports (even shared) VMA policy.
> hugetlbfs used to too, but lost its ability, but will hopefully get it again.
>
This is even more confusing. So numa_*_memory() works different
depending on where file is created. I can't rely on this anyway and
have to assume that numa_*_memory() call is ignored and prefault.
I think Lee's patches should be applied ASAP to fix this inconsistency.
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 9:38 ` Gleb Natapov
@ 2007-06-01 10:21 ` Andi Kleen
2007-06-01 12:25 ` Gleb Natapov
2007-06-01 17:15 ` Lee Schermerhorn
0 siblings, 2 replies; 83+ messages in thread
From: Andi Kleen @ 2007-06-01 10:21 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Lee Schermerhorn, Christoph Lameter, linux-mm, Andrew Morton
On Friday 01 June 2007 11:38:03 Gleb Natapov wrote:
> On Thu, May 31, 2007 at 10:43:19PM +0200, Andi Kleen wrote:
> >
> > > > > Do I
> > > > > miss something here?
> > > >
> > > > I think you do.
> > > OK. It seems I missed the fact that VMA policy is completely ignored for
> > > pagecache backed files and only task policy is used.
> >
> > That's not correct. tmpfs is page cache backed and supports (even shared) VMA policy.
> > hugetlbfs used to too, but lost its ability, but will hopefully get it again.
> >
> This is even more confusing.
I see. Anything that doesn't work exactly as your particular
application expects it is "unnatural" and "confusing". I suppose only
in Glebnix it would be different.
> So numa_*_memory() works different
> depending on where file is created.
See it as "it doesn't work for files, but only for shared memory".
The main reason for that is that there is no way to make it persistent
for files.
I only objected to your page cache based description because tmpfs
(and even anonymous memory) are page cache based too.
> I can't rely on this anyway and
> have to assume that numa_*_memory() call is ignored and prefault.
It's either use shared/anonymous memory or process policy.
> I think Lee's patches should be applied ASAP to fix this inconsistency.
They have serious semantic problems.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 10:21 ` Andi Kleen
@ 2007-06-01 12:25 ` Gleb Natapov
2007-06-01 13:09 ` Andi Kleen
2007-06-01 17:15 ` Lee Schermerhorn
1 sibling, 1 reply; 83+ messages in thread
From: Gleb Natapov @ 2007-06-01 12:25 UTC (permalink / raw)
To: Andi Kleen; +Cc: Lee Schermerhorn, Christoph Lameter, linux-mm, Andrew Morton
On Fri, Jun 01, 2007 at 12:21:32PM +0200, Andi Kleen wrote:
> On Friday 01 June 2007 11:38:03 Gleb Natapov wrote:
> > On Thu, May 31, 2007 at 10:43:19PM +0200, Andi Kleen wrote:
> > >
> > > > > > Do I
> > > > > > miss something here?
> > > > >
> > > > > I think you do.
> > > > OK. It seems I missed the fact that VMA policy is completely ignored for
> > > > pagecache backed files and only task policy is used.
> > >
> > > That's not correct. tmpfs is page cache backed and supports (even shared) VMA policy.
> > > hugetlbfs used to too, but lost its ability, but will hopefully get it again.
> > >
> > This is even more confusing.
>
> I see. Anything that doesn't work exactly as your particular
> application expects it is "unnatural" and "confusing". I suppose only
> in Glebnix it would be different.
Everything that is defined to work on memory, but sometimes doesn't work
because a memory happened to be backed by file on disk (not just any file).
And not that it just doesn't work as in "return error", but just silently ignored
is confusing to me. I don't know what your definition of "confusing" is.
My application doesn't "expect" anything. Just tell me how can I achieve what I need
and I'll change the application. Creating shared file on tmpfs it out my
control, so I really don't care that numa policy happens to be working
for them. The only option left is create shared memory with shmget(). In
you first reply to me you said that this will not work too. But never
followed up on my replies with map page citations.
>
> > So numa_*_memory() works different
> > depending on where file is created.
>
> See it as "it doesn't work for files, but only for shared memory".
> The main reason for that is that there is no way to make it persistent
> for files.
>
> I only objected to your page cache based description because tmpfs
> (and even anonymous memory) are page cache based too.
>
You are right. It should have been "disk backed files" of cause.
> > I can't rely on this anyway and
> > have to assume that numa_*_memory() call is ignored and prefault.
>
> It's either use shared/anonymous memory or process policy.
That is where confusion is. You use words "shared memory" here. Is shared
memory created with mmap(MAP_SHARED) is not "shared" enough? Suddenly
such memory became a second class citizen.
>
> > I think Lee's patches should be applied ASAP to fix this inconsistency.
>
> They have serious semantic problems.
>
Can you point me to thread where this was discussed?
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 12:25 ` Gleb Natapov
@ 2007-06-01 13:09 ` Andi Kleen
0 siblings, 0 replies; 83+ messages in thread
From: Andi Kleen @ 2007-06-01 13:09 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Lee Schermerhorn, Christoph Lameter, linux-mm, Andrew Morton
> > > I can't rely on this anyway and
> > > have to assume that numa_*_memory() call is ignored and prefault.
> >
> > It's either use shared/anonymous memory or process policy.
> That is where confusion is. You use words "shared memory" here. Is shared
> memory created with mmap(MAP_SHARED) is not "shared" enough?
It's file backed.
> > > I think Lee's patches should be applied ASAP to fix this inconsistency.
> >
> > They have serious semantic problems.
> >
> Can you point me to thread where this was discussed?
See the thread following the patches.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 10:21 ` Andi Kleen
2007-06-01 12:25 ` Gleb Natapov
@ 2007-06-01 17:15 ` Lee Schermerhorn
2007-06-01 18:43 ` Christoph Lameter
1 sibling, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-06-01 17:15 UTC (permalink / raw)
To: Andi Kleen; +Cc: Gleb Natapov, Christoph Lameter, linux-mm, Andrew Morton
On Fri, 2007-06-01 at 12:21 +0200, Andi Kleen wrote:
> On Friday 01 June 2007 11:38:03 Gleb Natapov wrote:
> > On Thu, May 31, 2007 at 10:43:19PM +0200, Andi Kleen wrote:
> > >
> > > > > > Do I
> > > > > > miss something here?
> > > > >
> > > > > I think you do.
> > > > OK. It seems I missed the fact that VMA policy is completely ignored for
> > > > pagecache backed files and only task policy is used.
> > >
> > > That's not correct. tmpfs is page cache backed and supports (even shared) VMA policy.
> > > hugetlbfs used to too, but lost its ability, but will hopefully get it again.
> > >
> > This is even more confusing.
>
> I see. Anything that doesn't work exactly as your particular
> application expects it is "unnatural" and "confusing". I suppose only
> in Glebnix it would be different.
Andi, as you well know, many Posix-like systems have had NUMA policies
for quite a while. Most of these systems tried to provide consistent
semantics from the applications view point with respect to control of
policy of memory objects mapped into the application's address space.
It's not particularly difficult to achieve. Your shared policy
infrastructure provides almost everything that's required, as I've
demonstrated.
Like Gleb, I find the different behaviors for different memory regions
to be unnatural. Not because of the fraction of applications or
deployments that might use them, but because [speaking for customers] I
expect and want to be able to control placement of any object mapped
into an application's address space, subject to permissions and
privileges.
>
> > So numa_*_memory() works different
> > depending on where file is created.
>
> See it as "it doesn't work for files, but only for shared memory".
> The main reason for that is that there is no way to make it persistent
> for files.
Your definition of persistence seems to be keeping policy around on
files when the application that owns the file doesn't have it open or
mapped. In the context of my customers' applications and, AFAICT,
Gleb's application, your definition of persistence is a red herring.
You're using it to prevent acceptance of behavior we need because the
patches don't address your definition. From what I can tell from the
discussion so far, YOU don't have a need [or know of anyone who does]
for your definition of persistence. You claim you don't know of any use
case for memory policy on memory mapped file at all.
If you do know of a need for file policy persistence at least as good as
shmem--i.e., doesn't survive reboot--that could be added relatively
easily. But you haven't asked for that. You've rejected the notion
that anyone might have a need for policy on memory mapped files without
such persistence. If you want persistence across reboots--i.e.,
attached to the file as some sort of extended attribue--I expect that
could be done, as well. But, that's a file system issue and, IMO,
mbind() is not the right interface. However, such a feature would
require the kernel to support policies on regular disk-backed files as
it does for swap-backed files.
>
> I only objected to your page cache based description because tmpfs
> (and even anonymous memory) are page cache based too.
Then why does Christoph keep insisting that "page cache pages" must
always follow task policy, when shmem, tmpfs and anonymous pages don't
have to?
>
> > I can't rely on this anyway and
> > have to assume that numa_*_memory() call is ignored and prefault.
>
> It's either use shared/anonymous memory or process policy.
>
> > I think Lee's patches should be applied ASAP to fix this inconsistency.
>
> They have serious semantic problems.
Which, except for your persistence red herring, you haven't described.
Go back to my message to Gleb where I described the semantics provided
by my patches and show me where your problems are. And tell us YOUR use
cases for YOUR definition of persistence that you claim is missing.
They must be very compelling if they're worth blocking a capability that
others want to use.
Regards,
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 17:15 ` Lee Schermerhorn
@ 2007-06-01 18:43 ` Christoph Lameter
2007-06-01 19:38 ` Lee Schermerhorn
2007-06-01 20:28 ` Gleb Natapov
0 siblings, 2 replies; 83+ messages in thread
From: Christoph Lameter @ 2007-06-01 18:43 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: Andi Kleen, Gleb Natapov, linux-mm, Andrew Morton
On Fri, 1 Jun 2007, Lee Schermerhorn wrote:
> Like Gleb, I find the different behaviors for different memory regions
> to be unnatural. Not because of the fraction of applications or
> deployments that might use them, but because [speaking for customers] I
> expect and want to be able to control placement of any object mapped
> into an application's address space, subject to permissions and
> privileges.
Same here and I wish we had a clean memory region based implementation.
But that is just what your patches do *not* provide. Instead they are file
based. They should be memory region based.
Would you please come up with such a solution?
> Then why does Christoph keep insisting that "page cache pages" must
> always follow task policy, when shmem, tmpfs and anonymous pages don't
> have to?
No I just said that the page cache handling is consistently following task
policy.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 18:43 ` Christoph Lameter
@ 2007-06-01 19:38 ` Lee Schermerhorn
2007-06-01 19:48 ` Christoph Lameter
2007-06-01 20:28 ` Gleb Natapov
1 sibling, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-06-01 19:38 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Andi Kleen, Gleb Natapov, linux-mm, Andrew Morton
On Fri, 2007-06-01 at 11:43 -0700, Christoph Lameter wrote:
> On Fri, 1 Jun 2007, Lee Schermerhorn wrote:
>
> > Like Gleb, I find the different behaviors for different memory regions
> > to be unnatural. Not because of the fraction of applications or
> > deployments that might use them, but because [speaking for customers] I
> > expect and want to be able to control placement of any object mapped
> > into an application's address space, subject to permissions and
> > privileges.
>
> Same here and I wish we had a clean memory region based implementation.
> But that is just what your patches do *not* provide. Instead they are file
> based. They should be memory region based.
>
> Would you please come up with such a solution?
Christoph:
I don't understand what you mean by "memory region based".
Linux does not have bona fide "memory objects" that sit between a task's
address space and the backing store--be it swap or regular files--like
some systems I've worked with. Rather, anonymous regions are described
by the vma_struct, and pages backing those regions must be referenced by
one or more ptes or a swap cache entry, or both. For a disk back file
mapped into a task address space, the vma points directly to the inode
+address_space structures via the file structure. Shmem regions attach
to a task address space much like regular files--via a pseudo-fs inode
+address_space. I don't know the rationale, but I suspect that Linux
dispenses with the extra memory object layer to conserve memory for
smaller systems. And that's a good thing, IMO.
So, for a shared memory mapped file, the inode+address_space--i.e., the
in-memory incarnation of the file--is as close to a "memory region" as
we have. In contains the mapping between [file/address] offset and
memory page. It's the only object representing the file and its
in-memory pages that gets shared between multiple task address spaces.
That seems, to me, to be the natural place to hang the shared policy.
Indeed, this is where we attach shared policy to shmem/tmpfs/hugetlbfs
pseudo-files.
Even if we had a layer between the vma's and the files/inodes, I don't
see what that would buy us. We'd still want to maintain coherency
between files accessed via file descriptor function calls and files
mapped via mmap(SHARED). That's one of the purposes of a shared page
cache. [I've seen unix variants where these weren't coherent. Now
THAT's unnatural ;-)!] So, yes any policy applied to the memory mapped
file affects the location of pages accessed via file descriptor access.
That's a good thing for the application that use shared mapped files.
The load/store access by the application that maps the file, and goes to
the trouble of specifying memory policy, takes precedence. Load/store
is the "fast path". File descriptor access system calls are the slow
path.
You're usually gung-ho about locality on a NUMA platform, avoiding off
node access or page allocations, respecting the fast path, ... Why the
resistance here?
>
> > Then why does Christoph keep insisting that "page cache pages" must
> > always follow task policy, when shmem, tmpfs and anonymous pages don't
> > have to?
>
> No I just said that the page cache handling is consistently following task
> policy.
Well, not for anon, shmem, tmpfs, ... page cache pages. All of those
are page cache based, according to Andi, and they certainly aren't
constrained to "consistently follow task policy".
Of course, I'm just being facetious [and, no doubt, annoying] to make a
point. We're using the same words, sometimes referring to the same
concepts, but in slightly different context and "talking past each
other". I'm trying real hard to believe that this is what's happening
in this entire exchange. That's the most benign reason I can come up
with...
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 19:38 ` Lee Schermerhorn
@ 2007-06-01 19:48 ` Christoph Lameter
2007-06-01 21:05 ` Lee Schermerhorn
0 siblings, 1 reply; 83+ messages in thread
From: Christoph Lameter @ 2007-06-01 19:48 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: Andi Kleen, Gleb Natapov, linux-mm, Andrew Morton
On Fri, 1 Jun 2007, Lee Schermerhorn wrote:
> > Same here and I wish we had a clean memory region based implementation.
> > But that is just what your patches do *not* provide. Instead they are file
> > based. They should be memory region based.
> >
> > Would you please come up with such a solution?
>
> Christoph:
>
> I don't understand what you mean by "memory region based".
Memory policies are controlling allocations for regions of memory of a
process. They are not file based policies (they may have been on Tru64).
> So, for a shared memory mapped file, the inode+address_space--i.e., the
> in-memory incarnation of the file--is as close to a "memory region" as
Not at all. Consider a mmapped memory region by a database. The database
is running on nodes 5-8 and has specified an interleave policy for the
data.
Now another process starts on node 1 and it also mapped to mmap the same
file used by the database. It specifies allocation on node 1 and then
terminates.
Now the database will attempt to satisfy its big memory needs from node 1?
This scheme is not working.
> You're usually gung-ho about locality on a NUMA platform, avoiding off
> node access or page allocations, respecting the fast path, ... Why the
> resistance here?
Yes I want consistent memory policies. There are already consistency
issues that need to be solved. Forcing in a Tru64 concept of file memory
allocation policies will just make the situation worse.
And shmem is not really something that should be taken as a general rule.
Shmem allocations can be controlled via a kernel boot option. They exist
even after a process terminates. etc etc.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 18:43 ` Christoph Lameter
2007-06-01 19:38 ` Lee Schermerhorn
@ 2007-06-01 20:28 ` Gleb Natapov
2007-06-01 20:45 ` Christoph Lameter
1 sibling, 1 reply; 83+ messages in thread
From: Gleb Natapov @ 2007-06-01 20:28 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Lee Schermerhorn, Andi Kleen, linux-mm, Andrew Morton
On Fri, Jun 01, 2007 at 11:43:57AM -0700, Christoph Lameter wrote:
> On Fri, 1 Jun 2007, Lee Schermerhorn wrote:
>
> > Like Gleb, I find the different behaviors for different memory regions
> > to be unnatural. Not because of the fraction of applications or
> > deployments that might use them, but because [speaking for customers] I
> > expect and want to be able to control placement of any object mapped
> > into an application's address space, subject to permissions and
> > privileges.
>
> Same here and I wish we had a clean memory region based implementation.
> But that is just what your patches do *not* provide. Instead they are file
> based. They should be memory region based.
Do you want a solution that doesn't associate memory policy with a file
(if a file is mapped shared and disk backed) like Lee's solution does, but
instead install it into VMA and respect the policy during pagecache page
allocation on behalf of the process? So two process should cooperate
(bind same part of a file to a same memory node in each process) to get
consistent result? If yes this will work for me.
I really hate to use shmget() for all the reasons you've listed in you
other mail and some more.
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 20:28 ` Gleb Natapov
@ 2007-06-01 20:45 ` Christoph Lameter
2007-06-01 21:10 ` Lee Schermerhorn
2007-06-02 7:23 ` Gleb Natapov
0 siblings, 2 replies; 83+ messages in thread
From: Christoph Lameter @ 2007-06-01 20:45 UTC (permalink / raw)
To: Gleb Natapov; +Cc: Lee Schermerhorn, Andi Kleen, linux-mm, Andrew Morton
On Fri, 1 Jun 2007, Gleb Natapov wrote:
> > Same here and I wish we had a clean memory region based implementation.
> > But that is just what your patches do *not* provide. Instead they are file
> > based. They should be memory region based.
> Do you want a solution that doesn't associate memory policy with a file
> (if a file is mapped shared and disk backed) like Lee's solution does, but
> instead install it into VMA and respect the policy during pagecache page
> allocation on behalf of the process? So two process should cooperate
Right.
> (bind same part of a file to a same memory node in each process) to get
> consistent result? If yes this will work for me.
Yes.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 19:48 ` Christoph Lameter
@ 2007-06-01 21:05 ` Lee Schermerhorn
2007-06-01 21:56 ` Christoph Lameter
0 siblings, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-06-01 21:05 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Andi Kleen, Gleb Natapov, linux-mm, Andrew Morton
On Fri, 2007-06-01 at 12:48 -0700, Christoph Lameter wrote:
> On Fri, 1 Jun 2007, Lee Schermerhorn wrote:
>
> > > Same here and I wish we had a clean memory region based implementation.
> > > But that is just what your patches do *not* provide. Instead they are file
> > > based. They should be memory region based.
> > >
> > > Would you please come up with such a solution?
> >
> > Christoph:
> >
> > I don't understand what you mean by "memory region based".
>
> Memory policies are controlling allocations for regions of memory of a
> process. They are not file based policies (they may have been on Tru64).
By "regions of memory of a process" do you mean VMAs? These are not
shared between processes, so installing a policy in a VMA of one task
will not affect pages faulted in by other cooperating tasks of the
application.
Actually, in Tru64, policies were attached to those "memory
object"--separate from the inode, but still shared by all mappings of
the file in separate tasks. [Bob Picco's design, IIRC.] Doesn't matter
where you attach the policies. You need to share them between tasks and
they need to control allocations of pages for the mapping--pages that
happen to live in the page cache.
>
> > So, for a shared memory mapped file, the inode+address_space--i.e., the
> > in-memory incarnation of the file--is as close to a "memory region" as
>
> Not at all. Consider a mmapped memory region by a database. The database
> is running on nodes 5-8 and has specified an interleave policy for the
> data.
If the memory region is a shared mmap'd file and the data base consists
of multiple tasks, you can't do this today if you don't want to prefault
in the entire file]--especially if you want to keep your task policy
default/local so that task heap and stack pages stay local.
Maybe you're thinking of a multithreaded task? You're right. You don't
need shared policy. You've only got one address space mapping the file.
And one page table... Somewhat problematic on NUMA systems, as you've
pointed out in the context of Nick's page cache replication patch/rfc.
One reason to use separate tasks sharing files and shmem on a NUMA
system.
>
> Now another process starts on node 1 and it also mapped to mmap the same
> file used by the database. It specifies allocation on node 1 and then
> terminates.
>
> Now the database will attempt to satisfy its big memory needs from node 1?
>
> This scheme is not working.
Red Herring. The same scenario can occur with shmem today. And don't
try to play the "shmem is different" card. For this scenario, they're
the same. If "node 1 task" can mmap your file and specify a different
policy, it can attach your shmem segment and specify a different policy,
with the same result.
And, why would the task on node 1 do that? In this scenario, these are
not cooperating tasks; or it's an application bug. You want to penalize
well behaved, cooperating tasks that are part of a single application,
sharing application private files because you can come up with scenarios
based on non-cooperating or buggy tasks to which you've allowed access
to your application's files?
As it stands today, and as we've been discussing with Gleb, a multitask
application cannot map a file shared and place different ranges on
different nodes reliably without prefaulting in all of the pages. Gleb
was even willing to install the identical policies from each
task--something I don't think he should have to do--but even this would
not achieve his desired results. This is much more serious shortcoming
than the scenario you describe above. We CAN prevent your scenario.
Just don't give non-cooperating tasks access to files whose
policy/location you care about? Same as for shmem.
>
> > You're usually gung-ho about locality on a NUMA platform, avoiding off
> > node access or page allocations, respecting the fast path, ... Why the
> > resistance here?
>
> Yes I want consistent memory policies. There are already consistency
> issues that need to be solved. Forcing in a Tru64 concept of file memory
> allocation policies will just make the situation worse.
It's NOT a Tru64 concept, Christoph. Another Red Herring. It's about
consistent support of memory policies on any object that I can map into
my address space. And if that object is a disk-based file that lives
in the page cache, and we want to preserve coherency between file
descriptor and shared, memory mapped access [believe me, we do], then
the policy applied to the object needs to affect all page allocations
for that file--even those caused by non-cooperating or buggy tasks, if
we allow them access to the files.
>
> And shmem is not really something that should be taken as a general rule.
I disagree. The shared policy support that shmem is exactly what I want
for shared mmaped files. I'm willing to deal with the same issues that
shmem has in order to get shared, mapped file semantics for my shared
regions.
> Shmem allocations can be controlled via a kernel boot option. They exist
> even after a process terminates. etc etc.
Once again. If you have a use case for shared file policies persisting
after the process terminates [and I suspect not, 'cause you don't even
want them in the first place] then raise that as a requirement. We can
add that--as a subsequent patch. If you have a use case for policies
persisting over system reboot [shmem policies don't, by the way], I
expect the file system folks could come up with a way to attach policies
to files that get loaded when the file is opened or when mmap'ed. It
would still require the in-kernel mechanism to attach policies to the
in-memory structure[s]. This capability is useful without either.
And, Christoph, again, adding shared policy support to shared file
mappings doesn't add any warts or inconsistent behavior that isn't
already there with policy applied to mmap'ed files. Default behavior is
the same--wart-for-wart. Yes, shared policies on mmaped files will have
the same risks as shared policy on shmem does today--e.g., your
scenario--but we find the shared policies on shmem useful enough that
we've all been willing to manage that.
Later,
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 20:45 ` Christoph Lameter
@ 2007-06-01 21:10 ` Lee Schermerhorn
2007-06-01 21:58 ` Christoph Lameter
2007-06-02 7:23 ` Gleb Natapov
1 sibling, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-06-01 21:10 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Gleb Natapov, Andi Kleen, linux-mm, Andrew Morton
On Fri, 2007-06-01 at 13:45 -0700, Christoph Lameter wrote:
> On Fri, 1 Jun 2007, Gleb Natapov wrote:
>
> > > Same here and I wish we had a clean memory region based implementation.
> > > But that is just what your patches do *not* provide. Instead they are file
> > > based. They should be memory region based.
> > Do you want a solution that doesn't associate memory policy with a file
> > (if a file is mapped shared and disk backed) like Lee's solution does, but
> > instead install it into VMA and respect the policy during pagecache page
> > allocation on behalf of the process? So two process should cooperate
>
> Right.
>
> > (bind same part of a file to a same memory node in each process) to get
> > consistent result? If yes this will work for me.
>
> Yes.
But, what if the processes install different policies... if they're NOT
cooperating. This was your previous objection. In fact, you've used
just the scenario that Gleb describes as an objection--that different
tasks could have different policies in their address spaces. Not a
problem if the policy is shared. Let one task do the setup. Done! It
just works. Keep those uncooperative tasks away from your file.
What happened to consistency? ;-)
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH] enhance memory policy sys call man pages v1
2007-05-31 8:20 ` Michael Kerrisk
2007-05-31 14:49 ` Lee Schermerhorn
@ 2007-06-01 21:15 ` Lee Schermerhorn
2007-07-23 6:11 ` Michael Kerrisk
` (3 more replies)
1 sibling, 4 replies; 83+ messages in thread
From: Lee Schermerhorn @ 2007-06-01 21:15 UTC (permalink / raw)
To: Michael Kerrisk; +Cc: ak, akpm, linux-mm, clameter
Subject was: Re: [PATCH] Document Linux Memory Policy
On Thu, 2007-05-31 at 10:20 +0200, Michael Kerrisk wrote:
> > > > The docs are wrong. This is fully supported.
> > >
> > > Yes, I gave up on that one and the warning in the manpage should be
> > > probably dropped
> >
> > OK. I'll work with the man page maintainers.
>
> Hi Lee,
>
> If you could write a patch for the man page, that would be ideal.
> Location of current tarball is in the .sig.
[PATCH] enhance memory policy sys call man pages v1
Against man pages 2.51
This patch enhances the 3 memory policy system call man pages
to add description of missing semantics, error return values,
etc. The descriptions match the semantics of the kernel circa
2.6.21/22, as gleaned from the source code.
I have changed the "policy" parameter to "mode" through out the
descriptions in an attempt to promote the concept that the memory
policy is a tuple consisting of a mode and optional set of nodes.
Also matches internal name and <numaif.h> prototypes for mbind()
and set_mempolicy().
I think I've covered all of the existing errno returns, but may
have missed a few.
These pages definitely need proofing by other sets of eyes...
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
man2/get_mempolicy.2 | 222 +++++++++++++++++++++------------
man2/mbind.2 | 335 +++++++++++++++++++++++++++++++++++++--------------
man2/set_mempolicy.2 | 173 +++++++++++++++++++++-----
3 files changed, 526 insertions(+), 204 deletions(-)
Index: Linux/man2/mbind.2
===================================================================
--- Linux.orig/man2/mbind.2 2007-05-11 19:07:02.000000000 -0400
+++ Linux/man2/mbind.2 2007-06-01 12:28:06.000000000 -0400
@@ -18,15 +18,16 @@
.\" the source, must acknowledge the copyright and authors of this work.
.\"
.\" 2006-02-03, mtk, substantial wording changes and other improvements
+.\" 2007-06-01, lts, more precise specification of behavior.
.\"
-.TH MBIND 2 "2006-02-07" "SuSE Labs" "Linux Programmer's Manual"
+.TH MBIND 2 "2007-06-01" "SuSE Labs" "Linux Programmer's Manual"
.SH NAME
mbind \- Set memory policy for a memory range
.SH SYNOPSIS
.nf
.B "#include <numaif.h>"
.sp
-.BI "int mbind(void *" start ", unsigned long " len ", int " policy ,
+.BI "int mbind(void *" start ", unsigned long " len ", int " mode ,
.BI " unsigned long *" nodemask ", unsigned long " maxnode ,
.BI " unsigned " flags );
.sp
@@ -34,76 +35,179 @@ mbind \- Set memory policy for a memory
.fi
.SH DESCRIPTION
.BR mbind ()
-sets the NUMA memory
-.I policy
+sets the NUMA memory policy,
+which consists of a policy mode and zero or more nodes,
for the memory range starting with
.I start
and continuing for
.IR len
bytes.
The memory of a NUMA machine is divided into multiple nodes.
-The memory policy defines in which node memory is allocated.
+The memory policy defines from which node memory is allocated.
+
+If the memory range specified by the
+.IR start " and " len
+arguments includes an "anonymous" region of memory\(emthat is
+a region of memory created using the
+.BR mmap (2)
+system call with the
+.BR MAP_ANONYMOUS \(emor
+a memory mapped file, mapped using the
+.BR mmap (2)
+system call with the
+.B MAP_PRIVATE
+flag, pages will only be allocated according to the specified
+policy when the application writes [stores] to the page.
+For anonymous regions, an initial read access will use a shared
+page in the kernel containing all zeros.
+For a file mapped with
+.BR MAP_PRIVATE ,
+an initial read access will allocate pages according to the
+process policy of the process that causes the page to be allocated.
+This may not be the process that called
+.BR mbind ().
+
+If the specified memory range includes a memory mapped file,
+mapped using the
+.BR mmap (2)
+system call with the
+.B MAP_SHARED
+flag, the specified policy will be ignored for all page allocations
+in this range.
+Rather the pages will be allocated according to the process policy
+of the process that caused the page to be allocated.
+Again, this may not be the process that called
+.BR mbind ().
+
+If the specified memory range includes a shared memory region
+created using the
+.BR shmget (2)
+system call and attached using the
+.BR shmat (2)
+system call,
+pages allocated for the anonymous or shared memory region will
+be allocated according to the policy specified, regardless which
+process attached to the shared memory segment causes the allocation.
+If, however, the shared memory region was created with the
+.B SHM_HUGETLB
+flag,
+the huge pages will be allocated according to the policy specified
+only if the page allocation is caused by the task that calls
+.BR mbind ()
+for that region.
+
+By default,
.BR mbind ()
only has an effect for new allocations; if the pages inside
the range have been already touched before setting the policy,
then the policy has no effect.
+This default behavior may be overridden by the
+.BR MPOL_MF_MOVE
+and
+.B MPOL_MF_MOVE_ALL
+flags described below.
-Available policies are
+The
+.I mode
+argument must specify one of
.BR MPOL_DEFAULT ,
.BR MPOL_BIND ,
-.BR MPOL_INTERLEAVE ,
-and
+.B MPOL_INTERLEAVE
+or
.BR MPOL_PREFERRED .
-All policies except
+All policy modes except
.B MPOL_DEFAULT
-require the caller to specify the nodes to which the policy applies in the
+require the caller to specify via the
.I nodemask
-parameter.
+parameter,
+the node or nodes to which the mode applies.
+
.I nodemask
-is a bitmask of nodes containing up to
+points to a bitmask of nodes containing up to
.I maxnode
bits.
-The actual number of bytes transferred via this argument
-is rounded up to the next multiple of
+The bit mask size is rounded to the next multiple of
.IR "sizeof(unsigned long)" ,
but the kernel will only use bits up to
.IR maxnode .
-A NULL argument means an empty set of nodes.
+A NULL value of
+.I nodemask
+or a
+.I maxnode
+value of zero specifies the empty set of nodes.
+If the value of
+.I maxnode
+is zero,
+the
+.I nodemask
+argument is ignored.
The
.B MPOL_DEFAULT
-policy is the default and means to use the underlying process policy
-(which can be modified with
-.BR set_mempolicy (2)).
-Unless the process policy has been changed this means to allocate
-memory on the node of the CPU that triggered the allocation.
+mode specifies the default policy.
+When applied to a range of memory via
+.IR mbind (),
+this means to use the process policy,
+ which may have been set with
+.BR set_mempolicy (2).
+If the mode of the process policy is also
+.B MPOL_DEFAULT
+pages will be allocated on the node of the CPU that triggers the allocation.
+For
+.BR MPOL_DEFAULT ,
+the
.I nodemask
-should be specified as NULL.
+and
+.I maxnode
+arguments must be specify the empty set of nodes.
The
.B MPOL_BIND
-policy is a strict policy that restricts memory allocation to the
-nodes specified in
+mode specifies a strict policy that restricts memory allocation to
+the nodes specified in
.IR nodemask .
+If
+.I nodemask
+specifies more than one node, page allocations will come from
+the node with the lowest numeric node id first, until that node
+contains no free memory.
+Allocations will then come from the node with the next highest
+node id specified in
+.I nodemask
+and so forth, until none of the specified nodes contain free memory.
There won't be allocations on other nodes.
+The
.B MPOL_INTERLEAVE
-interleaves allocations to the nodes specified in
+mode specifies that page allocations be interleaved across the
+set of nodes specified in
.IR nodemask .
-This optimizes for bandwidth instead of latency.
+This optimizes for bandwidth instead of latency
+by spreading out pages and memory accesses to those pages across
+multiple nodes.
To be effective the memory area should be fairly large,
-at least 1MB or bigger.
+at least 1MB or bigger with a fairly uniform access pattern.
+Accesses to a single page of the area will still be limited to
+the memory bandwidth of a single node.
.B MPOL_PREFERRED
sets the preferred node for allocation.
-The kernel will try to allocate in this
+The kernel will try to allocate pages from this
node first and fall back to other nodes if the
preferred nodes is low on free memory.
-Only the first node in the
+If
.I nodemask
-is used.
-If no node is set in the mask, then the memory is allocated on
-the node of the CPU that triggered the allocation allocation).
+specifies more than one node id, the first node in the
+mask will be selected as the preferred node.
+If the
+.I nodemask
+and
+.I maxnode
+arguments specify the empty set, then the memory is allocated on
+the node of the CPU that triggered the allocation.
+This is the only way to specify "local allocation" for a
+range of memory via
+.IR mbind (2).
If
.B MPOL_MF_STRICT
@@ -115,17 +219,18 @@ is not
.BR MPOL_DEFAULT ,
then the call will fail with the error
.B EIO
-if the existing pages in the mapping don't follow the policy.
-In 2.6.16 or later the kernel will also try to move pages
-to the requested node with this flag.
+if the existing pages in the memory range don't follow the policy.
+.\" According to the kernel code, the following is not true --lts
+.\" In 2.6.16 or later the kernel will also try to move pages
+.\" to the requested node with this flag.
If
.B MPOL_MF_MOVE
-is passed in
+is specified in
.IR flags ,
-then an attempt will be made to
-move all the pages in the mapping so that they follow the policy.
-Pages that are shared with other processes are not moved.
+then the kernel will attempt to move all the existing pages
+in the memory range so that they follow the policy.
+Pages that are shared with other processes will not be moved.
If
.B MPOL_MF_STRICT
is also specified, then the call will fail with the error
@@ -136,8 +241,8 @@ If
.B MPOL_MF_MOVE_ALL
is passed in
.IR flags ,
-then all pages in the mapping will be moved regardless of whether
-other processes use the pages.
+then the kernel will attempt to move all existing pages in the memory range
+regardless of whether other processes use the pages.
The calling process must be privileged
.RB ( CAP_SYS_NICE )
to use this flag.
@@ -146,6 +251,7 @@ If
is also specified, then the call will fail with the error
.B EIO
if some pages could not be moved.
+.\" ---------------------------------------------------------------
.SH RETURN VALUE
On success,
.BR mbind ()
@@ -153,11 +259,9 @@ returns 0;
on error, \-1 is returned and
.I errno
is set to indicate the error.
+.\" ---------------------------------------------------------------
.SH ERRORS
-.TP
-.B EFAULT
-There was a unmapped hole in the specified memory range
-or a passed pointer was not valid.
+.\" I think I got all of the error returns. --lts
.TP
.B EINVAL
An invalid value was specified for
@@ -169,53 +273,102 @@ or
was less than
.IR start ;
or
-.I policy
-was
-.B MPOL_DEFAULT
+.I start
+is not a multiple of the system page size.
+Or,
+.I mode
+is
+.I MPOL_DEFAULT
and
.I nodemask
-pointed to a non-empty set;
+specified a non-empty set;
or
-.I policy
-was
-.B MPOL_BIND
+.I mode
+is
+.I MPOL_BIND
or
-.B MPOL_INTERLEAVE
+.I MPOL_INTERLEAVE
and
.I nodemask
-pointed to an empty set,
+is empty.
+Or,
+.I maxnode
+specifies more than a page worth of bits.
+Or,
+.I nodemask
+specifies one or more node ids that are
+greater than the maximum supported node id,
+or are not allowed in the calling task's context.
+.\" "calling task's context" refers to cpusets. No man page avail to ref. --lts
+Or, none of the node ids specified by
+.I nodemask
+are on-line, or none of the specified nodes contain memory.
+.TP
+.B EFAULT
+Part of all of the memory range specified by
+.I nodemask
+and
+.I maxnode
+points outside your accessible address space.
+Or, there was a unmapped hole in the specified memory range.
.TP
.B ENOMEM
-System out of memory.
+Insufficient kernel memory was available.
.TP
.B EIO
.B MPOL_MF_STRICT
was specified and an existing page was already on a node
-that does not follow the policy.
+that does not follow the policy;
+or
+.B MPOL_MF_MOVE
+or
+.B MPOL_MF_MOVE_ALL
+was specified and the kernel was unable to move all existing
+pages in the range.
+.TP
+.B EPERM
+The
+.I flags
+argument included the
+.B MPOL_MF_MOVE_ALL
+flag and the caller does not have the
+.B CAP_SYS_NICE
+privilege.
+.\" ---------------------------------------------------------------
.SH NOTES
-NUMA policy is not supported on file mappings.
+NUMA policy is not supported on a memory mapped file range
+that was mapped with the
+.I MAP_SHARED
+flag.
.B MPOL_MF_STRICT
-is ignored on huge page mappings right now.
+is ignored on huge page mappings.
-It is unfortunate that the same flag,
+The
.BR MPOL_DEFAULT ,
-has different effects for
+mode has different effects for
.BR mbind (2)
and
.BR set_mempolicy (2).
-To select "allocation on the node of the CPU that
-triggered the allocation" (like
-.BR set_mempolicy (2)
-.BR MPOL_DEFAULT )
-when calling
+When
+.B MPOL_DEFAULT
+is specified for a range of memory using
.BR mbind (),
+any pages subsequently allocated for that range will use
+the process' policy, as set by
+.BR set_mempolicy (2).
+This effectively removes the explicit policy from the
+specified range.
+To select "local allocation" for a memory range,
specify a
-.I policy
+.I mode
of
.B MPOL_PREFERRED
-with an empty
-.IR nodemask .
+with an empty set of nodes.
+This method will work for
+.BR set_mempolicy (2),
+as well.
+.\" ---------------------------------------------------------------
.SH "VERSIONS AND LIBRARY SUPPORT"
The
.BR mbind (),
@@ -226,16 +379,18 @@ system calls were added to the Linux ker
They are only available on kernels compiled with
.BR CONFIG_NUMA .
-Support for huge page policy was added with 2.6.16.
-For interleave policy to be effective on huge page mappings the
-policied memory needs to be tens of megabytes or larger.
-
-.B MPOL_MF_MOVE
-and
-.B MPOL_MF_MOVE_ALL
-are only available on Linux 2.6.16 and later.
+You can link with
+.I -lnuma
+to get system call definitions.
+.I libnuma
+and the required
+.I numaif.h
+header.
+are available in the
+.I numactl
+package.
-These system calls should not be used directly.
+However, applications should not use these system calls directly.
Instead, the higher level interface provided by the
.BR numa (3)
functions in the
@@ -245,17 +400,21 @@ The
.I numactl
package is available at
.IR ftp://ftp.suse.com/pub/people/ak/numa/ .
-
-You can link with
-.I -lnuma
-to get system call definitions.
-.I libnuma
-is available in the
-.I numactl
+The package is also included in some Linux distributions.
+Some distributions include the development library and header
+in the separate
+.I numactl-devel
package.
-This package also has the
-.I numaif.h
-header.
+
+Support for huge page policy was added with 2.6.16.
+For interleave policy to be effective on huge page mappings the
+policied memory needs to be tens of megabytes or larger.
+
+.B MPOL_MF_MOVE
+and
+.B MPOL_MF_MOVE_ALL
+are only available on Linux 2.6.16 and later.
+
.SH CONFORMING TO
This system call is Linux specific.
.SH SEE ALSO
@@ -263,4 +422,6 @@ This system call is Linux specific.
.BR numactl (8),
.BR set_mempolicy (2),
.BR get_mempolicy (2),
-.BR mmap (2)
+.BR mmap (2),
+.BR shmget (2),
+.BR shmat (2).
Index: Linux/man2/get_mempolicy.2
===================================================================
--- Linux.orig/man2/get_mempolicy.2 2007-04-12 18:42:49.000000000 -0400
+++ Linux/man2/get_mempolicy.2 2007-06-01 12:29:00.000000000 -0400
@@ -18,6 +18,7 @@
.\" the source, must acknowledge the copyright and authors of this work.
.\"
.\" 2006-02-03, mtk, substantial wording changes and other improvements
+.\" 2007-06-01, lts, more precise specification of behavior.
.\"
.TH GET_MEMPOLICY 2 "2006-02-07" "SuSE Labs" "Linux Programmer's Manual"
.SH SYNOPSIS
@@ -26,9 +27,11 @@ get_mempolicy \- Retrieve NUMA memory po
.B "#include <numaif.h>"
.nf
.sp
-.BI "int get_mempolicy(int *" policy ", unsigned long *" nodemask ,
+.BI "int get_mempolicy(int *" mode ", unsigned long *" nodemask ,
.BI " unsigned long " maxnode ", unsigned long " addr ,
.BI " unsigned long " flags );
+.sp
+.BI "cc ... \-lnuma"
.fi
.\" TBD rewrite this. it is confusing.
.SH DESCRIPTION
@@ -39,7 +42,7 @@ depending on the setting of
A NUMA machine has different
memory controllers with different distances to specific CPUs.
-The memory policy defines in which node memory is allocated for
+The memory policy defines from which node memory is allocated for
the process.
If
@@ -58,58 +61,75 @@ then information is returned about the p
address given in
.IR addr .
This policy may be different from the process's default policy if
-.BR set_mempolicy (2)
-has been used to establish a policy for the page containing
+.BR mbind (2)
+or one of the helper functions described in
+.BR numa(3)
+has been used to establish a policy for the memory range containing
.IR addr .
-If
-.I policy
-is not NULL, then it is used to return the policy.
+If the
+.I mode
+argument is not NULL, then
+.IR get_mempolicy ()
+will store the policy mode of the requested NUMA policy in the location
+pointed to by this argument.
If
.IR nodemask
-is not NULL, then it is used to return the nodemask associated
-with the policy.
+is not NULL, then the nodemask associated with the policy will be stored
+in the location pointed to by this argument.
.I maxnode
-is the maximum bit number plus one that can be stored into
-.IR nodemask .
-The bit number is always rounded to a multiple of
-.IR "unsigned long" .
-.\"
-.\" If
-.\" .I flags
-.\" specifies both
-.\" .B MPOL_F_NODE
-.\" and
-.\" .BR MPOL_F_ADDR ,
-.\" then
-.\" .I policy
-.\" instead returns the number of the node on which the address
-.\" .I addr
-.\" is allocated.
-.\"
-.\" If
-.\" .I flags
-.\" specifies
-.\" .B MPOL_F_NODE
-.\" but not
-.\" .BR MPOL_F_ADDR ,
-.\" and the process's current policy is
-.\" .BR MPOL_INTERLEAVE ,
-.\" then
-.\" checkme: Andi's text below says that the info is returned in
-.\" 'nodemask', not 'policy':
-.\" .I policy
-.\" instead returns the number of the next node that will be used for
-.\" interleaving allocation.
-.\" FIXME .
-.\" The other valid flag is
-.\" .I MPOL_F_NODE.
-.\" It is only valid when the policy is
-.\" .I MPOL_INTERLEAVE.
-.\" In this case not the interleave mask, but an unsigned long with the next
-.\" node that would be used for interleaving is returned in
-.\" .I nodemask.
-.\" Other flag values are reserved.
+specifies the number of node ids
+that can be stored into
+.IR nodemask \(emthat
+is, the maximum node id plus one.
+The value specified by
+.I maxnode
+is always rounded to a multiple of
+.IR "sizeof(unsigned long)" .
+
+If
+.I flags
+specifies both
+.B MPOL_F_NODE
+and
+.BR MPOL_F_ADDR ,
+.IR get_mempolicy ()
+will return the node id of the node on which the address
+.I addr
+is allocated into the location pointed to by
+.IR mode .
+If no page has yet been allocated for the specified address,
+.IR get_mempolicy ()
+will allocate a page as if the process had performed a read
+[load] access to that address, and return the id of the node
+where that page was allocated.
+
+If
+.I flags
+specifies
+.BR MPOL_F_NODE ,
+but not
+.BR MPOL_F_ADDR ,
+and the process's current policy is
+.BR MPOL_INTERLEAVE ,
+then
+.IR get_mempolicy ()
+will return in the location pointed to by a non-NULL
+.I mode
+argument,
+the node id of the next node that will be used for
+interleaving of internal kernel pages allocated on behalf of the process.
+.\" Note: code returns next interleave node via 'mode' argument -lts
+These allocations include pages for memory mapped files in
+process memory ranges mapped using the
+.IR mmap (2)
+call with the
+.I MAP_PRIVATE
+flag for read accesses, and in memory ranges mapped with the
+.I MAP_SHARED
+flag for all accesses.
+
+Other flag values are reserved.
For an overview of the possible policies see
.BR set_mempolicy (2).
@@ -120,40 +140,77 @@ returns 0;
on error, \-1 is returned and
.I errno
is set to indicate the error.
-.\" .SH ERRORS
-.\" FIXME writeme -- no errors are listed on this page
-.\" .
-.\" .TP
-.\" .B EINVAL
-.\" .I nodemask
-.\" is non-NULL, and
-.\" .I maxnode
-.\" is too small;
-.\" or
-.\" .I flags
-.\" specified values other than
-.\" .B MPOL_F_NODE
-.\" or
-.\" .BR MPOL_F_ADDR ;
-.\" or
-.\" .I flags
-.\" specified
-.\" .B MPOL_F_ADDR
-.\" and
-.\" .I addr
-.\" is NULL.
-.\" (And there are other EINVAL cases.)
+.SH ERRORS
+.TP
+.B EINVAL
+The value specified by
+.I maxnode
+is less than the number of node ids supported by the system.
+Or
+.I flags
+specified values other than
+.B MPOL_F_NODE
+or
+.BR MPOL_F_ADDR ;
+or
+.I flags
+specified
+.B MPOL_F_ADDR
+and
+.I addr
+is NULL,
+or
+.I flags
+did not specify
+.B MPOL_F_ADDR
+and
+.I addr
+is not NULL.
+Or,
+.I flags
+specified
+.B MPOL_F_NODE
+but not
+.B MPOL_F_ADDR
+and the current process policy is not
+.BR MPOL_INTERLEAVE .
+(And there are other EINVAL cases.)
+.TP
+.B EFAULT
+Part of all of the memory range specified by
+.I nodemask
+and
+.I maxnode
+points outside your accessible address space.
.SH NOTES
-This manual page is incomplete:
-it does not document the details the
-.BR MPOL_F_NODE
-flag,
-which modifies the operation of
-.BR get_mempolicy ().
-This is deliberate: this flag is not intended for application use,
-and its operation may change or it may be removed altogether in
-future kernel versions.
-.B Do not use it.
+If the mode of the process policy or the policy governing allocations at the
+specified address is
+.I MPOL_PREFERRED
+and this policy was installed with an empty
+.IR nodemask \(emspecifying
+local allocation,
+.IR get_mempolicy ()
+will return the mask of on-line node ids in the location pointed to by
+a non-NULL
+.I nodemask
+argument.
+This mask does not take into consideration any adminstratively imposed
+restrictions on the process' context.
+.\" "context" above refers to cpusets. No man page to reference. --lts
+
+.\" Christoph says the following is untrue. These are "fully supported."
+.\" Andi concedes that he has lost this battle and approves [?]
+.\" updating the man pages to document the behavior. --lts
+.\" This manual page is incomplete:
+.\" it does not document the details the
+.\" .BR MPOL_F_NODE
+.\" flag,
+.\" which modifies the operation of
+.\" .BR get_mempolicy ().
+.\" This is deliberate: this flag is not intended for application use,
+.\" and its operation may change or it may be removed altogether in
+.\" future kernel versions.
+.\" .B Do not use it.
.SH "VERSIONS AND LIBRARY SUPPORT"
See
.BR mbind (2).
@@ -161,6 +218,7 @@ See
This system call is Linux specific.
.SH SEE ALSO
.BR mbind (2),
+.BR mmap (2),
.BR set_mempolicy (2),
.BR numactl (8),
.BR numa (3)
Index: Linux/man2/set_mempolicy.2
===================================================================
--- Linux.orig/man2/set_mempolicy.2 2007-04-12 18:42:49.000000000 -0400
+++ Linux/man2/set_mempolicy.2 2007-06-01 12:28:49.000000000 -0400
@@ -18,6 +18,7 @@
.\" the source, must acknowledge the copyright and authors of this work.
.\"
.\" 2006-02-03, mtk, substantial wording changes and other improvements
+.\" 2007-06-01, lts, more precise specification of behavior.
.\"
.TH SET_MEMPOLICY 2 "2006-02-07" "SuSE Labs" "Linux Programmer's Manual"
.SH NAME
@@ -26,80 +27,141 @@ set_mempolicy \- set default NUMA memory
.nf
.B "#include <numaif.h>"
.sp
-.BI "int set_mempolicy(int " policy ", unsigned long *" nodemask ,
+.BI "int set_mempolicy(int " mode ", unsigned long *" nodemask ,
.BI " unsigned long " maxnode );
+.sp
+.BI "cc ... \-lnuma"
.fi
.SH DESCRIPTION
.BR set_mempolicy ()
-sets the NUMA memory policy of the calling process to
-.IR policy .
+sets the NUMA memory policy of the calling process,
+which consists of a policy mode and zero or more nodes,
+to the values specified by the
+.IR mode ,
+.I nodemask
+and
+.IR maxnode
+arguments.
A NUMA machine has different
memory controllers with different distances to specific CPUs.
-The memory policy defines in which node memory is allocated for
+The memory policy defines from which node memory is allocated for
the process.
-This system call defines the default policy for the process;
-in addition a policy can be set for specific memory ranges using
+This system call defines the default policy for the process.
+The process policy governs allocation of pages in the process'
+address space outside of memory ranges
+controlled by a more specific policy set by
.BR mbind (2).
+The process default policy also controls allocation of any pages for
+memory mapped files mapped using the
+.BR mmap (2)
+call with the
+.B MAP_PRIVATE
+flag and that are only read [loaded] from by the task
+and of memory mapped files mapped using the
+.BR mmap (2)
+call with the
+.B MAP_SHARED
+flag, regardless of the access type.
The policy is only applied when a new page is allocated
for the process.
For anonymous memory this is when the page is first
touched by the application.
-Available policies are
+The
+.I mode
+argument must specify one of
.BR MPOL_DEFAULT ,
.BR MPOL_BIND ,
-.BR MPOL_INTERLEAVE ,
+.B MPOL_INTERLEAVE
+or
.BR MPOL_PREFERRED .
-All policies except
+All modes except
.B MPOL_DEFAULT
-require the caller to specify the nodes to which the policy applies in the
+require the caller to specify via the
.I nodemask
-parameter.
+parameter
+one or more nodes.
+
.I nodemask
-is pointer to a bit field of nodes that contains up to
+points to a bit mask of node ids that contains up to
.I maxnode
bits.
-The bit field size is rounded to the next multiple of
+The bit mask size is rounded to the next multiple of
.IR "sizeof(unsigned long)" ,
but the kernel will only use bits up to
.IR maxnode .
+A NULL value of
+.I nodemask
+or a
+.I maxnode
+value of zero specifies the empty set of nodes.
+If the value of
+.I maxnode
+is zero,
+the
+.I nodemask
+argument is ignored.
The
.B MPOL_DEFAULT
-policy is the default and means to allocate memory locally,
+mode is the default and means to allocate memory locally,
i.e., on the node of the CPU that triggered the allocation.
.I nodemask
-should be specified as NULL.
+must be specified as NULL.
+If the "local node" contains no free memory, the system will
+attempt to allocate memory from a "near by" node.
The
.B MPOL_BIND
-policy is a strict policy that restricts memory allocation to the
+mode defines a strict policy that restricts memory allocation to the
nodes specified in
.IR nodemask .
-There won't be allocations on other nodes.
+If
+.I nodemask
+specifies more than one node, page allocations will come from
+the node with the lowest numeric node id first, until that node
+contains no free memory.
+Allocations will then come from the node with the next highest
+node id specified in
+.I nodemask
+and so forth, until none of the specified nodes contain free memory.
+Pages will not be allocated from any node not specified in the
+.IR nodemask .
.B MPOL_INTERLEAVE
-interleaves allocations to the nodes specified in
-.IR nodemask .
-This optimizes for bandwidth instead of latency.
-To be effective the memory area should be fairly large,
-at least 1MB or bigger.
+interleaves page allocations across the nodes specified in
+.I nodemask
+in numeric node id order.
+This optimizes for bandwidth instead of latency
+by spreading out pages and memory accesses to those pages across
+multiple nodes.
+However, accesses to a single page will still be limited to
+the memory bandwidth of a single node.
+.\" NOTE: the following sentence doesn't make sense in the context
+.\" of set_mempolicy() -- no memory area specified.
+.\" To be effective the memory area should be fairly large,
+.\" at least 1MB or bigger.
.B MPOL_PREFERRED
sets the preferred node for allocation.
-The kernel will try to allocate in this
-node first and fall back to other nodes if the preferred node is low on free
+The kernel will try to allocate pages from this node first
+and fall back to "near by" nodes if the preferred node is low on free
memory.
-Only the first node in the
+If
+.I nodemask
+specifies more than one node id, the first node in the
+mask will be selected as the preferred node.
+If the
.I nodemask
-is used.
-If no node is set in the mask, then the memory is allocated on
-the node of the CPU that triggered the allocation allocation (like
+and
+.I maxnode
+arguments specify the empty set, then the memory is allocated on
+the node of the CPU that triggered the allocation (like
.BR MPOL_DEFAULT ).
-The memory policy is preserved across an
+The process memory policy is preserved across an
.BR execve (2),
and is inherited by child processes created using
.BR fork (2)
@@ -107,6 +169,9 @@ or
.BR clone (2).
.SH NOTES
Process policy is not remembered if the page is swapped out.
+When such a page is paged back in, it will use the policy of
+the process or memory range that is in effect at the time the
+page is allocated.
.SH RETURN VALUE
On success,
.BR set_mempolicy ()
@@ -114,12 +179,49 @@ returns 0;
on error, \-1 is returned and
.I errno
is set to indicate the error.
-.\" .SH ERRORS
-.\" FIXME writeme -- no errors are listed on this page
-.\" .
-.\" .TP
-.\" .B EINVAL
-.\" .I mode is invalid.
+.SH ERRORS
+.TP
+.B EINVAL
+.I mode is invalid.
+Or,
+.I mode
+is
+.I MPOL_DEFAULT
+and
+.I nodemask
+is non-empty,
+or
+.I mode
+is
+.I MPOL_BIND
+or
+.I MPOL_INTERLEAVE
+and
+.I nodemask
+is empty.
+Or,
+.I maxnode
+specifies more than a page worth of bits.
+Or,
+.I nodemask
+specifies one or more node ids that are
+greater than the maximum supported node id,
+or are not allowed in the calling task's context.
+.\" "calling task's context" refers to cpusets. No man page avail to ref. --lts
+Or, none of the node ids specified by
+.I nodemask
+are on-line, or none of the specified nodes contain memory.
+.TP
+.B EFAULT
+Part of all of the memory range specified by
+.I nodemask
+and
+.I maxnode
+points outside your accessible address space.
+.TP
+.B ENOMEM
+Insufficient kernel memory was available.
+
.SH "VERSIONS AND LIBRARY SUPPORT"
See
.BR mbind (2).
@@ -127,6 +229,7 @@ See
This system call is Linux specific.
.SH SEE ALSO
.BR mbind (2),
+.BR mmap (2),
.BR get_mempolicy (2),
.BR numactl (8),
.BR numa (3)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 21:05 ` Lee Schermerhorn
@ 2007-06-01 21:56 ` Christoph Lameter
2007-06-04 13:46 ` Lee Schermerhorn
0 siblings, 1 reply; 83+ messages in thread
From: Christoph Lameter @ 2007-06-01 21:56 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: Andi Kleen, Gleb Natapov, linux-mm, Andrew Morton
On Fri, 1 Jun 2007, Lee Schermerhorn wrote:
> > > I don't understand what you mean by "memory region based".
> >
> > Memory policies are controlling allocations for regions of memory of a
> > process. They are not file based policies (they may have been on Tru64).
>
> By "regions of memory of a process" do you mean VMAs? These are not
> shared between processes, so installing a policy in a VMA of one task
> will not affect pages faulted in by other cooperating tasks of the
> application.
Right. Thats how it should be.
> > Not at all. Consider a mmapped memory region by a database. The database
> > is running on nodes 5-8 and has specified an interleave policy for the
> > data.
>
> If the memory region is a shared mmap'd file and the data base consists
> of multiple tasks, you can't do this today if you don't want to prefault
> in the entire file]--especially if you want to keep your task policy
> default/local so that task heap and stack pages stay local.
Well the point was that your approach leads to pretty inconsistent
behavior that is very weird and counterintuitive for those runing the
software.
> Red Herring. The same scenario can occur with shmem today. And don't
> try to play the "shmem is different" card. For this scenario, they're
> the same. If "node 1 task" can mmap your file and specify a different
> policy, it can attach your shmem segment and specify a different policy,
> with the same result.
Sure it shmem is different. I think it was a mistake to allow memory
policy changes of shmem through the regular memory policy change API.
Shmem also has permissions so you can prevent the above listed scenario
from occurring.
> And, why would the task on node 1 do that? In this scenario, these are
Because it is a smaller version of the database that is run for some minor
update purpose?
> not cooperating tasks; or it's an application bug. You want to penalize
> well behaved, cooperating tasks that are part of a single application,
> sharing application private files because you can come up with scenarios
> based on non-cooperating or buggy tasks to which you've allowed access
> to your application's files?
I do not want to penalize anyone. I want consitent and easily
understable memory policy behavior.
> > Yes I want consistent memory policies. There are already consistency
> > issues that need to be solved. Forcing in a Tru64 concept of file memory
> > allocation policies will just make the situation worse.
>
> It's NOT a Tru64 concept, Christoph. Another Red Herring. It's about
> consistent support of memory policies on any object that I can map into
> my address space. And if that object is a disk-based file that lives
> in the page cache, and we want to preserve coherency between file
> descriptor and shared, memory mapped access [believe me, we do], then
> the policy applied to the object needs to affect all page allocations
> for that file--even those caused by non-cooperating or buggy tasks, if
> we allow them access to the files.
The scenario that I just described cannot occur with vma based policies.
And this is just one additional example of weird behaviors resulting from
file based policies.
> > And shmem is not really something that should be taken as a general rule.
>
> I disagree. The shared policy support that shmem is exactly what I want
> for shared mmaped files. I'm willing to deal with the same issues that
> shmem has in order to get shared, mapped file semantics for my shared
> regions.
I think the current shmem policy approach can only be tolerated because
shmem has other means of control that do not exist for page cache pages.
> And, Christoph, again, adding shared policy support to shared file
> mappings doesn't add any warts or inconsistent behavior that isn't
> already there with policy applied to mmap'ed files. Default behavior is
> the same--wart-for-wart. Yes, shared policies on mmaped files will have
> the same risks as shared policy on shmem does today--e.g., your
> scenario--but we find the shared policies on shmem useful enough that
> we've all been willing to manage that.
Of course it adds lots of wards. Repeating:
1. Another process can modify the memory policies of a running process.
2. Policies persist after a process terminates. I.e. file is bound to node
1, where we run a performance critical application. Now a process starts
on node 4 using the same file that does not use memory policies but its
allocations are redirected to node 1 where the mission critical app
suddenly has no memory available anymore.
2. It is not clear when the file policies will vanish. The point of
reclaim is indeterminate for the user. So sometimes the policy will vanish
in other cases it will not.
Sorry but these semantics are not acceptable.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 21:10 ` Lee Schermerhorn
@ 2007-06-01 21:58 ` Christoph Lameter
0 siblings, 0 replies; 83+ messages in thread
From: Christoph Lameter @ 2007-06-01 21:58 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: Gleb Natapov, Andi Kleen, linux-mm, Andrew Morton
On Fri, 1 Jun 2007, Lee Schermerhorn wrote:
> But, what if the processes install different policies... if they're NOT
> cooperating. This was your previous objection. In fact, you've used
> just the scenario that Gleb describes as an objection--that different
> tasks could have different policies in their address spaces. Not a
> problem if the policy is shared. Let one task do the setup. Done! It
> just works. Keep those uncooperative tasks away from your file.
>
> What happened to consistency? ;-)
It is consistent with page cache pages being able to be "faulted" in
either by buffered I/O or mmapped I/O of to an arbitrary node. So the
application does not have the expectation that the pages must be on
certain nodes. This is the same for shared anonymous pages. It would be
fully consistent across all uses of vma based policiues.
The new pages are allocated in the context of the vma's memory policy. And
the applicable policy depends on the task doing the allocations.
Again consistent semantics with how anonymous pages are handled.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 20:45 ` Christoph Lameter
2007-06-01 21:10 ` Lee Schermerhorn
@ 2007-06-02 7:23 ` Gleb Natapov
1 sibling, 0 replies; 83+ messages in thread
From: Gleb Natapov @ 2007-06-02 7:23 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Lee Schermerhorn, Andi Kleen, linux-mm, Andrew Morton
On Fri, Jun 01, 2007 at 01:45:04PM -0700, Christoph Lameter wrote:
> On Fri, 1 Jun 2007, Gleb Natapov wrote:
>
> > > Same here and I wish we had a clean memory region based implementation.
> > > But that is just what your patches do *not* provide. Instead they are file
> > > based. They should be memory region based.
> > Do you want a solution that doesn't associate memory policy with a file
> > (if a file is mapped shared and disk backed) like Lee's solution does, but
> > instead install it into VMA and respect the policy during pagecache page
> > allocation on behalf of the process? So two process should cooperate
>
> Right.
>
> > (bind same part of a file to a same memory node in each process) to get
> > consistent result? If yes this will work for me.
>
> Yes.
OK. This would be good enough for me (although I agree with Lee's approach and,
I suppose, we can track which process installed latest policy on the file's region
and remove it on process exit). But for the sake of consistency why not handle shmem
in the same way then? Do it Lee's way or do it your way but PLEASE do it the same
for all kind of memory regions! You are claiming that shmem is somehow special
because you can control access to it, but what about files? You surely
can control access to those. And about persistence of shmem policy I
don't see how this is useful for multiuser machine. I see some kind of
use for this in dedicated server, but this is exactly where it can be
achieved by other means.
--
Gleb.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-01 21:56 ` Christoph Lameter
@ 2007-06-04 13:46 ` Lee Schermerhorn
2007-06-04 16:34 ` Christoph Lameter
0 siblings, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-06-04 13:46 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Andi Kleen, Gleb Natapov, linux-mm, Andrew Morton
On Fri, 2007-06-01 at 14:56 -0700, Christoph Lameter wrote:
> On Fri, 1 Jun 2007, Lee Schermerhorn wrote:
>
> > > > I don't understand what you mean by "memory region based".
<big snip>
Christoph: obviously I disagree with you on most of the points as well
as your conclusion. I may get preempted out of this exchange for a
while, but in any case, I'm going to try to recap all of the points,
including the application model, current capabilities, the semantics I'm
expousing and why they make sense to me. I fear our mental maps of the
territory are such that no reconciliation is possible, but I need to
make the attempt because you seem to be in a position to block me here.
Later,
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-04 13:46 ` Lee Schermerhorn
@ 2007-06-04 16:34 ` Christoph Lameter
2007-06-04 17:02 ` Lee Schermerhorn
0 siblings, 1 reply; 83+ messages in thread
From: Christoph Lameter @ 2007-06-04 16:34 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: Andi Kleen, Gleb Natapov, linux-mm, Andrew Morton
We have discussed this since you began this work more than a year ago
after I asked you to do the memory region based approach. More
documentation will not change the fundamental problems with inode
based policies.
You can likely make the approach less of a catastophe by enhancing the
shmem tools (ipcs ipcrm) work on page cache files so that the sysadmin can
see what kind of policies are set on the inodes in memory right now, so
that any unusual allocation behavior as a result of the crazy semantics
here can be detected and fixed.
For shmem (even without page cache inode policies) it may be useful to at
least modify ipcs to show the memory policies and the distribution of the
pages for shared memory. Frankly the existing shmem numa policy
implementation is already a grave cause for concern because there are
weird policies suddenly come into play that the process has never set. To
have that for the page cache is a nightmare scenario.
Shmem has at least a determinate lifetime (and therefore also a
determinate lifetime for memory policies attached to shmem) which makes it
more manageable. Plus it is a kind of ramdisk where you would want to have
a policy attached to where the ramdisk data should be placed.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-04 16:34 ` Christoph Lameter
@ 2007-06-04 17:02 ` Lee Schermerhorn
2007-06-04 17:11 ` Christoph Lameter
0 siblings, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-06-04 17:02 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Andi Kleen, Gleb Natapov, linux-mm, Andrew Morton
On Mon, 2007-06-04 at 09:34 -0700, Christoph Lameter wrote:
> We have discussed this since you began this work more than a year ago
> after I asked you to do the memory region based approach. More
> documentation will not change the fundamental problems with inode
> based policies.
I hope that I can show why a memory region based approach, if I
understand your notion of memory regions, doesn't have the desired
properties. That is the properties that I desire. I want to make that
clear because you make statements about "fundamental problems",
"catastrophe", "crazy semantics" as if you view is the only valid one.
To quote [paraphrase?] Nick Piggin: "I have thought about this,
some..."
>
> You can likely make the approach less of a catastophe by enhancing the
> shmem tools (ipcs ipcrm) work on page cache files so that the sysadmin can
> see what kind of policies are set on the inodes in memory right now, so
> that any unusual allocation behavior as a result of the crazy semantics
> here can be detected and fixed.
>
> For shmem (even without page cache inode policies) it may be useful to at
> least modify ipcs to show the memory policies and the distribution of the
> pages for shared memory. Frankly the existing shmem numa policy
> implementation is already a grave cause for concern because there are
> weird policies suddenly come into play that the process has never set. To
> have that for the page cache is a nightmare scenario.
A "nightmare" in your view of the world, not mine. Maybe not in Gleb's,
from what I can tell. As for others, I don't know, as they've all been
silent. ROFL for all I know...
I try to give you the benefit of the doubt that it's my fault for not
explaining things clearly enough where you're making what appear to me
as specious arguments--unintentionally, of course. But your tone just
keeps getting more strident.
>
> Shmem has at least a determinate lifetime (and therefore also a
> determinate lifetime for memory policies attached to shmem) which makes it
> more manageable. Plus it is a kind of ramdisk where you would want to have
> a policy attached to where the ramdisk data should be placed.
So, control over the lifetime of the policies is one of your issue.
Fine, I can deal with that. Name calling and hyperbole doesn't help.
Later,
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-04 17:02 ` Lee Schermerhorn
@ 2007-06-04 17:11 ` Christoph Lameter
2007-06-04 20:23 ` Andi Kleen
0 siblings, 1 reply; 83+ messages in thread
From: Christoph Lameter @ 2007-06-04 17:11 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: Andi Kleen, Gleb Natapov, linux-mm, Andrew Morton
On Mon, 4 Jun 2007, Lee Schermerhorn wrote:
> I try to give you the benefit of the doubt that it's my fault for not
> explaining things clearly enough where you're making what appear to me
> as specious arguments--unintentionally, of course. But your tone just
> keeps getting more strident.
Yes, we have been discussing this since for more than year now. I am a bit
irritated that you keep pushing this. In particular none of the concerns
have been addressed. Its just as raw as it was then.
> > Shmem has at least a determinate lifetime (and therefore also a
> > determinate lifetime for memory policies attached to shmem) which makes it
> > more manageable. Plus it is a kind of ramdisk where you would want to have
> > a policy attached to where the ramdisk data should be placed.
> So, control over the lifetime of the policies is one of your issue.
> Fine, I can deal with that. Name calling and hyperbole doesn't help.
The other issues will still remain! This is a fundamental change to the
nature of memory policies. They are no longer under the control of the
task but imposed from the outside. If one wants to do this then the whole
scheme of memory policies needs to be reworked and rethought in order to
be consistent and usable. For example you would need the ability to clear
a memory policy. And perhaps call this something different in order not to
cause confusion?
The patchset also changes semantics to deviate from documented behavior.
The memory policies work on memory ranges *not* on page ranges of files.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-04 17:11 ` Christoph Lameter
@ 2007-06-04 20:23 ` Andi Kleen
2007-06-04 21:51 ` Christoph Lameter
0 siblings, 1 reply; 83+ messages in thread
From: Andi Kleen @ 2007-06-04 20:23 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Lee Schermerhorn, Gleb Natapov, linux-mm, Andrew Morton
>
> The other issues will still remain! This is a fundamental change to the
> nature of memory policies. They are no longer under the control of the
> task but imposed from the outside.
To be fair this can already happen with tmpfs (and hopefully soon hugetlbfs
again -- i plan to do some other work there anyways and will put
that in too) . But with first touch it is relatively benign.
> If one wants to do this then the whole
> scheme of memory policies needs to be reworked and rethought in order to
> be consistent and usable. For example you would need the ability to clear
> a memory policy.
That's just setting it to default.
Frankly I think this whole discussion is quite useless without discussing
concrete use cases. So far I haven't heard any where this any file policy
would be a great improvement. Any further complication of the code which
is already quite complex needs a very good rationale.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-04 20:23 ` Andi Kleen
@ 2007-06-04 21:51 ` Christoph Lameter
2007-06-05 14:30 ` Lee Schermerhorn
0 siblings, 1 reply; 83+ messages in thread
From: Christoph Lameter @ 2007-06-04 21:51 UTC (permalink / raw)
To: Andi Kleen; +Cc: Lee Schermerhorn, Gleb Natapov, linux-mm, Andrew Morton
On Mon, 4 Jun 2007, Andi Kleen wrote:
> > The other issues will still remain! This is a fundamental change to the
> > nature of memory policies. They are no longer under the control of the
> > task but imposed from the outside.
>
> To be fair this can already happen with tmpfs (and hopefully soon hugetlbfs
> again -- i plan to do some other work there anyways and will put
> that in too) . But with first touch it is relatively benign.
Well this is pretty restricted for now so the control issues are not that
much of a problem. Both are special areas of memory that only see limited
use.
But in general the association of memory policies with files is not that
clean and it would be best to avoid things like that unless we first clean
up the semantics.
> > If one wants to do this then the whole
> > scheme of memory policies needs to be reworked and rethought in order to
> > be consistent and usable. For example you would need the ability to clear
> > a memory policy.
>
> That's just setting it to default.
Default does not allow to distinguish between no memory policy set and
the node local policy. This becomes important if you need to arbitrate
multiple processes setting competing memory policies on a file page range.
Right now we are ducking issues here it seems. If a process with higher
rights sets the node local policy then another process with lower right
should not be able to change that etc.
> Frankly I think this whole discussion is quite useless without discussing
> concrete use cases. So far I haven't heard any where this any file policy
> would be a great improvement. Any further complication of the code which
> is already quite complex needs a very good rationale.
In general I agree (we have now operated for years with the current
mempolicy semantics and I am concerned about any changes causing churn for
our customers) but there is also the consistency issue. Memory policies do
not work in mmapped page cache ranges which is surprising and not
documented.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Document Linux Memory Policy
2007-06-04 21:51 ` Christoph Lameter
@ 2007-06-05 14:30 ` Lee Schermerhorn
0 siblings, 0 replies; 83+ messages in thread
From: Lee Schermerhorn @ 2007-06-05 14:30 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Andi Kleen, Gleb Natapov, linux-mm, Andrew Morton
On Mon, 2007-06-04 at 14:51 -0700, Christoph Lameter wrote:
> On Mon, 4 Jun 2007, Andi Kleen wrote:
>
> > > The other issues will still remain! This is a fundamental change to the
> > > nature of memory policies. They are no longer under the control of the
> > > task but imposed from the outside.
> >
> > To be fair this can already happen with tmpfs (and hopefully soon hugetlbfs
> > again -- i plan to do some other work there anyways and will put
> > that in too) . But with first touch it is relatively benign.
>
> Well this is pretty restricted for now so the control issues are not that
> much of a problem. Both are special areas of memory that only see limited
> use.
>
> But in general the association of memory policies with files is not that
> clean and it would be best to avoid things like that unless we first clean
> up the semantics.
Check out the behavior of mmap(MAP_ANONYMOUS|MAP_SHARED) and mbind().
You get a shared file with shared policies. Andi's shared policy
infrastructure works fine with all file objects to which it has been
applied. Exactly the semantics one would expect with a shared object.
I agree that for this usage, control issues are essentially non-existent
because the file is private to the application. And, I don't know how
wide spread the use of mmap(MAP_ANONYMOUS|MAP_SHARED) is, but I would
expect it to be used fairly widely by a multi-process application.
We can discuss semantics, clean or otherwise, when we have more shared
context vis a vis the models.
>
> > > If one wants to do this then the whole
> > > scheme of memory policies needs to be reworked and rethought in order to
> > > be consistent and usable. For example you would need the ability to clear
> > > a memory policy.
> >
> > That's just setting it to default.
>
> Default does not allow to distinguish between no memory policy set and
> the node local policy. This becomes important if you need to arbitrate
> multiple processes setting competing memory policies on a file page range.
I agree with Christoph here. I haven't started the patch yet, but I
think we can define a 'MPOL_DELETE' policy that deletes any policies on
object in the specified virtual address range for mbind(). This would
provide an interface for removing policy from shared, mapped files if
one wanted the policies to persist after last unmap.
For set_mempolicy() it can simply remove the task policy, restoring it
to system default.
Persistence is another area that I agree needs work. As I see it, the
options are:
1) let the policies persist until the inode is recycled. This can only
happen when there are no mappers. This is, in fact, what my patches do
today. I'm not suggesting this is the right way. I just haven't
decided, nor has anyone suggested to me, what the desirable semantics
would be.
2) remove the policy on last unmap. We'll need a way to detect last
unmap, but shouldn't be too difficult.
3) require the inode to persist while any policies are attached. Then,
we'd need a way to list the files hanging around because policies exist,
and a way to remove the policies. The latter is the easier of the two,
I think: enhance numactl to take a --delete <file-path> option that
mmaps() the entire file range shared and issues mbind() with the
MPOL_DELETE mode mentioned above. I'll have to look into listing files
with just a policy reference holding the inode.
I think #2 is relatively easy to do and has the semantics I need, where
the shared policy is established at application startup. #3 is the most
work, and therefore should have a compelling use case. One use case
would be to set shared file policy via numactl and have it persist after
numactl exits w/o risk of the inode being recycled before you could
start the application for which you've set up the file policy. Maybe
this is what Andi has been thinking but not saying?
> Right now we are ducking issues here it seems. If a process with higher
> rights sets the node local policy then another process with lower right
> should not be able to change that etc.
Yes, we must solve access control if you think this is a problem. We
have file permissions for controlling access to the contents of files.
If you think it necessary, we can require, say, write permission to set
policy. After all a task with write permission can corrupt the
contents. Seems much more serious, to me, than setting the policy
behind some other task's back.
>
> > Frankly I think this whole discussion is quite useless without discussing
> > concrete use cases. So far I haven't heard any where this any file policy
> > would be a great improvement. Any further complication of the code which
> > is already quite complex needs a very good rationale.
Andi:
The use case is multi-process applications that use memory mapped files
as initialized shared memory regions with writeback semantics. We have
customers with applications that do this. The files tend to be large
and cache behavior relatively poor--so locality matters. Typically,
even predating NUMA, these applications have had a single process that
sets up the environment at application start up. Where these
applications use uninitialized shared memory [SysV shmem], the init task
would create that, if necessary [they don't survive reboot], mmap shared
files, ... When NUMA came along, the init task was the logical place to
establish locality on shmem and shared files. After that, "first touch"
faults in the pages. In the shared objects that have explicit policy,
that policy controls the placement, as desired. For process heap,
stack, ... where no policy has been applied, the process gets local
allocation, as desired.
I don't think this complicates the code. I'd like to think that my
patches actually clean things up a bit [no disrespect intended ;-)].
The basic shared policy infrastructure supports the desired semantics on
all shared files [all page cache pages!] except disk back files. These
are the "odd-man" out. I'd love to get down to discussing the technical
aspects of the patches, but I understand that we need to agree on the
models and use cases first.
>
> In general I agree (we have now operated for years with the current
> mempolicy semantics and I am concerned about any changes causing churn for
> our customers) but there is also the consistency issue. Memory policies do
> not work in mmapped page cache ranges which is surprising and not
> documented.
I am willing to update the documentation for the new behavior. That's
why I started the documenation thread. I have already sent you a patch
to update the policy man pages to define current behavior.
Default behavior would continue to be as it is today. If any programs
are setting policy on address ranges backed by files mapped shared, they
aren't getting what they expect today. The policy is ignored. They
can't expect that, else why would then have called mbind() or one of the
libnuma wrappers(). In fact the 2.51 man pages that I grabbed from
Michael Kerrisk states in the mbind.2 NOTES section that mbind() isn't
supported on file mappings. I enhanced that a bit to indicate that this
is true for files mapped with MAP_SHARED. I should update the patch to
emphasize that it's only true for regular disk backed files.
If none of your customers are using shared mapped files this way today,
then it won't affect them. This is why I don't understand the
objections on behavioral grounds [I do understand we have a disconnect
on the model of processes/address spaces/memory objects/... that we need
to sort out]. However, it such applications do exist that will be
surprised if shared file policies suddenly start working, we could make
them controllable on a per cpuset [container] basis. Might be a good
idea in any case... if we can sort out the model issue.
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] enhance memory policy sys call man pages v1
2007-06-01 21:15 ` [PATCH] enhance memory policy sys call man pages v1 Lee Schermerhorn
@ 2007-07-23 6:11 ` Michael Kerrisk
2007-07-23 6:32 ` mbind.2 man page patch Michael Kerrisk
` (2 subsequent siblings)
3 siblings, 0 replies; 83+ messages in thread
From: Michael Kerrisk @ 2007-07-23 6:11 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: ak, akpm, linux-mm, clameter, Samuel Thibault
Lee, (and Andi, Christoph),
Sorry that I have not replied sooner.
The patches you have written look like a great piece of work. Thanks! I
have made some light edits to improve wording and grammar, and added a
small fix that came in independently from Samuel Thibault for mbind.2.
I have also rebased the patches to include a few small changes that have
occurred between man-pages-2.51 and man-pages-2.63. These changes are all
minor, and are formatting changes, reodering of a few sections, and similar.
Andi, Christoph: please see below.
Lee Schermerhorn wrote:
> Subject was: Re: [PATCH] Document Linux Memory Policy
>
> On Thu, 2007-05-31 at 10:20 +0200, Michael Kerrisk wrote:
>>>>> The docs are wrong. This is fully supported.
>>>> Yes, I gave up on that one and the warning in the manpage should be
>>>> probably dropped
>>> OK. I'll work with the man page maintainers.
>> Hi Lee,
>>
>> If you could write a patch for the man page, that would be ideal.
>> Location of current tarball is in the .sig.
>
> [PATCH] enhance memory policy sys call man pages v1
>
> Against man pages 2.51
>
> This patch enhances the 3 memory policy system call man pages
> to add description of missing semantics, error return values,
> etc. The descriptions match the semantics of the kernel circa
> 2.6.21/22, as gleaned from the source code.
>
> I have changed the "policy" parameter to "mode" through out the
> descriptions in an attempt to promote the concept that the memory
> policy is a tuple consisting of a mode and optional set of nodes.
> Also matches internal name and <numaif.h> prototypes for mbind()
> and set_mempolicy().
>
> I think I've covered all of the existing errno returns, but may
> have missed a few.
>
> These pages definitely need proofing by other sets of eyes...
Andi, Christoph: I don't have enough understanding of these system calls to
technically review the changes that Lee has made. Can one or both of you
please help? I will forward the revised patches as three separate mails
following this one. (NOTE: ignore the patch below; it is now stale.)
Cheers,
Michael
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
>
> man2/get_mempolicy.2 | 222 +++++++++++++++++++++------------
> man2/mbind.2 | 335 +++++++++++++++++++++++++++++++++++++--------------
> man2/set_mempolicy.2 | 173 +++++++++++++++++++++-----
> 3 files changed, 526 insertions(+), 204 deletions(-)
>
> Index: Linux/man2/mbind.2
> ===================================================================
> --- Linux.orig/man2/mbind.2 2007-05-11 19:07:02.000000000 -0400
> +++ Linux/man2/mbind.2 2007-06-01 12:28:06.000000000 -0400
> @@ -18,15 +18,16 @@
> .\" the source, must acknowledge the copyright and authors of this work.
> .\"
> .\" 2006-02-03, mtk, substantial wording changes and other improvements
> +.\" 2007-06-01, lts, more precise specification of behavior.
> .\"
> -.TH MBIND 2 "2006-02-07" "SuSE Labs" "Linux Programmer's Manual"
> +.TH MBIND 2 "2007-06-01" "SuSE Labs" "Linux Programmer's Manual"
> .SH NAME
> mbind \- Set memory policy for a memory range
> .SH SYNOPSIS
> .nf
> .B "#include <numaif.h>"
> .sp
> -.BI "int mbind(void *" start ", unsigned long " len ", int " policy ,
> +.BI "int mbind(void *" start ", unsigned long " len ", int " mode ,
> .BI " unsigned long *" nodemask ", unsigned long " maxnode ,
> .BI " unsigned " flags );
> .sp
> @@ -34,76 +35,179 @@ mbind \- Set memory policy for a memory
> .fi
> .SH DESCRIPTION
> .BR mbind ()
> -sets the NUMA memory
> -.I policy
> +sets the NUMA memory policy,
> +which consists of a policy mode and zero or more nodes,
> for the memory range starting with
> .I start
> and continuing for
> .IR len
> bytes.
> The memory of a NUMA machine is divided into multiple nodes.
> -The memory policy defines in which node memory is allocated.
> +The memory policy defines from which node memory is allocated.
> +
> +If the memory range specified by the
> +.IR start " and " len
> +arguments includes an "anonymous" region of memory\(emthat is
> +a region of memory created using the
> +.BR mmap (2)
> +system call with the
> +.BR MAP_ANONYMOUS \(emor
> +a memory mapped file, mapped using the
> +.BR mmap (2)
> +system call with the
> +.B MAP_PRIVATE
> +flag, pages will only be allocated according to the specified
> +policy when the application writes [stores] to the page.
> +For anonymous regions, an initial read access will use a shared
> +page in the kernel containing all zeros.
> +For a file mapped with
> +.BR MAP_PRIVATE ,
> +an initial read access will allocate pages according to the
> +process policy of the process that causes the page to be allocated.
> +This may not be the process that called
> +.BR mbind ().
> +
> +If the specified memory range includes a memory mapped file,
> +mapped using the
> +.BR mmap (2)
> +system call with the
> +.B MAP_SHARED
> +flag, the specified policy will be ignored for all page allocations
> +in this range.
> +Rather the pages will be allocated according to the process policy
> +of the process that caused the page to be allocated.
> +Again, this may not be the process that called
> +.BR mbind ().
> +
> +If the specified memory range includes a shared memory region
> +created using the
> +.BR shmget (2)
> +system call and attached using the
> +.BR shmat (2)
> +system call,
> +pages allocated for the anonymous or shared memory region will
> +be allocated according to the policy specified, regardless which
> +process attached to the shared memory segment causes the allocation.
> +If, however, the shared memory region was created with the
> +.B SHM_HUGETLB
> +flag,
> +the huge pages will be allocated according to the policy specified
> +only if the page allocation is caused by the task that calls
> +.BR mbind ()
> +for that region.
> +
> +By default,
> .BR mbind ()
> only has an effect for new allocations; if the pages inside
> the range have been already touched before setting the policy,
> then the policy has no effect.
> +This default behavior may be overridden by the
> +.BR MPOL_MF_MOVE
> +and
> +.B MPOL_MF_MOVE_ALL
> +flags described below.
>
> -Available policies are
> +The
> +.I mode
> +argument must specify one of
> .BR MPOL_DEFAULT ,
> .BR MPOL_BIND ,
> -.BR MPOL_INTERLEAVE ,
> -and
> +.B MPOL_INTERLEAVE
> +or
> .BR MPOL_PREFERRED .
> -All policies except
> +All policy modes except
> .B MPOL_DEFAULT
> -require the caller to specify the nodes to which the policy applies in the
> +require the caller to specify via the
> .I nodemask
> -parameter.
> +parameter,
> +the node or nodes to which the mode applies.
> +
> .I nodemask
> -is a bitmask of nodes containing up to
> +points to a bitmask of nodes containing up to
> .I maxnode
> bits.
> -The actual number of bytes transferred via this argument
> -is rounded up to the next multiple of
> +The bit mask size is rounded to the next multiple of
> .IR "sizeof(unsigned long)" ,
> but the kernel will only use bits up to
> .IR maxnode .
> -A NULL argument means an empty set of nodes.
> +A NULL value of
> +.I nodemask
> +or a
> +.I maxnode
> +value of zero specifies the empty set of nodes.
> +If the value of
> +.I maxnode
> +is zero,
> +the
> +.I nodemask
> +argument is ignored.
>
> The
> .B MPOL_DEFAULT
> -policy is the default and means to use the underlying process policy
> -(which can be modified with
> -.BR set_mempolicy (2)).
> -Unless the process policy has been changed this means to allocate
> -memory on the node of the CPU that triggered the allocation.
> +mode specifies the default policy.
> +When applied to a range of memory via
> +.IR mbind (),
> +this means to use the process policy,
> + which may have been set with
> +.BR set_mempolicy (2).
> +If the mode of the process policy is also
> +.B MPOL_DEFAULT
> +pages will be allocated on the node of the CPU that triggers the allocation.
> +For
> +.BR MPOL_DEFAULT ,
> +the
> .I nodemask
> -should be specified as NULL.
> +and
> +.I maxnode
> +arguments must be specify the empty set of nodes.
>
> The
> .B MPOL_BIND
> -policy is a strict policy that restricts memory allocation to the
> -nodes specified in
> +mode specifies a strict policy that restricts memory allocation to
> +the nodes specified in
> .IR nodemask .
> +If
> +.I nodemask
> +specifies more than one node, page allocations will come from
> +the node with the lowest numeric node id first, until that node
> +contains no free memory.
> +Allocations will then come from the node with the next highest
> +node id specified in
> +.I nodemask
> +and so forth, until none of the specified nodes contain free memory.
> There won't be allocations on other nodes.
>
> +The
> .B MPOL_INTERLEAVE
> -interleaves allocations to the nodes specified in
> +mode specifies that page allocations be interleaved across the
> +set of nodes specified in
> .IR nodemask .
> -This optimizes for bandwidth instead of latency.
> +This optimizes for bandwidth instead of latency
> +by spreading out pages and memory accesses to those pages across
> +multiple nodes.
> To be effective the memory area should be fairly large,
> -at least 1MB or bigger.
> +at least 1MB or bigger with a fairly uniform access pattern.
> +Accesses to a single page of the area will still be limited to
> +the memory bandwidth of a single node.
>
> .B MPOL_PREFERRED
> sets the preferred node for allocation.
> -The kernel will try to allocate in this
> +The kernel will try to allocate pages from this
> node first and fall back to other nodes if the
> preferred nodes is low on free memory.
> -Only the first node in the
> +If
> .I nodemask
> -is used.
> -If no node is set in the mask, then the memory is allocated on
> -the node of the CPU that triggered the allocation allocation).
> +specifies more than one node id, the first node in the
> +mask will be selected as the preferred node.
> +If the
> +.I nodemask
> +and
> +.I maxnode
> +arguments specify the empty set, then the memory is allocated on
> +the node of the CPU that triggered the allocation.
> +This is the only way to specify "local allocation" for a
> +range of memory via
> +.IR mbind (2).
>
> If
> .B MPOL_MF_STRICT
> @@ -115,17 +219,18 @@ is not
> .BR MPOL_DEFAULT ,
> then the call will fail with the error
> .B EIO
> -if the existing pages in the mapping don't follow the policy.
> -In 2.6.16 or later the kernel will also try to move pages
> -to the requested node with this flag.
> +if the existing pages in the memory range don't follow the policy.
> +.\" According to the kernel code, the following is not true --lts
> +.\" In 2.6.16 or later the kernel will also try to move pages
> +.\" to the requested node with this flag.
>
> If
> .B MPOL_MF_MOVE
> -is passed in
> +is specified in
> .IR flags ,
> -then an attempt will be made to
> -move all the pages in the mapping so that they follow the policy.
> -Pages that are shared with other processes are not moved.
> +then the kernel will attempt to move all the existing pages
> +in the memory range so that they follow the policy.
> +Pages that are shared with other processes will not be moved.
> If
> .B MPOL_MF_STRICT
> is also specified, then the call will fail with the error
> @@ -136,8 +241,8 @@ If
> .B MPOL_MF_MOVE_ALL
> is passed in
> .IR flags ,
> -then all pages in the mapping will be moved regardless of whether
> -other processes use the pages.
> +then the kernel will attempt to move all existing pages in the memory range
> +regardless of whether other processes use the pages.
> The calling process must be privileged
> .RB ( CAP_SYS_NICE )
> to use this flag.
> @@ -146,6 +251,7 @@ If
> is also specified, then the call will fail with the error
> .B EIO
> if some pages could not be moved.
> +.\" ---------------------------------------------------------------
> .SH RETURN VALUE
> On success,
> .BR mbind ()
> @@ -153,11 +259,9 @@ returns 0;
> on error, \-1 is returned and
> .I errno
> is set to indicate the error.
> +.\" ---------------------------------------------------------------
> .SH ERRORS
> -.TP
> -.B EFAULT
> -There was a unmapped hole in the specified memory range
> -or a passed pointer was not valid.
> +.\" I think I got all of the error returns. --lts
> .TP
> .B EINVAL
> An invalid value was specified for
> @@ -169,53 +273,102 @@ or
> was less than
> .IR start ;
> or
> -.I policy
> -was
> -.B MPOL_DEFAULT
> +.I start
> +is not a multiple of the system page size.
> +Or,
> +.I mode
> +is
> +.I MPOL_DEFAULT
> and
> .I nodemask
> -pointed to a non-empty set;
> +specified a non-empty set;
> or
> -.I policy
> -was
> -.B MPOL_BIND
> +.I mode
> +is
> +.I MPOL_BIND
> or
> -.B MPOL_INTERLEAVE
> +.I MPOL_INTERLEAVE
> and
> .I nodemask
> -pointed to an empty set,
> +is empty.
> +Or,
> +.I maxnode
> +specifies more than a page worth of bits.
> +Or,
> +.I nodemask
> +specifies one or more node ids that are
> +greater than the maximum supported node id,
> +or are not allowed in the calling task's context.
> +.\" "calling task's context" refers to cpusets. No man page avail to ref. --lts
> +Or, none of the node ids specified by
> +.I nodemask
> +are on-line, or none of the specified nodes contain memory.
> +.TP
> +.B EFAULT
> +Part of all of the memory range specified by
> +.I nodemask
> +and
> +.I maxnode
> +points outside your accessible address space.
> +Or, there was a unmapped hole in the specified memory range.
> .TP
> .B ENOMEM
> -System out of memory.
> +Insufficient kernel memory was available.
> .TP
> .B EIO
> .B MPOL_MF_STRICT
> was specified and an existing page was already on a node
> -that does not follow the policy.
> +that does not follow the policy;
> +or
> +.B MPOL_MF_MOVE
> +or
> +.B MPOL_MF_MOVE_ALL
> +was specified and the kernel was unable to move all existing
> +pages in the range.
> +.TP
> +.B EPERM
> +The
> +.I flags
> +argument included the
> +.B MPOL_MF_MOVE_ALL
> +flag and the caller does not have the
> +.B CAP_SYS_NICE
> +privilege.
> +.\" ---------------------------------------------------------------
> .SH NOTES
> -NUMA policy is not supported on file mappings.
> +NUMA policy is not supported on a memory mapped file range
> +that was mapped with the
> +.I MAP_SHARED
> +flag.
>
> .B MPOL_MF_STRICT
> -is ignored on huge page mappings right now.
> +is ignored on huge page mappings.
>
> -It is unfortunate that the same flag,
> +The
> .BR MPOL_DEFAULT ,
> -has different effects for
> +mode has different effects for
> .BR mbind (2)
> and
> .BR set_mempolicy (2).
> -To select "allocation on the node of the CPU that
> -triggered the allocation" (like
> -.BR set_mempolicy (2)
> -.BR MPOL_DEFAULT )
> -when calling
> +When
> +.B MPOL_DEFAULT
> +is specified for a range of memory using
> .BR mbind (),
> +any pages subsequently allocated for that range will use
> +the process' policy, as set by
> +.BR set_mempolicy (2).
> +This effectively removes the explicit policy from the
> +specified range.
> +To select "local allocation" for a memory range,
> specify a
> -.I policy
> +.I mode
> of
> .B MPOL_PREFERRED
> -with an empty
> -.IR nodemask .
> +with an empty set of nodes.
> +This method will work for
> +.BR set_mempolicy (2),
> +as well.
> +.\" ---------------------------------------------------------------
> .SH "VERSIONS AND LIBRARY SUPPORT"
> The
> .BR mbind (),
> @@ -226,16 +379,18 @@ system calls were added to the Linux ker
> They are only available on kernels compiled with
> .BR CONFIG_NUMA .
>
> -Support for huge page policy was added with 2.6.16.
> -For interleave policy to be effective on huge page mappings the
> -policied memory needs to be tens of megabytes or larger.
> -
> -.B MPOL_MF_MOVE
> -and
> -.B MPOL_MF_MOVE_ALL
> -are only available on Linux 2.6.16 and later.
> +You can link with
> +.I -lnuma
> +to get system call definitions.
> +.I libnuma
> +and the required
> +.I numaif.h
> +header.
> +are available in the
> +.I numactl
> +package.
>
> -These system calls should not be used directly.
> +However, applications should not use these system calls directly.
> Instead, the higher level interface provided by the
> .BR numa (3)
> functions in the
> @@ -245,17 +400,21 @@ The
> .I numactl
> package is available at
> .IR ftp://ftp.suse.com/pub/people/ak/numa/ .
> -
> -You can link with
> -.I -lnuma
> -to get system call definitions.
> -.I libnuma
> -is available in the
> -.I numactl
> +The package is also included in some Linux distributions.
> +Some distributions include the development library and header
> +in the separate
> +.I numactl-devel
> package.
> -This package also has the
> -.I numaif.h
> -header.
> +
> +Support for huge page policy was added with 2.6.16.
> +For interleave policy to be effective on huge page mappings the
> +policied memory needs to be tens of megabytes or larger.
> +
> +.B MPOL_MF_MOVE
> +and
> +.B MPOL_MF_MOVE_ALL
> +are only available on Linux 2.6.16 and later.
> +
> .SH CONFORMING TO
> This system call is Linux specific.
> .SH SEE ALSO
> @@ -263,4 +422,6 @@ This system call is Linux specific.
> .BR numactl (8),
> .BR set_mempolicy (2),
> .BR get_mempolicy (2),
> -.BR mmap (2)
> +.BR mmap (2),
> +.BR shmget (2),
> +.BR shmat (2).
> Index: Linux/man2/get_mempolicy.2
> ===================================================================
> --- Linux.orig/man2/get_mempolicy.2 2007-04-12 18:42:49.000000000 -0400
> +++ Linux/man2/get_mempolicy.2 2007-06-01 12:29:00.000000000 -0400
> @@ -18,6 +18,7 @@
> .\" the source, must acknowledge the copyright and authors of this work.
> .\"
> .\" 2006-02-03, mtk, substantial wording changes and other improvements
> +.\" 2007-06-01, lts, more precise specification of behavior.
> .\"
> .TH GET_MEMPOLICY 2 "2006-02-07" "SuSE Labs" "Linux Programmer's Manual"
> .SH SYNOPSIS
> @@ -26,9 +27,11 @@ get_mempolicy \- Retrieve NUMA memory po
> .B "#include <numaif.h>"
> .nf
> .sp
> -.BI "int get_mempolicy(int *" policy ", unsigned long *" nodemask ,
> +.BI "int get_mempolicy(int *" mode ", unsigned long *" nodemask ,
> .BI " unsigned long " maxnode ", unsigned long " addr ,
> .BI " unsigned long " flags );
> +.sp
> +.BI "cc ... \-lnuma"
> .fi
> .\" TBD rewrite this. it is confusing.
> .SH DESCRIPTION
> @@ -39,7 +42,7 @@ depending on the setting of
>
> A NUMA machine has different
> memory controllers with different distances to specific CPUs.
> -The memory policy defines in which node memory is allocated for
> +The memory policy defines from which node memory is allocated for
> the process.
>
> If
> @@ -58,58 +61,75 @@ then information is returned about the p
> address given in
> .IR addr .
> This policy may be different from the process's default policy if
> -.BR set_mempolicy (2)
> -has been used to establish a policy for the page containing
> +.BR mbind (2)
> +or one of the helper functions described in
> +.BR numa(3)
> +has been used to establish a policy for the memory range containing
> .IR addr .
>
> -If
> -.I policy
> -is not NULL, then it is used to return the policy.
> +If the
> +.I mode
> +argument is not NULL, then
> +.IR get_mempolicy ()
> +will store the policy mode of the requested NUMA policy in the location
> +pointed to by this argument.
> If
> .IR nodemask
> -is not NULL, then it is used to return the nodemask associated
> -with the policy.
> +is not NULL, then the nodemask associated with the policy will be stored
> +in the location pointed to by this argument.
> .I maxnode
> -is the maximum bit number plus one that can be stored into
> -.IR nodemask .
> -The bit number is always rounded to a multiple of
> -.IR "unsigned long" .
> -.\"
> -.\" If
> -.\" .I flags
> -.\" specifies both
> -.\" .B MPOL_F_NODE
> -.\" and
> -.\" .BR MPOL_F_ADDR ,
> -.\" then
> -.\" .I policy
> -.\" instead returns the number of the node on which the address
> -.\" .I addr
> -.\" is allocated.
> -.\"
> -.\" If
> -.\" .I flags
> -.\" specifies
> -.\" .B MPOL_F_NODE
> -.\" but not
> -.\" .BR MPOL_F_ADDR ,
> -.\" and the process's current policy is
> -.\" .BR MPOL_INTERLEAVE ,
> -.\" then
> -.\" checkme: Andi's text below says that the info is returned in
> -.\" 'nodemask', not 'policy':
> -.\" .I policy
> -.\" instead returns the number of the next node that will be used for
> -.\" interleaving allocation.
> -.\" FIXME .
> -.\" The other valid flag is
> -.\" .I MPOL_F_NODE.
> -.\" It is only valid when the policy is
> -.\" .I MPOL_INTERLEAVE.
> -.\" In this case not the interleave mask, but an unsigned long with the next
> -.\" node that would be used for interleaving is returned in
> -.\" .I nodemask.
> -.\" Other flag values are reserved.
> +specifies the number of node ids
> +that can be stored into
> +.IR nodemask \(emthat
> +is, the maximum node id plus one.
> +The value specified by
> +.I maxnode
> +is always rounded to a multiple of
> +.IR "sizeof(unsigned long)" .
> +
> +If
> +.I flags
> +specifies both
> +.B MPOL_F_NODE
> +and
> +.BR MPOL_F_ADDR ,
> +.IR get_mempolicy ()
> +will return the node id of the node on which the address
> +.I addr
> +is allocated into the location pointed to by
> +.IR mode .
> +If no page has yet been allocated for the specified address,
> +.IR get_mempolicy ()
> +will allocate a page as if the process had performed a read
> +[load] access to that address, and return the id of the node
> +where that page was allocated.
> +
> +If
> +.I flags
> +specifies
> +.BR MPOL_F_NODE ,
> +but not
> +.BR MPOL_F_ADDR ,
> +and the process's current policy is
> +.BR MPOL_INTERLEAVE ,
> +then
> +.IR get_mempolicy ()
> +will return in the location pointed to by a non-NULL
> +.I mode
> +argument,
> +the node id of the next node that will be used for
> +interleaving of internal kernel pages allocated on behalf of the process.
> +.\" Note: code returns next interleave node via 'mode' argument -lts
> +These allocations include pages for memory mapped files in
> +process memory ranges mapped using the
> +.IR mmap (2)
> +call with the
> +.I MAP_PRIVATE
> +flag for read accesses, and in memory ranges mapped with the
> +.I MAP_SHARED
> +flag for all accesses.
> +
> +Other flag values are reserved.
>
> For an overview of the possible policies see
> .BR set_mempolicy (2).
> @@ -120,40 +140,77 @@ returns 0;
> on error, \-1 is returned and
> .I errno
> is set to indicate the error.
> -.\" .SH ERRORS
> -.\" FIXME writeme -- no errors are listed on this page
> -.\" .
> -.\" .TP
> -.\" .B EINVAL
> -.\" .I nodemask
> -.\" is non-NULL, and
> -.\" .I maxnode
> -.\" is too small;
> -.\" or
> -.\" .I flags
> -.\" specified values other than
> -.\" .B MPOL_F_NODE
> -.\" or
> -.\" .BR MPOL_F_ADDR ;
> -.\" or
> -.\" .I flags
> -.\" specified
> -.\" .B MPOL_F_ADDR
> -.\" and
> -.\" .I addr
> -.\" is NULL.
> -.\" (And there are other EINVAL cases.)
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +The value specified by
> +.I maxnode
> +is less than the number of node ids supported by the system.
> +Or
> +.I flags
> +specified values other than
> +.B MPOL_F_NODE
> +or
> +.BR MPOL_F_ADDR ;
> +or
> +.I flags
> +specified
> +.B MPOL_F_ADDR
> +and
> +.I addr
> +is NULL,
> +or
> +.I flags
> +did not specify
> +.B MPOL_F_ADDR
> +and
> +.I addr
> +is not NULL.
> +Or,
> +.I flags
> +specified
> +.B MPOL_F_NODE
> +but not
> +.B MPOL_F_ADDR
> +and the current process policy is not
> +.BR MPOL_INTERLEAVE .
> +(And there are other EINVAL cases.)
> +.TP
> +.B EFAULT
> +Part of all of the memory range specified by
> +.I nodemask
> +and
> +.I maxnode
> +points outside your accessible address space.
> .SH NOTES
> -This manual page is incomplete:
> -it does not document the details the
> -.BR MPOL_F_NODE
> -flag,
> -which modifies the operation of
> -.BR get_mempolicy ().
> -This is deliberate: this flag is not intended for application use,
> -and its operation may change or it may be removed altogether in
> -future kernel versions.
> -.B Do not use it.
> +If the mode of the process policy or the policy governing allocations at the
> +specified address is
> +.I MPOL_PREFERRED
> +and this policy was installed with an empty
> +.IR nodemask \(emspecifying
> +local allocation,
> +.IR get_mempolicy ()
> +will return the mask of on-line node ids in the location pointed to by
> +a non-NULL
> +.I nodemask
> +argument.
> +This mask does not take into consideration any adminstratively imposed
> +restrictions on the process' context.
> +.\" "context" above refers to cpusets. No man page to reference. --lts
> +
> +.\" Christoph says the following is untrue. These are "fully supported."
> +.\" Andi concedes that he has lost this battle and approves [?]
> +.\" updating the man pages to document the behavior. --lts
> +.\" This manual page is incomplete:
> +.\" it does not document the details the
> +.\" .BR MPOL_F_NODE
> +.\" flag,
> +.\" which modifies the operation of
> +.\" .BR get_mempolicy ().
> +.\" This is deliberate: this flag is not intended for application use,
> +.\" and its operation may change or it may be removed altogether in
> +.\" future kernel versions.
> +.\" .B Do not use it.
> .SH "VERSIONS AND LIBRARY SUPPORT"
> See
> .BR mbind (2).
> @@ -161,6 +218,7 @@ See
> This system call is Linux specific.
> .SH SEE ALSO
> .BR mbind (2),
> +.BR mmap (2),
> .BR set_mempolicy (2),
> .BR numactl (8),
> .BR numa (3)
> Index: Linux/man2/set_mempolicy.2
> ===================================================================
> --- Linux.orig/man2/set_mempolicy.2 2007-04-12 18:42:49.000000000 -0400
> +++ Linux/man2/set_mempolicy.2 2007-06-01 12:28:49.000000000 -0400
> @@ -18,6 +18,7 @@
> .\" the source, must acknowledge the copyright and authors of this work.
> .\"
> .\" 2006-02-03, mtk, substantial wording changes and other improvements
> +.\" 2007-06-01, lts, more precise specification of behavior.
> .\"
> .TH SET_MEMPOLICY 2 "2006-02-07" "SuSE Labs" "Linux Programmer's Manual"
> .SH NAME
> @@ -26,80 +27,141 @@ set_mempolicy \- set default NUMA memory
> .nf
> .B "#include <numaif.h>"
> .sp
> -.BI "int set_mempolicy(int " policy ", unsigned long *" nodemask ,
> +.BI "int set_mempolicy(int " mode ", unsigned long *" nodemask ,
> .BI " unsigned long " maxnode );
> +.sp
> +.BI "cc ... \-lnuma"
> .fi
> .SH DESCRIPTION
> .BR set_mempolicy ()
> -sets the NUMA memory policy of the calling process to
> -.IR policy .
> +sets the NUMA memory policy of the calling process,
> +which consists of a policy mode and zero or more nodes,
> +to the values specified by the
> +.IR mode ,
> +.I nodemask
> +and
> +.IR maxnode
> +arguments.
>
> A NUMA machine has different
> memory controllers with different distances to specific CPUs.
> -The memory policy defines in which node memory is allocated for
> +The memory policy defines from which node memory is allocated for
> the process.
>
> -This system call defines the default policy for the process;
> -in addition a policy can be set for specific memory ranges using
> +This system call defines the default policy for the process.
> +The process policy governs allocation of pages in the process'
> +address space outside of memory ranges
> +controlled by a more specific policy set by
> .BR mbind (2).
> +The process default policy also controls allocation of any pages for
> +memory mapped files mapped using the
> +.BR mmap (2)
> +call with the
> +.B MAP_PRIVATE
> +flag and that are only read [loaded] from by the task
> +and of memory mapped files mapped using the
> +.BR mmap (2)
> +call with the
> +.B MAP_SHARED
> +flag, regardless of the access type.
> The policy is only applied when a new page is allocated
> for the process.
> For anonymous memory this is when the page is first
> touched by the application.
>
> -Available policies are
> +The
> +.I mode
> +argument must specify one of
> .BR MPOL_DEFAULT ,
> .BR MPOL_BIND ,
> -.BR MPOL_INTERLEAVE ,
> +.B MPOL_INTERLEAVE
> +or
> .BR MPOL_PREFERRED .
> -All policies except
> +All modes except
> .B MPOL_DEFAULT
> -require the caller to specify the nodes to which the policy applies in the
> +require the caller to specify via the
> .I nodemask
> -parameter.
> +parameter
> +one or more nodes.
> +
> .I nodemask
> -is pointer to a bit field of nodes that contains up to
> +points to a bit mask of node ids that contains up to
> .I maxnode
> bits.
> -The bit field size is rounded to the next multiple of
> +The bit mask size is rounded to the next multiple of
> .IR "sizeof(unsigned long)" ,
> but the kernel will only use bits up to
> .IR maxnode .
> +A NULL value of
> +.I nodemask
> +or a
> +.I maxnode
> +value of zero specifies the empty set of nodes.
> +If the value of
> +.I maxnode
> +is zero,
> +the
> +.I nodemask
> +argument is ignored.
>
> The
> .B MPOL_DEFAULT
> -policy is the default and means to allocate memory locally,
> +mode is the default and means to allocate memory locally,
> i.e., on the node of the CPU that triggered the allocation.
> .I nodemask
> -should be specified as NULL.
> +must be specified as NULL.
> +If the "local node" contains no free memory, the system will
> +attempt to allocate memory from a "near by" node.
>
> The
> .B MPOL_BIND
> -policy is a strict policy that restricts memory allocation to the
> +mode defines a strict policy that restricts memory allocation to the
> nodes specified in
> .IR nodemask .
> -There won't be allocations on other nodes.
> +If
> +.I nodemask
> +specifies more than one node, page allocations will come from
> +the node with the lowest numeric node id first, until that node
> +contains no free memory.
> +Allocations will then come from the node with the next highest
> +node id specified in
> +.I nodemask
> +and so forth, until none of the specified nodes contain free memory.
> +Pages will not be allocated from any node not specified in the
> +.IR nodemask .
>
> .B MPOL_INTERLEAVE
> -interleaves allocations to the nodes specified in
> -.IR nodemask .
> -This optimizes for bandwidth instead of latency.
> -To be effective the memory area should be fairly large,
> -at least 1MB or bigger.
> +interleaves page allocations across the nodes specified in
> +.I nodemask
> +in numeric node id order.
> +This optimizes for bandwidth instead of latency
> +by spreading out pages and memory accesses to those pages across
> +multiple nodes.
> +However, accesses to a single page will still be limited to
> +the memory bandwidth of a single node.
> +.\" NOTE: the following sentence doesn't make sense in the context
> +.\" of set_mempolicy() -- no memory area specified.
> +.\" To be effective the memory area should be fairly large,
> +.\" at least 1MB or bigger.
>
> .B MPOL_PREFERRED
> sets the preferred node for allocation.
> -The kernel will try to allocate in this
> -node first and fall back to other nodes if the preferred node is low on free
> +The kernel will try to allocate pages from this node first
> +and fall back to "near by" nodes if the preferred node is low on free
> memory.
> -Only the first node in the
> +If
> +.I nodemask
> +specifies more than one node id, the first node in the
> +mask will be selected as the preferred node.
> +If the
> .I nodemask
> -is used.
> -If no node is set in the mask, then the memory is allocated on
> -the node of the CPU that triggered the allocation allocation (like
> +and
> +.I maxnode
> +arguments specify the empty set, then the memory is allocated on
> +the node of the CPU that triggered the allocation (like
> .BR MPOL_DEFAULT ).
>
> -The memory policy is preserved across an
> +The process memory policy is preserved across an
> .BR execve (2),
> and is inherited by child processes created using
> .BR fork (2)
> @@ -107,6 +169,9 @@ or
> .BR clone (2).
> .SH NOTES
> Process policy is not remembered if the page is swapped out.
> +When such a page is paged back in, it will use the policy of
> +the process or memory range that is in effect at the time the
> +page is allocated.
> .SH RETURN VALUE
> On success,
> .BR set_mempolicy ()
> @@ -114,12 +179,49 @@ returns 0;
> on error, \-1 is returned and
> .I errno
> is set to indicate the error.
> -.\" .SH ERRORS
> -.\" FIXME writeme -- no errors are listed on this page
> -.\" .
> -.\" .TP
> -.\" .B EINVAL
> -.\" .I mode is invalid.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +.I mode is invalid.
> +Or,
> +.I mode
> +is
> +.I MPOL_DEFAULT
> +and
> +.I nodemask
> +is non-empty,
> +or
> +.I mode
> +is
> +.I MPOL_BIND
> +or
> +.I MPOL_INTERLEAVE
> +and
> +.I nodemask
> +is empty.
> +Or,
> +.I maxnode
> +specifies more than a page worth of bits.
> +Or,
> +.I nodemask
> +specifies one or more node ids that are
> +greater than the maximum supported node id,
> +or are not allowed in the calling task's context.
> +.\" "calling task's context" refers to cpusets. No man page avail to ref. --lts
> +Or, none of the node ids specified by
> +.I nodemask
> +are on-line, or none of the specified nodes contain memory.
> +.TP
> +.B EFAULT
> +Part of all of the memory range specified by
> +.I nodemask
> +and
> +.I maxnode
> +points outside your accessible address space.
> +.TP
> +.B ENOMEM
> +Insufficient kernel memory was available.
> +
> .SH "VERSIONS AND LIBRARY SUPPORT"
> See
> .BR mbind (2).
> @@ -127,6 +229,7 @@ See
> This system call is Linux specific.
> .SH SEE ALSO
> .BR mbind (2),
> +.BR mmap (2),
> .BR get_mempolicy (2),
> .BR numactl (8),
> .BR numa (3)
>
>
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance? Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* mbind.2 man page patch
2007-06-01 21:15 ` [PATCH] enhance memory policy sys call man pages v1 Lee Schermerhorn
2007-07-23 6:11 ` Michael Kerrisk
@ 2007-07-23 6:32 ` Michael Kerrisk
2007-07-23 14:26 ` Lee Schermerhorn
2007-07-23 6:32 ` get_mempolicy.2 " Michael Kerrisk
2007-07-23 6:33 ` set_mempolicy.2 " Michael Kerrisk
3 siblings, 1 reply; 83+ messages in thread
From: Michael Kerrisk @ 2007-07-23 6:32 UTC (permalink / raw)
To: ak, clameter; +Cc: Lee Schermerhorn, akpm, linux-mm, Samuel Thibault
Andi, Christoph
Could you please review these changes by Lee to the mbind.2 page? Patch
against man-pages-2.63 (available from
http://www.kernel.org/pub/linux/docs/manpages).
Andi / Christoph / Lee: There are a few points marked FIXME about which I'd
particularly like some input.
Lee: aside from the changes tha you made, plus my edits, I added a sentence
to this page that cam in independently from Samuel Thibau;t (marked below).
Cheers,
Michael
--- mbind.2.orig 2007-07-01 06:22:24.000000000 +0200
+++ mbind.2 2007-07-21 09:18:05.000000000 +0200
@@ -1,4 +1,5 @@
.\" Copyright 2003,2004 Andi Kleen, SuSE Labs.
+.\" and Copyright (C) 2007 Lee Schermerhorn <Lee.Schermerhorn@hp.com>
.\"
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
@@ -18,92 +19,214 @@
.\" the source, must acknowledge the copyright and authors of this work.
.\"
.\" 2006-02-03, mtk, substantial wording changes and other improvements
+.\" 2007-06-01, Lee Schermerhorn <Lee.Schermerhorn@hp.com>
+.\" more precise specification of behavior.
.\"
-.TH MBIND 2 2006-02-07 "Linux" "Linux Programmer's Manual"
+.TH MBIND 2 2007-07-20 Linux "Linux Programmer's Manual"
.SH NAME
mbind \- Set memory policy for a memory range
.SH SYNOPSIS
.nf
.B "#include <numaif.h>"
.sp
-.BI "int mbind(void *" start ", unsigned long " len ", int " policy ,
+.BI "int mbind(void *" start ", unsigned long " len ", int " mode ,
.BI " unsigned long *" nodemask ", unsigned long " maxnode ,
.BI " unsigned " flags );
.sp
-.BI "cc ... \-lnuma"
+Link with \fI\-lnuma\fP.
.fi
.SH DESCRIPTION
+The memory of a NUMA machine is divided into multiple nodes.
+The memory policy defines the node on which memory is allocated.
.BR mbind ()
-sets the NUMA memory
-.I policy
+sets the NUMA memory policy
for the memory range starting with
.I start
and continuing for
.IR len
bytes.
-The memory of a NUMA machine is divided into multiple nodes.
-The memory policy defines in which node memory is allocated.
+.\" The following sentence added by Samuel Thibault:
+.I start
+must be page aligned.
+
+The NUMA policy consists of a policy mode, specified in
+.IR mode ,
+and a set of zero or nodes, specified in
+.IR nodemask ;
+these arguments are described below.
+
+If the memory range specified by the
+.IR start " and " len
+arguments includes an anonymous region of memory (i.e.,
+a region of memory created using
+.BR mmap (2)
+with the
+.BR MAP_ANONYMOUS
+flag) or
+a memory mapped file mapped using
+.BR mmap (2)
+with the
+.B MAP_PRIVATE
+flag, pages will only be allocated according to the specified
+policy when the application writes [stores] to the page.
+For anonymous regions, an initial read access will use a shared
+page in the kernel containing all zeros.
+For a file mapped with
+.BR MAP_PRIVATE ,
+an initial read access will allocate pages according to the
+process policy of the process that causes the page to be allocated.
+This might not be the process that called
+.BR mbind ().
+
+If the specified memory range includes a memory mapped file mapped using
+.BR mmap (2)
+with the
+.B MAP_SHARED
+flag, the specified policy will be ignored for all page allocations
+in this range.
+.\" FIXME Lee / Andi: can you clarify/confirm "the specified policy
+.\" will be ignored for all page allocations in this range".
+.\" That text seems to be saying that if the memory range contains
+.\" (say) some mappings that are allocated with MAP_SHARED
+.\" and others allocated with MAP_PRIVATE, then the policy
+.\" will be ignored for all of the mappings, including even
+.\" the MAP_PRIVATE mappings. Right? I just want to be
+.\" sure that that is what the text is meaning.
+Instead, the pages will be allocated according to the process policy
+of the process that caused the page to be allocated.
+Again, this might not be the process that called
+.BR mbind ().
+
+If the specified memory range includes a shared memory region
+created using
+.BR shmget (2)
+and attached using
+.BR shmat (2),
+pages allocated for the anonymous or shared memory region will
+be allocated according to the policy specified, regardless of which
+process attached to the shared memory segment causes the allocation.
+If, however, the shared memory region was created with the
+.B SHM_HUGETLB
+flag,
+the huge pages will be allocated according to the policy specified
+only if the page allocation is caused by the task that calls
+.BR mbind ()
+for that region.
+
+By default,
.BR mbind ()
only has an effect for new allocations; if the pages inside
-the range have been already touched before setting the policy,
+the range have already been touched before setting the policy,
then the policy has no effect.
+This default behavior may be overridden by the
+.BR MPOL_MF_MOVE
+and
+.B MPOL_MF_MOVE_ALL
+flags described below.
-Available policies are
+The
+.I mode
+argument must specify one of
.BR MPOL_DEFAULT ,
.BR MPOL_BIND ,
.BR MPOL_INTERLEAVE ,
-and
+or
.BR MPOL_PREFERRED .
-All policies except
+All policy modes except
.B MPOL_DEFAULT
-require the caller to specify the nodes to which the policy applies in the
+require the caller to specify
+the node or nodes to which the mode applies, via the
.I nodemask
-parameter.
+argument.
+
.I nodemask
-is a bit mask of nodes containing up to
+points to a bit mask of nodes containing up to
.I maxnode
bits.
-The actual number of bytes transferred via this argument
+The actual number of bytes transferred via
+.I nodemask
is rounded up to the next multiple of
.IR "sizeof(unsigned long)" ,
but the kernel will only use bits up to
.IR maxnode .
-A NULL argument means an empty set of nodes.
+A NULL value for
+.IR nodemask ,
+or a
+.I maxnode
+value of zero specifies the empty set of nodes.
+If the value of
+.I maxnode
+is zero, then the
+.I nodemask
+argument is ignored.
The
.B MPOL_DEFAULT
-policy is the default and means to use the underlying process policy
-(which can be modified with
-.BR set_mempolicy (2)).
-Unless the process policy has been changed this means to allocate
-memory on the node of the CPU that triggered the allocation.
+mode specifies the default policy.
+When applied to a range of memory via
+.BR mbind (),
+this means that the process policy should be used;
+the process policy can be set with
+.BR set_mempolicy (2).
+If the mode of the process policy is also
+.BR MPOL_DEFAULT ,
+then pages will be allocated on the node of the CPU that
+triggers the allocation.
+For
+.BR MPOL_DEFAULT ,
+the
.I nodemask
-should be specified as NULL.
+and
+.I maxnode
+arguments must be specify the empty set of nodes.
The
.B MPOL_BIND
-policy is a strict policy that restricts memory allocation to the
-nodes specified in
+mode specifies a strict policy that restricts memory allocation to
+the nodes specified in
.IR nodemask .
+If
+.I nodemask
+specifies more than one node, page allocations will come from
+the node with the lowest numeric node ID first, until that node
+contains no free memory.
+Allocations will then come from the node with the next highest
+node ID specified in
+.I nodemask
+and so forth, until none of the specified nodes contains free memory.
There won't be allocations on other nodes.
+The
.B MPOL_INTERLEAVE
-interleaves allocations to the nodes specified in
+mode specifies that page allocations be interleaved across the
+set of nodes specified in
.IR nodemask .
-This optimizes for bandwidth instead of latency.
+This optimizes for bandwidth instead of latency
+by spreading out pages and memory accesses to those pages across
+multiple nodes.
To be effective the memory area should be fairly large,
-at least 1MB or bigger.
+at least 1MB or bigger with a fairly uniform access pattern.
+Accesses to a single page of the area will still be limited to
+the memory bandwidth of a single node.
.B MPOL_PREFERRED
sets the preferred node for allocation.
-The kernel will try to allocate in this
+The kernel will try to allocate pages on this
node first and fall back to other nodes if the
preferred nodes is low on free memory.
-Only the first node in the
+If
+.I nodemask
+specifies more than one node ID, the first node in the
+mask will be selected as the preferred node.
+If the
.I nodemask
-is used.
-If no node is set in the mask, then the memory is allocated on
-the node of the CPU that triggered the allocation allocation).
+and
+.I maxnode
+arguments specify the empty set, then the memory is allocated on
+the node of the CPU that triggered the allocation.
+This is the only way to specify "local allocation" for a
+range of memory via
+.BR mbind ().
If
.B MPOL_MF_STRICT
@@ -115,17 +238,20 @@
.BR MPOL_DEFAULT ,
then the call will fail with the error
.B EIO
-if the existing pages in the mapping don't follow the policy.
-In 2.6.16 or later the kernel will also try to move pages
-to the requested node with this flag.
+if the existing pages in the memory range don't follow the policy.
+.\" FIXME Andi / Christoph -- can you please verify Lee's change here:
+.\" According to the kernel code, the following is not true
+.\" -- Lee Schermerhorn:
+.\" In 2.6.16 or later the kernel will also try to move pages
+.\" to the requested node with this flag.
If
.B MPOL_MF_MOVE
-is passed in
+is specified in
.IR flags ,
-then an attempt will be made to
-move all the pages in the mapping so that they follow the policy.
-Pages that are shared with other processes are not moved.
+then the kernel will attempt to move all the existing pages
+in the memory range so that they follow the policy.
+Pages that are shared with other processes will not be moved.
If
.B MPOL_MF_STRICT
is also specified, then the call will fail with the error
@@ -136,8 +262,8 @@
.B MPOL_MF_MOVE_ALL
is passed in
.IR flags ,
-then all pages in the mapping will be moved regardless of whether
-other processes use the pages.
+then the kernel will attempt to move all existing pages in the memory
+range regardless of whether other processes use the pages.
The calling process must be privileged
.RB ( CAP_SYS_NICE )
to use this flag.
@@ -154,10 +280,15 @@
.I errno
is set to indicate the error.
.SH ERRORS
+.\" I think I got all of the error returns. -- Lee Schermerhorn
.TP
.B EFAULT
-There was a unmapped hole in the specified memory range
-or a passed pointer was not valid.
+Part or all of the memory range specified by
+.I nodemask
+and
+.I maxnode
+points outside your accessible address space.
+Or, there was a unmapped hole in the specified memory range.
.TP
.B EINVAL
An invalid value was specified for
@@ -169,56 +300,96 @@
was less than
.IR start ;
or
-.I policy
-was
+.I start
+is not a multiple of the system page size.
+Or,
+.I mode
+is
.B MPOL_DEFAULT
and
.I nodemask
-pointed to a non-empty set;
+specified a non-empty set;
or
-.I policy
-was
+.I mode
+is
.B MPOL_BIND
or
.B MPOL_INTERLEAVE
and
.I nodemask
-pointed to an empty set,
+is empty.
+Or,
+.I maxnode
+specifies more than a page worth of bits.
+Or,
+.I nodemask
+specifies one or more node IDs that are
+greater than the maximum supported node ID,
+or are not allowed in the calling task's context.
+.\" "calling task's context" refers to cpusets.
+.\" No man page avail to reference. -- Lee Schermerhorn
+Or, none of the node IDs specified by
+.I nodemask
+are on-line, or none of the specified nodes contain memory.
.TP
.B ENOMEM
-System out of memory.
+Insufficient kernel memory was available.
.TP
.B EIO
.B MPOL_MF_STRICT
was specified and an existing page was already on a node
-that does not follow the policy.
+that does not follow the policy;
+or
+.B MPOL_MF_MOVE
+or
+.B MPOL_MF_MOVE_ALL
+was specified and the kernel was unable to move all existing
+pages in the range.
+.TP
+.B EPERM
+The
+.I flags
+argument included the
+.B MPOL_MF_MOVE_ALL
+flag and the caller does not have the
+.B CAP_SYS_NICE
+privilege.
.SH CONFORMING TO
This system call is Linux specific.
.SH NOTES
-NUMA policy is not supported on file mappings.
+NUMA policy is not supported on a memory mapped file range
+that was mapped with the
+.B MAP_SHARED
+flag.
.B MPOL_MF_STRICT
-is ignored on huge page mappings right now.
+is ignored on huge page mappings.
-It is unfortunate that the same flag,
+The
.BR MPOL_DEFAULT ,
-has different effects for
-.BR mbind (2)
+mode has different effects for
+.BR mbind ()
and
.BR set_mempolicy (2).
-To select "allocation on the node of the CPU that
-triggered the allocation" (like
-.BR set_mempolicy (2)
-.BR MPOL_DEFAULT )
-when calling
+When
+.B MPOL_DEFAULT
+is specified for a range of memory using
.BR mbind (),
+any pages subsequently allocated for that range will use
+the process's policy, as set by
+.BR set_mempolicy (2).
+This effectively removes the explicit policy from the
+specified range.
+To select "local allocation" for a memory range,
specify a
-.I policy
+.I mode
of
.B MPOL_PREFERRED
-with an empty
-.IR nodemask .
-.SS "Versions and Library Support"
+with an empty set of nodes.
+This method will work for
+.BR set_mempolicy (2),
+as well.
+.SS "Versions and LIbrary Support"
The
.BR mbind (),
.BR get_mempolicy (2),
@@ -228,16 +399,17 @@
They are only available on kernels compiled with
.BR CONFIG_NUMA .
-Support for huge page policy was added with 2.6.16.
-For interleave policy to be effective on huge page mappings the
-policied memory needs to be tens of megabytes or larger.
-
-.B MPOL_MF_MOVE
-and
-.B MPOL_MF_MOVE_ALL
-are only available on Linux 2.6.16 and later.
+You can link with
+.I \-lnuma
+to get system call definitions.
+.I libnuma
+and the required
+.I numaif.h
+header are available in the
+.I numactl
+package.
-These system calls should not be used directly.
+However, applications should not use these system calls directly.
Instead, the higher level interface provided by the
.BR numa (3)
functions in the
@@ -247,20 +419,25 @@
.I numactl
package is available at
.IR ftp://ftp.suse.com/pub/people/ak/numa/ .
-
-You can link with
-.I \-lnuma
-to get system call definitions.
-.I libnuma
-is available in the
-.I numactl
+The package is also included in some Linux distributions.
+Some distributions include the development library and header
+in the separate
+.I numactl-devel
package.
-This package also has the
-.I numaif.h
-header.
+
+Support for huge page policy was added with 2.6.16.
+For interleave policy to be effective on huge page mappings the
+policied memory needs to be tens of megabytes or larger.
+
+.B MPOL_MF_MOVE
+and
+.B MPOL_MF_MOVE_ALL
+are only available on Linux 2.6.16 and later.
.SH SEE ALSO
-.BR numa (3),
-.BR numactl (8),
-.BR set_mempolicy (2),
.BR get_mempolicy (2),
-.BR mmap (2)
+.BR mmap (2),
+.BR set_mempolicy (2),
+.BR shmat (2),
+.BR shmget (2),
+.BR numa (3),
+.BR numactl (8)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* get_mempolicy.2 man page patch
2007-06-01 21:15 ` [PATCH] enhance memory policy sys call man pages v1 Lee Schermerhorn
2007-07-23 6:11 ` Michael Kerrisk
2007-07-23 6:32 ` mbind.2 man page patch Michael Kerrisk
@ 2007-07-23 6:32 ` Michael Kerrisk
2007-07-28 9:31 ` Michael Kerrisk
2007-07-23 6:33 ` set_mempolicy.2 " Michael Kerrisk
3 siblings, 1 reply; 83+ messages in thread
From: Michael Kerrisk @ 2007-07-23 6:32 UTC (permalink / raw)
To: ak, clameter; +Cc: Lee Schermerhorn, akpm, linux-mm
Andi, Christoph
Could you please review these changes by Lee to the get_mempolicy.2 page?
Patch against man-pages-2.63 (available from
http://www.kernel.org/pub/linux/docs/manpages).
Andi/ Christoph / Lee: There are a few points marked FIXME about which I'd
particularly like some input.
Cheers,
Michael
--- get_mempolicy.2.orig 2007-06-23 09:18:02.000000000 +0200
+++ get_mempolicy.2 2007-07-21 09:18:46.000000000 +0200
@@ -1,4 +1,5 @@
.\" Copyright 2003,2004 Andi Kleen, SuSE Labs.
+.\" and Copyright (C) 2007 Lee Schermerhorn <Lee.Schermerhorn@hp.com>
.\"
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
@@ -18,19 +19,22 @@
.\" the source, must acknowledge the copyright and authors of this work.
.\"
.\" 2006-02-03, mtk, substantial wording changes and other improvements
+.\" 2007-06-01, Lee Schermerhorn <Lee.Schermerhorn@hp.com>
+.\" more precise specification of behavior.
.\"
-.TH GET_MEMPOLICY 2 2006-02-07 "Linux" "Linux Programmer's Manual"
+.TH GET_MEMPOLICY 2 2007-07-20 Linux "Linux Programmer's Manual"
.SH NAME
get_mempolicy \- Retrieve NUMA memory policy for a process
.SH SYNOPSIS
.B "#include <numaif.h>"
.nf
.sp
-.BI "int get_mempolicy(int *" policy ", unsigned long *" nodemask ,
+.BI "int get_mempolicy(int *" mode ", unsigned long *" nodemask ,
.BI " unsigned long " maxnode ", unsigned long " addr ,
.BI " unsigned long " flags );
+.sp
+Link with \fI\-lnuma\fP.
.fi
-.\" FIXME rewrite this DESCRIPTION. it is confusing.
.SH DESCRIPTION
.BR get_mempolicy ()
retrieves the NUMA policy of the calling process or of a memory address,
@@ -39,7 +43,7 @@
A NUMA machine has different
memory controllers with different distances to specific CPUs.
-The memory policy defines in which node memory is allocated for
+The memory policy defines the node on which memory is allocated for
the process.
If
@@ -58,58 +62,84 @@
address given in
.IR addr .
This policy may be different from the process's default policy if
-.BR set_mempolicy (2)
-has been used to establish a policy for the page containing
+.\" FIXME Lee changed "set_mempolicy" to "mbind" in the following;
+.\" is that correct?
+.BR mbind (2)
+or one of the helper functions described in
+.BR numa(3)
+has been used to establish a policy for the memory range containing
.IR addr .
-If
-.I policy
-is not NULL, then it is used to return the policy.
+If the
+.I mode
+argument is not NULL, then
+.BR get_mempolicy ()
+will store the policy mode of the requested NUMA policy in the location
+pointed to by this argument.
If
.IR nodemask
-is not NULL, then it is used to return the nodemask associated
-with the policy.
+is not NULL, then the nodemask associated with the policy will be stored
+in the location pointed to by this argument.
.I maxnode
-is the maximum bit number plus one that can be stored into
-.IR nodemask .
-The bit number is always rounded to a multiple of
-.IR "unsigned long" .
-.\"
-.\" If
-.\" .I flags
-.\" specifies both
-.\" .B MPOL_F_NODE
-.\" and
-.\" .BR MPOL_F_ADDR ,
-.\" then
-.\" .I policy
-.\" instead returns the number of the node on which the address
-.\" .I addr
-.\" is allocated.
-.\"
-.\" If
-.\" .I flags
-.\" specifies
-.\" .B MPOL_F_NODE
-.\" but not
-.\" .BR MPOL_F_ADDR ,
-.\" and the process's current policy is
-.\" .BR MPOL_INTERLEAVE ,
-.\" then
-.\" checkme: Andi's text below says that the info is returned in
-.\" 'nodemask', not 'policy':
-.\" .I policy
-.\" instead returns the number of the next node that will be used for
-.\" interleaving allocation.
-.\" FIXME .
-.\" The other valid flag is
-.\" .I MPOL_F_NODE.
-.\" It is only valid when the policy is
-.\" .I MPOL_INTERLEAVE.
-.\" In this case not the interleave mask, but an unsigned long with the next
-.\" node that would be used for interleaving is returned in
-.\" .I nodemask.
-.\" Other flag values are reserved.
+specifies the number of node IDs
+that can be stored into
+.IR nodemask
+(i.e.,
+the maximum node ID plus one).
+The value specified by
+.I maxnode
+is always rounded up to a multiple of
+.IR "sizeof(unsigned long)" .
+.\" FIXME: does the preceding sentence mean that if maxnode is (say)
+.\" 22, then the call could neverthless return node IDs in node mask
+.\" up to 31 -- e.g., node 26?
+
+If
+.I flags
+specifies both
+.B MPOL_F_NODE
+and
+.BR MPOL_F_ADDR ,
+.BR get_mempolicy ()
+will return the node ID of the node on which the address
+.I addr
+is allocated.
+The node ID is returned in the location pointed to by
+.IR mode .
+If no page has yet been allocated for the specified address,
+.BR get_mempolicy ()
+will allocate a page as if the process had performed a read
+[load] access at that address, and return the ID of the node
+where that page was allocated.
+
+If
+.I flags
+specifies
+.BR MPOL_F_NODE ,
+but not
+.BR MPOL_F_ADDR ,
+and the process's current policy is
+.BR MPOL_INTERLEAVE ,
+then
+.BR get_mempolicy ()
+will return in the location pointed to by a non-NULL
+.I mode
+argument,
+the node ID of the next node that will be used for
+interleaving of internal kernel pages allocated on behalf
+of the process.
+.\" Note: code returns next interleave node via 'mode'
+.\" argument -- Lee Schermerhorn
+These allocations include pages for memory mapped files in
+process memory ranges mapped using the
+.IR mmap (2)
+call with the
+.B MAP_PRIVATE
+flag for read accesses, and in memory ranges mapped with the
+.B MAP_SHARED
+flag for all accesses.
+
+Other flag values are reserved.
For an overview of the possible policies see
.BR set_mempolicy (2).
@@ -120,49 +150,89 @@
on error, \-1 is returned and
.I errno
is set to indicate the error.
-.\" .SH ERRORS
-.\" FIXME -- no errors are listed on this page
-.\" .
-.\" .TP
-.\" .B EINVAL
-.\" .I nodemask
-.\" is non-NULL, and
-.\" .I maxnode
-.\" is too small;
-.\" or
-.\" .I flags
-.\" specified values other than
-.\" .B MPOL_F_NODE
-.\" or
-.\" .BR MPOL_F_ADDR ;
-.\" or
-.\" .I flags
-.\" specified
-.\" .B MPOL_F_ADDR
-.\" and
-.\" .I addr
-.\" is NULL.
-.\" (And there are other
-.\" .B EINVAL
-.\" cases.)
+.SH ERRORS
+.TP
+.B EINVAL
+The value specified by
+.I maxnode
+is less than the number of node IDs supported by the system.
+Or
+.I flags
+specified values other than
+.B MPOL_F_NODE
+or
+.BR MPOL_F_ADDR ;
+or
+.I flags
+specified
+.B MPOL_F_ADDR
+and
+.I addr
+is NULL,
+or
+.I flags
+did not specify
+.B MPOL_F_ADDR
+and
+.I addr
+is not NULL.
+Or,
+.I flags
+specified
+.B MPOL_F_NODE
+but not
+.B MPOL_F_ADDR
+and the current process policy is not
+.BR MPOL_INTERLEAVE .
+.TP
+.B EFAULT
+Part or all of the memory range specified by
+.I nodemask
+and
+.I maxnode
+points outside your accessible address space.
.SH CONFORMING TO
This system call is Linux specific.
.SH NOTES
-This manual page is incomplete:
-it does not document the details the
-.BR MPOL_F_NODE
-flag,
-which modifies the operation of
-.BR get_mempolicy ().
-This is deliberate: this flag is not intended for application use,
-and its operation may change or it may be removed altogether in
-future kernel versions.
-.B Do not use it.
+If the mode of the process policy or the policy governing allocations
+at the specified address is
+.B MPOL_PREFERRED
+and this policy was installed with an empty
+.IR nodemask
+(i.e., specifying local allocation),
+.BR get_mempolicy ()
+will return the mask of on-line node IDs, in the location pointed to by
+a non-NULL
+.I nodemask
+argument.
+This mask does not take into consideration any adminstratively imposed
+restrictions on the process's context.
+.\" "context" above refers to cpusets.
+.\" No man page to reference. -- Lee Schermerhorn
+.\"
+.\" FIXME: Andi / Lee -- can you please resolve the following (mtk):
+.\"
+.\" Christoph says the following is untrue. These are "fully supported."
+.\" Andi concedes that he has lost this battle and approves [?]
+.\" updating the man pages to document the behavior. -- Lee Schermerhorn
+.\" This manual page is incomplete:
+.\" it does not document the details the
+.\" .BR MPOL_F_NODE
+.\" flag,
+.\" which modifies the operation of
+.\" .BR get_mempolicy ().
+.\" This is deliberate: this flag is not intended for application use,
+.\" and its operation may change or it may be removed altogether in
+.\" future kernel versions.
+.\" .B Do not use it.
.SS "Versions and Library Support"
See
.BR mbind (2).
+.SH CONFORMING TO
+This system call is Linux specific.
.SH SEE ALSO
.BR mbind (2),
+.BR mmap (2),
.BR set_mempolicy (2),
-.BR numactl (8),
-.BR numa (3)
+.BR numa (3),
+.BR numactl (8)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* set_mempolicy.2 man page patch
2007-06-01 21:15 ` [PATCH] enhance memory policy sys call man pages v1 Lee Schermerhorn
` (2 preceding siblings ...)
2007-07-23 6:32 ` get_mempolicy.2 " Michael Kerrisk
@ 2007-07-23 6:33 ` Michael Kerrisk
3 siblings, 0 replies; 83+ messages in thread
From: Michael Kerrisk @ 2007-07-23 6:33 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: ak, akpm, linux-mm, clameter
Andi, Christoph
Could you please review these changes by Lee to the set_mempolicy.2 page?
Patch against man-pages-2.63 (available from
http://www.kernel.org/pub/linux/docs/manpages).
Cheers,
Michael
--- set_mempolicy.2.orig 2007-06-23 09:18:02.000000000 +0200
+++ set_mempolicy.2 2007-07-21 09:17:44.000000000 +0200
@@ -1,4 +1,5 @@
.\" Copyright 2003,2004 Andi Kleen, SuSE Labs.
+.\" and Copyright (C) 2007 Lee Schermerhorn <Lee.Schermerhorn@hp.com>
.\"
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
@@ -18,93 +19,161 @@
.\" the source, must acknowledge the copyright and authors of this work.
.\"
.\" 2006-02-03, mtk, substantial wording changes and other improvements
+.\" 2007-06-01, Lee Schermerhorn <Lee.Schermerhorn@hp.com>
+.\" more precise specification of behavior.
.\"
-.TH SET_MEMPOLICY 2 2006-02-07 "Linux" "Linux Programmer's Manual"
+.TH SET_MEMPOLICY 2 2007-07-20 Linux "Linux Programmer's Manual"
.SH NAME
-set_mempolicy \- set default NUMA memory policy for a process and its
children.
+set_mempolicy \- set default NUMA memory policy for a process
+and its children
.SH SYNOPSIS
.nf
.B "#include <numaif.h>"
.sp
-.BI "int set_mempolicy(int " policy ", unsigned long *" nodemask ,
+.BI "int set_mempolicy(int " mode ", unsigned long *" nodemask ,
.BI " unsigned long " maxnode );
+.sp
+Link with \fI\-lnuma\fP.
.fi
.SH DESCRIPTION
.BR set_mempolicy ()
-sets the NUMA memory policy of the calling process to
-.IR policy .
+sets the NUMA memory policy of the calling process,
+which consists of a policy mode and zero or more nodes,
+to the values specified by the
+.IR mode ,
+.I nodemask
+and
+.IR maxnode
+arguments.
A NUMA machine has different
memory controllers with different distances to specific CPUs.
-The memory policy defines in which node memory is allocated for
+The memory policy defines the node on which memory is allocated for
the process.
-This system call defines the default policy for the process;
-in addition a policy can be set for specific memory ranges using
+This system call defines the default policy for the process.
+The process policy governs allocation of pages in the process's
+address space outside of memory ranges
+controlled by a more specific policy set by
.BR mbind (2).
+The process default policy also controls allocation of any pages for
+memory mapped files mapped using the
+.BR mmap (2)
+call with the
+.B MAP_PRIVATE
+flag and that are only read [loaded] by the task,
+and of memory mapped files mapped using the
+.BR mmap (2)
+call with the
+.B MAP_SHARED
+flag, regardless of the access type.
The policy is only applied when a new page is allocated
for the process.
For anonymous memory this is when the page is first
touched by the application.
-Available policies are
+The
+.I mode
+argument must specify one of
.BR MPOL_DEFAULT ,
.BR MPOL_BIND ,
-.BR MPOL_INTERLEAVE ,
+.B MPOL_INTERLEAVE
+or
.BR MPOL_PREFERRED .
-All policies except
+All modes except
.B MPOL_DEFAULT
-require the caller to specify the nodes to which the policy applies in the
+require the caller to specify one of more nodes to which the mode
+applies, via the
.I nodemask
-parameter.
+argument.
+
.I nodemask
-is pointer to a bit field of nodes that contains up to
+points to a bit mask of node IDs that contains up to
.I maxnode
bits.
-The bit field size is rounded to the next multiple of
+The actual number of bytes transferred via
+.I nodemask
+is rounded up to the next multiple of
.IR "sizeof(unsigned long)" ,
but the kernel will only use bits up to
.IR maxnode .
+A NULL value for
+.IR nodemask ,
+or a
+.I maxnode
+value of zero specifies the empty set of nodes.
+If the value of
+.I maxnode
+is zero,
+the
+.I nodemask
+argument is ignored.
The
.B MPOL_DEFAULT
-policy is the default and means to allocate memory locally,
-i.e., on the node of the CPU that triggered the allocation.
+mode is the default and means to allocate memory locally
+(i.e., on the node of the CPU that triggered the allocation).
.I nodemask
-should be specified as NULL.
+must be specified as NULL.
+If the "local node" contains no free memory, the system will
+attempt to allocate memory from a "nearby" node.
The
.B MPOL_BIND
-policy is a strict policy that restricts memory allocation to the
+mode defines a strict policy that restricts memory allocation to the
nodes specified in
.IR nodemask .
-There won't be allocations on other nodes.
+If
+.I nodemask
+specifies more than one node, page allocations will come from
+the node with the lowest numeric node ID first, until that node
+contains no free memory.
+Allocations will then come from the node with the next highest
+node ID specified in
+.I nodemask
+and so forth, until none of the specified nodes contain free memory.
+Pages will not be allocated from any node not specified in the
+.IR nodemask .
.B MPOL_INTERLEAVE
-interleaves allocations to the nodes specified in
-.IR nodemask .
-This optimizes for bandwidth instead of latency.
-To be effective the memory area should be fairly large,
-at least 1MB or bigger.
+interleaves page allocations across the nodes specified in
+.I nodemask
+in numeric node ID order.
+This optimizes for bandwidth instead of latency
+by spreading out pages and memory accesses to those pages across
+multiple nodes.
+However, accesses to a single page will still be limited to
+the memory bandwidth of a single node.
+.\" NOTE: the following sentence doesn't make sense in the context
+.\" of set_mempolicy() -- no memory area specified.
+.\" To be effective the memory area should be fairly large,
+.\" at least 1MB or bigger.
.B MPOL_PREFERRED
sets the preferred node for allocation.
-The kernel will try to allocate in this
-node first and fall back to other nodes if the preferred node is low on free
+The kernel will try to allocate pages from this node first
+and fall back to "nearby" nodes if the preferred node is low on free
memory.
-Only the first node in the
+If
.I nodemask
-is used.
-If no node is set in the mask, then the memory is allocated on
-the node of the CPU that triggered the allocation allocation (like
+specifies more than one node ID, the first node in the
+mask will be selected as the preferred node.
+If the
+.I nodemask
+and
+.I maxnode
+arguments specify the empty set, then the memory is allocated on
+the node of the CPU that triggered the allocation (like
.BR MPOL_DEFAULT ).
-The memory policy is preserved across an
+The process memory policy is preserved across an
.BR execve (2),
and is inherited by child processes created using
.BR fork (2)
or
.BR clone (2).
+.SH CONFORMING TO
+This system call is Linux specific.
.SH RETURN VALUE
On success,
.BR set_mempolicy ()
@@ -112,21 +181,62 @@
on error, \-1 is returned and
.I errno
is set to indicate the error.
-.\" .SH ERRORS
-.\" FIXME no errors are listed on this page
-.\" .
-.\" .TP
-.\" .B EINVAL
-.\" .I mode is invalid.
-.SH CONFORMING TO
-This system call is Linux specific.
+.SH ERRORS
+.TP
+.B EINVAL
+.I mode is invalid.
+Or,
+.I mode
+is
+.B MPOL_DEFAULT
+and
+.I nodemask
+is non-empty,
+or
+.I mode
+is
+.B MPOL_BIND
+or
+.B MPOL_INTERLEAVE
+and
+.I nodemask
+is empty.
+Or,
+.I maxnode
+specifies more than a page worth of bits.
+Or,
+.I nodemask
+specifies one or more node IDs that are
+greater than the maximum supported node ID,
+or are not allowed in the calling task's context.
+.\" "calling task's context" refers to cpusets.
+.\" No man page avail to ref. --Lee Schermerhorn
+Or, none of the node IDs specified by
+.I nodemask
+are on-line, or none of the specified nodes contain memory.
+.TP
+.B EFAULT
+Part or all of the memory range specified by
+.I nodemask
+and
+.I maxnode
+points outside your accessible address space.
+.TP
+.B ENOMEM
+Insufficient kernel memory was available.
.SH NOTES
Process policy is not remembered if the page is swapped out.
+When such a page is paged back in, it will use the policy of
+the process or memory range that is in effect at the time the
+page is allocated.
.SS "Versions and Library Support"
See
.BR mbind (2).
+.SH CONFORMING TO
+This system call is Linux specific.
.SH SEE ALSO
-.BR mbind (2),
.BR get_mempolicy (2),
-.BR numactl (8),
-.BR numa (3)
+.BR mbind (2),
+.BR mmap (2),
+.BR numa (3),
+.BR numactl (8)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mbind.2 man page patch
2007-07-23 6:32 ` mbind.2 man page patch Michael Kerrisk
@ 2007-07-23 14:26 ` Lee Schermerhorn
2007-07-26 17:19 ` Michael Kerrisk
0 siblings, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-07-23 14:26 UTC (permalink / raw)
To: Michael Kerrisk; +Cc: ak, clameter, akpm, linux-mm, Samuel Thibault
On Mon, 2007-07-23 at 08:32 +0200, Michael Kerrisk wrote:
> Andi, Christoph
>
> Could you please review these changes by Lee to the mbind.2 page? Patch
> against man-pages-2.63 (available from
> http://www.kernel.org/pub/linux/docs/manpages).
>
> Andi / Christoph / Lee: There are a few points marked FIXME about which I'd
> particularly like some input.
>
> Lee: aside from the changes tha you made, plus my edits, I added a sentence
> to this page that cam in independently from Samuel Thibau;t (marked below).
>
> Cheers,
>
> Michael
>
> --- mbind.2.orig 2007-07-01 06:22:24.000000000 +0200
> +++ mbind.2 2007-07-21 09:18:05.000000000 +0200
> @@ -1,4 +1,5 @@
> .\" Copyright 2003,2004 Andi Kleen, SuSE Labs.
> +.\" and Copyright (C) 2007 Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> .\"
> .\" Permission is granted to make and distribute verbatim copies of this
> .\" manual provided the copyright notice and this permission notice are
> @@ -18,92 +19,214 @@
> .\" the source, must acknowledge the copyright and authors of this work.
> .\"
> .\" 2006-02-03, mtk, substantial wording changes and other improvements
> +.\" 2007-06-01, Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> +.\" more precise specification of behavior.
> .\"
> -.TH MBIND 2 2006-02-07 "Linux" "Linux Programmer's Manual"
> +.TH MBIND 2 2007-07-20 Linux "Linux Programmer's Manual"
> .SH NAME
> mbind \- Set memory policy for a memory range
> .SH SYNOPSIS
> .nf
> .B "#include <numaif.h>"
> .sp
> -.BI "int mbind(void *" start ", unsigned long " len ", int " policy ,
> +.BI "int mbind(void *" start ", unsigned long " len ", int " mode ,
> .BI " unsigned long *" nodemask ", unsigned long " maxnode ,
> .BI " unsigned " flags );
> .sp
> -.BI "cc ... \-lnuma"
> +Link with \fI\-lnuma\fP.
> .fi
> .SH DESCRIPTION
> +The memory of a NUMA machine is divided into multiple nodes.
> +The memory policy defines the node on which memory is allocated.
> .BR mbind ()
> -sets the NUMA memory
> -.I policy
> +sets the NUMA memory policy
> for the memory range starting with
> .I start
> and continuing for
> .IR len
> bytes.
> -The memory of a NUMA machine is divided into multiple nodes.
> -The memory policy defines in which node memory is allocated.
> +.\" The following sentence added by Samuel Thibault:
> +.I start
> +must be page aligned.
> +
> +The NUMA policy consists of a policy mode, specified in
> +.IR mode ,
> +and a set of zero or nodes, specified in
> +.IR nodemask ;
> +these arguments are described below.
> +
> +If the memory range specified by the
> +.IR start " and " len
> +arguments includes an anonymous region of memory (i.e.,
> +a region of memory created using
> +.BR mmap (2)
> +with the
> +.BR MAP_ANONYMOUS
> +flag) or
> +a memory mapped file mapped using
> +.BR mmap (2)
> +with the
> +.B MAP_PRIVATE
> +flag, pages will only be allocated according to the specified
> +policy when the application writes [stores] to the page.
> +For anonymous regions, an initial read access will use a shared
> +page in the kernel containing all zeros.
> +For a file mapped with
> +.BR MAP_PRIVATE ,
> +an initial read access will allocate pages according to the
> +process policy of the process that causes the page to be allocated.
> +This might not be the process that called
> +.BR mbind ().
> +
> +If the specified memory range includes a memory mapped file mapped using
> +.BR mmap (2)
> +with the
> +.B MAP_SHARED
> +flag, the specified policy will be ignored for all page allocations
> +in this range.
> +.\" FIXME Lee / Andi: can you clarify/confirm "the specified policy
> +.\" will be ignored for all page allocations in this range".
> +.\" That text seems to be saying that if the memory range contains
> +.\" (say) some mappings that are allocated with MAP_SHARED
> +.\" and others allocated with MAP_PRIVATE, then the policy
> +.\" will be ignored for all of the mappings, including even
> +.\" the MAP_PRIVATE mappings. Right? I just want to be
> +.\" sure that that is what the text is meaning.
I can see from the wording how you might think this. However, policy
will only be ignored for the SHARED mappings.
> +Instead, the pages will be allocated according to the process policy
> +of the process that caused the page to be allocated.
> +Again, this might not be the process that called
> +.BR mbind ().
> +
> +If the specified memory range includes a shared memory region
> +created using
> +.BR shmget (2)
> +and attached using
> +.BR shmat (2),
> +pages allocated for the anonymous or shared memory region will
> +be allocated according to the policy specified, regardless of which
> +process attached to the shared memory segment causes the allocation.
> +If, however, the shared memory region was created with the
> +.B SHM_HUGETLB
> +flag,
> +the huge pages will be allocated according to the policy specified
> +only if the page allocation is caused by the task that calls
> +.BR mbind ()
> +for that region.
> +
> +By default,
> .BR mbind ()
> only has an effect for new allocations; if the pages inside
> -the range have been already touched before setting the policy,
> +the range have already been touched before setting the policy,
> then the policy has no effect.
> +This default behavior may be overridden by the
> +.BR MPOL_MF_MOVE
> +and
> +.B MPOL_MF_MOVE_ALL
> +flags described below.
>
> -Available policies are
> +The
> +.I mode
> +argument must specify one of
> .BR MPOL_DEFAULT ,
> .BR MPOL_BIND ,
> .BR MPOL_INTERLEAVE ,
> -and
> +or
> .BR MPOL_PREFERRED .
> -All policies except
> +All policy modes except
> .B MPOL_DEFAULT
> -require the caller to specify the nodes to which the policy applies in the
> +require the caller to specify
> +the node or nodes to which the mode applies, via the
> .I nodemask
> -parameter.
> +argument.
> +
> .I nodemask
> -is a bit mask of nodes containing up to
> +points to a bit mask of nodes containing up to
> .I maxnode
> bits.
> -The actual number of bytes transferred via this argument
> +The actual number of bytes transferred via
> +.I nodemask
> is rounded up to the next multiple of
> .IR "sizeof(unsigned long)" ,
> but the kernel will only use bits up to
> .IR maxnode .
> -A NULL argument means an empty set of nodes.
> +A NULL value for
> +.IR nodemask ,
> +or a
> +.I maxnode
> +value of zero specifies the empty set of nodes.
> +If the value of
> +.I maxnode
> +is zero, then the
> +.I nodemask
> +argument is ignored.
>
> The
> .B MPOL_DEFAULT
> -policy is the default and means to use the underlying process policy
> -(which can be modified with
> -.BR set_mempolicy (2)).
> -Unless the process policy has been changed this means to allocate
> -memory on the node of the CPU that triggered the allocation.
> +mode specifies the default policy.
> +When applied to a range of memory via
> +.BR mbind (),
> +this means that the process policy should be used;
> +the process policy can be set with
> +.BR set_mempolicy (2).
> +If the mode of the process policy is also
> +.BR MPOL_DEFAULT ,
> +then pages will be allocated on the node of the CPU that
> +triggers the allocation.
> +For
> +.BR MPOL_DEFAULT ,
> +the
> .I nodemask
> -should be specified as NULL.
> +and
> +.I maxnode
> +arguments must be specify the empty set of nodes.
>
> The
> .B MPOL_BIND
> -policy is a strict policy that restricts memory allocation to the
> -nodes specified in
> +mode specifies a strict policy that restricts memory allocation to
> +the nodes specified in
> .IR nodemask .
> +If
> +.I nodemask
> +specifies more than one node, page allocations will come from
> +the node with the lowest numeric node ID first, until that node
> +contains no free memory.
> +Allocations will then come from the node with the next highest
> +node ID specified in
> +.I nodemask
> +and so forth, until none of the specified nodes contains free memory.
> There won't be allocations on other nodes.
>
> +The
> .B MPOL_INTERLEAVE
> -interleaves allocations to the nodes specified in
> +mode specifies that page allocations be interleaved across the
> +set of nodes specified in
> .IR nodemask .
> -This optimizes for bandwidth instead of latency.
> +This optimizes for bandwidth instead of latency
> +by spreading out pages and memory accesses to those pages across
> +multiple nodes.
> To be effective the memory area should be fairly large,
> -at least 1MB or bigger.
> +at least 1MB or bigger with a fairly uniform access pattern.
> +Accesses to a single page of the area will still be limited to
> +the memory bandwidth of a single node.
>
> .B MPOL_PREFERRED
> sets the preferred node for allocation.
> -The kernel will try to allocate in this
> +The kernel will try to allocate pages on this
> node first and fall back to other nodes if the
> preferred nodes is low on free memory.
> -Only the first node in the
> +If
> +.I nodemask
> +specifies more than one node ID, the first node in the
> +mask will be selected as the preferred node.
> +If the
> .I nodemask
> -is used.
> -If no node is set in the mask, then the memory is allocated on
> -the node of the CPU that triggered the allocation allocation).
> +and
> +.I maxnode
> +arguments specify the empty set, then the memory is allocated on
> +the node of the CPU that triggered the allocation.
> +This is the only way to specify "local allocation" for a
> +range of memory via
> +.BR mbind ().
>
> If
> .B MPOL_MF_STRICT
> @@ -115,17 +238,20 @@
> .BR MPOL_DEFAULT ,
> then the call will fail with the error
> .B EIO
> -if the existing pages in the mapping don't follow the policy.
> -In 2.6.16 or later the kernel will also try to move pages
> -to the requested node with this flag.
> +if the existing pages in the memory range don't follow the policy.
> +.\" FIXME Andi / Christoph -- can you please verify Lee's change here:
> +.\" According to the kernel code, the following is not true
> +.\" -- Lee Schermerhorn:
> +.\" In 2.6.16 or later the kernel will also try to move pages
> +.\" to the requested node with this flag.
>
> If
> .B MPOL_MF_MOVE
> -is passed in
> +is specified in
> .IR flags ,
> -then an attempt will be made to
> -move all the pages in the mapping so that they follow the policy.
> -Pages that are shared with other processes are not moved.
> +then the kernel will attempt to move all the existing pages
> +in the memory range so that they follow the policy.
> +Pages that are shared with other processes will not be moved.
> If
> .B MPOL_MF_STRICT
> is also specified, then the call will fail with the error
> @@ -136,8 +262,8 @@
> .B MPOL_MF_MOVE_ALL
> is passed in
> .IR flags ,
> -then all pages in the mapping will be moved regardless of whether
> -other processes use the pages.
> +then the kernel will attempt to move all existing pages in the memory
> +range regardless of whether other processes use the pages.
> The calling process must be privileged
> .RB ( CAP_SYS_NICE )
> to use this flag.
> @@ -154,10 +280,15 @@
> .I errno
> is set to indicate the error.
> .SH ERRORS
> +.\" I think I got all of the error returns. -- Lee Schermerhorn
> .TP
> .B EFAULT
> -There was a unmapped hole in the specified memory range
> -or a passed pointer was not valid.
> +Part or all of the memory range specified by
> +.I nodemask
> +and
> +.I maxnode
> +points outside your accessible address space.
> +Or, there was a unmapped hole in the specified memory range.
> .TP
> .B EINVAL
> An invalid value was specified for
> @@ -169,56 +300,96 @@
> was less than
> .IR start ;
> or
> -.I policy
> -was
> +.I start
> +is not a multiple of the system page size.
> +Or,
> +.I mode
> +is
> .B MPOL_DEFAULT
> and
> .I nodemask
> -pointed to a non-empty set;
> +specified a non-empty set;
> or
> -.I policy
> -was
> +.I mode
> +is
> .B MPOL_BIND
> or
> .B MPOL_INTERLEAVE
> and
> .I nodemask
> -pointed to an empty set,
> +is empty.
> +Or,
> +.I maxnode
> +specifies more than a page worth of bits.
> +Or,
> +.I nodemask
> +specifies one or more node IDs that are
> +greater than the maximum supported node ID,
> +or are not allowed in the calling task's context.
> +.\" "calling task's context" refers to cpusets.
> +.\" No man page avail to reference. -- Lee Schermerhorn
> +Or, none of the node IDs specified by
> +.I nodemask
> +are on-line, or none of the specified nodes contain memory.
> .TP
> .B ENOMEM
> -System out of memory.
> +Insufficient kernel memory was available.
> .TP
> .B EIO
> .B MPOL_MF_STRICT
> was specified and an existing page was already on a node
> -that does not follow the policy.
> +that does not follow the policy;
> +or
> +.B MPOL_MF_MOVE
> +or
> +.B MPOL_MF_MOVE_ALL
> +was specified and the kernel was unable to move all existing
> +pages in the range.
> +.TP
> +.B EPERM
> +The
> +.I flags
> +argument included the
> +.B MPOL_MF_MOVE_ALL
> +flag and the caller does not have the
> +.B CAP_SYS_NICE
> +privilege.
> .SH CONFORMING TO
> This system call is Linux specific.
> .SH NOTES
> -NUMA policy is not supported on file mappings.
> +NUMA policy is not supported on a memory mapped file range
> +that was mapped with the
> +.B MAP_SHARED
> +flag.
>
> .B MPOL_MF_STRICT
> -is ignored on huge page mappings right now.
> +is ignored on huge page mappings.
>
> -It is unfortunate that the same flag,
> +The
> .BR MPOL_DEFAULT ,
> -has different effects for
> -.BR mbind (2)
> +mode has different effects for
> +.BR mbind ()
> and
> .BR set_mempolicy (2).
> -To select "allocation on the node of the CPU that
> -triggered the allocation" (like
> -.BR set_mempolicy (2)
> -.BR MPOL_DEFAULT )
> -when calling
> +When
> +.B MPOL_DEFAULT
> +is specified for a range of memory using
> .BR mbind (),
> +any pages subsequently allocated for that range will use
> +the process's policy, as set by
> +.BR set_mempolicy (2).
> +This effectively removes the explicit policy from the
> +specified range.
> +To select "local allocation" for a memory range,
> specify a
> -.I policy
> +.I mode
> of
> .B MPOL_PREFERRED
> -with an empty
> -.IR nodemask .
> -.SS "Versions and Library Support"
> +with an empty set of nodes.
> +This method will work for
> +.BR set_mempolicy (2),
> +as well.
> +.SS "Versions and LIbrary Support"
> The
> .BR mbind (),
> .BR get_mempolicy (2),
> @@ -228,16 +399,17 @@
> They are only available on kernels compiled with
> .BR CONFIG_NUMA .
>
> -Support for huge page policy was added with 2.6.16.
> -For interleave policy to be effective on huge page mappings the
> -policied memory needs to be tens of megabytes or larger.
> -
> -.B MPOL_MF_MOVE
> -and
> -.B MPOL_MF_MOVE_ALL
> -are only available on Linux 2.6.16 and later.
> +You can link with
> +.I \-lnuma
> +to get system call definitions.
> +.I libnuma
> +and the required
> +.I numaif.h
> +header are available in the
> +.I numactl
> +package.
>
> -These system calls should not be used directly.
> +However, applications should not use these system calls directly.
> Instead, the higher level interface provided by the
> .BR numa (3)
> functions in the
> @@ -247,20 +419,25 @@
> .I numactl
> package is available at
> .IR ftp://ftp.suse.com/pub/people/ak/numa/ .
> -
> -You can link with
> -.I \-lnuma
> -to get system call definitions.
> -.I libnuma
> -is available in the
> -.I numactl
> +The package is also included in some Linux distributions.
> +Some distributions include the development library and header
> +in the separate
> +.I numactl-devel
> package.
> -This package also has the
> -.I numaif.h
> -header.
> +
> +Support for huge page policy was added with 2.6.16.
> +For interleave policy to be effective on huge page mappings the
> +policied memory needs to be tens of megabytes or larger.
> +
> +.B MPOL_MF_MOVE
> +and
> +.B MPOL_MF_MOVE_ALL
> +are only available on Linux 2.6.16 and later.
> .SH SEE ALSO
> -.BR numa (3),
> -.BR numactl (8),
> -.BR set_mempolicy (2),
> .BR get_mempolicy (2),
> -.BR mmap (2)
> +.BR mmap (2),
> +.BR set_mempolicy (2),
> +.BR shmat (2),
> +.BR shmget (2),
> +.BR numa (3),
> +.BR numactl (8)
>
>
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mbind.2 man page patch
2007-07-23 14:26 ` Lee Schermerhorn
@ 2007-07-26 17:19 ` Michael Kerrisk
2007-07-26 18:06 ` Lee Schermerhorn
0 siblings, 1 reply; 83+ messages in thread
From: Michael Kerrisk @ 2007-07-26 17:19 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: ak, clameter, akpm, linux-mm, Samuel Thibault
[...]
>> +If the specified memory range includes a memory mapped file mapped using
>> +.BR mmap (2)
>> +with the
>> +.B MAP_SHARED
>> +flag, the specified policy will be ignored for all page allocations
>> +in this range.
>> +.\" FIXME Lee / Andi: can you clarify/confirm "the specified policy
>> +.\" will be ignored for all page allocations in this range".
>> +.\" That text seems to be saying that if the memory range contains
>> +.\" (say) some mappings that are allocated with MAP_SHARED
>> +.\" and others allocated with MAP_PRIVATE, then the policy
>> +.\" will be ignored for all of the mappings, including even
>> +.\" the MAP_PRIVATE mappings. Right? I just want to be
>> +.\" sure that that is what the text is meaning.
>
> I can see from the wording how you might think this. However, policy
> will only be ignored for the SHARED mappings.
So is a better wording something like:
The specified policy will be ignored for any MAP_SHARED
file mappings in the specified memory range.
?
Cheers,
Michael
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mbind.2 man page patch
2007-07-26 17:19 ` Michael Kerrisk
@ 2007-07-26 18:06 ` Lee Schermerhorn
2007-07-26 18:18 ` Michael Kerrisk
0 siblings, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-07-26 18:06 UTC (permalink / raw)
To: Michael Kerrisk; +Cc: ak, clameter, akpm, linux-mm, Samuel Thibault
On Thu, 2007-07-26 at 19:19 +0200, Michael Kerrisk wrote:
> [...]
> >> +If the specified memory range includes a memory mapped file mapped using
> >> +.BR mmap (2)
> >> +with the
> >> +.B MAP_SHARED
> >> +flag, the specified policy will be ignored for all page allocations
> >> +in this range.
> >> +.\" FIXME Lee / Andi: can you clarify/confirm "the specified policy
> >> +.\" will be ignored for all page allocations in this range".
> >> +.\" That text seems to be saying that if the memory range contains
> >> +.\" (say) some mappings that are allocated with MAP_SHARED
> >> +.\" and others allocated with MAP_PRIVATE, then the policy
> >> +.\" will be ignored for all of the mappings, including even
> >> +.\" the MAP_PRIVATE mappings. Right? I just want to be
> >> +.\" sure that that is what the text is meaning.
> >
> > I can see from the wording how you might think this. However, policy
> > will only be ignored for the SHARED mappings.
>
> So is a better wording something like:
>
> The specified policy will be ignored for any MAP_SHARED
> file mappings in the specified memory range.
>
Wish I'd written that ;-)
Seriously, that is correct.
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: mbind.2 man page patch
2007-07-26 18:06 ` Lee Schermerhorn
@ 2007-07-26 18:18 ` Michael Kerrisk
0 siblings, 0 replies; 83+ messages in thread
From: Michael Kerrisk @ 2007-07-26 18:18 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: ak, clameter, akpm, linux-mm, Samuel Thibault
Lee Schermerhorn wrote:
> On Thu, 2007-07-26 at 19:19 +0200, Michael Kerrisk wrote:
>> [...]
>>>> +If the specified memory range includes a memory mapped file mapped using
>>>> +.BR mmap (2)
>>>> +with the
>>>> +.B MAP_SHARED
>>>> +flag, the specified policy will be ignored for all page allocations
>>>> +in this range.
>>>> +.\" FIXME Lee / Andi: can you clarify/confirm "the specified policy
>>>> +.\" will be ignored for all page allocations in this range".
>>>> +.\" That text seems to be saying that if the memory range contains
>>>> +.\" (say) some mappings that are allocated with MAP_SHARED
>>>> +.\" and others allocated with MAP_PRIVATE, then the policy
>>>> +.\" will be ignored for all of the mappings, including even
>>>> +.\" the MAP_PRIVATE mappings. Right? I just want to be
>>>> +.\" sure that that is what the text is meaning.
>>> I can see from the wording how you might think this. However, policy
>>> will only be ignored for the SHARED mappings.
>> So is a better wording something like:
>>
>> The specified policy will be ignored for any MAP_SHARED
>> file mappings in the specified memory range.
>>
>
> Wish I'd written that ;-)
It's just like code. Simpler is usually better ;0-).
> Seriously, that is correct.
Good.
Cheers,
Michael
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance? Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: get_mempolicy.2 man page patch
2007-07-23 6:32 ` get_mempolicy.2 " Michael Kerrisk
@ 2007-07-28 9:31 ` Michael Kerrisk
2007-08-09 18:43 ` Lee Schermerhorn
2007-08-16 20:05 ` Andi Kleen
0 siblings, 2 replies; 83+ messages in thread
From: Michael Kerrisk @ 2007-07-28 9:31 UTC (permalink / raw)
To: ak, clameter; +Cc: Michael Kerrisk, Lee Schermerhorn, akpm, linux-mm
Andi, Christoph,
Would one or both of you be willing to review the three man page patches by
Lee (mbind.2, set_mempolicy.2, get_mempolict.2)?
Cheers,
Michael
Michael Kerrisk wrote:
> Andi, Christoph
>
> Could you please review these changes by Lee to the get_mempolicy.2 page?
> Patch against man-pages-2.63 (available from
> http://www.kernel.org/pub/linux/docs/manpages).
>
> Andi/ Christoph / Lee: There are a few points marked FIXME about which I'd
> particularly like some input.
>
> Cheers,
>
> Michael
>
>
> --- get_mempolicy.2.orig 2007-06-23 09:18:02.000000000 +0200
> +++ get_mempolicy.2 2007-07-21 09:18:46.000000000 +0200
> @@ -1,4 +1,5 @@
> .\" Copyright 2003,2004 Andi Kleen, SuSE Labs.
> +.\" and Copyright (C) 2007 Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> .\"
> .\" Permission is granted to make and distribute verbatim copies of this
> .\" manual provided the copyright notice and this permission notice are
> @@ -18,19 +19,22 @@
> .\" the source, must acknowledge the copyright and authors of this work.
> .\"
> .\" 2006-02-03, mtk, substantial wording changes and other improvements
> +.\" 2007-06-01, Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> +.\" more precise specification of behavior.
> .\"
> -.TH GET_MEMPOLICY 2 2006-02-07 "Linux" "Linux Programmer's Manual"
> +.TH GET_MEMPOLICY 2 2007-07-20 Linux "Linux Programmer's Manual"
> .SH NAME
> get_mempolicy \- Retrieve NUMA memory policy for a process
> .SH SYNOPSIS
> .B "#include <numaif.h>"
> .nf
> .sp
> -.BI "int get_mempolicy(int *" policy ", unsigned long *" nodemask ,
> +.BI "int get_mempolicy(int *" mode ", unsigned long *" nodemask ,
> .BI " unsigned long " maxnode ", unsigned long " addr ,
> .BI " unsigned long " flags );
> +.sp
> +Link with \fI\-lnuma\fP.
> .fi
> -.\" FIXME rewrite this DESCRIPTION. it is confusing.
> .SH DESCRIPTION
> .BR get_mempolicy ()
> retrieves the NUMA policy of the calling process or of a memory address,
> @@ -39,7 +43,7 @@
>
> A NUMA machine has different
> memory controllers with different distances to specific CPUs.
> -The memory policy defines in which node memory is allocated for
> +The memory policy defines the node on which memory is allocated for
> the process.
>
> If
> @@ -58,58 +62,84 @@
> address given in
> .IR addr .
> This policy may be different from the process's default policy if
> -.BR set_mempolicy (2)
> -has been used to establish a policy for the page containing
> +.\" FIXME Lee changed "set_mempolicy" to "mbind" in the following;
> +.\" is that correct?
> +.BR mbind (2)
> +or one of the helper functions described in
> +.BR numa(3)
> +has been used to establish a policy for the memory range containing
> .IR addr .
>
> -If
> -.I policy
> -is not NULL, then it is used to return the policy.
> +If the
> +.I mode
> +argument is not NULL, then
> +.BR get_mempolicy ()
> +will store the policy mode of the requested NUMA policy in the location
> +pointed to by this argument.
> If
> .IR nodemask
> -is not NULL, then it is used to return the nodemask associated
> -with the policy.
> +is not NULL, then the nodemask associated with the policy will be stored
> +in the location pointed to by this argument.
> .I maxnode
> -is the maximum bit number plus one that can be stored into
> -.IR nodemask .
> -The bit number is always rounded to a multiple of
> -.IR "unsigned long" .
> -.\"
> -.\" If
> -.\" .I flags
> -.\" specifies both
> -.\" .B MPOL_F_NODE
> -.\" and
> -.\" .BR MPOL_F_ADDR ,
> -.\" then
> -.\" .I policy
> -.\" instead returns the number of the node on which the address
> -.\" .I addr
> -.\" is allocated.
> -.\"
> -.\" If
> -.\" .I flags
> -.\" specifies
> -.\" .B MPOL_F_NODE
> -.\" but not
> -.\" .BR MPOL_F_ADDR ,
> -.\" and the process's current policy is
> -.\" .BR MPOL_INTERLEAVE ,
> -.\" then
> -.\" checkme: Andi's text below says that the info is returned in
> -.\" 'nodemask', not 'policy':
> -.\" .I policy
> -.\" instead returns the number of the next node that will be used for
> -.\" interleaving allocation.
> -.\" FIXME .
> -.\" The other valid flag is
> -.\" .I MPOL_F_NODE.
> -.\" It is only valid when the policy is
> -.\" .I MPOL_INTERLEAVE.
> -.\" In this case not the interleave mask, but an unsigned long with the next
> -.\" node that would be used for interleaving is returned in
> -.\" .I nodemask.
> -.\" Other flag values are reserved.
> +specifies the number of node IDs
> +that can be stored into
> +.IR nodemask
> +(i.e.,
> +the maximum node ID plus one).
> +The value specified by
> +.I maxnode
> +is always rounded up to a multiple of
> +.IR "sizeof(unsigned long)" .
> +.\" FIXME: does the preceding sentence mean that if maxnode is (say)
> +.\" 22, then the call could neverthless return node IDs in node mask
> +.\" up to 31 -- e.g., node 26?
> +
> +If
> +.I flags
> +specifies both
> +.B MPOL_F_NODE
> +and
> +.BR MPOL_F_ADDR ,
> +.BR get_mempolicy ()
> +will return the node ID of the node on which the address
> +.I addr
> +is allocated.
> +The node ID is returned in the location pointed to by
> +.IR mode .
> +If no page has yet been allocated for the specified address,
> +.BR get_mempolicy ()
> +will allocate a page as if the process had performed a read
> +[load] access at that address, and return the ID of the node
> +where that page was allocated.
> +
> +If
> +.I flags
> +specifies
> +.BR MPOL_F_NODE ,
> +but not
> +.BR MPOL_F_ADDR ,
> +and the process's current policy is
> +.BR MPOL_INTERLEAVE ,
> +then
> +.BR get_mempolicy ()
> +will return in the location pointed to by a non-NULL
> +.I mode
> +argument,
> +the node ID of the next node that will be used for
> +interleaving of internal kernel pages allocated on behalf
> +of the process.
> +.\" Note: code returns next interleave node via 'mode'
> +.\" argument -- Lee Schermerhorn
> +These allocations include pages for memory mapped files in
> +process memory ranges mapped using the
> +.IR mmap (2)
> +call with the
> +.B MAP_PRIVATE
> +flag for read accesses, and in memory ranges mapped with the
> +.B MAP_SHARED
> +flag for all accesses.
> +
> +Other flag values are reserved.
>
> For an overview of the possible policies see
> .BR set_mempolicy (2).
> @@ -120,49 +150,89 @@
> on error, \-1 is returned and
> .I errno
> is set to indicate the error.
> -.\" .SH ERRORS
> -.\" FIXME -- no errors are listed on this page
> -.\" .
> -.\" .TP
> -.\" .B EINVAL
> -.\" .I nodemask
> -.\" is non-NULL, and
> -.\" .I maxnode
> -.\" is too small;
> -.\" or
> -.\" .I flags
> -.\" specified values other than
> -.\" .B MPOL_F_NODE
> -.\" or
> -.\" .BR MPOL_F_ADDR ;
> -.\" or
> -.\" .I flags
> -.\" specified
> -.\" .B MPOL_F_ADDR
> -.\" and
> -.\" .I addr
> -.\" is NULL.
> -.\" (And there are other
> -.\" .B EINVAL
> -.\" cases.)
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +The value specified by
> +.I maxnode
> +is less than the number of node IDs supported by the system.
> +Or
> +.I flags
> +specified values other than
> +.B MPOL_F_NODE
> +or
> +.BR MPOL_F_ADDR ;
> +or
> +.I flags
> +specified
> +.B MPOL_F_ADDR
> +and
> +.I addr
> +is NULL,
> +or
> +.I flags
> +did not specify
> +.B MPOL_F_ADDR
> +and
> +.I addr
> +is not NULL.
> +Or,
> +.I flags
> +specified
> +.B MPOL_F_NODE
> +but not
> +.B MPOL_F_ADDR
> +and the current process policy is not
> +.BR MPOL_INTERLEAVE .
> +.TP
> +.B EFAULT
> +Part or all of the memory range specified by
> +.I nodemask
> +and
> +.I maxnode
> +points outside your accessible address space.
> .SH CONFORMING TO
> This system call is Linux specific.
> .SH NOTES
> -This manual page is incomplete:
> -it does not document the details the
> -.BR MPOL_F_NODE
> -flag,
> -which modifies the operation of
> -.BR get_mempolicy ().
> -This is deliberate: this flag is not intended for application use,
> -and its operation may change or it may be removed altogether in
> -future kernel versions.
> -.B Do not use it.
> +If the mode of the process policy or the policy governing allocations
> +at the specified address is
> +.B MPOL_PREFERRED
> +and this policy was installed with an empty
> +.IR nodemask
> +(i.e., specifying local allocation),
> +.BR get_mempolicy ()
> +will return the mask of on-line node IDs, in the location pointed to by
> +a non-NULL
> +.I nodemask
> +argument.
> +This mask does not take into consideration any adminstratively imposed
> +restrictions on the process's context.
> +.\" "context" above refers to cpusets.
> +.\" No man page to reference. -- Lee Schermerhorn
> +.\"
> +.\" FIXME: Andi / Lee -- can you please resolve the following (mtk):
> +.\"
> +.\" Christoph says the following is untrue. These are "fully supported."
> +.\" Andi concedes that he has lost this battle and approves [?]
> +.\" updating the man pages to document the behavior. -- Lee Schermerhorn
> +.\" This manual page is incomplete:
> +.\" it does not document the details the
> +.\" .BR MPOL_F_NODE
> +.\" flag,
> +.\" which modifies the operation of
> +.\" .BR get_mempolicy ().
> +.\" This is deliberate: this flag is not intended for application use,
> +.\" and its operation may change or it may be removed altogether in
> +.\" future kernel versions.
> +.\" .B Do not use it.
> .SS "Versions and Library Support"
> See
> .BR mbind (2).
> +.SH CONFORMING TO
> +This system call is Linux specific.
> .SH SEE ALSO
> .BR mbind (2),
> +.BR mmap (2),
> .BR set_mempolicy (2),
> -.BR numactl (8),
> -.BR numa (3)
> +.BR numa (3),
> +.BR numactl (8)
>
>
>
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance? Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: get_mempolicy.2 man page patch
2007-07-28 9:31 ` Michael Kerrisk
@ 2007-08-09 18:43 ` Lee Schermerhorn
2007-08-09 20:57 ` Michael Kerrisk
2007-08-16 20:05 ` Andi Kleen
1 sibling, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-08-09 18:43 UTC (permalink / raw)
To: Michael Kerrisk; +Cc: ak, clameter, akpm, linux-mm
On Sat, 2007-07-28 at 11:31 +0200, Michael Kerrisk wrote:
> Andi, Christoph,
>
> Would one or both of you be willing to review the three man page patches by
> Lee (mbind.2, set_mempolicy.2, get_mempolict.2)?
>
> Cheers,
>
> Michael
Michael:
what's that status of these man page updates?
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: get_mempolicy.2 man page patch
2007-08-09 18:43 ` Lee Schermerhorn
@ 2007-08-09 20:57 ` Michael Kerrisk
0 siblings, 0 replies; 83+ messages in thread
From: Michael Kerrisk @ 2007-08-09 20:57 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: ak, clameter, akpm, linux-mm
Lee Schermerhorn wrote:
> On Sat, 2007-07-28 at 11:31 +0200, Michael Kerrisk wrote:
>> Andi, Christoph,
>>
>> Would one or both of you be willing to review the three man page patches by
>> Lee (mbind.2, set_mempolicy.2, get_mempolict.2)?
>>
>> Cheers,
>>
>> Michael
>
> Michael:
>
> what's that status of these man page updates?
I'm still waiting for review comment from either Andi or Christoph...
Cheers,
Michael
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: get_mempolicy.2 man page patch
2007-07-28 9:31 ` Michael Kerrisk
2007-08-09 18:43 ` Lee Schermerhorn
@ 2007-08-16 20:05 ` Andi Kleen
2007-08-18 5:50 ` Michael Kerrisk
2007-08-27 10:46 ` get_mempolicy.2 man page patch Michael Kerrisk
1 sibling, 2 replies; 83+ messages in thread
From: Andi Kleen @ 2007-08-16 20:05 UTC (permalink / raw)
To: Michael Kerrisk; +Cc: ak, clameter, Lee Schermerhorn, akpm, linux-mm
Lee's changes are ok for me.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: get_mempolicy.2 man page patch
2007-08-16 20:05 ` Andi Kleen
@ 2007-08-18 5:50 ` Michael Kerrisk
2007-08-21 15:45 ` Lee Schermerhorn
2007-08-27 10:46 ` get_mempolicy.2 man page patch Michael Kerrisk
1 sibling, 1 reply; 83+ messages in thread
From: Michael Kerrisk @ 2007-08-18 5:50 UTC (permalink / raw)
Cc: linux-mm, akpm, Lee.Schermerhorn, clameter, ak
> Lee's changes are ok for me.
>
> -Andi
Thanks Andi.
Lee, for each of th changed pages, could you write me a short summary
of the changes, suitable for inclusion in the change log?
Cheers,
Michael
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance?
Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages ,
read the HOWTOHELP file and grep the source
files for 'FIXME'.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: get_mempolicy.2 man page patch
2007-08-18 5:50 ` Michael Kerrisk
@ 2007-08-21 15:45 ` Lee Schermerhorn
2007-08-22 4:10 ` Michael Kerrisk
0 siblings, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-08-21 15:45 UTC (permalink / raw)
To: Michael Kerrisk; +Cc: Andi Kleen, linux-mm, akpm, clameter
On Sat, 2007-08-18 at 07:50 +0200, Michael Kerrisk wrote:
> > Lee's changes are ok for me.
> >
> > -Andi
>
> Thanks Andi.
>
> Lee, for each of th changed pages, could you write me a short summary
> of the changes, suitable for inclusion in the change log?
Michael:
The terse and generic description re: adding missing semantics and
error returns to match kernel code is not sufficient?
What level of detail would be?
I have rebased the patch against the 2.64 man pages if you'd like me to
send that along. There were a few conflicts, as you or someone had
moved some text around.
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: get_mempolicy.2 man page patch
2007-08-21 15:45 ` Lee Schermerhorn
@ 2007-08-22 4:10 ` Michael Kerrisk
2007-08-22 16:08 ` [PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2 Lee Schermerhorn
` (2 more replies)
0 siblings, 3 replies; 83+ messages in thread
From: Michael Kerrisk @ 2007-08-22 4:10 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: clameter, akpm, linux-mm, ak
> > Lee, for each of th changed pages, could you write me a short summary
> > of the changes, suitable for inclusion in the change log?
>
> Michael:
>
> The terse and generic description re: adding missing semantics and
> error returns to match kernel code is not sufficient?
Too terse ;-).
Perhaps you could briefly list which descriptions of semantics
were added?
> What level of detail would be?
>
> I have rebased the patch against the 2.64 man pages if you'd like me to
> send that along. There were a few conflicts, as you or someone had
> moved some text around.
That would be great.
Cheers,
Michael
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance?
Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages ,
read the HOWTOHELP file and grep the source
files for 'FIXME'.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2
2007-08-22 4:10 ` Michael Kerrisk
@ 2007-08-22 16:08 ` Lee Schermerhorn
2007-08-27 11:29 ` Michael Kerrisk
2007-08-22 16:10 ` [PATCH] Mempolicy Man Pages 2.64 2/3 - set_mempolicy.2 Lee Schermerhorn
2007-08-22 16:12 ` [PATCH] Mempolicy Man Pages 2.64 3/3 - get_mempolicy.2 Lee Schermerhorn
2 siblings, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-08-22 16:08 UTC (permalink / raw)
To: Michael Kerrisk; +Cc: clameter, akpm, linux-mm, ak, Eric Whitney
I've separated the mempolicy man page updates into 3 separate patches,
against the 2.64 man pages. I've added a slightly less terse
description of the changes for the change log.
Here's the first of the 3--mbind.2. I updated the description of the
interaction with MAP_SHARED to the wording you suggested. a while back.
---------------------------------
[PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2
Against: man pages 2.64
Changes:
+ changed the "policy" parameter to "mode" through out the
descriptions in an attempt to promote the concept that the memory
policy is a tuple consisting of a mode and optional set of nodes.
+ rewrite portions of description for clarification.
++ clarify interaction of policy with mmap()'d files and shared
memory regions, including SHM_HUGE regions.
++ defined how "empty set of nodes" specified and what this
means for MPOL_PREFERRED.
++ mention what happens if local/target node contains no
free memory.
++ clarify semantics of multiple nodes to BIND policy.
Note: subject to change. We'll fix the man pages when/if
this happens.
+ added all errors currently returned by sys call.
+ added mmap(2), shmget(2), shmat(2) to See Also list.
man2/mbind.2 | 338 +++++++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 248 insertions(+), 90 deletions(-)
Index: Linux/man2/mbind.2
===================================================================
--- Linux.orig/man2/mbind.2 2007-08-22 11:22:00.000000000 -0400
+++ Linux/man2/mbind.2 2007-08-22 11:56:58.000000000 -0400
@@ -18,15 +18,16 @@
.\" the source, must acknowledge the copyright and authors of this work.
.\"
.\" 2006-02-03, mtk, substantial wording changes and other improvements
+.\" 2007-06-01, lts, more precise specification of behavior.
.\"
-.TH MBIND 2 2006-02-07 "Linux" "Linux Programmer's Manual"
+.TH MBIND 2 "2007-06-01" "SuSE Labs" "Linux Programmer's Manual"
.SH NAME
mbind \- Set memory policy for a memory range
.SH SYNOPSIS
.nf
.B "#include <numaif.h>"
.sp
-.BI "int mbind(void *" start ", unsigned long " len ", int " policy ,
+.BI "int mbind(void *" start ", unsigned long " len ", int " mode ,
.BI " unsigned long *" nodemask ", unsigned long " maxnode ,
.BI " unsigned " flags );
.sp
@@ -34,76 +35,178 @@ mbind \- Set memory policy for a memory
.fi
.SH DESCRIPTION
.BR mbind ()
-sets the NUMA memory
-.I policy
+sets the NUMA memory policy,
+which consists of a policy mode and zero or more nodes,
for the memory range starting with
.I start
and continuing for
.IR len
bytes.
The memory of a NUMA machine is divided into multiple nodes.
-The memory policy defines in which node memory is allocated.
+The memory policy defines from which node memory is allocated.
+
+If the memory range specified by the
+.IR start " and " len
+arguments includes an "anonymous" region of memory\(emthat is
+a region of memory created using the
+.BR mmap (2)
+system call with the
+.BR MAP_ANONYMOUS \(emor
+a memory mapped file, mapped using the
+.BR mmap (2)
+system call with the
+.B MAP_PRIVATE
+flag, pages will only be allocated according to the specified
+policy when the application writes [stores] to the page.
+For anonymous regions, an initial read access will use a shared
+page in the kernel containing all zeros.
+For a file mapped with
+.BR MAP_PRIVATE ,
+an initial read access will allocate pages according to the
+process policy of the process that causes the page to be allocated.
+This may not be the process that called
+.BR mbind ().
+
+The specified policy will be ignored for any
+.B MAP_SHARED
+mappings in the specified memory range.
+Rather the pages will be allocated according to the process policy
+of the process that caused the page to be allocated.
+Again, this may not be the process that called
+.BR mbind ().
+
+If the specified memory range includes a shared memory region
+created using the
+.BR shmget (2)
+system call and attached using the
+.BR shmat (2)
+system call,
+pages allocated for the anonymous or shared memory region will
+be allocated according to the policy specified, regardless which
+process attached to the shared memory segment causes the allocation.
+If, however, the shared memory region was created with the
+.B SHM_HUGETLB
+flag,
+the huge pages will be allocated according to the policy specified
+only if the page allocation is caused by the task that calls
+.BR mbind ()
+for that region.
+
+By default,
.BR mbind ()
only has an effect for new allocations; if the pages inside
the range have been already touched before setting the policy,
then the policy has no effect.
+This default behavior may be overridden by the
+.BR MPOL_MF_MOVE
+and
+.B MPOL_MF_MOVE_ALL
+flags described below.
-Available policies are
+The
+.I mode
+argument must specify one of
.BR MPOL_DEFAULT ,
.BR MPOL_BIND ,
-.BR MPOL_INTERLEAVE ,
-and
+.B MPOL_INTERLEAVE
+or
.BR MPOL_PREFERRED .
-All policies except
+All policy modes except
.B MPOL_DEFAULT
-require the caller to specify the nodes to which the policy applies in the
+require the caller to specify via the
.I nodemask
-parameter.
+parameter,
+the node or nodes to which the mode applies.
+
.I nodemask
-is a bit mask of nodes containing up to
+points to a bitmask of nodes containing up to
.I maxnode
bits.
-The actual number of bytes transferred via this argument
-is rounded up to the next multiple of
+The bit mask size is rounded to the next multiple of
.IR "sizeof(unsigned long)" ,
but the kernel will only use bits up to
.IR maxnode .
-A NULL argument means an empty set of nodes.
+A NULL value of
+.I nodemask
+or a
+.I maxnode
+value of zero specifies the empty set of nodes.
+If the value of
+.I maxnode
+is zero,
+the
+.I nodemask
+argument is ignored.
The
.B MPOL_DEFAULT
-policy is the default and means to use the underlying process policy
-(which can be modified with
-.BR set_mempolicy (2)).
-Unless the process policy has been changed this means to allocate
-memory on the node of the CPU that triggered the allocation.
+mode specifies that the default policy be used.
+When applied to a range of memory via
+.IR mbind (),
+this means to use the process policy,
+ which may have been set with
+.BR set_mempolicy (2).
+If the mode of the process policy is also
+.BR MPOL_DEFAULT ,
+the system-wide default policy will be used.
+The system-wide default policy will allocate
+pages on the node of the CPU that triggers the allocation.
+For
+.BR MPOL_DEFAULT ,
+the
.I nodemask
-should be specified as NULL.
+and
+.I maxnode
+arguments must be specify the empty set of nodes.
The
.B MPOL_BIND
-policy is a strict policy that restricts memory allocation to the
-nodes specified in
+mode specifies a strict policy that restricts memory allocation to
+the nodes specified in
+.IR nodemask .
+If
+.I nodemask
+specifies more than one node, page allocations will come from
+the node with the lowest numeric node id first, until that node
+contains no free memory.
+Allocations will then come from the node with the next highest
+node id specified in
+.I nodemask
+and so forth, until none of the specified nodes contain free memory.
+Pages will not be allocated from any node not specified in the
.IR nodemask .
-There won't be allocations on other nodes.
+The
.B MPOL_INTERLEAVE
-interleaves allocations to the nodes specified in
+mode specifies that page allocations be interleaved across the
+set of nodes specified in
.IR nodemask .
-This optimizes for bandwidth instead of latency.
+This optimizes for bandwidth instead of latency
+by spreading out pages and memory accesses to those pages across
+multiple nodes.
To be effective the memory area should be fairly large,
-at least 1MB or bigger.
+at least 1MB or bigger with a fairly uniform access pattern.
+Accesses to a single page of the area will still be limited to
+the memory bandwidth of a single node.
.B MPOL_PREFERRED
sets the preferred node for allocation.
-The kernel will try to allocate in this
+The kernel will try to allocate pages from this
node first and fall back to other nodes if the
preferred nodes is low on free memory.
-Only the first node in the
+If
+.I nodemask
+specifies more than one node id, the first node in the
+mask will be selected as the preferred node.
+If the
.I nodemask
-is used.
-If no node is set in the mask, then the memory is allocated on
-the node of the CPU that triggered the allocation allocation).
+and
+.I maxnode
+arguments specify the empty set, then the memory is allocated on
+the node of the CPU that triggered the allocation.
+This is the only way to specify "local allocation" for a
+range of memory via
+.IR mbind (2).
If
.B MPOL_MF_STRICT
@@ -115,17 +218,18 @@ is not
.BR MPOL_DEFAULT ,
then the call will fail with the error
.B EIO
-if the existing pages in the mapping don't follow the policy.
-In 2.6.16 or later the kernel will also try to move pages
-to the requested node with this flag.
+if the existing pages in the memory range don't follow the policy.
+.\" According to the kernel code, the following is not true --lts
+.\" In 2.6.16 or later the kernel will also try to move pages
+.\" to the requested node with this flag.
If
.B MPOL_MF_MOVE
-is passed in
+is specified in
.IR flags ,
-then an attempt will be made to
-move all the pages in the mapping so that they follow the policy.
-Pages that are shared with other processes are not moved.
+then the kernel will attempt to move all the existing pages
+in the memory range so that they follow the policy.
+Pages that are shared with other processes will not be moved.
If
.B MPOL_MF_STRICT
is also specified, then the call will fail with the error
@@ -136,8 +240,8 @@ If
.B MPOL_MF_MOVE_ALL
is passed in
.IR flags ,
-then all pages in the mapping will be moved regardless of whether
-other processes use the pages.
+then the kernel will attempt to move all existing pages in the memory range
+regardless of whether other processes use the pages.
The calling process must be privileged
.RB ( CAP_SYS_NICE )
to use this flag.
@@ -146,6 +250,7 @@ If
is also specified, then the call will fail with the error
.B EIO
if some pages could not be moved.
+.\" ---------------------------------------------------------------
.SH RETURN VALUE
On success,
.BR mbind ()
@@ -153,11 +258,9 @@ returns 0;
on error, \-1 is returned and
.I errno
is set to indicate the error.
+.\" ---------------------------------------------------------------
.SH ERRORS
-.TP
-.B EFAULT
-There was a unmapped hole in the specified memory range
-or a passed pointer was not valid.
+.\" I think I got all of the error returns. --lts
.TP
.B EINVAL
An invalid value was specified for
@@ -169,55 +272,102 @@ or
was less than
.IR start ;
or
-.I policy
-was
-.B MPOL_DEFAULT
+.I start
+is not a multiple of the system page size.
+Or,
+.I mode
+is
+.I MPOL_DEFAULT
and
.I nodemask
-pointed to a non-empty set;
+specified a non-empty set;
or
-.I policy
-was
-.B MPOL_BIND
+.I mode
+is
+.I MPOL_BIND
or
-.B MPOL_INTERLEAVE
+.I MPOL_INTERLEAVE
and
.I nodemask
-pointed to an empty set,
+is empty.
+Or,
+.I maxnode
+specifies more than a page worth of bits.
+Or,
+.I nodemask
+specifies one or more node ids that are
+greater than the maximum supported node id,
+or are not allowed in the calling task's context.
+.\" "calling task's context" refers to cpusets. No man page avail to ref. --lts
+Or, none of the node ids specified by
+.I nodemask
+are on-line, or none of the specified nodes contain memory.
+.TP
+.B EFAULT
+Part of all of the memory range specified by
+.I nodemask
+and
+.I maxnode
+points outside your accessible address space.
+Or, there was a unmapped hole in the specified memory range.
.TP
.B ENOMEM
-System out of memory.
+Insufficient kernel memory was available.
.TP
.B EIO
.B MPOL_MF_STRICT
was specified and an existing page was already on a node
-that does not follow the policy.
-.SH CONFORMING TO
-This system call is Linux specific.
+that does not follow the policy;
+or
+.B MPOL_MF_MOVE
+or
+.B MPOL_MF_MOVE_ALL
+was specified and the kernel was unable to move all existing
+pages in the range.
+.TP
+.B EPERM
+The
+.I flags
+argument included the
+.B MPOL_MF_MOVE_ALL
+flag and the caller does not have the
+.B CAP_SYS_NICE
+privilege.
+.\" ---------------------------------------------------------------
.SH NOTES
-NUMA policy is not supported on file mappings.
+NUMA policy is not supported on a memory mapped file range
+that was mapped with the
+.I MAP_SHARED
+flag.
.B MPOL_MF_STRICT
-is ignored on huge page mappings right now.
+is ignored on huge page mappings.
-It is unfortunate that the same flag,
+The
.BR MPOL_DEFAULT ,
-has different effects for
+mode has different effects for
.BR mbind (2)
and
.BR set_mempolicy (2).
-To select "allocation on the node of the CPU that
-triggered the allocation" (like
-.BR set_mempolicy (2)
-.BR MPOL_DEFAULT )
-when calling
+When
+.B MPOL_DEFAULT
+is specified for a range of memory using
.BR mbind (),
+any pages subsequently allocated for that range will use
+the process' policy, as set by
+.BR set_mempolicy (2).
+This effectively removes the explicit policy from the
+specified range.
+To select "local allocation" for a memory range,
specify a
-.I policy
+.I mode
of
.B MPOL_PREFERRED
-with an empty
-.IR nodemask .
+with an empty set of nodes.
+This method will work for
+.BR set_mempolicy (2),
+as well.
+.\" ---------------------------------------------------------------
.SS "Versions and Library Support"
The
.BR mbind (),
@@ -228,16 +378,18 @@ system calls were added to the Linux ker
They are only available on kernels compiled with
.BR CONFIG_NUMA .
-Support for huge page policy was added with 2.6.16.
-For interleave policy to be effective on huge page mappings the
-policied memory needs to be tens of megabytes or larger.
-
-.B MPOL_MF_MOVE
-and
-.B MPOL_MF_MOVE_ALL
-are only available on Linux 2.6.16 and later.
+You can link with
+.I -lnuma
+to get system call definitions.
+.I libnuma
+and the required
+.I numaif.h
+header.
+are available in the
+.I numactl
+package.
-These system calls should not be used directly.
+However, applications should not use these system calls directly.
Instead, the higher level interface provided by the
.BR numa (3)
functions in the
@@ -247,20 +399,26 @@ The
.I numactl
package is available at
.IR ftp://ftp.suse.com/pub/people/ak/numa/ .
-
-You can link with
-.I \-lnuma
-to get system call definitions.
-.I libnuma
-is available in the
-.I numactl
+The package is also included in some Linux distributions.
+Some distributions include the development library and header
+in the separate
+.I numactl-devel
package.
-This package also has the
-.I numaif.h
-header.
+
+Support for huge page policy was added with 2.6.16.
+For interleave policy to be effective on huge page mappings the
+policied memory needs to be tens of megabytes or larger.
+
+.B MPOL_MF_MOVE
+and
+.B MPOL_MF_MOVE_ALL
+are only available on Linux 2.6.16 and later.
+
.SH SEE ALSO
.BR numa (3),
.BR numactl (8),
.BR set_mempolicy (2),
.BR get_mempolicy (2),
-.BR mmap (2)
+.BR mmap (2),
+.BR shmget (2),
+.BR shmat (2).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH] Mempolicy Man Pages 2.64 2/3 - set_mempolicy.2
2007-08-22 4:10 ` Michael Kerrisk
2007-08-22 16:08 ` [PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2 Lee Schermerhorn
@ 2007-08-22 16:10 ` Lee Schermerhorn
2007-08-27 11:30 ` Michael Kerrisk
2007-08-22 16:12 ` [PATCH] Mempolicy Man Pages 2.64 3/3 - get_mempolicy.2 Lee Schermerhorn
2 siblings, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-08-22 16:10 UTC (permalink / raw)
To: Michael Kerrisk; +Cc: clameter, akpm, linux-mm, ak, Eric Whitney
[PATCH] Mempolicy Man Pages 2.64 2/3 - set_mempolicy.2
Against: man pages 2.64
Changes:
+ changed the "policy" parameter to "mode" through out the
descriptions in an attempt to promote the concept that the memory
policy is a tuple consisting of a mode and optional set of nodes.
+ added requirement to link '-lnuma' to synopsis
+ rewrite portions of description for clarification.
++ clarify interaction of policy with mmap()'d files.
++ defined how "empty set of nodes" specified and what this
means for MPOL_PREFERRED.
++ mention what happens if local/target node contains no
free memory.
++ clarify semantics of multiple nodes to BIND policy.
Note: subject to change. We'll fix the man pages when/if
this happens.
+ added all errors currently returned by sys call.
+ added mmap(2) to See Also list.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Index: Linux/man2/set_mempolicy.2
===================================================================
--- Linux.orig/man2/set_mempolicy.2 2007-06-13 17:48:16.000000000 -0400
+++ Linux/man2/set_mempolicy.2 2007-08-10 12:30:14.000000000 -0400
@@ -18,6 +18,7 @@
.\" the source, must acknowledge the copyright and authors of this work.
.\"
.\" 2006-02-03, mtk, substantial wording changes and other improvements
+.\" 2007-06-01, lts, more precise specification of behavior.
.\"
.TH SET_MEMPOLICY 2 2006-02-07 "Linux" "Linux Programmer's Manual"
.SH NAME
@@ -26,80 +27,141 @@ set_mempolicy \- set default NUMA memory
.nf
.B "#include <numaif.h>"
.sp
-.BI "int set_mempolicy(int " policy ", unsigned long *" nodemask ,
+.BI "int set_mempolicy(int " mode ", unsigned long *" nodemask ,
.BI " unsigned long " maxnode );
+.sp
+.BI "cc ... \-lnuma"
.fi
.SH DESCRIPTION
.BR set_mempolicy ()
-sets the NUMA memory policy of the calling process to
-.IR policy .
+sets the NUMA memory policy of the calling process,
+which consists of a policy mode and zero or more nodes,
+to the values specified by the
+.IR mode ,
+.I nodemask
+and
+.IR maxnode
+arguments.
A NUMA machine has different
memory controllers with different distances to specific CPUs.
-The memory policy defines in which node memory is allocated for
+The memory policy defines from which node memory is allocated for
the process.
-This system call defines the default policy for the process;
-in addition a policy can be set for specific memory ranges using
+This system call defines the default policy for the process.
+The process policy governs allocation of pages in the process'
+address space outside of memory ranges
+controlled by a more specific policy set by
.BR mbind (2).
+The process default policy also controls allocation of any pages for
+memory mapped files mapped using the
+.BR mmap (2)
+call with the
+.B MAP_PRIVATE
+flag and that are only read [loaded] from by the task
+and of memory mapped files mapped using the
+.BR mmap (2)
+call with the
+.B MAP_SHARED
+flag, regardless of the access type.
The policy is only applied when a new page is allocated
for the process.
For anonymous memory this is when the page is first
touched by the application.
-Available policies are
+The
+.I mode
+argument must specify one of
.BR MPOL_DEFAULT ,
.BR MPOL_BIND ,
-.BR MPOL_INTERLEAVE ,
+.B MPOL_INTERLEAVE
+or
.BR MPOL_PREFERRED .
-All policies except
+All modes except
.B MPOL_DEFAULT
-require the caller to specify the nodes to which the policy applies in the
+require the caller to specify via the
.I nodemask
-parameter.
+parameter
+one or more nodes.
+
.I nodemask
-is pointer to a bit field of nodes that contains up to
+points to a bit mask of node ids that contains up to
.I maxnode
bits.
-The bit field size is rounded to the next multiple of
+The bit mask size is rounded to the next multiple of
.IR "sizeof(unsigned long)" ,
but the kernel will only use bits up to
.IR maxnode .
+A NULL value of
+.I nodemask
+or a
+.I maxnode
+value of zero specifies the empty set of nodes.
+If the value of
+.I maxnode
+is zero,
+the
+.I nodemask
+argument is ignored.
The
.B MPOL_DEFAULT
-policy is the default and means to allocate memory locally,
+mode is the default and means to allocate memory locally,
i.e., on the node of the CPU that triggered the allocation.
.I nodemask
-should be specified as NULL.
+must be specified as NULL.
+If the "local node" contains no free memory, the system will
+attempt to allocate memory from a "near by" node.
The
.B MPOL_BIND
-policy is a strict policy that restricts memory allocation to the
+mode defines a strict policy that restricts memory allocation to the
nodes specified in
.IR nodemask .
-There won't be allocations on other nodes.
+If
+.I nodemask
+specifies more than one node, page allocations will come from
+the node with the lowest numeric node id first, until that node
+contains no free memory.
+Allocations will then come from the node with the next highest
+node id specified in
+.I nodemask
+and so forth, until none of the specified nodes contain free memory.
+Pages will not be allocated from any node not specified in the
+.IR nodemask .
.B MPOL_INTERLEAVE
-interleaves allocations to the nodes specified in
-.IR nodemask .
-This optimizes for bandwidth instead of latency.
-To be effective the memory area should be fairly large,
-at least 1MB or bigger.
+interleaves page allocations across the nodes specified in
+.I nodemask
+in numeric node id order.
+This optimizes for bandwidth instead of latency
+by spreading out pages and memory accesses to those pages across
+multiple nodes.
+However, accesses to a single page will still be limited to
+the memory bandwidth of a single node.
+.\" NOTE: the following sentence doesn't make sense in the context
+.\" of set_mempolicy() -- no memory area specified.
+.\" To be effective the memory area should be fairly large,
+.\" at least 1MB or bigger.
.B MPOL_PREFERRED
sets the preferred node for allocation.
-The kernel will try to allocate in this
-node first and fall back to other nodes if the preferred node is low on free
+The kernel will try to allocate pages from this node first
+and fall back to "near by" nodes if the preferred node is low on free
memory.
-Only the first node in the
+If
+.I nodemask
+specifies more than one node id, the first node in the
+mask will be selected as the preferred node.
+If the
.I nodemask
-is used.
-If no node is set in the mask, then the memory is allocated on
-the node of the CPU that triggered the allocation allocation (like
+and
+.I maxnode
+arguments specify the empty set, then the memory is allocated on
+the node of the CPU that triggered the allocation (like
.BR MPOL_DEFAULT ).
-The memory policy is preserved across an
+The process memory policy is preserved across an
.BR execve (2),
and is inherited by child processes created using
.BR fork (2)
@@ -112,21 +174,62 @@ returns 0;
on error, \-1 is returned and
.I errno
is set to indicate the error.
-.\" .SH ERRORS
-.\" FIXME no errors are listed on this page
-.\" .
-.\" .TP
-.\" .B EINVAL
-.\" .I mode is invalid.
+.SH ERRORS
+.TP
+.B EINVAL
+.I mode is invalid.
+Or,
+.I mode
+is
+.I MPOL_DEFAULT
+and
+.I nodemask
+is non-empty,
+or
+.I mode
+is
+.I MPOL_BIND
+or
+.I MPOL_INTERLEAVE
+and
+.I nodemask
+is empty.
+Or,
+.I maxnode
+specifies more than a page worth of bits.
+Or,
+.I nodemask
+specifies one or more node ids that are
+greater than the maximum supported node id,
+or are not allowed in the calling task's context.
+.\" "calling task's context" refers to cpusets. No man page avail to ref. --lts
+Or, none of the node ids specified by
+.I nodemask
+are on-line, or none of the specified nodes contain memory.
+.TP
+.B EFAULT
+Part of all of the memory range specified by
+.I nodemask
+and
+.I maxnode
+points outside your accessible address space.
+.TP
+.B ENOMEM
+Insufficient kernel memory was available.
+
.SH CONFORMING TO
This system call is Linux specific.
.SH NOTES
Process policy is not remembered if the page is swapped out.
+When such a page is paged back in, it will use the policy of
+the process or memory range that is in effect at the time the
+page is allocated.
.SS "Versions and Library Support"
See
.BR mbind (2).
.SH SEE ALSO
.BR mbind (2),
+.BR mmap (2),
.BR get_mempolicy (2),
.BR numactl (8),
.BR numa (3)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* [PATCH] Mempolicy Man Pages 2.64 3/3 - get_mempolicy.2
2007-08-22 4:10 ` Michael Kerrisk
2007-08-22 16:08 ` [PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2 Lee Schermerhorn
2007-08-22 16:10 ` [PATCH] Mempolicy Man Pages 2.64 2/3 - set_mempolicy.2 Lee Schermerhorn
@ 2007-08-22 16:12 ` Lee Schermerhorn
2007-08-27 11:30 ` Michael Kerrisk
2 siblings, 1 reply; 83+ messages in thread
From: Lee Schermerhorn @ 2007-08-22 16:12 UTC (permalink / raw)
To: Michael Kerrisk; +Cc: clameter, akpm, linux-mm, ak, Eric Whitney
[PATCH] Mempolicy Man Pages 2.64 3/3 - get_mempolicy.2
Against: man pages 2.64
Changes:
+ changed the "policy" parameter to "mode" through out the
descriptions in an attempt to promote the concept that the memory
policy is a tuple consisting of a mode and optional set of nodes.
+ added requirement to link '-lnuma' to synopsis
+ rewrite portions of description for clarification.
+ added all errors currently returned by sys call.
+ removed cautionary note that use of MPOL_F_NODE|MPOL_F_ADDR
is not supported. This is no longer true.
+ added mmap(2) to See Also list.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Index: Linux/man2/get_mempolicy.2
===================================================================
--- Linux.orig/man2/get_mempolicy.2 2007-06-22 14:25:23.000000000 -0400
+++ Linux/man2/get_mempolicy.2 2007-08-10 12:33:23.000000000 -0400
@@ -18,6 +18,7 @@
.\" the source, must acknowledge the copyright and authors of this work.
.\"
.\" 2006-02-03, mtk, substantial wording changes and other improvements
+.\" 2007-06-01, lts, more precise specification of behavior.
.\"
.TH GET_MEMPOLICY 2 2006-02-07 "Linux" "Linux Programmer's Manual"
.SH NAME
@@ -26,9 +27,11 @@ get_mempolicy \- Retrieve NUMA memory po
.B "#include <numaif.h>"
.nf
.sp
-.BI "int get_mempolicy(int *" policy ", unsigned long *" nodemask ,
+.BI "int get_mempolicy(int *" mode ", unsigned long *" nodemask ,
.BI " unsigned long " maxnode ", unsigned long " addr ,
.BI " unsigned long " flags );
+.sp
+.BI "cc ... \-lnuma"
.fi
.\" FIXME rewrite this DESCRIPTION. it is confusing.
.SH DESCRIPTION
@@ -39,7 +42,7 @@ depending on the setting of
A NUMA machine has different
memory controllers with different distances to specific CPUs.
-The memory policy defines in which node memory is allocated for
+The memory policy defines from which node memory is allocated for
the process.
If
@@ -58,58 +61,75 @@ then information is returned about the p
address given in
.IR addr .
This policy may be different from the process's default policy if
-.BR set_mempolicy (2)
-has been used to establish a policy for the page containing
+.BR mbind (2)
+or one of the helper functions described in
+.BR numa(3)
+has been used to establish a policy for the memory range containing
.IR addr .
-If
-.I policy
-is not NULL, then it is used to return the policy.
+If the
+.I mode
+argument is not NULL, then
+.IR get_mempolicy ()
+will store the policy mode of the requested NUMA policy in the location
+pointed to by this argument.
If
.IR nodemask
-is not NULL, then it is used to return the nodemask associated
-with the policy.
+is not NULL, then the nodemask associated with the policy will be stored
+in the location pointed to by this argument.
.I maxnode
-is the maximum bit number plus one that can be stored into
-.IR nodemask .
-The bit number is always rounded to a multiple of
-.IR "unsigned long" .
-.\"
-.\" If
-.\" .I flags
-.\" specifies both
-.\" .B MPOL_F_NODE
-.\" and
-.\" .BR MPOL_F_ADDR ,
-.\" then
-.\" .I policy
-.\" instead returns the number of the node on which the address
-.\" .I addr
-.\" is allocated.
-.\"
-.\" If
-.\" .I flags
-.\" specifies
-.\" .B MPOL_F_NODE
-.\" but not
-.\" .BR MPOL_F_ADDR ,
-.\" and the process's current policy is
-.\" .BR MPOL_INTERLEAVE ,
-.\" then
-.\" checkme: Andi's text below says that the info is returned in
-.\" 'nodemask', not 'policy':
-.\" .I policy
-.\" instead returns the number of the next node that will be used for
-.\" interleaving allocation.
-.\" FIXME .
-.\" The other valid flag is
-.\" .I MPOL_F_NODE.
-.\" It is only valid when the policy is
-.\" .I MPOL_INTERLEAVE.
-.\" In this case not the interleave mask, but an unsigned long with the next
-.\" node that would be used for interleaving is returned in
-.\" .I nodemask.
-.\" Other flag values are reserved.
+specifies the number of node ids
+that can be stored into
+.IR nodemask \(emthat
+is, the maximum node id plus one.
+The value specified by
+.I maxnode
+is always rounded to a multiple of
+.IR "sizeof(unsigned long)" .
+
+If
+.I flags
+specifies both
+.B MPOL_F_NODE
+and
+.BR MPOL_F_ADDR ,
+.IR get_mempolicy ()
+will return the node id of the node on which the address
+.I addr
+is allocated into the location pointed to by
+.IR mode .
+If no page has yet been allocated for the specified address,
+.IR get_mempolicy ()
+will allocate a page as if the process had performed a read
+[load] access to that address, and return the id of the node
+where that page was allocated.
+
+If
+.I flags
+specifies
+.BR MPOL_F_NODE ,
+but not
+.BR MPOL_F_ADDR ,
+and the process's current policy is
+.BR MPOL_INTERLEAVE ,
+then
+.IR get_mempolicy ()
+will return in the location pointed to by a non-NULL
+.I mode
+argument,
+the node id of the next node that will be used for
+interleaving of internal kernel pages allocated on behalf of the process.
+.\" Note: code returns next interleave node via 'mode' argument -lts
+These allocations include pages for memory mapped files in
+process memory ranges mapped using the
+.IR mmap (2)
+call with the
+.I MAP_PRIVATE
+flag for read accesses, and in memory ranges mapped with the
+.I MAP_SHARED
+flag for all accesses.
+
+Other flag values are reserved.
For an overview of the possible policies see
.BR set_mempolicy (2).
@@ -120,49 +140,84 @@ returns 0;
on error, \-1 is returned and
.I errno
is set to indicate the error.
-.\" .SH ERRORS
-.\" FIXME -- no errors are listed on this page
-.\" .
-.\" .TP
-.\" .B EINVAL
-.\" .I nodemask
-.\" is non-NULL, and
-.\" .I maxnode
-.\" is too small;
-.\" or
-.\" .I flags
-.\" specified values other than
-.\" .B MPOL_F_NODE
-.\" or
-.\" .BR MPOL_F_ADDR ;
-.\" or
-.\" .I flags
-.\" specified
-.\" .B MPOL_F_ADDR
-.\" and
-.\" .I addr
-.\" is NULL.
-.\" (And there are other
-.\" .B EINVAL
-.\" cases.)
-.SH CONFORMING TO
-This system call is Linux specific.
+.SH ERRORS
+.TP
+.B EINVAL
+The value specified by
+.I maxnode
+is less than the number of node ids supported by the system.
+Or
+.I flags
+specified values other than
+.B MPOL_F_NODE
+or
+.BR MPOL_F_ADDR ;
+or
+.I flags
+specified
+.B MPOL_F_ADDR
+and
+.I addr
+is NULL,
+or
+.I flags
+did not specify
+.B MPOL_F_ADDR
+and
+.I addr
+is not NULL.
+Or,
+.I flags
+specified
+.B MPOL_F_NODE
+but not
+.B MPOL_F_ADDR
+and the current process policy is not
+.BR MPOL_INTERLEAVE .
+(And there are other EINVAL cases.)
+.TP
+.B EFAULT
+Part of all of the memory range specified by
+.I nodemask
+and
+.I maxnode
+points outside your accessible address space.
.SH NOTES
-This manual page is incomplete:
-it does not document the details the
-.BR MPOL_F_NODE
-flag,
-which modifies the operation of
-.BR get_mempolicy ().
-This is deliberate: this flag is not intended for application use,
-and its operation may change or it may be removed altogether in
-future kernel versions.
-.B Do not use it.
+If the mode of the process policy or the policy governing allocations at the
+specified address is
+.I MPOL_PREFERRED
+and this policy was installed with an empty
+.IR nodemask \(emspecifying
+local allocation,
+.IR get_mempolicy ()
+will return the mask of on-line node ids in the location pointed to by
+a non-NULL
+.I nodemask
+argument.
+This mask does not take into consideration any adminstratively imposed
+restrictions on the process' context.
+.\" FIXME:
+.\" "context" above refers to cpusets. No man page to reference. --lts
+
+.\" Christoph says the following is untrue. These are "fully supported."
+.\" Andi concedes that he has lost this battle and approves [?]
+.\" updating the man pages to document the behavior. --lts
+.\" This manual page is incomplete:
+.\" it does not document the details the
+.\" .BR MPOL_F_NODE
+.\" flag,
+.\" which modifies the operation of
+.\" .BR get_mempolicy ().
+.\" This is deliberate: this flag is not intended for application use,
+.\" and its operation may change or it may be removed altogether in
+.\" future kernel versions.
+.\" .B Do not use it.
.SS "Versions and Library Support"
See
.BR mbind (2).
.SH SEE ALSO
.BR mbind (2),
+.BR mmap (2),
.BR set_mempolicy (2),
.BR numactl (8),
.BR numa (3)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: get_mempolicy.2 man page patch
2007-08-16 20:05 ` Andi Kleen
2007-08-18 5:50 ` Michael Kerrisk
@ 2007-08-27 10:46 ` Michael Kerrisk
1 sibling, 0 replies; 83+ messages in thread
From: Michael Kerrisk @ 2007-08-27 10:46 UTC (permalink / raw)
To: Andi Kleen; +Cc: clameter, Lee Schermerhorn, akpm, linux-mm
Thanks Andi!
Andi Kleen wrote:
> Lee's changes are ok for me.
>
> -Andi
>
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance? Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2
2007-08-22 16:08 ` [PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2 Lee Schermerhorn
@ 2007-08-27 11:29 ` Michael Kerrisk
0 siblings, 0 replies; 83+ messages in thread
From: Michael Kerrisk @ 2007-08-27 11:29 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: clameter, akpm, linux-mm, ak, Eric Whitney
Lee,
Thanks for the tremendous work on these pages!
I've applied the patch below, but added the following copyright line to this (and also to the other two pages):
.\" and Copyright 2007 Lee Schermerhorn, Hewlett Packard
Let me know if this okay/should be changed in any way.
This patch (and the others) applied for man-pages-2.65.
Cheers,
Michael
Lee Schermerhorn wrote:
> Michael:
>
> I've separated the mempolicy man page updates into 3 separate patches,
> against the 2.64 man pages. I've added a slightly less terse
> description of the changes for the change log.
>
> Here's the first of the 3--mbind.2. I updated the description of the
> interaction with MAP_SHARED to the wording you suggested. a while back.
>
> ---------------------------------
>
> [PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2
>
> Against: man pages 2.64
>
> Changes:
>
> + changed the "policy" parameter to "mode" through out the
> descriptions in an attempt to promote the concept that the memory
> policy is a tuple consisting of a mode and optional set of nodes.
>
> + rewrite portions of description for clarification.
>
> ++ clarify interaction of policy with mmap()'d files and shared
> memory regions, including SHM_HUGE regions.
>
> ++ defined how "empty set of nodes" specified and what this
> means for MPOL_PREFERRED.
>
> ++ mention what happens if local/target node contains no
> free memory.
>
> ++ clarify semantics of multiple nodes to BIND policy.
> Note: subject to change. We'll fix the man pages when/if
> this happens.
>
> + added all errors currently returned by sys call.
>
> + added mmap(2), shmget(2), shmat(2) to See Also list.
>
>
>
> man2/mbind.2 | 338 +++++++++++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 248 insertions(+), 90 deletions(-)
>
> Index: Linux/man2/mbind.2
> ===================================================================
> --- Linux.orig/man2/mbind.2 2007-08-22 11:22:00.000000000 -0400
> +++ Linux/man2/mbind.2 2007-08-22 11:56:58.000000000 -0400
> @@ -18,15 +18,16 @@
> .\" the source, must acknowledge the copyright and authors of this work.
> .\"
> .\" 2006-02-03, mtk, substantial wording changes and other improvements
> +.\" 2007-06-01, lts, more precise specification of behavior.
> .\"
> -.TH MBIND 2 2006-02-07 "Linux" "Linux Programmer's Manual"
> +.TH MBIND 2 "2007-06-01" "SuSE Labs" "Linux Programmer's Manual"
> .SH NAME
> mbind \- Set memory policy for a memory range
> .SH SYNOPSIS
> .nf
> .B "#include <numaif.h>"
> .sp
> -.BI "int mbind(void *" start ", unsigned long " len ", int " policy ,
> +.BI "int mbind(void *" start ", unsigned long " len ", int " mode ,
> .BI " unsigned long *" nodemask ", unsigned long " maxnode ,
> .BI " unsigned " flags );
> .sp
> @@ -34,76 +35,178 @@ mbind \- Set memory policy for a memory
> .fi
> .SH DESCRIPTION
> .BR mbind ()
> -sets the NUMA memory
> -.I policy
> +sets the NUMA memory policy,
> +which consists of a policy mode and zero or more nodes,
> for the memory range starting with
> .I start
> and continuing for
> .IR len
> bytes.
> The memory of a NUMA machine is divided into multiple nodes.
> -The memory policy defines in which node memory is allocated.
> +The memory policy defines from which node memory is allocated.
> +
> +If the memory range specified by the
> +.IR start " and " len
> +arguments includes an "anonymous" region of memory\(emthat is
> +a region of memory created using the
> +.BR mmap (2)
> +system call with the
> +.BR MAP_ANONYMOUS \(emor
> +a memory mapped file, mapped using the
> +.BR mmap (2)
> +system call with the
> +.B MAP_PRIVATE
> +flag, pages will only be allocated according to the specified
> +policy when the application writes [stores] to the page.
> +For anonymous regions, an initial read access will use a shared
> +page in the kernel containing all zeros.
> +For a file mapped with
> +.BR MAP_PRIVATE ,
> +an initial read access will allocate pages according to the
> +process policy of the process that causes the page to be allocated.
> +This may not be the process that called
> +.BR mbind ().
> +
> +The specified policy will be ignored for any
> +.B MAP_SHARED
> +mappings in the specified memory range.
> +Rather the pages will be allocated according to the process policy
> +of the process that caused the page to be allocated.
> +Again, this may not be the process that called
> +.BR mbind ().
> +
> +If the specified memory range includes a shared memory region
> +created using the
> +.BR shmget (2)
> +system call and attached using the
> +.BR shmat (2)
> +system call,
> +pages allocated for the anonymous or shared memory region will
> +be allocated according to the policy specified, regardless which
> +process attached to the shared memory segment causes the allocation.
> +If, however, the shared memory region was created with the
> +.B SHM_HUGETLB
> +flag,
> +the huge pages will be allocated according to the policy specified
> +only if the page allocation is caused by the task that calls
> +.BR mbind ()
> +for that region.
> +
> +By default,
> .BR mbind ()
> only has an effect for new allocations; if the pages inside
> the range have been already touched before setting the policy,
> then the policy has no effect.
> +This default behavior may be overridden by the
> +.BR MPOL_MF_MOVE
> +and
> +.B MPOL_MF_MOVE_ALL
> +flags described below.
>
> -Available policies are
> +The
> +.I mode
> +argument must specify one of
> .BR MPOL_DEFAULT ,
> .BR MPOL_BIND ,
> -.BR MPOL_INTERLEAVE ,
> -and
> +.B MPOL_INTERLEAVE
> +or
> .BR MPOL_PREFERRED .
> -All policies except
> +All policy modes except
> .B MPOL_DEFAULT
> -require the caller to specify the nodes to which the policy applies in the
> +require the caller to specify via the
> .I nodemask
> -parameter.
> +parameter,
> +the node or nodes to which the mode applies.
> +
> .I nodemask
> -is a bit mask of nodes containing up to
> +points to a bitmask of nodes containing up to
> .I maxnode
> bits.
> -The actual number of bytes transferred via this argument
> -is rounded up to the next multiple of
> +The bit mask size is rounded to the next multiple of
> .IR "sizeof(unsigned long)" ,
> but the kernel will only use bits up to
> .IR maxnode .
> -A NULL argument means an empty set of nodes.
> +A NULL value of
> +.I nodemask
> +or a
> +.I maxnode
> +value of zero specifies the empty set of nodes.
> +If the value of
> +.I maxnode
> +is zero,
> +the
> +.I nodemask
> +argument is ignored.
>
> The
> .B MPOL_DEFAULT
> -policy is the default and means to use the underlying process policy
> -(which can be modified with
> -.BR set_mempolicy (2)).
> -Unless the process policy has been changed this means to allocate
> -memory on the node of the CPU that triggered the allocation.
> +mode specifies that the default policy be used.
> +When applied to a range of memory via
> +.IR mbind (),
> +this means to use the process policy,
> + which may have been set with
> +.BR set_mempolicy (2).
> +If the mode of the process policy is also
> +.BR MPOL_DEFAULT ,
> +the system-wide default policy will be used.
> +The system-wide default policy will allocate
> +pages on the node of the CPU that triggers the allocation.
> +For
> +.BR MPOL_DEFAULT ,
> +the
> .I nodemask
> -should be specified as NULL.
> +and
> +.I maxnode
> +arguments must be specify the empty set of nodes.
>
> The
> .B MPOL_BIND
> -policy is a strict policy that restricts memory allocation to the
> -nodes specified in
> +mode specifies a strict policy that restricts memory allocation to
> +the nodes specified in
> +.IR nodemask .
> +If
> +.I nodemask
> +specifies more than one node, page allocations will come from
> +the node with the lowest numeric node id first, until that node
> +contains no free memory.
> +Allocations will then come from the node with the next highest
> +node id specified in
> +.I nodemask
> +and so forth, until none of the specified nodes contain free memory.
> +Pages will not be allocated from any node not specified in the
> .IR nodemask .
> -There won't be allocations on other nodes.
>
> +The
> .B MPOL_INTERLEAVE
> -interleaves allocations to the nodes specified in
> +mode specifies that page allocations be interleaved across the
> +set of nodes specified in
> .IR nodemask .
> -This optimizes for bandwidth instead of latency.
> +This optimizes for bandwidth instead of latency
> +by spreading out pages and memory accesses to those pages across
> +multiple nodes.
> To be effective the memory area should be fairly large,
> -at least 1MB or bigger.
> +at least 1MB or bigger with a fairly uniform access pattern.
> +Accesses to a single page of the area will still be limited to
> +the memory bandwidth of a single node.
>
> .B MPOL_PREFERRED
> sets the preferred node for allocation.
> -The kernel will try to allocate in this
> +The kernel will try to allocate pages from this
> node first and fall back to other nodes if the
> preferred nodes is low on free memory.
> -Only the first node in the
> +If
> +.I nodemask
> +specifies more than one node id, the first node in the
> +mask will be selected as the preferred node.
> +If the
> .I nodemask
> -is used.
> -If no node is set in the mask, then the memory is allocated on
> -the node of the CPU that triggered the allocation allocation).
> +and
> +.I maxnode
> +arguments specify the empty set, then the memory is allocated on
> +the node of the CPU that triggered the allocation.
> +This is the only way to specify "local allocation" for a
> +range of memory via
> +.IR mbind (2).
>
> If
> .B MPOL_MF_STRICT
> @@ -115,17 +218,18 @@ is not
> .BR MPOL_DEFAULT ,
> then the call will fail with the error
> .B EIO
> -if the existing pages in the mapping don't follow the policy.
> -In 2.6.16 or later the kernel will also try to move pages
> -to the requested node with this flag.
> +if the existing pages in the memory range don't follow the policy.
> +.\" According to the kernel code, the following is not true --lts
> +.\" In 2.6.16 or later the kernel will also try to move pages
> +.\" to the requested node with this flag.
>
> If
> .B MPOL_MF_MOVE
> -is passed in
> +is specified in
> .IR flags ,
> -then an attempt will be made to
> -move all the pages in the mapping so that they follow the policy.
> -Pages that are shared with other processes are not moved.
> +then the kernel will attempt to move all the existing pages
> +in the memory range so that they follow the policy.
> +Pages that are shared with other processes will not be moved.
> If
> .B MPOL_MF_STRICT
> is also specified, then the call will fail with the error
> @@ -136,8 +240,8 @@ If
> .B MPOL_MF_MOVE_ALL
> is passed in
> .IR flags ,
> -then all pages in the mapping will be moved regardless of whether
> -other processes use the pages.
> +then the kernel will attempt to move all existing pages in the memory range
> +regardless of whether other processes use the pages.
> The calling process must be privileged
> .RB ( CAP_SYS_NICE )
> to use this flag.
> @@ -146,6 +250,7 @@ If
> is also specified, then the call will fail with the error
> .B EIO
> if some pages could not be moved.
> +.\" ---------------------------------------------------------------
> .SH RETURN VALUE
> On success,
> .BR mbind ()
> @@ -153,11 +258,9 @@ returns 0;
> on error, \-1 is returned and
> .I errno
> is set to indicate the error.
> +.\" ---------------------------------------------------------------
> .SH ERRORS
> -.TP
> -.B EFAULT
> -There was a unmapped hole in the specified memory range
> -or a passed pointer was not valid.
> +.\" I think I got all of the error returns. --lts
> .TP
> .B EINVAL
> An invalid value was specified for
> @@ -169,55 +272,102 @@ or
> was less than
> .IR start ;
> or
> -.I policy
> -was
> -.B MPOL_DEFAULT
> +.I start
> +is not a multiple of the system page size.
> +Or,
> +.I mode
> +is
> +.I MPOL_DEFAULT
> and
> .I nodemask
> -pointed to a non-empty set;
> +specified a non-empty set;
> or
> -.I policy
> -was
> -.B MPOL_BIND
> +.I mode
> +is
> +.I MPOL_BIND
> or
> -.B MPOL_INTERLEAVE
> +.I MPOL_INTERLEAVE
> and
> .I nodemask
> -pointed to an empty set,
> +is empty.
> +Or,
> +.I maxnode
> +specifies more than a page worth of bits.
> +Or,
> +.I nodemask
> +specifies one or more node ids that are
> +greater than the maximum supported node id,
> +or are not allowed in the calling task's context.
> +.\" "calling task's context" refers to cpusets. No man page avail to ref. --lts
> +Or, none of the node ids specified by
> +.I nodemask
> +are on-line, or none of the specified nodes contain memory.
> +.TP
> +.B EFAULT
> +Part of all of the memory range specified by
> +.I nodemask
> +and
> +.I maxnode
> +points outside your accessible address space.
> +Or, there was a unmapped hole in the specified memory range.
> .TP
> .B ENOMEM
> -System out of memory.
> +Insufficient kernel memory was available.
> .TP
> .B EIO
> .B MPOL_MF_STRICT
> was specified and an existing page was already on a node
> -that does not follow the policy.
> -.SH CONFORMING TO
> -This system call is Linux specific.
> +that does not follow the policy;
> +or
> +.B MPOL_MF_MOVE
> +or
> +.B MPOL_MF_MOVE_ALL
> +was specified and the kernel was unable to move all existing
> +pages in the range.
> +.TP
> +.B EPERM
> +The
> +.I flags
> +argument included the
> +.B MPOL_MF_MOVE_ALL
> +flag and the caller does not have the
> +.B CAP_SYS_NICE
> +privilege.
> +.\" ---------------------------------------------------------------
> .SH NOTES
> -NUMA policy is not supported on file mappings.
> +NUMA policy is not supported on a memory mapped file range
> +that was mapped with the
> +.I MAP_SHARED
> +flag.
>
> .B MPOL_MF_STRICT
> -is ignored on huge page mappings right now.
> +is ignored on huge page mappings.
>
> -It is unfortunate that the same flag,
> +The
> .BR MPOL_DEFAULT ,
> -has different effects for
> +mode has different effects for
> .BR mbind (2)
> and
> .BR set_mempolicy (2).
> -To select "allocation on the node of the CPU that
> -triggered the allocation" (like
> -.BR set_mempolicy (2)
> -.BR MPOL_DEFAULT )
> -when calling
> +When
> +.B MPOL_DEFAULT
> +is specified for a range of memory using
> .BR mbind (),
> +any pages subsequently allocated for that range will use
> +the process' policy, as set by
> +.BR set_mempolicy (2).
> +This effectively removes the explicit policy from the
> +specified range.
> +To select "local allocation" for a memory range,
> specify a
> -.I policy
> +.I mode
> of
> .B MPOL_PREFERRED
> -with an empty
> -.IR nodemask .
> +with an empty set of nodes.
> +This method will work for
> +.BR set_mempolicy (2),
> +as well.
> +.\" ---------------------------------------------------------------
> .SS "Versions and Library Support"
> The
> .BR mbind (),
> @@ -228,16 +378,18 @@ system calls were added to the Linux ker
> They are only available on kernels compiled with
> .BR CONFIG_NUMA .
>
> -Support for huge page policy was added with 2.6.16.
> -For interleave policy to be effective on huge page mappings the
> -policied memory needs to be tens of megabytes or larger.
> -
> -.B MPOL_MF_MOVE
> -and
> -.B MPOL_MF_MOVE_ALL
> -are only available on Linux 2.6.16 and later.
> +You can link with
> +.I -lnuma
> +to get system call definitions.
> +.I libnuma
> +and the required
> +.I numaif.h
> +header.
> +are available in the
> +.I numactl
> +package.
>
> -These system calls should not be used directly.
> +However, applications should not use these system calls directly.
> Instead, the higher level interface provided by the
> .BR numa (3)
> functions in the
> @@ -247,20 +399,26 @@ The
> .I numactl
> package is available at
> .IR ftp://ftp.suse.com/pub/people/ak/numa/ .
> -
> -You can link with
> -.I \-lnuma
> -to get system call definitions.
> -.I libnuma
> -is available in the
> -.I numactl
> +The package is also included in some Linux distributions.
> +Some distributions include the development library and header
> +in the separate
> +.I numactl-devel
> package.
> -This package also has the
> -.I numaif.h
> -header.
> +
> +Support for huge page policy was added with 2.6.16.
> +For interleave policy to be effective on huge page mappings the
> +policied memory needs to be tens of megabytes or larger.
> +
> +.B MPOL_MF_MOVE
> +and
> +.B MPOL_MF_MOVE_ALL
> +are only available on Linux 2.6.16 and later.
> +
> .SH SEE ALSO
> .BR numa (3),
> .BR numactl (8),
> .BR set_mempolicy (2),
> .BR get_mempolicy (2),
> -.BR mmap (2)
> +.BR mmap (2),
> +.BR shmget (2),
> +.BR shmat (2).
>
>
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance? Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Mempolicy Man Pages 2.64 2/3 - set_mempolicy.2
2007-08-22 16:10 ` [PATCH] Mempolicy Man Pages 2.64 2/3 - set_mempolicy.2 Lee Schermerhorn
@ 2007-08-27 11:30 ` Michael Kerrisk
0 siblings, 0 replies; 83+ messages in thread
From: Michael Kerrisk @ 2007-08-27 11:30 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: clameter, akpm, linux-mm, ak, Eric Whitney
Applied for man-pages-2.65.
Thanks Lee!
Cheers,
Michael
Lee Schermerhorn wrote:
> [PATCH] Mempolicy Man Pages 2.64 2/3 - set_mempolicy.2
>
> Against: man pages 2.64
>
> Changes:
>
> + changed the "policy" parameter to "mode" through out the
> descriptions in an attempt to promote the concept that the memory
> policy is a tuple consisting of a mode and optional set of nodes.
>
> + added requirement to link '-lnuma' to synopsis
>
> + rewrite portions of description for clarification.
>
> ++ clarify interaction of policy with mmap()'d files.
>
> ++ defined how "empty set of nodes" specified and what this
> means for MPOL_PREFERRED.
>
> ++ mention what happens if local/target node contains no
> free memory.
>
> ++ clarify semantics of multiple nodes to BIND policy.
> Note: subject to change. We'll fix the man pages when/if
> this happens.
>
> + added all errors currently returned by sys call.
>
> + added mmap(2) to See Also list.
>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
>
> Index: Linux/man2/set_mempolicy.2
> ===================================================================
> --- Linux.orig/man2/set_mempolicy.2 2007-06-13 17:48:16.000000000 -0400
> +++ Linux/man2/set_mempolicy.2 2007-08-10 12:30:14.000000000 -0400
> @@ -18,6 +18,7 @@
> .\" the source, must acknowledge the copyright and authors of this work.
> .\"
> .\" 2006-02-03, mtk, substantial wording changes and other improvements
> +.\" 2007-06-01, lts, more precise specification of behavior.
> .\"
> .TH SET_MEMPOLICY 2 2006-02-07 "Linux" "Linux Programmer's Manual"
> .SH NAME
> @@ -26,80 +27,141 @@ set_mempolicy \- set default NUMA memory
> .nf
> .B "#include <numaif.h>"
> .sp
> -.BI "int set_mempolicy(int " policy ", unsigned long *" nodemask ,
> +.BI "int set_mempolicy(int " mode ", unsigned long *" nodemask ,
> .BI " unsigned long " maxnode );
> +.sp
> +.BI "cc ... \-lnuma"
> .fi
> .SH DESCRIPTION
> .BR set_mempolicy ()
> -sets the NUMA memory policy of the calling process to
> -.IR policy .
> +sets the NUMA memory policy of the calling process,
> +which consists of a policy mode and zero or more nodes,
> +to the values specified by the
> +.IR mode ,
> +.I nodemask
> +and
> +.IR maxnode
> +arguments.
>
> A NUMA machine has different
> memory controllers with different distances to specific CPUs.
> -The memory policy defines in which node memory is allocated for
> +The memory policy defines from which node memory is allocated for
> the process.
>
> -This system call defines the default policy for the process;
> -in addition a policy can be set for specific memory ranges using
> +This system call defines the default policy for the process.
> +The process policy governs allocation of pages in the process'
> +address space outside of memory ranges
> +controlled by a more specific policy set by
> .BR mbind (2).
> +The process default policy also controls allocation of any pages for
> +memory mapped files mapped using the
> +.BR mmap (2)
> +call with the
> +.B MAP_PRIVATE
> +flag and that are only read [loaded] from by the task
> +and of memory mapped files mapped using the
> +.BR mmap (2)
> +call with the
> +.B MAP_SHARED
> +flag, regardless of the access type.
> The policy is only applied when a new page is allocated
> for the process.
> For anonymous memory this is when the page is first
> touched by the application.
>
> -Available policies are
> +The
> +.I mode
> +argument must specify one of
> .BR MPOL_DEFAULT ,
> .BR MPOL_BIND ,
> -.BR MPOL_INTERLEAVE ,
> +.B MPOL_INTERLEAVE
> +or
> .BR MPOL_PREFERRED .
> -All policies except
> +All modes except
> .B MPOL_DEFAULT
> -require the caller to specify the nodes to which the policy applies in the
> +require the caller to specify via the
> .I nodemask
> -parameter.
> +parameter
> +one or more nodes.
> +
> .I nodemask
> -is pointer to a bit field of nodes that contains up to
> +points to a bit mask of node ids that contains up to
> .I maxnode
> bits.
> -The bit field size is rounded to the next multiple of
> +The bit mask size is rounded to the next multiple of
> .IR "sizeof(unsigned long)" ,
> but the kernel will only use bits up to
> .IR maxnode .
> +A NULL value of
> +.I nodemask
> +or a
> +.I maxnode
> +value of zero specifies the empty set of nodes.
> +If the value of
> +.I maxnode
> +is zero,
> +the
> +.I nodemask
> +argument is ignored.
>
> The
> .B MPOL_DEFAULT
> -policy is the default and means to allocate memory locally,
> +mode is the default and means to allocate memory locally,
> i.e., on the node of the CPU that triggered the allocation.
> .I nodemask
> -should be specified as NULL.
> +must be specified as NULL.
> +If the "local node" contains no free memory, the system will
> +attempt to allocate memory from a "near by" node.
>
> The
> .B MPOL_BIND
> -policy is a strict policy that restricts memory allocation to the
> +mode defines a strict policy that restricts memory allocation to the
> nodes specified in
> .IR nodemask .
> -There won't be allocations on other nodes.
> +If
> +.I nodemask
> +specifies more than one node, page allocations will come from
> +the node with the lowest numeric node id first, until that node
> +contains no free memory.
> +Allocations will then come from the node with the next highest
> +node id specified in
> +.I nodemask
> +and so forth, until none of the specified nodes contain free memory.
> +Pages will not be allocated from any node not specified in the
> +.IR nodemask .
>
> .B MPOL_INTERLEAVE
> -interleaves allocations to the nodes specified in
> -.IR nodemask .
> -This optimizes for bandwidth instead of latency.
> -To be effective the memory area should be fairly large,
> -at least 1MB or bigger.
> +interleaves page allocations across the nodes specified in
> +.I nodemask
> +in numeric node id order.
> +This optimizes for bandwidth instead of latency
> +by spreading out pages and memory accesses to those pages across
> +multiple nodes.
> +However, accesses to a single page will still be limited to
> +the memory bandwidth of a single node.
> +.\" NOTE: the following sentence doesn't make sense in the context
> +.\" of set_mempolicy() -- no memory area specified.
> +.\" To be effective the memory area should be fairly large,
> +.\" at least 1MB or bigger.
>
> .B MPOL_PREFERRED
> sets the preferred node for allocation.
> -The kernel will try to allocate in this
> -node first and fall back to other nodes if the preferred node is low on free
> +The kernel will try to allocate pages from this node first
> +and fall back to "near by" nodes if the preferred node is low on free
> memory.
> -Only the first node in the
> +If
> +.I nodemask
> +specifies more than one node id, the first node in the
> +mask will be selected as the preferred node.
> +If the
> .I nodemask
> -is used.
> -If no node is set in the mask, then the memory is allocated on
> -the node of the CPU that triggered the allocation allocation (like
> +and
> +.I maxnode
> +arguments specify the empty set, then the memory is allocated on
> +the node of the CPU that triggered the allocation (like
> .BR MPOL_DEFAULT ).
>
> -The memory policy is preserved across an
> +The process memory policy is preserved across an
> .BR execve (2),
> and is inherited by child processes created using
> .BR fork (2)
> @@ -112,21 +174,62 @@ returns 0;
> on error, \-1 is returned and
> .I errno
> is set to indicate the error.
> -.\" .SH ERRORS
> -.\" FIXME no errors are listed on this page
> -.\" .
> -.\" .TP
> -.\" .B EINVAL
> -.\" .I mode is invalid.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +.I mode is invalid.
> +Or,
> +.I mode
> +is
> +.I MPOL_DEFAULT
> +and
> +.I nodemask
> +is non-empty,
> +or
> +.I mode
> +is
> +.I MPOL_BIND
> +or
> +.I MPOL_INTERLEAVE
> +and
> +.I nodemask
> +is empty.
> +Or,
> +.I maxnode
> +specifies more than a page worth of bits.
> +Or,
> +.I nodemask
> +specifies one or more node ids that are
> +greater than the maximum supported node id,
> +or are not allowed in the calling task's context.
> +.\" "calling task's context" refers to cpusets. No man page avail to ref. --lts
> +Or, none of the node ids specified by
> +.I nodemask
> +are on-line, or none of the specified nodes contain memory.
> +.TP
> +.B EFAULT
> +Part of all of the memory range specified by
> +.I nodemask
> +and
> +.I maxnode
> +points outside your accessible address space.
> +.TP
> +.B ENOMEM
> +Insufficient kernel memory was available.
> +
> .SH CONFORMING TO
> This system call is Linux specific.
> .SH NOTES
> Process policy is not remembered if the page is swapped out.
> +When such a page is paged back in, it will use the policy of
> +the process or memory range that is in effect at the time the
> +page is allocated.
> .SS "Versions and Library Support"
> See
> .BR mbind (2).
> .SH SEE ALSO
> .BR mbind (2),
> +.BR mmap (2),
> .BR get_mempolicy (2),
> .BR numactl (8),
> .BR numa (3)
>
>
>
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance? Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
* Re: [PATCH] Mempolicy Man Pages 2.64 3/3 - get_mempolicy.2
2007-08-22 16:12 ` [PATCH] Mempolicy Man Pages 2.64 3/3 - get_mempolicy.2 Lee Schermerhorn
@ 2007-08-27 11:30 ` Michael Kerrisk
0 siblings, 0 replies; 83+ messages in thread
From: Michael Kerrisk @ 2007-08-27 11:30 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: clameter, akpm, linux-mm, ak, Eric Whitney
Applied for man-pages-2.65.
Thanks Lee!
Cheers,
Michael
Lee Schermerhorn wrote:
> [PATCH] Mempolicy Man Pages 2.64 3/3 - get_mempolicy.2
>
> Against: man pages 2.64
>
> Changes:
>
> + changed the "policy" parameter to "mode" through out the
> descriptions in an attempt to promote the concept that the memory
> policy is a tuple consisting of a mode and optional set of nodes.
>
> + added requirement to link '-lnuma' to synopsis
>
> + rewrite portions of description for clarification.
>
> + added all errors currently returned by sys call.
>
> + removed cautionary note that use of MPOL_F_NODE|MPOL_F_ADDR
> is not supported. This is no longer true.
>
> + added mmap(2) to See Also list.
>
>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
>
> Index: Linux/man2/get_mempolicy.2
> ===================================================================
> --- Linux.orig/man2/get_mempolicy.2 2007-06-22 14:25:23.000000000 -0400
> +++ Linux/man2/get_mempolicy.2 2007-08-10 12:33:23.000000000 -0400
> @@ -18,6 +18,7 @@
> .\" the source, must acknowledge the copyright and authors of this work.
> .\"
> .\" 2006-02-03, mtk, substantial wording changes and other improvements
> +.\" 2007-06-01, lts, more precise specification of behavior.
> .\"
> .TH GET_MEMPOLICY 2 2006-02-07 "Linux" "Linux Programmer's Manual"
> .SH NAME
> @@ -26,9 +27,11 @@ get_mempolicy \- Retrieve NUMA memory po
> .B "#include <numaif.h>"
> .nf
> .sp
> -.BI "int get_mempolicy(int *" policy ", unsigned long *" nodemask ,
> +.BI "int get_mempolicy(int *" mode ", unsigned long *" nodemask ,
> .BI " unsigned long " maxnode ", unsigned long " addr ,
> .BI " unsigned long " flags );
> +.sp
> +.BI "cc ... \-lnuma"
> .fi
> .\" FIXME rewrite this DESCRIPTION. it is confusing.
> .SH DESCRIPTION
> @@ -39,7 +42,7 @@ depending on the setting of
>
> A NUMA machine has different
> memory controllers with different distances to specific CPUs.
> -The memory policy defines in which node memory is allocated for
> +The memory policy defines from which node memory is allocated for
> the process.
>
> If
> @@ -58,58 +61,75 @@ then information is returned about the p
> address given in
> .IR addr .
> This policy may be different from the process's default policy if
> -.BR set_mempolicy (2)
> -has been used to establish a policy for the page containing
> +.BR mbind (2)
> +or one of the helper functions described in
> +.BR numa(3)
> +has been used to establish a policy for the memory range containing
> .IR addr .
>
> -If
> -.I policy
> -is not NULL, then it is used to return the policy.
> +If the
> +.I mode
> +argument is not NULL, then
> +.IR get_mempolicy ()
> +will store the policy mode of the requested NUMA policy in the location
> +pointed to by this argument.
> If
> .IR nodemask
> -is not NULL, then it is used to return the nodemask associated
> -with the policy.
> +is not NULL, then the nodemask associated with the policy will be stored
> +in the location pointed to by this argument.
> .I maxnode
> -is the maximum bit number plus one that can be stored into
> -.IR nodemask .
> -The bit number is always rounded to a multiple of
> -.IR "unsigned long" .
> -.\"
> -.\" If
> -.\" .I flags
> -.\" specifies both
> -.\" .B MPOL_F_NODE
> -.\" and
> -.\" .BR MPOL_F_ADDR ,
> -.\" then
> -.\" .I policy
> -.\" instead returns the number of the node on which the address
> -.\" .I addr
> -.\" is allocated.
> -.\"
> -.\" If
> -.\" .I flags
> -.\" specifies
> -.\" .B MPOL_F_NODE
> -.\" but not
> -.\" .BR MPOL_F_ADDR ,
> -.\" and the process's current policy is
> -.\" .BR MPOL_INTERLEAVE ,
> -.\" then
> -.\" checkme: Andi's text below says that the info is returned in
> -.\" 'nodemask', not 'policy':
> -.\" .I policy
> -.\" instead returns the number of the next node that will be used for
> -.\" interleaving allocation.
> -.\" FIXME .
> -.\" The other valid flag is
> -.\" .I MPOL_F_NODE.
> -.\" It is only valid when the policy is
> -.\" .I MPOL_INTERLEAVE.
> -.\" In this case not the interleave mask, but an unsigned long with the next
> -.\" node that would be used for interleaving is returned in
> -.\" .I nodemask.
> -.\" Other flag values are reserved.
> +specifies the number of node ids
> +that can be stored into
> +.IR nodemask \(emthat
> +is, the maximum node id plus one.
> +The value specified by
> +.I maxnode
> +is always rounded to a multiple of
> +.IR "sizeof(unsigned long)" .
> +
> +If
> +.I flags
> +specifies both
> +.B MPOL_F_NODE
> +and
> +.BR MPOL_F_ADDR ,
> +.IR get_mempolicy ()
> +will return the node id of the node on which the address
> +.I addr
> +is allocated into the location pointed to by
> +.IR mode .
> +If no page has yet been allocated for the specified address,
> +.IR get_mempolicy ()
> +will allocate a page as if the process had performed a read
> +[load] access to that address, and return the id of the node
> +where that page was allocated.
> +
> +If
> +.I flags
> +specifies
> +.BR MPOL_F_NODE ,
> +but not
> +.BR MPOL_F_ADDR ,
> +and the process's current policy is
> +.BR MPOL_INTERLEAVE ,
> +then
> +.IR get_mempolicy ()
> +will return in the location pointed to by a non-NULL
> +.I mode
> +argument,
> +the node id of the next node that will be used for
> +interleaving of internal kernel pages allocated on behalf of the process.
> +.\" Note: code returns next interleave node via 'mode' argument -lts
> +These allocations include pages for memory mapped files in
> +process memory ranges mapped using the
> +.IR mmap (2)
> +call with the
> +.I MAP_PRIVATE
> +flag for read accesses, and in memory ranges mapped with the
> +.I MAP_SHARED
> +flag for all accesses.
> +
> +Other flag values are reserved.
>
> For an overview of the possible policies see
> .BR set_mempolicy (2).
> @@ -120,49 +140,84 @@ returns 0;
> on error, \-1 is returned and
> .I errno
> is set to indicate the error.
> -.\" .SH ERRORS
> -.\" FIXME -- no errors are listed on this page
> -.\" .
> -.\" .TP
> -.\" .B EINVAL
> -.\" .I nodemask
> -.\" is non-NULL, and
> -.\" .I maxnode
> -.\" is too small;
> -.\" or
> -.\" .I flags
> -.\" specified values other than
> -.\" .B MPOL_F_NODE
> -.\" or
> -.\" .BR MPOL_F_ADDR ;
> -.\" or
> -.\" .I flags
> -.\" specified
> -.\" .B MPOL_F_ADDR
> -.\" and
> -.\" .I addr
> -.\" is NULL.
> -.\" (And there are other
> -.\" .B EINVAL
> -.\" cases.)
> -.SH CONFORMING TO
> -This system call is Linux specific.
> +.SH ERRORS
> +.TP
> +.B EINVAL
> +The value specified by
> +.I maxnode
> +is less than the number of node ids supported by the system.
> +Or
> +.I flags
> +specified values other than
> +.B MPOL_F_NODE
> +or
> +.BR MPOL_F_ADDR ;
> +or
> +.I flags
> +specified
> +.B MPOL_F_ADDR
> +and
> +.I addr
> +is NULL,
> +or
> +.I flags
> +did not specify
> +.B MPOL_F_ADDR
> +and
> +.I addr
> +is not NULL.
> +Or,
> +.I flags
> +specified
> +.B MPOL_F_NODE
> +but not
> +.B MPOL_F_ADDR
> +and the current process policy is not
> +.BR MPOL_INTERLEAVE .
> +(And there are other EINVAL cases.)
> +.TP
> +.B EFAULT
> +Part of all of the memory range specified by
> +.I nodemask
> +and
> +.I maxnode
> +points outside your accessible address space.
> .SH NOTES
> -This manual page is incomplete:
> -it does not document the details the
> -.BR MPOL_F_NODE
> -flag,
> -which modifies the operation of
> -.BR get_mempolicy ().
> -This is deliberate: this flag is not intended for application use,
> -and its operation may change or it may be removed altogether in
> -future kernel versions.
> -.B Do not use it.
> +If the mode of the process policy or the policy governing allocations at the
> +specified address is
> +.I MPOL_PREFERRED
> +and this policy was installed with an empty
> +.IR nodemask \(emspecifying
> +local allocation,
> +.IR get_mempolicy ()
> +will return the mask of on-line node ids in the location pointed to by
> +a non-NULL
> +.I nodemask
> +argument.
> +This mask does not take into consideration any adminstratively imposed
> +restrictions on the process' context.
> +.\" FIXME:
> +.\" "context" above refers to cpusets. No man page to reference. --lts
> +
> +.\" Christoph says the following is untrue. These are "fully supported."
> +.\" Andi concedes that he has lost this battle and approves [?]
> +.\" updating the man pages to document the behavior. --lts
> +.\" This manual page is incomplete:
> +.\" it does not document the details the
> +.\" .BR MPOL_F_NODE
> +.\" flag,
> +.\" which modifies the operation of
> +.\" .BR get_mempolicy ().
> +.\" This is deliberate: this flag is not intended for application use,
> +.\" and its operation may change or it may be removed altogether in
> +.\" future kernel versions.
> +.\" .B Do not use it.
> .SS "Versions and Library Support"
> See
> .BR mbind (2).
> .SH SEE ALSO
> .BR mbind (2),
> +.BR mmap (2),
> .BR set_mempolicy (2),
> .BR numactl (8),
> .BR numa (3)
>
>
>
--
Michael Kerrisk
maintainer of Linux man pages Sections 2, 3, 4, 5, and 7
Want to help with man page maintenance? Grab the latest tarball at
http://www.kernel.org/pub/linux/docs/manpages/
read the HOWTOHELP file and grep the source files for 'FIXME'.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 83+ messages in thread
end of thread, other threads:[~2007-08-27 11:30 UTC | newest]
Thread overview: 83+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-29 19:33 [PATCH] Document Linux Memory Policy Lee Schermerhorn
2007-05-29 20:04 ` Christoph Lameter
2007-05-29 20:16 ` Andi Kleen
2007-05-30 16:17 ` Lee Schermerhorn
2007-05-30 17:41 ` Christoph Lameter
2007-05-31 8:20 ` Michael Kerrisk
2007-05-31 14:49 ` Lee Schermerhorn
2007-05-31 15:56 ` Michael Kerrisk
2007-06-01 21:15 ` [PATCH] enhance memory policy sys call man pages v1 Lee Schermerhorn
2007-07-23 6:11 ` Michael Kerrisk
2007-07-23 6:32 ` mbind.2 man page patch Michael Kerrisk
2007-07-23 14:26 ` Lee Schermerhorn
2007-07-26 17:19 ` Michael Kerrisk
2007-07-26 18:06 ` Lee Schermerhorn
2007-07-26 18:18 ` Michael Kerrisk
2007-07-23 6:32 ` get_mempolicy.2 " Michael Kerrisk
2007-07-28 9:31 ` Michael Kerrisk
2007-08-09 18:43 ` Lee Schermerhorn
2007-08-09 20:57 ` Michael Kerrisk
2007-08-16 20:05 ` Andi Kleen
2007-08-18 5:50 ` Michael Kerrisk
2007-08-21 15:45 ` Lee Schermerhorn
2007-08-22 4:10 ` Michael Kerrisk
2007-08-22 16:08 ` [PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2 Lee Schermerhorn
2007-08-27 11:29 ` Michael Kerrisk
2007-08-22 16:10 ` [PATCH] Mempolicy Man Pages 2.64 2/3 - set_mempolicy.2 Lee Schermerhorn
2007-08-27 11:30 ` Michael Kerrisk
2007-08-22 16:12 ` [PATCH] Mempolicy Man Pages 2.64 3/3 - get_mempolicy.2 Lee Schermerhorn
2007-08-27 11:30 ` Michael Kerrisk
2007-08-27 10:46 ` get_mempolicy.2 man page patch Michael Kerrisk
2007-07-23 6:33 ` set_mempolicy.2 " Michael Kerrisk
2007-05-30 16:55 ` [PATCH] Document Linux Memory Policy Lee Schermerhorn
2007-05-30 17:56 ` Christoph Lameter
2007-05-31 6:18 ` Gleb Natapov
2007-05-31 6:41 ` Christoph Lameter
2007-05-31 6:47 ` Gleb Natapov
2007-05-31 6:56 ` Christoph Lameter
2007-05-31 7:11 ` Gleb Natapov
2007-05-31 7:24 ` Christoph Lameter
2007-05-31 7:39 ` Gleb Natapov
2007-05-31 17:43 ` Christoph Lameter
2007-05-31 17:07 ` Lee Schermerhorn
2007-05-31 10:43 ` Andi Kleen
2007-05-31 11:04 ` Gleb Natapov
2007-05-31 11:30 ` Gleb Natapov
2007-05-31 15:26 ` Lee Schermerhorn
2007-05-31 17:41 ` Gleb Natapov
2007-05-31 18:56 ` Lee Schermerhorn
2007-05-31 20:06 ` Gleb Natapov
2007-05-31 20:43 ` Andi Kleen
2007-06-01 9:38 ` Gleb Natapov
2007-06-01 10:21 ` Andi Kleen
2007-06-01 12:25 ` Gleb Natapov
2007-06-01 13:09 ` Andi Kleen
2007-06-01 17:15 ` Lee Schermerhorn
2007-06-01 18:43 ` Christoph Lameter
2007-06-01 19:38 ` Lee Schermerhorn
2007-06-01 19:48 ` Christoph Lameter
2007-06-01 21:05 ` Lee Schermerhorn
2007-06-01 21:56 ` Christoph Lameter
2007-06-04 13:46 ` Lee Schermerhorn
2007-06-04 16:34 ` Christoph Lameter
2007-06-04 17:02 ` Lee Schermerhorn
2007-06-04 17:11 ` Christoph Lameter
2007-06-04 20:23 ` Andi Kleen
2007-06-04 21:51 ` Christoph Lameter
2007-06-05 14:30 ` Lee Schermerhorn
2007-06-01 20:28 ` Gleb Natapov
2007-06-01 20:45 ` Christoph Lameter
2007-06-01 21:10 ` Lee Schermerhorn
2007-06-01 21:58 ` Christoph Lameter
2007-06-02 7:23 ` Gleb Natapov
2007-05-31 11:47 ` Andi Kleen
2007-05-31 11:59 ` Gleb Natapov
2007-05-31 12:15 ` Andi Kleen
2007-05-31 12:18 ` Gleb Natapov
2007-05-31 18:28 ` Lee Schermerhorn
2007-05-31 18:35 ` Christoph Lameter
2007-05-31 19:29 ` Lee Schermerhorn
2007-05-31 19:25 ` Paul Jackson
2007-05-31 20:22 ` Lee Schermerhorn
2007-05-29 20:07 ` Andi Kleen
2007-05-30 16:04 ` Lee Schermerhorn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).