All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: ak@suse.de, clameter@sgi.com, akpm@linux-foundation.org,
	linux-mm@kvack.org,
	Samuel Thibault <samuel.thibault@ens-lyon.org>
Subject: Re: mbind.2 man page patch
Date: Mon, 23 Jul 2007 10:26:08 -0400	[thread overview]
Message-ID: <1185200768.5074.10.camel@localhost> (raw)
In-Reply-To: <46A44B8D.2040200@gmx.net>

On Mon, 2007-07-23 at 08:32 +0200, Michael Kerrisk wrote:
> Andi, Christoph
> 
> Could you please review these changes by Lee to the mbind.2 page?  Patch
> against man-pages-2.63 (available from
> http://www.kernel.org/pub/linux/docs/manpages).
> 
> Andi / Christoph / Lee: There are a few points marked FIXME about which I'd
> particularly like some input.
> 
> Lee: aside from the changes tha you made, plus my edits, I added a sentence
> to this page that cam in independently from Samuel Thibau;t (marked below).
> 
> Cheers,
> 
> Michael
> 
> --- mbind.2.orig        2007-07-01 06:22:24.000000000 +0200
> +++ mbind.2     2007-07-21 09:18:05.000000000 +0200
> @@ -1,4 +1,5 @@
>  .\" Copyright 2003,2004 Andi Kleen, SuSE Labs.
> +.\" and Copyright (C) 2007 Lee Schermerhorn <Lee.Schermerhorn@hp.com>
>  .\"
>  .\" Permission is granted to make and distribute verbatim copies of this
>  .\" manual provided the copyright notice and this permission notice are
> @@ -18,92 +19,214 @@
>  .\" the source, must acknowledge the copyright and authors of this work.
>  .\"
>  .\" 2006-02-03, mtk, substantial wording changes and other improvements
> +.\" 2007-06-01, Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> +.\"     more precise specification of behavior.
>  .\"
> -.TH MBIND 2 2006-02-07 "Linux" "Linux Programmer's Manual"
> +.TH MBIND 2 2007-07-20 Linux "Linux Programmer's Manual"
>  .SH NAME
>  mbind \- Set memory policy for a memory range
>  .SH SYNOPSIS
>  .nf
>  .B "#include <numaif.h>"
>  .sp
> -.BI "int mbind(void *" start ", unsigned long " len  ", int " policy ,
> +.BI "int mbind(void *" start ", unsigned long " len  ", int " mode ,
>  .BI "          unsigned long *" nodemask  ", unsigned long " maxnode ,
>  .BI "          unsigned " flags );
>  .sp
> -.BI "cc ... \-lnuma"
> +Link with \fI\-lnuma\fP.
>  .fi
>  .SH DESCRIPTION
> +The memory of a NUMA machine is divided into multiple nodes.
> +The memory policy defines the node on which memory is allocated.
>  .BR mbind ()
> -sets the NUMA memory
> -.I policy
> +sets the NUMA memory policy
>  for the memory range starting with
>  .I start
>  and continuing for
>  .IR len
>  bytes.
> -The memory of a NUMA machine is divided into multiple nodes.
> -The memory policy defines in which node memory is allocated.
> +.\" The following sentence added by Samuel Thibault:
> +.I start
> +must be page aligned.
> +
> +The NUMA policy consists of a policy mode, specified in
> +.IR mode ,
> +and a set of zero or nodes, specified in
> +.IR nodemask ;
> +these arguments are described below.
> +
> +If the memory range specified by the
> +.IR start " and " len
> +arguments includes an anonymous region of memory (i.e.,
> +a region of memory created using
> +.BR mmap (2)
> +with the
> +.BR MAP_ANONYMOUS
> +flag) or
> +a memory mapped file mapped using
> +.BR mmap (2)
> +with the
> +.B MAP_PRIVATE
> +flag, pages will only be allocated according to the specified
> +policy when the application writes [stores] to the page.
> +For anonymous regions, an initial read access will use a shared
> +page in the kernel containing all zeros.
> +For a file mapped with
> +.BR MAP_PRIVATE ,
> +an initial read access will allocate pages according to the
> +process policy of the process that causes the page to be allocated.
> +This might not be the process that called
> +.BR mbind ().
> +
> +If the specified memory range includes a memory mapped file mapped using
> +.BR mmap (2)
> +with the
> +.B MAP_SHARED
> +flag, the specified policy will be ignored for all page allocations
> +in this range.
> +.\" FIXME Lee / Andi: can you clarify/confirm "the specified policy
> +.\" will be ignored for all page allocations in this range".
> +.\" That text seems to be saying that if the memory range contains
> +.\" (say) some mappings that are allocated with MAP_SHARED
> +.\" and others allocated with MAP_PRIVATE, then the policy
> +.\" will be ignored for all of the mappings, including even
> +.\" the MAP_PRIVATE mappings.  Right?  I just want to be
> +.\" sure that that is what the text is meaning.

I can see from the wording how you might think this.  However, policy
will only be ignored for the SHARED mappings.  

> +Instead, the pages will be allocated according to the process policy
> +of the process that caused the page to be allocated.
> +Again, this might not be the process that called
> +.BR mbind ().
> +
> +If the specified memory range includes a shared memory region
> +created using
> +.BR shmget (2)
> +and attached using
> +.BR shmat (2),
> +pages allocated for the anonymous or shared memory region will
> +be allocated according to the policy specified, regardless of which
> +process attached to the shared memory segment causes the allocation.
> +If, however, the shared memory region was created with the
> +.B SHM_HUGETLB
> +flag,
> +the huge pages will be allocated according to the policy specified
> +only if the page allocation is caused by the task that calls
> +.BR mbind ()
> +for that region.
> +
> +By default,
>  .BR mbind ()
>  only has an effect for new allocations; if the pages inside
> -the range have been already touched before setting the policy,
> +the range have already been touched before setting the policy,
>  then the policy has no effect.
> +This default behavior may be overridden by the
> +.BR MPOL_MF_MOVE
> +and
> +.B MPOL_MF_MOVE_ALL
> +flags described below.
> 
> -Available policies are
> +The
> +.I mode
> +argument must specify one of
>  .BR MPOL_DEFAULT ,
>  .BR MPOL_BIND ,
>  .BR MPOL_INTERLEAVE ,
> -and
> +or
>  .BR MPOL_PREFERRED .
> -All policies except
> +All policy modes except
>  .B MPOL_DEFAULT
> -require the caller to specify the nodes to which the policy applies in the
> +require the caller to specify
> +the node or nodes to which the mode applies, via the
>  .I nodemask
> -parameter.
> +argument.
> +
>  .I nodemask
> -is a bit mask of nodes containing up to
> +points to a bit mask of nodes containing up to
>  .I maxnode
>  bits.
> -The actual number of bytes transferred via this argument
> +The actual number of bytes transferred via
> +.I nodemask
>  is rounded up to the next multiple of
>  .IR "sizeof(unsigned long)" ,
>  but the kernel will only use bits up to
>  .IR maxnode .
> -A NULL argument means an empty set of nodes.
> +A NULL value for
> +.IR nodemask ,
> +or a
> +.I maxnode
> +value of zero specifies the empty set of nodes.
> +If the value of
> +.I maxnode
> +is zero, then the
> +.I nodemask
> +argument is ignored.
> 
>  The
>  .B MPOL_DEFAULT
> -policy is the default and means to use the underlying process policy
> -(which can be modified with
> -.BR set_mempolicy (2)).
> -Unless the process policy has been changed this means to allocate
> -memory on the node of the CPU that triggered the allocation.
> +mode specifies the default policy.
> +When applied to a range of memory via
> +.BR mbind (),
> +this means that the process policy should be used;
> +the process policy can be set with
> +.BR set_mempolicy (2).
> +If the mode of the process policy is also
> +.BR MPOL_DEFAULT ,
> +then pages will be allocated on the node of the CPU that
> +triggers the allocation.
> +For
> +.BR MPOL_DEFAULT ,
> +the
>  .I nodemask
> -should be specified as NULL.
> +and
> +.I maxnode
> +arguments must be specify the empty set of nodes.
> 
>  The
>  .B MPOL_BIND
> -policy is a strict policy that restricts memory allocation to the
> -nodes specified in
> +mode specifies a strict policy that restricts memory allocation to
> +the nodes specified in
>  .IR nodemask .
> +If
> +.I nodemask
> +specifies more than one node, page allocations will come from
> +the node with the lowest numeric node ID first, until that node
> +contains no free memory.
> +Allocations will then come from the node with the next highest
> +node ID specified in
> +.I nodemask
> +and so forth, until none of the specified nodes contains free memory.
>  There won't be allocations on other nodes.
> 
> +The
>  .B MPOL_INTERLEAVE
> -interleaves allocations to the nodes specified in
> +mode specifies that page allocations be interleaved across the
> +set of nodes specified in
>  .IR nodemask .
> -This optimizes for bandwidth instead of latency.
> +This optimizes for bandwidth instead of latency
> +by spreading out pages and memory accesses to those pages across
> +multiple nodes.
>  To be effective the memory area should be fairly large,
> -at least 1MB or bigger.
> +at least 1MB or bigger with a fairly uniform access pattern.
> +Accesses to a single page of the area will still be limited to
> +the memory bandwidth of a single node.
> 
>  .B MPOL_PREFERRED
>  sets the preferred node for allocation.
> -The kernel will try to allocate in this
> +The kernel will try to allocate pages on this
>  node first and fall back to other nodes if the
>  preferred nodes is low on free memory.
> -Only the first node in the
> +If
> +.I nodemask
> +specifies more than one node ID, the first node in the
> +mask will be selected as the preferred node.
> +If the
>  .I nodemask
> -is used.
> -If no node is set in the mask, then the memory is allocated on
> -the node of the CPU that triggered the allocation allocation).
> +and
> +.I maxnode
> +arguments specify the empty set, then the memory is allocated on
> +the node of the CPU that triggered the allocation.
> +This is the only way to specify "local allocation" for a
> +range of memory via
> +.BR mbind ().
> 
>  If
>  .B MPOL_MF_STRICT
> @@ -115,17 +238,20 @@
>  .BR MPOL_DEFAULT ,
>  then the call will fail with the error
>  .B EIO
> -if the existing pages in the mapping don't follow the policy.
> -In 2.6.16 or later the kernel will also try to move pages
> -to the requested node with this flag.
> +if the existing pages in the memory range don't follow the policy.
> +.\" FIXME Andi / Christoph -- can you please verify Lee's change here:
> +.\" According to the kernel code, the following is not true
> +.\" -- Lee Schermerhorn:
> +.\" In 2.6.16 or later the kernel will also try to move pages
> +.\" to the requested node with this flag.
> 
>  If
>  .B MPOL_MF_MOVE
> -is passed in
> +is specified in
>  .IR flags ,
> -then an attempt will be made  to
> -move all the pages in the mapping so that they follow the policy.
> -Pages that are shared with other processes are not moved.
> +then the kernel will attempt to move all the existing pages
> +in the memory range so that they follow the policy.
> +Pages that are shared with other processes will not be moved.
>  If
>  .B MPOL_MF_STRICT
>  is also specified, then the call will fail with the error
> @@ -136,8 +262,8 @@
>  .B MPOL_MF_MOVE_ALL
>  is passed in
>  .IR flags ,
> -then all pages in the mapping will be moved regardless of whether
> -other processes use the pages.
> +then the kernel will attempt to move all existing pages in the memory
> +range regardless of whether other processes use the pages.
>  The calling process must be privileged
>  .RB ( CAP_SYS_NICE )
>  to use this flag.
> @@ -154,10 +280,15 @@
>  .I errno
>  is set to indicate the error.
>  .SH ERRORS
> +.\"  I think I got all of the error returns.  -- Lee Schermerhorn
>  .TP
>  .B EFAULT
> -There was a unmapped hole in the specified memory range
> -or a passed pointer was not valid.
> +Part or all of the memory range specified by
> +.I nodemask
> +and
> +.I maxnode
> +points outside your accessible address space.
> +Or, there was a unmapped hole in the specified memory range.
>  .TP
>  .B EINVAL
>  An invalid value was specified for
> @@ -169,56 +300,96 @@
>  was less than
>  .IR start ;
>  or
> -.I policy
> -was
> +.I start
> +is not a multiple of the system page size.
> +Or,
> +.I mode
> +is
>  .B MPOL_DEFAULT
>  and
>  .I nodemask
> -pointed to a non-empty set;
> +specified a non-empty set;
>  or
> -.I policy
> -was
> +.I mode
> +is
>  .B MPOL_BIND
>  or
>  .B MPOL_INTERLEAVE
>  and
>  .I nodemask
> -pointed to an empty set,
> +is empty.
> +Or,
> +.I maxnode
> +specifies more than a page worth of bits.
> +Or,
> +.I nodemask
> +specifies one or more node IDs that are
> +greater than the maximum supported node ID,
> +or are not allowed in the calling task's context.
> +.\" "calling task's context" refers to cpusets.
> +.\" No man page avail to reference. -- Lee Schermerhorn
> +Or, none of the node IDs specified by
> +.I nodemask
> +are on-line, or none of the specified nodes contain memory.
>  .TP
>  .B ENOMEM
> -System out of memory.
> +Insufficient kernel memory was available.
>  .TP
>  .B EIO
>  .B MPOL_MF_STRICT
>  was specified and an existing page was already on a node
> -that does not follow the policy.
> +that does not follow the policy;
> +or
> +.B MPOL_MF_MOVE
> +or
> +.B MPOL_MF_MOVE_ALL
> +was specified and the kernel was unable to move all existing
> +pages in the range.
> +.TP
> +.B EPERM
> +The
> +.I flags
> +argument included the
> +.B MPOL_MF_MOVE_ALL
> +flag and the caller does not have the
> +.B CAP_SYS_NICE
> +privilege.
>  .SH CONFORMING TO
>  This system call is Linux specific.
>  .SH NOTES
> -NUMA policy is not supported on file mappings.
> +NUMA policy is not supported on a memory mapped file range
> +that was mapped with the
> +.B MAP_SHARED
> +flag.
> 
>  .B MPOL_MF_STRICT
> -is  ignored  on  huge page mappings right now.
> +is ignored on huge page mappings.
> 
> -It is unfortunate that the same flag,
> +The
>  .BR MPOL_DEFAULT ,
> -has different effects for
> -.BR mbind (2)
> +mode has different effects for
> +.BR mbind ()
>  and
>  .BR set_mempolicy (2).
> -To select "allocation on the node of the CPU that
> -triggered the allocation" (like
> -.BR set_mempolicy (2)
> -.BR MPOL_DEFAULT )
> -when calling
> +When
> +.B MPOL_DEFAULT
> +is specified for a range of memory using
>  .BR mbind (),
> +any pages subsequently allocated for that range will use
> +the process's policy, as set by
> +.BR set_mempolicy (2).
> +This effectively removes the explicit policy from the
> +specified range.
> +To select "local allocation" for a memory range,
>  specify a
> -.I policy
> +.I mode
>  of
>  .B MPOL_PREFERRED
> -with an empty
> -.IR nodemask .
> -.SS "Versions and Library Support"
> +with an empty set of nodes.
> +This method will work for
> +.BR set_mempolicy (2),
> +as well.
> +.SS "Versions and LIbrary Support"
>  The
>  .BR mbind (),
>  .BR get_mempolicy (2),
> @@ -228,16 +399,17 @@
>  They are only available on kernels compiled with
>  .BR CONFIG_NUMA .
> 
> -Support for huge page policy was added with 2.6.16.
> -For interleave policy to be effective on huge page mappings the
> -policied memory needs to be tens of megabytes or larger.
> -
> -.B MPOL_MF_MOVE
> -and
> -.B MPOL_MF_MOVE_ALL
> -are only available on Linux 2.6.16 and later.
> +You can link with
> +.I \-lnuma
> +to get system call definitions.
> +.I libnuma
> +and the required
> +.I numaif.h
> +header are available in the
> +.I numactl
> +package.
> 
> -These system calls should not be used directly.
> +However, applications should not use these system calls directly.
>  Instead, the higher level interface provided by the
>  .BR numa (3)
>  functions in the
> @@ -247,20 +419,25 @@
>  .I numactl
>  package is available at
>  .IR ftp://ftp.suse.com/pub/people/ak/numa/ .
> -
> -You can link with
> -.I \-lnuma
> -to get system call definitions.
> -.I libnuma
> -is available in the
> -.I numactl
> +The package is also included in some Linux distributions.
> +Some distributions include the development library and header
> +in the separate
> +.I numactl-devel
>  package.
> -This package also has the
> -.I numaif.h
> -header.
> +
> +Support for huge page policy was added with 2.6.16.
> +For interleave policy to be effective on huge page mappings the
> +policied memory needs to be tens of megabytes or larger.
> +
> +.B MPOL_MF_MOVE
> +and
> +.B MPOL_MF_MOVE_ALL
> +are only available on Linux 2.6.16 and later.
>  .SH SEE ALSO
> -.BR numa (3),
> -.BR numactl (8),
> -.BR set_mempolicy (2),
>  .BR get_mempolicy (2),
> -.BR mmap (2)
> +.BR mmap (2),
> +.BR set_mempolicy (2),
> +.BR shmat (2),
> +.BR shmget (2),
> +.BR numa (3),
> +.BR numactl (8)
> 
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2007-07-23 14:26 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-29 19:33 [PATCH] Document Linux Memory Policy Lee Schermerhorn
2007-05-29 20:04 ` Christoph Lameter
2007-05-29 20:16   ` Andi Kleen
2007-05-30 16:17     ` Lee Schermerhorn
2007-05-30 17:41       ` Christoph Lameter
2007-05-31  8:20       ` Michael Kerrisk
2007-05-31 14:49         ` Lee Schermerhorn
2007-05-31 15:56           ` Michael Kerrisk
2007-06-01 21:15         ` [PATCH] enhance memory policy sys call man pages v1 Lee Schermerhorn
2007-07-23  6:11           ` Michael Kerrisk
2007-07-23  6:32           ` mbind.2 man page patch Michael Kerrisk
2007-07-23 14:26             ` Lee Schermerhorn [this message]
2007-07-26 17:19               ` Michael Kerrisk
2007-07-26 18:06                 ` Lee Schermerhorn
2007-07-26 18:18                   ` Michael Kerrisk
2007-07-23  6:32           ` get_mempolicy.2 " Michael Kerrisk
2007-07-28  9:31             ` Michael Kerrisk
2007-08-09 18:43               ` Lee Schermerhorn
2007-08-09 20:57                 ` Michael Kerrisk
2007-08-16 20:05               ` Andi Kleen
2007-08-18  5:50                 ` Michael Kerrisk
2007-08-21 15:45                   ` Lee Schermerhorn
2007-08-22  4:10                     ` Michael Kerrisk
2007-08-22 16:08                       ` [PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2 Lee Schermerhorn
2007-08-27 11:29                         ` Michael Kerrisk
2007-08-22 16:10                       ` [PATCH] Mempolicy Man Pages 2.64 2/3 - set_mempolicy.2 Lee Schermerhorn
2007-08-27 11:30                         ` Michael Kerrisk
2007-08-22 16:12                       ` [PATCH] Mempolicy Man Pages 2.64 3/3 - get_mempolicy.2 Lee Schermerhorn
2007-08-27 11:30                         ` Michael Kerrisk
2007-08-27 10:46                 ` get_mempolicy.2 man page patch Michael Kerrisk
2007-07-23  6:33           ` set_mempolicy.2 " Michael Kerrisk
2007-05-30 16:55   ` [PATCH] Document Linux Memory Policy Lee Schermerhorn
2007-05-30 17:56     ` Christoph Lameter
2007-05-31  6:18       ` Gleb Natapov
2007-05-31  6:41         ` Christoph Lameter
2007-05-31  6:47           ` Gleb Natapov
2007-05-31  6:56             ` Christoph Lameter
2007-05-31  7:11               ` Gleb Natapov
2007-05-31  7:24                 ` Christoph Lameter
2007-05-31  7:39                   ` Gleb Natapov
2007-05-31 17:43                     ` Christoph Lameter
2007-05-31 17:07                   ` Lee Schermerhorn
2007-05-31 10:43             ` Andi Kleen
2007-05-31 11:04               ` Gleb Natapov
2007-05-31 11:30                 ` Gleb Natapov
2007-05-31 15:26                   ` Lee Schermerhorn
2007-05-31 17:41                     ` Gleb Natapov
2007-05-31 18:56                       ` Lee Schermerhorn
2007-05-31 20:06                         ` Gleb Natapov
2007-05-31 20:43                           ` Andi Kleen
2007-06-01  9:38                             ` Gleb Natapov
2007-06-01 10:21                               ` Andi Kleen
2007-06-01 12:25                                 ` Gleb Natapov
2007-06-01 13:09                                   ` Andi Kleen
2007-06-01 17:15                                 ` Lee Schermerhorn
2007-06-01 18:43                                   ` Christoph Lameter
2007-06-01 19:38                                     ` Lee Schermerhorn
2007-06-01 19:48                                       ` Christoph Lameter
2007-06-01 21:05                                         ` Lee Schermerhorn
2007-06-01 21:56                                           ` Christoph Lameter
2007-06-04 13:46                                             ` Lee Schermerhorn
2007-06-04 16:34                                               ` Christoph Lameter
2007-06-04 17:02                                                 ` Lee Schermerhorn
2007-06-04 17:11                                                   ` Christoph Lameter
2007-06-04 20:23                                                     ` Andi Kleen
2007-06-04 21:51                                                       ` Christoph Lameter
2007-06-05 14:30                                                         ` Lee Schermerhorn
2007-06-01 20:28                                     ` Gleb Natapov
2007-06-01 20:45                                       ` Christoph Lameter
2007-06-01 21:10                                         ` Lee Schermerhorn
2007-06-01 21:58                                           ` Christoph Lameter
2007-06-02  7:23                                         ` Gleb Natapov
2007-05-31 11:47                 ` Andi Kleen
2007-05-31 11:59                   ` Gleb Natapov
2007-05-31 12:15                     ` Andi Kleen
2007-05-31 12:18                       ` Gleb Natapov
2007-05-31 18:28       ` Lee Schermerhorn
2007-05-31 18:35         ` Christoph Lameter
2007-05-31 19:29           ` Lee Schermerhorn
2007-05-31 19:25       ` Paul Jackson
2007-05-31 20:22         ` Lee Schermerhorn
2007-05-29 20:07 ` Andi Kleen
2007-05-30 16:04   ` Lee Schermerhorn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1185200768.5074.10.camel@localhost \
    --to=lee.schermerhorn@hp.com \
    --cc=ak@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=clameter@sgi.com \
    --cc=linux-mm@kvack.org \
    --cc=mtk-manpages@gmx.net \
    --cc=samuel.thibault@ens-lyon.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.