From: Randy Dunlap <randy.dunlap@oracle.com>
To: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: linux-mm@kvack.org, akpm@linux-foundation.org,
Mel Gorman <mel@csn.ul.ie>, Nishanth Aravamudan <nacc@us.ibm.com>,
David Rientjes <rientjes@google.com>,
linux-numa@vger.kernel.org, Adam Litke <agl@us.ibm.com>,
Andy Whitcroft <apw@canonical.com>,
eric.whitney@hp.com
Subject: Re: [PATCH 6/6] hugetlb: update hugetlb documentation for mempolicy based management.
Date: Thu, 3 Sep 2009 13:42:10 -0700 [thread overview]
Message-ID: <20090903134210.5a27611d.randy.dunlap@oracle.com> (raw)
In-Reply-To: <20090828160351.11080.21379.sendpatchset@localhost.localdomain>
On Fri, 28 Aug 2009 12:03:51 -0400 Lee Schermerhorn wrote:
(Thanks for cc:, David.)
> [PATCH 6/6] hugetlb: update hugetlb documentation for mempolicy based management.
>
> Against: 2.6.31-rc7-mmotm-090827-0057
>
> V2: Add brief description of per node attributes.
>
> This patch updates the kernel huge tlb documentation to describe the
> numa memory policy based huge page management. Additionaly, the patch
> includes a fair amount of rework to improve consistency, eliminate
> duplication and set the context for documenting the memory policy
> interaction.
>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
>
> Documentation/vm/hugetlbpage.txt | 257 ++++++++++++++++++++++++++-------------
> 1 file changed, 172 insertions(+), 85 deletions(-)
>
> Index: linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt
> ===================================================================
> --- linux-2.6.31-rc7-mmotm-090827-0057.orig/Documentation/vm/hugetlbpage.txt 2009-08-28 09:21:16.000000000 -0400
> +++ linux-2.6.31-rc7-mmotm-090827-0057/Documentation/vm/hugetlbpage.txt 2009-08-28 09:21:32.000000000 -0400
> @@ -53,26 +51,25 @@ HugePages_Surp is short for "surplus,"
> /proc/filesystems should also show a filesystem of type "hugetlbfs" configured
> in the kernel.
>
> -/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
> -pages in the kernel. Super user can dynamically request more (or free some
> -pre-configured) huge pages.
> -The allocation (or deallocation) of hugetlb pages is possible only if there are
> -enough physically contiguous free pages in system (freeing of huge pages is
> -possible only if there are enough hugetlb pages free that can be transferred
> -back to regular memory pool).
> -
> -Pages that are used as hugetlb pages are reserved inside the kernel and cannot
> -be used for other purposes.
> -
> -Once the kernel with Hugetlb page support is built and running, a user can
> -use either the mmap system call or shared memory system calls to start using
> -the huge pages. It is required that the system administrator preallocate
> -enough memory for huge page purposes.
> -
> -The administrator can preallocate huge pages on the kernel boot command line by
> -specifying the "hugepages=N" parameter, where 'N' = the number of huge pages
> -requested. This is the most reliable method for preallocating huge pages as
> -memory has not yet become fragmented.
> +/proc/sys/vm/nr_hugepages indicates the current number of huge pages pre-
> +allocated in the kernel's huge page pool. These are called "persistent"
> +huge pages. A user with root privileges can dynamically allocate more or
> +free some persistent huge pages by increasing or decreasing the value of
> +'nr_hugepages'.
> +
> +Pages that are used as huge pages are reserved inside the kernel and cannot
> +be used for other purposes. Huge pages can not be swapped out under
cannot
> +memory pressure.
> +
> +Once a number of huge pages have been pre-allocated to the kernel huge page
> +pool, a user with appropriate privilege can use either the mmap system call
> +or shared memory system calls to use the huge pages. See the discussion of
> +Using Huge Pages, below
below.
> +
> +The administrator can preallocate persistent huge pages on the kernel boot
> +command line by specifying the "hugepages=N" parameter, where 'N' = the
> +number of requested huge pages requested. This is the most reliable method
drop first "requested"
> +or preallocating huge pages as memory has not yet become fragmented.
of
>
> Some platforms support multiple huge page sizes. To preallocate huge pages
> of a specific size, one must preceed the huge pages boot command parameters
> @@ -80,19 +77,24 @@ with a huge page size selection paramete
> be specified in bytes with optional scale suffix [kKmMgG]. The default huge
> page size may be selected with the "default_hugepagesz=<size>" boot parameter.
>
> -/proc/sys/vm/nr_hugepages indicates the current number of configured [default
> -size] hugetlb pages in the kernel. Super user can dynamically request more
> -(or free some pre-configured) huge pages.
> -
> -Use the following command to dynamically allocate/deallocate default sized
> -huge pages:
> +When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
> +indicates the current number of pre-allocated huge pages of the default size.
> +Thus, one can use the following command to dynamically allocate/deallocate
> +default sized persistent huge pages:
>
> echo 20 > /proc/sys/vm/nr_hugepages
>
> -This command will try to configure 20 default sized huge pages in the system.
> +This command will try to adjust the number of default sized huge pages in the
> +huge page pool to 20, allocating or freeing huge pages, as required.
> +
> On a NUMA platform, the kernel will attempt to distribute the huge page pool
> -over the all on-line nodes. These huge pages, allocated when nr_hugepages
> -is increased, are called "persistent huge pages".
> +over the all the nodes specified by the NUMA memory policy of the task that
drop first "the"
> +modifies nr_hugepages that contain sufficient available contiguous memory.
whoa. too many "that"s. confusing.
> +These nodes are called the huge pages "allowed nodes". The default for the
> +huge pages allowed nodes--when the task has default memory policy--is all
> +on-line nodes. See the discussion below of the interaction of task memory
> +policy, cpusets and per node attributes with the allocation and freeing of
> +persistent huge pages.
>
> The success or failure of huge page allocation depends on the amount of
> physically contiguous memory that is preset in system at the time of the
> @@ -101,11 +103,11 @@ some nodes in a NUMA system, it will att
...
> @@ -113,39 +115,40 @@ distribution of huge pages in a NUMA sys
> /proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
> huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
> requested by applications. Writing any non-zero value into this file
> -indicates that the hugetlb subsystem is allowed to try to obtain "surplus"
> -huge pages from the buddy allocator, when the normal pool is exhausted. As
> -these surplus huge pages go out of use, they are freed back to the buddy
> -allocator.
> +indicates that the hugetlb subsystem is allowed to try to obtain that
> +number of "surplus" huge pages from the kernel's normal page pool, when the
> +persistent huge page pool is exhausted. As these surplus huge pages become
> +unused, they are freed back to the kernel's normal page pool.
>
> -When increasing the huge page pool size via nr_hugepages, any surplus
> +When increasing the huge page pool size via nr_hugepages, any existing surplus
> pages will first be promoted to persistent huge pages. Then, additional
> huge pages will be allocated, if necessary and if possible, to fulfill
> -the new huge page pool size.
> +the new persistent huge page pool size.
>
> The administrator may shrink the pool of preallocated huge pages for
> the default huge page size by setting the nr_hugepages sysctl to a
> smaller value. The kernel will attempt to balance the freeing of huge pages
> -across all on-line nodes. Any free huge pages on the selected nodes will
> -be freed back to the buddy allocator.
> -
> -Caveat: Shrinking the pool via nr_hugepages such that it becomes less
> -than the number of huge pages in use will convert the balance to surplus
> -huge pages even if it would exceed the overcommit value. As long as
> -this condition holds, however, no more surplus huge pages will be
> -allowed on the system until one of the two sysctls are increased
> -sufficiently, or the surplus huge pages go out of use and are freed.
> +across all nodes in the memory policy of the task modifying nr_hugepages.
> +Any free huge pages on the selected nodes will be freed back to the kernel's
> +normal page pool.
> +
> +Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
> +it becomes less than the number of huge pages in use will convert the balance
> +of the in-use huge pages to surplus huge pages. This will occur even if
surplus allocated huge pages
? vs. surplus available huge pages?
surplus (to me) implies available/unallocated...
Reading more below, I see that "surplus" here means "overcommitted". oh well ;)
> +the number of surplus pages it would exceed the overcommit value. As long as
> +this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
> +increased sufficiently, or the surplus huge pages go out of use and are freed--
> +no more surplus huge pages will be allowed to be allocated.
>
> With support for multiple huge page pools at run-time available, much of
> -the huge page userspace interface has been duplicated in sysfs. The above
> -information applies to the default huge page size which will be
> -controlled by the /proc interfaces for backwards compatibility. The root
> -huge page control directory in sysfs is:
> +the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
> +The /proc interfaces discussed above have been retained for backwards
> +compatibility. The root huge page control directory in sysfs is:
>
> /sys/kernel/mm/hugepages
>
> For each huge page size supported by the running kernel, a subdirectory
> -will exist, of the form
> +will exist, of the form:
>
> hugepages-${size}kB
>
> @@ -159,6 +162,98 @@ Inside each of these directories, the sa
>
> which function as described above for the default huge page-sized case.
>
> +
> +Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
> +
> +Whether huge pages are allocated and freed via the /proc interface or
> +the /sysfs interface, the NUMA nodes from which huge pages are allocated
> +or freed are controlled by the NUMA memory policy of the task that modifies
> +the nr_hugepages parameter. [nr_overcommit_hugepages is a global limit.]
> +
> +The recommended method to allocate or free huge pages to/from the kernel
> +huge page pool, using the nr_hugepages example above, is:
> +
> + numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
drop '.'
> +
> +or, more succinctly:
> +
> + numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages.
ditto
> +
> +This will allocate or free abs(20 - nr_hugepages) to or from the nodes
> +specified in <node-list>, depending on whether nr_hugepages is initially
> +less than or greater than 20, respectively. No huge pages will be
> +allocated nor freed on any node not included in the specified <node-list>.
> +
> +Any memory policy mode--bind, preferred, local or interleave--may be
> +used. The effect on persistent huge page allocation will be as follows:
I would just use present tense as much as possible, e.g.,
allocation is as follows:
> +
> +1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
> + persistent huge pages will be distributed across the node or nodes
> + specified in the mempolicy as if "interleave" had been specified.
> + However, if a node in the policy does not contain sufficient contiguous
> + memory for a huge page, the allocation will not "fallback" to the nearest
> + neighbor node with sufficient contiguous memory. To do this would cause
> + undesirable imbalance in the distribution of the huge page pool, or
> + possibly, allocation of persistent huge pages on nodes not allowed by
> + the task's memory policy.
> +
> +2) One or more nodes may be specified with the bind or interleave policy.
> + If more than one node is specified with the preferred policy, only the
> + lowest numeric id will be used. Local policy will select the node where
> + the task is running at the time the nodes_allowed mask is constructed.
> +
> +3) For local policy to be deterministic, the task must be bound to a cpu or
> + cpus in a single node. Otherwise, the task could be migrated to some
I prefer s/cpu/CPU/ in all of Documentation/ text, but the cat is already out
of the bag on that.
> + other node at any time after launch and the resulting node will be
> + indeterminate. Thus, local policy is not very useful for this purpose.
> + Any of the other mempolicy modes may be used to specify a single node.
> +
> +4) The nodes allowed mask will be derived from any non-default task mempolicy,
> + whether this policy was set explicitly by the task itself or one of its
> + ancestors, such as numactl. This means that if the task is invoked from a
> + shell with non-default policy, that policy will be used. One can specify a
> + node list of "all" with numactl --interleave or --membind [-m] to achieve
> + interleaving over all nodes in the system or cpuset.
> +
> +5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
> + the resource limits of any cpuset in which the task runs. Thus, there will
> + be no way for a task with non-default policy running in a cpuset with a
> + subset of the system nodes to allocate huge pages outside the cpuset
> + without first moving to a cpuset that contains all of the desired nodes.
> +
> +6) Hugepages allocated at boot time always use the node_online_map.
> +
> +
> +Per Node Hugepages Attributes
> +
> +A subset of the contents of the root huge page control directory in sysfs,
> +described above, has been replicated under each "node" system device in:
> +
> + /sys/devices/system/node/node[0-9]*/hugepages/
> +
> +Under this directory, the subdirectory for each supported huge page size
> +contains the following attribute files:
> +
> + nr_hugepages
> + free_hugepages
> + surplus_hugepages
> +
> +The free_' and surplus_' attribute files are read-only. They return the number
> +of free and surplus [overcommitted] huge pages, respectively, on the parent
> +node.
> +
> +The nr_hugepages attribute will return the total number of huge pages on the
> +specified node. When this attribute is written, the number of persistent huge
> +pages on the parent node will be adjusted to the specified value, if sufficient
> +resources exist, regardless of the task's mempolicy or cpuset constraints.
> +
> +Note that the number of overcommit and reserve pages remain global quantities,
> +as we don't know until fault time, when the faulting task's mempolicy is applied,
> +from which node the huge page allocation will be attempted.
> +
> +
> +Using Huge Pages:
> +
> If the user applications are going to request huge pages using mmap system
> call, then it is required that system administrator mount a file system of
> type hugetlbfs:
> @@ -206,9 +301,11 @@ map_hugetlb.c.
...
> @@ -237,14 +334,8 @@ map_hugetlb.c.
...
> @@ -302,10 +393,12 @@ int main(void)
...
> @@ -317,14 +410,8 @@ int main(void)
...
---
~Randy
LPC 2009, Sept. 23-25, Portland, Oregon
http://linuxplumbersconf.org/2009/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-09-03 20:42 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-08-28 16:03 [PATCH 0/6] hugetlb: V5 constrain allocation/free based on task mempolicy Lee Schermerhorn
2009-08-28 16:03 ` [PATCH 1/6] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
2009-08-28 16:03 ` [PATCH 2/6] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Lee Schermerhorn
2009-09-03 18:39 ` David Rientjes
2009-08-28 16:03 ` [PATCH 3/6] hugetlb: derive huge pages nodes allowed from task mempolicy Lee Schermerhorn
2009-09-01 14:47 ` Mel Gorman
2009-09-03 19:22 ` David Rientjes
2009-09-03 20:15 ` Lee Schermerhorn
2009-09-03 20:49 ` David Rientjes
2009-08-28 16:03 ` [PATCH 4/6] hugetlb: introduce alloc_nodemask_of_node Lee Schermerhorn
2009-09-01 14:49 ` Mel Gorman
2009-09-01 16:42 ` Lee Schermerhorn
2009-09-03 18:34 ` David Rientjes
2009-09-03 20:49 ` Lee Schermerhorn
2009-09-03 21:03 ` David Rientjes
2009-08-28 16:03 ` [PATCH 5/6] hugetlb: add per node hstate attributes Lee Schermerhorn
2009-09-01 15:20 ` Mel Gorman
2009-09-03 19:52 ` David Rientjes
2009-09-03 20:41 ` Lee Schermerhorn
2009-09-03 21:02 ` David Rientjes
2009-09-04 14:30 ` Lee Schermerhorn
2009-08-28 16:03 ` [PATCH 6/6] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn
2009-09-03 20:07 ` David Rientjes
2009-09-03 21:09 ` Lee Schermerhorn
2009-09-03 21:25 ` David Rientjes
2009-09-08 10:44 ` Mel Gorman
2009-09-08 19:51 ` David Rientjes
2009-09-08 20:04 ` Mel Gorman
2009-09-08 20:18 ` David Rientjes
2009-09-08 21:41 ` Mel Gorman
2009-09-08 22:54 ` David Rientjes
2009-09-09 8:16 ` Mel Gorman
2009-09-09 20:44 ` David Rientjes
2009-09-10 12:26 ` Mel Gorman
2009-09-11 22:27 ` David Rientjes
2009-09-14 13:33 ` Mel Gorman
2009-09-14 14:15 ` Lee Schermerhorn
2009-09-14 15:41 ` Mel Gorman
2009-09-14 19:15 ` David Rientjes
2009-09-15 11:48 ` Mel Gorman
2009-09-14 19:14 ` David Rientjes
2009-09-14 21:28 ` David Rientjes
2009-09-16 10:21 ` Mel Gorman
2009-09-03 20:42 ` Randy Dunlap [this message]
2009-09-04 15:23 ` Lee Schermerhorn
-- strict thread matches above, loose matches on Subject: below --
2009-09-09 16:31 [PATCH 0/6] hugetlb: V6 constrain allocation/free based on task mempolicy Lee Schermerhorn
2009-09-09 16:32 ` [PATCH 6/6] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090903134210.5a27611d.randy.dunlap@oracle.com \
--to=randy.dunlap@oracle.com \
--cc=agl@us.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=apw@canonical.com \
--cc=eric.whitney@hp.com \
--cc=lee.schermerhorn@hp.com \
--cc=linux-mm@kvack.org \
--cc=linux-numa@vger.kernel.org \
--cc=mel@csn.ul.ie \
--cc=nacc@us.ibm.com \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).