From: Mel Gorman <mel@csn.ul.ie>
To: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org,
akpm@linux-foundation.org, Nishanth Aravamudan <nacc@us.ibm.com>,
David Rientjes <rientjes@google.com>, Adam Litke <agl@us.ibm.com>,
Andy Whitcroft <apw@canonical.com>,
eric.whitney@hp.com
Subject: Re: [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy
Date: Tue, 25 Aug 2009 11:22:04 +0100 [thread overview]
Message-ID: <20090825102204.GC4427@csn.ul.ie> (raw)
In-Reply-To: <20090824192752.10317.96125.sendpatchset@localhost.localdomain>
On Mon, Aug 24, 2009 at 03:27:52PM -0400, Lee Schermerhorn wrote:
> [PATCH 3/4] hugetlb: derive huge pages nodes allowed from task mempolicy
>
> Against: 2.6.31-rc6-mmotm-090820-1918
>
> V2:
> + cleaned up comments, removed some deemed unnecessary,
> add some suggested by review
> + removed check for !current in huge_mpol_nodes_allowed().
> + added 'current->comm' to warning message in huge_mpol_nodes_allowed().
> + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
> catch out of range node id.
> + add examples to patch description
>
> V3: Factored this patch from V2 patch 2/3
>
> V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages()
>
> This patch derives a "nodes_allowed" node mask from the numa
> mempolicy of the task modifying the number of persistent huge
> pages to control the allocation, freeing and adjusting of surplus
> huge pages. This mask is derived as follows:
>
> * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
> is produced. This will cause the hugetlb subsystem to use
> node_online_map as the "nodes_allowed". This preserves the
> behavior before this patch.
> * For "preferred" mempolicy, including explicit local allocation,
> a nodemask with the single preferred node will be produced.
> "local" policy will NOT track any internode migrations of the
> task adjusting nr_hugepages.
> * For "bind" and "interleave" policy, the mempolicy's nodemask
> will be used.
> * Other than to inform the construction of the nodes_allowed node
> mask, the actual mempolicy mode is ignored. That is, all modes
> behave like interleave over the resulting nodes_allowed mask
> with no "fallback".
>
> Notes:
>
> 1) This patch introduces a subtle change in behavior: huge page
> allocation and freeing will be constrained by any mempolicy
> that the task adjusting the huge page pool inherits from its
> parent. This policy could come from a distant ancestor. The
> adminstrator adjusting the huge page pool without explicitly
> specifying a mempolicy via numactl might be surprised by this.
> Additionaly, any mempolicy specified by numactl will be
> constrained by the cpuset in which numactl is invoked.
>
> 2) Hugepages allocated at boot time use the node_online_map.
> An additional patch could implement a temporary boot time
> huge pages nodes_allowed command line parameter.
>
> 3) Using mempolicy to control persistent huge page allocation
> and freeing requires no change to hugeadm when invoking
> it via numactl, as shown in the examples below. However,
> hugeadm could be enhanced to take the allowed nodes as an
> argument and set its task mempolicy itself. This would allow
> it to detect and warn about any non-default mempolicy that it
> inherited from its parent, thus alleviating the issue described
> in Note 1 above.
>
> See the updated documentation [next patch] for more information
> about the implications of this patch.
>
> Examples:
>
> Starting with:
>
> Node 0 HugePages_Total: 0
> Node 1 HugePages_Total: 0
> Node 2 HugePages_Total: 0
> Node 3 HugePages_Total: 0
>
> Default behavior [with or without this patch] balances persistent
> hugepage allocation across nodes [with sufficient contiguous memory]:
>
> hugeadm --pool-pages-min=2048Kb:32
>
> yields:
>
> Node 0 HugePages_Total: 8
> Node 1 HugePages_Total: 8
> Node 2 HugePages_Total: 8
> Node 3 HugePages_Total: 8
>
> Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
> '--membind' because it allows multiple nodes to be specified
> and it's easy to type]--we can allocate huge pages on
> individual nodes or sets of nodes. So, starting from the
> condition above, with 8 huge pages per node:
>
> numactl -m 2 hugeadm --pool-pages-min=2048Kb:+8
>
> yields:
>
> Node 0 HugePages_Total: 8
> Node 1 HugePages_Total: 8
> Node 2 HugePages_Total: 16
> Node 3 HugePages_Total: 8
>
> The incremental 8 huge pages were restricted to node 2 by the
> specified mempolicy.
>
> Similarly, we can use mempolicy to free persistent huge pages
> from specified nodes:
>
> numactl -m 0,1 hugeadm --pool-pages-min=2048Kb:-8
>
> yields:
>
> Node 0 HugePages_Total: 4
> Node 1 HugePages_Total: 4
> Node 2 HugePages_Total: 16
> Node 3 HugePages_Total: 8
>
> The 8 huge pages freed were balanced over nodes 0 and 1.
>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
I haven't been able to test this yet because of some build and deploy
issues but I didn't spot anything wrong when eyeballing the patch. For
the moment;
Acked-by: Mel Gorman <mel@csn.ul.ie>
>
> include/linux/mempolicy.h | 3 ++
> mm/hugetlb.c | 14 ++++++----
> mm/mempolicy.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 73 insertions(+), 5 deletions(-)
>
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/mempolicy.c 2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/mempolicy.c 2009-08-24 12:12:53.000000000 -0400
> @@ -1564,6 +1564,67 @@ struct zonelist *huge_zonelist(struct vm
> }
> return zl;
> }
> +
> +/*
> + * huge_mpol_nodes_allowed -- mempolicy extension for huge pages.
> + *
> + * Returns a [pointer to a] nodelist based on the current task's mempolicy
> + * to constraing the allocation and freeing of persistent huge pages
> + * 'Preferred', 'local' and 'interleave' mempolicy will behave more like
> + * 'bind' policy in this context. An attempt to allocate a persistent huge
> + * page will never "fallback" to another node inside the buddy system
> + * allocator.
> + *
> + * If the task's mempolicy is "default" [NULL], just return NULL for
> + * default behavior. Otherwise, extract the policy nodemask for 'bind'
> + * or 'interleave' policy or construct a nodemask for 'preferred' or
> + * 'local' policy and return a pointer to a kmalloc()ed nodemask_t.
> + *
> + * N.B., it is the caller's responsibility to free a returned nodemask.
> + */
> +nodemask_t *huge_mpol_nodes_allowed(void)
> +{
> + nodemask_t *nodes_allowed = NULL;
> + struct mempolicy *mempolicy;
> + int nid;
> +
> + if (!current->mempolicy)
> + return NULL;
> +
> + mpol_get(current->mempolicy);
> + nodes_allowed = kmalloc(sizeof(*nodes_allowed), GFP_KERNEL);
> + if (!nodes_allowed) {
> + printk(KERN_WARNING "%s unable to allocate nodes allowed mask "
> + "for huge page allocation.\nFalling back to default.\n",
> + current->comm);
> + goto out;
> + }
> + nodes_clear(*nodes_allowed);
> +
> + mempolicy = current->mempolicy;
> + switch (mempolicy->mode) {
> + case MPOL_PREFERRED:
> + if (mempolicy->flags & MPOL_F_LOCAL)
> + nid = numa_node_id();
> + else
> + nid = mempolicy->v.preferred_node;
> + node_set(nid, *nodes_allowed);
> + break;
> +
> + case MPOL_BIND:
> + /* Fall through */
> + case MPOL_INTERLEAVE:
> + *nodes_allowed = mempolicy->v.nodes;
> + break;
> +
> + default:
> + BUG();
> + }
> +
> +out:
> + mpol_put(current->mempolicy);
> + return nodes_allowed;
> +}
> #endif
>
> /* Allocate a page in interleaved policy.
> Index: linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/include/linux/mempolicy.h 2009-08-24 12:12:44.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/include/linux/mempolicy.h 2009-08-24 12:12:53.000000000 -0400
> @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
> extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> unsigned long addr, gfp_t gfp_flags,
> struct mempolicy **mpol, nodemask_t **nodemask);
> +extern nodemask_t *huge_mpol_nodes_allowed(void);
> extern unsigned slab_node(struct mempolicy *policy);
>
> extern enum zone_type policy_zone;
> @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
> return node_zonelist(0, gfp_flags);
> }
>
> +static inline nodemask_t *huge_mpol_nodes_allowed(void) { return NULL; }
> +
> static inline int do_migrate_pages(struct mm_struct *mm,
> const nodemask_t *from_nodes,
> const nodemask_t *to_nodes, int flags)
> Index: linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-rc6-mmotm-090820-1918.orig/mm/hugetlb.c 2009-08-24 12:12:50.000000000 -0400
> +++ linux-2.6.31-rc6-mmotm-090820-1918/mm/hugetlb.c 2009-08-24 12:12:53.000000000 -0400
> @@ -1257,10 +1257,13 @@ static int adjust_pool_surplus(struct hs
> static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count)
> {
> unsigned long min_count, ret;
> + nodemask_t *nodes_allowed;
>
> if (h->order >= MAX_ORDER)
> return h->max_huge_pages;
>
> + nodes_allowed = huge_mpol_nodes_allowed();
> +
> /*
> * Increase the pool size
> * First take pages out of surplus state. Then make up the
> @@ -1274,7 +1277,7 @@ static unsigned long set_max_huge_pages(
> */
> spin_lock(&hugetlb_lock);
> while (h->surplus_huge_pages && count > persistent_huge_pages(h)) {
> - if (!adjust_pool_surplus(h, NULL, -1))
> + if (!adjust_pool_surplus(h, nodes_allowed, -1))
> break;
> }
>
> @@ -1285,7 +1288,7 @@ static unsigned long set_max_huge_pages(
> * and reducing the surplus.
> */
> spin_unlock(&hugetlb_lock);
> - ret = alloc_fresh_huge_page(h, NULL);
> + ret = alloc_fresh_huge_page(h, nodes_allowed);
> spin_lock(&hugetlb_lock);
> if (!ret)
> goto out;
> @@ -1309,18 +1312,19 @@ static unsigned long set_max_huge_pages(
> */
> min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages;
> min_count = max(count, min_count);
> - try_to_free_low(h, min_count, NULL);
> + try_to_free_low(h, min_count, nodes_allowed);
> while (min_count < persistent_huge_pages(h)) {
> - if (!free_pool_huge_page(h, NULL, 0))
> + if (!free_pool_huge_page(h, nodes_allowed, 0))
> break;
> }
> while (count < persistent_huge_pages(h)) {
> - if (!adjust_pool_surplus(h, NULL, 1))
> + if (!adjust_pool_surplus(h, nodes_allowed, 1))
> break;
> }
> out:
> ret = persistent_huge_pages(h);
> spin_unlock(&hugetlb_lock);
> + kfree(nodes_allowed);
> return ret;
> }
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
next prev parent reply other threads:[~2009-08-25 10:22 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-08-24 19:24 [PATCH 0/5] hugetlb: numa control of persistent huge pages alloc/free Lee Schermerhorn
2009-08-24 19:25 ` [PATCH 1/5] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
2009-08-25 8:10 ` David Rientjes
2009-08-24 19:26 ` [PATCH 2/5] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Lee Schermerhorn
2009-08-25 8:16 ` David Rientjes
2009-08-25 20:49 ` Lee Schermerhorn
2009-08-25 21:59 ` David Rientjes
2009-08-26 9:58 ` Mel Gorman
2009-08-24 19:27 ` [PATCH 3/5] hugetlb: derive huge pages nodes allowed from task mempolicy Lee Schermerhorn
2009-08-25 8:47 ` David Rientjes
2009-08-25 20:49 ` Lee Schermerhorn
2009-08-27 19:40 ` David Rientjes
2009-08-25 10:22 ` Mel Gorman [this message]
2009-08-24 19:29 ` [PATCH 4/5] hugetlb: add per node hstate attributes Lee Schermerhorn
2009-08-25 10:19 ` Mel Gorman
2009-08-25 20:49 ` Lee Schermerhorn
2009-08-26 10:11 ` Mel Gorman
2009-08-26 18:02 ` Lee Schermerhorn
2009-08-26 19:47 ` David Rientjes
2009-08-26 20:46 ` Lee Schermerhorn
2009-08-27 9:52 ` Mel Gorman
2009-08-27 19:35 ` David Rientjes
2009-08-28 12:56 ` Lee Schermerhorn
2009-08-26 18:04 ` Lee Schermerhorn
2009-08-27 10:23 ` Mel Gorman
2009-08-27 16:52 ` Lee Schermerhorn
2009-08-28 10:09 ` Mel Gorman
2009-08-25 13:35 ` Mel Gorman
2009-08-25 20:49 ` Lee Schermerhorn
2009-08-26 10:12 ` Mel Gorman
2009-08-24 19:30 ` [PATCH 5/5] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090825102204.GC4427@csn.ul.ie \
--to=mel@csn.ul.ie \
--cc=agl@us.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=apw@canonical.com \
--cc=eric.whitney@hp.com \
--cc=lee.schermerhorn@hp.com \
--cc=linux-mm@kvack.org \
--cc=linux-numa@vger.kernel.org \
--cc=nacc@us.ibm.com \
--cc=rientjes@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).