Re: [PATCH 8/11] hugetlb: Optionally use mempolicy for persistent huge page allocation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org,
	akpm@linux-foundation.org, Randy Dunlap <randy.dunlap@oracle.com>,
	Nishanth Aravamudan <nacc@us.ibm.com>,
	David Rientjes <rientjes@google.com>, Adam Litke <agl@us.ibm.com>,
	Andy Whitcroft <apw@canonical.com>,
	eric.whitney@hp.com
Subject: Re: [PATCH 8/11] hugetlb:  Optionally use mempolicy for persistent huge page allocation
Date: Wed, 16 Sep 2009 14:48:57 +0100	[thread overview]
Message-ID: <20090916134857.GF1993@csn.ul.ie> (raw)
In-Reply-To: <20090915204510.4828.10825.sendpatchset@localhost.localdomain>

On Tue, Sep 15, 2009 at 04:45:10PM -0400, Lee Schermerhorn wrote:
> [PATCH 8/11] hugetlb:  Optionally use mempolicy for persistent huge page allocation
> 
> From: Mel Gorman <mel@csn.ul.ie>
> 
> Against:  2.6.31-mmotm-090914-0157
> 
> Patch "derive huge pages nodes allowed from task mempolicy" brought
> huge page support more in line with the core VM in that tuning the size
> of the static huge page pool would obey memory policies. Using this,
> administrators could interleave allocation of huge pages from a subset
> of nodes. This is consistent with how dynamic hugepage pool resizing
> works and how hugepages get allocated to applications at run-time.
> 
> However, it was pointed out that scripts may exist that depend on being
> able to drain all hugepages via /proc/sys/vm/nr_hugepages from processes
> that are running within a memory policy. This patch adds
> /proc/sys/vm/nr_hugepages_mempolicy which when written to will obey
> memory policies. /proc/sys/vm/nr_hugepages continues then to be a
> system-wide tunable regardless of memory policy.
> 
> Replicate the vm/nr_hugepages_mempolicy sysctl under the sysfs global
> hstate attributes directory.
> 
> Note:  with this patch, hugeadm will require update to write to the
> vm/nr_hugepages_mempolicy sysctl/attribute when one wants to adjust
> the hugepage pool on a specific set of nodes.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
> 
>  Documentation/vm/hugetlbpage.txt |   36 ++++++++-------
>  include/linux/hugetlb.h          |    6 ++
>  kernel/sysctl.c                  |   12 +++++
>  mm/hugetlb.c                     |   91 ++++++++++++++++++++++++++++++++-------
>  4 files changed, 114 insertions(+), 31 deletions(-)
> 
> Index: linux-2.6.31-mmotm-090914-0157/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/include/linux/hugetlb.h	2009-09-15 13:23:01.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/include/linux/hugetlb.h	2009-09-15 13:48:11.000000000 -0400
> @@ -23,6 +23,12 @@ void reset_vma_resv_huge_pages(struct vm
>  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> +
> +#ifdef CONFIG_NUMA
> +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
> +				void __user *, size_t *, loff_t *);
> +#endif
> +
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
>  int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>  			struct page **, struct vm_area_struct **,
> Index: linux-2.6.31-mmotm-090914-0157/kernel/sysctl.c
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/kernel/sysctl.c	2009-09-15 13:23:01.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/kernel/sysctl.c	2009-09-15 13:43:36.000000000 -0400
> @@ -1170,6 +1170,18 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= (void *)&hugetlb_zero,
>  		.extra2		= (void *)&hugetlb_infinity,
>  	 },
> +#ifdef CONFIG_NUMA
> +	 {
> +		.ctl_name	= CTL_UNNUMBERED,
> +		.procname	= "nr_hugepages_mempolicy",
> +		.data		= NULL,
> +		.maxlen		= sizeof(unsigned long),
> +		.mode		= 0644,
> +		.proc_handler	= &hugetlb_mempolicy_sysctl_handler,
> +		.extra1		= (void *)&hugetlb_zero,
> +		.extra2		= (void *)&hugetlb_infinity,
> +	 },
> +#endif
>  	 {
>  		.ctl_name	= VM_HUGETLB_GROUP,
>  		.procname	= "hugetlb_shm_group",
> Index: linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/mm/hugetlb.c	2009-09-15 13:43:13.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/mm/hugetlb.c	2009-09-15 13:50:28.000000000 -0400
> @@ -1243,6 +1243,7 @@ static int adjust_pool_surplus(struct hs
>  	return ret;
>  }
>  
> +#define HUGETLB_NO_NODE_OBEY_MEMPOLICY (NUMA_NO_NODE - 1)
>  #define persistent_huge_pages(h) (h->nr_huge_pages - h->surplus_huge_pages)
>  static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
>  								int nid)

As a pre-emptive note to David. I made a quick stab at getting rid of
HUGETLB_NO_NODE_OBEY_MEMPOLICY by allocating the nodemask higher up in
the chain. However, it was getting progressively more horrible looking
and decided that the current definition was easier to understand. I
would be biased though.

> @@ -1253,9 +1254,14 @@ static unsigned long set_max_huge_pages(
>  	if (h->order >= MAX_ORDER)
>  		return h->max_huge_pages;
>  
> -	if (nid == NUMA_NO_NODE) {
> +	switch (nid) {
> +	case HUGETLB_NO_NODE_OBEY_MEMPOLICY:
>  		nodes_allowed = alloc_nodemask_of_mempolicy();
> -	} else {
> +		break;
> +	case NUMA_NO_NODE:
> +		nodes_allowed = &node_online_map;
> +		break;
> +	default:
>  		/*
>  		 * incoming 'count' is for node 'nid' only, so
>  		 * adjust count to global, but restrict alloc/free
> @@ -1354,23 +1360,24 @@ static struct hstate *kobj_to_hstate(str
>  
>  	for (i = 0; i < HUGE_MAX_HSTATE; i++)
>  		if (hstate_kobjs[i] == kobj) {
> -			if (nidp)
> -				*nidp = NUMA_NO_NODE;
> +			/*
> +			 * let *nidp default.
> +			 */
>  			return &hstates[i];
>  		}
>  
>  	return kobj_to_node_hstate(kobj, nidp);
>  }
>  
> -static ssize_t nr_hugepages_show(struct kobject *kobj,
> +static ssize_t nr_hugepages_show_common(int nid_default, struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
>  	struct hstate *h;
>  	unsigned long nr_huge_pages;
> -	int nid;
> +	int nid = nid_default;
>  
>  	h = kobj_to_hstate(kobj, &nid);
> -	if (nid == NUMA_NO_NODE)
> +	if (nid < 0)
>  		nr_huge_pages = h->nr_huge_pages;
>  	else
>  		nr_huge_pages = h->nr_huge_pages_node[nid];
> @@ -1378,12 +1385,12 @@ static ssize_t nr_hugepages_show(struct
>  	return sprintf(buf, "%lu\n", nr_huge_pages);
>  }
>  
> -static ssize_t nr_hugepages_store(struct kobject *kobj,
> +static ssize_t nr_hugepages_store_common(int nid_default, struct kobject *kobj,
>  		struct kobj_attribute *attr, const char *buf, size_t len)
>  {
>  	unsigned long count;
>  	struct hstate *h;
> -	int nid;
> +	int nid = nid_default;
>  	int err;
>  
>  	err = strict_strtoul(buf, 10, &count);
> @@ -1395,8 +1402,42 @@ static ssize_t nr_hugepages_store(struct
>  
>  	return len;
>  }
> +
> +static ssize_t nr_hugepages_show(struct kobject *kobj,
> +					struct kobj_attribute *attr, char *buf)
> +{
> +	return nr_hugepages_show_common(NUMA_NO_NODE, kobj, attr, buf);
> +}
> +
> +static ssize_t nr_hugepages_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> +	return nr_hugepages_store_common(NUMA_NO_NODE, kobj, attr, buf, len);
> +}
>  HSTATE_ATTR(nr_hugepages);
>  
> +#ifdef CONFIG_NUMA
> +
> +/*
> + * hstate attribute for optionally mempolicy-based constraint on persistent
> + * huge page alloc/free.
> + */
> +static ssize_t nr_hugepages_mempolicy_show(struct kobject *kobj,
> +					struct kobj_attribute *attr, char *buf)
> +{
> +	return nr_hugepages_show_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
> +						kobj, attr, buf);
> +}
> +
> +static ssize_t nr_hugepages_mempolicy_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> +	return nr_hugepages_store_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
> +					kobj, attr, buf, len);
> +}
> +HSTATE_ATTR(nr_hugepages_mempolicy);
> +#endif
> +
>  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> @@ -1429,7 +1470,7 @@ static ssize_t free_hugepages_show(struc
>  {
>  	struct hstate *h;
>  	unsigned long free_huge_pages;
> -	int nid;
> +	int nid = NUMA_NO_NODE;
>  
>  	h = kobj_to_hstate(kobj, &nid);
>  	if (nid == NUMA_NO_NODE)
> @@ -1454,7 +1495,7 @@ static ssize_t surplus_hugepages_show(st
>  {
>  	struct hstate *h;
>  	unsigned long surplus_huge_pages;
> -	int nid;
> +	int nid = NUMA_NO_NODE;
>  
>  	h = kobj_to_hstate(kobj, &nid);
>  	if (nid == NUMA_NO_NODE)
> @@ -1472,6 +1513,9 @@ static struct attribute *hstate_attrs[]
>  	&free_hugepages_attr.attr,
>  	&resv_hugepages_attr.attr,
>  	&surplus_hugepages_attr.attr,
> +#ifdef CONFIG_NUMA
> +	&nr_hugepages_mempolicy_attr.attr,
> +#endif
>  	NULL,
>  };
>  
> @@ -1809,9 +1853,9 @@ static unsigned int cpuset_mems_nr(unsig
>  }
>  
>  #ifdef CONFIG_SYSCTL
> -int hugetlb_sysctl_handler(struct ctl_table *table, int write,
> -			   void __user *buffer,
> -			   size_t *length, loff_t *ppos)
> +static int hugetlb_sysctl_handler_common(int no_node,
> +			 struct ctl_table *table, int write,
> +			 void __user *buffer, size_t *length, loff_t *ppos)
>  {
>  	struct hstate *h = &default_hstate;
>  	unsigned long tmp;
> @@ -1824,7 +1868,7 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  	proc_doulongvec_minmax(table, write, buffer, length, ppos);
>  
>  	if (write)
> -		h->max_huge_pages = set_max_huge_pages(h, tmp, NUMA_NO_NODE);
> +		h->max_huge_pages = set_max_huge_pages(h, tmp, no_node);
>  
>  	return 0;
>  }
> @@ -1864,6 +1908,23 @@ int hugetlb_overcommit_handler(struct ct
>  	return 0;
>  }
>  
> +int hugetlb_sysctl_handler(struct ctl_table *table, int write,
> +			   void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +
> +	return hugetlb_sysctl_handler_common(NUMA_NO_NODE,
> +				table, write, buffer, length, ppos);
> +}
> +
> +#ifdef CONFIG_NUMA
> +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *table, int write,
> +			   void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +	return hugetlb_sysctl_handler_common(HUGETLB_NO_NODE_OBEY_MEMPOLICY,
> +				table, write, buffer, length, ppos);
> +}
> +#endif /* CONFIG_NUMA */
> +
>  #endif /* CONFIG_SYSCTL */
>  
>  void hugetlb_report_meminfo(struct seq_file *m)
> Index: linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt
> ===================================================================
> --- linux-2.6.31-mmotm-090914-0157.orig/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:32.000000000 -0400
> +++ linux-2.6.31-mmotm-090914-0157/Documentation/vm/hugetlbpage.txt	2009-09-15 13:43:36.000000000 -0400
> @@ -155,6 +155,7 @@ will exist, of the form:
>  Inside each of these directories, the same set of files will exist:
>  
>  	nr_hugepages
> +	nr_hugepages_mempolicy
>  	nr_overcommit_hugepages
>  	free_hugepages
>  	resv_hugepages
> @@ -166,26 +167,30 @@ which function as described above for th
>  Interaction of Task Memory Policy with Huge Page Allocation/Freeing:
>  
>  Whether huge pages are allocated and freed via the /proc interface or
> -the /sysfs interface, the NUMA nodes from which huge pages are allocated
> -or freed are controlled by the NUMA memory policy of the task that modifies
> -the nr_hugepages parameter.  [nr_overcommit_hugepages is a global limit.]
> +the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA
> +nodes from which huge pages are allocated or freed are controlled by the
> +NUMA memory policy of the task that modifies the nr_hugepages_mempolicy
> +sysctl or attribute.  When the nr_hugepages attribute is used, mempolicy
> +is ignored
>  
>  The recommended method to allocate or free huge pages to/from the kernel
>  huge page pool, using the nr_hugepages example above, is:
>  
> -    numactl --interleave <node-list> echo 20 >/proc/sys/vm/nr_hugepages
> +    numactl --interleave <node-list> echo 20 \
> +				>/proc/sys/vm/nr_hugepages_mempolicy
>  
>  or, more succinctly:
>  
> -    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages
> +    numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
>  
>  This will allocate or free abs(20 - nr_hugepages) to or from the nodes
> -specified in <node-list>, depending on whether nr_hugepages is initially
> -less than or greater than 20, respectively.  No huge pages will be
> +specified in <node-list>, depending on whether number of persistent huge pages
> +is initially less than or greater than 20, respectively.  No huge pages will be
>  allocated nor freed on any node not included in the specified <node-list>.
>  
> -Any memory policy mode--bind, preferred, local or interleave--may be
> -used.  The effect on persistent huge page allocation is as follows:
> +When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any
> +memory policy mode--bind, preferred, local or interleave--may be used.  The
> +resulting effect on persistent huge page allocation is as follows:
>  
>  1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
>     persistent huge pages will be distributed across the node or nodes
> @@ -201,27 +206,26 @@ used.  The effect on persistent huge pag
>     If more than one node is specified with the preferred policy, only the
>     lowest numeric id will be used.  Local policy will select the node where
>     the task is running at the time the nodes_allowed mask is constructed.
> -
> -3) For local policy to be deterministic, the task must be bound to a cpu or
> +   For local policy to be deterministic, the task must be bound to a cpu or
>     cpus in a single node.  Otherwise, the task could be migrated to some
>     other node at any time after launch and the resulting node will be
>     indeterminate.  Thus, local policy is not very useful for this purpose.
>     Any of the other mempolicy modes may be used to specify a single node.
>  
> -4) The nodes allowed mask will be derived from any non-default task mempolicy,
> +3) The nodes allowed mask will be derived from any non-default task mempolicy,
>     whether this policy was set explicitly by the task itself or one of its
>     ancestors, such as numactl.  This means that if the task is invoked from a
>     shell with non-default policy, that policy will be used.  One can specify a
>     node list of "all" with numactl --interleave or --membind [-m] to achieve
>     interleaving over all nodes in the system or cpuset.
>  
> -5) Any task mempolicy specifed--e.g., using numactl--will be constrained by
> +4) Any task mempolicy specifed--e.g., using numactl--will be constrained by
>     the resource limits of any cpuset in which the task runs.  Thus, there will
>     be no way for a task with non-default policy running in a cpuset with a
>     subset of the system nodes to allocate huge pages outside the cpuset
>     without first moving to a cpuset that contains all of the desired nodes.
>  
> -6) Boot-time huge page allocation attempts to distribute the requested number
> +5) Boot-time huge page allocation attempts to distribute the requested number
>     of huge pages over all on-lines nodes.
>  
>  Per Node Hugepages Attributes
> @@ -248,8 +252,8 @@ pages on the parent node will be adjuste
>  resources exist, regardless of the task's mempolicy or cpuset constraints.
>  
>  Note that the number of overcommit and reserve pages remain global quantities,
> -as we don't know until fault time, when the faulting task's mempolicy is applied,
> -from which node the huge page allocation will be attempted.
> +as we don't know until fault time, when the faulting task's mempolicy is
> +applied, from which node the huge page allocation will be attempted.
>  
>  
>  Using Huge Pages:
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2009-09-16 13:48 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-15 20:43 [PATCH 0/11] hugetlb: V7 constrain allocation/free based on task mempolicy Lee Schermerhorn
2009-09-15 20:43 ` [PATCH 1/11] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
2009-09-22 18:08   ` David Rientjes
2009-09-22 20:08     ` Lee Schermerhorn
2009-09-22 20:13       ` David Rientjes
2009-09-15 20:44 ` [PATCH 2/11] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Lee Schermerhorn
2009-09-15 20:44 ` [PATCH 3/11] hugetlb: introduce alloc_nodemask_of_node Lee Schermerhorn
2009-09-15 20:44 ` [PATCH 4/11] hugetlb: derive huge pages nodes allowed from task mempolicy Lee Schermerhorn
2009-09-15 20:44 ` [PATCH 5/11] hugetlb: add generic definition of NUMA_NO_NODE Lee Schermerhorn
2009-09-17 13:28   ` Mel Gorman
2009-09-15 20:44 ` [PATCH 6/11] hugetlb: add per node hstate attributes Lee Schermerhorn
2009-09-15 20:45 ` [PATCH 7/11] hugetlb: update hugetlb documentation for mempolicy based management Lee Schermerhorn
2009-09-16 13:37   ` Mel Gorman
2009-09-15 20:45 ` [PATCH 8/11] hugetlb: Optionally use mempolicy for persistent huge page allocation Lee Schermerhorn
2009-09-16 13:48   ` Mel Gorman [this message]
2009-09-15 20:45 ` [PATCH 9/11] hugetlb: use only nodes with memory for huge pages Lee Schermerhorn
2009-09-15 20:45 ` [PATCH 10/11] hugetlb: handle memory hot-plug events Lee Schermerhorn
2009-09-15 20:45 ` [PATCH 11/11] hugetlb: offload per node attribute registrations Lee Schermerhorn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090916134857.GF1993@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=agl@us.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=apw@canonical.com \
    --cc=eric.whitney@hp.com \
    --cc=lee.schermerhorn@hp.com \
    --cc=linux-mm@kvack.org \
    --cc=linux-numa@vger.kernel.org \
    --cc=nacc@us.ibm.com \
    --cc=randy.dunlap@oracle.com \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).