Re: [PATCH 4/10] hugetlb: derive huge pages nodes allowed from task mempolicy

linux-numa.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mel@csn.ul.ie>
To: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: linux-mm@kvack.org, linux-numa@vger.kernel.org,
	akpm@linux-foundation.org, clameter@sgi.com,
	Randy Dunlap <randy.dunlap@oracle.com>,
	Nishanth Aravamudan <nacc@us.ibm.com>,
	David Rientjes <rientjes@google.com>, Adam Litke <agl@us.ibm.com>,
	Andy Whitcroft <apw@canonical.com>,
	eric.whitney@hp.com
Subject: Re: [PATCH 4/10] hugetlb:  derive huge pages nodes allowed from task mempolicy
Date: Fri, 2 Oct 2009 11:11:05 +0100	[thread overview]
Message-ID: <20091002101105.GM21906@csn.ul.ie> (raw)
In-Reply-To: <20091001165832.32248.32725.sendpatchset@localhost.localdomain>

On Thu, Oct 01, 2009 at 12:58:32PM -0400, Lee Schermerhorn wrote:
> [PATCH 4/10] hugetlb:  derive huge pages nodes allowed from task mempolicy
> 
> Against:  2.6.31-mmotm-090925-1435
> 
> V2: + cleaned up comments, removed some deemed unnecessary,
>       add some suggested by review
>     + removed check for !current in huge_mpol_nodes_allowed().
>     + added 'current->comm' to warning message in huge_mpol_nodes_allowed().
>     + added VM_BUG_ON() assertion in hugetlb.c next_node_allowed() to
>       catch out of range node id.
>     + add examples to patch description
> 
> V3: Factored this patch from V2 patch 2/3
> 
> V4: added back missing "kfree(nodes_allowed)" in set_max_nr_hugepages()
> 
> V5: remove internal '\n' from printk in huge_mpol_nodes_allowed()
> 
> V6: + rename 'huge_mpol_nodes_allowed()" to "alloc_nodemask_of_mempolicy()"
>     + move the printk() when we can't kmalloc() a nodemask_t to
>       set_max_huge_pages(), as alloc_nodemask_of_mempolicy() is no longer
>       hugepage specific.
>     + handle movement of nodes_allowed initialization:
>     ++ Don't kfree() nodes_allowed when it points at node_online_map.
> 
> V7: + drop mpol-get/put from alloc_nodemask_of_mempolicy().  Not needed
>       here because current task is examining it's own mempolicy.  Add
>       comment to that effect.
>     + use init_nodemask_of_node() to initialize the nodes_allowed for
>       single node policies [preferred/local].
> 
> V8:  + fold in subsequent patches to:
>        1) define a new sysctl and hugepages sysfs attribute
>           nr_hugepages_mempolicy which will modify the huge page pool
>           under the current task's mempolicy.  Modifications via the
>           existing nr_hugepages will continue to ignore mempolicy.
>           NOTE:  This part comes from a patch from Mel Gorman.
>        2) reorganize sysctl and sysfs attribute handlers to create
>           and pass nodes_allowed mask to set_max_huge_pages().
> 
> This patch derives a "nodes_allowed" node mask from the numa
> mempolicy of the task modifying the number of persistent huge
> pages to control the allocation, freeing and adjusting of surplus
> huge pages when the pool page count is modified via the new sysctl
> or sysfs attribute "nr_hugepages_mempolicy".  The nodes_allowed
> mask is derived as follows:
> 
> * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
>   is produced.  This will cause the hugetlb subsystem to use
>   node_online_map as the "nodes_allowed".  This preserves the
>   behavior before this patch.
> * For "preferred" mempolicy, including explicit local allocation,
>   a nodemask with the single preferred node will be produced.
>   "local" policy will NOT track any internode migrations of the
>   task adjusting nr_hugepages.
> * For "bind" and "interleave" policy, the mempolicy's nodemask
>   will be used.
> * Other than to inform the construction of the nodes_allowed node
>   mask, the actual mempolicy mode is ignored.  That is, all modes
>   behave like interleave over the resulting nodes_allowed mask
>   with no "fallback".
> 
> See the updated documentation [next patch] for more information
> about the implications of this patch.
> 
> Examples:
> 
> Starting with:
> 
> 	Node 0 HugePages_Total:     0
> 	Node 1 HugePages_Total:     0
> 	Node 2 HugePages_Total:     0
> 	Node 3 HugePages_Total:     0
> 
> Default behavior [with or without this patch] balances persistent
> hugepage allocation across nodes [with sufficient contiguous memory]:
> 
> 	sysctl vm.nr_hugepages[_mempolicy]=32
> 
> yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:     8
> 	Node 3 HugePages_Total:     8
> 
> Of course, we only have nr_hugepages_mempolicy with the patch,
> but with default mempolicy, nr_hugepages_mempolicy behaves the
> same as nr_hugepages.
> 
> Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
> '--membind' because it allows multiple nodes to be specified
> and it's easy to type]--we can allocate huge pages on
> individual nodes or sets of nodes.  So, starting from the
> condition above, with 8 huge pages per node, add 8 more to
> node 2 using:
> 
> 	numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
> 
> This yields:
> 
> 	Node 0 HugePages_Total:     8
> 	Node 1 HugePages_Total:     8
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The incremental 8 huge pages were restricted to node 2 by the
> specified mempolicy.
> 
> Similarly, we can use mempolicy to free persistent huge pages
> from specified nodes:
> 
> 	numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
> 
> yields:
> 
> 	Node 0 HugePages_Total:     4
> 	Node 1 HugePages_Total:     4
> 	Node 2 HugePages_Total:    16
> 	Node 3 HugePages_Total:     8
> 
> The 8 huge pages freed were balanced over nodes 0 and 1.
> 
> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> 
>  include/linux/hugetlb.h   |    6 ++
>  include/linux/mempolicy.h |    3 +
>  kernel/sysctl.c           |   16 ++++++-
>  mm/hugetlb.c              |   97 +++++++++++++++++++++++++++++++++++++++-------
>  mm/mempolicy.c            |   47 ++++++++++++++++++++++
>  5 files changed, 154 insertions(+), 15 deletions(-)
> 
> Index: linux-2.6.31-mmotm-090925-1435/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.31-mmotm-090925-1435.orig/mm/mempolicy.c	2009-09-30 12:48:45.000000000 -0400
> +++ linux-2.6.31-mmotm-090925-1435/mm/mempolicy.c	2009-09-30 12:48:46.000000000 -0400
> @@ -1564,6 +1564,53 @@ struct zonelist *huge_zonelist(struct vm
>  	}
>  	return zl;
>  }
> +
> +/*
> + * init_nodemask_of_mempolicy
> + *
> + * If the current task's mempolicy is "default" [NULL], return 'false'
> + * to indicate * default policy.  Otherwise, extract the policy nodemask
> + * for 'bind' * or 'interleave' policy into the argument nodemask, or
> + * initialize the argument nodemask to contain the single node for
> + * 'preferred' or * 'local' policy and return 'true' to indicate presence
> + * of non-default mempolicy.
> + *
> + * We don't bother with reference counting the mempolicy [mpol_get/put]
> + * because the current task is examining it's own mempolicy and a task's
> + * mempolicy is only ever changed by the task itself.
> + *
> + * N.B., it is the caller's responsibility to free a returned nodemask.
> + */
> +bool init_nodemask_of_mempolicy(nodemask_t *mask)
> +{
> +	struct mempolicy *mempolicy;
> +	int nid;
> +
> +	if (!current->mempolicy)
> +		return false;
> +
> +	mempolicy = current->mempolicy;
> +	switch (mempolicy->mode) {
> +	case MPOL_PREFERRED:
> +		if (mempolicy->flags & MPOL_F_LOCAL)
> +			nid = numa_node_id();
> +		else
> +			nid = mempolicy->v.preferred_node;
> +		init_nodemask_of_node(mask, nid);
> +		break;
> +
> +	case MPOL_BIND:
> +		/* Fall through */
> +	case MPOL_INTERLEAVE:
> +		*mask =  mempolicy->v.nodes;
> +		break;
> +
> +	default:
> +		BUG();
> +	}
> +
> +	return true;
> +}
>  #endif
>  
>  /* Allocate a page in interleaved policy.
> Index: linux-2.6.31-mmotm-090925-1435/include/linux/mempolicy.h
> ===================================================================
> --- linux-2.6.31-mmotm-090925-1435.orig/include/linux/mempolicy.h	2009-09-30 12:48:45.000000000 -0400
> +++ linux-2.6.31-mmotm-090925-1435/include/linux/mempolicy.h	2009-09-30 12:48:46.000000000 -0400
> @@ -201,6 +201,7 @@ extern void mpol_fix_fork_child_flag(str
>  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
>  				unsigned long addr, gfp_t gfp_flags,
>  				struct mempolicy **mpol, nodemask_t **nodemask);
> +extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -328,6 +329,8 @@ static inline struct zonelist *huge_zone
>  	return node_zonelist(0, gfp_flags);
>  }
>  
> +static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
> +
>  static inline int do_migrate_pages(struct mm_struct *mm,
>  			const nodemask_t *from_nodes,
>  			const nodemask_t *to_nodes, int flags)
> Index: linux-2.6.31-mmotm-090925-1435/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.31-mmotm-090925-1435.orig/mm/hugetlb.c	2009-09-30 12:48:45.000000000 -0400
> +++ linux-2.6.31-mmotm-090925-1435/mm/hugetlb.c	2009-10-01 12:13:25.000000000 -0400
> @@ -1334,29 +1334,71 @@ static struct hstate *kobj_to_hstate(str
>  	return NULL;
>  }
>  
> -static ssize_t nr_hugepages_show(struct kobject *kobj,
> +static ssize_t nr_hugepages_show_common(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
>  	struct hstate *h = kobj_to_hstate(kobj);
>  	return sprintf(buf, "%lu\n", h->nr_huge_pages);
>  }
> -static ssize_t nr_hugepages_store(struct kobject *kobj,
> -		struct kobj_attribute *attr, const char *buf, size_t count)
> +static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
> +			struct kobject *kobj, struct kobj_attribute *attr,
> +			const char *buf, size_t len)
>  {
>  	int err;
> -	unsigned long input;
> +	unsigned long count;
>  	struct hstate *h = kobj_to_hstate(kobj);
> +	NODEMASK_ALLOC(nodemask, nodes_allowed);
>  
> -	err = strict_strtoul(buf, 10, &input);
> +	err = strict_strtoul(buf, 10, &count);
>  	if (err)
>  		return 0;
>  
> -	h->max_huge_pages = set_max_huge_pages(h, input, &node_online_map);
> +	if (!(obey_mempolicy && init_nodemask_of_mempolicy(nodes_allowed))) {
> +		NODEMASK_FREE(nodes_allowed);
> +		nodes_allowed = &node_states[N_HIGH_MEMORY];
> +	}
> +	h->max_huge_pages = set_max_huge_pages(h, count, &node_online_map);
>  

Should that node_online_map not have changed to nodes_allowed?

> -	return count;
> +	if (nodes_allowed != &node_states[N_HIGH_MEMORY])
> +		NODEMASK_FREE(nodes_allowed);
> +
> +	return len;
> +}
> +
> +static ssize_t nr_hugepages_show(struct kobject *kobj,
> +				       struct kobj_attribute *attr, char *buf)
> +{
> +	return nr_hugepages_show_common(kobj, attr, buf);
> +}
> +
> +static ssize_t nr_hugepages_store(struct kobject *kobj,
> +	       struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> +	return nr_hugepages_store_common(false, kobj, attr, buf, len);
>  }
>  HSTATE_ATTR(nr_hugepages);
>  
> +#ifdef CONFIG_NUMA
> +
> +/*
> + * hstate attribute for optionally mempolicy-based constraint on persistent
> + * huge page alloc/free.
> + */
> +static ssize_t nr_hugepages_mempolicy_show(struct kobject *kobj,
> +				       struct kobj_attribute *attr, char *buf)
> +{
> +	return nr_hugepages_show_common(kobj, attr, buf);
> +}
> +
> +static ssize_t nr_hugepages_mempolicy_store(struct kobject *kobj,
> +	       struct kobj_attribute *attr, const char *buf, size_t len)
> +{
> +	return nr_hugepages_store_common(true, kobj, attr, buf, len);
> +}
> +HSTATE_ATTR(nr_hugepages_mempolicy);
> +#endif
> +
> +
>  static ssize_t nr_overcommit_hugepages_show(struct kobject *kobj,
>  					struct kobj_attribute *attr, char *buf)
>  {
> @@ -1412,6 +1454,9 @@ static struct attribute *hstate_attrs[]
>  	&free_hugepages_attr.attr,
>  	&resv_hugepages_attr.attr,
>  	&surplus_hugepages_attr.attr,
> +#ifdef CONFIG_NUMA
> +	&nr_hugepages_mempolicy_attr.attr,
> +#endif
>  	NULL,
>  };
>  
> @@ -1578,9 +1623,9 @@ static unsigned int cpuset_mems_nr(unsig
>  }
>  
>  #ifdef CONFIG_SYSCTL
> -int hugetlb_sysctl_handler(struct ctl_table *table, int write,
> -			   void __user *buffer,
> -			   size_t *length, loff_t *ppos)
> +static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
> +			 struct ctl_table *table, int write,
> +			 void __user *buffer, size_t *length, loff_t *ppos)
>  {
>  	struct hstate *h = &default_hstate;
>  	unsigned long tmp;
> @@ -1592,13 +1637,39 @@ int hugetlb_sysctl_handler(struct ctl_ta
>  	table->maxlen = sizeof(unsigned long);
>  	proc_doulongvec_minmax(table, write, buffer, length, ppos);
>  
> -	if (write)
> -		h->max_huge_pages = set_max_huge_pages(h, tmp,
> -							&node_online_map);
> +	if (write) {
> +		NODEMASK_ALLOC(nodemask, nodes_allowed);
> +		if (!(obey_mempolicy &&
> +			       init_nodemask_of_mempolicy(nodes_allowed))) {
> +			NODEMASK_FREE(nodes_allowed);
> +			nodes_allowed = &node_states[N_HIGH_MEMORY];
> +		}
> +		h->max_huge_pages = set_max_huge_pages(h, tmp, nodes_allowed);
> +
> +		if (nodes_allowed != &node_states[N_HIGH_MEMORY])
> +			NODEMASK_FREE(nodes_allowed);
> +	}
>  
>  	return 0;
>  }
>  
> +int hugetlb_sysctl_handler(struct ctl_table *table, int write,
> +			  void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +
> +	return hugetlb_sysctl_handler_common(false, table, write,
> +							buffer, length, ppos);
> +}
> +
> +#ifdef CONFIG_NUMA
> +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *table, int write,
> +			  void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +	return hugetlb_sysctl_handler_common(true, table, write,
> +							buffer, length, ppos);
> +}
> +#endif /* CONFIG_NUMA */
> +
>  int hugetlb_treat_movable_handler(struct ctl_table *table, int write,
>  			void __user *buffer,
>  			size_t *length, loff_t *ppos)
> Index: linux-2.6.31-mmotm-090925-1435/include/linux/hugetlb.h
> ===================================================================
> --- linux-2.6.31-mmotm-090925-1435.orig/include/linux/hugetlb.h	2009-09-30 12:48:45.000000000 -0400
> +++ linux-2.6.31-mmotm-090925-1435/include/linux/hugetlb.h	2009-09-30 12:48:46.000000000 -0400
> @@ -23,6 +23,12 @@ void reset_vma_resv_huge_pages(struct vm
>  int hugetlb_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int hugetlb_overcommit_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
>  int hugetlb_treat_movable_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *);
> +
> +#ifdef CONFIG_NUMA
> +int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
> +					void __user *, size_t *, loff_t *);
> +#endif
> +
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
>  int follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>  			struct page **, struct vm_area_struct **,
> Index: linux-2.6.31-mmotm-090925-1435/kernel/sysctl.c
> ===================================================================
> --- linux-2.6.31-mmotm-090925-1435.orig/kernel/sysctl.c	2009-09-30 12:48:45.000000000 -0400
> +++ linux-2.6.31-mmotm-090925-1435/kernel/sysctl.c	2009-09-30 12:48:46.000000000 -0400
> @@ -1164,7 +1164,7 @@ static struct ctl_table vm_table[] = {
>  		.extra2		= &one_hundred,
>  	},
>  #ifdef CONFIG_HUGETLB_PAGE
> -	 {
> +	{
>  		.procname	= "nr_hugepages",
>  		.data		= NULL,
>  		.maxlen		= sizeof(unsigned long),
> @@ -1172,7 +1172,19 @@ static struct ctl_table vm_table[] = {
>  		.proc_handler	= &hugetlb_sysctl_handler,
>  		.extra1		= (void *)&hugetlb_zero,
>  		.extra2		= (void *)&hugetlb_infinity,
> -	 },
> +	},
> +#ifdef CONFIG_NUMA
> +	{
> +	       .ctl_name       = CTL_UNNUMBERED,
> +	       .procname       = "nr_hugepages_mempolicy",
> +	       .data           = NULL,
> +	       .maxlen         = sizeof(unsigned long),
> +	       .mode           = 0644,
> +	       .proc_handler   = &hugetlb_mempolicy_sysctl_handler,
> +	       .extra1	 = (void *)&hugetlb_zero,
> +	       .extra2	 = (void *)&hugetlb_infinity,
> +	},
> +#endif
>  	 {
>  		.ctl_name	= VM_HUGETLB_GROUP,
>  		.procname	= "hugetlb_shm_group",
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2009-10-02 10:11 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-10-01 16:57 [PATCH 0/10] hugetlb: V8 numa control of persistent huge pages alloc/free Lee Schermerhorn
2009-10-01 16:57 ` [PATCH 1/10] hugetlb: rework hstate_next_node_* functions Lee Schermerhorn
2009-10-01 16:57 ` [PATCH 2/10] hugetlb: add nodemask arg to huge page alloc, free and surplus adjust fcns Lee Schermerhorn
2009-10-01 16:58 ` [PATCH 3/10] hugetlb: factor init_nodemask_of_node Lee Schermerhorn
2009-10-02  9:48   ` Mel Gorman
2009-10-02 11:38     ` Lee Schermerhorn
2009-10-03 17:25       ` Andrew Morton
2009-10-01 16:58 ` [PATCH 4/10] hugetlb: derive huge pages nodes allowed from task mempolicy Lee Schermerhorn
2009-10-02 10:11   ` Mel Gorman [this message]
2009-10-02 22:41     ` David Rientjes
2009-10-02 22:16   ` [patch] nodemask: make NODEMASK_ALLOC more general David Rientjes
2009-10-02 22:48     ` Christoph Lameter
2009-10-03  1:03       ` KAMEZAWA Hiroyuki
2009-10-06  3:46       ` David Rientjes
2009-10-03  0:59     ` KAMEZAWA Hiroyuki
2009-10-02 22:16   ` [PATCH 4/10] hugetlb: derive huge pages nodes allowed from task mempolicy David Rientjes
2009-10-02 23:23     ` Christoph Lameter
2009-10-05 11:15     ` Lee Schermerhorn
2009-10-05 20:58       ` David Rientjes
2009-10-06  2:54         ` Lee Schermerhorn
2009-10-06  3:33           ` David Rientjes
2009-10-01 16:58 ` [PATCH 5/10] hugetlb: add generic definition of NUMA_NO_NODE Lee Schermerhorn
2009-10-01 16:58 ` [PATCH 6/10] hugetlb: add per node hstate attributes Lee Schermerhorn
2009-10-01 16:58 ` [PATCH 7/10] hugetlb: update hugetlb documentation for NUMA controls Lee Schermerhorn
2009-10-01 19:47   ` Randy Dunlap
2009-10-02 11:43     ` Lee Schermerhorn
2009-10-01 16:58 ` [PATCH 8/10] hugetlb: use only nodes with memory for huge pages Lee Schermerhorn
2009-10-01 16:59 ` [PATCH 9/10] hugetlb: handle memory hot-plug events Lee Schermerhorn
2009-10-01 16:59 ` [PATCH 10/10] hugetlb: offload per node attribute registrations Lee Schermerhorn
2010-03-31 11:23 ` [APPLIED] [PATCH 0/10] hugetlb: V8 numa control of persistent huge pages alloc/free Andy Whitcroft

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20091002101105.GM21906@csn.ul.ie \
    --to=mel@csn.ul.ie \
    --cc=agl@us.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=apw@canonical.com \
    --cc=clameter@sgi.com \
    --cc=eric.whitney@hp.com \
    --cc=lee.schermerhorn@hp.com \
    --cc=linux-mm@kvack.org \
    --cc=linux-numa@vger.kernel.org \
    --cc=nacc@us.ibm.com \
    --cc=randy.dunlap@oracle.com \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).