[PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
@ 2007-08-30 18:50 Lee Schermerhorn
  2007-08-30 18:51 ` [PATCH/RFC 1/5] Mem Policy: fix reference counting Lee Schermerhorn
                   ` (6 more replies)
  0 siblings, 7 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-08-30 18:50 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, ak, mtk-manpages, clameter, solo, Lee Schermerhorn,
	eric.whitney

Some of these patches have been previously posted for comment
individually.  The patches also update the numa_memory_policy
document where applicable.  Needed man page updates are flagged.
I will provide the needed updates for any of the patches that
are accepted.

Cleanups:

1) Fix reference counting for shared, vma and other task's
   mempolicies.  This was discussed back in late June'07, but
   never went anywhere.  Closes possible races and fixes potential
   memory leak for shared policies.  Adds code to allocation paths.

   Patch does NOT update numa_memory_policy doc--the doc doesn't
   go into that much detail regarding design.  Perhaps it should.

2) use MPOL_PREFERRED with preferred_node = -1 for system default
   local allocation.  This removes all usage of MPOL_DEFAULT in
   in-kernel struct mempolicy 'policy' members.  MPOL_DEFAULT is
   now an API-only value that requests fall back to the default
   policy for the target context [task or vma/shared policy].  This
   simplifies the description of policies and removes some runtime
   tests in the page allocation paths.

   Needs man page update to clarify meaning of MPOL_DEFAULT with 
   this patch.  Should simplify things a bit.

2) cleanup MPOL_PREFERRED "local allocation" handling -- i.e., when
   preferred_node == -1.

   Needs man page update to clarify returned nodemask when
   MPOL_PERFERRED policy specifies local allocation.

Enhancements:

4) cpuset-independent [a.k.a. "contextual"] interleave policy:  NULL
   or empty nodemask to mempolicy API [set_mempolicy() and mbind()]
   now means "interleave over all permitted nodes in allocation 
   context".

   Needs man page update to describe contextual interleave--how to
   specify, behavior, ...

5) add MPOL_F_MEMS_ALLOWED flag for get_mempolicy().  Allows an
   application to query the valid nodes to avoid EINVAL errors when
   attempting to install memory policies from within a memory
   constrained cpuset.

   Needs man page update to describe flag, behavior.
   Could also use libnuma update -- e.g., new numa_mems_allowed()

Testing:

I've run with these patches for the past few weeks.  Some moderate
stress testing and functional testing on an ia64 NUMA platform, shows
no issues nor regression.  memtoy >= 0.13 supports MPOL_F_MEMS_ALLOWED
flag [mems command].

Some of the patches [ref count fix, contextual interneavel] do add
code in some of the allocation paths.  I hope to get some time in
the next month on a terabyte system [~64 million 16KB pages] to
measure the overhead of these patches allocating and migrating a few
million pages to expose any increased overhead.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH/RFC 1/5] Mem Policy:  fix reference counting
  2007-08-30 18:50 [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements Lee Schermerhorn
@ 2007-08-30 18:51 ` Lee Schermerhorn
  2007-09-11 18:48   ` Mel Gorman
  2007-08-30 18:51 ` [PATCH/RFC 2/5] Mem Policy: Use MPOL_PREFERRED for system-wide default policy Lee Schermerhorn
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-08-30 18:51 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, ak, mtk-manpages, clameter, solo, Lee Schermerhorn,
	eric.whitney

PATCHRFC  1/5 Memory Policy: fix reference counting

Against 2.6.23-rc3-mm1

This patch proposes fixes to the reference counting of memory policy
in the page allocation paths and in show_numa_map().

Shared policy lookup [shmem] has always added a reference to the
policy, but this was never unrefed after page allocation or after
formatting the numa map data.  

Default system policy should not require additional ref counting,
nor should the current task's task policy.  However, show_numa_map()
calls get_vma_policy() to examine what may be [likely is] another
task's policy.  The latter case needs protection against freeing
of the policy.

This patch adds a reference count to a mempolicy returned by
get_vma_policy() when the policy is a vma policy or another
task's mempolicy.  Again, shared policy is already reference
counted on lookup.  A matching "unref" [__mpol_free()] is performed
in alloc_page_vma() for shared and vma policies, and in
show_numa_map() for shared and another task's mempolicy.
We can call __mpol_free() directly, saving an admittedly
inexpensive inline NULL test, because we know we have a non-NULL
policy.

Handling policy ref counts for hugepages is a bit tricker.
huge_zonelist() returns a zone list that might come from a 
shared or vma 'BIND policy.  In this case, we should hold the
reference until after the huge page allocation in 
dequeue_hugepage().  The patch modifies huge_zonelist() to
return a pointer to the mempolicy if it needs to be unref'd
after allocation.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |    4 +-
 mm/hugetlb.c              |    4 +-
 mm/mempolicy.c            |   79 ++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 75 insertions(+), 12 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-08-29 10:05:19.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-08-29 13:31:42.000000000 -0400
@@ -1083,21 +1083,37 @@ asmlinkage long compat_sys_mbind(compat_
 
 #endif
 
-/* Return effective policy for a VMA */
+/*
+ * get_vma_policy(@task, @vma, @addr)
+ * @task - task for fallback if vma policy == default
+ * @vma   - virtual memory area whose policy is sought
+ * @addr  - address in @vma for shared policy lookup
+ *
+ * Returns effective policy for a VMA at specified address.
+ * Falls back to @task or system default policy, as necessary.
+ * Returned policy has extra reference count if shared, vma,
+ * or some other task's policy [show_numa_maps() can pass
+ * @task != current].  It is the caller's responsibility to
+ * free the reference in these cases.
+ */
 static struct mempolicy * get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = task->mempolicy;
+	int shared_pol = 0;
 
 	if (vma) {
-		if (vma->vm_ops && vma->vm_ops->get_policy)
+		if (vma->vm_ops && vma->vm_ops->get_policy) {
 			pol = vma->vm_ops->get_policy(vma, addr);
-		else if (vma->vm_policy &&
+			shared_pol = 1;	/* if pol non-NULL, that is */
+		} else if (vma->vm_policy &&
 				vma->vm_policy->policy != MPOL_DEFAULT)
 			pol = vma->vm_policy;
 	}
 	if (!pol)
 		pol = &default_policy;
+	else if (!shared_pol && pol != current->mempolicy)
+		mpol_get(pol);	/* vma or other task's policy */
 	return pol;
 }
 
@@ -1213,19 +1229,45 @@ static inline unsigned interleave_nid(st
 }
 
 #ifdef CONFIG_HUGETLBFS
-/* Return a zonelist suitable for a huge page allocation. */
+/*
+ * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
+ * @vma = virtual memory area whose policy is sought
+ * @addr = address in @vma for shared policy lookup and interleave policy
+ * @gfp_flags = for requested zone
+ * @mpol = pointer to mempolicy pointer for reference counted 'BIND policy
+ *
+ * Returns a zonelist suitable for a huge page allocation.
+ * If the effective policy is 'BIND, returns pointer to policy's zonelist.
+ * If it is also a policy for which get_vma_policy() returns an extra
+ * reference, we must hold that reference until after allocation.
+ * In that case, return policy via @mpol so hugetlb allocation can drop
+ * the reference.  For non-'BIND referenced policies, we can/do drop the
+ * reference here, so the caller doesn't need to know about the special case
+ * for default and current task policy.
+ */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
-							gfp_t gfp_flags)
+				gfp_t gfp_flags, struct mempolicy **mpol)
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	struct zonelist *zl;
 
+	*mpol = NULL;		/* probably no unref needed */
 	if (pol->policy == MPOL_INTERLEAVE) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
+		__mpol_free(pol);
 		return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
 	}
-	return zonelist_policy(GFP_HIGHUSER, pol);
+
+	zl = zonelist_policy(GFP_HIGHUSER, pol);
+	if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
+		if (pol->policy != MPOL_BIND)
+			__mpol_free(pol);	/* finished with pol */
+		else
+			*mpol = pol;	/* unref needed after allocation */
+	}
+	return zl;
 }
 #endif
 
@@ -1270,6 +1312,7 @@ struct page *
 alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	struct zonelist *zl;
 
 	cpuset_update_task_memory_state();
 
@@ -1279,7 +1322,19 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
 		return alloc_page_interleave(gfp, 0, nid);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	zl = zonelist_policy(gfp, pol);
+	if (pol != &default_policy && pol != current->mempolicy) {
+		/*
+		 * slow path: ref counted policy -- shared or vma
+		 */
+		struct page *page =  __alloc_pages(gfp, 0, zl);
+		__mpol_free(pol);
+		return page;
+	}
+	/*
+	 * fast path:  default or task policy
+	 */
+	return __alloc_pages(gfp, 0, zl);
 }
 
 /**
@@ -1878,6 +1933,7 @@ int show_numa_map(struct seq_file *m, vo
 	struct numa_maps *md;
 	struct file *file = vma->vm_file;
 	struct mm_struct *mm = vma->vm_mm;
+	struct mempolicy *pol;
 	int n;
 	char buffer[50];
 
@@ -1888,8 +1944,13 @@ int show_numa_map(struct seq_file *m, vo
 	if (!md)
 		return 0;
 
-	mpol_to_str(buffer, sizeof(buffer),
-			    get_vma_policy(priv->task, vma, vma->vm_start));
+	pol = get_vma_policy(priv->task, vma, vma->vm_start);
+	mpol_to_str(buffer, sizeof(buffer), pol);
+	/*
+	 * unref shared or other task's mempolicy
+	 */
+	if (pol != &default_policy && pol != current->mempolicy)
+		__mpol_free(pol);
 
 	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
 
Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h	2007-08-29 10:56:14.000000000 -0400
+++ Linux/include/linux/mempolicy.h	2007-08-29 13:32:05.000000000 -0400
@@ -150,7 +150,7 @@ extern void mpol_fix_fork_child_flag(str
 
 extern struct mempolicy default_policy;
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
-		unsigned long addr, gfp_t gfp_flags);
+		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -238,7 +238,7 @@ static inline void mpol_fix_fork_child_f
 }
 
 static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
-		unsigned long addr, gfp_t gfp_flags)
+		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol)
 {
 	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
 }
Index: Linux/mm/hugetlb.c
===================================================================
--- Linux.orig/mm/hugetlb.c	2007-08-29 10:56:13.000000000 -0400
+++ Linux/mm/hugetlb.c	2007-08-29 13:19:39.000000000 -0400
@@ -71,8 +71,9 @@ static struct page *dequeue_huge_page(st
 {
 	int nid;
 	struct page *page = NULL;
+	struct mempolicy *mpol;
 	struct zonelist *zonelist = huge_zonelist(vma, address,
-						htlb_alloc_mask);
+					htlb_alloc_mask, &mpol);
 	struct zone **z;
 
 	for (z = zonelist->zones; *z; z++) {
@@ -87,6 +88,7 @@ static struct page *dequeue_huge_page(st
 			break;
 		}
 	}
+	mpol_free(mpol);	/* maybe need unref */
 	return page;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 1/5] Mem Policy:  fix reference counting
  2007-08-30 18:51 ` [PATCH/RFC 1/5] Mem Policy: fix reference counting Lee Schermerhorn
@ 2007-09-11 18:48   ` Mel Gorman
  2007-09-11 18:12     ` Lee Schermerhorn
  0 siblings, 1 reply; 76+ messages in thread
From: Mel Gorman @ 2007-09-11 18:48 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, clameter, solo, eric.whitney

You know this stuff better than I do. Take suggestions here with a large
grain of salt.

On Thu, 2007-08-30 at 14:51 -0400, Lee Schermerhorn wrote:
> PATCHRFC  1/5 Memory Policy: fix reference counting
> 
> Against 2.6.23-rc3-mm1
> 
> This patch proposes fixes to the reference counting of memory policy
> in the page allocation paths and in show_numa_map().
> 
> Shared policy lookup [shmem] has always added a reference to the
> policy, but this was never unrefed after page allocation or after
> formatting the numa map data.  
> 
> Default system policy should not require additional ref counting,
> nor should the current task's task policy.  However, show_numa_map()
> calls get_vma_policy() to examine what may be [likely is] another
> task's policy.  The latter case needs protection against freeing
> of the policy.
> 
> This patch adds a reference count to a mempolicy returned by
> get_vma_policy() when the policy is a vma policy or another
> task's mempolicy.  Again, shared policy is already reference
> counted on lookup.  A matching "unref" [__mpol_free()] is performed
> in alloc_page_vma() for shared and vma policies, and in
> show_numa_map() for shared and another task's mempolicy.
> We can call __mpol_free() directly, saving an admittedly
> inexpensive inline NULL test, because we know we have a non-NULL
> policy.
> 
> Handling policy ref counts for hugepages is a bit tricker.
> huge_zonelist() returns a zone list that might come from a 
> shared or vma 'BIND policy.  In this case, we should hold the
> reference until after the huge page allocation in 
> dequeue_hugepage().  The patch modifies huge_zonelist() to
> return a pointer to the mempolicy if it needs to be unref'd
> after allocation.
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/mempolicy.h |    4 +-
>  mm/hugetlb.c              |    4 +-
>  mm/mempolicy.c            |   79 ++++++++++++++++++++++++++++++++++++++++------
>  3 files changed, 75 insertions(+), 12 deletions(-)
> 
> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c	2007-08-29 10:05:19.000000000 -0400
> +++ Linux/mm/mempolicy.c	2007-08-29 13:31:42.000000000 -0400
> @@ -1083,21 +1083,37 @@ asmlinkage long compat_sys_mbind(compat_
>  
>  #endif
>  
> -/* Return effective policy for a VMA */
> +/*
> + * get_vma_policy(@task, @vma, @addr)
> + * @task - task for fallback if vma policy == default
> + * @vma   - virtual memory area whose policy is sought
> + * @addr  - address in @vma for shared policy lookup
> + *
> + * Returns effective policy for a VMA at specified address.
> + * Falls back to @task or system default policy, as necessary.
> + * Returned policy has extra reference count if shared, vma,
> + * or some other task's policy [show_numa_maps() can pass
> + * @task != current].  It is the caller's responsibility to
> + * free the reference in these cases.
> + */
>  static struct mempolicy * get_vma_policy(struct task_struct *task,
>  		struct vm_area_struct *vma, unsigned long addr)
>  {
>  	struct mempolicy *pol = task->mempolicy;
> +	int shared_pol = 0;
>  
>  	if (vma) {
> -		if (vma->vm_ops && vma->vm_ops->get_policy)
> +		if (vma->vm_ops && vma->vm_ops->get_policy) {
>  			pol = vma->vm_ops->get_policy(vma, addr);
> -		else if (vma->vm_policy &&
> +			shared_pol = 1;	/* if pol non-NULL, that is */

What do you mean here by "pol non-NULL, that is". Where do you check
that vm_ops->get_policy() returned a non-NULL value?

Should the comment be 

/* Policy if set is shared, check later */

and rename the variable to check_shared_pol?

> +		} else if (vma->vm_policy &&
>  				vma->vm_policy->policy != MPOL_DEFAULT)
>  			pol = vma->vm_policy;
>  	}
>  	if (!pol)
>  		pol = &default_policy;
> +	else if (!shared_pol && pol != current->mempolicy)
> +		mpol_get(pol);	/* vma or other task's policy */
>  	return pol;
>  }
>  
> @@ -1213,19 +1229,45 @@ static inline unsigned interleave_nid(st
>  }
>  
>  #ifdef CONFIG_HUGETLBFS
> -/* Return a zonelist suitable for a huge page allocation. */
> +/*
> + * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
> + * @vma = virtual memory area whose policy is sought
> + * @addr = address in @vma for shared policy lookup and interleave policy
> + * @gfp_flags = for requested zone
> + * @mpol = pointer to mempolicy pointer for reference counted 'BIND policy
> + *
> + * Returns a zonelist suitable for a huge page allocation.
> + * If the effective policy is 'BIND, returns pointer to policy's zonelist.

This comment here becomes redundant if applied on top of one-zonelist as
you suggest you will be doing later. The zonelist returned for MPOL_BIND
is the nodes zonelist but it is filtered based on a nodemask.

> + * If it is also a policy for which get_vma_policy() returns an extra
> + * reference, we must hold that reference until after allocation.
> + * In that case, return policy via @mpol so hugetlb allocation can drop
> + * the reference.  For non-'BIND referenced policies, we can/do drop the
> + * reference here, so the caller doesn't need to know about the special case
> + * for default and current task policy.
> + */
>  struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
> -							gfp_t gfp_flags)
> +				gfp_t gfp_flags, struct mempolicy **mpol)
>  {
>  	struct mempolicy *pol = get_vma_policy(current, vma, addr);
> +	struct zonelist *zl;
>  
> +	*mpol = NULL;		/* probably no unref needed */
>  	if (pol->policy == MPOL_INTERLEAVE) {
>  		unsigned nid;
>  
>  		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
> +		__mpol_free(pol);

So, __mpol_free() here acts as a put on the get_vma_policy() right?
Either that needs commenting or __mpol_free() needs to be renamed to
__mpol_put() assuming that when the count reaches 0, it really gets
free.

>  		return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
>  	}
> -	return zonelist_policy(GFP_HIGHUSER, pol);
> +
> +	zl = zonelist_policy(GFP_HIGHUSER, pol);
> +	if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
> +		if (pol->policy != MPOL_BIND)
> +			__mpol_free(pol);	/* finished with pol */
> +		else
> +			*mpol = pol;	/* unref needed after allocation */
> +	}
> +	return zl;
>  }
>  #endif
>  
> @@ -1270,6 +1312,7 @@ struct page *
>  alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
>  {
>  	struct mempolicy *pol = get_vma_policy(current, vma, addr);
> +	struct zonelist *zl;
>  
>  	cpuset_update_task_memory_state();
>  
> @@ -1279,7 +1322,19 @@ alloc_page_vma(gfp_t gfp, struct vm_area
>  		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
>  		return alloc_page_interleave(gfp, 0, nid);
>  	}
> -	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
> +	zl = zonelist_policy(gfp, pol);
> +	if (pol != &default_policy && pol != current->mempolicy) {
> +		/*
> +		 * slow path: ref counted policy -- shared or vma
> +		 */
> +		struct page *page =  __alloc_pages(gfp, 0, zl);
> +		__mpol_free(pol);
> +		return page;
> +	}
> +	/*
> +	 * fast path:  default or task policy
> +	 */
> +	return __alloc_pages(gfp, 0, zl);
>  }
>  
>  /**
> @@ -1878,6 +1933,7 @@ int show_numa_map(struct seq_file *m, vo
>  	struct numa_maps *md;
>  	struct file *file = vma->vm_file;
>  	struct mm_struct *mm = vma->vm_mm;
> +	struct mempolicy *pol;
>  	int n;
>  	char buffer[50];
>  
> @@ -1888,8 +1944,13 @@ int show_numa_map(struct seq_file *m, vo
>  	if (!md)
>  		return 0;
>  
> -	mpol_to_str(buffer, sizeof(buffer),
> -			    get_vma_policy(priv->task, vma, vma->vm_start));
> +	pol = get_vma_policy(priv->task, vma, vma->vm_start);
> +	mpol_to_str(buffer, sizeof(buffer), pol);
> +	/*
> +	 * unref shared or other task's mempolicy
> +	 */
> +	if (pol != &default_policy && pol != current->mempolicy)
> +		__mpol_free(pol);
>  
>  	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
>  
> Index: Linux/include/linux/mempolicy.h
> ===================================================================
> --- Linux.orig/include/linux/mempolicy.h	2007-08-29 10:56:14.000000000 -0400
> +++ Linux/include/linux/mempolicy.h	2007-08-29 13:32:05.000000000 -0400
> @@ -150,7 +150,7 @@ extern void mpol_fix_fork_child_flag(str
>  
>  extern struct mempolicy default_policy;
>  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> -		unsigned long addr, gfp_t gfp_flags);
> +		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -238,7 +238,7 @@ static inline void mpol_fix_fork_child_f
>  }
>  
>  static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> -		unsigned long addr, gfp_t gfp_flags)
> +		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol)
>  {
>  	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
>  }
> Index: Linux/mm/hugetlb.c
> ===================================================================
> --- Linux.orig/mm/hugetlb.c	2007-08-29 10:56:13.000000000 -0400
> +++ Linux/mm/hugetlb.c	2007-08-29 13:19:39.000000000 -0400
> @@ -71,8 +71,9 @@ static struct page *dequeue_huge_page(st
>  {
>  	int nid;
>  	struct page *page = NULL;
> +	struct mempolicy *mpol;
>  	struct zonelist *zonelist = huge_zonelist(vma, address,
> -						htlb_alloc_mask);
> +					htlb_alloc_mask, &mpol);
>  	struct zone **z;
>  
>  	for (z = zonelist->zones; *z; z++) {
> @@ -87,6 +88,7 @@ static struct page *dequeue_huge_page(st
>  			break;
>  		}
>  	}
> +	mpol_free(mpol);	/* maybe need unref */
>  	return page;
>  }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 1/5] Mem Policy:  fix reference counting
  2007-09-11 18:48   ` Mel Gorman
@ 2007-09-11 18:12     ` Lee Schermerhorn
  2007-09-13  9:45       ` Mel Gorman
  0 siblings, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-11 18:12 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, akpm, ak, mtk-manpages, clameter, solo, eric.whitney

On Tue, 2007-09-11 at 19:48 +0100, Mel Gorman wrote:
> You know this stuff better than I do. Take suggestions here with a large
> grain of salt.

Your comments are on the mark.  See responses below.

> 
> On Thu, 2007-08-30 at 14:51 -0400, Lee Schermerhorn wrote:
<patch description snipped>
> > Index: Linux/mm/mempolicy.c
> > ===================================================================
> > --- Linux.orig/mm/mempolicy.c	2007-08-29 10:05:19.000000000 -0400
> > +++ Linux/mm/mempolicy.c	2007-08-29 13:31:42.000000000 -0400
> > @@ -1083,21 +1083,37 @@ asmlinkage long compat_sys_mbind(compat_
> >  
> >  #endif
> >  
> > -/* Return effective policy for a VMA */
> > +/*
> > + * get_vma_policy(@task, @vma, @addr)
> > + * @task - task for fallback if vma policy == default
> > + * @vma   - virtual memory area whose policy is sought
> > + * @addr  - address in @vma for shared policy lookup
> > + *
> > + * Returns effective policy for a VMA at specified address.
> > + * Falls back to @task or system default policy, as necessary.
> > + * Returned policy has extra reference count if shared, vma,
> > + * or some other task's policy [show_numa_maps() can pass
> > + * @task != current].  It is the caller's responsibility to
> > + * free the reference in these cases.
> > + */
> >  static struct mempolicy * get_vma_policy(struct task_struct *task,
> >  		struct vm_area_struct *vma, unsigned long addr)
> >  {
> >  	struct mempolicy *pol = task->mempolicy;
> > +	int shared_pol = 0;
> >  
> >  	if (vma) {
> > -		if (vma->vm_ops && vma->vm_ops->get_policy)
> > +		if (vma->vm_ops && vma->vm_ops->get_policy) {
> >  			pol = vma->vm_ops->get_policy(vma, addr);
> > -		else if (vma->vm_policy &&
> > +			shared_pol = 1;	/* if pol non-NULL, that is */
> 
> What do you mean here by "pol non-NULL, that is". Where do you check
> that vm_ops->get_policy() returned a non-NULL value?
> 
> Should the comment be 
> 
> /* Policy if set is shared, check later */
> 
> and rename the variable to check_shared_pol?

You interpret my cryptic comment correctly.  However, your suggested fix
doesn't quite capture my way of looking at it.  Would it work for you if
I change it to:  /* if pol non-NULL, add ref below */  ???  That fits in
80 columns ;-)!

> 
> > +		} else if (vma->vm_policy &&
> >  				vma->vm_policy->policy != MPOL_DEFAULT)
> >  			pol = vma->vm_policy;
> >  	}
> >  	if (!pol)
> >  		pol = &default_policy;
> > +	else if (!shared_pol && pol != current->mempolicy)
> > +		mpol_get(pol);	/* vma or other task's policy */
> >  	return pol;
> >  }
> >  
> > @@ -1213,19 +1229,45 @@ static inline unsigned interleave_nid(st
> >  }
> >  
> >  #ifdef CONFIG_HUGETLBFS
> > -/* Return a zonelist suitable for a huge page allocation. */
> > +/*
> > + * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
> > + * @vma = virtual memory area whose policy is sought
> > + * @addr = address in @vma for shared policy lookup and interleave policy
> > + * @gfp_flags = for requested zone
> > + * @mpol = pointer to mempolicy pointer for reference counted 'BIND policy
> > + *
> > + * Returns a zonelist suitable for a huge page allocation.
> > + * If the effective policy is 'BIND, returns pointer to policy's zonelist.
> 
> This comment here becomes redundant if applied on top of one-zonelist as
> you suggest you will be doing later. The zonelist returned for MPOL_BIND
> is the nodes zonelist but it is filtered based on a nodemask.

Agreed.  When I get around to rebasing atop your patches [under the
assumption they'll hit the mm tree before these] I'll fix this up.  For
now, I've added myself a 'TODO' comment.

Note, however, that unless we take a copy of the policy's nodemask,
we'll still need to hold the reference over the allocation, I think.
Haven't looked that closely, yet.

> 
> > + * If it is also a policy for which get_vma_policy() returns an extra
> > + * reference, we must hold that reference until after allocation.
> > + * In that case, return policy via @mpol so hugetlb allocation can drop
> > + * the reference.  For non-'BIND referenced policies, we can/do drop the
> > + * reference here, so the caller doesn't need to know about the special case
> > + * for default and current task policy.
> > + */
> >  struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
> > -							gfp_t gfp_flags)
> > +				gfp_t gfp_flags, struct mempolicy **mpol)
> >  {
> >  	struct mempolicy *pol = get_vma_policy(current, vma, addr);
> > +	struct zonelist *zl;
> >  
> > +	*mpol = NULL;		/* probably no unref needed */
> >  	if (pol->policy == MPOL_INTERLEAVE) {
> >  		unsigned nid;
> >  
> >  		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
> > +		__mpol_free(pol);
> 
> So, __mpol_free() here acts as a put on the get_vma_policy() right?
> Either that needs commenting or __mpol_free() needs to be renamed to
> __mpol_put() assuming that when the count reaches 0, it really gets
> free.

Yes, the '__' version of mpol_free() takes a non-NULL policy pointer and
decrements the reference.  [w/o the '__', a NULL policy pointer is a
no-op.]  If the resulting count is zero, the policy structure, and any
attached zonelist [or nodemask, if we make those remote, as discussed in
Cambridge] is freed.   The 'free notation is Andi's original naming.
For now, rather than change that throughout the code, I'll comment this
instance.


Thanks,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 1/5] Mem Policy:  fix reference counting
  2007-09-11 18:12     ` Lee Schermerhorn
@ 2007-09-13  9:45       ` Mel Gorman
  0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2007-09-13  9:45 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, clameter, solo, eric.whitney

On (11/09/07 14:12), Lee Schermerhorn didst pronounce:
> On Tue, 2007-09-11 at 19:48 +0100, Mel Gorman wrote:
> > You know this stuff better than I do. Take suggestions here with a large
> > grain of salt.
> 
> Your comments are on the mark.  See responses below.
> 

Great.

> > 
> > On Thu, 2007-08-30 at 14:51 -0400, Lee Schermerhorn wrote:
> <patch description snipped>
> > > Index: Linux/mm/mempolicy.c
> > > ===================================================================
> > > --- Linux.orig/mm/mempolicy.c	2007-08-29 10:05:19.000000000 -0400
> > > +++ Linux/mm/mempolicy.c	2007-08-29 13:31:42.000000000 -0400
> > > @@ -1083,21 +1083,37 @@ asmlinkage long compat_sys_mbind(compat_
> > >  
> > >  #endif
> > >  
> > > -/* Return effective policy for a VMA */
> > > +/*
> > > + * get_vma_policy(@task, @vma, @addr)
> > > + * @task - task for fallback if vma policy == default
> > > + * @vma   - virtual memory area whose policy is sought
> > > + * @addr  - address in @vma for shared policy lookup
> > > + *
> > > + * Returns effective policy for a VMA at specified address.
> > > + * Falls back to @task or system default policy, as necessary.
> > > + * Returned policy has extra reference count if shared, vma,
> > > + * or some other task's policy [show_numa_maps() can pass
> > > + * @task != current].  It is the caller's responsibility to
> > > + * free the reference in these cases.
> > > + */
> > >  static struct mempolicy * get_vma_policy(struct task_struct *task,
> > >  		struct vm_area_struct *vma, unsigned long addr)
> > >  {
> > >  	struct mempolicy *pol = task->mempolicy;
> > > +	int shared_pol = 0;
> > >  
> > >  	if (vma) {
> > > -		if (vma->vm_ops && vma->vm_ops->get_policy)
> > > +		if (vma->vm_ops && vma->vm_ops->get_policy) {
> > >  			pol = vma->vm_ops->get_policy(vma, addr);
> > > -		else if (vma->vm_policy &&
> > > +			shared_pol = 1;	/* if pol non-NULL, that is */
> > 
> > What do you mean here by "pol non-NULL, that is". Where do you check
> > that vm_ops->get_policy() returned a non-NULL value?
> > 
> > Should the comment be 
> > 
> > /* Policy if set is shared, check later */
> > 
> > and rename the variable to check_shared_pol?
> 
> You interpret my cryptic comment correctly.  However, your suggested fix
> doesn't quite capture my way of looking at it.  Would it work for you if
> I change it to:  /* if pol non-NULL, add ref below */  ???  That fits in
> 80 columns ;-)!
> 
> > 
> > > +		} else if (vma->vm_policy &&
> > >  				vma->vm_policy->policy != MPOL_DEFAULT)
> > >  			pol = vma->vm_policy;
> > >  	}
> > >  	if (!pol)
> > >  		pol = &default_policy;
> > > +	else if (!shared_pol && pol != current->mempolicy)
> > > +		mpol_get(pol);	/* vma or other task's policy */
> > >  	return pol;
> > >  }
> > >  
> > > @@ -1213,19 +1229,45 @@ static inline unsigned interleave_nid(st
> > >  }
> > >  
> > >  #ifdef CONFIG_HUGETLBFS
> > > -/* Return a zonelist suitable for a huge page allocation. */
> > > +/*
> > > + * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
> > > + * @vma = virtual memory area whose policy is sought
> > > + * @addr = address in @vma for shared policy lookup and interleave policy
> > > + * @gfp_flags = for requested zone
> > > + * @mpol = pointer to mempolicy pointer for reference counted 'BIND policy
> > > + *
> > > + * Returns a zonelist suitable for a huge page allocation.
> > > + * If the effective policy is 'BIND, returns pointer to policy's zonelist.
> > 
> > This comment here becomes redundant if applied on top of one-zonelist as
> > you suggest you will be doing later. The zonelist returned for MPOL_BIND
> > is the nodes zonelist but it is filtered based on a nodemask.
> 
> Agreed.  When I get around to rebasing atop your patches [under the
> assumption they'll hit the mm tree before these] I'll fix this up.  For
> now, I've added myself a 'TODO' comment.
> 

I'm hoping they will hit -mm soon. They have dragged on a long time and the
last change was more cosmetic than anything else. Their last real test is
from you really as you catch policy bugs way better than me.

> Note, however, that unless we take a copy of the policy's nodemask,
> we'll still need to hold the reference over the allocation, I think.
> Haven't looked that closely, yet.
> 

Whatever you have to do for MPOL_PREFERRED is likely the same as what
you'll have to do for MPOL_BIND.

> > 
> > > + * If it is also a policy for which get_vma_policy() returns an extra
> > > + * reference, we must hold that reference until after allocation.
> > > + * In that case, return policy via @mpol so hugetlb allocation can drop
> > > + * the reference.  For non-'BIND referenced policies, we can/do drop the
> > > + * reference here, so the caller doesn't need to know about the special case
> > > + * for default and current task policy.
> > > + */
> > >  struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
> > > -							gfp_t gfp_flags)
> > > +				gfp_t gfp_flags, struct mempolicy **mpol)
> > >  {
> > >  	struct mempolicy *pol = get_vma_policy(current, vma, addr);
> > > +	struct zonelist *zl;
> > >  
> > > +	*mpol = NULL;		/* probably no unref needed */
> > >  	if (pol->policy == MPOL_INTERLEAVE) {
> > >  		unsigned nid;
> > >  
> > >  		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
> > > +		__mpol_free(pol);
> > 
> > So, __mpol_free() here acts as a put on the get_vma_policy() right?
> > Either that needs commenting or __mpol_free() needs to be renamed to
> > __mpol_put() assuming that when the count reaches 0, it really gets
> > free.
> 
> Yes, the '__' version of mpol_free() takes a non-NULL policy pointer and
> decrements the reference.  [w/o the '__', a NULL policy pointer is a
> no-op.]  If the resulting count is zero, the policy structure, and any
> attached zonelist [or nodemask, if we make those remote, as discussed in
> Cambridge] is freed.   The 'free notation is Andi's original naming.
> For now, rather than change that throughout the code, I'll comment this
> instance.
> 

Or have a standalone patch that just does the renaming for clarity. The
__func convention in almost every other case refers to a lockless
version of func() or an internal helper of some sort. __func being a
reference count helper for func() is confusing to me and while it's not
your fault, you might as well fix it up while you're here.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH/RFC 2/5] Mem Policy:  Use MPOL_PREFERRED for system-wide default policy
  2007-08-30 18:50 [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements Lee Schermerhorn
  2007-08-30 18:51 ` [PATCH/RFC 1/5] Mem Policy: fix reference counting Lee Schermerhorn
@ 2007-08-30 18:51 ` Lee Schermerhorn
  2007-09-11 18:54   ` Mel Gorman
  2007-08-30 18:51 ` [PATCH/RFC 3/5] Mem Policy: MPOL_PREFERRED fixups for "local allocation" Lee Schermerhorn
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-08-30 18:51 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, ak, mtk-manpages, clameter, solo, Lee Schermerhorn,
	eric.whitney

PATCH/RFC 2/5 Use MPOL_PREFERRED for system-wide default policy

Against:  2.6.23-rc3-mm1

V1 -> V2:
+ restore BUG()s in switch(policy) default cases -- per
  Christoph
+ eliminate unneeded re-init of struct mempolicy policy member
  before freeing

Currently, when one specifies MPOL_DEFAULT via a NUMA memory
policy API [set_mempolicy(), mbind() and internal versions],
the kernel simply installs a NULL struct mempolicy pointer in
the appropriate context:  task policy, vma policy, or shared
policy.  This causes any use of that policy to "fall back" to
the next most specific policy scope.  The only use of MPOL_DEFAULT
to mean "local allocation" is in the system default policy.

There is another, "preferred" way to specify local allocation via
the APIs.  That is using the MPOL_PREFERRED policy mode with an
empty nodemask.  Internally, the empty nodemask gets converted to
a preferred_node id of '-1'.  All internal usage of MPOL_PREFERRED
will convert the '-1' to the id of the node local to the cpu 
where the allocation occurs.

System default policy, except during boot, is hard-coded to
"local allocation".  By using the MPOL_PREFERRED mode with a
negative value of preferred node for system default policy,
MPOL_DEFAULT will never occur in the 'policy' member of a
struct mempolicy.  Thus, we can remove all checks for
MPOL_DEFAULT when converting policy to a node id/zonelist in
the allocation paths.

In slab_node() return local node id when policy pointer is NULL.
No need to set a pol value to take the switch default.  Replace
switch default with BUG()--i.e., shouldn't happen.

With this patch MPOL_DEFAULT is only used in the APIs, including
internal calls to do_set_mempolicy() and in the display of policy
in /proc/<pid>/numa_maps.  It always means "fall back" to the the
next most specific policy scope.  This simplifies the description
of memory policies quite a bit, with no visible change in behavior.
This patch updates Documentation to reflect this change.

Tested with set_mempolicy() using numactl with memtoy, and
tested mbind() with memtoy.  All seems to work "as expected".

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/numa_memory_policy.txt |   70 ++++++++++++--------------------
 mm/mempolicy.c                          |   31 ++++++--------
 2 files changed, 41 insertions(+), 60 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-08-29 11:43:06.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-08-29 11:44:03.000000000 -0400
@@ -105,9 +105,13 @@ static struct kmem_cache *sn_cache;
    policied. */
 enum zone_type policy_zone = 0;
 
+/*
+ * run-time system-wide default policy => local allocation
+ */
 struct mempolicy default_policy = {
 	.refcnt = ATOMIC_INIT(1), /* never free it */
-	.policy = MPOL_DEFAULT,
+	.policy = MPOL_PREFERRED,
+	.v =  { .preferred_node =  -1 },
 };
 
 static void mpol_rebind_policy(struct mempolicy *pol,
@@ -180,7 +184,8 @@ static struct mempolicy *mpol_new(int mo
 		 mode, nodes ? nodes_addr(*nodes)[0] : -1);
 
 	if (mode == MPOL_DEFAULT)
-		return NULL;
+		return NULL;	/* simply delete any existing policy */
+
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
 	if (!policy)
 		return ERR_PTR(-ENOMEM);
@@ -493,8 +498,6 @@ static void get_zonemask(struct mempolic
 			node_set(zone_to_nid(p->v.zonelist->zones[i]),
 				*nodes);
 		break;
-	case MPOL_DEFAULT:
-		break;
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
@@ -1106,8 +1109,7 @@ static struct mempolicy * get_vma_policy
 		if (vma->vm_ops && vma->vm_ops->get_policy) {
 			pol = vma->vm_ops->get_policy(vma, addr);
 			shared_pol = 1;	/* if pol non-NULL, that is */
-		} else if (vma->vm_policy &&
-				vma->vm_policy->policy != MPOL_DEFAULT)
+		} else if (vma->vm_policy)
 			pol = vma->vm_policy;
 	}
 	if (!pol)
@@ -1136,7 +1138,6 @@ static struct zonelist *zonelist_policy(
 				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
-	case MPOL_DEFAULT:
 		nd = numa_node_id();
 		break;
 	default:
@@ -1166,9 +1167,10 @@ static unsigned interleave_nodes(struct 
  */
 unsigned slab_node(struct mempolicy *policy)
 {
-	int pol = policy ? policy->policy : MPOL_DEFAULT;
+	if (!policy)
+		return numa_node_id();
 
-	switch (pol) {
+	switch (policy->policy) {
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
@@ -1182,10 +1184,10 @@ unsigned slab_node(struct mempolicy *pol
 	case MPOL_PREFERRED:
 		if (policy->v.preferred_node >= 0)
 			return policy->v.preferred_node;
-		/* Fall through */
+		return numa_node_id();
 
 	default:
-		return numa_node_id();
+		BUG();
 	}
 }
 
@@ -1410,8 +1412,6 @@ int __mpol_equal(struct mempolicy *a, st
 	if (a->policy != b->policy)
 		return 0;
 	switch (a->policy) {
-	case MPOL_DEFAULT:
-		return 1;
 	case MPOL_INTERLEAVE:
 		return nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
@@ -1436,7 +1436,6 @@ void __mpol_free(struct mempolicy *p)
 		return;
 	if (p->policy == MPOL_BIND)
 		kfree(p->v.zonelist);
-	p->policy = MPOL_DEFAULT;
 	kmem_cache_free(policy_cache, p);
 }
 
@@ -1603,7 +1602,7 @@ void mpol_shared_policy_init(struct shar
 	if (policy != MPOL_DEFAULT) {
 		struct mempolicy *newpol;
 
-		/* Falls back to MPOL_DEFAULT on any error */
+		/* Falls back to NULL policy [MPOL_DEFAULT] on any error */
 		newpol = mpol_new(policy, policy_nodes);
 		if (!IS_ERR(newpol)) {
 			/* Create pseudo-vma that contains just the policy */
@@ -1724,8 +1723,6 @@ static void mpol_rebind_policy(struct me
 		return;
 
 	switch (pol->policy) {
-	case MPOL_DEFAULT:
-		break;
 	case MPOL_INTERLEAVE:
 		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
 		pol->v.nodes = tmp;
Index: Linux/Documentation/vm/numa_memory_policy.txt
===================================================================
--- Linux.orig/Documentation/vm/numa_memory_policy.txt	2007-08-29 11:23:56.000000000 -0400
+++ Linux/Documentation/vm/numa_memory_policy.txt	2007-08-29 11:43:10.000000000 -0400
@@ -149,63 +149,47 @@ Components of Memory Policies
 
    Linux memory policy supports the following 4 behavioral modes:
 
-	Default Mode--MPOL_DEFAULT:  The behavior specified by this mode is
-	context or scope dependent.
+	Default Mode--MPOL_DEFAULT:  This mode is only used in the memory
+	policy APIs.  Internally, MPOL_DEFAULT is converted to the NULL
+	memory policy in all policy scopes.  Any existing non-default policy
+	will simply be removed when MPOL_DEFAULT is specified.  As a result,
+	MPOL_DEFAULT means "fall back to the next most specific policy scope."
+
+	    For example, a NULL or default task policy will fall back to the
+	    system default policy.  A NULL or default vma policy will fall
+	    back to the task policy.
 
-	    As mentioned in the Policy Scope section above, during normal
-	    system operation, the System Default Policy is hard coded to
-	    contain the Default mode.
-
-	    In this context, default mode means "local" allocation--that is
-	    attempt to allocate the page from the node associated with the cpu
-	    where the fault occurs.  If the "local" node has no memory, or the
-	    node's memory can be exhausted [no free pages available], local
-	    allocation will "fallback to"--attempt to allocate pages from--
-	    "nearby" nodes, in order of increasing "distance".
-
-		Implementation detail -- subject to change:  "Fallback" uses
-		a per node list of sibling nodes--called zonelists--built at
-		boot time, or when nodes or memory are added or removed from
-		the system [memory hotplug].  These per node zonelist are
-		constructed with nodes in order of increasing distance based
-		on information provided by the platform firmware.
-
-	    When a task/process policy or a shared policy contains the Default
-	    mode, this also means "local allocation", as described above.
-
-	    In the context of a VMA, Default mode means "fall back to task
-	    policy"--which may or may not specify Default mode.  Thus, Default
-	    mode can not be counted on to mean local allocation when used
-	    on a non-shared region of the address space.  However, see
-	    MPOL_PREFERRED below.
-
-	    The Default mode does not use the optional set of nodes.
+	    When specified in one of the memory policy APIs, the Default mode
+	    does not use the optional set of nodes.
 
 	MPOL_BIND:  This mode specifies that memory must come from the
 	set of nodes specified by the policy.
 
 	    The memory policy APIs do not specify an order in which the nodes
-	    will be searched.  However, unlike "local allocation", the Bind
-	    policy does not consider the distance between the nodes.  Rather,
-	    allocations will fallback to the nodes specified by the policy in
-	    order of numeric node id.  Like everything in Linux, this is subject
-	    to change.
+	    will be searched.  However, unlike "local allocation" discussed
+	    below, the Bind policy does not consider the distance between the
+	    nodes.  Rather, allocations will fallback to the nodes specified
+	    by the policy in order of numeric node id.  Like everything in
+	    Linux, this is subject to change.
 
 	MPOL_PREFERRED:  This mode specifies that the allocation should be
 	attempted from the single node specified in the policy.  If that
-	allocation fails, the kernel will search other nodes, exactly as
-	it would for a local allocation that started at the preferred node
-	in increasing distance from the preferred node.  "Local" allocation
-	policy can be viewed as a Preferred policy that starts at the node
-	containing the cpu where the allocation takes place.
+	allocation fails, the kernel will search other nodes, in order of
+	increasing distance from the preferred node based on information
+	provided by the platform firmware.
 
 	    Internally, the Preferred policy uses a single node--the
 	    preferred_node member of struct mempolicy.  A "distinguished
 	    value of this preferred_node, currently '-1', is interpreted
 	    as "the node containing the cpu where the allocation takes
-	    place"--local allocation.  This is the way to specify
-	    local allocation for a specific range of addresses--i.e. for
-	    VMA policies.
+	    place"--local allocation.  "Local" allocation policy can be
+	    viewed as a Preferred policy that starts at the node containing
+	    the cpu where the allocation takes place.
+
+	    As mentioned in the Policy Scope section above, during normal
+	    system operation, the System Default Policy is hard coded to
+	    specify "local allocation".  This policy uses the Preferred
+	    policy with the special negative value of preferred_node.
 
 	MPOL_INTERLEAVED:  This mode specifies that page allocations be
 	interleaved, on a page granularity, across the nodes specified in

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 2/5] Mem Policy:  Use MPOL_PREFERRED for system-wide default policy
  2007-08-30 18:51 ` [PATCH/RFC 2/5] Mem Policy: Use MPOL_PREFERRED for system-wide default policy Lee Schermerhorn
@ 2007-09-11 18:54   ` Mel Gorman
  2007-09-11 18:22     ` Lee Schermerhorn
  0 siblings, 1 reply; 76+ messages in thread
From: Mel Gorman @ 2007-09-11 18:54 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, clameter, solo, eric.whitney

On Thu, 2007-08-30 at 14:51 -0400, Lee Schermerhorn wrote:
> PATCH/RFC 2/5 Use MPOL_PREFERRED for system-wide default policy
> 
> Against:  2.6.23-rc3-mm1
> 
> V1 -> V2:
> + restore BUG()s in switch(policy) default cases -- per
>   Christoph
> + eliminate unneeded re-init of struct mempolicy policy member
>   before freeing
> 
> Currently, when one specifies MPOL_DEFAULT via a NUMA memory
> policy API [set_mempolicy(), mbind() and internal versions],
> the kernel simply installs a NULL struct mempolicy pointer in
> the appropriate context:  task policy, vma policy, or shared
> policy.  This causes any use of that policy to "fall back" to
> the next most specific policy scope.  The only use of MPOL_DEFAULT
> to mean "local allocation" is in the system default policy.
> 

In general, this seems like a good idea. It's certainly simplier to
always assume a policy exists because it discourages "bah, I don't care
about policies" style of thinking.

> There is another, "preferred" way to specify local allocation via
> the APIs.  That is using the MPOL_PREFERRED policy mode with an
> empty nodemask.  Internally, the empty nodemask gets converted to
> a preferred_node id of '-1'.  All internal usage of MPOL_PREFERRED
> will convert the '-1' to the id of the node local to the cpu 
> where the allocation occurs.
> 
> System default policy, except during boot, is hard-coded to
> "local allocation".  By using the MPOL_PREFERRED mode with a
> negative value of preferred node for system default policy,
> MPOL_DEFAULT will never occur in the 'policy' member of a
> struct mempolicy.  Thus, we can remove all checks for
> MPOL_DEFAULT when converting policy to a node id/zonelist in
> the allocation paths.
> 
> In slab_node() return local node id when policy pointer is NULL.
> No need to set a pol value to take the switch default.  Replace
> switch default with BUG()--i.e., shouldn't happen.
> 
> With this patch MPOL_DEFAULT is only used in the APIs, including
> internal calls to do_set_mempolicy() and in the display of policy
> in /proc/<pid>/numa_maps.  It always means "fall back" to the the
> next most specific policy scope.  This simplifies the description
> of memory policies quite a bit, with no visible change in behavior.
> This patch updates Documentation to reflect this change.
> 
> Tested with set_mempolicy() using numactl with memtoy, and
> tested mbind() with memtoy.  All seems to work "as expected".
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  Documentation/vm/numa_memory_policy.txt |   70 ++++++++++++--------------------
>  mm/mempolicy.c                          |   31 ++++++--------
>  2 files changed, 41 insertions(+), 60 deletions(-)
> 
> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c	2007-08-29 11:43:06.000000000 -0400
> +++ Linux/mm/mempolicy.c	2007-08-29 11:44:03.000000000 -0400
> @@ -105,9 +105,13 @@ static struct kmem_cache *sn_cache;
>     policied. */
>  enum zone_type policy_zone = 0;
>  
> +/*
> + * run-time system-wide default policy => local allocation
> + */
>  struct mempolicy default_policy = {
>  	.refcnt = ATOMIC_INIT(1), /* never free it */
> -	.policy = MPOL_DEFAULT,
> +	.policy = MPOL_PREFERRED,
> +	.v =  { .preferred_node =  -1 },
>  };
>  

fairly clear.

>  static void mpol_rebind_policy(struct mempolicy *pol,
> @@ -180,7 +184,8 @@ static struct mempolicy *mpol_new(int mo
>  		 mode, nodes ? nodes_addr(*nodes)[0] : -1);
>  
>  	if (mode == MPOL_DEFAULT)
> -		return NULL;
> +		return NULL;	/* simply delete any existing policy */
> +

Why do we not return default_policy and insert that into the VMA or
whatever?

>  	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
>  	if (!policy)
>  		return ERR_PTR(-ENOMEM);
> @@ -493,8 +498,6 @@ static void get_zonemask(struct mempolic
>  			node_set(zone_to_nid(p->v.zonelist->zones[i]),
>  				*nodes);
>  		break;
> -	case MPOL_DEFAULT:
> -		break;
>  	case MPOL_INTERLEAVE:
>  		*nodes = p->v.nodes;
>  		break;
> @@ -1106,8 +1109,7 @@ static struct mempolicy * get_vma_policy
>  		if (vma->vm_ops && vma->vm_ops->get_policy) {
>  			pol = vma->vm_ops->get_policy(vma, addr);
>  			shared_pol = 1;	/* if pol non-NULL, that is */
> -		} else if (vma->vm_policy &&
> -				vma->vm_policy->policy != MPOL_DEFAULT)
> +		} else if (vma->vm_policy)
>  			pol = vma->vm_policy;
>  	}
>  	if (!pol)
> @@ -1136,7 +1138,6 @@ static struct zonelist *zonelist_policy(
>  				return policy->v.zonelist;
>  		/*FALL THROUGH*/
>  	case MPOL_INTERLEAVE: /* should not happen */
> -	case MPOL_DEFAULT:
>  		nd = numa_node_id();
>  		break;
>  	default:
> @@ -1166,9 +1167,10 @@ static unsigned interleave_nodes(struct 
>   */
>  unsigned slab_node(struct mempolicy *policy)
>  {
> -	int pol = policy ? policy->policy : MPOL_DEFAULT;
> +	if (!policy)
> +		return numa_node_id();
>  
> -	switch (pol) {
> +	switch (policy->policy) {
>  	case MPOL_INTERLEAVE:
>  		return interleave_nodes(policy);
>  
> @@ -1182,10 +1184,10 @@ unsigned slab_node(struct mempolicy *pol
>  	case MPOL_PREFERRED:
>  		if (policy->v.preferred_node >= 0)
>  			return policy->v.preferred_node;
> -		/* Fall through */
> +		return numa_node_id();
>  
>  	default:
> -		return numa_node_id();
> +		BUG();
>  	}
>  }
>  
> @@ -1410,8 +1412,6 @@ int __mpol_equal(struct mempolicy *a, st
>  	if (a->policy != b->policy)
>  		return 0;
>  	switch (a->policy) {
> -	case MPOL_DEFAULT:
> -		return 1;
>  	case MPOL_INTERLEAVE:
>  		return nodes_equal(a->v.nodes, b->v.nodes);
>  	case MPOL_PREFERRED:
> @@ -1436,7 +1436,6 @@ void __mpol_free(struct mempolicy *p)
>  		return;
>  	if (p->policy == MPOL_BIND)
>  		kfree(p->v.zonelist);
> -	p->policy = MPOL_DEFAULT;
>  	kmem_cache_free(policy_cache, p);
>  }
>  
> @@ -1603,7 +1602,7 @@ void mpol_shared_policy_init(struct shar
>  	if (policy != MPOL_DEFAULT) {
>  		struct mempolicy *newpol;
>  
> -		/* Falls back to MPOL_DEFAULT on any error */
> +		/* Falls back to NULL policy [MPOL_DEFAULT] on any error */
>  		newpol = mpol_new(policy, policy_nodes);
>  		if (!IS_ERR(newpol)) {
>  			/* Create pseudo-vma that contains just the policy */
> @@ -1724,8 +1723,6 @@ static void mpol_rebind_policy(struct me
>  		return;
>  
>  	switch (pol->policy) {
> -	case MPOL_DEFAULT:
> -		break;
>  	case MPOL_INTERLEAVE:
>  		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
>  		pol->v.nodes = tmp;
> Index: Linux/Documentation/vm/numa_memory_policy.txt
> ===================================================================
> --- Linux.orig/Documentation/vm/numa_memory_policy.txt	2007-08-29 11:23:56.000000000 -0400
> +++ Linux/Documentation/vm/numa_memory_policy.txt	2007-08-29 11:43:10.000000000 -0400
> @@ -149,63 +149,47 @@ Components of Memory Policies
>  
>     Linux memory policy supports the following 4 behavioral modes:
>  
> -	Default Mode--MPOL_DEFAULT:  The behavior specified by this mode is
> -	context or scope dependent.
> +	Default Mode--MPOL_DEFAULT:  This mode is only used in the memory
> +	policy APIs.  Internally, MPOL_DEFAULT is converted to the NULL
> +	memory policy in all policy scopes.  Any existing non-default policy
> +	will simply be removed when MPOL_DEFAULT is specified.  As a result,
> +	MPOL_DEFAULT means "fall back to the next most specific policy scope."
> +
> +	    For example, a NULL or default task policy will fall back to the
> +	    system default policy.  A NULL or default vma policy will fall
> +	    back to the task policy.
>  
> -	    As mentioned in the Policy Scope section above, during normal
> -	    system operation, the System Default Policy is hard coded to
> -	    contain the Default mode.
> -
> -	    In this context, default mode means "local" allocation--that is
> -	    attempt to allocate the page from the node associated with the cpu
> -	    where the fault occurs.  If the "local" node has no memory, or the
> -	    node's memory can be exhausted [no free pages available], local
> -	    allocation will "fallback to"--attempt to allocate pages from--
> -	    "nearby" nodes, in order of increasing "distance".
> -
> -		Implementation detail -- subject to change:  "Fallback" uses
> -		a per node list of sibling nodes--called zonelists--built at
> -		boot time, or when nodes or memory are added or removed from
> -		the system [memory hotplug].  These per node zonelist are
> -		constructed with nodes in order of increasing distance based
> -		on information provided by the platform firmware.
> -
> -	    When a task/process policy or a shared policy contains the Default
> -	    mode, this also means "local allocation", as described above.
> -
> -	    In the context of a VMA, Default mode means "fall back to task
> -	    policy"--which may or may not specify Default mode.  Thus, Default
> -	    mode can not be counted on to mean local allocation when used
> -	    on a non-shared region of the address space.  However, see
> -	    MPOL_PREFERRED below.
> -
> -	    The Default mode does not use the optional set of nodes.
> +	    When specified in one of the memory policy APIs, the Default mode
> +	    does not use the optional set of nodes.
>  
>  	MPOL_BIND:  This mode specifies that memory must come from the
>  	set of nodes specified by the policy.
>  
>  	    The memory policy APIs do not specify an order in which the nodes
> -	    will be searched.  However, unlike "local allocation", the Bind
> -	    policy does not consider the distance between the nodes.  Rather,
> -	    allocations will fallback to the nodes specified by the policy in
> -	    order of numeric node id.  Like everything in Linux, this is subject
> -	    to change.
> +	    will be searched.  However, unlike "local allocation" discussed
> +	    below, the Bind policy does not consider the distance between the
> +	    nodes.  Rather, allocations will fallback to the nodes specified
> +	    by the policy in order of numeric node id.  Like everything in
> +	    Linux, this is subject to change.
>  
>  	MPOL_PREFERRED:  This mode specifies that the allocation should be
>  	attempted from the single node specified in the policy.  If that
> -	allocation fails, the kernel will search other nodes, exactly as
> -	it would for a local allocation that started at the preferred node
> -	in increasing distance from the preferred node.  "Local" allocation
> -	policy can be viewed as a Preferred policy that starts at the node
> -	containing the cpu where the allocation takes place.
> +	allocation fails, the kernel will search other nodes, in order of
> +	increasing distance from the preferred node based on information
> +	provided by the platform firmware.
>  
>  	    Internally, the Preferred policy uses a single node--the
>  	    preferred_node member of struct mempolicy.  A "distinguished
>  	    value of this preferred_node, currently '-1', is interpreted
>  	    as "the node containing the cpu where the allocation takes
> -	    place"--local allocation.  This is the way to specify
> -	    local allocation for a specific range of addresses--i.e. for
> -	    VMA policies.
> +	    place"--local allocation.  "Local" allocation policy can be
> +	    viewed as a Preferred policy that starts at the node containing
> +	    the cpu where the allocation takes place.
> +
> +	    As mentioned in the Policy Scope section above, during normal
> +	    system operation, the System Default Policy is hard coded to
> +	    specify "local allocation".  This policy uses the Preferred
> +	    policy with the special negative value of preferred_node.
>  
>  	MPOL_INTERLEAVED:  This mode specifies that page allocations be
>  	interleaved, on a page granularity, across the nodes specified in
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 2/5] Mem Policy:  Use MPOL_PREFERRED for system-wide default policy
  2007-09-11 18:54   ` Mel Gorman
@ 2007-09-11 18:22     ` Lee Schermerhorn
  2007-09-13  9:48       ` Mel Gorman
  0 siblings, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-11 18:22 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, akpm, ak, mtk-manpages, clameter, solo, eric.whitney

On Tue, 2007-09-11 at 19:54 +0100, Mel Gorman wrote:
> On Thu, 2007-08-30 at 14:51 -0400, Lee Schermerhorn wrote:
> > PATCH/RFC 2/5 Use MPOL_PREFERRED for system-wide default policy
> > 
> > Against:  2.6.23-rc3-mm1
> > 
> > V1 -> V2:
> > + restore BUG()s in switch(policy) default cases -- per
> >   Christoph
> > + eliminate unneeded re-init of struct mempolicy policy member
> >   before freeing
> > 
> > Currently, when one specifies MPOL_DEFAULT via a NUMA memory
> > policy API [set_mempolicy(), mbind() and internal versions],
> > the kernel simply installs a NULL struct mempolicy pointer in
> > the appropriate context:  task policy, vma policy, or shared
> > policy.  This causes any use of that policy to "fall back" to
> > the next most specific policy scope.  The only use of MPOL_DEFAULT
> > to mean "local allocation" is in the system default policy.
> > 
> 
> In general, this seems like a good idea. It's certainly simplier to
> always assume a policy exists because it discourages "bah, I don't care
> about policies" style of thinking.

More importantly, IMO, it eliminates 2 meanings for MPOL_DEFAULT in
different contexts and promotes the use 0f [MPOL_PREFERRED,
-1/null-nodemask] for local allocation.  I think this makes the
resulting documentation clearer.

<snip>
> > 
> > Index: Linux/mm/mempolicy.c
> > ===================================================================
> > --- Linux.orig/mm/mempolicy.c	2007-08-29 11:43:06.000000000 -0400
> > +++ Linux/mm/mempolicy.c	2007-08-29 11:44:03.000000000 -0400
> > @@ -105,9 +105,13 @@ static struct kmem_cache *sn_cache;
> >     policied. */
> >  enum zone_type policy_zone = 0;
> >  
> > +/*
> > + * run-time system-wide default policy => local allocation
> > + */
> >  struct mempolicy default_policy = {
> >  	.refcnt = ATOMIC_INIT(1), /* never free it */
> > -	.policy = MPOL_DEFAULT,
> > +	.policy = MPOL_PREFERRED,
> > +	.v =  { .preferred_node =  -1 },
> >  };
> >  
> 
> fairly clear.
> 
> >  static void mpol_rebind_policy(struct mempolicy *pol,
> > @@ -180,7 +184,8 @@ static struct mempolicy *mpol_new(int mo
> >  		 mode, nodes ? nodes_addr(*nodes)[0] : -1);
> >  
> >  	if (mode == MPOL_DEFAULT)
> > -		return NULL;
> > +		return NULL;	/* simply delete any existing policy */
> > +
> 
> Why do we not return default_policy and insert that into the VMA or
> whatever?
> 

Because then, if we're installing a shared policy [shmem], we'll go
ahead and create an rb-node and insert the [default] policy in the tree
in the shared policy struct, instead of just deleting any policy ranges
that the new policy covers.  Andi already implemented the code to delete
shared policy ranges covered by a subsequent null/default policy.  I
like this approach.

I have additional patches, to come later, that dynamically allocate the
shared policy structure when a non-null [non-default] policy is
installed.  At some point, I plan on enhancing this to to use a single
policy pointer, instead of the shared policy struct, when the policy
covers the entire object range, and delete any existing shared policy
struct when/if a default policy covers the entire range.

Thanks,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 2/5] Mem Policy:  Use MPOL_PREFERRED for system-wide default policy
  2007-09-11 18:22     ` Lee Schermerhorn
@ 2007-09-13  9:48       ` Mel Gorman
  0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2007-09-13  9:48 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, clameter, solo, eric.whitney

On (11/09/07 14:22), Lee Schermerhorn didst pronounce:
> On Tue, 2007-09-11 at 19:54 +0100, Mel Gorman wrote:
> > On Thu, 2007-08-30 at 14:51 -0400, Lee Schermerhorn wrote:
> > > PATCH/RFC 2/5 Use MPOL_PREFERRED for system-wide default policy
> > > 
> > > Against:  2.6.23-rc3-mm1
> > > 
> > > V1 -> V2:
> > > + restore BUG()s in switch(policy) default cases -- per
> > >   Christoph
> > > + eliminate unneeded re-init of struct mempolicy policy member
> > >   before freeing
> > > 
> > > Currently, when one specifies MPOL_DEFAULT via a NUMA memory
> > > policy API [set_mempolicy(), mbind() and internal versions],
> > > the kernel simply installs a NULL struct mempolicy pointer in
> > > the appropriate context:  task policy, vma policy, or shared
> > > policy.  This causes any use of that policy to "fall back" to
> > > the next most specific policy scope.  The only use of MPOL_DEFAULT
> > > to mean "local allocation" is in the system default policy.
> > > 
> > 
> > In general, this seems like a good idea. It's certainly simplier to
> > always assume a policy exists because it discourages "bah, I don't care
> > about policies" style of thinking.
> 
> More importantly, IMO, it eliminates 2 meanings for MPOL_DEFAULT in
> different contexts and promotes the use 0f [MPOL_PREFERRED,
> -1/null-nodemask] for local allocation.  I think this makes the
> resulting documentation clearer.
> 

That's a fair point.

> <snip>
> > > 
> > > Index: Linux/mm/mempolicy.c
> > > ===================================================================
> > > --- Linux.orig/mm/mempolicy.c	2007-08-29 11:43:06.000000000 -0400
> > > +++ Linux/mm/mempolicy.c	2007-08-29 11:44:03.000000000 -0400
> > > @@ -105,9 +105,13 @@ static struct kmem_cache *sn_cache;
> > >     policied. */
> > >  enum zone_type policy_zone = 0;
> > >  
> > > +/*
> > > + * run-time system-wide default policy => local allocation
> > > + */
> > >  struct mempolicy default_policy = {
> > >  	.refcnt = ATOMIC_INIT(1), /* never free it */
> > > -	.policy = MPOL_DEFAULT,
> > > +	.policy = MPOL_PREFERRED,
> > > +	.v =  { .preferred_node =  -1 },
> > >  };
> > >  
> > 
> > fairly clear.
> > 
> > >  static void mpol_rebind_policy(struct mempolicy *pol,
> > > @@ -180,7 +184,8 @@ static struct mempolicy *mpol_new(int mo
> > >  		 mode, nodes ? nodes_addr(*nodes)[0] : -1);
> > >  
> > >  	if (mode == MPOL_DEFAULT)
> > > -		return NULL;
> > > +		return NULL;	/* simply delete any existing policy */
> > > +
> > 
> > Why do we not return default_policy and insert that into the VMA or
> > whatever?
> > 
> 
> Because then, if we're installing a shared policy [shmem], we'll go
> ahead and create an rb-node and insert the [default] policy in the tree
> in the shared policy struct, instead of just deleting any policy ranges
> that the new policy covers.  Andi already implemented the code to delete
> shared policy ranges covered by a subsequent null/default policy.  I
> like this approach.
> 

Right, thanks for clearing this up.

> I have additional patches, to come later, that dynamically allocate the
> shared policy structure when a non-null [non-default] policy is
> installed.  At some point, I plan on enhancing this to to use a single
> policy pointer, instead of the shared policy struct, when the policy
> covers the entire object range, and delete any existing shared policy
> struct when/if a default policy covers the entire range.
> 

It sounds reasonable. I don't know the policy code well enough to say if
it's a good idea but it certainly seems like one.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH/RFC 3/5] Mem Policy:  MPOL_PREFERRED fixups for "local allocation"
  2007-08-30 18:50 [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements Lee Schermerhorn
  2007-08-30 18:51 ` [PATCH/RFC 1/5] Mem Policy: fix reference counting Lee Schermerhorn
  2007-08-30 18:51 ` [PATCH/RFC 2/5] Mem Policy: Use MPOL_PREFERRED for system-wide default policy Lee Schermerhorn
@ 2007-08-30 18:51 ` Lee Schermerhorn
  2007-09-11 18:58   ` Mel Gorman
  2007-09-12 22:06   ` Christoph Lameter
  2007-08-30 18:51 ` [PATCH/RFC 4/5] Mem Policy: cpuset-independent interleave policy Lee Schermerhorn
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-08-30 18:51 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, ak, mtk-manpages, clameter, solo, Lee Schermerhorn,
	eric.whitney

PATCH/RFC 03/05 - MPOL_PREFERRED cleanups for "local allocation" - V4

Against: 2.6.23-rc3-mm1

V3 -> V4:
+  updated Documentation/vm/numa_memory_policy.txt to better explain
   [I think] the "local allocation" feature of MPOL_PREFERRED.

V2 -> V3:
+  renamed get_nodemask() to get_policy_nodemask() to more closely
   match what it's doing.

V1 -> V2:
+  renamed get_zonemask() to get_nodemask().  Mel Gorman suggested this
   was a valid "cleanup".

Here are a couple of "cleanups" for MPOL_PREFERRED behavior
when v.preferred_node < 0 -- i.e., "local allocation":

1)  [do_]get_mempolicy() calls the now renamed get_policy_nodemask()
    to fetch the nodemask associated with a policy.  Currently,
    get_policy_nodemask() returns the set of nodes with memory, when
    the policy 'mode' is 'PREFERRED, and the preferred_node is < 0.
    Return the set of allowed nodes instead.  This will already have
    been masked to include only nodes with memory.

2)  When a task is moved into a [new] cpuset, mpol_rebind_policy() is
    called to adjust any task and vma policy nodes to be valid in the
    new cpuset.  However, when the policy is MPOL_PREFERRED, and the
    preferred_node is <0, no rebind is necessary.  The "local allocation"
    indication is valid in any cpuset.  Existing code will "do the right
    thing" because node_remap() will just return the argument node when
    it is outside of the valid range of node ids.  However, I think it is
    clearer and cleaner to skip the remap explicitly in this case.

3)  mpol_to_str() produces a printable, "human readable" string from a
    struct mempolicy.  For MPOL_PREFERRED with preferred_node <0,  show
    the entire set of valid nodes.  Although, technically, MPOL_PREFERRED
    takes only a single node, preferred_node <0 is a local allocation policy,
    with the preferred node determined by the context where the task
    is executing.  All of the allowed nodes are possible, as the task
    migrates amoung the nodes in the cpuset.  Without this change, I believe
    that node_set() [via set_bit()] will set bit 31, resulting in a misleading
    display.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 mm/mempolicy.c |   41 ++++++++++++++++++++++++++++++-----------
 1 file changed, 30 insertions(+), 11 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-08-30 13:20:13.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-08-30 13:36:04.000000000 -0400
@@ -486,8 +486,10 @@ static long do_set_mempolicy(int mode, n
 	return 0;
 }
 
-/* Fill a zone bitmap for a policy */
-static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
+/*
+ * Return a node bitmap for a policy
+ */
+static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
 {
 	int i;
 
@@ -502,9 +504,11 @@ static void get_zonemask(struct mempolic
 		*nodes = p->v.nodes;
 		break;
 	case MPOL_PREFERRED:
-		/* or use current node instead of memory_map? */
+		/*
+		 * for "local policy", return allowed memories
+		 */
 		if (p->v.preferred_node < 0)
-			*nodes = node_states[N_HIGH_MEMORY];
+			*nodes = cpuset_current_mems_allowed;
 		else
 			node_set(p->v.preferred_node, *nodes);
 		break;
@@ -578,7 +582,7 @@ static long do_get_mempolicy(int *policy
 
 	err = 0;
 	if (nmask)
-		get_zonemask(pol, nmask);
+		get_policy_nodemask(pol, nmask);
 
  out:
 	if (vma)
@@ -1715,6 +1719,7 @@ static void mpol_rebind_policy(struct me
 {
 	nodemask_t *mpolmask;
 	nodemask_t tmp;
+	int nid;
 
 	if (!pol)
 		return;
@@ -1731,9 +1736,15 @@ static void mpol_rebind_policy(struct me
 						*mpolmask, *newmask);
 		break;
 	case MPOL_PREFERRED:
-		pol->v.preferred_node = node_remap(pol->v.preferred_node,
+		/*
+		 * no need to remap "local policy"
+		 */
+		nid = pol->v.preferred_node;
+		if (nid >= 0) {
+			pol->v.preferred_node = node_remap(nid,
 						*mpolmask, *newmask);
-		*mpolmask = *newmask;
+			*mpolmask = *newmask;
+		}
 		break;
 	case MPOL_BIND: {
 		nodemask_t nodes;
@@ -1808,7 +1819,7 @@ static const char * const policy_types[]
 static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 {
 	char *p = buffer;
-	int l;
+	int nid, l;
 	nodemask_t nodes;
 	int mode = pol ? pol->policy : MPOL_DEFAULT;
 
@@ -1818,12 +1829,20 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_PREFERRED:
-		nodes_clear(nodes);
-		node_set(pol->v.preferred_node, nodes);
+		nid = pol->v.preferred_node;
+		/*
+		 * local interleave, show all valid nodes
+		 */
+		if (nid < 0)
+			nodes = cpuset_current_mems_allowed;
+		else {
+			nodes_clear(nodes);
+			node_set(nid, nodes);
+		}
 		break;
 
 	case MPOL_BIND:
-		get_zonemask(pol, &nodes);
+		get_policy_nodemask(pol, &nodes);
 		break;
 
 	case MPOL_INTERLEAVE:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 3/5] Mem Policy:  MPOL_PREFERRED fixups for "local allocation"
  2007-08-30 18:51 ` [PATCH/RFC 3/5] Mem Policy: MPOL_PREFERRED fixups for "local allocation" Lee Schermerhorn
@ 2007-09-11 18:58   ` Mel Gorman
  2007-09-11 18:34     ` Lee Schermerhorn
  2007-09-12 22:06   ` Christoph Lameter
  1 sibling, 1 reply; 76+ messages in thread
From: Mel Gorman @ 2007-09-11 18:58 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, clameter, solo, eric.whitney

On Thu, 2007-08-30 at 14:51 -0400, Lee Schermerhorn wrote:
> PATCH/RFC 03/05 - MPOL_PREFERRED cleanups for "local allocation" - V4
> 
> Against: 2.6.23-rc3-mm1
> 
> V3 -> V4:
> +  updated Documentation/vm/numa_memory_policy.txt to better explain
>    [I think] the "local allocation" feature of MPOL_PREFERRED.
> 
> V2 -> V3:
> +  renamed get_nodemask() to get_policy_nodemask() to more closely
>    match what it's doing.
> 
> V1 -> V2:
> +  renamed get_zonemask() to get_nodemask().  Mel Gorman suggested this
>    was a valid "cleanup".
> 
> Here are a couple of "cleanups" for MPOL_PREFERRED behavior
> when v.preferred_node < 0 -- i.e., "local allocation":
> 
> 1)  [do_]get_mempolicy() calls the now renamed get_policy_nodemask()
>     to fetch the nodemask associated with a policy.  Currently,
>     get_policy_nodemask() returns the set of nodes with memory, when
>     the policy 'mode' is 'PREFERRED, and the preferred_node is < 0.
>     Return the set of allowed nodes instead.  This will already have
>     been masked to include only nodes with memory.
> 

Better name all right.

> 2)  When a task is moved into a [new] cpuset, mpol_rebind_policy() is
>     called to adjust any task and vma policy nodes to be valid in the
>     new cpuset.  However, when the policy is MPOL_PREFERRED, and the
>     preferred_node is <0, no rebind is necessary.  The "local allocation"
>     indication is valid in any cpuset.  Existing code will "do the right
>     thing" because node_remap() will just return the argument node when
>     it is outside of the valid range of node ids.  However, I think it is
>     clearer and cleaner to skip the remap explicitly in this case.
> 
> 3)  mpol_to_str() produces a printable, "human readable" string from a
>     struct mempolicy.  For MPOL_PREFERRED with preferred_node <0,  show
>     the entire set of valid nodes.  Although, technically, MPOL_PREFERRED
>     takes only a single node, preferred_node <0 is a local allocation policy,
>     with the preferred node determined by the context where the task
>     is executing.  All of the allowed nodes are possible, as the task
>     migrates amoung the nodes in the cpuset.  Without this change, I believe
>     that node_set() [via set_bit()] will set bit 31, resulting in a misleading
>     display.
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  mm/mempolicy.c |   41 ++++++++++++++++++++++++++++++-----------
>  1 file changed, 30 insertions(+), 11 deletions(-)
> 
> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c	2007-08-30 13:20:13.000000000 -0400
> +++ Linux/mm/mempolicy.c	2007-08-30 13:36:04.000000000 -0400
> @@ -486,8 +486,10 @@ static long do_set_mempolicy(int mode, n
>  	return 0;
>  }
>  
> -/* Fill a zone bitmap for a policy */
> -static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
> +/*
> + * Return a node bitmap for a policy
> + */
> +static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
>  {
>  	int i;
>  
> @@ -502,9 +504,11 @@ static void get_zonemask(struct mempolic
>  		*nodes = p->v.nodes;
>  		break;
>  	case MPOL_PREFERRED:
> -		/* or use current node instead of memory_map? */
> +		/*
> +		 * for "local policy", return allowed memories
> +		 */
>  		if (p->v.preferred_node < 0)
> -			*nodes = node_states[N_HIGH_MEMORY];
> +			*nodes = cpuset_current_mems_allowed;

Is this change intentional? It looks like something that belongs as part
of the the memoryless patch set.

>  		else
>  			node_set(p->v.preferred_node, *nodes);
>  		break;
> @@ -578,7 +582,7 @@ static long do_get_mempolicy(int *policy
>  
>  	err = 0;
>  	if (nmask)
> -		get_zonemask(pol, nmask);
> +		get_policy_nodemask(pol, nmask);
>  
>   out:
>  	if (vma)
> @@ -1715,6 +1719,7 @@ static void mpol_rebind_policy(struct me
>  {
>  	nodemask_t *mpolmask;
>  	nodemask_t tmp;
> +	int nid;
>  
>  	if (!pol)
>  		return;
> @@ -1731,9 +1736,15 @@ static void mpol_rebind_policy(struct me
>  						*mpolmask, *newmask);
>  		break;
>  	case MPOL_PREFERRED:
> -		pol->v.preferred_node = node_remap(pol->v.preferred_node,
> +		/*
> +		 * no need to remap "local policy"
> +		 */
> +		nid = pol->v.preferred_node;
> +		if (nid >= 0) {
> +			pol->v.preferred_node = node_remap(nid,
>  						*mpolmask, *newmask);
> -		*mpolmask = *newmask;
> +			*mpolmask = *newmask;
> +		}
>  		break;
>  	case MPOL_BIND: {
>  		nodemask_t nodes;
> @@ -1808,7 +1819,7 @@ static const char * const policy_types[]
>  static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
>  {
>  	char *p = buffer;
> -	int l;
> +	int nid, l;
>  	nodemask_t nodes;
>  	int mode = pol ? pol->policy : MPOL_DEFAULT;
>  
> @@ -1818,12 +1829,20 @@ static inline int mpol_to_str(char *buff
>  		break;
>  
>  	case MPOL_PREFERRED:
> -		nodes_clear(nodes);
> -		node_set(pol->v.preferred_node, nodes);
> +		nid = pol->v.preferred_node;
> +		/*
> +		 * local interleave, show all valid nodes
> +		 */
> +		if (nid < 0)
> +			nodes = cpuset_current_mems_allowed;
> +		else {
> +			nodes_clear(nodes);
> +			node_set(nid, nodes);
> +		}
>  		break;
>  
>  	case MPOL_BIND:
> -		get_zonemask(pol, &nodes);
> +		get_policy_nodemask(pol, &nodes);
>  		break;
>  
>  	case MPOL_INTERLEAVE:
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 3/5] Mem Policy:  MPOL_PREFERRED fixups for "local allocation"
  2007-09-11 18:58   ` Mel Gorman
@ 2007-09-11 18:34     ` Lee Schermerhorn
  2007-09-12 22:10       ` Christoph Lameter
  2007-09-13  9:55       ` Mel Gorman
  0 siblings, 2 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-11 18:34 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, akpm, ak, mtk-manpages, clameter, solo, eric.whitney

On Tue, 2007-09-11 at 19:58 +0100, Mel Gorman wrote:
> On Thu, 2007-08-30 at 14:51 -0400, Lee Schermerhorn wrote:
> > PATCH/RFC 03/05 - MPOL_PREFERRED cleanups for "local allocation" - V4
> > 
> > Against: 2.6.23-rc3-mm1
> > 
> > V3 -> V4:
> > +  updated Documentation/vm/numa_memory_policy.txt to better explain
> >    [I think] the "local allocation" feature of MPOL_PREFERRED.
> > 
> > V2 -> V3:
> > +  renamed get_nodemask() to get_policy_nodemask() to more closely
> >    match what it's doing.
> > 
> > V1 -> V2:
> > +  renamed get_zonemask() to get_nodemask().  Mel Gorman suggested this
> >    was a valid "cleanup".
> > 
> > Here are a couple of "cleanups" for MPOL_PREFERRED behavior
> > when v.preferred_node < 0 -- i.e., "local allocation":
> > 
> > 1)  [do_]get_mempolicy() calls the now renamed get_policy_nodemask()
> >     to fetch the nodemask associated with a policy.  Currently,
> >     get_policy_nodemask() returns the set of nodes with memory, when
> >     the policy 'mode' is 'PREFERRED, and the preferred_node is < 0.
> >     Return the set of allowed nodes instead.  This will already have
> >     been masked to include only nodes with memory.
> > 
> 
> Better name all right.

:-) That's why you suggested it, right?

<snip>

> > Index: Linux/mm/mempolicy.c
> > ===================================================================
> > --- Linux.orig/mm/mempolicy.c	2007-08-30 13:20:13.000000000 -0400
> > +++ Linux/mm/mempolicy.c	2007-08-30 13:36:04.000000000 -0400
> > @@ -486,8 +486,10 @@ static long do_set_mempolicy(int mode, n
> >  	return 0;
> >  }
> >  
> > -/* Fill a zone bitmap for a policy */
> > -static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
> > +/*
> > + * Return a node bitmap for a policy
> > + */
> > +static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
> >  {
> >  	int i;
> >  
> > @@ -502,9 +504,11 @@ static void get_zonemask(struct mempolic
> >  		*nodes = p->v.nodes;
> >  		break;
> >  	case MPOL_PREFERRED:
> > -		/* or use current node instead of memory_map? */
> > +		/*
> > +		 * for "local policy", return allowed memories
> > +		 */
> >  		if (p->v.preferred_node < 0)
> > -			*nodes = node_states[N_HIGH_MEMORY];
> > +			*nodes = cpuset_current_mems_allowed;
> 
> Is this change intentional? It looks like something that belongs as part
> of the the memoryless patch set.
> 

Absolutely intentional.  The use of 'node_states[N_HIGH_MEMORY]' was
added by the memoryless nodes patches.  Formerly, this was
'node_online_map'.  However, even this results in misleading info for
tasks running in a cpuset.  

When a task queries its memory policy via get_mempolicy(2), and the
policy is MPOL_PREFERRED with the '-1' policy node--i.e., local
allocation--the memory can come from any node from which the task is
allowed to allocate.   Initially it will try to allocate only from nodes
containing cpus on which the task is allowed to execute.  But, the
allocation could overflow onto some other node allowed in the cpuset.

It's a fine, point, but I think this is "more correct" that the existing
code.  I'm hoping that this change, with a corresponding man page update
will head off some head scratching and support calls down the road.

Lee


<snip>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 3/5] Mem Policy:  MPOL_PREFERRED fixups for "local allocation"
  2007-09-11 18:34     ` Lee Schermerhorn
@ 2007-09-12 22:10       ` Christoph Lameter
  2007-09-13 13:51         ` Lee Schermerhorn
  2007-09-13  9:55       ` Mel Gorman
  1 sibling, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2007-09-12 22:10 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, akpm, ak, mtk-manpages, solo, eric.whitney

On Tue, 11 Sep 2007, Lee Schermerhorn wrote:

> > >  	case MPOL_PREFERRED:
> > > -		/* or use current node instead of memory_map? */
> > > +		/*
> > > +		 * for "local policy", return allowed memories
> > > +		 */
> > >  		if (p->v.preferred_node < 0)
> > > -			*nodes = node_states[N_HIGH_MEMORY];
> > > +			*nodes = cpuset_current_mems_allowed;
> > 
> > Is this change intentional? It looks like something that belongs as part
> > of the the memoryless patch set.
> > 
> 
> Absolutely intentional.  The use of 'node_states[N_HIGH_MEMORY]' was
> added by the memoryless nodes patches.  Formerly, this was
> 'node_online_map'.  However, even this results in misleading info for
> tasks running in a cpuset.  

Sort of. This just means that the policy does not restrict the valid 
nodes. The cpuset does. I think this is okay but we may be confusing users 
as to which mechanism performs the restriction.
 
> It's a fine, point, but I think this is "more correct" that the existing
> code.  I'm hoping that this change, with a corresponding man page update
> will head off some head scratching and support calls down the road.

How does this sync with the nodemasks used by other policies? So far we 
are using a sort of cpuset agnostic nodeset and limit it when it is 
applied. I think the integration between cpuset and memory policies could 
use some work and this is certainly something valid to do. Is there any 
way to describe that and have output that clarifies that distinction and 
helps the user figure out what is going on?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 3/5] Mem Policy:  MPOL_PREFERRED fixups for "local allocation"
  2007-09-12 22:10       ` Christoph Lameter
@ 2007-09-13 13:51         ` Lee Schermerhorn
  2007-09-13 18:18           ` Christoph Lameter
  0 siblings, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-13 13:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-mm, akpm, ak, mtk-manpages, solo, eric.whitney

On Wed, 2007-09-12 at 15:10 -0700, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Lee Schermerhorn wrote:
> 
> > > >  	case MPOL_PREFERRED:
> > > > -		/* or use current node instead of memory_map? */
> > > > +		/*
> > > > +		 * for "local policy", return allowed memories
> > > > +		 */
> > > >  		if (p->v.preferred_node < 0)
> > > > -			*nodes = node_states[N_HIGH_MEMORY];
> > > > +			*nodes = cpuset_current_mems_allowed;
> > > 
> > > Is this change intentional? It looks like something that belongs as part
> > > of the the memoryless patch set.
> > > 
> > 
> > Absolutely intentional.  The use of 'node_states[N_HIGH_MEMORY]' was
> > added by the memoryless nodes patches.  Formerly, this was
> > 'node_online_map'.  However, even this results in misleading info for
> > tasks running in a cpuset.  
> 
> Sort of. This just means that the policy does not restrict the valid 
> nodes. The cpuset does. I think this is okay but we may be confusing users 
> as to which mechanism performs the restriction.
>  
> > It's a fine, point, but I think this is "more correct" that the existing
> > code.  I'm hoping that this change, with a corresponding man page update
> > will head off some head scratching and support calls down the road.
> 
> How does this sync with the nodemasks used by other policies? So far we 
> are using a sort of cpuset agnostic nodeset and limit it when it is 
> applied. 

Not exactly:  set_mempolicy() calls "contextualize_policy()" that
returns an error if the nodemask is not a subset of mems_allowed; and
then calls mpol_check_policy() to further vet the syscall args.

Now, I see that sys_mbind() does just AND the nodemask with
mems_allowed.  So, it won't give an error.

Should these be the same?  If so, which way:  error or silently mask off
dis-allowed nodes?  The latter doesn't let the user know what's going
on, but with my new MPOL_F_MEMS_ALLOWED flag, a user can query the
allowed nodes.  And, I can update the man pages to state exactly what
happens.  So, how about:

1) changing contextualize_policy() to mask off dis-allowed nodes rather
than giving an error  [this is a change in behavior for
set_mempolicy()], and

2) changing mbind() to use contextualize_policy() like
set_mempolicy()--no change in behavior here.

Thoughts?

> I think the integration between cpuset and memory policies could 
> use some work and this is certainly something valid to do. Is there any 
> way to describe that and have output that clarifies that distinction and 
> helps the user figure out what is going on?

Man pages can/will be updated and the ability to query allowed nodes
should provide the necessary info.  Would this satisfy your concern?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 3/5] Mem Policy:  MPOL_PREFERRED fixups for "local allocation"
  2007-09-13 13:51         ` Lee Schermerhorn
@ 2007-09-13 18:18           ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2007-09-13 18:18 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, linux-mm, akpm, ak, mtk-manpages, solo, eric.whitney

On Thu, 13 Sep 2007, Lee Schermerhorn wrote:

> > How does this sync with the nodemasks used by other policies? So far we 
> > are using a sort of cpuset agnostic nodeset and limit it when it is 
> > applied. 
> 
> Not exactly:  set_mempolicy() calls "contextualize_policy()" that
> returns an error if the nodemask is not a subset of mems_allowed; and
> then calls mpol_check_policy() to further vet the syscall args.
> 
> Now, I see that sys_mbind() does just AND the nodemask with
> mems_allowed.  So, it won't give an error.

Correct.

> Should these be the same?  If so, which way:  error or silently mask off
> dis-allowed nodes?  The latter doesn't let the user know what's going
> on, but with my new MPOL_F_MEMS_ALLOWED flag, a user can query the
> allowed nodes.  And, I can update the man pages to state exactly what
> happens.  So, how about:

I think an error is better. However, the use of MPOL_F_MEMS_ALLOWED may 
race with process migration. The mems allowed are not reliable in that 
sense. If the process stores information about allowable nodes then the 
process must have some way to be notified that the underlying nodes are 
changing for page migration. Otherwise the process may start trying to
allocate from nodes that it is no longer allowed to get memory from.

That is another reason why we have considered relative nodemasks in the 
past. If its relative to mems_allowed then the mems_allowed can change 
without the process having to change the nodemasks that it may have 
stored.

> > I think the integration between cpuset and memory policies could 
> > use some work and this is certainly something valid to do. Is there any 
> > way to describe that and have output that clarifies that distinction and 
> > helps the user figure out what is going on?
> 
> Man pages can/will be updated and the ability to query allowed nodes
> should provide the necessary info.  Would this satisfy your concern?

I have no concern about the manpages. You are doing an excellent job. I am 
worried about the consistency of the API and breakage that may be 
introduced because of partially relative and partially absolute nodemasks 
in use. The F_MEMS_ALLOWED needs to have a big warning in the manpage that 
the allowed nodemask may change at any time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 3/5] Mem Policy:  MPOL_PREFERRED fixups for "local allocation"
  2007-09-11 18:34     ` Lee Schermerhorn
  2007-09-12 22:10       ` Christoph Lameter
@ 2007-09-13  9:55       ` Mel Gorman
  1 sibling, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2007-09-13  9:55 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, clameter, solo, eric.whitney

On (11/09/07 14:34), Lee Schermerhorn didst pronounce:
> On Tue, 2007-09-11 at 19:58 +0100, Mel Gorman wrote:
> > On Thu, 2007-08-30 at 14:51 -0400, Lee Schermerhorn wrote:
> > > PATCH/RFC 03/05 - MPOL_PREFERRED cleanups for "local allocation" - V4
> > > 
> > > Against: 2.6.23-rc3-mm1
> > > 
> > > V3 -> V4:
> > > +  updated Documentation/vm/numa_memory_policy.txt to better explain
> > >    [I think] the "local allocation" feature of MPOL_PREFERRED.
> > > 
> > > V2 -> V3:
> > > +  renamed get_nodemask() to get_policy_nodemask() to more closely
> > >    match what it's doing.
> > > 
> > > V1 -> V2:
> > > +  renamed get_zonemask() to get_nodemask().  Mel Gorman suggested this
> > >    was a valid "cleanup".
> > > 
> > > Here are a couple of "cleanups" for MPOL_PREFERRED behavior
> > > when v.preferred_node < 0 -- i.e., "local allocation":
> > > 
> > > 1)  [do_]get_mempolicy() calls the now renamed get_policy_nodemask()
> > >     to fetch the nodemask associated with a policy.  Currently,
> > >     get_policy_nodemask() returns the set of nodes with memory, when
> > >     the policy 'mode' is 'PREFERRED, and the preferred_node is < 0.
> > >     Return the set of allowed nodes instead.  This will already have
> > >     been masked to include only nodes with memory.
> > > 
> > 
> > Better name all right.
> 
> :-) That's why you suggested it, right?
> 

I did? Probably why I like it then :)

> <snip>
> 
> > > Index: Linux/mm/mempolicy.c
> > > ===================================================================
> > > --- Linux.orig/mm/mempolicy.c	2007-08-30 13:20:13.000000000 -0400
> > > +++ Linux/mm/mempolicy.c	2007-08-30 13:36:04.000000000 -0400
> > > @@ -486,8 +486,10 @@ static long do_set_mempolicy(int mode, n
> > >  	return 0;
> > >  }
> > >  
> > > -/* Fill a zone bitmap for a policy */
> > > -static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
> > > +/*
> > > + * Return a node bitmap for a policy
> > > + */
> > > +static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
> > >  {
> > >  	int i;
> > >  
> > > @@ -502,9 +504,11 @@ static void get_zonemask(struct mempolic
> > >  		*nodes = p->v.nodes;
> > >  		break;
> > >  	case MPOL_PREFERRED:
> > > -		/* or use current node instead of memory_map? */
> > > +		/*
> > > +		 * for "local policy", return allowed memories
> > > +		 */
> > >  		if (p->v.preferred_node < 0)
> > > -			*nodes = node_states[N_HIGH_MEMORY];
> > > +			*nodes = cpuset_current_mems_allowed;
> > 
> > Is this change intentional? It looks like something that belongs as part
> > of the the memoryless patch set.
> > 
> 
> Absolutely intentional.  The use of 'node_states[N_HIGH_MEMORY]' was
> added by the memoryless nodes patches.  Formerly, this was
> 'node_online_map'.  However, even this results in misleading info for
> tasks running in a cpuset.  
> 

Right, because the map would contain nodes outside of the cpuset which
is very misleading.

> When a task queries its memory policy via get_mempolicy(2), and the
> policy is MPOL_PREFERRED with the '-1' policy node--i.e., local
> allocation--the memory can come from any node from which the task is
> allowed to allocate.   Initially it will try to allocate only from nodes
> containing cpus on which the task is allowed to execute.  But, the
> allocation could overflow onto some other node allowed in the cpuset.
> 
> It's a fine, point, but I think this is "more correct" that the existing
> code.  I'm hoping that this change, with a corresponding man page update
> will head off some head scratching and support calls down the road.
> 

I agree. The change just seemed out-of-context in this patchset so I
thought I would flag it in case it had creeped in from another patchset
by accident.

Thanks for the clarification

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 3/5] Mem Policy:  MPOL_PREFERRED fixups for "local allocation"
  2007-08-30 18:51 ` [PATCH/RFC 3/5] Mem Policy: MPOL_PREFERRED fixups for "local allocation" Lee Schermerhorn
  2007-09-11 18:58   ` Mel Gorman
@ 2007-09-12 22:06   ` Christoph Lameter
  2007-09-13 13:35     ` Lee Schermerhorn
  1 sibling, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2007-09-12 22:06 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, ak, mtk-manpages, solo, eric.whitney

On Thu, 30 Aug 2007, Lee Schermerhorn wrote:

> 1)  [do_]get_mempolicy() calls the now renamed get_policy_nodemask()
>     to fetch the nodemask associated with a policy.  Currently,
>     get_policy_nodemask() returns the set of nodes with memory, when
>     the policy 'mode' is 'PREFERRED, and the preferred_node is < 0.
>     Return the set of allowed nodes instead.  This will already have
>     been masked to include only nodes with memory.

Ok.
 
> 2)  When a task is moved into a [new] cpuset, mpol_rebind_policy() is
>     called to adjust any task and vma policy nodes to be valid in the
>     new cpuset.  However, when the policy is MPOL_PREFERRED, and the
>     preferred_node is <0, no rebind is necessary.  The "local allocation"
>     indication is valid in any cpuset.  Existing code will "do the right
>     thing" because node_remap() will just return the argument node when
>     it is outside of the valid range of node ids.  However, I think it is
>     clearer and cleaner to skip the remap explicitly in this case.

Sounds good. This is on the way to having cpuset relative node 
numbering???
 
> 3)  mpol_to_str() produces a printable, "human readable" string from a
>     struct mempolicy.  For MPOL_PREFERRED with preferred_node <0,  show
>     the entire set of valid nodes.  Although, technically, MPOL_PREFERRED
>     takes only a single node, preferred_node <0 is a local allocation policy,
>     with the preferred node determined by the context where the task
>     is executing.  All of the allowed nodes are possible, as the task
>     migrates amoung the nodes in the cpuset.  Without this change, I believe
>     that node_set() [via set_bit()] will set bit 31, resulting in a misleading
>     display.

Hmmm. But one wants mpol_to_str to represent the memory policy not the 
context information that may change through migration. What you 
do there is provide information from the context. You could add the 
nodemask but I think we need to have some indicator that this policy is 
referring to the local policy.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 3/5] Mem Policy:  MPOL_PREFERRED fixups for "local allocation"
  2007-09-12 22:06   ` Christoph Lameter
@ 2007-09-13 13:35     ` Lee Schermerhorn
  2007-09-13 18:21       ` Christoph Lameter
  0 siblings, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-13 13:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, akpm, ak, mtk-manpages, solo, eric.whitney, Mel Gorman

On Wed, 2007-09-12 at 15:06 -0700, Christoph Lameter wrote:
> On Thu, 30 Aug 2007, Lee Schermerhorn wrote:
> 
> > 1)  [do_]get_mempolicy() calls the now renamed get_policy_nodemask()
> >     to fetch the nodemask associated with a policy.  Currently,
> >     get_policy_nodemask() returns the set of nodes with memory, when
> >     the policy 'mode' is 'PREFERRED, and the preferred_node is < 0.
> >     Return the set of allowed nodes instead.  This will already have
> >     been masked to include only nodes with memory.
> 
> Ok.
>  
> > 2)  When a task is moved into a [new] cpuset, mpol_rebind_policy() is
> >     called to adjust any task and vma policy nodes to be valid in the
> >     new cpuset.  However, when the policy is MPOL_PREFERRED, and the
> >     preferred_node is <0, no rebind is necessary.  The "local allocation"
> >     indication is valid in any cpuset.  Existing code will "do the right
> >     thing" because node_remap() will just return the argument node when
> >     it is outside of the valid range of node ids.  However, I think it is
> >     clearer and cleaner to skip the remap explicitly in this case.
> 
> Sounds good. This is on the way to having cpuset relative node 
> numbering???
>  
> > 3)  mpol_to_str() produces a printable, "human readable" string from a
> >     struct mempolicy.  For MPOL_PREFERRED with preferred_node <0,  show
> >     the entire set of valid nodes.  Although, technically, MPOL_PREFERRED
> >     takes only a single node, preferred_node <0 is a local allocation policy,
> >     with the preferred node determined by the context where the task
> >     is executing.  All of the allowed nodes are possible, as the task
> >     migrates amoung the nodes in the cpuset.  Without this change, I believe
> >     that node_set() [via set_bit()] will set bit 31, resulting in a misleading
> >     display.
> 
> Hmmm. But one wants mpol_to_str to represent the memory policy not the 
> context information that may change through migration. What you 
> do there is provide information from the context. You could add the 
> nodemask but I think we need to have some indicator that this policy is 
> referring to the local policy.

True.  I could make mpol_to_str return something like "*" for the
nodemask and document this as "any allowed".

Thoughts?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 3/5] Mem Policy:  MPOL_PREFERRED fixups for "local allocation"
  2007-09-13 13:35     ` Lee Schermerhorn
@ 2007-09-13 18:21       ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2007-09-13 18:21 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, solo, eric.whitney, Mel Gorman

On Thu, 13 Sep 2007, Lee Schermerhorn wrote:

> > Hmmm. But one wants mpol_to_str to represent the memory policy not the 
> > context information that may change through migration. What you 
> > do there is provide information from the context. You could add the 
> > nodemask but I think we need to have some indicator that this policy is 
> > referring to the local policy.
> 
> True.  I could make mpol_to_str return something like "*" for the
> nodemask and document this as "any allowed".

Or print (any) ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH/RFC 4/5] Mem Policy:  cpuset-independent interleave policy
  2007-08-30 18:50 [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements Lee Schermerhorn
                   ` (2 preceding siblings ...)
  2007-08-30 18:51 ` [PATCH/RFC 3/5] Mem Policy: MPOL_PREFERRED fixups for "local allocation" Lee Schermerhorn
@ 2007-08-30 18:51 ` Lee Schermerhorn
  2007-09-12 21:20   ` Ethan Solomita
  2007-09-12 21:59   ` Ethan Solomita
  2007-08-30 18:51 ` [PATCH/RFC 5/5] Mem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag Lee Schermerhorn
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-08-30 18:51 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, ak, mtk-manpages, clameter, solo, Lee Schermerhorn,
	eric.whitney

PATCH/RFC 04/05 - cpuset-independent interleave policy

Against:  2.6.23-rc3-mm1

Interleave memory policy uses physical node ideas.  When
a task executes in a cpuset, any policies that it installs
are constrained to use only nodes that are valid in the cpuset.
This makes is difficult to use shared policies--e.g., on shmem/shm
segments--in this environment; especially in disjoint cpusets.  Any
policy installed by a task in one of the cpusets is invalid in a
disjoint cpuset.

Local allocation, whether as a result of default policy or preferred
policy with the local preferred_node token [-1 internally, null/empty
nodemask in the APIs], does not suffer from this problem.  It is a
"context dependent" or cpuset-independent policy.

This patch introduces a cpuset-independent interleave policy that will
work in shared policies applied to shared memory segments attached by
tasks in disjoint cpusets.  The cpuset-independent policy effectively
says "interleave across all valid nodes in the context where page
allocation occurs."

API:  following the lead of the "preferred local" policy, a null or
empty node mask specified with MPOL_INTERLEAVE specifies "all nodes
valid in the allocating context."  

Internally, it's not quite as easy as storing a special token [node
id == -1] in the preferred_node member.  MPOL_INTERLEAVE policy uses
a nodemask embedded in the mempolicy structure.  The nodemask is
"unioned" with preferred_node.   The only otherwise invalid value of
the nodemask that one could use to indicate the context-dependent
interleave mask is the empty set.  Coding-wise this would be simple:

	if (nodes_empty(mpol->v.nodes)) ...

However, this will involve testing possibly several words of
bitmask in the allocation path.  Instead, I chose to encode the
"context-dependent policy" indication in the upper bits of the
policy member of the mempolicy structure.  This member must
already be tested to determine the policy mode, so no extra
memory references should be required.  However, for testing the
policy--e.g., in the several switch() and if() statements--the
context flag must be masked off using the policy_mode() inline
function.  On the upside, this allows additional flags to be so
encoded, should that become useful.

Another potential issue is that this requires fetching the
interleave nodemask--either from the mempolicy struct or
current_cpuset_mems_allowed, depending on the context flag, during
page allocation time.  However, interleaving is already a fairly
heavy-weight policy, so maybe this won't be noticable.

Functionally tested OK.  i.e., tasks in disjoint cpusets sharing
shmem with shared, cpuset-independent interleave policy.  it
appears to work.

TODO:  see intentional '// TODO' in patch

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/numa_memory_policy.txt |   15 +++++-
 include/linux/mempolicy.h               |   16 ++++++
 mm/mempolicy.c                          |   74 ++++++++++++++++++++++----------
 3 files changed, 80 insertions(+), 25 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-08-30 13:36:04.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-08-30 13:37:51.000000000 -0400
@@ -128,12 +128,15 @@ static int mpol_check_policy(int mode, n
 			return -EINVAL;
 		break;
 	case MPOL_BIND:
-	case MPOL_INTERLEAVE:
 		/* Preferred will only use the first bit, but allow
 		   more for now. */
 		if (empty)
 			return -EINVAL;
 		break;
+	case MPOL_INTERLEAVE:
+		if (empty)
+			return 0;	/* context dependent interleave */
+		break;
 	}
  	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
 }
@@ -193,6 +196,10 @@ static struct mempolicy *mpol_new(int mo
 	switch (mode) {
 	case MPOL_INTERLEAVE:
 		policy->v.nodes = *nodes;
+		if (nodes_weight(*nodes) == 0) {
+			mode |= MPOL_CONTEXT;
+			break;
+		}
 		nodes_and(policy->v.nodes, policy->v.nodes,
 					node_states[N_HIGH_MEMORY]);
 		if (nodes_weight(policy->v.nodes) == 0) {
@@ -468,6 +475,19 @@ static void mpol_set_task_struct_flag(vo
 	mpol_fix_fork_child_flag(current);
 }
 
+/*
+ * Return node mask of specified [possibly contextualized] interleave policy.
+ */
+static nodemask_t *get_interleave_nodes(struct mempolicy *p)
+{
+	VM_BUG_ON(policy_mode(p) != MPOL_INTERLEAVE);
+
+	if (unlikely(p->policy & MPOL_CONTEXT))
+		return &cpuset_current_mems_allowed;
+
+	return &p->v.nodes;
+}
+
 /* Set the process memory policy */
 static long do_set_mempolicy(int mode, nodemask_t *nodes)
 {
@@ -481,8 +501,8 @@ static long do_set_mempolicy(int mode, n
 	mpol_free(current->mempolicy);
 	current->mempolicy = new;
 	mpol_set_task_struct_flag();
-	if (new && new->policy == MPOL_INTERLEAVE)
-		current->il_next = first_node(new->v.nodes);
+	if (new && policy_mode(new) == MPOL_INTERLEAVE)
+		current->il_next = first_node(*get_interleave_nodes(new));
 	return 0;
 }
 
@@ -494,14 +514,14 @@ static void get_policy_nodemask(struct m
 	int i;
 
 	nodes_clear(*nodes);
-	switch (p->policy) {
+	switch (policy_mode(p)) {
 	case MPOL_BIND:
 		for (i = 0; p->v.zonelist->zones[i]; i++)
 			node_set(zone_to_nid(p->v.zonelist->zones[i]),
 				*nodes);
 		break;
 	case MPOL_INTERLEAVE:
-		*nodes = p->v.nodes;
+		*nodes = *get_interleave_nodes(p);
 		break;
 	case MPOL_PREFERRED:
 		/*
@@ -566,7 +586,7 @@ static long do_get_mempolicy(int *policy
 				goto out;
 			*policy = err;
 		} else if (pol == current->mempolicy &&
-				pol->policy == MPOL_INTERLEAVE) {
+				policy_mode(pol) == MPOL_INTERLEAVE) {
 			*policy = current->il_next;
 		} else {
 			err = -EINVAL;
@@ -1128,7 +1148,7 @@ static struct zonelist *zonelist_policy(
 {
 	int nd;
 
-	switch (policy->policy) {
+	switch (policy_mode(policy)) {
 	case MPOL_PREFERRED:
 		nd = policy->v.preferred_node;
 		if (nd < 0)
@@ -1155,13 +1175,13 @@ static struct zonelist *zonelist_policy(
 static unsigned interleave_nodes(struct mempolicy *policy)
 {
 	unsigned nid, next;
-	struct task_struct *me = current;
+	nodemask_t *nodes = get_interleave_nodes(policy);
 
-	nid = me->il_next;
-	next = next_node(nid, policy->v.nodes);
+	nid = current->il_next;
+	next = next_node(nid, *nodes);
 	if (next >= MAX_NUMNODES)
-		next = first_node(policy->v.nodes);
-	me->il_next = next;
+		next = first_node(*nodes);
+	current->il_next = next;
 	return nid;
 }
 
@@ -1174,7 +1194,7 @@ unsigned slab_node(struct mempolicy *pol
 	if (!policy)
 		return numa_node_id();
 
-	switch (policy->policy) {
+	switch (policy_mode(policy)) {
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
@@ -1199,14 +1219,15 @@ unsigned slab_node(struct mempolicy *pol
 static unsigned offset_il_node(struct mempolicy *pol,
 		struct vm_area_struct *vma, unsigned long off)
 {
-	unsigned nnodes = nodes_weight(pol->v.nodes);
+	nodemask_t *nodes = get_interleave_nodes(pol);
+	unsigned nnodes = nodes_weight(*nodes);
 	unsigned target = (unsigned)off % nnodes;
 	int c;
 	int nid = -1;
 
 	c = 0;
 	do {
-		nid = next_node(nid, pol->v.nodes);
+		nid = next_node(nid, *nodes);
 		c++;
 	} while (c <= target);
 	return nid;
@@ -1258,7 +1279,7 @@ struct zonelist *huge_zonelist(struct vm
 	struct zonelist *zl;
 
 	*mpol = NULL;		/* probably no unref needed */
-	if (pol->policy == MPOL_INTERLEAVE) {
+	if (policy_mode(pol) == MPOL_INTERLEAVE) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
@@ -1268,7 +1289,7 @@ struct zonelist *huge_zonelist(struct vm
 
 	zl = zonelist_policy(GFP_HIGHUSER, pol);
 	if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
-		if (pol->policy != MPOL_BIND)
+		if (policy_mode(pol) != MPOL_BIND)
 			__mpol_free(pol);	/* finished with pol */
 		else
 			*mpol = pol;	/* unref needed after allocation */
@@ -1322,7 +1343,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 
 	cpuset_update_task_memory_state();
 
-	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
+	if (unlikely(policy_mode(pol) == MPOL_INTERLEAVE)) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
@@ -1370,7 +1391,7 @@ struct page *alloc_pages_current(gfp_t g
 		cpuset_update_task_memory_state();
 	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
 		pol = &default_policy;
-	if (pol->policy == MPOL_INTERLEAVE)
+	if (policy_mode(pol) == MPOL_INTERLEAVE)
 		return alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
 }
@@ -1415,9 +1436,10 @@ int __mpol_equal(struct mempolicy *a, st
 		return 0;
 	if (a->policy != b->policy)
 		return 0;
-	switch (a->policy) {
+	switch (policy_mode(a)) {
 	case MPOL_INTERLEAVE:
-		return nodes_equal(a->v.nodes, b->v.nodes);
+		return a->policy & MPOL_CONTEXT ||
+			nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
 	case MPOL_BIND: {
@@ -1735,6 +1757,11 @@ static void mpol_rebind_policy(struct me
 		current->il_next = node_remap(current->il_next,
 						*mpolmask, *newmask);
 		break;
+	case MPOL_INTERLEAVE|MPOL_CONTEXT:
+		/*
+		 * No remap necessary for contextual interleave
+		 */
+		break;
 	case MPOL_PREFERRED:
 		/*
 		 * no need to remap "local policy"
@@ -1821,7 +1848,7 @@ static inline int mpol_to_str(char *buff
 	char *p = buffer;
 	int nid, l;
 	nodemask_t nodes;
-	int mode = pol ? pol->policy : MPOL_DEFAULT;
+	int mode = pol ? policy_mode(pol) : MPOL_DEFAULT;
 
 	switch (mode) {
 	case MPOL_DEFAULT:
@@ -1846,7 +1873,8 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_INTERLEAVE:
-		nodes = pol->v.nodes;
+		nodes = *get_interleave_nodes(pol);
+// TODO:  or show indication of context-dependent interleave?
 		break;
 
 	default:
Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h	2007-08-30 13:36:04.000000000 -0400
+++ Linux/include/linux/mempolicy.h	2007-08-30 13:38:33.000000000 -0400
@@ -15,6 +15,13 @@
 #define MPOL_INTERLEAVE	3
 
 #define MPOL_MAX MPOL_INTERLEAVE
+#define MPOL_MODE 0x0ff		/* reserve 8 bits for policy "mode" */
+
+/*
+ * OR'd into struct mempolicy 'policy' member for "context-dependent interleave"
+ * -- i.e., interleave across all nodes allowed in current context.
+ */
+#define MPOL_CONTEXT  (1 << 8)
 
 /* Flags for get_mem_policy */
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
@@ -72,6 +79,15 @@ struct mempolicy {
 };
 
 /*
+ * Return 'policy' [a.k.a. 'mode'] member of mpol, less CONTEXT
+ * or any other modifiers.
+ */
+static inline int policy_mode(struct mempolicy *mpol)
+{
+	return mpol->policy & MPOL_MODE;
+}
+
+/*
  * Support for managing mempolicy data objects (clone, copy, destroy)
  * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
  */
Index: Linux/Documentation/vm/numa_memory_policy.txt
===================================================================
--- Linux.orig/Documentation/vm/numa_memory_policy.txt	2007-08-30 13:36:04.000000000 -0400
+++ Linux/Documentation/vm/numa_memory_policy.txt	2007-08-30 13:36:07.000000000 -0400
@@ -193,7 +193,15 @@ Components of Memory Policies
 
 	MPOL_INTERLEAVED:  This mode specifies that page allocations be
 	interleaved, on a page granularity, across the nodes specified in
-	the policy.  This mode also behaves slightly differently, based on
+	the policy.
+
+	If an empty nodemask is supplied to the MPOL_INTERLEAVED mode via one
+	of the memory policy APIs, the kernel treats this as "contextual
+	interleave".  That is, it will interleave allocates across all nodes
+	that are allowed in the context [cpuset] where the allocation occurs.
+	See the discussion of MEMORY POLICIES AND CPUSETS below.
+
+	The MPOL_INTERLEAVED mode also behaves slightly differently, based on
 	the context where it is used:
 
 	    For allocation of anonymous pages and shared memory pages,
@@ -313,4 +321,7 @@ couple of reasons:
    the memory policy APIs, as well as knowing in what cpusets other task might
    be attaching to the shared region, to use the cpuset information.
    Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
-   allocation is the only valid policy.
+
+Note, however, that local allocation, whether specified by MPOL_DEFAULT or
+MPOL_PREFERRED with an empty nodemask and "contextual interleave"--
+MPOL_INTERLEAVE with an empty nodemask--are valid policies in any context.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 4/5] Mem Policy:  cpuset-independent interleave policy
  2007-08-30 18:51 ` [PATCH/RFC 4/5] Mem Policy: cpuset-independent interleave policy Lee Schermerhorn
@ 2007-09-12 21:20   ` Ethan Solomita
  2007-09-12 22:14     ` Christoph Lameter
  2007-09-13 13:26     ` Lee Schermerhorn
  2007-09-12 21:59   ` Ethan Solomita
  1 sibling, 2 replies; 76+ messages in thread
From: Ethan Solomita @ 2007-09-12 21:20 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, ak, mtk-manpages, clameter, eric.whitney

the feature set I was interested in. One question regarding:

Lee Schermerhorn wrote:
> 
> However, this will involve testing possibly several words of
> bitmask in the allocation path.  Instead, I chose to encode the
> "context-dependent policy" indication in the upper bits of the
> policy member of the mempolicy structure.  This member must
> already be tested to determine the policy mode, so no extra
> memory references should be required.  However, for testing the
> policy--e.g., in the several switch() and if() statements--the
> context flag must be masked off using the policy_mode() inline
> function.  On the upside, this allows additional flags to be so
> encoded, should that become useful.

	Instead of creating MPOL_CONTEXT, did you consider instead creating a 
new MPOL for this, such as MPOL_INTERLEAVE_ALL? If the only intended 
user of the MPOL_CONTEXT "flag" is just MPOL_INTERLEAVE_ALL, it seems 
like you'll have simpler code this way.
	-- Ethan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 4/5] Mem Policy:  cpuset-independent interleave policy
  2007-09-12 21:20   ` Ethan Solomita
@ 2007-09-12 22:14     ` Christoph Lameter
  2007-09-13 13:26     ` Lee Schermerhorn
  1 sibling, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2007-09-12 22:14 UTC (permalink / raw)
  To: Ethan Solomita
  Cc: Lee Schermerhorn, linux-mm, akpm, ak, mtk-manpages, eric.whitney

On Wed, 12 Sep 2007, Ethan Solomita wrote:

> 	Hi Lee -- sorry for the delay in responding. Yes, this provides
> exactly the feature set I was interested in. One question regarding:
> 
> Lee Schermerhorn wrote:
> > 
> > However, this will involve testing possibly several words of
> > bitmask in the allocation path.  Instead, I chose to encode the
> > "context-dependent policy" indication in the upper bits of the
> > policy member of the mempolicy structure.  This member must
> > already be tested to determine the policy mode, so no extra
> > memory references should be required.  However, for testing the
> > policy--e.g., in the several switch() and if() statements--the
> > context flag must be masked off using the policy_mode() inline
> > function.  On the upside, this allows additional flags to be so
> > encoded, should that become useful.
> 
> 	Instead of creating MPOL_CONTEXT, did you consider instead creating a
> new MPOL for this, such as MPOL_INTERLEAVE_ALL? If the only intended user of
> the MPOL_CONTEXT "flag" is just MPOL_INTERLEAVE_ALL, it seems like you'll have
> simpler code this way.

I think were we are ultimately heading is more memory policies that are 
context independent. So it makes sense to add the flags. It would be 
good to see a conceptual description how this would all work out in the 
end.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 4/5] Mem Policy:  cpuset-independent interleave policy
  2007-09-12 21:20   ` Ethan Solomita
  2007-09-12 22:14     ` Christoph Lameter
@ 2007-09-13 13:26     ` Lee Schermerhorn
  2007-09-13 17:17       ` Ethan Solomita
  1 sibling, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-13 13:26 UTC (permalink / raw)
  To: Ethan Solomita
  Cc: linux-mm, akpm, ak, mtk-manpages, clameter, eric.whitney,
	Mel Gorman

On Wed, 2007-09-12 at 14:20 -0700, Ethan Solomita wrote:
> 	Hi Lee -- sorry for the delay in responding. Yes, this provides exactly 
> the feature set I was interested in. One question regarding:
> 
> Lee Schermerhorn wrote:
> > 
> > However, this will involve testing possibly several words of
> > bitmask in the allocation path.  Instead, I chose to encode the
> > "context-dependent policy" indication in the upper bits of the
> > policy member of the mempolicy structure.  This member must
> > already be tested to determine the policy mode, so no extra
> > memory references should be required.  However, for testing the
> > policy--e.g., in the several switch() and if() statements--the
> > context flag must be masked off using the policy_mode() inline
> > function.  On the upside, this allows additional flags to be so
> > encoded, should that become useful.
> 
> 	Instead of creating MPOL_CONTEXT, did you consider instead creating a 
> new MPOL for this, such as MPOL_INTERLEAVE_ALL? If the only intended 
> user of the MPOL_CONTEXT "flag" is just MPOL_INTERLEAVE_ALL, it seems 
> like you'll have simpler code this way.

I did think about it, and I did see your mail about this.  I guess
"simpler code" is in the eye of the beholder.  I consider "cpuset
independent interleave" to be an instance of MPOL_INTERLEAVE using a
context dependent nodemask.  If we have a separate policy for this
[really should be MPOL_INTERLEAVE_ALLOWED, don't you think?], would we
then want a separate policy for "local preferred"--e.g.,
MPOL_PREFERRED_LOCAL?  If we did this internally, I wouldn't want to
expose it via the APIs.  We already have an established way to indicate
"local preferred"--the NULL/empty nodemask.  Can't break the API, so I
chose to use the same way to indicate "all allowed" interleave.

I agree that the MPOL_CONTEXT flag looks a bit odd [could be renamed
MPOL_ALLOWED?], but the policy_mode() wrapper hides this; and looks OK
to me.  Keeps the number of cases in the switch and comparisons to
MPOL_INTERLEAVE the same in most places.  Anyway, the MPOL_CONTEXT flag
may go away after Mel Gorman's zonelist patches get merged.  We have
some ideas for further work on the policies that may give us a different
way to indicate this.   I don't expect the policy patches to proceed
until Mel's patches settle down and get merged.  Then we can revisit
this.

Thanks,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 4/5] Mem Policy:  cpuset-independent interleave policy
  2007-09-13 13:26     ` Lee Schermerhorn
@ 2007-09-13 17:17       ` Ethan Solomita
  0 siblings, 0 replies; 76+ messages in thread
From: Ethan Solomita @ 2007-09-13 17:17 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, clameter, eric.whitney,
	Mel Gorman

Lee Schermerhorn wrote:
> 
> I did think about it, and I did see your mail about this.  I guess
> "simpler code" is in the eye of the beholder.  I consider "cpuset
> independent interleave" to be an instance of MPOL_INTERLEAVE using a
> context dependent nodemask.  If we have a separate policy for this
> [really should be MPOL_INTERLEAVE_ALLOWED, don't you think?], would we
> then want a separate policy for "local preferred"--e.g.,
> MPOL_PREFERRED_LOCAL?  If we did this internally, I wouldn't want to
> expose it via the APIs.  We already have an established way to indicate
> "local preferred"--the NULL/empty nodemask.  Can't break the API, so I
> chose to use the same way to indicate "all allowed" interleave.

	As long as you expect to add more uses for it, I have no complaint.
	-- Ethan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 4/5] Mem Policy:  cpuset-independent interleave policy
  2007-08-30 18:51 ` [PATCH/RFC 4/5] Mem Policy: cpuset-independent interleave policy Lee Schermerhorn
  2007-09-12 21:20   ` Ethan Solomita
@ 2007-09-12 21:59   ` Ethan Solomita
  2007-09-13 13:32     ` Lee Schermerhorn
  1 sibling, 1 reply; 76+ messages in thread
From: Ethan Solomita @ 2007-09-12 21:59 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, ak, mtk-manpages, clameter, eric.whitney

Lee Schermerhorn wrote:
> -		return nodes_equal(a->v.nodes, b->v.nodes);
> +		return a->policy & MPOL_CONTEXT ||
> +			nodes_equal(a->v.nodes, b->v.nodes);

	For the sake of my sanity, can we add () around a->policy & 
MPOL_CONTEXT? 8-) This falls into order of precedence that I don't trust 
myself to memorize.
	-- Ethan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 4/5] Mem Policy:  cpuset-independent interleave policy
  2007-09-12 21:59   ` Ethan Solomita
@ 2007-09-13 13:32     ` Lee Schermerhorn
  2007-09-13 17:19       ` Ethan Solomita
                         ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-13 13:32 UTC (permalink / raw)
  To: Ethan Solomita; +Cc: linux-mm, akpm, ak, mtk-manpages, clameter, eric.whitney

On Wed, 2007-09-12 at 14:59 -0700, Ethan Solomita wrote:
> 	Just one code note:
> 
> Lee Schermerhorn wrote:
> > -		return nodes_equal(a->v.nodes, b->v.nodes);
> > +		return a->policy & MPOL_CONTEXT ||
> > +			nodes_equal(a->v.nodes, b->v.nodes);
> 
> 	For the sake of my sanity, can we add () around a->policy & 
> MPOL_CONTEXT? 8-) This falls into order of precedence that I don't trust 
> myself to memorize.

I agree and I would have done that, but then someone would have dinged
me for "unneeded parentheses"--despite the fact that I can't find
anything in the style guide about this [except in the bit about macro
definitions that says to always add parentheses around expressions
defining constants].  Can't win for losin' :-(.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 4/5] Mem Policy:  cpuset-independent interleave policy
  2007-09-13 13:32     ` Lee Schermerhorn
@ 2007-09-13 17:19       ` Ethan Solomita
  2007-09-13 18:20       ` Christoph Lameter
  2007-10-09  6:15       ` Ethan Solomita
  2 siblings, 0 replies; 76+ messages in thread
From: Ethan Solomita @ 2007-09-13 17:19 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, ak, mtk-manpages, clameter, eric.whitney

Lee Schermerhorn wrote:
> On Wed, 2007-09-12 at 14:59 -0700, Ethan Solomita wrote:
>> 	Just one code note:
>>
>> Lee Schermerhorn wrote:
>>> -		return nodes_equal(a->v.nodes, b->v.nodes);
>>> +		return a->policy & MPOL_CONTEXT ||
>>> +			nodes_equal(a->v.nodes, b->v.nodes);
>> 	For the sake of my sanity, can we add () around a->policy & 
>> MPOL_CONTEXT? 8-) This falls into order of precedence that I don't trust 
>> myself to memorize.
> 
> I agree and I would have done that, but then someone would have dinged
> me for "unneeded parentheses"--despite the fact that I can't find
> anything in the style guide about this [except in the bit about macro
> definitions that says to always add parentheses around expressions
> defining constants].  Can't win for losin' :-(.

	Don't despair! :-)

	You could do:

#if CONFIG_ILIKEPARENTHESIS
	return (a->policy & MPOL_CONTEXT) ||
#else
	return a->policy & MPOL_CONTEXT ||
#endif

	-- Ethan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 4/5] Mem Policy:  cpuset-independent interleave policy
  2007-09-13 13:32     ` Lee Schermerhorn
  2007-09-13 17:19       ` Ethan Solomita
@ 2007-09-13 18:20       ` Christoph Lameter
  2007-10-09  6:15       ` Ethan Solomita
  2 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2007-09-13 18:20 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Ethan Solomita, linux-mm, akpm, ak, mtk-manpages, eric.whitney

On Thu, 13 Sep 2007, Lee Schermerhorn wrote:

> On Wed, 2007-09-12 at 14:59 -0700, Ethan Solomita wrote:
> > 	Just one code note:
> > 
> > Lee Schermerhorn wrote:
> > > -		return nodes_equal(a->v.nodes, b->v.nodes);
> > > +		return a->policy & MPOL_CONTEXT ||
> > > +			nodes_equal(a->v.nodes, b->v.nodes);
> > 
> > 	For the sake of my sanity, can we add () around a->policy & 
> > MPOL_CONTEXT? 8-) This falls into order of precedence that I don't trust 
> > myself to memorize.
> 
> I agree and I would have done that, but then someone would have dinged
> me for "unneeded parentheses"--despite the fact that I can't find
> anything in the style guide about this [except in the bit about macro
> definitions that says to always add parentheses around expressions
> defining constants].  Can't win for losin' :-(.

What Mel suggests is correct.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 4/5] Mem Policy:  cpuset-independent interleave policy
  2007-09-13 13:32     ` Lee Schermerhorn
  2007-09-13 17:19       ` Ethan Solomita
  2007-09-13 18:20       ` Christoph Lameter
@ 2007-10-09  6:15       ` Ethan Solomita
  2007-10-09 13:39         ` Lee Schermerhorn
  2007-10-09 18:49         ` Christoph Lameter
  2 siblings, 2 replies; 76+ messages in thread
From: Ethan Solomita @ 2007-10-09  6:15 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, ak, mtk-manpages, clameter, eric.whitney

MPOL_CONTEXT set? That's what's happening with this patch, and I expect 
it'll confuse userland apps, e.g. numactl.
	-- Ethan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 4/5] Mem Policy:  cpuset-independent interleave policy
  2007-10-09  6:15       ` Ethan Solomita
@ 2007-10-09 13:39         ` Lee Schermerhorn
  2007-10-09 18:49         ` Christoph Lameter
  1 sibling, 0 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-10-09 13:39 UTC (permalink / raw)
  To: Ethan Solomita; +Cc: linux-mm, akpm, ak, mtk-manpages, clameter, eric.whitney

On Mon, 2007-10-08 at 23:15 -0700, Ethan Solomita wrote:
> 	Do we want do_get_mempolicy() to return a policy number with 
> MPOL_CONTEXT set? That's what's happening with this patch, and I expect 
> it'll confuse userland apps, e.g. numactl.

No, that's a bug!  I'll make sure I stomp it before next repost.  

Thanks,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 4/5] Mem Policy:  cpuset-independent interleave policy
  2007-10-09  6:15       ` Ethan Solomita
  2007-10-09 13:39         ` Lee Schermerhorn
@ 2007-10-09 18:49         ` Christoph Lameter
  2007-10-09 19:02           ` Lee Schermerhorn
  1 sibling, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2007-10-09 18:49 UTC (permalink / raw)
  To: Ethan Solomita
  Cc: Lee Schermerhorn, linux-mm, akpm, ak, mtk-manpages, eric.whitney

On Mon, 8 Oct 2007, Ethan Solomita wrote:

> 	Do we want do_get_mempolicy() to return a policy number with
> MPOL_CONTEXT set? That's what's happening with this patch, and I expect it'll
> confuse userland apps, e.g. numactl.

Do we have a consistent way to deal with MPOL_CONTEXT now? I thought this 
was just to test some ideas.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 4/5] Mem Policy:  cpuset-independent interleave policy
  2007-10-09 18:49         ` Christoph Lameter
@ 2007-10-09 19:02           ` Lee Schermerhorn
  0 siblings, 0 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-10-09 19:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ethan Solomita, linux-mm, akpm, ak, mtk-manpages, eric.whitney

On Tue, 2007-10-09 at 11:49 -0700, Christoph Lameter wrote:
> On Mon, 8 Oct 2007, Ethan Solomita wrote:
> 
> > 	Do we want do_get_mempolicy() to return a policy number with
> > MPOL_CONTEXT set? That's what's happening with this patch, and I expect it'll
> > confuse userland apps, e.g. numactl.
> 
> Do we have a consistent way to deal with MPOL_CONTEXT now? I thought this 
> was just to test some ideas.


Not sure your meaning, here, Christoph.  Ethan did find a bug in my
patch.  That WAS an RFC, I believe.  I plan on reissuing that patch
along with the other cleanups after Mel's stuff goes in, as I'm
currently testing my patches atop his.  I have a couple of other
independent, and more urgent fixes that I'll post as soon as I finish
testing.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH/RFC 5/5] Mem Policy:  add MPOL_F_MEMS_ALLOWED get_mempolicy() flag
  2007-08-30 18:50 [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements Lee Schermerhorn
                   ` (3 preceding siblings ...)
  2007-08-30 18:51 ` [PATCH/RFC 4/5] Mem Policy: cpuset-independent interleave policy Lee Schermerhorn
@ 2007-08-30 18:51 ` Lee Schermerhorn
  2007-09-11 19:07   ` Mel Gorman
                     ` (2 more replies)
  2007-09-11 16:20 ` [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements Lee Schermerhorn
  2007-09-17 19:00 ` [PATCH] Fix NUMA Memory Policy Reference Counting Lee Schermerhorn
  6 siblings, 3 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-08-30 18:51 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, ak, mtk-manpages, clameter, solo, Lee Schermerhorn,
	eric.whitney

PATCH/RFC 05/05 -  add MPOL_F_MEMS_ALLOWED get_mempolicy() flag

Against:  2.6.23-rc3-mm1

Allow an application to query the memories allowed by its context.

Updated numa_memory_policy.txt to mention that applications can use this
to obtain allowed memories for constructing valid policies.

TODO:  update out-of-tree libnuma wrapper[s], or maybe add a new 
wrapper--e.g.,  numa_get_mems_allowed() ?

Tested with memtoy V>=0.13.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/numa_memory_policy.txt |   28 +++++++++++-----------------
 include/linux/mempolicy.h               |    1 +
 mm/mempolicy.c                          |   14 +++++++++++++-
 3 files changed, 25 insertions(+), 18 deletions(-)

Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h	2007-08-29 11:44:18.000000000 -0400
+++ Linux/include/linux/mempolicy.h	2007-08-29 11:45:23.000000000 -0400
@@ -26,6 +26,7 @@
 /* Flags for get_mem_policy */
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
 #define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
+#define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-08-29 11:45:09.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-08-29 11:45:23.000000000 -0400
@@ -560,8 +560,20 @@ static long do_get_mempolicy(int *policy
 	struct mempolicy *pol = current->mempolicy;
 
 	cpuset_update_task_memory_state();
-	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
+	if (flags &
+		~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
 		return -EINVAL;
+
+	if (flags & MPOL_F_MEMS_ALLOWED) {
+		if (flags & (MPOL_F_NODE|MPOL_F_ADDR))
+			return -EINVAL;
+		*policy = 0;	/* just so it's initialized */
+		if (!nmask)
+			return -EFAULT;
+		*nmask  = cpuset_current_mems_allowed;
+		return 0;
+	}
+
 	if (flags & MPOL_F_ADDR) {
 		down_read(&mm->mmap_sem);
 		vma = find_vma_intersection(mm, addr, addr+1);
Index: Linux/Documentation/vm/numa_memory_policy.txt
===================================================================
--- Linux.orig/Documentation/vm/numa_memory_policy.txt	2007-08-29 11:44:18.000000000 -0400
+++ Linux/Documentation/vm/numa_memory_policy.txt	2007-08-29 11:45:23.000000000 -0400
@@ -294,24 +294,20 @@ MEMORY POLICIES AND CPUSETS
 
 Memory policies work within cpusets as described above.  For memory policies
 that require a node or set of nodes, the nodes are restricted to the set of
-nodes whose memories are allowed by the cpuset constraints.  If the
-intersection of the set of nodes specified for the policy and the set of nodes
-allowed by the cpuset is the empty set, the policy is considered invalid and
-cannot be installed.
+nodes whose memories are allowed by the cpuset constraints.  If the nodemask
+specified for the policy contains nodes that are not allowed by the cpuset, or
+the intersection of the set of nodes specified for the policy and the set of
+nodes with memory is the empty set, the policy is considered invalid
+and cannot be installed.
 
 The interaction of memory policies and cpusets can be problematic for a
 couple of reasons:
 
-1) the memory policy APIs take physical node id's as arguments.  However, the
-   memory policy APIs do not provide a way to determine what nodes are valid
-   in the context where the application is running.  An application MAY consult
-   the cpuset file system [directly or via an out of tree, and not generally
-   available, libcpuset API] to obtain this information, but then the
-   application must be aware that it is running in a cpuset and use what are
-   intended primarily as administrative APIs.
-
-   However, as long as the policy specifies at least one node that is valid
-   in the controlling cpuset, the policy can be used.
+1) the memory policy APIs take physical node id's as arguments.  As mentioned
+   above, it is illegal to specify nodes that are not allowed in the cpuset.
+   The application must query the allowed nodes using the get_mempolicy()
+   API with the MPOL_F_MEMS_ALLOWED flag to determine the allowed nodes and
+   restrict itself to those nodes.
 
 2) when tasks in two cpusets share access to a memory region, such as shared
    memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
@@ -321,7 +317,5 @@ couple of reasons:
    the memory policy APIs, as well as knowing in what cpusets other task might
    be attaching to the shared region, to use the cpuset information.
    Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
+   allocation and "contextual interleave" are the only valid policies.
 
-Note, however, that local allocation, whether specified by MPOL_DEFAULT or
-MPOL_PREFERRED with an empty nodemask and "contextual interleave"--
-MPOL_INTERLEAVE with an empty nodemask--are valid policies in any context.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 5/5] Mem Policy:  add MPOL_F_MEMS_ALLOWED get_mempolicy() flag
  2007-08-30 18:51 ` [PATCH/RFC 5/5] Mem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag Lee Schermerhorn
@ 2007-09-11 19:07   ` Mel Gorman
  2007-09-11 18:42     ` Lee Schermerhorn
  2007-09-12 22:14   ` Christoph Lameter
  2007-09-14 20:24   ` [PATCH] " Lee Schermerhorn
  2 siblings, 1 reply; 76+ messages in thread
From: Mel Gorman @ 2007-09-11 19:07 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, clameter, solo, eric.whitney

On Thu, 2007-08-30 at 14:51 -0400, Lee Schermerhorn wrote:
> PATCH/RFC 05/05 -  add MPOL_F_MEMS_ALLOWED get_mempolicy() flag
> 
> Against:  2.6.23-rc3-mm1
> 
> Allow an application to query the memories allowed by its context.
> 

I think you may be underplaying the significance of this patch here.
>From what understand, an application that is only policy aware can run
inside a cpuset and think it can use nodes it's not allowed. If that is
right, then the language here implies that a policy-aware application
can now get useful information without going through complicated hoops.
That is pretty important.

> Updated numa_memory_policy.txt to mention that applications can use this
> to obtain allowed memories for constructing valid policies.
>
> TODO:  update out-of-tree libnuma wrapper[s], or maybe add a new 
> wrapper--e.g.,  numa_get_mems_allowed() ?
> 
> Tested with memtoy V>=0.13.
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  Documentation/vm/numa_memory_policy.txt |   28 +++++++++++-----------------
>  include/linux/mempolicy.h               |    1 +
>  mm/mempolicy.c                          |   14 +++++++++++++-
>  3 files changed, 25 insertions(+), 18 deletions(-)
> 
> Index: Linux/include/linux/mempolicy.h
> ===================================================================
> --- Linux.orig/include/linux/mempolicy.h	2007-08-29 11:44:18.000000000 -0400
> +++ Linux/include/linux/mempolicy.h	2007-08-29 11:45:23.000000000 -0400
> @@ -26,6 +26,7 @@
>  /* Flags for get_mem_policy */
>  #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
>  #define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
> +#define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */
>  
>  /* Flags for mbind */
>  #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c	2007-08-29 11:45:09.000000000 -0400
> +++ Linux/mm/mempolicy.c	2007-08-29 11:45:23.000000000 -0400
> @@ -560,8 +560,20 @@ static long do_get_mempolicy(int *policy
>  	struct mempolicy *pol = current->mempolicy;
>  
>  	cpuset_update_task_memory_state();
> -	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
> +	if (flags &
> +		~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
>  		return -EINVAL;
> +
> +	if (flags & MPOL_F_MEMS_ALLOWED) {
> +		if (flags & (MPOL_F_NODE|MPOL_F_ADDR))
> +			return -EINVAL;
> +		*policy = 0;	/* just so it's initialized */
> +		if (!nmask)
> +			return -EFAULT;
> +		*nmask  = cpuset_current_mems_allowed;
> +		return 0;
> +	}
> +

Seems a fair implementation.

>  	if (flags & MPOL_F_ADDR) {
>  		down_read(&mm->mmap_sem);
>  		vma = find_vma_intersection(mm, addr, addr+1);
> Index: Linux/Documentation/vm/numa_memory_policy.txt
> ===================================================================
> --- Linux.orig/Documentation/vm/numa_memory_policy.txt	2007-08-29 11:44:18.000000000 -0400
> +++ Linux/Documentation/vm/numa_memory_policy.txt	2007-08-29 11:45:23.000000000 -0400
> @@ -294,24 +294,20 @@ MEMORY POLICIES AND CPUSETS
>  
>  Memory policies work within cpusets as described above.  For memory policies
>  that require a node or set of nodes, the nodes are restricted to the set of
> -nodes whose memories are allowed by the cpuset constraints.  If the
> -intersection of the set of nodes specified for the policy and the set of nodes
> -allowed by the cpuset is the empty set, the policy is considered invalid and
> -cannot be installed.
> +nodes whose memories are allowed by the cpuset constraints.  If the nodemask
> +specified for the policy contains nodes that are not allowed by the cpuset, or
> +the intersection of the set of nodes specified for the policy and the set of
> +nodes with memory is the empty set, the policy is considered invalid
> +and cannot be installed.
>  
>  The interaction of memory policies and cpusets can be problematic for a
>  couple of reasons:
>  
> -1) the memory policy APIs take physical node id's as arguments.  However, the
> -   memory policy APIs do not provide a way to determine what nodes are valid
> -   in the context where the application is running.  An application MAY consult
> -   the cpuset file system [directly or via an out of tree, and not generally
> -   available, libcpuset API] to obtain this information, but then the
> -   application must be aware that it is running in a cpuset and use what are
> -   intended primarily as administrative APIs.
> -
> -   However, as long as the policy specifies at least one node that is valid
> -   in the controlling cpuset, the policy can be used.
> +1) the memory policy APIs take physical node id's as arguments.  As mentioned
> +   above, it is illegal to specify nodes that are not allowed in the cpuset.
> +   The application must query the allowed nodes using the get_mempolicy()
> +   API with the MPOL_F_MEMS_ALLOWED flag to determine the allowed nodes and
> +   restrict itself to those nodes.
>  
>  2) when tasks in two cpusets share access to a memory region, such as shared
>     memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
> @@ -321,7 +317,5 @@ couple of reasons:
>     the memory policy APIs, as well as knowing in what cpusets other task might
>     be attaching to the shared region, to use the cpuset information.
>     Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
> +   allocation and "contextual interleave" are the only valid policies.
>  
> -Note, however, that local allocation, whether specified by MPOL_DEFAULT or
> -MPOL_PREFERRED with an empty nodemask and "contextual interleave"--
> -MPOL_INTERLEAVE with an empty nodemask--are valid policies in any context.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 5/5] Mem Policy:  add MPOL_F_MEMS_ALLOWED get_mempolicy() flag
  2007-09-11 19:07   ` Mel Gorman
@ 2007-09-11 18:42     ` Lee Schermerhorn
  0 siblings, 0 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-11 18:42 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, akpm, ak, mtk-manpages, clameter, solo, eric.whitney

On Tue, 2007-09-11 at 20:07 +0100, Mel Gorman wrote:
> On Thu, 2007-08-30 at 14:51 -0400, Lee Schermerhorn wrote:
> > PATCH/RFC 05/05 -  add MPOL_F_MEMS_ALLOWED get_mempolicy() flag
> > 
> > Against:  2.6.23-rc3-mm1
> > 
> > Allow an application to query the memories allowed by its context.
> > 
> 
> I think you may be underplaying the significance of this patch here.
> >From what understand, an application that is only policy aware can run
> inside a cpuset and think it can use nodes it's not allowed. If that is
> right, then the language here implies that a policy-aware application
> can now get useful information without going through complicated hoops.
> That is pretty important.

I thought so.  In my memtoy test program, I tried to find a way to get
this info with just the existing APIs--i.e., without diving into the
cpuset file system [even with library wrappers]--and couldn't.  Having
convinced myself that this can't break existing apps--they can't use
undefined flags w/o getting an error--it seemed like the way to go.

> 
> > Updated numa_memory_policy.txt to mention that applications can use this
> > to obtain allowed memories for constructing valid policies.
> >
> > TODO:  update out-of-tree libnuma wrapper[s], or maybe add a new 
> > wrapper--e.g.,  numa_get_mems_allowed() ?
> > 
> > Tested with memtoy V>=0.13.
> > 
> > Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> > 
> >  Documentation/vm/numa_memory_policy.txt |   28 +++++++++++-----------------
> >  include/linux/mempolicy.h               |    1 +
> >  mm/mempolicy.c                          |   14 +++++++++++++-
> >  3 files changed, 25 insertions(+), 18 deletions(-)
> > 
> > Index: Linux/include/linux/mempolicy.h
> > ===================================================================
> > --- Linux.orig/include/linux/mempolicy.h	2007-08-29 11:44:18.000000000 -0400
> > +++ Linux/include/linux/mempolicy.h	2007-08-29 11:45:23.000000000 -0400
> > @@ -26,6 +26,7 @@
> >  /* Flags for get_mem_policy */
> >  #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
> >  #define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
> > +#define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */
> >  
> >  /* Flags for mbind */
> >  #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
> > Index: Linux/mm/mempolicy.c
> > ===================================================================
> > --- Linux.orig/mm/mempolicy.c	2007-08-29 11:45:09.000000000 -0400
> > +++ Linux/mm/mempolicy.c	2007-08-29 11:45:23.000000000 -0400
> > @@ -560,8 +560,20 @@ static long do_get_mempolicy(int *policy
> >  	struct mempolicy *pol = current->mempolicy;
> >  
> >  	cpuset_update_task_memory_state();
> > -	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
> > +	if (flags &
> > +		~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
> >  		return -EINVAL;
> > +
> > +	if (flags & MPOL_F_MEMS_ALLOWED) {
> > +		if (flags & (MPOL_F_NODE|MPOL_F_ADDR))
> > +			return -EINVAL;
> > +		*policy = 0;	/* just so it's initialized */
> > +		if (!nmask)
> > +			return -EFAULT;
> > +		*nmask  = cpuset_current_mems_allowed;
> > +		return 0;
> > +	}
> > +
> 
> Seems a fair implementation.

Except that I don't need the test of nmask.  This is a lower level
function.  sys_get_mempolicy() always passes a non-NULL pointer to an
on-stack nodemask.  I realized this mistake after I'd sent the patches.
Fixed in my tree.

Thanks, again,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 5/5] Mem Policy:  add MPOL_F_MEMS_ALLOWED get_mempolicy() flag
  2007-08-30 18:51 ` [PATCH/RFC 5/5] Mem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag Lee Schermerhorn
  2007-09-11 19:07   ` Mel Gorman
@ 2007-09-12 22:14   ` Christoph Lameter
  2007-09-14 20:24   ` [PATCH] " Lee Schermerhorn
  2 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2007-09-12 22:14 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, ak, mtk-manpages, solo, eric.whitney


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH] Mem Policy:  add MPOL_F_MEMS_ALLOWED get_mempolicy() flag
  2007-08-30 18:51 ` [PATCH/RFC 5/5] Mem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag Lee Schermerhorn
  2007-09-11 19:07   ` Mel Gorman
  2007-09-12 22:14   ` Christoph Lameter
@ 2007-09-14 20:24   ` Lee Schermerhorn
  2007-09-14 20:27     ` Christoph Lameter
  2 siblings, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:24 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, ak, mtk-manpages, clameter, eric.whitney, Mel Gorman

PATCH  add MPOL_F_MEMS_ALLOWED get_mempolicy() flag

Against:  2.6.23-rc4-mm1

V1 -> V2:
+ extracted from earlier mempolicy series as stand alone patch
+ update numa_memory_policy to indicate that cpuset resources can 
  change after task queries allowed nodes.  Suggestion from
  Christoph L.

Allow an application to query the memories allowed by its context.

Updated numa_memory_policy.txt to mention that applications can use this
to obtain allowed memories for constructing valid policies.

TODO:  update out-of-tree libnuma wrapper[s], or maybe add a new 
wrapper--e.g.,  numa_get_mems_allowed() ?

Also, update numa syscall man pages.

Tested with memtoy V>=0.13.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
V1 was:
Acked-by: Christoph Lameter <clameter@sgi.com>

 Documentation/vm/numa_memory_policy.txt |   33 +++++++++++++++-----------------
 include/linux/mempolicy.h               |    1 
 mm/mempolicy.c                          |   12 ++++++++++-
 3 files changed, 28 insertions(+), 18 deletions(-)

Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h	2007-09-14 12:00:38.000000000 -0400
+++ Linux/include/linux/mempolicy.h	2007-09-14 12:03:12.000000000 -0400
@@ -19,6 +19,7 @@
 /* Flags for get_mem_policy */
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
 #define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
+#define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-09-14 12:00:38.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-09-14 12:03:12.000000000 -0400
@@ -533,8 +533,18 @@ static long do_get_mempolicy(int *policy
 	struct mempolicy *pol = current->mempolicy;
 
 	cpuset_update_task_memory_state();
-	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
+	if (flags &
+		~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR|MPOL_F_MEMS_ALLOWED))
 		return -EINVAL;
+
+	if (flags & MPOL_F_MEMS_ALLOWED) {
+		if (flags & (MPOL_F_NODE|MPOL_F_ADDR))
+			return -EINVAL;
+		*policy = 0;	/* just so it's initialized */
+		*nmask  = cpuset_current_mems_allowed;
+		return 0;
+	}
+
 	if (flags & MPOL_F_ADDR) {
 		down_read(&mm->mmap_sem);
 		vma = find_vma_intersection(mm, addr, addr+1);
Index: Linux/Documentation/vm/numa_memory_policy.txt
===================================================================
--- Linux.orig/Documentation/vm/numa_memory_policy.txt	2007-09-12 09:02:50.000000000 -0400
+++ Linux/Documentation/vm/numa_memory_policy.txt	2007-09-14 12:10:30.000000000 -0400
@@ -302,31 +302,30 @@ MEMORY POLICIES AND CPUSETS
 
 Memory policies work within cpusets as described above.  For memory policies
 that require a node or set of nodes, the nodes are restricted to the set of
-nodes whose memories are allowed by the cpuset constraints.  If the
-intersection of the set of nodes specified for the policy and the set of nodes
-allowed by the cpuset is the empty set, the policy is considered invalid and
-cannot be installed.
+nodes whose memories are allowed by the cpuset constraints.  If the nodemask
+specified for the policy contains nodes that are not allowed by the cpuset, or
+the intersection of the set of nodes specified for the policy and the set of
+nodes with memory is the empty set, the policy is considered invalid
+and cannot be installed.
 
 The interaction of memory policies and cpusets can be problematic for a
 couple of reasons:
 
-1) the memory policy APIs take physical node id's as arguments.  However, the
-   memory policy APIs do not provide a way to determine what nodes are valid
-   in the context where the application is running.  An application MAY consult
-   the cpuset file system [directly or via an out of tree, and not generally
-   available, libcpuset API] to obtain this information, but then the
-   application must be aware that it is running in a cpuset and use what are
-   intended primarily as administrative APIs.
-
-   However, as long as the policy specifies at least one node that is valid
-   in the controlling cpuset, the policy can be used.
+1) the memory policy APIs take physical node id's as arguments.  As mentioned
+   above, it is illegal to specify nodes that are not allowed in the cpuset.
+   The application must query the allowed nodes using the get_mempolicy()
+   API with the MPOL_F_MEMS_ALLOWED flag to determine the allowed nodes and
+   restrict itself to those nodes.  However, the resources available to a
+   cpuset can be changed by the system administrator, or a workload manager
+   application, at any time.  So, a task may still get errors attempting to
+   specify policy nodes, and must query the allowed memories again.
 
 2) when tasks in two cpusets share access to a memory region, such as shared
    memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
    MAP_SHARED flags, and any of the tasks install shared policy on the region,
    only nodes whose memories are allowed in both cpusets may be used in the
-   policies.  Again, obtaining this information requires "stepping outside"
-   the memory policy APIs, as well as knowing in what cpusets other task might
-   be attaching to the shared region, to use the cpuset information.
+   policies.  Obtaining this information requires "stepping outside" the
+   memory policy APIs to use the cpuset information and requires that one
+   know in what cpusets other task might be attaching to the shared region.
    Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
    allocation is the only valid policy.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Mem Policy:  add MPOL_F_MEMS_ALLOWED get_mempolicy() flag
  2007-09-14 20:24   ` [PATCH] " Lee Schermerhorn
@ 2007-09-14 20:27     ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2007-09-14 20:27 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, eric.whitney, Mel Gorman

Thanks.

Acked-by: Christoph Lameter <clameter@sgi.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-08-30 18:50 [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements Lee Schermerhorn
                   ` (4 preceding siblings ...)
  2007-08-30 18:51 ` [PATCH/RFC 5/5] Mem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag Lee Schermerhorn
@ 2007-09-11 16:20 ` Lee Schermerhorn
  2007-09-11 19:12   ` Mel Gorman
  2007-09-12 22:17   ` Christoph Lameter
  2007-09-17 19:00 ` [PATCH] Fix NUMA Memory Policy Reference Counting Lee Schermerhorn
  6 siblings, 2 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-11 16:20 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, ak, mtk-manpages, clameter, solo, eric.whitney, Mel Gorman

Andi, Christoph, Mel [added to cc]:

Any comments on these patches, posted 30aug?  I've rebased to
23-rc4-mm1, but before reposting, I wanted to give you a chance to
comment.

I'm going to add Mel's "one zonelist" series to my mempolicy tree with
these patches and see how that goes.  I'll slide Mel's patches in below
these, as it looks like they're closer to acceptance into -mm.

Ethan:  I believe that patch #4 provides the cpuset independent
interleave capability that you were looking for.  Does this meet your
requirements?

Regards,
Lee



On Thu, 2007-08-30 at 14:50 -0400, Lee Schermerhorn wrote:
> Some of these patches have been previously posted for comment
> individually.  The patches also update the numa_memory_policy
> document where applicable.  Needed man page updates are flagged.
> I will provide the needed updates for any of the patches that
> are accepted.
> 
> Cleanups:
> 
> 1) Fix reference counting for shared, vma and other task's
>    mempolicies.  This was discussed back in late June'07, but
>    never went anywhere.  Closes possible races and fixes potential
>    memory leak for shared policies.  Adds code to allocation paths.
> 
>    Patch does NOT update numa_memory_policy doc--the doc doesn't
>    go into that much detail regarding design.  Perhaps it should.
> 
> 2) use MPOL_PREFERRED with preferred_node = -1 for system default
>    local allocation.  This removes all usage of MPOL_DEFAULT in
>    in-kernel struct mempolicy 'policy' members.  MPOL_DEFAULT is
>    now an API-only value that requests fall back to the default
>    policy for the target context [task or vma/shared policy].  This
>    simplifies the description of policies and removes some runtime
>    tests in the page allocation paths.
> 
>    Needs man page update to clarify meaning of MPOL_DEFAULT with 
>    this patch.  Should simplify things a bit.
> 
> 2) cleanup MPOL_PREFERRED "local allocation" handling -- i.e., when
>    preferred_node == -1.
> 
>    Needs man page update to clarify returned nodemask when
>    MPOL_PERFERRED policy specifies local allocation.
> 
> Enhancements:
> 
> 4) cpuset-independent [a.k.a. "contextual"] interleave policy:  NULL
>    or empty nodemask to mempolicy API [set_mempolicy() and mbind()]
>    now means "interleave over all permitted nodes in allocation 
>    context".
> 
>    Needs man page update to describe contextual interleave--how to
>    specify, behavior, ...
> 
> 5) add MPOL_F_MEMS_ALLOWED flag for get_mempolicy().  Allows an
>    application to query the valid nodes to avoid EINVAL errors when
>    attempting to install memory policies from within a memory
>    constrained cpuset.
> 
>    Needs man page update to describe flag, behavior.
>    Could also use libnuma update -- e.g., new numa_mems_allowed()
> 
> Testing:
> 
> I've run with these patches for the past few weeks.  Some moderate
> stress testing and functional testing on an ia64 NUMA platform, shows
> no issues nor regression.  memtoy >= 0.13 supports MPOL_F_MEMS_ALLOWED
> flag [mems command].
> 
> Some of the patches [ref count fix, contextual interneavel] do add
> code in some of the allocation paths.  I hope to get some time in
> the next month on a terabyte system [~64 million 16KB pages] to
> measure the overhead of these patches allocating and migrating a few
> million pages to expose any increased overhead.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-11 16:20 ` [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements Lee Schermerhorn
@ 2007-09-11 19:12   ` Mel Gorman
  2007-09-11 18:45     ` Lee Schermerhorn
  2007-09-12 22:17   ` Christoph Lameter
  1 sibling, 1 reply; 76+ messages in thread
From: Mel Gorman @ 2007-09-11 19:12 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, clameter, solo, eric.whitney

On Tue, 2007-09-11 at 12:20 -0400, Lee Schermerhorn wrote:
> Andi, Christoph, Mel [added to cc]:
> 
> Any comments on these patches, posted 30aug?  I've rebased to
> 23-rc4-mm1, but before reposting, I wanted to give you a chance to
> comment.
> 

I hadn't intended to comment but because you asked, I took a look
through. It wasn't an in-depth review but nothing jumped out as broken
to me and I commented on what I spotted. The last patch to me was the
most interesting and justifies the set unless someone can think of a
real reason to not extend the get_mempolicy() API to retrieve this
information. I made comments on what I saw but as I'm not a frequent
user of policies so take the suggestions with a grain of salt.

Unless something jumps out to someone else, I think it'll be ready for
wider testing after your next release.

> I'm going to add Mel's "one zonelist" series to my mempolicy tree with
> these patches and see how that goes.  I'll slide Mel's patches in below
> these, as it looks like they're closer to acceptance into -mm.
> 
> Ethan:  I believe that patch #4 provides the cpuset independent
> interleave capability that you were looking for.  Does this meet your
> requirements?
> 
> Regards,
> Lee
> 
> 
> 
> On Thu, 2007-08-30 at 14:50 -0400, Lee Schermerhorn wrote:
> > Some of these patches have been previously posted for comment
> > individually.  The patches also update the numa_memory_policy
> > document where applicable.  Needed man page updates are flagged.
> > I will provide the needed updates for any of the patches that
> > are accepted.
> > 
> > Cleanups:
> > 
> > 1) Fix reference counting for shared, vma and other task's
> >    mempolicies.  This was discussed back in late June'07, but
> >    never went anywhere.  Closes possible races and fixes potential
> >    memory leak for shared policies.  Adds code to allocation paths.
> > 
> >    Patch does NOT update numa_memory_policy doc--the doc doesn't
> >    go into that much detail regarding design.  Perhaps it should.
> > 
> > 2) use MPOL_PREFERRED with preferred_node = -1 for system default
> >    local allocation.  This removes all usage of MPOL_DEFAULT in
> >    in-kernel struct mempolicy 'policy' members.  MPOL_DEFAULT is
> >    now an API-only value that requests fall back to the default
> >    policy for the target context [task or vma/shared policy].  This
> >    simplifies the description of policies and removes some runtime
> >    tests in the page allocation paths.
> > 
> >    Needs man page update to clarify meaning of MPOL_DEFAULT with 
> >    this patch.  Should simplify things a bit.
> > 
> > 2) cleanup MPOL_PREFERRED "local allocation" handling -- i.e., when
> >    preferred_node == -1.
> > 
> >    Needs man page update to clarify returned nodemask when
> >    MPOL_PERFERRED policy specifies local allocation.
> > 
> > Enhancements:
> > 
> > 4) cpuset-independent [a.k.a. "contextual"] interleave policy:  NULL
> >    or empty nodemask to mempolicy API [set_mempolicy() and mbind()]
> >    now means "interleave over all permitted nodes in allocation 
> >    context".
> > 
> >    Needs man page update to describe contextual interleave--how to
> >    specify, behavior, ...
> > 
> > 5) add MPOL_F_MEMS_ALLOWED flag for get_mempolicy().  Allows an
> >    application to query the valid nodes to avoid EINVAL errors when
> >    attempting to install memory policies from within a memory
> >    constrained cpuset.
> > 
> >    Needs man page update to describe flag, behavior.
> >    Could also use libnuma update -- e.g., new numa_mems_allowed()
> > 
> > Testing:
> > 
> > I've run with these patches for the past few weeks.  Some moderate
> > stress testing and functional testing on an ia64 NUMA platform, shows
> > no issues nor regression.  memtoy >= 0.13 supports MPOL_F_MEMS_ALLOWED
> > flag [mems command].
> > 
> > Some of the patches [ref count fix, contextual interneavel] do add
> > code in some of the allocation paths.  I hope to get some time in
> > the next month on a terabyte system [~64 million 16KB pages] to
> > measure the overhead of these patches allocating and migrating a few
> > million pages to expose any increased overhead.
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-11 19:12   ` Mel Gorman
@ 2007-09-11 18:45     ` Lee Schermerhorn
  0 siblings, 0 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-11 18:45 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, akpm, ak, mtk-manpages, clameter, solo, eric.whitney

On Tue, 2007-09-11 at 20:12 +0100, Mel Gorman wrote:
> On Tue, 2007-09-11 at 12:20 -0400, Lee Schermerhorn wrote:
> > Andi, Christoph, Mel [added to cc]:
> > 
> > Any comments on these patches, posted 30aug?  I've rebased to
> > 23-rc4-mm1, but before reposting, I wanted to give you a chance to
> > comment.
> > 
> 
> I hadn't intended to comment but because you asked, I took a look
> through. It wasn't an in-depth review but nothing jumped out as broken
> to me and I commented on what I spotted. The last patch to me was the
> most interesting and justifies the set unless someone can think of a
> real reason to not extend the get_mempolicy() API to retrieve this
> information. I made comments on what I saw but as I'm not a frequent
> user of policies so take the suggestions with a grain of salt.
> 
> Unless something jumps out to someone else, I think it'll be ready for
> wider testing after your next release.
> 
> > I'm going to add Mel's "one zonelist" series to my mempolicy tree with
> > these patches and see how that goes.  I'll slide Mel's patches in below
> > these, as it looks like they're closer to acceptance into -mm.
> > 

Thanks, again, Mel.  As I mentioned in today's "ping", I'm going to try
to merge this with your patches and wanted to give you a heads up.  The
patches will collide--in code, as well as comments, I think.

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-11 16:20 ` [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements Lee Schermerhorn
  2007-09-11 19:12   ` Mel Gorman
@ 2007-09-12 22:17   ` Christoph Lameter
  2007-09-13 13:57     ` Lee Schermerhorn
  2007-09-13 15:49     ` Lee Schermerhorn
  1 sibling, 2 replies; 76+ messages in thread
From: Christoph Lameter @ 2007-09-12 22:17 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, solo, eric.whitney, Mel Gorman

On Tue, 11 Sep 2007, Lee Schermerhorn wrote:

> Andi, Christoph, Mel [added to cc]:
> 
> Any comments on these patches, posted 30aug?  I've rebased to
> 23-rc4-mm1, but before reposting, I wanted to give you a chance to
> comment.

Sorry that it took some time but I only just got around to look at them. 
The one patch that I acked may be of higher priority and should probably 
go in immediately to be merged for 2.6.24.

> I'm going to add Mel's "one zonelist" series to my mempolicy tree with
> these patches and see how that goes.  I'll slide Mel's patches in below
> these, as it looks like they're closer to acceptance into -mm.

That patchset will have a significant impact on yours. You may be able to 
get rid of some of the switch statements. It would be great if we had some 
description as to where you are heading with the incremental changes to 
the memory policy semantics? I sure wish we would have something more 
consistent and easier to understand.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-12 22:17   ` Christoph Lameter
@ 2007-09-13 13:57     ` Lee Schermerhorn
  2007-09-13 15:31       ` Mel Gorman
  2007-09-13 18:19       ` Christoph Lameter
  2007-09-13 15:49     ` Lee Schermerhorn
  1 sibling, 2 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-13 13:57 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, akpm, ak, mtk-manpages, solo, eric.whitney, Mel Gorman

On Wed, 2007-09-12 at 15:17 -0700, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Lee Schermerhorn wrote:
> 
> > Andi, Christoph, Mel [added to cc]:
> > 
> > Any comments on these patches, posted 30aug?  I've rebased to
> > 23-rc4-mm1, but before reposting, I wanted to give you a chance to
> > comment.
> 
> Sorry that it took some time but I only just got around to look at them. 
> The one patch that I acked may be of higher priority and should probably 
> go in immediately to be merged for 2.6.24.

OK.  I'll pull that from the set and post it separately.  I'll see if it
conflicts with Mel's set.  If so, we'll need to decide on the ordering.
Do we think Mel's patches will make .24?

> 
> > I'm going to add Mel's "one zonelist" series to my mempolicy tree with
> > these patches and see how that goes.  I'll slide Mel's patches in below
> > these, as it looks like they're closer to acceptance into -mm.
> 
> That patchset will have a significant impact on yours. You may be able to 
> get rid of some of the switch statements. It would be great if we had some 
> description as to where you are heading with the incremental changes to 
> the memory policy semantics? I sure wish we would have something more 
> consistent and easier to understand.

The general reaction to such descriptions is "show me the code."  So, if
we agree that Mel's patches should go first, I'll rebase and update the
numa_memory_policy doc accordingly to explain the resulting semantics.
Perhaps Mel should considering updating that document where his patches
change/invalidate the current descriptions.

Does this sound like a resonable way to proceed?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-13 13:57     ` Lee Schermerhorn
@ 2007-09-13 15:31       ` Mel Gorman
  2007-09-13 15:01         ` Lee Schermerhorn
  2007-09-13 18:19       ` Christoph Lameter
  1 sibling, 1 reply; 76+ messages in thread
From: Mel Gorman @ 2007-09-13 15:31 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, linux-mm, akpm, ak, mtk-manpages, solo,
	eric.whitney

On Thu, 2007-09-13 at 09:57 -0400, Lee Schermerhorn wrote:
> On Wed, 2007-09-12 at 15:17 -0700, Christoph Lameter wrote:
> > On Tue, 11 Sep 2007, Lee Schermerhorn wrote:
> > 
> > > Andi, Christoph, Mel [added to cc]:
> > > 
> > > Any comments on these patches, posted 30aug?  I've rebased to
> > > 23-rc4-mm1, but before reposting, I wanted to give you a chance to
> > > comment.
> > 
> > Sorry that it took some time but I only just got around to look at them. 
> > The one patch that I acked may be of higher priority and should probably 
> > go in immediately to be merged for 2.6.24.
> 
> OK.  I'll pull that from the set and post it separately.  I'll see if it
> conflicts with Mel's set.  If so, we'll need to decide on the ordering.
> Do we think Mel's patches will make .24?
> 

I am hoping they will. They remove the nasty hack in relation to
MPOL_BIND applying to the top two highest zones when ZONE_MOVABLE is
configured and let MPOL_BIND use a node-local policy which I feel is
important. There hasn't been a consensus on this yet and I would expect
that to be hashed out at the start of the merge window as usual.

> > 
> > > I'm going to add Mel's "one zonelist" series to my mempolicy tree with
> > > these patches and see how that goes.  I'll slide Mel's patches in below
> > > these, as it looks like they're closer to acceptance into -mm.
> > 
> > That patchset will have a significant impact on yours. You may be able to 
> > get rid of some of the switch statements. It would be great if we had some 
> > description as to where you are heading with the incremental changes to 
> > the memory policy semantics? I sure wish we would have something more 
> > consistent and easier to understand.
> 
> The general reaction to such descriptions is "show me the code."  So, if
> we agree that Mel's patches should go first, I'll rebase and update the
> numa_memory_policy doc accordingly to explain the resulting semantics.
> Perhaps Mel should considering updating that document where his patches
> change/invalidate the current descriptions.
> 

Yes, I should be. When the patches finalise (any day now *sigh*), I'll
reread the documentation and see what I have affected and send a
follow-up patch.

> Does this sound like a resonable way to proceed?
> 

Yes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-13 15:31       ` Mel Gorman
@ 2007-09-13 15:01         ` Lee Schermerhorn
  2007-09-13 18:55           ` Mel Gorman
  0 siblings, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-13 15:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, linux-mm, akpm, ak, mtk-manpages, solo,
	eric.whitney

On Thu, 2007-09-13 at 16:31 +0100, Mel Gorman wrote:
> On Thu, 2007-09-13 at 09:57 -0400, Lee Schermerhorn wrote:
> > On Wed, 2007-09-12 at 15:17 -0700, Christoph Lameter wrote:
> > > On Tue, 11 Sep 2007, Lee Schermerhorn wrote:
> > > 
> > > > Andi, Christoph, Mel [added to cc]:
> > > > 
> > > > Any comments on these patches, posted 30aug?  I've rebased to
> > > > 23-rc4-mm1, but before reposting, I wanted to give you a chance to
> > > > comment.
> > > 
> > > Sorry that it took some time but I only just got around to look at them. 
> > > The one patch that I acked may be of higher priority and should probably 
> > > go in immediately to be merged for 2.6.24.
> > 
> > OK.  I'll pull that from the set and post it separately.  I'll see if it
> > conflicts with Mel's set.  If so, we'll need to decide on the ordering.
> > Do we think Mel's patches will make .24?
> > 
> 
> I am hoping they will. They remove the nasty hack in relation to
> MPOL_BIND applying to the top two highest zones when ZONE_MOVABLE is
> configured and let MPOL_BIND use a node-local policy which I feel is
> important. There hasn't been a consensus on this yet and I would expect
> that to be hashed out at the start of the merge window as usual.

OK.  I haven't had a chance to check your patch for the mem controller
issue yet.  Should I test with your latest series that uses the struct
containing the zone id?  I know that Christoph had concerns about
increasing the number of cache lines.  Which way do you think we'll go
on this?

<snip>

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-13 15:01         ` Lee Schermerhorn
@ 2007-09-13 18:55           ` Mel Gorman
  0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2007-09-13 18:55 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Christoph Lameter, linux-mm, akpm, ak, mtk-manpages, solo,
	eric.whitney

On Thu, 2007-09-13 at 11:01 -0400, Lee Schermerhorn wrote:
> On Thu, 2007-09-13 at 16:31 +0100, Mel Gorman wrote:
> > On Thu, 2007-09-13 at 09:57 -0400, Lee Schermerhorn wrote:
> > > On Wed, 2007-09-12 at 15:17 -0700, Christoph Lameter wrote:
> > > > On Tue, 11 Sep 2007, Lee Schermerhorn wrote:
> > > > 
> > > > > Andi, Christoph, Mel [added to cc]:
> > > > > 
> > > > > Any comments on these patches, posted 30aug?  I've rebased to
> > > > > 23-rc4-mm1, but before reposting, I wanted to give you a chance to
> > > > > comment.
> > > > 
> > > > Sorry that it took some time but I only just got around to look at them. 
> > > > The one patch that I acked may be of higher priority and should probably 
> > > > go in immediately to be merged for 2.6.24.
> > > 
> > > OK.  I'll pull that from the set and post it separately.  I'll see if it
> > > conflicts with Mel's set.  If so, we'll need to decide on the ordering.
> > > Do we think Mel's patches will make .24?
> > > 
> > 
> > I am hoping they will. They remove the nasty hack in relation to
> > MPOL_BIND applying to the top two highest zones when ZONE_MOVABLE is
> > configured and let MPOL_BIND use a node-local policy which I feel is
> > important. There hasn't been a consensus on this yet and I would expect
> > that to be hashed out at the start of the merge window as usual.
> 
> OK.  I haven't had a chance to check your patch for the mem controller
> issue yet.  Should I test with your latest series that uses the struct
> containing the zone id?  I know that Christoph had concerns about
> increasing the number of cache lines.  Which way do you think we'll go
> on this?
> 

I think we'll be going with the struct with zone id for the moment.
Christoph's concerns are valid but it's an easier starting point for
trying out optimisations in the page allocator path. We may end up doing
the pointer packing after a while if there are no better optimisations
but we should have a flexible starting point.

I've sent you V7 so hopefully it'll pass your tests.

Thanks

-- 
Mel Gorman

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-13 13:57     ` Lee Schermerhorn
  2007-09-13 15:31       ` Mel Gorman
@ 2007-09-13 18:19       ` Christoph Lameter
  2007-09-13 18:23         ` Mel Gorman
  1 sibling, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2007-09-13 18:19 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, solo, eric.whitney, Mel Gorman

On Thu, 13 Sep 2007, Lee Schermerhorn wrote:

> Do we think Mel's patches will make .24?

No,

> > That patchset will have a significant impact on yours. You may be able to 
> > get rid of some of the switch statements. It would be great if we had some 
> > description as to where you are heading with the incremental changes to 
> > the memory policy semantics? I sure wish we would have something more 
> > consistent and easier to understand.
> 
> The general reaction to such descriptions is "show me the code."  So, if
> we agree that Mel's patches should go first, I'll rebase and update the
> numa_memory_policy doc accordingly to explain the resulting semantics.
> Perhaps Mel should considering updating that document where his patches
> change/invalidate the current descriptions.
> 
> Does this sound like a resonable way to proceed?

Well I am not the show me the code type. I'd like to have some 
documentation first on how this would all work together with page 
migration, cpusets and the use of various memory policies. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-13 18:19       ` Christoph Lameter
@ 2007-09-13 18:23         ` Mel Gorman
  2007-09-13 18:26           ` Christoph Lameter
  0 siblings, 1 reply; 76+ messages in thread
From: Mel Gorman @ 2007-09-13 18:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, linux-mm, akpm, ak, mtk-manpages, solo,
	eric.whitney

On (13/09/07 11:19), Christoph Lameter didst pronounce:
> On Thu, 13 Sep 2007, Lee Schermerhorn wrote:
> 
> > Do we think Mel's patches will make .24?
> 
> No,
> 

What do you see holding it up? Is it the fact we are no longer doing the
pointer packing and you don't want that structure to exist, or is it simply
a case that 2.6.23 is too close the door and it won't get adequate
coverage in -mm?

> <snip>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-13 18:23         ` Mel Gorman
@ 2007-09-13 18:26           ` Christoph Lameter
  2007-09-13 21:17             ` Andrew Morton
  0 siblings, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2007-09-13 18:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, linux-mm, akpm, ak, mtk-manpages, solo,
	eric.whitney

On Thu, 13 Sep 2007, Mel Gorman wrote:

> What do you see holding it up? Is it the fact we are no longer doing the
> pointer packing and you don't want that structure to exist, or is it simply
> a case that 2.6.23 is too close the door and it won't get adequate
> coverage in -mm?

No its not the pointer packing. The problem is that the patches have not 
been merged yet and 2.6.23 is close. We would need to merge it very soon 
and get some exposure in mm. Andrew?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-13 18:26           ` Christoph Lameter
@ 2007-09-13 21:17             ` Andrew Morton
  2007-09-14  2:20               ` Christoph Lameter
  2007-09-14  8:53               ` Mel Gorman
  0 siblings, 2 replies; 76+ messages in thread
From: Andrew Morton @ 2007-09-13 21:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Lee Schermerhorn, linux-mm, ak, mtk-manpages, solo,
	eric.whitney

On Thu, 13 Sep 2007 11:26:19 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:

> On Thu, 13 Sep 2007, Mel Gorman wrote:
> 
> > What do you see holding it up? Is it the fact we are no longer doing the
> > pointer packing and you don't want that structure to exist, or is it simply
> > a case that 2.6.23 is too close the door and it won't get adequate
> > coverage in -mm?
> 
> No its not the pointer packing. The problem is that the patches have not 
> been merged yet and 2.6.23 is close. We would need to merge it very soon 
> and get some exposure in mm. Andrew?

You rang?

To which patches do you refer?  "Memory Policy Cleanups and Enhancements"? 
That's still in my queue somewhere, but a) it has "RFC" in it which usually
makes me run away and b) we already have no fewer than 221 memory
management patches queued.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-13 21:17             ` Andrew Morton
@ 2007-09-14  2:20               ` Christoph Lameter
  2007-09-14  8:53               ` Mel Gorman
  1 sibling, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2007-09-14  2:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Lee Schermerhorn, linux-mm, ak, mtk-manpages, solo,
	eric.whitney

On Thu, 13 Sep 2007, Andrew Morton wrote:

> On Thu, 13 Sep 2007 11:26:19 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Thu, 13 Sep 2007, Mel Gorman wrote:
> > 
> > > What do you see holding it up? Is it the fact we are no longer doing the
> > > pointer packing and you don't want that structure to exist, or is it simply
> > > a case that 2.6.23 is too close the door and it won't get adequate
> > > coverage in -mm?
> > 
> > No its not the pointer packing. The problem is that the patches have not 
> > been merged yet and 2.6.23 is close. We would need to merge it very soon 
> > and get some exposure in mm. Andrew?
> 
> You rang?
> 
> To which patches do you refer?  "Memory Policy Cleanups and Enhancements"? 
> That's still in my queue somewhere, but a) it has "RFC" in it which usually
> makes me run away and b) we already have no fewer than 221 memory
> management patches queued.

To Mel's one zonelist patchset that is supposed to supplant the numa 
policy "hack" merged in .23.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-13 21:17             ` Andrew Morton
  2007-09-14  2:20               ` Christoph Lameter
@ 2007-09-14  8:53               ` Mel Gorman
  2007-09-14 15:06                 ` Lee Schermerhorn
  2007-09-14 20:15                 ` Lee Schermerhorn
  1 sibling, 2 replies; 76+ messages in thread
From: Mel Gorman @ 2007-09-14  8:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Lee Schermerhorn, linux-mm, ak, mtk-manpages,
	solo, eric.whitney

On (13/09/07 14:17), Andrew Morton didst pronounce:
> On Thu, 13 Sep 2007 11:26:19 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Thu, 13 Sep 2007, Mel Gorman wrote:
> > 
> > > What do you see holding it up? Is it the fact we are no longer doing the
> > > pointer packing and you don't want that structure to exist, or is it simply
> > > a case that 2.6.23 is too close the door and it won't get adequate
> > > coverage in -mm?
> > 
> > No its not the pointer packing. The problem is that the patches have not 
> > been merged yet and 2.6.23 is close. We would need to merge it very soon 
> > and get some exposure in mm. Andrew?
> 
> You rang?
> 
> To which patches do you refer?  "Memory Policy Cleanups and Enhancements"? 
> That's still in my queue somewhere, but a) it has "RFC" in it which usually
> makes me run away and b) we already have no fewer than 221 memory
> management patches queued.
> 

Christoph's question is in relation to the patchset "Use one zonelist per
node instead of multiple zonelists v7" and whether one zonelist will be
merged in 2.6.24 in your opinion. I am hoping "yes" because it removes that
hack with ZONE_MOVABLE and policies. I had sent you a version (v5) but there
were further suggestions on ways to improve it so we're up to v7 now. Lee
will hopefully be able to determine if v7 regresses policy behaviour or not.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-14  8:53               ` Mel Gorman
@ 2007-09-14 15:06                 ` Lee Schermerhorn
  2007-09-14 17:46                   ` Mel Gorman
  2007-09-14 20:15                 ` Lee Schermerhorn
  1 sibling, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 15:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, linux-mm, ak, mtk-manpages,
	solo, eric.whitney

On Fri, 2007-09-14 at 09:53 +0100, Mel Gorman wrote:
> On (13/09/07 14:17), Andrew Morton didst pronounce:
> > On Thu, 13 Sep 2007 11:26:19 -0700 (PDT)
> > Christoph Lameter <clameter@sgi.com> wrote:
> > 
> > > On Thu, 13 Sep 2007, Mel Gorman wrote:
> > > 
> > > > What do you see holding it up? Is it the fact we are no longer doing the
> > > > pointer packing and you don't want that structure to exist, or is it simply
> > > > a case that 2.6.23 is too close the door and it won't get adequate
> > > > coverage in -mm?
> > > 
> > > No its not the pointer packing. The problem is that the patches have not 
> > > been merged yet and 2.6.23 is close. We would need to merge it very soon 
> > > and get some exposure in mm. Andrew?
> > 
> > You rang?
> > 
> > To which patches do you refer?  "Memory Policy Cleanups and Enhancements"? 
> > That's still in my queue somewhere, but a) it has "RFC" in it which usually
> > makes me run away and b) we already have no fewer than 221 memory
> > management patches queued.
> > 
> 
> Christoph's question is in relation to the patchset "Use one zonelist per
> node instead of multiple zonelists v7" and whether one zonelist will be
> merged in 2.6.24 in your opinion. I am hoping "yes" because it removes that
> hack with ZONE_MOVABLE and policies. I had sent you a version (v5) but there
> were further suggestions on ways to improve it so we're up to v7 now. Lee
> will hopefully be able to determine if v7 regresses policy behaviour or not.
> 

Hi, Mel:

I'm running with your patches now.  An earlier version--just received v7
end of day yesterday.  Will rebuild today.  I've been using the kernel
with your patches for general patch development and kernel building on
my ia64 numa platform.  Before I rebooted to test another kernel
[reclaim scalability/noreclaim patch set], your mail prompted me to try
a couple of memtoy migration scripts.  I managed to hang/panic the
system with a null pointer deref and a very interesting stack trace,
which I didn't capture [want to test w/ v7].  The trace included some
kprobes functions--which I'm not using--and a lot of tcp/network stack
routines.  I have no clue whether these are related to your patches.
The hang that I experienced before the panic could have been a local
site network glitch [happens, sometimes] that triggered a fault in the
network stack.

Again, I'll retest with the v7 patches today.  In the meantime, you
might want to grab memtoy from:
http://free.linux.hp.com/~lts/Tools/memtoy-latest.tar.gz
and try out the test scripts in the Xpm-tests/Mbind directory.  Note
that these scripts assume a 4-node numa system, but from the comments
you should be able to mod them for whatever numa system you have
available.  Building memtoy can also be a bit of a challenge, depending
on your environment--no autoconfig or such.  Check out the README for
instructions/caveats/...

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-14 15:06                 ` Lee Schermerhorn
@ 2007-09-14 17:46                   ` Mel Gorman
  2007-09-14 18:41                     ` Christoph Lameter
  0 siblings, 1 reply; 76+ messages in thread
From: Mel Gorman @ 2007-09-14 17:46 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, Andrew Morton, Christoph Lameter, linux-mm, ak,
	mtk-manpages, solo, eric.whitney

On Fri, 2007-09-14 at 11:06 -0400, Lee Schermerhorn wrote:
> On Fri, 2007-09-14 at 09:53 +0100, Mel Gorman wrote:
> > On (13/09/07 14:17), Andrew Morton didst pronounce:
> > > On Thu, 13 Sep 2007 11:26:19 -0700 (PDT)
> > > Christoph Lameter <clameter@sgi.com> wrote:
> > > 
> > > > On Thu, 13 Sep 2007, Mel Gorman wrote:
> > > > 
> > > > > What do you see holding it up? Is it the fact we are no longer doing the
> > > > > pointer packing and you don't want that structure to exist, or is it simply
> > > > > a case that 2.6.23 is too close the door and it won't get adequate
> > > > > coverage in -mm?
> > > > 
> > > > No its not the pointer packing. The problem is that the patches have not 
> > > > been merged yet and 2.6.23 is close. We would need to merge it very soon 
> > > > and get some exposure in mm. Andrew?
> > > 
> > > You rang?
> > > 
> > > To which patches do you refer?  "Memory Policy Cleanups and Enhancements"? 
> > > That's still in my queue somewhere, but a) it has "RFC" in it which usually
> > > makes me run away and b) we already have no fewer than 221 memory
> > > management patches queued.
> > > 
> > 
> > Christoph's question is in relation to the patchset "Use one zonelist per
> > node instead of multiple zonelists v7" and whether one zonelist will be
> > merged in 2.6.24 in your opinion. I am hoping "yes" because it removes that
> > hack with ZONE_MOVABLE and policies. I had sent you a version (v5) but there
> > were further suggestions on ways to improve it so we're up to v7 now. Lee
> > will hopefully be able to determine if v7 regresses policy behaviour or not.
> > 
> 
> Hi, Mel:
> 
> I'm running with your patches now.  An earlier version--just received v7
> end of day yesterday.  Will rebuild today.  I've been using the kernel
> with your patches for general patch development and kernel building on
> my ia64 numa platform.  Before I rebooted to test another kernel
> [reclaim scalability/noreclaim patch set], your mail prompted me to try
> a couple of memtoy migration scripts.  I managed to hang/panic the
> system with a null pointer deref and a very interesting stack trace,
> which I didn't capture [want to test w/ v7]. 

Very uncool. If I am null dereferencing anywhere, it's going to be in
the iterator so check for get_page_from_freelist() in the stack.

>  The trace included some
> kprobes functions--which I'm not using--and a lot of tcp/network stack
> routines.  I have no clue whether these are related to your patches.
> The hang that I experienced before the panic could have been a local
> site network glitch [happens, sometimes] that triggered a fault in the
> network stack.
> 

How bizarre.

> Again, I'll retest with the v7 patches today.  In the meantime, you
> might want to grab memtoy from:
> http://free.linux.hp.com/~lts/Tools/memtoy-latest.tar.gz
> and try out the test scripts in the Xpm-tests/Mbind directory.  Note
> that these scripts assume a 4-node numa system, but from the comments
> you should be able to mod them for whatever numa system you have
> available.  Building memtoy can also be a bit of a challenge, depending
> on your environment--no autoconfig or such.  Check out the README for
> instructions/caveats/...
> 

Ok, I'll give it a shot and see how I get on. Thanks Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-14 17:46                   ` Mel Gorman
@ 2007-09-14 18:41                     ` Christoph Lameter
  2007-09-16 18:02                       ` Mel Gorman
  0 siblings, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2007-09-14 18:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, Andrew Morton, linux-mm, ak, mtk-manpages, solo,
	eric.whitney, Kamezawa Hiroyuki

Since you are going to rework the one zonelist patch again anyway I would 
suggest to either use Kame-san full approach and include the node in the 
zonelist item structure (it does not add any memory since the struct is 
word aligned and this is an int) or go back to the earlier approach of 
packing the zone id into the low bits which would reduce the cache 
footprint.

Kame-san's approach is likely very useful if we have a lot of nodes and 
need to match a nodemask to a zonelist.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-14 18:41                     ` Christoph Lameter
@ 2007-09-16 18:02                       ` Mel Gorman
  2007-09-17 18:12                         ` Christoph Lameter
  0 siblings, 1 reply; 76+ messages in thread
From: Mel Gorman @ 2007-09-16 18:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee Schermerhorn, Andrew Morton, linux-mm, ak, mtk-manpages, solo,
	eric.whitney, Kamezawa Hiroyuki

On (14/09/07 11:41), Christoph Lameter didst pronounce:
> Since you are going to rework the one zonelist patch again anyway

I only intend to rework the patches again if there is a known problem
with them. In this case, that means that Lee finds a functional problem
or a performance problem is found.

> I would 
> suggest to either use Kame-san full approach and include the node in the 
> zonelist item structure (it does not add any memory since the struct is 
> word aligned and this is an int)

It increases the size on 32 bit NUMA which we've established is not
confined to the NUMAQ. I think it's best to evaluate adding the node
separetly at a later time.

> or go back to the earlier approach of 
> packing the zone id into the low bits which would reduce the cache 
> footprint.
> 

If adding the node does not work out, I will re-evaluate this approach.

> Kame-san's approach is likely very useful if we have a lot of nodes and 
> need to match a nodemask to a zonelist.
> 

I agree but I don't have the tools to prove it yet.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-16 18:02                       ` Mel Gorman
@ 2007-09-17 18:12                         ` Christoph Lameter
  2007-09-17 18:19                           ` Christoph Lameter
  2007-09-17 20:03                           ` Mel Gorman
  0 siblings, 2 replies; 76+ messages in thread
From: Christoph Lameter @ 2007-09-17 18:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, Andrew Morton, linux-mm, ak, mtk-manpages, solo,
	eric.whitney, Kamezawa Hiroyuki

On Sun, 16 Sep 2007, Mel Gorman wrote:

> It increases the size on 32 bit NUMA which we've established is not
> confined to the NUMAQ. I think it's best to evaluate adding the node
> separetly at a later time.

Na... 32 bit NUMA is not that important.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-17 18:12                         ` Christoph Lameter
@ 2007-09-17 18:19                           ` Christoph Lameter
  2007-09-17 20:14                             ` Mel Gorman
  2007-09-17 20:03                           ` Mel Gorman
  1 sibling, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2007-09-17 18:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, Andrew Morton, linux-mm, ak, mtk-manpages, solo,
	eric.whitney, Kamezawa Hiroyuki

On Mon, 17 Sep 2007, Christoph Lameter wrote:

> On Sun, 16 Sep 2007, Mel Gorman wrote:
> 
> > It increases the size on 32 bit NUMA which we've established is not
> > confined to the NUMAQ. I think it's best to evaluate adding the node
> > separetly at a later time.
> 
> Na... 32 bit NUMA is not that important.
> 

Oh and another thing: If you make both the node and the zoneid unsigned 
short then there is no loss. zoneid is always < 4 and node is always < 
1024. We even could make the zoneid u8 and stuff some more stuff into the 
list.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-17 18:19                           ` Christoph Lameter
@ 2007-09-17 20:14                             ` Mel Gorman
  2007-09-17 19:16                               ` Christoph Lameter
  0 siblings, 1 reply; 76+ messages in thread
From: Mel Gorman @ 2007-09-17 20:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Lee Schermerhorn, Andrew Morton, linux-mm, ak,
	mtk-manpages, solo, eric.whitney, Kamezawa Hiroyuki

On Mon, 2007-09-17 at 11:19 -0700, Christoph Lameter wrote:
> On Mon, 17 Sep 2007, Christoph Lameter wrote:
> 
> > On Sun, 16 Sep 2007, Mel Gorman wrote:
> > 
> > > It increases the size on 32 bit NUMA which we've established is not
> > > confined to the NUMAQ. I think it's best to evaluate adding the node
> > > separetly at a later time.
> > 
> > Na... 32 bit NUMA is not that important.
> > 
> 
> Oh and another thing: If you make both the node and the zoneid unsigned 
> short then there is no loss. zoneid is always < 4 and node is always < 
> 1024. We even could make the zoneid u8 and stuff some more stuff into the 
> list.
> 

ok, that is a very good point. When -mm is settled and one-zonelist in
it's v7 incarnation has been merged, I'll roll a patch that does this
and go through another test cycle. Lee's testing has passed one-zonelist
in it's current incarnation and this close, I don't want to do more
revisions. The patch to do this will be very small so should be easy to
review in isolation. Sound like a plan?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-17 20:14                             ` Mel Gorman
@ 2007-09-17 19:16                               ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2007-09-17 19:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mel Gorman, Lee Schermerhorn, Andrew Morton, linux-mm, ak,
	mtk-manpages, solo, eric.whitney, Kamezawa Hiroyuki

On Mon, 17 Sep 2007, Mel Gorman wrote:

> ok, that is a very good point. When -mm is settled and one-zonelist in
> it's v7 incarnation has been merged, I'll roll a patch that does this
> and go through another test cycle. Lee's testing has passed one-zonelist
> in it's current incarnation and this close, I don't want to do more
> revisions. The patch to do this will be very small so should be easy to
> review in isolation. Sound like a plan?

Yeah. Adding patches on top of your patches sounds great and will give 
Andrew the stability he needs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-17 18:12                         ` Christoph Lameter
  2007-09-17 18:19                           ` Christoph Lameter
@ 2007-09-17 20:03                           ` Mel Gorman
  1 sibling, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2007-09-17 20:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Lee Schermerhorn, Andrew Morton, linux-mm, ak,
	mtk-manpages, solo, eric.whitney, Kamezawa Hiroyuki, lethal

On Mon, 2007-09-17 at 11:12 -0700, Christoph Lameter wrote:
> On Sun, 16 Sep 2007, Mel Gorman wrote:
> 
> > It increases the size on 32 bit NUMA which we've established is not
> > confined to the NUMAQ. I think it's best to evaluate adding the node
> > separetly at a later time.
> 
> Na... 32 bit NUMA is not that important.

Paul Mundt might disagree. Later when I revisit the node-id-in-zoneref
issue, I'll be asking him to do a quick performance check if I can't get
a simulator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-14  8:53               ` Mel Gorman
  2007-09-14 15:06                 ` Lee Schermerhorn
@ 2007-09-14 20:15                 ` Lee Schermerhorn
  2007-09-16 18:05                   ` Mel Gorman
  1 sibling, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-14 20:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, linux-mm, ak, mtk-manpages,
	solo, eric.whitney

On Fri, 2007-09-14 at 09:53 +0100, Mel Gorman wrote:
> On (13/09/07 14:17), Andrew Morton didst pronounce:
> > On Thu, 13 Sep 2007 11:26:19 -0700 (PDT)
> > Christoph Lameter <clameter@sgi.com> wrote:
> > 
> > > On Thu, 13 Sep 2007, Mel Gorman wrote:
> > > 
> > > > What do you see holding it up? Is it the fact we are no longer doing the
> > > > pointer packing and you don't want that structure to exist, or is it simply
> > > > a case that 2.6.23 is too close the door and it won't get adequate
> > > > coverage in -mm?
> > > 
> > > No its not the pointer packing. The problem is that the patches have not 
> > > been merged yet and 2.6.23 is close. We would need to merge it very soon 
> > > and get some exposure in mm. Andrew?
> > 
> > You rang?
> > 
> > To which patches do you refer?  "Memory Policy Cleanups and Enhancements"? 
> > That's still in my queue somewhere, but a) it has "RFC" in it which usually
> > makes me run away and b) we already have no fewer than 221 memory
> > management patches queued.
> > 
> 
> Christoph's question is in relation to the patchset "Use one zonelist per
> node instead of multiple zonelists v7" and whether one zonelist will be
> merged in 2.6.24 in your opinion. I am hoping "yes" because it removes that
> hack with ZONE_MOVABLE and policies. I had sent you a version (v5) but there
> were further suggestions on ways to improve it so we're up to v7 now. Lee
> will hopefully be able to determine if v7 regresses policy behaviour or not.
> 

I've been testing the "one zonelist patches" with various memtoy
scripts, and they seem to be working--i.e., pages ending up where I
expect.  The tests aren't exhaustive or even particularly stressful, but
I did test all of the policies.   I also measured the time to allocate
4G [256K pages] with several policies using memtoy.  Here are the
results--rough averages of 10 runs; very close grouping for each
test--smaller is better:

Test			23-rc4-mm1	+one zonelist patches
sys default policy	  2.768s	>	2.755s
task pol bind local(1)	  2.789s	~=	2.789s
task pol bind remote(2)	  3.774s	<	3.780s
vma pol bind local(3)	  2.794s	>	2.790s
vma pol bind remote(4)	  3.769s	<	3.777s
vma pol pref local(5)	  2.774s	>	2.770s
vma interleave 0-3	  3.446s	>	3.436s

Notes:
1) numactl -c3 -m3 
2) numactl -c1 -m3
3) memtoy bound to node 3, mbind MPOL_BIND to node 3
4) memtoy bound to node 1, mbind MPOL_BIND to node 3
5) mbind MPOL_PREFERRED, null nodemask [preferred_node == -1 internally]

The results are very close, but it looks like one-zonelist is a bit
faster for local allocations and a bit slower for remote allocations.
None of these tests overflowed the target node.

I've also run a moderate stress test [half an hour now] and it's holding
up.  

I'm still trying to absorb the patches, but so far they look good.
Perhaps Andrew can tack them onto the bottom of the next -mm so that if
someone else finds issues, they won't complicate merging earlier patches
upstream?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-14 20:15                 ` Lee Schermerhorn
@ 2007-09-16 18:05                   ` Mel Gorman
  2007-09-16 19:34                     ` Andrew Morton
                                       ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Mel Gorman @ 2007-09-16 18:05 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Andrew Morton, Christoph Lameter, linux-mm, ak, mtk-manpages,
	solo, eric.whitney

On (14/09/07 16:15), Lee Schermerhorn didst pronounce:
> On Fri, 2007-09-14 at 09:53 +0100, Mel Gorman wrote:
> > On (13/09/07 14:17), Andrew Morton didst pronounce:
> > > On Thu, 13 Sep 2007 11:26:19 -0700 (PDT)
> > > Christoph Lameter <clameter@sgi.com> wrote:
> > > 
> > > > On Thu, 13 Sep 2007, Mel Gorman wrote:
> > > > 
> > > > > What do you see holding it up? Is it the fact we are no longer doing the
> > > > > pointer packing and you don't want that structure to exist, or is it simply
> > > > > a case that 2.6.23 is too close the door and it won't get adequate
> > > > > coverage in -mm?
> > > > 
> > > > No its not the pointer packing. The problem is that the patches have not 
> > > > been merged yet and 2.6.23 is close. We would need to merge it very soon 
> > > > and get some exposure in mm. Andrew?
> > > 
> > > You rang?
> > > 
> > > To which patches do you refer?  "Memory Policy Cleanups and Enhancements"? 
> > > That's still in my queue somewhere, but a) it has "RFC" in it which usually
> > > makes me run away and b) we already have no fewer than 221 memory
> > > management patches queued.
> > > 
> > 
> > Christoph's question is in relation to the patchset "Use one zonelist per
> > node instead of multiple zonelists v7" and whether one zonelist will be
> > merged in 2.6.24 in your opinion. I am hoping "yes" because it removes that
> > hack with ZONE_MOVABLE and policies. I had sent you a version (v5) but there
> > were further suggestions on ways to improve it so we're up to v7 now. Lee
> > will hopefully be able to determine if v7 regresses policy behaviour or not.
> > 
> 
> I've been testing the "one zonelist patches" with various memtoy
> scripts, and they seem to be working--i.e., pages ending up where I
> expect. 

Very very cool. Thanks Lee, I haven't gotten off the ground with with
memtoy. It's promising but as I don't have a machine image for long
periods of time, the ramp-up time for making memtoy useful to me is
large.

> The tests aren't exhaustive or even particularly stressful, but
> I did test all of the policies.   I also measured the time to allocate
> 4G [256K pages] with several policies using memtoy.  Here are the
> results--rough averages of 10 runs; very close grouping for each
> test--smaller is better:
> 
> Test			23-rc4-mm1	+one zonelist patches
> sys default policy	  2.768s	>	2.755s
> task pol bind local(1)	  2.789s	~=	2.789s
> task pol bind remote(2)	  3.774s	<	3.780s
> vma pol bind local(3)	  2.794s	>	2.790s
> vma pol bind remote(4)	  3.769s	<	3.777s
> vma pol pref local(5)	  2.774s	>	2.770s
> vma interleave 0-3	  3.446s	>	3.436s
> 
> Notes:
> 1) numactl -c3 -m3 
> 2) numactl -c1 -m3
> 3) memtoy bound to node 3, mbind MPOL_BIND to node 3
> 4) memtoy bound to node 1, mbind MPOL_BIND to node 3
> 5) mbind MPOL_PREFERRED, null nodemask [preferred_node == -1 internally]
> 
> The results are very close, but it looks like one-zonelist is a bit
> faster for local allocations and a bit slower for remote allocations.

I wonder why they are slower for the remote allocations. I wonder if what we're
seeing is due to a longer zonelist that is filtered instead of a customised
shorter zonelist. By and large though, the differences are small as to not
be noticed.

> None of these tests overflowed the target node.
> 
> I've also run a moderate stress test [half an hour now] and it's holding
> up.  
> 

Great.

> I'm still trying to absorb the patches, but so far they look good.
> Perhaps Andrew can tack them onto the bottom of the next -mm so that if
> someone else finds issues, they won't complicate merging earlier patches
> upstream?
> 

I hope so. Andrew, how do you feel about pulling V7 into -mm? I don't
intend to do a revision around pointer packing or node packing until I
have something in place to prove they are worthwhile.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-16 18:05                   ` Mel Gorman
@ 2007-09-16 19:34                     ` Andrew Morton
  2007-09-16 21:22                       ` Mel Gorman
  2007-09-17 13:29                     ` Lee Schermerhorn
  2007-09-17 18:14                     ` Christoph Lameter
  2 siblings, 1 reply; 76+ messages in thread
From: Andrew Morton @ 2007-09-16 19:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, Christoph Lameter, linux-mm, ak, mtk-manpages,
	solo, eric.whitney

On Sun, 16 Sep 2007 19:05:27 +0100 mel@skynet.ie (Mel Gorman) wrote:

> > I'm still trying to absorb the patches, but so far they look good.
> > Perhaps Andrew can tack them onto the bottom of the next -mm so that if
> > someone else finds issues, they won't complicate merging earlier patches
> > upstream?
> > 
> 
> I hope so. Andrew, how do you feel about pulling V7 into -mm?

umm, sure, once the churn rate falls to less than one new revision per day?

I need to get rc6-mm1 out, and it'll be crap, and it'll need another -mm shortly
after that to get things vaguely stable.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-16 19:34                     ` Andrew Morton
@ 2007-09-16 21:22                       ` Mel Gorman
  0 siblings, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2007-09-16 21:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Lee Schermerhorn, Christoph Lameter, linux-mm, ak, mtk-manpages,
	solo, eric.whitney

On (16/09/07 12:34), Andrew Morton didst pronounce:
> On Sun, 16 Sep 2007 19:05:27 +0100 mel@skynet.ie (Mel Gorman) wrote:
> 
> > > I'm still trying to absorb the patches, but so far they look good.
> > > Perhaps Andrew can tack them onto the bottom of the next -mm so that if
> > > someone else finds issues, they won't complicate merging earlier patches
> > > upstream?
> > > 
> > 
> > I hope so. Andrew, how do you feel about pulling V7 into -mm?
> 
> umm, sure, once the churn rate falls to less than one new revision per day?
> 

I don't intend to revise it any more except in response to bugs. Lee's tests
have gone well and no one has reported performance problems.

The suggestion for future revision was putting node-id in the struct zoneref
that makes up the zonelist but that needs to be justified because it's not
a zero-cost to add with 32 bit NUMA. I don't intend to go back to packing
the zone_id into a pointer because the code is a little opaque and it's not
clear it gains in performance.  We might justify it later because "clearly
it uses less cache" but it's not a decision that is going to be made in a day.

> I need to get rc6-mm1 out, and it'll be crap, and it'll need another -mm shortly
> after that to get things vaguely stable.
> 

If you want to wait until the -mm after rc6-mm1, that's fine. I don't intend
to revise the patches for the moment so hopefully I can put the time
into helping debug -mm instead.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-16 18:05                   ` Mel Gorman
  2007-09-16 19:34                     ` Andrew Morton
@ 2007-09-17 13:29                     ` Lee Schermerhorn
  2007-09-17 18:14                     ` Christoph Lameter
  2 siblings, 0 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-17 13:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, linux-mm, ak, mtk-manpages,
	solo, eric.whitney

On Sun, 2007-09-16 at 19:05 +0100, Mel Gorman wrote:
> On (14/09/07 16:15), Lee Schermerhorn didst pronounce:


> > 
> > I've also run a moderate stress test [half an hour now] and it's holding
> > up.  
> > 
> 
> Great.

FYI:  I let the exerciser tests run over the weekend.  As of this
morning they had run for ~65.5 hours with no console messages.  One of
the "bin" tests [hexdump] had exited with non-zero status at some point,
but insufficient info to diagonose.

Still, pretty solid.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-16 18:05                   ` Mel Gorman
  2007-09-16 19:34                     ` Andrew Morton
  2007-09-17 13:29                     ` Lee Schermerhorn
@ 2007-09-17 18:14                     ` Christoph Lameter
  2 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2007-09-17 18:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee Schermerhorn, Andrew Morton, linux-mm, ak, mtk-manpages, solo,
	eric.whitney

On Sun, 16 Sep 2007, Mel Gorman wrote:

> I wonder why they are slower for the remote allocations. I wonder if what we're
> seeing is due to a longer zonelist that is filtered instead of a customised
> shorter zonelist. By and large though, the differences are small as to not
> be noticed.

Maybe because of the node lookups in the zone? A remote allocation with an 
MPOL_BIND list requires longer traversal of the zone list and a comparison 
of node ids by dereferencing the zone.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-12 22:17   ` Christoph Lameter
  2007-09-13 13:57     ` Lee Schermerhorn
@ 2007-09-13 15:49     ` Lee Schermerhorn
  2007-09-13 18:22       ` Christoph Lameter
  1 sibling, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-13 15:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, akpm, ak, mtk-manpages, solo, eric.whitney, Mel Gorman

On Wed, 2007-09-12 at 15:17 -0700, Christoph Lameter wrote:
> On Tue, 11 Sep 2007, Lee Schermerhorn wrote:
> 
> > Andi, Christoph, Mel [added to cc]:
> > 
> > Any comments on these patches, posted 30aug?  I've rebased to
> > 23-rc4-mm1, but before reposting, I wanted to give you a chance to
> > comment.
> 
> Sorry that it took some time but I only just got around to look at them. 
> The one patch that I acked may be of higher priority and should probably 
> go in immediately to be merged for 2.6.24.
> 

One thing I forgot to ask in previous response:

What about patch #1 in the series:  fix reference counting?

Andi thought this was important enough to go in to 23 as a bug fix.  I
think it's at least as important as the one you acked.  However, I don't
know that anyone has ever cited a problem with it.  Then again, they
might not notice if they occasionally leak a mempolicy struct, or use a
freed or reused one for page allocation.  I'd be happier to see it cook
in -mm for a while.  So, I'll go ahead and rebase atop Mel's patches.

Thoughts?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements
  2007-09-13 15:49     ` Lee Schermerhorn
@ 2007-09-13 18:22       ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2007-09-13 18:22 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: linux-mm, akpm, ak, mtk-manpages, solo, eric.whitney, Mel Gorman

On Thu, 13 Sep 2007, Lee Schermerhorn wrote:

> Andi thought this was important enough to go in to 23 as a bug fix.  I
> think it's at least as important as the one you acked.  However, I don't
> know that anyone has ever cited a problem with it.  Then again, they
> might not notice if they occasionally leak a mempolicy struct, or use a
> freed or reused one for page allocation.  I'd be happier to see it cook
> in -mm for a while.  So, I'll go ahead and rebase atop Mel's patches.
> 
> Thoughts?

There is the concern about performance issues because of new refcounter 
increments. I thought you said that you wanted to do performance tests 
first?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH] Fix NUMA Memory Policy Reference Counting
  2007-08-30 18:50 [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements Lee Schermerhorn
                   ` (5 preceding siblings ...)
  2007-09-11 16:20 ` [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements Lee Schermerhorn
@ 2007-09-17 19:00 ` Lee Schermerhorn
  2007-09-17 19:14   ` Christoph Lameter
  2007-09-18 10:36   ` Mel Gorman
  6 siblings, 2 replies; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-17 19:00 UTC (permalink / raw)
  To: linux-mm; +Cc: akpm, ak, clameter, eric.whitney, Mel Gorman

Andi pinged me to submit this patch as stand alone.  He would like to
see this go into 2.6.23, as he considers it a high priority bug.  This
verion of the patch is against 23-rc4-mm1.  I have another version
rebased against 23-rc6.  I would understand if folks were not
comfortable with this going in at this late date.  However, I will post
that version so that it can be added to .23 or one of the subsequent
stable release, if folks so choose.

Christoph L was concerned about performance regression in the page
allocation path, so I've included the results of some page allocation
micro-benchmarks, plus kernel build results on the version of the patch
against 23-rc6.

Note that this patch will collide with Mel Gorman's one zonelist series.
I can rebase atop that, if Mel's series makes it into -mm before this
one.

Mel suggested renaming mpol_free() to mpol_put() because it removes a
reference and frees only if ref count goes to zero.  I haven't gotten
around to that in this patch.  I would welcome other opinions on this.

Lee

--------------------------------
PATCH  Memory Policy: fix reference counting

Against 2.6.23-rc4-mm1

This patch proposes fixes to the reference counting of memory policy
in the page allocation paths and in show_numa_map().  Extracted from
my "Memory Policy Cleanups and Enhancements" series as stand-alone.

Shared policy lookup [shmem] has always added a reference to the
policy, but this was never unrefed after page allocation or after
formatting the numa map data.  

Default system policy should not require additional ref counting,
nor should the current task's task policy.  However, show_numa_map()
calls get_vma_policy() to examine what may be [likely is] another
task's policy.  The latter case needs protection against freeing
of the policy.

This patch adds a reference count to a mempolicy returned by
get_vma_policy() when the policy is a vma policy or another
task's mempolicy.  Again, shared policy is already reference
counted on lookup.  A matching "unref" [__mpol_free()] is performed
in alloc_page_vma() for shared and vma policies, and in
show_numa_map() for shared and another task's mempolicy.
We can call __mpol_free() directly, saving an admittedly
inexpensive inline NULL test, because we know we have a non-NULL
policy.

Handling policy ref counts for hugepages is a bit trickier.
huge_zonelist() returns a zone list that might come from a 
shared or vma 'BIND policy.  In this case, we should hold the
reference until after the huge page allocation in 
dequeue_hugepage().  The patch modifies huge_zonelist() to
return a pointer to the mempolicy if it needs to be unref'd
after allocation.

Page allocation micro-benchmark:

Time to fault in 256K 16k pages [ia64] into a 4G anon segment.

Test                    23-rc4-mm1      +mempol ref count
sys default policy        2.769s        >       2.763s [-0.006ms = 0.22%]
task pol bind local(1)    2.789s        ~=      2.790s
task pol bind remote(2)   3.774s        <       3.777s
vma pol bind local(3)     2.793s        >       2.790s
vma pol bind remote(4)    3.768s        >       3.764s
vma pol pref local(5)     2.774s        <       2.780s [+0.006ms = 0.22%]
vma interleave 0-3        3.445s        ~=      3.444s

Notes:
1) numactl -c3 -m3 
2) numactl -c1 -m3
3) memtoy bound to node 3, mbind MPOL_BIND to node 3
4) memtoy bound to node 1, mbind MPOL_BIND to node 3
5) mbind MPOL_PREFERRED, null nodemask [preferred_node == -1 internally]

I think the difference in performance, for these tests, is in the noise:  0.22% max.
In one case in favor of the patch [system default policy] and in the other case,
in favor of unpatched kernel [explicit local policy].

Kernel Build [16cpu, 32GB, ia64] - same patch, but atop 2.6.23-rc6;
average of 10 runs:
		w/o patch	w/ refcount patch
	    Avg	  Std Devn	   Avg	  Std Devn
Real:	 100.59	    0.38	 100.63	    0.43
User:	1209.60	    0.37	1209.91	    0.31
System:   81.52	    0.42	  81.64	    0.34


Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/mempolicy.h |    4 +-
 mm/hugetlb.c              |    4 +-
 mm/mempolicy.c            |   79 ++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 75 insertions(+), 12 deletions(-)

Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c	2007-09-17 12:18:47.000000000 -0400
+++ Linux/mm/mempolicy.c	2007-09-17 14:48:08.000000000 -0400
@@ -1086,21 +1086,37 @@ asmlinkage long compat_sys_mbind(compat_
 
 #endif
 
-/* Return effective policy for a VMA */
+/*
+ * get_vma_policy(@task, @vma, @addr)
+ * @task - task for fallback if vma policy == default
+ * @vma   - virtual memory area whose policy is sought
+ * @addr  - address in @vma for shared policy lookup
+ *
+ * Returns effective policy for a VMA at specified address.
+ * Falls back to @task or system default policy, as necessary.
+ * Returned policy has extra reference count if shared, vma,
+ * or some other task's policy [show_numa_maps() can pass
+ * @task != current].  It is the caller's responsibility to
+ * free the reference in these cases.
+ */
 static struct mempolicy * get_vma_policy(struct task_struct *task,
 		struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = task->mempolicy;
+	int shared_pol = 0;
 
 	if (vma) {
-		if (vma->vm_ops && vma->vm_ops->get_policy)
+		if (vma->vm_ops && vma->vm_ops->get_policy) {
 			pol = vma->vm_ops->get_policy(vma, addr);
-		else if (vma->vm_policy &&
+			shared_pol = 1;	/* if pol non-NULL, add ref below */
+		} else if (vma->vm_policy &&
 				vma->vm_policy->policy != MPOL_DEFAULT)
 			pol = vma->vm_policy;
 	}
 	if (!pol)
 		pol = &default_policy;
+	else if (!shared_pol && pol != current->mempolicy)
+		mpol_get(pol);	/* vma or other task's policy */
 	return pol;
 }
 
@@ -1216,19 +1232,45 @@ static inline unsigned interleave_nid(st
 }
 
 #ifdef CONFIG_HUGETLBFS
-/* Return a zonelist suitable for a huge page allocation. */
+/*
+ * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
+ * @vma = virtual memory area whose policy is sought
+ * @addr = address in @vma for shared policy lookup and interleave policy
+ * @gfp_flags = for requested zone
+ * @mpol = pointer to mempolicy pointer for reference counted 'BIND policy
+ *
+ * Returns a zonelist suitable for a huge page allocation.
+ * If the effective policy is 'BIND, returns pointer to policy's zonelist.
+ * If it is also a policy for which get_vma_policy() returns an extra
+ * reference, we must hold that reference until after allocation.
+ * In that case, return policy via @mpol so hugetlb allocation can drop
+ * the reference.  For non-'BIND referenced policies, we can/do drop the
+ * reference here, so the caller doesn't need to know about the special case
+ * for default and current task policy.
+ */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
-							gfp_t gfp_flags)
+				gfp_t gfp_flags, struct mempolicy **mpol)
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	struct zonelist *zl;
 
+	*mpol = NULL;		/* probably no unref needed */
 	if (pol->policy == MPOL_INTERLEAVE) {
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
+		__mpol_free(pol);		/* finished with pol */
 		return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
 	}
-	return zonelist_policy(GFP_HIGHUSER, pol);
+
+	zl = zonelist_policy(GFP_HIGHUSER, pol);
+	if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
+		if (pol->policy != MPOL_BIND)
+			__mpol_free(pol);	/* finished with pol */
+		else
+			*mpol = pol;	/* unref needed after allocation */
+	}
+	return zl;
 }
 #endif
 
@@ -1273,6 +1315,7 @@ struct page *
 alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
+	struct zonelist *zl;
 
 	cpuset_update_task_memory_state();
 
@@ -1282,7 +1325,19 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
 		return alloc_page_interleave(gfp, 0, nid);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	zl = zonelist_policy(gfp, pol);
+	if (pol != &default_policy && pol != current->mempolicy) {
+		/*
+		 * slow path: ref counted policy -- shared or vma
+		 */
+		struct page *page =  __alloc_pages(gfp, 0, zl);
+		__mpol_free(pol);
+		return page;
+	}
+	/*
+	 * fast path:  default or task policy
+	 */
+	return __alloc_pages(gfp, 0, zl);
 }
 
 /**
@@ -1881,6 +1936,7 @@ int show_numa_map(struct seq_file *m, vo
 	struct numa_maps *md;
 	struct file *file = vma->vm_file;
 	struct mm_struct *mm = vma->vm_mm;
+	struct mempolicy *pol;
 	int n;
 	char buffer[50];
 
@@ -1891,8 +1947,13 @@ int show_numa_map(struct seq_file *m, vo
 	if (!md)
 		return 0;
 
-	mpol_to_str(buffer, sizeof(buffer),
-			    get_vma_policy(priv->task, vma, vma->vm_start));
+	pol = get_vma_policy(priv->task, vma, vma->vm_start);
+	mpol_to_str(buffer, sizeof(buffer), pol);
+	/*
+	 * unref shared or other task's mempolicy
+	 */
+	if (pol != &default_policy && pol != current->mempolicy)
+		__mpol_free(pol);
 
 	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
 
Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h	2007-09-17 12:18:47.000000000 -0400
+++ Linux/include/linux/mempolicy.h	2007-09-17 14:47:58.000000000 -0400
@@ -150,7 +150,7 @@ extern void mpol_fix_fork_child_flag(str
 
 extern struct mempolicy default_policy;
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
-		unsigned long addr, gfp_t gfp_flags);
+		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -238,7 +238,7 @@ static inline void mpol_fix_fork_child_f
 }
 
 static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
-		unsigned long addr, gfp_t gfp_flags)
+		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol)
 {
 	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
 }
Index: Linux/mm/hugetlb.c
===================================================================
--- Linux.orig/mm/hugetlb.c	2007-09-17 14:47:54.000000000 -0400
+++ Linux/mm/hugetlb.c	2007-09-17 14:47:58.000000000 -0400
@@ -71,8 +71,9 @@ static struct page *dequeue_huge_page(st
 {
 	int nid;
 	struct page *page = NULL;
+	struct mempolicy *mpol;
 	struct zonelist *zonelist = huge_zonelist(vma, address,
-						htlb_alloc_mask);
+					htlb_alloc_mask, &mpol);
 	struct zone **z;
 
 	for (z = zonelist->zones; *z; z++) {
@@ -87,6 +88,7 @@ static struct page *dequeue_huge_page(st
 			break;
 		}
 	}
+	mpol_free(mpol);	/* maybe need unref */
 	return page;
 }
 



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Fix NUMA Memory Policy Reference Counting
  2007-09-17 19:00 ` [PATCH] Fix NUMA Memory Policy Reference Counting Lee Schermerhorn
@ 2007-09-17 19:14   ` Christoph Lameter
  2007-09-17 19:38     ` Lee Schermerhorn
  2007-09-18 10:36   ` Mel Gorman
  1 sibling, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2007-09-17 19:14 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, ak, eric.whitney, Mel Gorman

On Mon, 17 Sep 2007, Lee Schermerhorn wrote:

> Page allocation micro-benchmark:
> 
> Time to fault in 256K 16k pages [ia64] into a 4G anon segment.

You need to run this as a test that concurrently allocates these pages 
from as many processors as possible in a common address space. A single 
thread will not cause cache line bouncing. I suspect this will cause an 
additional issue than what we already have with mmap_sem locking.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Fix NUMA Memory Policy Reference Counting
  2007-09-17 19:14   ` Christoph Lameter
@ 2007-09-17 19:38     ` Lee Schermerhorn
  2007-09-17 19:43       ` Christoph Lameter
  0 siblings, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-17 19:38 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, akpm, ak, eric.whitney, Mel Gorman

On Mon, 2007-09-17 at 12:14 -0700, Christoph Lameter wrote:
> On Mon, 17 Sep 2007, Lee Schermerhorn wrote:
> 
> > Page allocation micro-benchmark:
> > 
> > Time to fault in 256K 16k pages [ia64] into a 4G anon segment.
> 
> You need to run this as a test that concurrently allocates these pages 
> from as many processors as possible in a common address space. A single 
> thread will not cause cache line bouncing. I suspect this will cause an 
> additional issue than what we already have with mmap_sem locking.
> 
Yeah, I'll have to write a custom, multithreaded test for this, or
enhance memtoy to attach shm segments by id and run lots of them
together.  I'll try to get to it asap.  

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Fix NUMA Memory Policy Reference Counting
  2007-09-17 19:38     ` Lee Schermerhorn
@ 2007-09-17 19:43       ` Christoph Lameter
  2007-09-19 22:03         ` Lee Schermerhorn
  0 siblings, 1 reply; 76+ messages in thread
From: Christoph Lameter @ 2007-09-17 19:43 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, ak, eric.whitney, Mel Gorman

On Mon, 17 Sep 2007, Lee Schermerhorn wrote:

> Yeah, I'll have to write a custom, multithreaded test for this, or
> enhance memtoy to attach shm segments by id and run lots of them
> together.  I'll try to get to it asap.  

Maybe my old pft.c tool would help:

http://lkml.org/lkml/2006/8/29/294

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Fix NUMA Memory Policy Reference Counting
  2007-09-17 19:43       ` Christoph Lameter
@ 2007-09-19 22:03         ` Lee Schermerhorn
  2007-09-19 22:23           ` Christoph Lameter
  0 siblings, 1 reply; 76+ messages in thread
From: Lee Schermerhorn @ 2007-09-19 22:03 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: linux-mm, akpm, ak, eric.whitney, Mel Gorman

On Mon, 2007-09-17 at 12:43 -0700, Christoph Lameter wrote:
> On Mon, 17 Sep 2007, Lee Schermerhorn wrote:
> 
> > Yeah, I'll have to write a custom, multithreaded test for this, or
> > enhance memtoy to attach shm segments by id and run lots of them
> > together.  I'll try to get to it asap.  
> 
> Maybe my old pft.c tool would help:
> 
> http://lkml.org/lkml/2006/8/29/294

Christoph:

pft did help.  It didn't do exactly what I needed so I cloned it and
hacked the copy.  What I wanted it to do is:

* allocate a single large, page aligned mem area:  valloc()'d [was
malloc()] or shmem.  [maybe later use mmap(_ANONYMOUS)]

* optionally apply a vma policy to the region via mbind().  I want to be
able to test faulting with system default policy, task policy [via
numactl] and vma policy using mbind().

* fork off a single child so that we can collect rusage [fault count]
using RUSAGE_CHILDREN.  Your version created multiple children, but I
only need a single task with multiple threads to test vma policy
reference counting.  

* create multiple threads to touch and fault in different ranges of the
test memory region.  Pretty much what your pft already did.

I made a few more changes to allocate the memory region and do as much
setup outside the measurement interval as I could.

At some point, I might want to add back multiple tasks to test shmem
policy ref counting.  However, we always added a ref count to shmem on
allocation--we just never released it.  So the fix does add an extra
write to release the policy.  But, the biggest change is the taking of a
reference for vma policy, so this is what I wanted to test.

I've placed a tarball containing the original and modified pft, a
Makefile and a wrapper script to invoke pft with from 1 to <nr_cpus-1>
threads at:

	http://free.linux.hp.com/~lts/Tools/pft-0.01.tar.gz

I ran this modified version on my 16-cpu numa platform--therefore the
runs go from 1 to 15 threads each.  I ran for both valloc()d memory and
shmem [8GB each] with system default and vma policy, on an unpatched
2.6.24-rc4-mm1 and same with the ref counting patch.  Raw results and a
plot of the vmalloc()ed runs can be found at:

	http://free.linux.hp.com/~lts/Mempolicy/

I'll place a plot of the shmem runs there tomorrow.

Bottom line:  the run to run variability seems greater than the
difference between 23-rc4-mm1 with and without the patch.  Also, it
appears that the contention on the page table, and perhaps the
radix-tree in the shmem case, overshadow any differences due to the
reference counting.  Take a look and see what you think.

Perhaps you could grab the modified version and have it run on a larger
altix system.

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Fix NUMA Memory Policy Reference Counting
  2007-09-19 22:03         ` Lee Schermerhorn
@ 2007-09-19 22:23           ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2007-09-19 22:23 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, ak, eric.whitney, Mel Gorman

On Wed, 19 Sep 2007, Lee Schermerhorn wrote:

> Bottom line:  the run to run variability seems greater than the
> difference between 23-rc4-mm1 with and without the patch.  Also, it
> appears that the contention on the page table, and perhaps the
> radix-tree in the shmem case, overshadow any differences due to the
> reference counting.  Take a look and see what you think.

Looks good. Thanks. Makes sense that the radix tree overshadows it. There 
are lots of reasons to not use shmem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH] Fix NUMA Memory Policy Reference Counting
  2007-09-17 19:00 ` [PATCH] Fix NUMA Memory Policy Reference Counting Lee Schermerhorn
  2007-09-17 19:14   ` Christoph Lameter
@ 2007-09-18 10:36   ` Mel Gorman
  1 sibling, 0 replies; 76+ messages in thread
From: Mel Gorman @ 2007-09-18 10:36 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: linux-mm, akpm, ak, clameter, eric.whitney

On (17/09/07 15:00), Lee Schermerhorn didst pronounce:
> Andi pinged me to submit this patch as stand alone.  He would like to
> see this go into 2.6.23, as he considers it a high priority bug.  This
> verion of the patch is against 23-rc4-mm1.  I have another version
> rebased against 23-rc6.  I would understand if folks were not
> comfortable with this going in at this late date.  However, I will post
> that version so that it can be added to .23 or one of the subsequent
> stable release, if folks so choose.
> 
> Christoph L was concerned about performance regression in the page
> allocation path, so I've included the results of some page allocation
> micro-benchmarks, plus kernel build results on the version of the patch
> against 23-rc6.
> 
> Note that this patch will collide with Mel Gorman's one zonelist series.
> I can rebase atop that, if Mel's series makes it into -mm before this
> one.
> 
> Mel suggested renaming mpol_free() to mpol_put() because it removes a
> reference and frees only if ref count goes to zero.  I haven't gotten
> around to that in this patch.  I would welcome other opinions on this.
> 

If this is being pushed as a fix to 2.6.23, then do the clean-up separetly
in the next cycle to avoid obscuring the fix this close to release. I still
think that mpol_put() is a much better name for a drop-ref-and-free-if-0
function than mpol_free.

> Lee
> 
> --------------------------------
> PATCH  Memory Policy: fix reference counting
> 
> Against 2.6.23-rc4-mm1
> 
> This patch proposes fixes to the reference counting of memory policy
> in the page allocation paths and in show_numa_map().  Extracted from
> my "Memory Policy Cleanups and Enhancements" series as stand-alone.
> 
> Shared policy lookup [shmem] has always added a reference to the
> policy, but this was never unrefed after page allocation or after
> formatting the numa map data.  
> 
> Default system policy should not require additional ref counting,
> nor should the current task's task policy.  However, show_numa_map()
> calls get_vma_policy() to examine what may be [likely is] another
> task's policy.  The latter case needs protection against freeing
> of the policy.
> 
> This patch adds a reference count to a mempolicy returned by
> get_vma_policy() when the policy is a vma policy or another
> task's mempolicy.  Again, shared policy is already reference
> counted on lookup.  A matching "unref" [__mpol_free()] is performed
> in alloc_page_vma() for shared and vma policies, and in
> show_numa_map() for shared and another task's mempolicy.
> We can call __mpol_free() directly, saving an admittedly
> inexpensive inline NULL test, because we know we have a non-NULL
> policy.
> 
> Handling policy ref counts for hugepages is a bit trickier.
> huge_zonelist() returns a zone list that might come from a 
> shared or vma 'BIND policy.  In this case, we should hold the
> reference until after the huge page allocation in 
> dequeue_hugepage().  The patch modifies huge_zonelist() to
> return a pointer to the mempolicy if it needs to be unref'd
> after allocation.
> 
> Page allocation micro-benchmark:
> 
> Time to fault in 256K 16k pages [ia64] into a 4G anon segment.
> 
> Test                    23-rc4-mm1      +mempol ref count
> sys default policy        2.769s        >       2.763s [-0.006ms = 0.22%]
> task pol bind local(1)    2.789s        ~=      2.790s
> task pol bind remote(2)   3.774s        <       3.777s
> vma pol bind local(3)     2.793s        >       2.790s
> vma pol bind remote(4)    3.768s        >       3.764s
> vma pol pref local(5)     2.774s        <       2.780s [+0.006ms = 0.22%]
> vma interleave 0-3        3.445s        ~=      3.444s
> 
> Notes:
> 1) numactl -c3 -m3 
> 2) numactl -c1 -m3
> 3) memtoy bound to node 3, mbind MPOL_BIND to node 3
> 4) memtoy bound to node 1, mbind MPOL_BIND to node 3
> 5) mbind MPOL_PREFERRED, null nodemask [preferred_node == -1 internally]
> 
> I think the difference in performance, for these tests, is in the noise:  0.22% max.
> In one case in favor of the patch [system default policy] and in the other case,
> in favor of unpatched kernel [explicit local policy].
> 
> Kernel Build [16cpu, 32GB, ia64] - same patch, but atop 2.6.23-rc6;
> average of 10 runs:
> 		w/o patch	w/ refcount patch
> 	    Avg	  Std Devn	   Avg	  Std Devn
> Real:	 100.59	    0.38	 100.63	    0.43
> User:	1209.60	    0.37	1209.91	    0.31
> System:   81.52	    0.42	  81.64	    0.34
> 
> 
> Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
> 
>  include/linux/mempolicy.h |    4 +-
>  mm/hugetlb.c              |    4 +-
>  mm/mempolicy.c            |   79 ++++++++++++++++++++++++++++++++++++++++------
>  3 files changed, 75 insertions(+), 12 deletions(-)
> 
> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c	2007-09-17 12:18:47.000000000 -0400
> +++ Linux/mm/mempolicy.c	2007-09-17 14:48:08.000000000 -0400
> @@ -1086,21 +1086,37 @@ asmlinkage long compat_sys_mbind(compat_
>  
>  #endif
>  
> -/* Return effective policy for a VMA */
> +/*
> + * get_vma_policy(@task, @vma, @addr)
> + * @task - task for fallback if vma policy == default
> + * @vma   - virtual memory area whose policy is sought
> + * @addr  - address in @vma for shared policy lookup
> + *
> + * Returns effective policy for a VMA at specified address.
> + * Falls back to @task or system default policy, as necessary.
> + * Returned policy has extra reference count if shared, vma,
> + * or some other task's policy [show_numa_maps() can pass
> + * @task != current].  It is the caller's responsibility to
> + * free the reference in these cases.
> + */
>  static struct mempolicy * get_vma_policy(struct task_struct *task,
>  		struct vm_area_struct *vma, unsigned long addr)
>  {
>  	struct mempolicy *pol = task->mempolicy;
> +	int shared_pol = 0;
>  
>  	if (vma) {
> -		if (vma->vm_ops && vma->vm_ops->get_policy)
> +		if (vma->vm_ops && vma->vm_ops->get_policy) {
>  			pol = vma->vm_ops->get_policy(vma, addr);
> -		else if (vma->vm_policy &&
> +			shared_pol = 1;	/* if pol non-NULL, add ref below */
> +		} else if (vma->vm_policy &&
>  				vma->vm_policy->policy != MPOL_DEFAULT)
>  			pol = vma->vm_policy;
>  	}
>  	if (!pol)
>  		pol = &default_policy;
> +	else if (!shared_pol && pol != current->mempolicy)
> +		mpol_get(pol);	/* vma or other task's policy */
>  	return pol;
>  }
>  
> @@ -1216,19 +1232,45 @@ static inline unsigned interleave_nid(st
>  }
>  
>  #ifdef CONFIG_HUGETLBFS
> -/* Return a zonelist suitable for a huge page allocation. */
> +/*
> + * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
> + * @vma = virtual memory area whose policy is sought
> + * @addr = address in @vma for shared policy lookup and interleave policy
> + * @gfp_flags = for requested zone
> + * @mpol = pointer to mempolicy pointer for reference counted 'BIND policy
> + *
> + * Returns a zonelist suitable for a huge page allocation.
> + * If the effective policy is 'BIND, returns pointer to policy's zonelist.
> + * If it is also a policy for which get_vma_policy() returns an extra
> + * reference, we must hold that reference until after allocation.
> + * In that case, return policy via @mpol so hugetlb allocation can drop
> + * the reference.  For non-'BIND referenced policies, we can/do drop the
> + * reference here, so the caller doesn't need to know about the special case
> + * for default and current task policy.
> + */
>  struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
> -							gfp_t gfp_flags)
> +				gfp_t gfp_flags, struct mempolicy **mpol)
>  {
>  	struct mempolicy *pol = get_vma_policy(current, vma, addr);
> +	struct zonelist *zl;
>  
> +	*mpol = NULL;		/* probably no unref needed */
>  	if (pol->policy == MPOL_INTERLEAVE) {
>  		unsigned nid;
>  
>  		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
> +		__mpol_free(pol);		/* finished with pol */
>  		return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
>  	}

This different handling of pol vs mpol is a bit confusing. It took me a
few minutes to pick apart what is going on despite the comment before
the function. What would the consequence be of always passing back mpol
and having the caller drop the reference?

Again, not big enough to actually halt the fix.

> -	return zonelist_policy(GFP_HIGHUSER, pol);
> +
> +	zl = zonelist_policy(GFP_HIGHUSER, pol);
> +	if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
> +		if (pol->policy != MPOL_BIND)
> +			__mpol_free(pol);	/* finished with pol */
> +		else
> +			*mpol = pol;	/* unref needed after allocation */
> +	}
> +	return zl;
>  }
>  #endif
>  
> @@ -1273,6 +1315,7 @@ struct page *
>  alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
>  {
>  	struct mempolicy *pol = get_vma_policy(current, vma, addr);
> +	struct zonelist *zl;
>  
>  	cpuset_update_task_memory_state();
>  
> @@ -1282,7 +1325,19 @@ alloc_page_vma(gfp_t gfp, struct vm_area
>  		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
>  		return alloc_page_interleave(gfp, 0, nid);
>  	}
> -	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
> +	zl = zonelist_policy(gfp, pol);
> +	if (pol != &default_policy && pol != current->mempolicy) {

Bit of a nit-pick. This

if (pol != &default_policy && pol != current->mempolicy)

check happens quite frequently. Consider making it a helper function like
is_foreign_policy() or something as a future clean-up. That's not a great
name but you get the idea. It's not vital to address as part of a fix.

> +		/*
> +		 * slow path: ref counted policy -- shared or vma
> +		 */
> +		struct page *page =  __alloc_pages(gfp, 0, zl);
> +		__mpol_free(pol);
> +		return page;
> +	}
> +	/*
> +	 * fast path:  default or task policy
> +	 */
> +	return __alloc_pages(gfp, 0, zl);
>  }
>  
>  /**
> @@ -1881,6 +1936,7 @@ int show_numa_map(struct seq_file *m, vo
>  	struct numa_maps *md;
>  	struct file *file = vma->vm_file;
>  	struct mm_struct *mm = vma->vm_mm;
> +	struct mempolicy *pol;
>  	int n;
>  	char buffer[50];
>  
> @@ -1891,8 +1947,13 @@ int show_numa_map(struct seq_file *m, vo
>  	if (!md)
>  		return 0;
>  
> -	mpol_to_str(buffer, sizeof(buffer),
> -			    get_vma_policy(priv->task, vma, vma->vm_start));
> +	pol = get_vma_policy(priv->task, vma, vma->vm_start);
> +	mpol_to_str(buffer, sizeof(buffer), pol);
> +	/*
> +	 * unref shared or other task's mempolicy
> +	 */
> +	if (pol != &default_policy && pol != current->mempolicy)
> +		__mpol_free(pol);
>  
>  	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
>  
> Index: Linux/include/linux/mempolicy.h
> ===================================================================
> --- Linux.orig/include/linux/mempolicy.h	2007-09-17 12:18:47.000000000 -0400
> +++ Linux/include/linux/mempolicy.h	2007-09-17 14:47:58.000000000 -0400
> @@ -150,7 +150,7 @@ extern void mpol_fix_fork_child_flag(str
>  
>  extern struct mempolicy default_policy;
>  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> -		unsigned long addr, gfp_t gfp_flags);
> +		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -238,7 +238,7 @@ static inline void mpol_fix_fork_child_f
>  }
>  
>  static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> -		unsigned long addr, gfp_t gfp_flags)
> +		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol)
>  {
>  	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
>  }
> Index: Linux/mm/hugetlb.c
> ===================================================================
> --- Linux.orig/mm/hugetlb.c	2007-09-17 14:47:54.000000000 -0400
> +++ Linux/mm/hugetlb.c	2007-09-17 14:47:58.000000000 -0400
> @@ -71,8 +71,9 @@ static struct page *dequeue_huge_page(st
>  {
>  	int nid;
>  	struct page *page = NULL;
> +	struct mempolicy *mpol;
>  	struct zonelist *zonelist = huge_zonelist(vma, address,
> -						htlb_alloc_mask);
> +					htlb_alloc_mask, &mpol);
>  	struct zone **z;
>  
>  	for (z = zonelist->zones; *z; z++) {
> @@ -87,6 +88,7 @@ static struct page *dequeue_huge_page(st
>  			break;
>  		}
>  	}
> +	mpol_free(mpol);	/* maybe need unref */

If you always passed back mpol and did the free here, it ref/unref
should be clearer.

>  	return page;
>  }
>  

I haven't tested it but it looks good. Any problems I have are of the
cleanup or nit-pick nature and not sufficent to warrent another
revision. I would like to see the cleanups after 2.6.23 though.

Acked-by: Mel Gorman <mel@csn.ul.ie>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2007-10-09 19:02 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-30 18:50 [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements Lee Schermerhorn
2007-08-30 18:51 ` [PATCH/RFC 1/5] Mem Policy: fix reference counting Lee Schermerhorn
2007-09-11 18:48   ` Mel Gorman
2007-09-11 18:12     ` Lee Schermerhorn
2007-09-13  9:45       ` Mel Gorman
2007-08-30 18:51 ` [PATCH/RFC 2/5] Mem Policy: Use MPOL_PREFERRED for system-wide default policy Lee Schermerhorn
2007-09-11 18:54   ` Mel Gorman
2007-09-11 18:22     ` Lee Schermerhorn
2007-09-13  9:48       ` Mel Gorman
2007-08-30 18:51 ` [PATCH/RFC 3/5] Mem Policy: MPOL_PREFERRED fixups for "local allocation" Lee Schermerhorn
2007-09-11 18:58   ` Mel Gorman
2007-09-11 18:34     ` Lee Schermerhorn
2007-09-12 22:10       ` Christoph Lameter
2007-09-13 13:51         ` Lee Schermerhorn
2007-09-13 18:18           ` Christoph Lameter
2007-09-13  9:55       ` Mel Gorman
2007-09-12 22:06   ` Christoph Lameter
2007-09-13 13:35     ` Lee Schermerhorn
2007-09-13 18:21       ` Christoph Lameter
2007-08-30 18:51 ` [PATCH/RFC 4/5] Mem Policy: cpuset-independent interleave policy Lee Schermerhorn
2007-09-12 21:20   ` Ethan Solomita
2007-09-12 22:14     ` Christoph Lameter
2007-09-13 13:26     ` Lee Schermerhorn
2007-09-13 17:17       ` Ethan Solomita
2007-09-12 21:59   ` Ethan Solomita
2007-09-13 13:32     ` Lee Schermerhorn
2007-09-13 17:19       ` Ethan Solomita
2007-09-13 18:20       ` Christoph Lameter
2007-10-09  6:15       ` Ethan Solomita
2007-10-09 13:39         ` Lee Schermerhorn
2007-10-09 18:49         ` Christoph Lameter
2007-10-09 19:02           ` Lee Schermerhorn
2007-08-30 18:51 ` [PATCH/RFC 5/5] Mem Policy: add MPOL_F_MEMS_ALLOWED get_mempolicy() flag Lee Schermerhorn
2007-09-11 19:07   ` Mel Gorman
2007-09-11 18:42     ` Lee Schermerhorn
2007-09-12 22:14   ` Christoph Lameter
2007-09-14 20:24   ` [PATCH] " Lee Schermerhorn
2007-09-14 20:27     ` Christoph Lameter
2007-09-11 16:20 ` [PATCH/RFC 0/5] Memory Policy Cleanups and Enhancements Lee Schermerhorn
2007-09-11 19:12   ` Mel Gorman
2007-09-11 18:45     ` Lee Schermerhorn
2007-09-12 22:17   ` Christoph Lameter
2007-09-13 13:57     ` Lee Schermerhorn
2007-09-13 15:31       ` Mel Gorman
2007-09-13 15:01         ` Lee Schermerhorn
2007-09-13 18:55           ` Mel Gorman
2007-09-13 18:19       ` Christoph Lameter
2007-09-13 18:23         ` Mel Gorman
2007-09-13 18:26           ` Christoph Lameter
2007-09-13 21:17             ` Andrew Morton
2007-09-14  2:20               ` Christoph Lameter
2007-09-14  8:53               ` Mel Gorman
2007-09-14 15:06                 ` Lee Schermerhorn
2007-09-14 17:46                   ` Mel Gorman
2007-09-14 18:41                     ` Christoph Lameter
2007-09-16 18:02                       ` Mel Gorman
2007-09-17 18:12                         ` Christoph Lameter
2007-09-17 18:19                           ` Christoph Lameter
2007-09-17 20:14                             ` Mel Gorman
2007-09-17 19:16                               ` Christoph Lameter
2007-09-17 20:03                           ` Mel Gorman
2007-09-14 20:15                 ` Lee Schermerhorn
2007-09-16 18:05                   ` Mel Gorman
2007-09-16 19:34                     ` Andrew Morton
2007-09-16 21:22                       ` Mel Gorman
2007-09-17 13:29                     ` Lee Schermerhorn
2007-09-17 18:14                     ` Christoph Lameter
2007-09-13 15:49     ` Lee Schermerhorn
2007-09-13 18:22       ` Christoph Lameter
2007-09-17 19:00 ` [PATCH] Fix NUMA Memory Policy Reference Counting Lee Schermerhorn
2007-09-17 19:14   ` Christoph Lameter
2007-09-17 19:38     ` Lee Schermerhorn
2007-09-17 19:43       ` Christoph Lameter
2007-09-19 22:03         ` Lee Schermerhorn
2007-09-19 22:23           ` Christoph Lameter
2007-09-18 10:36   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).