[NUMA] Fix memory policy refcounting

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [NUMA] Fix memory policy refcounting
@ 2007-10-26 23:41 Christoph Lameter
  2007-10-29 15:48 ` Lee Schermerhorn
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2007-10-26 23:41 UTC (permalink / raw)
  To: Lee Schermerhorn; +Cc: David Rientjes, Paul Jackson, linux-mm

The refcounting fix that went into 2.6.23 left race conditions open because:

1. Reference counts were taken in get_vma_policy without necessarily
   holding another lock that guaranteed the existence of the object
   on which the reference count was taken.

2. For shared memory policies we were taking reference counts twice
   but only release one reference. So the memory leak was not fixed.

3. The patch figures out in multiple places if a reference
   count on the memory policy was taken or not. However, the refcount
   is only taken under certain conditions and these conditions may
   change. The logic to figure out when to drop the refcount is resulting
   in code that easily breaks.

This patch fixes the issues by:

1. Removing the logic to determine if a refcount was taken earlier.

2. Adding a flag to all functions that can potentially take a refcount
   on a memory. That flag is set if the refcount was taken. The code
   using the memory policy can then just free the refcount if it was
   actually taken.

3. Protect against races between set_mem_policy and numa_maps by taking
   an mmap_sem writelock when a tasks memory policy is changed.

4. There were a couple of places where get_vma_policy() etc is used that
   were missed in the 2.6.23 fix. Fix those too.

Note: IMHO removing the shared memory policy support would be preferable.
Shared policies can still interact in surprising ways with cpusets
and cannot be handled in a reasonable way by page migration.

The removal of shared policy support would result in the refcount issues
going away and code would be much simpler. Semantics would be consistent in
that memory policies only apply to a single process. Sharing of memory policies
would only occur in a controlled way that does not require extra refcounting
for the use of a policy.

We have already vma policy pointers that are currently unused for shmem areas
and could replicate shared policies by setting these pointers in each vma that
is pointing to a shmem area. Changing a shared policy would then require
iterating over all processes using the policy using the reverse maps. At that
point cpuset constraints etc could be considered and eventually a policy change
could even be rejected on the ground that a consistent change is not possible
given the other constraints of the shmem area.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 include/linux/mempolicy.h |    7 +-
 include/linux/mm.h        |    2 
 ipc/shm.c                 |    4 -
 mm/hugetlb.c              |    6 +-
 mm/mempolicy.c            |  113 ++++++++++++++++++----------------------------
 mm/shmem.c                |   17 ++++--
 6 files changed, 67 insertions(+), 82 deletions(-)

Index: linux-2.6/include/linux/mempolicy.h
===================================================================
--- linux-2.6.orig/include/linux/mempolicy.h	2007-10-26 15:46:53.000000000 -0700
+++ linux-2.6/include/linux/mempolicy.h	2007-10-26 15:50:00.000000000 -0700
@@ -140,7 +140,7 @@ int mpol_set_shared_policy(struct shared
 				struct mempolicy *new);
 void mpol_free_shared_policy(struct shared_policy *p);
 struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
-					    unsigned long idx);
+			    unsigned long idx, int *ref);
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
@@ -151,7 +151,8 @@ extern void mpol_fix_fork_child_flag(str
 
 extern struct mempolicy default_policy;
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
-		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol);
+			unsigned long addr, gfp_t gfp_flags,
+			struct mempolicy **mpol, int *ref);
 extern unsigned slab_node(struct mempolicy *policy);
 
 extern enum zone_type policy_zone;
@@ -239,7 +240,7 @@ static inline void mpol_fix_fork_child_f
 }
 
 static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
- 		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol)
+		unsigned long addr, gfp_t gfp_flags, int *ref)
 {
 	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
 }
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c	2007-10-26 15:46:53.000000000 -0700
+++ linux-2.6/mm/hugetlb.c	2007-10-26 15:50:46.000000000 -0700
@@ -75,9 +75,10 @@ static struct page *dequeue_huge_page(st
 {
 	int nid;
 	struct page *page = NULL;
+	int ref = 0;
 	struct mempolicy *mpol;
 	struct zonelist *zonelist = huge_zonelist(vma, address,
-					htlb_alloc_mask, &mpol);
+					htlb_alloc_mask, &mpol, &ref);
 	struct zone **z;
 
 	for (z = zonelist->zones; *z; z++) {
@@ -94,7 +95,8 @@ static struct page *dequeue_huge_page(st
 			break;
 		}
 	}
-	mpol_free(mpol);	/* unref if mpol !NULL */
+	if (ref)
+		mpol_free(mpol);
 	return page;
 }
 
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c	2007-10-26 15:46:53.000000000 -0700
+++ linux-2.6/mm/mempolicy.c	2007-10-26 16:23:54.000000000 -0700
@@ -531,6 +531,7 @@ static long do_get_mempolicy(int *policy
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma = NULL;
 	struct mempolicy *pol = current->mempolicy;
+	int ref = 0;
 
 	cpuset_update_task_memory_state();
 	if (flags &
@@ -553,7 +554,7 @@ static long do_get_mempolicy(int *policy
 			return -EFAULT;
 		}
 		if (vma->vm_ops && vma->vm_ops->get_policy)
-			pol = vma->vm_ops->get_policy(vma, addr);
+			pol = vma->vm_ops->get_policy(vma, addr, &ref);
 		else
 			pol = vma->vm_policy;
 	} else if (addr)
@@ -587,7 +588,9 @@ static long do_get_mempolicy(int *policy
 	if (nmask)
 		get_zonemask(pol, nmask);
 
- out:
+out:
+	if (ref)
+		mpol_free(pol);
 	if (vma)
 		up_read(&current->mm->mmap_sem);
 	return err;
@@ -917,7 +920,10 @@ asmlinkage long sys_set_mempolicy(int mo
 	err = get_nodes(&nodes, nmask, maxnode);
 	if (err)
 		return err;
-	return do_set_mempolicy(mode, &nodes);
+	down_write(&current->mm->mmap_sem);
+	err = do_set_mempolicy(mode, &nodes);
+	up_write(&current->mm->mmap_sem);
+	return err;
 }
 
 asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
@@ -1097,35 +1103,31 @@ asmlinkage long compat_sys_mbind(compat_
 
 /*
  * get_vma_policy(@task, @vma, @addr)
- * @task - task for fallback if vma policy == default
+ * @task  - task for fallback if vma policy == default
  * @vma   - virtual memory area whose policy is sought
  * @addr  - address in @vma for shared policy lookup
+ * @ref   - reference was taken against policy
  *
  * Returns effective policy for a VMA at specified address.
  * Falls back to @task or system default policy, as necessary.
- * Returned policy has extra reference count if shared, vma,
- * or some other task's policy [show_numa_maps() can pass
- * @task != current].  It is the caller's responsibility to
- * free the reference in these cases.
- */
-static struct mempolicy * get_vma_policy(struct task_struct *task,
-		struct vm_area_struct *vma, unsigned long addr)
+ * It is the caller's responsibility to free the reference count
+ * on the policy if any was taken. The refcount guarantees that
+ * the memory policy does not vanish from under us.
+*/
+static struct mempolicy *get_vma_policy(struct task_struct *task,
+		struct vm_area_struct *vma, unsigned long addr, int *ref)
 {
 	struct mempolicy *pol = task->mempolicy;
-	int shared_pol = 0;
 
 	if (vma) {
-		if (vma->vm_ops && vma->vm_ops->get_policy) {
-			pol = vma->vm_ops->get_policy(vma, addr);
-			shared_pol = 1;	/* if pol non-NULL, add ref below */
-		} else if (vma->vm_policy &&
+		if (vma->vm_ops && vma->vm_ops->get_policy)
+			pol = vma->vm_ops->get_policy(vma, addr, ref);
+		else if (vma->vm_policy &&
 				vma->vm_policy->policy != MPOL_DEFAULT)
 			pol = vma->vm_policy;
 	}
 	if (!pol)
 		pol = &default_policy;
-	else if (!shared_pol && pol != current->mempolicy)
-		mpol_get(pol);	/* vma or other task's policy */
 	return pol;
 }
 
@@ -1247,39 +1249,23 @@ static inline unsigned interleave_nid(st
  * @addr = address in @vma for shared policy lookup and interleave policy
  * @gfp_flags = for requested zone
  * @mpol = pointer to mempolicy pointer for reference counted 'BIND policy
+ * @ref = indicates that a refcount was taken against mpol.
  *
- * Returns a zonelist suitable for a huge page allocation.
- * If the effective policy is 'BIND, returns pointer to policy's zonelist.
- * If it is also a policy for which get_vma_policy() returns an extra
- * reference, we must hold that reference until after allocation.
- * In that case, return policy via @mpol so hugetlb allocation can drop
- * the reference.  For non-'BIND referenced policies, we can/do drop the
- * reference here, so the caller doesn't need to know about the special case
- * for default and current task policy.
+ * Returns a zonelist suitable for a huge page allocation. The caller must
+ * release the memory policy refcount if ref != 0. The zonelist may be freed
+ * at any time after the reference count is released.
  */
 struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
-				gfp_t gfp_flags, struct mempolicy **mpol)
+			gfp_t gfp_flags, struct mempolicy **mpol, int *ref)
 {
-	struct mempolicy *pol = get_vma_policy(current, vma, addr);
-	struct zonelist *zl;
-
-	*mpol = NULL;		/* probably no unref needed */
-	if (pol->policy == MPOL_INTERLEAVE) {
+	*mpol = get_vma_policy(current, vma, addr, ref);
+	if ((*mpol)->policy == MPOL_INTERLEAVE) {
 		unsigned nid;
 
-		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
-		__mpol_free(pol);		/* finished with pol */
+		nid = interleave_nid(*mpol, vma, addr, HPAGE_SHIFT);
 		return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
 	}
-
-	zl = zonelist_policy(GFP_HIGHUSER, pol);
-	if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
-		if (pol->policy != MPOL_BIND)
-			__mpol_free(pol);	/* finished with pol */
-		else
-			*mpol = pol;	/* unref needed after allocation */
-	}
-	return zl;
+	return zonelist_policy(GFP_HIGHUSER, *mpol);
 }
 #endif
 
@@ -1323,8 +1309,9 @@ static struct page *alloc_page_interleav
 struct page *
 alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
 {
-	struct mempolicy *pol = get_vma_policy(current, vma, addr);
-	struct zonelist *zl;
+	int ref = 0;
+	struct mempolicy *pol = get_vma_policy(current, vma, addr, &ref);
+	struct page *page;
 
 	cpuset_update_task_memory_state();
 
@@ -1332,21 +1319,12 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		unsigned nid;
 
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
-		return alloc_page_interleave(gfp, 0, nid);
-	}
-	zl = zonelist_policy(gfp, pol);
-	if (pol != &default_policy && pol != current->mempolicy) {
-		/*
-		 * slow path: ref counted policy -- shared or vma
-		 */
-		struct page *page =  __alloc_pages(gfp, 0, zl);
-		__mpol_free(pol);
-		return page;
-	}
-	/*
-	 * fast path:  default or task policy
-	 */
-	return __alloc_pages(gfp, 0, zl);
+		page = alloc_page_interleave(gfp, 0, nid);
+	} else
+		page = __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	if (ref)
+		mpol_free(pol);
+	return page;
 }
 
 /**
@@ -1519,7 +1497,8 @@ static void sp_insert(struct shared_poli
 
 /* Find shared policy intersecting idx */
 struct mempolicy *
-mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+mpol_shared_policy_lookup(struct shared_policy *sp,
+			unsigned long idx, int *ref)
 {
 	struct mempolicy *pol = NULL;
 	struct sp_node *sn;
@@ -1531,6 +1510,7 @@ mpol_shared_policy_lookup(struct shared_
 	if (sn) {
 		mpol_get(sn->policy);
 		pol = sn->policy;
+		(*ref)++;
 	}
 	spin_unlock(&sp->lock);
 	return pol;
@@ -1945,8 +1925,9 @@ int show_numa_map(struct seq_file *m, vo
 	struct numa_maps *md;
 	struct file *file = vma->vm_file;
 	struct mm_struct *mm = vma->vm_mm;
-	struct mempolicy *pol;
 	int n;
+	int ref = 0;
+	struct mempolicy *mpol;
 	char buffer[50];
 
 	if (!mm)
@@ -1956,13 +1937,10 @@ int show_numa_map(struct seq_file *m, vo
 	if (!md)
 		return 0;
 
-	pol = get_vma_policy(priv->task, vma, vma->vm_start);
-	mpol_to_str(buffer, sizeof(buffer), pol);
-	/*
-	 * unref shared or other task's mempolicy
-	 */
-	if (pol != &default_policy && pol != current->mempolicy)
-		__mpol_free(pol);
+	mpol = get_vma_policy(priv->task, vma, vma->vm_start, &ref);
+	mpol_to_str(buffer, sizeof(buffer), mpol);
+	if (ref)
+		mpol_free(mpol);
 
 	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
 
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c	2007-10-26 15:46:53.000000000 -0700
+++ linux-2.6/mm/shmem.c	2007-10-26 15:50:00.000000000 -0700
@@ -1015,14 +1015,16 @@ static struct page *shmem_swapin_async(s
 {
 	struct page *page;
 	struct vm_area_struct pvma;
+	int ref = 0;
 
 	/* Create a pseudo vma that just contains the policy */
 	memset(&pvma, 0, sizeof(struct vm_area_struct));
 	pvma.vm_end = PAGE_SIZE;
 	pvma.vm_pgoff = idx;
-	pvma.vm_policy = mpol_shared_policy_lookup(p, idx);
+	pvma.vm_policy = mpol_shared_policy_lookup(p, idx, &ref);
 	page = read_swap_cache_async(entry, &pvma, 0);
-	mpol_free(pvma.vm_policy);
+	if (ref)
+		mpol_free(pvma.vm_policy);
 	return page;
 }
 
@@ -1052,13 +1054,16 @@ shmem_alloc_page(gfp_t gfp, struct shmem
 {
 	struct vm_area_struct pvma;
 	struct page *page;
+	int ref = 0;
 
 	memset(&pvma, 0, sizeof(struct vm_area_struct));
-	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
+	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy,
+							idx, &ref);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
 	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
-	mpol_free(pvma.vm_policy);
+	if (ref)
+		mpol_free(pvma.vm_policy);
 	return page;
 }
 #else
@@ -1336,13 +1341,13 @@ static int shmem_set_policy(struct vm_ar
 }
 
 static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
-					  unsigned long addr)
+				 unsigned long addr, int *ref)
 {
 	struct inode *i = vma->vm_file->f_path.dentry->d_inode;
 	unsigned long idx;
 
 	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
-	return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx);
+	return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx, ref);
 }
 #endif
 
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2007-10-26 15:46:53.000000000 -0700
+++ linux-2.6/include/linux/mm.h	2007-10-26 15:50:00.000000000 -0700
@@ -173,7 +173,7 @@ struct vm_operations_struct {
 #ifdef CONFIG_NUMA
 	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
-					unsigned long addr);
+				unsigned long addr, int *ref);
 	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
 		const nodemask_t *to, unsigned long flags);
 #endif
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c	2007-10-26 15:46:53.000000000 -0700
+++ linux-2.6/ipc/shm.c	2007-10-26 15:50:00.000000000 -0700
@@ -279,14 +279,14 @@ static int shm_set_policy(struct vm_area
 }
 
 static struct mempolicy *shm_get_policy(struct vm_area_struct *vma,
-					unsigned long addr)
+					unsigned long addr, int *ref)
 {
 	struct file *file = vma->vm_file;
 	struct shm_file_data *sfd = shm_file_data(file);
 	struct mempolicy *pol = NULL;
 
 	if (sfd->vm_ops->get_policy)
-		pol = sfd->vm_ops->get_policy(vma, addr);
+		pol = sfd->vm_ops->get_policy(vma, addr, ref);
 	else if (vma->vm_policy)
 		pol = vma->vm_policy;
 	else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [NUMA] Fix memory policy refcounting
  2007-10-26 23:41 [NUMA] Fix memory policy refcounting Christoph Lameter
@ 2007-10-29 15:48 ` Lee Schermerhorn
  2007-10-29 20:24   ` Christoph Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2007-10-29 15:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney

On Fri, 2007-10-26 at 16:41 -0700, Christoph Lameter wrote:
> The refcounting fix that went into 2.6.23 left race conditions open because:

Christoph:  I've been testing a set of patches to address the races,
based on our recent discussions [your feed back to my last attempt].
I'll send that along for comment shortly.  Below, a few comments on this
patch.

> 
> 1. Reference counts were taken in get_vma_policy without necessarily
>    holding another lock that guaranteed the existence of the object
>    on which the reference count was taken.

Yes, this was true for the show_numa_maps() case, as we've discussed.  I
agree we need to take the mmap_sem for write in do_set_mempolicy() as we
do in do_mbind().

> 
> 2. For shared memory policies we were taking reference counts twice
>    but only release one reference. So the memory leak was not fixed.

I don't think this was the case.  You removed the code that attempted to
prevent this.  But, I admit I might have missed something...

> 
> 3. The patch figures out in multiple places if a reference
>    count on the memory policy was taken or not. However, the refcount
>    is only taken under certain conditions and these conditions may
>    change. The logic to figure out when to drop the refcount is resulting
>    in code that easily breaks.

Agreed, especially since with the fix for "other task's policy" and
noting that vma policy is protected by mmap_sem, we only need the extra
ref for shared policy and need a way to determine that we need to remove
that.

> 
> This patch fixes the issues by:
> 
> 1. Removing the logic to determine if a refcount was taken earlier.
> 
> 2. Adding a flag to all functions that can potentially take a refcount
>    on a memory. That flag is set if the refcount was taken. The code
>    using the memory policy can then just free the refcount if it was
>    actually taken.

This does add some additional code in the alloc path and adds an
additional arg to a lot of functions that I think we can remove by
marking shared policies as such and only derefing those.  

> 
> 3. Protect against races between set_mem_policy and numa_maps by taking
>    an mmap_sem writelock when a tasks memory policy is changed.

Agreed.

> 
> 4. There were a couple of places where get_vma_policy() etc is used that
>    were missed in the 2.6.23 fix. Fix those too.
> 
> Note: IMHO removing the shared memory policy support would be preferable.
> Shared policies can still interact in surprising ways with cpusets
> and cannot be handled in a reasonable way by page migration.

Yeah, yeah, yeah.  But I consider that to be cpusets' fault and not
shared memory policy.  I still have use for the latter.  We need to find
a way to accomodate all of our requirements, even if it means
documenting that shared memory policy must be used very carefully with
cpusets--or not at all with dynamically changing cpusets.  I can
certainly live with that.

> 
> The removal of shared policy support would result in the refcount issues
> going away and code would be much simpler. Semantics would be consistent in
> that memory policies only apply to a single process. Sharing of memory policies
> would only occur in a controlled way that does not require extra refcounting
> for the use of a policy.

Yes, and we'd loose control over placement of shared pages except by
hacking our task policy and prefaulting, or requiring every program that
attaches to be aware of the numa policy of the overall application.  I
find this as objectionable as you find shared policies.  

> 
> We have already vma policy pointers that are currently unused for shmem areas
> and could replicate shared policies by setting these pointers in each vma that
> is pointing to a shmem area. 

Doesn't work for me. :(

> Changing a shared policy would then require
> iterating over all processes using the policy using the reverse maps. At that
> point cpuset constraints etc could be considered and eventually a policy change
> could even be rejected on the ground that a consistent change is not possible
> given the other constraints of the shmem area.

Policy remapping isn't already complex enough for you, huh? :-)

> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 
> ---
>  include/linux/mempolicy.h |    7 +-
>  include/linux/mm.h        |    2 
>  ipc/shm.c                 |    4 -
>  mm/hugetlb.c              |    6 +-
>  mm/mempolicy.c            |  113 ++++++++++++++++++----------------------------
>  mm/shmem.c                |   17 ++++--
>  6 files changed, 67 insertions(+), 82 deletions(-)
> 
> Index: linux-2.6/include/linux/mempolicy.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mempolicy.h	2007-10-26 15:46:53.000000000 -0700
> +++ linux-2.6/include/linux/mempolicy.h	2007-10-26 15:50:00.000000000 -0700
> @@ -140,7 +140,7 @@ int mpol_set_shared_policy(struct shared
>  				struct mempolicy *new);
>  void mpol_free_shared_policy(struct shared_policy *p);
>  struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
> -					    unsigned long idx);
> +			    unsigned long idx, int *ref);
>  
>  extern void numa_default_policy(void);
>  extern void numa_policy_init(void);
> @@ -151,7 +151,8 @@ extern void mpol_fix_fork_child_flag(str
>  
>  extern struct mempolicy default_policy;
>  extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> -		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol);
> +			unsigned long addr, gfp_t gfp_flags,
> +			struct mempolicy **mpol, int *ref);
>  extern unsigned slab_node(struct mempolicy *policy);
>  
>  extern enum zone_type policy_zone;
> @@ -239,7 +240,7 @@ static inline void mpol_fix_fork_child_f
>  }
>  
>  static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> - 		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol)
> +		unsigned long addr, gfp_t gfp_flags, int *ref)
>  {
>  	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
>  }
> Index: linux-2.6/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.orig/mm/hugetlb.c	2007-10-26 15:46:53.000000000 -0700
> +++ linux-2.6/mm/hugetlb.c	2007-10-26 15:50:46.000000000 -0700
> @@ -75,9 +75,10 @@ static struct page *dequeue_huge_page(st
>  {
>  	int nid;
>  	struct page *page = NULL;
> +	int ref = 0;
>  	struct mempolicy *mpol;
>  	struct zonelist *zonelist = huge_zonelist(vma, address,
> -					htlb_alloc_mask, &mpol);
> +					htlb_alloc_mask, &mpol, &ref);
>  	struct zone **z;
>  
>  	for (z = zonelist->zones; *z; z++) {
> @@ -94,7 +95,8 @@ static struct page *dequeue_huge_page(st
>  			break;
>  		}
>  	}
> -	mpol_free(mpol);	/* unref if mpol !NULL */
> +	if (ref)
> +		mpol_free(mpol);
This shouldn't be necessary if huge_zonelist only returns a non-NULL
mpol if the unref is required, as I had done.  mpol_free() on a NULL
mpol is a no-op, as the comment was intended to convey.  You could drop
the extra ref argument to huge_zonelist--not that this should be much of
a fast path.

>  	return page;
>  }
>  
> Index: linux-2.6/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.orig/mm/mempolicy.c	2007-10-26 15:46:53.000000000 -0700
> +++ linux-2.6/mm/mempolicy.c	2007-10-26 16:23:54.000000000 -0700
> @@ -531,6 +531,7 @@ static long do_get_mempolicy(int *policy
>  	struct mm_struct *mm = current->mm;
>  	struct vm_area_struct *vma = NULL;
>  	struct mempolicy *pol = current->mempolicy;
> +	int ref = 0;
>  
>  	cpuset_update_task_memory_state();
>  	if (flags &
> @@ -553,7 +554,7 @@ static long do_get_mempolicy(int *policy
>  			return -EFAULT;
>  		}
>  		if (vma->vm_ops && vma->vm_ops->get_policy)
> -			pol = vma->vm_ops->get_policy(vma, addr);
> +			pol = vma->vm_ops->get_policy(vma, addr, &ref);
>  		else
>  			pol = vma->vm_policy;
>  	} else if (addr)
> @@ -587,7 +588,9 @@ static long do_get_mempolicy(int *policy
>  	if (nmask)
>  		get_zonemask(pol, nmask);
>  
> - out:
> +out:
> +	if (ref)
> +		mpol_free(pol);
>  	if (vma)
>  		up_read(&current->mm->mmap_sem);
>  	return err;
> @@ -917,7 +920,10 @@ asmlinkage long sys_set_mempolicy(int mo
>  	err = get_nodes(&nodes, nmask, maxnode);
>  	if (err)
>  		return err;
> -	return do_set_mempolicy(mode, &nodes);
> +	down_write(&current->mm->mmap_sem);
> +	err = do_set_mempolicy(mode, &nodes);
> +	up_write(&current->mm->mmap_sem);
> +	return err;
>  }
>  
>  asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
> @@ -1097,35 +1103,31 @@ asmlinkage long compat_sys_mbind(compat_
>  
>  /*
>   * get_vma_policy(@task, @vma, @addr)
> - * @task - task for fallback if vma policy == default
> + * @task  - task for fallback if vma policy == default
>   * @vma   - virtual memory area whose policy is sought
>   * @addr  - address in @vma for shared policy lookup
> + * @ref   - reference was taken against policy
>   *
>   * Returns effective policy for a VMA at specified address.
>   * Falls back to @task or system default policy, as necessary.
> - * Returned policy has extra reference count if shared, vma,
> - * or some other task's policy [show_numa_maps() can pass
> - * @task != current].  It is the caller's responsibility to
> - * free the reference in these cases.
> - */
> -static struct mempolicy * get_vma_policy(struct task_struct *task,
> -		struct vm_area_struct *vma, unsigned long addr)
> + * It is the caller's responsibility to free the reference count
> + * on the policy if any was taken. The refcount guarantees that
> + * the memory policy does not vanish from under us.
> +*/
> +static struct mempolicy *get_vma_policy(struct task_struct *task,
> +		struct vm_area_struct *vma, unsigned long addr, int *ref)
>  {
>  	struct mempolicy *pol = task->mempolicy;
> -	int shared_pol = 0;
>  
>  	if (vma) {
> -		if (vma->vm_ops && vma->vm_ops->get_policy) {
> -			pol = vma->vm_ops->get_policy(vma, addr);
> -			shared_pol = 1;	/* if pol non-NULL, add ref below */
Mea culpa.  Comment is wrong here.  It AVOIDS adding a ref to shared
pols.

> -		} else if (vma->vm_policy &&
> +		if (vma->vm_ops && vma->vm_ops->get_policy)
> +			pol = vma->vm_ops->get_policy(vma, addr, ref);
> +		else if (vma->vm_policy &&
>  				vma->vm_policy->policy != MPOL_DEFAULT)
>  			pol = vma->vm_policy;
>  	}
>  	if (!pol)
>  		pol = &default_policy;
> -	else if (!shared_pol && pol != current->mempolicy)
                   ^^^^^^^^^^^  won't do the get if shared_pol is true
> -		mpol_get(pol);	/* vma or other task's policy */
>  	return pol;
>  }
>  
> @@ -1247,39 +1249,23 @@ static inline unsigned interleave_nid(st
>   * @addr = address in @vma for shared policy lookup and interleave policy
>   * @gfp_flags = for requested zone
>   * @mpol = pointer to mempolicy pointer for reference counted 'BIND policy
> + * @ref = indicates that a refcount was taken against mpol.
>   *
> - * Returns a zonelist suitable for a huge page allocation.
> - * If the effective policy is 'BIND, returns pointer to policy's zonelist.
> - * If it is also a policy for which get_vma_policy() returns an extra
> - * reference, we must hold that reference until after allocation.
> - * In that case, return policy via @mpol so hugetlb allocation can drop
> - * the reference.  For non-'BIND referenced policies, we can/do drop the
> - * reference here, so the caller doesn't need to know about the special case
> - * for default and current task policy.
> + * Returns a zonelist suitable for a huge page allocation. The caller must
> + * release the memory policy refcount if ref != 0. The zonelist may be freed
> + * at any time after the reference count is released.
>   */
>  struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
> -				gfp_t gfp_flags, struct mempolicy **mpol)
> +			gfp_t gfp_flags, struct mempolicy **mpol, int *ref)
>  {
> -	struct mempolicy *pol = get_vma_policy(current, vma, addr);
> -	struct zonelist *zl;
> -
> -	*mpol = NULL;		/* probably no unref needed */
> -	if (pol->policy == MPOL_INTERLEAVE) {
> +	*mpol = get_vma_policy(current, vma, addr, ref);
> +	if ((*mpol)->policy == MPOL_INTERLEAVE) {
>  		unsigned nid;
>  
> -		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
> -		__mpol_free(pol);		/* finished with pol */

Again, his was an error on my part.  Should have unconditionally free'd.
Most recent series fixes this.

> +		nid = interleave_nid(*mpol, vma, addr, HPAGE_SHIFT);
>  		return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
>  	}
> -
> -	zl = zonelist_policy(GFP_HIGHUSER, pol);
> -	if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
> -		if (pol->policy != MPOL_BIND)
> -			__mpol_free(pol);	/* finished with pol */
> -		else
> -			*mpol = pol;	/* unref needed after allocation */
> -	}
> -	return zl;
> +	return zonelist_policy(GFP_HIGHUSER, *mpol);
>  }
>  #endif
>  
> @@ -1323,8 +1309,9 @@ static struct page *alloc_page_interleav
>  struct page *
>  alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
>  {
> -	struct mempolicy *pol = get_vma_policy(current, vma, addr);
> -	struct zonelist *zl;
> +	int ref = 0;
> +	struct mempolicy *pol = get_vma_policy(current, vma, addr, &ref);
> +	struct page *page;
>  
>  	cpuset_update_task_memory_state();
>  
> @@ -1332,21 +1319,12 @@ alloc_page_vma(gfp_t gfp, struct vm_area
>  		unsigned nid;
>  
>  		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
I also missed an unref here.  Fixed in soon to come series.

> -		return alloc_page_interleave(gfp, 0, nid);
> -	}
> -	zl = zonelist_policy(gfp, pol);
> -	if (pol != &default_policy && pol != current->mempolicy) {
With my latest attempt, this is only necessary for shared policies.
> -		/*
> -		 * slow path: ref counted policy -- shared or vma
> -		 */
> -		struct page *page =  __alloc_pages(gfp, 0, zl);
> -		__mpol_free(pol);
> -		return page;
> -	}
> -	/*
> -	 * fast path:  default or task policy
> -	 */
> -	return __alloc_pages(gfp, 0, zl);
> +		page = alloc_page_interleave(gfp, 0, nid);
> +	} else
> +		page = __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
> +	if (ref)
> +		mpol_free(pol);
> +	return page;

Andi wanted to keep the fast path as a tail-call here.  I was trying to
preserve that.  
>  }
>  
>  /**
> @@ -1519,7 +1497,8 @@ static void sp_insert(struct shared_poli
>  
>  /* Find shared policy intersecting idx */
>  struct mempolicy *
> -mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
> +mpol_shared_policy_lookup(struct shared_policy *sp,
> +			unsigned long idx, int *ref)
>  {
>  	struct mempolicy *pol = NULL;
>  	struct sp_node *sn;
> @@ -1531,6 +1510,7 @@ mpol_shared_policy_lookup(struct shared_
>  	if (sn) {
>  		mpol_get(sn->policy);
>  		pol = sn->policy;
> +		(*ref)++;
>  	}
>  	spin_unlock(&sp->lock);
>  	return pol;
> @@ -1945,8 +1925,9 @@ int show_numa_map(struct seq_file *m, vo
>  	struct numa_maps *md;
>  	struct file *file = vma->vm_file;
>  	struct mm_struct *mm = vma->vm_mm;
> -	struct mempolicy *pol;
>  	int n;
> +	int ref = 0;
> +	struct mempolicy *mpol;
>  	char buffer[50];
>  
>  	if (!mm)
> @@ -1956,13 +1937,10 @@ int show_numa_map(struct seq_file *m, vo
>  	if (!md)
>  		return 0;
>  
> -	pol = get_vma_policy(priv->task, vma, vma->vm_start);
> -	mpol_to_str(buffer, sizeof(buffer), pol);
> -	/*
> -	 * unref shared or other task's mempolicy
> -	 */
> -	if (pol != &default_policy && pol != current->mempolicy)
> -		__mpol_free(pol);
> +	mpol = get_vma_policy(priv->task, vma, vma->vm_start, &ref);
> +	mpol_to_str(buffer, sizeof(buffer), mpol);
> +	if (ref)
> +		mpol_free(mpol);
If we really want to add the ref argument to get_vma_policy(), we could
avoid bringing it down this deeply by requiring that all get_policy()
vm_ops add the extra ref [these are only used for shared memory policy
now] and set ref !0 when get_policy() returns a non-null policy.  This
would be an alternative to marking shared policies as such.

>  
>  	seq_printf(m, "%08lx %s", vma->vm_start, buffer);
>  
> Index: linux-2.6/mm/shmem.c
> ===================================================================
> --- linux-2.6.orig/mm/shmem.c	2007-10-26 15:46:53.000000000 -0700
> +++ linux-2.6/mm/shmem.c	2007-10-26 15:50:00.000000000 -0700
> @@ -1015,14 +1015,16 @@ static struct page *shmem_swapin_async(s
>  {
>  	struct page *page;
>  	struct vm_area_struct pvma;
> +	int ref = 0;
>  
>  	/* Create a pseudo vma that just contains the policy */
>  	memset(&pvma, 0, sizeof(struct vm_area_struct));
>  	pvma.vm_end = PAGE_SIZE;
>  	pvma.vm_pgoff = idx;
> -	pvma.vm_policy = mpol_shared_policy_lookup(p, idx);
> +	pvma.vm_policy = mpol_shared_policy_lookup(p, idx, &ref);
>  	page = read_swap_cache_async(entry, &pvma, 0);
> -	mpol_free(pvma.vm_policy);

By the way, my shared policy series, that you love so much ;), obviates
all of this pseudo-vma stuff.  This function becomes practically a
one-liner.  I've been holding off on that while all the other mempolicy
stuff [Mel's work, ref counting, ...] settles down.

> +	if (ref)
> +		mpol_free(pvma.vm_policy);
>  	return page;
>  }
>  
> @@ -1052,13 +1054,16 @@ shmem_alloc_page(gfp_t gfp, struct shmem
>  {
>  	struct vm_area_struct pvma;
>  	struct page *page;
> +	int ref = 0;
>  
>  	memset(&pvma, 0, sizeof(struct vm_area_struct));
> -	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
> +	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy,
> +							idx, &ref);
>  	pvma.vm_pgoff = idx;
>  	pvma.vm_end = PAGE_SIZE;
>  	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);

Again, we can dispense with pseudo-vma's here, in time.

> -	mpol_free(pvma.vm_policy);
> +	if (ref)
> +		mpol_free(pvma.vm_policy);
>  	return page;
>  }
>  #else
> @@ -1336,13 +1341,13 @@ static int shmem_set_policy(struct vm_ar
>  }
>  
>  static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
> -					  unsigned long addr)
> +				 unsigned long addr, int *ref)
>  {
>  	struct inode *i = vma->vm_file->f_path.dentry->d_inode;
>  	unsigned long idx;
>  
>  	idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> -	return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx);
> +	return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx, ref);
>  }
>  #endif
>  
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h	2007-10-26 15:46:53.000000000 -0700
> +++ linux-2.6/include/linux/mm.h	2007-10-26 15:50:00.000000000 -0700
> @@ -173,7 +173,7 @@ struct vm_operations_struct {
>  #ifdef CONFIG_NUMA
>  	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
>  	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
> -					unsigned long addr);
> +				unsigned long addr, int *ref);
>  	int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
>  		const nodemask_t *to, unsigned long flags);
>  #endif
> Index: linux-2.6/ipc/shm.c
> ===================================================================
> --- linux-2.6.orig/ipc/shm.c	2007-10-26 15:46:53.000000000 -0700
> +++ linux-2.6/ipc/shm.c	2007-10-26 15:50:00.000000000 -0700
> @@ -279,14 +279,14 @@ static int shm_set_policy(struct vm_area
>  }
>  
>  static struct mempolicy *shm_get_policy(struct vm_area_struct *vma,
> -					unsigned long addr)
> +					unsigned long addr, int *ref)
>  {
>  	struct file *file = vma->vm_file;
>  	struct shm_file_data *sfd = shm_file_data(file);
>  	struct mempolicy *pol = NULL;
>  
>  	if (sfd->vm_ops->get_policy)
> -		pol = sfd->vm_ops->get_policy(vma, addr);
> +		pol = sfd->vm_ops->get_policy(vma, addr, ref);
>  	else if (vma->vm_policy)
>  		pol = vma->vm_policy;
>  	else

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [NUMA] Fix memory policy refcounting
  2007-10-29 15:48 ` Lee Schermerhorn
@ 2007-10-29 20:24   ` Christoph Lameter
  2007-10-29 21:34     ` Lee Schermerhorn
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2007-10-29 20:24 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney

On Mon, 29 Oct 2007, Lee Schermerhorn wrote:

> > 1. Reference counts were taken in get_vma_policy without necessarily
> >    holding another lock that guaranteed the existence of the object
> >    on which the reference count was taken.
> 
> Yes, this was true for the show_numa_maps() case, as we've discussed.  I
> agree we need to take the mmap_sem for write in do_set_mempolicy() as we
> do in do_mbind().

Yeah it seems that we were safe for shared policies since we took the 
refcount twice?

> > 2. Adding a flag to all functions that can potentially take a refcount
> >    on a memory. That flag is set if the refcount was taken. The code
> >    using the memory policy can then just free the refcount if it was
> >    actually taken.
> 
> This does add some additional code in the alloc path and adds an
> additional arg to a lot of functions that I think we can remove by
> marking shared policies as such and only derefing those.  

The get_policy function and friends may not only return shared policies 
but also task etc policies.

> Yeah, yeah, yeah.  But I consider that to be cpusets' fault and not
> shared memory policy.  I still have use for the latter.  We need to find
> a way to accomodate all of our requirements, even if it means
> documenting that shared memory policy must be used very carefully with
> cpusets--or not at all with dynamically changing cpusets.  I can
> certainly live with that.

There is no reason that this issue should exist. We can have your shared 
policies with proper enforcement that no bad things happen if we get rid 
of get_policy etc and instead use the vma policy pointer to point to the 
shared policy. Take a refcount for each vma as it is setup to point to a 
shared policy and you will not have to take the refcount in the hot paths.

> > The removal of shared policy support would result in the refcount issues
> > going away and code would be much simpler. Semantics would be consistent in
> > that memory policies only apply to a single process. Sharing of memory policies
> > would only occur in a controlled way that does not require extra refcounting
> > for the use of a policy.
> 
> Yes, and we'd loose control over placement of shared pages except by
> hacking our task policy and prefaulting, or requiring every program that
> attaches to be aware of the numa policy of the overall application.  I
> find this as objectionable as you find shared policies.  

That is not true.

> > We have already vma policy pointers that are currently unused for shmem areas
> > and could replicate shared policies by setting these pointers in each vma that
> > is pointing to a shmem area. 
> 
> Doesn't work for me. :(

Why not? Point them to the shared policy and add a refcount and things 
would be much easier.

> > Changing a shared policy would then require
> > iterating over all processes using the policy using the reverse maps. At that
> > point cpuset constraints etc could be considered and eventually a policy change
> > could even be rejected on the ground that a consistent change is not possible
> > given the other constraints of the shmem area.
> 
> Policy remapping isn't already complex enough for you, huh? :-)

Policy remapping is one thing since we can correct the policy. Shared 
policies cannot be correct if applied to multiple cpuset context. The 
looping over reverse maps is a pretty standard way of doing things which 
would allow us to detect bad policies at the time they are created and not 
later.

> >  		}
> >  	}
> > -	mpol_free(mpol);	/* unref if mpol !NULL */
> > +	if (ref)
> > +		mpol_free(mpol);
> This shouldn't be necessary if huge_zonelist only returns a non-NULL
> mpol if the unref is required, as I had done.  mpol_free() on a NULL
> mpol is a no-op, as the comment was intended to convey.  You could drop
> the extra ref argument to huge_zonelist--not that this should be much of
> a fast path.

True.

> > -	mpol_to_str(buffer, sizeof(buffer), pol);
> > -	/*
> > -	 * unref shared or other task's mempolicy
> > -	 */
> > -	if (pol != &default_policy && pol != current->mempolicy)
> > -		__mpol_free(pol);
> > +	mpol = get_vma_policy(priv->task, vma, vma->vm_start, &ref);
> > +	mpol_to_str(buffer, sizeof(buffer), mpol);
> > +	if (ref)
> > +		mpol_free(mpol);
> If we really want to add the ref argument to get_vma_policy(), we could
> avoid bringing it down this deeply by requiring that all get_policy()
> vm_ops add the extra ref [these are only used for shared memory policy
> now] and set ref !0 when get_policy() returns a non-null policy.  This
> would be an alternative to marking shared policies as such.

Please be aware that refcounting in the hot paths is a bad thing to do. 
We are potentially bouncing a cacheline. I'd rather get rid of that 
completely.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [NUMA] Fix memory policy refcounting
  2007-10-29 20:24   ` Christoph Lameter
@ 2007-10-29 21:34     ` Lee Schermerhorn
  2007-10-29 21:43       ` Christoph Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2007-10-29 21:34 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney

On Mon, 2007-10-29 at 13:24 -0700, Christoph Lameter wrote:
> On Mon, 29 Oct 2007, Lee Schermerhorn wrote:
> 
> > > 1. Reference counts were taken in get_vma_policy without necessarily
> > >    holding another lock that guaranteed the existence of the object
> > >    on which the reference count was taken.
> > 
> > Yes, this was true for the show_numa_maps() case, as we've discussed.  I
> > agree we need to take the mmap_sem for write in do_set_mempolicy() as we
> > do in do_mbind().
> 
> Yeah it seems that we were safe for shared policies since we took the 
> refcount twice?

I still don't see where we took the ref twice on shared policies.  My
comment was wrong.  It should have said "avoid extra ref" which is what
the code [in get_vma_policy()] does now, contradicting the comment.

> 
> > > 2. Adding a flag to all functions that can potentially take a refcount
> > >    on a memory. That flag is set if the refcount was taken. The code
> > >    using the memory policy can then just free the refcount if it was
> > >    actually taken.
> > 
> > This does add some additional code in the alloc path and adds an
> > additional arg to a lot of functions that I think we can remove by
> > marking shared policies as such and only derefing those.  
> 
> The get_policy function and friends may not only return shared policies 
> but also task etc policies.

Right, but they don't [don't need to] add extra ref's like the shared
policies do.  So, if we can id a shared policy, we unref only those.
I'll send out my RFC probably tomorrow am.  

> 
> > Yeah, yeah, yeah.  But I consider that to be cpusets' fault and not
> > shared memory policy.  I still have use for the latter.  We need to find
> > a way to accomodate all of our requirements, even if it means
> > documenting that shared memory policy must be used very carefully with
> > cpusets--or not at all with dynamically changing cpusets.  I can
> > certainly live with that.
> 
> There is no reason that this issue should exist. We can have your shared 
> policies with proper enforcement that no bad things happen if we get rid 
> of get_policy etc and instead use the vma policy pointer to point to the 
> shared policy. Take a refcount for each vma as it is setup to point to a 
> shared policy and you will not have to take the refcount in the hot paths.

We support different policies on different ranges of a shared memory
segment.  In the task which installs this policy, we split the vmas, but
any other tasks which already have the segment attached or which
subsequently attach don't split the vmas along policy bondaries.  This
also makes numa_maps lie when we have multiple subrange policies.

My run of your page fault test with and without the ref count patches
showed no add'l overhead for reference counting.  Of course, I can't go
up the number of nodes you can.  You could try it on one of your
platform with a high node count to see.  

> 
> > > The removal of shared policy support would result in the refcount issues
> > > going away and code would be much simpler. Semantics would be consistent in
> > > that memory policies only apply to a single process. Sharing of memory policies
> > > would only occur in a controlled way that does not require extra refcounting
> > > for the use of a policy.
> > 
> > Yes, and we'd loose control over placement of shared pages except by
> > hacking our task policy and prefaulting, or requiring every program that
> > attaches to be aware of the numa policy of the overall application.  I
> > find this as objectionable as you find shared policies.  
> 
> That is not true.

???

> 
> > > We have already vma policy pointers that are currently unused for shmem areas
> > > and could replicate shared policies by setting these pointers in each vma that
> > > is pointing to a shmem area. 
> > 
> > Doesn't work for me. :(
> 
> Why not? Point them to the shared policy and add a refcount and things 
> would be much easier.

I'd really like to avoid having to map multiple vmas when I attach a shm
seg with different policies on different ranges, and  fix up all
existing tasks attached to a shmem when installing policies on ranges of
the segment.   Having the share policy lookup add a ref while holding
the shared policy spinlock handles this fine now.

> 
> > > Changing a shared policy would then require
> > > iterating over all processes using the policy using the reverse maps. At that
> > > point cpuset constraints etc could be considered and eventually a policy change
> > > could even be rejected on the ground that a consistent change is not possible
> > > given the other constraints of the shmem area.
> > 
> > Policy remapping isn't already complex enough for you, huh? :-)
> 
> Policy remapping is one thing since we can correct the policy. Shared 
> policies cannot be correct if applied to multiple cpuset context. 

I agree.  Especially if the attaching task's aren't cooperating, and
when the cpuset is reconfigured or tasks moved.  I don't need it to work
in these cases. 

> The 
> looping over reverse maps is a pretty standard way of doing things which 
> would allow us to detect bad policies at the time they are created and not 
> later.

I guess we could also do the rmap walk at attach time and reject the
attach if the shmem policy is not valid in the attaching task's cpuset?
And when we try to move the task to a new cpuset, reject the move if any
policy would be invalid in the new cpuset?  

Maybe better to just document the constraints and the behavior that one
might expect if they violate the constraints.  For example, a
default/local policy would be valid in all cpusets.  My proposal for
"cpuset-independent interleave policy" [which is semantically the same
as David's "interleave over all allowed", altho' we propose different
syscall arg values to specify it, and David's method requires remapping
the interleave nodemask, I think, where mine doesn't--details] would be
valid in any cpuset.  In these cases, page placement might be different
than for a set of tasks in a single, static cpuset attached to shmem
segments, which might be "surprising" behavior for some tasks, but we
could document that.

Any policy that explicitly specifies nodes that are not valid in all
cpusets containing tasks sharing a shmem segment would be problematic
and should be avoided.  A possible solution to this is to do the
cpuset-relative to physical nodeid translation at allocation time, but I
don't think you want that in the allocation path!

> 
> > >  		}
> > >  	}
> > > -	mpol_free(mpol);	/* unref if mpol !NULL */
> > > +	if (ref)
> > > +		mpol_free(mpol);
> > This shouldn't be necessary if huge_zonelist only returns a non-NULL
> > mpol if the unref is required, as I had done.  mpol_free() on a NULL
> > mpol is a no-op, as the comment was intended to convey.  You could drop
> > the extra ref argument to huge_zonelist--not that this should be much of
> > a fast path.
> 
> True.
> 
> > > -	mpol_to_str(buffer, sizeof(buffer), pol);
> > > -	/*
> > > -	 * unref shared or other task's mempolicy
> > > -	 */
> > > -	if (pol != &default_policy && pol != current->mempolicy)
> > > -		__mpol_free(pol);
> > > +	mpol = get_vma_policy(priv->task, vma, vma->vm_start, &ref);
> > > +	mpol_to_str(buffer, sizeof(buffer), mpol);
> > > +	if (ref)
> > > +		mpol_free(mpol);
> > If we really want to add the ref argument to get_vma_policy(), we could
> > avoid bringing it down this deeply by requiring that all get_policy()
> > vm_ops add the extra ref [these are only used for shared memory policy
> > now] and set ref !0 when get_policy() returns a non-null policy.  This
> > would be an alternative to marking shared policies as such.
> 
> Please be aware that refcounting in the hot paths is a bad thing to do. 
> We are potentially bouncing a cacheline. I'd rather get rid of that 
> completely.

I understand the goal.  Sometimes one must ref count for adequate
protection to get the semantics/features one wants--that I want, anyway.
Again, my tests showed no measureable overhead.  [Yes, lots of little
"no measureable overhead" changes can add up...].

Anyway, let's keep chipping away.  I'll send out my series for RFC and
then do some more measurements.

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [NUMA] Fix memory policy refcounting
  2007-10-29 21:34     ` Lee Schermerhorn
@ 2007-10-29 21:43       ` Christoph Lameter
  2007-10-30 16:39         ` Lee Schermerhorn
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2007-10-29 21:43 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney

On Mon, 29 Oct 2007, Lee Schermerhorn wrote:

> > > Yeah, yeah, yeah.  But I consider that to be cpusets' fault and not
> > > shared memory policy.  I still have use for the latter.  We need to find
> > > a way to accomodate all of our requirements, even if it means
> > > documenting that shared memory policy must be used very carefully with
> > > cpusets--or not at all with dynamically changing cpusets.  I can
> > > certainly live with that.
> > 
> > There is no reason that this issue should exist. We can have your shared 
> > policies with proper enforcement that no bad things happen if we get rid 
> > of get_policy etc and instead use the vma policy pointer to point to the 
> > shared policy. Take a refcount for each vma as it is setup to point to a 
> > shared policy and you will not have to take the refcount in the hot paths.
> 
> We support different policies on different ranges of a shared memory
> segment.  In the task which installs this policy, we split the vmas, but
> any other tasks which already have the segment attached or which
> subsequently attach don't split the vmas along policy bondaries.  This
> also makes numa_maps lie when we have multiple subrange policies.

Which would also be fixed if we would split the vmas properly.
 
> > > > We have already vma policy pointers that are currently unused for shmem areas
> > > > and could replicate shared policies by setting these pointers in each vma that
> > > > is pointing to a shmem area. 
> > > 
> > > Doesn't work for me. :(
> > 
> > Why not? Point them to the shared policy and add a refcount and things 
> > would be much easier.
> 
> I'd really like to avoid having to map multiple vmas when I attach a shm
> seg with different policies on different ranges, and  fix up all
> existing tasks attached to a shmem when installing policies on ranges of
> the segment.   Having the share policy lookup add a ref while holding
> the shared policy spinlock handles this fine now.

Why? Attaching a policy is a rare thing right?

> > The 
> > looping over reverse maps is a pretty standard way of doing things which 
> > would allow us to detect bad policies at the time they are created and not 
> > later.
> 
> I guess we could also do the rmap walk at attach time and reject the
> attach if the shmem policy is not valid in the attaching task's cpuset?
> And when we try to move the task to a new cpuset, reject the move if any
> policy would be invalid in the new cpuset?  

Yes reject or do some corrective action. At least be aware of the 
situation.

> Any policy that explicitly specifies nodes that are not valid in all
> cpusets containing tasks sharing a shmem segment would be problematic
> and should be avoided.  A possible solution to this is to do the
> cpuset-relative to physical nodeid translation at allocation time, but I
> don't think you want that in the allocation path!

The problem is that it can create subtle issues right now. If we would 
handle it through vma policy pointers then we would be able to detect many 
things and the refcount issues would go away.

> > Please be aware that refcounting in the hot paths is a bad thing to do. 
> > We are potentially bouncing a cacheline. I'd rather get rid of that 
> > completely.
> 
> I understand the goal.  Sometimes one must ref count for adequate
> protection to get the semantics/features one wants--that I want, anyway.
> Again, my tests showed no measureable overhead.  [Yes, lots of little
> "no measureable overhead" changes can add up...].

The refcounting in hot paths causes a lot of issues here not only scaling. 
Removing them would be the smart thing to do. The logic to update vmas, 
loop over rmaps etc etc all exists. The code would likely be minimal.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [NUMA] Fix memory policy refcounting
  2007-10-29 21:43       ` Christoph Lameter
@ 2007-10-30 16:39         ` Lee Schermerhorn
  2007-10-30 18:42           ` Christoph Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2007-10-30 16:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney

On Mon, 2007-10-29 at 14:43 -0700, Christoph Lameter wrote:
> On Mon, 29 Oct 2007, Lee Schermerhorn wrote:
> 
> > > > Yeah, yeah, yeah.  But I consider that to be cpusets' fault and not
> > > > shared memory policy.  I still have use for the latter.  We need to find
> > > > a way to accomodate all of our requirements, even if it means
> > > > documenting that shared memory policy must be used very carefully with
> > > > cpusets--or not at all with dynamically changing cpusets.  I can
> > > > certainly live with that.
> > > 
> > > There is no reason that this issue should exist. We can have your shared 
> > > policies with proper enforcement that no bad things happen if we get rid 
> > > of get_policy etc and instead use the vma policy pointer to point to the 
> > > shared policy. Take a refcount for each vma as it is setup to point to a 
> > > shared policy and you will not have to take the refcount in the hot paths.
> > 
> > We support different policies on different ranges of a shared memory
> > segment.  In the task which installs this policy, we split the vmas, but
> > any other tasks which already have the segment attached or which
> > subsequently attach don't split the vmas along policy bondaries.  This
> > also makes numa_maps lie when we have multiple subrange policies.
> 
> Which would also be fixed if we would split the vmas properly.

The problem I see with splitting vmas for shared policy is that, to be
correct, when you apply a sub-range policy to a shm segment that already
has tasks attached, you'd have to split those task's vmas as
well--either from outside the tasks, or somehow notify them to do it
themselves.  In general, I really want to avoid requiring every process
in a multi-task application to install policies on shared objects
uniformly to get correct behavior.  However, something you said
yesterday [about vma pointers to shared policies] got me thinking last
evening of another approach.  Here's an idea.

First, the situation we have today:

task1 creates [shmget()] and attaches [shmat()] a shm segment.  W/o
SHM_HUGETLB flag, we get a tmpfs mapping with shmem vm_ops.  These
vm_ops support shared mempolicy that maintains ranges of mempolicy in an
rbtree.  After task1 installs [mbind()] two mempolicies on subset ranges
of the shm segment, we get the vma connections shown below on the left
[vma1.[12]] in Figure 1 [please forgive the lame ascii art].  The
reference count of 1 on the mempolicies represents the reference held by
the shared policy rbtree itself.  The horizontal "arrows" do NOT
represent actual pointers.  Rather they represent the association
between the vm_start of the vma and the offset of the start of the
policy range.  The vertical "arrows" represent the length of the range
of virtual addresses mapped by each vma.  The original vma was split in
two when the mempolicies were installed.

Now task2 attaches [shmat] the segment, without installing any policy.
[NUMA layout and mempolicy installation is the responsibility of task1
in this mythical multi-task application.]  Because task2 attaches
[do_mmap*() internally] the entire segment--unlike mmap(), shmat() has
no provision to attach a subset of the segment--it gets a single vma
mapping the entire segment.   We get the vma connection shown on the
right in Figure 1.  [We'd get this same configuration if task2 were
already attached when task1 installs the policies.] Again, the vertical
arrow represents the range of virtual addresses mapped by the single
vma.  The attach does not increment the reference count.

                                    Figure 1

        task1,                   Shared Policy               task2,
      mm_struct1,                (w/ rb tree)              mm_struct2,
                              -------------------
       vma1.1---------------->|                 |<------------vma2.1
                            | |  mode, nodemask,| |
                            | |     ref = 1     | |
                            V |                 | |
                              ------------------- |
       vma1.2---------------->|                 | |
                            | |  mode, nodemask,| |
                            | |     ref = 1     | |
                            V |                 | V
                              -------------------

Note that if we cat /proc/<pid1>/numa_maps to display task1's numa maps,
we'll see both policies in the the rbtree.  If we display task2's numa
maps, we'll see only the policy at the front of the segment.  However,
we'll count the page stats over the entire range and report these.  I
can show you an example of this using memtoy, if you'd like, but it's
somewhat orthogonal to the reference counting issue.  Still, I can
imagine that it could confuse customers and result in unnecessary
service calls...

Next,

As part of my shared policy cleanup and enhancement series, I "fixed"
numa_maps to display the sub-ranges of policies in a shm segment mapped
by a single vma. As part of this fix, I also modified mempolicy.c so
that it does not split vmas that support set_policy vm_ops, because
handling both split vmas and non-split vmas for a single shm segment
would have complicated the code more than I thought necessary.  This is
still at prototype stage--altho' it works against 23-rc8-mm2.

With the these changes, the vma connections and ref counts, look like
this:

                                    Figure 2

        task1,                   Shared Policy               task2,
      mm_struct1,                (w/ rb tree)              mm_struct2,
                              -------------------
       vma1.1---------------->|                 |<------------vma2.1
                            | |  mode, nodemask,| |
                            | |     ref = 1     | |
                            | |                 | |
                            | ------------------- |
                            | |                 | |
                            | |  mode, nodemask,| |
                            | |     ref = 1     | |
                            V |                 | V
                              -------------------

With this config, my fix to numa_maps will show the same policy ranges
from either task.  And, of course, the get_policy() vm_op still gets the
correct policy based on the faulting address.

Now, if we modify the shmem mmap() file_op [mmap() vm_op for any
mmap()ed segment who's {set|get}_mempolicy() ops supports sub-range
policies and non-split vmas] to add a reference to shared policies for
each vma attached, we get the following picture:

                                   Figure 3

        task1,                   Shared Policy               task2,
      mm_struct1,                (w/ rb tree)              mm_struct2,
                              -------------------
       vma1.1---------------->|                 |<------------vma2.1
                            | |  mode, nodemask,| |
                            | |     ref = 3     | |
                            | |                 | |
                            | ------------------- |
                            | |                 | |
                            | |  mode, nodemask,| |
                            | |     ref = 3     | |
                            V |                 | V
                              -------------------

Re:  'ref = 3' -- One reference for the rbtree--the shm segment and it's
policies continue to exist independent of any vma mappings--and one for
each attached vma.  Because the vma references are protected by the
respective task/mm_struct's  mmap_sem, we won't need to add an
additional reference during lookup, nor release it when finished with
the policy.  And, we won't need to mess with any other task's mm data
structures when installing/removing shmem policies.  Of course, munmap()
of a vma will need to decrement the ref count of all policies in a
shared policy tree, but this is not a "fast path".  Unfortunately, we
don't have a unmap file operation, so I'd have to add one, or otherwise
arrange to remove the unmapping vma's ref--perhaps via a vm_op so that
we only need to call it on vmas that support it--i.e., that support
shared policy.

I could extract the parts of my shared policy series that gets us to
Figure 2 and add in the necessary mods to prototype Figure 3 if you
would be agreeable to this approach.  However, in that case, I should
produce a minimal patch to make the current reference counting correct,
if overkill.  This involves:  1) fixing do_set_mempolicy() to hold
mmap_sem for write over change, 2) fixing up reference counting for
interleaving for both normal [forgot unref] and huge [unconditional
unref should be conditional] and 3) adding ref to policy in
shm_get_policy() to match shmem_get_policy.  All 3 of these are required
to be correct w/o changing any of the rest of the current ref counting.

Then, once the vma-protected shared policy mechanism discussed above is
in mergable, we can back out all of the extra ref's on other task and
vma policies and the lookup-time ref on shared policies, along with all
of the matching unrefs.

Thoughts?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [NUMA] Fix memory policy refcounting
  2007-10-30 16:39         ` Lee Schermerhorn
@ 2007-10-30 18:42           ` Christoph Lameter
  2007-10-30 20:18             ` Lee Schermerhorn
  2007-11-06 18:56             ` Lee Schermerhorn
  0 siblings, 2 replies; 14+ messages in thread
From: Christoph Lameter @ 2007-10-30 18:42 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney

On Tue, 30 Oct 2007, Lee Schermerhorn wrote:

> As part of my shared policy cleanup and enhancement series, I "fixed"
> numa_maps to display the sub-ranges of policies in a shm segment mapped
> by a single vma. As part of this fix, I also modified mempolicy.c so
> that it does not split vmas that support set_policy vm_ops, because
> handling both split vmas and non-split vmas for a single shm segment
> would have complicated the code more than I thought necessary.  This is
> still at prototype stage--altho' it works against 23-rc8-mm2.

I have not looked at that yet. Maybe you could post another patch?

> Re:  'ref = 3' -- One reference for the rbtree--the shm segment and it's
> policies continue to exist independent of any vma mappings--and one for
> each attached vma.  Because the vma references are protected by the
> respective task/mm_struct's  mmap_sem, we won't need to add an
> additional reference during lookup, nor release it when finished with
> the policy.  And, we won't need to mess with any other task's mm data
> structures when installing/removing shmem policies.  Of course, munmap()
> of a vma will need to decrement the ref count of all policies in a
> shared policy tree, but this is not a "fast path".  Unfortunately, we
> don't have a unmap file operation, so I'd have to add one, or otherwise
> arrange to remove the unmapping vma's ref--perhaps via a vm_op so that
> we only need to call it on vmas that support it--i.e., that support
> shared policy.

Yup that sounds like it is going to be a good solution.

> if overkill.  This involves:  1) fixing do_set_mempolicy() to hold
> mmap_sem for write over change, 2) fixing up reference counting for
> interleaving for both normal [forgot unref] and huge [unconditional
> unref should be conditional] and 3) adding ref to policy in
> shm_get_policy() to match shmem_get_policy.  All 3 of these are required
> to be correct w/o changing any of the rest of the current ref counting.

I know about 1. I'd have to look through 2 + 3. I would suggest to fix the 
refcounting by doing the refcounting using vmas as you explained above and 
simply remove the problems that exist there right now.

> Then, once the vma-protected shared policy mechanism discussed above is
> in mergable, we can back out all of the extra ref's on other task and
> vma policies and the lookup-time ref on shared policies, along with all
> of the matching unrefs.

Too complicated. Lets go there directly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [NUMA] Fix memory policy refcounting
  2007-10-30 18:42           ` Christoph Lameter
@ 2007-10-30 20:18             ` Lee Schermerhorn
  2007-11-06 18:56             ` Lee Schermerhorn
  1 sibling, 0 replies; 14+ messages in thread
From: Lee Schermerhorn @ 2007-10-30 20:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney

On Tue, 2007-10-30 at 11:42 -0700, Christoph Lameter wrote:
> On Tue, 30 Oct 2007, Lee Schermerhorn wrote:
> 
> > As part of my shared policy cleanup and enhancement series, I "fixed"
> > numa_maps to display the sub-ranges of policies in a shm segment mapped
> > by a single vma. As part of this fix, I also modified mempolicy.c so
> > that it does not split vmas that support set_policy vm_ops, because
> > handling both split vmas and non-split vmas for a single shm segment
> > would have complicated the code more than I thought necessary.  This is
> > still at prototype stage--altho' it works against 23-rc8-mm2.
> 
> I have not looked at that yet. Maybe you could post another patch?

I will, when I get around to rebasing them.  For reference, you can look
at the last posting back in late June:

"Shared Policy Infrastructure 3/11 let vma policy ops handle sub-vma
policies":
	http://marc.info/?l=linux-mm&m=118280133031696&w=4

and 

Shared Policy Infrastructure 4/11 fix show_numa_maps()
	http://marc.info/?l=linux-mm&m=118280133220503&w=4

> 
> > Re:  'ref = 3' -- One reference for the rbtree--the shm segment and it's
> > policies continue to exist independent of any vma mappings--and one for
> > each attached vma.  Because the vma references are protected by the
> > respective task/mm_struct's  mmap_sem, we won't need to add an
> > additional reference during lookup, nor release it when finished with
> > the policy.  And, we won't need to mess with any other task's mm data
> > structures when installing/removing shmem policies.  Of course, munmap()
> > of a vma will need to decrement the ref count of all policies in a
> > shared policy tree, but this is not a "fast path".  Unfortunately, we
> > don't have a unmap file operation, so I'd have to add one, or otherwise
> > arrange to remove the unmapping vma's ref--perhaps via a vm_op so that
> > we only need to call it on vmas that support it--i.e., that support
> > shared policy.
> 
> Yup that sounds like it is going to be a good solution.
> 
> > if overkill.  This involves:  1) fixing do_set_mempolicy() to hold
> > mmap_sem for write over change, 2) fixing up reference counting for
> > interleaving for both normal [forgot unref] and huge [unconditional
> > unref should be conditional] and 3) adding ref to policy in
> > shm_get_policy() to match shmem_get_policy.  All 3 of these are required
> > to be correct w/o changing any of the rest of the current ref counting.
> 
> I know about 1. I'd have to look through 2 + 3. I would suggest to fix the 
> refcounting by doing the refcounting using vmas as you explained above and 
> simply remove the problems that exist there right now.
> 
> > Then, once the vma-protected shared policy mechanism discussed above is
> > in mergable, we can back out all of the extra ref's on other task and
> > vma policies and the lookup-time ref on shared policies, along with all
> > of the matching unrefs.
> 
> Too complicated. Lets go there directly.

OK.  It's just that I predict we'll take some time to agree on the
details and I know that #3 in the previous paragraph will bug out if
anyone tries to use SHM_HUGETLB segments with non-default policy on
2.6.23.  The unconditional unref in #2 might [will?] also result in
premature freeing of an interleave policy for huge pages that can result
in a bug out on next allocation that tries to use the freed policy.

That's why I suggested a temporary minimal fix for 24 and 23-stable.
I'll go ahead and put one together, just in case.  You can ack or nack
as you see fit.

Meanwhile, my adventures in testing shared policies applied from
different tasks have, just today, uncovered another bug we can hit in
rmap.c:vma_address().  Appears to have been there back at least since
2.6.18.  I'll send out details of that in a subsequent message.  I think
my shared policy patches [non-split vma for shmem segments] fixes that.
I'll test such a kernel before I post my findings.

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [NUMA] Fix memory policy refcounting
  2007-10-30 18:42           ` Christoph Lameter
  2007-10-30 20:18             ` Lee Schermerhorn
@ 2007-11-06 18:56             ` Lee Schermerhorn
  2007-11-06 19:15               ` Christoph Lameter
  1 sibling, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2007-11-06 18:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: AndiKleen, linux-mm, Eric Whitney, David Rientjes, Paul Jackson

On Tue, 2007-10-30 at 11:42 -0700, Christoph Lameter wrote:
> On Tue, 30 Oct 2007, Lee Schermerhorn wrote:
> 
> > As part of my shared policy cleanup and enhancement series, I "fixed"
> > numa_maps to display the sub-ranges of policies in a shm segment mapped
> > by a single vma. As part of this fix, I also modified mempolicy.c so
> > that it does not split vmas that support set_policy vm_ops, because
> > handling both split vmas and non-split vmas for a single shm segment
> > would have complicated the code more than I thought necessary.  This is
> > still at prototype stage--altho' it works against 23-rc8-mm2.
> 
> I have not looked at that yet. Maybe you could post another patch?
> 
> > Re:  'ref = 3' -- One reference for the rbtree--the shm segment and it's
> > policies continue to exist independent of any vma mappings--and one for
> > each attached vma.  Because the vma references are protected by the
> > respective task/mm_struct's  mmap_sem, we won't need to add an
> > additional reference during lookup, nor release it when finished with
> > the policy.  And, we won't need to mess with any other task's mm data
> > structures when installing/removing shmem policies.  Of course, munmap()
> > of a vma will need to decrement the ref count of all policies in a
> > shared policy tree, but this is not a "fast path".  Unfortunately, we
> > don't have a unmap file operation, so I'd have to add one, or otherwise
> > arrange to remove the unmapping vma's ref--perhaps via a vm_op so that
> > we only need to call it on vmas that support it--i.e., that support
> > shared policy.
> 
> Yup that sounds like it is going to be a good solution.
> 

Christoph:

After looking at this and attempting to implement it, I find that it
won't work.  The reason is that I can't tell from just vma references
whether an mempolicy in the shared policy rbtree is actually in use.  A
task is allowed to change the policies in the rbtree at any time--a
feature that I understand you have no use for and therefore don't like,
but which is fundamental to shared policy semantics.  If I try to
install a policy that completely covers/replaces an existing policy, I
need to be able to do this, regardless of how many vmas have the shared
region attached/mapped.  So, this doesn't protect any task that is
currently examining the policy for page allocation, get_mempolicy() or
show_numa_maps() without the extra ref.  Andi had probably figured this
out back when he implemented shared policies.

I have another approach that still involves adding a ref to shared
policies at lookup time, and dropping the ref when finished with the
policy.  I know you don't like the idea of taking references in the vma
policy lookup path.  However, the 'get() is already there [for shared
policies].  I just need to add the 'free() [which Mel G would like to
see renamed at mpol_put()].  I have a patch that does the unref only for
shared policies, along with the other cleanups necessary in this area.

I hope to post soon, but I've said that before.  I'll also rerun the pft
tests with and without this change when I can.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [NUMA] Fix memory policy refcounting
  2007-11-06 18:56             ` Lee Schermerhorn
@ 2007-11-06 19:15               ` Christoph Lameter
  2007-11-06 19:35                 ` Lee Schermerhorn
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2007-11-06 19:15 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: AndiKleen, linux-mm, Eric Whitney, David Rientjes, Paul Jackson

On Tue, 6 Nov 2007, Lee Schermerhorn wrote:

> After looking at this and attempting to implement it, I find that it
> won't work.  The reason is that I can't tell from just vma references
> whether an mempolicy in the shared policy rbtree is actually in use.  A
> task is allowed to change the policies in the rbtree at any time--a
> feature that I understand you have no use for and therefore don't like,

I am not sure what my dislikes have to do with anything. This needs to 
work and be made to work in such a way that it does not negatively impact 
the rest of the system.

What do you mean by in use? If a vma can potentially use a shared policy 
in a rbtree then it is in use right?

> but which is fundamental to shared policy semantics.  If I try to
> install a policy that completely covers/replaces an existing policy, I
> need to be able to do this, regardless of how many vmas have the shared
> region attached/mapped.  So, this doesn't protect any task that is
> currently examining the policy for page allocation, get_mempolicy() or
> show_numa_maps() without the extra ref.  Andi had probably figured this
> out back when he implemented shared policies.

AFAICT: If you take a reference on the shared policy for each 
vma then you can tell from the references that the policy is in use.
 
> I have another approach that still involves adding a ref to shared
> policies at lookup time, and dropping the ref when finished with the
> policy.  I know you don't like the idea of taking references in the vma
> policy lookup path.  However, the 'get() is already there [for shared
> policies].  I just need to add the 'free() [which Mel G would like to
> see renamed at mpol_put()].  I have a patch that does the unref only for
> shared policies, along with the other cleanups necessary in this area.
> 
> I hope to post soon, but I've said that before.  I'll also rerun the pft
> tests with and without this change when I can.

I am fine with this if it only affects the shmem policies and no critical 
performance hot paths for regular anonymous and page cache allocations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [NUMA] Fix memory policy refcounting
  2007-11-06 19:15               ` Christoph Lameter
@ 2007-11-06 19:35                 ` Lee Schermerhorn
  2007-11-06 19:43                   ` Christoph Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2007-11-06 19:35 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: AndiKleen, linux-mm, Eric Whitney, David Rientjes, Paul Jackson

On Tue, 2007-11-06 at 11:15 -0800, Christoph Lameter wrote:
> On Tue, 6 Nov 2007, Lee Schermerhorn wrote:
> 
> > After looking at this and attempting to implement it, I find that it
> > won't work.  The reason is that I can't tell from just vma references
> > whether an mempolicy in the shared policy rbtree is actually in use.  A
> > task is allowed to change the policies in the rbtree at any time--a
> > feature that I understand you have no use for and therefore don't like,
> 
> I am not sure what my dislikes have to do with anything. This needs to 
> work and be made to work in such a way that it does not negatively impact 
> the rest of the system.

We always seem to rathole on that subject.  I just hoped to head that
off...

> 
> What do you mean by in use? If a vma can potentially use a shared policy 
> in a rbtree then it is in use right?

Not really--not for shared policies.  Again, another task is allowed to
remove or replace the shared policies at any time, regardless of the
number of task's attached to the segment.  We can't differentiate
between simple attachment and current use.  We need the lookup-time
ref/unref to know that the policy is actually in use.  We can still
replace it in the tree while it's "in use".  This will remove the tree's
reference on the policy, but the policy won't be freed until the task
holding the extra ref drops it.  

I suppose we could stick any replaced mempolicy on a list associated
with the segment and keep them there until all tasks detach from the
shared segment.  Not too much of a memory leak, as long as a task
doesn't keep changing policy on a shmem segment just to be perverse.
Don't know how I'd limit that.  On top of all the other mechanism I'd
have to add to track vma refs to shared policies [and it appears to be
quite a bit], it wouldn't be all that much additional work.  But, the
entire effort didn't seem worth it when I determined that I couldn't
safely free the policies when they're removed/replaced.  That was the
purpose of the exercise...

> 
> > but which is fundamental to shared policy semantics.  If I try to
> > install a policy that completely covers/replaces an existing policy, I
> > need to be able to do this, regardless of how many vmas have the shared
> > region attached/mapped.  So, this doesn't protect any task that is
> > currently examining the policy for page allocation, get_mempolicy() or
> > show_numa_maps() without the extra ref.  Andi had probably figured this
> > out back when he implemented shared policies.
> 
> AFAICT: If you take a reference on the shared policy for each 
> vma then you can tell from the references that the policy is in use.

See above.  A vma reference does not constitute use for a shared policy.
Again, I could replace them, I just can't free the replaced ones...
Unless I can figure out some [s]rcu mechanism.  If not, I'll have to
think about how to limit a task from causing an arbitrary number of them
to be allocated w/o being freed.

>  
> > I have another approach that still involves adding a ref to shared
> > policies at lookup time, and dropping the ref when finished with the
> > policy.  I know you don't like the idea of taking references in the vma
> > policy lookup path.  However, the 'get() is already there [for shared
> > policies].  I just need to add the 'free() [which Mel G would like to
> > see renamed at mpol_put()].  I have a patch that does the unref only for
> > shared policies, along with the other cleanups necessary in this area.
> > 
> > I hope to post soon, but I've said that before.  I'll also rerun the pft
> > tests with and without this change when I can.
> 
> I am fine with this if it only affects the shmem policies and no critical 
> performance hot paths for regular anonymous and page cache allocations.

So far, that's what it looks like.  Again, I'll continue to think about
the vma refs along with some RCU mechanism for deferred freeing of the
policies.  

Later,
Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [NUMA] Fix memory policy refcounting
  2007-11-06 19:35                 ` Lee Schermerhorn
@ 2007-11-06 19:43                   ` Christoph Lameter
  2007-11-06 20:08                     ` Lee Schermerhorn
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2007-11-06 19:43 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: AndiKleen, linux-mm, Eric Whitney, David Rientjes, Paul Jackson

On Tue, 6 Nov 2007, Lee Schermerhorn wrote:

> We always seem to rathole on that subject.  I just hoped to head that
> off...

Well fix this and the rathole will be gone.,

> > What do you mean by in use? If a vma can potentially use a shared policy 
> > in a rbtree then it is in use right?
> 
> Not really--not for shared policies.  Again, another task is allowed to
> remove or replace the shared policies at any time, regardless of the
> number of task's attached to the segment.  We can't differentiate
> between simple attachment and current use.  We need the lookup-time
> ref/unref to know that the policy is actually in use.  We can still
> replace it in the tree while it's "in use".  This will remove the tree's
> reference on the policy, but the policy won't be freed until the task
> holding the extra ref drops it.  

Stil unclear as to why we need lookup time ref/unref. A task can replace 
the shared policy at any time you just need to update the refcounts. If 
you have a pointer to the policy in the vma then its possible to do so.

> I suppose we could stick any replaced mempolicy on a list associated
> with the segment and keep them there until all tasks detach from the
> shared segment.  Not too much of a memory leak, as long as a task

Well you have the refcount on the policy? Why keep the mempolicy around?

> > AFAICT: If you take a reference on the shared policy for each 
> > vma then you can tell from the references that the policy is in use.
> 
> See above.  A vma reference does not constitute use for a shared policy.

Why not? What does constitute "use" of a shared policy? A page that has 
used the policy?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [NUMA] Fix memory policy refcounting
  2007-11-06 19:43                   ` Christoph Lameter
@ 2007-11-06 20:08                     ` Lee Schermerhorn
  2007-11-06 20:19                       ` Christoph Lameter
  0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2007-11-06 20:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, linux-mm, Eric Whitney, David Rientjes, Paul Jackson

On Tue, 2007-11-06 at 11:43 -0800, Christoph Lameter wrote:
> On Tue, 6 Nov 2007, Lee Schermerhorn wrote:
> 
> > We always seem to rathole on that subject.  I just hoped to head that
> > off...
> 
> Well fix this and the rathole will be gone.,

I'll hold you to that! :-)

> 
> > > What do you mean by in use? If a vma can potentially use a shared policy 
> > > in a rbtree then it is in use right?
> > 
> > Not really--not for shared policies.  Again, another task is allowed to
> > remove or replace the shared policies at any time, regardless of the
> > number of task's attached to the segment.  We can't differentiate
> > between simple attachment and current use.  We need the lookup-time
> > ref/unref to know that the policy is actually in use.  We can still
> > replace it in the tree while it's "in use".  This will remove the tree's
> > reference on the policy, but the policy won't be freed until the task
> > holding the extra ref drops it.  
> 
> Stil unclear as to why we need lookup time ref/unref. A task can replace 
> the shared policy at any time you just need to update the refcounts. If 
> you have a pointer to the policy in the vma then its possible to do so.

A pointer in the vma won't work.  Different tasks could apply policies
on different ranges and shared policy semantics dictate that all tasks
see the same policy for a particular offset in the region--modulo
set/get races.  The only way we could keep a pointer in the vma would be
to split the vmas in every task that has the shared region attached
whenever any task changes the policy of a range of the region, so that
all tasks have the same set of vma's all pointing to the same set of
policies in the tree.  I don't think we can be changing other task's
address space externally like this.  And it still wouldn't work, I
think, for shared policy semantics--again, except maybe with some sort
of rcu mechanism.  More below on what constitutes actual "use".

> 
> > I suppose we could stick any replaced mempolicy on a list associated
> > with the segment and keep them there until all tasks detach from the
> > shared segment.  Not too much of a memory leak, as long as a task
> 
> Well you have the refcount on the policy? Why keep the mempolicy around?

A non-zero ref count is what keeps the policy around.  It implies that
some structure has a pointer to the policy, or some task is actively
examining the policy and will drop the reference when finsished with it.
[The latter is what's NOT happening now for shared policy.]  

> 
> > > AFAICT: If you take a reference on the shared policy for each 
> > > vma then you can tell from the references that the policy is in use.
> > 
> > See above.  A vma reference does not constitute use for a shared policy.
> 
> Why not? What does constitute "use" of a shared policy? A page that has 
> used the policy?

Currently, when you lookup the policy [based on offset] in the rbtree
under spin_lock, the lookup function does an mpol_get() before dropping
the lock.  Now, you can use the policy to allocate a page or to report
via get_mempolicy(MPOL_F_ADDR) or show_numa_maps()/mpol_to_str().   When
you're finished with the policy, you mpol_free() to release the
reference.  While you're holding this ref, another task that has the
shared region attached can replace/delete the policy, removing it from
the rbtree and dropping the rbtree's reference via mpol_free().  Now,
the only reference to the policy is any reference held by a task that
has looked it up, but not yet mpol_free()ed it.  When the last task
holding such a reference releases it, we'll free it back to the kmem
cache.

This is the type of use that I can't infer from vma counts or even vma
pointer refs.  I should be able to replace the vma pointer/ref at any
time when the shared policy changes, and mpol_free() the policy for each
such vma pointer/ref.  That leaves no ref to hold the policy should it
be in use [as discussed above].

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [NUMA] Fix memory policy refcounting
  2007-11-06 20:08                     ` Lee Schermerhorn
@ 2007-11-06 20:19                       ` Christoph Lameter
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2007-11-06 20:19 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Andi Kleen, linux-mm, Eric Whitney, David Rientjes, Paul Jackson

On Tue, 6 Nov 2007, Lee Schermerhorn wrote:

> > Stil unclear as to why we need lookup time ref/unref. A task can replace 
> > the shared policy at any time you just need to update the refcounts. If 
> > you have a pointer to the policy in the vma then its possible to do so.
> 
> A pointer in the vma won't work.  Different tasks could apply policies
> on different ranges and shared policy semantics dictate that all tasks
> see the same policy for a particular offset in the region--modulo
> set/get races.  The only way we could keep a pointer in the vma would be
> to split the vmas in every task that has the shared region attached
> whenever any task changes the policy of a range of the region, so that
> all tasks have the same set of vma's all pointing to the same set of
> policies in the tree.  I don't think we can be changing other task's
> address space externally like this.  And it still wouldn't work, I
> think, for shared policy semantics--again, except maybe with some sort
> of rcu mechanism.  More below on what constitutes actual "use".

You can split vmas by holding a writelock on mmap_sem. But we have 
discussed that before. If different policies are in effect for ranges of a 
shared mapping (that is what you are talking about?) then vmas need to 
correspond to these ranges and contain a pointer to the shard policy.

> > > I suppose we could stick any replaced mempolicy on a list associated
> > > with the segment and keep them there until all tasks detach from the
> > > shared segment.  Not too much of a memory leak, as long as a task
> > 
> > Well you have the refcount on the policy? Why keep the mempolicy around?
> 
> A non-zero ref count is what keeps the policy around.  It implies that
> some structure has a pointer to the policy, or some task is actively
> examining the policy and will drop the reference when finsished with it.
> [The latter is what's NOT happening now for shared policy.]  

The shared policy has one refcount because it is in the rbtree. The others 
are currently temporary.

With establishment of references via vmas you have 3 types of references 
that are taken on a policy

1. Rbtree

2. vma

3. temporary (is this really needed?)

If the refcount of a shared policy becomes one then you can remove a 
policy from the rbtree and free it.

The role of the rbtree is drastically reduced if you put the pointer into 
the vma because lookups in the tree are only needed when the vma is 
established. shmem operations will be faster because of the vma policy 
pointer.

> > Why not? What does constitute "use" of a shared policy? A page that has 
> > used the policy?
> 
> Currently, when you lookup the policy [based on offset] in the rbtree
> under spin_lock, the lookup function does an mpol_get() before dropping
> the lock.  Now, you can use the policy to allocate a page or to report
> via get_mempolicy(MPOL_F_ADDR) or show_numa_maps()/mpol_to_str().   When
> you're finished with the policy, you mpol_free() to release the
> reference.  While you're holding this ref, another task that has the
> shared region attached can replace/delete the policy, removing it from
> the rbtree and dropping the rbtree's reference via mpol_free().  Now,
> the only reference to the policy is any reference held by a task that

The policy can be removed while holding a refcount?

You can do the same then with the refcounts via vma. Drop the policy from
the rbtree and then go through the vmas to drop the refs. At some point 
the refcount reaches zero and the policy vanished.

> This is the type of use that I can't infer from vma counts or even vma
> pointer refs.  I should be able to replace the vma pointer/ref at any
> time when the shared policy changes, and mpol_free() the policy for each
> such vma pointer/ref.  That leaves no ref to hold the policy should it
> be in use [as discussed above].

AFAICT you are able to do all that you need. If a policy is in use then it 
has a refcount. You need to take the mmap_sem writelock to change 
policies. At that point you are guaranteed that no fault is in progress 
thus no user of the policy exists.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-11-06 20:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-26 23:41 [NUMA] Fix memory policy refcounting Christoph Lameter
2007-10-29 15:48 ` Lee Schermerhorn
2007-10-29 20:24   ` Christoph Lameter
2007-10-29 21:34     ` Lee Schermerhorn
2007-10-29 21:43       ` Christoph Lameter
2007-10-30 16:39         ` Lee Schermerhorn
2007-10-30 18:42           ` Christoph Lameter
2007-10-30 20:18             ` Lee Schermerhorn
2007-11-06 18:56             ` Lee Schermerhorn
2007-11-06 19:15               ` Christoph Lameter
2007-11-06 19:35                 ` Lee Schermerhorn
2007-11-06 19:43                   ` Christoph Lameter
2007-11-06 20:08                     ` Lee Schermerhorn
2007-11-06 20:19                       ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).