* [NUMA] Fix memory policy refcounting
@ 2007-10-26 23:41 Christoph Lameter
2007-10-29 15:48 ` Lee Schermerhorn
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2007-10-26 23:41 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: David Rientjes, Paul Jackson, linux-mm
The refcounting fix that went into 2.6.23 left race conditions open because:
1. Reference counts were taken in get_vma_policy without necessarily
holding another lock that guaranteed the existence of the object
on which the reference count was taken.
2. For shared memory policies we were taking reference counts twice
but only release one reference. So the memory leak was not fixed.
3. The patch figures out in multiple places if a reference
count on the memory policy was taken or not. However, the refcount
is only taken under certain conditions and these conditions may
change. The logic to figure out when to drop the refcount is resulting
in code that easily breaks.
This patch fixes the issues by:
1. Removing the logic to determine if a refcount was taken earlier.
2. Adding a flag to all functions that can potentially take a refcount
on a memory. That flag is set if the refcount was taken. The code
using the memory policy can then just free the refcount if it was
actually taken.
3. Protect against races between set_mem_policy and numa_maps by taking
an mmap_sem writelock when a tasks memory policy is changed.
4. There were a couple of places where get_vma_policy() etc is used that
were missed in the 2.6.23 fix. Fix those too.
Note: IMHO removing the shared memory policy support would be preferable.
Shared policies can still interact in surprising ways with cpusets
and cannot be handled in a reasonable way by page migration.
The removal of shared policy support would result in the refcount issues
going away and code would be much simpler. Semantics would be consistent in
that memory policies only apply to a single process. Sharing of memory policies
would only occur in a controlled way that does not require extra refcounting
for the use of a policy.
We have already vma policy pointers that are currently unused for shmem areas
and could replicate shared policies by setting these pointers in each vma that
is pointing to a shmem area. Changing a shared policy would then require
iterating over all processes using the policy using the reverse maps. At that
point cpuset constraints etc could be considered and eventually a policy change
could even be rejected on the ground that a consistent change is not possible
given the other constraints of the shmem area.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
include/linux/mempolicy.h | 7 +-
include/linux/mm.h | 2
ipc/shm.c | 4 -
mm/hugetlb.c | 6 +-
mm/mempolicy.c | 113 ++++++++++++++++++----------------------------
mm/shmem.c | 17 ++++--
6 files changed, 67 insertions(+), 82 deletions(-)
Index: linux-2.6/include/linux/mempolicy.h
===================================================================
--- linux-2.6.orig/include/linux/mempolicy.h 2007-10-26 15:46:53.000000000 -0700
+++ linux-2.6/include/linux/mempolicy.h 2007-10-26 15:50:00.000000000 -0700
@@ -140,7 +140,7 @@ int mpol_set_shared_policy(struct shared
struct mempolicy *new);
void mpol_free_shared_policy(struct shared_policy *p);
struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
- unsigned long idx);
+ unsigned long idx, int *ref);
extern void numa_default_policy(void);
extern void numa_policy_init(void);
@@ -151,7 +151,8 @@ extern void mpol_fix_fork_child_flag(str
extern struct mempolicy default_policy;
extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
- unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol);
+ unsigned long addr, gfp_t gfp_flags,
+ struct mempolicy **mpol, int *ref);
extern unsigned slab_node(struct mempolicy *policy);
extern enum zone_type policy_zone;
@@ -239,7 +240,7 @@ static inline void mpol_fix_fork_child_f
}
static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
- unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol)
+ unsigned long addr, gfp_t gfp_flags, int *ref)
{
return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
}
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c 2007-10-26 15:46:53.000000000 -0700
+++ linux-2.6/mm/hugetlb.c 2007-10-26 15:50:46.000000000 -0700
@@ -75,9 +75,10 @@ static struct page *dequeue_huge_page(st
{
int nid;
struct page *page = NULL;
+ int ref = 0;
struct mempolicy *mpol;
struct zonelist *zonelist = huge_zonelist(vma, address,
- htlb_alloc_mask, &mpol);
+ htlb_alloc_mask, &mpol, &ref);
struct zone **z;
for (z = zonelist->zones; *z; z++) {
@@ -94,7 +95,8 @@ static struct page *dequeue_huge_page(st
break;
}
}
- mpol_free(mpol); /* unref if mpol !NULL */
+ if (ref)
+ mpol_free(mpol);
return page;
}
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c 2007-10-26 15:46:53.000000000 -0700
+++ linux-2.6/mm/mempolicy.c 2007-10-26 16:23:54.000000000 -0700
@@ -531,6 +531,7 @@ static long do_get_mempolicy(int *policy
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
struct mempolicy *pol = current->mempolicy;
+ int ref = 0;
cpuset_update_task_memory_state();
if (flags &
@@ -553,7 +554,7 @@ static long do_get_mempolicy(int *policy
return -EFAULT;
}
if (vma->vm_ops && vma->vm_ops->get_policy)
- pol = vma->vm_ops->get_policy(vma, addr);
+ pol = vma->vm_ops->get_policy(vma, addr, &ref);
else
pol = vma->vm_policy;
} else if (addr)
@@ -587,7 +588,9 @@ static long do_get_mempolicy(int *policy
if (nmask)
get_zonemask(pol, nmask);
- out:
+out:
+ if (ref)
+ mpol_free(pol);
if (vma)
up_read(¤t->mm->mmap_sem);
return err;
@@ -917,7 +920,10 @@ asmlinkage long sys_set_mempolicy(int mo
err = get_nodes(&nodes, nmask, maxnode);
if (err)
return err;
- return do_set_mempolicy(mode, &nodes);
+ down_write(¤t->mm->mmap_sem);
+ err = do_set_mempolicy(mode, &nodes);
+ up_write(¤t->mm->mmap_sem);
+ return err;
}
asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
@@ -1097,35 +1103,31 @@ asmlinkage long compat_sys_mbind(compat_
/*
* get_vma_policy(@task, @vma, @addr)
- * @task - task for fallback if vma policy == default
+ * @task - task for fallback if vma policy == default
* @vma - virtual memory area whose policy is sought
* @addr - address in @vma for shared policy lookup
+ * @ref - reference was taken against policy
*
* Returns effective policy for a VMA at specified address.
* Falls back to @task or system default policy, as necessary.
- * Returned policy has extra reference count if shared, vma,
- * or some other task's policy [show_numa_maps() can pass
- * @task != current]. It is the caller's responsibility to
- * free the reference in these cases.
- */
-static struct mempolicy * get_vma_policy(struct task_struct *task,
- struct vm_area_struct *vma, unsigned long addr)
+ * It is the caller's responsibility to free the reference count
+ * on the policy if any was taken. The refcount guarantees that
+ * the memory policy does not vanish from under us.
+*/
+static struct mempolicy *get_vma_policy(struct task_struct *task,
+ struct vm_area_struct *vma, unsigned long addr, int *ref)
{
struct mempolicy *pol = task->mempolicy;
- int shared_pol = 0;
if (vma) {
- if (vma->vm_ops && vma->vm_ops->get_policy) {
- pol = vma->vm_ops->get_policy(vma, addr);
- shared_pol = 1; /* if pol non-NULL, add ref below */
- } else if (vma->vm_policy &&
+ if (vma->vm_ops && vma->vm_ops->get_policy)
+ pol = vma->vm_ops->get_policy(vma, addr, ref);
+ else if (vma->vm_policy &&
vma->vm_policy->policy != MPOL_DEFAULT)
pol = vma->vm_policy;
}
if (!pol)
pol = &default_policy;
- else if (!shared_pol && pol != current->mempolicy)
- mpol_get(pol); /* vma or other task's policy */
return pol;
}
@@ -1247,39 +1249,23 @@ static inline unsigned interleave_nid(st
* @addr = address in @vma for shared policy lookup and interleave policy
* @gfp_flags = for requested zone
* @mpol = pointer to mempolicy pointer for reference counted 'BIND policy
+ * @ref = indicates that a refcount was taken against mpol.
*
- * Returns a zonelist suitable for a huge page allocation.
- * If the effective policy is 'BIND, returns pointer to policy's zonelist.
- * If it is also a policy for which get_vma_policy() returns an extra
- * reference, we must hold that reference until after allocation.
- * In that case, return policy via @mpol so hugetlb allocation can drop
- * the reference. For non-'BIND referenced policies, we can/do drop the
- * reference here, so the caller doesn't need to know about the special case
- * for default and current task policy.
+ * Returns a zonelist suitable for a huge page allocation. The caller must
+ * release the memory policy refcount if ref != 0. The zonelist may be freed
+ * at any time after the reference count is released.
*/
struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
- gfp_t gfp_flags, struct mempolicy **mpol)
+ gfp_t gfp_flags, struct mempolicy **mpol, int *ref)
{
- struct mempolicy *pol = get_vma_policy(current, vma, addr);
- struct zonelist *zl;
-
- *mpol = NULL; /* probably no unref needed */
- if (pol->policy == MPOL_INTERLEAVE) {
+ *mpol = get_vma_policy(current, vma, addr, ref);
+ if ((*mpol)->policy == MPOL_INTERLEAVE) {
unsigned nid;
- nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
- __mpol_free(pol); /* finished with pol */
+ nid = interleave_nid(*mpol, vma, addr, HPAGE_SHIFT);
return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
}
-
- zl = zonelist_policy(GFP_HIGHUSER, pol);
- if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
- if (pol->policy != MPOL_BIND)
- __mpol_free(pol); /* finished with pol */
- else
- *mpol = pol; /* unref needed after allocation */
- }
- return zl;
+ return zonelist_policy(GFP_HIGHUSER, *mpol);
}
#endif
@@ -1323,8 +1309,9 @@ static struct page *alloc_page_interleav
struct page *
alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
{
- struct mempolicy *pol = get_vma_policy(current, vma, addr);
- struct zonelist *zl;
+ int ref = 0;
+ struct mempolicy *pol = get_vma_policy(current, vma, addr, &ref);
+ struct page *page;
cpuset_update_task_memory_state();
@@ -1332,21 +1319,12 @@ alloc_page_vma(gfp_t gfp, struct vm_area
unsigned nid;
nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
- return alloc_page_interleave(gfp, 0, nid);
- }
- zl = zonelist_policy(gfp, pol);
- if (pol != &default_policy && pol != current->mempolicy) {
- /*
- * slow path: ref counted policy -- shared or vma
- */
- struct page *page = __alloc_pages(gfp, 0, zl);
- __mpol_free(pol);
- return page;
- }
- /*
- * fast path: default or task policy
- */
- return __alloc_pages(gfp, 0, zl);
+ page = alloc_page_interleave(gfp, 0, nid);
+ } else
+ page = __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+ if (ref)
+ mpol_free(pol);
+ return page;
}
/**
@@ -1519,7 +1497,8 @@ static void sp_insert(struct shared_poli
/* Find shared policy intersecting idx */
struct mempolicy *
-mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+mpol_shared_policy_lookup(struct shared_policy *sp,
+ unsigned long idx, int *ref)
{
struct mempolicy *pol = NULL;
struct sp_node *sn;
@@ -1531,6 +1510,7 @@ mpol_shared_policy_lookup(struct shared_
if (sn) {
mpol_get(sn->policy);
pol = sn->policy;
+ (*ref)++;
}
spin_unlock(&sp->lock);
return pol;
@@ -1945,8 +1925,9 @@ int show_numa_map(struct seq_file *m, vo
struct numa_maps *md;
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
- struct mempolicy *pol;
int n;
+ int ref = 0;
+ struct mempolicy *mpol;
char buffer[50];
if (!mm)
@@ -1956,13 +1937,10 @@ int show_numa_map(struct seq_file *m, vo
if (!md)
return 0;
- pol = get_vma_policy(priv->task, vma, vma->vm_start);
- mpol_to_str(buffer, sizeof(buffer), pol);
- /*
- * unref shared or other task's mempolicy
- */
- if (pol != &default_policy && pol != current->mempolicy)
- __mpol_free(pol);
+ mpol = get_vma_policy(priv->task, vma, vma->vm_start, &ref);
+ mpol_to_str(buffer, sizeof(buffer), mpol);
+ if (ref)
+ mpol_free(mpol);
seq_printf(m, "%08lx %s", vma->vm_start, buffer);
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c 2007-10-26 15:46:53.000000000 -0700
+++ linux-2.6/mm/shmem.c 2007-10-26 15:50:00.000000000 -0700
@@ -1015,14 +1015,16 @@ static struct page *shmem_swapin_async(s
{
struct page *page;
struct vm_area_struct pvma;
+ int ref = 0;
/* Create a pseudo vma that just contains the policy */
memset(&pvma, 0, sizeof(struct vm_area_struct));
pvma.vm_end = PAGE_SIZE;
pvma.vm_pgoff = idx;
- pvma.vm_policy = mpol_shared_policy_lookup(p, idx);
+ pvma.vm_policy = mpol_shared_policy_lookup(p, idx, &ref);
page = read_swap_cache_async(entry, &pvma, 0);
- mpol_free(pvma.vm_policy);
+ if (ref)
+ mpol_free(pvma.vm_policy);
return page;
}
@@ -1052,13 +1054,16 @@ shmem_alloc_page(gfp_t gfp, struct shmem
{
struct vm_area_struct pvma;
struct page *page;
+ int ref = 0;
memset(&pvma, 0, sizeof(struct vm_area_struct));
- pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
+ pvma.vm_policy = mpol_shared_policy_lookup(&info->policy,
+ idx, &ref);
pvma.vm_pgoff = idx;
pvma.vm_end = PAGE_SIZE;
page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
- mpol_free(pvma.vm_policy);
+ if (ref)
+ mpol_free(pvma.vm_policy);
return page;
}
#else
@@ -1336,13 +1341,13 @@ static int shmem_set_policy(struct vm_ar
}
static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma,
- unsigned long addr)
+ unsigned long addr, int *ref)
{
struct inode *i = vma->vm_file->f_path.dentry->d_inode;
unsigned long idx;
idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
- return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx);
+ return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx, ref);
}
#endif
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2007-10-26 15:46:53.000000000 -0700
+++ linux-2.6/include/linux/mm.h 2007-10-26 15:50:00.000000000 -0700
@@ -173,7 +173,7 @@ struct vm_operations_struct {
#ifdef CONFIG_NUMA
int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
- unsigned long addr);
+ unsigned long addr, int *ref);
int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
const nodemask_t *to, unsigned long flags);
#endif
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c 2007-10-26 15:46:53.000000000 -0700
+++ linux-2.6/ipc/shm.c 2007-10-26 15:50:00.000000000 -0700
@@ -279,14 +279,14 @@ static int shm_set_policy(struct vm_area
}
static struct mempolicy *shm_get_policy(struct vm_area_struct *vma,
- unsigned long addr)
+ unsigned long addr, int *ref)
{
struct file *file = vma->vm_file;
struct shm_file_data *sfd = shm_file_data(file);
struct mempolicy *pol = NULL;
if (sfd->vm_ops->get_policy)
- pol = sfd->vm_ops->get_policy(vma, addr);
+ pol = sfd->vm_ops->get_policy(vma, addr, ref);
else if (vma->vm_policy)
pol = vma->vm_policy;
else
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [NUMA] Fix memory policy refcounting 2007-10-26 23:41 [NUMA] Fix memory policy refcounting Christoph Lameter @ 2007-10-29 15:48 ` Lee Schermerhorn 2007-10-29 20:24 ` Christoph Lameter 0 siblings, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2007-10-29 15:48 UTC (permalink / raw) To: Christoph Lameter Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney On Fri, 2007-10-26 at 16:41 -0700, Christoph Lameter wrote: > The refcounting fix that went into 2.6.23 left race conditions open because: Christoph: I've been testing a set of patches to address the races, based on our recent discussions [your feed back to my last attempt]. I'll send that along for comment shortly. Below, a few comments on this patch. > > 1. Reference counts were taken in get_vma_policy without necessarily > holding another lock that guaranteed the existence of the object > on which the reference count was taken. Yes, this was true for the show_numa_maps() case, as we've discussed. I agree we need to take the mmap_sem for write in do_set_mempolicy() as we do in do_mbind(). > > 2. For shared memory policies we were taking reference counts twice > but only release one reference. So the memory leak was not fixed. I don't think this was the case. You removed the code that attempted to prevent this. But, I admit I might have missed something... > > 3. The patch figures out in multiple places if a reference > count on the memory policy was taken or not. However, the refcount > is only taken under certain conditions and these conditions may > change. The logic to figure out when to drop the refcount is resulting > in code that easily breaks. Agreed, especially since with the fix for "other task's policy" and noting that vma policy is protected by mmap_sem, we only need the extra ref for shared policy and need a way to determine that we need to remove that. > > This patch fixes the issues by: > > 1. Removing the logic to determine if a refcount was taken earlier. > > 2. Adding a flag to all functions that can potentially take a refcount > on a memory. That flag is set if the refcount was taken. The code > using the memory policy can then just free the refcount if it was > actually taken. This does add some additional code in the alloc path and adds an additional arg to a lot of functions that I think we can remove by marking shared policies as such and only derefing those. > > 3. Protect against races between set_mem_policy and numa_maps by taking > an mmap_sem writelock when a tasks memory policy is changed. Agreed. > > 4. There were a couple of places where get_vma_policy() etc is used that > were missed in the 2.6.23 fix. Fix those too. > > Note: IMHO removing the shared memory policy support would be preferable. > Shared policies can still interact in surprising ways with cpusets > and cannot be handled in a reasonable way by page migration. Yeah, yeah, yeah. But I consider that to be cpusets' fault and not shared memory policy. I still have use for the latter. We need to find a way to accomodate all of our requirements, even if it means documenting that shared memory policy must be used very carefully with cpusets--or not at all with dynamically changing cpusets. I can certainly live with that. > > The removal of shared policy support would result in the refcount issues > going away and code would be much simpler. Semantics would be consistent in > that memory policies only apply to a single process. Sharing of memory policies > would only occur in a controlled way that does not require extra refcounting > for the use of a policy. Yes, and we'd loose control over placement of shared pages except by hacking our task policy and prefaulting, or requiring every program that attaches to be aware of the numa policy of the overall application. I find this as objectionable as you find shared policies. > > We have already vma policy pointers that are currently unused for shmem areas > and could replicate shared policies by setting these pointers in each vma that > is pointing to a shmem area. Doesn't work for me. :( > Changing a shared policy would then require > iterating over all processes using the policy using the reverse maps. At that > point cpuset constraints etc could be considered and eventually a policy change > could even be rejected on the ground that a consistent change is not possible > given the other constraints of the shmem area. Policy remapping isn't already complex enough for you, huh? :-) > > Signed-off-by: Christoph Lameter <clameter@sgi.com> > > --- > include/linux/mempolicy.h | 7 +- > include/linux/mm.h | 2 > ipc/shm.c | 4 - > mm/hugetlb.c | 6 +- > mm/mempolicy.c | 113 ++++++++++++++++++---------------------------- > mm/shmem.c | 17 ++++-- > 6 files changed, 67 insertions(+), 82 deletions(-) > > Index: linux-2.6/include/linux/mempolicy.h > =================================================================== > --- linux-2.6.orig/include/linux/mempolicy.h 2007-10-26 15:46:53.000000000 -0700 > +++ linux-2.6/include/linux/mempolicy.h 2007-10-26 15:50:00.000000000 -0700 > @@ -140,7 +140,7 @@ int mpol_set_shared_policy(struct shared > struct mempolicy *new); > void mpol_free_shared_policy(struct shared_policy *p); > struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, > - unsigned long idx); > + unsigned long idx, int *ref); > > extern void numa_default_policy(void); > extern void numa_policy_init(void); > @@ -151,7 +151,8 @@ extern void mpol_fix_fork_child_flag(str > > extern struct mempolicy default_policy; > extern struct zonelist *huge_zonelist(struct vm_area_struct *vma, > - unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol); > + unsigned long addr, gfp_t gfp_flags, > + struct mempolicy **mpol, int *ref); > extern unsigned slab_node(struct mempolicy *policy); > > extern enum zone_type policy_zone; > @@ -239,7 +240,7 @@ static inline void mpol_fix_fork_child_f > } > > static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma, > - unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol) > + unsigned long addr, gfp_t gfp_flags, int *ref) > { > return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags); > } > Index: linux-2.6/mm/hugetlb.c > =================================================================== > --- linux-2.6.orig/mm/hugetlb.c 2007-10-26 15:46:53.000000000 -0700 > +++ linux-2.6/mm/hugetlb.c 2007-10-26 15:50:46.000000000 -0700 > @@ -75,9 +75,10 @@ static struct page *dequeue_huge_page(st > { > int nid; > struct page *page = NULL; > + int ref = 0; > struct mempolicy *mpol; > struct zonelist *zonelist = huge_zonelist(vma, address, > - htlb_alloc_mask, &mpol); > + htlb_alloc_mask, &mpol, &ref); > struct zone **z; > > for (z = zonelist->zones; *z; z++) { > @@ -94,7 +95,8 @@ static struct page *dequeue_huge_page(st > break; > } > } > - mpol_free(mpol); /* unref if mpol !NULL */ > + if (ref) > + mpol_free(mpol); This shouldn't be necessary if huge_zonelist only returns a non-NULL mpol if the unref is required, as I had done. mpol_free() on a NULL mpol is a no-op, as the comment was intended to convey. You could drop the extra ref argument to huge_zonelist--not that this should be much of a fast path. > return page; > } > > Index: linux-2.6/mm/mempolicy.c > =================================================================== > --- linux-2.6.orig/mm/mempolicy.c 2007-10-26 15:46:53.000000000 -0700 > +++ linux-2.6/mm/mempolicy.c 2007-10-26 16:23:54.000000000 -0700 > @@ -531,6 +531,7 @@ static long do_get_mempolicy(int *policy > struct mm_struct *mm = current->mm; > struct vm_area_struct *vma = NULL; > struct mempolicy *pol = current->mempolicy; > + int ref = 0; > > cpuset_update_task_memory_state(); > if (flags & > @@ -553,7 +554,7 @@ static long do_get_mempolicy(int *policy > return -EFAULT; > } > if (vma->vm_ops && vma->vm_ops->get_policy) > - pol = vma->vm_ops->get_policy(vma, addr); > + pol = vma->vm_ops->get_policy(vma, addr, &ref); > else > pol = vma->vm_policy; > } else if (addr) > @@ -587,7 +588,9 @@ static long do_get_mempolicy(int *policy > if (nmask) > get_zonemask(pol, nmask); > > - out: > +out: > + if (ref) > + mpol_free(pol); > if (vma) > up_read(¤t->mm->mmap_sem); > return err; > @@ -917,7 +920,10 @@ asmlinkage long sys_set_mempolicy(int mo > err = get_nodes(&nodes, nmask, maxnode); > if (err) > return err; > - return do_set_mempolicy(mode, &nodes); > + down_write(¤t->mm->mmap_sem); > + err = do_set_mempolicy(mode, &nodes); > + up_write(¤t->mm->mmap_sem); > + return err; > } > > asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode, > @@ -1097,35 +1103,31 @@ asmlinkage long compat_sys_mbind(compat_ > > /* > * get_vma_policy(@task, @vma, @addr) > - * @task - task for fallback if vma policy == default > + * @task - task for fallback if vma policy == default > * @vma - virtual memory area whose policy is sought > * @addr - address in @vma for shared policy lookup > + * @ref - reference was taken against policy > * > * Returns effective policy for a VMA at specified address. > * Falls back to @task or system default policy, as necessary. > - * Returned policy has extra reference count if shared, vma, > - * or some other task's policy [show_numa_maps() can pass > - * @task != current]. It is the caller's responsibility to > - * free the reference in these cases. > - */ > -static struct mempolicy * get_vma_policy(struct task_struct *task, > - struct vm_area_struct *vma, unsigned long addr) > + * It is the caller's responsibility to free the reference count > + * on the policy if any was taken. The refcount guarantees that > + * the memory policy does not vanish from under us. > +*/ > +static struct mempolicy *get_vma_policy(struct task_struct *task, > + struct vm_area_struct *vma, unsigned long addr, int *ref) > { > struct mempolicy *pol = task->mempolicy; > - int shared_pol = 0; > > if (vma) { > - if (vma->vm_ops && vma->vm_ops->get_policy) { > - pol = vma->vm_ops->get_policy(vma, addr); > - shared_pol = 1; /* if pol non-NULL, add ref below */ Mea culpa. Comment is wrong here. It AVOIDS adding a ref to shared pols. > - } else if (vma->vm_policy && > + if (vma->vm_ops && vma->vm_ops->get_policy) > + pol = vma->vm_ops->get_policy(vma, addr, ref); > + else if (vma->vm_policy && > vma->vm_policy->policy != MPOL_DEFAULT) > pol = vma->vm_policy; > } > if (!pol) > pol = &default_policy; > - else if (!shared_pol && pol != current->mempolicy) ^^^^^^^^^^^ won't do the get if shared_pol is true > - mpol_get(pol); /* vma or other task's policy */ > return pol; > } > > @@ -1247,39 +1249,23 @@ static inline unsigned interleave_nid(st > * @addr = address in @vma for shared policy lookup and interleave policy > * @gfp_flags = for requested zone > * @mpol = pointer to mempolicy pointer for reference counted 'BIND policy > + * @ref = indicates that a refcount was taken against mpol. > * > - * Returns a zonelist suitable for a huge page allocation. > - * If the effective policy is 'BIND, returns pointer to policy's zonelist. > - * If it is also a policy for which get_vma_policy() returns an extra > - * reference, we must hold that reference until after allocation. > - * In that case, return policy via @mpol so hugetlb allocation can drop > - * the reference. For non-'BIND referenced policies, we can/do drop the > - * reference here, so the caller doesn't need to know about the special case > - * for default and current task policy. > + * Returns a zonelist suitable for a huge page allocation. The caller must > + * release the memory policy refcount if ref != 0. The zonelist may be freed > + * at any time after the reference count is released. > */ > struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr, > - gfp_t gfp_flags, struct mempolicy **mpol) > + gfp_t gfp_flags, struct mempolicy **mpol, int *ref) > { > - struct mempolicy *pol = get_vma_policy(current, vma, addr); > - struct zonelist *zl; > - > - *mpol = NULL; /* probably no unref needed */ > - if (pol->policy == MPOL_INTERLEAVE) { > + *mpol = get_vma_policy(current, vma, addr, ref); > + if ((*mpol)->policy == MPOL_INTERLEAVE) { > unsigned nid; > > - nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT); > - __mpol_free(pol); /* finished with pol */ Again, his was an error on my part. Should have unconditionally free'd. Most recent series fixes this. > + nid = interleave_nid(*mpol, vma, addr, HPAGE_SHIFT); > return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags); > } > - > - zl = zonelist_policy(GFP_HIGHUSER, pol); > - if (unlikely(pol != &default_policy && pol != current->mempolicy)) { > - if (pol->policy != MPOL_BIND) > - __mpol_free(pol); /* finished with pol */ > - else > - *mpol = pol; /* unref needed after allocation */ > - } > - return zl; > + return zonelist_policy(GFP_HIGHUSER, *mpol); > } > #endif > > @@ -1323,8 +1309,9 @@ static struct page *alloc_page_interleav > struct page * > alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr) > { > - struct mempolicy *pol = get_vma_policy(current, vma, addr); > - struct zonelist *zl; > + int ref = 0; > + struct mempolicy *pol = get_vma_policy(current, vma, addr, &ref); > + struct page *page; > > cpuset_update_task_memory_state(); > > @@ -1332,21 +1319,12 @@ alloc_page_vma(gfp_t gfp, struct vm_area > unsigned nid; > > nid = interleave_nid(pol, vma, addr, PAGE_SHIFT); I also missed an unref here. Fixed in soon to come series. > - return alloc_page_interleave(gfp, 0, nid); > - } > - zl = zonelist_policy(gfp, pol); > - if (pol != &default_policy && pol != current->mempolicy) { With my latest attempt, this is only necessary for shared policies. > - /* > - * slow path: ref counted policy -- shared or vma > - */ > - struct page *page = __alloc_pages(gfp, 0, zl); > - __mpol_free(pol); > - return page; > - } > - /* > - * fast path: default or task policy > - */ > - return __alloc_pages(gfp, 0, zl); > + page = alloc_page_interleave(gfp, 0, nid); > + } else > + page = __alloc_pages(gfp, 0, zonelist_policy(gfp, pol)); > + if (ref) > + mpol_free(pol); > + return page; Andi wanted to keep the fast path as a tail-call here. I was trying to preserve that. > } > > /** > @@ -1519,7 +1497,8 @@ static void sp_insert(struct shared_poli > > /* Find shared policy intersecting idx */ > struct mempolicy * > -mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx) > +mpol_shared_policy_lookup(struct shared_policy *sp, > + unsigned long idx, int *ref) > { > struct mempolicy *pol = NULL; > struct sp_node *sn; > @@ -1531,6 +1510,7 @@ mpol_shared_policy_lookup(struct shared_ > if (sn) { > mpol_get(sn->policy); > pol = sn->policy; > + (*ref)++; > } > spin_unlock(&sp->lock); > return pol; > @@ -1945,8 +1925,9 @@ int show_numa_map(struct seq_file *m, vo > struct numa_maps *md; > struct file *file = vma->vm_file; > struct mm_struct *mm = vma->vm_mm; > - struct mempolicy *pol; > int n; > + int ref = 0; > + struct mempolicy *mpol; > char buffer[50]; > > if (!mm) > @@ -1956,13 +1937,10 @@ int show_numa_map(struct seq_file *m, vo > if (!md) > return 0; > > - pol = get_vma_policy(priv->task, vma, vma->vm_start); > - mpol_to_str(buffer, sizeof(buffer), pol); > - /* > - * unref shared or other task's mempolicy > - */ > - if (pol != &default_policy && pol != current->mempolicy) > - __mpol_free(pol); > + mpol = get_vma_policy(priv->task, vma, vma->vm_start, &ref); > + mpol_to_str(buffer, sizeof(buffer), mpol); > + if (ref) > + mpol_free(mpol); If we really want to add the ref argument to get_vma_policy(), we could avoid bringing it down this deeply by requiring that all get_policy() vm_ops add the extra ref [these are only used for shared memory policy now] and set ref !0 when get_policy() returns a non-null policy. This would be an alternative to marking shared policies as such. > > seq_printf(m, "%08lx %s", vma->vm_start, buffer); > > Index: linux-2.6/mm/shmem.c > =================================================================== > --- linux-2.6.orig/mm/shmem.c 2007-10-26 15:46:53.000000000 -0700 > +++ linux-2.6/mm/shmem.c 2007-10-26 15:50:00.000000000 -0700 > @@ -1015,14 +1015,16 @@ static struct page *shmem_swapin_async(s > { > struct page *page; > struct vm_area_struct pvma; > + int ref = 0; > > /* Create a pseudo vma that just contains the policy */ > memset(&pvma, 0, sizeof(struct vm_area_struct)); > pvma.vm_end = PAGE_SIZE; > pvma.vm_pgoff = idx; > - pvma.vm_policy = mpol_shared_policy_lookup(p, idx); > + pvma.vm_policy = mpol_shared_policy_lookup(p, idx, &ref); > page = read_swap_cache_async(entry, &pvma, 0); > - mpol_free(pvma.vm_policy); By the way, my shared policy series, that you love so much ;), obviates all of this pseudo-vma stuff. This function becomes practically a one-liner. I've been holding off on that while all the other mempolicy stuff [Mel's work, ref counting, ...] settles down. > + if (ref) > + mpol_free(pvma.vm_policy); > return page; > } > > @@ -1052,13 +1054,16 @@ shmem_alloc_page(gfp_t gfp, struct shmem > { > struct vm_area_struct pvma; > struct page *page; > + int ref = 0; > > memset(&pvma, 0, sizeof(struct vm_area_struct)); > - pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx); > + pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, > + idx, &ref); > pvma.vm_pgoff = idx; > pvma.vm_end = PAGE_SIZE; > page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0); Again, we can dispense with pseudo-vma's here, in time. > - mpol_free(pvma.vm_policy); > + if (ref) > + mpol_free(pvma.vm_policy); > return page; > } > #else > @@ -1336,13 +1341,13 @@ static int shmem_set_policy(struct vm_ar > } > > static struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, > - unsigned long addr) > + unsigned long addr, int *ref) > { > struct inode *i = vma->vm_file->f_path.dentry->d_inode; > unsigned long idx; > > idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; > - return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx); > + return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx, ref); > } > #endif > > Index: linux-2.6/include/linux/mm.h > =================================================================== > --- linux-2.6.orig/include/linux/mm.h 2007-10-26 15:46:53.000000000 -0700 > +++ linux-2.6/include/linux/mm.h 2007-10-26 15:50:00.000000000 -0700 > @@ -173,7 +173,7 @@ struct vm_operations_struct { > #ifdef CONFIG_NUMA > int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new); > struct mempolicy *(*get_policy)(struct vm_area_struct *vma, > - unsigned long addr); > + unsigned long addr, int *ref); > int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from, > const nodemask_t *to, unsigned long flags); > #endif > Index: linux-2.6/ipc/shm.c > =================================================================== > --- linux-2.6.orig/ipc/shm.c 2007-10-26 15:46:53.000000000 -0700 > +++ linux-2.6/ipc/shm.c 2007-10-26 15:50:00.000000000 -0700 > @@ -279,14 +279,14 @@ static int shm_set_policy(struct vm_area > } > > static struct mempolicy *shm_get_policy(struct vm_area_struct *vma, > - unsigned long addr) > + unsigned long addr, int *ref) > { > struct file *file = vma->vm_file; > struct shm_file_data *sfd = shm_file_data(file); > struct mempolicy *pol = NULL; > > if (sfd->vm_ops->get_policy) > - pol = sfd->vm_ops->get_policy(vma, addr); > + pol = sfd->vm_ops->get_policy(vma, addr, ref); > else if (vma->vm_policy) > pol = vma->vm_policy; > else -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [NUMA] Fix memory policy refcounting 2007-10-29 15:48 ` Lee Schermerhorn @ 2007-10-29 20:24 ` Christoph Lameter 2007-10-29 21:34 ` Lee Schermerhorn 0 siblings, 1 reply; 14+ messages in thread From: Christoph Lameter @ 2007-10-29 20:24 UTC (permalink / raw) To: Lee Schermerhorn Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney On Mon, 29 Oct 2007, Lee Schermerhorn wrote: > > 1. Reference counts were taken in get_vma_policy without necessarily > > holding another lock that guaranteed the existence of the object > > on which the reference count was taken. > > Yes, this was true for the show_numa_maps() case, as we've discussed. I > agree we need to take the mmap_sem for write in do_set_mempolicy() as we > do in do_mbind(). Yeah it seems that we were safe for shared policies since we took the refcount twice? > > 2. Adding a flag to all functions that can potentially take a refcount > > on a memory. That flag is set if the refcount was taken. The code > > using the memory policy can then just free the refcount if it was > > actually taken. > > This does add some additional code in the alloc path and adds an > additional arg to a lot of functions that I think we can remove by > marking shared policies as such and only derefing those. The get_policy function and friends may not only return shared policies but also task etc policies. > Yeah, yeah, yeah. But I consider that to be cpusets' fault and not > shared memory policy. I still have use for the latter. We need to find > a way to accomodate all of our requirements, even if it means > documenting that shared memory policy must be used very carefully with > cpusets--or not at all with dynamically changing cpusets. I can > certainly live with that. There is no reason that this issue should exist. We can have your shared policies with proper enforcement that no bad things happen if we get rid of get_policy etc and instead use the vma policy pointer to point to the shared policy. Take a refcount for each vma as it is setup to point to a shared policy and you will not have to take the refcount in the hot paths. > > The removal of shared policy support would result in the refcount issues > > going away and code would be much simpler. Semantics would be consistent in > > that memory policies only apply to a single process. Sharing of memory policies > > would only occur in a controlled way that does not require extra refcounting > > for the use of a policy. > > Yes, and we'd loose control over placement of shared pages except by > hacking our task policy and prefaulting, or requiring every program that > attaches to be aware of the numa policy of the overall application. I > find this as objectionable as you find shared policies. That is not true. > > We have already vma policy pointers that are currently unused for shmem areas > > and could replicate shared policies by setting these pointers in each vma that > > is pointing to a shmem area. > > Doesn't work for me. :( Why not? Point them to the shared policy and add a refcount and things would be much easier. > > Changing a shared policy would then require > > iterating over all processes using the policy using the reverse maps. At that > > point cpuset constraints etc could be considered and eventually a policy change > > could even be rejected on the ground that a consistent change is not possible > > given the other constraints of the shmem area. > > Policy remapping isn't already complex enough for you, huh? :-) Policy remapping is one thing since we can correct the policy. Shared policies cannot be correct if applied to multiple cpuset context. The looping over reverse maps is a pretty standard way of doing things which would allow us to detect bad policies at the time they are created and not later. > > } > > } > > - mpol_free(mpol); /* unref if mpol !NULL */ > > + if (ref) > > + mpol_free(mpol); > This shouldn't be necessary if huge_zonelist only returns a non-NULL > mpol if the unref is required, as I had done. mpol_free() on a NULL > mpol is a no-op, as the comment was intended to convey. You could drop > the extra ref argument to huge_zonelist--not that this should be much of > a fast path. True. > > - mpol_to_str(buffer, sizeof(buffer), pol); > > - /* > > - * unref shared or other task's mempolicy > > - */ > > - if (pol != &default_policy && pol != current->mempolicy) > > - __mpol_free(pol); > > + mpol = get_vma_policy(priv->task, vma, vma->vm_start, &ref); > > + mpol_to_str(buffer, sizeof(buffer), mpol); > > + if (ref) > > + mpol_free(mpol); > If we really want to add the ref argument to get_vma_policy(), we could > avoid bringing it down this deeply by requiring that all get_policy() > vm_ops add the extra ref [these are only used for shared memory policy > now] and set ref !0 when get_policy() returns a non-null policy. This > would be an alternative to marking shared policies as such. Please be aware that refcounting in the hot paths is a bad thing to do. We are potentially bouncing a cacheline. I'd rather get rid of that completely. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [NUMA] Fix memory policy refcounting 2007-10-29 20:24 ` Christoph Lameter @ 2007-10-29 21:34 ` Lee Schermerhorn 2007-10-29 21:43 ` Christoph Lameter 0 siblings, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2007-10-29 21:34 UTC (permalink / raw) To: Christoph Lameter Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney On Mon, 2007-10-29 at 13:24 -0700, Christoph Lameter wrote: > On Mon, 29 Oct 2007, Lee Schermerhorn wrote: > > > > 1. Reference counts were taken in get_vma_policy without necessarily > > > holding another lock that guaranteed the existence of the object > > > on which the reference count was taken. > > > > Yes, this was true for the show_numa_maps() case, as we've discussed. I > > agree we need to take the mmap_sem for write in do_set_mempolicy() as we > > do in do_mbind(). > > Yeah it seems that we were safe for shared policies since we took the > refcount twice? I still don't see where we took the ref twice on shared policies. My comment was wrong. It should have said "avoid extra ref" which is what the code [in get_vma_policy()] does now, contradicting the comment. > > > > 2. Adding a flag to all functions that can potentially take a refcount > > > on a memory. That flag is set if the refcount was taken. The code > > > using the memory policy can then just free the refcount if it was > > > actually taken. > > > > This does add some additional code in the alloc path and adds an > > additional arg to a lot of functions that I think we can remove by > > marking shared policies as such and only derefing those. > > The get_policy function and friends may not only return shared policies > but also task etc policies. Right, but they don't [don't need to] add extra ref's like the shared policies do. So, if we can id a shared policy, we unref only those. I'll send out my RFC probably tomorrow am. > > > Yeah, yeah, yeah. But I consider that to be cpusets' fault and not > > shared memory policy. I still have use for the latter. We need to find > > a way to accomodate all of our requirements, even if it means > > documenting that shared memory policy must be used very carefully with > > cpusets--or not at all with dynamically changing cpusets. I can > > certainly live with that. > > There is no reason that this issue should exist. We can have your shared > policies with proper enforcement that no bad things happen if we get rid > of get_policy etc and instead use the vma policy pointer to point to the > shared policy. Take a refcount for each vma as it is setup to point to a > shared policy and you will not have to take the refcount in the hot paths. We support different policies on different ranges of a shared memory segment. In the task which installs this policy, we split the vmas, but any other tasks which already have the segment attached or which subsequently attach don't split the vmas along policy bondaries. This also makes numa_maps lie when we have multiple subrange policies. My run of your page fault test with and without the ref count patches showed no add'l overhead for reference counting. Of course, I can't go up the number of nodes you can. You could try it on one of your platform with a high node count to see. > > > > The removal of shared policy support would result in the refcount issues > > > going away and code would be much simpler. Semantics would be consistent in > > > that memory policies only apply to a single process. Sharing of memory policies > > > would only occur in a controlled way that does not require extra refcounting > > > for the use of a policy. > > > > Yes, and we'd loose control over placement of shared pages except by > > hacking our task policy and prefaulting, or requiring every program that > > attaches to be aware of the numa policy of the overall application. I > > find this as objectionable as you find shared policies. > > That is not true. ??? > > > > We have already vma policy pointers that are currently unused for shmem areas > > > and could replicate shared policies by setting these pointers in each vma that > > > is pointing to a shmem area. > > > > Doesn't work for me. :( > > Why not? Point them to the shared policy and add a refcount and things > would be much easier. I'd really like to avoid having to map multiple vmas when I attach a shm seg with different policies on different ranges, and fix up all existing tasks attached to a shmem when installing policies on ranges of the segment. Having the share policy lookup add a ref while holding the shared policy spinlock handles this fine now. > > > > Changing a shared policy would then require > > > iterating over all processes using the policy using the reverse maps. At that > > > point cpuset constraints etc could be considered and eventually a policy change > > > could even be rejected on the ground that a consistent change is not possible > > > given the other constraints of the shmem area. > > > > Policy remapping isn't already complex enough for you, huh? :-) > > Policy remapping is one thing since we can correct the policy. Shared > policies cannot be correct if applied to multiple cpuset context. I agree. Especially if the attaching task's aren't cooperating, and when the cpuset is reconfigured or tasks moved. I don't need it to work in these cases. > The > looping over reverse maps is a pretty standard way of doing things which > would allow us to detect bad policies at the time they are created and not > later. I guess we could also do the rmap walk at attach time and reject the attach if the shmem policy is not valid in the attaching task's cpuset? And when we try to move the task to a new cpuset, reject the move if any policy would be invalid in the new cpuset? Maybe better to just document the constraints and the behavior that one might expect if they violate the constraints. For example, a default/local policy would be valid in all cpusets. My proposal for "cpuset-independent interleave policy" [which is semantically the same as David's "interleave over all allowed", altho' we propose different syscall arg values to specify it, and David's method requires remapping the interleave nodemask, I think, where mine doesn't--details] would be valid in any cpuset. In these cases, page placement might be different than for a set of tasks in a single, static cpuset attached to shmem segments, which might be "surprising" behavior for some tasks, but we could document that. Any policy that explicitly specifies nodes that are not valid in all cpusets containing tasks sharing a shmem segment would be problematic and should be avoided. A possible solution to this is to do the cpuset-relative to physical nodeid translation at allocation time, but I don't think you want that in the allocation path! > > > > } > > > } > > > - mpol_free(mpol); /* unref if mpol !NULL */ > > > + if (ref) > > > + mpol_free(mpol); > > This shouldn't be necessary if huge_zonelist only returns a non-NULL > > mpol if the unref is required, as I had done. mpol_free() on a NULL > > mpol is a no-op, as the comment was intended to convey. You could drop > > the extra ref argument to huge_zonelist--not that this should be much of > > a fast path. > > True. > > > > - mpol_to_str(buffer, sizeof(buffer), pol); > > > - /* > > > - * unref shared or other task's mempolicy > > > - */ > > > - if (pol != &default_policy && pol != current->mempolicy) > > > - __mpol_free(pol); > > > + mpol = get_vma_policy(priv->task, vma, vma->vm_start, &ref); > > > + mpol_to_str(buffer, sizeof(buffer), mpol); > > > + if (ref) > > > + mpol_free(mpol); > > If we really want to add the ref argument to get_vma_policy(), we could > > avoid bringing it down this deeply by requiring that all get_policy() > > vm_ops add the extra ref [these are only used for shared memory policy > > now] and set ref !0 when get_policy() returns a non-null policy. This > > would be an alternative to marking shared policies as such. > > Please be aware that refcounting in the hot paths is a bad thing to do. > We are potentially bouncing a cacheline. I'd rather get rid of that > completely. I understand the goal. Sometimes one must ref count for adequate protection to get the semantics/features one wants--that I want, anyway. Again, my tests showed no measureable overhead. [Yes, lots of little "no measureable overhead" changes can add up...]. Anyway, let's keep chipping away. I'll send out my series for RFC and then do some more measurements. Later, Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [NUMA] Fix memory policy refcounting 2007-10-29 21:34 ` Lee Schermerhorn @ 2007-10-29 21:43 ` Christoph Lameter 2007-10-30 16:39 ` Lee Schermerhorn 0 siblings, 1 reply; 14+ messages in thread From: Christoph Lameter @ 2007-10-29 21:43 UTC (permalink / raw) To: Lee Schermerhorn Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney On Mon, 29 Oct 2007, Lee Schermerhorn wrote: > > > Yeah, yeah, yeah. But I consider that to be cpusets' fault and not > > > shared memory policy. I still have use for the latter. We need to find > > > a way to accomodate all of our requirements, even if it means > > > documenting that shared memory policy must be used very carefully with > > > cpusets--or not at all with dynamically changing cpusets. I can > > > certainly live with that. > > > > There is no reason that this issue should exist. We can have your shared > > policies with proper enforcement that no bad things happen if we get rid > > of get_policy etc and instead use the vma policy pointer to point to the > > shared policy. Take a refcount for each vma as it is setup to point to a > > shared policy and you will not have to take the refcount in the hot paths. > > We support different policies on different ranges of a shared memory > segment. In the task which installs this policy, we split the vmas, but > any other tasks which already have the segment attached or which > subsequently attach don't split the vmas along policy bondaries. This > also makes numa_maps lie when we have multiple subrange policies. Which would also be fixed if we would split the vmas properly. > > > > We have already vma policy pointers that are currently unused for shmem areas > > > > and could replicate shared policies by setting these pointers in each vma that > > > > is pointing to a shmem area. > > > > > > Doesn't work for me. :( > > > > Why not? Point them to the shared policy and add a refcount and things > > would be much easier. > > I'd really like to avoid having to map multiple vmas when I attach a shm > seg with different policies on different ranges, and fix up all > existing tasks attached to a shmem when installing policies on ranges of > the segment. Having the share policy lookup add a ref while holding > the shared policy spinlock handles this fine now. Why? Attaching a policy is a rare thing right? > > The > > looping over reverse maps is a pretty standard way of doing things which > > would allow us to detect bad policies at the time they are created and not > > later. > > I guess we could also do the rmap walk at attach time and reject the > attach if the shmem policy is not valid in the attaching task's cpuset? > And when we try to move the task to a new cpuset, reject the move if any > policy would be invalid in the new cpuset? Yes reject or do some corrective action. At least be aware of the situation. > Any policy that explicitly specifies nodes that are not valid in all > cpusets containing tasks sharing a shmem segment would be problematic > and should be avoided. A possible solution to this is to do the > cpuset-relative to physical nodeid translation at allocation time, but I > don't think you want that in the allocation path! The problem is that it can create subtle issues right now. If we would handle it through vma policy pointers then we would be able to detect many things and the refcount issues would go away. > > Please be aware that refcounting in the hot paths is a bad thing to do. > > We are potentially bouncing a cacheline. I'd rather get rid of that > > completely. > > I understand the goal. Sometimes one must ref count for adequate > protection to get the semantics/features one wants--that I want, anyway. > Again, my tests showed no measureable overhead. [Yes, lots of little > "no measureable overhead" changes can add up...]. The refcounting in hot paths causes a lot of issues here not only scaling. Removing them would be the smart thing to do. The logic to update vmas, loop over rmaps etc etc all exists. The code would likely be minimal. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [NUMA] Fix memory policy refcounting 2007-10-29 21:43 ` Christoph Lameter @ 2007-10-30 16:39 ` Lee Schermerhorn 2007-10-30 18:42 ` Christoph Lameter 0 siblings, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2007-10-30 16:39 UTC (permalink / raw) To: Christoph Lameter Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney On Mon, 2007-10-29 at 14:43 -0700, Christoph Lameter wrote: > On Mon, 29 Oct 2007, Lee Schermerhorn wrote: > > > > > Yeah, yeah, yeah. But I consider that to be cpusets' fault and not > > > > shared memory policy. I still have use for the latter. We need to find > > > > a way to accomodate all of our requirements, even if it means > > > > documenting that shared memory policy must be used very carefully with > > > > cpusets--or not at all with dynamically changing cpusets. I can > > > > certainly live with that. > > > > > > There is no reason that this issue should exist. We can have your shared > > > policies with proper enforcement that no bad things happen if we get rid > > > of get_policy etc and instead use the vma policy pointer to point to the > > > shared policy. Take a refcount for each vma as it is setup to point to a > > > shared policy and you will not have to take the refcount in the hot paths. > > > > We support different policies on different ranges of a shared memory > > segment. In the task which installs this policy, we split the vmas, but > > any other tasks which already have the segment attached or which > > subsequently attach don't split the vmas along policy bondaries. This > > also makes numa_maps lie when we have multiple subrange policies. > > Which would also be fixed if we would split the vmas properly. The problem I see with splitting vmas for shared policy is that, to be correct, when you apply a sub-range policy to a shm segment that already has tasks attached, you'd have to split those task's vmas as well--either from outside the tasks, or somehow notify them to do it themselves. In general, I really want to avoid requiring every process in a multi-task application to install policies on shared objects uniformly to get correct behavior. However, something you said yesterday [about vma pointers to shared policies] got me thinking last evening of another approach. Here's an idea. First, the situation we have today: task1 creates [shmget()] and attaches [shmat()] a shm segment. W/o SHM_HUGETLB flag, we get a tmpfs mapping with shmem vm_ops. These vm_ops support shared mempolicy that maintains ranges of mempolicy in an rbtree. After task1 installs [mbind()] two mempolicies on subset ranges of the shm segment, we get the vma connections shown below on the left [vma1.[12]] in Figure 1 [please forgive the lame ascii art]. The reference count of 1 on the mempolicies represents the reference held by the shared policy rbtree itself. The horizontal "arrows" do NOT represent actual pointers. Rather they represent the association between the vm_start of the vma and the offset of the start of the policy range. The vertical "arrows" represent the length of the range of virtual addresses mapped by each vma. The original vma was split in two when the mempolicies were installed. Now task2 attaches [shmat] the segment, without installing any policy. [NUMA layout and mempolicy installation is the responsibility of task1 in this mythical multi-task application.] Because task2 attaches [do_mmap*() internally] the entire segment--unlike mmap(), shmat() has no provision to attach a subset of the segment--it gets a single vma mapping the entire segment. We get the vma connection shown on the right in Figure 1. [We'd get this same configuration if task2 were already attached when task1 installs the policies.] Again, the vertical arrow represents the range of virtual addresses mapped by the single vma. The attach does not increment the reference count. Figure 1 task1, Shared Policy task2, mm_struct1, (w/ rb tree) mm_struct2, ------------------- vma1.1---------------->| |<------------vma2.1 | | mode, nodemask,| | | | ref = 1 | | V | | | ------------------- | vma1.2---------------->| | | | | mode, nodemask,| | | | ref = 1 | | V | | V ------------------- Note that if we cat /proc/<pid1>/numa_maps to display task1's numa maps, we'll see both policies in the the rbtree. If we display task2's numa maps, we'll see only the policy at the front of the segment. However, we'll count the page stats over the entire range and report these. I can show you an example of this using memtoy, if you'd like, but it's somewhat orthogonal to the reference counting issue. Still, I can imagine that it could confuse customers and result in unnecessary service calls... Next, As part of my shared policy cleanup and enhancement series, I "fixed" numa_maps to display the sub-ranges of policies in a shm segment mapped by a single vma. As part of this fix, I also modified mempolicy.c so that it does not split vmas that support set_policy vm_ops, because handling both split vmas and non-split vmas for a single shm segment would have complicated the code more than I thought necessary. This is still at prototype stage--altho' it works against 23-rc8-mm2. With the these changes, the vma connections and ref counts, look like this: Figure 2 task1, Shared Policy task2, mm_struct1, (w/ rb tree) mm_struct2, ------------------- vma1.1---------------->| |<------------vma2.1 | | mode, nodemask,| | | | ref = 1 | | | | | | | ------------------- | | | | | | | mode, nodemask,| | | | ref = 1 | | V | | V ------------------- With this config, my fix to numa_maps will show the same policy ranges from either task. And, of course, the get_policy() vm_op still gets the correct policy based on the faulting address. Now, if we modify the shmem mmap() file_op [mmap() vm_op for any mmap()ed segment who's {set|get}_mempolicy() ops supports sub-range policies and non-split vmas] to add a reference to shared policies for each vma attached, we get the following picture: Figure 3 task1, Shared Policy task2, mm_struct1, (w/ rb tree) mm_struct2, ------------------- vma1.1---------------->| |<------------vma2.1 | | mode, nodemask,| | | | ref = 3 | | | | | | | ------------------- | | | | | | | mode, nodemask,| | | | ref = 3 | | V | | V ------------------- Re: 'ref = 3' -- One reference for the rbtree--the shm segment and it's policies continue to exist independent of any vma mappings--and one for each attached vma. Because the vma references are protected by the respective task/mm_struct's mmap_sem, we won't need to add an additional reference during lookup, nor release it when finished with the policy. And, we won't need to mess with any other task's mm data structures when installing/removing shmem policies. Of course, munmap() of a vma will need to decrement the ref count of all policies in a shared policy tree, but this is not a "fast path". Unfortunately, we don't have a unmap file operation, so I'd have to add one, or otherwise arrange to remove the unmapping vma's ref--perhaps via a vm_op so that we only need to call it on vmas that support it--i.e., that support shared policy. I could extract the parts of my shared policy series that gets us to Figure 2 and add in the necessary mods to prototype Figure 3 if you would be agreeable to this approach. However, in that case, I should produce a minimal patch to make the current reference counting correct, if overkill. This involves: 1) fixing do_set_mempolicy() to hold mmap_sem for write over change, 2) fixing up reference counting for interleaving for both normal [forgot unref] and huge [unconditional unref should be conditional] and 3) adding ref to policy in shm_get_policy() to match shmem_get_policy. All 3 of these are required to be correct w/o changing any of the rest of the current ref counting. Then, once the vma-protected shared policy mechanism discussed above is in mergable, we can back out all of the extra ref's on other task and vma policies and the lookup-time ref on shared policies, along with all of the matching unrefs. Thoughts? Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [NUMA] Fix memory policy refcounting 2007-10-30 16:39 ` Lee Schermerhorn @ 2007-10-30 18:42 ` Christoph Lameter 2007-10-30 20:18 ` Lee Schermerhorn 2007-11-06 18:56 ` Lee Schermerhorn 0 siblings, 2 replies; 14+ messages in thread From: Christoph Lameter @ 2007-10-30 18:42 UTC (permalink / raw) To: Lee Schermerhorn Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney On Tue, 30 Oct 2007, Lee Schermerhorn wrote: > As part of my shared policy cleanup and enhancement series, I "fixed" > numa_maps to display the sub-ranges of policies in a shm segment mapped > by a single vma. As part of this fix, I also modified mempolicy.c so > that it does not split vmas that support set_policy vm_ops, because > handling both split vmas and non-split vmas for a single shm segment > would have complicated the code more than I thought necessary. This is > still at prototype stage--altho' it works against 23-rc8-mm2. I have not looked at that yet. Maybe you could post another patch? > Re: 'ref = 3' -- One reference for the rbtree--the shm segment and it's > policies continue to exist independent of any vma mappings--and one for > each attached vma. Because the vma references are protected by the > respective task/mm_struct's mmap_sem, we won't need to add an > additional reference during lookup, nor release it when finished with > the policy. And, we won't need to mess with any other task's mm data > structures when installing/removing shmem policies. Of course, munmap() > of a vma will need to decrement the ref count of all policies in a > shared policy tree, but this is not a "fast path". Unfortunately, we > don't have a unmap file operation, so I'd have to add one, or otherwise > arrange to remove the unmapping vma's ref--perhaps via a vm_op so that > we only need to call it on vmas that support it--i.e., that support > shared policy. Yup that sounds like it is going to be a good solution. > if overkill. This involves: 1) fixing do_set_mempolicy() to hold > mmap_sem for write over change, 2) fixing up reference counting for > interleaving for both normal [forgot unref] and huge [unconditional > unref should be conditional] and 3) adding ref to policy in > shm_get_policy() to match shmem_get_policy. All 3 of these are required > to be correct w/o changing any of the rest of the current ref counting. I know about 1. I'd have to look through 2 + 3. I would suggest to fix the refcounting by doing the refcounting using vmas as you explained above and simply remove the problems that exist there right now. > Then, once the vma-protected shared policy mechanism discussed above is > in mergable, we can back out all of the extra ref's on other task and > vma policies and the lookup-time ref on shared policies, along with all > of the matching unrefs. Too complicated. Lets go there directly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [NUMA] Fix memory policy refcounting 2007-10-30 18:42 ` Christoph Lameter @ 2007-10-30 20:18 ` Lee Schermerhorn 2007-11-06 18:56 ` Lee Schermerhorn 1 sibling, 0 replies; 14+ messages in thread From: Lee Schermerhorn @ 2007-10-30 20:18 UTC (permalink / raw) To: Christoph Lameter Cc: David Rientjes, Paul Jackson, linux-mm, Andi Kleen, Eric Whitney On Tue, 2007-10-30 at 11:42 -0700, Christoph Lameter wrote: > On Tue, 30 Oct 2007, Lee Schermerhorn wrote: > > > As part of my shared policy cleanup and enhancement series, I "fixed" > > numa_maps to display the sub-ranges of policies in a shm segment mapped > > by a single vma. As part of this fix, I also modified mempolicy.c so > > that it does not split vmas that support set_policy vm_ops, because > > handling both split vmas and non-split vmas for a single shm segment > > would have complicated the code more than I thought necessary. This is > > still at prototype stage--altho' it works against 23-rc8-mm2. > > I have not looked at that yet. Maybe you could post another patch? I will, when I get around to rebasing them. For reference, you can look at the last posting back in late June: "Shared Policy Infrastructure 3/11 let vma policy ops handle sub-vma policies": http://marc.info/?l=linux-mm&m=118280133031696&w=4 and Shared Policy Infrastructure 4/11 fix show_numa_maps() http://marc.info/?l=linux-mm&m=118280133220503&w=4 > > > Re: 'ref = 3' -- One reference for the rbtree--the shm segment and it's > > policies continue to exist independent of any vma mappings--and one for > > each attached vma. Because the vma references are protected by the > > respective task/mm_struct's mmap_sem, we won't need to add an > > additional reference during lookup, nor release it when finished with > > the policy. And, we won't need to mess with any other task's mm data > > structures when installing/removing shmem policies. Of course, munmap() > > of a vma will need to decrement the ref count of all policies in a > > shared policy tree, but this is not a "fast path". Unfortunately, we > > don't have a unmap file operation, so I'd have to add one, or otherwise > > arrange to remove the unmapping vma's ref--perhaps via a vm_op so that > > we only need to call it on vmas that support it--i.e., that support > > shared policy. > > Yup that sounds like it is going to be a good solution. > > > if overkill. This involves: 1) fixing do_set_mempolicy() to hold > > mmap_sem for write over change, 2) fixing up reference counting for > > interleaving for both normal [forgot unref] and huge [unconditional > > unref should be conditional] and 3) adding ref to policy in > > shm_get_policy() to match shmem_get_policy. All 3 of these are required > > to be correct w/o changing any of the rest of the current ref counting. > > I know about 1. I'd have to look through 2 + 3. I would suggest to fix the > refcounting by doing the refcounting using vmas as you explained above and > simply remove the problems that exist there right now. > > > Then, once the vma-protected shared policy mechanism discussed above is > > in mergable, we can back out all of the extra ref's on other task and > > vma policies and the lookup-time ref on shared policies, along with all > > of the matching unrefs. > > Too complicated. Lets go there directly. OK. It's just that I predict we'll take some time to agree on the details and I know that #3 in the previous paragraph will bug out if anyone tries to use SHM_HUGETLB segments with non-default policy on 2.6.23. The unconditional unref in #2 might [will?] also result in premature freeing of an interleave policy for huge pages that can result in a bug out on next allocation that tries to use the freed policy. That's why I suggested a temporary minimal fix for 24 and 23-stable. I'll go ahead and put one together, just in case. You can ack or nack as you see fit. Meanwhile, my adventures in testing shared policies applied from different tasks have, just today, uncovered another bug we can hit in rmap.c:vma_address(). Appears to have been there back at least since 2.6.18. I'll send out details of that in a subsequent message. I think my shared policy patches [non-split vma for shmem segments] fixes that. I'll test such a kernel before I post my findings. Later, Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [NUMA] Fix memory policy refcounting 2007-10-30 18:42 ` Christoph Lameter 2007-10-30 20:18 ` Lee Schermerhorn @ 2007-11-06 18:56 ` Lee Schermerhorn 2007-11-06 19:15 ` Christoph Lameter 1 sibling, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2007-11-06 18:56 UTC (permalink / raw) To: Christoph Lameter Cc: AndiKleen, linux-mm, Eric Whitney, David Rientjes, Paul Jackson On Tue, 2007-10-30 at 11:42 -0700, Christoph Lameter wrote: > On Tue, 30 Oct 2007, Lee Schermerhorn wrote: > > > As part of my shared policy cleanup and enhancement series, I "fixed" > > numa_maps to display the sub-ranges of policies in a shm segment mapped > > by a single vma. As part of this fix, I also modified mempolicy.c so > > that it does not split vmas that support set_policy vm_ops, because > > handling both split vmas and non-split vmas for a single shm segment > > would have complicated the code more than I thought necessary. This is > > still at prototype stage--altho' it works against 23-rc8-mm2. > > I have not looked at that yet. Maybe you could post another patch? > > > Re: 'ref = 3' -- One reference for the rbtree--the shm segment and it's > > policies continue to exist independent of any vma mappings--and one for > > each attached vma. Because the vma references are protected by the > > respective task/mm_struct's mmap_sem, we won't need to add an > > additional reference during lookup, nor release it when finished with > > the policy. And, we won't need to mess with any other task's mm data > > structures when installing/removing shmem policies. Of course, munmap() > > of a vma will need to decrement the ref count of all policies in a > > shared policy tree, but this is not a "fast path". Unfortunately, we > > don't have a unmap file operation, so I'd have to add one, or otherwise > > arrange to remove the unmapping vma's ref--perhaps via a vm_op so that > > we only need to call it on vmas that support it--i.e., that support > > shared policy. > > Yup that sounds like it is going to be a good solution. > Christoph: After looking at this and attempting to implement it, I find that it won't work. The reason is that I can't tell from just vma references whether an mempolicy in the shared policy rbtree is actually in use. A task is allowed to change the policies in the rbtree at any time--a feature that I understand you have no use for and therefore don't like, but which is fundamental to shared policy semantics. If I try to install a policy that completely covers/replaces an existing policy, I need to be able to do this, regardless of how many vmas have the shared region attached/mapped. So, this doesn't protect any task that is currently examining the policy for page allocation, get_mempolicy() or show_numa_maps() without the extra ref. Andi had probably figured this out back when he implemented shared policies. I have another approach that still involves adding a ref to shared policies at lookup time, and dropping the ref when finished with the policy. I know you don't like the idea of taking references in the vma policy lookup path. However, the 'get() is already there [for shared policies]. I just need to add the 'free() [which Mel G would like to see renamed at mpol_put()]. I have a patch that does the unref only for shared policies, along with the other cleanups necessary in this area. I hope to post soon, but I've said that before. I'll also rerun the pft tests with and without this change when I can. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [NUMA] Fix memory policy refcounting 2007-11-06 18:56 ` Lee Schermerhorn @ 2007-11-06 19:15 ` Christoph Lameter 2007-11-06 19:35 ` Lee Schermerhorn 0 siblings, 1 reply; 14+ messages in thread From: Christoph Lameter @ 2007-11-06 19:15 UTC (permalink / raw) To: Lee Schermerhorn Cc: AndiKleen, linux-mm, Eric Whitney, David Rientjes, Paul Jackson On Tue, 6 Nov 2007, Lee Schermerhorn wrote: > After looking at this and attempting to implement it, I find that it > won't work. The reason is that I can't tell from just vma references > whether an mempolicy in the shared policy rbtree is actually in use. A > task is allowed to change the policies in the rbtree at any time--a > feature that I understand you have no use for and therefore don't like, I am not sure what my dislikes have to do with anything. This needs to work and be made to work in such a way that it does not negatively impact the rest of the system. What do you mean by in use? If a vma can potentially use a shared policy in a rbtree then it is in use right? > but which is fundamental to shared policy semantics. If I try to > install a policy that completely covers/replaces an existing policy, I > need to be able to do this, regardless of how many vmas have the shared > region attached/mapped. So, this doesn't protect any task that is > currently examining the policy for page allocation, get_mempolicy() or > show_numa_maps() without the extra ref. Andi had probably figured this > out back when he implemented shared policies. AFAICT: If you take a reference on the shared policy for each vma then you can tell from the references that the policy is in use. > I have another approach that still involves adding a ref to shared > policies at lookup time, and dropping the ref when finished with the > policy. I know you don't like the idea of taking references in the vma > policy lookup path. However, the 'get() is already there [for shared > policies]. I just need to add the 'free() [which Mel G would like to > see renamed at mpol_put()]. I have a patch that does the unref only for > shared policies, along with the other cleanups necessary in this area. > > I hope to post soon, but I've said that before. I'll also rerun the pft > tests with and without this change when I can. I am fine with this if it only affects the shmem policies and no critical performance hot paths for regular anonymous and page cache allocations. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [NUMA] Fix memory policy refcounting 2007-11-06 19:15 ` Christoph Lameter @ 2007-11-06 19:35 ` Lee Schermerhorn 2007-11-06 19:43 ` Christoph Lameter 0 siblings, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2007-11-06 19:35 UTC (permalink / raw) To: Christoph Lameter Cc: AndiKleen, linux-mm, Eric Whitney, David Rientjes, Paul Jackson On Tue, 2007-11-06 at 11:15 -0800, Christoph Lameter wrote: > On Tue, 6 Nov 2007, Lee Schermerhorn wrote: > > > After looking at this and attempting to implement it, I find that it > > won't work. The reason is that I can't tell from just vma references > > whether an mempolicy in the shared policy rbtree is actually in use. A > > task is allowed to change the policies in the rbtree at any time--a > > feature that I understand you have no use for and therefore don't like, > > I am not sure what my dislikes have to do with anything. This needs to > work and be made to work in such a way that it does not negatively impact > the rest of the system. We always seem to rathole on that subject. I just hoped to head that off... > > What do you mean by in use? If a vma can potentially use a shared policy > in a rbtree then it is in use right? Not really--not for shared policies. Again, another task is allowed to remove or replace the shared policies at any time, regardless of the number of task's attached to the segment. We can't differentiate between simple attachment and current use. We need the lookup-time ref/unref to know that the policy is actually in use. We can still replace it in the tree while it's "in use". This will remove the tree's reference on the policy, but the policy won't be freed until the task holding the extra ref drops it. I suppose we could stick any replaced mempolicy on a list associated with the segment and keep them there until all tasks detach from the shared segment. Not too much of a memory leak, as long as a task doesn't keep changing policy on a shmem segment just to be perverse. Don't know how I'd limit that. On top of all the other mechanism I'd have to add to track vma refs to shared policies [and it appears to be quite a bit], it wouldn't be all that much additional work. But, the entire effort didn't seem worth it when I determined that I couldn't safely free the policies when they're removed/replaced. That was the purpose of the exercise... > > > but which is fundamental to shared policy semantics. If I try to > > install a policy that completely covers/replaces an existing policy, I > > need to be able to do this, regardless of how many vmas have the shared > > region attached/mapped. So, this doesn't protect any task that is > > currently examining the policy for page allocation, get_mempolicy() or > > show_numa_maps() without the extra ref. Andi had probably figured this > > out back when he implemented shared policies. > > AFAICT: If you take a reference on the shared policy for each > vma then you can tell from the references that the policy is in use. See above. A vma reference does not constitute use for a shared policy. Again, I could replace them, I just can't free the replaced ones... Unless I can figure out some [s]rcu mechanism. If not, I'll have to think about how to limit a task from causing an arbitrary number of them to be allocated w/o being freed. > > > I have another approach that still involves adding a ref to shared > > policies at lookup time, and dropping the ref when finished with the > > policy. I know you don't like the idea of taking references in the vma > > policy lookup path. However, the 'get() is already there [for shared > > policies]. I just need to add the 'free() [which Mel G would like to > > see renamed at mpol_put()]. I have a patch that does the unref only for > > shared policies, along with the other cleanups necessary in this area. > > > > I hope to post soon, but I've said that before. I'll also rerun the pft > > tests with and without this change when I can. > > I am fine with this if it only affects the shmem policies and no critical > performance hot paths for regular anonymous and page cache allocations. So far, that's what it looks like. Again, I'll continue to think about the vma refs along with some RCU mechanism for deferred freeing of the policies. Later, Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [NUMA] Fix memory policy refcounting 2007-11-06 19:35 ` Lee Schermerhorn @ 2007-11-06 19:43 ` Christoph Lameter 2007-11-06 20:08 ` Lee Schermerhorn 0 siblings, 1 reply; 14+ messages in thread From: Christoph Lameter @ 2007-11-06 19:43 UTC (permalink / raw) To: Lee Schermerhorn Cc: AndiKleen, linux-mm, Eric Whitney, David Rientjes, Paul Jackson On Tue, 6 Nov 2007, Lee Schermerhorn wrote: > We always seem to rathole on that subject. I just hoped to head that > off... Well fix this and the rathole will be gone., > > What do you mean by in use? If a vma can potentially use a shared policy > > in a rbtree then it is in use right? > > Not really--not for shared policies. Again, another task is allowed to > remove or replace the shared policies at any time, regardless of the > number of task's attached to the segment. We can't differentiate > between simple attachment and current use. We need the lookup-time > ref/unref to know that the policy is actually in use. We can still > replace it in the tree while it's "in use". This will remove the tree's > reference on the policy, but the policy won't be freed until the task > holding the extra ref drops it. Stil unclear as to why we need lookup time ref/unref. A task can replace the shared policy at any time you just need to update the refcounts. If you have a pointer to the policy in the vma then its possible to do so. > I suppose we could stick any replaced mempolicy on a list associated > with the segment and keep them there until all tasks detach from the > shared segment. Not too much of a memory leak, as long as a task Well you have the refcount on the policy? Why keep the mempolicy around? > > AFAICT: If you take a reference on the shared policy for each > > vma then you can tell from the references that the policy is in use. > > See above. A vma reference does not constitute use for a shared policy. Why not? What does constitute "use" of a shared policy? A page that has used the policy? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [NUMA] Fix memory policy refcounting 2007-11-06 19:43 ` Christoph Lameter @ 2007-11-06 20:08 ` Lee Schermerhorn 2007-11-06 20:19 ` Christoph Lameter 0 siblings, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2007-11-06 20:08 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, linux-mm, Eric Whitney, David Rientjes, Paul Jackson On Tue, 2007-11-06 at 11:43 -0800, Christoph Lameter wrote: > On Tue, 6 Nov 2007, Lee Schermerhorn wrote: > > > We always seem to rathole on that subject. I just hoped to head that > > off... > > Well fix this and the rathole will be gone., I'll hold you to that! :-) > > > > What do you mean by in use? If a vma can potentially use a shared policy > > > in a rbtree then it is in use right? > > > > Not really--not for shared policies. Again, another task is allowed to > > remove or replace the shared policies at any time, regardless of the > > number of task's attached to the segment. We can't differentiate > > between simple attachment and current use. We need the lookup-time > > ref/unref to know that the policy is actually in use. We can still > > replace it in the tree while it's "in use". This will remove the tree's > > reference on the policy, but the policy won't be freed until the task > > holding the extra ref drops it. > > Stil unclear as to why we need lookup time ref/unref. A task can replace > the shared policy at any time you just need to update the refcounts. If > you have a pointer to the policy in the vma then its possible to do so. A pointer in the vma won't work. Different tasks could apply policies on different ranges and shared policy semantics dictate that all tasks see the same policy for a particular offset in the region--modulo set/get races. The only way we could keep a pointer in the vma would be to split the vmas in every task that has the shared region attached whenever any task changes the policy of a range of the region, so that all tasks have the same set of vma's all pointing to the same set of policies in the tree. I don't think we can be changing other task's address space externally like this. And it still wouldn't work, I think, for shared policy semantics--again, except maybe with some sort of rcu mechanism. More below on what constitutes actual "use". > > > I suppose we could stick any replaced mempolicy on a list associated > > with the segment and keep them there until all tasks detach from the > > shared segment. Not too much of a memory leak, as long as a task > > Well you have the refcount on the policy? Why keep the mempolicy around? A non-zero ref count is what keeps the policy around. It implies that some structure has a pointer to the policy, or some task is actively examining the policy and will drop the reference when finsished with it. [The latter is what's NOT happening now for shared policy.] > > > > AFAICT: If you take a reference on the shared policy for each > > > vma then you can tell from the references that the policy is in use. > > > > See above. A vma reference does not constitute use for a shared policy. > > Why not? What does constitute "use" of a shared policy? A page that has > used the policy? Currently, when you lookup the policy [based on offset] in the rbtree under spin_lock, the lookup function does an mpol_get() before dropping the lock. Now, you can use the policy to allocate a page or to report via get_mempolicy(MPOL_F_ADDR) or show_numa_maps()/mpol_to_str(). When you're finished with the policy, you mpol_free() to release the reference. While you're holding this ref, another task that has the shared region attached can replace/delete the policy, removing it from the rbtree and dropping the rbtree's reference via mpol_free(). Now, the only reference to the policy is any reference held by a task that has looked it up, but not yet mpol_free()ed it. When the last task holding such a reference releases it, we'll free it back to the kmem cache. This is the type of use that I can't infer from vma counts or even vma pointer refs. I should be able to replace the vma pointer/ref at any time when the shared policy changes, and mpol_free() the policy for each such vma pointer/ref. That leaves no ref to hold the policy should it be in use [as discussed above]. Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [NUMA] Fix memory policy refcounting 2007-11-06 20:08 ` Lee Schermerhorn @ 2007-11-06 20:19 ` Christoph Lameter 0 siblings, 0 replies; 14+ messages in thread From: Christoph Lameter @ 2007-11-06 20:19 UTC (permalink / raw) To: Lee Schermerhorn Cc: Andi Kleen, linux-mm, Eric Whitney, David Rientjes, Paul Jackson On Tue, 6 Nov 2007, Lee Schermerhorn wrote: > > Stil unclear as to why we need lookup time ref/unref. A task can replace > > the shared policy at any time you just need to update the refcounts. If > > you have a pointer to the policy in the vma then its possible to do so. > > A pointer in the vma won't work. Different tasks could apply policies > on different ranges and shared policy semantics dictate that all tasks > see the same policy for a particular offset in the region--modulo > set/get races. The only way we could keep a pointer in the vma would be > to split the vmas in every task that has the shared region attached > whenever any task changes the policy of a range of the region, so that > all tasks have the same set of vma's all pointing to the same set of > policies in the tree. I don't think we can be changing other task's > address space externally like this. And it still wouldn't work, I > think, for shared policy semantics--again, except maybe with some sort > of rcu mechanism. More below on what constitutes actual "use". You can split vmas by holding a writelock on mmap_sem. But we have discussed that before. If different policies are in effect for ranges of a shared mapping (that is what you are talking about?) then vmas need to correspond to these ranges and contain a pointer to the shard policy. > > > I suppose we could stick any replaced mempolicy on a list associated > > > with the segment and keep them there until all tasks detach from the > > > shared segment. Not too much of a memory leak, as long as a task > > > > Well you have the refcount on the policy? Why keep the mempolicy around? > > A non-zero ref count is what keeps the policy around. It implies that > some structure has a pointer to the policy, or some task is actively > examining the policy and will drop the reference when finsished with it. > [The latter is what's NOT happening now for shared policy.] The shared policy has one refcount because it is in the rbtree. The others are currently temporary. With establishment of references via vmas you have 3 types of references that are taken on a policy 1. Rbtree 2. vma 3. temporary (is this really needed?) If the refcount of a shared policy becomes one then you can remove a policy from the rbtree and free it. The role of the rbtree is drastically reduced if you put the pointer into the vma because lookups in the tree are only needed when the vma is established. shmem operations will be faster because of the vma policy pointer. > > Why not? What does constitute "use" of a shared policy? A page that has > > used the policy? > > Currently, when you lookup the policy [based on offset] in the rbtree > under spin_lock, the lookup function does an mpol_get() before dropping > the lock. Now, you can use the policy to allocate a page or to report > via get_mempolicy(MPOL_F_ADDR) or show_numa_maps()/mpol_to_str(). When > you're finished with the policy, you mpol_free() to release the > reference. While you're holding this ref, another task that has the > shared region attached can replace/delete the policy, removing it from > the rbtree and dropping the rbtree's reference via mpol_free(). Now, > the only reference to the policy is any reference held by a task that The policy can be removed while holding a refcount? You can do the same then with the refcounts via vma. Drop the policy from the rbtree and then go through the vmas to drop the refs. At some point the refcount reaches zero and the policy vanished. > This is the type of use that I can't infer from vma counts or even vma > pointer refs. I should be able to replace the vma pointer/ref at any > time when the shared policy changes, and mpol_free() the policy for each > such vma pointer/ref. That leaves no ref to hold the policy should it > be in use [as discussed above]. AFAICT you are able to do all that you need. If a policy is in use then it has a refcount. You need to take the mmap_sem writelock to change policies. At that point you are guaranteed that no fault is in progress thus no user of the policy exists. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2007-11-06 20:19 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-10-26 23:41 [NUMA] Fix memory policy refcounting Christoph Lameter 2007-10-29 15:48 ` Lee Schermerhorn 2007-10-29 20:24 ` Christoph Lameter 2007-10-29 21:34 ` Lee Schermerhorn 2007-10-29 21:43 ` Christoph Lameter 2007-10-30 16:39 ` Lee Schermerhorn 2007-10-30 18:42 ` Christoph Lameter 2007-10-30 20:18 ` Lee Schermerhorn 2007-11-06 18:56 ` Lee Schermerhorn 2007-11-06 19:15 ` Christoph Lameter 2007-11-06 19:35 ` Lee Schermerhorn 2007-11-06 19:43 ` Christoph Lameter 2007-11-06 20:08 ` Lee Schermerhorn 2007-11-06 20:19 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).