* [PATCH/RFC 0/4] More Mempolicy Reference Counting Fixes
@ 2007-10-12 15:48 Lee Schermerhorn
2007-10-12 15:49 ` [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver Lee Schermerhorn
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Lee Schermerhorn @ 2007-10-12 15:48 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, ak, eric.whitney, clameter, mel
PATCH 0/4 Memory Policy: More Reference Counting and Fallback Fixes
While testing huge pages using shm segments created with the SHM_HUGETLB flag,
came across additional problems with memory policy reference counting. Some
of the problems were introduced by myself in my previous mempolicy ref counting
patch, and some were just paths that I missed in my inspection and testing.
This resend separates the 2nd patch of the previous posting into 3 separate
fixes for 3 separate, but interrelated, issues. At Christoph Lameter's
request. Rebuilt and cursory testing on x86_64 numa platform only.
These 4 patches are based against 2.6.23-rc8-mm2, but should also probably
go to the .23-stable tree when it opens.
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread* [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver 2007-10-12 15:48 [PATCH/RFC 0/4] More Mempolicy Reference Counting Fixes Lee Schermerhorn @ 2007-10-12 15:49 ` Lee Schermerhorn 2007-10-12 17:29 ` Christoph Lameter 2007-10-12 15:49 ` [PATCH/RFC 2/4] Mem Policy: Fixup Shm and Interleave Policy Reference Counting Lee Schermerhorn ` (2 subsequent siblings) 3 siblings, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2007-10-12 15:49 UTC (permalink / raw) To: akpm; +Cc: linux-mm, ak, mel, eric.whitney, clameter PATCH 1/4 Fix memory policy usage in pci driver Against: 2.6.23-rc8-mm2 In an attempt to ensure memory allocation from the local node, the pci driver temporarily replaces the current task's memory policy with the system default policy. Trying to be a good citizen, the driver then call's mpol_get() on the new policy. When it's finished probing, it undoes the '_get by calling mpol_free() [on the system default policy] and then restores the current task's saved mempolicy. A couple of issues here: 1) it's never necessary to set a task's mempolicy to the system default policy in order to get system default allocation behavior. Simply set the current task's mempolicy to NULL and allocations will fall back to system default policy. 2) we should never [need to] call mpol_free() on the system default policy. [I plan on trapping this with a VM_BUG_ON() in a subsequent patch.] This patch removes the calls to mpol_get() and mpol_free() and uses NULL for the temporary task mempolicy to effect default allocation behavior. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> drivers/pci/pci-driver.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) Index: Linux/drivers/pci/pci-driver.c =================================================================== --- Linux.orig/drivers/pci/pci-driver.c 2007-10-09 14:31:57.000000000 -0400 +++ Linux/drivers/pci/pci-driver.c 2007-10-09 14:43:57.000000000 -0400 @@ -177,13 +177,11 @@ static int pci_call_probe(struct pci_dri set_cpus_allowed(current, node_to_cpumask(node)); /* And set default memory allocation policy */ oldpol = current->mempolicy; - current->mempolicy = &default_policy; - mpol_get(current->mempolicy); + current->mempolicy = NULL; /* fall back to system default policy */ #endif error = drv->probe(dev, id); #ifdef CONFIG_NUMA set_cpus_allowed(current, oldmask); - mpol_free(current->mempolicy); current->mempolicy = oldpol; #endif return error; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver 2007-10-12 15:49 ` [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver Lee Schermerhorn @ 2007-10-12 17:29 ` Christoph Lameter 0 siblings, 0 replies; 14+ messages in thread From: Christoph Lameter @ 2007-10-12 17:29 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, mel, eric.whitney I already acked this one twice. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH/RFC 2/4] Mem Policy: Fixup Shm and Interleave Policy Reference Counting 2007-10-12 15:48 [PATCH/RFC 0/4] More Mempolicy Reference Counting Fixes Lee Schermerhorn 2007-10-12 15:49 ` [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver Lee Schermerhorn @ 2007-10-12 15:49 ` Lee Schermerhorn 2007-10-12 17:30 ` Christoph Lameter 2007-10-12 15:49 ` [PATCH/RFC 3/4] Mem Policy: Fixup " Lee Schermerhorn 2007-10-12 15:49 ` [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy Lee Schermerhorn 3 siblings, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2007-10-12 15:49 UTC (permalink / raw) To: akpm; +Cc: linux-mm, clameter, ak, eric.whitney, mel PATCH 2/4 Mem Policy: fix reference counting for SHM_HUGETLB segments Against: 2.6.23-rc8-mm2 Separated from previous multi-issue patch 2/2 NOTE: w/o this fix, one will BUG-out on 2nd page fault to a SHM_HUGETLB segment with memory policy applied via mbind(). get_vma_policy() assumes that shared policies are referenced by the get_policy() vm_op, if any. This is true for shmem_get_policy() but not for shm_get_policy() when the "backing file" does not support a get_policy() vm_op. The latter is the case for SHM_HUGETLB segments. Because get_vma_policy() expects the get_policy() op to have added a ref, it doesn't do so itself. This results in premature freeing of the policy. Add the mpol_get() to the shm_get_policy() op when the backing file doesn't support shared policies. Further, shm_get_policy() was falling back to current task's task policy if the backing file did not support get_policy() vm_op and the vma policy was null. This is not valid when get_vma_policy() is called from show_numa_map() as task != current. Also, this did not match the behavior of the shmem_get_policy() vm_op which did NOT fall back to task policy. So, modify shm_get_policy() NOT to fall back to current->mempolicy. Document mempolicy return value reference semantics assumed by the changes discussed above for the set_ and get_policy vm_ops in <linux/mm.h>--where the prototypes are defined. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Index: Linux/ipc/shm.c =================================================================== --- Linux.orig/ipc/shm.c 2007-10-10 14:58:12.000000000 -0400 +++ Linux/ipc/shm.c 2007-10-10 14:59:13.000000000 -0400 @@ -263,10 +263,10 @@ static struct mempolicy *shm_get_policy( if (sfd->vm_ops->get_policy) pol = sfd->vm_ops->get_policy(vma, addr); - else if (vma->vm_policy) + else if (vma->vm_policy) { pol = vma->vm_policy; - else - pol = current->mempolicy; + mpol_get(pol); /* get_vma_policy() assumes this */ + } return pol; } #endif Index: Linux/include/linux/mm.h =================================================================== --- Linux.orig/include/linux/mm.h 2007-10-10 14:58:12.000000000 -0400 +++ Linux/include/linux/mm.h 2007-10-11 14:07:43.000000000 -0400 @@ -173,7 +173,21 @@ struct vm_operations_struct { * writable, if an error is returned it will cause a SIGBUS */ int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page); #ifdef CONFIG_NUMA + /* + * set_policy() op must add a reference to any non-NULL @new mempolicy + * to hold the policy upon return. Caller should pass NULL @new to + * remove a policy and fall back to surrounding context--i.e. do not + * install a MPOL_DEFAULT policy, nor the task or system default + * mempolicy. + */ int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new); + + /* + * get_policy() op must add reference [mpol_get()] to any mempolicy + * at (vma,addr). If no [shared/vma] mempolicy exists at that addr, + * get_policy() op must return NULL--i.e., do not "fallback" to task + * or system default policy. + */ struct mempolicy *(*get_policy)(struct vm_area_struct *vma, unsigned long addr); int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 2/4] Mem Policy: Fixup Shm and Interleave Policy Reference Counting 2007-10-12 15:49 ` [PATCH/RFC 2/4] Mem Policy: Fixup Shm and Interleave Policy Reference Counting Lee Schermerhorn @ 2007-10-12 17:30 ` Christoph Lameter 0 siblings, 0 replies; 14+ messages in thread From: Christoph Lameter @ 2007-10-12 17:30 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, eric.whitney, mel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH/RFC 3/4] Mem Policy: Fixup Interleave Policy Reference Counting 2007-10-12 15:48 [PATCH/RFC 0/4] More Mempolicy Reference Counting Fixes Lee Schermerhorn 2007-10-12 15:49 ` [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver Lee Schermerhorn 2007-10-12 15:49 ` [PATCH/RFC 2/4] Mem Policy: Fixup Shm and Interleave Policy Reference Counting Lee Schermerhorn @ 2007-10-12 15:49 ` Lee Schermerhorn 2007-10-12 17:33 ` Christoph Lameter 2007-10-12 15:49 ` [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy Lee Schermerhorn 3 siblings, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2007-10-12 15:49 UTC (permalink / raw) To: akpm; +Cc: linux-mm, ak, mel, clameter, eric.whitney PATCH 3/4 Mempolicy: Fixup Interleave Policy Reference Counting Against: 2.6.23-rc8-mm2 Separated from multi-issue patch 2/2 In the memory policy reference counting cleanup patch, I missed one path that needs to unreference the memory policy. After computing the target node for interleave policy, we need to drop the reference if the policy is not the system default nor the current task's policy. In huge_zonelist(), I was unconditionally unref'ing the policy in the interleave path, even when it was a policy that didn't need it. Fix this! Note: I investigated moving the check for "policy_needs_unref" to the mpol_free() wrapper, but this led to nasty circular header dependencies. If we wanted to make mpol_free() an external function, rather than a static inline, I could do this and remove several checks. I'd still need to keep an explicit check in alloc_page_vma() if we want to use a tail-call for the fast path. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> mm/mempolicy.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) Index: Linux/mm/mempolicy.c =================================================================== --- Linux.orig/mm/mempolicy.c 2007-10-12 10:48:03.000000000 -0400 +++ Linux/mm/mempolicy.c 2007-10-12 10:50:05.000000000 -0400 @@ -1262,18 +1262,21 @@ struct zonelist *huge_zonelist(struct vm { struct mempolicy *pol = get_vma_policy(current, vma, addr); struct zonelist *zl; + int policy_needs_unref = (pol != &default_policy && \ + pol != current->mempolicy); *mpol = NULL; /* probably no unref needed */ if (pol->policy == MPOL_INTERLEAVE) { unsigned nid; nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT); - __mpol_free(pol); /* finished with pol */ + if (unlikely(policy_needs_unref)) + __mpol_free(pol); /* finished with pol */ return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags); } zl = zonelist_policy(GFP_HIGHUSER, pol); - if (unlikely(pol != &default_policy && pol != current->mempolicy)) { + if (unlikely(policy_needs_unref)) { if (pol->policy != MPOL_BIND) __mpol_free(pol); /* finished with pol */ else @@ -1325,6 +1328,9 @@ alloc_page_vma(gfp_t gfp, struct vm_area { struct mempolicy *pol = get_vma_policy(current, vma, addr); struct zonelist *zl; + int policy_needs_unref = (pol != &default_policy && \ + pol != current->mempolicy); + cpuset_update_task_memory_state(); @@ -1332,10 +1338,12 @@ alloc_page_vma(gfp_t gfp, struct vm_area unsigned nid; nid = interleave_nid(pol, vma, addr, PAGE_SHIFT); + if (unlikely(policy_needs_unref)) + __mpol_free(pol); return alloc_page_interleave(gfp, 0, nid); } zl = zonelist_policy(gfp, pol); - if (pol != &default_policy && pol != current->mempolicy) { + if (unlikely(policy_needs_unref)) { /* * slow path: ref counted policy -- shared or vma */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 3/4] Mem Policy: Fixup Interleave Policy Reference Counting 2007-10-12 15:49 ` [PATCH/RFC 3/4] Mem Policy: Fixup " Lee Schermerhorn @ 2007-10-12 17:33 ` Christoph Lameter 0 siblings, 0 replies; 14+ messages in thread From: Christoph Lameter @ 2007-10-12 17:33 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, mel, eric.whitney On Fri, 12 Oct 2007, Lee Schermerhorn wrote: > Note: I investigated moving the check for "policy_needs_unref" > to the mpol_free() wrapper, but this led to nasty circular header > dependencies. If we wanted to make mpol_free() an external > function, rather than a static inline, I could do this and > remove several checks. I'd still need to keep an explicit > check in alloc_page_vma() if we want to use a tail-call for > the fast path. At a mininum we need to somehow encapsulate these checks that may now have to be done in multiple place. This is going to be ugly because it adds a lot of special casing to policy handling. Is there some way to put smarts into mpol_get to deal with this? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy 2007-10-12 15:48 [PATCH/RFC 0/4] More Mempolicy Reference Counting Fixes Lee Schermerhorn ` (2 preceding siblings ...) 2007-10-12 15:49 ` [PATCH/RFC 3/4] Mem Policy: Fixup " Lee Schermerhorn @ 2007-10-12 15:49 ` Lee Schermerhorn 2007-10-12 17:57 ` Christoph Lameter 3 siblings, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2007-10-12 15:49 UTC (permalink / raw) To: akpm; +Cc: linux-mm, ak, eric.whitney, clameter, mel PATCH 4/4 Mempolicy: Fixup Fallback for Default Shmem Policy Against: 2.6.23-rc8-mm2 Separated from previous multi-issue patch 2/2 get_vma_policy() was not handling fallback to task policy correctly when the get_policy() vm_op returns NULL. The NULL overwrites the 'pol' variable that was holding the fallback task mempolicy. So, it was falling back directly to system default policy. Fix get_vma_policy() to use only non-NULL policy returned from the vma get_policy op and indicate that this policy does not need another ref count. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> mm/mempolicy.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) Index: Linux/mm/mempolicy.c =================================================================== --- Linux.orig/mm/mempolicy.c 2007-10-12 10:50:05.000000000 -0400 +++ Linux/mm/mempolicy.c 2007-10-12 10:52:46.000000000 -0400 @@ -1112,19 +1112,25 @@ static struct mempolicy * get_vma_policy struct vm_area_struct *vma, unsigned long addr) { struct mempolicy *pol = task->mempolicy; - int shared_pol = 0; + int pol_needs_ref = (task != current); if (vma) { if (vma->vm_ops && vma->vm_ops->get_policy) { - pol = vma->vm_ops->get_policy(vma, addr); - shared_pol = 1; /* if pol non-NULL, add ref below */ + struct mempolicy *vpol = vma->vm_ops->get_policy(vma, + addr); + if (vpol) { + pol = vpol; + pol_needs_ref = 0; /* get_policy() added ref */ + } } else if (vma->vm_policy && - vma->vm_policy->policy != MPOL_DEFAULT) + vma->vm_policy->policy != MPOL_DEFAULT) { pol = vma->vm_policy; + pol_needs_ref++; + } } if (!pol) pol = &default_policy; - else if (!shared_pol && pol != current->mempolicy) + else if (pol_needs_ref) mpol_get(pol); /* vma or other task's policy */ return pol; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy 2007-10-12 15:49 ` [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy Lee Schermerhorn @ 2007-10-12 17:57 ` Christoph Lameter 2007-10-15 19:34 ` Christoph Lameter 2007-10-23 17:32 ` Lee Schermerhorn 0 siblings, 2 replies; 14+ messages in thread From: Christoph Lameter @ 2007-10-12 17:57 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, eric.whitney, mel On Fri, 12 Oct 2007, Lee Schermerhorn wrote: > get_vma_policy() was not handling fallback to task policy correctly > when the get_policy() vm_op returns NULL. The NULL overwrites > the 'pol' variable that was holding the fallback task mempolicy. > So, it was falling back directly to system default policy. > > Fix get_vma_policy() to use only non-NULL policy returned from > the vma get_policy op and indicate that this policy does not need > another ref count. I still think there must be a thinko here. The function seems to be currently coded with the assumption that get_policy always returns a policy. That policy may be the default policy?? If it returns NULL then the tasks policy is applied to shmem segment. I though we wanted a consistent application of policies to shmem segments? Now one task or another may determine placement. I still have no idea what your warrant is for being sure that the object continues to exist before increasing the policy refcount in get_vma_policy()? What pins the shared policy before we get the refcount? Some more concerns below: > Index: Linux/mm/mempolicy.c > =================================================================== > --- Linux.orig/mm/mempolicy.c 2007-10-12 10:50:05.000000000 -0400 > +++ Linux/mm/mempolicy.c 2007-10-12 10:52:46.000000000 -0400 > @@ -1112,19 +1112,25 @@ static struct mempolicy * get_vma_policy > struct vm_area_struct *vma, unsigned long addr) > { > struct mempolicy *pol = task->mempolicy; > - int shared_pol = 0; > + int pol_needs_ref = (task != current); If get_vma_policy is called from the numa_maps handler then we have taken a refcount on the task struct. So this should be int pol_needs_ref = 0; > > if (vma) { > if (vma->vm_ops && vma->vm_ops->get_policy) { > - pol = vma->vm_ops->get_policy(vma, addr); > - shared_pol = 1; /* if pol non-NULL, add ref below */ > + struct mempolicy *vpol = vma->vm_ops->get_policy(vma, > + addr); > + if (vpol) { > + pol = vpol; > + pol_needs_ref = 0; /* get_policy() added ref */ > + } > } else if (vma->vm_policy && > - vma->vm_policy->policy != MPOL_DEFAULT) > + vma->vm_policy->policy != MPOL_DEFAULT) { > pol = vma->vm_policy; > + pol_needs_ref++; Why do we need a ref here for a vma policy? The policy is pinned through the ref to the task structure. > + } > } > if (!pol) > pol = &default_policy; > - else if (!shared_pol && pol != current->mempolicy) > + else if (pol_needs_ref) > mpol_get(pol); /* vma or other task's policy */ > return pol; The mpol_get() here looks wrong. get_vma_policy determines the current policy. The policy must already be pinned by increasing the refcount or use in a certain task before get_vma_policy is ever called. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy 2007-10-12 17:57 ` Christoph Lameter @ 2007-10-15 19:34 ` Christoph Lameter 2007-10-23 16:15 ` Lee Schermerhorn 2007-10-23 17:32 ` Lee Schermerhorn 1 sibling, 1 reply; 14+ messages in thread From: Christoph Lameter @ 2007-10-15 19:34 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, eric.whitney, mel On Fri, 12 Oct 2007, Christoph Lameter wrote: > > Index: Linux/mm/mempolicy.c > > =================================================================== > > --- Linux.orig/mm/mempolicy.c 2007-10-12 10:50:05.000000000 -0400 > > +++ Linux/mm/mempolicy.c 2007-10-12 10:52:46.000000000 -0400 > > @@ -1112,19 +1112,25 @@ static struct mempolicy * get_vma_policy > > struct vm_area_struct *vma, unsigned long addr) > > { > > struct mempolicy *pol = task->mempolicy; > > - int shared_pol = 0; > > + int pol_needs_ref = (task != current); > > If get_vma_policy is called from the numa_maps handler then we have taken > a refcount on the task struct. > > So this should be > int pol_needs_ref = 0; Argh. Refcount is not it. We have taken the mmap_sem lock because we are scanning though the pages. This avoids issues for the vma policies that can only be set when a writelock was taken on mmap_sem. However, mmap_sem is not taken when setting task->mempolicy. Taking mmap_sem there would solve the issue (we have discussed this before). You cannot reliably take a refcount on a foreign task structs mempolicy since the task may just be in the process of switching policies. You could increment the refcount and then the other task frees the structure. I think we need something like this: Index: linux-2.6/mm/mempolicy.c =================================================================== --- linux-2.6.orig/mm/mempolicy.c 2007-10-15 12:32:45.000000000 -0700 +++ linux-2.6/mm/mempolicy.c 2007-10-15 12:33:56.000000000 -0700 @@ -468,11 +468,13 @@ long do_set_mempolicy(int mode, nodemask new = mpol_new(mode, nodes); if (IS_ERR(new)) return PTR_ERR(new); + down_read(¤t->mm->mmap_sem); mpol_free(current->mempolicy); current->mempolicy = new; mpol_set_task_struct_flag(); if (new && new->policy == MPOL_INTERLEAVE) current->il_next = first_node(new->v.nodes); + up_read(¤t->mm->mmap_sem); return 0; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy 2007-10-15 19:34 ` Christoph Lameter @ 2007-10-23 16:15 ` Lee Schermerhorn 2007-10-23 16:23 ` Christoph Lameter 0 siblings, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2007-10-23 16:15 UTC (permalink / raw) To: Christoph Lameter; +Cc: akpm, linux-mm, ak, eric.whitney, mel On Mon, 2007-10-15 at 12:34 -0700, Christoph Lameter wrote: > On Fri, 12 Oct 2007, Christoph Lameter wrote: > > > > Index: Linux/mm/mempolicy.c > > > =================================================================== > > > --- Linux.orig/mm/mempolicy.c 2007-10-12 10:50:05.000000000 -0400 > > > +++ Linux/mm/mempolicy.c 2007-10-12 10:52:46.000000000 -0400 > > > @@ -1112,19 +1112,25 @@ static struct mempolicy * get_vma_policy > > > struct vm_area_struct *vma, unsigned long addr) > > > { > > > struct mempolicy *pol = task->mempolicy; > > > - int shared_pol = 0; > > > + int pol_needs_ref = (task != current); > > > > If get_vma_policy is called from the numa_maps handler then we have taken > > a refcount on the task struct. > > > > So this should be > > int pol_needs_ref = 0; > > Argh. Refcount is not it. We have taken the mmap_sem lock > because we are scanning though the pages. This avoids issues for the vma > policies that can only be set when a writelock was taken on mmap_sem. > > However, mmap_sem is not taken when setting task->mempolicy. Taking > mmap_sem there would solve the issue (we have discussed this before). > > You cannot reliably take a refcount on a foreign task structs mempolicy > since the task may just be in the process of switching policies. You could > increment the refcount and then the other task frees the structure. > > I think we need something like this: > > Index: linux-2.6/mm/mempolicy.c > =================================================================== > --- linux-2.6.orig/mm/mempolicy.c 2007-10-15 12:32:45.000000000 -0700 > +++ linux-2.6/mm/mempolicy.c 2007-10-15 12:33:56.000000000 -0700 > @@ -468,11 +468,13 @@ long do_set_mempolicy(int mode, nodemask > new = mpol_new(mode, nodes); > if (IS_ERR(new)) > return PTR_ERR(new); > + down_read(¤t->mm->mmap_sem); > mpol_free(current->mempolicy); > current->mempolicy = new; > mpol_set_task_struct_flag(); > if (new && new->policy == MPOL_INTERLEAVE) > current->il_next = first_node(new->v.nodes); > + up_read(¤t->mm->mmap_sem); > return 0; > } > Christoph: just getting back to this. You sent two messages commented about this patch. I'm not sure whether this one supercedes the previous one or adds to it. So, I'll address the points in your other comment separately. Re: this patch: I can see how we need to grab the mmap_sem during do_set_mempolicy() to coordinate with the numa_maps display. However, shouldn't we use {down|up}_write() here to obtain exclusive access with respect to numa_maps ? Lee -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy 2007-10-23 16:15 ` Lee Schermerhorn @ 2007-10-23 16:23 ` Christoph Lameter 0 siblings, 0 replies; 14+ messages in thread From: Christoph Lameter @ 2007-10-23 16:23 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, eric.whitney, mel On Tue, 23 Oct 2007, Lee Schermerhorn wrote: > Christoph: just getting back to this. You sent two messages commented > about this patch. I'm not sure whether this one supercedes the previous > one or adds to it. So, I'll address the points in your other comment > separately. It supersedes one earlier comment. > Re: this patch: I can see how we need to grab the mmap_sem during > do_set_mempolicy() to coordinate with the numa_maps display. However, > shouldn't we use {down|up}_write() here to obtain exclusive access with > respect to numa_maps ? Right. We must obtain a writelock. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy 2007-10-12 17:57 ` Christoph Lameter 2007-10-15 19:34 ` Christoph Lameter @ 2007-10-23 17:32 ` Lee Schermerhorn 2007-10-24 13:09 ` Christoph Lameter 1 sibling, 1 reply; 14+ messages in thread From: Lee Schermerhorn @ 2007-10-23 17:32 UTC (permalink / raw) To: Christoph Lameter; +Cc: akpm, linux-mm, ak, eric.whitney, mel On Fri, 2007-10-12 at 10:57 -0700, Christoph Lameter wrote: > On Fri, 12 Oct 2007, Lee Schermerhorn wrote: > > > get_vma_policy() was not handling fallback to task policy correctly > > when the get_policy() vm_op returns NULL. The NULL overwrites > > the 'pol' variable that was holding the fallback task mempolicy. > > So, it was falling back directly to system default policy. > > > > Fix get_vma_policy() to use only non-NULL policy returned from > > the vma get_policy op and indicate that this policy does not need > > another ref count. > > I still think there must be a thinko here. The function seems to be > currently coded with the assumption that get_policy always returns a > policy. That policy may be the default policy?? My assumption is that the get_policy vm_op should either return a [non-NULL] mempolicy corresponding to the specified address with the ref count elevated for the caller, or NULL. Never the default policy. Fallback will be handled by get_vma_policy(). I was thinking that we HAD to do the fallback in get_vma_policy() to get the reference counting correct for show_numa_maps(). But, based on your other mail, I agree that we don't need to reference count another task's task mempolicy, if we take the mmap_sem for write in do_set_mempolicy(). However, I still think that fallback is best handled in one place--for consistency :-). > > If it returns NULL then the tasks policy is applied to shmem segment. I > though we wanted a consistent application of policies to shmem segments? > Now one task or another may determine placement. OK. To address this, one must consider "when can the get_policy() vm op" return a NULL? To answer the question, first note that there are [currently] a couple of configurations to consider: 1) direct shared mapping [MAP_SHARED] of tmpfs backed storage, or SysV shm segments without the SHM_HUGETLB flag. Either of these will get you the shmem vm_ops--indirectly via the shm vm_ops in the case of a SysV shm segment. The shmem {set|get}_policy ops support different policies on different ranges of the segment {actually on ranges of the tmpfs file backing the segment} on a page granularity. Now, before any of my changes, if you never install a mempolicy on any range of a shmem segment, the get_policy() op will return NULL. This causes get_vma_policy() fall back to task or system default policy, as appropriate. Always has. If you apply mempolicy, using mbind() or the libnuma wrapper equivalent, to a subset of the shmem segment and then query the policy on any other part of the segment not affected by the mempolicy, again the get_policy() op will return NULL and again we'll fall back to task or system default policy, as appropriate. 2) A SysV shm segment with the SHM_HUGETLB option will use the hugetlbfs vm_ops indirectly via the shm vm_ops. However, the hugetlbfs vm_ops do not support shared policy in the same sense as shmem segments. This is primarily because the hugetlbfs vm_ops do not specify any {set| get}_policy() operations. [I have a patch to address this in my "shared policy" series that we've discussed in the past. I hope to get back to that series, eventually.] When the file system backing the shm segment [hugetlbfs file, in this case] does not support the get_policy() op, shm_get_policy() will just return the VMA policy for the specified (vma,address). Again, this can be NULL. Before the patch under discussion, shm_get_policy() would fall back to the task policy if the vma policy was NULL. My patch changes this to just return NULL and let get_vma_policy() do the fallback. This makes it consistent with the shmem {get|set}_policy ops. Note that I also made shm_get_policy() take a reference on any non-NULL vma policy--again to be consistent with the shmem get_policy behavior. And, in case you're wondering, shmem get_policy() MUST take the extra reference therein while holding the spin lock on the shared policy red-black tree because another task mapping the shmem segment could change the policy at any time. So, my "model" is: the get_policy() op must return a non-NULL policy with elevated reference count or NULL so that get_vma_policy() can depend on consistent behavior; and a NULL return from the get_policy() op means "fall back to surrounding context" just as for vma policy. I think this is "consistent" behavior, for some definition thereof. > > I still have no idea what your warrant is for being sure that the object > continues to exist before increasing the policy refcount in > get_vma_policy()? What pins the shared policy before we get the refcount? For shmem shared policy, the rb-tree spin lock protects the policy while we take the reference. To be consistent with this, I require that the shm get_policy op does the same when falling back to vma policy for shm file systems that don't support get_policy() ops--only hugetlbfs at this time. We don't need to add a ref to the current task's policy, because it can't change while we're using it--as long as we don't try to cache it across system calls. We don't need to add a ref to system default policy because we never free it. I thought we had to ref count other task's policies, including their vma policies, for show_numa_map() because they could change at any time. In your other mail, you showed how this isn't necessary because the task calling show_numa_maps() is holding the target task's mmap_sem for read. By patching do_set_mempolicy() to take the mmap_sem [for write!] we can close this window w/o an extra reference. However, there may be an issue with this. Read further. The current task's vma policies, although subject to change by other threads/tasks sharing the mm_struct, are protected by the mmap_sem() while we take the reference, as you've pointed out in other mail. Why take the extra ref? Back in June/July, we [you, Andi, myself] thought that this was required for allocating under bind policy with the custom zonelist because the allocation could sleep. Now, if we hold the mmap_sem over the allocation, we can probably dispense with the extra reference on [non-shared] vma policies as well. However, we still need to unref shared policies which one could consider a subclass of vma policies. With these recent patches and the prior mempolicy ref count patches, we could assume that all policies except the system default and the current task's mempolicy needed unref upon return from get_vma_policy(). If we don't take an extra ref on other task's mempolicy and non-shared vma policy, then we need to be able to differentiate truly shared policies when we're done with them so that we can unref them. How about a funky flag in the higher order policy bits, like the MPOL_CONTEXT flag in my cpuset-independent interleave patch, to indicate shmem-style shared policy. If the reasoning about mmap_sem above is correct, and we only need to hold refs on shmem shared policy, we can dispense with all of this extra reference counting and only unref the shared policies. Thoughts? > Some more concerns below: > > > Index: Linux/mm/mempolicy.c > > =================================================================== > > --- Linux.orig/mm/mempolicy.c 2007-10-12 10:50:05.000000000 -0400 > > +++ Linux/mm/mempolicy.c 2007-10-12 10:52:46.000000000 -0400 > > @@ -1112,19 +1112,25 @@ static struct mempolicy * get_vma_policy > > struct vm_area_struct *vma, unsigned long addr) > > { > > struct mempolicy *pol = task->mempolicy; > > - int shared_pol = 0; > > + int pol_needs_ref = (task != current); > > If get_vma_policy is called from the numa_maps handler then we have taken > a refcount on the task struct. > > So this should be > int pol_needs_ref = 0; If we can pare down the extra refs to just shmem shared policy, I agree. Otherwise, I'd like to keep the behavior such that only the current task's policy and system default policy don't get extra refs. This will keep the checks fairly simple on unref at the expense of an unneeded extra ref/unref for other task's policy--i.e., numa_maps which shouldn't be a path that we need to optimize, I think. > > > > > if (vma) { > > if (vma->vm_ops && vma->vm_ops->get_policy) { > > - pol = vma->vm_ops->get_policy(vma, addr); > > - shared_pol = 1; /* if pol non-NULL, add ref below */ > > + struct mempolicy *vpol = vma->vm_ops->get_policy(vma, > > + addr); > > + if (vpol) { > > + pol = vpol; > > + pol_needs_ref = 0; /* get_policy() added ref */ > > + } > > } else if (vma->vm_policy && > > - vma->vm_policy->policy != MPOL_DEFAULT) > > + vma->vm_policy->policy != MPOL_DEFAULT) { > > pol = vma->vm_policy; > > + pol_needs_ref++; > > Why do we need a ref here for a vma policy? The policy is pinned through > the ref to the task structure. You mean the mmap_sem, right? If this is the case, we apparently missed this point when we discussed it back in June/July. This was the one of the main points of that discussion. > > > + } > > } > > if (!pol) > > pol = &default_policy; > > - else if (!shared_pol && pol != current->mempolicy) > > + else if (pol_needs_ref) > > mpol_get(pol); /* vma or other task's policy */ > > return pol; > > The mpol_get() here looks wrong. get_vma_policy determines the > current policy. The policy must already be pinned by increasing the > refcount or use in a certain task before get_vma_policy is ever called. I see what you're saying. A quick look through the cscope indicates that the mmap_sem is held over faults, even when allocation sleeps. If this is true, again, we only need extra refs on shmem style shared policies. I'll rework the patches on that assumption, including backing out some of the recent extra reference counting, to see what they look like. I'll also document the mempolicy stability assumptions on which the reworked code is based. Later, Lee > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy 2007-10-23 17:32 ` Lee Schermerhorn @ 2007-10-24 13:09 ` Christoph Lameter 0 siblings, 0 replies; 14+ messages in thread From: Christoph Lameter @ 2007-10-24 13:09 UTC (permalink / raw) To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, eric.whitney, mel On Tue, 23 Oct 2007, Lee Schermerhorn wrote: > > I still think there must be a thinko here. The function seems to be > > currently coded with the assumption that get_policy always returns a > > policy. That policy may be the default policy?? > > My assumption is that the get_policy vm_op should either return a > [non-NULL] mempolicy corresponding to the specified address with the ref > count elevated for the caller, or NULL. Never the default policy. > Fallback will be handled by get_vma_policy(). Ok. > So, my "model" is: the get_policy() op must return a non-NULL policy > with elevated reference count or NULL so that get_vma_policy() can > depend on consistent behavior; and a NULL return from the get_policy() > op means "fall back to surrounding context" just as for vma policy. > > I think this is "consistent" behavior, for some definition thereof. I still have concerns about ting the refcount. The get_policy() method may take a refcount if it can ensure that the object is not vanishing from under us. But I would think that a refcount needs to be taken when the possibility is created for a certain vma to reference a policy via get_vma_policy and not when get_vma_policy itself runs. > > I still have no idea what your warrant is for being sure that the object > > continues to exist before increasing the policy refcount in > > get_vma_policy()? What pins the shared policy before we get the refcount? > > For shmem shared policy, the rb-tree spin lock protects the policy while > we take the reference. To be consistent with this, I require that the > shm get_policy op does the same when falling back to vma policy for shm > file systems that don't support get_policy() ops--only hugetlbfs at this > time. The rb tree lock is always taken when we run get_vma_policy()? You mean you can take the lock while the get_policy is run? This will make get_vma_policy even heavier? > The current task's vma policies, although subject to change by other > threads/tasks sharing the mm_struct, are protected by the mmap_sem() > while we take the reference, as you've pointed out in other mail. Why > take the extra ref? Back in June/July, we [you, Andi, myself] thought > that this was required for allocating under bind policy with the custom > zonelist because the allocation could sleep. Now, if we hold the > mmap_sem over the allocation, we can probably dispense with the extra > reference on [non-shared] vma policies as well. Right. > However, we still need to unref shared policies which one could consider > a subclass of vma policies. With these recent patches and the prior > mempolicy ref count patches, we could assume that all policies except > the system default and the current task's mempolicy needed unref upon > return from get_vma_policy(). If we don't take an extra ref on other > task's mempolicy and non-shared vma policy, then we need to be able to > differentiate truly shared policies when we're done with them so that we > can unref them. If you take the reference when a vma is established then you can avoid dropping the refcount on the hot paths? > How about a funky flag in the higher order policy bits, like the > MPOL_CONTEXT flag in my cpuset-independent interleave patch, to indicate > shmem-style shared policy. If the reasoning about mmap_sem above is > correct, and we only need to hold refs on shmem shared policy, we can > dispense with all of this extra reference counting and only unref the > shared policies. Maybe. Would need to be further fleshed out. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2007-10-24 13:09 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-10-12 15:48 [PATCH/RFC 0/4] More Mempolicy Reference Counting Fixes Lee Schermerhorn 2007-10-12 15:49 ` [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver Lee Schermerhorn 2007-10-12 17:29 ` Christoph Lameter 2007-10-12 15:49 ` [PATCH/RFC 2/4] Mem Policy: Fixup Shm and Interleave Policy Reference Counting Lee Schermerhorn 2007-10-12 17:30 ` Christoph Lameter 2007-10-12 15:49 ` [PATCH/RFC 3/4] Mem Policy: Fixup " Lee Schermerhorn 2007-10-12 17:33 ` Christoph Lameter 2007-10-12 15:49 ` [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy Lee Schermerhorn 2007-10-12 17:57 ` Christoph Lameter 2007-10-15 19:34 ` Christoph Lameter 2007-10-23 16:15 ` Lee Schermerhorn 2007-10-23 16:23 ` Christoph Lameter 2007-10-23 17:32 ` Lee Schermerhorn 2007-10-24 13:09 ` Christoph Lameter
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).