* [PATCH/RFC 0/4] More Mempolicy Reference Counting Fixes
@ 2007-10-12 15:48 Lee Schermerhorn
2007-10-12 15:49 ` [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver Lee Schermerhorn
` (3 more replies)
0 siblings, 4 replies; 14+ messages in thread
From: Lee Schermerhorn @ 2007-10-12 15:48 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, ak, eric.whitney, clameter, mel
PATCH 0/4 Memory Policy: More Reference Counting and Fallback Fixes
While testing huge pages using shm segments created with the SHM_HUGETLB flag,
came across additional problems with memory policy reference counting. Some
of the problems were introduced by myself in my previous mempolicy ref counting
patch, and some were just paths that I missed in my inspection and testing.
This resend separates the 2nd patch of the previous posting into 3 separate
fixes for 3 separate, but interrelated, issues. At Christoph Lameter's
request. Rebuilt and cursory testing on x86_64 numa platform only.
These 4 patches are based against 2.6.23-rc8-mm2, but should also probably
go to the .23-stable tree when it opens.
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver
2007-10-12 15:48 [PATCH/RFC 0/4] More Mempolicy Reference Counting Fixes Lee Schermerhorn
@ 2007-10-12 15:49 ` Lee Schermerhorn
2007-10-12 17:29 ` Christoph Lameter
2007-10-12 15:49 ` [PATCH/RFC 2/4] Mem Policy: Fixup Shm and Interleave Policy Reference Counting Lee Schermerhorn
` (2 subsequent siblings)
3 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2007-10-12 15:49 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, ak, mel, eric.whitney, clameter
PATCH 1/4 Fix memory policy usage in pci driver
Against: 2.6.23-rc8-mm2
In an attempt to ensure memory allocation from the local node,
the pci driver temporarily replaces the current task's memory
policy with the system default policy. Trying to be a good
citizen, the driver then call's mpol_get() on the new policy.
When it's finished probing, it undoes the '_get by calling
mpol_free() [on the system default policy] and then restores
the current task's saved mempolicy.
A couple of issues here:
1) it's never necessary to set a task's mempolicy to the
system default policy in order to get system default
allocation behavior. Simply set the current task's
mempolicy to NULL and allocations will fall back to
system default policy.
2) we should never [need to] call mpol_free() on the system
default policy. [I plan on trapping this with a VM_BUG_ON()
in a subsequent patch.]
This patch removes the calls to mpol_get() and mpol_free()
and uses NULL for the temporary task mempolicy to effect
default allocation behavior.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
drivers/pci/pci-driver.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
Index: Linux/drivers/pci/pci-driver.c
===================================================================
--- Linux.orig/drivers/pci/pci-driver.c 2007-10-09 14:31:57.000000000 -0400
+++ Linux/drivers/pci/pci-driver.c 2007-10-09 14:43:57.000000000 -0400
@@ -177,13 +177,11 @@ static int pci_call_probe(struct pci_dri
set_cpus_allowed(current, node_to_cpumask(node));
/* And set default memory allocation policy */
oldpol = current->mempolicy;
- current->mempolicy = &default_policy;
- mpol_get(current->mempolicy);
+ current->mempolicy = NULL; /* fall back to system default policy */
#endif
error = drv->probe(dev, id);
#ifdef CONFIG_NUMA
set_cpus_allowed(current, oldmask);
- mpol_free(current->mempolicy);
current->mempolicy = oldpol;
#endif
return error;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH/RFC 2/4] Mem Policy: Fixup Shm and Interleave Policy Reference Counting
2007-10-12 15:48 [PATCH/RFC 0/4] More Mempolicy Reference Counting Fixes Lee Schermerhorn
2007-10-12 15:49 ` [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver Lee Schermerhorn
@ 2007-10-12 15:49 ` Lee Schermerhorn
2007-10-12 17:30 ` Christoph Lameter
2007-10-12 15:49 ` [PATCH/RFC 3/4] Mem Policy: Fixup " Lee Schermerhorn
2007-10-12 15:49 ` [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy Lee Schermerhorn
3 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2007-10-12 15:49 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, clameter, ak, eric.whitney, mel
PATCH 2/4 Mem Policy: fix reference counting for SHM_HUGETLB segments
Against: 2.6.23-rc8-mm2
Separated from previous multi-issue patch 2/2
NOTE: w/o this fix, one will BUG-out on 2nd page fault to a
SHM_HUGETLB segment with memory policy applied via mbind().
get_vma_policy() assumes that shared policies are referenced by
the get_policy() vm_op, if any. This is true for shmem_get_policy()
but not for shm_get_policy() when the "backing file" does not
support a get_policy() vm_op. The latter is the case for SHM_HUGETLB
segments. Because get_vma_policy() expects the get_policy() op to
have added a ref, it doesn't do so itself. This results in
premature freeing of the policy. Add the mpol_get() to the
shm_get_policy() op when the backing file doesn't support shared
policies.
Further, shm_get_policy() was falling back to current task's task
policy if the backing file did not support get_policy() vm_op and
the vma policy was null. This is not valid when get_vma_policy() is
called from show_numa_map() as task != current. Also, this did
not match the behavior of the shmem_get_policy() vm_op which did
NOT fall back to task policy. So, modify shm_get_policy() NOT to
fall back to current->mempolicy.
Document mempolicy return value reference semantics assumed by
the changes discussed above for the set_ and get_policy vm_ops
in <linux/mm.h>--where the prototypes are defined.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Index: Linux/ipc/shm.c
===================================================================
--- Linux.orig/ipc/shm.c 2007-10-10 14:58:12.000000000 -0400
+++ Linux/ipc/shm.c 2007-10-10 14:59:13.000000000 -0400
@@ -263,10 +263,10 @@ static struct mempolicy *shm_get_policy(
if (sfd->vm_ops->get_policy)
pol = sfd->vm_ops->get_policy(vma, addr);
- else if (vma->vm_policy)
+ else if (vma->vm_policy) {
pol = vma->vm_policy;
- else
- pol = current->mempolicy;
+ mpol_get(pol); /* get_vma_policy() assumes this */
+ }
return pol;
}
#endif
Index: Linux/include/linux/mm.h
===================================================================
--- Linux.orig/include/linux/mm.h 2007-10-10 14:58:12.000000000 -0400
+++ Linux/include/linux/mm.h 2007-10-11 14:07:43.000000000 -0400
@@ -173,7 +173,21 @@ struct vm_operations_struct {
* writable, if an error is returned it will cause a SIGBUS */
int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
#ifdef CONFIG_NUMA
+ /*
+ * set_policy() op must add a reference to any non-NULL @new mempolicy
+ * to hold the policy upon return. Caller should pass NULL @new to
+ * remove a policy and fall back to surrounding context--i.e. do not
+ * install a MPOL_DEFAULT policy, nor the task or system default
+ * mempolicy.
+ */
int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
+
+ /*
+ * get_policy() op must add reference [mpol_get()] to any mempolicy
+ * at (vma,addr). If no [shared/vma] mempolicy exists at that addr,
+ * get_policy() op must return NULL--i.e., do not "fallback" to task
+ * or system default policy.
+ */
struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
unsigned long addr);
int (*migrate)(struct vm_area_struct *vma, const nodemask_t *from,
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH/RFC 3/4] Mem Policy: Fixup Interleave Policy Reference Counting
2007-10-12 15:48 [PATCH/RFC 0/4] More Mempolicy Reference Counting Fixes Lee Schermerhorn
2007-10-12 15:49 ` [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver Lee Schermerhorn
2007-10-12 15:49 ` [PATCH/RFC 2/4] Mem Policy: Fixup Shm and Interleave Policy Reference Counting Lee Schermerhorn
@ 2007-10-12 15:49 ` Lee Schermerhorn
2007-10-12 17:33 ` Christoph Lameter
2007-10-12 15:49 ` [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy Lee Schermerhorn
3 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2007-10-12 15:49 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, ak, mel, clameter, eric.whitney
PATCH 3/4 Mempolicy: Fixup Interleave Policy Reference Counting
Against: 2.6.23-rc8-mm2
Separated from multi-issue patch 2/2
In the memory policy reference counting cleanup patch, I missed
one path that needs to unreference the memory policy. After
computing the target node for interleave policy, we need to
drop the reference if the policy is not the system default nor
the current task's policy.
In huge_zonelist(), I was unconditionally unref'ing the policy
in the interleave path, even when it was a policy that didn't
need it. Fix this!
Note: I investigated moving the check for "policy_needs_unref"
to the mpol_free() wrapper, but this led to nasty circular header
dependencies. If we wanted to make mpol_free() an external
function, rather than a static inline, I could do this and
remove several checks. I'd still need to keep an explicit
check in alloc_page_vma() if we want to use a tail-call for
the fast path.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
mm/mempolicy.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c 2007-10-12 10:48:03.000000000 -0400
+++ Linux/mm/mempolicy.c 2007-10-12 10:50:05.000000000 -0400
@@ -1262,18 +1262,21 @@ struct zonelist *huge_zonelist(struct vm
{
struct mempolicy *pol = get_vma_policy(current, vma, addr);
struct zonelist *zl;
+ int policy_needs_unref = (pol != &default_policy && \
+ pol != current->mempolicy);
*mpol = NULL; /* probably no unref needed */
if (pol->policy == MPOL_INTERLEAVE) {
unsigned nid;
nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
- __mpol_free(pol); /* finished with pol */
+ if (unlikely(policy_needs_unref))
+ __mpol_free(pol); /* finished with pol */
return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
}
zl = zonelist_policy(GFP_HIGHUSER, pol);
- if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
+ if (unlikely(policy_needs_unref)) {
if (pol->policy != MPOL_BIND)
__mpol_free(pol); /* finished with pol */
else
@@ -1325,6 +1328,9 @@ alloc_page_vma(gfp_t gfp, struct vm_area
{
struct mempolicy *pol = get_vma_policy(current, vma, addr);
struct zonelist *zl;
+ int policy_needs_unref = (pol != &default_policy && \
+ pol != current->mempolicy);
+
cpuset_update_task_memory_state();
@@ -1332,10 +1338,12 @@ alloc_page_vma(gfp_t gfp, struct vm_area
unsigned nid;
nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
+ if (unlikely(policy_needs_unref))
+ __mpol_free(pol);
return alloc_page_interleave(gfp, 0, nid);
}
zl = zonelist_policy(gfp, pol);
- if (pol != &default_policy && pol != current->mempolicy) {
+ if (unlikely(policy_needs_unref)) {
/*
* slow path: ref counted policy -- shared or vma
*/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy
2007-10-12 15:48 [PATCH/RFC 0/4] More Mempolicy Reference Counting Fixes Lee Schermerhorn
` (2 preceding siblings ...)
2007-10-12 15:49 ` [PATCH/RFC 3/4] Mem Policy: Fixup " Lee Schermerhorn
@ 2007-10-12 15:49 ` Lee Schermerhorn
2007-10-12 17:57 ` Christoph Lameter
3 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2007-10-12 15:49 UTC (permalink / raw)
To: akpm; +Cc: linux-mm, ak, eric.whitney, clameter, mel
PATCH 4/4 Mempolicy: Fixup Fallback for Default Shmem Policy
Against: 2.6.23-rc8-mm2
Separated from previous multi-issue patch 2/2
get_vma_policy() was not handling fallback to task policy correctly
when the get_policy() vm_op returns NULL. The NULL overwrites
the 'pol' variable that was holding the fallback task mempolicy.
So, it was falling back directly to system default policy.
Fix get_vma_policy() to use only non-NULL policy returned from
the vma get_policy op and indicate that this policy does not need
another ref count.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
mm/mempolicy.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c 2007-10-12 10:50:05.000000000 -0400
+++ Linux/mm/mempolicy.c 2007-10-12 10:52:46.000000000 -0400
@@ -1112,19 +1112,25 @@ static struct mempolicy * get_vma_policy
struct vm_area_struct *vma, unsigned long addr)
{
struct mempolicy *pol = task->mempolicy;
- int shared_pol = 0;
+ int pol_needs_ref = (task != current);
if (vma) {
if (vma->vm_ops && vma->vm_ops->get_policy) {
- pol = vma->vm_ops->get_policy(vma, addr);
- shared_pol = 1; /* if pol non-NULL, add ref below */
+ struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
+ addr);
+ if (vpol) {
+ pol = vpol;
+ pol_needs_ref = 0; /* get_policy() added ref */
+ }
} else if (vma->vm_policy &&
- vma->vm_policy->policy != MPOL_DEFAULT)
+ vma->vm_policy->policy != MPOL_DEFAULT) {
pol = vma->vm_policy;
+ pol_needs_ref++;
+ }
}
if (!pol)
pol = &default_policy;
- else if (!shared_pol && pol != current->mempolicy)
+ else if (pol_needs_ref)
mpol_get(pol); /* vma or other task's policy */
return pol;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver
2007-10-12 15:49 ` [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver Lee Schermerhorn
@ 2007-10-12 17:29 ` Christoph Lameter
0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2007-10-12 17:29 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, mel, eric.whitney
I already acked this one twice.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 2/4] Mem Policy: Fixup Shm and Interleave Policy Reference Counting
2007-10-12 15:49 ` [PATCH/RFC 2/4] Mem Policy: Fixup Shm and Interleave Policy Reference Counting Lee Schermerhorn
@ 2007-10-12 17:30 ` Christoph Lameter
0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2007-10-12 17:30 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, eric.whitney, mel
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 3/4] Mem Policy: Fixup Interleave Policy Reference Counting
2007-10-12 15:49 ` [PATCH/RFC 3/4] Mem Policy: Fixup " Lee Schermerhorn
@ 2007-10-12 17:33 ` Christoph Lameter
0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2007-10-12 17:33 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, mel, eric.whitney
On Fri, 12 Oct 2007, Lee Schermerhorn wrote:
> Note: I investigated moving the check for "policy_needs_unref"
> to the mpol_free() wrapper, but this led to nasty circular header
> dependencies. If we wanted to make mpol_free() an external
> function, rather than a static inline, I could do this and
> remove several checks. I'd still need to keep an explicit
> check in alloc_page_vma() if we want to use a tail-call for
> the fast path.
At a mininum we need to somehow encapsulate these checks that may now have
to be done in multiple place. This is going to be ugly because it adds
a lot of special casing to policy handling.
Is there some way to put smarts into mpol_get to deal with this?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy
2007-10-12 15:49 ` [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy Lee Schermerhorn
@ 2007-10-12 17:57 ` Christoph Lameter
2007-10-15 19:34 ` Christoph Lameter
2007-10-23 17:32 ` Lee Schermerhorn
0 siblings, 2 replies; 14+ messages in thread
From: Christoph Lameter @ 2007-10-12 17:57 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, eric.whitney, mel
On Fri, 12 Oct 2007, Lee Schermerhorn wrote:
> get_vma_policy() was not handling fallback to task policy correctly
> when the get_policy() vm_op returns NULL. The NULL overwrites
> the 'pol' variable that was holding the fallback task mempolicy.
> So, it was falling back directly to system default policy.
>
> Fix get_vma_policy() to use only non-NULL policy returned from
> the vma get_policy op and indicate that this policy does not need
> another ref count.
I still think there must be a thinko here. The function seems to be
currently coded with the assumption that get_policy always returns a
policy. That policy may be the default policy??
If it returns NULL then the tasks policy is applied to shmem segment. I
though we wanted a consistent application of policies to shmem segments?
Now one task or another may determine placement.
I still have no idea what your warrant is for being sure that the object
continues to exist before increasing the policy refcount in
get_vma_policy()? What pins the shared policy before we get the refcount?
Some more concerns below:
> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c 2007-10-12 10:50:05.000000000 -0400
> +++ Linux/mm/mempolicy.c 2007-10-12 10:52:46.000000000 -0400
> @@ -1112,19 +1112,25 @@ static struct mempolicy * get_vma_policy
> struct vm_area_struct *vma, unsigned long addr)
> {
> struct mempolicy *pol = task->mempolicy;
> - int shared_pol = 0;
> + int pol_needs_ref = (task != current);
If get_vma_policy is called from the numa_maps handler then we have taken
a refcount on the task struct.
So this should be
int pol_needs_ref = 0;
>
> if (vma) {
> if (vma->vm_ops && vma->vm_ops->get_policy) {
> - pol = vma->vm_ops->get_policy(vma, addr);
> - shared_pol = 1; /* if pol non-NULL, add ref below */
> + struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
> + addr);
> + if (vpol) {
> + pol = vpol;
> + pol_needs_ref = 0; /* get_policy() added ref */
> + }
> } else if (vma->vm_policy &&
> - vma->vm_policy->policy != MPOL_DEFAULT)
> + vma->vm_policy->policy != MPOL_DEFAULT) {
> pol = vma->vm_policy;
> + pol_needs_ref++;
Why do we need a ref here for a vma policy? The policy is pinned through
the ref to the task structure.
> + }
> }
> if (!pol)
> pol = &default_policy;
> - else if (!shared_pol && pol != current->mempolicy)
> + else if (pol_needs_ref)
> mpol_get(pol); /* vma or other task's policy */
> return pol;
The mpol_get() here looks wrong. get_vma_policy determines the
current policy. The policy must already be pinned by increasing the
refcount or use in a certain task before get_vma_policy is ever called.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy
2007-10-12 17:57 ` Christoph Lameter
@ 2007-10-15 19:34 ` Christoph Lameter
2007-10-23 16:15 ` Lee Schermerhorn
2007-10-23 17:32 ` Lee Schermerhorn
1 sibling, 1 reply; 14+ messages in thread
From: Christoph Lameter @ 2007-10-15 19:34 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, eric.whitney, mel
On Fri, 12 Oct 2007, Christoph Lameter wrote:
> > Index: Linux/mm/mempolicy.c
> > ===================================================================
> > --- Linux.orig/mm/mempolicy.c 2007-10-12 10:50:05.000000000 -0400
> > +++ Linux/mm/mempolicy.c 2007-10-12 10:52:46.000000000 -0400
> > @@ -1112,19 +1112,25 @@ static struct mempolicy * get_vma_policy
> > struct vm_area_struct *vma, unsigned long addr)
> > {
> > struct mempolicy *pol = task->mempolicy;
> > - int shared_pol = 0;
> > + int pol_needs_ref = (task != current);
>
> If get_vma_policy is called from the numa_maps handler then we have taken
> a refcount on the task struct.
>
> So this should be
> int pol_needs_ref = 0;
Argh. Refcount is not it. We have taken the mmap_sem lock
because we are scanning though the pages. This avoids issues for the vma
policies that can only be set when a writelock was taken on mmap_sem.
However, mmap_sem is not taken when setting task->mempolicy. Taking
mmap_sem there would solve the issue (we have discussed this before).
You cannot reliably take a refcount on a foreign task structs mempolicy
since the task may just be in the process of switching policies. You could
increment the refcount and then the other task frees the structure.
I think we need something like this:
Index: linux-2.6/mm/mempolicy.c
===================================================================
--- linux-2.6.orig/mm/mempolicy.c 2007-10-15 12:32:45.000000000 -0700
+++ linux-2.6/mm/mempolicy.c 2007-10-15 12:33:56.000000000 -0700
@@ -468,11 +468,13 @@ long do_set_mempolicy(int mode, nodemask
new = mpol_new(mode, nodes);
if (IS_ERR(new))
return PTR_ERR(new);
+ down_read(¤t->mm->mmap_sem);
mpol_free(current->mempolicy);
current->mempolicy = new;
mpol_set_task_struct_flag();
if (new && new->policy == MPOL_INTERLEAVE)
current->il_next = first_node(new->v.nodes);
+ up_read(¤t->mm->mmap_sem);
return 0;
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy
2007-10-15 19:34 ` Christoph Lameter
@ 2007-10-23 16:15 ` Lee Schermerhorn
2007-10-23 16:23 ` Christoph Lameter
0 siblings, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2007-10-23 16:15 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, linux-mm, ak, eric.whitney, mel
On Mon, 2007-10-15 at 12:34 -0700, Christoph Lameter wrote:
> On Fri, 12 Oct 2007, Christoph Lameter wrote:
>
> > > Index: Linux/mm/mempolicy.c
> > > ===================================================================
> > > --- Linux.orig/mm/mempolicy.c 2007-10-12 10:50:05.000000000 -0400
> > > +++ Linux/mm/mempolicy.c 2007-10-12 10:52:46.000000000 -0400
> > > @@ -1112,19 +1112,25 @@ static struct mempolicy * get_vma_policy
> > > struct vm_area_struct *vma, unsigned long addr)
> > > {
> > > struct mempolicy *pol = task->mempolicy;
> > > - int shared_pol = 0;
> > > + int pol_needs_ref = (task != current);
> >
> > If get_vma_policy is called from the numa_maps handler then we have taken
> > a refcount on the task struct.
> >
> > So this should be
> > int pol_needs_ref = 0;
>
> Argh. Refcount is not it. We have taken the mmap_sem lock
> because we are scanning though the pages. This avoids issues for the vma
> policies that can only be set when a writelock was taken on mmap_sem.
>
> However, mmap_sem is not taken when setting task->mempolicy. Taking
> mmap_sem there would solve the issue (we have discussed this before).
>
> You cannot reliably take a refcount on a foreign task structs mempolicy
> since the task may just be in the process of switching policies. You could
> increment the refcount and then the other task frees the structure.
>
> I think we need something like this:
>
> Index: linux-2.6/mm/mempolicy.c
> ===================================================================
> --- linux-2.6.orig/mm/mempolicy.c 2007-10-15 12:32:45.000000000 -0700
> +++ linux-2.6/mm/mempolicy.c 2007-10-15 12:33:56.000000000 -0700
> @@ -468,11 +468,13 @@ long do_set_mempolicy(int mode, nodemask
> new = mpol_new(mode, nodes);
> if (IS_ERR(new))
> return PTR_ERR(new);
> + down_read(¤t->mm->mmap_sem);
> mpol_free(current->mempolicy);
> current->mempolicy = new;
> mpol_set_task_struct_flag();
> if (new && new->policy == MPOL_INTERLEAVE)
> current->il_next = first_node(new->v.nodes);
> + up_read(¤t->mm->mmap_sem);
> return 0;
> }
>
Christoph: just getting back to this. You sent two messages commented
about this patch. I'm not sure whether this one supercedes the previous
one or adds to it. So, I'll address the points in your other comment
separately.
Re: this patch: I can see how we need to grab the mmap_sem during
do_set_mempolicy() to coordinate with the numa_maps display. However,
shouldn't we use {down|up}_write() here to obtain exclusive access with
respect to numa_maps ?
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy
2007-10-23 16:15 ` Lee Schermerhorn
@ 2007-10-23 16:23 ` Christoph Lameter
0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2007-10-23 16:23 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, eric.whitney, mel
On Tue, 23 Oct 2007, Lee Schermerhorn wrote:
> Christoph: just getting back to this. You sent two messages commented
> about this patch. I'm not sure whether this one supercedes the previous
> one or adds to it. So, I'll address the points in your other comment
> separately.
It supersedes one earlier comment.
> Re: this patch: I can see how we need to grab the mmap_sem during
> do_set_mempolicy() to coordinate with the numa_maps display. However,
> shouldn't we use {down|up}_write() here to obtain exclusive access with
> respect to numa_maps ?
Right. We must obtain a writelock.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy
2007-10-12 17:57 ` Christoph Lameter
2007-10-15 19:34 ` Christoph Lameter
@ 2007-10-23 17:32 ` Lee Schermerhorn
2007-10-24 13:09 ` Christoph Lameter
1 sibling, 1 reply; 14+ messages in thread
From: Lee Schermerhorn @ 2007-10-23 17:32 UTC (permalink / raw)
To: Christoph Lameter; +Cc: akpm, linux-mm, ak, eric.whitney, mel
On Fri, 2007-10-12 at 10:57 -0700, Christoph Lameter wrote:
> On Fri, 12 Oct 2007, Lee Schermerhorn wrote:
>
> > get_vma_policy() was not handling fallback to task policy correctly
> > when the get_policy() vm_op returns NULL. The NULL overwrites
> > the 'pol' variable that was holding the fallback task mempolicy.
> > So, it was falling back directly to system default policy.
> >
> > Fix get_vma_policy() to use only non-NULL policy returned from
> > the vma get_policy op and indicate that this policy does not need
> > another ref count.
>
> I still think there must be a thinko here. The function seems to be
> currently coded with the assumption that get_policy always returns a
> policy. That policy may be the default policy??
My assumption is that the get_policy vm_op should either return a
[non-NULL] mempolicy corresponding to the specified address with the ref
count elevated for the caller, or NULL. Never the default policy.
Fallback will be handled by get_vma_policy().
I was thinking that we HAD to do the fallback in get_vma_policy() to get
the reference counting correct for show_numa_maps(). But, based on your
other mail, I agree that we don't need to reference count another task's
task mempolicy, if we take the mmap_sem for write in do_set_mempolicy().
However, I still think that fallback is best handled in one place--for
consistency :-).
>
> If it returns NULL then the tasks policy is applied to shmem segment. I
> though we wanted a consistent application of policies to shmem segments?
> Now one task or another may determine placement.
OK. To address this, one must consider "when can the get_policy() vm
op" return a NULL? To answer the question, first note that there are
[currently] a couple of configurations to consider:
1) direct shared mapping [MAP_SHARED] of tmpfs backed storage, or SysV
shm segments without the SHM_HUGETLB flag. Either of these will get you
the shmem vm_ops--indirectly via the shm vm_ops in the case of a SysV
shm segment. The shmem {set|get}_policy ops support different policies
on different ranges of the segment {actually on ranges of the tmpfs file
backing the segment} on a page granularity. Now, before any of my
changes, if you never install a mempolicy on any range of a shmem
segment, the get_policy() op will return NULL. This causes
get_vma_policy() fall back to task or system default policy, as
appropriate. Always has.
If you apply mempolicy, using mbind() or the libnuma wrapper equivalent,
to a subset of the shmem segment and then query the policy on any other
part of the segment not affected by the mempolicy, again the
get_policy() op will return NULL and again we'll fall back to task or
system default policy, as appropriate.
2) A SysV shm segment with the SHM_HUGETLB option will use the hugetlbfs
vm_ops indirectly via the shm vm_ops. However, the hugetlbfs vm_ops do
not support shared policy in the same sense as shmem segments. This is
primarily because the hugetlbfs vm_ops do not specify any {set|
get}_policy() operations. [I have a patch to address this in my "shared
policy" series that we've discussed in the past. I hope to get back to
that series, eventually.]
When the file system backing the shm segment [hugetlbfs file, in this
case] does not support the get_policy() op, shm_get_policy() will just
return the VMA policy for the specified (vma,address). Again, this can
be NULL. Before the patch under discussion, shm_get_policy() would fall
back to the task policy if the vma policy was NULL. My patch changes
this to just return NULL and let get_vma_policy() do the fallback. This
makes it consistent with the shmem {get|set}_policy ops.
Note that I also made shm_get_policy() take a reference on any non-NULL
vma policy--again to be consistent with the shmem get_policy behavior.
And, in case you're wondering, shmem get_policy() MUST take the extra
reference therein while holding the spin lock on the shared policy
red-black tree because another task mapping the shmem segment could
change the policy at any time.
So, my "model" is: the get_policy() op must return a non-NULL policy
with elevated reference count or NULL so that get_vma_policy() can
depend on consistent behavior; and a NULL return from the get_policy()
op means "fall back to surrounding context" just as for vma policy.
I think this is "consistent" behavior, for some definition thereof.
>
> I still have no idea what your warrant is for being sure that the object
> continues to exist before increasing the policy refcount in
> get_vma_policy()? What pins the shared policy before we get the refcount?
For shmem shared policy, the rb-tree spin lock protects the policy while
we take the reference. To be consistent with this, I require that the
shm get_policy op does the same when falling back to vma policy for shm
file systems that don't support get_policy() ops--only hugetlbfs at this
time.
We don't need to add a ref to the current task's policy, because it
can't change while we're using it--as long as we don't try to cache it
across system calls. We don't need to add a ref to system default
policy because we never free it.
I thought we had to ref count other task's policies, including their vma
policies, for show_numa_map() because they could change at any time. In
your other mail, you showed how this isn't necessary because the task
calling show_numa_maps() is holding the target task's mmap_sem for read.
By patching do_set_mempolicy() to take the mmap_sem [for write!] we can
close this window w/o an extra reference. However, there may be an
issue with this. Read further.
The current task's vma policies, although subject to change by other
threads/tasks sharing the mm_struct, are protected by the mmap_sem()
while we take the reference, as you've pointed out in other mail. Why
take the extra ref? Back in June/July, we [you, Andi, myself] thought
that this was required for allocating under bind policy with the custom
zonelist because the allocation could sleep. Now, if we hold the
mmap_sem over the allocation, we can probably dispense with the extra
reference on [non-shared] vma policies as well.
However, we still need to unref shared policies which one could consider
a subclass of vma policies. With these recent patches and the prior
mempolicy ref count patches, we could assume that all policies except
the system default and the current task's mempolicy needed unref upon
return from get_vma_policy(). If we don't take an extra ref on other
task's mempolicy and non-shared vma policy, then we need to be able to
differentiate truly shared policies when we're done with them so that we
can unref them.
How about a funky flag in the higher order policy bits, like the
MPOL_CONTEXT flag in my cpuset-independent interleave patch, to indicate
shmem-style shared policy. If the reasoning about mmap_sem above is
correct, and we only need to hold refs on shmem shared policy, we can
dispense with all of this extra reference counting and only unref the
shared policies.
Thoughts?
> Some more concerns below:
>
> > Index: Linux/mm/mempolicy.c
> > ===================================================================
> > --- Linux.orig/mm/mempolicy.c 2007-10-12 10:50:05.000000000 -0400
> > +++ Linux/mm/mempolicy.c 2007-10-12 10:52:46.000000000 -0400
> > @@ -1112,19 +1112,25 @@ static struct mempolicy * get_vma_policy
> > struct vm_area_struct *vma, unsigned long addr)
> > {
> > struct mempolicy *pol = task->mempolicy;
> > - int shared_pol = 0;
> > + int pol_needs_ref = (task != current);
>
> If get_vma_policy is called from the numa_maps handler then we have taken
> a refcount on the task struct.
>
> So this should be
> int pol_needs_ref = 0;
If we can pare down the extra refs to just shmem shared policy, I agree.
Otherwise, I'd like to keep the behavior such that only the current
task's policy and system default policy don't get extra refs. This will
keep the checks fairly simple on unref at the expense of an unneeded
extra ref/unref for other task's policy--i.e., numa_maps which shouldn't
be a path that we need to optimize, I think.
>
> >
> > if (vma) {
> > if (vma->vm_ops && vma->vm_ops->get_policy) {
> > - pol = vma->vm_ops->get_policy(vma, addr);
> > - shared_pol = 1; /* if pol non-NULL, add ref below */
> > + struct mempolicy *vpol = vma->vm_ops->get_policy(vma,
> > + addr);
> > + if (vpol) {
> > + pol = vpol;
> > + pol_needs_ref = 0; /* get_policy() added ref */
> > + }
> > } else if (vma->vm_policy &&
> > - vma->vm_policy->policy != MPOL_DEFAULT)
> > + vma->vm_policy->policy != MPOL_DEFAULT) {
> > pol = vma->vm_policy;
> > + pol_needs_ref++;
>
> Why do we need a ref here for a vma policy? The policy is pinned through
> the ref to the task structure.
You mean the mmap_sem, right? If this is the case, we apparently missed
this point when we discussed it back in June/July. This was the one of
the main points of that discussion.
>
> > + }
> > }
> > if (!pol)
> > pol = &default_policy;
> > - else if (!shared_pol && pol != current->mempolicy)
> > + else if (pol_needs_ref)
> > mpol_get(pol); /* vma or other task's policy */
> > return pol;
>
> The mpol_get() here looks wrong. get_vma_policy determines the
> current policy. The policy must already be pinned by increasing the
> refcount or use in a certain task before get_vma_policy is ever called.
I see what you're saying. A quick look through the cscope indicates
that the mmap_sem is held over faults, even when allocation sleeps. If
this is true, again, we only need extra refs on shmem style shared
policies. I'll rework the patches on that assumption, including backing
out some of the recent extra reference counting, to see what they look
like. I'll also document the mempolicy stability assumptions on which
the reworked code is based.
Later,
Lee
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy
2007-10-23 17:32 ` Lee Schermerhorn
@ 2007-10-24 13:09 ` Christoph Lameter
0 siblings, 0 replies; 14+ messages in thread
From: Christoph Lameter @ 2007-10-24 13:09 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: akpm, linux-mm, ak, eric.whitney, mel
On Tue, 23 Oct 2007, Lee Schermerhorn wrote:
> > I still think there must be a thinko here. The function seems to be
> > currently coded with the assumption that get_policy always returns a
> > policy. That policy may be the default policy??
>
> My assumption is that the get_policy vm_op should either return a
> [non-NULL] mempolicy corresponding to the specified address with the ref
> count elevated for the caller, or NULL. Never the default policy.
> Fallback will be handled by get_vma_policy().
Ok.
> So, my "model" is: the get_policy() op must return a non-NULL policy
> with elevated reference count or NULL so that get_vma_policy() can
> depend on consistent behavior; and a NULL return from the get_policy()
> op means "fall back to surrounding context" just as for vma policy.
>
> I think this is "consistent" behavior, for some definition thereof.
I still have concerns about ting the refcount. The get_policy() method may
take a refcount if it can ensure that the object is not vanishing from
under us. But I would think that a refcount needs to be taken when the
possibility is created for a certain vma to reference a policy via
get_vma_policy and not when get_vma_policy itself runs.
> > I still have no idea what your warrant is for being sure that the object
> > continues to exist before increasing the policy refcount in
> > get_vma_policy()? What pins the shared policy before we get the refcount?
>
> For shmem shared policy, the rb-tree spin lock protects the policy while
> we take the reference. To be consistent with this, I require that the
> shm get_policy op does the same when falling back to vma policy for shm
> file systems that don't support get_policy() ops--only hugetlbfs at this
> time.
The rb tree lock is always taken when we run get_vma_policy()? You mean
you can take the lock while the get_policy is run? This will make
get_vma_policy even heavier?
> The current task's vma policies, although subject to change by other
> threads/tasks sharing the mm_struct, are protected by the mmap_sem()
> while we take the reference, as you've pointed out in other mail. Why
> take the extra ref? Back in June/July, we [you, Andi, myself] thought
> that this was required for allocating under bind policy with the custom
> zonelist because the allocation could sleep. Now, if we hold the
> mmap_sem over the allocation, we can probably dispense with the extra
> reference on [non-shared] vma policies as well.
Right.
> However, we still need to unref shared policies which one could consider
> a subclass of vma policies. With these recent patches and the prior
> mempolicy ref count patches, we could assume that all policies except
> the system default and the current task's mempolicy needed unref upon
> return from get_vma_policy(). If we don't take an extra ref on other
> task's mempolicy and non-shared vma policy, then we need to be able to
> differentiate truly shared policies when we're done with them so that we
> can unref them.
If you take the reference when a vma is established then you can avoid
dropping the refcount on the hot paths?
> How about a funky flag in the higher order policy bits, like the
> MPOL_CONTEXT flag in my cpuset-independent interleave patch, to indicate
> shmem-style shared policy. If the reasoning about mmap_sem above is
> correct, and we only need to hold refs on shmem shared policy, we can
> dispense with all of this extra reference counting and only unref the
> shared policies.
Maybe. Would need to be further fleshed out.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2007-10-24 13:09 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-12 15:48 [PATCH/RFC 0/4] More Mempolicy Reference Counting Fixes Lee Schermerhorn
2007-10-12 15:49 ` [PATCH/RFC 1/4] Mem Policy: fix mempolicy usage in pci driver Lee Schermerhorn
2007-10-12 17:29 ` Christoph Lameter
2007-10-12 15:49 ` [PATCH/RFC 2/4] Mem Policy: Fixup Shm and Interleave Policy Reference Counting Lee Schermerhorn
2007-10-12 17:30 ` Christoph Lameter
2007-10-12 15:49 ` [PATCH/RFC 3/4] Mem Policy: Fixup " Lee Schermerhorn
2007-10-12 17:33 ` Christoph Lameter
2007-10-12 15:49 ` [PATCH/RFC 4/4] Mem Policy: Fixup Fallback for Default Shmem Policy Lee Schermerhorn
2007-10-12 17:57 ` Christoph Lameter
2007-10-15 19:34 ` Christoph Lameter
2007-10-23 16:15 ` Lee Schermerhorn
2007-10-23 16:23 ` Christoph Lameter
2007-10-23 17:32 ` Lee Schermerhorn
2007-10-24 13:09 ` Christoph Lameter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).