* [PATCH] 2.6.23-rc6: Fix NUMA Memory Policy Reference Counting
[not found] <20070830185053.22619.96398.sendpatchset@localhost>
@ 2007-09-17 19:32 ` Lee Schermerhorn
2007-09-17 19:37 ` Christoph Lameter
2007-09-17 22:28 ` Andi Kleen
0 siblings, 2 replies; 7+ messages in thread
From: Lee Schermerhorn @ 2007-09-17 19:32 UTC (permalink / raw)
To: linux-kernel; +Cc: akpm, ak, clameter, eric.whitney, Mel Gorman
Here is the 23-rc6 verison of the patch. Andi considers it a high
priority bug fix for .23. I'm a bit uncomfortable with this, this late
in the 23 cycle. I've not heard of problems w/o this patch, but then,
maybe no one notices if they leak a memory policy struct now and then,
or occasionally allocate memory on the wrong node because they used a
prematurely freed memory policy.
If a lot of other mem policy patches don't go in before .23 goes final,
this patch should apply easily to .23 if we were to defer it to one of
the subsequent stable releases.
Lee
--------------------------------
PATCH Memory Policy: fix reference counting
Against 2.6.23-rc6
This patch proposes fixes to the reference counting of memory policy
in the page allocation paths and in show_numa_map(). Extracted from
my "Memory Policy Cleanups and Enhancements" series as stand-alone.
Shared policy lookup [shmem] has always added a reference to the
policy, but this was never unrefed after page allocation or after
formatting the numa map data.
Default system policy should not require additional ref counting,
nor should the current task's task policy. However, show_numa_map()
calls get_vma_policy() to examine what may be [likely is] another
task's policy. The latter case needs protection against freeing
of the policy.
This patch adds a reference count to a mempolicy returned by
get_vma_policy() when the policy is a vma policy or another
task's mempolicy. Again, shared policy is already reference
counted on lookup. A matching "unref" [__mpol_free()] is performed
in alloc_page_vma() for shared and vma policies, and in
show_numa_map() for shared and another task's mempolicy.
We can call __mpol_free() directly, saving an admittedly
inexpensive inline NULL test, because we know we have a non-NULL
policy.
Handling policy ref counts for hugepages is a bit trickier.
huge_zonelist() returns a zone list that might come from a
shared or vma 'BIND policy. In this case, we should hold the
reference until after the huge page allocation in
dequeue_hugepage(). The patch modifies huge_zonelist() to
return a pointer to the mempolicy if it needs to be unref'd
after allocation.
Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:
w/o patch w/ refcount patch
Avg Std Devn Avg Std Devn
Real: 100.59 0.38 100.63 0.43
User: 1209.60 0.37 1209.91 0.31
System: 81.52 0.42 81.64 0.34
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
include/linux/mempolicy.h | 4 +-
mm/hugetlb.c | 4 +-
mm/mempolicy.c | 79 ++++++++++++++++++++++++++++++++++++++++------
3 files changed, 75 insertions(+), 12 deletions(-)
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c 2007-09-12 11:17:19.000000000 -0400
+++ Linux/mm/mempolicy.c 2007-09-17 14:48:39.000000000 -0400
@@ -1077,21 +1077,37 @@ asmlinkage long compat_sys_mbind(compat_
#endif
-/* Return effective policy for a VMA */
+/*
+ * get_vma_policy(@task, @vma, @addr)
+ * @task - task for fallback if vma policy == default
+ * @vma - virtual memory area whose policy is sought
+ * @addr - address in @vma for shared policy lookup
+ *
+ * Returns effective policy for a VMA at specified address.
+ * Falls back to @task or system default policy, as necessary.
+ * Returned policy has extra reference count if shared, vma,
+ * or some other task's policy [show_numa_maps() can pass
+ * @task != current]. It is the caller's responsibility to
+ * free the reference in these cases.
+ */
static struct mempolicy * get_vma_policy(struct task_struct *task,
struct vm_area_struct *vma, unsigned long addr)
{
struct mempolicy *pol = task->mempolicy;
+ int shared_pol = 0;
if (vma) {
- if (vma->vm_ops && vma->vm_ops->get_policy)
+ if (vma->vm_ops && vma->vm_ops->get_policy) {
pol = vma->vm_ops->get_policy(vma, addr);
- else if (vma->vm_policy &&
+ shared_pol = 1; /* if pol non-NULL, add ref below */
+ } else if (vma->vm_policy &&
vma->vm_policy->policy != MPOL_DEFAULT)
pol = vma->vm_policy;
}
if (!pol)
pol = &default_policy;
+ else if (!shared_pol && pol != current->mempolicy)
+ mpol_get(pol); /* vma or other task's policy */
return pol;
}
@@ -1207,19 +1223,45 @@ static inline unsigned interleave_nid(st
}
#ifdef CONFIG_HUGETLBFS
-/* Return a zonelist suitable for a huge page allocation. */
+/*
+ * huge_zonelist(@vma, @addr, @gfp_flags, @mpol)
+ * @vma = virtual memory area whose policy is sought
+ * @addr = address in @vma for shared policy lookup and interleave policy
+ * @gfp_flags = for requested zone
+ * @mpol = pointer to mempolicy pointer for reference counted 'BIND policy
+ *
+ * Returns a zonelist suitable for a huge page allocation.
+ * If the effective policy is 'BIND, returns pointer to policy's zonelist.
+ * If it is also a policy for which get_vma_policy() returns an extra
+ * reference, we must hold that reference until after allocation.
+ * In that case, return policy via @mpol so hugetlb allocation can drop
+ * the reference. For non-'BIND referenced policies, we can/do drop the
+ * reference here, so the caller doesn't need to know about the special case
+ * for default and current task policy.
+ */
struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
- gfp_t gfp_flags)
+ gfp_t gfp_flags, struct mempolicy **mpol)
{
struct mempolicy *pol = get_vma_policy(current, vma, addr);
+ struct zonelist *zl;
+ *mpol = NULL; /* probably no unref needed */
if (pol->policy == MPOL_INTERLEAVE) {
unsigned nid;
nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
+ __mpol_free(pol); /* finished with pol */
return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
}
- return zonelist_policy(GFP_HIGHUSER, pol);
+
+ zl = zonelist_policy(GFP_HIGHUSER, pol);
+ if (unlikely(pol != &default_policy && pol != current->mempolicy)) {
+ if (pol->policy != MPOL_BIND)
+ __mpol_free(pol); /* finished with pol */
+ else
+ *mpol = pol; /* unref needed after allocation */
+ }
+ return zl;
}
#endif
@@ -1264,6 +1306,7 @@ struct page *
alloc_page_vma(gfp_t gfp, struct vm_area_struct *vma, unsigned long addr)
{
struct mempolicy *pol = get_vma_policy(current, vma, addr);
+ struct zonelist *zl;
cpuset_update_task_memory_state();
@@ -1273,7 +1316,19 @@ alloc_page_vma(gfp_t gfp, struct vm_area
nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
return alloc_page_interleave(gfp, 0, nid);
}
- return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+ zl = zonelist_policy(gfp, pol);
+ if (pol != &default_policy && pol != current->mempolicy) {
+ /*
+ * slow path: ref counted policy -- shared or vma
+ */
+ struct page *page = __alloc_pages(gfp, 0, zl);
+ __mpol_free(pol);
+ return page;
+ }
+ /*
+ * fast path: default or task policy
+ */
+ return __alloc_pages(gfp, 0, zl);
}
/**
@@ -1872,6 +1927,7 @@ int show_numa_map(struct seq_file *m, vo
struct numa_maps *md;
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
+ struct mempolicy *pol;
int n;
char buffer[50];
@@ -1882,8 +1938,13 @@ int show_numa_map(struct seq_file *m, vo
if (!md)
return 0;
- mpol_to_str(buffer, sizeof(buffer),
- get_vma_policy(priv->task, vma, vma->vm_start));
+ pol = get_vma_policy(priv->task, vma, vma->vm_start);
+ mpol_to_str(buffer, sizeof(buffer), pol);
+ /*
+ * unref shared or other task's mempolicy
+ */
+ if (pol != &default_policy && pol != current->mempolicy)
+ __mpol_free(pol);
seq_printf(m, "%08lx %s", vma->vm_start, buffer);
Index: Linux/include/linux/mempolicy.h
===================================================================
--- Linux.orig/include/linux/mempolicy.h 2007-09-12 11:17:19.000000000 -0400
+++ Linux/include/linux/mempolicy.h 2007-09-17 12:25:27.000000000 -0400
@@ -159,7 +159,7 @@ extern void mpol_fix_fork_child_flag(str
extern struct mempolicy default_policy;
extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
- unsigned long addr, gfp_t gfp_flags);
+ unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol);
extern unsigned slab_node(struct mempolicy *policy);
extern enum zone_type policy_zone;
@@ -256,7 +256,7 @@ static inline void mpol_fix_fork_child_f
#define set_cpuset_being_rebound(x) do {} while (0)
static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
- unsigned long addr, gfp_t gfp_flags)
+ unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol)
{
return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
}
Index: Linux/mm/hugetlb.c
===================================================================
--- Linux.orig/mm/hugetlb.c 2007-09-12 11:17:19.000000000 -0400
+++ Linux/mm/hugetlb.c 2007-09-17 12:24:24.000000000 -0400
@@ -71,8 +71,9 @@ static struct page *dequeue_huge_page(st
{
int nid;
struct page *page = NULL;
+ struct mempolicy *mpol;
struct zonelist *zonelist = huge_zonelist(vma, address,
- htlb_alloc_mask);
+ htlb_alloc_mask, &mpol);
struct zone **z;
for (z = zonelist->zones; *z; z++) {
@@ -87,6 +88,7 @@ static struct page *dequeue_huge_page(st
break;
}
}
+ mpol_free(mpol); /* unref if mpol !NULL */
return page;
}
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] 2.6.23-rc6: Fix NUMA Memory Policy Reference Counting
2007-09-17 19:32 ` [PATCH] 2.6.23-rc6: Fix NUMA Memory Policy Reference Counting Lee Schermerhorn
@ 2007-09-17 19:37 ` Christoph Lameter
2007-09-17 20:19 ` Lee Schermerhorn
2007-09-17 22:25 ` Andi Kleen
2007-09-17 22:28 ` Andi Kleen
1 sibling, 2 replies; 7+ messages in thread
From: Christoph Lameter @ 2007-09-17 19:37 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: linux-kernel, akpm, ak, eric.whitney, Mel Gorman
On Mon, 17 Sep 2007, Lee Schermerhorn wrote:
> Here is the 23-rc6 verison of the patch. Andi considers it a high
> priority bug fix for .23. I'm a bit uncomfortable with this, this late
> in the 23 cycle. I've not heard of problems w/o this patch, but then,
> maybe no one notices if they leak a memory policy struct now and then,
> or occasionally allocate memory on the wrong node because they used a
> prematurely freed memory policy.
The patch does require concurrent increments and decrements in the main
fault patch. The potential is to create another bouncing cacheline for
concurrent faults. This looks like it would cause a performance issue.
> Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:
>
> w/o patch w/ refcount patch
> Avg Std Devn Avg Std Devn
> Real: 100.59 0.38 100.63 0.43
> User: 1209.60 0.37 1209.91 0.31
> System: 81.52 0.42 81.64 0.34
Single threaded build? I would suggest to try concurrently faulting memory
from multiple processors. You may not see this on a kernel build even if
this is run with -j16 because concurrent faults are rare.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] 2.6.23-rc6: Fix NUMA Memory Policy Reference Counting
2007-09-17 19:37 ` Christoph Lameter
@ 2007-09-17 20:19 ` Lee Schermerhorn
2007-09-17 21:23 ` Christoph Lameter
2007-09-17 22:25 ` Andi Kleen
1 sibling, 1 reply; 7+ messages in thread
From: Lee Schermerhorn @ 2007-09-17 20:19 UTC (permalink / raw)
To: Christoph Lameter; +Cc: linux-kernel, akpm, ak, eric.whitney, Mel Gorman
On Mon, 2007-09-17 at 12:37 -0700, Christoph Lameter wrote:
> On Mon, 17 Sep 2007, Lee Schermerhorn wrote:
>
> > Here is the 23-rc6 verison of the patch. Andi considers it a high
> > priority bug fix for .23. I'm a bit uncomfortable with this, this late
> > in the 23 cycle. I've not heard of problems w/o this patch, but then,
> > maybe no one notices if they leak a memory policy struct now and then,
> > or occasionally allocate memory on the wrong node because they used a
> > prematurely freed memory policy.
>
> The patch does require concurrent increments and decrements in the main
> fault patch. The potential is to create another bouncing cacheline for
> concurrent faults. This looks like it would cause a performance issue.
Only for vma policy, right? show_numa_maps() isn't a performance path,
and shared policies are already reference counted--just not unref'd!
>
> > Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:
> >
> > w/o patch w/ refcount patch
> > Avg Std Devn Avg Std Devn
> > Real: 100.59 0.38 100.63 0.43
> > User: 1209.60 0.37 1209.91 0.31
> > System: 81.52 0.42 81.64 0.34
>
> Single threaded build? I would suggest to try concurrently faulting memory
> from multiple processors. You may not see this on a kernel build even if
> this is run with -j16 because concurrent faults are rare.
Well, it was a 32-way parallel build [-j32] on a 16-cpu system--my usual
build method. But, I'm guessing that all of the build tools are single
threaded and all using default policy, so no reference counting is
needed.
I'm taking a look at your 'pft' program, and I'll try that.
I do have some ideas for enhancements to memtoy to test vma policies in
a multi-threaded task. I have the basic multi-threading infrastructure
that binds threads to cpus, allocates node local stacks, thread state
structs, ... in my mmtrace tool that I can probably hack for use in
memtoy to provoke cacheline bouncing of the mem policy. But, if pft
does the trick, I won't rush the memtoy enhancments...
Meanwhile, we do have a mem policy ref counting bug in the mainline.
Later,
Lee
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] 2.6.23-rc6: Fix NUMA Memory Policy Reference Counting
2007-09-17 20:19 ` Lee Schermerhorn
@ 2007-09-17 21:23 ` Christoph Lameter
0 siblings, 0 replies; 7+ messages in thread
From: Christoph Lameter @ 2007-09-17 21:23 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: linux-kernel, akpm, ak, eric.whitney, Mel Gorman
On Mon, 17 Sep 2007, Lee Schermerhorn wrote:
> Only for vma policy, right? show_numa_maps() isn't a performance path,
> and shared policies are already reference counted--just not unref'd!
Right.
> I do have some ideas for enhancements to memtoy to test vma policies in
> a multi-threaded task. I have the basic multi-threading infrastructure
> that binds threads to cpus, allocates node local stacks, thread state
> structs, ... in my mmtrace tool that I can probably hack for use in
> memtoy to provoke cacheline bouncing of the mem policy. But, if pft
> does the trick, I won't rush the memtoy enhancments...
Well pft is old and limited in what it can do. I'd be glad if you could
put it into memtoy. Then it may perhaps be useful in the future.
> Meanwhile, we do have a mem policy ref counting bug in the mainline.
But we have had this ref counting issue forever with no ill effect. Memory
policies were designed to have almost no overhead for the default
allocation paths. Incrementing and decrementing refcounters makes that
design no longer light weight as it was intended to be.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] 2.6.23-rc6: Fix NUMA Memory Policy Reference Counting
2007-09-17 19:37 ` Christoph Lameter
2007-09-17 20:19 ` Lee Schermerhorn
@ 2007-09-17 22:25 ` Andi Kleen
2007-09-18 19:30 ` Christoph Lameter
1 sibling, 1 reply; 7+ messages in thread
From: Andi Kleen @ 2007-09-17 22:25 UTC (permalink / raw)
To: Christoph Lameter
Cc: Lee Schermerhorn, linux-kernel, akpm, eric.whitney, Mel Gorman
> The patch does require concurrent increments and decrements in the main
> fault patch. The potential is to create another bouncing cacheline for
> concurrent faults. This looks like it would cause a performance issue.
While may be true correctness is always more important than performance.
So I think this is the right thing for .23. Any performance improvements
if needed can come later.
-Andi
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] 2.6.23-rc6: Fix NUMA Memory Policy Reference Counting
2007-09-17 19:32 ` [PATCH] 2.6.23-rc6: Fix NUMA Memory Policy Reference Counting Lee Schermerhorn
2007-09-17 19:37 ` Christoph Lameter
@ 2007-09-17 22:28 ` Andi Kleen
1 sibling, 0 replies; 7+ messages in thread
From: Andi Kleen @ 2007-09-17 22:28 UTC (permalink / raw)
To: Lee Schermerhorn; +Cc: linux-kernel, akpm, clameter, eric.whitney, Mel Gorman
> Handling policy ref counts for hugepages is a bit trickier.
> huge_zonelist() returns a zone list that might come from a
> shared or vma 'BIND policy. In this case, we should hold the
> reference until after the huge page allocation in
> dequeue_hugepage(). The patch modifies huge_zonelist() to
> return a pointer to the mempolicy if it needs to be unref'd
> after allocation.
Acked-by: Andi Kleen <ak@suse.de>
Andrew, can you please queue that for .23?
-Andi
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] 2.6.23-rc6: Fix NUMA Memory Policy Reference Counting
2007-09-17 22:25 ` Andi Kleen
@ 2007-09-18 19:30 ` Christoph Lameter
0 siblings, 0 replies; 7+ messages in thread
From: Christoph Lameter @ 2007-09-18 19:30 UTC (permalink / raw)
To: Andi Kleen; +Cc: Lee Schermerhorn, linux-kernel, akpm, eric.whitney, Mel Gorman
On Tue, 18 Sep 2007, Andi Kleen wrote:
> > The patch does require concurrent increments and decrements in the main
> > fault patch. The potential is to create another bouncing cacheline for
> > concurrent faults. This looks like it would cause a performance issue.
>
> While may be true correctness is always more important than performance.
> So I think this is the right thing for .23. Any performance improvements
> if needed can come later.
Is there any real issue as a result of the refcounting issues? It seems
that performance improvements are not possible unless one would modify how
memory policies work using RCU or so.
Ok the effect is limited. get_vma_policy() only increments refcounts
for foreign or shared policies. So it seems that it is fine if one stays
away from shmem.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2007-09-18 19:30 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20070830185053.22619.96398.sendpatchset@localhost>
2007-09-17 19:32 ` [PATCH] 2.6.23-rc6: Fix NUMA Memory Policy Reference Counting Lee Schermerhorn
2007-09-17 19:37 ` Christoph Lameter
2007-09-17 20:19 ` Lee Schermerhorn
2007-09-17 21:23 ` Christoph Lameter
2007-09-17 22:25 ` Andi Kleen
2007-09-18 19:30 ` Christoph Lameter
2007-09-17 22:28 ` Andi Kleen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox