* [PATCH] Cpuset: alloc_pages_node overrides cpuset constraints
@ 2006-03-18 20:40 Paul Jackson
2006-03-20 7:07 ` Andrew Morton
2006-03-22 16:33 ` Andi Kleen
0 siblings, 2 replies; 6+ messages in thread
From: Paul Jackson @ 2006-03-18 20:40 UTC (permalink / raw)
To: Andrew Morton
Cc: Simon.Derr, linux-kernel, Nick Piggin, Andi Kleen, Paul Jackson,
Dave Hansen, Ingo Molnar, Christoph Lameter
>From Christoph Lameter <clameter@sgi.com>
Make alloc_pages_node() ignore cpusets.
Currently alloc_pages_node() obeys cpusets. If you ask for a page
on a node outside the current tasks cpuset, you will be forced to
take a page within your cpuset instead.
Several kernel mechanisms use alloc_pages_node(), directly or
indirectly, including the numa-aware slab allocator, various device
and bus controller drivers, the page migration facility, the hugetlb
allocator, node local data, numa aware block io scheduler, per-node
mmtimers, per-node network buffers, per-node oprofile buffers, memory
pools, some netfilter counters, and any other caller of kmalloc_node()
or vmalloc_node().
These mechanisms are expecting to get memory on the node they asked
for, regardless of user imposed cpuset memory placement constraints.
This patch adds a __GFP_NOCPUSET flag to disable cpuset memory
placement. It is set in alloc_pages_node() and checked in
__cpuset_zone_allowed(). The routine alloc_pages_node() is the
common routine that all node-specific allocation calls resolve to, and
__cpuset_zone_allowed() is called from the hook beneath __alloc_pages()
to enforce cpuset memory constraints.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Paul Jackson <pj@sgi.com>
---
Andrew,
This is needed for memory migration to work if it is invoked from
a task that is cpuset memory constrained. Without this, writing a
cpusets 'mems' file (when its memory_migrate flag is set '1') from
a task that is in some limited cpuset (not all memory nodes allowed)
causes the migration to go to the memory nodes in that writing tasks
cpuset, not to the requested memory nodes in the 'mems' value written.
I recommend it as a fix in 2.6.16. -pj
include/linux/gfp.h | 6 ++++++
kernel/cpuset.c | 2 ++
2 files changed, 8 insertions(+)
--- 2.6.16-rc6.orig/include/linux/gfp.h 2006-03-13 20:19:30.000000000 -0800
+++ 2.6.16-rc6/include/linux/gfp.h 2006-03-17 21:52:03.000000000 -0800
@@ -47,6 +47,7 @@ struct vm_area_struct;
#define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_NOCPUSET ((__force gfp_t)0x40000u)/* Ignore cpuset constraints */
#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -113,6 +114,11 @@ static inline struct page *alloc_pages_n
/* Unknown node is current node */
if (nid < 0)
nid = numa_node_id();
+ /*
+ * Specified (or implied by nid < 0) node overrides cpuset placement.
+ * Various slab, page and device node specific allocations need this.
+ */
+ gfp_mask |= __GFP_NOCPUSET;
return __alloc_pages(gfp_mask, order,
NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
--- 2.6.16-rc6.orig/kernel/cpuset.c 2006-03-13 20:19:36.000000000 -0800
+++ 2.6.16-rc6/kernel/cpuset.c 2006-03-17 21:52:18.000000000 -0800
@@ -2164,6 +2164,8 @@ int __cpuset_zone_allowed(struct zone *z
node = z->zone_pgdat->node_id;
if (node_isset(node, current->mems_allowed))
return 1;
+ if (gfp_mask & __GFP_NOCPUSET)
+ return 1;
if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */
return 0;
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.650.933.1373, 1.925.600.0401
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Cpuset: alloc_pages_node overrides cpuset constraints
2006-03-18 20:40 [PATCH] Cpuset: alloc_pages_node overrides cpuset constraints Paul Jackson
@ 2006-03-20 7:07 ` Andrew Morton
2006-03-20 8:30 ` Paul Jackson
2006-03-22 16:33 ` Andi Kleen
1 sibling, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2006-03-20 7:07 UTC (permalink / raw)
To: Paul Jackson
Cc: Simon.Derr, linux-kernel, nickpiggin, ak, pj, haveblue, mingo,
clameter
Paul Jackson <pj@sgi.com> wrote:
>
> @@ -113,6 +114,11 @@ static inline struct page *alloc_pages_n
> /* Unknown node is current node */
> if (nid < 0)
> nid = numa_node_id();
> + /*
> + * Specified (or implied by nid < 0) node overrides cpuset placement.
> + * Various slab, page and device node specific allocations need this.
> + */
> + gfp_mask |= __GFP_NOCPUSET;
You're kidding. This adds more code to every page allocation on every
machine, cpusets or not.
I stuck this on top:
--- devel/include/linux/gfp.h~cpuset-alloc_pages_node-overrides-cpuset-constraints-speedup 2006-03-19 23:01:04.000000000 -0800
+++ devel-akpm/include/linux/gfp.h 2006-03-19 23:01:04.000000000 -0800
@@ -47,7 +47,11 @@ struct vm_area_struct;
#define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#ifdef CONFIG_CPUSETS
#define __GFP_NOCPUSET ((__force gfp_t)0x40000u)/* Ignore cpuset constraints */
+#else
+#define __GFP_NOCPUSET ((__force gfp_t)0u)
+#endif
#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
_
But it's a bit ugly.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Cpuset: alloc_pages_node overrides cpuset constraints
2006-03-20 7:07 ` Andrew Morton
@ 2006-03-20 8:30 ` Paul Jackson
2006-03-20 15:34 ` Christoph Lameter
0 siblings, 1 reply; 6+ messages in thread
From: Paul Jackson @ 2006-03-20 8:30 UTC (permalink / raw)
To: Andrew Morton
Cc: Simon.Derr, linux-kernel, nickpiggin, ak, haveblue, mingo,
clameter
> You're kidding. This adds more code to every page allocation on every
> machine, cpusets or not.
>
> I stuck this on top:
>
> ...
>
> But it's a bit ugly.
The code path we care about for this __GFP_NOCPUSET flag is for memory
migration. When it says it wants memory allocated on a certain node,
it really wants it there.
While I consider it a bug in the cpuset implementation that any kernel
alloc_pages_node(), kmalloc_node() or vmalloc_node() call has the
requested node ignored if it falls outside the current tasks cpuset,
that's not a priority bug in my view. It's been that way for a year
(since cpusets went in), and no one has noticed.
Christoph - I suspect that the following patch, instead of the one
we sent, would meet with greater approval from Andrew.
The patch below just adds the __GFP_NOCPUSET flag on the memory
migration code path, where we need it. That code path is not
performance critical.
We can deal with the long standing bug in cpusets, where it overrides
all alloc_pages_node() calls, some other day.
What think you of this, Christoph? Should we send it to Andrew?
include/linux/gfp.h | 1 +
kernel/cpuset.c | 2 ++
mm/migrate.c | 5 +++--
3 files changed, 6 insertions(+), 2 deletions(-)
--- 2.6.16-rc6-mm2.orig/include/linux/gfp.h 2006-03-18 13:07:51.000000000 -0800
+++ 2.6.16-rc6-mm2/include/linux/gfp.h 2006-03-19 23:55:19.000000000 -0800
@@ -47,6 +47,7 @@ struct vm_area_struct;
#define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_NOCPUSET ((__force gfp_t)0x40000u)/* Ignore cpuset constraints */
#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
--- 2.6.16-rc6-mm2.orig/kernel/cpuset.c 2006-03-18 13:15:04.000000000 -0800
+++ 2.6.16-rc6-mm2/kernel/cpuset.c 2006-03-19 23:52:42.000000000 -0800
@@ -2209,6 +2209,8 @@ int __cpuset_zone_allowed(struct zone *z
node = z->zone_pgdat->node_id;
if (node_isset(node, current->mems_allowed))
return 1;
+ if (gfp_mask & __GFP_NOCPUSET)
+ return 1;
if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */
return 0;
--- 2.6.16-rc6-mm2.orig/mm/migrate.c 2006-03-18 13:12:53.000000000 -0800
+++ 2.6.16-rc6-mm2/mm/migrate.c 2006-03-20 00:24:36.000000000 -0800
@@ -614,12 +614,13 @@ redo:
* a certain old page is moved to so we cannot
* specify the correct address.
*/
- page = alloc_page_vma(GFP_HIGHUSER, vma,
+ page = alloc_page_vma(GFP_HIGHUSER|__GFP_NOCPUSET, vma,
offset + vma->vm_start);
offset += PAGE_SIZE;
}
else
- page = alloc_pages_node(dest, GFP_HIGHUSER, 0);
+ page = alloc_pages_node(dest,
+ GFP_HIGHUSER|__GFP_NOCPUSET, 0);
if (!page) {
err = -ENOMEM;
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Cpuset: alloc_pages_node overrides cpuset constraints
2006-03-20 8:30 ` Paul Jackson
@ 2006-03-20 15:34 ` Christoph Lameter
0 siblings, 0 replies; 6+ messages in thread
From: Christoph Lameter @ 2006-03-20 15:34 UTC (permalink / raw)
To: Paul Jackson
Cc: Andrew Morton, Simon.Derr, linux-kernel, nickpiggin, ak, haveblue,
mingo
On Mon, 20 Mar 2006, Paul Jackson wrote:
> The patch below just adds the __GFP_NOCPUSET flag on the memory
> migration code path, where we need it. That code path is not
> performance critical.
>
> We can deal with the long standing bug in cpusets, where it overrides
> all alloc_pages_node() calls, some other day.
>
> What think you of this, Christoph? Should we send it to Andrew?
It does not fix the general problem that alloc_pages_node must not respect
cpusets. Slab, device drivers etc etc are still screwed up with this
patch.
We noticed this first in page migration but I think the other areas are
much more problematic.
The earlier patch is correct. What Andrew wanted is an optimization
not a general change in the patch. And
#ifdef CPUSETS
#endif
around the ORing of __GFP_NOCPUSET may suffice.
Maybe you got a better idea?
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Cpuset: alloc_pages_node overrides cpuset constraints
2006-03-18 20:40 [PATCH] Cpuset: alloc_pages_node overrides cpuset constraints Paul Jackson
2006-03-20 7:07 ` Andrew Morton
@ 2006-03-22 16:33 ` Andi Kleen
2006-03-22 18:05 ` Paul Jackson
1 sibling, 1 reply; 6+ messages in thread
From: Andi Kleen @ 2006-03-22 16:33 UTC (permalink / raw)
To: Paul Jackson
Cc: Andrew Morton, Simon.Derr, linux-kernel, Nick Piggin, Dave Hansen,
Ingo Molnar, Christoph Lameter
On Saturday 18 March 2006 21:40, Paul Jackson wrote:
> --- 2.6.16-rc6.orig/kernel/cpuset.c 2006-03-13 20:19:36.000000000 -0800
> +++ 2.6.16-rc6/kernel/cpuset.c 2006-03-17 21:52:18.000000000 -0800
> @@ -2164,6 +2164,8 @@ int __cpuset_zone_allowed(struct zone *z
> node = z->zone_pgdat->node_id;
> if (node_isset(node, current->mems_allowed))
> return 1;
> + if (gfp_mask & __GFP_NOCPUSET)
> + return 1;
> if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */
> return 0;
Faster would be if (gfp_mask & (__GFP_NOCPUSET|__GFP_HARDWALL)) { /* sort them out */ }
That would only give a single test for the common case of no such flag set.
-Andi
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] Cpuset: alloc_pages_node overrides cpuset constraints
2006-03-22 16:33 ` Andi Kleen
@ 2006-03-22 18:05 ` Paul Jackson
0 siblings, 0 replies; 6+ messages in thread
From: Paul Jackson @ 2006-03-22 18:05 UTC (permalink / raw)
To: Andi Kleen
Cc: akpm, Simon.Derr, linux-kernel, nickpiggin, haveblue, mingo,
clameter
Andi wrote:
> Faster would be if (gfp_mask & (__GFP_NOCPUSET|__GFP_HARDWALL)) { /* sort them out */ }
Yup - good point. However ...
Note that I am already off the fast path here, having peeled off the
interrupt and node inside cpuset cases above. After these checks for
__GFP_NOCPUSET or __GFP_HARDWALL, only the rare case of having to
look outside a cpuset with no free memory for essential kernel memory
remains.
Look at this entire routine:
int __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
{
int node; /* node that zone z is on */
const struct cpuset *cs; /* current cpuset ancestors */
int allowed; /* is allocation in zone z allowed? */
if (in_interrupt())
return 1;
node = z->zone_pgdat->node_id;
if (node_isset(node, current->mems_allowed))
return 1;
if (gfp_mask & __GFP_NOCPUSET)
return 1;
if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */
return 0;
if (current->flags & PF_EXITING) /* Let dying task have memory */
return 1;
/* Not hardwall and node outside mems_allowed: scan up cpusets */
mutex_lock(&callback_mutex);
task_lock(current);
cs = nearest_exclusive_ancestor(current->cpuset);
task_unlock(current);
allowed = node_isset(node, cs->mems_allowed);
mutex_unlock(&callback_mutex);
return allowed;
}
Notice that if neither of the __GFP_NOCPUSET or __GFP_HARDWALL flag
tests fire, then the code hits a mutex, spinlock and subroutine call.
Also notice that the __GFP_NOCPUSET case is the most important case of
those at or below that check. Any alloc_pages_node, zmalloc_node or
kmalloc_node call that requires a node outside the cpuset hits that
case. Only tasks that have used up all the available memory in their
cpuset commonly get past here, to the __GFP_HARDWALL case, which is not
a case worth optimizing at the expense of more important code paths.
So ... actually ... I suspect that doing:
if (gfp_mask & __GFP_NOCPUSET)
return 1;
if (gfp_mask & __GFP_HARDWALL)
return 0;
is faster than doing:
if (gfp_mask & (__GFP_NOCPUSET|__GFP_HARDWALL)) {
if (gfp_mask & __GFP_NOCPUSET)
return 1;
if (gfp_mask & __GFP_HARDWALL)
return 0;
}
because the first of these two gets to the relatively more important
case of __GFP_NOCPUSET faster.
Too bad the patch didn't show a little more context.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-03-22 18:06 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-18 20:40 [PATCH] Cpuset: alloc_pages_node overrides cpuset constraints Paul Jackson
2006-03-20 7:07 ` Andrew Morton
2006-03-20 8:30 ` Paul Jackson
2006-03-20 15:34 ` Christoph Lameter
2006-03-22 16:33 ` Andi Kleen
2006-03-22 18:05 ` Paul Jackson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.