[PATCH 0/6] Use one zonelist per node instead of multiple zonelists v8

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v8
@ 2007-09-28 14:23 Mel Gorman
  2007-09-28 14:23 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman
                   ` (5 more replies)
  0 siblings, 6 replies; 35+ messages in thread
From: Mel Gorman @ 2007-09-28 14:23 UTC (permalink / raw)
  To: akpm
  Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

Hi Andrew,

This is the one-zonelist patchset again. There were multiple collisions
with patches in -mm like the policy cleanups, policy refcounting, the memory
controller patches and OOM killer changes. The functionality of the code has
not changed since the last release. I'm still hoping to merge this to -mm
when it is considered a bit more stable.

I've added David Rientjes to the cc as the OOM-zone-locking code is affected
by this patchset now and I want to be sure I didn't accidently break it. The
changes to try_set_zone_oom() are the most important here. I believe the
code is equivilant but a second opinion would not hurt.

Changelog since V7
  o Rebase to 2.6.23-rc8-mm2

Changelog since V6
  o Fix build bug in relation to memory controller combined with one-zonelist
  o Use while() instead of a stupid looking for()
  o Instead of encoding zone index information in a pointer, this version
    introduces a structure that stores a zone pointer and its index 

Changelog since V5
  o Rebase to 2.6.23-rc4-mm1
  o Drop patch that replaces inline functions with macros

Changelog since V4
  o Rebase to -mm kernel. Host of memoryless patches collisions dealt with
  o Do not call wakeup_kswapd() for every zone in a zonelist
  o Dropped the FASTCALL removal
  o Have cursor in iterator advance earlier
  o Use nodes_and in cpuset_nodes_valid_mems_allowed()
  o Use defines instead of inlines, noticably better performance on gcc-3.4
    No difference on later compilers such as gcc 4.1
  o Dropped gfp_skip patch until it is proven to be of benefit. Tests are
    currently inconclusive but it definitly consumes at least one cache
    line

Changelog since V3
  o Fix compile error in the parisc change
  o Calculate gfp_zone only once in __alloc_pages
  o Calculate classzone_idx properly in get_page_from_freelist
  o Alter check so that zone id embedded may still be used on UP
  o Use Kamezawa-sans suggestion for skipping zones in zonelist
  o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This
    removes the need for MPOL_BIND to have a custom zonelist
  o Move zonelist iterators and helpers to mm.h
  o Change _zones from struct zone * to unsigned long

Changelog since V2
  o shrink_zones() uses zonelist instead of zonelist->zones
  o hugetlb uses zonelist iterator
  o zone_idx information is embedded in zonelist pointers
  o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid)

Changelog since V1
  o Break up the patch into 3 patches
  o Introduce iterators for zonelists
  o Performance regression test

The following patches replace multiple zonelists per node with one zonelist
that is filtered based on the GFP flags. The patches as a set fix a bug
with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset,
the MPOL_BIND will apply to the two highest zones when the highest zone
is ZONE_MOVABLE. This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that
filters only custom zonelists. As a bonus, the patchset reduces the cache
footprint of the kernel and should improve performance in a number of cases.

The first patch cleans up an inconsitency where direct reclaim uses
zonelist->zones where other places use zonelist. The second patch introduces
a helper function node_zonelist() for looking up the appropriate zonelist
for a GFP mask which simplifies patches later in the set.

The third patch replaces multiple zonelists with two zonelists that are
filtered. The two zonelists are due to the fact that the memoryless patchset
introduces a second set of zonelists for __GFP_THISNODE.

The fourth patch introduces helper macros for retrieving the zone and node indices of entries in a zonelist.

The fifth patch introduces filtering of the zonelists based on a nodemask.

The final patch replaces the two zonelists with one zonelist. A nodemask is
created when __GFP_THISNODE is specified to filter the list. The nodelists
could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE
is used often enough to be worth the effort.

Performance results varied depending on the machine configuration but were
usually small performance gains. In real workloads the gain/loss will depend
on how much the userspace portion of the benchmark benefits from having more
cache available due to reduced referencing of zonelists.

These are the range of performance losses/gains when running against
2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.

Total CPU time on Kernbench: -0.67% to  3.05%
Elapsed   time on Kernbench: -0.25% to  2.96%
page_test from aim9:         -6.98% to  5.60%
brk_test  from aim9:         -3.94% to  4.11%
fork_test from aim9:         -5.72% to  4.14%
exec_test from aim9:         -1.02% to  1.56%

The TBench figures were too variable between runs to draw conclusions from but
there didn't appear to be any regressions there. The hackbench results for both
sockets and pipes were within noise.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages
  2007-09-28 14:23 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v8 Mel Gorman
@ 2007-09-28 14:23 ` Mel Gorman
  2007-09-28 14:24 ` [PATCH 2/6] Introduce node_zonelist() for accessing the zonelist for a GFP mask Mel Gorman
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 35+ messages in thread
From: Mel Gorman @ 2007-09-28 14:23 UTC (permalink / raw)
  To: akpm
  Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

The allocator deals with zonelists which indicate the order in which zones
should be targeted for an allocation. Similarly, direct reclaim of pages
iterates over an array of zones. For consistency, this patch converts direct
reclaim to use a zonelist. No functionality is changed by this patch. This
simplifies zonelist iterators in the next patch.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
---

 include/linux/swap.h |    2 +-
 mm/page_alloc.c      |    2 +-
 mm/vmscan.c          |   21 ++++++++++++---------
 3 files changed, 14 insertions(+), 11 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-clean/include/linux/swap.h linux-2.6.23-rc8-mm2-005_freepages_zonelist/include/linux/swap.h
--- linux-2.6.23-rc8-mm2-clean/include/linux/swap.h	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-005_freepages_zonelist/include/linux/swap.h	2007-09-28 15:48:35.000000000 +0100
@@ -185,7 +185,7 @@ extern void move_tail_pages(void);
 extern void swap_setup(void);
 
 /* linux/mm/vmscan.c */
-extern unsigned long try_to_free_pages(struct zone **zones, int order,
+extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask);
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem,
 							gfp_t gfp_mask);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-clean/mm/page_alloc.c linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/page_alloc.c
--- linux-2.6.23-rc8-mm2-clean/mm/page_alloc.c	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/page_alloc.c	2007-09-28 15:48:35.000000000 +0100
@@ -1668,7 +1668,7 @@ nofail_alloc:
 	reclaim_state.reclaimed_slab = 0;
 	p->reclaim_state = &reclaim_state;
 
-	did_some_progress = try_to_free_pages(zonelist->zones, order, gfp_mask);
+	did_some_progress = try_to_free_pages(zonelist, order, gfp_mask);
 
 	p->reclaim_state = NULL;
 	p->flags &= ~PF_MEMALLOC;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-clean/mm/vmscan.c linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/vmscan.c
--- linux-2.6.23-rc8-mm2-clean/mm/vmscan.c	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/vmscan.c	2007-09-28 15:48:35.000000000 +0100
@@ -1204,10 +1204,11 @@ static unsigned long shrink_zone(int pri
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
  */
-static unsigned long shrink_zones(int priority, struct zone **zones,
+static unsigned long shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
 	unsigned long nr_reclaimed = 0;
+	struct zone **zones = zonelist->zones;
 	int i;
 
 	sc->all_unreclaimable = 1;
@@ -1245,8 +1246,8 @@ static unsigned long shrink_zones(int pr
  * holds filesystem locks which prevent writeout this might not work, and the
  * allocation attempt will fail.
  */
-static unsigned long do_try_to_free_pages(struct zone **zones, gfp_t gfp_mask,
-					  struct scan_control *sc)
+static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
+					gfp_t gfp_mask, struct scan_control *sc)
 {
 	int priority;
 	int ret = 0;
@@ -1254,6 +1255,7 @@ static unsigned long do_try_to_free_page
 	unsigned long nr_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long lru_pages = 0;
+	struct zone **zones = zonelist->zones;
 	int i;
 
 	count_vm_event(ALLOCSTALL);
@@ -1272,7 +1274,7 @@ static unsigned long do_try_to_free_page
 		sc->nr_scanned = 0;
 		if (!priority)
 			disable_swap_token();
-		nr_reclaimed += shrink_zones(priority, zones, sc);
+		nr_reclaimed += shrink_zones(priority, zonelist, sc);
 		/*
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups
@@ -1330,7 +1332,8 @@ out:
 	return ret;
 }
 
-unsigned long try_to_free_pages(struct zone **zones, int order, gfp_t gfp_mask)
+unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
+								gfp_t gfp_mask)
 {
 	struct scan_control sc = {
 		.gfp_mask = gfp_mask,
@@ -1343,7 +1346,7 @@ unsigned long try_to_free_pages(struct z
 		.isolate_pages = isolate_pages_global,
 	};
 
-	return do_try_to_free_pages(zones, gfp_mask, &sc);
+	return do_try_to_free_pages(zonelist, gfp_mask, &sc);
 }
 
 #ifdef CONFIG_CGROUP_MEM_CONT
@@ -1362,12 +1365,12 @@ unsigned long try_to_free_mem_cgroup_pag
 		.isolate_pages = mem_cgroup_isolate_pages,
 	};
 	int node;
-	struct zone **zones;
+	struct zonelist *zonelist;
 	int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
 
 	for_each_online_node(node) {
-		zones = NODE_DATA(node)->node_zonelists[target_zone].zones;
-		if (do_try_to_free_pages(zones, sc.gfp_mask, &sc))
+		zonelist = &NODE_DATA(node)->node_zonelists[target_zone];
+		if (do_try_to_free_pages(zonelist, sc.gfp_mask, &sc))
 			return 1;
 	}
 	return 0;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 2/6] Introduce node_zonelist() for accessing the zonelist for a GFP mask
  2007-09-28 14:23 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v8 Mel Gorman
  2007-09-28 14:23 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman
@ 2007-09-28 14:24 ` Mel Gorman
  2007-09-28 14:24 ` [PATCH 3/6] Use two zonelist that are filtered by " Mel Gorman
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 35+ messages in thread
From: Mel Gorman @ 2007-09-28 14:24 UTC (permalink / raw)
  To: akpm
  Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

This patch introduces a node_zonelist() helper function. It is used to lookup
the appropriate zonelist given a node and a GFP mask. The patch on its own is
a cleanup but it helps clarify parts of the one-zonelist-per-node patchset. If
necessary, it can be merged with the next patch in this set without problems.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 drivers/char/sysrq.c      |    3 +--
 fs/buffer.c               |    6 +++---
 include/linux/gfp.h       |   19 +++++++++++++++++--
 include/linux/mempolicy.h |    2 +-
 mm/mempolicy.c            |    6 +++---
 mm/page_alloc.c           |    3 +--
 mm/slab.c                 |    3 +--
 mm/slub.c                 |    3 +--
 8 files changed, 28 insertions(+), 17 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-005_freepages_zonelist/drivers/char/sysrq.c linux-2.6.23-rc8-mm2-007_node_zonelist/drivers/char/sysrq.c
--- linux-2.6.23-rc8-mm2-005_freepages_zonelist/drivers/char/sysrq.c	2007-09-27 14:40:51.000000000 +0100
+++ linux-2.6.23-rc8-mm2-007_node_zonelist/drivers/char/sysrq.c	2007-09-28 15:48:55.000000000 +0100
@@ -271,8 +271,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(&NODE_DATA(0)->node_zonelists[ZONE_NORMAL],
-			GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-005_freepages_zonelist/fs/buffer.c linux-2.6.23-rc8-mm2-007_node_zonelist/fs/buffer.c
--- linux-2.6.23-rc8-mm2-005_freepages_zonelist/fs/buffer.c	2007-09-27 14:41:01.000000000 +0100
+++ linux-2.6.23-rc8-mm2-007_node_zonelist/fs/buffer.c	2007-09-28 15:48:55.000000000 +0100
@@ -369,13 +369,13 @@ void invalidate_bdev(struct block_device
 static void free_more_memory(void)
 {
 	struct zone **zones;
-	pg_data_t *pgdat;
+	int nid;
 
 	wakeup_pdflush(1024);
 	yield();
 
-	for_each_online_pgdat(pgdat) {
-		zones = pgdat->node_zonelists[gfp_zone(GFP_NOFS)].zones;
+	for_each_online_node(nid) {
+		zones = node_zonelist(nid, GFP_NOFS);
 		if (*zones)
 			try_to_free_pages(zones, 0, GFP_NOFS);
 	}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-005_freepages_zonelist/include/linux/gfp.h linux-2.6.23-rc8-mm2-007_node_zonelist/include/linux/gfp.h
--- linux-2.6.23-rc8-mm2-005_freepages_zonelist/include/linux/gfp.h	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-007_node_zonelist/include/linux/gfp.h	2007-09-28 15:48:55.000000000 +0100
@@ -157,13 +157,29 @@ static inline gfp_t set_migrateflags(gfp
  * virtual kernel addresses to the allocated page(s).
  */
 
+static inline enum zone_type gfp_zonelist(gfp_t flags)
+{
+	int base = 0;
+
+	if (NUMA_BUILD && (flags & __GFP_THISNODE))
+		base = 1;
+
+	return base;
+}
+
 /*
  * We get the zone list from the current node and the gfp_mask.
  * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
+ * There are two zonelists per node, one for all zones with memory and
+ * one containing just zones from the node the zonelist belongs to.
  *
  * For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets
  * optimized to &contig_page_data at compile-time.
  */
+static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
+{
+	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
+}
 
 #ifndef HAVE_ARCH_FREE_PAGE
 static inline void arch_free_page(struct page *page, int order) { }
@@ -185,8 +201,7 @@ static inline struct page *alloc_pages_n
 	if (nid < 0)
 		nid = numa_node_id();
 
-	return __alloc_pages(gfp_mask, order,
-		NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
 }
 
 #ifdef CONFIG_NUMA
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-005_freepages_zonelist/include/linux/mempolicy.h linux-2.6.23-rc8-mm2-007_node_zonelist/include/linux/mempolicy.h
--- linux-2.6.23-rc8-mm2-005_freepages_zonelist/include/linux/mempolicy.h	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-007_node_zonelist/include/linux/mempolicy.h	2007-09-28 15:48:55.000000000 +0100
@@ -241,7 +241,7 @@ static inline void mpol_fix_fork_child_f
 static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
  		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol)
 {
-	return NODE_DATA(0)->node_zonelists + gfp_zone(gfp_flags);
+	return node_zonelist(0, gfp_flags);
 }
 
 static inline int do_migrate_pages(struct mm_struct *mm,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/mempolicy.c linux-2.6.23-rc8-mm2-007_node_zonelist/mm/mempolicy.c
--- linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/mempolicy.c	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-007_node_zonelist/mm/mempolicy.c	2007-09-28 15:48:55.000000000 +0100
@@ -1155,7 +1155,7 @@ static struct zonelist *zonelist_policy(
 		nd = 0;
 		BUG();
 	}
-	return NODE_DATA(nd)->node_zonelists + gfp_zone(gfp);
+	return node_zonelist(nd, gfp);
 }
 
 /* Do dynamic interleaving for a process */
@@ -1269,7 +1269,7 @@ struct zonelist *huge_zonelist(struct vm
 
 		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
 		__mpol_free(pol);		/* finished with pol */
-		return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_flags);
+		return node_zonelist(nid, gfp_flags);
 	}
 
 	zl = zonelist_policy(GFP_HIGHUSER, pol);
@@ -1291,7 +1291,7 @@ static struct page *alloc_page_interleav
 	struct zonelist *zl;
 	struct page *page;
 
-	zl = NODE_DATA(nid)->node_zonelists + gfp_zone(gfp);
+	zl = node_zonelist(nid, gfp);
 	page = __alloc_pages(gfp, order, zl);
 	if (page && page_zone(page) == zl->zones[0])
 		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/page_alloc.c linux-2.6.23-rc8-mm2-007_node_zonelist/mm/page_alloc.c
--- linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/page_alloc.c	2007-09-28 15:48:35.000000000 +0100
+++ linux-2.6.23-rc8-mm2-007_node_zonelist/mm/page_alloc.c	2007-09-28 15:48:55.000000000 +0100
@@ -1816,10 +1816,9 @@ EXPORT_SYMBOL(free_pages);
 static unsigned int nr_free_zone_pages(int offset)
 {
 	/* Just pick one node, since fallback list is circular */
-	pg_data_t *pgdat = NODE_DATA(numa_node_id());
 	unsigned int sum = 0;
 
-	struct zonelist *zonelist = pgdat->node_zonelists + offset;
+	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
 	struct zone **zonep = zonelist->zones;
 	struct zone *zone;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/slab.c linux-2.6.23-rc8-mm2-007_node_zonelist/mm/slab.c
--- linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/slab.c	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-007_node_zonelist/mm/slab.c	2007-09-28 15:48:55.000000000 +0100
@@ -3246,8 +3246,7 @@ static void *fallback_alloc(struct kmem_
 	if (flags & __GFP_THISNODE)
 		return NULL;
 
-	zonelist = &NODE_DATA(slab_node(current->mempolicy))
-			->node_zonelists[gfp_zone(flags)];
+	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
 retry:
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/slub.c linux-2.6.23-rc8-mm2-007_node_zonelist/mm/slub.c
--- linux-2.6.23-rc8-mm2-005_freepages_zonelist/mm/slub.c	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-007_node_zonelist/mm/slub.c	2007-09-28 15:48:55.000000000 +0100
@@ -1303,8 +1303,7 @@ static struct page *get_any_partial(stru
 	if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
 		return NULL;
 
-	zonelist = &NODE_DATA(slab_node(current->mempolicy))
-					->node_zonelists[gfp_zone(flags)];
+	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 	for (z = zonelist->zones; *z; z++) {
 		struct kmem_cache_node *n;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 3/6] Use two zonelist that are filtered by GFP mask
  2007-09-28 14:23 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v8 Mel Gorman
  2007-09-28 14:23 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman
  2007-09-28 14:24 ` [PATCH 2/6] Introduce node_zonelist() for accessing the zonelist for a GFP mask Mel Gorman
@ 2007-09-28 14:24 ` Mel Gorman
  2007-09-28 14:24 ` [PATCH 4/6] Have zonelist contains structs with both a zone pointer and zone_idx Mel Gorman
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 35+ messages in thread
From: Mel Gorman @ 2007-09-28 14:24 UTC (permalink / raw)
  To: akpm
  Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

Currently a node has a number of zonelists, one for each zone type in the
system and a second set for GFP_THISNODE allocations. Based on the zones allowed
by a gfp mask, one of these zonelists is selected. All of these zonelists
consume memory and occupy cache lines.

This patch replaces the multiple zonelists per-node with two zonelists. The
first contains all populated zones in the system and the second contains all
populated zones in node suitable for GFP_THISNODE allocations. An iterator
macro is introduced called for_each_zone_zonelist() interates through each
zone in the zonelist that is allowed by the GFP flags.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
---

 arch/parisc/mm/init.c  |   11 +-
 fs/buffer.c            |    6 +
 include/linux/gfp.h    |   17 +---
 include/linux/mmzone.h |   65 +++++++++++-----
 mm/hugetlb.c           |    8 +-
 mm/oom_kill.c          |    8 +-
 mm/page_alloc.c        |  169 +++++++++++++++++++-------------------------
 mm/slab.c              |    8 +-
 mm/slub.c              |    8 +-
 mm/vmscan.c            |   21 ++---
 10 files changed, 159 insertions(+), 162 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-007_node_zonelist/arch/parisc/mm/init.c linux-2.6.23-rc8-mm2-010_use_two_zonelists/arch/parisc/mm/init.c
--- linux-2.6.23-rc8-mm2-007_node_zonelist/arch/parisc/mm/init.c	2007-09-25 01:33:10.000000000 +0100
+++ linux-2.6.23-rc8-mm2-010_use_two_zonelists/arch/parisc/mm/init.c	2007-09-28 15:49:16.000000000 +0100
@@ -599,15 +599,18 @@ void show_mem(void)
 #ifdef CONFIG_DISCONTIGMEM
 	{
 		struct zonelist *zl;
-		int i, j, k;
+		int i, j;
 
 		for (i = 0; i < npmem_ranges; i++) {
+			zl = node_zonelist(i);
 			for (j = 0; j < MAX_NR_ZONES; j++) {
-				zl = NODE_DATA(i)->node_zonelists + j;
+				struct zone **z;
+				struct zone *zone;
 
 				printk("Zone list for zone %d on node %d: ", j, i);
-				for (k = 0; zl->zones[k] != NULL; k++) 
-					printk("[%ld/%s] ", zone_to_nid(zl->zones[k]), zl->zones[k]->name);
+				for_each_zone_zonelist(zone, z, zl, j)
+					printk("[%d/%s] ", zone_to_nid(zone),
+								zone->name);
 				printk("\n");
 			}
 		}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-007_node_zonelist/fs/buffer.c linux-2.6.23-rc8-mm2-010_use_two_zonelists/fs/buffer.c
--- linux-2.6.23-rc8-mm2-007_node_zonelist/fs/buffer.c	2007-09-28 15:48:55.000000000 +0100
+++ linux-2.6.23-rc8-mm2-010_use_two_zonelists/fs/buffer.c	2007-09-28 15:49:16.000000000 +0100
@@ -375,9 +375,11 @@ static void free_more_memory(void)
 	yield();
 
 	for_each_online_node(nid) {
-		zones = node_zonelist(nid, GFP_NOFS);
+		zones = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
+						gfp_zone(GFP_NOFS));
 		if (*zones)
-			try_to_free_pages(zones, 0, GFP_NOFS);
+			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
+						GFP_NOFS);
 	}
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-007_node_zonelist/include/linux/gfp.h linux-2.6.23-rc8-mm2-010_use_two_zonelists/include/linux/gfp.h
--- linux-2.6.23-rc8-mm2-007_node_zonelist/include/linux/gfp.h	2007-09-28 15:48:55.000000000 +0100
+++ linux-2.6.23-rc8-mm2-010_use_two_zonelists/include/linux/gfp.h	2007-09-28 15:49:16.000000000 +0100
@@ -119,29 +119,22 @@ static inline int allocflags_to_migratet
 
 static inline enum zone_type gfp_zone(gfp_t flags)
 {
-	int base = 0;
-
-#ifdef CONFIG_NUMA
-	if (flags & __GFP_THISNODE)
-		base = MAX_NR_ZONES;
-#endif
-
 #ifdef CONFIG_ZONE_DMA
 	if (flags & __GFP_DMA)
-		return base + ZONE_DMA;
+		return ZONE_DMA;
 #endif
 #ifdef CONFIG_ZONE_DMA32
 	if (flags & __GFP_DMA32)
-		return base + ZONE_DMA32;
+		return ZONE_DMA32;
 #endif
 	if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
 			(__GFP_HIGHMEM | __GFP_MOVABLE))
-		return base + ZONE_MOVABLE;
+		return ZONE_MOVABLE;
 #ifdef CONFIG_HIGHMEM
 	if (flags & __GFP_HIGHMEM)
-		return base + ZONE_HIGHMEM;
+		return ZONE_HIGHMEM;
 #endif
-	return base + ZONE_NORMAL;
+	return ZONE_NORMAL;
 }
 
 static inline gfp_t set_migrateflags(gfp_t gfp, gfp_t migrate_flags)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-007_node_zonelist/include/linux/mmzone.h linux-2.6.23-rc8-mm2-010_use_two_zonelists/include/linux/mmzone.h
--- linux-2.6.23-rc8-mm2-007_node_zonelist/include/linux/mmzone.h	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-010_use_two_zonelists/include/linux/mmzone.h	2007-09-28 15:49:16.000000000 +0100
@@ -395,10 +395,10 @@ static inline int zone_is_oom_locked(con
  * The NUMA zonelists are doubled becausse we need zonelists that restrict the
  * allocations to a single node for GFP_THISNODE.
  *
- * [0 .. MAX_NR_ZONES -1] 		: Zonelists with fallback
- * [MAZ_NR_ZONES ... MAZ_ZONELISTS -1]  : No fallback (GFP_THISNODE)
+ * [0]	: Zonelist with fallback
+ * [1]	: No fallback (GFP_THISNODE)
  */
-#define MAX_ZONELISTS (2 * MAX_NR_ZONES)
+#define MAX_ZONELISTS 2
 
 
 /*
@@ -466,7 +466,7 @@ struct zonelist_cache {
 	unsigned long last_full_zap;		/* when last zap'd (jiffies) */
 };
 #else
-#define MAX_ZONELISTS MAX_NR_ZONES
+#define MAX_ZONELISTS 1
 struct zonelist_cache;
 #endif
 
@@ -488,24 +488,6 @@ struct zonelist {
 #endif
 };
 
-#ifdef CONFIG_NUMA
-/*
- * Only custom zonelists like MPOL_BIND need to be filtered as part of
- * policies. As described in the comment for struct zonelist_cache, these
- * zonelists will not have a zlcache so zlcache_ptr will not be set. Use
- * that to determine if the zonelists needs to be filtered or not.
- */
-static inline int alloc_should_filter_zonelist(struct zonelist *zonelist)
-{
-	return !zonelist->zlcache_ptr;
-}
-#else
-static inline int alloc_should_filter_zonelist(struct zonelist *zonelist)
-{
-	return 0;
-}
-#endif /* CONFIG_NUMA */
-
 #ifdef CONFIG_ARCH_POPULATES_NODE_MAP
 struct node_active_region {
 	unsigned long start_pfn;
@@ -734,6 +716,45 @@ extern struct zone *next_zone(struct zon
 	     zone;					\
 	     zone = next_zone(zone))
 
+/* Returns the first zone at or below highest_zoneidx in a zonelist */
+static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
+					enum zone_type highest_zoneidx)
+{
+	struct zone **z;
+
+	/* Find the first suitable zone to use for the allocation */
+	z = zonelist->zones;
+	while (*z && zone_idx(*z) > highest_zoneidx)
+		z++;
+
+	return z;
+}
+
+/* Returns the next zone at or below highest_zoneidx in a zonelist */
+static inline struct zone **next_zones_zonelist(struct zone **z,
+					enum zone_type highest_zoneidx)
+{
+	/* Find the next suitable zone to use for the allocation */
+	while (*z && zone_idx(*z) > highest_zoneidx)
+		z++;
+
+	return z;
+}
+
+/**
+ * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ *
+ * This iterator iterates though all zones at or below a given zone index.
+ */
+#define for_each_zone_zonelist(zone, z, zlist, highidx) \
+	for (z = first_zones_zonelist(zlist, highidx), zone = *z++;	\
+		zone;							\
+		z = next_zones_zonelist(z, highidx), zone = *z++)
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-007_node_zonelist/mm/hugetlb.c linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/hugetlb.c
--- linux-2.6.23-rc8-mm2-007_node_zonelist/mm/hugetlb.c	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/hugetlb.c	2007-09-28 15:49:16.000000000 +0100
@@ -74,11 +74,11 @@ static struct page *dequeue_huge_page(st
 	struct mempolicy *mpol;
 	struct zonelist *zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol);
-	struct zone **z;
+	struct zone *zone, **z;
 
-	for (z = zonelist->zones; *z; z++) {
-		nid = zone_to_nid(*z);
-		if (cpuset_zone_allowed_softwall(*z, htlb_alloc_mask) &&
+	for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
+		nid = zone_to_nid(zone);
+		if (cpuset_zone_allowed_softwall(zone, htlb_alloc_mask) &&
 		    !list_empty(&hugepage_freelists[nid])) {
 			page = list_entry(hugepage_freelists[nid].next,
 					  struct page, lru);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-007_node_zonelist/mm/oom_kill.c linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/oom_kill.c
--- linux-2.6.23-rc8-mm2-007_node_zonelist/mm/oom_kill.c	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/oom_kill.c	2007-09-28 15:49:16.000000000 +0100
@@ -181,12 +181,14 @@ static inline enum oom_constraint constr
 						    gfp_t gfp_mask)
 {
 #ifdef CONFIG_NUMA
+	struct zone *zone;
 	struct zone **z;
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	nodemask_t nodes = node_states[N_HIGH_MEMORY];
 
-	for (z = zonelist->zones; *z; z++)
-		if (cpuset_zone_allowed_softwall(*z, gfp_mask))
-			node_clear(zone_to_nid(*z), nodes);
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		if (cpuset_zone_allowed_softwall(zone, gfp_mask))
+			node_clear(zone_to_nid(zone), nodes);
 		else
 			return CONSTRAINT_CPUSET;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-007_node_zonelist/mm/page_alloc.c linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/page_alloc.c
--- linux-2.6.23-rc8-mm2-007_node_zonelist/mm/page_alloc.c	2007-09-28 15:48:55.000000000 +0100
+++ linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/page_alloc.c	2007-09-28 15:49:16.000000000 +0100
@@ -1421,41 +1421,28 @@ static void zlc_mark_zone_full(struct zo
  */
 static struct page *
 get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist, int alloc_flags)
+		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
 {
 	struct zone **z;
 	struct page *page = NULL;
-	int classzone_idx = zone_idx(zonelist->zones[0]);
+	int classzone_idx;
 	struct zone *zone;
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
-	enum zone_type highest_zoneidx = -1; /* Gets set for policy zonelists */
+
+	z = first_zones_zonelist(zonelist, high_zoneidx);
+	classzone_idx = zone_idx(*z);
 
 zonelist_scan:
 	/*
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	z = zonelist->zones;
-
-	do {
-		/*
-		 * In NUMA, this could be a policy zonelist which contains
-		 * zones that may not be allowed by the current gfp_mask.
-		 * Check the zone is allowed by the current flags
-		 */
-		if (unlikely(alloc_should_filter_zonelist(zonelist))) {
-			if (highest_zoneidx == -1)
-				highest_zoneidx = gfp_zone(gfp_mask);
-			if (zone_idx(*z) > highest_zoneidx)
-				continue;
-		}
-
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		zone = *z;
 		if ((alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed_softwall(zone, gfp_mask))
 				goto try_next_zone;
@@ -1489,7 +1476,7 @@ try_next_zone:
 			zlc_active = 1;
 			did_zlc_setup = 1;
 		}
-	} while (*(++z) != NULL);
+	}
 
 	if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
 		/* Disable zlc cache for second zonelist scan */
@@ -1563,6 +1550,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 		struct zonelist *zonelist)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	struct zone **z;
 	struct page *page;
 	struct reclaim_state reclaim_state;
@@ -1588,7 +1576,7 @@ restart:
 	}
 
 	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-				zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
 
@@ -1632,7 +1620,8 @@ restart:
 	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+	page = get_page_from_freelist(gfp_mask, order, zonelist,
+						high_zoneidx, alloc_flags);
 	if (page)
 		goto got_pg;
 
@@ -1645,7 +1634,7 @@ rebalance:
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
 			page = get_page_from_freelist(gfp_mask, order,
-				zonelist, ALLOC_NO_WATERMARKS);
+				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
 			if (gfp_mask & __GFP_NOFAIL) {
@@ -1680,7 +1669,7 @@ nofail_alloc:
 
 	if (likely(did_some_progress)) {
 		page = get_page_from_freelist(gfp_mask, order,
-						zonelist, alloc_flags);
+					zonelist, high_zoneidx, alloc_flags);
 		if (page)
 			goto got_pg;
 	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
@@ -1696,7 +1685,7 @@ nofail_alloc:
 		 * under heavy pressure.
 		 */
 		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-				zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page) {
 			clear_zonelist_oom(zonelist);
 			goto got_pg;
@@ -1815,14 +1804,16 @@ EXPORT_SYMBOL(free_pages);
 
 static unsigned int nr_free_zone_pages(int offset)
 {
+	enum zone_type high_zoneidx = MAX_NR_ZONES - 1;
+	struct zone **z;
+	struct zone *zone;
+
 	/* Just pick one node, since fallback list is circular */
 	unsigned int sum = 0;
 
 	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
-	struct zone **zonep = zonelist->zones;
-	struct zone *zone;
 
-	for (zone = *zonep++; zone; zone = *zonep++) {
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		unsigned long size = zone->present_pages;
 		unsigned long high = zone->pages_high;
 		if (size > high)
@@ -2181,17 +2172,15 @@ static int find_next_best_node(int node,
  */
 static void build_zonelists_in_node_order(pg_data_t *pgdat, int node)
 {
-	enum zone_type i;
 	int j;
 	struct zonelist *zonelist;
 
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		for (j = 0; zonelist->zones[j] != NULL; j++)
-			;
- 		j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-		zonelist->zones[j] = NULL;
-	}
+	zonelist = &pgdat->node_zonelists[0];
+	for (j = 0; zonelist->zones[j] != NULL; j++)
+		;
+	j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+							MAX_NR_ZONES - 1);
+	zonelist->zones[j] = NULL;
 }
 
 /*
@@ -2199,15 +2188,12 @@ static void build_zonelists_in_node_orde
  */
 static void build_thisnode_zonelists(pg_data_t *pgdat)
 {
-	enum zone_type i;
 	int j;
 	struct zonelist *zonelist;
 
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		zonelist = pgdat->node_zonelists + MAX_NR_ZONES + i;
-		j = build_zonelists_node(pgdat, zonelist, 0, i);
-		zonelist->zones[j] = NULL;
-	}
+	zonelist = &pgdat->node_zonelists[1];
+	j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);
+	zonelist->zones[j] = NULL;
 }
 
 /*
@@ -2220,27 +2206,24 @@ static int node_order[MAX_NUMNODES];
 
 static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
 {
-	enum zone_type i;
 	int pos, j, node;
 	int zone_type;		/* needs to be signed */
 	struct zone *z;
 	struct zonelist *zonelist;
 
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		pos = 0;
-		for (zone_type = i; zone_type >= 0; zone_type--) {
-			for (j = 0; j < nr_nodes; j++) {
-				node = node_order[j];
-				z = &NODE_DATA(node)->node_zones[zone_type];
-				if (populated_zone(z)) {
-					zonelist->zones[pos++] = z;
-					check_highest_zone(zone_type);
-				}
+	zonelist = &pgdat->node_zonelists[0];
+	pos = 0;
+	for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
+		for (j = 0; j < nr_nodes; j++) {
+			node = node_order[j];
+			z = &NODE_DATA(node)->node_zones[zone_type];
+			if (populated_zone(z)) {
+				zonelist->zones[pos++] = z;
+				check_highest_zone(zone_type);
 			}
 		}
-		zonelist->zones[pos] = NULL;
 	}
+	zonelist->zones[pos] = NULL;
 }
 
 static int default_zonelist_order(void)
@@ -2367,19 +2350,15 @@ static void build_zonelists(pg_data_t *p
 /* Construct the zonelist performance cache - see further mmzone.h */
 static void build_zonelist_cache(pg_data_t *pgdat)
 {
-	int i;
-
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		struct zonelist *zonelist;
-		struct zonelist_cache *zlc;
-		struct zone **z;
+	struct zonelist *zonelist;
+	struct zonelist_cache *zlc;
+	struct zone **z;
 
-		zonelist = pgdat->node_zonelists + i;
-		zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
-		bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-		for (z = zonelist->zones; *z; z++)
-			zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
-	}
+	zonelist = &pgdat->node_zonelists[0];
+	zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
+	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
+	for (z = zonelist->zones; *z; z++)
+		zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
 }
 
 
@@ -2393,45 +2372,43 @@ static void set_zonelist_order(void)
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int node, local_node;
-	enum zone_type i,j;
+	enum zone_type j;
+	struct zonelist *zonelist;
 
 	local_node = pgdat->node_id;
-	for (i = 0; i < MAX_NR_ZONES; i++) {
-		struct zonelist *zonelist;
 
-		zonelist = pgdat->node_zonelists + i;
+	zonelist = &pgdat->node_zonelists[0];
+	j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);
 
- 		j = build_zonelists_node(pgdat, zonelist, 0, i);
- 		/*
- 		 * Now we build the zonelist so that it contains the zones
- 		 * of all the other nodes.
- 		 * We don't want to pressure a particular node, so when
- 		 * building the zones for node N, we make sure that the
- 		 * zones coming right after the local ones are those from
- 		 * node N+1 (modulo N)
- 		 */
-		for (node = local_node + 1; node < MAX_NUMNODES; node++) {
-			if (!node_online(node))
-				continue;
-			j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-		}
-		for (node = 0; node < local_node; node++) {
-			if (!node_online(node))
-				continue;
-			j = build_zonelists_node(NODE_DATA(node), zonelist, j, i);
-		}
-
-		zonelist->zones[j] = NULL;
+	/*
+	 * Now we build the zonelist so that it contains the zones
+	 * of all the other nodes.
+	 * We don't want to pressure a particular node, so when
+	 * building the zones for node N, we make sure that the
+	 * zones coming right after the local ones are those from
+	 * node N+1 (modulo N)
+	 */
+	for (node = local_node + 1; node < MAX_NUMNODES; node++) {
+		if (!node_online(node))
+			continue;
+		j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+							MAX_NR_ZONES - 1);
 	}
+	for (node = 0; node < local_node; node++) {
+		if (!node_online(node))
+			continue;
+		j = build_zonelists_node(NODE_DATA(node), zonelist, j,
+							MAX_NR_ZONES - 1);
+	}
+
+	zonelist->zones[j] = NULL;
 }
 
 /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
 static void build_zonelist_cache(pg_data_t *pgdat)
 {
-	int i;
-
-	for (i = 0; i < MAX_NR_ZONES; i++)
-		pgdat->node_zonelists[i].zlcache_ptr = NULL;
+	pgdat->node_zonelists[0].zlcache_ptr = NULL;
+	pgdat->node_zonelists[1].zlcache_ptr = NULL;
 }
 
 #endif	/* CONFIG_NUMA */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-007_node_zonelist/mm/slab.c linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/slab.c
--- linux-2.6.23-rc8-mm2-007_node_zonelist/mm/slab.c	2007-09-28 15:48:55.000000000 +0100
+++ linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/slab.c	2007-09-28 15:49:16.000000000 +0100
@@ -3240,6 +3240,8 @@ static void *fallback_alloc(struct kmem_
 	struct zonelist *zonelist;
 	gfp_t local_flags;
 	struct zone **z;
+	struct zone *zone;
+	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
 	int nid;
 
@@ -3254,10 +3256,10 @@ retry:
 	 * Look through allowed nodes for objects available
 	 * from existing per node queues.
 	 */
-	for (z = zonelist->zones; *z && !obj; z++) {
-		nid = zone_to_nid(*z);
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+		nid = zone_to_nid(zone);
 
-		if (cpuset_zone_allowed_hardwall(*z, flags) &&
+		if (cpuset_zone_allowed_hardwall(zone, flags) &&
 			cache->nodelists[nid] &&
 			cache->nodelists[nid]->free_objects)
 				obj = ____cache_alloc_node(cache,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-007_node_zonelist/mm/slub.c linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/slub.c
--- linux-2.6.23-rc8-mm2-007_node_zonelist/mm/slub.c	2007-09-28 15:48:55.000000000 +0100
+++ linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/slub.c	2007-09-28 15:49:16.000000000 +0100
@@ -1280,6 +1280,8 @@ static struct page *get_any_partial(stru
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
 	struct zone **z;
+	struct zone *zone;
+	enum zone_type high_zoneidx = gfp_zone(flags);
 	struct page *page;
 
 	/*
@@ -1304,12 +1306,12 @@ static struct page *get_any_partial(stru
 		return NULL;
 
 	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
-	for (z = zonelist->zones; *z; z++) {
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		struct kmem_cache_node *n;
 
-		n = get_node(s, zone_to_nid(*z));
+		n = get_node(s, zone_to_nid(zone));
 
-		if (n && cpuset_zone_allowed_hardwall(*z, flags) &&
+		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
 				n->nr_partial > MIN_PARTIAL) {
 			page = get_partial_node(n);
 			if (page)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-007_node_zonelist/mm/vmscan.c linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/vmscan.c
--- linux-2.6.23-rc8-mm2-007_node_zonelist/mm/vmscan.c	2007-09-28 15:48:35.000000000 +0100
+++ linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/vmscan.c	2007-09-28 15:49:16.000000000 +0100
@@ -1208,13 +1208,11 @@ static unsigned long shrink_zones(int pr
 					struct scan_control *sc)
 {
 	unsigned long nr_reclaimed = 0;
-	struct zone **zones = zonelist->zones;
-	int i;
+	struct zone **z;
+	struct zone *zone;
 
 	sc->all_unreclaimable = 1;
-	for (i = 0; zones[i] != NULL; i++) {
-		struct zone *zone = zones[i];
-
+	for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
 		if (!populated_zone(zone))
 			continue;
 
@@ -1255,14 +1253,13 @@ static unsigned long do_try_to_free_page
 	unsigned long nr_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long lru_pages = 0;
-	struct zone **zones = zonelist->zones;
-	int i;
+	struct zone **z;
+	struct zone *zone;
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 
 	count_vm_event(ALLOCSTALL);
 
-	for (i = 0; zones[i] != NULL; i++) {
-		struct zone *zone = zones[i];
-
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
 
@@ -1321,9 +1318,7 @@ out:
 	 */
 	if (priority < 0)
 		priority = 0;
-	for (i = 0; zones[i] != 0; i++) {
-		struct zone *zone = zones[i];
-
+	for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
 		if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
 			continue;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 4/6] Have zonelist contains structs with both a zone pointer and zone_idx
  2007-09-28 14:23 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v8 Mel Gorman
                   ` (2 preceding siblings ...)
  2007-09-28 14:24 ` [PATCH 3/6] Use two zonelist that are filtered by " Mel Gorman
@ 2007-09-28 14:24 ` Mel Gorman
  2007-10-17  3:22   ` David Rientjes
  2007-09-28 14:25 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
  2007-09-28 14:25 ` [PATCH 6/6] Use one zonelist that is filtered by nodemask Mel Gorman
  5 siblings, 1 reply; 35+ messages in thread
From: Mel Gorman @ 2007-09-28 14:24 UTC (permalink / raw)
  To: akpm
  Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

Filtering zonelists requires very frequent use of zone_idx(). This is costly
as it involves a lookup of another structure and a substraction operation. As
the zone_idx is often required, it should be quickly accessible.  The node
idx could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.

This patch introduces a struct zoneref to store a zone pointer and a zone
index.  The zonelist then consists of an array of this struct zonerefs which
are looked up as necessary. Helpers are given for accessing the zone index
as well as the node index.

[kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
---

 arch/parisc/mm/init.c  |    2 -
 fs/buffer.c            |    6 ++--
 include/linux/mmzone.h |   64 +++++++++++++++++++++++++++++++++++++-------
 include/linux/oom.h    |    4 +-
 kernel/cpuset.c        |    4 +-
 mm/hugetlb.c           |    3 +-
 mm/mempolicy.c         |   35 ++++++++++++++----------
 mm/oom_kill.c          |   45 +++++++++++++++---------------
 mm/page_alloc.c        |   59 ++++++++++++++++++++--------------------
 mm/slab.c              |    2 -
 mm/slub.c              |    2 -
 mm/vmscan.c            |    7 ++--
 mm/vmstat.c            |    5 ++-
 13 files changed, 145 insertions(+), 93 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-010_use_two_zonelists/arch/parisc/mm/init.c linux-2.6.23-rc8-mm2-020_zoneid_zonelist/arch/parisc/mm/init.c
--- linux-2.6.23-rc8-mm2-010_use_two_zonelists/arch/parisc/mm/init.c	2007-09-28 15:49:16.000000000 +0100
+++ linux-2.6.23-rc8-mm2-020_zoneid_zonelist/arch/parisc/mm/init.c	2007-09-28 15:49:39.000000000 +0100
@@ -604,7 +604,7 @@ void show_mem(void)
 		for (i = 0; i < npmem_ranges; i++) {
 			zl = node_zonelist(i);
 			for (j = 0; j < MAX_NR_ZONES; j++) {
-				struct zone **z;
+				struct zoneref *z;
 				struct zone *zone;
 
 				printk("Zone list for zone %d on node %d: ", j, i);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-010_use_two_zonelists/fs/buffer.c linux-2.6.23-rc8-mm2-020_zoneid_zonelist/fs/buffer.c
--- linux-2.6.23-rc8-mm2-010_use_two_zonelists/fs/buffer.c	2007-09-28 15:49:16.000000000 +0100
+++ linux-2.6.23-rc8-mm2-020_zoneid_zonelist/fs/buffer.c	2007-09-28 15:49:39.000000000 +0100
@@ -368,16 +368,16 @@ void invalidate_bdev(struct block_device
  */
 static void free_more_memory(void)
 {
-	struct zone **zones;
+	struct zoneref *zrefs;
 	int nid;
 
 	wakeup_pdflush(1024);
 	yield();
 
 	for_each_online_node(nid) {
-		zones = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
+		zrefs = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
 						gfp_zone(GFP_NOFS));
-		if (*zones)
+		if (zrefs->zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
 						GFP_NOFS);
 	}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-010_use_two_zonelists/include/linux/mmzone.h linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mmzone.h
--- linux-2.6.23-rc8-mm2-010_use_two_zonelists/include/linux/mmzone.h	2007-09-28 15:49:16.000000000 +0100
+++ linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mmzone.h	2007-09-28 15:49:39.000000000 +0100
@@ -471,6 +471,15 @@ struct zonelist_cache;
 #endif
 
 /*
+ * This struct contains information about a zone in a zonelist. It is stored
+ * here to avoid dereferences into large structures and lookups of tables
+ */
+struct zoneref {
+	struct zone *zone;	/* Pointer to actual zone */
+	int zone_idx;		/* zone_idx(zoneref->zone) */
+};
+
+/*
  * One allocation request operates on a zonelist. A zonelist
  * is a list of zones, the first one is the 'goal' of the
  * allocation, the other zones are fallback zones, in decreasing
@@ -478,11 +487,18 @@ struct zonelist_cache;
  *
  * If zlcache_ptr is not NULL, then it is just the address of zlcache,
  * as explained above.  If zlcache_ptr is NULL, there is no zlcache.
+ * *
+ * To speed the reading of the zonelist, the zonerefs contain the zone index
+ * of the entry being read. Helper functions to access information given
+ * a struct zoneref are
+ *
+ * zonelist_zone()	- Return the struct zone * for an entry in _zonerefs
+ * zonelist_zone_idx()	- Return the index of the zone for an entry
+ * zonelist_node_idx()	- Return the index of the node for an entry
  */
-
 struct zonelist {
 	struct zonelist_cache *zlcache_ptr;		     // NULL or &zlcache
-	struct zone *zones[MAX_ZONES_PER_ZONELIST + 1];      // NULL delimited
+	struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
 #ifdef CONFIG_NUMA
 	struct zonelist_cache zlcache;			     // optional ...
 #endif
@@ -716,26 +732,52 @@ extern struct zone *next_zone(struct zon
 	     zone;					\
 	     zone = next_zone(zone))
 
+static inline struct zone *zonelist_zone(struct zoneref *zoneref)
+{
+	return zoneref->zone;
+}
+
+static inline int zonelist_zone_idx(struct zoneref *zoneref)
+{
+	return zoneref->zone_idx;
+}
+
+static inline int zonelist_node_idx(struct zoneref *zoneref)
+{
+#ifdef CONFIG_NUMA
+	/* zone_to_nid not available in this context */
+	return zoneref->zone->node;
+#else
+	return 0;
+#endif /* CONFIG_NUMA */
+}
+
+static inline void encode_zoneref(struct zone *zone, struct zoneref *zoneref)
+{
+	zoneref->zone = zone;
+	zoneref->zone_idx = zone_idx(zone);
+}
+
 /* Returns the first zone at or below highest_zoneidx in a zonelist */
-static inline struct zone **first_zones_zonelist(struct zonelist *zonelist,
+static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
 					enum zone_type highest_zoneidx)
 {
-	struct zone **z;
+	struct zoneref *z;
 
 	/* Find the first suitable zone to use for the allocation */
-	z = zonelist->zones;
-	while (*z && zone_idx(*z) > highest_zoneidx)
+	z = zonelist->_zonerefs;
+	while (zonelist_zone_idx(z) > highest_zoneidx)
 		z++;
 
 	return z;
 }
 
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
-static inline struct zone **next_zones_zonelist(struct zone **z,
+static inline struct zoneref *next_zones_zonelist(struct zoneref *z,
 					enum zone_type highest_zoneidx)
 {
 	/* Find the next suitable zone to use for the allocation */
-	while (*z && zone_idx(*z) > highest_zoneidx)
+	while (zonelist_zone_idx(z) > highest_zoneidx)
 		z++;
 
 	return z;
@@ -751,9 +793,11 @@ static inline struct zone **next_zones_z
  * This iterator iterates though all zones at or below a given zone index.
  */
 #define for_each_zone_zonelist(zone, z, zlist, highidx) \
-	for (z = first_zones_zonelist(zlist, highidx), zone = *z++;	\
+	for (z = first_zones_zonelist(zlist, highidx),			\
+					zone = zonelist_zone(z++);	\
 		zone;							\
-		z = next_zones_zonelist(z, highidx), zone = *z++)
+		z = next_zones_zonelist(z, highidx),			\
+					zone = zonelist_zone(z++))
 
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-010_use_two_zonelists/include/linux/oom.h linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/oom.h
--- linux-2.6.23-rc8-mm2-010_use_two_zonelists/include/linux/oom.h	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/oom.h	2007-09-28 15:49:39.000000000 +0100
@@ -23,8 +23,8 @@ enum oom_constraint {
 	CONSTRAINT_MEMORY_POLICY,
 };
 
-extern int try_set_zone_oom(struct zonelist *zonelist);
-extern void clear_zonelist_oom(struct zonelist *zonelist);
+extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
+extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
 
 extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
 extern int register_oom_notifier(struct notifier_block *nb);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-010_use_two_zonelists/kernel/cpuset.c linux-2.6.23-rc8-mm2-020_zoneid_zonelist/kernel/cpuset.c
--- linux-2.6.23-rc8-mm2-010_use_two_zonelists/kernel/cpuset.c	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-020_zoneid_zonelist/kernel/cpuset.c	2007-09-28 15:49:39.000000000 +0100
@@ -1525,8 +1525,8 @@ int cpuset_zonelist_valid_mems_allowed(s
 {
 	int i;
 
-	for (i = 0; zl->zones[i]; i++) {
-		int nid = zone_to_nid(zl->zones[i]);
+	for (i = 0; zl->_zonerefs[i].zone; i++) {
+		int nid = zonelist_node_idx(zl->_zonerefs[i]);
 
 		if (node_isset(nid, current->mems_allowed))
 			return 1;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/hugetlb.c linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/hugetlb.c
--- linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/hugetlb.c	2007-09-28 15:49:16.000000000 +0100
+++ linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/hugetlb.c	2007-09-28 15:49:39.000000000 +0100
@@ -74,7 +74,8 @@ static struct page *dequeue_huge_page(st
 	struct mempolicy *mpol;
 	struct zonelist *zonelist = huge_zonelist(vma, address,
 					htlb_alloc_mask, &mpol);
-	struct zone *zone, **z;
+	struct zone *zone;
+	struct zoneref *z;
 
 	for_each_zone_zonelist(zone, z, zonelist, MAX_NR_ZONES - 1) {
 		nid = zone_to_nid(zone);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/mempolicy.c linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/mempolicy.c
--- linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/mempolicy.c	2007-09-28 15:48:55.000000000 +0100
+++ linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/mempolicy.c	2007-09-28 15:49:39.000000000 +0100
@@ -157,7 +157,7 @@ static struct zonelist *bind_zonelist(no
 		for_each_node_mask(nd, *nodes) { 
 			struct zone *z = &NODE_DATA(nd)->node_zones[k];
 			if (z->present_pages > 0) 
-				zl->zones[num++] = z;
+				encode_zoneref(z, &zl->_zonerefs[num++]);
 		}
 		if (k == 0)
 			break;
@@ -167,7 +167,7 @@ static struct zonelist *bind_zonelist(no
 		kfree(zl);
 		return ERR_PTR(-EINVAL);
 	}
-	zl->zones[num] = NULL;
+	zl->_zonerefs[num].zone = NULL;
 	return zl;
 }
 
@@ -489,9 +489,11 @@ static void get_zonemask(struct mempolic
 	nodes_clear(*nodes);
 	switch (p->policy) {
 	case MPOL_BIND:
-		for (i = 0; p->v.zonelist->zones[i]; i++)
-			node_set(zone_to_nid(p->v.zonelist->zones[i]),
-				*nodes);
+		for (i = 0; p->v.zonelist->_zonerefs[i].zone; i++) {
+			struct zoneref *zref;
+			zref = &p->v.zonelist->_zonerefs[i];
+			node_set(zonelist_node_idx(zref), *nodes);
+		}
 		break;
 	case MPOL_DEFAULT:
 		break;
@@ -1184,12 +1186,13 @@ unsigned slab_node(struct mempolicy *pol
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
-	case MPOL_BIND:
+	case MPOL_BIND: {
 		/*
 		 * Follow bind policy behavior and start allocation at the
 		 * first node.
 		 */
-		return zone_to_nid(policy->v.zonelist->zones[0]);
+		return zonelist_node_idx(policy->v.zonelist->_zonerefs);
+	}
 
 	case MPOL_PREFERRED:
 		if (policy->v.preferred_node >= 0)
@@ -1293,7 +1296,7 @@ static struct page *alloc_page_interleav
 
 	zl = node_zonelist(nid, gfp);
 	page = __alloc_pages(gfp, order, zl);
-	if (page && page_zone(page) == zl->zones[0])
+	if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
 		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
 	return page;
 }
@@ -1430,10 +1433,14 @@ int __mpol_equal(struct mempolicy *a, st
 		return a->v.preferred_node == b->v.preferred_node;
 	case MPOL_BIND: {
 		int i;
-		for (i = 0; a->v.zonelist->zones[i]; i++)
-			if (a->v.zonelist->zones[i] != b->v.zonelist->zones[i])
+		for (i = 0; a->v.zonelist->_zonerefs[i].zone; i++) {
+			struct zone *za, *zb;
+			za = zonelist_zone(&a->v.zonelist->_zonerefs[i]);
+			zb = zonelist_zone(&b->v.zonelist->_zonerefs[i]);
+			if (za != zb)
 				return 0;
-		return b->v.zonelist->zones[i] == NULL;
+		}
+		return b->v.zonelist->_zonerefs[i].zone == NULL;
 	}
 	default:
 		BUG();
@@ -1752,12 +1759,12 @@ static void mpol_rebind_policy(struct me
 		break;
 	case MPOL_BIND: {
 		nodemask_t nodes;
-		struct zone **z;
+		struct zoneref *z;
 		struct zonelist *zonelist;
 
 		nodes_clear(nodes);
-		for (z = pol->v.zonelist->zones; *z; z++)
-			node_set(zone_to_nid(*z), nodes);
+		for (z = pol->v.zonelist->_zonerefs; z->zone; z++)
+			node_set(zonelist_node_idx(z), nodes);
 		nodes_remap(tmp, nodes, *mpolmask, *newmask);
 		nodes = tmp;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/oom_kill.c linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/oom_kill.c
--- linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/oom_kill.c	2007-09-28 15:49:16.000000000 +0100
+++ linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/oom_kill.c	2007-09-28 15:49:39.000000000 +0100
@@ -182,7 +182,7 @@ static inline enum oom_constraint constr
 {
 #ifdef CONFIG_NUMA
 	struct zone *zone;
-	struct zone **z;
+	struct zoneref *z;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	nodemask_t nodes = node_states[N_HIGH_MEMORY];
 
@@ -425,29 +425,29 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifie
  * if a parallel OOM killing is already taking place that includes a zone in
  * the zonelist.  Otherwise, locks all zones in the zonelist and returns 1.
  */
-int try_set_zone_oom(struct zonelist *zonelist)
+int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 {
-	struct zone **z;
+	struct zoneref *z;
+	struct zone *zone;
 	int ret = 1;
 
-	z = zonelist->zones;
-
 	spin_lock(&zone_scan_mutex);
-	do {
-		if (zone_is_oom_locked(*z)) {
+	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
+		if (zone_is_oom_locked(zone)) {
 			ret = 0;
 			goto out;
 		}
-	} while (*(++z) != NULL);
+	}
+
+	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
+		/*
+		 * Lock each zone in the zonelist under zone_scan_mutex so a
+		 * parallel invocation of try_set_zone_oom() doesn't succeed
+		 * when it shouldn't.
+		 */
+		zone_set_flag(zone, ZONE_OOM_LOCKED);
+	}
 
-	/*
-	 * Lock each zone in the zonelist under zone_scan_mutex so a parallel
-	 * invocation of try_set_zone_oom() doesn't succeed when it shouldn't.
-	 */
-	z = zonelist->zones;
-	do {
-		zone_set_flag(*z, ZONE_OOM_LOCKED);
-	} while (*(++z) != NULL);
 out:
 	spin_unlock(&zone_scan_mutex);
 	return ret;
@@ -458,16 +458,15 @@ out:
  * allocation attempts with zonelists containing them may now recall the OOM
  * killer, if necessary.
  */
-void clear_zonelist_oom(struct zonelist *zonelist)
+void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
 {
-	struct zone **z;
-
-	z = zonelist->zones;
+	struct zoneref *z;
+	struct zone *zone;
 
 	spin_lock(&zone_scan_mutex);
-	do {
-		zone_clear_flag(*z, ZONE_OOM_LOCKED);
-	} while (*(++z) != NULL);
+	for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) {
+		zone_clear_flag(zone, ZONE_OOM_LOCKED);
+	}
 	spin_unlock(&zone_scan_mutex);
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/page_alloc.c linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/page_alloc.c
--- linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/page_alloc.c	2007-09-28 15:49:16.000000000 +0100
+++ linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/page_alloc.c	2007-09-28 15:49:39.000000000 +0100
@@ -1360,7 +1360,7 @@ static nodemask_t *zlc_setup(struct zone
  * We are low on memory in the second scan, and should leave no stone
  * unturned looking for a free page.
  */
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zone **z,
+static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
 						nodemask_t *allowednodes)
 {
 	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
@@ -1371,7 +1371,7 @@ static int zlc_zone_worth_trying(struct 
 	if (!zlc)
 		return 1;
 
-	i = z - zonelist->zones;
+	i = z - zonelist->_zonerefs;
 	n = zlc->z_to_n[i];
 
 	/* This zone is worth trying if it is allowed but not full */
@@ -1383,7 +1383,7 @@ static int zlc_zone_worth_trying(struct 
  * zlc->fullzones, so that subsequent attempts to allocate a page
  * from that zone don't waste time re-examining it.
  */
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zone **z)
+static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 {
 	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
 	int i;				/* index of *z in zonelist zones */
@@ -1392,7 +1392,7 @@ static void zlc_mark_zone_full(struct zo
 	if (!zlc)
 		return;
 
-	i = z - zonelist->zones;
+	i = z - zonelist->_zonerefs;
 
 	set_bit(i, zlc->fullzones);
 }
@@ -1404,13 +1404,13 @@ static nodemask_t *zlc_setup(struct zone
 	return NULL;
 }
 
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zone **z,
+static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
 				nodemask_t *allowednodes)
 {
 	return 1;
 }
 
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zone **z)
+static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
 {
 }
 #endif	/* CONFIG_NUMA */
@@ -1423,7 +1423,7 @@ static struct page *
 get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
 {
-	struct zone **z;
+	struct zoneref *z;
 	struct page *page = NULL;
 	int classzone_idx;
 	struct zone *zone;
@@ -1432,7 +1432,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
 	z = first_zones_zonelist(zonelist, high_zoneidx);
-	classzone_idx = zone_idx(*z);
+	classzone_idx = zonelist_zone_idx(z);
 
 zonelist_scan:
 	/*
@@ -1551,7 +1551,8 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
-	struct zone **z;
+	struct zoneref *z;
+	struct zone *zone;
 	struct page *page;
 	struct reclaim_state reclaim_state;
 	struct task_struct *p = current;
@@ -1565,9 +1566,9 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 		return NULL;
 
 restart:
-	z = zonelist->zones;  /* the list of zones suitable for gfp_mask */
+	z = zonelist->_zonerefs;  /* the list of zones suitable for gfp_mask */
 
-	if (unlikely(*z == NULL)) {
+	if (unlikely(!z->zone)) {
 		/*
 		 * Happens if we have an empty zonelist as a result of
 		 * GFP_THISNODE being used on a memoryless node
@@ -1591,8 +1592,8 @@ restart:
 	if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
 		goto nopage;
 
-	for (z = zonelist->zones; *z; z++)
-		wakeup_kswapd(*z, order);
+	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+		wakeup_kswapd(zone, order);
 
 	/*
 	 * OK, we're below the kswapd watermark and have kicked background
@@ -1673,7 +1674,7 @@ nofail_alloc:
 		if (page)
 			goto got_pg;
 	} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
-		if (!try_set_zone_oom(zonelist)) {
+		if (!try_set_zone_oom(zonelist, gfp_mask)) {
 			schedule_timeout_uninterruptible(1);
 			goto restart;
 		}
@@ -1687,18 +1688,18 @@ nofail_alloc:
 		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page) {
-			clear_zonelist_oom(zonelist);
+			clear_zonelist_oom(zonelist, gfp_mask);
 			goto got_pg;
 		}
 
 		/* The OOM killer will not help higher order allocs so fail */
 		if (order > PAGE_ALLOC_COSTLY_ORDER) {
-			clear_zonelist_oom(zonelist);
+			clear_zonelist_oom(zonelist, gfp_mask);
 			goto nopage;
 		}
 
 		out_of_memory(zonelist, gfp_mask, order);
-		clear_zonelist_oom(zonelist);
+		clear_zonelist_oom(zonelist, gfp_mask);
 		goto restart;
 	}
 
@@ -1805,7 +1806,7 @@ EXPORT_SYMBOL(free_pages);
 static unsigned int nr_free_zone_pages(int offset)
 {
 	enum zone_type high_zoneidx = MAX_NR_ZONES - 1;
-	struct zone **z;
+	struct zoneref *z;
 	struct zone *zone;
 
 	/* Just pick one node, since fallback list is circular */
@@ -2000,7 +2001,7 @@ static int build_zonelists_node(pg_data_
 		zone_type--;
 		zone = pgdat->node_zones + zone_type;
 		if (populated_zone(zone)) {
-			zonelist->zones[nr_zones++] = zone;
+			encode_zoneref(zone, &zonelist->_zonerefs[nr_zones++]);
 			check_highest_zone(zone_type);
 		}
 
@@ -2176,11 +2177,11 @@ static void build_zonelists_in_node_orde
 	struct zonelist *zonelist;
 
 	zonelist = &pgdat->node_zonelists[0];
-	for (j = 0; zonelist->zones[j] != NULL; j++)
+	for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
 		;
 	j = build_zonelists_node(NODE_DATA(node), zonelist, j,
 							MAX_NR_ZONES - 1);
-	zonelist->zones[j] = NULL;
+	zonelist->_zonerefs[j].zone = NULL;
 }
 
 /*
@@ -2193,7 +2194,7 @@ static void build_thisnode_zonelists(pg_
 
 	zonelist = &pgdat->node_zonelists[1];
 	j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);
-	zonelist->zones[j] = NULL;
+	zonelist->_zonerefs[j].zone = NULL;
 }
 
 /*
@@ -2218,12 +2219,12 @@ static void build_zonelists_in_zone_orde
 			node = node_order[j];
 			z = &NODE_DATA(node)->node_zones[zone_type];
 			if (populated_zone(z)) {
-				zonelist->zones[pos++] = z;
+				encode_zoneref(z, &zonelist->_zonerefs[pos++]);
 				check_highest_zone(zone_type);
 			}
 		}
 	}
-	zonelist->zones[pos] = NULL;
+	zonelist->_zonerefs[pos].zone = NULL;
 }
 
 static int default_zonelist_order(void)
@@ -2300,7 +2301,7 @@ static void build_zonelists(pg_data_t *p
 	/* initialize zonelists */
 	for (i = 0; i < MAX_ZONELISTS; i++) {
 		zonelist = pgdat->node_zonelists + i;
-		zonelist->zones[0] = NULL;
+		zonelist->_zonerefs[0].zone = NULL;
 	}
 
 	/* NUMA-aware ordering of nodes */
@@ -2352,13 +2353,13 @@ static void build_zonelist_cache(pg_data
 {
 	struct zonelist *zonelist;
 	struct zonelist_cache *zlc;
-	struct zone **z;
+	struct zoneref *z;
 
 	zonelist = &pgdat->node_zonelists[0];
 	zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
 	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-	for (z = zonelist->zones; *z; z++)
-		zlc->z_to_n[z - zonelist->zones] = zone_to_nid(*z);
+	for (z = zonelist->_zonerefs; z->zone; z++)
+		zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z);
 }
 
 
@@ -2401,7 +2402,7 @@ static void build_zonelists(pg_data_t *p
 							MAX_NR_ZONES - 1);
 	}
 
-	zonelist->zones[j] = NULL;
+	zonelist->_zonerefs[j].zone = NULL;
 }
 
 /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/slab.c linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/slab.c
--- linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/slab.c	2007-09-28 15:49:16.000000000 +0100
+++ linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/slab.c	2007-09-28 15:49:39.000000000 +0100
@@ -3239,7 +3239,7 @@ static void *fallback_alloc(struct kmem_
 {
 	struct zonelist *zonelist;
 	gfp_t local_flags;
-	struct zone **z;
+	struct zoneref *z;
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	void *obj = NULL;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/slub.c linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/slub.c
--- linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/slub.c	2007-09-28 15:49:16.000000000 +0100
+++ linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/slub.c	2007-09-28 15:49:39.000000000 +0100
@@ -1279,7 +1279,7 @@ static struct page *get_any_partial(stru
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
-	struct zone **z;
+	struct zoneref *z;
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(flags);
 	struct page *page;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/vmscan.c linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/vmscan.c
--- linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/vmscan.c	2007-09-28 15:49:16.000000000 +0100
+++ linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/vmscan.c	2007-09-28 15:49:39.000000000 +0100
@@ -1208,7 +1208,7 @@ static unsigned long shrink_zones(int pr
 					struct scan_control *sc)
 {
 	unsigned long nr_reclaimed = 0;
-	struct zone **z;
+	struct zoneref *z;
 	struct zone *zone;
 
 	sc->all_unreclaimable = 1;
@@ -1253,7 +1253,7 @@ static unsigned long do_try_to_free_page
 	unsigned long nr_reclaimed = 0;
 	struct reclaim_state *reclaim_state = current->reclaim_state;
 	unsigned long lru_pages = 0;
-	struct zone **z;
+	struct zoneref *z;
 	struct zone *zone;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 
@@ -1361,10 +1361,9 @@ unsigned long try_to_free_mem_cgroup_pag
 	};
 	int node;
 	struct zonelist *zonelist;
-	int target_zone = gfp_zone(GFP_HIGHUSER_MOVABLE);
 
 	for_each_online_node(node) {
-		zonelist = &NODE_DATA(node)->node_zonelists[target_zone];
+		zonelist = &NODE_DATA(node)->node_zonelists[0];
 		if (do_try_to_free_pages(zonelist, sc.gfp_mask, &sc))
 			return 1;
 	}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/vmstat.c linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/vmstat.c
--- linux-2.6.23-rc8-mm2-010_use_two_zonelists/mm/vmstat.c	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/vmstat.c	2007-09-28 15:49:39.000000000 +0100
@@ -365,11 +365,12 @@ void refresh_cpu_vm_stats(int cpu)
  */
 void zone_statistics(struct zonelist *zonelist, struct zone *z)
 {
-	if (z->zone_pgdat == zonelist->zones[0]->zone_pgdat) {
+	if (z->zone_pgdat == zonelist_zone(&zonelist->_zonerefs[0])->zone_pgdat) {
 		__inc_zone_state(z, NUMA_HIT);
 	} else {
 		__inc_zone_state(z, NUMA_MISS);
-		__inc_zone_state(zonelist->zones[0], NUMA_FOREIGN);
+		__inc_zone_state(zonelist_zone(&zonelist->_zonerefs[0]),
+								NUMA_FOREIGN);
 	}
 	if (z->node == numa_node_id())
 		__inc_zone_state(z, NUMA_LOCAL);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/6] Have zonelist contains structs with both a zone pointer and zone_idx
  2007-09-28 14:24 ` [PATCH 4/6] Have zonelist contains structs with both a zone pointer and zone_idx Mel Gorman
@ 2007-10-17  3:22   ` David Rientjes
  0 siblings, 0 replies; 35+ messages in thread
From: David Rientjes @ 2007-10-17  3:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, Lee.Schermerhorn, linux-kernel, linux-mm, kamezawa.hiroyu,
	clameter

On Fri, 28 Sep 2007, Mel Gorman wrote:

> 
> Filtering zonelists requires very frequent use of zone_idx(). This is costly
> as it involves a lookup of another structure and a substraction operation. As
> the zone_idx is often required, it should be quickly accessible.  The node
> idx could also be stored here if it was found that accessing zone->node is
> significant which may be the case on workloads where nodemasks are heavily
> used.
> 
> This patch introduces a struct zoneref to store a zone pointer and a zone
> index.  The zonelist then consists of an array of this struct zonerefs which
> are looked up as necessary. Helpers are given for accessing the zone index
> as well as the node index.
> 
> [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Christoph Lameter <clameter@sgi.com>

OOM locking looks good, thanks.

Acked-by: David Rientjes <rientjes@google.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-09-28 14:23 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v8 Mel Gorman
                   ` (3 preceding siblings ...)
  2007-09-28 14:24 ` [PATCH 4/6] Have zonelist contains structs with both a zone pointer and zone_idx Mel Gorman
@ 2007-09-28 14:25 ` Mel Gorman
  2007-09-28 15:37   ` Lee Schermerhorn
  2007-09-28 14:25 ` [PATCH 6/6] Use one zonelist that is filtered by nodemask Mel Gorman
  5 siblings, 1 reply; 35+ messages in thread
From: Mel Gorman @ 2007-09-28 14:25 UTC (permalink / raw)
  To: akpm
  Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

The MPOL_BIND policy creates a zonelist that is used for allocations belonging
to that thread that can use the policy_zone. As the per-node zonelist is
already being filtered based on a zone id, this patch adds a version of
__alloc_pages() that takes a nodemask for further filtering. This eliminates
the need for MPOL_BIND to create a custom zonelist. A positive benefit of
this is that allocations using MPOL_BIND now use the local-node-ordered
zonelist instead of a custom node-id-ordered zonelist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
---

 fs/buffer.c               |    2 
 include/linux/cpuset.h    |    4 -
 include/linux/gfp.h       |    4 +
 include/linux/mempolicy.h |    3 
 include/linux/mmzone.h    |   58 +++++++++++++---
 kernel/cpuset.c           |   18 +----
 mm/mempolicy.c            |  144 +++++++++++------------------------------
 mm/page_alloc.c           |   40 +++++++----
 8 files changed, 131 insertions(+), 142 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/fs/buffer.c linux-2.6.23-rc8-mm2-030_filter_nodemask/fs/buffer.c
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/fs/buffer.c	2007-09-28 15:49:39.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/fs/buffer.c	2007-09-28 15:49:57.000000000 +0100
@@ -376,7 +376,7 @@ static void free_more_memory(void)
 
 	for_each_online_node(nid) {
 		zrefs = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
-						gfp_zone(GFP_NOFS));
+						NULL, gfp_zone(GFP_NOFS));
 		if (zrefs->zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
 						GFP_NOFS);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/cpuset.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/cpuset.h
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/cpuset.h	2007-09-27 14:41:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/cpuset.h	2007-09-28 15:49:57.000000000 +0100
@@ -28,7 +28,7 @@ void cpuset_init_current_mems_allowed(vo
 void cpuset_update_task_memory_state(void);
 #define cpuset_nodes_subset_current_mems_allowed(nodes) \
 		nodes_subset((nodes), current->mems_allowed)
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
 extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask);
@@ -103,7 +103,7 @@ static inline void cpuset_init_current_m
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
-static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
 	return 1;
 }
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/gfp.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/gfp.h	2007-09-28 15:49:16.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h	2007-09-28 15:49:57.000000000 +0100
@@ -184,6 +184,10 @@ static inline void arch_alloc_page(struc
 extern struct page *
 FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
 
+extern struct page *
+FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
+				struct zonelist *, nodemask_t *nodemask));
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mempolicy.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mempolicy.h
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mempolicy.h	2007-09-28 15:48:55.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mempolicy.h	2007-09-28 15:49:57.000000000 +0100
@@ -64,9 +64,8 @@ struct mempolicy {
 	atomic_t refcnt;
 	short policy; 	/* See MPOL_* above */
 	union {
-		struct zonelist  *zonelist;	/* bind */
 		short 		 preferred_node; /* preferred */
-		nodemask_t	 nodes;		/* interleave */
+		nodemask_t	 nodes;		/* interleave/bind */
 		/* undefined for default */
 	} v;
 	nodemask_t cpuset_mems_allowed;	/* mempolicy relative to these nodes */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mmzone.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mmzone.h
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mmzone.h	2007-09-28 15:49:39.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mmzone.h	2007-09-28 15:49:57.000000000 +0100
@@ -758,47 +758,85 @@ static inline void encode_zoneref(struct
 	zoneref->zone_idx = zone_idx(zone);
 }
 
+static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+	return node_isset(zonelist_node_idx(zref), *nodes);
+#else
+	return 1;
+#endif /* CONFIG_NUMA */
+}
+
 /* Returns the first zone at or below highest_zoneidx in a zonelist */
 static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
 	struct zoneref *z;
 
 	/* Find the first suitable zone to use for the allocation */
 	z = zonelist->_zonerefs;
-	while (zonelist_zone_idx(z) > highest_zoneidx)
-		z++;
+	if (likely(nodes == NULL))
+		while (zonelist_zone_idx(z) > highest_zoneidx)
+			z++;
+	else
+		while (zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes)))
+			z++;
 
 	return z;
 }
 
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
 static inline struct zoneref *next_zones_zonelist(struct zoneref *z,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	/* Find the next suitable zone to use for the allocation */
-	while (zonelist_zone_idx(z) > highest_zoneidx)
-		z++;
+	/*
+	 * Find the next suitable zone to use for the allocation.
+	 * Only filter based on nodemask if it's set
+	 */
+	if (likely(nodes == NULL))
+		while (zonelist_zone_idx(z) > highest_zoneidx)
+			z++;
+	else
+		while (zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes)))
+			z++;
 
 	return z;
 }
 
 /**
- * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
  * @zone - The current zone in the iterator
  * @z - The current pointer within zonelist->zones being iterated
  * @zlist - The zonelist being iterated
  * @highidx - The zone index of the highest zone to return
+ * @nodemask - Nodemask allowed by the allocator
  *
- * This iterator iterates though all zones at or below a given zone index.
+ * This iterator iterates though all zones at or below a given zone index and
+ * within a given nodemask
  */
-#define for_each_zone_zonelist(zone, z, zlist, highidx) \
-	for (z = first_zones_zonelist(zlist, highidx),			\
+#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+	for (z = first_zones_zonelist(zlist, nodemask, highidx),	\
 					zone = zonelist_zone(z++);	\
 		zone;							\
-		z = next_zones_zonelist(z, highidx),			\
+		z = next_zones_zonelist(z, nodemask, highidx),		\
 					zone = zonelist_zone(z++))
 
+/**
+ * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ *
+ * This iterator iterates though all zones at or below a given zone index.
+ */
+#define for_each_zone_zonelist(zone, z, zlist, highidx) \
+	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/kernel/cpuset.c linux-2.6.23-rc8-mm2-030_filter_nodemask/kernel/cpuset.c
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/kernel/cpuset.c	2007-09-28 15:49:39.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/kernel/cpuset.c	2007-09-28 15:49:57.000000000 +0100
@@ -1516,22 +1516,14 @@ nodemask_t cpuset_mems_allowed(struct ta
 }
 
 /**
- * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
- * @zl: the zonelist to be checked
+ * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
+ * @nodemask: the nodemask to be checked
  *
- * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
+ * Are any of the nodes in the nodemask allowed in current->mems_allowed?
  */
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
-	int i;
-
-	for (i = 0; zl->_zonerefs[i].zone; i++) {
-		int nid = zonelist_node_idx(zl->_zonerefs[i]);
-
-		if (node_isset(nid, current->mems_allowed))
-			return 1;
-	}
-	return 0;
+	return nodes_intersect(nodemask, current->mems_allowed);
 }
 
 /*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/mempolicy.c linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/mempolicy.c
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/mempolicy.c	2007-09-28 15:49:39.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/mempolicy.c	2007-09-28 15:49:57.000000000 +0100
@@ -134,41 +134,21 @@ static int mpol_check_policy(int mode, n
  	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
 }
 
-/* Generate a custom zonelist for the BIND policy. */
-static struct zonelist *bind_zonelist(nodemask_t *nodes)
+/* Check that the nodemask contains at least one populated zone */
+static int is_valid_nodemask(nodemask_t *nodemask)
 {
-	struct zonelist *zl;
-	int num, max, nd;
-	enum zone_type k;
+	int nd, k;
 
-	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
-	max++;			/* space for zlcache_ptr (see mmzone.h) */
-	zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
-	if (!zl)
-		return ERR_PTR(-ENOMEM);
-	zl->zlcache_ptr = NULL;
-	num = 0;
-	/* First put in the highest zones from all nodes, then all the next 
-	   lower zones etc. Avoid empty zones because the memory allocator
-	   doesn't like them. If you implement node hot removal you
-	   have to fix that. */
-	k = MAX_NR_ZONES - 1;
-	while (1) {
-		for_each_node_mask(nd, *nodes) { 
-			struct zone *z = &NODE_DATA(nd)->node_zones[k];
-			if (z->present_pages > 0) 
-				encode_zoneref(z, &zl->_zonerefs[num++]);
-		}
-		if (k == 0)
-			break;
-		k--;
-	}
-	if (num == 0) {
-		kfree(zl);
-		return ERR_PTR(-EINVAL);
+	/* Check that there is something useful in this mask */
+	k = policy_zone;
+
+	for_each_node_mask(nd, *nodemask) {
+		struct zone *z = &NODE_DATA(nd)->node_zones[k];
+		if (z->present_pages > 0)
+			return 1;
 	}
-	zl->_zonerefs[num].zone = NULL;
-	return zl;
+
+	return 0;
 }
 
 /* Create a new policy */
@@ -201,12 +181,11 @@ static struct mempolicy *mpol_new(int mo
 			policy->v.preferred_node = -1;
 		break;
 	case MPOL_BIND:
-		policy->v.zonelist = bind_zonelist(nodes);
-		if (IS_ERR(policy->v.zonelist)) {
-			void *error_code = policy->v.zonelist;
+		if (!is_valid_nodemask(nodes)) {
 			kmem_cache_free(policy_cache, policy);
-			return error_code;
+			return ERR_PTR(-EINVAL);
 		}
+		policy->v.nodes = *nodes;
 		break;
 	}
 	policy->policy = mode;
@@ -484,19 +463,12 @@ static long do_set_mempolicy(int mode, n
 /* Fill a zone bitmap for a policy */
 static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
 {
-	int i;
-
 	nodes_clear(*nodes);
 	switch (p->policy) {
-	case MPOL_BIND:
-		for (i = 0; p->v.zonelist->_zonerefs[i].zone; i++) {
-			struct zoneref *zref;
-			zref = &p->v.zonelist->_zonerefs[i];
-			node_set(zonelist_node_idx(zref), *nodes);
-		}
-		break;
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
@@ -1131,6 +1103,18 @@ static struct mempolicy * get_vma_policy
 	return pol;
 }
 
+/* Return a nodemask representing a mempolicy */
+static inline nodemask_t *nodemask_policy(gfp_t gfp, struct mempolicy *policy)
+{
+	/* Lower zones don't get a nodemask applied for MPOL_BIND */
+	if (unlikely(policy->policy == MPOL_BIND &&
+			gfp_zone(gfp) >= policy_zone &&
+			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)))
+		return &policy->v.nodes;
+
+	return NULL;
+}
+
 /* Return a zonelist representing a mempolicy */
 static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 {
@@ -1143,11 +1127,6 @@ static struct zonelist *zonelist_policy(
 			nd = numa_node_id();
 		break;
 	case MPOL_BIND:
-		/* Lower zones don't get a policy applied */
-		/* Careful: current->mems_allowed might have moved */
-		if (gfp_zone(gfp) >= policy_zone)
-			if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
-				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
 	case MPOL_DEFAULT:
@@ -1191,7 +1170,13 @@ unsigned slab_node(struct mempolicy *pol
 		 * Follow bind policy behavior and start allocation at the
 		 * first node.
 		 */
-		return zonelist_node_idx(policy->v.zonelist->_zonerefs);
+		struct zonelist *zonelist;
+		struct zoneref *z;
+		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
+		zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
+		z = first_zones_zonelist(zonelist, &policy->v.nodes,
+							highest_zoneidx);
+		return zonelist_node_idx(z);
 	}
 
 	case MPOL_PREFERRED:
@@ -1349,7 +1334,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 	/*
 	 * fast path:  default or task policy
 	 */
-	return __alloc_pages(gfp, 0, zl);
+	return __alloc_pages_nodemask(gfp, 0, zl, nodemask_policy(gfp, pol));
 }
 
 /**
@@ -1406,14 +1391,6 @@ struct mempolicy *__mpol_copy(struct mem
 	}
 	*new = *old;
 	atomic_set(&new->refcnt, 1);
-	if (new->policy == MPOL_BIND) {
-		int sz = ksize(old->v.zonelist);
-		new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
-		if (!new->v.zonelist) {
-			kmem_cache_free(policy_cache, new);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
 	return new;
 }
 
@@ -1427,21 +1404,12 @@ int __mpol_equal(struct mempolicy *a, st
 	switch (a->policy) {
 	case MPOL_DEFAULT:
 		return 1;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		return nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
-	case MPOL_BIND: {
-		int i;
-		for (i = 0; a->v.zonelist->_zonerefs[i].zone; i++) {
-			struct zone *za, *zb;
-			za = zonelist_zone(&a->v.zonelist->_zonerefs[i]);
-			zb = zonelist_zone(&b->v.zonelist->_zonerefs[i]);
-			if (za != zb)
-				return 0;
-		}
-		return b->v.zonelist->_zonerefs[i].zone == NULL;
-	}
 	default:
 		BUG();
 		return 0;
@@ -1453,8 +1421,6 @@ void __mpol_free(struct mempolicy *p)
 {
 	if (!atomic_dec_and_test(&p->refcnt))
 		return;
-	if (p->policy == MPOL_BIND)
-		kfree(p->v.zonelist);
 	p->policy = MPOL_DEFAULT;
 	kmem_cache_free(policy_cache, p);
 }
@@ -1745,6 +1711,8 @@ static void mpol_rebind_policy(struct me
 	switch (pol->policy) {
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
 		pol->v.nodes = tmp;
@@ -1757,32 +1725,6 @@ static void mpol_rebind_policy(struct me
 						*mpolmask, *newmask);
 		*mpolmask = *newmask;
 		break;
-	case MPOL_BIND: {
-		nodemask_t nodes;
-		struct zoneref *z;
-		struct zonelist *zonelist;
-
-		nodes_clear(nodes);
-		for (z = pol->v.zonelist->_zonerefs; z->zone; z++)
-			node_set(zonelist_node_idx(z), nodes);
-		nodes_remap(tmp, nodes, *mpolmask, *newmask);
-		nodes = tmp;
-
-		zonelist = bind_zonelist(&nodes);
-
-		/* If no mem, then zonelist is NULL and we keep old zonelist.
-		 * If that old zonelist has no remaining mems_allowed nodes,
-		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
-		 */
-
-		if (!IS_ERR(zonelist)) {
-			/* Good - got mem - substitute new zonelist */
-			kfree(pol->v.zonelist);
-			pol->v.zonelist = zonelist;
-		}
-		*mpolmask = *newmask;
-		break;
-	}
 	default:
 		BUG();
 		break;
@@ -1845,9 +1787,7 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_BIND:
-		get_zonemask(pol, &nodes);
-		break;
-
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes = pol->v.nodes;
 		break;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/page_alloc.c linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/page_alloc.c
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/page_alloc.c	2007-09-28 15:49:39.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/page_alloc.c	2007-09-28 15:49:57.000000000 +0100
@@ -1420,7 +1420,7 @@ static void zlc_mark_zone_full(struct zo
  * a page.
  */
 static struct page *
-get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
 {
 	struct zoneref *z;
@@ -1431,7 +1431,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	z = first_zones_zonelist(zonelist, high_zoneidx);
+	z = first_zones_zonelist(zonelist, nodemask, high_zoneidx);
 	classzone_idx = zonelist_zone_idx(z);
 
 zonelist_scan:
@@ -1439,7 +1439,8 @@ zonelist_scan:
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+						high_zoneidx, nodemask) {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
@@ -1545,9 +1546,9 @@ static void set_page_owner(struct page *
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
-struct page * fastcall
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+static struct page *
+__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1576,7 +1577,7 @@ restart:
 		return NULL;
 	}
 
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
@@ -1621,7 +1622,7 @@ restart:
 	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist,
+	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags);
 	if (page)
 		goto got_pg;
@@ -1634,7 +1635,7 @@ rebalance:
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
+			page = get_page_from_freelist(gfp_mask, nodemask, order,
 				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
@@ -1669,7 +1670,7 @@ nofail_alloc:
 		drain_all_local_pages();
 
 	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, order,
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx, alloc_flags);
 		if (page)
 			goto got_pg;
@@ -1685,8 +1686,9 @@ nofail_alloc:
 		 * a parallel oom killing, we must fail if we're still
 		 * under heavy pressure.
 		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+			order, zonelist, high_zoneidx,
+			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page) {
 			clear_zonelist_oom(zonelist, gfp_mask);
 			goto got_pg;
@@ -1739,6 +1741,20 @@ got_pg:
 	return page;
 }
 
+struct page * fastcall
+__alloc_pages(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+}
+
+struct page * fastcall
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
+}
+
 EXPORT_SYMBOL(__alloc_pages);
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-09-28 14:25 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
@ 2007-09-28 15:37   ` Lee Schermerhorn
  2007-09-28 18:28     ` Mel Gorman
  0 siblings, 1 reply; 35+ messages in thread
From: Lee Schermerhorn @ 2007-09-28 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, linux-kernel, linux-mm, rientjes, kamezawa.hiroyu, clameter

Still need to fix 'nodes_intersect' -> 'nodes_intersects'.  See below.

On Fri, 2007-09-28 at 15:25 +0100, Mel Gorman wrote:
> The MPOL_BIND policy creates a zonelist that is used for allocations belonging
> to that thread that can use the policy_zone. As the per-node zonelist is
> already being filtered based on a zone id, this patch adds a version of
> __alloc_pages() that takes a nodemask for further filtering. This eliminates
> the need for MPOL_BIND to create a custom zonelist. A positive benefit of
> this is that allocations using MPOL_BIND now use the local-node-ordered
> zonelist instead of a custom node-id-ordered zonelist.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Christoph Lameter <clameter@sgi.com>
> ---
> 
>  fs/buffer.c               |    2 
>  include/linux/cpuset.h    |    4 -
>  include/linux/gfp.h       |    4 +
>  include/linux/mempolicy.h |    3 
>  include/linux/mmzone.h    |   58 +++++++++++++---
>  kernel/cpuset.c           |   18 +----
>  mm/mempolicy.c            |  144 +++++++++++------------------------------
>  mm/page_alloc.c           |   40 +++++++----
>  8 files changed, 131 insertions(+), 142 deletions(-)
> 
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/fs/buffer.c linux-2.6.23-rc8-mm2-030_filter_nodemask/fs/buffer.c
> --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/fs/buffer.c	2007-09-28 15:49:39.000000000 +0100
> +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/fs/buffer.c	2007-09-28 15:49:57.000000000 +0100
> @@ -376,7 +376,7 @@ static void free_more_memory(void)
>  
>  	for_each_online_node(nid) {
>  		zrefs = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
> -						gfp_zone(GFP_NOFS));
> +						NULL, gfp_zone(GFP_NOFS));
>  		if (zrefs->zone)
>  			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
>  						GFP_NOFS);
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/cpuset.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/cpuset.h
> --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/cpuset.h	2007-09-27 14:41:05.000000000 +0100
> +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/cpuset.h	2007-09-28 15:49:57.000000000 +0100
> @@ -28,7 +28,7 @@ void cpuset_init_current_mems_allowed(vo
>  void cpuset_update_task_memory_state(void);
>  #define cpuset_nodes_subset_current_mems_allowed(nodes) \
>  		nodes_subset((nodes), current->mems_allowed)
> -int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
> +int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
>  
>  extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
>  extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask);
> @@ -103,7 +103,7 @@ static inline void cpuset_init_current_m
>  static inline void cpuset_update_task_memory_state(void) {}
>  #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
>  
> -static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
> +static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
>  {
>  	return 1;
>  }
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/gfp.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h
> --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/gfp.h	2007-09-28 15:49:16.000000000 +0100
> +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h	2007-09-28 15:49:57.000000000 +0100
> @@ -184,6 +184,10 @@ static inline void arch_alloc_page(struc
>  extern struct page *
>  FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
>  
> +extern struct page *
> +FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
> +				struct zonelist *, nodemask_t *nodemask));
> +
>  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
>  						unsigned int order)
>  {
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mempolicy.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mempolicy.h
> --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mempolicy.h	2007-09-28 15:48:55.000000000 +0100
> +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mempolicy.h	2007-09-28 15:49:57.000000000 +0100
> @@ -64,9 +64,8 @@ struct mempolicy {
>  	atomic_t refcnt;
>  	short policy; 	/* See MPOL_* above */
>  	union {
> -		struct zonelist  *zonelist;	/* bind */
>  		short 		 preferred_node; /* preferred */
> -		nodemask_t	 nodes;		/* interleave */
> +		nodemask_t	 nodes;		/* interleave/bind */
>  		/* undefined for default */
>  	} v;
>  	nodemask_t cpuset_mems_allowed;	/* mempolicy relative to these nodes */
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mmzone.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mmzone.h
> --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mmzone.h	2007-09-28 15:49:39.000000000 +0100
> +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mmzone.h	2007-09-28 15:49:57.000000000 +0100
> @@ -758,47 +758,85 @@ static inline void encode_zoneref(struct
>  	zoneref->zone_idx = zone_idx(zone);
>  }
>  
> +static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
> +{
> +#ifdef CONFIG_NUMA
> +	return node_isset(zonelist_node_idx(zref), *nodes);
> +#else
> +	return 1;
> +#endif /* CONFIG_NUMA */
> +}
> +
>  /* Returns the first zone at or below highest_zoneidx in a zonelist */
>  static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
> +					nodemask_t *nodes,
>  					enum zone_type highest_zoneidx)
>  {
>  	struct zoneref *z;
>  
>  	/* Find the first suitable zone to use for the allocation */
>  	z = zonelist->_zonerefs;
> -	while (zonelist_zone_idx(z) > highest_zoneidx)
> -		z++;
> +	if (likely(nodes == NULL))
> +		while (zonelist_zone_idx(z) > highest_zoneidx)
> +			z++;
> +	else
> +		while (zonelist_zone_idx(z) > highest_zoneidx ||
> +				(z->zone && !zref_in_nodemask(z, nodes)))
> +			z++;
>  
>  	return z;
>  }
>  
>  /* Returns the next zone at or below highest_zoneidx in a zonelist */
>  static inline struct zoneref *next_zones_zonelist(struct zoneref *z,
> +					nodemask_t *nodes,
>  					enum zone_type highest_zoneidx)
>  {
> -	/* Find the next suitable zone to use for the allocation */
> -	while (zonelist_zone_idx(z) > highest_zoneidx)
> -		z++;
> +	/*
> +	 * Find the next suitable zone to use for the allocation.
> +	 * Only filter based on nodemask if it's set
> +	 */
> +	if (likely(nodes == NULL))
> +		while (zonelist_zone_idx(z) > highest_zoneidx)
> +			z++;
> +	else
> +		while (zonelist_zone_idx(z) > highest_zoneidx ||
> +				(z->zone && !zref_in_nodemask(z, nodes)))
> +			z++;
>  
>  	return z;
>  }
>  
>  /**
> - * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
> + * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
>   * @zone - The current zone in the iterator
>   * @z - The current pointer within zonelist->zones being iterated
>   * @zlist - The zonelist being iterated
>   * @highidx - The zone index of the highest zone to return
> + * @nodemask - Nodemask allowed by the allocator
>   *
> - * This iterator iterates though all zones at or below a given zone index.
> + * This iterator iterates though all zones at or below a given zone index and
> + * within a given nodemask
>   */
> -#define for_each_zone_zonelist(zone, z, zlist, highidx) \
> -	for (z = first_zones_zonelist(zlist, highidx),			\
> +#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
> +	for (z = first_zones_zonelist(zlist, nodemask, highidx),	\
>  					zone = zonelist_zone(z++);	\
>  		zone;							\
> -		z = next_zones_zonelist(z, highidx),			\
> +		z = next_zones_zonelist(z, nodemask, highidx),		\
>  					zone = zonelist_zone(z++))
>  
> +/**
> + * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
> + * @zone - The current zone in the iterator
> + * @z - The current pointer within zonelist->zones being iterated
> + * @zlist - The zonelist being iterated
> + * @highidx - The zone index of the highest zone to return
> + *
> + * This iterator iterates though all zones at or below a given zone index.
> + */
> +#define for_each_zone_zonelist(zone, z, zlist, highidx) \
> +	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
> +
>  #ifdef CONFIG_SPARSEMEM
>  #include <asm/sparsemem.h>
>  #endif
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/kernel/cpuset.c linux-2.6.23-rc8-mm2-030_filter_nodemask/kernel/cpuset.c
> --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/kernel/cpuset.c	2007-09-28 15:49:39.000000000 +0100
> +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/kernel/cpuset.c	2007-09-28 15:49:57.000000000 +0100
> @@ -1516,22 +1516,14 @@ nodemask_t cpuset_mems_allowed(struct ta
>  }
>  
>  /**
> - * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
> - * @zl: the zonelist to be checked
> + * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
> + * @nodemask: the nodemask to be checked
>   *
> - * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
> + * Are any of the nodes in the nodemask allowed in current->mems_allowed?
>   */
> -int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
> +int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
>  {
> -	int i;
> -
> -	for (i = 0; zl->_zonerefs[i].zone; i++) {
> -		int nid = zonelist_node_idx(zl->_zonerefs[i]);
> -
> -		if (node_isset(nid, current->mems_allowed))
> -			return 1;
> -	}
> -	return 0;
> +	return nodes_intersect(nodemask, current->mems_allowed);
                 ^^^^^^^^^^^^^^^ -- should be nodes_intersects, I think.
>  }
>  
>  /*
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/mempolicy.c linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/mempolicy.c
> --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/mempolicy.c	2007-09-28 15:49:39.000000000 +0100
> +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/mempolicy.c	2007-09-28 15:49:57.000000000 +0100
> @@ -134,41 +134,21 @@ static int mpol_check_policy(int mode, n
>   	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
>  }
>  
> -/* Generate a custom zonelist for the BIND policy. */
> -static struct zonelist *bind_zonelist(nodemask_t *nodes)
> +/* Check that the nodemask contains at least one populated zone */
> +static int is_valid_nodemask(nodemask_t *nodemask)
>  {
> -	struct zonelist *zl;
> -	int num, max, nd;
> -	enum zone_type k;
> +	int nd, k;
>  
> -	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
> -	max++;			/* space for zlcache_ptr (see mmzone.h) */
> -	zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
> -	if (!zl)
> -		return ERR_PTR(-ENOMEM);
> -	zl->zlcache_ptr = NULL;
> -	num = 0;
> -	/* First put in the highest zones from all nodes, then all the next 
> -	   lower zones etc. Avoid empty zones because the memory allocator
> -	   doesn't like them. If you implement node hot removal you
> -	   have to fix that. */
> -	k = MAX_NR_ZONES - 1;
> -	while (1) {
> -		for_each_node_mask(nd, *nodes) { 
> -			struct zone *z = &NODE_DATA(nd)->node_zones[k];
> -			if (z->present_pages > 0) 
> -				encode_zoneref(z, &zl->_zonerefs[num++]);
> -		}
> -		if (k == 0)
> -			break;
> -		k--;
> -	}
> -	if (num == 0) {
> -		kfree(zl);
> -		return ERR_PTR(-EINVAL);
> +	/* Check that there is something useful in this mask */
> +	k = policy_zone;
> +
> +	for_each_node_mask(nd, *nodemask) {
> +		struct zone *z = &NODE_DATA(nd)->node_zones[k];
> +		if (z->present_pages > 0)
> +			return 1;
>  	}
> -	zl->_zonerefs[num].zone = NULL;
> -	return zl;
> +
> +	return 0;
>  }
>  
>  /* Create a new policy */
> @@ -201,12 +181,11 @@ static struct mempolicy *mpol_new(int mo
>  			policy->v.preferred_node = -1;
>  		break;
>  	case MPOL_BIND:
> -		policy->v.zonelist = bind_zonelist(nodes);
> -		if (IS_ERR(policy->v.zonelist)) {
> -			void *error_code = policy->v.zonelist;
> +		if (!is_valid_nodemask(nodes)) {
>  			kmem_cache_free(policy_cache, policy);
> -			return error_code;
> +			return ERR_PTR(-EINVAL);
>  		}
> +		policy->v.nodes = *nodes;
>  		break;
>  	}
>  	policy->policy = mode;
> @@ -484,19 +463,12 @@ static long do_set_mempolicy(int mode, n
>  /* Fill a zone bitmap for a policy */
>  static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
>  {
> -	int i;
> -
>  	nodes_clear(*nodes);
>  	switch (p->policy) {
> -	case MPOL_BIND:
> -		for (i = 0; p->v.zonelist->_zonerefs[i].zone; i++) {
> -			struct zoneref *zref;
> -			zref = &p->v.zonelist->_zonerefs[i];
> -			node_set(zonelist_node_idx(zref), *nodes);
> -		}
> -		break;
>  	case MPOL_DEFAULT:
>  		break;
> +	case MPOL_BIND:
> +		/* Fall through */
>  	case MPOL_INTERLEAVE:
>  		*nodes = p->v.nodes;
>  		break;
> @@ -1131,6 +1103,18 @@ static struct mempolicy * get_vma_policy
>  	return pol;
>  }
>  
> +/* Return a nodemask representing a mempolicy */
> +static inline nodemask_t *nodemask_policy(gfp_t gfp, struct mempolicy *policy)
> +{
> +	/* Lower zones don't get a nodemask applied for MPOL_BIND */
> +	if (unlikely(policy->policy == MPOL_BIND &&
> +			gfp_zone(gfp) >= policy_zone &&
> +			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)))
> +		return &policy->v.nodes;
> +
> +	return NULL;
> +}
> +
>  /* Return a zonelist representing a mempolicy */
>  static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
>  {
> @@ -1143,11 +1127,6 @@ static struct zonelist *zonelist_policy(
>  			nd = numa_node_id();
>  		break;
>  	case MPOL_BIND:
> -		/* Lower zones don't get a policy applied */
> -		/* Careful: current->mems_allowed might have moved */
> -		if (gfp_zone(gfp) >= policy_zone)
> -			if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
> -				return policy->v.zonelist;
>  		/*FALL THROUGH*/
>  	case MPOL_INTERLEAVE: /* should not happen */
>  	case MPOL_DEFAULT:
> @@ -1191,7 +1170,13 @@ unsigned slab_node(struct mempolicy *pol
>  		 * Follow bind policy behavior and start allocation at the
>  		 * first node.
>  		 */
> -		return zonelist_node_idx(policy->v.zonelist->_zonerefs);
> +		struct zonelist *zonelist;
> +		struct zoneref *z;
> +		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
> +		zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
> +		z = first_zones_zonelist(zonelist, &policy->v.nodes,
> +							highest_zoneidx);
> +		return zonelist_node_idx(z);
>  	}
>  
>  	case MPOL_PREFERRED:
> @@ -1349,7 +1334,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
>  	/*
>  	 * fast path:  default or task policy
>  	 */
> -	return __alloc_pages(gfp, 0, zl);
> +	return __alloc_pages_nodemask(gfp, 0, zl, nodemask_policy(gfp, pol));
>  }
>  
>  /**
> @@ -1406,14 +1391,6 @@ struct mempolicy *__mpol_copy(struct mem
>  	}
>  	*new = *old;
>  	atomic_set(&new->refcnt, 1);
> -	if (new->policy == MPOL_BIND) {
> -		int sz = ksize(old->v.zonelist);
> -		new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
> -		if (!new->v.zonelist) {
> -			kmem_cache_free(policy_cache, new);
> -			return ERR_PTR(-ENOMEM);
> -		}
> -	}
>  	return new;
>  }
>  
> @@ -1427,21 +1404,12 @@ int __mpol_equal(struct mempolicy *a, st
>  	switch (a->policy) {
>  	case MPOL_DEFAULT:
>  		return 1;
> +	case MPOL_BIND:
> +		/* Fall through */
>  	case MPOL_INTERLEAVE:
>  		return nodes_equal(a->v.nodes, b->v.nodes);
>  	case MPOL_PREFERRED:
>  		return a->v.preferred_node == b->v.preferred_node;
> -	case MPOL_BIND: {
> -		int i;
> -		for (i = 0; a->v.zonelist->_zonerefs[i].zone; i++) {
> -			struct zone *za, *zb;
> -			za = zonelist_zone(&a->v.zonelist->_zonerefs[i]);
> -			zb = zonelist_zone(&b->v.zonelist->_zonerefs[i]);
> -			if (za != zb)
> -				return 0;
> -		}
> -		return b->v.zonelist->_zonerefs[i].zone == NULL;
> -	}
>  	default:
>  		BUG();
>  		return 0;
> @@ -1453,8 +1421,6 @@ void __mpol_free(struct mempolicy *p)
>  {
>  	if (!atomic_dec_and_test(&p->refcnt))
>  		return;
> -	if (p->policy == MPOL_BIND)
> -		kfree(p->v.zonelist);
>  	p->policy = MPOL_DEFAULT;
>  	kmem_cache_free(policy_cache, p);
>  }
> @@ -1745,6 +1711,8 @@ static void mpol_rebind_policy(struct me
>  	switch (pol->policy) {
>  	case MPOL_DEFAULT:
>  		break;
> +	case MPOL_BIND:
> +		/* Fall through */
>  	case MPOL_INTERLEAVE:
>  		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
>  		pol->v.nodes = tmp;
> @@ -1757,32 +1725,6 @@ static void mpol_rebind_policy(struct me
>  						*mpolmask, *newmask);
>  		*mpolmask = *newmask;
>  		break;
> -	case MPOL_BIND: {
> -		nodemask_t nodes;
> -		struct zoneref *z;
> -		struct zonelist *zonelist;
> -
> -		nodes_clear(nodes);
> -		for (z = pol->v.zonelist->_zonerefs; z->zone; z++)
> -			node_set(zonelist_node_idx(z), nodes);
> -		nodes_remap(tmp, nodes, *mpolmask, *newmask);
> -		nodes = tmp;
> -
> -		zonelist = bind_zonelist(&nodes);
> -
> -		/* If no mem, then zonelist is NULL and we keep old zonelist.
> -		 * If that old zonelist has no remaining mems_allowed nodes,
> -		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
> -		 */
> -
> -		if (!IS_ERR(zonelist)) {
> -			/* Good - got mem - substitute new zonelist */
> -			kfree(pol->v.zonelist);
> -			pol->v.zonelist = zonelist;
> -		}
> -		*mpolmask = *newmask;
> -		break;
> -	}
>  	default:
>  		BUG();
>  		break;
> @@ -1845,9 +1787,7 @@ static inline int mpol_to_str(char *buff
>  		break;
>  
>  	case MPOL_BIND:
> -		get_zonemask(pol, &nodes);
> -		break;
> -
> +		/* Fall through */
>  	case MPOL_INTERLEAVE:
>  		nodes = pol->v.nodes;
>  		break;
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/page_alloc.c linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/page_alloc.c
> --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/page_alloc.c	2007-09-28 15:49:39.000000000 +0100
> +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/page_alloc.c	2007-09-28 15:49:57.000000000 +0100
> @@ -1420,7 +1420,7 @@ static void zlc_mark_zone_full(struct zo
>   * a page.
>   */
>  static struct page *
> -get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
> +get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
>  		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
>  {
>  	struct zoneref *z;
> @@ -1431,7 +1431,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
>  	int zlc_active = 0;		/* set if using zonelist_cache */
>  	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
>  
> -	z = first_zones_zonelist(zonelist, high_zoneidx);
> +	z = first_zones_zonelist(zonelist, nodemask, high_zoneidx);
>  	classzone_idx = zonelist_zone_idx(z);
>  
>  zonelist_scan:
> @@ -1439,7 +1439,8 @@ zonelist_scan:
>  	 * Scan zonelist, looking for a zone with enough free.
>  	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
>  	 */
> -	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist,
> +						high_zoneidx, nodemask) {
>  		if (NUMA_BUILD && zlc_active &&
>  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
>  				continue;
> @@ -1545,9 +1546,9 @@ static void set_page_owner(struct page *
>  /*
>   * This is the 'heart' of the zoned buddy allocator.
>   */
> -struct page * fastcall
> -__alloc_pages(gfp_t gfp_mask, unsigned int order,
> -		struct zonelist *zonelist)
> +static struct page *
> +__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
> +			struct zonelist *zonelist, nodemask_t *nodemask)
>  {
>  	const gfp_t wait = gfp_mask & __GFP_WAIT;
>  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> @@ -1576,7 +1577,7 @@ restart:
>  		return NULL;
>  	}
>  
> -	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
> +	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
>  			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
>  	if (page)
>  		goto got_pg;
> @@ -1621,7 +1622,7 @@ restart:
>  	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
>  	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
>  	 */
> -	page = get_page_from_freelist(gfp_mask, order, zonelist,
> +	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>  						high_zoneidx, alloc_flags);
>  	if (page)
>  		goto got_pg;
> @@ -1634,7 +1635,7 @@ rebalance:
>  		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
>  nofail_alloc:
>  			/* go through the zonelist yet again, ignoring mins */
> -			page = get_page_from_freelist(gfp_mask, order,
> +			page = get_page_from_freelist(gfp_mask, nodemask, order,
>  				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
>  			if (page)
>  				goto got_pg;
> @@ -1669,7 +1670,7 @@ nofail_alloc:
>  		drain_all_local_pages();
>  
>  	if (likely(did_some_progress)) {
> -		page = get_page_from_freelist(gfp_mask, order,
> +		page = get_page_from_freelist(gfp_mask, nodemask, order,
>  					zonelist, high_zoneidx, alloc_flags);
>  		if (page)
>  			goto got_pg;
> @@ -1685,8 +1686,9 @@ nofail_alloc:
>  		 * a parallel oom killing, we must fail if we're still
>  		 * under heavy pressure.
>  		 */
> -		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
> -			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
> +		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
> +			order, zonelist, high_zoneidx,
> +			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
>  		if (page) {
>  			clear_zonelist_oom(zonelist, gfp_mask);
>  			goto got_pg;
> @@ -1739,6 +1741,20 @@ got_pg:
>  	return page;
>  }
>  
> +struct page * fastcall
> +__alloc_pages(gfp_t gfp_mask, unsigned int order,
> +		struct zonelist *zonelist)
> +{
> +	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
> +}
> +
> +struct page * fastcall
> +__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> +		struct zonelist *zonelist, nodemask_t *nodemask)
> +{
> +	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
> +}
> +
>  EXPORT_SYMBOL(__alloc_pages);
>  
>  /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-09-28 15:37   ` Lee Schermerhorn
@ 2007-09-28 18:28     ` Mel Gorman
  2007-09-28 18:38       ` Paul Jackson
  2007-09-28 21:03       ` Lee Schermerhorn
  0 siblings, 2 replies; 35+ messages in thread
From: Mel Gorman @ 2007-09-28 18:28 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: akpm, linux-kernel, linux-mm, rientjes, kamezawa.hiroyu, clameter

On (28/09/07 11:37), Lee Schermerhorn didst pronounce:
> Still need to fix 'nodes_intersect' -> 'nodes_intersects'.  See below.
> 
> On Fri, 2007-09-28 at 15:25 +0100, Mel Gorman wrote:
> > The MPOL_BIND policy creates a zonelist that is used for allocations belonging
> > to that thread that can use the policy_zone. As the per-node zonelist is
> > already being filtered based on a zone id, this patch adds a version of
> > __alloc_pages() that takes a nodemask for further filtering. This eliminates
> > the need for MPOL_BIND to create a custom zonelist. A positive benefit of
> > this is that allocations using MPOL_BIND now use the local-node-ordered
> > zonelist instead of a custom node-id-ordered zonelist.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Christoph Lameter <clameter@sgi.com>
> > ---
> > 
> >  fs/buffer.c               |    2 
> >  include/linux/cpuset.h    |    4 -
> >  include/linux/gfp.h       |    4 +
> >  include/linux/mempolicy.h |    3 
> >  include/linux/mmzone.h    |   58 +++++++++++++---
> >  kernel/cpuset.c           |   18 +----
> >  mm/mempolicy.c            |  144 +++++++++++------------------------------
> >  mm/page_alloc.c           |   40 +++++++----
> >  8 files changed, 131 insertions(+), 142 deletions(-)
> > 
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/fs/buffer.c linux-2.6.23-rc8-mm2-030_filter_nodemask/fs/buffer.c
> > --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/fs/buffer.c	2007-09-28 15:49:39.000000000 +0100
> > +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/fs/buffer.c	2007-09-28 15:49:57.000000000 +0100
> > @@ -376,7 +376,7 @@ static void free_more_memory(void)
> >  
> >  	for_each_online_node(nid) {
> >  		zrefs = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
> > -						gfp_zone(GFP_NOFS));
> > +						NULL, gfp_zone(GFP_NOFS));
> >  		if (zrefs->zone)
> >  			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
> >  						GFP_NOFS);
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/cpuset.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/cpuset.h
> > --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/cpuset.h	2007-09-27 14:41:05.000000000 +0100
> > +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/cpuset.h	2007-09-28 15:49:57.000000000 +0100
> > @@ -28,7 +28,7 @@ void cpuset_init_current_mems_allowed(vo
> >  void cpuset_update_task_memory_state(void);
> >  #define cpuset_nodes_subset_current_mems_allowed(nodes) \
> >  		nodes_subset((nodes), current->mems_allowed)
> > -int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
> > +int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
> >  
> >  extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
> >  extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask);
> > @@ -103,7 +103,7 @@ static inline void cpuset_init_current_m
> >  static inline void cpuset_update_task_memory_state(void) {}
> >  #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
> >  
> > -static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
> > +static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
> >  {
> >  	return 1;
> >  }
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/gfp.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h
> > --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/gfp.h	2007-09-28 15:49:16.000000000 +0100
> > +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h	2007-09-28 15:49:57.000000000 +0100
> > @@ -184,6 +184,10 @@ static inline void arch_alloc_page(struc
> >  extern struct page *
> >  FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
> >  
> > +extern struct page *
> > +FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
> > +				struct zonelist *, nodemask_t *nodemask));
> > +
> >  static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
> >  						unsigned int order)
> >  {
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mempolicy.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mempolicy.h
> > --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mempolicy.h	2007-09-28 15:48:55.000000000 +0100
> > +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mempolicy.h	2007-09-28 15:49:57.000000000 +0100
> > @@ -64,9 +64,8 @@ struct mempolicy {
> >  	atomic_t refcnt;
> >  	short policy; 	/* See MPOL_* above */
> >  	union {
> > -		struct zonelist  *zonelist;	/* bind */
> >  		short 		 preferred_node; /* preferred */
> > -		nodemask_t	 nodes;		/* interleave */
> > +		nodemask_t	 nodes;		/* interleave/bind */
> >  		/* undefined for default */
> >  	} v;
> >  	nodemask_t cpuset_mems_allowed;	/* mempolicy relative to these nodes */
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mmzone.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mmzone.h
> > --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mmzone.h	2007-09-28 15:49:39.000000000 +0100
> > +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mmzone.h	2007-09-28 15:49:57.000000000 +0100
> > @@ -758,47 +758,85 @@ static inline void encode_zoneref(struct
> >  	zoneref->zone_idx = zone_idx(zone);
> >  }
> >  
> > +static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
> > +{
> > +#ifdef CONFIG_NUMA
> > +	return node_isset(zonelist_node_idx(zref), *nodes);
> > +#else
> > +	return 1;
> > +#endif /* CONFIG_NUMA */
> > +}
> > +
> >  /* Returns the first zone at or below highest_zoneidx in a zonelist */
> >  static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
> > +					nodemask_t *nodes,
> >  					enum zone_type highest_zoneidx)
> >  {
> >  	struct zoneref *z;
> >  
> >  	/* Find the first suitable zone to use for the allocation */
> >  	z = zonelist->_zonerefs;
> > -	while (zonelist_zone_idx(z) > highest_zoneidx)
> > -		z++;
> > +	if (likely(nodes == NULL))
> > +		while (zonelist_zone_idx(z) > highest_zoneidx)
> > +			z++;
> > +	else
> > +		while (zonelist_zone_idx(z) > highest_zoneidx ||
> > +				(z->zone && !zref_in_nodemask(z, nodes)))
> > +			z++;
> >  
> >  	return z;
> >  }
> >  
> >  /* Returns the next zone at or below highest_zoneidx in a zonelist */
> >  static inline struct zoneref *next_zones_zonelist(struct zoneref *z,
> > +					nodemask_t *nodes,
> >  					enum zone_type highest_zoneidx)
> >  {
> > -	/* Find the next suitable zone to use for the allocation */
> > -	while (zonelist_zone_idx(z) > highest_zoneidx)
> > -		z++;
> > +	/*
> > +	 * Find the next suitable zone to use for the allocation.
> > +	 * Only filter based on nodemask if it's set
> > +	 */
> > +	if (likely(nodes == NULL))
> > +		while (zonelist_zone_idx(z) > highest_zoneidx)
> > +			z++;
> > +	else
> > +		while (zonelist_zone_idx(z) > highest_zoneidx ||
> > +				(z->zone && !zref_in_nodemask(z, nodes)))
> > +			z++;
> >  
> >  	return z;
> >  }
> >  
> >  /**
> > - * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
> > + * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
> >   * @zone - The current zone in the iterator
> >   * @z - The current pointer within zonelist->zones being iterated
> >   * @zlist - The zonelist being iterated
> >   * @highidx - The zone index of the highest zone to return
> > + * @nodemask - Nodemask allowed by the allocator
> >   *
> > - * This iterator iterates though all zones at or below a given zone index.
> > + * This iterator iterates though all zones at or below a given zone index and
> > + * within a given nodemask
> >   */
> > -#define for_each_zone_zonelist(zone, z, zlist, highidx) \
> > -	for (z = first_zones_zonelist(zlist, highidx),			\
> > +#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
> > +	for (z = first_zones_zonelist(zlist, nodemask, highidx),	\
> >  					zone = zonelist_zone(z++);	\
> >  		zone;							\
> > -		z = next_zones_zonelist(z, highidx),			\
> > +		z = next_zones_zonelist(z, nodemask, highidx),		\
> >  					zone = zonelist_zone(z++))
> >  
> > +/**
> > + * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
> > + * @zone - The current zone in the iterator
> > + * @z - The current pointer within zonelist->zones being iterated
> > + * @zlist - The zonelist being iterated
> > + * @highidx - The zone index of the highest zone to return
> > + *
> > + * This iterator iterates though all zones at or below a given zone index.
> > + */
> > +#define for_each_zone_zonelist(zone, z, zlist, highidx) \
> > +	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
> > +
> >  #ifdef CONFIG_SPARSEMEM
> >  #include <asm/sparsemem.h>
> >  #endif
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/kernel/cpuset.c linux-2.6.23-rc8-mm2-030_filter_nodemask/kernel/cpuset.c
> > --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/kernel/cpuset.c	2007-09-28 15:49:39.000000000 +0100
> > +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/kernel/cpuset.c	2007-09-28 15:49:57.000000000 +0100
> > @@ -1516,22 +1516,14 @@ nodemask_t cpuset_mems_allowed(struct ta
> >  }
> >  
> >  /**
> > - * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
> > - * @zl: the zonelist to be checked
> > + * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
> > + * @nodemask: the nodemask to be checked
> >   *
> > - * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
> > + * Are any of the nodes in the nodemask allowed in current->mems_allowed?
> >   */
> > -int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
> > +int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
> >  {
> > -	int i;
> > -
> > -	for (i = 0; zl->_zonerefs[i].zone; i++) {
> > -		int nid = zonelist_node_idx(zl->_zonerefs[i]);
> > -
> > -		if (node_isset(nid, current->mems_allowed))
> > -			return 1;
> > -	}
> > -	return 0;
> > +	return nodes_intersect(nodemask, current->mems_allowed);
>                  ^^^^^^^^^^^^^^^ -- should be nodes_intersects, I think.

Crap, you're right, I missed the warning about implicit declarations. I
apologise. This is the corrected version

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/fs/buffer.c linux-2.6.23-rc8-mm2-030_filter_nodemask/fs/buffer.c
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/fs/buffer.c	2007-09-28 19:23:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/fs/buffer.c	2007-09-28 19:23:14.000000000 +0100
@@ -376,7 +376,7 @@ static void free_more_memory(void)
 
 	for_each_online_node(nid) {
 		zrefs = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
-						gfp_zone(GFP_NOFS));
+						NULL, gfp_zone(GFP_NOFS));
 		if (zrefs->zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
 						GFP_NOFS);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/cpuset.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/cpuset.h
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/cpuset.h	2007-09-28 19:22:22.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/cpuset.h	2007-09-28 19:23:14.000000000 +0100
@@ -28,7 +28,7 @@ void cpuset_init_current_mems_allowed(vo
 void cpuset_update_task_memory_state(void);
 #define cpuset_nodes_subset_current_mems_allowed(nodes) \
 		nodes_subset((nodes), current->mems_allowed)
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
 extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask);
@@ -103,7 +103,7 @@ static inline void cpuset_init_current_m
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
-static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
 	return 1;
 }
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/gfp.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/gfp.h	2007-09-28 19:22:56.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h	2007-09-28 19:23:14.000000000 +0100
@@ -184,6 +184,10 @@ static inline void arch_alloc_page(struc
 extern struct page *
 FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
 
+extern struct page *
+FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
+				struct zonelist *, nodemask_t *nodemask));
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mempolicy.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mempolicy.h
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mempolicy.h	2007-09-28 19:22:46.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mempolicy.h	2007-09-28 19:23:14.000000000 +0100
@@ -64,9 +64,8 @@ struct mempolicy {
 	atomic_t refcnt;
 	short policy; 	/* See MPOL_* above */
 	union {
-		struct zonelist  *zonelist;	/* bind */
 		short 		 preferred_node; /* preferred */
-		nodemask_t	 nodes;		/* interleave */
+		nodemask_t	 nodes;		/* interleave/bind */
 		/* undefined for default */
 	} v;
 	nodemask_t cpuset_mems_allowed;	/* mempolicy relative to these nodes */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mmzone.h linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mmzone.h
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/include/linux/mmzone.h	2007-09-28 19:23:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mmzone.h	2007-09-28 19:23:14.000000000 +0100
@@ -758,47 +758,85 @@ static inline void encode_zoneref(struct
 	zoneref->zone_idx = zone_idx(zone);
 }
 
+static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+	return node_isset(zonelist_node_idx(zref), *nodes);
+#else
+	return 1;
+#endif /* CONFIG_NUMA */
+}
+
 /* Returns the first zone at or below highest_zoneidx in a zonelist */
 static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
 	struct zoneref *z;
 
 	/* Find the first suitable zone to use for the allocation */
 	z = zonelist->_zonerefs;
-	while (zonelist_zone_idx(z) > highest_zoneidx)
-		z++;
+	if (likely(nodes == NULL))
+		while (zonelist_zone_idx(z) > highest_zoneidx)
+			z++;
+	else
+		while (zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes)))
+			z++;
 
 	return z;
 }
 
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
 static inline struct zoneref *next_zones_zonelist(struct zoneref *z,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	/* Find the next suitable zone to use for the allocation */
-	while (zonelist_zone_idx(z) > highest_zoneidx)
-		z++;
+	/*
+	 * Find the next suitable zone to use for the allocation.
+	 * Only filter based on nodemask if it's set
+	 */
+	if (likely(nodes == NULL))
+		while (zonelist_zone_idx(z) > highest_zoneidx)
+			z++;
+	else
+		while (zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes)))
+			z++;
 
 	return z;
 }
 
 /**
- * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
  * @zone - The current zone in the iterator
  * @z - The current pointer within zonelist->zones being iterated
  * @zlist - The zonelist being iterated
  * @highidx - The zone index of the highest zone to return
+ * @nodemask - Nodemask allowed by the allocator
  *
- * This iterator iterates though all zones at or below a given zone index.
+ * This iterator iterates though all zones at or below a given zone index and
+ * within a given nodemask
  */
-#define for_each_zone_zonelist(zone, z, zlist, highidx) \
-	for (z = first_zones_zonelist(zlist, highidx),			\
+#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+	for (z = first_zones_zonelist(zlist, nodemask, highidx),	\
 					zone = zonelist_zone(z++);	\
 		zone;							\
-		z = next_zones_zonelist(z, highidx),			\
+		z = next_zones_zonelist(z, nodemask, highidx),		\
 					zone = zonelist_zone(z++))
 
+/**
+ * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ *
+ * This iterator iterates though all zones at or below a given zone index.
+ */
+#define for_each_zone_zonelist(zone, z, zlist, highidx) \
+	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/kernel/cpuset.c linux-2.6.23-rc8-mm2-030_filter_nodemask/kernel/cpuset.c
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/kernel/cpuset.c	2007-09-28 19:23:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/kernel/cpuset.c	2007-09-28 19:27:01.000000000 +0100
@@ -1516,22 +1516,14 @@ nodemask_t cpuset_mems_allowed(struct ta
 }
 
 /**
- * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
- * @zl: the zonelist to be checked
+ * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
+ * @nodemask: the nodemask to be checked
  *
- * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
+ * Are any of the nodes in the nodemask allowed in current->mems_allowed?
  */
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
-	int i;
-
-	for (i = 0; zl->_zonerefs[i].zone; i++) {
-		int nid = zonelist_node_idx(zl->_zonerefs[i]);
-
-		if (node_isset(nid, current->mems_allowed))
-			return 1;
-	}
-	return 0;
+	return nodes_intersects(*nodemask, current->mems_allowed);
 }
 
 /*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/mempolicy.c linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/mempolicy.c
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/mempolicy.c	2007-09-28 19:23:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/mempolicy.c	2007-09-28 19:23:14.000000000 +0100
@@ -134,41 +134,21 @@ static int mpol_check_policy(int mode, n
  	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
 }
 
-/* Generate a custom zonelist for the BIND policy. */
-static struct zonelist *bind_zonelist(nodemask_t *nodes)
+/* Check that the nodemask contains at least one populated zone */
+static int is_valid_nodemask(nodemask_t *nodemask)
 {
-	struct zonelist *zl;
-	int num, max, nd;
-	enum zone_type k;
+	int nd, k;
 
-	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
-	max++;			/* space for zlcache_ptr (see mmzone.h) */
-	zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
-	if (!zl)
-		return ERR_PTR(-ENOMEM);
-	zl->zlcache_ptr = NULL;
-	num = 0;
-	/* First put in the highest zones from all nodes, then all the next 
-	   lower zones etc. Avoid empty zones because the memory allocator
-	   doesn't like them. If you implement node hot removal you
-	   have to fix that. */
-	k = MAX_NR_ZONES - 1;
-	while (1) {
-		for_each_node_mask(nd, *nodes) { 
-			struct zone *z = &NODE_DATA(nd)->node_zones[k];
-			if (z->present_pages > 0) 
-				encode_zoneref(z, &zl->_zonerefs[num++]);
-		}
-		if (k == 0)
-			break;
-		k--;
-	}
-	if (num == 0) {
-		kfree(zl);
-		return ERR_PTR(-EINVAL);
+	/* Check that there is something useful in this mask */
+	k = policy_zone;
+
+	for_each_node_mask(nd, *nodemask) {
+		struct zone *z = &NODE_DATA(nd)->node_zones[k];
+		if (z->present_pages > 0)
+			return 1;
 	}
-	zl->_zonerefs[num].zone = NULL;
-	return zl;
+
+	return 0;
 }
 
 /* Create a new policy */
@@ -201,12 +181,11 @@ static struct mempolicy *mpol_new(int mo
 			policy->v.preferred_node = -1;
 		break;
 	case MPOL_BIND:
-		policy->v.zonelist = bind_zonelist(nodes);
-		if (IS_ERR(policy->v.zonelist)) {
-			void *error_code = policy->v.zonelist;
+		if (!is_valid_nodemask(nodes)) {
 			kmem_cache_free(policy_cache, policy);
-			return error_code;
+			return ERR_PTR(-EINVAL);
 		}
+		policy->v.nodes = *nodes;
 		break;
 	}
 	policy->policy = mode;
@@ -484,19 +463,12 @@ static long do_set_mempolicy(int mode, n
 /* Fill a zone bitmap for a policy */
 static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
 {
-	int i;
-
 	nodes_clear(*nodes);
 	switch (p->policy) {
-	case MPOL_BIND:
-		for (i = 0; p->v.zonelist->_zonerefs[i].zone; i++) {
-			struct zoneref *zref;
-			zref = &p->v.zonelist->_zonerefs[i];
-			node_set(zonelist_node_idx(zref), *nodes);
-		}
-		break;
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
@@ -1131,6 +1103,18 @@ static struct mempolicy * get_vma_policy
 	return pol;
 }
 
+/* Return a nodemask representing a mempolicy */
+static inline nodemask_t *nodemask_policy(gfp_t gfp, struct mempolicy *policy)
+{
+	/* Lower zones don't get a nodemask applied for MPOL_BIND */
+	if (unlikely(policy->policy == MPOL_BIND &&
+			gfp_zone(gfp) >= policy_zone &&
+			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)))
+		return &policy->v.nodes;
+
+	return NULL;
+}
+
 /* Return a zonelist representing a mempolicy */
 static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 {
@@ -1143,11 +1127,6 @@ static struct zonelist *zonelist_policy(
 			nd = numa_node_id();
 		break;
 	case MPOL_BIND:
-		/* Lower zones don't get a policy applied */
-		/* Careful: current->mems_allowed might have moved */
-		if (gfp_zone(gfp) >= policy_zone)
-			if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
-				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
 	case MPOL_DEFAULT:
@@ -1191,7 +1170,13 @@ unsigned slab_node(struct mempolicy *pol
 		 * Follow bind policy behavior and start allocation at the
 		 * first node.
 		 */
-		return zonelist_node_idx(policy->v.zonelist->_zonerefs);
+		struct zonelist *zonelist;
+		struct zoneref *z;
+		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
+		zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
+		z = first_zones_zonelist(zonelist, &policy->v.nodes,
+							highest_zoneidx);
+		return zonelist_node_idx(z);
 	}
 
 	case MPOL_PREFERRED:
@@ -1349,7 +1334,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 	/*
 	 * fast path:  default or task policy
 	 */
-	return __alloc_pages(gfp, 0, zl);
+	return __alloc_pages_nodemask(gfp, 0, zl, nodemask_policy(gfp, pol));
 }
 
 /**
@@ -1406,14 +1391,6 @@ struct mempolicy *__mpol_copy(struct mem
 	}
 	*new = *old;
 	atomic_set(&new->refcnt, 1);
-	if (new->policy == MPOL_BIND) {
-		int sz = ksize(old->v.zonelist);
-		new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
-		if (!new->v.zonelist) {
-			kmem_cache_free(policy_cache, new);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
 	return new;
 }
 
@@ -1427,21 +1404,12 @@ int __mpol_equal(struct mempolicy *a, st
 	switch (a->policy) {
 	case MPOL_DEFAULT:
 		return 1;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		return nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
-	case MPOL_BIND: {
-		int i;
-		for (i = 0; a->v.zonelist->_zonerefs[i].zone; i++) {
-			struct zone *za, *zb;
-			za = zonelist_zone(&a->v.zonelist->_zonerefs[i]);
-			zb = zonelist_zone(&b->v.zonelist->_zonerefs[i]);
-			if (za != zb)
-				return 0;
-		}
-		return b->v.zonelist->_zonerefs[i].zone == NULL;
-	}
 	default:
 		BUG();
 		return 0;
@@ -1453,8 +1421,6 @@ void __mpol_free(struct mempolicy *p)
 {
 	if (!atomic_dec_and_test(&p->refcnt))
 		return;
-	if (p->policy == MPOL_BIND)
-		kfree(p->v.zonelist);
 	p->policy = MPOL_DEFAULT;
 	kmem_cache_free(policy_cache, p);
 }
@@ -1745,6 +1711,8 @@ static void mpol_rebind_policy(struct me
 	switch (pol->policy) {
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
 		pol->v.nodes = tmp;
@@ -1757,32 +1725,6 @@ static void mpol_rebind_policy(struct me
 						*mpolmask, *newmask);
 		*mpolmask = *newmask;
 		break;
-	case MPOL_BIND: {
-		nodemask_t nodes;
-		struct zoneref *z;
-		struct zonelist *zonelist;
-
-		nodes_clear(nodes);
-		for (z = pol->v.zonelist->_zonerefs; z->zone; z++)
-			node_set(zonelist_node_idx(z), nodes);
-		nodes_remap(tmp, nodes, *mpolmask, *newmask);
-		nodes = tmp;
-
-		zonelist = bind_zonelist(&nodes);
-
-		/* If no mem, then zonelist is NULL and we keep old zonelist.
-		 * If that old zonelist has no remaining mems_allowed nodes,
-		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
-		 */
-
-		if (!IS_ERR(zonelist)) {
-			/* Good - got mem - substitute new zonelist */
-			kfree(pol->v.zonelist);
-			pol->v.zonelist = zonelist;
-		}
-		*mpolmask = *newmask;
-		break;
-	}
 	default:
 		BUG();
 		break;
@@ -1845,9 +1787,7 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_BIND:
-		get_zonemask(pol, &nodes);
-		break;
-
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes = pol->v.nodes;
 		break;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/page_alloc.c linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/page_alloc.c
--- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/mm/page_alloc.c	2007-09-28 19:23:05.000000000 +0100
+++ linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/page_alloc.c	2007-09-28 19:23:14.000000000 +0100
@@ -1420,7 +1420,7 @@ static void zlc_mark_zone_full(struct zo
  * a page.
  */
 static struct page *
-get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
 {
 	struct zoneref *z;
@@ -1431,7 +1431,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	z = first_zones_zonelist(zonelist, high_zoneidx);
+	z = first_zones_zonelist(zonelist, nodemask, high_zoneidx);
 	classzone_idx = zonelist_zone_idx(z);
 
 zonelist_scan:
@@ -1439,7 +1439,8 @@ zonelist_scan:
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+						high_zoneidx, nodemask) {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
@@ -1545,9 +1546,9 @@ static void set_page_owner(struct page *
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
-struct page * fastcall
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+static struct page *
+__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1576,7 +1577,7 @@ restart:
 		return NULL;
 	}
 
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
@@ -1621,7 +1622,7 @@ restart:
 	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist,
+	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags);
 	if (page)
 		goto got_pg;
@@ -1634,7 +1635,7 @@ rebalance:
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
+			page = get_page_from_freelist(gfp_mask, nodemask, order,
 				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
@@ -1669,7 +1670,7 @@ nofail_alloc:
 		drain_all_local_pages();
 
 	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, order,
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx, alloc_flags);
 		if (page)
 			goto got_pg;
@@ -1685,8 +1686,9 @@ nofail_alloc:
 		 * a parallel oom killing, we must fail if we're still
 		 * under heavy pressure.
 		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+			order, zonelist, high_zoneidx,
+			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page) {
 			clear_zonelist_oom(zonelist, gfp_mask);
 			goto got_pg;
@@ -1739,6 +1741,20 @@ got_pg:
 	return page;
 }
 
+struct page * fastcall
+__alloc_pages(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+}
+
+struct page * fastcall
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
+}
+
 EXPORT_SYMBOL(__alloc_pages);
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-09-28 18:28     ` Mel Gorman
@ 2007-09-28 18:38       ` Paul Jackson
  2007-09-28 21:03       ` Lee Schermerhorn
  1 sibling, 0 replies; 35+ messages in thread
From: Paul Jackson @ 2007-09-28 18:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Lee.Schermerhorn, akpm, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

Mel replied to Lee:
> > > +	return nodes_intersect(nodemask, current->mems_allowed);
> >                  ^^^^^^^^^^^^^^^ -- should be nodes_intersects, I think.
> 
> Crap, you're right, I missed the warning about implicit declarations. I
> apologise. This is the corrected version

I found myself making that same error, saying 'nodes_intersect' instead
of 'nodes_intersects' the other day.  And I might be the one who invented
that name ;).

This would probably be too noisey and too little gain to do on the
Linux kernel, but if this was just a little private project of my own,
I'd be running a script over the whole thing, modifying all 30 or so
instances of bitmap_intersects, cpus_intersects and nodes_intersects so
as to remove the final 's' character.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-09-28 18:28     ` Mel Gorman
  2007-09-28 18:38       ` Paul Jackson
@ 2007-09-28 21:03       ` Lee Schermerhorn
  1 sibling, 0 replies; 35+ messages in thread
From: Lee Schermerhorn @ 2007-09-28 21:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, linux-kernel, linux-mm, rientjes, kamezawa.hiroyu, clameter

On Fri, 2007-09-28 at 19:28 +0100, Mel Gorman wrote:
> On (28/09/07 11:37), Lee Schermerhorn didst pronounce:
> > Still need to fix 'nodes_intersect' -> 'nodes_intersects'.  See below.
> > 
<snip>
> > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-020_zoneid_zonelist/kernel/cpuset.c linux-2.6.23-rc8-mm2-030_filter_nodemask/kernel/cpuset.c
> > > --- linux-2.6.23-rc8-mm2-020_zoneid_zonelist/kernel/cpuset.c	2007-09-28 15:49:39.000000000 +0100
> > > +++ linux-2.6.23-rc8-mm2-030_filter_nodemask/kernel/cpuset.c	2007-09-28 15:49:57.000000000 +0100
> > > @@ -1516,22 +1516,14 @@ nodemask_t cpuset_mems_allowed(struct ta
> > >  }
> > >  
> > >  /**
> > > - * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
> > > - * @zl: the zonelist to be checked
> > > + * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
> > > + * @nodemask: the nodemask to be checked
> > >   *
> > > - * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
> > > + * Are any of the nodes in the nodemask allowed in current->mems_allowed?
> > >   */
> > > -int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
> > > +int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
> > >  {
> > > -	int i;
> > > -
> > > -	for (i = 0; zl->_zonerefs[i].zone; i++) {
> > > -		int nid = zonelist_node_idx(zl->_zonerefs[i]);
> > > -
> > > -		if (node_isset(nid, current->mems_allowed))
> > > -			return 1;
> > > -	}
> > > -	return 0;
> > > +	return nodes_intersect(nodemask, current->mems_allowed);
> >                  ^^^^^^^^^^^^^^^ -- should be nodes_intersects, I think.
> 
> Crap, you're right, I missed the warning about implicit declarations. I
> apologise. This is the corrected version

Mel:  

When I'm rebasing a patch series, I use a little script [shell function,
actually] to make just the sources modified by each patch, before moving
on to the next.  That way, I have fewer log messages to look at, and
warnings and such jump out so I can fix them while I'm at the patch that
causes them.  That's how I caught this one.  Here's the script, in case
you're interested:

--------------------------

#qm - quilt make -- attempt to compile all .c's in patch
# Note:  some files might not compile if they wouldn't build in 
# the current config.
qm()
{
#	__in_ktree qm || return

	make silentoldconfig; # in case patch changes a Kconfig*

	quilt files | \
	while read file xxx
	do
		ftype=${file##*.}
		if [[ "$ftype" != "c" ]]
		then
			# echo "Skipping $file" >&2
			continue
		fi
		f=${file%.*}
		echo "make $f.o" >&2
		make $f.o
	done
}

---------------------------

This is part of a larger set of quilt wrappers that, being basically
lazy, I use to reduce typing.   I've commented out one dependency on
other parts of the "environment".  To use this, I build an unpatched
kernel before starting the rebase, so that the .config and all of the
pieces are in place for the incremental makes.  

Works for me...

Later,
Lee




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 6/6] Use one zonelist that is filtered by nodemask
  2007-09-28 14:23 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v8 Mel Gorman
                   ` (4 preceding siblings ...)
  2007-09-28 14:25 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
@ 2007-09-28 14:25 ` Mel Gorman
  2007-10-09  1:11   ` Nishanth Aravamudan
  5 siblings, 1 reply; 35+ messages in thread
From: Mel Gorman @ 2007-09-28 14:25 UTC (permalink / raw)
  To: akpm
  Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
to use memory only from a node local to the CPU. As we can now filter the
zonelist based on a nodemask, we filter the standard node zonelist for zones
on the local node when GFP_THISNODE is specified.

When GFP_THISNODE is used, a temporary nodemask is created with only the
node local to the CPU set. This allows us to eliminate the second zonelist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
---

 drivers/char/sysrq.c      |    2 -
 fs/buffer.c               |    5 +--
 include/linux/gfp.h       |   20 ++-----------
 include/linux/mempolicy.h |    2 -
 include/linux/mmzone.h    |   14 ---------
 mm/mempolicy.c            |    8 ++---
 mm/page_alloc.c           |   61 ++++++++++++++++++++++-------------------
 mm/slab.c                 |    2 -
 mm/slub.c                 |    2 -
 mm/vmscan.c               |    2 -
 10 files changed, 50 insertions(+), 68 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/drivers/char/sysrq.c linux-2.6.23-rc8-mm2-040_use_one_zonelist/drivers/char/sysrq.c
--- linux-2.6.23-rc8-mm2-030_filter_nodemask/drivers/char/sysrq.c	2007-09-28 15:48:55.000000000 +0100
+++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/drivers/char/sysrq.c	2007-09-28 15:54:13.000000000 +0100
@@ -271,7 +271,7 @@ static struct sysrq_key_op sysrq_term_op
 
 static void moom_callback(struct work_struct *ignored)
 {
-	out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+	out_of_memory(node_zonelist(0), GFP_KERNEL, 0);
 }
 
 static DECLARE_WORK(moom_work, moom_callback);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/fs/buffer.c linux-2.6.23-rc8-mm2-040_use_one_zonelist/fs/buffer.c
--- linux-2.6.23-rc8-mm2-030_filter_nodemask/fs/buffer.c	2007-09-28 15:49:57.000000000 +0100
+++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/fs/buffer.c	2007-09-28 15:54:13.000000000 +0100
@@ -375,11 +375,10 @@ static void free_more_memory(void)
 	yield();
 
 	for_each_online_node(nid) {
-		zrefs = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
+		zrefs = first_zones_zonelist(node_zonelist(nid),
 						NULL, gfp_zone(GFP_NOFS));
 		if (zrefs->zone)
-			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
-						GFP_NOFS);
+			try_to_free_pages(node_zonelist(nid), 0, GFP_NOFS);
 	}
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h
--- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h	2007-09-28 15:49:57.000000000 +0100
+++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h	2007-09-28 15:55:03.000000000 +0100
@@ -150,28 +150,16 @@ static inline gfp_t set_migrateflags(gfp
  * virtual kernel addresses to the allocated page(s).
  */
 
-static inline enum zone_type gfp_zonelist(gfp_t flags)
-{
-	int base = 0;
-
-	if (NUMA_BUILD && (flags & __GFP_THISNODE))
-		base = 1;
-
-	return base;
-}
-
 /*
- * We get the zone list from the current node and the gfp_mask.
+ * We get the zone list based on a node ID as there is one zone list per node.
  * This zone list contains a maximum of MAXNODES*MAX_NR_ZONES zones.
- * There are two zonelists per node, one for all zones with memory and
- * one containing just zones from the node the zonelist belongs to.
  *
  * For the normal case of non-DISCONTIGMEM systems the NODE_DATA() gets
  * optimized to &contig_page_data at compile-time.
  */
-static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
+static inline struct zonelist *node_zonelist(int nid)
 {
-	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
+	return &NODE_DATA(nid)->node_zonelist;
 }
 
 #ifndef HAVE_ARCH_FREE_PAGE
@@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n
 	if (nid < 0)
 		nid = numa_node_id();
 
-	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
+	return __alloc_pages(gfp_mask, order, node_zonelist(nid));
 }
 
 #ifdef CONFIG_NUMA
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mempolicy.h linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/mempolicy.h
--- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mempolicy.h	2007-09-28 15:49:57.000000000 +0100
+++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/mempolicy.h	2007-09-28 15:54:13.000000000 +0100
@@ -240,7 +240,7 @@ static inline void mpol_fix_fork_child_f
 static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
  		unsigned long addr, gfp_t gfp_flags, struct mempolicy **mpol)
 {
-	return node_zonelist(0, gfp_flags);
+	return node_zonelist(0);
 }
 
 static inline int do_migrate_pages(struct mm_struct *mm,
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mmzone.h linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/mmzone.h
--- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/mmzone.h	2007-09-28 15:49:57.000000000 +0100
+++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/mmzone.h	2007-09-28 15:54:13.000000000 +0100
@@ -390,17 +390,6 @@ static inline int zone_is_oom_locked(con
 #define MAX_ZONES_PER_ZONELIST (MAX_NUMNODES * MAX_NR_ZONES)
 
 #ifdef CONFIG_NUMA
-
-/*
- * The NUMA zonelists are doubled becausse we need zonelists that restrict the
- * allocations to a single node for GFP_THISNODE.
- *
- * [0]	: Zonelist with fallback
- * [1]	: No fallback (GFP_THISNODE)
- */
-#define MAX_ZONELISTS 2
-
-
 /*
  * We cache key information from each zonelist for smaller cache
  * footprint when scanning for free pages in get_page_from_freelist().
@@ -466,7 +455,6 @@ struct zonelist_cache {
 	unsigned long last_full_zap;		/* when last zap'd (jiffies) */
 };
 #else
-#define MAX_ZONELISTS 1
 struct zonelist_cache;
 #endif
 
@@ -531,7 +519,7 @@ extern struct page *mem_map;
 struct bootmem_data;
 typedef struct pglist_data {
 	struct zone node_zones[MAX_NR_ZONES];
-	struct zonelist node_zonelists[MAX_ZONELISTS];
+	struct zonelist node_zonelist;
 	int nr_zones;
 #ifdef CONFIG_FLAT_NODE_MEM_MAP
 	struct page *node_mem_map;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/mempolicy.c linux-2.6.23-rc8-mm2-040_use_one_zonelist/mm/mempolicy.c
--- linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/mempolicy.c	2007-09-28 15:49:57.000000000 +0100
+++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/mm/mempolicy.c	2007-09-28 15:54:13.000000000 +0100
@@ -1136,7 +1136,7 @@ static struct zonelist *zonelist_policy(
 		nd = 0;
 		BUG();
 	}
-	return node_zonelist(nd, gfp);
+	return node_zonelist(nd);
 }
 
 /* Do dynamic interleaving for a process */
@@ -1173,7 +1173,7 @@ unsigned slab_node(struct mempolicy *pol
 		struct zonelist *zonelist;
 		struct zoneref *z;
 		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
-		zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
+		zonelist = &NODE_DATA(numa_node_id())->node_zonelist;
 		z = first_zones_zonelist(zonelist, &policy->v.nodes,
 							highest_zoneidx);
 		return zonelist_node_idx(z);
@@ -1257,7 +1257,7 @@ struct zonelist *huge_zonelist(struct vm
 
 		nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
 		__mpol_free(pol);		/* finished with pol */
-		return node_zonelist(nid, gfp_flags);
+		return node_zonelist(nid);
 	}
 
 	zl = zonelist_policy(GFP_HIGHUSER, pol);
@@ -1279,7 +1279,7 @@ static struct page *alloc_page_interleav
 	struct zonelist *zl;
 	struct page *page;
 
-	zl = node_zonelist(nid, gfp);
+	zl = node_zonelist(nid);
 	page = __alloc_pages(gfp, order, zl);
 	if (page && page_zone(page) == zonelist_zone(&zl->_zonerefs[0]))
 		inc_zone_page_state(page, NUMA_INTERLEAVE_HIT);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/page_alloc.c linux-2.6.23-rc8-mm2-040_use_one_zonelist/mm/page_alloc.c
--- linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/page_alloc.c	2007-09-28 15:49:57.000000000 +0100
+++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/mm/page_alloc.c	2007-09-28 15:54:13.000000000 +0100
@@ -1741,10 +1741,33 @@ got_pg:
 	return page;
 }
 
+static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
+{
+	/* Build a nodemask for just this node */
+	int nid = numa_node_id();
+
+	nodes_clear(*nodemask);
+	node_set(nid, *nodemask);
+
+	return nodemask;
+}
+
 struct page * fastcall
 __alloc_pages(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist)
 {
+	/*
+	 * Use a temporary nodemask for __GFP_THISNODE allocations. If the
+	 * cost of allocating on the stack or the stack usage becomes
+	 * noticable, allocate the nodemasks per node at boot or compile time
+	 */
+	if (unlikely(gfp_mask & __GFP_THISNODE)) {
+		nodemask_t nodemask;
+
+		return __alloc_pages_internal(gfp_mask, order,
+				zonelist, nodemask_thisnode(&nodemask));
+	}
+
 	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
 }
 
@@ -1752,6 +1775,9 @@ struct page * fastcall
 __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		struct zonelist *zonelist, nodemask_t *nodemask)
 {
+	/* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */
+	WARN_ON(gfp_mask & __GFP_THISNODE);
+
 	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
 }
 
@@ -1828,7 +1854,7 @@ static unsigned int nr_free_zone_pages(i
 	/* Just pick one node, since fallback list is circular */
 	unsigned int sum = 0;
 
-	struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);
+	struct zonelist *zonelist = node_zonelist(numa_node_id());
 
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		unsigned long size = zone->present_pages;
@@ -2192,7 +2218,7 @@ static void build_zonelists_in_node_orde
 	int j;
 	struct zonelist *zonelist;
 
-	zonelist = &pgdat->node_zonelists[0];
+	zonelist = &pgdat->node_zonelist;
 	for (j = 0; zonelist->_zonerefs[j].zone != NULL; j++)
 		;
 	j = build_zonelists_node(NODE_DATA(node), zonelist, j,
@@ -2201,19 +2227,6 @@ static void build_zonelists_in_node_orde
 }
 
 /*
- * Build gfp_thisnode zonelists
- */
-static void build_thisnode_zonelists(pg_data_t *pgdat)
-{
-	int j;
-	struct zonelist *zonelist;
-
-	zonelist = &pgdat->node_zonelists[1];
-	j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);
-	zonelist->_zonerefs[j].zone = NULL;
-}
-
-/*
  * Build zonelists ordered by zone and nodes within zones.
  * This results in conserving DMA zone[s] until all Normal memory is
  * exhausted, but results in overflowing to remote node while memory
@@ -2228,7 +2241,7 @@ static void build_zonelists_in_zone_orde
 	struct zone *z;
 	struct zonelist *zonelist;
 
-	zonelist = &pgdat->node_zonelists[0];
+	zonelist = &pgdat->node_zonelist;
 	pos = 0;
 	for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
 		for (j = 0; j < nr_nodes; j++) {
@@ -2308,17 +2321,14 @@ static void set_zonelist_order(void)
 static void build_zonelists(pg_data_t *pgdat)
 {
 	int j, node, load;
-	enum zone_type i;
 	nodemask_t used_mask;
 	int local_node, prev_node;
 	struct zonelist *zonelist;
 	int order = current_zonelist_order;
 
 	/* initialize zonelists */
-	for (i = 0; i < MAX_ZONELISTS; i++) {
-		zonelist = pgdat->node_zonelists + i;
-		zonelist->_zonerefs[0].zone = NULL;
-	}
+	zonelist = &pgdat->node_zonelist;
+	zonelist->_zonerefs[0].zone = NULL;
 
 	/* NUMA-aware ordering of nodes */
 	local_node = pgdat->node_id;
@@ -2360,8 +2370,6 @@ static void build_zonelists(pg_data_t *p
 		/* calculate node order -- i.e., DMA last! */
 		build_zonelists_in_zone_order(pgdat, j);
 	}
-
-	build_thisnode_zonelists(pgdat);
 }
 
 /* Construct the zonelist performance cache - see further mmzone.h */
@@ -2371,7 +2379,7 @@ static void build_zonelist_cache(pg_data
 	struct zonelist_cache *zlc;
 	struct zoneref *z;
 
-	zonelist = &pgdat->node_zonelists[0];
+	zonelist = &pgdat->node_zonelist;
 	zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
 	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
 	for (z = zonelist->_zonerefs; z->zone; z++)
@@ -2394,7 +2402,7 @@ static void build_zonelists(pg_data_t *p
 
 	local_node = pgdat->node_id;
 
-	zonelist = &pgdat->node_zonelists[0];
+	zonelist = &pgdat->node_zonelist;
 	j = build_zonelists_node(pgdat, zonelist, 0, MAX_NR_ZONES - 1);
 
 	/*
@@ -2424,8 +2432,7 @@ static void build_zonelists(pg_data_t *p
 /* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
 static void build_zonelist_cache(pg_data_t *pgdat)
 {
-	pgdat->node_zonelists[0].zlcache_ptr = NULL;
-	pgdat->node_zonelists[1].zlcache_ptr = NULL;
+	pgdat->node_zonelist.zlcache_ptr = NULL;
 }
 
 #endif	/* CONFIG_NUMA */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/slab.c linux-2.6.23-rc8-mm2-040_use_one_zonelist/mm/slab.c
--- linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/slab.c	2007-09-28 15:49:39.000000000 +0100
+++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/mm/slab.c	2007-09-28 15:54:13.000000000 +0100
@@ -3248,7 +3248,7 @@ static void *fallback_alloc(struct kmem_
 	if (flags & __GFP_THISNODE)
 		return NULL;
 
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+	zonelist = node_zonelist(slab_node(current->mempolicy));
 	local_flags = flags & (GFP_CONSTRAINT_MASK|GFP_RECLAIM_MASK);
 
 retry:
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/slub.c linux-2.6.23-rc8-mm2-040_use_one_zonelist/mm/slub.c
--- linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/slub.c	2007-09-28 15:49:39.000000000 +0100
+++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/mm/slub.c	2007-09-28 15:54:13.000000000 +0100
@@ -1305,7 +1305,7 @@ static struct page *get_any_partial(stru
 	if (!s->defrag_ratio || get_cycles() % 1024 > s->defrag_ratio)
 		return NULL;
 
-	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
+	zonelist = node_zonelist(slab_node(current->mempolicy));
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		struct kmem_cache_node *n;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/vmscan.c linux-2.6.23-rc8-mm2-040_use_one_zonelist/mm/vmscan.c
--- linux-2.6.23-rc8-mm2-030_filter_nodemask/mm/vmscan.c	2007-09-28 15:49:39.000000000 +0100
+++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/mm/vmscan.c	2007-09-28 15:54:13.000000000 +0100
@@ -1363,7 +1363,7 @@ unsigned long try_to_free_mem_cgroup_pag
 	struct zonelist *zonelist;
 
 	for_each_online_node(node) {
-		zonelist = &NODE_DATA(node)->node_zonelists[0];
+		zonelist = &NODE_DATA(node)->node_zonelist;
 		if (do_try_to_free_pages(zonelist, sc.gfp_mask, &sc))
 			return 1;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
  2007-09-28 14:25 ` [PATCH 6/6] Use one zonelist that is filtered by nodemask Mel Gorman
@ 2007-10-09  1:11   ` Nishanth Aravamudan
  2007-10-09  1:56     ` Christoph Lameter
  2007-10-09 15:40     ` Mel Gorman
  0 siblings, 2 replies; 35+ messages in thread
From: Nishanth Aravamudan @ 2007-10-09  1:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, Lee.Schermerhorn, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote:
> 
> Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
> to use memory only from a node local to the CPU. As we can now filter the
> zonelist based on a nodemask, we filter the standard node zonelist for zones
> on the local node when GFP_THISNODE is specified.
> 
> When GFP_THISNODE is used, a temporary nodemask is created with only the
> node local to the CPU set. This allows us to eliminate the second zonelist.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Christoph Lameter <clameter@sgi.com>

<snip>

> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h
> --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h	2007-09-28 15:49:57.000000000 +0100
> +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h	2007-09-28 15:55:03.000000000 +0100

[Reordering the chunks to make my comments a little more logical]

<snip>

> -static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> +static inline struct zonelist *node_zonelist(int nid)
>  {
> -	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> +	return &NODE_DATA(nid)->node_zonelist;
>  }
> 
>  #ifndef HAVE_ARCH_FREE_PAGE
> @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n
>  	if (nid < 0)
>  		nid = numa_node_id();
> 
> -	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> +	return __alloc_pages(gfp_mask, order, node_zonelist(nid));
>  }

This is alloc_pages_node(), and converting the nid to a zonelist means
that lower levels (specifically __alloc_pages() here) are not aware of
nids, as far as I can tell. This isn't a change, I just want to make
sure I understand...

<snip>

>  struct page * fastcall
>  __alloc_pages(gfp_t gfp_mask, unsigned int order,
>  		struct zonelist *zonelist)
>  {
> +	/*
> +	 * Use a temporary nodemask for __GFP_THISNODE allocations. If the
> +	 * cost of allocating on the stack or the stack usage becomes
> +	 * noticable, allocate the nodemasks per node at boot or compile time
> +	 */
> +	if (unlikely(gfp_mask & __GFP_THISNODE)) {
> +		nodemask_t nodemask;
> +
> +		return __alloc_pages_internal(gfp_mask, order,
> +				zonelist, nodemask_thisnode(&nodemask));
> +	}
> +
>  	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
>  }

<snip>

So alloc_pages_node() calls here and for THISNODE allocations, we go ask
nodemask_thisnode() for a nodemask...

> +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
> +{
> +	/* Build a nodemask for just this node */
> +	int nid = numa_node_id();
> +
> +	nodes_clear(*nodemask);
> +	node_set(nid, *nodemask);
> +
> +	return nodemask;
> +}

<snip>

And nodemask_thisnode() always gives us a nodemask with only the node
the current process is running on set, I think?

That seems really wrong -- and would explain what Lee was seeing while
using my patches for the hugetlb pool allocator to use THISNODE
allocations. All the allocations would end up coming from whatever node
the process happened to be running on. This obviously messes up hugetlb
accounting, as I rely on THISNODE requests returning NULL if they go
off-node.

I'm not sure how this would be fixed, as __alloc_pages() no longer has
the nid to set in the mask.

Am I wrong in my analysis?

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
  2007-10-09  1:11   ` Nishanth Aravamudan
@ 2007-10-09  1:56     ` Christoph Lameter
  2007-10-09  3:17       ` Nishanth Aravamudan
  2007-10-09 15:40     ` Mel Gorman
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2007-10-09  1:56 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Mel Gorman, akpm, Lee.Schermerhorn, linux-kernel, linux-mm,
	rientjes, kamezawa.hiroyu

On Mon, 8 Oct 2007, Nishanth Aravamudan wrote:

> >  struct page * fastcall
> >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> >  		struct zonelist *zonelist)
> >  {
> > +	/*
> > +	 * Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > +	 * cost of allocating on the stack or the stack usage becomes
> > +	 * noticable, allocate the nodemasks per node at boot or compile time
> > +	 */
> > +	if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > +		nodemask_t nodemask;
> > +
> > +		return __alloc_pages_internal(gfp_mask, order,
> > +				zonelist, nodemask_thisnode(&nodemask));
> > +	}
> > +
> >  	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
> >  }
> 
> <snip>
> 
> So alloc_pages_node() calls here and for THISNODE allocations, we go ask
> nodemask_thisnode() for a nodemask...

Hmmmm... nodemask_thisnode needs to be passed the zonelist.

> And nodemask_thisnode() always gives us a nodemask with only the node
> the current process is running on set, I think?

Right.

 
> That seems really wrong -- and would explain what Lee was seeing while
> using my patches for the hugetlb pool allocator to use THISNODE
> allocations. All the allocations would end up coming from whatever node
> the process happened to be running on. This obviously messes up hugetlb
> accounting, as I rely on THISNODE requests returning NULL if they go
> off-node.
> 
> I'm not sure how this would be fixed, as __alloc_pages() no longer has
> the nid to set in the mask.
> 
> Am I wrong in my analysis?

No you are right on target. The thisnode function must determine the node 
from the first zone of the zonelist.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
  2007-10-09  1:56     ` Christoph Lameter
@ 2007-10-09  3:17       ` Nishanth Aravamudan
  0 siblings, 0 replies; 35+ messages in thread
From: Nishanth Aravamudan @ 2007-10-09  3:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, akpm, Lee.Schermerhorn, linux-kernel, linux-mm,
	rientjes, kamezawa.hiroyu

On 08.10.2007 [18:56:05 -0700], Christoph Lameter wrote:
> On Mon, 8 Oct 2007, Nishanth Aravamudan wrote:
> 
> > >  struct page * fastcall
> > >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > >  		struct zonelist *zonelist)
> > >  {
> > > +	/*
> > > +	 * Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > > +	 * cost of allocating on the stack or the stack usage becomes
> > > +	 * noticable, allocate the nodemasks per node at boot or compile time
> > > +	 */
> > > +	if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > > +		nodemask_t nodemask;
> > > +
> > > +		return __alloc_pages_internal(gfp_mask, order,
> > > +				zonelist, nodemask_thisnode(&nodemask));
> > > +	}
> > > +
> > >  	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
> > >  }
> > 
> > <snip>
> > 
> > So alloc_pages_node() calls here and for THISNODE allocations, we go ask
> > nodemask_thisnode() for a nodemask...
> 
> Hmmmm... nodemask_thisnode needs to be passed the zonelist.
> 
> > And nodemask_thisnode() always gives us a nodemask with only the node
> > the current process is running on set, I think?
> 
> Right.
> 
> 
> > That seems really wrong -- and would explain what Lee was seeing while
> > using my patches for the hugetlb pool allocator to use THISNODE
> > allocations. All the allocations would end up coming from whatever node
> > the process happened to be running on. This obviously messes up hugetlb
> > accounting, as I rely on THISNODE requests returning NULL if they go
> > off-node.
> > 
> > I'm not sure how this would be fixed, as __alloc_pages() no longer has
> > the nid to set in the mask.
> > 
> > Am I wrong in my analysis?
> 
> No you are right on target. The thisnode function must determine the
> node from the first zone of the zonelist.

It seems like I would zonelist_node_idx() for this, along the lines of:

	static nodemask_t *nodemask_thisnode(nodemask_t *nodemask,
		struct zonelist *zonelist)
	{
		int nid = zonelist_node_idx(zonelist);
		/* Build a nodemask for just this node */
		nodes_clear(*nodemask);
		node_set(nid, *nodemask);

		return nodemask;
	}

But I think I need to check that zonelist->_zonerefs->zone is !NULL, given this
definition of zonelist_node_idx()

	static inline int zonelist_node_idx(struct zoneref *zoneref)
	{
	#ifdef CONFIG_NUMA
		/* zone_to_nid not available in this context */
		return zoneref->zone->node;
	#else
		return 0;
	#endif /* CONFIG_NUMA */
	}

and this comment in __alloc_pages_internal():

	....
	z = zonelist->_zonerefs;  /* the list of zones suitable for gfp_mask */

	if (unlikely(!z->zone)) {
		/*
		 * Happens if we have an empty zonelist as a result of
		 * GFP_THISNODE being used on a memoryless node
		 */
		return NULL;
	}
	...

It seems like zoneref->zone may be NULL in zonelist_node_idx()? Maybe
someone else should look into resolving this :)

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
  2007-10-09  1:11   ` Nishanth Aravamudan
  2007-10-09  1:56     ` Christoph Lameter
@ 2007-10-09 15:40     ` Mel Gorman
  2007-10-09 16:25       ` Nishanth Aravamudan
                         ` (2 more replies)
  1 sibling, 3 replies; 35+ messages in thread
From: Mel Gorman @ 2007-10-09 15:40 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: akpm, Lee.Schermerhorn, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

First, sorry for being so slow to respond. I was getting ill towards the end
of last week and am worse now. Brain is in total mush as a result. Thanks
Lee for finding this problem and thanks to Nish for investigating it properly.

Comments and candidate fix to one zonelist are below.

On (08/10/07 18:11), Nishanth Aravamudan didst pronounce:
> On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote:
> > 
> > Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
> > to use memory only from a node local to the CPU. As we can now filter the
> > zonelist based on a nodemask, we filter the standard node zonelist for zones
> > on the local node when GFP_THISNODE is specified.
> > 
> > When GFP_THISNODE is used, a temporary nodemask is created with only the
> > node local to the CPU set. This allows us to eliminate the second zonelist.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Christoph Lameter <clameter@sgi.com>
> 
> <snip>
> 
> > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h
> > --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h	2007-09-28 15:49:57.000000000 +0100
> > +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h	2007-09-28 15:55:03.000000000 +0100
> 
> [Reordering the chunks to make my comments a little more logical]
> 
> <snip>
> 
> > -static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> > +static inline struct zonelist *node_zonelist(int nid)
> >  {
> > -	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> > +	return &NODE_DATA(nid)->node_zonelist;
> >  }
> > 
> >  #ifndef HAVE_ARCH_FREE_PAGE
> > @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n
> >  	if (nid < 0)
> >  		nid = numa_node_id();
> > 
> > -	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> > +	return __alloc_pages(gfp_mask, order, node_zonelist(nid));
> >  }
> 
> This is alloc_pages_node(), and converting the nid to a zonelist means
> that lower levels (specifically __alloc_pages() here) are not aware of
> nids, as far as I can tell.

Yep, this is correct.

> This isn't a change, I just want to make
> sure I understand...
> 
> <snip>
> 
> >  struct page * fastcall
> >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> >  		struct zonelist *zonelist)
> >  {
> > +	/*
> > +	 * Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > +	 * cost of allocating on the stack or the stack usage becomes
> > +	 * noticable, allocate the nodemasks per node at boot or compile time
> > +	 */
> > +	if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > +		nodemask_t nodemask;
> > +
> > +		return __alloc_pages_internal(gfp_mask, order,
> > +				zonelist, nodemask_thisnode(&nodemask));
> > +	}
> > +
> >  	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
> >  }
> 
> <snip>
> 
> So alloc_pages_node() calls here and for THISNODE allocations, we go ask
> nodemask_thisnode() for a nodemask...
> 

Also correct.

> > +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
> > +{
> > +	/* Build a nodemask for just this node */
> > +	int nid = numa_node_id();
> > +
> > +	nodes_clear(*nodemask);
> > +	node_set(nid, *nodemask);
> > +
> > +	return nodemask;
> > +}
> 
> <snip>
> 
> And nodemask_thisnode() always gives us a nodemask with only the node
> the current process is running on set, I think?
> 

Yes, I interpreted THISNODE to mean "this node I am running on". Callers
seemed to expect this but the memoryless needs it to be "this node I am
running on unless I specify a node in which case I mean that node.".

> That seems really wrong -- and would explain what Lee was seeing while
> using my patches for the hugetlb pool allocator to use THISNODE
> allocations. All the allocations would end up coming from whatever node
> the process happened to be running on. This obviously messes up hugetlb
> accounting, as I rely on THISNODE requests returning NULL if they go
> off-node.
> 
> I'm not sure how this would be fixed, as __alloc_pages() no longer has
> the nid to set in the mask.
> 
> Am I wrong in my analysis?
> 

No, you seem to be right on the ball. Can you review the following patch
please and determine if it fixes the problem in a satisfactory manner? I
think it does and your tests seemed to give proper values with this patch
applied but brain no worky work and a second opinion is needed.

====
Subject: Use specified node ID with GFP_THISNODE if available

It had been assumed that __GFP_THISNODE meant allocating from the local
node and only the local node. However, users of alloc_pages_node() may also
specify GFP_THISNODE. In this case, only the specified node should be used.
This patch will allocate pages only from the requested node when GFP_THISNODE
is used with alloc_pages_node().

[nacc@us.ibm.com: Detailed analysis of problem]
Found-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>

--- 
 include/linux/gfp.h |   10 ++++++++++
 mm/page_alloc.c     |    8 +++-----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h linux-2.6.23-rc8-mm2-050_memoryless_fix/include/linux/gfp.h
--- linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h	2007-10-09 13:52:39.000000000 +0100
+++ linux-2.6.23-rc8-mm2-050_memoryless_fix/include/linux/gfp.h	2007-10-09 14:17:06.000000000 +0100
@@ -175,6 +175,7 @@ FASTCALL(__alloc_pages(gfp_t, unsigned i
 extern struct page *
 FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
 				struct zonelist *, nodemask_t *nodemask));
+extern nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask);
 
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
@@ -186,6 +187,15 @@ static inline struct page *alloc_pages_n
 	if (nid < 0)
 		nid = numa_node_id();
 
+	/* Use a temporary nodemask for __GFP_THISNODE allocations */
+	if (unlikely(gfp_mask & __GFP_THISNODE)) {
+		nodemask_t nodemask;
+
+		return __alloc_pages_nodemask(gfp_mask, order,
+				node_zonelist(nid),
+				nodemask_thisnode(nid, &nodemask));
+	}
+
 	return __alloc_pages(gfp_mask, order, node_zonelist(nid));
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-040_use_one_zonelist/mm/page_alloc.c linux-2.6.23-rc8-mm2-050_memoryless_fix/mm/page_alloc.c
--- linux-2.6.23-rc8-mm2-040_use_one_zonelist/mm/page_alloc.c	2007-10-09 13:52:39.000000000 +0100
+++ linux-2.6.23-rc8-mm2-050_memoryless_fix/mm/page_alloc.c	2007-10-09 14:15:18.000000000 +0100
@@ -1741,11 +1741,9 @@ got_pg:
 	return page;
 }
 
-static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
+/* Creates a nodemask suitable for GFP_THISNODE allocations */
+nodemask_t *nodemask_thisnode(int nid, nodemask_t *nodemask)
 {
-	/* Build a nodemask for just this node */
-	int nid = numa_node_id();
-
 	nodes_clear(*nodemask);
 	node_set(nid, *nodemask);
 
@@ -1765,7 +1763,7 @@ __alloc_pages(gfp_t gfp_mask, unsigned i
 		nodemask_t nodemask;
 
 		return __alloc_pages_internal(gfp_mask, order,
-				zonelist, nodemask_thisnode(&nodemask));
+			zonelist, nodemask_thisnode(numa_node_id(), &nodemask));
 	}
 
 	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
  2007-10-09 15:40     ` Mel Gorman
@ 2007-10-09 16:25       ` Nishanth Aravamudan
  2007-10-09 18:47         ` Christoph Lameter
  2007-10-09 18:12       ` Nishanth Aravamudan
  2007-10-10 15:53       ` Lee Schermerhorn
  2 siblings, 1 reply; 35+ messages in thread
From: Nishanth Aravamudan @ 2007-10-09 16:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, Lee.Schermerhorn, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

On 09.10.2007 [16:40:53 +0100], Mel Gorman wrote:
> First, sorry for being so slow to respond. I was getting ill towards the end
> of last week and am worse now. Brain is in total mush as a result. Thanks
> Lee for finding this problem and thanks to Nish for investigating it properly.
> 
> Comments and candidate fix to one zonelist are below.
> 
> On (08/10/07 18:11), Nishanth Aravamudan didst pronounce:
> > On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote:
> > > 
> > > Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
> > > to use memory only from a node local to the CPU. As we can now filter the
> > > zonelist based on a nodemask, we filter the standard node zonelist for zones
> > > on the local node when GFP_THISNODE is specified.
> > > 
> > > When GFP_THISNODE is used, a temporary nodemask is created with only the
> > > node local to the CPU set. This allows us to eliminate the second zonelist.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > Acked-by: Christoph Lameter <clameter@sgi.com>
> > 
> > <snip>
> > 
> > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h
> > > --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h	2007-09-28 15:49:57.000000000 +0100
> > > +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h	2007-09-28 15:55:03.000000000 +0100
> > 
> > [Reordering the chunks to make my comments a little more logical]
> > 
> > <snip>
> > 
> > > -static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> > > +static inline struct zonelist *node_zonelist(int nid)
> > >  {
> > > -	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> > > +	return &NODE_DATA(nid)->node_zonelist;
> > >  }
> > > 
> > >  #ifndef HAVE_ARCH_FREE_PAGE
> > > @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n
> > >  	if (nid < 0)
> > >  		nid = numa_node_id();
> > > 
> > > -	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> > > +	return __alloc_pages(gfp_mask, order, node_zonelist(nid));
> > >  }
> > 
> > This is alloc_pages_node(), and converting the nid to a zonelist means
> > that lower levels (specifically __alloc_pages() here) are not aware of
> > nids, as far as I can tell.
> 
> Yep, this is correct.
> 
> > This isn't a change, I just want to make
> > sure I understand...
> > 
> > <snip>
> > 
> > >  struct page * fastcall
> > >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > >  		struct zonelist *zonelist)
> > >  {
> > > +	/*
> > > +	 * Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > > +	 * cost of allocating on the stack or the stack usage becomes
> > > +	 * noticable, allocate the nodemasks per node at boot or compile time
> > > +	 */
> > > +	if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > > +		nodemask_t nodemask;
> > > +
> > > +		return __alloc_pages_internal(gfp_mask, order,
> > > +				zonelist, nodemask_thisnode(&nodemask));
> > > +	}
> > > +
> > >  	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
> > >  }
> > 
> > <snip>
> > 
> > So alloc_pages_node() calls here and for THISNODE allocations, we go ask
> > nodemask_thisnode() for a nodemask...
> > 
> 
> Also correct.
> 
> > > +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
> > > +{
> > > +	/* Build a nodemask for just this node */
> > > +	int nid = numa_node_id();
> > > +
> > > +	nodes_clear(*nodemask);
> > > +	node_set(nid, *nodemask);
> > > +
> > > +	return nodemask;
> > > +}
> > 
> > <snip>
> > 
> > And nodemask_thisnode() always gives us a nodemask with only the node
> > the current process is running on set, I think?
> > 
> 
> Yes, I interpreted THISNODE to mean "this node I am running on".
> Callers seemed to expect this but the memoryless needs it to be "this
> node I am running on unless I specify a node in which case I mean that
> node.".

I think that is only true (THISNODE = local node) if the callpath is not
via alloc_pages_node(). If the callpath is via alloc_pages_node(), then
it depends on whether the nid parameter is -1 (in which case it is also
local node) or anything (in which case it is the nid specified). Ah,
reading further along, that's exactly what your changelog indicates too
:)

> > That seems really wrong -- and would explain what Lee was seeing while
> > using my patches for the hugetlb pool allocator to use THISNODE
> > allocations. All the allocations would end up coming from whatever node
> > the process happened to be running on. This obviously messes up hugetlb
> > accounting, as I rely on THISNODE requests returning NULL if they go
> > off-node.
> > 
> > I'm not sure how this would be fixed, as __alloc_pages() no longer has
> > the nid to set in the mask.
> > 
> > Am I wrong in my analysis?
> > 
> 
> No, you seem to be right on the ball. Can you review the following patch
> please and determine if it fixes the problem in a satisfactory manner? I
> think it does and your tests seemed to give proper values with this patch
> applied but brain no worky work and a second opinion is needed.
> 
> ====
> Subject: Use specified node ID with GFP_THISNODE if available
> 
> It had been assumed that __GFP_THISNODE meant allocating from the local
> node and only the local node. However, users of alloc_pages_node() may also
> specify GFP_THISNODE. In this case, only the specified node should be used.
> This patch will allocate pages only from the requested node when GFP_THISNODE
> is used with alloc_pages_node().

I will throw this into my tests and see if it fixes things. It looks
like it should.

Thanks,
Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
  2007-10-09 16:25       ` Nishanth Aravamudan
@ 2007-10-09 18:47         ` Christoph Lameter
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Lameter @ 2007-10-09 18:47 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Mel Gorman, akpm, Lee.Schermerhorn, linux-kernel, linux-mm,
	rientjes, kamezawa.hiroyu

On Tue, 9 Oct 2007, Nishanth Aravamudan wrote:

> > > And nodemask_thisnode() always gives us a nodemask with only the node
> > > the current process is running on set, I think?
> > > 
> > 
> > Yes, I interpreted THISNODE to mean "this node I am running on".
> > Callers seemed to expect this but the memoryless needs it to be "this
> > node I am running on unless I specify a node in which case I mean that
> > node.".
> 
> I think that is only true (THISNODE = local node) if the callpath is not
> via alloc_pages_node(). If the callpath is via alloc_pages_node(), then
> it depends on whether the nid parameter is -1 (in which case it is also
> local node) or anything (in which case it is the nid specified). Ah,
> reading further along, that's exactly what your changelog indicates too
> :)

Right. THISNODE means the node we are on or the node that we indicated we 
want to allocate from. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
  2007-10-09 15:40     ` Mel Gorman
  2007-10-09 16:25       ` Nishanth Aravamudan
@ 2007-10-09 18:12       ` Nishanth Aravamudan
  2007-10-10 15:53       ` Lee Schermerhorn
  2 siblings, 0 replies; 35+ messages in thread
From: Nishanth Aravamudan @ 2007-10-09 18:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, Lee.Schermerhorn, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

On 09.10.2007 [16:40:53 +0100], Mel Gorman wrote:
> First, sorry for being so slow to respond. I was getting ill towards the end
> of last week and am worse now. Brain is in total mush as a result. Thanks
> Lee for finding this problem and thanks to Nish for investigating it properly.
> 
> Comments and candidate fix to one zonelist are below.
> 
> On (08/10/07 18:11), Nishanth Aravamudan didst pronounce:
> > On 28.09.2007 [15:25:27 +0100], Mel Gorman wrote:
> > > 
> > > Two zonelists exist so that GFP_THISNODE allocations will be guaranteed
> > > to use memory only from a node local to the CPU. As we can now filter the
> > > zonelist based on a nodemask, we filter the standard node zonelist for zones
> > > on the local node when GFP_THISNODE is specified.
> > > 
> > > When GFP_THISNODE is used, a temporary nodemask is created with only the
> > > node local to the CPU set. This allows us to eliminate the second zonelist.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > Acked-by: Christoph Lameter <clameter@sgi.com>
> > 
> > <snip>
> > 
> > > diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h
> > > --- linux-2.6.23-rc8-mm2-030_filter_nodemask/include/linux/gfp.h	2007-09-28 15:49:57.000000000 +0100
> > > +++ linux-2.6.23-rc8-mm2-040_use_one_zonelist/include/linux/gfp.h	2007-09-28 15:55:03.000000000 +0100
> > 
> > [Reordering the chunks to make my comments a little more logical]
> > 
> > <snip>
> > 
> > > -static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> > > +static inline struct zonelist *node_zonelist(int nid)
> > >  {
> > > -	return NODE_DATA(nid)->node_zonelists + gfp_zonelist(flags);
> > > +	return &NODE_DATA(nid)->node_zonelist;
> > >  }
> > > 
> > >  #ifndef HAVE_ARCH_FREE_PAGE
> > > @@ -198,7 +186,7 @@ static inline struct page *alloc_pages_n
> > >  	if (nid < 0)
> > >  		nid = numa_node_id();
> > > 
> > > -	return __alloc_pages(gfp_mask, order, node_zonelist(nid, gfp_mask));
> > > +	return __alloc_pages(gfp_mask, order, node_zonelist(nid));
> > >  }
> > 
> > This is alloc_pages_node(), and converting the nid to a zonelist means
> > that lower levels (specifically __alloc_pages() here) are not aware of
> > nids, as far as I can tell.
> 
> Yep, this is correct.
> 
> > This isn't a change, I just want to make
> > sure I understand...
> > 
> > <snip>
> > 
> > >  struct page * fastcall
> > >  __alloc_pages(gfp_t gfp_mask, unsigned int order,
> > >  		struct zonelist *zonelist)
> > >  {
> > > +	/*
> > > +	 * Use a temporary nodemask for __GFP_THISNODE allocations. If the
> > > +	 * cost of allocating on the stack or the stack usage becomes
> > > +	 * noticable, allocate the nodemasks per node at boot or compile time
> > > +	 */
> > > +	if (unlikely(gfp_mask & __GFP_THISNODE)) {
> > > +		nodemask_t nodemask;
> > > +
> > > +		return __alloc_pages_internal(gfp_mask, order,
> > > +				zonelist, nodemask_thisnode(&nodemask));
> > > +	}
> > > +
> > >  	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
> > >  }
> > 
> > <snip>
> > 
> > So alloc_pages_node() calls here and for THISNODE allocations, we go ask
> > nodemask_thisnode() for a nodemask...
> > 
> 
> Also correct.
> 
> > > +static nodemask_t *nodemask_thisnode(nodemask_t *nodemask)
> > > +{
> > > +	/* Build a nodemask for just this node */
> > > +	int nid = numa_node_id();
> > > +
> > > +	nodes_clear(*nodemask);
> > > +	node_set(nid, *nodemask);
> > > +
> > > +	return nodemask;
> > > +}
> > 
> > <snip>
> > 
> > And nodemask_thisnode() always gives us a nodemask with only the node
> > the current process is running on set, I think?
> > 
> 
> Yes, I interpreted THISNODE to mean "this node I am running on". Callers
> seemed to expect this but the memoryless needs it to be "this node I am
> running on unless I specify a node in which case I mean that node.".
> 
> > That seems really wrong -- and would explain what Lee was seeing while
> > using my patches for the hugetlb pool allocator to use THISNODE
> > allocations. All the allocations would end up coming from whatever node
> > the process happened to be running on. This obviously messes up hugetlb
> > accounting, as I rely on THISNODE requests returning NULL if they go
> > off-node.
> > 
> > I'm not sure how this would be fixed, as __alloc_pages() no longer has
> > the nid to set in the mask.
> > 
> > Am I wrong in my analysis?
> > 
> 
> No, you seem to be right on the ball. Can you review the following patch
> please and determine if it fixes the problem in a satisfactory manner? I
> think it does and your tests seemed to give proper values with this patch
> applied but brain no worky work and a second opinion is needed.
> 
> ====
> Subject: Use specified node ID with GFP_THISNODE if available
> 
> It had been assumed that __GFP_THISNODE meant allocating from the local
> node and only the local node. However, users of alloc_pages_node() may also
> specify GFP_THISNODE. In this case, only the specified node should be used.
> This patch will allocate pages only from the requested node when GFP_THISNODE
> is used with alloc_pages_node().
> 
> [nacc@us.ibm.com: Detailed analysis of problem]
> Found-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Mel, seems to fix the problem here. Nice job. Feel free to add:

Tested-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@us.ibm.com>

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
  2007-10-09 15:40     ` Mel Gorman
  2007-10-09 16:25       ` Nishanth Aravamudan
  2007-10-09 18:12       ` Nishanth Aravamudan
@ 2007-10-10 15:53       ` Lee Schermerhorn
  2007-10-10 16:05         ` Nishanth Aravamudan
  2007-10-10 16:09         ` Mel Gorman
  2 siblings, 2 replies; 35+ messages in thread
From: Lee Schermerhorn @ 2007-10-10 15:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nishanth Aravamudan, akpm, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote:
<snip>
> ====
> Subject: Use specified node ID with GFP_THISNODE if available
> 
> It had been assumed that __GFP_THISNODE meant allocating from the local
> node and only the local node. However, users of alloc_pages_node() may also
> specify GFP_THISNODE. In this case, only the specified node should be used.
> This patch will allocate pages only from the requested node when GFP_THISNODE
> is used with alloc_pages_node().
> 
> [nacc@us.ibm.com: Detailed analysis of problem]
> Found-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
<snip>

Mel:  I applied this patch [to your v8 series--the most recent, I
think?] and it does fix the problem.  However, now I'm tripping over
this warning in __alloc_pages_nodemask:

	/* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */
	WARN_ON(gfp_mask & __GFP_THISNODE);

for each huge page allocated.  Rather slow as my console is a virtual
serial line and the warning includes the stack traceback.

I think we want to just drop this warning, but maybe you have a tighter
condition that you want to warn about?

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
  2007-10-10 15:53       ` Lee Schermerhorn
@ 2007-10-10 16:05         ` Nishanth Aravamudan
  2007-10-10 16:09         ` Mel Gorman
  1 sibling, 0 replies; 35+ messages in thread
From: Nishanth Aravamudan @ 2007-10-10 16:05 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Mel Gorman, akpm, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

On 10.10.2007 [11:53:40 -0400], Lee Schermerhorn wrote:
> On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote:
> <snip>
> > ====
> > Subject: Use specified node ID with GFP_THISNODE if available
> > 
> > It had been assumed that __GFP_THISNODE meant allocating from the local
> > node and only the local node. However, users of alloc_pages_node() may also
> > specify GFP_THISNODE. In this case, only the specified node should be used.
> > This patch will allocate pages only from the requested node when GFP_THISNODE
> > is used with alloc_pages_node().
> > 
> > [nacc@us.ibm.com: Detailed analysis of problem]
> > Found-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> <snip>
> 
> Mel:  I applied this patch [to your v8 series--the most recent, I
> think?] and it does fix the problem.  However, now I'm tripping over
> this warning in __alloc_pages_nodemask:
> 
> 	/* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */
> 	WARN_ON(gfp_mask & __GFP_THISNODE);
> 
> for each huge page allocated.  Rather slow as my console is a virtual
> serial line and the warning includes the stack traceback.
> 
> I think we want to just drop this warning, but maybe you have a tighter
> condition that you want to warn about?

Sigh, sorry Mel. I see this too on my box. I purely checked the
functionality and didn't think to check the logs, as the tests worked :/

I think it's quite clear that the WARN_ON() makes no sense now, since
alloc_pages_node() now calls __alloc_pages_nodemask().

-Nish

-- 
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/6] Use one zonelist that is filtered by nodemask
  2007-10-10 15:53       ` Lee Schermerhorn
  2007-10-10 16:05         ` Nishanth Aravamudan
@ 2007-10-10 16:09         ` Mel Gorman
  1 sibling, 0 replies; 35+ messages in thread
From: Mel Gorman @ 2007-10-10 16:09 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Nishanth Aravamudan, akpm, linux-kernel, linux-mm, rientjes,
	kamezawa.hiroyu, clameter

On (10/10/07 11:53), Lee Schermerhorn didst pronounce:
> On Tue, 2007-10-09 at 16:40 +0100, Mel Gorman wrote:
> <snip>
> > ====
> > Subject: Use specified node ID with GFP_THISNODE if available
> > 
> > It had been assumed that __GFP_THISNODE meant allocating from the local
> > node and only the local node. However, users of alloc_pages_node() may also
> > specify GFP_THISNODE. In this case, only the specified node should be used.
> > This patch will allocate pages only from the requested node when GFP_THISNODE
> > is used with alloc_pages_node().
> > 
> > [nacc@us.ibm.com: Detailed analysis of problem]
> > Found-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> <snip>
> 
> Mel:  I applied this patch [to your v8 series--the most recent, I
> think?] and it does fix the problem.  However, now I'm tripping over
> this warning in __alloc_pages_nodemask:
> 
> 	/* Specifying both __GFP_THISNODE and nodemask is stupid. Warn user */
> 	WARN_ON(gfp_mask & __GFP_THISNODE);
> 
> for each huge page allocated.  Rather slow as my console is a virtual
> serial line and the warning includes the stack traceback.
> 
> I think we want to just drop this warning, but maybe you have a tighter
> condition that you want to warn about?
> 

I should drop the warning. The nature of the comment and the WARN_ON was
rooted in my belief that "THISNODE means this node I am running on" and the
warning was defensive programming just in case the assumption was broken. Now
we know the assumption was wrong and the warning is bogus.

Thanks Lee.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v9
@ 2007-11-09 14:32 Mel Gorman
  2007-11-09 14:34 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
  0 siblings, 1 reply; 35+ messages in thread
From: Mel Gorman @ 2007-11-09 14:32 UTC (permalink / raw)
  To: akpm
  Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes,
	nacc, kamezawa.hiroyu, clameter

This is basically a rebase to the broken-out -mm tree. Since v8, two fixes
have been applied that showed up during testing. Most machines I test -mm
on are failing to boot for a variety of reasons but on the two machines
that did work, they appeared to work fine.

Changelog since V8
  o Rebase to 2.6.24-rc2
  o Added ack for the OOM changes
  o Behave correctly when GFP_THISNODE and a node ID are specified
  o Clear up warning over type of nodes_intersects() function

Changelog since V7
  o Rebase to 2.6.23-rc8-mm2

Changelog since V6
  o Fix build bug in relation to memory controller combined with one-zonelist
  o Use while() instead of a stupid looking for()
  o Instead of encoding zone index information in a pointer, this version
    introduces a structure that stores a zone pointer and its index 

Changelog since V5
  o Rebase to 2.6.23-rc4-mm1
  o Drop patch that replaces inline functions with macros

Changelog since V4
  o Rebase to -mm kernel. Host of memoryless patches collisions dealt with
  o Do not call wakeup_kswapd() for every zone in a zonelist
  o Dropped the FASTCALL removal
  o Have cursor in iterator advance earlier
  o Use nodes_and in cpuset_nodes_valid_mems_allowed()
  o Use defines instead of inlines, noticably better performance on gcc-3.4
    No difference on later compilers such as gcc 4.1
  o Dropped gfp_skip patch until it is proven to be of benefit. Tests are
    currently inconclusive but it definitly consumes at least one cache
    line

Changelog since V3
  o Fix compile error in the parisc change
  o Calculate gfp_zone only once in __alloc_pages
  o Calculate classzone_idx properly in get_page_from_freelist
  o Alter check so that zone id embedded may still be used on UP
  o Use Kamezawa-sans suggestion for skipping zones in zonelist
  o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This
    removes the need for MPOL_BIND to have a custom zonelist
  o Move zonelist iterators and helpers to mm.h
  o Change _zones from struct zone * to unsigned long

Changelog since V2
  o shrink_zones() uses zonelist instead of zonelist->zones
  o hugetlb uses zonelist iterator
  o zone_idx information is embedded in zonelist pointers
  o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid)

Changelog since V1
  o Break up the patch into 3 patches
  o Introduce iterators for zonelists
  o Performance regression test

The following patches replace multiple zonelists per node with one zonelist
that is filtered based on the GFP flags. The patches as a set fix a bug
with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset,
the MPOL_BIND will apply to the two highest zones when the highest zone
is ZONE_MOVABLE. This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that
filters only custom zonelists. As a bonus, the patchset reduces the cache
footprint of the kernel and should improve performance in a number of cases.

The first patch cleans up an inconsitency where direct reclaim uses
zonelist->zones where other places use zonelist. The second patch introduces
a helper function node_zonelist() for looking up the appropriate zonelist
for a GFP mask which simplifies patches later in the set.

The third patch replaces multiple zonelists with two zonelists that are
filtered. The two zonelists are due to the fact that the memoryless patchset
introduces a second set of zonelists for __GFP_THISNODE.

The fourth patch introduces helper macros for retrieving the zone and node indices of entries in a zonelist.

The fifth patch introduces filtering of the zonelists based on a nodemask.

The final patch replaces the two zonelists with one zonelist. A nodemask is
created when __GFP_THISNODE is specified to filter the list. The nodelists
could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE
is used often enough to be worth the effort.

Performance results varied depending on the machine configuration but were
usually small performance gains. In real workloads the gain/loss will depend
on how much the userspace portion of the benchmark benefits from having more
cache available due to reduced referencing of zonelists.

These are the range of performance losses/gains when running against
2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.

Total CPU time on Kernbench: -0.67% to  3.05%
Elapsed   time on Kernbench: -0.25% to  2.96%
page_test from aim9:         -6.98% to  5.60%
brk_test  from aim9:         -3.94% to  4.11%
fork_test from aim9:         -5.72% to  4.14%
exec_test from aim9:         -1.02% to  1.56%

The TBench figures were too variable between runs to draw conclusions from but
there didn't appear to be any regressions there. The hackbench results for both
sockets and pipes were within noise.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-11-09 14:32 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v9 Mel Gorman
@ 2007-11-09 14:34 ` Mel Gorman
  2008-02-29  5:01   ` Paul Jackson
  0 siblings, 1 reply; 35+ messages in thread
From: Mel Gorman @ 2007-11-09 14:34 UTC (permalink / raw)
  To: akpm
  Cc: Lee.Schermerhorn, Mel Gorman, linux-kernel, linux-mm, rientjes,
	nacc, kamezawa.hiroyu, clameter

The MPOL_BIND policy creates a zonelist that is used for allocations belonging
to that thread that can use the policy_zone. As the per-node zonelist is
already being filtered based on a zone id, this patch adds a version of
__alloc_pages() that takes a nodemask for further filtering. This eliminates
the need for MPOL_BIND to create a custom zonelist. A positive benefit of
this is that allocations using MPOL_BIND now use the local-node-ordered
zonelist instead of a custom node-id-ordered zonelist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <clameter@sgi.com>
---

 fs/buffer.c               |    2 
 include/linux/cpuset.h    |    4 -
 include/linux/gfp.h       |    4 +
 include/linux/mempolicy.h |    3 
 include/linux/mmzone.h    |   58 +++++++++++++---
 kernel/cpuset.c           |   18 +----
 mm/mempolicy.c            |  144 +++++++++++------------------------------
 mm/page_alloc.c           |   40 +++++++----
 8 files changed, 131 insertions(+), 142 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc1-mm-020_zoneid_zonelist/fs/buffer.c linux-2.6.24-rc1-mm-030_filter_nodemask/fs/buffer.c
--- linux-2.6.24-rc1-mm-020_zoneid_zonelist/fs/buffer.c	2007-11-08 19:18:27.000000000 +0000
+++ linux-2.6.24-rc1-mm-030_filter_nodemask/fs/buffer.c	2007-11-08 19:21:22.000000000 +0000
@@ -376,7 +376,7 @@ static void free_more_memory(void)
 
 	for_each_online_node(nid) {
 		zrefs = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
-						gfp_zone(GFP_NOFS));
+						NULL, gfp_zone(GFP_NOFS));
 		if (zrefs->zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
 						GFP_NOFS);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc1-mm-020_zoneid_zonelist/include/linux/cpuset.h linux-2.6.24-rc1-mm-030_filter_nodemask/include/linux/cpuset.h
--- linux-2.6.24-rc1-mm-020_zoneid_zonelist/include/linux/cpuset.h	2007-10-24 04:50:57.000000000 +0100
+++ linux-2.6.24-rc1-mm-030_filter_nodemask/include/linux/cpuset.h	2007-11-08 19:21:22.000000000 +0000
@@ -28,7 +28,7 @@ void cpuset_init_current_mems_allowed(vo
 void cpuset_update_task_memory_state(void);
 #define cpuset_nodes_subset_current_mems_allowed(nodes) \
 		nodes_subset((nodes), current->mems_allowed)
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
 extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask);
@@ -103,7 +103,7 @@ static inline void cpuset_init_current_m
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
-static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
 	return 1;
 }
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc1-mm-020_zoneid_zonelist/include/linux/gfp.h linux-2.6.24-rc1-mm-030_filter_nodemask/include/linux/gfp.h
--- linux-2.6.24-rc1-mm-020_zoneid_zonelist/include/linux/gfp.h	2007-11-08 19:11:18.000000000 +0000
+++ linux-2.6.24-rc1-mm-030_filter_nodemask/include/linux/gfp.h	2007-11-08 19:21:22.000000000 +0000
@@ -184,6 +184,10 @@ static inline void arch_alloc_page(struc
 extern struct page *
 FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
 
+extern struct page *
+FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
+				struct zonelist *, nodemask_t *nodemask));
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc1-mm-020_zoneid_zonelist/include/linux/mempolicy.h linux-2.6.24-rc1-mm-030_filter_nodemask/include/linux/mempolicy.h
--- linux-2.6.24-rc1-mm-020_zoneid_zonelist/include/linux/mempolicy.h	2007-11-08 19:08:12.000000000 +0000
+++ linux-2.6.24-rc1-mm-030_filter_nodemask/include/linux/mempolicy.h	2007-11-08 19:21:22.000000000 +0000
@@ -64,9 +64,8 @@ struct mempolicy {
 	atomic_t refcnt;
 	short policy; 	/* See MPOL_* above */
 	union {
-		struct zonelist  *zonelist;	/* bind */
 		short 		 preferred_node; /* preferred */
-		nodemask_t	 nodes;		/* interleave */
+		nodemask_t	 nodes;		/* interleave/bind */
 		/* undefined for default */
 	} v;
 	nodemask_t cpuset_mems_allowed;	/* mempolicy relative to these nodes */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc1-mm-020_zoneid_zonelist/include/linux/mmzone.h linux-2.6.24-rc1-mm-030_filter_nodemask/include/linux/mmzone.h
--- linux-2.6.24-rc1-mm-020_zoneid_zonelist/include/linux/mmzone.h	2007-11-08 19:18:27.000000000 +0000
+++ linux-2.6.24-rc1-mm-030_filter_nodemask/include/linux/mmzone.h	2007-11-08 19:21:22.000000000 +0000
@@ -755,47 +755,85 @@ static inline void encode_zoneref(struct
 	zoneref->zone_idx = zone_idx(zone);
 }
 
+static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+	return node_isset(zonelist_node_idx(zref), *nodes);
+#else
+	return 1;
+#endif /* CONFIG_NUMA */
+}
+
 /* Returns the first zone at or below highest_zoneidx in a zonelist */
 static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
 	struct zoneref *z;
 
 	/* Find the first suitable zone to use for the allocation */
 	z = zonelist->_zonerefs;
-	while (zonelist_zone_idx(z) > highest_zoneidx)
-		z++;
+	if (likely(nodes == NULL))
+		while (zonelist_zone_idx(z) > highest_zoneidx)
+			z++;
+	else
+		while (zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes)))
+			z++;
 
 	return z;
 }
 
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
 static inline struct zoneref *next_zones_zonelist(struct zoneref *z,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	/* Find the next suitable zone to use for the allocation */
-	while (zonelist_zone_idx(z) > highest_zoneidx)
-		z++;
+	/*
+	 * Find the next suitable zone to use for the allocation.
+	 * Only filter based on nodemask if it's set
+	 */
+	if (likely(nodes == NULL))
+		while (zonelist_zone_idx(z) > highest_zoneidx)
+			z++;
+	else
+		while (zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes)))
+			z++;
 
 	return z;
 }
 
 /**
- * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
  * @zone - The current zone in the iterator
  * @z - The current pointer within zonelist->zones being iterated
  * @zlist - The zonelist being iterated
  * @highidx - The zone index of the highest zone to return
+ * @nodemask - Nodemask allowed by the allocator
  *
- * This iterator iterates though all zones at or below a given zone index.
+ * This iterator iterates though all zones at or below a given zone index and
+ * within a given nodemask
  */
-#define for_each_zone_zonelist(zone, z, zlist, highidx) \
-	for (z = first_zones_zonelist(zlist, highidx),			\
+#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+	for (z = first_zones_zonelist(zlist, nodemask, highidx),	\
 					zone = zonelist_zone(z++);	\
 		zone;							\
-		z = next_zones_zonelist(z, highidx),			\
+		z = next_zones_zonelist(z, nodemask, highidx),		\
 					zone = zonelist_zone(z++))
 
+/**
+ * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ *
+ * This iterator iterates though all zones at or below a given zone index.
+ */
+#define for_each_zone_zonelist(zone, z, zlist, highidx) \
+	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc1-mm-020_zoneid_zonelist/kernel/cpuset.c linux-2.6.24-rc1-mm-030_filter_nodemask/kernel/cpuset.c
--- linux-2.6.24-rc1-mm-020_zoneid_zonelist/kernel/cpuset.c	2007-11-08 19:18:27.000000000 +0000
+++ linux-2.6.24-rc1-mm-030_filter_nodemask/kernel/cpuset.c	2007-11-08 19:21:22.000000000 +0000
@@ -1868,22 +1868,14 @@ nodemask_t cpuset_mems_allowed(struct ta
 }
 
 /**
- * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
- * @zl: the zonelist to be checked
+ * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
+ * @nodemask: the nodemask to be checked
  *
- * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
+ * Are any of the nodes in the nodemask allowed in current->mems_allowed?
  */
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
-	int i;
-
-	for (i = 0; zl->_zonerefs[i].zone; i++) {
-		int nid = zonelist_node_idx(zl->_zonerefs[i]);
-
-		if (node_isset(nid, current->mems_allowed))
-			return 1;
-	}
-	return 0;
+	return nodes_intersects(*nodemask, current->mems_allowed);
 }
 
 /*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc1-mm-020_zoneid_zonelist/mm/mempolicy.c linux-2.6.24-rc1-mm-030_filter_nodemask/mm/mempolicy.c
--- linux-2.6.24-rc1-mm-020_zoneid_zonelist/mm/mempolicy.c	2007-11-08 19:18:27.000000000 +0000
+++ linux-2.6.24-rc1-mm-030_filter_nodemask/mm/mempolicy.c	2007-11-08 19:21:22.000000000 +0000
@@ -134,41 +134,21 @@ static int mpol_check_policy(int mode, n
  	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
 }
 
-/* Generate a custom zonelist for the BIND policy. */
-static struct zonelist *bind_zonelist(nodemask_t *nodes)
+/* Check that the nodemask contains at least one populated zone */
+static int is_valid_nodemask(nodemask_t *nodemask)
 {
-	struct zonelist *zl;
-	int num, max, nd;
-	enum zone_type k;
+	int nd, k;
 
-	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
-	max++;			/* space for zlcache_ptr (see mmzone.h) */
-	zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
-	if (!zl)
-		return ERR_PTR(-ENOMEM);
-	zl->zlcache_ptr = NULL;
-	num = 0;
-	/* First put in the highest zones from all nodes, then all the next 
-	   lower zones etc. Avoid empty zones because the memory allocator
-	   doesn't like them. If you implement node hot removal you
-	   have to fix that. */
-	k = MAX_NR_ZONES - 1;
-	while (1) {
-		for_each_node_mask(nd, *nodes) { 
-			struct zone *z = &NODE_DATA(nd)->node_zones[k];
-			if (z->present_pages > 0) 
-				encode_zoneref(z, &zl->_zonerefs[num++]);
-		}
-		if (k == 0)
-			break;
-		k--;
-	}
-	if (num == 0) {
-		kfree(zl);
-		return ERR_PTR(-EINVAL);
+	/* Check that there is something useful in this mask */
+	k = policy_zone;
+
+	for_each_node_mask(nd, *nodemask) {
+		struct zone *z = &NODE_DATA(nd)->node_zones[k];
+		if (z->present_pages > 0)
+			return 1;
 	}
-	zl->_zonerefs[num].zone = NULL;
-	return zl;
+
+	return 0;
 }
 
 /* Create a new policy */
@@ -201,12 +181,11 @@ static struct mempolicy *mpol_new(int mo
 			policy->v.preferred_node = -1;
 		break;
 	case MPOL_BIND:
-		policy->v.zonelist = bind_zonelist(nodes);
-		if (IS_ERR(policy->v.zonelist)) {
-			void *error_code = policy->v.zonelist;
+		if (!is_valid_nodemask(nodes)) {
 			kmem_cache_free(policy_cache, policy);
-			return error_code;
+			return ERR_PTR(-EINVAL);
 		}
+		policy->v.nodes = *nodes;
 		break;
 	}
 	policy->policy = mode;
@@ -484,19 +463,12 @@ static long do_set_mempolicy(int mode, n
 /* Fill a zone bitmap for a policy */
 static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
 {
-	int i;
-
 	nodes_clear(*nodes);
 	switch (p->policy) {
-	case MPOL_BIND:
-		for (i = 0; p->v.zonelist->_zonerefs[i].zone; i++) {
-			struct zoneref *zref;
-			zref = &p->v.zonelist->_zonerefs[i];
-			node_set(zonelist_node_idx(zref), *nodes);
-		}
-		break;
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
@@ -1148,6 +1120,18 @@ static struct mempolicy * get_vma_policy
 	return pol;
 }
 
+/* Return a nodemask representing a mempolicy */
+static inline nodemask_t *nodemask_policy(gfp_t gfp, struct mempolicy *policy)
+{
+	/* Lower zones don't get a nodemask applied for MPOL_BIND */
+	if (unlikely(policy->policy == MPOL_BIND &&
+			gfp_zone(gfp) >= policy_zone &&
+			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)))
+		return &policy->v.nodes;
+
+	return NULL;
+}
+
 /* Return a zonelist representing a mempolicy */
 static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 {
@@ -1160,11 +1144,6 @@ static struct zonelist *zonelist_policy(
 			nd = numa_node_id();
 		break;
 	case MPOL_BIND:
-		/* Lower zones don't get a policy applied */
-		/* Careful: current->mems_allowed might have moved */
-		if (gfp_zone(gfp) >= policy_zone)
-			if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
-				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
 	case MPOL_DEFAULT:
@@ -1208,7 +1187,13 @@ unsigned slab_node(struct mempolicy *pol
 		 * Follow bind policy behavior and start allocation at the
 		 * first node.
 		 */
-		return zonelist_node_idx(policy->v.zonelist->_zonerefs);
+		struct zonelist *zonelist;
+		struct zoneref *z;
+		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
+		zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
+		z = first_zones_zonelist(zonelist, &policy->v.nodes,
+							highest_zoneidx);
+		return zonelist_node_idx(z);
 	}
 
 	case MPOL_PREFERRED:
@@ -1366,7 +1351,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 	/*
 	 * fast path:  default or task policy
 	 */
-	return __alloc_pages(gfp, 0, zl);
+	return __alloc_pages_nodemask(gfp, 0, zl, nodemask_policy(gfp, pol));
 }
 
 /**
@@ -1423,14 +1408,6 @@ struct mempolicy *__mpol_copy(struct mem
 	}
 	*new = *old;
 	atomic_set(&new->refcnt, 1);
-	if (new->policy == MPOL_BIND) {
-		int sz = ksize(old->v.zonelist);
-		new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
-		if (!new->v.zonelist) {
-			kmem_cache_free(policy_cache, new);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
 	return new;
 }
 
@@ -1444,21 +1421,12 @@ int __mpol_equal(struct mempolicy *a, st
 	switch (a->policy) {
 	case MPOL_DEFAULT:
 		return 1;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		return nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
-	case MPOL_BIND: {
-		int i;
-		for (i = 0; a->v.zonelist->_zonerefs[i].zone; i++) {
-			struct zone *za, *zb;
-			za = zonelist_zone(&a->v.zonelist->_zonerefs[i]);
-			zb = zonelist_zone(&b->v.zonelist->_zonerefs[i]);
-			if (za != zb)
-				return 0;
-		}
-		return b->v.zonelist->_zonerefs[i].zone == NULL;
-	}
 	default:
 		BUG();
 		return 0;
@@ -1470,8 +1438,6 @@ void __mpol_free(struct mempolicy *p)
 {
 	if (!atomic_dec_and_test(&p->refcnt))
 		return;
-	if (p->policy == MPOL_BIND)
-		kfree(p->v.zonelist);
 	p->policy = MPOL_DEFAULT;
 	kmem_cache_free(policy_cache, p);
 }
@@ -1762,6 +1728,8 @@ static void mpol_rebind_policy(struct me
 	switch (pol->policy) {
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
 		pol->v.nodes = tmp;
@@ -1774,32 +1742,6 @@ static void mpol_rebind_policy(struct me
 						*mpolmask, *newmask);
 		*mpolmask = *newmask;
 		break;
-	case MPOL_BIND: {
-		nodemask_t nodes;
-		struct zoneref *z;
-		struct zonelist *zonelist;
-
-		nodes_clear(nodes);
-		for (z = pol->v.zonelist->_zonerefs; z->zone; z++)
-			node_set(zonelist_node_idx(z), nodes);
-		nodes_remap(tmp, nodes, *mpolmask, *newmask);
-		nodes = tmp;
-
-		zonelist = bind_zonelist(&nodes);
-
-		/* If no mem, then zonelist is NULL and we keep old zonelist.
-		 * If that old zonelist has no remaining mems_allowed nodes,
-		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
-		 */
-
-		if (!IS_ERR(zonelist)) {
-			/* Good - got mem - substitute new zonelist */
-			kfree(pol->v.zonelist);
-			pol->v.zonelist = zonelist;
-		}
-		*mpolmask = *newmask;
-		break;
-	}
 	default:
 		BUG();
 		break;
@@ -1862,9 +1804,7 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_BIND:
-		get_zonemask(pol, &nodes);
-		break;
-
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes = pol->v.nodes;
 		break;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.24-rc1-mm-020_zoneid_zonelist/mm/page_alloc.c linux-2.6.24-rc1-mm-030_filter_nodemask/mm/page_alloc.c
--- linux-2.6.24-rc1-mm-020_zoneid_zonelist/mm/page_alloc.c	2007-11-08 19:18:27.000000000 +0000
+++ linux-2.6.24-rc1-mm-030_filter_nodemask/mm/page_alloc.c	2007-11-08 19:21:23.000000000 +0000
@@ -1399,7 +1399,7 @@ static void zlc_mark_zone_full(struct zo
  * a page.
  */
 static struct page *
-get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
 {
 	struct zoneref *z;
@@ -1410,7 +1410,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	z = first_zones_zonelist(zonelist, high_zoneidx);
+	z = first_zones_zonelist(zonelist, nodemask, high_zoneidx);
 	classzone_idx = zonelist_zone_idx(z);
 
 zonelist_scan:
@@ -1418,7 +1418,8 @@ zonelist_scan:
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+						high_zoneidx, nodemask) {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
@@ -1524,9 +1525,9 @@ static void set_page_owner(struct page *
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
-struct page * fastcall
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+static struct page *
+__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1555,7 +1556,7 @@ restart:
 		return NULL;
 	}
 
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
@@ -1600,7 +1601,7 @@ restart:
 	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist,
+	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags);
 	if (page)
 		goto got_pg;
@@ -1613,7 +1614,7 @@ rebalance:
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
+			page = get_page_from_freelist(gfp_mask, nodemask, order,
 				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
@@ -1648,7 +1649,7 @@ nofail_alloc:
 		drain_all_local_pages();
 
 	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, order,
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx, alloc_flags);
 		if (page)
 			goto got_pg;
@@ -1664,8 +1665,9 @@ nofail_alloc:
 		 * a parallel oom killing, we must fail if we're still
 		 * under heavy pressure.
 		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+			order, zonelist, high_zoneidx,
+			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page) {
 			clear_zonelist_oom(zonelist, gfp_mask);
 			goto got_pg;
@@ -1718,6 +1720,20 @@ got_pg:
 	return page;
 }
 
+struct page * fastcall
+__alloc_pages(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+}
+
+struct page * fastcall
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
+}
+
 EXPORT_SYMBOL(__alloc_pages);
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-11-09 14:34 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
@ 2008-02-29  5:01   ` Paul Jackson
  2008-02-29 14:49     ` Lee Schermerhorn
  0 siblings, 1 reply; 35+ messages in thread
From: Paul Jackson @ 2008-02-29  5:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: akpm, Lee.Schermerhorn, linux-kernel, linux-mm, rientjes, nacc,
	kamezawa.hiroyu, clameter

Mel wrote:
> A positive benefit of
> this is that allocations using MPOL_BIND now use the local-node-ordered
> zonelist instead of a custom node-id-ordered zonelist.

Could you update the now obsolete documentation (perhaps just delete
the no longer correct remark):

Documentation/vm/numa_memory_policy.txt:

        MPOL_BIND:  This mode specifies that memory must come from the
        set of nodes specified by the policy.

            The memory policy APIs do not specify an order in which the nodes
            will be searched.  However, unlike "local allocation", the Bind
            policy does not consider the distance between the nodes.  Rather,
            allocations will fallback to the nodes specified by the policy in
            order of numeric node id.  Like everything in Linux, this is subject
            to change.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2008-02-29  5:01   ` Paul Jackson
@ 2008-02-29 14:49     ` Lee Schermerhorn
  0 siblings, 0 replies; 35+ messages in thread
From: Lee Schermerhorn @ 2008-02-29 14:49 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Mel Gorman, akpm, linux-kernel, linux-mm, rientjes, nacc,
	kamezawa.hiroyu, clameter

On Thu, 2008-02-28 at 23:01 -0600, Paul Jackson wrote:
> Mel wrote:
> > A positive benefit of
> > this is that allocations using MPOL_BIND now use the local-node-ordered
> > zonelist instead of a custom node-id-ordered zonelist.
> 
> Could you update the now obsolete documentation (perhaps just delete
> the no longer correct remark):
> 
> Documentation/vm/numa_memory_policy.txt:
> 
>         MPOL_BIND:  This mode specifies that memory must come from the
>         set of nodes specified by the policy.
> 
>             The memory policy APIs do not specify an order in which the nodes
>             will be searched.  However, unlike "local allocation", the Bind
>             policy does not consider the distance between the nodes.  Rather,
>             allocations will fallback to the nodes specified by the policy in
>             order of numeric node id.  Like everything in Linux, this is subject
>             to change.
> 

Yes, will do.  

Thanks, Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v7
@ 2007-09-13 17:52 Mel Gorman
  2007-09-13 17:53 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
  0 siblings, 1 reply; 35+ messages in thread
From: Mel Gorman @ 2007-09-13 17:52 UTC (permalink / raw)
  To: Lee.Schermerhorn
  Cc: Mel Gorman, linux-kernel, linux-mm, kamezawa.hiroyu, clameter

Hi Lee,

This is the patchset I would like tested. It has Kamezawa-sans approach for
using a structure instead of pointer packing. While it consumes more cache
like Christoph pointed out, it should an easier starting point to optimise
once workloads are identified that can show performance gains/regressions. The
pointer packing is a potential optimisation but once in place, it's difficult
to alter again.

Please let me know how it works out for you.

Changelog since V7
  o Fix build bug in relation to memory controller combined with one-zonelist
  o Use while() instead of a stupid looking for()

Changelog since V6
  o Instead of encoding zone index information in a pointer, this version
    introduces a structure that stores a zone pointer and its index 

Changelog since V5
  o Rebase to 2.6.23-rc4-mm1
  o Drop patch that replaces inline functions with macros

Changelog since V4
  o Rebase to -mm kernel. Host of memoryless patches collisions dealt with
  o Do not call wakeup_kswapd() for every zone in a zonelist
  o Dropped the FASTCALL removal
  o Have cursor in iterator advance earlier
  o Use nodes_and in cpuset_nodes_valid_mems_allowed()
  o Use defines instead of inlines, noticably better performance on gcc-3.4
    No difference on later compilers such as gcc 4.1
  o Dropped gfp_skip patch until it is proven to be of benefit. Tests are
    currently inconclusive but it definitly consumes at least one cache
    line

Changelog since V3
  o Fix compile error in the parisc change
  o Calculate gfp_zone only once in __alloc_pages
  o Calculate classzone_idx properly in get_page_from_freelist
  o Alter check so that zone id embedded may still be used on UP
  o Use Kamezawa-sans suggestion for skipping zones in zonelist
  o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This
    removes the need for MPOL_BIND to have a custom zonelist
  o Move zonelist iterators and helpers to mm.h
  o Change _zones from struct zone * to unsigned long

Changelog since V2
  o shrink_zones() uses zonelist instead of zonelist->zones
  o hugetlb uses zonelist iterator
  o zone_idx information is embedded in zonelist pointers
  o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid)

Changelog since V1
  o Break up the patch into 3 patches
  o Introduce iterators for zonelists
  o Performance regression test

The following patches replace multiple zonelists per node with one zonelist
that is filtered based on the GFP flags. The patches as a set fix a bug
with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset,
the MPOL_BIND will apply to the two highest zones when the highest zone
is ZONE_MOVABLE. This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that
filters only custom zonelists. As a bonus, the patchset reduces the cache
footprint of the kernel and should improve performance in a number of cases.

The first patch cleans up an inconsitency where direct reclaim uses
zonelist->zones where other places use zonelist. The second patch introduces
a helper function node_zonelist() for looking up the appropriate zonelist
for a GFP mask which simplifies patches later in the set.

The third patch replaces multiple zonelists with two zonelists that are
filtered. The two zonelists are due to the fact that the memoryless patchset
introduces a second set of zonelists for __GFP_THISNODE.

The fourth patch introduces filtering of the zonelists based on a nodemask.

The final patch replaces the two zonelists with one zonelist. A nodemask is
created when __GFP_THISNODE is specified to filter the list. The nodelists
could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE
is used often enough to be worth the effort.

Performance results varied depending on the machine configuration but were
usually small performance gains. In real workloads the gain/loss will depend
on how much the userspace portion of the benchmark benefits from having more
cache available due to reduced referencing of zonelists.

These are the range of performance losses/gains when running against
2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.

Total CPU time on Kernbench: -0.67% to  3.05%
Elapsed   time on Kernbench: -0.25% to  2.96%
page_test from aim9:         -6.98% to  5.60%
brk_test  from aim9:         -3.94% to  4.11%
fork_test from aim9:         -5.72% to  4.14%
exec_test from aim9:         -1.02% to  1.56%

The TBench figures were too variable between runs to draw conclusions from but
there didn't appear to be any regressions there. The hackbench results for both
sockets and pipes were within noise.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-09-13 17:52 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v7 Mel Gorman
@ 2007-09-13 17:53 ` Mel Gorman
  0 siblings, 0 replies; 35+ messages in thread
From: Mel Gorman @ 2007-09-13 17:53 UTC (permalink / raw)
  To: Lee.Schermerhorn
  Cc: Mel Gorman, linux-kernel, linux-mm, kamezawa.hiroyu, clameter

The MPOL_BIND policy creates a zonelist that is used for allocations belonging
to that thread that can use the policy_zone. As the per-node zonelist is
already being filtered based on a zone id, this patch adds a version of
__alloc_pages() that takes a nodemask for further filtering. This eliminates
the need for MPOL_BIND to create a custom zonelist. A positive benefit of
this is that allocations using MPOL_BIND now use the local-node-ordered
zonelist instead of a custom node-id-ordered zonelist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 fs/buffer.c               |    2 
 include/linux/cpuset.h    |    4 -
 include/linux/gfp.h       |    4 +
 include/linux/mempolicy.h |    3 
 include/linux/mmzone.h    |   62 +++++++++++++----
 kernel/cpuset.c           |   18 +----
 mm/mempolicy.c            |  145 ++++++++++++-----------------------------
 mm/page_alloc.c           |   40 +++++++----
 8 files changed, 133 insertions(+), 145 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/fs/buffer.c linux-2.6.23-rc4-mm1-030_filter_nodemask/fs/buffer.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/fs/buffer.c	2007-09-13 11:57:44.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/fs/buffer.c	2007-09-13 11:57:52.000000000 +0100
@@ -376,7 +376,7 @@ static void free_more_memory(void)
 
 	for_each_online_node(nid) {
 		zrefs = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
-						gfp_zone(GFP_NOFS));
+						NULL, gfp_zone(GFP_NOFS));
 		if (zrefs->zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
 						GFP_NOFS);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/cpuset.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/cpuset.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/cpuset.h	2007-09-10 09:29:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/cpuset.h	2007-09-13 11:57:52.000000000 +0100
@@ -28,7 +28,7 @@ void cpuset_init_current_mems_allowed(vo
 void cpuset_update_task_memory_state(void);
 #define cpuset_nodes_subset_current_mems_allowed(nodes) \
 		nodes_subset((nodes), current->mems_allowed)
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
 extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask);
@@ -102,7 +102,7 @@ static inline void cpuset_init_current_m
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
-static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
 	return 1;
 }
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/gfp.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/gfp.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/gfp.h	2007-09-13 11:57:36.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/gfp.h	2007-09-13 11:57:52.000000000 +0100
@@ -185,6 +185,10 @@ static inline void arch_alloc_page(struc
 extern struct page *
 FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
 
+extern struct page *
+FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
+				struct zonelist *, nodemask_t *nodemask));
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mempolicy.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mempolicy.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mempolicy.h	2007-09-13 11:57:27.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mempolicy.h	2007-09-13 11:57:52.000000000 +0100
@@ -63,9 +63,8 @@ struct mempolicy {
 	atomic_t refcnt;
 	short policy; 	/* See MPOL_* above */
 	union {
-		struct zonelist  *zonelist;	/* bind */
 		short 		 preferred_node; /* preferred */
-		nodemask_t	 nodes;		/* interleave */
+		nodemask_t	 nodes;		/* interleave/bind */
 		/* undefined for default */
 	} v;
 	nodemask_t cpuset_mems_allowed;	/* mempolicy relative to these nodes */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mmzone.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mmzone.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mmzone.h	2007-09-13 12:00:25.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mmzone.h	2007-09-13 12:15:04.000000000 +0100
@@ -724,32 +724,72 @@ static inline void encode_zoneref(struct
 	zoneref->zone_idx = zone_idx(zone);
 }
 
+static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+	return node_isset(zonelist_node_idx(zref), *nodes);
+#else
+	return 1;
+#endif /* CONFIG_NUMA */
+}
+
 /* Returns the first zone at or below highest_zoneidx in a zonelist */
 static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	struct zoneref *z;
+	struct zoneref *z = zonelist->_zonerefs;
 
-	for (z = zonelist->_zonerefs;
-			zonelist_zone_idx(z) > highest_zoneidx;
-			z++)
-		;
+	if (likely(nodes == NULL))
+		while (zonelist_zone_idx(z) > highest_zoneidx)
+			z++;
+	else
+		while (zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes)))
+			z++;
 
 	return z;
 }
 
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
 static inline struct zoneref *next_zones_zonelist(struct zoneref *z,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	/* Find the next suitable zone to use for the allocation */
-	while (zonelist_zone_idx(z) > highest_zoneidx)
-		z++;
+	/*
+	 * Find the next suitable zone to use for the allocation.
+	 * Only filter based on nodemask if it's set
+	 */
+	if (likely(nodes == NULL))
+		while (zonelist_zone_idx(z) > highest_zoneidx)
+			z++;
+	else
+		while (zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes)))
+			z++;
 
 	return z;
 }
 
 /**
+ * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
+ * helper macro to iterate over valid zones in a zonelist at or below a given zone index zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ * @nodemask - Nodemask allowed by the allocator
+ *
+ * This iterator iterates though all zones at or below a given zone index and
+ * within a given nodemask
+ */
+#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+	for (z = first_zones_zonelist(zlist, nodemask, highidx),	\
+					zone = zonelist_zone(z++);	\
+		zone;							\
+		z = next_zones_zonelist(z, nodemask, highidx),		\
+					zone = zonelist_zone(z++))
+
+/**
  * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
  * @zone - The current zone in the iterator
  * @z - The current pointer within zonelist->zones being iterated
@@ -759,11 +799,7 @@ static inline struct zoneref *next_zones
  * This iterator iterates though all zones at or below a given zone index.
  */
 #define for_each_zone_zonelist(zone, z, zlist, highidx) \
-	for (z = first_zones_zonelist(zlist, highidx),			\
-					zone = zonelist_zone(z++);	\
-		zone;							\
-		z = next_zones_zonelist(z, highidx),			\
-					zone = zonelist_zone(z++))
+	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
 
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c linux-2.6.23-rc4-mm1-030_filter_nodemask/kernel/cpuset.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c	2007-09-13 11:57:44.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/kernel/cpuset.c	2007-09-13 11:57:52.000000000 +0100
@@ -1516,22 +1516,14 @@ nodemask_t cpuset_mems_allowed(struct ta
 }
 
 /**
- * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
- * @zl: the zonelist to be checked
+ * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
+ * @nodemask: the nodemask to be checked
  *
- * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
+ * Are any of the nodes in the nodemask allowed in current->mems_allowed?
  */
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
-	int i;
-
-	for (i = 0; zl->_zonerefs[i].zone; i++) {
-		int nid = zonelist_node_idx(zl->_zonerefs[i]);
-
-		if (node_isset(nid, current->mems_allowed))
-			return 1;
-	}
-	return 0;
+	return nodes_intersect(nodemask, current->mems_allowed);
 }
 
 /*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/mempolicy.c linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/mempolicy.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/mempolicy.c	2007-09-13 11:57:44.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/mempolicy.c	2007-09-13 13:43:14.000000000 +0100
@@ -134,41 +134,21 @@ static int mpol_check_policy(int mode, n
  	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
 }
 
-/* Generate a custom zonelist for the BIND policy. */
-static struct zonelist *bind_zonelist(nodemask_t *nodes)
+/* Check that the nodemask contains at least one populated zone */
+static int is_valid_nodemask(nodemask_t *nodemask)
 {
-	struct zonelist *zl;
-	int num, max, nd;
-	enum zone_type k;
+	int nd, k;
 
-	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
-	max++;			/* space for zlcache_ptr (see mmzone.h) */
-	zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
-	if (!zl)
-		return ERR_PTR(-ENOMEM);
-	zl->zlcache_ptr = NULL;
-	num = 0;
-	/* First put in the highest zones from all nodes, then all the next 
-	   lower zones etc. Avoid empty zones because the memory allocator
-	   doesn't like them. If you implement node hot removal you
-	   have to fix that. */
-	k = MAX_NR_ZONES - 1;
-	while (1) {
-		for_each_node_mask(nd, *nodes) { 
-			struct zone *z = &NODE_DATA(nd)->node_zones[k];
-			if (z->present_pages > 0) 
-				encode_zoneref(z, &zl->_zonerefs[num++]);
-		}
-		if (k == 0)
-			break;
-		k--;
-	}
-	if (num == 0) {
-		kfree(zl);
-		return ERR_PTR(-EINVAL);
+	/* Check that there is something useful in this mask */
+	k = policy_zone;
+
+	for_each_node_mask(nd, *nodemask) {
+		struct zone *z = &NODE_DATA(nd)->node_zones[k];
+		if (z->present_pages > 0)
+			return 1;
 	}
-	zl->_zonerefs[num].zone = NULL;
-	return zl;
+
+	return 0;
 }
 
 /* Create a new policy */
@@ -201,12 +181,11 @@ static struct mempolicy *mpol_new(int mo
 			policy->v.preferred_node = -1;
 		break;
 	case MPOL_BIND:
-		policy->v.zonelist = bind_zonelist(nodes);
-		if (IS_ERR(policy->v.zonelist)) {
-			void *error_code = policy->v.zonelist;
+		if (!is_valid_nodemask(nodes)) {
 			kmem_cache_free(policy_cache, policy);
-			return error_code;
+			return ERR_PTR(-EINVAL);
 		}
+		policy->v.nodes = *nodes;
 		break;
 	}
 	policy->policy = mode;
@@ -484,19 +463,12 @@ static long do_set_mempolicy(int mode, n
 /* Fill a zone bitmap for a policy */
 static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
 {
-	int i;
-
 	nodes_clear(*nodes);
 	switch (p->policy) {
-	case MPOL_BIND:
-		for (i = 0; p->v.zonelist->_zonerefs[i].zone; i++) {
-			struct zoneref *zref;
-			zref = &p->v.zonelist->_zonerefs[i];
-			node_set(zonelist_node_idx(zref), *nodes);
-		}
-		break;
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
@@ -1106,6 +1078,18 @@ static struct mempolicy * get_vma_policy
 	return pol;
 }
 
+/* Return a nodemask representing a mempolicy */
+static inline nodemask_t *nodemask_policy(gfp_t gfp, struct mempolicy *policy)
+{
+	/* Lower zones don't get a nodemask applied for MPOL_BIND */
+	if (unlikely(policy->policy == MPOL_BIND &&
+			gfp_zone(gfp) >= policy_zone &&
+			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)))
+		return &policy->v.nodes;
+
+	return NULL;
+}
+
 /* Return a zonelist representing a mempolicy */
 static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 {
@@ -1118,11 +1102,6 @@ static struct zonelist *zonelist_policy(
 			nd = numa_node_id();
 		break;
 	case MPOL_BIND:
-		/* Lower zones don't get a policy applied */
-		/* Careful: current->mems_allowed might have moved */
-		if (gfp_zone(gfp) >= policy_zone)
-			if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
-				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
 	case MPOL_DEFAULT:
@@ -1166,7 +1145,13 @@ unsigned slab_node(struct mempolicy *pol
 		 * Follow bind policy behavior and start allocation at the
 		 * first node.
 		 */
-		return zonelist_node_idx(policy->v.zonelist->_zonerefs);
+		struct zonelist *zonelist;
+		struct zoneref *z;
+		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
+		zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
+		z = first_zones_zonelist(zonelist, &policy->v.nodes,
+							highest_zoneidx);
+		return zonelist_node_idx(z);
 	}
 
 	case MPOL_PREFERRED:
@@ -1285,7 +1270,8 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
 		return alloc_page_interleave(gfp, 0, nid);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	return __alloc_pages_nodemask(gfp, 0,
+			zonelist_policy(gfp, pol), nodemask_policy(gfp, pol));
 }
 
 /**
@@ -1342,14 +1328,6 @@ struct mempolicy *__mpol_copy(struct mem
 	}
 	*new = *old;
 	atomic_set(&new->refcnt, 1);
-	if (new->policy == MPOL_BIND) {
-		int sz = ksize(old->v.zonelist);
-		new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
-		if (!new->v.zonelist) {
-			kmem_cache_free(policy_cache, new);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
 	return new;
 }
 
@@ -1363,21 +1341,12 @@ int __mpol_equal(struct mempolicy *a, st
 	switch (a->policy) {
 	case MPOL_DEFAULT:
 		return 1;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		return nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
-	case MPOL_BIND: {
-		int i;
-		for (i = 0; a->v.zonelist->_zonerefs[i].zone; i++) {
-			struct zone *za, *zb;
-			za = zonelist_zone(&a->v.zonelist->_zonerefs[i]);
-			zb = zonelist_zone(&b->v.zonelist->_zonerefs[i]);
-			if (za != zb)
-				return 0;
-		}
-		return b->v.zonelist->_zonerefs[i].zone == NULL;
-	}
 	default:
 		BUG();
 		return 0;
@@ -1389,8 +1358,6 @@ void __mpol_free(struct mempolicy *p)
 {
 	if (!atomic_dec_and_test(&p->refcnt))
 		return;
-	if (p->policy == MPOL_BIND)
-		kfree(p->v.zonelist);
 	p->policy = MPOL_DEFAULT;
 	kmem_cache_free(policy_cache, p);
 }
@@ -1681,6 +1648,8 @@ static void mpol_rebind_policy(struct me
 	switch (pol->policy) {
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
 		pol->v.nodes = tmp;
@@ -1693,32 +1662,6 @@ static void mpol_rebind_policy(struct me
 						*mpolmask, *newmask);
 		*mpolmask = *newmask;
 		break;
-	case MPOL_BIND: {
-		nodemask_t nodes;
-		struct zoneref *z;
-		struct zonelist *zonelist;
-
-		nodes_clear(nodes);
-		for (z = pol->v.zonelist->_zonerefs; z->zone; z++)
-			node_set(zonelist_node_idx(z), nodes);
-		nodes_remap(tmp, nodes, *mpolmask, *newmask);
-		nodes = tmp;
-
-		zonelist = bind_zonelist(&nodes);
-
-		/* If no mem, then zonelist is NULL and we keep old zonelist.
-		 * If that old zonelist has no remaining mems_allowed nodes,
-		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
-		 */
-
-		if (!IS_ERR(zonelist)) {
-			/* Good - got mem - substitute new zonelist */
-			kfree(pol->v.zonelist);
-			pol->v.zonelist = zonelist;
-		}
-		*mpolmask = *newmask;
-		break;
-	}
 	default:
 		BUG();
 		break;
@@ -1781,9 +1724,7 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_BIND:
-		get_zonemask(pol, &nodes);
-		break;
-
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes = pol->v.nodes;
 		break;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/page_alloc.c linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/page_alloc.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/page_alloc.c	2007-09-13 13:45:54.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/page_alloc.c	2007-09-13 13:46:13.000000000 +0100
@@ -1419,7 +1419,7 @@ static void zlc_mark_zone_full(struct zo
  * a page.
  */
 static struct page *
-get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
 {
 	struct zoneref *z;
@@ -1430,7 +1430,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	z = first_zones_zonelist(zonelist, high_zoneidx);
+	z = first_zones_zonelist(zonelist, nodemask, high_zoneidx);
 	classzone_idx = zonelist_zone_idx(z);
 
 zonelist_scan:
@@ -1438,7 +1438,8 @@ zonelist_scan:
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+						high_zoneidx, nodemask) {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
@@ -1544,9 +1545,9 @@ static void set_page_owner(struct page *
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
-struct page * fastcall
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+static struct page *
+__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1575,7 +1576,7 @@ restart:
 		return NULL;
 	}
 
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
@@ -1620,7 +1621,7 @@ restart:
 	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist,
+	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags);
 	if (page)
 		goto got_pg;
@@ -1633,7 +1634,7 @@ rebalance:
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
+			page = get_page_from_freelist(gfp_mask, nodemask, order,
 				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
@@ -1668,7 +1669,7 @@ nofail_alloc:
 		drain_all_local_pages();
 
 	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, order,
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx, alloc_flags);
 		if (page)
 			goto got_pg;
@@ -1679,8 +1680,9 @@ nofail_alloc:
 		 * a parallel oom killing, we must fail if we're still
 		 * under heavy pressure.
 		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+			order, zonelist, high_zoneidx,
+			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page)
 			goto got_pg;
 
@@ -1728,6 +1730,20 @@ got_pg:
 	return page;
 }
 
+struct page * fastcall
+__alloc_pages(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+}
+
+struct page * fastcall
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
+}
+
 EXPORT_SYMBOL(__alloc_pages);
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v6
@ 2007-09-12 21:04 Mel Gorman
  2007-09-12 21:06 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
  0 siblings, 1 reply; 35+ messages in thread
From: Mel Gorman @ 2007-09-12 21:04 UTC (permalink / raw)
  To: Lee.Schermerhorn, kamezawa.hiroyu, clameter
  Cc: Mel Gorman, linux-kernel, linux-mm

Kamezawa-san,

This version implements your idea for storing a zone pointer and zone_idx
in a structure within the zonelist instead of encoding information in a
pointer. It has worked out quite well. The performance is comparable on
the tests I've run with similar gains/losses as I've seen with but pointer
packing but this code may be easier to understand. However, the zonelist
has doubled in size and consumes more cache lines.

I did not put the node_idx into the structure as it was not clear that there
was a real gain from doing that as the node ID is no rarely used. However,
it would be trivial to add if it could be demonstrated to be of real benefit
on workloads that make heavy use of nodemasks. I do not have an appropriate
test environment for measuring that but prehaps someone else. If they are
willing to check it out, I'll roll a suitable patch.

Any opinions on whether the slight gain in apparent performance in kernbench
worth the cacheline? It's very difficult to craft a benchmark that notices
the extra line being used so this could be a hand-waving issue.

Changelog since V6
  o Instead of encoding zone index information in a pointer, this version
    introduces a structure that stores a zone pointer and its index 

Changelog since V5
  o Rebase to 2.6.23-rc4-mm1
  o Drop patch that replaces inline functions with macros

Changelog since V4
  o Rebase to -mm kernel. Host of memoryless patches collisions dealt with
  o Do not call wakeup_kswapd() for every zone in a zonelist
  o Dropped the FASTCALL removal
  o Have cursor in iterator advance earlier
  o Use nodes_and in cpuset_nodes_valid_mems_allowed()
  o Use defines instead of inlines, noticably better performance on gcc-3.4
    No difference on later compilers such as gcc 4.1
  o Dropped gfp_skip patch until it is proven to be of benefit. Tests are
    currently inconclusive but it definitly consumes at least one cache
    line

Changelog since V3
  o Fix compile error in the parisc change
  o Calculate gfp_zone only once in __alloc_pages
  o Calculate classzone_idx properly in get_page_from_freelist
  o Alter check so that zone id embedded may still be used on UP
  o Use Kamezawa-sans suggestion for skipping zones in zonelist
  o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This
    removes the need for MPOL_BIND to have a custom zonelist
  o Move zonelist iterators and helpers to mm.h
  o Change _zones from struct zone * to unsigned long

Changelog since V2
  o shrink_zones() uses zonelist instead of zonelist->zones
  o hugetlb uses zonelist iterator
  o zone_idx information is embedded in zonelist pointers
  o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid)

Changelog since V1
  o Break up the patch into 3 patches
  o Introduce iterators for zonelists
  o Performance regression test

The following patches replace multiple zonelists per node with one zonelist
that is filtered based on the GFP flags. The patches as a set fix a bug
with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset,
the MPOL_BIND will apply to the two highest zones when the highest zone
is ZONE_MOVABLE. This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that
filters only custom zonelists. As a bonus, the patchset reduces the cache
footprint of the kernel and should improve performance in a number of cases.

The first patch cleans up an inconsitency where direct reclaim uses
zonelist->zones where other places use zonelist. The second patch introduces
a helper function node_zonelist() for looking up the appropriate zonelist
for a GFP mask which simplifies patches later in the set.

The third patch replaces multiple zonelists with two zonelists that are
filtered. The two zonelists are due to the fact that the memoryless patchset
introduces a second set of zonelists for __GFP_THISNODE.

The fourth patch introduces filtering of the zonelists based on a nodemask.

The final patch replaces the two zonelists with one zonelist. A nodemask is
created when __GFP_THISNODE is specified to filter the list. The nodelists
could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE
is used often enough to be worth the effort.

Performance results varied depending on the machine configuration but were
usually small performance gains. In real workloads the gain/loss will depend
on how much the userspace portion of the benchmark benefits from having more
cache available due to reduced referencing of zonelists.

These are the range of performance losses/gains when running against
2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.

Total CPU time on Kernbench: -0.67% to  3.05%
Elapsed   time on Kernbench: -0.25% to  2.96%
page_test from aim9:         -6.98% to  5.60%
brk_test  from aim9:         -3.94% to  4.11%
fork_test from aim9:         -5.72% to  4.14%
exec_test from aim9:         -1.02% to  1.56%

The TBench figures were too variable between runs to draw conclusions from but
there didn't appear to be any regressions there. The hackbench results for both
sockets and pipes were within noise.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-09-12 21:04 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v6 Mel Gorman
@ 2007-09-12 21:06 ` Mel Gorman
  2007-09-12 21:23   ` Christoph Lameter
  2007-09-13 15:49   ` Lee Schermerhorn
  0 siblings, 2 replies; 35+ messages in thread
From: Mel Gorman @ 2007-09-12 21:06 UTC (permalink / raw)
  To: Lee.Schermerhorn, kamezawa.hiroyu, clameter
  Cc: Mel Gorman, linux-kernel, linux-mm

The MPOL_BIND policy creates a zonelist that is used for allocations belonging
to that thread that can use the policy_zone. As the per-node zonelist is
already being filtered based on a zone id, this patch adds a version of
__alloc_pages() that takes a nodemask for further filtering. This eliminates
the need for MPOL_BIND to create a custom zonelist. A positive benefit of
this is that allocations using MPOL_BIND now use the local-node-ordered
zonelist instead of a custom node-id-ordered zonelist.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 fs/buffer.c               |    2 
 include/linux/cpuset.h    |    4 -
 include/linux/gfp.h       |    4 +
 include/linux/mempolicy.h |    3 
 include/linux/mmzone.h    |   65 ++++++++++++++----
 kernel/cpuset.c           |   18 +----
 mm/mempolicy.c            |  145 ++++++++++++-----------------------------
 mm/page_alloc.c           |   40 +++++++----
 8 files changed, 136 insertions(+), 145 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/fs/buffer.c linux-2.6.23-rc4-mm1-030_filter_nodemask/fs/buffer.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/fs/buffer.c	2007-09-12 16:05:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/fs/buffer.c	2007-09-12 16:05:44.000000000 +0100
@@ -376,7 +376,7 @@ static void free_more_memory(void)
 
 	for_each_online_node(nid) {
 		zrefs = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
-						gfp_zone(GFP_NOFS));
+						NULL, gfp_zone(GFP_NOFS));
 		if (zrefs->zone)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
 						GFP_NOFS);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/cpuset.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/cpuset.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/cpuset.h	2007-09-10 09:29:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/cpuset.h	2007-09-12 16:05:44.000000000 +0100
@@ -28,7 +28,7 @@ void cpuset_init_current_mems_allowed(vo
 void cpuset_update_task_memory_state(void);
 #define cpuset_nodes_subset_current_mems_allowed(nodes) \
 		nodes_subset((nodes), current->mems_allowed)
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
 extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask);
@@ -102,7 +102,7 @@ static inline void cpuset_init_current_m
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
-static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
 	return 1;
 }
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/gfp.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/gfp.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/gfp.h	2007-09-12 16:05:27.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/gfp.h	2007-09-12 16:05:44.000000000 +0100
@@ -185,6 +185,10 @@ static inline void arch_alloc_page(struc
 extern struct page *
 FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
 
+extern struct page *
+FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
+				struct zonelist *, nodemask_t *nodemask));
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mempolicy.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mempolicy.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mempolicy.h	2007-09-12 16:05:18.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mempolicy.h	2007-09-12 16:05:44.000000000 +0100
@@ -63,9 +63,8 @@ struct mempolicy {
 	atomic_t refcnt;
 	short policy; 	/* See MPOL_* above */
 	union {
-		struct zonelist  *zonelist;	/* bind */
 		short 		 preferred_node; /* preferred */
-		nodemask_t	 nodes;		/* interleave */
+		nodemask_t	 nodes;		/* interleave/bind */
 		/* undefined for default */
 	} v;
 	nodemask_t cpuset_mems_allowed;	/* mempolicy relative to these nodes */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mmzone.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mmzone.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mmzone.h	2007-09-12 16:05:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mmzone.h	2007-09-12 17:54:51.000000000 +0100
@@ -724,32 +724,75 @@ static inline void encode_zoneref(struct
 	zoneref->zone_idx = zone_idx(zone);
 }
 
+static inline int zref_in_nodemask(struct zoneref *zref, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+	return node_isset(zonelist_node_idx(zref), *nodes);
+#else
+	return 1;
+#endif /* CONFIG_NUMA */
+}
+
 /* Returns the first zone at or below highest_zoneidx in a zonelist */
 static inline struct zoneref *first_zones_zonelist(struct zonelist *zonelist,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	struct zoneref *z;
+	struct zoneref *z = zonelist->_zonerefs;
 
-	for (z = zonelist->_zonerefs;
-			zonelist_zone_idx(z) > highest_zoneidx;
-			z++)
-		;
+	if (likely(nodes == NULL))
+		for (; zonelist_zone_idx(z) > highest_zoneidx;
+				z++)
+			;
+	else
+		for (; zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes));
+				z++)
+			;
 
 	return z;
 }
 
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
 static inline struct zoneref *next_zones_zonelist(struct zoneref *z,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	/* Find the next suitable zone to use for the allocation */
-	for (; zonelist_zone_idx(z) > highest_zoneidx; z++)
-		;
+	/*
+	 * Find the next suitable zone to use for the allocation.
+	 * Only filter based on nodemask if it's set
+	 */
+	if (likely(nodes == NULL))
+		for (; zonelist_zone_idx(z) > highest_zoneidx; z++)
+			;
+	else
+		for (; zonelist_zone_idx(z) > highest_zoneidx ||
+				(z->zone && !zref_in_nodemask(z, nodes));
+				z++)
+			;
 
 	return z;
 }
 
 /**
+ * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
+ * helper macro to iterate over valid zones in a zonelist at or below a given zone index zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ * @nodemask - Nodemask allowed by the allocator
+ *
+ * This iterator iterates though all zones at or below a given zone index and
+ * within a given nodemask
+ */
+#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+	for (z = first_zones_zonelist(zlist, nodemask, highidx),	\
+					zone = zonelist_zone(z++);	\
+		zone;							\
+		z = next_zones_zonelist(z, nodemask, highidx),		\
+					zone = zonelist_zone(z++))
+
+/**
  * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
  * @zone - The current zone in the iterator
  * @z - The current pointer within zonelist->zones being iterated
@@ -759,11 +802,7 @@ static inline struct zoneref *next_zones
  * This iterator iterates though all zones at or below a given zone index.
  */
 #define for_each_zone_zonelist(zone, z, zlist, highidx) \
-	for (z = first_zones_zonelist(zlist, highidx),			\
-					zone = zonelist_zone(z++);	\
-		zone;							\
-		z = next_zones_zonelist(z, highidx),			\
-					zone = zonelist_zone(z++))
+	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
 
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c linux-2.6.23-rc4-mm1-030_filter_nodemask/kernel/cpuset.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c	2007-09-12 16:05:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/kernel/cpuset.c	2007-09-12 16:05:44.000000000 +0100
@@ -1516,22 +1516,14 @@ nodemask_t cpuset_mems_allowed(struct ta
 }
 
 /**
- * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
- * @zl: the zonelist to be checked
+ * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
+ * @nodemask: the nodemask to be checked
  *
- * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
+ * Are any of the nodes in the nodemask allowed in current->mems_allowed?
  */
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
-	int i;
-
-	for (i = 0; zl->_zonerefs[i].zone; i++) {
-		int nid = zonelist_node_idx(zl->_zonerefs[i]);
-
-		if (node_isset(nid, current->mems_allowed))
-			return 1;
-	}
-	return 0;
+	return nodes_intersect(nodemask, current->mems_allowed);
 }
 
 /*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/mempolicy.c linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/mempolicy.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/mempolicy.c	2007-09-12 16:05:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/mempolicy.c	2007-09-12 16:17:30.000000000 +0100
@@ -134,41 +134,21 @@ static int mpol_check_policy(int mode, n
  	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
 }
 
-/* Generate a custom zonelist for the BIND policy. */
-static struct zonelist *bind_zonelist(nodemask_t *nodes)
+/* Check that the nodemask contains at least one populated zone */
+static int is_valid_nodemask(nodemask_t *nodemask)
 {
-	struct zonelist *zl;
-	int num, max, nd;
-	enum zone_type k;
+	int nd, k;
 
-	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
-	max++;			/* space for zlcache_ptr (see mmzone.h) */
-	zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
-	if (!zl)
-		return ERR_PTR(-ENOMEM);
-	zl->zlcache_ptr = NULL;
-	num = 0;
-	/* First put in the highest zones from all nodes, then all the next 
-	   lower zones etc. Avoid empty zones because the memory allocator
-	   doesn't like them. If you implement node hot removal you
-	   have to fix that. */
-	k = MAX_NR_ZONES - 1;
-	while (1) {
-		for_each_node_mask(nd, *nodes) { 
-			struct zone *z = &NODE_DATA(nd)->node_zones[k];
-			if (z->present_pages > 0) 
-				encode_zoneref(z, &zl->_zonerefs[num++]);
-		}
-		if (k == 0)
-			break;
-		k--;
-	}
-	if (num == 0) {
-		kfree(zl);
-		return ERR_PTR(-EINVAL);
+	/* Check that there is something useful in this mask */
+	k = policy_zone;
+
+	for_each_node_mask(nd, *nodemask) {
+		struct zone *z = &NODE_DATA(nd)->node_zones[k];
+		if (z->present_pages > 0)
+			return 1;
 	}
-	zl->_zonerefs[num].zone = NULL;
-	return zl;
+
+	return 0;
 }
 
 /* Create a new policy */
@@ -201,12 +181,11 @@ static struct mempolicy *mpol_new(int mo
 			policy->v.preferred_node = -1;
 		break;
 	case MPOL_BIND:
-		policy->v.zonelist = bind_zonelist(nodes);
-		if (IS_ERR(policy->v.zonelist)) {
-			void *error_code = policy->v.zonelist;
+		if (!is_valid_nodemask(nodes)) {
 			kmem_cache_free(policy_cache, policy);
-			return error_code;
+			return ERR_PTR(-EINVAL);
 		}
+		policy->v.nodes = *nodes;
 		break;
 	}
 	policy->policy = mode;
@@ -484,19 +463,12 @@ static long do_set_mempolicy(int mode, n
 /* Fill a zone bitmap for a policy */
 static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
 {
-	int i;
-
 	nodes_clear(*nodes);
 	switch (p->policy) {
-	case MPOL_BIND:
-		for (i = 0; p->v.zonelist->_zonerefs[i].zone; i++) {
-			struct zoneref *zref;
-			zref = &p->v.zonelist->_zonerefs[i];
-			node_set(zonelist_node_idx(zref), *nodes);
-		}
-		break;
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
@@ -1106,6 +1078,18 @@ static struct mempolicy * get_vma_policy
 	return pol;
 }
 
+/* Return a nodemask representing a mempolicy */
+static inline nodemask_t *nodemask_policy(gfp_t gfp, struct mempolicy *policy)
+{
+	/* Lower zones don't get a nodemask applied for MPOL_BIND */
+	if (unlikely(policy->policy == MPOL_BIND &&
+			gfp_zone(gfp) >= policy_zone &&
+			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)))
+		return &policy->v.nodes;
+
+	return NULL;
+}
+
 /* Return a zonelist representing a mempolicy */
 static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 {
@@ -1118,11 +1102,6 @@ static struct zonelist *zonelist_policy(
 			nd = numa_node_id();
 		break;
 	case MPOL_BIND:
-		/* Lower zones don't get a policy applied */
-		/* Careful: current->mems_allowed might have moved */
-		if (gfp_zone(gfp) >= policy_zone)
-			if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
-				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
 	case MPOL_DEFAULT:
@@ -1166,7 +1145,13 @@ unsigned slab_node(struct mempolicy *pol
 		 * Follow bind policy behavior and start allocation at the
 		 * first node.
 		 */
-		return zonelist_node_idx(policy->v.zonelist->_zonerefs);
+		struct zonelist *zonelist;
+		struct zoneref *z;
+		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
+		zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
+		z = first_zones_zonelist(zonelist, &policy->v.nodes,
+							highest_zoneidx);
+		return zonelist_node_idx(z);
 	}
 
 	case MPOL_PREFERRED:
@@ -1285,7 +1270,8 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
 		return alloc_page_interleave(gfp, 0, nid);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	return __alloc_pages_nodemask(gfp, 0,
+			zonelist_policy(gfp, pol), nodemask_policy(gfp, pol));
 }
 
 /**
@@ -1342,14 +1328,6 @@ struct mempolicy *__mpol_copy(struct mem
 	}
 	*new = *old;
 	atomic_set(&new->refcnt, 1);
-	if (new->policy == MPOL_BIND) {
-		int sz = ksize(old->v.zonelist);
-		new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
-		if (!new->v.zonelist) {
-			kmem_cache_free(policy_cache, new);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
 	return new;
 }
 
@@ -1365,19 +1343,10 @@ int __mpol_equal(struct mempolicy *a, st
 		return 1;
 	case MPOL_INTERLEAVE:
 		return nodes_equal(a->v.nodes, b->v.nodes);
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
-	case MPOL_BIND: {
-		int i;
-		for (i = 0; a->v.zonelist->_zonerefs[i].zone; i++) {
-			struct zone *za, *zb;
-			za = zonelist_zone(&a->v.zonelist->_zonerefs[i]);
-			zb = zonelist_zone(&b->v.zonelist->_zonerefs[i]);
-			if (za != zb)
-				return 0;
-		}
-		return b->v.zonelist->_zonerefs[i].zone == NULL;
-	}
 	default:
 		BUG();
 		return 0;
@@ -1389,8 +1358,6 @@ void __mpol_free(struct mempolicy *p)
 {
 	if (!atomic_dec_and_test(&p->refcnt))
 		return;
-	if (p->policy == MPOL_BIND)
-		kfree(p->v.zonelist);
 	p->policy = MPOL_DEFAULT;
 	kmem_cache_free(policy_cache, p);
 }
@@ -1681,6 +1648,8 @@ static void mpol_rebind_policy(struct me
 	switch (pol->policy) {
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
 		pol->v.nodes = tmp;
@@ -1693,32 +1662,6 @@ static void mpol_rebind_policy(struct me
 						*mpolmask, *newmask);
 		*mpolmask = *newmask;
 		break;
-	case MPOL_BIND: {
-		nodemask_t nodes;
-		struct zoneref *z;
-		struct zonelist *zonelist;
-
-		nodes_clear(nodes);
-		for (z = pol->v.zonelist->_zonerefs; z->zone; z++)
-			node_set(zonelist_node_idx(z), nodes);
-		nodes_remap(tmp, nodes, *mpolmask, *newmask);
-		nodes = tmp;
-
-		zonelist = bind_zonelist(&nodes);
-
-		/* If no mem, then zonelist is NULL and we keep old zonelist.
-		 * If that old zonelist has no remaining mems_allowed nodes,
-		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
-		 */
-
-		if (!IS_ERR(zonelist)) {
-			/* Good - got mem - substitute new zonelist */
-			kfree(pol->v.zonelist);
-			pol->v.zonelist = zonelist;
-		}
-		*mpolmask = *newmask;
-		break;
-	}
 	default:
 		BUG();
 		break;
@@ -1781,9 +1724,7 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_BIND:
-		get_zonemask(pol, &nodes);
-		break;
-
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes = pol->v.nodes;
 		break;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/page_alloc.c linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/page_alloc.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/page_alloc.c	2007-09-12 16:05:35.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/page_alloc.c	2007-09-12 16:05:44.000000000 +0100
@@ -1419,7 +1419,7 @@ static void zlc_mark_zone_full(struct zo
  * a page.
  */
 static struct page *
-get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
 {
 	struct zoneref *z;
@@ -1430,7 +1430,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	z = first_zones_zonelist(zonelist, high_zoneidx);
+	z = first_zones_zonelist(zonelist, nodemask, high_zoneidx);
 	classzone_idx = zonelist_zone_idx(z);
 
 zonelist_scan:
@@ -1438,7 +1438,8 @@ zonelist_scan:
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+						high_zoneidx, nodemask) {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
@@ -1544,9 +1545,9 @@ static void set_page_owner(struct page *
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
-struct page * fastcall
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+static struct page *
+__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1575,7 +1576,7 @@ restart:
 		return NULL;
 	}
 
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
@@ -1620,7 +1621,7 @@ restart:
 	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist,
+	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags);
 	if (page)
 		goto got_pg;
@@ -1633,7 +1634,7 @@ rebalance:
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
+			page = get_page_from_freelist(gfp_mask, nodemask, order,
 				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
@@ -1668,7 +1669,7 @@ nofail_alloc:
 		drain_all_local_pages();
 
 	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, order,
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx, alloc_flags);
 		if (page)
 			goto got_pg;
@@ -1679,8 +1680,9 @@ nofail_alloc:
 		 * a parallel oom killing, we must fail if we're still
 		 * under heavy pressure.
 		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+			order, zonelist, high_zoneidx,
+			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page)
 			goto got_pg;
 
@@ -1728,6 +1730,20 @@ got_pg:
 	return page;
 }
 
+struct page * fastcall
+__alloc_pages(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+}
+
+struct page * fastcall
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
+}
+
 EXPORT_SYMBOL(__alloc_pages);
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-09-12 21:06 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
@ 2007-09-12 21:23   ` Christoph Lameter
  2007-09-13 10:25     ` Mel Gorman
  2007-09-13 15:49   ` Lee Schermerhorn
  1 sibling, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2007-09-12 21:23 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Lee.Schermerhorn, kamezawa.hiroyu, linux-kernel, linux-mm

On Wed, 12 Sep 2007, Mel Gorman wrote:

> -			z++)
> -		;
> +	if (likely(nodes == NULL))
> +		for (; zonelist_zone_idx(z) > highest_zoneidx;
> +				z++)
> +			;
> +	else
> +		for (; zonelist_zone_idx(z) > highest_zoneidx ||
> +				(z->zone && !zref_in_nodemask(z, nodes));
> +				z++)
> +			;
>  

Minor nitpick here: "for (;" should become "for ( ;" to have correct 
whitespace. However, it would be clearer to use a while here.

while (zonelist_zone_idx(z)) > highest_zoneidx)
		z++;

etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-09-12 21:23   ` Christoph Lameter
@ 2007-09-13 10:25     ` Mel Gorman
  0 siblings, 0 replies; 35+ messages in thread
From: Mel Gorman @ 2007-09-13 10:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Lee.Schermerhorn, kamezawa.hiroyu, linux-kernel, linux-mm

On (12/09/07 14:23), Christoph Lameter didst pronounce:
> On Wed, 12 Sep 2007, Mel Gorman wrote:
> 
> > -			z++)
> > -		;
> > +	if (likely(nodes == NULL))
> > +		for (; zonelist_zone_idx(z) > highest_zoneidx;
> > +				z++)
> > +			;
> > +	else
> > +		for (; zonelist_zone_idx(z) > highest_zoneidx ||
> > +				(z->zone && !zref_in_nodemask(z, nodes));
> > +				z++)
> > +			;
> >  
> 
> Minor nitpick here: "for (;" should become "for ( ;" to have correct 
> whitespace. However, it would be clearer to use a while here.
> 
> while (zonelist_zone_idx(z)) > highest_zoneidx)
> 		z++;
> 
> etc.

Good point. I'll clean it up and retest. Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-09-12 21:06 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
  2007-09-12 21:23   ` Christoph Lameter
@ 2007-09-13 15:49   ` Lee Schermerhorn
  1 sibling, 0 replies; 35+ messages in thread
From: Lee Schermerhorn @ 2007-09-13 15:49 UTC (permalink / raw)
  To: Mel Gorman; +Cc: kamezawa.hiroyu, clameter, linux-kernel, linux-mm

On Wed, 2007-09-12 at 22:06 +0100, Mel Gorman wrote:
> The MPOL_BIND policy creates a zonelist that is used for allocations belonging
> to that thread that can use the policy_zone. As the per-node zonelist is
> already being filtered based on a zone id, this patch adds a version of
> __alloc_pages() that takes a nodemask for further filtering. This eliminates
> the need for MPOL_BIND to create a custom zonelist. A positive benefit of
> this is that allocations using MPOL_BIND now use the local-node-ordered
> zonelist instead of a custom node-id-ordered zonelist.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> ---
> 
>  fs/buffer.c               |    2 
>  include/linux/cpuset.h    |    4 -
>  include/linux/gfp.h       |    4 +
>  include/linux/mempolicy.h |    3 
>  include/linux/mmzone.h    |   65 ++++++++++++++----
>  kernel/cpuset.c           |   18 +----
>  mm/mempolicy.c            |  145 ++++++++++++-----------------------------
>  mm/page_alloc.c           |   40 +++++++----
>  8 files changed, 136 insertions(+), 145 deletions(-)
<snip>
> diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c linux-2.6.23-rc4-mm1-030_filter_nodemask/kernel/cpuset.c
> --- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c	2007-09-12 16:05:35.000000000 +0100
> +++ linux-2.6.23-rc4-mm1-030_filter_nodemask/kernel/cpuset.c	2007-09-12 16:05:44.000000000 +0100
> @@ -1516,22 +1516,14 @@ nodemask_t cpuset_mems_allowed(struct ta
>  }
>  
>  /**
> - * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
> - * @zl: the zonelist to be checked
> + * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
> + * @nodemask: the nodemask to be checked
>   *
> - * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
> + * Are any of the nodes in the nodemask allowed in current->mems_allowed?
>   */
> -int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
> +int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
>  {
> -	int i;
> -
> -	for (i = 0; zl->_zonerefs[i].zone; i++) {
> -		int nid = zonelist_node_idx(zl->_zonerefs[i]);
> -
> -		if (node_isset(nid, current->mems_allowed))
> -			return 1;
> -	}
> -	return 0;
> +	return nodes_intersect(nodemask, current->mems_allowed);
                 nodes_intersects(*nodemask, ... 
>  }
>  
>  /*
<snip>

Still preping for test.

Lee

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 (resend)
@ 2007-09-11 21:30 Mel Gorman
  2007-09-11 21:31 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
  0 siblings, 1 reply; 35+ messages in thread
From: Mel Gorman @ 2007-09-11 21:30 UTC (permalink / raw)
  To: Lee.Schermerhorn, akpm, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm

(Sorry for the resend, I mucked up the TO: line in the earlier sending)

This is the latest version of one-zonelist and it should be solid enough
for wider testing. To briefly summarise, the patchset replaces multiple
zonelists-per-node with one zonelist that is filtered based on nodemask and
GFP flags. I've dropped the patch that replaces inline functions with macros
from the end as it obscures the code for something that may or may not be a
performance benefit on older compilers. If we see performance regressions that
might have something to do with it, the patch is trivially to bring forward.

Andrew, please merge to -mm for wider testing and consideration for merging
to mainline. Minimally, it gets rid of the hack in relation to ZONE_MOVABLE
and MPOL_BIND.

Changelog since V5
  o Rebase to 2.6.23-rc4-mm1
  o Drop patch that replaces inline functions with macros

Changelog since V4
  o Rebase to -mm kernel. Host of memoryless patches collisions dealt with
  o Do not call wakeup_kswapd() for every zone in a zonelist
  o Dropped the FASTCALL removal
  o Have cursor in iterator advance earlier
  o Use nodes_and in cpuset_nodes_valid_mems_allowed()
  o Use defines instead of inlines, noticably better performance on gcc-3.4
    No difference on later compilers such as gcc 4.1
  o Dropped gfp_skip patch until it is proven to be of benefit. Tests are
    currently inconclusive but it definitly consumes at least one cache
    line

Changelog since V3
  o Fix compile error in the parisc change
  o Calculate gfp_zone only once in __alloc_pages
  o Calculate classzone_idx properly in get_page_from_freelist
  o Alter check so that zone id embedded may still be used on UP
  o Use Kamezawa-sans suggestion for skipping zones in zonelist
  o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This
    removes the need for MPOL_BIND to have a custom zonelist
  o Move zonelist iterators and helpers to mm.h
  o Change _zones from struct zone * to unsigned long

Changelog since V2
  o shrink_zones() uses zonelist instead of zonelist->zones
  o hugetlb uses zonelist iterator
  o zone_idx information is embedded in zonelist pointers
  o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid)

Changelog since V1
  o Break up the patch into 3 patches
  o Introduce iterators for zonelists
  o Performance regression test

The following patches replace multiple zonelists per node with one zonelist
that is filtered based on the GFP flags. The patches as a set fix a bug
with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset,
the MPOL_BIND will apply to the two highest zones when the highest zone
is ZONE_MOVABLE. This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that
filters only custom zonelists. As a bonus, the patchset reduces the cache
footprint of the kernel and should improve performance in a number of cases.

The first patch cleans up an inconsitency where direct reclaim uses
zonelist->zones where other places use zonelist. The second patch introduces
a helper function node_zonelist() for looking up the appropriate zonelist
for a GFP mask which simplifies patches later in the set.

The third patch replaces multiple zonelists with two zonelists that are
filtered. The two zonelists are due to the fact that the memoryless patchset
introduces a second set of zonelists for __GFP_THISNODE.

The fourth patch introduces filtering of the zonelists based on a nodemask.

The final patch replaces the two zonelists with one zonelist. A nodemask is
created when __GFP_THISNODE is specified to filter the list. The nodelists
could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE
is used often enough to be worth the effort.

Performance results varied depending on the machine configuration but were
usually small performance gains. In real workloads the gain/loss will depend
on how much the userspace portion of the benchmark benefits from having more
cache available due to reduced referencing of zonelists.

These are the range of performance losses/gains when running against
2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.

Total CPU time on Kernbench: -0.67% to  3.05%
Elapsed   time on Kernbench: -0.25% to  2.96%
page_test from aim9:         -6.98% to  5.60%
brk_test  from aim9:         -3.94% to  4.11%
fork_test from aim9:         -5.72% to  4.14%
exec_test from aim9:         -1.02% to  1.56%

The TBench figures were too variable between runs to draw conclusions from but
there didn't appear to be any regressions there. The hackbench results for both
sockets and pipes were within noise.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-09-11 21:30 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 (resend) Mel Gorman
@ 2007-09-11 21:31 ` Mel Gorman
  0 siblings, 0 replies; 35+ messages in thread
From: Mel Gorman @ 2007-09-11 21:31 UTC (permalink / raw)
  To: Lee.Schermerhorn, akpm, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm

The MPOL_BIND policy creates a zonelist that is used for allocations belonging
to that thread that can use the policy_zone. As the per-node zonelist is
already being filtered based on a zone id, this patch adds a version of
__alloc_pages() that takes a nodemask for further filtering. This eliminates
the need for MPOL_BIND to create a custom zonelist. A positive benefit of
this is that allocations using MPOL_BIND now use the local-node-ordered
zonelist instead of a custom node-id-ordered zonelist.


Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 fs/buffer.c               |    4 -
 include/linux/cpuset.h    |    4 -
 include/linux/gfp.h       |    4 +
 include/linux/mempolicy.h |    3 
 include/linux/mmzone.h    |   61 ++++++++++++++---
 kernel/cpuset.c           |   19 +----
 mm/mempolicy.c            |  144 +++++++++++------------------------------
 mm/page_alloc.c           |   40 +++++++----
 8 files changed, 136 insertions(+), 143 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/fs/buffer.c linux-2.6.23-rc4-mm1-030_filter_nodemask/fs/buffer.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/fs/buffer.c	2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/fs/buffer.c	2007-09-10 16:06:39.000000000 +0100
@@ -376,10 +376,10 @@ static void free_more_memory(void)
 
 	for_each_online_node(nid) {
 		zones = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
-							gfp_zone(GFP_NOFS));
+						NULL, gfp_zone(GFP_NOFS));
 		if (*zones)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
-							GFP_NOFS);
+						GFP_NOFS);
 	}
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/cpuset.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/cpuset.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/cpuset.h	2007-09-10 09:29:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/cpuset.h	2007-09-10 16:06:39.000000000 +0100
@@ -28,7 +28,7 @@ void cpuset_init_current_mems_allowed(vo
 void cpuset_update_task_memory_state(void);
 #define cpuset_nodes_subset_current_mems_allowed(nodes) \
 		nodes_subset((nodes), current->mems_allowed)
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
 extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask);
@@ -102,7 +102,7 @@ static inline void cpuset_init_current_m
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
-static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
 	return 1;
 }
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/gfp.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/gfp.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/gfp.h	2007-09-10 16:06:22.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/gfp.h	2007-09-10 16:06:39.000000000 +0100
@@ -185,6 +185,10 @@ static inline void arch_alloc_page(struc
 extern struct page *
 FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
 
+extern struct page *
+FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
+				struct zonelist *, nodemask_t *nodemask));
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mempolicy.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mempolicy.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mempolicy.h	2007-09-10 16:06:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mempolicy.h	2007-09-10 16:06:39.000000000 +0100
@@ -63,9 +63,8 @@ struct mempolicy {
 	atomic_t refcnt;
 	short policy; 	/* See MPOL_* above */
 	union {
-		struct zonelist  *zonelist;	/* bind */
 		short 		 preferred_node; /* preferred */
-		nodemask_t	 nodes;		/* interleave */
+		nodemask_t	 nodes;		/* interleave/bind */
 		/* undefined for default */
 	} v;
 	nodemask_t cpuset_mems_allowed;	/* mempolicy relative to these nodes */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mmzone.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mmzone.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mmzone.h	2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mmzone.h	2007-09-11 13:43:04.000000000 +0100
@@ -718,14 +718,29 @@ static inline unsigned long encode_zone_
 	return encoded;
 }
 
+static inline int zone_in_nodemask(unsigned long zone_addr, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+	return node_isset(zonelist_zone(zone_addr)->node, *nodes);
+#else
+	return 1;
+#endif /* CONFIG_NUMA */
+}
+
 /* Returns the first zone at or below highest_zoneidx in a zonelist */
 static inline unsigned long *first_zones_zonelist(struct zonelist *zonelist,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	unsigned long *z;
+	unsigned long *z = zonelist->_zones;
 
-	for (z = zonelist->_zones;
-			zonelist_zone_idx(*z) > highest_zoneidx;
+	if (likely(nodes == NULL))
+		for (; zonelist_zone_idx(*z) > highest_zoneidx;
+			z++)
+		;
+	else
+		for (; zonelist_zone_idx(*z) > highest_zoneidx ||
+				(*z && !zone_in_nodemask(*z, nodes));
 			z++)
 		;
 
@@ -734,31 +749,55 @@ static inline unsigned long *first_zones
 
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
 static inline unsigned long *next_zones_zonelist(unsigned long *z,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	/* Find the next suitable zone to use for the allocation */
-	for (; zonelist_zone_idx(*z) > highest_zoneidx; z++)
-		;
+	/*
+	 * Find the next suitable zone to use for the allocation.
+	 * Only filter based on nodemask if it's set
+	 */
+	if (likely(nodes == NULL))
+		for (; zonelist_zone_idx(*z) > highest_zoneidx; z++)
+			;
+	else
+		for (; zonelist_zone_idx(*z) > highest_zoneidx ||
+				(*z && !zone_in_nodemask(*z, nodes));
+			z++)
+			;
 
 	return z;
 }
 
 /**
- * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
  * @zone - The current zone in the iterator
  * @z - The current pointer within zonelist->zones being iterated
  * @zlist - The zonelist being iterated
  * @highidx - The zone index of the highest zone to return
+ * @nodemask - Nodemask allowed by the allocator
  *
- * This iterator iterates though all zones at or below a given zone index.
+ * This iterator iterates though all zones at or below a given zone index and
+ * within a given nodemask
  */
-#define for_each_zone_zonelist(zone, z, zlist, highidx) \
-	for (z = first_zones_zonelist(zlist, highidx),			\
+#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+	for (z = first_zones_zonelist(zlist, nodemask, highidx),	\
 					zone = zonelist_zone(*z++);	\
 		zone;							\
-		z = next_zones_zonelist(z, highidx),			\
+		z = next_zones_zonelist(z, nodemask, highidx),		\
 					zone = zonelist_zone(*z++))
 
+/**
+ * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ *
+ * This iterator iterates though all zones at or below a given zone index.
+ */
+#define for_each_zone_zonelist(zone, z, zlist, highidx) \
+	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c linux-2.6.23-rc4-mm1-030_filter_nodemask/kernel/cpuset.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c	2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/kernel/cpuset.c	2007-09-10 16:06:39.000000000 +0100
@@ -1516,22 +1516,17 @@ nodemask_t cpuset_mems_allowed(struct ta
 }
 
 /**
- * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
- * @zl: the zonelist to be checked
+ * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
+ * @nodemask: the nodemask to be checked
  *
- * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
+ * Are any of the nodes in the nodemask allowed in current->mems_allowed?
  */
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
-	int i;
+	int nid;
+	nodemask_t tmp;
 
-	for (i = 0; zl->_zones[i]; i++) {
-		int nid = zone_to_nid(zonelist_zone(zl->_zones[i]));
-
-		if (node_isset(nid, current->mems_allowed))
-			return 1;
-	}
-	return 0;
+	return nodes_intersect(nodemask, current->mems_allowed);
 }
 
 /*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/mempolicy.c linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/mempolicy.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/mempolicy.c	2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/mempolicy.c	2007-09-10 16:06:39.000000000 +0100
@@ -134,41 +134,21 @@ static int mpol_check_policy(int mode, n
  	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
 }
 
-/* Generate a custom zonelist for the BIND policy. */
-static struct zonelist *bind_zonelist(nodemask_t *nodes)
+/* Check that the nodemask contains at least one populated zone */
+static int is_valid_nodemask(nodemask_t *nodemask)
 {
-	struct zonelist *zl;
-	int num, max, nd;
-	enum zone_type k;
+	int nd, k;
 
-	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
-	max++;			/* space for zlcache_ptr (see mmzone.h) */
-	zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
-	if (!zl)
-		return ERR_PTR(-ENOMEM);
-	zl->zlcache_ptr = NULL;
-	num = 0;
-	/* First put in the highest zones from all nodes, then all the next 
-	   lower zones etc. Avoid empty zones because the memory allocator
-	   doesn't like them. If you implement node hot removal you
-	   have to fix that. */
-	k = MAX_NR_ZONES - 1;
-	while (1) {
-		for_each_node_mask(nd, *nodes) { 
-			struct zone *z = &NODE_DATA(nd)->node_zones[k];
-			if (z->present_pages > 0) 
-				zl->_zones[num++] = encode_zone_idx(z);
-		}
-		if (k == 0)
-			break;
-		k--;
-	}
-	if (num == 0) {
-		kfree(zl);
-		return ERR_PTR(-EINVAL);
+	/* Check that there is something useful in this mask */
+	k = policy_zone;
+
+	for_each_node_mask(nd, *nodemask) {
+		struct zone *z = &NODE_DATA(nd)->node_zones[k];
+		if (z->present_pages > 0)
+			return 1;
 	}
-	zl->_zones[num] = 0;
-	return zl;
+
+	return 0;
 }
 
 /* Create a new policy */
@@ -201,12 +181,11 @@ static struct mempolicy *mpol_new(int mo
 			policy->v.preferred_node = -1;
 		break;
 	case MPOL_BIND:
-		policy->v.zonelist = bind_zonelist(nodes);
-		if (IS_ERR(policy->v.zonelist)) {
-			void *error_code = policy->v.zonelist;
+		if (!is_valid_nodemask(nodes)) {
 			kmem_cache_free(policy_cache, policy);
-			return error_code;
+			return ERR_PTR(-EINVAL);
 		}
+		policy->v.nodes = *nodes;
 		break;
 	}
 	policy->policy = mode;
@@ -484,19 +463,13 @@ static long do_set_mempolicy(int mode, n
 /* Fill a zone bitmap for a policy */
 static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
 {
-	int i;
 
 	nodes_clear(*nodes);
 	switch (p->policy) {
-	case MPOL_BIND:
-		for (i = 0; p->v.zonelist->_zones[i]; i++) {
-			struct zone *zone;
-			zone = zonelist_zone(p->v.zonelist->_zones[i]);
-			node_set(zone_to_nid(zone), *nodes);
-		}
-		break;
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
@@ -1106,6 +1079,18 @@ static struct mempolicy * get_vma_policy
 	return pol;
 }
 
+/* Return a nodemask representing a mempolicy */
+static inline nodemask_t *nodemask_policy(gfp_t gfp, struct mempolicy *policy)
+{
+	/* Lower zones don't get a nodemask applied for MPOL_BIND */
+	if (unlikely(policy->policy == MPOL_BIND &&
+			gfp_zone(gfp) >= policy_zone &&
+			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)))
+		return &policy->v.nodes;
+
+	return NULL;
+}
+
 /* Return a zonelist representing a mempolicy */
 static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 {
@@ -1118,11 +1103,6 @@ static struct zonelist *zonelist_policy(
 			nd = numa_node_id();
 		break;
 	case MPOL_BIND:
-		/* Lower zones don't get a policy applied */
-		/* Careful: current->mems_allowed might have moved */
-		if (gfp_zone(gfp) >= policy_zone)
-			if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
-				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
 	case MPOL_DEFAULT:
@@ -1167,8 +1147,12 @@ unsigned slab_node(struct mempolicy *pol
 		 * first node.
 		 */
 		struct zonelist *zonelist;
-		zonelist = policy->v.zonelist;
-		return zone_to_nid(zonelist_zone(zonelist->_zones[0]));
+		unsigned long *z;
+		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
+		zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
+		z = first_zones_zonelist(zonelist, &policy->v.nodes,
+							highest_zoneidx);
+		return zone_to_nid(zonelist_zone(*z));
 	}
 
 	case MPOL_PREFERRED:
@@ -1287,7 +1271,8 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
 		return alloc_page_interleave(gfp, 0, nid);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	return __alloc_pages_nodemask(gfp, 0,
+			zonelist_policy(gfp, pol), nodemask_policy(gfp, pol));
 }
 
 /**
@@ -1344,14 +1329,6 @@ struct mempolicy *__mpol_copy(struct mem
 	}
 	*new = *old;
 	atomic_set(&new->refcnt, 1);
-	if (new->policy == MPOL_BIND) {
-		int sz = ksize(old->v.zonelist);
-		new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
-		if (!new->v.zonelist) {
-			kmem_cache_free(policy_cache, new);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
 	return new;
 }
 
@@ -1365,21 +1342,12 @@ int __mpol_equal(struct mempolicy *a, st
 	switch (a->policy) {
 	case MPOL_DEFAULT:
 		return 1;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		return nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
-	case MPOL_BIND: {
-		int i;
-		for (i = 0; a->v.zonelist->_zones[i]; i++) {
-			struct zone *za, *zb;
-			za = zonelist_zone(a->v.zonelist->_zones[i]);
-			zb = zonelist_zone(b->v.zonelist->_zones[i]);
-			if (za != zb)
-				return 0;
-		}
-		return b->v.zonelist->_zones[i] == 0;
-	}
 	default:
 		BUG();
 		return 0;
@@ -1391,8 +1359,6 @@ void __mpol_free(struct mempolicy *p)
 {
 	if (!atomic_dec_and_test(&p->refcnt))
 		return;
-	if (p->policy == MPOL_BIND)
-		kfree(p->v.zonelist);
 	p->policy = MPOL_DEFAULT;
 	kmem_cache_free(policy_cache, p);
 }
@@ -1683,6 +1649,8 @@ static void mpol_rebind_policy(struct me
 	switch (pol->policy) {
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
 		pol->v.nodes = tmp;
@@ -1695,32 +1663,6 @@ static void mpol_rebind_policy(struct me
 						*mpolmask, *newmask);
 		*mpolmask = *newmask;
 		break;
-	case MPOL_BIND: {
-		nodemask_t nodes;
-		unsigned long *z;
-		struct zonelist *zonelist;
-
-		nodes_clear(nodes);
-		for (z = pol->v.zonelist->_zones; *z; z++)
-			node_set(zone_to_nid(zonelist_zone(*z)), nodes);
-		nodes_remap(tmp, nodes, *mpolmask, *newmask);
-		nodes = tmp;
-
-		zonelist = bind_zonelist(&nodes);
-
-		/* If no mem, then zonelist is NULL and we keep old zonelist.
-		 * If that old zonelist has no remaining mems_allowed nodes,
-		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
-		 */
-
-		if (!IS_ERR(zonelist)) {
-			/* Good - got mem - substitute new zonelist */
-			kfree(pol->v.zonelist);
-			pol->v.zonelist = zonelist;
-		}
-		*mpolmask = *newmask;
-		break;
-	}
 	default:
 		BUG();
 		break;
@@ -1783,9 +1725,7 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_BIND:
-		get_zonemask(pol, &nodes);
-		break;
-
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes = pol->v.nodes;
 		break;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/page_alloc.c linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/page_alloc.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/page_alloc.c	2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/page_alloc.c	2007-09-10 16:06:39.000000000 +0100
@@ -1419,7 +1419,7 @@ static void zlc_mark_zone_full(struct zo
  * a page.
  */
 static struct page *
-get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
 {
 	unsigned long *z;
@@ -1430,7 +1430,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	z = first_zones_zonelist(zonelist, high_zoneidx);
+	z = first_zones_zonelist(zonelist, nodemask, high_zoneidx);
 	classzone_idx = zonelist_zone_idx(*z);
 
 zonelist_scan:
@@ -1438,7 +1438,8 @@ zonelist_scan:
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+						high_zoneidx, nodemask) {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
@@ -1544,9 +1545,9 @@ static void set_page_owner(struct page *
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
-struct page * fastcall
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+static struct page *
+__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1575,7 +1576,7 @@ restart:
 		return NULL;
 	}
 
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
@@ -1620,7 +1621,7 @@ restart:
 	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist,
+	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags);
 	if (page)
 		goto got_pg;
@@ -1633,7 +1634,7 @@ rebalance:
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
+			page = get_page_from_freelist(gfp_mask, nodemask, order,
 				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
@@ -1668,7 +1669,7 @@ nofail_alloc:
 		drain_all_local_pages();
 
 	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, order,
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx, alloc_flags);
 		if (page)
 			goto got_pg;
@@ -1679,8 +1680,9 @@ nofail_alloc:
 		 * a parallel oom killing, we must fail if we're still
 		 * under heavy pressure.
 		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+			order, zonelist, high_zoneidx,
+			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page)
 			goto got_pg;
 
@@ -1728,6 +1730,20 @@ got_pg:
 	return page;
 }
 
+struct page * fastcall
+__alloc_pages(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+}
+
+struct page * fastcall
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
+}
+
 EXPORT_SYMBOL(__alloc_pages);
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5
@ 2007-09-11 15:19 Mel Gorman
  2007-09-11 15:21 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
  0 siblings, 1 reply; 35+ messages in thread
From: Mel Gorman @ 2007-09-11 15:19 UTC (permalink / raw)
  To: apw; +Cc: Mel Gorman, linux-kernel, linux-mm

This is the latest version of one-zonelist and it should be solid enough
for wider testing. To briefly summarise, the patchset replaces multiple
zonelists-per-node with one zonelist that is filtered based on nodemask and
GFP flags. I've dropped the patch that replaces inline functions with macros
from the end as it obscures the code for something that may or may not be a
performance benefit on older compilers. If we see performance regressions that
might have something to do with it, the patch is trivially to bring forward.

Andrew, please merge to -mm for wider testing and consideration for merging
to mainline. Minimally, it gets rid of the hack in relation to ZONE_MOVABLE
and MPOL_BIND.

Changelog since V5
  o Rebase to 2.6.23-rc4-mm1
  o Drop patch that replaces inline functions with macros

Changelog since V4
  o Rebase to -mm kernel. Host of memoryless patches collisions dealt with
  o Do not call wakeup_kswapd() for every zone in a zonelist
  o Dropped the FASTCALL removal
  o Have cursor in iterator advance earlier
  o Use nodes_and in cpuset_nodes_valid_mems_allowed()
  o Use defines instead of inlines, noticably better performance on gcc-3.4
    No difference on later compilers such as gcc 4.1
  o Dropped gfp_skip patch until it is proven to be of benefit. Tests are
    currently inconclusive but it definitly consumes at least one cache
    line

Changelog since V3
  o Fix compile error in the parisc change
  o Calculate gfp_zone only once in __alloc_pages
  o Calculate classzone_idx properly in get_page_from_freelist
  o Alter check so that zone id embedded may still be used on UP
  o Use Kamezawa-sans suggestion for skipping zones in zonelist
  o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This
    removes the need for MPOL_BIND to have a custom zonelist
  o Move zonelist iterators and helpers to mm.h
  o Change _zones from struct zone * to unsigned long

Changelog since V2
  o shrink_zones() uses zonelist instead of zonelist->zones
  o hugetlb uses zonelist iterator
  o zone_idx information is embedded in zonelist pointers
  o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid)

Changelog since V1
  o Break up the patch into 3 patches
  o Introduce iterators for zonelists
  o Performance regression test

The following patches replace multiple zonelists per node with one zonelist
that is filtered based on the GFP flags. The patches as a set fix a bug
with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset,
the MPOL_BIND will apply to the two highest zones when the highest zone
is ZONE_MOVABLE. This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that
filters only custom zonelists. As a bonus, the patchset reduces the cache
footprint of the kernel and should improve performance in a number of cases.

The first patch cleans up an inconsitency where direct reclaim uses
zonelist->zones where other places use zonelist. The second patch introduces
a helper function node_zonelist() for looking up the appropriate zonelist
for a GFP mask which simplifies patches later in the set.

The third patch replaces multiple zonelists with two zonelists that are
filtered. The two zonelists are due to the fact that the memoryless patchset
introduces a second set of zonelists for __GFP_THISNODE.

The fourth patch introduces filtering of the zonelists based on a nodemask.

The final patch replaces the two zonelists with one zonelist. A nodemask is
created when __GFP_THISNODE is specified to filter the list. The nodelists
could be pre-allocated with one-per-node but it's not clear that __GFP_THISNODE
is used often enough to be worth the effort.

Performance results varied depending on the machine configuration but were
usually small performance gains. In real workloads the gain/loss will depend
on how much the userspace portion of the benchmark benefits from having more
cache available due to reduced referencing of zonelists.

These are the range of performance losses/gains when running against
2.6.23-rc3-mm1. The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.

Total CPU time on Kernbench: -0.67% to  3.05%
Elapsed   time on Kernbench: -0.25% to  2.96%
page_test from aim9:         -6.98% to  5.60%
brk_test  from aim9:         -3.94% to  4.11%
fork_test from aim9:         -5.72% to  4.14%
exec_test from aim9:         -1.02% to  1.56%

The TBench figures were too variable between runs to draw conclusions from but
there didn't appear to be any regressions there. The hackbench results for both
sockets and pipes were within noise.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-09-11 15:19 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 Mel Gorman
@ 2007-09-11 15:21 ` Mel Gorman
  0 siblings, 0 replies; 35+ messages in thread
From: Mel Gorman @ 2007-09-11 15:21 UTC (permalink / raw)
  To: apw; +Cc: Mel Gorman, linux-kernel, linux-mm

The MPOL_BIND policy creates a zonelist that is used for allocations belonging
to that thread that can use the policy_zone. As the per-node zonelist is
already being filtered based on a zone id, this patch adds a version of
__alloc_pages() that takes a nodemask for further filtering. This eliminates
the need for MPOL_BIND to create a custom zonelist. A positive benefit of
this is that allocations using MPOL_BIND now use the local-node-ordered
zonelist instead of a custom node-id-ordered zonelist.


Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 fs/buffer.c               |    4 -
 include/linux/cpuset.h    |    4 -
 include/linux/gfp.h       |    4 +
 include/linux/mempolicy.h |    3 
 include/linux/mmzone.h    |   61 ++++++++++++++---
 kernel/cpuset.c           |   19 +----
 mm/mempolicy.c            |  144 +++++++++++------------------------------
 mm/page_alloc.c           |   40 +++++++----
 8 files changed, 136 insertions(+), 143 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/fs/buffer.c linux-2.6.23-rc4-mm1-030_filter_nodemask/fs/buffer.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/fs/buffer.c	2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/fs/buffer.c	2007-09-10 16:06:39.000000000 +0100
@@ -376,10 +376,10 @@ static void free_more_memory(void)
 
 	for_each_online_node(nid) {
 		zones = first_zones_zonelist(node_zonelist(nid, GFP_NOFS),
-							gfp_zone(GFP_NOFS));
+						NULL, gfp_zone(GFP_NOFS));
 		if (*zones)
 			try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0,
-							GFP_NOFS);
+						GFP_NOFS);
 	}
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/cpuset.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/cpuset.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/cpuset.h	2007-09-10 09:29:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/cpuset.h	2007-09-10 16:06:39.000000000 +0100
@@ -28,7 +28,7 @@ void cpuset_init_current_mems_allowed(vo
 void cpuset_update_task_memory_state(void);
 #define cpuset_nodes_subset_current_mems_allowed(nodes) \
 		nodes_subset((nodes), current->mems_allowed)
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
 extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask);
@@ -102,7 +102,7 @@ static inline void cpuset_init_current_m
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
-static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
 	return 1;
 }
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/gfp.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/gfp.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/gfp.h	2007-09-10 16:06:22.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/gfp.h	2007-09-10 16:06:39.000000000 +0100
@@ -185,6 +185,10 @@ static inline void arch_alloc_page(struc
 extern struct page *
 FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
 
+extern struct page *
+FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
+				struct zonelist *, nodemask_t *nodemask));
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mempolicy.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mempolicy.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mempolicy.h	2007-09-10 16:06:13.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mempolicy.h	2007-09-10 16:06:39.000000000 +0100
@@ -63,9 +63,8 @@ struct mempolicy {
 	atomic_t refcnt;
 	short policy; 	/* See MPOL_* above */
 	union {
-		struct zonelist  *zonelist;	/* bind */
 		short 		 preferred_node; /* preferred */
-		nodemask_t	 nodes;		/* interleave */
+		nodemask_t	 nodes;		/* interleave/bind */
 		/* undefined for default */
 	} v;
 	nodemask_t cpuset_mems_allowed;	/* mempolicy relative to these nodes */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mmzone.h linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mmzone.h
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/include/linux/mmzone.h	2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/include/linux/mmzone.h	2007-09-11 13:43:04.000000000 +0100
@@ -718,14 +718,29 @@ static inline unsigned long encode_zone_
 	return encoded;
 }
 
+static inline int zone_in_nodemask(unsigned long zone_addr, nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+	return node_isset(zonelist_zone(zone_addr)->node, *nodes);
+#else
+	return 1;
+#endif /* CONFIG_NUMA */
+}
+
 /* Returns the first zone at or below highest_zoneidx in a zonelist */
 static inline unsigned long *first_zones_zonelist(struct zonelist *zonelist,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	unsigned long *z;
+	unsigned long *z = zonelist->_zones;
 
-	for (z = zonelist->_zones;
-			zonelist_zone_idx(*z) > highest_zoneidx;
+	if (likely(nodes == NULL))
+		for (; zonelist_zone_idx(*z) > highest_zoneidx;
+			z++)
+		;
+	else
+		for (; zonelist_zone_idx(*z) > highest_zoneidx ||
+				(*z && !zone_in_nodemask(*z, nodes));
 			z++)
 		;
 
@@ -734,31 +749,55 @@ static inline unsigned long *first_zones
 
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
 static inline unsigned long *next_zones_zonelist(unsigned long *z,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	/* Find the next suitable zone to use for the allocation */
-	for (; zonelist_zone_idx(*z) > highest_zoneidx; z++)
-		;
+	/*
+	 * Find the next suitable zone to use for the allocation.
+	 * Only filter based on nodemask if it's set
+	 */
+	if (likely(nodes == NULL))
+		for (; zonelist_zone_idx(*z) > highest_zoneidx; z++)
+			;
+	else
+		for (; zonelist_zone_idx(*z) > highest_zoneidx ||
+				(*z && !zone_in_nodemask(*z, nodes));
+			z++)
+			;
 
 	return z;
 }
 
 /**
- * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
  * @zone - The current zone in the iterator
  * @z - The current pointer within zonelist->zones being iterated
  * @zlist - The zonelist being iterated
  * @highidx - The zone index of the highest zone to return
+ * @nodemask - Nodemask allowed by the allocator
  *
- * This iterator iterates though all zones at or below a given zone index.
+ * This iterator iterates though all zones at or below a given zone index and
+ * within a given nodemask
  */
-#define for_each_zone_zonelist(zone, z, zlist, highidx) \
-	for (z = first_zones_zonelist(zlist, highidx),			\
+#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+	for (z = first_zones_zonelist(zlist, nodemask, highidx),	\
 					zone = zonelist_zone(*z++);	\
 		zone;							\
-		z = next_zones_zonelist(z, highidx),			\
+		z = next_zones_zonelist(z, nodemask, highidx),		\
 					zone = zonelist_zone(*z++))
 
+/**
+ * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ *
+ * This iterator iterates though all zones at or below a given zone index.
+ */
+#define for_each_zone_zonelist(zone, z, zlist, highidx) \
+	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #endif
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c linux-2.6.23-rc4-mm1-030_filter_nodemask/kernel/cpuset.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/kernel/cpuset.c	2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/kernel/cpuset.c	2007-09-10 16:06:39.000000000 +0100
@@ -1516,22 +1516,17 @@ nodemask_t cpuset_mems_allowed(struct ta
 }
 
 /**
- * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
- * @zl: the zonelist to be checked
+ * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
+ * @nodemask: the nodemask to be checked
  *
- * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
+ * Are any of the nodes in the nodemask allowed in current->mems_allowed?
  */
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
-	int i;
+	int nid;
+	nodemask_t tmp;
 
-	for (i = 0; zl->_zones[i]; i++) {
-		int nid = zone_to_nid(zonelist_zone(zl->_zones[i]));
-
-		if (node_isset(nid, current->mems_allowed))
-			return 1;
-	}
-	return 0;
+	return nodes_intersect(nodemask, current->mems_allowed);
 }
 
 /*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/mempolicy.c linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/mempolicy.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/mempolicy.c	2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/mempolicy.c	2007-09-10 16:06:39.000000000 +0100
@@ -134,41 +134,21 @@ static int mpol_check_policy(int mode, n
  	return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
 }
 
-/* Generate a custom zonelist for the BIND policy. */
-static struct zonelist *bind_zonelist(nodemask_t *nodes)
+/* Check that the nodemask contains at least one populated zone */
+static int is_valid_nodemask(nodemask_t *nodemask)
 {
-	struct zonelist *zl;
-	int num, max, nd;
-	enum zone_type k;
+	int nd, k;
 
-	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
-	max++;			/* space for zlcache_ptr (see mmzone.h) */
-	zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
-	if (!zl)
-		return ERR_PTR(-ENOMEM);
-	zl->zlcache_ptr = NULL;
-	num = 0;
-	/* First put in the highest zones from all nodes, then all the next 
-	   lower zones etc. Avoid empty zones because the memory allocator
-	   doesn't like them. If you implement node hot removal you
-	   have to fix that. */
-	k = MAX_NR_ZONES - 1;
-	while (1) {
-		for_each_node_mask(nd, *nodes) { 
-			struct zone *z = &NODE_DATA(nd)->node_zones[k];
-			if (z->present_pages > 0) 
-				zl->_zones[num++] = encode_zone_idx(z);
-		}
-		if (k == 0)
-			break;
-		k--;
-	}
-	if (num == 0) {
-		kfree(zl);
-		return ERR_PTR(-EINVAL);
+	/* Check that there is something useful in this mask */
+	k = policy_zone;
+
+	for_each_node_mask(nd, *nodemask) {
+		struct zone *z = &NODE_DATA(nd)->node_zones[k];
+		if (z->present_pages > 0)
+			return 1;
 	}
-	zl->_zones[num] = 0;
-	return zl;
+
+	return 0;
 }
 
 /* Create a new policy */
@@ -201,12 +181,11 @@ static struct mempolicy *mpol_new(int mo
 			policy->v.preferred_node = -1;
 		break;
 	case MPOL_BIND:
-		policy->v.zonelist = bind_zonelist(nodes);
-		if (IS_ERR(policy->v.zonelist)) {
-			void *error_code = policy->v.zonelist;
+		if (!is_valid_nodemask(nodes)) {
 			kmem_cache_free(policy_cache, policy);
-			return error_code;
+			return ERR_PTR(-EINVAL);
 		}
+		policy->v.nodes = *nodes;
 		break;
 	}
 	policy->policy = mode;
@@ -484,19 +463,13 @@ static long do_set_mempolicy(int mode, n
 /* Fill a zone bitmap for a policy */
 static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
 {
-	int i;
 
 	nodes_clear(*nodes);
 	switch (p->policy) {
-	case MPOL_BIND:
-		for (i = 0; p->v.zonelist->_zones[i]; i++) {
-			struct zone *zone;
-			zone = zonelist_zone(p->v.zonelist->_zones[i]);
-			node_set(zone_to_nid(zone), *nodes);
-		}
-		break;
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
@@ -1106,6 +1079,18 @@ static struct mempolicy * get_vma_policy
 	return pol;
 }
 
+/* Return a nodemask representing a mempolicy */
+static inline nodemask_t *nodemask_policy(gfp_t gfp, struct mempolicy *policy)
+{
+	/* Lower zones don't get a nodemask applied for MPOL_BIND */
+	if (unlikely(policy->policy == MPOL_BIND &&
+			gfp_zone(gfp) >= policy_zone &&
+			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)))
+		return &policy->v.nodes;
+
+	return NULL;
+}
+
 /* Return a zonelist representing a mempolicy */
 static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 {
@@ -1118,11 +1103,6 @@ static struct zonelist *zonelist_policy(
 			nd = numa_node_id();
 		break;
 	case MPOL_BIND:
-		/* Lower zones don't get a policy applied */
-		/* Careful: current->mems_allowed might have moved */
-		if (gfp_zone(gfp) >= policy_zone)
-			if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
-				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
 	case MPOL_DEFAULT:
@@ -1167,8 +1147,12 @@ unsigned slab_node(struct mempolicy *pol
 		 * first node.
 		 */
 		struct zonelist *zonelist;
-		zonelist = policy->v.zonelist;
-		return zone_to_nid(zonelist_zone(zonelist->_zones[0]));
+		unsigned long *z;
+		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
+		zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
+		z = first_zones_zonelist(zonelist, &policy->v.nodes,
+							highest_zoneidx);
+		return zone_to_nid(zonelist_zone(*z));
 	}
 
 	case MPOL_PREFERRED:
@@ -1287,7 +1271,8 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
 		return alloc_page_interleave(gfp, 0, nid);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	return __alloc_pages_nodemask(gfp, 0,
+			zonelist_policy(gfp, pol), nodemask_policy(gfp, pol));
 }
 
 /**
@@ -1344,14 +1329,6 @@ struct mempolicy *__mpol_copy(struct mem
 	}
 	*new = *old;
 	atomic_set(&new->refcnt, 1);
-	if (new->policy == MPOL_BIND) {
-		int sz = ksize(old->v.zonelist);
-		new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
-		if (!new->v.zonelist) {
-			kmem_cache_free(policy_cache, new);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
 	return new;
 }
 
@@ -1365,21 +1342,12 @@ int __mpol_equal(struct mempolicy *a, st
 	switch (a->policy) {
 	case MPOL_DEFAULT:
 		return 1;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		return nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
-	case MPOL_BIND: {
-		int i;
-		for (i = 0; a->v.zonelist->_zones[i]; i++) {
-			struct zone *za, *zb;
-			za = zonelist_zone(a->v.zonelist->_zones[i]);
-			zb = zonelist_zone(b->v.zonelist->_zones[i]);
-			if (za != zb)
-				return 0;
-		}
-		return b->v.zonelist->_zones[i] == 0;
-	}
 	default:
 		BUG();
 		return 0;
@@ -1391,8 +1359,6 @@ void __mpol_free(struct mempolicy *p)
 {
 	if (!atomic_dec_and_test(&p->refcnt))
 		return;
-	if (p->policy == MPOL_BIND)
-		kfree(p->v.zonelist);
 	p->policy = MPOL_DEFAULT;
 	kmem_cache_free(policy_cache, p);
 }
@@ -1683,6 +1649,8 @@ static void mpol_rebind_policy(struct me
 	switch (pol->policy) {
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
 		pol->v.nodes = tmp;
@@ -1695,32 +1663,6 @@ static void mpol_rebind_policy(struct me
 						*mpolmask, *newmask);
 		*mpolmask = *newmask;
 		break;
-	case MPOL_BIND: {
-		nodemask_t nodes;
-		unsigned long *z;
-		struct zonelist *zonelist;
-
-		nodes_clear(nodes);
-		for (z = pol->v.zonelist->_zones; *z; z++)
-			node_set(zone_to_nid(zonelist_zone(*z)), nodes);
-		nodes_remap(tmp, nodes, *mpolmask, *newmask);
-		nodes = tmp;
-
-		zonelist = bind_zonelist(&nodes);
-
-		/* If no mem, then zonelist is NULL and we keep old zonelist.
-		 * If that old zonelist has no remaining mems_allowed nodes,
-		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
-		 */
-
-		if (!IS_ERR(zonelist)) {
-			/* Good - got mem - substitute new zonelist */
-			kfree(pol->v.zonelist);
-			pol->v.zonelist = zonelist;
-		}
-		*mpolmask = *newmask;
-		break;
-	}
 	default:
 		BUG();
 		break;
@@ -1783,9 +1725,7 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_BIND:
-		get_zonemask(pol, &nodes);
-		break;
-
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes = pol->v.nodes;
 		break;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/page_alloc.c linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/page_alloc.c
--- linux-2.6.23-rc4-mm1-020_zoneid_zonelist/mm/page_alloc.c	2007-09-10 16:06:31.000000000 +0100
+++ linux-2.6.23-rc4-mm1-030_filter_nodemask/mm/page_alloc.c	2007-09-10 16:06:39.000000000 +0100
@@ -1419,7 +1419,7 @@ static void zlc_mark_zone_full(struct zo
  * a page.
  */
 static struct page *
-get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
 {
 	unsigned long *z;
@@ -1430,7 +1430,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	z = first_zones_zonelist(zonelist, high_zoneidx);
+	z = first_zones_zonelist(zonelist, nodemask, high_zoneidx);
 	classzone_idx = zonelist_zone_idx(*z);
 
 zonelist_scan:
@@ -1438,7 +1438,8 @@ zonelist_scan:
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+						high_zoneidx, nodemask) {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
@@ -1544,9 +1545,9 @@ static void set_page_owner(struct page *
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
-struct page * fastcall
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+static struct page *
+__alloc_pages_internal(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1575,7 +1576,7 @@ restart:
 		return NULL;
 	}
 
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
@@ -1620,7 +1621,7 @@ restart:
 	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist,
+	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags);
 	if (page)
 		goto got_pg;
@@ -1633,7 +1634,7 @@ rebalance:
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
+			page = get_page_from_freelist(gfp_mask, nodemask, order,
 				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
@@ -1668,7 +1669,7 @@ nofail_alloc:
 		drain_all_local_pages();
 
 	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, order,
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx, alloc_flags);
 		if (page)
 			goto got_pg;
@@ -1679,8 +1680,9 @@ nofail_alloc:
 		 * a parallel oom killing, we must fail if we're still
 		 * under heavy pressure.
 		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+			order, zonelist, high_zoneidx,
+			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page)
 			goto got_pg;
 
@@ -1728,6 +1730,20 @@ got_pg:
 	return page;
 }
 
+struct page * fastcall
+__alloc_pages(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, NULL);
+}
+
+struct page * fastcall
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist, nodemask_t *nodemask)
+{
+	return __alloc_pages_internal(gfp_mask, order, zonelist, nodemask);
+}
+
 EXPORT_SYMBOL(__alloc_pages);
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v4
@ 2007-08-17 20:16 Mel Gorman
  2007-08-17 20:18 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
  0 siblings, 1 reply; 35+ messages in thread
From: Mel Gorman @ 2007-08-17 20:16 UTC (permalink / raw)
  To: Lee.Schermerhorn, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm

Biggest changes are altering the embedding of zone IDs so that the type is
unsigned long instead of struct zone * and the removal of MPOL_BIND-specific
zonelists and filering based on node data instead. The biggest concern is the
last patch where FASTCALL doesn't appear to do the right thing in all cases.

Changelog since V3
  o Fix compile error in the parisc change
  o Calculate gfp_zone only once in __alloc_pages
  o Calculate classzone_idx properly in get_page_from_freelist
  o Alter check so that zone id embedded may still be used on UP
  o Use Kamezawa-sans suggestion for skipping zones in zonelist
  o Add __alloc_pages_nodemask() to filter zonelist based on a nodemask. This
    removes the need for MPOL_BIND to have a custom zonelist
  o Move zonelist iterators and helpers to mm.h
  o Change _zones from struct zone * to unsigned long

Changelog since V2
  o shrink_zones() uses zonelist instead of zonelist->zones
  o hugetlb uses zonelist iterator
  o zone_idx information is embedded in zonelist pointers
  o replace NODE_DATA(nid)->node_zonelist with node_zonelist(nid)

Changelog since V1
  o Break up the patch into 3 patches
  o Introduce iterators for zonelists
  o Performance regression test

The following patches replace multiple zonelists per node with one zonelist
that is filtered based on the GFP flags. The patches as a set fix a bug
with regard to the use of MPOL_BIND and ZONE_MOVABLE. With this patchset,
the MPOL_BIND will apply to the two highest zones when the highest zone
is ZONE_MOVABLE. This should be considered as an alternative fix for the
MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that
filters only custom zonelists. As a bonus, the patchset reduces the cache
footprint of the kernel and should improve performance in a number of cases.

The first patch cleans up an inconsitency where direct reclaim uses
zonelist->zones where other places use zonelist.

The second patch replaces multiple zonelists with one zonelist that is
filtered.

The final patch is a fix that depends on the previous two patches. The
patch changes policy zone so that the MPOL_BIND policy gets applied
to the two highest populated zones when the highest populated zone is
ZONE_MOVABLE. Otherwise, MPOL_BIND only applies to the highest populated zone.

The tests passed regression tests with numactltest. Performance results
varied depending on the machine configuration but were usually small
performance gains. The new algorithm relies heavily on the implementation
of zone_idx which is currently pretty expensive. Experiments to optimise
this have shown significant improvements for this algorithm, but is beyond
the scope of this patchset. Due to the nature of the change, the results
for other people are likely to vary - it'll usually win but occasionally lose.

In real workloads the gain/loss will depend on how much the userspace
portion of the benchmark benefits from having more cache available due
to reduced referencing of zonelists. I expect it'll be more noticable on
x86_64 with many zones than on IA64 which typically would only have one
active zonelist-per-node.

These are the range of performance losses/gains I found when running against
2.6.23-rc1-mm2. The set and these machines are a mix of i386, x86_64 and
ppc64 both NUMA and non-NUMA.

Total CPU time on Kernbench: -0.02% to  0.27%
Elapsed   time on Kernbench: -0.21% to  1.26%
page_test from aim9:         -3.41% to  3.90%
brk_test  from aim9:         -0.20% to 40.94%
fork_test from aim9:         -0.42% to  4.59%
exec_test from aim9:         -0.78% to  1.95%
Size reduction of pg_dat_t:   0     to  7808 bytes (depends on alignment)

The TBench figures were too variable between runs to draw conclusions from but
there didn't appear to be any regressions there. The hackbench results for both
sockets and pipes was within noise. I haven't gone though lmbench.

These three patches are a standalone set which address the MPOL_BIND problem
with ZONE_MOVABLE as well as reducing memory usage and in many cases the
cache footprint of the kernel.  They should be considered as a bug fix due to
the MPOL_BIND fixup.

If these patches are accepted, the follow-on work would entail;

o Encode zone_id in the zonelist pointers to avoid zone_idx() (Christoph's idea)
o If zone_id works out, eliminate z_to_n from the zonelist cache as unnecessary
o Remove bind_zonelist() (Patch in progress, very messy right now)
o Eliminate policy_zone (Trickier)

Comments?
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-08-17 20:16 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v4 Mel Gorman
@ 2007-08-17 20:18 ` Mel Gorman
  2007-08-17 21:29   ` Christoph Lameter
  0 siblings, 1 reply; 35+ messages in thread
From: Mel Gorman @ 2007-08-17 20:18 UTC (permalink / raw)
  To: Lee.Schermerhorn, ak, clameter; +Cc: Mel Gorman, linux-kernel, linux-mm

The MPOL_BIND policy creates a zonelist that is used for allocations belonging
to that thread that can use the policy_zone. As the zonelist is already being
filtered based on a zone id, this patch adds a version of __alloc_pages()
that takes a nodemask for further filtering. This eliminates the need for
MPOL_BIND to create a custom zonelist. The practical upside of this is that
allocations using MPOL_BIND should now use nodes closer to the running CPU
first instead of using nodes in numeric order.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 fs/buffer.c               |    2 
 include/linux/cpuset.h    |    4 -
 include/linux/gfp.h       |    4 +
 include/linux/mempolicy.h |    3 
 include/linux/mmzone.h    |   59 +++++++++++++---
 kernel/cpuset.c           |   16 +---
 mm/mempolicy.c            |  145 +++++++++++------------------------------
 mm/page_alloc.c           |   34 ++++++---
 8 files changed, 128 insertions(+), 139 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/fs/buffer.c linux-2.6.23-rc3-030_filter_nodemask/fs/buffer.c
--- linux-2.6.23-rc3-020_gfpskip/fs/buffer.c	2007-08-17 16:36:04.000000000 +0100
+++ linux-2.6.23-rc3-030_filter_nodemask/fs/buffer.c	2007-08-17 16:56:36.000000000 +0100
@@ -355,7 +355,7 @@ static void free_more_memory(void)
 
 	for_each_online_node(nid) {
 		zones = first_zones_zonelist(node_zonelist(nid),
-			gfp_zone(GFP_NOFS));
+			NULL, gfp_zone(GFP_NOFS));
 		if (*zones)
 			try_to_free_pages(node_zonelist(nid), 0, GFP_NOFS);
 	}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/include/linux/cpuset.h linux-2.6.23-rc3-030_filter_nodemask/include/linux/cpuset.h
--- linux-2.6.23-rc3-020_gfpskip/include/linux/cpuset.h	2007-08-13 05:25:24.000000000 +0100
+++ linux-2.6.23-rc3-030_filter_nodemask/include/linux/cpuset.h	2007-08-17 16:56:36.000000000 +0100
@@ -28,7 +28,7 @@ void cpuset_init_current_mems_allowed(vo
 void cpuset_update_task_memory_state(void);
 #define cpuset_nodes_subset_current_mems_allowed(nodes) \
 		nodes_subset((nodes), current->mems_allowed)
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask);
 
 extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
 extern int __cpuset_zone_allowed_hardwall(struct zone *z, gfp_t gfp_mask);
@@ -98,7 +98,7 @@ static inline void cpuset_init_current_m
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
-static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+static inline int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
 	return 1;
 }
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/include/linux/gfp.h linux-2.6.23-rc3-030_filter_nodemask/include/linux/gfp.h
--- linux-2.6.23-rc3-020_gfpskip/include/linux/gfp.h	2007-08-17 16:35:55.000000000 +0100
+++ linux-2.6.23-rc3-030_filter_nodemask/include/linux/gfp.h	2007-08-17 16:56:36.000000000 +0100
@@ -141,6 +141,10 @@ static inline void arch_alloc_page(struc
 extern struct page *
 FASTCALL(__alloc_pages(gfp_t, unsigned int, struct zonelist *));
 
+extern struct page *
+FASTCALL(__alloc_pages_nodemask(gfp_t, unsigned int,
+				struct zonelist *, nodemask_t *nodemask));
+
 static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
 						unsigned int order)
 {
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/include/linux/mempolicy.h linux-2.6.23-rc3-030_filter_nodemask/include/linux/mempolicy.h
--- linux-2.6.23-rc3-020_gfpskip/include/linux/mempolicy.h	2007-08-17 16:35:55.000000000 +0100
+++ linux-2.6.23-rc3-030_filter_nodemask/include/linux/mempolicy.h	2007-08-17 16:56:36.000000000 +0100
@@ -63,9 +63,8 @@ struct mempolicy {
 	atomic_t refcnt;
 	short policy; 	/* See MPOL_* above */
 	union {
-		struct zonelist  *zonelist;	/* bind */
 		short 		 preferred_node; /* preferred */
-		nodemask_t	 nodes;		/* interleave */
+		nodemask_t	 nodes;		/* interleave/bind */
 		/* undefined for default */
 	} v;
 	nodemask_t cpuset_mems_allowed;	/* mempolicy relative to these nodes */
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/include/linux/mmzone.h linux-2.6.23-rc3-030_filter_nodemask/include/linux/mmzone.h
--- linux-2.6.23-rc3-020_gfpskip/include/linux/mmzone.h	2007-08-17 16:56:20.000000000 +0100
+++ linux-2.6.23-rc3-030_filter_nodemask/include/linux/mmzone.h	2007-08-17 17:31:05.000000000 +0100
@@ -696,6 +696,16 @@ static inline struct zonelist *node_zone
 	return &NODE_DATA(nid)->node_zonelist;
 }
 
+static inline int zone_in_nodemask(unsigned long zone_addr,
+				nodemask_t *nodes)
+{
+#ifdef CONFIG_NUMA
+	return node_isset(zonelist_zone(zone_addr)->node, *nodes);
+#else
+	return 1;
+#endif /* CONFIG_NUMA */
+}
+
 static inline unsigned long *zonelist_gfp_skip(struct zonelist *zonelist,
 					enum zone_type highest_zoneidx)
 {
@@ -704,26 +714,57 @@ static inline unsigned long *zonelist_gf
 
 /* Returns the first zone at or below highest_zoneidx in a zonelist */
 static inline unsigned long *first_zones_zonelist(struct zonelist *zonelist,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	unsigned long *z;
-	for (z = zonelist_gfp_skip(zonelist, highest_zoneidx);
-		zonelist_zone_idx(*z) > highest_zoneidx;
-		z++);
+	unsigned long *z = zonelist_gfp_skip(zonelist, highest_zoneidx);
+
+	/* Only filter based on the nodemask if it's set */
+	if (likely(nodes == NULL))
+		for (;zonelist_zone_idx(*z) > highest_zoneidx;
+			z++);
+	else
+		for (;zonelist_zone_idx(*z) > highest_zoneidx ||
+				!zone_in_nodemask(*z, nodes);
+			z++);
 	return z;
 }
 
 /* Returns the next zone at or below highest_zoneidx in a zonelist */
 static inline unsigned long *next_zones_zonelist(unsigned long *z,
+					nodemask_t *nodes,
 					enum zone_type highest_zoneidx)
 {
-	for (++z;
-		zonelist_zone_idx(*z) > highest_zoneidx;
-		z++);
+	z++;
+
+	/* Only filter based on the nodemask if it's set */
+	if (likely(nodes == NULL))
+		for (;zonelist_zone_idx(*z) > highest_zoneidx;
+			z++);
+	else
+		for (;zonelist_zone_idx(*z) > highest_zoneidx ||
+				!zone_in_nodemask(*z, nodes);
+			z++);
 	return z;
 }
 
 /**
+ * for_each_zone_zonelist_nodemask - helper macro to iterate over valid zones in a zonelist at or below a given zone index and within a nodemask
+ * @zone - The current zone in the iterator
+ * @z - The current pointer within zonelist->zones being iterated
+ * @zlist - The zonelist being iterated
+ * @highidx - The zone index of the highest zone to return
+ * @nodemask - Nodemask allowed by the allocator
+ *
+ * This iterator iterates though all zones at or below a given zone index and
+ * within a given nodemask
+ */
+#define for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+	for (z = first_zones_zonelist(zlist, nodemask, highidx), zone = zonelist_zone(*z); \
+		zone;							\
+		z = next_zones_zonelist(z, nodemask, highidx), zone = zonelist_zone(*z))
+
+/**
  * for_each_zone_zonelist - helper macro to iterate over valid zones in a zonelist at or below a given zone index
  * @zone - The current zone in the iterator
  * @z - The current pointer within zonelist->zones being iterated
@@ -733,9 +774,7 @@ static inline unsigned long *next_zones_
  * This iterator iterates though all zones at or below a given zone index.
  */
 #define for_each_zone_zonelist(zone, z, zlist, highidx) \
-	for (z = first_zones_zonelist(zlist, highidx), zone = zonelist_zone(*z); \
-		zone; \
-		z = next_zones_zonelist(z, highidx), zone = zonelist_zone(*z))
+	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
 
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/kernel/cpuset.c linux-2.6.23-rc3-030_filter_nodemask/kernel/cpuset.c
--- linux-2.6.23-rc3-020_gfpskip/kernel/cpuset.c	2007-08-17 16:36:04.000000000 +0100
+++ linux-2.6.23-rc3-030_filter_nodemask/kernel/cpuset.c	2007-08-17 16:56:36.000000000 +0100
@@ -2327,21 +2327,19 @@ nodemask_t cpuset_mems_allowed(struct ta
 }
 
 /**
- * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
- * @zl: the zonelist to be checked
+ * cpuset_nodemask_valid_mems_allowed - check nodemask vs. curremt mems_allowed
+ * @nodemask: the nodemask to be checked
  *
- * Are any of the nodes on zonelist zl allowed in current->mems_allowed?
+ * Are any of the nodes in the nodemask allowed in current->mems_allowed?
  */
-int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
+int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
 {
-	int i;
-
-	for (i = 0; zl->_zones[i]; i++) {
-		int nid = zone_to_nid(zonelist_zone(zl->_zones[i]));
+	int nid;
 
+	for_each_node_mask(nid, *nodemask)
 		if (node_isset(nid, current->mems_allowed))
 			return 1;
-	}
+
 	return 0;
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/mm/mempolicy.c linux-2.6.23-rc3-030_filter_nodemask/mm/mempolicy.c
--- linux-2.6.23-rc3-020_gfpskip/mm/mempolicy.c	2007-08-17 16:55:31.000000000 +0100
+++ linux-2.6.23-rc3-030_filter_nodemask/mm/mempolicy.c	2007-08-17 17:00:07.000000000 +0100
@@ -131,43 +131,20 @@ static int mpol_check_policy(int mode, n
 	return nodes_subset(*nodes, node_online_map) ? 0 : -EINVAL;
 }
 
-/* Generate a custom zonelist for the BIND policy. */
-static struct zonelist *bind_zonelist(nodemask_t *nodes)
+/* Check that the nodemask contains at least one populated zone */
+static int is_valid_nodemask(nodemask_t *nodemask)
 {
-	struct zonelist *zl;
-	int num, max, nd;
-	enum zone_type k;
+	int nd, k;
 
-	max = 1 + MAX_NR_ZONES * nodes_weight(*nodes);
-	max++;			/* space for zlcache_ptr (see mmzone.h) */
-	max += sizeof(unsigned short) * MAX_NR_ZONES;	/* gfp_skip */
-	zl = kmalloc(sizeof(struct zone *) * max, GFP_KERNEL);
-	if (!zl)
-		return ERR_PTR(-ENOMEM);
-	zl->zlcache_ptr = NULL;
-	memset(zl->gfp_skip, 0, sizeof(zl->gfp_skip));
-	num = 0;
-	/* First put in the highest zones from all nodes, then all the next 
-	   lower zones etc. Avoid empty zones because the memory allocator
-	   doesn't like them. If you implement node hot removal you
-	   have to fix that. */
+	/* Check that there is something useful in this mask */
 	k = policy_zone;
-	while (1) {
-		for_each_node_mask(nd, *nodes) { 
-			struct zone *z = &NODE_DATA(nd)->node_zones[k];
-			if (z->present_pages > 0) 
-				zl->_zones[num++] = encode_zone_idx(z);
-		}
-		if (k == 0)
-			break;
-		k--;
-	}
-	if (num == 0) {
-		kfree(zl);
-		return ERR_PTR(-EINVAL);
+	for_each_node_mask(nd, *nodemask) {
+		struct zone *z = &NODE_DATA(nd)->node_zones[k];
+		if (z->present_pages > 0)
+			return 1;
 	}
-	zl->_zones[num] = 0;
-	return zl;
+
+	return 0;
 }
 
 /* Create a new policy */
@@ -198,12 +175,11 @@ static struct mempolicy *mpol_new(int mo
 			policy->v.preferred_node = -1;
 		break;
 	case MPOL_BIND:
-		policy->v.zonelist = bind_zonelist(nodes);
-		if (IS_ERR(policy->v.zonelist)) {
-			void *error_code = policy->v.zonelist;
+		if (!is_valid_nodemask(nodes)) {
 			kmem_cache_free(policy_cache, policy);
-			return error_code;
+			return ERR_PTR(-EINVAL);
 		}
+		policy->v.nodes = *nodes;
 		break;
 	}
 	policy->policy = mode;
@@ -481,19 +457,13 @@ long do_set_mempolicy(int mode, nodemask
 /* Fill a zone bitmap for a policy */
 static void get_zonemask(struct mempolicy *p, nodemask_t *nodes)
 {
-	int i;
 
 	nodes_clear(*nodes);
 	switch (p->policy) {
-	case MPOL_BIND:
-		for (i = 0; p->v.zonelist->_zones[i]; i++) {
-			struct zone *zone;
-			zone = zonelist_zone(p->v.zonelist->_zones[i]);
-			node_set(zone_to_nid(zone), *nodes);
-		}
-		break;
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		*nodes = p->v.nodes;
 		break;
@@ -1094,6 +1064,17 @@ static struct mempolicy * get_vma_policy
 	return pol;
 }
 
+/* Return a nodemask represnting a mempolicy */
+static nodemask_t *nodemask_policy(gfp_t gfp, struct mempolicy *policy)
+{
+	/* Lower zones don't get a nodemask applied  for MPOL_BIND */
+	if (policy->policy == MPOL_BIND &&
+			gfp_zone(gfp) >= policy_zone &&
+			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
+		return &policy->v.nodes;
+
+	return NULL;
+}
 /* Return a zonelist representing a mempolicy */
 static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 {
@@ -1106,11 +1087,6 @@ static struct zonelist *zonelist_policy(
 			nd = numa_node_id();
 		break;
 	case MPOL_BIND:
-		/* Lower zones don't get a policy applied */
-		/* Careful: current->mems_allowed might have moved */
-		if (gfp_zone(gfp) >= policy_zone)
-			if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
-				return policy->v.zonelist;
 		/*FALL THROUGH*/
 	case MPOL_INTERLEAVE: /* should not happen */
 	case MPOL_DEFAULT:
@@ -1149,12 +1125,19 @@ unsigned slab_node(struct mempolicy *pol
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
-	case MPOL_BIND:
+	case MPOL_BIND: {
 		/*
 		 * Follow bind policy behavior and start allocation at the
 		 * first node.
 		 */
-		return zone_to_nid(zonelist_zone(policy->v.zonelist->_zones[0]));
+		struct zonelist *zonelist;
+		unsigned long *z;
+		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
+		zonelist = &NODE_DATA(numa_node_id())->node_zonelist;
+		z = first_zones_zonelist(zonelist, &policy->v.nodes,
+							highest_zoneidx);
+		return zone_to_nid(zonelist_zone(*z));
+	}
 
 	case MPOL_PREFERRED:
 		if (policy->v.preferred_node >= 0)
@@ -1272,7 +1255,8 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 		nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
 		return alloc_page_interleave(gfp, 0, nid);
 	}
-	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+	return __alloc_pages_nodemask(gfp, 0,
+			zonelist_policy(gfp, pol), nodemask_policy(gfp, pol));
 }
 
 /**
@@ -1330,14 +1314,6 @@ struct mempolicy *__mpol_copy(struct mem
 	}
 	*new = *old;
 	atomic_set(&new->refcnt, 1);
-	if (new->policy == MPOL_BIND) {
-		int sz = ksize(old->v.zonelist);
-		new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
-		if (!new->v.zonelist) {
-			kmem_cache_free(policy_cache, new);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
 	return new;
 }
 
@@ -1351,21 +1327,12 @@ int __mpol_equal(struct mempolicy *a, st
 	switch (a->policy) {
 	case MPOL_DEFAULT:
 		return 1;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		return nodes_equal(a->v.nodes, b->v.nodes);
 	case MPOL_PREFERRED:
 		return a->v.preferred_node == b->v.preferred_node;
-	case MPOL_BIND: {
-		int i;
-		for (i = 0; a->v.zonelist->_zones[i]; i++) {
-			struct zone *za, *zb;
-			za = zonelist_zone(a->v.zonelist->_zones[i]);
-			zb = zonelist_zone(b->v.zonelist->_zones[i]);
-			if (za != zb)
-				return 0;
-		}
-		return b->v.zonelist->_zones[i] == 0;
-	}
 	default:
 		BUG();
 		return 0;
@@ -1377,8 +1344,6 @@ void __mpol_free(struct mempolicy *p)
 {
 	if (!atomic_dec_and_test(&p->refcnt))
 		return;
-	if (p->policy == MPOL_BIND)
-		kfree(p->v.zonelist);
 	p->policy = MPOL_DEFAULT;
 	kmem_cache_free(policy_cache, p);
 }
@@ -1668,6 +1633,8 @@ void mpol_rebind_policy(struct mempolicy
 	switch (pol->policy) {
 	case MPOL_DEFAULT:
 		break;
+	case MPOL_BIND:
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
 		pol->v.nodes = tmp;
@@ -1680,32 +1647,6 @@ void mpol_rebind_policy(struct mempolicy
 						*mpolmask, *newmask);
 		*mpolmask = *newmask;
 		break;
-	case MPOL_BIND: {
-		nodemask_t nodes;
-		unsigned long *z;
-		struct zonelist *zonelist;
-
-		nodes_clear(nodes);
-		for (z = pol->v.zonelist->_zones; *z; z++)
-			node_set(zone_to_nid(zonelist_zone(*z)), nodes);
-		nodes_remap(tmp, nodes, *mpolmask, *newmask);
-		nodes = tmp;
-
-		zonelist = bind_zonelist(&nodes);
-
-		/* If no mem, then zonelist is NULL and we keep old zonelist.
-		 * If that old zonelist has no remaining mems_allowed nodes,
-		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
-		 */
-
-		if (!IS_ERR(zonelist)) {
-			/* Good - got mem - substitute new zonelist */
-			kfree(pol->v.zonelist);
-			pol->v.zonelist = zonelist;
-		}
-		*mpolmask = *newmask;
-		break;
-	}
 	default:
 		BUG();
 		break;
@@ -1768,9 +1709,7 @@ static inline int mpol_to_str(char *buff
 		break;
 
 	case MPOL_BIND:
-		get_zonemask(pol, &nodes);
-		break;
-
+		/* Fall through */
 	case MPOL_INTERLEAVE:
 		nodes = pol->v.nodes;
 		break;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.23-rc3-020_gfpskip/mm/page_alloc.c linux-2.6.23-rc3-030_filter_nodemask/mm/page_alloc.c
--- linux-2.6.23-rc3-020_gfpskip/mm/page_alloc.c	2007-08-17 16:55:31.000000000 +0100
+++ linux-2.6.23-rc3-030_filter_nodemask/mm/page_alloc.c	2007-08-17 17:00:27.000000000 +0100
@@ -1147,7 +1147,7 @@ static void zlc_mark_zone_full(struct zo
  * a page.
  */
 static struct page *
-get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
+get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
 		struct zonelist *zonelist, int high_zoneidx, int alloc_flags)
 {
 	unsigned long *z;
@@ -1159,7 +1159,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 
-	z = first_zones_zonelist(zonelist, high_zoneidx);
+	z = first_zones_zonelist(zonelist, nodemask, high_zoneidx);
 	classzone = zonelist_zone(*z);
 	classzone_idx = zonelist_zone_idx(*z);
 
@@ -1168,7 +1168,8 @@ zonelist_scan:
 	 * Scan zonelist, looking for a zone with enough free.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+						high_zoneidx, nodemask) {
 		if (NUMA_BUILD && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
@@ -1222,8 +1223,8 @@ try_next_zone:
  * This is the 'heart' of the zoned buddy allocator.
  */
 struct page * fastcall
-__alloc_pages(gfp_t gfp_mask, unsigned int order,
-		struct zonelist *zonelist)
+__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
+			struct zonelist *zonelist, nodemask_t *nodemask)
 {
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1248,7 +1249,7 @@ restart:
 		return NULL;
 	}
 
-	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
+	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
 			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);
 	if (page)
 		goto got_pg;
@@ -1293,7 +1294,7 @@ restart:
 	 * Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
 	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
 	 */
-	page = get_page_from_freelist(gfp_mask, order, zonelist,
+	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 						high_zoneidx, alloc_flags);
 	if (page)
 		goto got_pg;
@@ -1306,7 +1307,7 @@ rebalance:
 		if (!(gfp_mask & __GFP_NOMEMALLOC)) {
 nofail_alloc:
 			/* go through the zonelist yet again, ignoring mins */
-			page = get_page_from_freelist(gfp_mask, order,
+			page = get_page_from_freelist(gfp_mask, nodemask, order,
 				zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
 			if (page)
 				goto got_pg;
@@ -1338,7 +1339,7 @@ nofail_alloc:
 	cond_resched();
 
 	if (likely(did_some_progress)) {
-		page = get_page_from_freelist(gfp_mask, order,
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
 					zonelist, high_zoneidx, alloc_flags);
 		if (page)
 			goto got_pg;
@@ -1349,8 +1350,9 @@ nofail_alloc:
 		 * a parallel oom killing, we must fail if we're still
 		 * under heavy pressure.
 		 */
-		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
-			zonelist, high_zoneidx, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+		page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask,
+			order, zonelist, high_zoneidx,
+			ALLOC_WMARK_HIGH|ALLOC_CPUSET);
 		if (page)
 			goto got_pg;
 
@@ -1394,6 +1396,14 @@ got_pg:
 	return page;
 }
 
+struct page * fastcall
+__alloc_pages(gfp_t gfp_mask, unsigned int order,
+		struct zonelist *zonelist)
+{
+	return __alloc_pages_nodemask(gfp_mask, order, zonelist, NULL);
+}
+
+
 EXPORT_SYMBOL(__alloc_pages);
 
 /*
@@ -2055,7 +2065,7 @@ static void build_zonelist_gfpskip(pg_da
 
 	for (target = 0; target < MAX_NR_ZONES; target++) {
 		unsigned long *z;
-		z = first_zones_zonelist(zl, target);
+		z = first_zones_zonelist(zl, NULL, target);
 		zl->gfp_skip[target] = z - zl->_zones;
 	}
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-08-17 20:18 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
@ 2007-08-17 21:29   ` Christoph Lameter
  2007-08-21  9:12     ` Mel Gorman
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Lameter @ 2007-08-17 21:29 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Lee.Schermerhorn, ak, linux-kernel, linux-mm

On Fri, 17 Aug 2007, Mel Gorman wrote:

> @@ -696,6 +696,16 @@ static inline struct zonelist *node_zone
>  	return &NODE_DATA(nid)->node_zonelist;
>  }
>  
> +static inline int zone_in_nodemask(unsigned long zone_addr,
> +				nodemask_t *nodes)
> +{
> +#ifdef CONFIG_NUMA
> +	return node_isset(zonelist_zone(zone_addr)->node, *nodes);
> +#else
> +	return 1;
> +#endif /* CONFIG_NUMA */
> +}
> +

This is dereferencind the zone in a filtering operation. I wonder if
we could encode the node in the zone_addr as well? x86_64 aligns zones on
page boundaries. So we have 10 bits left after taking 2 for the zone id.

> -int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
> +int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
>  {
> -	int i;
> -
> -	for (i = 0; zl->_zones[i]; i++) {
> -		int nid = zone_to_nid(zonelist_zone(zl->_zones[i]));
> +	int nid;
>  
> +	for_each_node_mask(nid, *nodemask)
>  		if (node_isset(nid, current->mems_allowed))
>  			return 1;
> -	}
> +
>  	return 0;

Hmmm... This is equivalent to

nodemask_t temp;

nodes_and(temp, nodemask, current->mems_allowed);
return !nodes_empty(temp);

which avoids the loop over all nodes.

> -	}
> -	if (num == 0) {
> -		kfree(zl);
> -		return ERR_PTR(-EINVAL);
> +	for_each_node_mask(nd, *nodemask) {
> +		struct zone *z = &NODE_DATA(nd)->node_zones[k];
> +		if (z->present_pages > 0)
> +			return 1;

Here you could use an and with the N_HIGH_MEMORY or N_NORMAL_MEMORY 
nodemask.

> @@ -1149,12 +1125,19 @@ unsigned slab_node(struct mempolicy *pol
>  	case MPOL_INTERLEAVE:
>  		return interleave_nodes(policy);
>  
> -	case MPOL_BIND:
> +	case MPOL_BIND: {

No { } needed.

>  		/*
>  		 * Follow bind policy behavior and start allocation at the
>  		 * first node.
>  		 */
> -		return zone_to_nid(zonelist_zone(policy->v.zonelist->_zones[0]));
> +		struct zonelist *zonelist;
> +		unsigned long *z;
> +		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
> +		zonelist = &NODE_DATA(numa_node_id())->node_zonelist;
> +		z = first_zones_zonelist(zonelist, &policy->v.nodes,
> +							highest_zoneidx);
> +		return zone_to_nid(zonelist_zone(*z));
> +	}
>  
>  	case MPOL_PREFERRED:
>  		if (policy->v.preferred_node >= 0)

> @@ -1330,14 +1314,6 @@ struct mempolicy *__mpol_copy(struct mem
>  	}
>  	*new = *old;
>  	atomic_set(&new->refcnt, 1);
> -	if (new->policy == MPOL_BIND) {
> -		int sz = ksize(old->v.zonelist);
> -		new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
> -		if (!new->v.zonelist) {
> -			kmem_cache_free(policy_cache, new);
> -			return ERR_PTR(-ENOMEM);
> -		}
> -	}
>  	return new;

That is a good optimization.

> @@ -1680,32 +1647,6 @@ void mpol_rebind_policy(struct mempolicy
>  						*mpolmask, *newmask);
>  		*mpolmask = *newmask;
>  		break;
> -	case MPOL_BIND: {
> -		nodemask_t nodes;
> -		unsigned long *z;
> -		struct zonelist *zonelist;
> -
> -		nodes_clear(nodes);
> -		for (z = pol->v.zonelist->_zones; *z; z++)
> -			node_set(zone_to_nid(zonelist_zone(*z)), nodes);
> -		nodes_remap(tmp, nodes, *mpolmask, *newmask);
> -		nodes = tmp;
> -
> -		zonelist = bind_zonelist(&nodes);
> -
> -		/* If no mem, then zonelist is NULL and we keep old zonelist.
> -		 * If that old zonelist has no remaining mems_allowed nodes,
> -		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
> -		 */
> -
> -		if (!IS_ERR(zonelist)) {
> -			/* Good - got mem - substitute new zonelist */
> -			kfree(pol->v.zonelist);
> -			pol->v.zonelist = zonelist;
> -		}
> -		*mpolmask = *newmask;
> -		break;
> -	}

Simply dropped? We still need to recalculate the node_mask depending on 
the new cpuset environment!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask
  2007-08-17 21:29   ` Christoph Lameter
@ 2007-08-21  9:12     ` Mel Gorman
  0 siblings, 0 replies; 35+ messages in thread
From: Mel Gorman @ 2007-08-21  9:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Lee.Schermerhorn, ak, linux-kernel, linux-mm

On (17/08/07 14:29), Christoph Lameter didst pronounce:
> On Fri, 17 Aug 2007, Mel Gorman wrote:
> 
> > @@ -696,6 +696,16 @@ static inline struct zonelist *node_zone
> >  	return &NODE_DATA(nid)->node_zonelist;
> >  }
> >  
> > +static inline int zone_in_nodemask(unsigned long zone_addr,
> > +				nodemask_t *nodes)
> > +{
> > +#ifdef CONFIG_NUMA
> > +	return node_isset(zonelist_zone(zone_addr)->node, *nodes);
> > +#else
> > +	return 1;
> > +#endif /* CONFIG_NUMA */
> > +}
> > +
> 
> This is dereferencind the zone in a filtering operation. I wonder if
> we could encode the node in the zone_addr as well? x86_64 aligns zones on
> page boundaries. So we have 10 bits left after taking 2 for the zone id.
> 

I had considered it but not gotten around to an implementation. A quick
look shows that it is likely to be a win on x86_64 and ppc64 as in those
places NODES_SHIFT is small enough to fit into the lower bits of the
zone addresses. It does not appear to be the case on IA-64 though. The
INTERNODE_CACHE_SHIFT will be around 7 but the NODES_SHIFT defaults to
10 so it will not fit.

I'll try it out anyway.

> > -int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
> > +int cpuset_nodemask_valid_mems_allowed(nodemask_t *nodemask)
> >  {
> > -	int i;
> > -
> > -	for (i = 0; zl->_zones[i]; i++) {
> > -		int nid = zone_to_nid(zonelist_zone(zl->_zones[i]));
> > +	int nid;
> >  
> > +	for_each_node_mask(nid, *nodemask)
> >  		if (node_isset(nid, current->mems_allowed))
> >  			return 1;
> > -	}
> > +
> >  	return 0;
> 
> Hmmm... This is equivalent to
> 
> nodemask_t temp;
> 
> nodes_and(temp, nodemask, current->mems_allowed);
> return !nodes_empty(temp);
> 
> which avoids the loop over all nodes.
> 

Good point. I've replaced the code with your version.

> > -	}
> > -	if (num == 0) {
> > -		kfree(zl);
> > -		return ERR_PTR(-EINVAL);
> > +	for_each_node_mask(nd, *nodemask) {
> > +		struct zone *z = &NODE_DATA(nd)->node_zones[k];
> > +		if (z->present_pages > 0)
> > +			return 1;
> 
> Here you could use an and with the N_HIGH_MEMORY or N_NORMAL_MEMORY 
> nodemask.
> 

I'm basing against 2.6.23-rc3 at the moment. I'll add an additional
patch later to use the N_HIGH_MEMORy and N_NORMAL_MEMORY nodemasks.

> > @@ -1149,12 +1125,19 @@ unsigned slab_node(struct mempolicy *pol
> >  	case MPOL_INTERLEAVE:
> >  		return interleave_nodes(policy);
> >  
> > -	case MPOL_BIND:
> > +	case MPOL_BIND: {
> 
> No { } needed.
> 
> >  		/*
> >  		 * Follow bind policy behavior and start allocation at the
> >  		 * first node.
> >  		 */
> > -		return zone_to_nid(zonelist_zone(policy->v.zonelist->_zones[0]));
> > +		struct zonelist *zonelist;
> > +		unsigned long *z;

Without the {}, it would fail to compile here

> > +		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
> > +		zonelist = &NODE_DATA(numa_node_id())->node_zonelist;
> > +		z = first_zones_zonelist(zonelist, &policy->v.nodes,
> > +							highest_zoneidx);
> > +		return zone_to_nid(zonelist_zone(*z));
> > +	}
> >  
> >  	case MPOL_PREFERRED:
> >  		if (policy->v.preferred_node >= 0)
> 
> > @@ -1330,14 +1314,6 @@ struct mempolicy *__mpol_copy(struct mem
> >  	}
> >  	*new = *old;
> >  	atomic_set(&new->refcnt, 1);
> > -	if (new->policy == MPOL_BIND) {
> > -		int sz = ksize(old->v.zonelist);
> > -		new->v.zonelist = kmemdup(old->v.zonelist, sz, GFP_KERNEL);
> > -		if (!new->v.zonelist) {
> > -			kmem_cache_free(policy_cache, new);
> > -			return ERR_PTR(-ENOMEM);
> > -		}
> > -	}
> >  	return new;
> 
> That is a good optimization.
> 

Thanks

> > @@ -1680,32 +1647,6 @@ void mpol_rebind_policy(struct mempolicy
> >  						*mpolmask, *newmask);
> >  		*mpolmask = *newmask;
> >  		break;
> > -	case MPOL_BIND: {
> > -		nodemask_t nodes;
> > -		unsigned long *z;
> > -		struct zonelist *zonelist;
> > -
> > -		nodes_clear(nodes);
> > -		for (z = pol->v.zonelist->_zones; *z; z++)
> > -			node_set(zone_to_nid(zonelist_zone(*z)), nodes);
> > -		nodes_remap(tmp, nodes, *mpolmask, *newmask);
> > -		nodes = tmp;
> > -
> > -		zonelist = bind_zonelist(&nodes);
> > -
> > -		/* If no mem, then zonelist is NULL and we keep old zonelist.
> > -		 * If that old zonelist has no remaining mems_allowed nodes,
> > -		 * then zonelist_policy() will "FALL THROUGH" to MPOL_DEFAULT.
> > -		 */
> > -
> > -		if (!IS_ERR(zonelist)) {
> > -			/* Good - got mem - substitute new zonelist */
> > -			kfree(pol->v.zonelist);
> > -			pol->v.zonelist = zonelist;
> > -		}
> > -		*mpolmask = *newmask;
> > -		break;
> > -	}
> 
> Simply dropped? We still need to recalculate the node_mask depending on 
> the new cpuset environment!
> 

It's not simply dropped. The previous patch chunk made the MPOL_BIND case
falls through to take the same action as MPOL_INTERLEAVE. Is that wrong?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2008-02-29 14:49 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-28 14:23 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v8 Mel Gorman
2007-09-28 14:23 ` [PATCH 1/6] Use zonelists instead of zones when direct reclaiming pages Mel Gorman
2007-09-28 14:24 ` [PATCH 2/6] Introduce node_zonelist() for accessing the zonelist for a GFP mask Mel Gorman
2007-09-28 14:24 ` [PATCH 3/6] Use two zonelist that are filtered by " Mel Gorman
2007-09-28 14:24 ` [PATCH 4/6] Have zonelist contains structs with both a zone pointer and zone_idx Mel Gorman
2007-10-17  3:22   ` David Rientjes
2007-09-28 14:25 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
2007-09-28 15:37   ` Lee Schermerhorn
2007-09-28 18:28     ` Mel Gorman
2007-09-28 18:38       ` Paul Jackson
2007-09-28 21:03       ` Lee Schermerhorn
2007-09-28 14:25 ` [PATCH 6/6] Use one zonelist that is filtered by nodemask Mel Gorman
2007-10-09  1:11   ` Nishanth Aravamudan
2007-10-09  1:56     ` Christoph Lameter
2007-10-09  3:17       ` Nishanth Aravamudan
2007-10-09 15:40     ` Mel Gorman
2007-10-09 16:25       ` Nishanth Aravamudan
2007-10-09 18:47         ` Christoph Lameter
2007-10-09 18:12       ` Nishanth Aravamudan
2007-10-10 15:53       ` Lee Schermerhorn
2007-10-10 16:05         ` Nishanth Aravamudan
2007-10-10 16:09         ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2007-11-09 14:32 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v9 Mel Gorman
2007-11-09 14:34 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
2008-02-29  5:01   ` Paul Jackson
2008-02-29 14:49     ` Lee Schermerhorn
2007-09-13 17:52 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v7 Mel Gorman
2007-09-13 17:53 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
2007-09-12 21:04 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v6 Mel Gorman
2007-09-12 21:06 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
2007-09-12 21:23   ` Christoph Lameter
2007-09-13 10:25     ` Mel Gorman
2007-09-13 15:49   ` Lee Schermerhorn
2007-09-11 21:30 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 (resend) Mel Gorman
2007-09-11 21:31 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
2007-09-11 15:19 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v5 Mel Gorman
2007-09-11 15:21 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
2007-08-17 20:16 [PATCH 0/6] Use one zonelist per node instead of multiple zonelists v4 Mel Gorman
2007-08-17 20:18 ` [PATCH 5/6] Filter based on a nodemask as well as a gfp_mask Mel Gorman
2007-08-17 21:29   ` Christoph Lameter
2007-08-21  9:12     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).