linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/8] Avoiding fragmentation with subzone groupings v25
@ 2006-09-07 19:03 Mel Gorman
  2006-09-07 19:04 ` [PATCH 1/8] Add __GFP_EASYRCLM flag and update callers Mel Gorman
                   ` (8 more replies)
  0 siblings, 9 replies; 17+ messages in thread
From: Mel Gorman @ 2006-09-07 19:03 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman

This is the latest version of anti-fragmentation based on sub-zones (previously
called list-based anti-fragmentation) based on top of 2.6.18-rc5-mm1. In
it's last release, it was decided that the scheme should be implemented with
zones to avoid affecting the page allocator hot paths. However, at VM Summit,
it was made clear that zones may not be the right answer either because zones
have their own issues. Hence, this is a reintroduction of the first approach.

The purpose of these patches is to reduce external fragmentation by grouping
pages of related types together. The objective is that when page reclaim
occurs, there is a greater chance that large contiguous pages will be
free. Note that this is not a defragmentation which would get contiguous
pages by moving pages around.

This patch works by categorising allocations by their reclaimability;

EasyReclaimable - These are userspace pages that are easily reclaimable. This
	flag is set when it is known that the pages will be trivially reclaimed
	by writing the page out to swap or syncing with backing storage

KernelReclaimable - These are allocations for some kernel caches that are
	reclaimable or allocations that are known to be very short-lived.

KernelNonReclaimable - These are pages that are allocated by the kernel that
	are not trivially reclaimed. For example, the memory allocated for a
	loaded module would be in this category. By default, allocations are
	considered to be of this type

Instead of having one MAX_ORDER-sized array of free lists in struct free_area,
there is one for each type of reclaimability. Once a 2^MAX_ORDER block of
pages is split for a type of allocation, it is added to the free-lists for
that type, in effect reserving it. Hence, over time, pages of the different
types can be clustered together. When a page is allocated, the page-flags
are updated with a value indicating it's type of reclaimability so that it
is placed on the correct list on free.

When the preferred freelists are expired, the largest possible block is taken
from an alternative list. Buddies that are split from that large block are
placed on the preferred allocation-type freelists to mitigate fragmentation.

This implementation gives best-effort for low fragmentation in all zones. To
be effective, min_free_kbytes needs to be set to a value about 10% of physical
memory (10% was found by experimentation, it may be workload dependant). To get
that value lower, anti-fragmentation needs to be significantly more invasive
so it's best to find out what sorts of workloads still cause fragmentation
before taking further steps.

Our tests show that about 60-70% of physical memory can be allocated on
a desktop after a few days uptime. In benchmarks and stress tests, we are
finding that 80% of memory is available as contiguous blocks at the end of
the test. To compare, a standard kernel was getting < 1% of memory as large
pages on a desktop and about 8-12% of memory as large pages at the end of
stress tests.

Performance tests are within 0.1% for kbuild on a number of test machines. aim9
is usually within 1% except on x86_64 where aim9 results are unreliable.
I have never been able to show it but it is possible the main allocator path
is adversely affected by anti-fragmentation and it may be exposed by using
differnet compilers or benchmarks. If any regressions are detected due to
anti-fragmentation, it may be simply disabled via the kernel configuration
and I'd appreciate a report detailing the regression and how to trigger it.

Following this email are 8 patches.  They early patches introduce the split
between user and kernel allocations.  Later we introduce a further split
for kernel allocations, into KernRclm and KernNoRclm.  Note that although
in early patches an additional page flag is consumed, later patches reuse
the suspend bits, releasing this bit again.

Comments?
-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/8] Add __GFP_EASYRCLM flag and update callers
  2006-09-07 19:03 [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Mel Gorman
@ 2006-09-07 19:04 ` Mel Gorman
  2006-09-07 19:04 ` [PATCH 2/8] Split the free lists into kernel and user parts Mel Gorman
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2006-09-07 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman

This patch adds a flag __GFP_EASYRCLM.  Allocations using the __GFP_EASYRCLM
flag are expected to be easily reclaimed by syncing with backing storage (be
it a file or swap) or cleaning the buffers and discarding.


Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 fs/block_dev.c          |    3 ++-
 fs/buffer.c             |    3 ++-
 fs/compat.c             |    3 ++-
 fs/exec.c               |    3 ++-
 fs/inode.c              |    3 ++-
 include/asm-i386/page.h |    4 +++-
 include/linux/gfp.h     |   12 +++++++++++-
 include/linux/highmem.h |    4 +++-
 mm/memory.c             |    8 ++++++--
 mm/swap_state.c         |    4 +++-
 10 files changed, 36 insertions(+), 11 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-clean/fs/block_dev.c linux-2.6.18-rc5-mm1-001_antifrag_flags/fs/block_dev.c
--- linux-2.6.18-rc5-mm1-clean/fs/block_dev.c	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-001_antifrag_flags/fs/block_dev.c	2006-09-04 18:36:09.000000000 +0100
@@ -380,7 +380,8 @@ struct block_device *bdget(dev_t dev)
 		inode->i_rdev = dev;
 		inode->i_bdev = bdev;
 		inode->i_data.a_ops = &def_blk_aops;
-		mapping_set_gfp_mask(&inode->i_data, GFP_USER);
+		mapping_set_gfp_mask(&inode->i_data,
+				set_rclmflags(GFP_USER, __GFP_EASYRCLM));
 		inode->i_data.backing_dev_info = &default_backing_dev_info;
 		spin_lock(&bdev_lock);
 		list_add(&bdev->bd_list, &all_bdevs);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-clean/fs/buffer.c linux-2.6.18-rc5-mm1-001_antifrag_flags/fs/buffer.c
--- linux-2.6.18-rc5-mm1-clean/fs/buffer.c	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-001_antifrag_flags/fs/buffer.c	2006-09-04 18:36:09.000000000 +0100
@@ -986,7 +986,8 @@ grow_dev_page(struct block_device *bdev,
 	struct page *page;
 	struct buffer_head *bh;
 
-	page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
+	page = find_or_create_page(inode->i_mapping, index,
+				   set_rclmflags(GFP_NOFS, __GFP_EASYRCLM));
 	if (!page)
 		return NULL;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-clean/fs/compat.c linux-2.6.18-rc5-mm1-001_antifrag_flags/fs/compat.c
--- linux-2.6.18-rc5-mm1-clean/fs/compat.c	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-001_antifrag_flags/fs/compat.c	2006-09-04 18:36:09.000000000 +0100
@@ -1419,7 +1419,8 @@ static int compat_copy_strings(int argc,
 			page = bprm->page[i];
 			new = 0;
 			if (!page) {
-				page = alloc_page(GFP_HIGHUSER);
+				page = alloc_page(set_rclmflags(GFP_HIGHUSER,
+							__GFP_EASYRCLM));
 				bprm->page[i] = page;
 				if (!page) {
 					ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-clean/fs/exec.c linux-2.6.18-rc5-mm1-001_antifrag_flags/fs/exec.c
--- linux-2.6.18-rc5-mm1-clean/fs/exec.c	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-001_antifrag_flags/fs/exec.c	2006-09-04 18:36:09.000000000 +0100
@@ -238,7 +238,8 @@ static int copy_strings(int argc, char _
 			page = bprm->page[i];
 			new = 0;
 			if (!page) {
-				page = alloc_page(GFP_HIGHUSER);
+				page = alloc_page(set_rclmflags(GFP_HIGHUSER,
+							__GFP_EASYRCLM));
 				bprm->page[i] = page;
 				if (!page) {
 					ret = -ENOMEM;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-clean/fs/inode.c linux-2.6.18-rc5-mm1-001_antifrag_flags/fs/inode.c
--- linux-2.6.18-rc5-mm1-clean/fs/inode.c	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-001_antifrag_flags/fs/inode.c	2006-09-04 18:36:09.000000000 +0100
@@ -145,7 +145,8 @@ static struct inode *alloc_inode(struct 
 		mapping->a_ops = &empty_aops;
  		mapping->host = inode;
 		mapping->flags = 0;
-		mapping_set_gfp_mask(mapping, GFP_HIGHUSER);
+		mapping_set_gfp_mask(mapping,
+				set_rclmflags(GFP_HIGHUSER, __GFP_EASYRCLM));
 		mapping->assoc_mapping = NULL;
 		mapping->backing_dev_info = &default_backing_dev_info;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-clean/include/asm-i386/page.h linux-2.6.18-rc5-mm1-001_antifrag_flags/include/asm-i386/page.h
--- linux-2.6.18-rc5-mm1-clean/include/asm-i386/page.h	2006-08-28 04:41:48.000000000 +0100
+++ linux-2.6.18-rc5-mm1-001_antifrag_flags/include/asm-i386/page.h	2006-09-04 18:36:09.000000000 +0100
@@ -35,7 +35,9 @@
 #define clear_user_page(page, vaddr, pg)	clear_page(page)
 #define copy_user_page(to, from, vaddr, pg)	copy_page(to, from)
 
-#define alloc_zeroed_user_highpage(vma, vaddr) alloc_page_vma(GFP_HIGHUSER | __GFP_ZERO, vma, vaddr)
+#define alloc_zeroed_user_highpage(vma, vaddr) \
+	alloc_page_vma(set_rclmflags(GFP_HIGHUSER|__GFP_ZERO, __GFP_EASYRCLM),\
+								vma, vaddr)
 #define __HAVE_ARCH_ALLOC_ZEROED_USER_HIGHPAGE
 
 /*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-clean/include/linux/gfp.h linux-2.6.18-rc5-mm1-001_antifrag_flags/include/linux/gfp.h
--- linux-2.6.18-rc5-mm1-clean/include/linux/gfp.h	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-001_antifrag_flags/include/linux/gfp.h	2006-09-04 18:36:09.000000000 +0100
@@ -46,6 +46,7 @@ struct vm_area_struct;
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
+#define __GFP_EASYRCLM	((__force gfp_t)0x80000u) /* Easily reclaimed page */
 
 #define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
@@ -54,7 +55,11 @@ struct vm_area_struct;
 #define GFP_LEVEL_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS| \
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
-			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE)
+			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|\
+			__GFP_EASYRCLM)
+
+/* This mask makes up all the RCLM-related flags */
+#define GFP_RECLAIM_MASK (__GFP_EASYRCLM)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
@@ -93,6 +98,11 @@ static inline enum zone_type gfp_zone(gf
 	return ZONE_NORMAL;
 }
 
+static inline gfp_t set_rclmflags(gfp_t gfp, gfp_t reclaim_flags)
+{
+	return (gfp & ~(GFP_RECLAIM_MASK)) | reclaim_flags;
+}
+
 /*
  * There is only one page-allocator function, and two main namespaces to
  * it. The alloc_page*() variants return 'struct page *' and as such
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-clean/include/linux/highmem.h linux-2.6.18-rc5-mm1-001_antifrag_flags/include/linux/highmem.h
--- linux-2.6.18-rc5-mm1-clean/include/linux/highmem.h	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-001_antifrag_flags/include/linux/highmem.h	2006-09-04 18:36:09.000000000 +0100
@@ -61,7 +61,9 @@ static inline void clear_user_highpage(s
 static inline struct page *
 alloc_zeroed_user_highpage(struct vm_area_struct *vma, unsigned long vaddr)
 {
-	struct page *page = alloc_page_vma(GFP_HIGHUSER, vma, vaddr);
+	struct page *page = alloc_page_vma(
+				set_rclmflags(GFP_HIGHUSER, __GFP_EASYRCLM),
+				vma, vaddr);
 
 	if (page)
 		clear_user_highpage(page, vaddr);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-clean/mm/memory.c linux-2.6.18-rc5-mm1-001_antifrag_flags/mm/memory.c
--- linux-2.6.18-rc5-mm1-clean/mm/memory.c	2006-09-04 18:34:33.000000000 +0100
+++ linux-2.6.18-rc5-mm1-001_antifrag_flags/mm/memory.c	2006-09-04 18:36:09.000000000 +0100
@@ -1562,7 +1562,9 @@ gotten:
 		if (!new_page)
 			goto oom;
 	} else {
-		new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+		new_page = alloc_page_vma(
+				set_rclmflags(GFP_HIGHUSER, __GFP_EASYRCLM),
+				vma, address);
 		if (!new_page)
 			goto oom;
 		cow_user_page(new_page, old_page, address);
@@ -2177,7 +2179,9 @@ retry:
 
 			if (unlikely(anon_vma_prepare(vma)))
 				goto oom;
-			page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+			page = alloc_page_vma(
+				set_rclmflags(GFP_HIGHUSER, __GFP_EASYRCLM),
+				vma, address);
 			if (!page)
 				goto oom;
 			copy_user_highpage(page, new_page, address);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-clean/mm/swap_state.c linux-2.6.18-rc5-mm1-001_antifrag_flags/mm/swap_state.c
--- linux-2.6.18-rc5-mm1-clean/mm/swap_state.c	2006-09-04 18:34:33.000000000 +0100
+++ linux-2.6.18-rc5-mm1-001_antifrag_flags/mm/swap_state.c	2006-09-04 18:36:09.000000000 +0100
@@ -343,7 +343,9 @@ struct page *read_swap_cache_async(swp_e
 		 * Get a new page to read into from swap.
 		 */
 		if (!new_page) {
-			new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
+			new_page = alloc_page_vma(
+				set_rclmflags(GFP_HIGHUSER, __GFP_EASYRCLM),
+				vma, addr);
 			if (!new_page)
 				break;		/* Out of memory */
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/8] Split the free lists into kernel and user parts
  2006-09-07 19:03 [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Mel Gorman
  2006-09-07 19:04 ` [PATCH 1/8] Add __GFP_EASYRCLM flag and update callers Mel Gorman
@ 2006-09-07 19:04 ` Mel Gorman
  2006-09-08  7:54   ` Peter Zijlstra
  2006-09-07 19:04 ` [PATCH 3/8] Split the per-cpu " Mel Gorman
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2006-09-07 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman

This patch adds the core of the anti-fragmentation strategy. It works by
grouping related allocation types together. The idea is that large groups of
pages that may be reclaimed are placed near each other. The zone->free_area
list is broken into RCLM_TYPES number of lists.


Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
---

 include/linux/mmzone.h     |   10 +++
 include/linux/page-flags.h |    7 ++
 mm/page_alloc.c            |  109 +++++++++++++++++++++++++++++++---------
 3 files changed, 102 insertions(+), 24 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-001_antifrag_flags/include/linux/mmzone.h linux-2.6.18-rc5-mm1-002_fragcore/include/linux/mmzone.h
--- linux-2.6.18-rc5-mm1-001_antifrag_flags/include/linux/mmzone.h	2006-09-04 18:34:33.000000000 +0100
+++ linux-2.6.18-rc5-mm1-002_fragcore/include/linux/mmzone.h	2006-09-04 18:37:59.000000000 +0100
@@ -24,8 +24,16 @@
 #endif
 #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
 
+#define RCLM_NORCLM 0
+#define RCLM_EASY   1
+#define RCLM_TYPES  2
+
+#define for_each_rclmtype_order(type, order) \
+	for (order = 0; order < MAX_ORDER; order++) \
+		for (type = 0; type < RCLM_TYPES; type++)
+
 struct free_area {
-	struct list_head	free_list;
+	struct list_head	free_list[RCLM_TYPES];
 	unsigned long		nr_free;
 };
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-001_antifrag_flags/include/linux/page-flags.h linux-2.6.18-rc5-mm1-002_fragcore/include/linux/page-flags.h
--- linux-2.6.18-rc5-mm1-001_antifrag_flags/include/linux/page-flags.h	2006-09-04 18:34:33.000000000 +0100
+++ linux-2.6.18-rc5-mm1-002_fragcore/include/linux/page-flags.h	2006-09-04 18:37:59.000000000 +0100
@@ -92,6 +92,7 @@
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
 #define PG_readahead		20	/* Reminder to do readahead */
+#define PG_easyrclm		21	/* Page is an easy reclaim block */
 
 
 #if (BITS_PER_LONG > 32)
@@ -254,6 +255,12 @@
 #define SetPageReadahead(page)	set_bit(PG_readahead, &(page)->flags)
 #define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
 
+#define PageEasyRclm(page)	test_bit(PG_easyrclm, &(page)->flags)
+#define SetPageEasyRclm(page)	set_bit(PG_easyrclm, &(page)->flags)
+#define ClearPageEasyRclm(page)	clear_bit(PG_easyrclm, &(page)->flags)
+#define __SetPageEasyRclm(page)	__set_bit(PG_easyrclm, &(page)->flags)
+#define __ClearPageEasyRclm(page) __clear_bit(PG_easyrclm, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-001_antifrag_flags/mm/page_alloc.c linux-2.6.18-rc5-mm1-002_fragcore/mm/page_alloc.c
--- linux-2.6.18-rc5-mm1-001_antifrag_flags/mm/page_alloc.c	2006-09-04 18:34:33.000000000 +0100
+++ linux-2.6.18-rc5-mm1-002_fragcore/mm/page_alloc.c	2006-09-04 18:37:59.000000000 +0100
@@ -133,6 +133,16 @@ static unsigned long __initdata dma_rese
   unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 
+static inline int get_pageblock_type(struct page *page)
+{
+	return (PageEasyRclm(page) != 0);
+}
+
+static inline int gfpflags_to_rclmtype(unsigned long gfp_flags)
+{
+	return ((gfp_flags & __GFP_EASYRCLM) != 0);
+}
+
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
 {
@@ -402,11 +412,13 @@ static inline void __free_one_page(struc
 {
 	unsigned long page_idx;
 	int order_size = 1 << order;
+	int rclmtype = get_pageblock_type(page);
 
 	if (unlikely(PageCompound(page)))
 		destroy_compound_page(page, order);
 
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
+	__SetPageEasyRclm(page);
 
 	VM_BUG_ON(page_idx & (order_size - 1));
 	VM_BUG_ON(bad_range(zone, page));
@@ -414,7 +426,6 @@ static inline void __free_one_page(struc
 	zone->free_pages += order_size;
 	while (order < MAX_ORDER-1) {
 		unsigned long combined_idx;
-		struct free_area *area;
 		struct page *buddy;
 
 		buddy = __page_find_buddy(page, page_idx, order);
@@ -422,8 +433,7 @@ static inline void __free_one_page(struc
 			break;		/* Move the buddy up one level. */
 
 		list_del(&buddy->lru);
-		area = zone->free_area + order;
-		area->nr_free--;
+		zone->free_area[order].nr_free--;
 		rmv_page_order(buddy);
 		combined_idx = __find_combined_index(page_idx, order);
 		page = page + (combined_idx - page_idx);
@@ -431,7 +441,7 @@ static inline void __free_one_page(struc
 		order++;
 	}
 	set_page_order(page, order);
-	list_add(&page->lru, &zone->free_area[order].free_list);
+	list_add(&page->lru, &zone->free_area[order].free_list[rclmtype]);
 	zone->free_area[order].nr_free++;
 }
 
@@ -567,7 +577,8 @@ void fastcall __init __free_pages_bootme
  * -- wli
  */
 static inline void expand(struct zone *zone, struct page *page,
- 	int low, int high, struct free_area *area)
+ 	int low, int high, struct free_area *area,
+	int rclmtype)
 {
 	unsigned long size = 1 << high;
 
@@ -576,7 +587,7 @@ static inline void expand(struct zone *z
 		high--;
 		size >>= 1;
 		VM_BUG_ON(bad_range(zone, &page[size]));
-		list_add(&page[size].lru, &area->free_list);
+		list_add(&page[size].lru, &area->free_list[rclmtype]);
 		area->nr_free++;
 		set_page_order(&page[size], high);
 	}
@@ -627,31 +638,80 @@ static int prep_new_page(struct page *pa
 	return 0;
 }
 
+/* Remove an element from the buddy allocator from the fallback list */
+static struct page *__rmqueue_fallback(struct zone *zone, int order,
+							gfp_t gfp_flags)
+{
+	struct free_area * area;
+	int current_order;
+	struct page *page;
+	int rclmtype = gfpflags_to_rclmtype(gfp_flags);
+
+	/* Find the largest possible block of pages in the other list */
+	rclmtype = !rclmtype;
+	for (current_order = MAX_ORDER-1; current_order >= order;
+						--current_order) {
+		area = &(zone->free_area[current_order]);
+ 		if (list_empty(&area->free_list[rclmtype]))
+ 			continue;
+
+		page = list_entry(area->free_list[rclmtype].next,
+					struct page, lru);
+		area->nr_free--;
+
+		/*
+		 * If breaking a large block of pages, place the buddies
+		 * on the preferred allocation list
+		 */
+		if (unlikely(current_order >= MAX_ORDER / 2))
+			rclmtype = !rclmtype;
+
+		/* Remove the page from the freelists */
+		list_del(&page->lru);
+		rmv_page_order(page);
+		zone->free_pages -= 1UL << order;
+		expand(zone, page, order, current_order, area, rclmtype);
+		return page;
+	}
+
+	return NULL;
+}
+
 /* 
  * Do the hard work of removing an element from the buddy allocator.
  * Call me with the zone->lock already held.
  */
-static struct page *__rmqueue(struct zone *zone, unsigned int order)
+static struct page *__rmqueue(struct zone *zone, unsigned int order,
+						gfp_t gfp_flags)
 {
 	struct free_area * area;
 	unsigned int current_order;
 	struct page *page;
+	int rclmtype = gfpflags_to_rclmtype(gfp_flags);
 
+	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
-		area = zone->free_area + current_order;
-		if (list_empty(&area->free_list))
+		area = &(zone->free_area[current_order]);
+		if (list_empty(&area->free_list[rclmtype]))
 			continue;
 
-		page = list_entry(area->free_list.next, struct page, lru);
+		page = list_entry(area->free_list[rclmtype].next,
+					struct page, lru);
 		list_del(&page->lru);
 		rmv_page_order(page);
 		area->nr_free--;
 		zone->free_pages -= 1UL << order;
-		expand(zone, page, order, current_order, area);
-		return page;
+		expand(zone, page, order, current_order, area, rclmtype);
+		goto got_page;
 	}
 
-	return NULL;
+	page = __rmqueue_fallback(zone, order, gfp_flags);
+
+got_page:
+	if (unlikely(rclmtype == RCLM_NORCLM) && page)
+		__ClearPageEasyRclm(page);
+
+	return page;
 }
 
 /* 
@@ -660,13 +720,14 @@ static struct page *__rmqueue(struct zon
  * Returns the number of new pages which were placed at *list.
  */
 static int rmqueue_bulk(struct zone *zone, unsigned int order, 
-			unsigned long count, struct list_head *list)
+			unsigned long count, struct list_head *list,
+			gfp_t gfp_flags)
 {
 	int i;
 	
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
-		struct page *page = __rmqueue(zone, order);
+		struct page *page = __rmqueue(zone, order, gfp_flags);
 		if (unlikely(page == NULL))
 			break;
 		list_add_tail(&page->lru, list);
@@ -741,7 +802,7 @@ void mark_free_pages(struct zone *zone)
 {
 	unsigned long pfn, max_zone_pfn;
 	unsigned long flags;
-	int order;
+	int order, t;
 	struct list_head *curr;
 
 	if (!zone->spanned_pages)
@@ -758,14 +819,15 @@ void mark_free_pages(struct zone *zone)
 				ClearPageNosaveFree(page);
 		}
 
-	for (order = MAX_ORDER - 1; order >= 0; --order)
-		list_for_each(curr, &zone->free_area[order].free_list) {
+	for_each_rclmtype_order(t, order) {
+		list_for_each(curr, &zone->free_area[order].free_list[t]) {
 			unsigned long i;
 
 			pfn = page_to_pfn(list_entry(curr, struct page, lru));
 			for (i = 0; i < (1UL << order); i++)
 				SetPageNosaveFree(pfn_to_page(pfn + i));
 		}
+	}
 
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
@@ -864,7 +926,7 @@ again:
 		local_irq_save(flags);
 		if (!pcp->count) {
 			pcp->count += rmqueue_bulk(zone, 0,
-						pcp->batch, &pcp->list);
+				pcp->batch, &pcp->list, gfp_flags);
 			if (unlikely(!pcp->count))
 				goto failed;
 		}
@@ -873,7 +935,7 @@ again:
 		pcp->count--;
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order);
+		page = __rmqueue(zone, order, gfp_flags);
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;
@@ -1782,6 +1844,7 @@ void __meminit memmap_init_zone(unsigned
 		init_page_count(page);
 		reset_page_mapcount(page);
 		SetPageReserved(page);
+		SetPageEasyRclm(page);
 		INIT_LIST_HEAD(&page->lru);
 #ifdef WANT_PAGE_VIRTUAL
 		/* The shift won't overflow because ZONE_NORMAL is below 4G. */
@@ -1797,9 +1860,9 @@ void __meminit memmap_init_zone(unsigned
 void zone_init_free_lists(struct pglist_data *pgdat, struct zone *zone,
 				unsigned long size)
 {
-	int order;
-	for (order = 0; order < MAX_ORDER ; order++) {
-		INIT_LIST_HEAD(&zone->free_area[order].free_list);
+	int order, rclmtype;
+	for_each_rclmtype_order(rclmtype, order) {
+		INIT_LIST_HEAD(&zone->free_area[order].free_list[rclmtype]);
 		zone->free_area[order].nr_free = 0;
 	}
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/8] Split the per-cpu lists into kernel and user parts
  2006-09-07 19:03 [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Mel Gorman
  2006-09-07 19:04 ` [PATCH 1/8] Add __GFP_EASYRCLM flag and update callers Mel Gorman
  2006-09-07 19:04 ` [PATCH 2/8] Split the free lists into kernel and user parts Mel Gorman
@ 2006-09-07 19:04 ` Mel Gorman
  2006-09-07 19:05 ` [PATCH 4/8] Add a configure option for anti-fragmentation Mel Gorman
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2006-09-07 19:04 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman

The freelists for each allocation type can slowly become fragmented due to
the per-cpu list. Consider what happens when the following happens

1. A 2^(MAX_ORDER-1) list is reserved for __GFP_EASYRCLM pages
2. An order-0 page is allocated from the newly reserved block
3. The page is freed and placed on the per-cpu list
4. alloc_page() is called with GFP_KERNEL as the gfp_mask
5. The per-cpu list is used to satisfy the allocation

This results in a kernel page is in the middle of a RCLM_EASY region. This
means that over long periods of the time, the anti-fragmentation scheme
slowly degrades to the standard allocator.

This patch divides the per-cpu lists into RCLM_TYPES number of lists.


Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
---

 include/linux/mmzone.h |   16 +++++++++--
 mm/page_alloc.c        |   63 +++++++++++++++++++++++++++-----------------
 mm/vmstat.c            |    4 +-
 3 files changed, 56 insertions(+), 27 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-002_fragcore/include/linux/mmzone.h linux-2.6.18-rc5-mm1-003_percpu/include/linux/mmzone.h
--- linux-2.6.18-rc5-mm1-002_fragcore/include/linux/mmzone.h	2006-09-04 18:37:59.000000000 +0100
+++ linux-2.6.18-rc5-mm1-003_percpu/include/linux/mmzone.h	2006-09-04 18:39:39.000000000 +0100
@@ -28,6 +28,8 @@
 #define RCLM_EASY   1
 #define RCLM_TYPES  2
 
+#define for_each_rclmtype(type) \
+	for (type = 0; type < RCLM_TYPES; type++)
 #define for_each_rclmtype_order(type, order) \
 	for (order = 0; order < MAX_ORDER; order++) \
 		for (type = 0; type < RCLM_TYPES; type++)
@@ -77,10 +79,10 @@ enum zone_stat_item {
 	NR_VM_ZONE_STAT_ITEMS };
 
 struct per_cpu_pages {
-	int count;		/* number of pages in the list */
+	int counts[RCLM_TYPES];	/* number of pages in the list */
 	int high;		/* high watermark, emptying needed */
 	int batch;		/* chunk size for buddy add/remove */
-	struct list_head list;	/* the list of pages */
+	struct list_head list[RCLM_TYPES];	/* the list of pages */
 };
 
 struct per_cpu_pageset {
@@ -91,6 +93,16 @@ struct per_cpu_pageset {
 #endif
 } ____cacheline_aligned_in_smp;
 
+static inline int pcp_count(struct per_cpu_pages *pcp)
+{
+	int rclmtype, count = 0;
+
+	for_each_rclmtype(rclmtype)
+		count += pcp->counts[rclmtype];
+
+	return count;
+}
+
 #ifdef CONFIG_NUMA
 #define zone_pcp(__z, __cpu) ((__z)->pageset[(__cpu)])
 #else
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-002_fragcore/mm/page_alloc.c linux-2.6.18-rc5-mm1-003_percpu/mm/page_alloc.c
--- linux-2.6.18-rc5-mm1-002_fragcore/mm/page_alloc.c	2006-09-04 18:37:59.000000000 +0100
+++ linux-2.6.18-rc5-mm1-003_percpu/mm/page_alloc.c	2006-09-04 18:39:39.000000000 +0100
@@ -745,7 +745,7 @@ static int rmqueue_bulk(struct zone *zon
  */
 void drain_node_pages(int nodeid)
 {
-	int i;
+	int i, pindex;
 	enum zone_type z;
 	unsigned long flags;
 
@@ -761,10 +761,14 @@ void drain_node_pages(int nodeid)
 			struct per_cpu_pages *pcp;
 
 			pcp = &pset->pcp[i];
-			if (pcp->count) {
+			if (pcp_count(pcp)) {
 				local_irq_save(flags);
-				free_pages_bulk(zone, pcp->count, &pcp->list, 0);
-				pcp->count = 0;
+				for_each_rclmtype(pindex) {
+					free_pages_bulk(zone,
+							pcp->counts[pindex],
+							&pcp->list[pindex], 0);
+					pcp->counts[pindex] = 0;
+				}
 				local_irq_restore(flags);
 			}
 		}
@@ -777,7 +781,7 @@ static void __drain_pages(unsigned int c
 {
 	unsigned long flags;
 	struct zone *zone;
-	int i;
+	int i, pindex;
 
 	for_each_zone(zone) {
 		struct per_cpu_pageset *pset;
@@ -788,8 +792,13 @@ static void __drain_pages(unsigned int c
 
 			pcp = &pset->pcp[i];
 			local_irq_save(flags);
-			free_pages_bulk(zone, pcp->count, &pcp->list, 0);
-			pcp->count = 0;
+			for_each_rclmtype(pindex) {
+				free_pages_bulk(zone,
+						pcp->counts[pindex],
+						&pcp->list[pindex], 0);
+
+				pcp->counts[pindex] = 0;
+			}
 			local_irq_restore(flags);
 		}
 	}
@@ -851,6 +860,7 @@ void drain_local_pages(void)
 static void fastcall free_hot_cold_page(struct page *page, int cold)
 {
 	struct zone *zone = page_zone(page);
+	int pindex = get_pageblock_type(page);
 	struct per_cpu_pages *pcp;
 	unsigned long flags;
 
@@ -866,11 +876,11 @@ static void fastcall free_hot_cold_page(
 	pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
 	local_irq_save(flags);
 	__count_vm_event(PGFREE);
-	list_add(&page->lru, &pcp->list);
-	pcp->count++;
-	if (pcp->count >= pcp->high) {
-		free_pages_bulk(zone, pcp->batch, &pcp->list, 0);
-		pcp->count -= pcp->batch;
+	list_add(&page->lru, &pcp->list[pindex]);
+	pcp->counts[pindex]++;
+	if (pcp->counts[pindex] >= pcp->high) {
+		free_pages_bulk(zone, pcp->batch, &pcp->list[pindex], 0);
+		pcp->counts[pindex] -= pcp->batch;
 	}
 	local_irq_restore(flags);
 	put_cpu();
@@ -916,6 +926,7 @@ static struct page *buffered_rmqueue(str
 	struct page *page;
 	int cold = !!(gfp_flags & __GFP_COLD);
 	int cpu;
+	int rclmtype = gfpflags_to_rclmtype(gfp_flags);
 
 again:
 	cpu  = get_cpu();
@@ -924,15 +935,15 @@ again:
 
 		pcp = &zone_pcp(zone, cpu)->pcp[cold];
 		local_irq_save(flags);
-		if (!pcp->count) {
-			pcp->count += rmqueue_bulk(zone, 0,
-				pcp->batch, &pcp->list, gfp_flags);
-			if (unlikely(!pcp->count))
+		if (!pcp->counts[rclmtype]) {
+			pcp->counts[rclmtype] += rmqueue_bulk(zone, 0,
+				pcp->batch, &pcp->list[rclmtype], gfp_flags);
+			if (unlikely(!pcp->counts[rclmtype]))
 				goto failed;
 		}
-		page = list_entry(pcp->list.next, struct page, lru);
+		page = list_entry(pcp->list[rclmtype].next, struct page, lru);
 		list_del(&page->lru);
-		pcp->count--;
+		pcp->counts[rclmtype]--;
 	} else {
 		spin_lock_irqsave(&zone->lock, flags);
 		page = __rmqueue(zone, order, gfp_flags);
@@ -1480,7 +1491,7 @@ void show_free_areas(void)
 					temperature ? "cold" : "hot",
 					pageset->pcp[temperature].high,
 					pageset->pcp[temperature].batch,
-					pageset->pcp[temperature].count);
+					pcp_count(&pageset->pcp[temperature]));
 		}
 	}
 
@@ -1921,20 +1932,26 @@ static int __cpuinit zone_batchsize(stru
 inline void setup_pageset(struct per_cpu_pageset *p, unsigned long batch)
 {
 	struct per_cpu_pages *pcp;
+	int rclmtype;
 
 	memset(p, 0, sizeof(*p));
 
 	pcp = &p->pcp[0];		/* hot */
-	pcp->count = 0;
+	for_each_rclmtype(rclmtype) {
+		pcp->counts[rclmtype] = 0;
+		INIT_LIST_HEAD(&pcp->list[rclmtype]);
+	}
 	pcp->high = 6 * batch;
 	pcp->batch = max(1UL, 1 * batch);
-	INIT_LIST_HEAD(&pcp->list);
+	INIT_LIST_HEAD(&pcp->list[RCLM_EASY]);
 
 	pcp = &p->pcp[1];		/* cold*/
-	pcp->count = 0;
+	for_each_rclmtype(rclmtype) {
+		pcp->counts[rclmtype] = 0;
+		INIT_LIST_HEAD(&pcp->list[rclmtype]);
+	}
 	pcp->high = 2 * batch;
 	pcp->batch = max(1UL, batch/2);
-	INIT_LIST_HEAD(&pcp->list);
 }
 
 /*
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-002_fragcore/mm/vmstat.c linux-2.6.18-rc5-mm1-003_percpu/mm/vmstat.c
--- linux-2.6.18-rc5-mm1-002_fragcore/mm/vmstat.c	2006-09-04 18:34:33.000000000 +0100
+++ linux-2.6.18-rc5-mm1-003_percpu/mm/vmstat.c	2006-09-04 18:39:39.000000000 +0100
@@ -562,7 +562,7 @@ static int zoneinfo_show(struct seq_file
 
 			pageset = zone_pcp(zone, i);
 			for (j = 0; j < ARRAY_SIZE(pageset->pcp); j++) {
-				if (pageset->pcp[j].count)
+				if (pcp_count(&pageset->pcp[j]))
 					break;
 			}
 			if (j == ARRAY_SIZE(pageset->pcp))
@@ -574,7 +574,7 @@ static int zoneinfo_show(struct seq_file
 					   "\n              high:  %i"
 					   "\n              batch: %i",
 					   i, j,
-					   pageset->pcp[j].count,
+					   pcp_count(&pageset->pcp[j]),
 					   pageset->pcp[j].high,
 					   pageset->pcp[j].batch);
 			}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 4/8] Add a configure option for anti-fragmentation
  2006-09-07 19:03 [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Mel Gorman
                   ` (2 preceding siblings ...)
  2006-09-07 19:04 ` [PATCH 3/8] Split the per-cpu " Mel Gorman
@ 2006-09-07 19:05 ` Mel Gorman
  2006-09-07 19:05 ` [PATCH 5/8] Drain per-cpu lists when high-order allocations fail Mel Gorman
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2006-09-07 19:05 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman

The anti-fragmentation strategy has memory overhead. This patch allows
the strategy to be disabled for small memory systems or if it is known the
workload is suffering because of the strategy. It also acts to show where
the anti-frag strategy interacts with the standard buddy allocator.


Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Joel Schopp <jschopp@austin.ibm.com>
---

 init/Kconfig    |   14 ++++++++++++++
 mm/page_alloc.c |   20 ++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-003_percpu/init/Kconfig linux-2.6.18-rc5-mm1-004_configurable/init/Kconfig
--- linux-2.6.18-rc5-mm1-003_percpu/init/Kconfig	2006-09-04 18:34:33.000000000 +0100
+++ linux-2.6.18-rc5-mm1-004_configurable/init/Kconfig	2006-09-04 18:41:13.000000000 +0100
@@ -478,6 +478,20 @@ config SLOB
 	default !SLAB
 	bool
 
+config PAGEALLOC_ANTIFRAG
+ 	bool "Avoid fragmentation in the page allocator"
+ 	def_bool n
+ 	help
+ 	  The standard allocator will fragment memory over time which means
+ 	  that high order allocations will fail even if kswapd is running. If
+ 	  this option is set, the allocator will try and group page types into
+ 	  two groups, kernel and easy reclaimable. The gain is a best effort
+ 	  attempt at lowering fragmentation which a few workloads care about.
+ 	  The loss is a more complex allocactor that may perform slower. If
+	  you are interested in working with large pages, say Y and set
+	  /proc/sys/vm/min_free_bytes to be 10% of physical memory. Otherwise
+ 	  say N
+
 menu "Loadable module support"
 
 config MODULES
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-003_percpu/mm/page_alloc.c linux-2.6.18-rc5-mm1-004_configurable/mm/page_alloc.c
--- linux-2.6.18-rc5-mm1-003_percpu/mm/page_alloc.c	2006-09-04 18:39:39.000000000 +0100
+++ linux-2.6.18-rc5-mm1-004_configurable/mm/page_alloc.c	2006-09-04 18:41:13.000000000 +0100
@@ -133,6 +133,7 @@ static unsigned long __initdata dma_rese
   unsigned long __initdata arch_zone_highest_possible_pfn[MAX_NR_ZONES];
 #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
 
+#ifdef CONFIG_PAGEALLOC_ANTIFRAG
 static inline int get_pageblock_type(struct page *page)
 {
 	return (PageEasyRclm(page) != 0);
@@ -142,6 +143,17 @@ static inline int gfpflags_to_rclmtype(u
 {
 	return ((gfp_flags & __GFP_EASYRCLM) != 0);
 }
+#else
+static inline int get_pageblock_type(struct page *page)
+{
+	return RCLM_NORCLM;
+}
+
+static inline int gfpflags_to_rclmtype(unsigned long gfp_flags)
+{
+	return RCLM_NORCLM;
+}
+#endif /* CONFIG_PAGEALLOC_ANTIFRAG */
 
 #ifdef CONFIG_DEBUG_VM
 static int page_outside_zone_boundaries(struct zone *zone, struct page *page)
@@ -638,6 +650,7 @@ static int prep_new_page(struct page *pa
 	return 0;
 }
 
+#ifdef CONFIG_PAGEALLOC_ANTIFRAG
 /* Remove an element from the buddy allocator from the fallback list */
 static struct page *__rmqueue_fallback(struct zone *zone, int order,
 							gfp_t gfp_flags)
@@ -676,6 +689,13 @@ static struct page *__rmqueue_fallback(s
 
 	return NULL;
 }
+#else
+static struct page *__rmqueue_fallback(struct zone *zone, unsigned int order,
+							int rclmtype)
+{
+	return NULL;
+}
+#endif /* CONFIG_PAGEALLOC_ANTIFRAG */
 
 /* 
  * Do the hard work of removing an element from the buddy allocator.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 5/8] Drain per-cpu lists when high-order allocations fail
  2006-09-07 19:03 [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Mel Gorman
                   ` (3 preceding siblings ...)
  2006-09-07 19:05 ` [PATCH 4/8] Add a configure option for anti-fragmentation Mel Gorman
@ 2006-09-07 19:05 ` Mel Gorman
  2006-09-07 19:05 ` [PATCH 6/8] Move free pages between lists on steal Mel Gorman
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2006-09-07 19:05 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman

Per-cpu pages can accidentally cause fragmentation because they are free, but
pinned pages in an otherwise contiguous block.  When this patch is applied,
the per-cpu caches are drained after the direct-reclaim is entered if the
requested order is greater than 0. It simply reuses the code used by suspend
and hotplug.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 Kconfig      |    4 ++++
 page_alloc.c |   32 +++++++++++++++++++++++++++++---
 2 files changed, 33 insertions(+), 3 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-004_configurable/mm/Kconfig linux-2.6.18-rc5-mm1-005_drainpercpu/mm/Kconfig
--- linux-2.6.18-rc5-mm1-004_configurable/mm/Kconfig	2006-09-04 18:34:33.000000000 +0100
+++ linux-2.6.18-rc5-mm1-005_drainpercpu/mm/Kconfig	2006-09-07 19:16:11.000000000 +0100
@@ -242,3 +242,7 @@ config READAHEAD_SMOOTH_AGING
 		- have the danger of readahead thrashing(i.e. memory tight)
 
 	  This feature is only available on non-NUMA systems.
+
+config NEED_DRAIN_PERCPU_PAGES
+	def_bool y
+	depends on PM || HOTPLUG_CPU || PAGEALLOC_ANTIFRAG
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-004_configurable/mm/page_alloc.c linux-2.6.18-rc5-mm1-005_drainpercpu/mm/page_alloc.c
--- linux-2.6.18-rc5-mm1-004_configurable/mm/page_alloc.c	2006-09-07 19:14:30.000000000 +0100
+++ linux-2.6.18-rc5-mm1-005_drainpercpu/mm/page_alloc.c	2006-09-07 19:16:38.000000000 +0100
@@ -796,7 +796,7 @@ void drain_node_pages(int nodeid)
 }
 #endif
 
-#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
+#ifdef CONFIG_NEED_DRAIN_PERCPU_PAGES
 static void __drain_pages(unsigned int cpu)
 {
 	unsigned long flags;
@@ -823,7 +823,7 @@ static void __drain_pages(unsigned int c
 		}
 	}
 }
-#endif /* CONFIG_PM || CONFIG_HOTPLUG_CPU */
+#endif /* CONFIG_DRAIN_PERCPU_PAGES */
 
 #ifdef CONFIG_PM
 
@@ -860,7 +860,9 @@ void mark_free_pages(struct zone *zone)
 
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
+#endif /* CONFIG_PM */
 
+#if defined(CONFIG_PM) || defined(CONFIG_PAGEALLOC_ANTIFRAG)
 /*
  * Spill all of this CPU's per-cpu pages back into the buddy allocator.
  */
@@ -872,7 +874,28 @@ void drain_local_pages(void)
 	__drain_pages(smp_processor_id());
 	local_irq_restore(flags);	
 }
-#endif /* CONFIG_PM */
+
+void smp_drain_local_pages(void *arg)
+{
+	drain_local_pages();
+}
+
+/*
+ * Spill all the per-cpu pages from all CPUs back into the buddy allocator
+ */
+void drain_all_local_pages(void)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__drain_pages(smp_processor_id());
+	local_irq_restore(flags);
+
+	smp_call_function(smp_drain_local_pages, NULL, 0, 1);
+}
+#else
+void drain_all_local_pages(void) {}
+#endif /* CONFIG_PM || CONFIG_PAGEALLOC_ANTIFRAG */
 
 /*
  * Free a 0-order page
@@ -1232,6 +1255,9 @@ rebalance:
 
 	cond_resched();
 
+	if (order != 0)
+		drain_all_local_pages();
+
 	if (likely(did_some_progress)) {
 		page = get_page_from_freelist(gfp_mask, order,
 						zonelist, alloc_flags);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 6/8] Move free pages between lists on steal
  2006-09-07 19:03 [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Mel Gorman
                   ` (4 preceding siblings ...)
  2006-09-07 19:05 ` [PATCH 5/8] Drain per-cpu lists when high-order allocations fail Mel Gorman
@ 2006-09-07 19:05 ` Mel Gorman
  2006-09-07 19:06 ` [PATCH 7/8] Introduce the RCLM_KERN allocation type Mel Gorman
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2006-09-07 19:05 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman

When a fallback occurs, there will be free pages for one allocation type
stored on the list for another. When a large steal occurs, this patch
will move all the free pages within one list to one allocation type.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 page_alloc.c |   84 ++++++++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 75 insertions(+), 9 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-005_drainpercpu/mm/page_alloc.c linux-2.6.18-rc5-mm1-006_movefree/mm/page_alloc.c
--- linux-2.6.18-rc5-mm1-005_drainpercpu/mm/page_alloc.c	2006-09-04 18:42:43.000000000 +0100
+++ linux-2.6.18-rc5-mm1-006_movefree/mm/page_alloc.c	2006-09-04 18:44:14.000000000 +0100
@@ -651,6 +651,62 @@ static int prep_new_page(struct page *pa
 }
 
 #ifdef CONFIG_PAGEALLOC_ANTIFRAG
+/*
+ * Move the free pages in a range to the free lists of the requested type.
+ * Note that start_page and end_pages are not aligned in a MAX_ORDER_NR_PAGES
+ * boundary. If alignment is required, use move_freepages_block()
+ */
+int move_freepages(struct zone *zone,
+			struct page *start_page, struct page *end_page,
+			int rclmtype)
+{
+	struct page *page;
+	unsigned long order;
+	int blocks_moved = 0;
+
+	BUG_ON(page_zone(start_page) != page_zone(end_page));
+
+	for (page = start_page; page < end_page;) {
+		if (!PageBuddy(page)) {
+			page++;
+			continue;
+		}
+#ifdef CONFIG_HOLES_IN_ZONE
+		if (!pfn_valid(page_to_pfn(page))) {
+			page++;
+			continue;
+		}
+#endif
+
+		order = page_order(page);
+		list_del(&page->lru);
+		list_add(&page->lru,
+			&zone->free_area[order].free_list[rclmtype]);
+		page += 1 << order;
+		blocks_moved++;
+	}
+
+	return blocks_moved;
+}
+
+int move_freepages_block(struct zone *zone, struct page *page, int rclmtype)
+{
+	unsigned long start_pfn;
+	struct page *start_page, *end_page;
+
+	start_pfn = page_to_pfn(page);
+	start_pfn = ALIGN(start_pfn, MAX_ORDER_NR_PAGES);
+	start_page = pfn_to_page(start_pfn);
+	end_page = start_page + MAX_ORDER_NR_PAGES;
+
+	if (page_zone(page) != page_zone(start_page))
+		start_page = page;
+	if (page_zone(page) != page_zone(end_page))
+		return 0;
+
+	return move_freepages(zone, start_page, end_page, rclmtype);
+}
+
 /* Remove an element from the buddy allocator from the fallback list */
 static struct page *__rmqueue_fallback(struct zone *zone, int order,
 							gfp_t gfp_flags)
@@ -658,10 +714,10 @@ static struct page *__rmqueue_fallback(s
 	struct free_area * area;
 	int current_order;
 	struct page *page;
-	int rclmtype = gfpflags_to_rclmtype(gfp_flags);
+	int start_rclmtype = gfpflags_to_rclmtype(gfp_flags);
+	int rclmtype = !start_rclmtype;
 
 	/* Find the largest possible block of pages in the other list */
-	rclmtype = !rclmtype;
 	for (current_order = MAX_ORDER-1; current_order >= order;
 						--current_order) {
 		area = &(zone->free_area[current_order]);
@@ -672,24 +728,34 @@ static struct page *__rmqueue_fallback(s
 					struct page, lru);
 		area->nr_free--;
 
-		/*
-		 * If breaking a large block of pages, place the buddies
-		 * on the preferred allocation list
-		 */
-		if (unlikely(current_order >= MAX_ORDER / 2))
-			rclmtype = !rclmtype;
-
 		/* Remove the page from the freelists */
 		list_del(&page->lru);
 		rmv_page_order(page);
 		zone->free_pages -= 1UL << order;
 		expand(zone, page, order, current_order, area, rclmtype);
+
+		/* Move free pages between lists if stealing a large block */
+		if (current_order > MAX_ORDER / 2)
+			move_freepages_block(zone, page, start_rclmtype);
+
 		return page;
 	}
 
 	return NULL;
 }
 #else
+int move_freepages(struct zone *zone,
+			struct page *start_page, struct page *end_page,
+			int rclmtype)
+{
+	return 0;
+}
+
+int move_freepages_block(struct zone *zone, struct page *page, int rclmtype)
+{
+	return 0;
+}
+
 static struct page *__rmqueue_fallback(struct zone *zone, unsigned int order,
 							int rclmtype)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 7/8] Introduce the RCLM_KERN allocation type
  2006-09-07 19:03 [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Mel Gorman
                   ` (5 preceding siblings ...)
  2006-09-07 19:05 ` [PATCH 6/8] Move free pages between lists on steal Mel Gorman
@ 2006-09-07 19:06 ` Mel Gorman
  2006-09-07 19:06 ` [PATCH 8/8] [DEBUG] Add statistics Mel Gorman
  2006-09-08  0:58 ` [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Andrew Morton
  8 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2006-09-07 19:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman

Some kernel allocations are easily reclaimable such as inode caches and
these reclaimable kernel allocations are by far the most common type of
kernel allocation. This patch marks those type of allocations explicitly
and tries to group them together.

As another page bit would normally be required, it was decided to reuse the
suspend-related page bits and make anti-fragmentation and software suspend
mutually exclusive. More details are in the patch.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 arch/x86_64/kernel/e820.c  |    8 +++++
 fs/buffer.c                |    5 +--
 fs/dcache.c                |    3 +
 fs/ext2/super.c            |    3 +
 fs/ext3/super.c            |    3 +
 fs/jbd/journal.c           |    6 ++-
 fs/jbd/revoke.c            |    6 ++-
 fs/ntfs/inode.c            |    6 ++-
 fs/reiserfs/super.c        |    3 +
 include/linux/gfp.h        |   10 +++---
 include/linux/mmzone.h     |    5 +--
 include/linux/page-flags.h |   60 +++++++++++++++++++++++++++++++------
 init/Kconfig               |    1 
 lib/radix-tree.c           |    6 ++-
 mm/page_alloc.c            |   64 ++++++++++++++++++++++++++--------------
 mm/shmem.c                 |    8 +++--
 net/core/skbuff.c          |    1 
 17 files changed, 144 insertions(+), 54 deletions(-)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/arch/x86_64/kernel/e820.c linux-2.6.18-rc5-mm1-007_kernrclm/arch/x86_64/kernel/e820.c
--- linux-2.6.18-rc5-mm1-006_movefree/arch/x86_64/kernel/e820.c	2006-09-04 18:34:30.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/arch/x86_64/kernel/e820.c	2006-09-04 18:45:50.000000000 +0100
@@ -235,6 +235,13 @@ e820_register_active_regions(int nid, un
 	}
 }
 
+#ifdef CONFIG_PAGEALLOC_ANTIFRAG
+static void __init
+e820_mark_nosave_range(unsigned long start, unsigned long end)
+{
+	printk("Nosave not set when anti-frag is enabled");
+}
+#else
 /* Mark pages corresponding to given address range as nosave */
 static void __init
 e820_mark_nosave_range(unsigned long start, unsigned long end)
@@ -250,6 +257,7 @@ e820_mark_nosave_range(unsigned long sta
 		if (pfn_valid(pfn))
 			SetPageNosave(pfn_to_page(pfn));
 }
+#endif
 
 /*
  * Find the ranges of physical addresses that do not correspond to
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/fs/buffer.c linux-2.6.18-rc5-mm1-007_kernrclm/fs/buffer.c
--- linux-2.6.18-rc5-mm1-006_movefree/fs/buffer.c	2006-09-04 18:36:09.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/fs/buffer.c	2006-09-04 18:45:50.000000000 +0100
@@ -2642,7 +2642,7 @@ int submit_bh(int rw, struct buffer_head
 	 * from here on down, it's all bio -- do the initial mapping,
 	 * submit_bio -> generic_make_request may further map this bio around
 	 */
-	bio = bio_alloc(GFP_NOIO, 1);
+	bio = bio_alloc(set_rclmflags(GFP_NOIO, __GFP_EASYRCLM), 1);
 
 	bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
 	bio->bi_bdev = bh->b_bdev;
@@ -2922,7 +2922,8 @@ static void recalc_bh_state(void)
 	
 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags)
 {
-	struct buffer_head *ret = kmem_cache_alloc(bh_cachep, gfp_flags);
+	struct buffer_head *ret = kmem_cache_alloc(bh_cachep,
+				set_rclmflags(gfp_flags, __GFP_KERNRCLM));
 	if (ret) {
 		get_cpu_var(bh_accounting).nr++;
 		recalc_bh_state();
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/fs/dcache.c linux-2.6.18-rc5-mm1-007_kernrclm/fs/dcache.c
--- linux-2.6.18-rc5-mm1-006_movefree/fs/dcache.c	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/fs/dcache.c	2006-09-04 18:45:50.000000000 +0100
@@ -853,7 +853,8 @@ struct dentry *d_alloc(struct dentry * p
 	struct dentry *dentry;
 	char *dname;
 
-	dentry = kmem_cache_alloc(dentry_cache, GFP_KERNEL); 
+	dentry = kmem_cache_alloc(dentry_cache,
+				set_rclmflags(GFP_KERNEL, __GFP_KERNRCLM));
 	if (!dentry)
 		return NULL;
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/fs/ext2/super.c linux-2.6.18-rc5-mm1-007_kernrclm/fs/ext2/super.c
--- linux-2.6.18-rc5-mm1-006_movefree/fs/ext2/super.c	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/fs/ext2/super.c	2006-09-04 18:45:50.000000000 +0100
@@ -140,7 +140,8 @@ static kmem_cache_t * ext2_inode_cachep;
 static struct inode *ext2_alloc_inode(struct super_block *sb)
 {
 	struct ext2_inode_info *ei;
-	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep, SLAB_KERNEL);
+	ei = (struct ext2_inode_info *)kmem_cache_alloc(ext2_inode_cachep,
+				set_rclmflags(SLAB_KERNEL, __GFP_KERNRCLM));
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT2_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/fs/ext3/super.c linux-2.6.18-rc5-mm1-007_kernrclm/fs/ext3/super.c
--- linux-2.6.18-rc5-mm1-006_movefree/fs/ext3/super.c	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/fs/ext3/super.c	2006-09-04 18:45:50.000000000 +0100
@@ -444,7 +444,8 @@ static struct inode *ext3_alloc_inode(st
 {
 	struct ext3_inode_info *ei;
 
-	ei = kmem_cache_alloc(ext3_inode_cachep, SLAB_NOFS);
+	ei = kmem_cache_alloc(ext3_inode_cachep,
+				set_rclmflags(SLAB_NOFS, __GFP_KERNRCLM));
 	if (!ei)
 		return NULL;
 #ifdef CONFIG_EXT3_FS_POSIX_ACL
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/fs/jbd/journal.c linux-2.6.18-rc5-mm1-007_kernrclm/fs/jbd/journal.c
--- linux-2.6.18-rc5-mm1-006_movefree/fs/jbd/journal.c	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/fs/jbd/journal.c	2006-09-04 18:45:50.000000000 +0100
@@ -1735,7 +1735,8 @@ static struct journal_head *journal_allo
 #ifdef CONFIG_JBD_DEBUG
 	atomic_inc(&nr_journal_heads);
 #endif
-	ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
+	ret = kmem_cache_alloc(journal_head_cache,
+				set_rclmflags(GFP_NOFS, __GFP_KERNRCLM));
 	if (ret == 0) {
 		jbd_debug(1, "out of memory for journal_head\n");
 		if (time_after(jiffies, last_warning + 5*HZ)) {
@@ -1745,7 +1746,8 @@ static struct journal_head *journal_allo
 		}
 		while (ret == 0) {
 			yield();
-			ret = kmem_cache_alloc(journal_head_cache, GFP_NOFS);
+			ret = kmem_cache_alloc(journal_head_cache,
+				set_rclmflags(GFP_NOFS, __GFP_KERNRCLM));
 		}
 	}
 	return ret;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/fs/jbd/revoke.c linux-2.6.18-rc5-mm1-007_kernrclm/fs/jbd/revoke.c
--- linux-2.6.18-rc5-mm1-006_movefree/fs/jbd/revoke.c	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/fs/jbd/revoke.c	2006-09-04 18:45:50.000000000 +0100
@@ -206,7 +206,8 @@ int journal_init_revoke(journal_t *journ
 	while((tmp >>= 1UL) != 0UL)
 		shift++;
 
-	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
+	journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache,
+				set_rclmflags(GFP_KERNEL, __GFP_KERNRCLM));
 	if (!journal->j_revoke_table[0])
 		return -ENOMEM;
 	journal->j_revoke = journal->j_revoke_table[0];
@@ -229,7 +230,8 @@ int journal_init_revoke(journal_t *journ
 	for (tmp = 0; tmp < hash_size; tmp++)
 		INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
 
-	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
+	journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache,
+				set_rclmflags(GFP_KERNEL, __GFP_KERNRCLM));
 	if (!journal->j_revoke_table[1]) {
 		kfree(journal->j_revoke_table[0]->hash_table);
 		kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/fs/ntfs/inode.c linux-2.6.18-rc5-mm1-007_kernrclm/fs/ntfs/inode.c
--- linux-2.6.18-rc5-mm1-006_movefree/fs/ntfs/inode.c	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/fs/ntfs/inode.c	2006-09-04 18:45:50.000000000 +0100
@@ -324,7 +324,8 @@ struct inode *ntfs_alloc_big_inode(struc
 	ntfs_inode *ni;
 
 	ntfs_debug("Entering.");
-	ni = kmem_cache_alloc(ntfs_big_inode_cache, SLAB_NOFS);
+	ni = kmem_cache_alloc(ntfs_big_inode_cache,
+				set_rclmflags(SLAB_NOFS, __GFP_KERNRCLM));
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return VFS_I(ni);
@@ -349,7 +350,8 @@ static inline ntfs_inode *ntfs_alloc_ext
 	ntfs_inode *ni;
 
 	ntfs_debug("Entering.");
-	ni = kmem_cache_alloc(ntfs_inode_cache, SLAB_NOFS);
+	ni = kmem_cache_alloc(ntfs_inode_cache,
+				set_rclmflags(SLAB_NOFS, __GFP_KERNRCLM));
 	if (likely(ni != NULL)) {
 		ni->state = 0;
 		return ni;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/fs/reiserfs/super.c linux-2.6.18-rc5-mm1-007_kernrclm/fs/reiserfs/super.c
--- linux-2.6.18-rc5-mm1-006_movefree/fs/reiserfs/super.c	2006-09-04 18:34:32.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/fs/reiserfs/super.c	2006-09-04 18:45:50.000000000 +0100
@@ -496,7 +496,8 @@ static struct inode *reiserfs_alloc_inod
 {
 	struct reiserfs_inode_info *ei;
 	ei = (struct reiserfs_inode_info *)
-	    kmem_cache_alloc(reiserfs_inode_cachep, SLAB_KERNEL);
+	    kmem_cache_alloc(reiserfs_inode_cachep,
+			    	set_rclmflags(SLAB_KERNEL, __GFP_KERNRCLM));
 	if (!ei)
 		return NULL;
 	return &ei->vfs_inode;
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/include/linux/gfp.h linux-2.6.18-rc5-mm1-007_kernrclm/include/linux/gfp.h
--- linux-2.6.18-rc5-mm1-006_movefree/include/linux/gfp.h	2006-09-04 18:36:09.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/include/linux/gfp.h	2006-09-04 18:45:50.000000000 +0100
@@ -46,9 +46,10 @@ struct vm_area_struct;
 #define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
 #define __GFP_HARDWALL   ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
 #define __GFP_THISNODE	((__force gfp_t)0x40000u)/* No fallback, no policies */
-#define __GFP_EASYRCLM	((__force gfp_t)0x80000u) /* Easily reclaimed page */
+#define __GFP_KERNRCLM	((__force gfp_t)0x80000u) /* Kernel reclaimable page */
+#define __GFP_EASYRCLM	((__force gfp_t)0x100000u) /* Easily reclaimed page */
 
-#define __GFP_BITS_SHIFT 20	/* Room for 20 __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 21	/* Room for 21 __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /* if you forget to add the bitmask here kernel will crash, period */
@@ -56,10 +57,10 @@ struct vm_area_struct;
 			__GFP_COLD|__GFP_NOWARN|__GFP_REPEAT| \
 			__GFP_NOFAIL|__GFP_NORETRY|__GFP_NO_GROW|__GFP_COMP| \
 			__GFP_NOMEMALLOC|__GFP_HARDWALL|__GFP_THISNODE|\
-			__GFP_EASYRCLM)
+			__GFP_KERNRCLM|__GFP_EASYRCLM)
 
 /* This mask makes up all the RCLM-related flags */
-#define GFP_RECLAIM_MASK (__GFP_EASYRCLM)
+#define GFP_RECLAIM_MASK (__GFP_KERNRCLM|__GFP_EASYRCLM)
 
 /* This equals 0, but use constants in case they ever change */
 #define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
@@ -100,6 +101,7 @@ static inline enum zone_type gfp_zone(gf
 
 static inline gfp_t set_rclmflags(gfp_t gfp, gfp_t reclaim_flags)
 {
+	BUG_ON((gfp & GFP_RECLAIM_MASK) == GFP_RECLAIM_MASK);
 	return (gfp & ~(GFP_RECLAIM_MASK)) | reclaim_flags;
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/include/linux/mmzone.h linux-2.6.18-rc5-mm1-007_kernrclm/include/linux/mmzone.h
--- linux-2.6.18-rc5-mm1-006_movefree/include/linux/mmzone.h	2006-09-04 18:39:39.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/include/linux/mmzone.h	2006-09-04 18:45:50.000000000 +0100
@@ -25,8 +25,9 @@
 #define MAX_ORDER_NR_PAGES (1 << (MAX_ORDER - 1))
 
 #define RCLM_NORCLM 0
-#define RCLM_EASY   1
-#define RCLM_TYPES  2
+#define RCLM_KERN   1
+#define RCLM_EASY   2
+#define RCLM_TYPES  3
 
 #define for_each_rclmtype(type) \
 	for (type = 0; type < RCLM_TYPES; type++)
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/include/linux/page-flags.h linux-2.6.18-rc5-mm1-007_kernrclm/include/linux/page-flags.h
--- linux-2.6.18-rc5-mm1-006_movefree/include/linux/page-flags.h	2006-09-04 18:37:59.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/include/linux/page-flags.h	2006-09-04 18:45:50.000000000 +0100
@@ -82,18 +82,37 @@
 #define PG_private		11	/* If pagecache, has fs-private data */
 
 #define PG_writeback		12	/* Page is under writeback */
-#define PG_nosave		13	/* Used for system suspend/resume */
 #define PG_compound		14	/* Part of a compound page */
 #define PG_swapcache		15	/* Swap page: swp_entry_t in private */
 
 #define PG_mappedtodisk		16	/* Has blocks allocated on-disk */
 #define PG_reclaim		17	/* To be reclaimed asap */
-#define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
 #define PG_readahead		20	/* Reminder to do readahead */
-#define PG_easyrclm		21	/* Page is an easy reclaim block */
 
+/*
+ * As anti-fragmentation requires two flags, it was best to reuse the suspend
+ * flags and make anti-fragmentation depend on !SOFTWARE_SUSPEND. This works
+ * on the assumption that machines being suspended do not really care about
+ * large contiguous allocations. There are two alternatives to where the
+ * anti-fragmentation falgs could be stored
+ *
+ * 1. Use the lower two bits of page->lru and remove direct references to
+ *    page->lru
+ * 2. Use the page->flags of the struct page backing the page storing the
+ *    mem_map
+ *
+ * The first option may be difficult to read. The second option would require
+ * an additional cache line
+ */
+#ifndef CONFIG_PAGEALLOC_ANTIFRAG
+#define PG_nosave		13	/* Used for system suspend/resume */
+#define PG_nosave_free		18	/* Free, should not be written */
+#else
+#define PG_kernrclm		13	/* Page is a kernel reclaim block */
+#define PG_easyrclm		18	/* Page is an easy reclaim block */
+#endif
 
 #if (BITS_PER_LONG > 32)
 /*
@@ -212,6 +231,7 @@
 		ret;							\
 	})
 
+#ifndef CONFIG_PAGEALLOC_ANTIFRAG
 #define PageNosave(page)	test_bit(PG_nosave, &(page)->flags)
 #define SetPageNosave(page)	set_bit(PG_nosave, &(page)->flags)
 #define TestSetPageNosave(page)	test_and_set_bit(PG_nosave, &(page)->flags)
@@ -222,6 +242,34 @@
 #define SetPageNosaveFree(page)	set_bit(PG_nosave_free, &(page)->flags)
 #define ClearPageNosaveFree(page)		clear_bit(PG_nosave_free, &(page)->flags)
 
+#define PageKernRclm(page)	(0)
+#define SetPageKernRclm(page)	do {} while (0)
+#define ClearPageKernRclm(page)	do {} while (0)
+#define __SetPageKernRclm(page)	do {} while (0)
+#define __ClearPageKernRclm(page) do {} while (0)
+
+#define PageEasyRclm(page)	(0)
+#define SetPageEasyRclm(page)	do {} while (0)
+#define ClearPageEasyRclm(page)	do {} while (0)
+#define __SetPageEasyRclm(page)	do {} while (0)
+#define __ClearPageEasyRclm(page) do {} while (0)
+
+#else
+
+#define PageKernRclm(page)	test_bit(PG_kernrclm, &(page)->flags)
+#define SetPageKernRclm(page)	set_bit(PG_kernrclm, &(page)->flags)
+#define ClearPageKernRclm(page)	clear_bit(PG_kernrclm, &(page)->flags)
+#define __SetPageKernRclm(page)	__set_bit(PG_kernrclm, &(page)->flags)
+#define __ClearPageKernRclm(page) __clear_bit(PG_kernrclm, &(page)->flags)
+
+#define PageEasyRclm(page)	test_bit(PG_easyrclm, &(page)->flags)
+#define SetPageEasyRclm(page)	set_bit(PG_easyrclm, &(page)->flags)
+#define ClearPageEasyRclm(page)	clear_bit(PG_easyrclm, &(page)->flags)
+#define __SetPageEasyRclm(page)	__set_bit(PG_easyrclm, &(page)->flags)
+#define __ClearPageEasyRclm(page) __clear_bit(PG_easyrclm, &(page)->flags)
+#endif /* CONFIG_PAGEALLOC_ANTIFRAG */
+
+
 #define PageBuddy(page)		test_bit(PG_buddy, &(page)->flags)
 #define __SetPageBuddy(page)	__set_bit(PG_buddy, &(page)->flags)
 #define __ClearPageBuddy(page)	__clear_bit(PG_buddy, &(page)->flags)
@@ -255,12 +303,6 @@
 #define SetPageReadahead(page)	set_bit(PG_readahead, &(page)->flags)
 #define TestClearPageReadahead(page) test_and_clear_bit(PG_readahead, &(page)->flags)
 
-#define PageEasyRclm(page)	test_bit(PG_easyrclm, &(page)->flags)
-#define SetPageEasyRclm(page)	set_bit(PG_easyrclm, &(page)->flags)
-#define ClearPageEasyRclm(page)	clear_bit(PG_easyrclm, &(page)->flags)
-#define __SetPageEasyRclm(page)	__set_bit(PG_easyrclm, &(page)->flags)
-#define __ClearPageEasyRclm(page) __clear_bit(PG_easyrclm, &(page)->flags)
-
 struct page;	/* forward declaration */
 
 int test_clear_page_dirty(struct page *page);
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/init/Kconfig linux-2.6.18-rc5-mm1-007_kernrclm/init/Kconfig
--- linux-2.6.18-rc5-mm1-006_movefree/init/Kconfig	2006-09-04 18:41:13.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/init/Kconfig	2006-09-04 18:45:50.000000000 +0100
@@ -491,6 +491,7 @@ config PAGEALLOC_ANTIFRAG
 	  you are interested in working with large pages, say Y and set
 	  /proc/sys/vm/min_free_bytes to be 10% of physical memory. Otherwise
  	  say N
+	depends on !SOFTWARE_SUSPEND
 
 menu "Loadable module support"
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/lib/radix-tree.c linux-2.6.18-rc5-mm1-007_kernrclm/lib/radix-tree.c
--- linux-2.6.18-rc5-mm1-006_movefree/lib/radix-tree.c	2006-09-04 18:34:33.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/lib/radix-tree.c	2006-09-04 18:45:50.000000000 +0100
@@ -93,7 +93,8 @@ radix_tree_node_alloc(struct radix_tree_
 	struct radix_tree_node *ret;
 	gfp_t gfp_mask = root_gfp_mask(root);
 
-	ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+	ret = kmem_cache_alloc(radix_tree_node_cachep,
+					set_rclmflags(gfp_mask, __GFP_KERNRCLM));
 	if (ret == NULL && !(gfp_mask & __GFP_WAIT)) {
 		struct radix_tree_preload *rtp;
 
@@ -137,7 +138,8 @@ int radix_tree_preload(gfp_t gfp_mask)
 	rtp = &__get_cpu_var(radix_tree_preloads);
 	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
 		preempt_enable();
-		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+		node = kmem_cache_alloc(radix_tree_node_cachep,
+					set_rclmflags(gfp_mask, __GFP_KERNRCLM));
 		if (node == NULL)
 			goto out;
 		preempt_disable();
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/mm/page_alloc.c linux-2.6.18-rc5-mm1-007_kernrclm/mm/page_alloc.c
--- linux-2.6.18-rc5-mm1-006_movefree/mm/page_alloc.c	2006-09-04 18:44:14.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/mm/page_alloc.c	2006-09-04 18:45:50.000000000 +0100
@@ -136,12 +136,16 @@ static unsigned long __initdata dma_rese
 #ifdef CONFIG_PAGEALLOC_ANTIFRAG
 static inline int get_pageblock_type(struct page *page)
 {
-	return (PageEasyRclm(page) != 0);
+	return ((PageEasyRclm(page) != 0) << 1) | (PageKernRclm(page) != 0);
 }
 
 static inline int gfpflags_to_rclmtype(unsigned long gfp_flags)
 {
-	return ((gfp_flags & __GFP_EASYRCLM) != 0);
+	gfp_t badflags = (__GFP_EASYRCLM | __GFP_KERNRCLM);
+	WARN_ON((gfp_flags & badflags) == badflags);
+
+	return (((gfp_flags & __GFP_EASYRCLM) != 0) << 1) |
+		((gfp_flags & __GFP_KERNRCLM) != 0);
 }
 #else
 static inline int get_pageblock_type(struct page *page)
@@ -431,6 +435,7 @@ static inline void __free_one_page(struc
 
 	page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
 	__SetPageEasyRclm(page);
+	__ClearPageKernRclm(page);
 
 	VM_BUG_ON(page_idx & (order_size - 1));
 	VM_BUG_ON(bad_range(zone, page));
@@ -707,6 +712,12 @@ int move_freepages_block(struct zone *zo
 	return move_freepages(zone, start_page, end_page, rclmtype);
 }
 
+static int fallbacks[RCLM_TYPES][RCLM_TYPES] = {
+	{ RCLM_NORCLM, RCLM_KERN,   RCLM_EASY  }, /* RCLM_NORCLM Fallback */
+	{ RCLM_KERN,   RCLM_NORCLM, RCLM_EASY  }, /* RCLM_KERN Fallback */
+	{ RCLM_EASY,   RCLM_KERN,   RCLM_NORCLM}  /* RCLM_EASY Fallback */
+};
+
 /* Remove an element from the buddy allocator from the fallback list */
 static struct page *__rmqueue_fallback(struct zone *zone, int order,
 							gfp_t gfp_flags)
@@ -715,30 +726,36 @@ static struct page *__rmqueue_fallback(s
 	int current_order;
 	struct page *page;
 	int start_rclmtype = gfpflags_to_rclmtype(gfp_flags);
-	int rclmtype = !start_rclmtype;
+	int rclmtype, i;
 
 	/* Find the largest possible block of pages in the other list */
 	for (current_order = MAX_ORDER-1; current_order >= order;
 						--current_order) {
-		area = &(zone->free_area[current_order]);
- 		if (list_empty(&area->free_list[rclmtype]))
- 			continue;
+		for (i = 0; i < RCLM_TYPES; i++) {
+			rclmtype = fallbacks[start_rclmtype][i];
 
-		page = list_entry(area->free_list[rclmtype].next,
-					struct page, lru);
-		area->nr_free--;
+			area = &(zone->free_area[current_order]);
+			if (list_empty(&area->free_list[rclmtype]))
+				continue;
 
-		/* Remove the page from the freelists */
-		list_del(&page->lru);
-		rmv_page_order(page);
-		zone->free_pages -= 1UL << order;
-		expand(zone, page, order, current_order, area, rclmtype);
+			page = list_entry(area->free_list[rclmtype].next,
+					struct page, lru);
+			area->nr_free--;
 
-		/* Move free pages between lists if stealing a large block */
-		if (current_order > MAX_ORDER / 2)
-			move_freepages_block(zone, page, start_rclmtype);
+			/* Remove the page from the freelists */
+			list_del(&page->lru);
+			rmv_page_order(page);
+			zone->free_pages -= 1UL << order;
+			expand(zone, page, order, current_order, area,
+							start_rclmtype);
+
+			/* Move free pages between lists for large blocks */
+			if (current_order >= MAX_ORDER / 2)
+				move_freepages_block(zone, page,
+							start_rclmtype);
 
-		return page;
+			return page;
+		}
 	}
 
 	return NULL;
@@ -794,9 +811,12 @@ static struct page *__rmqueue(struct zon
 	page = __rmqueue_fallback(zone, order, gfp_flags);
 
 got_page:
-	if (unlikely(rclmtype == RCLM_NORCLM) && page)
+	if (unlikely(rclmtype != RCLM_EASY) && page)
 		__ClearPageEasyRclm(page);
 
+	if (rclmtype == RCLM_KERN && page)
+		SetPageKernRclm(page);
+
 	return page;
 }
 
@@ -891,7 +911,7 @@ static void __drain_pages(unsigned int c
 }
 #endif /* CONFIG_DRAIN_PERCPU_PAGES */
 
-#ifdef CONFIG_PM
+#ifdef CONFIG_SOFTWARE_SUSPEND
 void mark_free_pages(struct zone *zone)
 {
 	unsigned long pfn, max_zone_pfn;
@@ -2052,7 +2072,7 @@ inline void setup_pageset(struct per_cpu
 		pcp->counts[rclmtype] = 0;
 		INIT_LIST_HEAD(&pcp->list[rclmtype]);
 	}
-	pcp->high = 6 * batch;
+	pcp->high = 3 * batch;
 	pcp->batch = max(1UL, 1 * batch);
 	INIT_LIST_HEAD(&pcp->list[RCLM_EASY]);
 
@@ -2061,7 +2081,7 @@ inline void setup_pageset(struct per_cpu
 		pcp->counts[rclmtype] = 0;
 		INIT_LIST_HEAD(&pcp->list[rclmtype]);
 	}
-	pcp->high = 2 * batch;
+	pcp->high = batch;
 	pcp->batch = max(1UL, batch/2);
 }
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/mm/shmem.c linux-2.6.18-rc5-mm1-007_kernrclm/mm/shmem.c
--- linux-2.6.18-rc5-mm1-006_movefree/mm/shmem.c	2006-09-04 18:34:33.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/mm/shmem.c	2006-09-04 18:45:50.000000000 +0100
@@ -91,7 +91,8 @@ static inline struct page *shmem_dir_all
 	 * BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
 	 * might be reconsidered if it ever diverges from PAGE_SIZE.
 	 */
-	return alloc_pages(gfp_mask, PAGE_CACHE_SHIFT-PAGE_SHIFT);
+	return alloc_pages(set_rclmflags(gfp_mask, __GFP_KERNRCLM),
+						PAGE_CACHE_SHIFT-PAGE_SHIFT);
 }
 
 static inline void shmem_dir_free(struct page *page)
@@ -968,7 +969,8 @@ shmem_alloc_page(gfp_t gfp, struct shmem
 	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx);
 	pvma.vm_pgoff = idx;
 	pvma.vm_end = PAGE_SIZE;
-	page = alloc_page_vma(gfp | __GFP_ZERO, &pvma, 0);
+	page = alloc_page_vma(set_rclmflags(gfp | __GFP_ZERO, __GFP_KERNRCLM),
+								&pvma, 0);
 	mpol_free(pvma.vm_policy);
 	return page;
 }
@@ -988,7 +990,7 @@ shmem_swapin(struct shmem_inode_info *in
 static inline struct page *
 shmem_alloc_page(gfp_t gfp,struct shmem_inode_info *info, unsigned long idx)
 {
-	return alloc_page(gfp | __GFP_ZERO);
+	return alloc_page(set_rclmflags(gfp | __GFP_ZERO, __GFP_KERNRCLM));
 }
 #endif
 
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-006_movefree/net/core/skbuff.c linux-2.6.18-rc5-mm1-007_kernrclm/net/core/skbuff.c
--- linux-2.6.18-rc5-mm1-006_movefree/net/core/skbuff.c	2006-09-04 18:34:33.000000000 +0100
+++ linux-2.6.18-rc5-mm1-007_kernrclm/net/core/skbuff.c	2006-09-04 18:45:50.000000000 +0100
@@ -148,6 +148,7 @@ struct sk_buff *__alloc_skb(unsigned int
 	u8 *data;
 
 	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
+	gfp_mask |= __GFP_KERNRCLM;
 
 	/* Get the HEAD */
 	skb = kmem_cache_alloc(cache, gfp_mask & ~__GFP_DMA);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 8/8] [DEBUG] Add statistics
  2006-09-07 19:03 [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Mel Gorman
                   ` (6 preceding siblings ...)
  2006-09-07 19:06 ` [PATCH 7/8] Introduce the RCLM_KERN allocation type Mel Gorman
@ 2006-09-07 19:06 ` Mel Gorman
  2006-09-08  0:58 ` [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Andrew Morton
  8 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2006-09-07 19:06 UTC (permalink / raw)
  To: linux-mm, linux-kernel; +Cc: Mel Gorman

This patch is strictly debug only. With static markers from SystemTap (what is
the current story with these?) or any other type of static marking of probe
points, this could be replaced by a relatively trivial script. Until such
static probes exist, this patch outputs some information to /proc/buddyinfo
that may help explain what went wrong if the anti-fragmentation strategy fails.


Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---

 page_alloc.c |   20 ++++++++++++++++++++
 vmstat.c     |   16 ++++++++++++++++
 2 files changed, 36 insertions(+)

diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-007_kernrclm/mm/page_alloc.c linux-2.6.18-rc5-mm1-009_stats/mm/page_alloc.c
--- linux-2.6.18-rc5-mm1-007_kernrclm/mm/page_alloc.c	2006-09-04 18:45:50.000000000 +0100
+++ linux-2.6.18-rc5-mm1-009_stats/mm/page_alloc.c	2006-09-04 18:47:33.000000000 +0100
@@ -56,6 +56,10 @@ unsigned long totalram_pages __read_most
 unsigned long totalreserve_pages __read_mostly;
 long nr_swap_pages;
 int percpu_pagelist_fraction;
+int split_count[RCLM_TYPES];
+#ifdef CONFIG_PAGEALLOC_ANTIFRAG
+int fallback_counts[RCLM_TYPES];
+#endif
 
 static void __free_pages_ok(struct page *page, unsigned int order);
 
@@ -742,6 +746,12 @@ static struct page *__rmqueue_fallback(s
 					struct page, lru);
 			area->nr_free--;
 
+			/* Account for a MAX_ORDER block being split */
+			if (current_order == MAX_ORDER - 1 &&
+					order < MAX_ORDER - 1) {
+				split_count[start_rclmtype]++;
+			}
+
 			/* Remove the page from the freelists */
 			list_del(&page->lru);
 			rmv_page_order(page);
@@ -754,6 +764,12 @@ static struct page *__rmqueue_fallback(s
 				move_freepages_block(zone, page,
 							start_rclmtype);
 
+			/* Account for fallbacks */
+			if (order < MAX_ORDER - 1 &&
+					current_order != MAX_ORDER - 1) {
+				fallback_counts[start_rclmtype]++;
+			}
+
 			return page;
 		}
 	}
@@ -804,6 +820,10 @@ static struct page *__rmqueue(struct zon
 		rmv_page_order(page);
 		area->nr_free--;
 		zone->free_pages -= 1UL << order;
+
+		if (current_order == MAX_ORDER - 1 && order < MAX_ORDER - 1)
+			split_count[rclmtype]++;
+
 		expand(zone, page, order, current_order, area, rclmtype);
 		goto got_page;
 	}
diff -rup -X /usr/src/patchset-0.6/bin//dontdiff linux-2.6.18-rc5-mm1-007_kernrclm/mm/vmstat.c linux-2.6.18-rc5-mm1-009_stats/mm/vmstat.c
--- linux-2.6.18-rc5-mm1-007_kernrclm/mm/vmstat.c	2006-09-04 18:39:39.000000000 +0100
+++ linux-2.6.18-rc5-mm1-009_stats/mm/vmstat.c	2006-09-04 18:47:33.000000000 +0100
@@ -13,6 +13,11 @@
 #include <linux/module.h>
 #include <linux/cpu.h>
 
+#ifdef CONFIG_PAGEALLOC_ANTIFRAG
+extern int split_count[RCLM_TYPES];
+extern int fallback_counts[RCLM_TYPES];
+#endif
+
 void __get_zone_counts(unsigned long *active, unsigned long *inactive,
 			unsigned long *free, struct pglist_data *pgdat)
 {
@@ -427,6 +432,17 @@ static int frag_show(struct seq_file *m,
 		spin_unlock_irqrestore(&zone->lock, flags);
 		seq_putc(m, '\n');
 	}
+#ifdef CONFIG_PAGEALLOC_ANTIFRAG
+	seq_printf(m, "Fallback counts\n");
+	seq_printf(m, "KernNoRclm: %8d\n", fallback_counts[RCLM_NORCLM]);
+	seq_printf(m, "KernRclm:   %8d\n", fallback_counts[RCLM_KERN]);
+	seq_printf(m, "EasyRclm:   %8d\n", fallback_counts[RCLM_EASY]);
+
+	seq_printf(m, "\nSplit counts\n");
+	seq_printf(m, "KernNoRclm: %8d\n", split_count[RCLM_NORCLM]);
+	seq_printf(m, "KernRclm:   %8d\n", split_count[RCLM_KERN]);
+	seq_printf(m, "EasyRclm:   %8d\n", split_count[RCLM_EASY]);
+#endif /* CONFIG_PAGEALLOC_ANTIFRAG */
 	return 0;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/8] Avoiding fragmentation with subzone groupings v25
  2006-09-07 19:03 [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Mel Gorman
                   ` (7 preceding siblings ...)
  2006-09-07 19:06 ` [PATCH 8/8] [DEBUG] Add statistics Mel Gorman
@ 2006-09-08  0:58 ` Andrew Morton
  2006-09-08  8:30   ` Peter Zijlstra
  2006-09-08  8:36   ` Mel Gorman
  8 siblings, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2006-09-08  0:58 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel

On Thu,  7 Sep 2006 20:03:42 +0100 (IST)
Mel Gorman <mel@csn.ul.ie> wrote:

> When a page is allocated, the page-flags
> are updated with a value indicating it's type of reclaimability so that it
> is placed on the correct list on free.

We're getting awful tight on page-flags.

Would it be possible to avoid adding the flag?  Say, have a per-zone bitmap
of size (zone->present_pages/(1<<MAX_ORDER)) bits, then do a lookup in
there to work out whether a particular page is within a MAX_ORDER clump of
easy-reclaimable pages?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/8] Split the free lists into kernel and user parts
  2006-09-07 19:04 ` [PATCH 2/8] Split the free lists into kernel and user parts Mel Gorman
@ 2006-09-08  7:54   ` Peter Zijlstra
  2006-09-08  9:20     ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2006-09-08  7:54 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, linux-kernel

Hi Mel,

Looking good, some small nits follow.

On Thu, 2006-09-07 at 20:04 +0100, Mel Gorman wrote:

> +#define for_each_rclmtype_order(type, order) \
> +	for (order = 0; order < MAX_ORDER; order++) \
> +		for (type = 0; type < RCLM_TYPES; type++)

It seems odd to me that you have the for loops in reverse order of the
arguments.

> +static inline int get_pageblock_type(struct page *page)
> +{
> +	return (PageEasyRclm(page) != 0);
> +}

I find the naming a little odd, I would have suspected something like:
get_page_blocktype() or thereabout since you're getting a page
attribute.

> +static inline int gfpflags_to_rclmtype(unsigned long gfp_flags)
> +{
> +	return ((gfp_flags & __GFP_EASYRCLM) != 0);
> +}

gfp_t argument?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/8] Avoiding fragmentation with subzone groupings v25
  2006-09-08  0:58 ` [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Andrew Morton
@ 2006-09-08  8:30   ` Peter Zijlstra
  2006-09-08  9:24     ` Mel Gorman
  2006-09-08  8:36   ` Mel Gorman
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2006-09-08  8:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mel Gorman, linux-mm, linux-kernel

On Thu, 2006-09-07 at 17:58 -0700, Andrew Morton wrote:
> On Thu,  7 Sep 2006 20:03:42 +0100 (IST)
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > When a page is allocated, the page-flags
> > are updated with a value indicating it's type of reclaimability so that it
> > is placed on the correct list on free.
> 
> We're getting awful tight on page-flags.
> 
> Would it be possible to avoid adding the flag?  Say, have a per-zone bitmap
> of size (zone->present_pages/(1<<MAX_ORDER)) bits, then do a lookup in
> there to work out whether a particular page is within a MAX_ORDER clump of
> easy-reclaimable pages?

That would not actually work, the fallback allocation path can move
blocks smaller than MAX_ORDER to another recaim type.

But yeah, page flags are getting right, perhaps Rafael can use his
recently introduced bitmaps to rid us of the swsusp flags?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/8] Avoiding fragmentation with subzone groupings v25
  2006-09-08  0:58 ` [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Andrew Morton
  2006-09-08  8:30   ` Peter Zijlstra
@ 2006-09-08  8:36   ` Mel Gorman
  2006-09-08 13:06     ` Peter Zijlstra
  1 sibling, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2006-09-08  8:36 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel

On Thu, 7 Sep 2006, Andrew Morton wrote:

> On Thu,  7 Sep 2006 20:03:42 +0100 (IST)
> Mel Gorman <mel@csn.ul.ie> wrote:
>
>> When a page is allocated, the page-flags
>> are updated with a value indicating it's type of reclaimability so that it
>> is placed on the correct list on free.
>
> We're getting awful tight on page-flags.
>

Yeah, I know :(

> Would it be possible to avoid adding the flag?  Say, have a per-zone bitmap
> of size (zone->present_pages/(1<<MAX_ORDER)) bits, then do a lookup in
> there to work out whether a particular page is within a MAX_ORDER clump of
> easy-reclaimable pages?
>

An early version of the patches created such a bitmap and it was heavily 
resisted for two reasons. It put more pressure on the cache and it needed 
to be resized during hot-add and hot-remove. It was the latter issue 
people had more problems with. However, I can reimplement it if people 
want to take a look. As I see it currently, there are five choices that 
could be taken to avoid using an additional pageflag

1. Re-use existing page flags. This is what I currently do in a later
    patch for the software suspend flags
    pros: Straight-forward implementation, appears to use no additional flags
    cons: When swsusp stops using the flags, anti-frag takes them right back
          Makes anti-frag mutually exclusive with swsusp

2. Create a per-zone bitmap for every MAX_ORDER block
    pros: Straight-forward implementation initially
    cons: Needs resizing during hotadd which could get complicated
          Bit more cache pressure

3. Use the low two bits of page->lru
    pros: Uses existing struct page field
    cons: It's a bit funky looking

4. Use the page->flags of the struct page backing the pages used
    for the memmap.
    pros: Similar to the bitmap idea except with less hotadd problems
    cons: Bit more cache pressure

5. Add an additional field page->hintsflags used for non-critical flags.
    There are patches out there like guest page hinting that want to
    consume flags but not for any vital purpose and usually for machines
    that have ample amounts of memory. For these features, add an
    additional page->hintsflags
    pros: Straight-forward to implement
    cons: Increses struct page size for some kernel features.

I am leaning towards option 3 because it uses no additional memory but I'm 
not sure how people feel about using pointer magic like this.

Any opinions?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/8] Split the free lists into kernel and user parts
  2006-09-08  7:54   ` Peter Zijlstra
@ 2006-09-08  9:20     ` Mel Gorman
  0 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2006-09-08  9:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-mm, linux-kernel

On Fri, 8 Sep 2006, Peter Zijlstra wrote:

> Hi Mel,
>
> Looking good, some small nits follow.
>
> On Thu, 2006-09-07 at 20:04 +0100, Mel Gorman wrote:
>
>> +#define for_each_rclmtype_order(type, order) \
>> +	for (order = 0; order < MAX_ORDER; order++) \
>> +		for (type = 0; type < RCLM_TYPES; type++)
>
> It seems odd to me that you have the for loops in reverse order of the
> arguments.
>

I'll fix that.

>> +static inline int get_pageblock_type(struct page *page)
>> +{
>> +	return (PageEasyRclm(page) != 0);
>> +}
>
> I find the naming a little odd, I would have suspected something like:
> get_page_blocktype() or thereabout since you're getting a page
> attribute.
>

This is a throwback from an early version when I used a bitmap that used 
one bit per MAX_ORDER_NR_PAGES block of pages. Many pages in a block 
shared one bit - hence get_pageblock_type(). The name is now stupid. I'll 
fix it.

>> +static inline int gfpflags_to_rclmtype(unsigned long gfp_flags)
>> +{
>> +	return ((gfp_flags & __GFP_EASYRCLM) != 0);
>> +}
>
> gfp_t argument?
>

doh, yes, it should be gfp_t

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/8] Avoiding fragmentation with subzone groupings v25
  2006-09-08  8:30   ` Peter Zijlstra
@ 2006-09-08  9:24     ` Mel Gorman
  0 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2006-09-08  9:24 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, linux-mm, linux-kernel

On Fri, 8 Sep 2006, Peter Zijlstra wrote:

> On Thu, 2006-09-07 at 17:58 -0700, Andrew Morton wrote:
>> On Thu,  7 Sep 2006 20:03:42 +0100 (IST)
>> Mel Gorman <mel@csn.ul.ie> wrote:
>>
>>> When a page is allocated, the page-flags
>>> are updated with a value indicating it's type of reclaimability so that it
>>> is placed on the correct list on free.
>>
>> We're getting awful tight on page-flags.
>>
>> Would it be possible to avoid adding the flag?  Say, have a per-zone bitmap
>> of size (zone->present_pages/(1<<MAX_ORDER)) bits, then do a lookup in
>> there to work out whether a particular page is within a MAX_ORDER clump of
>> easy-reclaimable pages?
>
> That would not actually work, the fallback allocation path can move
> blocks smaller than MAX_ORDER to another recaim type.
>

Believe it or not, it may be desirably to have a whole block represented 
by one or two bits. If a fallback allocation occurs and I move blocks 
between lists, I want pages that free later to be freed to the new list as 
well. Currently that doesn't happen because the flags are set per-page but 
it used to happen in an early version of anti-frag.

> But yeah, page flags are getting right, perhaps Rafael can use his
> recently introduced bitmaps to rid us of the swsusp flags?
>
>

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/8] Avoiding fragmentation with subzone groupings v25
  2006-09-08  8:36   ` Mel Gorman
@ 2006-09-08 13:06     ` Peter Zijlstra
  2006-09-08 13:16       ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2006-09-08 13:06 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Andrew Morton, linux-mm, linux-kernel

On Fri, 2006-09-08 at 09:36 +0100, Mel Gorman wrote:
> On Thu, 7 Sep 2006, Andrew Morton wrote:
> 
> > On Thu,  7 Sep 2006 20:03:42 +0100 (IST)
> > Mel Gorman <mel@csn.ul.ie> wrote:
> >
> >> When a page is allocated, the page-flags
> >> are updated with a value indicating it's type of reclaimability so that it
> >> is placed on the correct list on free.
> >
> > We're getting awful tight on page-flags.
> >
> 
> Yeah, I know :(
> 
> > Would it be possible to avoid adding the flag?  Say, have a per-zone bitmap
> > of size (zone->present_pages/(1<<MAX_ORDER)) bits, then do a lookup in
> > there to work out whether a particular page is within a MAX_ORDER clump of
> > easy-reclaimable pages?
> >
> 
> An early version of the patches created such a bitmap and it was heavily 
> resisted for two reasons. It put more pressure on the cache and it needed 
> to be resized during hot-add and hot-remove. It was the latter issue 
> people had more problems with. However, I can reimplement it if people 
> want to take a look. As I see it currently, there are five choices that 
> could be taken to avoid using an additional pageflag
> 
> 1. Re-use existing page flags. This is what I currently do in a later
>     patch for the software suspend flags
>     pros: Straight-forward implementation, appears to use no additional flags
>     cons: When swsusp stops using the flags, anti-frag takes them right back
>           Makes anti-frag mutually exclusive with swsusp
> 
> 2. Create a per-zone bitmap for every MAX_ORDER block
>     pros: Straight-forward implementation initially
>     cons: Needs resizing during hotadd which could get complicated
>           Bit more cache pressure
> 
> 3. Use the low two bits of page->lru
>     pros: Uses existing struct page field
>     cons: It's a bit funky looking
> 
> 4. Use the page->flags of the struct page backing the pages used
>     for the memmap.
>     pros: Similar to the bitmap idea except with less hotadd problems
>     cons: Bit more cache pressure
> 
> 5. Add an additional field page->hintsflags used for non-critical flags.
>     There are patches out there like guest page hinting that want to
>     consume flags but not for any vital purpose and usually for machines
>     that have ample amounts of memory. For these features, add an
>     additional page->hintsflags
>     pros: Straight-forward to implement
>     cons: Increses struct page size for some kernel features.
> 
> I am leaning towards option 3 because it uses no additional memory but I'm 
> not sure how people feel about using pointer magic like this.
> 
> Any opinions?

If, as you stated in a previous mail, you'd like to have flags per
MAX_ORDER block, you'd already have to suffer the extra cache pressure.
In that case I vote for 4.

Otherwise 3 sounds doable, we already hide PAGE_MAPPING_ANON in a
pointer, so hiding flags is not new to struct page. It's just a question
of how good the implementation will look, I hope you'll not have to
visit all the list ops.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/8] Avoiding fragmentation with subzone groupings v25
  2006-09-08 13:06     ` Peter Zijlstra
@ 2006-09-08 13:16       ` Mel Gorman
  0 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2006-09-08 13:16 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Andrew Morton, linux-mm, linux-kernel

On Fri, 8 Sep 2006, Peter Zijlstra wrote:

> On Fri, 2006-09-08 at 09:36 +0100, Mel Gorman wrote:
>> On Thu, 7 Sep 2006, Andrew Morton wrote:
>>
>>> On Thu,  7 Sep 2006 20:03:42 +0100 (IST)
>>> Mel Gorman <mel@csn.ul.ie> wrote:
>>>
>>>> When a page is allocated, the page-flags
>>>> are updated with a value indicating it's type of reclaimability so that it
>>>> is placed on the correct list on free.
>>>
>>> We're getting awful tight on page-flags.
>>>
>>
>> Yeah, I know :(
>>
>>> Would it be possible to avoid adding the flag?  Say, have a per-zone bitmap
>>> of size (zone->present_pages/(1<<MAX_ORDER)) bits, then do a lookup in
>>> there to work out whether a particular page is within a MAX_ORDER clump of
>>> easy-reclaimable pages?
>>>
>>
>> An early version of the patches created such a bitmap and it was heavily
>> resisted for two reasons. It put more pressure on the cache and it needed
>> to be resized during hot-add and hot-remove. It was the latter issue
>> people had more problems with. However, I can reimplement it if people
>> want to take a look. As I see it currently, there are five choices that
>> could be taken to avoid using an additional pageflag
>>
>> 1. Re-use existing page flags. This is what I currently do in a later
>>     patch for the software suspend flags
>>     pros: Straight-forward implementation, appears to use no additional flags
>>     cons: When swsusp stops using the flags, anti-frag takes them right back
>>           Makes anti-frag mutually exclusive with swsusp
>>
>> 2. Create a per-zone bitmap for every MAX_ORDER block
>>     pros: Straight-forward implementation initially
>>     cons: Needs resizing during hotadd which could get complicated
>>           Bit more cache pressure
>>
>> 3. Use the low two bits of page->lru
>>     pros: Uses existing struct page field
>>     cons: It's a bit funky looking
>>
>> 4. Use the page->flags of the struct page backing the pages used
>>     for the memmap.
>>     pros: Similar to the bitmap idea except with less hotadd problems
>>     cons: Bit more cache pressure
>>
>> 5. Add an additional field page->hintsflags used for non-critical flags.
>>     There are patches out there like guest page hinting that want to
>>     consume flags but not for any vital purpose and usually for machines
>>     that have ample amounts of memory. For these features, add an
>>     additional page->hintsflags
>>     pros: Straight-forward to implement
>>     cons: Increses struct page size for some kernel features.
>>
>> I am leaning towards option 3 because it uses no additional memory but I'm
>> not sure how people feel about using pointer magic like this.
>>
>> Any opinions?
>
> If, as you stated in a previous mail, you'd like to have flags per
> MAX_ORDER block, you'd already have to suffer the extra cache pressure.
> In that case I vote for 4.
>

Originally, I wanted flags per MAX_ORDER block but I no longer have data 
on whether this is a good idea or not. It could turn out that we steal 
back and forth a lot when pageblock flags are used.

> Otherwise 3 sounds doable, we already hide PAGE_MAPPING_ANON in a
> pointer, so hiding flags is not new to struct page. It's just a question
> of how good the implementation will look, I hope you'll not have to
> visit all the list ops.
>

One way to find out for sure! I reckon I'll go off and implement options 3 
and 4 as add-on patches that avoid the use of page->flags and see what 
they look like. As you said, pointer magic in struct page is not new.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2006-09-08 13:16 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-07 19:03 [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Mel Gorman
2006-09-07 19:04 ` [PATCH 1/8] Add __GFP_EASYRCLM flag and update callers Mel Gorman
2006-09-07 19:04 ` [PATCH 2/8] Split the free lists into kernel and user parts Mel Gorman
2006-09-08  7:54   ` Peter Zijlstra
2006-09-08  9:20     ` Mel Gorman
2006-09-07 19:04 ` [PATCH 3/8] Split the per-cpu " Mel Gorman
2006-09-07 19:05 ` [PATCH 4/8] Add a configure option for anti-fragmentation Mel Gorman
2006-09-07 19:05 ` [PATCH 5/8] Drain per-cpu lists when high-order allocations fail Mel Gorman
2006-09-07 19:05 ` [PATCH 6/8] Move free pages between lists on steal Mel Gorman
2006-09-07 19:06 ` [PATCH 7/8] Introduce the RCLM_KERN allocation type Mel Gorman
2006-09-07 19:06 ` [PATCH 8/8] [DEBUG] Add statistics Mel Gorman
2006-09-08  0:58 ` [PATCH 0/8] Avoiding fragmentation with subzone groupings v25 Andrew Morton
2006-09-08  8:30   ` Peter Zijlstra
2006-09-08  9:24     ` Mel Gorman
2006-09-08  8:36   ` Mel Gorman
2006-09-08 13:06     ` Peter Zijlstra
2006-09-08 13:16       ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).