* [PATCH 05/14] Export unusable free space index via /proc/unusable_index
  2010-03-30  9:14 [PATCH 0/14] Memory Compaction v6 Mel Gorman
@ 2010-03-30  9:14 ` Mel Gorman
  0 siblings, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-03-30  9:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
Unusable free space index is a measure of external fragmentation that
takes the allocation size into account. For the most part, the huge page
size will be the size of interest but not necessarily so it is exported
on a per-order and per-zone basis via /proc/unusable_index.
The index is a value between 0 and 1. It can be expressed as a
percentage by multiplying by 100 as documented in
Documentation/filesystems/proc.txt.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/filesystems/proc.txt |   13 ++++-
 mm/vmstat.c                        |  120 ++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+), 1 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 74d2605..e87775a 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -453,6 +453,7 @@ Table 1-5: Kernel info in /proc
  sys         See chapter 2                                     
  sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
  tty	     Info of tty drivers
+ unusable_index Additional page allocator information (see text)(2.5)
  uptime      System uptime                                     
  version     Kernel version                                    
  video	     bttv info of video resources			(2.4)
@@ -610,7 +611,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo.
+pagetypeinfo and unusable_index
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -651,6 +652,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
 also be allocatable although a lot of filesystem metadata may have to be
 reclaimed to achieve this.
 
+> cat /proc/unusable_index
+Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
+Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
+
+The unusable free space index measures how much of the available free
+memory cannot be used to satisfy an allocation of a given size and is a
+value between 0 and 1. The higher the value, the more of free memory is
+unusable and by implication, the worse the external fragmentation is. This
+can be expressed as a percentage by multiplying by 100.
+
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7f760cb..2fb4986 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -453,6 +453,106 @@ static int frag_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
+
+struct contig_page_info {
+	unsigned long free_pages;
+	unsigned long free_blocks_total;
+	unsigned long free_blocks_suitable;
+};
+
+/*
+ * Calculate the number of free pages in a zone, how many contiguous
+ * pages are free and how many are large enough to satisfy an allocation of
+ * the target size. Note that this function makes no attempt to estimate
+ * how many suitable free blocks there *might* be if MOVABLE pages were
+ * migrated. Calculating that is possible, but expensive and can be
+ * figured out from userspace
+ */
+static void fill_contig_page_info(struct zone *zone,
+				unsigned int suitable_order,
+				struct contig_page_info *info)
+{
+	unsigned int order;
+
+	info->free_pages = 0;
+	info->free_blocks_total = 0;
+	info->free_blocks_suitable = 0;
+
+	for (order = 0; order < MAX_ORDER; order++) {
+		unsigned long blocks;
+
+		/* Count number of free blocks */
+		blocks = zone->free_area[order].nr_free;
+		info->free_blocks_total += blocks;
+
+		/* Count free base pages */
+		info->free_pages += blocks << order;
+
+		/* Count the suitable free blocks */
+		if (order >= suitable_order)
+			info->free_blocks_suitable += blocks <<
+						(order - suitable_order);
+	}
+}
+
+/*
+ * Return an index indicating how much of the available free memory is
+ * unusable for an allocation of the requested size.
+ */
+static int unusable_free_index(unsigned int order,
+				struct contig_page_info *info)
+{
+	/* No free memory is interpreted as all free memory is unusable */
+	if (info->free_pages == 0)
+		return 1000;
+
+	/*
+	 * Index should be a value between 0 and 1. Return a value to 3
+	 * decimal places.
+	 *
+	 * 0 => no fragmentation
+	 * 1 => high fragmentation
+	 */
+	return div_u64((info->free_pages - (info->free_blocks_suitable << order)) * 1000ULL, info->free_pages);
+
+}
+
+static void unusable_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = unusable_free_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display unusable free space index
+ * XXX: Could be a lot more efficient, but it's not a critical path
+ */
+static int unusable_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	/* check memoryless node */
+	if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+		return 0;
+
+	walk_zones_in_node(m, pgdat, unusable_show_print);
+
+	return 0;
+}
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -603,6 +703,25 @@ static const struct file_operations pagetypeinfo_file_ops = {
 	.release	= seq_release,
 };
 
+static const struct seq_operations unusable_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= unusable_show,
+};
+
+static int unusable_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &unusable_op);
+}
+
+static const struct file_operations unusable_file_ops = {
+	.open		= unusable_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -947,6 +1066,7 @@ static int __init setup_vmstat(void)
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
+	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 0/14] Memory Compaction v7
@ 2010-04-02 16:02 Mel Gorman
  2010-04-02 16:02 ` [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
                   ` (14 more replies)
  0 siblings, 15 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
The only change is relatively minor and is around the migration of unmapped
PageSwapCache pages. Specifically, it's not safe to access anon_vma for
these pages when remapping after migration completes so the last patch
makes sure we don't.
Are there any further obstacles to merging?
Changelog since V6
  o Avoid accessing anon_vma when migrating unmapped PageSwapCache pages
Changelog since V5
  o Rebase to mmotm-2010-03-24-14-48
  o Add more reviewed-by's
  o Correct one spelling in vmstat.c and some leader clarifications
  o Split the LRU isolation modes into a separate path
  o Correct a NID change
  o Call migrate_prep less frequently
  o Remove unnecessary inlining
  o Do not interfere with memory hot-remove
  o Do not compact for orders <= PAGE_ALLOC_COSTLY_ORDER
  o page_mapped instead of page_mapcount and allow swapcache to migrate
  o Avoid too many pages being isolated for migration
  o Handle PageSwapCache pages during migration
Changelog since V4
  o Remove unnecessary check for PageLRU and PageUnevictable
  o Fix isolated accounting
  o Close race window between page_mapcount and rcu_read_lock
  o Added a lot more Reviewed-by tags
Changelog since V3
  o Document sysfs entries (subseqently, merged independently)
  o COMPACTION should depend on MMU
  o Comment updates
  o Ensure proc/sysfs triggering of compaction fully completes
  o Rename anon_vma refcount to external_refcount
  o Rebase to mmotm on top of 2.6.34-rc1
Changelog since V2
  o Move unusable and fragmentation indices to separate proc files
  o Express indices as being between 0 and 1
  o Update copyright notice for compaction.c
  o Avoid infinite loop when split free page fails
  o Init compact_resume at least once (impacted x86 testing)
  o Fewer pages are isolated during compaction.
  o LRU lists are no longer rotated when page is busy
  o NR_ISOLATED_* is updated to avoid isolating too many pages
  o Update zone LRU stats correctly when isolating pages
  o Reference count anon_vma instead of insufficient locking with
    use-after-free races in memory compaction
  o Watch for unmapped anon pages during migration
  o Remove unnecessary parameters on a few functions
  o Add Reviewed-by's. Note that I didn't add the Acks and Reviewed
    for the proc patches as they have been split out into separate
    files and I don't know if the Acks are still valid.
Changelog since V1
  o Update help blurb on CONFIG_MIGRATION
  o Max unusable free space index is 100, not 1000
  o Move blockpfn forward properly during compaction
  o Cleanup CONFIG_COMPACTION vs CONFIG_MIGRATION confusion
  o Permissions on /proc and /sys files should be 0200
  o Reduce verbosity
  o Compact all nodes when triggered via /proc
  o Add per-node compaction via sysfs
  o Move defer_compaction out-of-line
  o Fix lock oddities in rmap_walk_anon
  o Add documentation
This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was slub
"defragmentation" (really a form of targeted reclaim). Hence, this is called
"compaction" to distinguish it from other forms of defragmentation.
In this implementation, a full compaction run involves two scanners operating
within a zone - a migration and a free scanner. The migration scanner
starts at the beginning of a zone and finds all movable pages within one
pageblock_nr_pages-sized area and isolates them on a migratepages list. The
free scanner begins at the end of the zone and searches on a per-area
basis for enough free pages to migrate all the pages on the migratepages
list. As each area is respectively migrated or exhausted of free pages,
the scanners are advanced one area.  A compaction run completes within a
zone when the two scanners meet.
This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.
It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.
Memory compaction can be triggered in one of three ways. It may be triggered
explicitly by writing any value to /proc/sys/vm/compact_memory and compacting
all of memory. It can be triggered on a per-node basis by writing any
value to /sys/devices/system/node/nodeN/compact where N is the node ID to
be compacted. When a process fails to allocate a high-order page, it may
compact memory in an attempt to satisfy the allocation instead of entering
direct reclaim. Explicit compaction does not finish until the two scanners
meet and direct compaction ends if a suitable page becomes available that
would meet watermarks.
The series is in 14 patches. The first three are not "core" to the series
but are important pre-requisites.
Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
	patch, it's possible to use anon_vma after free if the caller is
	not holding a VMA or mmap_sem for the pages in question. While
	there should be no existing user that causes this problem,
	it's a requirement for memory compaction to be stable. The patch
	is at the start of the series for bisection reasons.
Patch 2 skips over anon pages during migration that are no longer mapped
	because there still appeared to be a small window between when
	a page was isolated and migration started during which anon_vma
	could disappear.
Patch 3 merges the KSM and migrate counts. It could be merged with patch 1
	but would be slightly harder to review.
Patch 4 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 5 exports a "unusable free space index" via /proc/pagetypeinfo. It's
	a measure of external fragmentation that takes the size of the
	allocation request into account. It can also be calculated from
	userspace so can be dropped if requested
Patch 6 exports a "fragmentation index" which only has meaning when an
	allocation request fails. It determines if an allocation failure
	would be due to a lack of memory or external fragmentation.
Patch 7 moves the definition for LRU isolation modes for use by compaction
Patch 8 is the compaction mechanism although it's unreachable at this point
Patch 9 adds a means of compacting all of memory with a proc trgger
Patch 10 adds a means of compacting a specific node with a sysfs trigger
Patch 11 adds "direct compaction" before "direct reclaim" if it is
	determined there is a good chance of success.
Patch 12 adds a sysctl that allows tuning of the threshold at which the
	kernel will compact or direct reclaim
Patch 13 temporarily disables compaction if an allocation failure occurs
	after compaction.
Patch 14 allows the migration of PageSwapCache pages. This patch was not
	as straight-forward as rmap_walk and migration needed extra
	smarts to avoid problems under heavy memory pressure. It's possible
	that memory hot-remove could be affected.
Testing of compaction was in three stages.  For the test, debugging, preempt,
the sleep watchdog and lockdep were all enabled but nothing nasty popped
out. min_free_kbytes was tuned as recommended by hugeadm to help fragmentation
avoidance and high-order allocations. It was tested on X86, X86-64 and PPC64.
Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.
1. Machine freshly booted and configured for hugepage usage with
	a) hugeadm --create-global-mounts
	b) hugeadm --pool-pages-max DEFAULT:8G
	c) hugeadm --set-recommended-min_free_kbytes
	d) hugeadm --set-recommended-shmmax
	The min_free_kbytes here is important. Anti-fragmentation works best
	when pageblocks don't mix. hugeadm knows how to calculate a value that
	will significantly reduce the worst of external-fragmentation-related
	events as reported by the mm_page_alloc_extfrag tracepoint.
2. Load up memory
	a) Start updatedb
	b) Create in parallel a X files of pagesize*128 in size. Wait
	   until files are created. By parallel, I mean that 4096 instances
	   of dd were launched, one after the other using &. The crude
	   objective being to mix filesystem metadata allocations with
	   the buffer cache.
	c) Delete every second file so that pageblocks are likely to
	   have holes
	d) kill updatedb if it's still running
	At this point, the system is quiet, memory is full but it's full with
	clean filesystem metadata and clean buffer cache that is unmapped.
	This is readily migrated or discarded so you'd expect lumpy reclaim
	to have no significant advantage over compaction but this is at
	the POC stage.
3. In increments, attempt to allocate 5% of memory as hugepages.
	   Measure how long it took, how successful it was, how many
	   direct reclaims took place and how how many compactions. Note
	   the compaction figures might not fully add up as compactions
	   can take place for orders other than the hugepage size
X86				vanilla		compaction
Final page count                    915                916 (attempted 1002)
pages reclaimed                   88872               2942
X86-64				vanilla		compaction
Final page count:                   901                902 (attempted 1002)
Total pages reclaimed:           137573              50655
PPC64				vanilla		compaction
Final page count:                    89                 92 (attempted 110)
Total pages reclaimed:            84727               9345
There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that far fewer pages
were reclaimed in all cases reducing the amount of IO required to satisfy
a huge page allocation.
The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.
The last test was a high-order allocation stress test. Many kernel compiles
are started to fill memory with a pressured mix of unmovable and movable
allocations. During this, an attempt is made to allocate 90% of memory
as huge pages - one at a time with small delays between attempts to avoid
flooding the IO queue.
                                             vanilla   compaction
Percentage of request allocated X86               96           99
Percentage of request allocated X86-64            96           98
Percentage of request allocated PPC64             51           79
Success rates are a little higher, particularly on PPC64 with the larger
huge pages. What is most interesting is the latency when allocating huge
pages.
X86:    http://www.csn.ul.ie/~mel/postings/compaction-20100402/highalloc-interlatency-arnold-compaction-stress-v7r12-mean.ps
X86_64: http://www.csn.ul.ie/~mel/postings/compaction-20100402/highalloc-interlatency-hydra-compaction-stress-v7r12-mean.ps
PPC64: http://www.csn.ul.ie/~mel/postings/compaction-20100402/highalloc-interlatency-powyah-compaction-stress-v7r12-mean.ps
X86 latency is reduced the least but it is depending heavily on the HIGHMEM
zone to allocate many of its huge pages which is a relatively straight-forward
job. X86-64 and PPC64 both show reductions in average time taken to allocate
huge pages. It is not reduced to zero because the system is under enough
memory pressure that reclaim is still required for some of the allocations.
What is also enlightening in the same directory is the "stddev" files. Each
of them show that the variance between allocation times is heavily reduced.
 Documentation/ABI/testing/sysfs-devices-node |    7 +
 Documentation/filesystems/proc.txt           |   25 +-
 Documentation/sysctl/vm.txt                  |   29 ++-
 drivers/base/node.c                          |    3 +
 include/linux/compaction.h                   |   81 ++++
 include/linux/mm.h                           |    1 +
 include/linux/mmzone.h                       |    7 +
 include/linux/rmap.h                         |   27 +-
 include/linux/swap.h                         |    6 +
 include/linux/vmstat.h                       |    2 +
 kernel/sysctl.c                              |   25 ++
 mm/Kconfig                                   |   18 +-
 mm/Makefile                                  |    1 +
 mm/compaction.c                              |  589 ++++++++++++++++++++++++++
 mm/ksm.c                                     |    4 +-
 mm/migrate.c                                 |   48 ++-
 mm/page_alloc.c                              |   73 ++++
 mm/rmap.c                                    |   10 +-
 mm/vmscan.c                                  |    5 -
 mm/vmstat.c                                  |  218 ++++++++++
 20 files changed, 1145 insertions(+), 34 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-node
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-07  0:05   ` Andrew Morton
  2010-04-02 16:02 ` [PATCH 02/14] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.
This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/rmap.h |   23 +++++++++++++++++++++++
 mm/migrate.c         |   12 ++++++++++++
 mm/rmap.c            |   10 +++++-----
 3 files changed, 40 insertions(+), 5 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d25bd22..567d43f 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -29,6 +29,9 @@ struct anon_vma {
 #ifdef CONFIG_KSM
 	atomic_t ksm_refcount;
 #endif
+#ifdef CONFIG_MIGRATION
+	atomic_t migrate_refcount;
+#endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -81,6 +84,26 @@ static inline int ksm_refcount(struct anon_vma *anon_vma)
 	return 0;
 }
 #endif /* CONFIG_KSM */
+#ifdef CONFIG_MIGRATION
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+	atomic_set(&anon_vma->migrate_refcount, 0);
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return atomic_read(&anon_vma->migrate_refcount);
+}
+#else
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return 0;
+}
+#endif /* CONFIG_MIGRATE */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 6903abf..06e6316 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -542,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rcu_locked = 0;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
+	struct anon_vma *anon_vma = NULL;
 
 	if (!newpage)
 		return -ENOMEM;
@@ -598,6 +599,8 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	if (PageAnon(page)) {
 		rcu_read_lock();
 		rcu_locked = 1;
+		anon_vma = page_anon_vma(page);
+		atomic_inc(&anon_vma->migrate_refcount);
 	}
 
 	/*
@@ -637,6 +640,15 @@ skip_unmap:
 	if (rc)
 		remove_migration_ptes(page, page);
 rcu_unlock:
+
+	/* Drop an anon_vma reference if we took one */
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+		int empty = list_empty(&anon_vma->head);
+		spin_unlock(&anon_vma->lock);
+		if (empty)
+			anon_vma_free(anon_vma);
+	}
+
 	if (rcu_locked)
 		rcu_read_unlock();
 uncharge:
diff --git a/mm/rmap.c b/mm/rmap.c
index fcd593c..578d0fe 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -248,7 +248,8 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
+					!migrate_refcount(anon_vma);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -273,6 +274,7 @@ static void anon_vma_ctor(void *data)
 
 	spin_lock_init(&anon_vma->lock);
 	ksm_refcount_init(anon_vma);
+	migrate_refcount_init(anon_vma);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -1338,10 +1340,8 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	/*
 	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
 	 * because that depends on page_mapped(); but not all its usages
-	 * are holding mmap_sem, which also gave the necessary guarantee
-	 * (that this anon_vma's slab has not already been destroyed).
-	 * This needs to be reviewed later: avoiding page_lock_anon_vma()
-	 * is risky, and currently limits the usefulness of rmap_walk().
+	 * are holding mmap_sem. Users without mmap_sem are required to
+	 * take a reference count to prevent the anon_vma disappearing
 	 */
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 02/14] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
  2010-04-02 16:02 ` [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-02 16:02 ` [PATCH 03/14] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
rmap_walk_anon() was triggering errors in memory compaction that look like
use-after-free errors. The problem is that between the page being isolated
from the LRU and rcu_read_lock() being taken, the mapcount of the page
dropped to 0 and the anon_vma gets freed. This can happen during memory
compaction if pages being migrated belong to a process that exits before
migration completes. Hence, the use-after-free race looks like
 1. Page isolated for migration
 2. Process exits
 3. page_mapcount(page) drops to zero so anon_vma was no longer reliable
 4. unmap_and_move() takes the rcu_lock but the anon_vma is already garbage
 4. call try_to_unmap, looks up tha anon_vma and "locks" it but the lock
    is garbage.
This patch checks the mapcount after the rcu lock is taken. If the
mapcount is zero, the anon_vma is assumed to be freed and no further
action is taken.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/migrate.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 06e6316..5c5c1bd 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -599,6 +599,17 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	if (PageAnon(page)) {
 		rcu_read_lock();
 		rcu_locked = 1;
+
+		/*
+		 * If the page has no mappings any more, just bail. An
+		 * unmapped anon page is likely to be freed soon but worse,
+		 * it's possible its anon_vma disappeared between when
+		 * the page was isolated and when we reached here while
+		 * the RCU lock was not held
+		 */
+		if (!page_mapped(page))
+			goto rcu_unlock;
+
 		anon_vma = page_anon_vma(page);
 		atomic_inc(&anon_vma->migrate_refcount);
 	}
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 03/14] mm: Share the anon_vma ref counts between KSM and page migration
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
  2010-04-02 16:02 ` [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
  2010-04-02 16:02 ` [PATCH 02/14] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-07  0:05   ` Andrew Morton
  2010-04-02 16:02 ` [PATCH 04/14] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
For clarity of review, KSM and page migration have separate refcounts on
the anon_vma. While clear, this is a waste of memory. This patch gets
KSM and page migration to share their toys in a spirit of harmony.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/rmap.h |   50 ++++++++++++++++++--------------------------------
 mm/ksm.c             |    4 ++--
 mm/migrate.c         |    4 ++--
 mm/rmap.c            |    6 ++----
 4 files changed, 24 insertions(+), 40 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 567d43f..7721674 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -26,11 +26,17 @@
  */
 struct anon_vma {
 	spinlock_t lock;	/* Serialize access to vma list */
-#ifdef CONFIG_KSM
-	atomic_t ksm_refcount;
-#endif
-#ifdef CONFIG_MIGRATION
-	atomic_t migrate_refcount;
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+
+	/*
+	 * The external_refcount is taken by either KSM or page migration
+	 * to take a reference to an anon_vma when there is no
+	 * guarantee that the vma of page tables will exist for
+	 * the duration of the operation. A caller that takes
+	 * the reference is responsible for clearing up the
+	 * anon_vma if they are the last user on release
+	 */
+	atomic_t external_refcount;
 #endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
@@ -64,46 +70,26 @@ struct anon_vma_chain {
 };
 
 #ifdef CONFIG_MMU
-#ifdef CONFIG_KSM
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
-	atomic_set(&anon_vma->ksm_refcount, 0);
+	atomic_set(&anon_vma->external_refcount, 0);
 }
 
-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
 {
-	return atomic_read(&anon_vma->ksm_refcount);
+	return atomic_read(&anon_vma->external_refcount);
 }
 #else
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+static inline void anonvma_external_refcount_init(struct anon_vma *anon_vma)
 {
 }
 
-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_external_refcount(struct anon_vma *anon_vma)
 {
 	return 0;
 }
 #endif /* CONFIG_KSM */
-#ifdef CONFIG_MIGRATION
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
-	atomic_set(&anon_vma->migrate_refcount, 0);
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
-	return atomic_read(&anon_vma->migrate_refcount);
-}
-#else
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
-	return 0;
-}
-#endif /* CONFIG_MIGRATE */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
diff --git a/mm/ksm.c b/mm/ksm.c
index 8cdfc2a..3666d43 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -318,14 +318,14 @@ static void hold_anon_vma(struct rmap_item *rmap_item,
 			  struct anon_vma *anon_vma)
 {
 	rmap_item->anon_vma = anon_vma;
-	atomic_inc(&anon_vma->ksm_refcount);
+	atomic_inc(&anon_vma->external_refcount);
 }
 
 static void drop_anon_vma(struct rmap_item *rmap_item)
 {
 	struct anon_vma *anon_vma = rmap_item->anon_vma;
 
-	if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
+	if (atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
diff --git a/mm/migrate.c b/mm/migrate.c
index 5c5c1bd..35aad2a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -611,7 +611,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 			goto rcu_unlock;
 
 		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->migrate_refcount);
+		atomic_inc(&anon_vma->external_refcount);
 	}
 
 	/*
@@ -653,7 +653,7 @@ skip_unmap:
 rcu_unlock:
 
 	/* Drop an anon_vma reference if we took one */
-	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 578d0fe..af35b75 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -248,8 +248,7 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
 	list_del(&anon_vma_chain->same_anon_vma);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
-					!migrate_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !anonvma_external_refcount(anon_vma);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -273,8 +272,7 @@ static void anon_vma_ctor(void *data)
 	struct anon_vma *anon_vma = data;
 
 	spin_lock_init(&anon_vma->lock);
-	ksm_refcount_init(anon_vma);
-	migrate_refcount_init(anon_vma);
+	anonvma_external_refcount_init(anon_vma);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 04/14] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
                   ` (2 preceding siblings ...)
  2010-04-02 16:02 ` [PATCH 03/14] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-07  0:05   ` Andrew Morton
  2010-04-02 16:02 ` [PATCH 05/14] Export unusable free space index via /proc/unusable_index Mel Gorman
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
being able to hot-remove memory. The main users of page migration such as
sys_move_pages(), sys_migrate_pages() and cpuset process migration are
only beneficial on NUMA so it makes sense.
As memory compaction will operate within a zone and is useful on both NUMA
and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
user selects CONFIG_COMPACTION as an option.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/Kconfig |   18 +++++++++++++++---
 1 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index 9c61158..4fd75a0 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -172,6 +172,16 @@ config SPLIT_PTLOCK_CPUS
 	default "4"
 
 #
+# support for memory compaction
+config COMPACTION
+	bool "Allow for memory compaction"
+	def_bool y
+	select MIGRATION
+	depends on EXPERIMENTAL && HUGETLBFS && MMU
+	help
+	  Allows the compaction of memory for the allocation of huge pages.
+
+#
 # support for page migration
 #
 config MIGRATION
@@ -180,9 +190,11 @@ config MIGRATION
 	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
 	help
 	  Allows the migration of the physical location of pages of processes
-	  while the virtual addresses are not changed. This is useful for
-	  example on NUMA systems to put pages nearer to the processors accessing
-	  the page.
+	  while the virtual addresses are not changed. This is useful in
+	  two situations. The first is on NUMA systems to put pages nearer
+	  to the processors accessing. The second is when allocating huge
+	  pages as migration can relocate pages to satisfy a huge page
+	  allocation instead of reclaiming.
 
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 05/14] Export unusable free space index via /proc/unusable_index
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
                   ` (3 preceding siblings ...)
  2010-04-02 16:02 ` [PATCH 04/14] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-07  0:05   ` Andrew Morton
  2010-04-02 16:02 ` [PATCH 06/14] Export fragmentation index via /proc/extfrag_index Mel Gorman
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
Unusable free space index is a measure of external fragmentation that
takes the allocation size into account. For the most part, the huge page
size will be the size of interest but not necessarily so it is exported
on a per-order and per-zone basis via /proc/unusable_index.
The index is a value between 0 and 1. It can be expressed as a
percentage by multiplying by 100 as documented in
Documentation/filesystems/proc.txt.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/filesystems/proc.txt |   13 ++++-
 mm/vmstat.c                        |  120 ++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+), 1 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 74d2605..e87775a 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -453,6 +453,7 @@ Table 1-5: Kernel info in /proc
  sys         See chapter 2                                     
  sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
  tty	     Info of tty drivers
+ unusable_index Additional page allocator information (see text)(2.5)
  uptime      System uptime                                     
  version     Kernel version                                    
  video	     bttv info of video resources			(2.4)
@@ -610,7 +611,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo.
+pagetypeinfo and unusable_index
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -651,6 +652,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
 also be allocatable although a lot of filesystem metadata may have to be
 reclaimed to achieve this.
 
+> cat /proc/unusable_index
+Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
+Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
+
+The unusable free space index measures how much of the available free
+memory cannot be used to satisfy an allocation of a given size and is a
+value between 0 and 1. The higher the value, the more of free memory is
+unusable and by implication, the worse the external fragmentation is. This
+can be expressed as a percentage by multiplying by 100.
+
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7f760cb..2fb4986 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -453,6 +453,106 @@ static int frag_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
+
+struct contig_page_info {
+	unsigned long free_pages;
+	unsigned long free_blocks_total;
+	unsigned long free_blocks_suitable;
+};
+
+/*
+ * Calculate the number of free pages in a zone, how many contiguous
+ * pages are free and how many are large enough to satisfy an allocation of
+ * the target size. Note that this function makes no attempt to estimate
+ * how many suitable free blocks there *might* be if MOVABLE pages were
+ * migrated. Calculating that is possible, but expensive and can be
+ * figured out from userspace
+ */
+static void fill_contig_page_info(struct zone *zone,
+				unsigned int suitable_order,
+				struct contig_page_info *info)
+{
+	unsigned int order;
+
+	info->free_pages = 0;
+	info->free_blocks_total = 0;
+	info->free_blocks_suitable = 0;
+
+	for (order = 0; order < MAX_ORDER; order++) {
+		unsigned long blocks;
+
+		/* Count number of free blocks */
+		blocks = zone->free_area[order].nr_free;
+		info->free_blocks_total += blocks;
+
+		/* Count free base pages */
+		info->free_pages += blocks << order;
+
+		/* Count the suitable free blocks */
+		if (order >= suitable_order)
+			info->free_blocks_suitable += blocks <<
+						(order - suitable_order);
+	}
+}
+
+/*
+ * Return an index indicating how much of the available free memory is
+ * unusable for an allocation of the requested size.
+ */
+static int unusable_free_index(unsigned int order,
+				struct contig_page_info *info)
+{
+	/* No free memory is interpreted as all free memory is unusable */
+	if (info->free_pages == 0)
+		return 1000;
+
+	/*
+	 * Index should be a value between 0 and 1. Return a value to 3
+	 * decimal places.
+	 *
+	 * 0 => no fragmentation
+	 * 1 => high fragmentation
+	 */
+	return div_u64((info->free_pages - (info->free_blocks_suitable << order)) * 1000ULL, info->free_pages);
+
+}
+
+static void unusable_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = unusable_free_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display unusable free space index
+ * XXX: Could be a lot more efficient, but it's not a critical path
+ */
+static int unusable_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	/* check memoryless node */
+	if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+		return 0;
+
+	walk_zones_in_node(m, pgdat, unusable_show_print);
+
+	return 0;
+}
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -603,6 +703,25 @@ static const struct file_operations pagetypeinfo_file_ops = {
 	.release	= seq_release,
 };
 
+static const struct seq_operations unusable_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= unusable_show,
+};
+
+static int unusable_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &unusable_op);
+}
+
+static const struct file_operations unusable_file_ops = {
+	.open		= unusable_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -947,6 +1066,7 @@ static int __init setup_vmstat(void)
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
+	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 06/14] Export fragmentation index via /proc/extfrag_index
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
                   ` (4 preceding siblings ...)
  2010-04-02 16:02 ` [PATCH 05/14] Export unusable free space index via /proc/unusable_index Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-07  0:05   ` Andrew Morton
  2010-04-02 16:02 ` [PATCH 07/14] Move definition for LRU isolation modes to a header Mel Gorman
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
Fragmentation index is a value that makes sense when an allocation of a
given size would fail. The index indicates whether an allocation failure is
due to a lack of memory (values towards 0) or due to external fragmentation
(value towards 1).  For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zone
basis via /proc/extfrag_index
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/filesystems/proc.txt |   14 ++++++-
 mm/vmstat.c                        |   82 ++++++++++++++++++++++++++++++++++++
 2 files changed, 95 insertions(+), 1 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e87775a..c041638 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -422,6 +422,7 @@ Table 1-5: Kernel info in /proc
  filesystems Supported filesystems                             
  driver	     Various drivers grouped here, currently rtc (2.4)
  execdomains Execdomains, related to security			(2.4)
+ extfrag_index Additional page allocator information (see text) (2.5)
  fb	     Frame Buffer devices				(2.4)
  fs	     File system parameters, currently nfs/exports	(2.4)
  ide         Directory containing info about the IDE subsystem 
@@ -611,7 +612,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo and unusable_index
+pagetypeinfo, unusable_index and extfrag_index.
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -662,6 +663,17 @@ value between 0 and 1. The higher the value, the more of free memory is
 unusable and by implication, the worse the external fragmentation is. This
 can be expressed as a percentage by multiplying by 100.
 
+> cat /proc/extfrag_index
+Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
+Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954
+
+The external fragmentation index, is only meaningful if an allocation
+would fail and indicates what the failure is due to. A value of -1 such as
+in many of the examples above states that the allocation would succeed.
+If it would fail, the value is between 0 and 1. A value tending towards
+0 implies the allocation failed due to a lack of memory. A value tending
+towards 1 implies it failed due to external fragmentation.
+
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2fb4986..351e491 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -15,6 +15,7 @@
 #include <linux/cpu.h>
 #include <linux/vmstat.h>
 #include <linux/sched.h>
+#include <linux/math64.h>
 
 #ifdef CONFIG_VM_EVENT_COUNTERS
 DEFINE_PER_CPU(struct vm_event_state, vm_event_states) = {{0}};
@@ -553,6 +554,67 @@ static int unusable_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
+/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int fragmentation_index(unsigned int order, struct contig_page_info *info)
+{
+	unsigned long requested = 1UL << order;
+
+	if (!info->free_blocks_total)
+		return 0;
+
+	/* Fragmentation index only makes sense when a request would fail */
+	if (info->free_blocks_suitable)
+		return -1000;
+
+	/*
+	 * Index is between 0 and 1 so return within 3 decimal places
+	 *
+	 * 0 => allocation would fail due to lack of memory
+	 * 1 => allocation would fail due to fragmentation
+	 */
+	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
+}
+
+
+static void extfrag_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+
+	/* Alloc on stack as interrupts are disabled for zone walk */
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = fragmentation_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display fragmentation index for orders that allocations would fail for
+ */
+static int extfrag_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	walk_zones_in_node(m, pgdat, extfrag_show_print);
+
+	return 0;
+}
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -722,6 +784,25 @@ static const struct file_operations unusable_file_ops = {
 	.release	= seq_release,
 };
 
+static const struct seq_operations extfrag_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= extfrag_show,
+};
+
+static int extfrag_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &extfrag_op);
+}
+
+static const struct file_operations extfrag_file_ops = {
+	.open		= extfrag_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -1067,6 +1148,7 @@ static int __init setup_vmstat(void)
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
 	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
+	proc_create("extfrag_index", S_IRUGO, NULL, &extfrag_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 07/14] Move definition for LRU isolation modes to a header
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
                   ` (5 preceding siblings ...)
  2010-04-02 16:02 ` [PATCH 06/14] Export fragmentation index via /proc/extfrag_index Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-02 16:02 ` [PATCH 08/14] Memory compaction core Mel Gorman
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
Currently, vmscan.c defines the isolation modes for
__isolate_lru_page(). Memory compaction needs access to these modes for
isolating pages for migration.  This patch exports them.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
---
 include/linux/swap.h |    5 +++++
 mm/vmscan.c          |    5 -----
 2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1f59d93..986b12d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -238,6 +238,11 @@ static inline void lru_cache_add_active_file(struct page *page)
 	__lru_cache_add(page, LRU_ACTIVE_FILE);
 }
 
+/* LRU Isolation modes. */
+#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
+#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
+#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..ef89600 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -839,11 +839,6 @@ keep:
 	return nr_reclaimed;
 }
 
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
-#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
-
 /*
  * Attempt to remove the specified page from its LRU.  Only take this page
  * if it is of the appropriate PageActive status.  Pages which are being
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 08/14] Memory compaction core
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
                   ` (6 preceding siblings ...)
  2010-04-02 16:02 ` [PATCH 07/14] Move definition for LRU isolation modes to a header Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-07  0:05   ` Andrew Morton
  2010-04-08 16:59   ` Mel Gorman
  2010-04-02 16:02 ` [PATCH 09/14] Add /proc trigger for memory compaction Mel Gorman
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.
A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable pages
within each area, isolating them onto a private list called migratelist.
The free scanner starts at the top of the zone and searches for suitable
areas and consumes the free pages within making them available for the
migration scanner. The pages isolated for migration are then migrated to
the newly isolated free pages.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/compaction.h |    9 +
 include/linux/mm.h         |    1 +
 include/linux/swap.h       |    1 +
 include/linux/vmstat.h     |    1 +
 mm/Makefile                |    1 +
 mm/compaction.c            |  379 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   39 +++++
 mm/vmstat.c                |    5 +
 8 files changed, 436 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
new file mode 100644
index 0000000..dbebe58
--- /dev/null
+++ b/include/linux/compaction.h
@@ -0,0 +1,9 @@
+#ifndef _LINUX_COMPACTION_H
+#define _LINUX_COMPACTION_H
+
+/* Return values for compact_zone() */
+#define COMPACT_INCOMPLETE	0
+#define COMPACT_PARTIAL		1
+#define COMPACT_COMPLETE	2
+
+#endif /* _LINUX_COMPACTION_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3b473a..f920815 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -335,6 +335,7 @@ void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
+int split_free_page(struct page *page);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 986b12d..cf8bba7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -151,6 +151,7 @@ enum {
 };
 
 #define SWAP_CLUSTER_MAX 32
+#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 117f0dd..56e4b44 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/Makefile b/mm/Makefile
index 7a68d2a..ccb1f72 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_COMPACTION) += compaction.o
 obj-$(CONFIG_SMP) += percpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
diff --git a/mm/compaction.c b/mm/compaction.c
new file mode 100644
index 0000000..4041209
--- /dev/null
+++ b/mm/compaction.c
@@ -0,0 +1,379 @@
+/*
+ * linux/mm/compaction.c
+ *
+ * Memory compaction for the reduction of external fragmentation. Note that
+ * this heavily depends upon page migration to do all the real heavy
+ * lifting
+ *
+ * Copyright IBM Corp. 2007-2010 Mel Gorman <mel@csn.ul.ie>
+ */
+#include <linux/swap.h>
+#include <linux/migrate.h>
+#include <linux/compaction.h>
+#include <linux/mm_inline.h>
+#include <linux/backing-dev.h>
+#include "internal.h"
+
+/*
+ * compact_control is used to track pages being migrated and the free pages
+ * they are being migrated to during memory compaction. The free_pfn starts
+ * at the end of a zone and migrate_pfn begins at the start. Movable pages
+ * are moved to the end of a zone during a compaction run and the run
+ * completes when free_pfn <= migrate_pfn
+ */
+struct compact_control {
+	struct list_head freepages;	/* List of free pages to migrate to */
+	struct list_head migratepages;	/* List of pages being migrated */
+	unsigned long nr_freepages;	/* Number of isolated free pages */
+	unsigned long nr_migratepages;	/* Number of pages to migrate */
+	unsigned long free_pfn;		/* isolate_freepages search base */
+	unsigned long migrate_pfn;	/* isolate_migratepages search base */
+
+	/* Account for isolated anon and file pages */
+	unsigned long nr_anon;
+	unsigned long nr_file;
+
+	struct zone *zone;
+};
+
+static int release_freepages(struct list_head *freelist)
+{
+	struct page *page, *next;
+	int count = 0;
+
+	list_for_each_entry_safe(page, next, freelist, lru) {
+		list_del(&page->lru);
+		__free_page(page);
+		count++;
+	}
+
+	return count;
+}
+
+/* Isolate free pages onto a private freelist. Must hold zone->lock */
+static int isolate_freepages_block(struct zone *zone,
+				unsigned long blockpfn,
+				struct list_head *freelist)
+{
+	unsigned long zone_end_pfn, end_pfn;
+	int total_isolated = 0;
+
+	/* Get the last PFN we should scan for free pages at */
+	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	end_pfn = blockpfn + pageblock_nr_pages;
+	if (end_pfn > zone_end_pfn)
+		end_pfn = zone_end_pfn;
+
+	/* Isolate free pages. This assumes the block is valid */
+	for (; blockpfn < end_pfn; blockpfn++) {
+		struct page *page;
+		int isolated, i;
+
+		if (!pfn_valid_within(blockpfn))
+			continue;
+
+		page = pfn_to_page(blockpfn);
+		if (!PageBuddy(page))
+			continue;
+
+		/* Found a free page, break it into order-0 pages */
+		isolated = split_free_page(page);
+		total_isolated += isolated;
+		for (i = 0; i < isolated; i++) {
+			list_add(&page->lru, freelist);
+			page++;
+		}
+
+		/* If a page was split, advance to the end of it */
+		if (isolated)
+			blockpfn += isolated - 1;
+	}
+
+	return total_isolated;
+}
+
+/* Returns 1 if the page is within a block suitable for migration to */
+static int suitable_migration_target(struct page *page)
+{
+
+	int migratetype = get_pageblock_migratetype(page);
+
+	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
+	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
+		return 0;
+
+	/* If the page is a large free page, then allow migration */
+	if (PageBuddy(page) && page_order(page) >= pageblock_order)
+		return 1;
+
+	/* If the block is MIGRATE_MOVABLE, allow migration */
+	if (migratetype == MIGRATE_MOVABLE)
+		return 1;
+
+	/* Otherwise skip the block */
+	return 0;
+}
+
+/*
+ * Based on information in the current compact_control, find blocks
+ * suitable for isolating free pages from
+ */
+static void isolate_freepages(struct zone *zone,
+				struct compact_control *cc)
+{
+	struct page *page;
+	unsigned long high_pfn, low_pfn, pfn;
+	unsigned long flags;
+	int nr_freepages = cc->nr_freepages;
+	struct list_head *freelist = &cc->freepages;
+
+	pfn = cc->free_pfn;
+	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
+	high_pfn = low_pfn;
+
+	/*
+	 * Isolate free pages until enough are available to migrate the
+	 * pages on cc->migratepages. We stop searching if the migrate
+	 * and free page scanners meet or enough free pages are isolated.
+	 */
+	spin_lock_irqsave(&zone->lock, flags);
+	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
+					pfn -= pageblock_nr_pages) {
+		int isolated;
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		/* 
+		 * Check for overlapping nodes/zones. It's possible on some
+		 * configurations to have a setup like
+		 * node0 node1 node0
+		 * i.e. it's possible that all pages within a zones range of
+		 * pages do not belong to a single zone.
+		 */
+		page = pfn_to_page(pfn);
+		if (page_zone(page) != zone)
+			continue;
+
+		/* Check the block is suitable for migration */
+		if (!suitable_migration_target(page))
+			continue;
+
+		/* Found a block suitable for isolating free pages from */
+		isolated = isolate_freepages_block(zone, pfn, freelist);
+		nr_freepages += isolated;
+
+		/*
+		 * Record the highest PFN we isolated pages from. When next
+		 * looking for free pages, the search will restart here as
+		 * page migration may have returned some pages to the allocator
+		 */
+		if (isolated)
+			high_pfn = max(high_pfn, pfn);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	cc->free_pfn = high_pfn;
+	cc->nr_freepages = nr_freepages;
+}
+
+/* Update the number of anon and file isolated pages in the zone */
+static void acct_isolated(struct zone *zone, struct compact_control *cc)
+{
+	struct page *page;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+
+	list_for_each_entry(page, &cc->migratepages, lru) {
+		int lru = page_lru_base_type(page);
+		count[lru]++;
+	}
+
+	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
+}
+
+/* Similar to reclaim, but different enough that they don't share logic */
+static int too_many_isolated(struct zone *zone)
+{
+
+	unsigned long inactive, isolated;
+
+	inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
+					zone_page_state(zone, NR_INACTIVE_ANON);
+	isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
+					zone_page_state(zone, NR_ISOLATED_ANON);
+
+	return isolated > inactive;
+}
+
+/*
+ * Isolate all pages that can be migrated from the block pointed to by
+ * the migrate scanner within compact_control.
+ */
+static unsigned long isolate_migratepages(struct zone *zone,
+					struct compact_control *cc)
+{
+	unsigned long low_pfn, end_pfn;
+	struct list_head *migratelist;
+
+	low_pfn = cc->migrate_pfn;
+	migratelist = &cc->migratepages;
+
+	/* Do not scan outside zone boundaries */
+	if (low_pfn < zone->zone_start_pfn)
+		low_pfn = zone->zone_start_pfn;
+
+	/* Setup to scan one block but not past where we are migrating to */
+	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
+
+	/* Do not cross the free scanner or scan within a memory hole */
+	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
+		cc->migrate_pfn = end_pfn;
+		return 0;
+	}
+
+	/* Do not isolate the world */
+	while (unlikely(too_many_isolated(zone))) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+		if (fatal_signal_pending(current))
+			return 0;
+	}
+
+	/* Time to isolate some pages for migration */
+	spin_lock_irq(&zone->lru_lock);
+	for (; low_pfn < end_pfn; low_pfn++) {
+		struct page *page;
+		if (!pfn_valid_within(low_pfn))
+			continue;
+
+		/* Get the page and skip if free */
+		page = pfn_to_page(low_pfn);
+		if (PageBuddy(page)) {
+			low_pfn += (1 << page_order(page)) - 1;
+			continue;
+		}
+
+		/* Try isolate the page */
+		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
+			del_page_from_lru_list(zone, page, page_lru(page));
+			list_add(&page->lru, migratelist);
+			mem_cgroup_del_lru(page);
+			cc->nr_migratepages++;
+		}
+
+		/* Avoid isolating too much */
+		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
+			break;
+	}
+
+	acct_isolated(zone, cc);
+
+	spin_unlock_irq(&zone->lru_lock);
+	cc->migrate_pfn = low_pfn;
+
+	return cc->nr_migratepages;
+}
+
+/*
+ * This is a migrate-callback that "allocates" freepages by taking pages
+ * from the isolated freelists in the block we are migrating to.
+ */
+static struct page *compaction_alloc(struct page *migratepage,
+					unsigned long data,
+					int **result)
+{
+	struct compact_control *cc = (struct compact_control *)data;
+	struct page *freepage;
+
+	/* Isolate free pages if necessary */
+	if (list_empty(&cc->freepages)) {
+		isolate_freepages(cc->zone, cc);
+
+		if (list_empty(&cc->freepages))
+			return NULL;
+	}
+
+	freepage = list_entry(cc->freepages.next, struct page, lru);
+	list_del(&freepage->lru);
+	cc->nr_freepages--;
+
+	return freepage;
+}
+
+/*
+ * We cannot control nr_migratepages and nr_freepages fully when migration is
+ * running as migrate_pages() has no knowledge of compact_control. When
+ * migration is complete, we count the number of pages on the lists by hand.
+ */
+static void update_nr_listpages(struct compact_control *cc)
+{
+	int nr_migratepages = 0;
+	int nr_freepages = 0;
+	struct page *page;
+	list_for_each_entry(page, &cc->migratepages, lru)
+		nr_migratepages++;
+	list_for_each_entry(page, &cc->freepages, lru)
+		nr_freepages++;
+
+	cc->nr_migratepages = nr_migratepages;
+	cc->nr_freepages = nr_freepages;
+}
+
+static inline int compact_finished(struct zone *zone,
+						struct compact_control *cc)
+{
+	if (fatal_signal_pending(current))
+		return COMPACT_PARTIAL;
+
+	/* Compaction run completes if the migrate and free scanner meet */
+	if (cc->free_pfn <= cc->migrate_pfn)
+		return COMPACT_COMPLETE;
+
+	return COMPACT_INCOMPLETE;
+}
+
+static int compact_zone(struct zone *zone, struct compact_control *cc)
+{
+	int ret = COMPACT_INCOMPLETE;
+
+	/* Setup to move all movable pages to the end of the zone */
+	cc->migrate_pfn = zone->zone_start_pfn;
+	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
+	cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+	migrate_prep();
+
+	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
+		unsigned long nr_migrate, nr_remaining;
+		if (!isolate_migratepages(zone, cc))
+			continue;
+
+		nr_migrate = cc->nr_migratepages;
+		migrate_pages(&cc->migratepages, compaction_alloc,
+						(unsigned long)cc, 0);
+		update_nr_listpages(cc);
+		nr_remaining = cc->nr_migratepages;
+
+		count_vm_event(COMPACTBLOCKS);
+		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
+		if (nr_remaining)
+			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
+
+		/* Release LRU pages not migrated */
+		if (!list_empty(&cc->migratepages)) {
+			putback_lru_pages(&cc->migratepages);
+			cc->nr_migratepages = 0;
+		}
+
+	}
+
+	/* Release free pages and check accounting */
+	cc->nr_freepages -= release_freepages(&cc->freepages);
+	VM_BUG_ON(cc->nr_freepages != 0);
+
+	return ret;
+}
+
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 624cba4..3cf947d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1208,6 +1208,45 @@ void split_page(struct page *page, unsigned int order)
 }
 
 /*
+ * Similar to split_page except the page is already free. As this is only
+ * being used for migration, the migratetype of the block also changes.
+ */
+int split_free_page(struct page *page)
+{
+	unsigned int order;
+	unsigned long watermark;
+	struct zone *zone;
+
+	BUG_ON(!PageBuddy(page));
+
+	zone = page_zone(page);
+	order = page_order(page);
+
+	/* Obey watermarks or the system could deadlock */
+	watermark = low_wmark_pages(zone) + (1 << order);
+	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+		return 0;
+
+	/* Remove page from free list */
+	list_del(&page->lru);
+	zone->free_area[order].nr_free--;
+	rmv_page_order(page);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+
+	/* Split into individual pages */
+	set_page_refcounted(page);
+	split_page(page, order);
+
+	if (order >= pageblock_order - 1) {
+		struct page *endpage = page + (1 << order) - 1;
+		for (; page < endpage; page += pageblock_nr_pages)
+			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+	}
+
+	return 1 << order;
+}
+
+/*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
  * or two.
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 351e491..3a69b48 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -892,6 +892,11 @@ static const char * const vmstat_text[] = {
 	"allocstall",
 
 	"pgrotated",
+
+	"compact_blocks_moved",
+	"compact_pages_moved",
+	"compact_pagemigrate_failed",
+
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
 	"htlb_buddy_alloc_fail",
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 09/14] Add /proc trigger for memory compaction
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
                   ` (7 preceding siblings ...)
  2010-04-02 16:02 ` [PATCH 08/14] Memory compaction core Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-07  0:05   ` Andrew Morton
  2010-04-02 16:02 ` [PATCH 10/14] Add /sys trigger for per-node " Mel Gorman
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
value is written to the file, all zones are compacted. The expected user
of such a trigger is a job scheduler that prepares the system before the
target application runs.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
---
 Documentation/sysctl/vm.txt |   11 +++++++
 include/linux/compaction.h  |    6 ++++
 kernel/sysctl.c             |   10 +++++++
 mm/compaction.c             |   62 ++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 88 insertions(+), 1 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 56366a5..803c018 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ files can be found in mm/swap.c.
 Currently, these files are in /proc/sys/vm:
 
 - block_dump
+- compact_memory
 - dirty_background_bytes
 - dirty_background_ratio
 - dirty_bytes
@@ -64,6 +65,16 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
 
 ==============================================================
 
+compact_memory
+
+Available only when CONFIG_COMPACTION is set. When an arbitrary value
+is written to the file, all zones are compacted such that free memory
+is available in contiguous blocks where possible. This can be important
+for example in the allocation of huge pages although processes will also
+directly compact memory as required.
+
+==============================================================
+
 dirty_background_bytes
 
 Contains the amount of dirty memory at which the pdflush background writeback
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index dbebe58..fef591b 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -6,4 +6,10 @@
 #define COMPACT_PARTIAL		1
 #define COMPACT_COMPLETE	2
 
+#ifdef CONFIG_COMPACTION
+extern int sysctl_compact_memory;
+extern int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos);
+#endif /* CONFIG_COMPACTION */
+
 #endif /* _LINUX_COMPACTION_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 455f394..3838928 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -53,6 +53,7 @@
 #include <linux/slow-work.h>
 #include <linux/perf_event.h>
 #include <linux/kprobes.h>
+#include <linux/compaction.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -1102,6 +1103,15 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= drop_caches_sysctl_handler,
 	},
+#ifdef CONFIG_COMPACTION
+	{
+		.procname	= "compact_memory",
+		.data		= &sysctl_compact_memory,
+		.maxlen		= sizeof(int),
+		.mode		= 0200,
+		.proc_handler	= sysctl_compaction_handler,
+	},
+#endif /* CONFIG_COMPACTION */
 	{
 		.procname	= "min_free_kbytes",
 		.data		= &min_free_kbytes,
diff --git a/mm/compaction.c b/mm/compaction.c
index 4041209..615b811 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -12,6 +12,7 @@
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
 #include <linux/backing-dev.h>
+#include <linux/sysctl.h>
 #include "internal.h"
 
 /*
@@ -322,7 +323,7 @@ static void update_nr_listpages(struct compact_control *cc)
 	cc->nr_freepages = nr_freepages;
 }
 
-static inline int compact_finished(struct zone *zone,
+static int compact_finished(struct zone *zone,
 						struct compact_control *cc)
 {
 	if (fatal_signal_pending(current))
@@ -377,3 +378,62 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
+/* Compact all zones within a node */
+static int compact_node(int nid)
+{
+	int zoneid;
+	pg_data_t *pgdat;
+	struct zone *zone;
+
+	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
+		return -EINVAL;
+	pgdat = NODE_DATA(nid);
+
+	/* Flush pending updates to the LRU lists */
+	lru_add_drain_all();
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+		struct compact_control cc;
+
+		zone = &pgdat->node_zones[zoneid];
+		if (!populated_zone(zone))
+			continue;
+
+		cc.nr_freepages = 0;
+		cc.nr_migratepages = 0;
+		cc.zone = zone;
+		INIT_LIST_HEAD(&cc.freepages);
+		INIT_LIST_HEAD(&cc.migratepages);
+
+		compact_zone(zone, &cc);
+
+		VM_BUG_ON(!list_empty(&cc.freepages));
+		VM_BUG_ON(!list_empty(&cc.migratepages));
+	}
+
+	return 0;
+}
+
+/* Compact all nodes in the system */
+static int compact_nodes(void)
+{
+	int nid;
+
+	for_each_online_node(nid)
+		compact_node(nid);
+
+	return COMPACT_COMPLETE;
+}
+
+/* The written value is actually unused, all memory is compacted */
+int sysctl_compact_memory;
+
+/* This is the entry point for compacting all nodes via /proc/sys/vm */
+int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos)
+{
+	if (write)
+		return compact_nodes();
+
+	return 0;
+}
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 10/14] Add /sys trigger for per-node memory compaction
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
                   ` (8 preceding siblings ...)
  2010-04-02 16:02 ` [PATCH 09/14] Add /proc trigger for memory compaction Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-07  0:05   ` Andrew Morton
  2010-04-02 16:02 ` [PATCH 11/14] Direct compact when a high-order allocation fails Mel Gorman
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
This patch adds a per-node sysfs file called compact. When the file is
written to, each zone in that node is compacted. The intention that this
would be used by something like a job scheduler in a batch system before
a job starts so that the job can allocate the maximum number of
hugepages without significant start-up cost.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/ABI/testing/sysfs-devices-node |    7 +++++++
 drivers/base/node.c                          |    3 +++
 include/linux/compaction.h                   |   16 ++++++++++++++++
 mm/compaction.c                              |   23 +++++++++++++++++++++++
 4 files changed, 49 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-devices-node
diff --git a/Documentation/ABI/testing/sysfs-devices-node b/Documentation/ABI/testing/sysfs-devices-node
new file mode 100644
index 0000000..453a210
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-node
@@ -0,0 +1,7 @@
+What:		/sys/devices/system/node/nodeX/compact
+Date:		February 2010
+Contact:	Mel Gorman <mel@csn.ul.ie>
+Description:
+		When this file is written to, all memory within that node
+		will be compacted. When it completes, memory will be freed
+		into blocks which have as many contiguous pages as possible
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 93b3ac6..07cdcc6 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -15,6 +15,7 @@
 #include <linux/cpu.h>
 #include <linux/device.h>
 #include <linux/swap.h>
+#include <linux/compaction.h>
 
 static struct sysdev_class_attribute *node_state_attrs[];
 
@@ -245,6 +246,8 @@ int register_node(struct node *node, int num, struct node *parent)
 		scan_unevictable_register_node(node);
 
 		hugetlb_register_node(node);
+
+		compaction_register_node(node);
 	}
 	return error;
 }
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index fef591b..c4ab05f 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -12,4 +12,20 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
 #endif /* CONFIG_COMPACTION */
 
+#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+extern int compaction_register_node(struct node *node);
+extern void compaction_unregister_node(struct node *node);
+
+#else
+
+static inline int compaction_register_node(struct node *node)
+{
+	return 0;
+}
+
+static inline void compaction_unregister_node(struct node *node)
+{
+}
+#endif /* CONFIG_COMPACTION && CONFIG_SYSFS && CONFIG_NUMA */
+
 #endif /* _LINUX_COMPACTION_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index 615b811..b058bae 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -13,6 +13,7 @@
 #include <linux/mm_inline.h>
 #include <linux/backing-dev.h>
 #include <linux/sysctl.h>
+#include <linux/sysfs.h>
 #include "internal.h"
 
 /*
@@ -437,3 +438,25 @@ int sysctl_compaction_handler(struct ctl_table *table, int write,
 
 	return 0;
 }
+
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+ssize_t sysfs_compact_node(struct sys_device *dev,
+			struct sysdev_attribute *attr,
+			const char *buf, size_t count)
+{
+	compact_node(dev->id);
+
+	return count;
+}
+static SYSDEV_ATTR(compact, S_IWUSR, NULL, sysfs_compact_node);
+
+int compaction_register_node(struct node *node)
+{
+	return sysdev_create_file(&node->sysdev, &attr_compact);
+}
+
+void compaction_unregister_node(struct node *node)
+{
+	return sysdev_remove_file(&node->sysdev, &attr_compact);
+}
+#endif /* CONFIG_SYSFS && CONFIG_NUMA */
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 11/14] Direct compact when a high-order allocation fails
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
                   ` (9 preceding siblings ...)
  2010-04-02 16:02 ` [PATCH 10/14] Add /sys trigger for per-node " Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-07  0:06   ` Andrew Morton
  2010-04-02 16:02 ` [PATCH 12/14] Add a tunable that decides when memory should be compacted and when it should be reclaimed Mel Gorman
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
Ordinarily when a high-order allocation fails, direct reclaim is entered to
free pages to satisfy the allocation.  With this patch, it is determined if
an allocation failed due to external fragmentation instead of low memory
and if so, the calling process will compact until a suitable page is
freed. Compaction by moving pages in memory is considerably cheaper than
paging out to disk and works where there are locked pages or no swap. If
compaction fails to free a page of a suitable size, then reclaim will
still occur.
Direct compaction returns as soon as possible. As each block is compacted,
it is checked if a suitable page has been freed and if so, it returns.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
---
 include/linux/compaction.h |   20 ++++++--
 include/linux/vmstat.h     |    1 +
 mm/compaction.c            |  117 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   31 ++++++++++++
 mm/vmstat.c                |   15 +++++-
 5 files changed, 178 insertions(+), 6 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index c4ab05f..faa3faf 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,15 +1,27 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
-/* Return values for compact_zone() */
-#define COMPACT_INCOMPLETE	0
-#define COMPACT_PARTIAL		1
-#define COMPACT_COMPLETE	2
+/* Return values for compact_zone() and try_to_compact_pages() */
+#define COMPACT_SKIPPED		0
+#define COMPACT_INCOMPLETE	1
+#define COMPACT_PARTIAL		2
+#define COMPACT_COMPLETE	3
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
+
+extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *mask);
+#else
+static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	return COMPACT_INCOMPLETE;
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 56e4b44..b4b4d34 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
+		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/compaction.c b/mm/compaction.c
index b058bae..e8ef511 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -35,6 +35,8 @@ struct compact_control {
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
+	unsigned int order;		/* order a direct compactor needs */
+	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
 };
 
@@ -327,6 +329,9 @@ static void update_nr_listpages(struct compact_control *cc)
 static int compact_finished(struct zone *zone,
 						struct compact_control *cc)
 {
+	unsigned int order;
+	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+
 	if (fatal_signal_pending(current))
 		return COMPACT_PARTIAL;
 
@@ -334,6 +339,24 @@ static int compact_finished(struct zone *zone,
 	if (cc->free_pfn <= cc->migrate_pfn)
 		return COMPACT_COMPLETE;
 
+	/* Compaction run is not finished if the watermark is not met */
+	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
+		return COMPACT_INCOMPLETE;
+
+	if (cc->order == -1)
+		return COMPACT_INCOMPLETE;
+
+	/* Direct compactor: Is a suitable page free? */
+	for (order = cc->order; order < MAX_ORDER; order++) {
+		/* Job done if page is free of the right migratetype */
+		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
+			return COMPACT_PARTIAL;
+
+		/* Job done if allocation would set block type */
+		if (order >= pageblock_order && zone->free_area[order].nr_free)
+			return COMPACT_PARTIAL;
+	}
+
 	return COMPACT_INCOMPLETE;
 }
 
@@ -379,6 +402,99 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
+static unsigned long compact_zone_order(struct zone *zone,
+						int order, gfp_t gfp_mask)
+{
+	struct compact_control cc = {
+		.nr_freepages = 0,
+		.nr_migratepages = 0,
+		.order = order,
+		.migratetype = allocflags_to_migratetype(gfp_mask),
+		.zone = zone,
+	};
+	INIT_LIST_HEAD(&cc.freepages);
+	INIT_LIST_HEAD(&cc.migratepages);
+
+	return compact_zone(zone, &cc);
+}
+
+/**
+ * try_to_compact_pages - Direct compact to satisfy a high-order allocation
+ * @zonelist: The zonelist used for the current allocation
+ * @order: The order of the current allocation
+ * @gfp_mask: The GFP mask of the current allocation
+ * @nodemask: The allowed nodes to allocate from
+ *
+ * This is the main entry point for direct page compaction.
+ */
+unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	int may_enter_fs = gfp_mask & __GFP_FS;
+	int may_perform_io = gfp_mask & __GFP_IO;
+	unsigned long watermark;
+	struct zoneref *z;
+	struct zone *zone;
+	int rc = COMPACT_SKIPPED;
+
+	/*
+	 * Check whether it is worth even starting compaction. The order check is
+	 * made because an assumption is made that the page allocator can satisfy
+	 * the "cheaper" orders without taking special steps
+	 */
+	if (order <= PAGE_ALLOC_COSTLY_ORDER || !may_enter_fs || !may_perform_io)
+		return rc;
+
+	count_vm_event(COMPACTSTALL);
+
+	/* Compact each zone in the list */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
+								nodemask) {
+		int fragindex;
+		int status;
+
+		/*
+		 * Watermarks for order-0 must be met for compaction. Note
+		 * the 2UL. This is because during migration, copies of
+		 * pages need to be allocated and for a short time, the
+		 * footprint is higher
+		 */
+		watermark = low_wmark_pages(zone) + (2UL << order);
+		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+			continue;
+
+		/*
+		 * fragmentation index determines if allocation failures are
+		 * due to low memory or external fragmentation
+		 *
+		 * index of -1 implies allocations might succeed depending
+		 * 	on watermarks
+		 * index towards 0 implies failure is due to lack of memory
+		 * index towards 1000 implies failure is due to fragmentation
+		 *
+		 * Only compact if a failure would be due to fragmentation.
+		 */
+		fragindex = fragmentation_index(zone, order);
+		if (fragindex >= 0 && fragindex <= 500)
+			continue;
+
+		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
+			rc = COMPACT_PARTIAL;
+			break;
+		}
+
+		status = compact_zone_order(zone, order, gfp_mask);
+		rc = max(status, rc);
+
+		if (zone_watermark_ok(zone, order, watermark, 0, 0))
+			break;
+	}
+
+	return rc;
+}
+
+
 /* Compact all zones within a node */
 static int compact_node(int nid)
 {
@@ -403,6 +519,7 @@ static int compact_node(int nid)
 		cc.nr_freepages = 0;
 		cc.nr_migratepages = 0;
 		cc.zone = zone;
+		cc.order = -1;
 		INIT_LIST_HEAD(&cc.freepages);
 		INIT_LIST_HEAD(&cc.migratepages);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3cf947d..7a2e4a2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -49,6 +49,7 @@
 #include <linux/debugobjects.h>
 #include <linux/kmemleak.h>
 #include <linux/memory.h>
+#include <linux/compaction.h>
 #include <trace/events/kmem.h>
 #include <linux/ftrace_event.h>
 
@@ -1768,6 +1769,36 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
+	/* Try memory compaction for high-order allocations before reclaim */
+	if (order) {
+		*did_some_progress = try_to_compact_pages(zonelist,
+						order, gfp_mask, nodemask);
+		if (*did_some_progress != COMPACT_SKIPPED) {
+
+			/* Page migration frees to the PCP lists but we want merging */
+			drain_pages(get_cpu());
+			put_cpu();
+
+			page = get_page_from_freelist(gfp_mask, nodemask,
+					order, zonelist, high_zoneidx,
+					alloc_flags, preferred_zone,
+					migratetype);
+			if (page) {
+				__count_vm_event(COMPACTSUCCESS);
+				return page;
+			}
+
+			/*
+			 * It's bad if compaction run occurs and fails.
+			 * The most likely reason is that pages exist,
+			 * but not enough to satisfy watermarks.
+			 */
+			count_vm_event(COMPACTFAIL);
+
+			cond_resched();
+		}
+	}
+
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
 	p->flags |= PF_MEMALLOC;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3a69b48..2780a36 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -561,7 +561,7 @@ static int unusable_show(struct seq_file *m, void *arg)
  * The value can be used to determine if page reclaim or compaction
  * should be used
  */
-int fragmentation_index(unsigned int order, struct contig_page_info *info)
+int __fragmentation_index(unsigned int order, struct contig_page_info *info)
 {
 	unsigned long requested = 1UL << order;
 
@@ -581,6 +581,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
 	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
 }
 
+/* Same as __fragmentation index but allocs contig_page_info on stack */
+int fragmentation_index(struct zone *zone, unsigned int order)
+{
+	struct contig_page_info info;
+
+	fill_contig_page_info(zone, order, &info);
+	return __fragmentation_index(order, &info);
+}
 
 static void extfrag_show_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
@@ -596,7 +604,7 @@ static void extfrag_show_print(struct seq_file *m,
 				zone->name);
 	for (order = 0; order < MAX_ORDER; ++order) {
 		fill_contig_page_info(zone, order, &info);
-		index = fragmentation_index(order, &info);
+		index = __fragmentation_index(order, &info);
 		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
 	}
 
@@ -896,6 +904,9 @@ static const char * const vmstat_text[] = {
 	"compact_blocks_moved",
 	"compact_pages_moved",
 	"compact_pagemigrate_failed",
+	"compact_stall",
+	"compact_fail",
+	"compact_success",
 
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 12/14] Add a tunable that decides when memory should be compacted and when it should be reclaimed
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
                   ` (10 preceding siblings ...)
  2010-04-02 16:02 ` [PATCH 11/14] Direct compact when a high-order allocation fails Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-07  0:06   ` Andrew Morton
  2010-04-02 16:02 ` [PATCH 13/14] Do not compact within a preferred zone after a compaction failure Mel Gorman
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
The kernel applies some heuristics when deciding if memory should be
compacted or reclaimed to satisfy a high-order allocation. One of these
is based on the fragmentation. If the index is below 500, memory will
not be compacted. This choice is arbitrary and not based on data. To
help optimise the system and set a sensible default for this value, this
patch adds a sysctl extfrag_threshold. The kernel will only compact
memory if the fragmentation index is above the extfrag_threshold.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/sysctl/vm.txt |   18 ++++++++++++++++--
 include/linux/compaction.h  |    3 +++
 kernel/sysctl.c             |   15 +++++++++++++++
 mm/compaction.c             |   12 +++++++++++-
 4 files changed, 45 insertions(+), 3 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 803c018..878b1b4 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -27,6 +27,7 @@ Currently, these files are in /proc/sys/vm:
 - dirty_ratio
 - dirty_writeback_centisecs
 - drop_caches
+- extfrag_threshold
 - hugepages_treat_as_movable
 - hugetlb_shm_group
 - laptop_mode
@@ -131,8 +132,7 @@ out to disk.  This tunable expresses the interval between those wakeups, in
 
 Setting this to zero disables periodic writeback altogether.
 
-==============================================================
-
+============================================================== 
 drop_caches
 
 Writing to this will cause the kernel to drop clean caches, dentries and
@@ -150,6 +150,20 @@ user should run `sync' first.
 
 ==============================================================
 
+extfrag_threshold
+
+This parameter affects whether the kernel will compact memory or direct
+reclaim to satisfy a high-order allocation. /proc/extfrag_index shows what
+the fragmentation index for each order is in each zone in the system. Values
+tending towards 0 imply allocations would fail due to lack of memory,
+values towards 1000 imply failures are due to fragmentation and -1 implies
+that the allocation will succeed as long as watermarks are met.
+
+The kernel will not compact memory in a zone if the
+fragmentation index is <= extfrag_threshold. The default value is 500.
+
+==============================================================
+
 hugepages_treat_as_movable
 
 This parameter is only useful when kernelcore= is specified at boot time to
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index faa3faf..ae98afc 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -11,6 +11,9 @@
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
+extern int sysctl_extfrag_threshold;
+extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos);
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 3838928..b8f292e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -243,6 +243,11 @@ static int min_sched_shares_ratelimit = 100000; /* 100 usec */
 static int max_sched_shares_ratelimit = NSEC_PER_SEC; /* 1 second */
 #endif
 
+#ifdef CONFIG_COMPACTION
+static int min_extfrag_threshold = 0;
+static int max_extfrag_threshold = 1000;
+#endif
+
 static struct ctl_table kern_table[] = {
 	{
 		.procname	= "sched_child_runs_first",
@@ -1111,6 +1116,16 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0200,
 		.proc_handler	= sysctl_compaction_handler,
 	},
+	{
+		.procname	= "extfrag_threshold",
+		.data		= &sysctl_extfrag_threshold,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_extfrag_handler,
+		.extra1		= &min_extfrag_threshold,
+		.extra2		= &max_extfrag_threshold,
+	},
+
 #endif /* CONFIG_COMPACTION */
 	{
 		.procname	= "min_free_kbytes",
diff --git a/mm/compaction.c b/mm/compaction.c
index e8ef511..3bb65d7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -418,6 +418,8 @@ static unsigned long compact_zone_order(struct zone *zone,
 	return compact_zone(zone, &cc);
 }
 
+int sysctl_extfrag_threshold = 500;
+
 /**
  * try_to_compact_pages - Direct compact to satisfy a high-order allocation
  * @zonelist: The zonelist used for the current allocation
@@ -476,7 +478,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 		 * Only compact if a failure would be due to fragmentation.
 		 */
 		fragindex = fragmentation_index(zone, order);
-		if (fragindex >= 0 && fragindex <= 500)
+		if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
 			continue;
 
 		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
@@ -556,6 +558,14 @@ int sysctl_compaction_handler(struct ctl_table *table, int write,
 	return 0;
 }
 
+int sysctl_extfrag_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos)
+{
+	proc_dointvec_minmax(table, write, buffer, length, ppos);
+
+	return 0;
+}
+
 #if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
 ssize_t sysfs_compact_node(struct sys_device *dev,
 			struct sysdev_attribute *attr,
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 13/14] Do not compact within a preferred zone after a compaction failure
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
                   ` (11 preceding siblings ...)
  2010-04-02 16:02 ` [PATCH 12/14] Add a tunable that decides when memory should be compacted and when it should be reclaimed Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-07  0:06   ` Andrew Morton
  2010-04-02 16:02 ` [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages Mel Gorman
  2010-04-06 14:47 ` [PATCH 0/14] Memory Compaction v7 Tarkan Erimer
  14 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
The fragmentation index may indicate that a failure is due to external
fragmentation but after a compaction run completes, it is still possible
for an allocation to fail. There are two obvious reasons as to why
  o Page migration cannot move all pages so fragmentation remains
  o A suitable page may exist but watermarks are not met
In the event of compaction followed by an allocation failure, this patch
defers further compaction in the zone for a period of time. The zone that
is deferred is the first zone in the zonelist - i.e. the preferred zone.
To defer compaction in the other zones, the information would need to be
stored in the zonelist or implemented similar to the zonelist_cache.
This would impact the fast-paths and is not justified at this time.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |   35 +++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h     |    7 +++++++
 mm/page_alloc.c            |    5 ++++-
 3 files changed, 46 insertions(+), 1 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index ae98afc..2a02719 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -18,6 +18,32 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask);
+
+/* defer_compaction - Do not compact within a zone until a given time */
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+	/*
+	 * This function is called when compaction fails to result in a page
+	 * allocation success. This is somewhat unsatisfactory as the failure
+	 * to compact has nothing to do with time and everything to do with
+	 * the requested order, the number of free pages and watermarks. How
+	 * to wait on that is more unclear, but the answer would apply to
+	 * other areas where the VM waits based on time.
+	 */
+	zone->compact_resume = resume;
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+	/* init once if necessary */
+	if (unlikely(!zone->compact_resume)) {
+		zone->compact_resume = jiffies;
+		return 0;
+	}
+
+	return time_before(jiffies, zone->compact_resume);
+}
+
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask)
@@ -25,6 +51,15 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	return COMPACT_INCOMPLETE;
 }
 
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+	return 1;
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cf9e458..bde879b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -321,6 +321,13 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
+#ifdef CONFIG_COMPACTION
+	/*
+	 * If a compaction fails, do not try compaction again until
+	 * jiffies is after the value of compact_resume
+	 */
+	unsigned long		compact_resume;
+#endif
 
 	ZONE_PADDING(_pad1_)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7a2e4a2..66823bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1770,7 +1770,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	cond_resched();
 
 	/* Try memory compaction for high-order allocations before reclaim */
-	if (order) {
+	if (order && !compaction_deferred(preferred_zone)) {
 		*did_some_progress = try_to_compact_pages(zonelist,
 						order, gfp_mask, nodemask);
 		if (*did_some_progress != COMPACT_SKIPPED) {
@@ -1795,6 +1795,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 			 */
 			count_vm_event(COMPACTFAIL);
 
+			/* On failure, avoid compaction for a short time. */
+			defer_compaction(preferred_zone, jiffies + HZ/50);
+
 			cond_resched();
 		}
 	}
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
                   ` (12 preceding siblings ...)
  2010-04-02 16:02 ` [PATCH 13/14] Do not compact within a preferred zone after a compaction failure Mel Gorman
@ 2010-04-02 16:02 ` Mel Gorman
  2010-04-06  6:54   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2010-04-06 14:47 ` [PATCH 0/14] Memory Compaction v7 Tarkan Erimer
  14 siblings, 3 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-02 16:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, Mel Gorman, linux-kernel, linux-mm
PageAnon pages that are unmapped may or may not have an anon_vma so are
not currently migrated. However, a swap cache page can be migrated and
fits this description. This patch identifies page swap caches and allows
them to be migrated but ensures that no attempt to made to remap the pages
would would potentially try to access an already freed anon_vma.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c |   47 ++++++++++++++++++++++++++++++-----------------
 1 files changed, 30 insertions(+), 17 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 35aad2a..0356e64 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
  *   < 0 - error code
  *  == 0 - success
  */
-static int move_to_new_page(struct page *newpage, struct page *page)
+static int move_to_new_page(struct page *newpage, struct page *page,
+						int remap_swapcache)
 {
 	struct address_space *mapping;
 	int rc;
@@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
 	else
 		rc = fallback_migrate_page(mapping, newpage, page);
 
-	if (!rc)
-		remove_migration_ptes(page, newpage);
-	else
+	if (rc) {
 		newpage->mapping = NULL;
+	} else {
+		if (remap_swapcache) 
+			remove_migration_ptes(page, newpage);
+	}
 
 	unlock_page(newpage);
 
@@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rc = 0;
 	int *result = NULL;
 	struct page *newpage = get_new_page(page, private, &result);
+	int remap_swapcache = 1;
 	int rcu_locked = 0;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
@@ -600,18 +604,27 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		rcu_read_lock();
 		rcu_locked = 1;
 
-		/*
-		 * If the page has no mappings any more, just bail. An
-		 * unmapped anon page is likely to be freed soon but worse,
-		 * it's possible its anon_vma disappeared between when
-		 * the page was isolated and when we reached here while
-		 * the RCU lock was not held
-		 */
-		if (!page_mapped(page))
-			goto rcu_unlock;
+		/* Determine how to safely use anon_vma */
+		if (!page_mapped(page)) {
+			if (!PageSwapCache(page))
+				goto rcu_unlock;
 
-		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->external_refcount);
+			/*
+			 * We cannot be sure that the anon_vma of an unmapped
+			 * swapcache page is safe to use. In this case, the
+			 * swapcache page gets migrated but the pages are not
+			 * remapped
+			 */
+			remap_swapcache = 0;
+		} else { 
+			/*
+			 * Take a reference count on the anon_vma if the
+			 * page is mapped so that it is guaranteed to
+			 * exist when the page is remapped later
+			 */
+			anon_vma = page_anon_vma(page);
+			atomic_inc(&anon_vma->external_refcount);
+		}
 	}
 
 	/*
@@ -646,9 +659,9 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 
 skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page);
+		rc = move_to_new_page(newpage, page, remap_swapcache);
 
-	if (rc)
+	if (rc && remap_swapcache)
 		remove_migration_ptes(page, page);
 rcu_unlock:
 
-- 
1.6.5
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-02 16:02 ` [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages Mel Gorman
@ 2010-04-06  6:54   ` KAMEZAWA Hiroyuki
  2010-04-06 15:37   ` Minchan Kim
  2010-04-07  0:06   ` Andrew Morton
  2 siblings, 0 replies; 56+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-06  6:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri,  2 Apr 2010 17:02:48 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> PageAnon pages that are unmapped may or may not have an anon_vma so are
> not currently migrated. However, a swap cache page can be migrated and
> fits this description. This patch identifies page swap caches and allows
> them to be migrated but ensures that no attempt to made to remap the pages
> would would potentially try to access an already freed anon_vma.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Seems nice to me.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 0/14] Memory Compaction v7
  2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
                   ` (13 preceding siblings ...)
  2010-04-02 16:02 ` [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages Mel Gorman
@ 2010-04-06 14:47 ` Tarkan Erimer
  2010-04-06 15:00   ` Mel Gorman
  14 siblings, 1 reply; 56+ messages in thread
From: Tarkan Erimer @ 2010-04-06 14:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm
Hi Mel,
On Friday 02 April 2010 07:02:34 pm Mel Gorman wrote:
> The only change is relatively minor and is around the migration of unmapped
> PageSwapCache pages. Specifically, it's not safe to access anon_vma for
> these pages when remapping after migration completes so the last patch
> makes sure we don't.
> 
> Are there any further obstacles to merging?
> 
These patches are applicable to which kernel version or versions ?
I tried on 2.6.33.2 and 2.6.34-rc3 without succeed. 
Tarkan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 0/14] Memory Compaction v7
  2010-04-06 14:47 ` [PATCH 0/14] Memory Compaction v7 Tarkan Erimer
@ 2010-04-06 15:00   ` Mel Gorman
  2010-04-06 15:03     ` Tarkan Erimer
  0 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2010-04-06 15:00 UTC (permalink / raw)
  To: Tarkan Erimer
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:47:16PM +0300, Tarkan Erimer wrote:
> On Friday 02 April 2010 07:02:34 pm Mel Gorman wrote:
> > The only change is relatively minor and is around the migration of unmapped
> > PageSwapCache pages. Specifically, it's not safe to access anon_vma for
> > these pages when remapping after migration completes so the last patch
> > makes sure we don't.
> > 
> > Are there any further obstacles to merging?
> > 
> 
> These patches are applicable to which kernel version or versions ?
> I tried on 2.6.33.2 and 2.6.34-rc3 without succeed. 
> 
It's based on Andrew's tree mmotm-2010-03-24-14-48.
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 0/14] Memory Compaction v7
  2010-04-06 15:00   ` Mel Gorman
@ 2010-04-06 15:03     ` Tarkan Erimer
  0 siblings, 0 replies; 56+ messages in thread
From: Tarkan Erimer @ 2010-04-06 15:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, linux-kernel, linux-mm
On Tuesday 06 April 2010 06:00:36 pm Mel Gorman wrote:
> On Tue, Apr 06, 2010 at 05:47:16PM +0300, Tarkan Erimer wrote:
> > On Friday 02 April 2010 07:02:34 pm Mel Gorman wrote:
> > > The only change is relatively minor and is around the migration of
> > > unmapped PageSwapCache pages. Specifically, it's not safe to access
> > > anon_vma for these pages when remapping after migration completes so
> > > the last patch makes sure we don't.
> > > 
> > > Are there any further obstacles to merging?
> > 
> > These patches are applicable to which kernel version or versions ?
> > I tried on 2.6.33.2 and 2.6.34-rc3 without succeed.
> 
> It's based on Andrew's tree mmotm-2010-03-24-14-48.
OK. Thanks for the reply. 
Tarkan
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-02 16:02 ` [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages Mel Gorman
  2010-04-06  6:54   ` KAMEZAWA Hiroyuki
@ 2010-04-06 15:37   ` Minchan Kim
  2010-04-07  0:06   ` Andrew Morton
  2 siblings, 0 replies; 56+ messages in thread
From: Minchan Kim @ 2010-04-06 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Sat, Apr 3, 2010 at 1:02 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> PageAnon pages that are unmapped may or may not have an anon_vma so are
> not currently migrated. However, a swap cache page can be migrated and
> fits this description. This patch identifies page swap caches and allows
> them to be migrated but ensures that no attempt to made to remap the pages
> would would potentially try to access an already freed anon_vma.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Thanks for your effort, Mel.
-- 
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 10/14] Add /sys trigger for per-node memory compaction
  2010-04-07  0:31     ` KAMEZAWA Hiroyuki
@ 2010-04-06 21:56       ` Andrew Morton
  2010-04-07  1:19         ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2010-04-06 21:56 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Mel Gorman, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Wed, 7 Apr 2010 09:31:48 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> A cgroup which controls placement of memory is cpuset.
err, yes, that.
> One idea is per cpuset. But per-node seems ok.
Which is superior?
Which maps best onto the way systems are used (and onto ways in which
we _intend_ that systems be used)?
Is the physical node really the best unit-of-administration?  And is
direct access to physical nodes the best means by which admins will
manage things?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating
  2010-04-02 16:02 ` [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
@ 2010-04-07  0:05   ` Andrew Morton
  2010-04-07  9:56     ` Mel Gorman
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2010-04-07  0:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri,  2 Apr 2010 17:02:35 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
> locking an anon_vma and it does not appear to have sufficient locking to
> ensure the anon_vma does not disappear from under it.
> 
> This patch copies an approach used by KSM to take a reference on the
> anon_vma while pages are being migrated. This should prevent rmap_walk()
> running into nasty surprises later because anon_vma has been freed.
> 
The code didn't exactly bend over backwards making itself easy for
others to understand...
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index d25bd22..567d43f 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -29,6 +29,9 @@ struct anon_vma {
>  #ifdef CONFIG_KSM
>  	atomic_t ksm_refcount;
>  #endif
> +#ifdef CONFIG_MIGRATION
> +	atomic_t migrate_refcount;
> +#endif
Some documentation here describing the need for this thing and its
runtime semantics would be appropriate.
>  	/*
>  	 * NOTE: the LSB of the head.next is set by
>  	 * mm_take_all_locks() _after_ taking the above lock. So the
> @@ -81,6 +84,26 @@ static inline int ksm_refcount(struct anon_vma *anon_vma)
>  	return 0;
>  }
>  #endif /* CONFIG_KSM */
> +#ifdef CONFIG_MIGRATION
> +static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> +{
> +	atomic_set(&anon_vma->migrate_refcount, 0);
> +}
> +
> +static inline int migrate_refcount(struct anon_vma *anon_vma)
> +{
> +	return atomic_read(&anon_vma->migrate_refcount);
> +}
> +#else
> +static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> +{
> +}
> +
> +static inline int migrate_refcount(struct anon_vma *anon_vma)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_MIGRATE */
>  
>  static inline struct anon_vma *page_anon_vma(struct page *page)
>  {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6903abf..06e6316 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -542,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  	int rcu_locked = 0;
>  	int charge = 0;
>  	struct mem_cgroup *mem = NULL;
> +	struct anon_vma *anon_vma = NULL;
>  
>  	if (!newpage)
>  		return -ENOMEM;
> @@ -598,6 +599,8 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  	if (PageAnon(page)) {
>  		rcu_read_lock();
>  		rcu_locked = 1;
> +		anon_vma = page_anon_vma(page);
> +		atomic_inc(&anon_vma->migrate_refcount);
So no helper function for this.  I guess a grep for `migrate_refcount'
will find it OK.
Can this count ever have a value > 1?   I guess so..
>  	}
>  
>  	/*
> @@ -637,6 +640,15 @@ skip_unmap:
>  	if (rc)
>  		remove_migration_ptes(page, page);
>  rcu_unlock:
> +
> +	/* Drop an anon_vma reference if we took one */
> +	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> +		int empty = list_empty(&anon_vma->head);
> +		spin_unlock(&anon_vma->lock);
> +		if (empty)
> +			anon_vma_free(anon_vma);
> +	}
> +
So...  Why shouldn't this be testing ksm_refcount() too?
Can we consolidate ksm_refcount and migrate_refcount into, err, `refcount'?
>  	if (rcu_locked)
>  		rcu_read_unlock();
>  uncharge:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index fcd593c..578d0fe 100644
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 03/14] mm: Share the anon_vma ref counts between KSM and page migration
  2010-04-02 16:02 ` [PATCH 03/14] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
@ 2010-04-07  0:05   ` Andrew Morton
  2010-04-07  0:10     ` Rik van Riel
  2010-04-07 10:01     ` Mel Gorman
  0 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2010-04-07  0:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri,  2 Apr 2010 17:02:37 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> For clarity of review, KSM and page migration have separate refcounts on
> the anon_vma. While clear, this is a waste of memory. This patch gets
> KSM and page migration to share their toys in a spirit of harmony.
> 
> ...
>
> @@ -26,11 +26,17 @@
>   */
>  struct anon_vma {
>  	spinlock_t lock;	/* Serialize access to vma list */
> -#ifdef CONFIG_KSM
> -	atomic_t ksm_refcount;
> -#endif
> -#ifdef CONFIG_MIGRATION
> -	atomic_t migrate_refcount;
> +#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
> +
> +	/*
> +	 * The external_refcount is taken by either KSM or page migration
> +	 * to take a reference to an anon_vma when there is no
> +	 * guarantee that the vma of page tables will exist for
> +	 * the duration of the operation. A caller that takes
> +	 * the reference is responsible for clearing up the
> +	 * anon_vma if they are the last user on release
> +	 */
> +	atomic_t external_refcount;
>  #endif
hah.
> @@ -653,7 +653,7 @@ skip_unmap:
>  rcu_unlock:
>  
>  	/* Drop an anon_vma reference if we took one */
> -	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> +	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
>  		int empty = list_empty(&anon_vma->head);
>  		spin_unlock(&anon_vma->lock);
>  		if (empty)
So we now _do_ test ksm_refcount.  Perhaps that fixed a bug added in [1/14]
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 578d0fe..af35b75 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -248,8 +248,7 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
>  	list_del(&anon_vma_chain->same_anon_vma);
>  
>  	/* We must garbage collect the anon_vma if it's empty */
> -	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
> -					!migrate_refcount(anon_vma);
> +	empty = list_empty(&anon_vma->head) && !anonvma_external_refcount(anon_vma);
>  	spin_unlock(&anon_vma->lock);
>  
>  	if (empty)
> @@ -273,8 +272,7 @@ static void anon_vma_ctor(void *data)
>  	struct anon_vma *anon_vma = data;
>  
>  	spin_lock_init(&anon_vma->lock);
> -	ksm_refcount_init(anon_vma);
> -	migrate_refcount_init(anon_vma);
> +	anonvma_external_refcount_init(anon_vma);
What a mouthful.  Can we do s/external_//g?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 04/14] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-04-02 16:02 ` [PATCH 04/14] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
@ 2010-04-07  0:05   ` Andrew Morton
  2010-04-07 10:22     ` Mel Gorman
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2010-04-07  0:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri,  2 Apr 2010 17:02:38 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
> being able to hot-remove memory. The main users of page migration such as
> sys_move_pages(), sys_migrate_pages() and cpuset process migration are
> only beneficial on NUMA so it makes sense.
> 
> As memory compaction will operate within a zone and is useful on both NUMA
> and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
> user selects CONFIG_COMPACTION as an option.
> 
> ...
>
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -172,6 +172,16 @@ config SPLIT_PTLOCK_CPUS
>  	default "4"
>  
>  #
> +# support for memory compaction
> +config COMPACTION
> +	bool "Allow for memory compaction"
> +	def_bool y
> +	select MIGRATION
> +	depends on EXPERIMENTAL && HUGETLBFS && MMU
> +	help
> +	  Allows the compaction of memory for the allocation of huge pages.
Seems strange to depend on hugetlbfs.  Perhaps depending on
HUGETLB_PAGE would be more logical.
But hang on.  I wanna use compaction to make my order-4 wireless skb
allocations work better!  Why do you hate me?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 05/14] Export unusable free space index via /proc/unusable_index
  2010-04-02 16:02 ` [PATCH 05/14] Export unusable free space index via /proc/unusable_index Mel Gorman
@ 2010-04-07  0:05   ` Andrew Morton
  2010-04-07 10:35     ` Mel Gorman
  2010-04-13 12:42     ` Mel Gorman
  0 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2010-04-07  0:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri,  2 Apr 2010 17:02:39 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> Unusable free space index is a measure of external fragmentation that
> takes the allocation size into account. For the most part, the huge page
> size will be the size of interest but not necessarily so it is exported
> on a per-order and per-zone basis via /proc/unusable_index.
I'd suggest /proc/sys/vm/unusable_index.  I don't know how pagetypeinfo
found its way into the top-level dir.
> The index is a value between 0 and 1. It can be expressed as a
> percentage by multiplying by 100 as documented in
> Documentation/filesystems/proc.txt.
> 
> ...
> 
> +> cat /proc/unusable_index
> +Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
> +Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
> +
> +The unusable free space index measures how much of the available free
> +memory cannot be used to satisfy an allocation of a given size and is a
> +value between 0 and 1. The higher the value, the more of free memory is
> +unusable and by implication, the worse the external fragmentation is. This
> +can be expressed as a percentage by multiplying by 100.
That's going to hurt my brain.  Why didn't it report usable free blocks?
Also, the index is scaled by the actual amount of free memory in the
zones, yes?  So to work out how many order-N pages are available you
first need to know how many free pages there are?
Seems complicated.
>  
> +
> +struct contig_page_info {
> +	unsigned long free_pages;
> +	unsigned long free_blocks_total;
> +	unsigned long free_blocks_suitable;
> +};
> +
> +/*
> + * Calculate the number of free pages in a zone, how many contiguous
> + * pages are free and how many are large enough to satisfy an allocation of
> + * the target size. Note that this function makes no attempt to estimate
> + * how many suitable free blocks there *might* be if MOVABLE pages were
> + * migrated. Calculating that is possible, but expensive and can be
> + * figured out from userspace
> + */
> +static void fill_contig_page_info(struct zone *zone,
> +				unsigned int suitable_order,
> +				struct contig_page_info *info)
> +{
> +	unsigned int order;
> +
> +	info->free_pages = 0;
> +	info->free_blocks_total = 0;
> +	info->free_blocks_suitable = 0;
> +
> +	for (order = 0; order < MAX_ORDER; order++) {
> +		unsigned long blocks;
> +
> +		/* Count number of free blocks */
> +		blocks = zone->free_area[order].nr_free;
> +		info->free_blocks_total += blocks;
> +
> +		/* Count free base pages */
> +		info->free_pages += blocks << order;
> +
> +		/* Count the suitable free blocks */
> +		if (order >= suitable_order)
> +			info->free_blocks_suitable += blocks <<
> +						(order - suitable_order);
> +	}
> +}
> +
> +/*
> + * Return an index indicating how much of the available free memory is
> + * unusable for an allocation of the requested size.
> + */
> +static int unusable_free_index(unsigned int order,
> +				struct contig_page_info *info)
> +{
> +	/* No free memory is interpreted as all free memory is unusable */
> +	if (info->free_pages == 0)
> +		return 1000;
> +
> +	/*
> +	 * Index should be a value between 0 and 1. Return a value to 3
> +	 * decimal places.
> +	 *
> +	 * 0 => no fragmentation
> +	 * 1 => high fragmentation
> +	 */
> +	return div_u64((info->free_pages - (info->free_blocks_suitable << order)) * 1000ULL, info->free_pages);
> +
> +}
> +
> +static void unusable_show_print(struct seq_file *m,
> +					pg_data_t *pgdat, struct zone *zone)
> +{
> +	unsigned int order;
> +	int index;
> +	struct contig_page_info info;
> +
> +	seq_printf(m, "Node %d, zone %8s ",
> +				pgdat->node_id,
> +				zone->name);
> +	for (order = 0; order < MAX_ORDER; ++order) {
> +		fill_contig_page_info(zone, order, &info);
> +		index = unusable_free_index(order, &info);
> +		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
> +	}
> +
> +	seq_putc(m, '\n');
> +}
> +
> +/*
> + * Display unusable free space index
> + * XXX: Could be a lot more efficient, but it's not a critical path
> + */
> +static int unusable_show(struct seq_file *m, void *arg)
> +{
> +	pg_data_t *pgdat = (pg_data_t *)arg;
> +
> +	/* check memoryless node */
> +	if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
> +		return 0;
> +
> +	walk_zones_in_node(m, pgdat, unusable_show_print);
> +
> +	return 0;
> +}
> +
>  static void pagetypeinfo_showfree_print(struct seq_file *m,
>  					pg_data_t *pgdat, struct zone *zone)
>  {
> @@ -603,6 +703,25 @@ static const struct file_operations pagetypeinfo_file_ops = {
>  	.release	= seq_release,
>  };
>  
> +static const struct seq_operations unusable_op = {
> +	.start	= frag_start,
> +	.next	= frag_next,
> +	.stop	= frag_stop,
> +	.show	= unusable_show,
> +};
> +
> +static int unusable_open(struct inode *inode, struct file *file)
> +{
> +	return seq_open(file, &unusable_op);
> +}
> +
> +static const struct file_operations unusable_file_ops = {
> +	.open		= unusable_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= seq_release,
> +};
> +
>  #ifdef CONFIG_ZONE_DMA
>  #define TEXT_FOR_DMA(xx) xx "_dma",
>  #else
> @@ -947,6 +1066,7 @@ static int __init setup_vmstat(void)
>  #ifdef CONFIG_PROC_FS
>  	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
>  	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
> +	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
>  	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
>  	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
>  #endif
All this code will be bloat for most people, I suspect.  Can we find a
suitable #ifdef wrapper to keep my cellphone happy?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 06/14] Export fragmentation index via /proc/extfrag_index
  2010-04-02 16:02 ` [PATCH 06/14] Export fragmentation index via /proc/extfrag_index Mel Gorman
@ 2010-04-07  0:05   ` Andrew Morton
  2010-04-07 10:46     ` Mel Gorman
  2010-04-13 12:43     ` Mel Gorman
  0 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2010-04-07  0:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri,  2 Apr 2010 17:02:40 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> Fragmentation index is a value that makes sense when an allocation of a
> given size would fail. The index indicates whether an allocation failure is
> due to a lack of memory (values towards 0) or due to external fragmentation
> (value towards 1).  For the most part, the huge page size will be the size
> of interest but not necessarily so it is exported on a per-order and per-zone
> basis via /proc/extfrag_index
(/proc/sys/vm?)
Like unusable_index, this seems awfully specialised.  Perhaps we could
hide it under CONFIG_MEL, or even put it in debugfs with the intention
of removing it in 6 or 12 months time.  Either way, it's hard to
justify permanently adding this stuff to every kernel in the world?
I have a suspicion that all the info in unusable_index and
extfrag_index could be computed from userspace using /proc/kpageflags
(and perhaps a bit of dmesg-diddling to find the zones).  If that can't
be done today, I bet it'd be pretty easy to arrange for it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 08/14] Memory compaction core
  2010-04-02 16:02 ` [PATCH 08/14] Memory compaction core Mel Gorman
@ 2010-04-07  0:05   ` Andrew Morton
  2010-04-07 15:21     ` Mel Gorman
  2010-04-08 16:59   ` Mel Gorman
  1 sibling, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2010-04-07  0:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri,  2 Apr 2010 17:02:42 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> This patch is the core of a mechanism which compacts memory in a zone by
> relocating movable pages towards the end of the zone.
> 
> A single compaction run involves a migration scanner and a free scanner.
> Both scanners operate on pageblock-sized areas in the zone. The migration
> scanner starts at the bottom of the zone and searches for all movable pages
> within each area, isolating them onto a private list called migratelist.
> The free scanner starts at the top of the zone and searches for suitable
> areas and consumes the free pages within making them available for the
> migration scanner. The pages isolated for migration are then migrated to
> the newly isolated free pages.
> 
>
> ...
>
> --- /dev/null
> +++ b/include/linux/compaction.h
> @@ -0,0 +1,9 @@
> +#ifndef _LINUX_COMPACTION_H
> +#define _LINUX_COMPACTION_H
> +
> +/* Return values for compact_zone() */
> +#define COMPACT_INCOMPLETE	0
> +#define COMPACT_PARTIAL		1
> +#define COMPACT_COMPLETE	2
Confused.  "incomplete" and "partial" are synonyms.  Please fully
document these here.
> +#endif /* _LINUX_COMPACTION_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f3b473a..f920815 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -335,6 +335,7 @@ void put_page(struct page *page);
>  void put_pages_list(struct list_head *pages);
>  
>  void split_page(struct page *page, unsigned int order);
> +int split_free_page(struct page *page);
>  
>  /*
>   * Compound pages have a destructor function.  Provide a
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 986b12d..cf8bba7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -151,6 +151,7 @@ enum {
>  };
>  
>  #define SWAP_CLUSTER_MAX 32
> +#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
Why?  What are the implications of this decision?  How was it arrived
at?  What might one expect if one were to alter COMPACT_CLUSTER_MAX?
>  #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
>  #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 117f0dd..56e4b44 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
>  		KSWAPD_SKIP_CONGESTION_WAIT,
>  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> +		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
>  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>  #endif
> diff --git a/mm/Makefile b/mm/Makefile
> index 7a68d2a..ccb1f72 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
>
> ...
>
> +static int release_freepages(struct list_head *freelist)
> +{
> +	struct page *page, *next;
> +	int count = 0;
> +
> +	list_for_each_entry_safe(page, next, freelist, lru) {
> +		list_del(&page->lru);
> +		__free_page(page);
> +		count++;
> +	}
> +
> +	return count;
> +}
I'm kinda surprised that we don't already have a function to do this.
An `unsigned' return value would make more sense.  Perhaps even
`unsigned long', unless there's something else here which would prevent
that absurd corner-case.
> +/* Isolate free pages onto a private freelist. Must hold zone->lock */
> +static int isolate_freepages_block(struct zone *zone,
> +				unsigned long blockpfn,
> +				struct list_head *freelist)
> +{
> +	unsigned long zone_end_pfn, end_pfn;
> +	int total_isolated = 0;
> +
> +	/* Get the last PFN we should scan for free pages at */
> +	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> +	end_pfn = blockpfn + pageblock_nr_pages;
> +	if (end_pfn > zone_end_pfn)
> +		end_pfn = zone_end_pfn;
	end_pfn = min(blockpfn + pageblock_nr_pages, zone_end_pfn);
I find that easier to follow, dunno how others feel.
> +	/* Isolate free pages. This assumes the block is valid */
What does "This assumes the block is valid" mean?  The code checks
pfn_valid_within()..
> +	for (; blockpfn < end_pfn; blockpfn++) {
> +		struct page *page;
> +		int isolated, i;
> +
> +		if (!pfn_valid_within(blockpfn))
> +			continue;
> +
> +		page = pfn_to_page(blockpfn);
hm.  pfn_to_page() isn't exactly cheap in some memory models.  I wonder
if there was some partial result we could have locally cached across
the entire loop.
> +		if (!PageBuddy(page))
> +			continue;
> +
> +		/* Found a free page, break it into order-0 pages */
> +		isolated = split_free_page(page);
> +		total_isolated += isolated;
> +		for (i = 0; i < isolated; i++) {
> +			list_add(&page->lru, freelist);
> +			page++;
> +		}
> +
> +		/* If a page was split, advance to the end of it */
> +		if (isolated)
> +			blockpfn += isolated - 1;
> +	}
Strange.  Having just busted a pageblock_order-sized higher-order page
into order-0 pages, the loop goes on and inspects the remaining
(1-2^pageblock_order) pages, presumably to no effect.  Perhaps
	for (; blockpfn < end_pfn; blockpfn++) {
should be
	for (; blockpfn < end_pfn; blockpfn += pageblock_nr_pages) {
or somesuch.
btw, is the whole pageblock_order thing as sucky as it seems?  If I
want my VM to be oriented to making order-4-skb-allocations work, I
need to tune it that way, to coopt something the hugepage fetishists
added?  What if I need order-4 skb's _and_ hugepages?
> +	return total_isolated;
> +}
> +
> +/* Returns 1 if the page is within a block suitable for migration to */
> +static int suitable_migration_target(struct page *page)
`bool'?
> +{
> +
> +	int migratetype = get_pageblock_migratetype(page);
> +
> +	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
> +	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
> +		return 0;
> +
> +	/* If the page is a large free page, then allow migration */
> +	if (PageBuddy(page) && page_order(page) >= pageblock_order)
> +		return 1;
> +
> +	/* If the block is MIGRATE_MOVABLE, allow migration */
> +	if (migratetype == MIGRATE_MOVABLE)
> +		return 1;
> +
> +	/* Otherwise skip the block */
> +	return 0;
> +}
> +
> +/*
> + * Based on information in the current compact_control, find blocks
> + * suitable for isolating free pages from
"and then isolate them"?
> + */
> +static void isolate_freepages(struct zone *zone,
> +				struct compact_control *cc)
> +{
> +	struct page *page;
> +	unsigned long high_pfn, low_pfn, pfn;
> +	unsigned long flags;
> +	int nr_freepages = cc->nr_freepages;
> +	struct list_head *freelist = &cc->freepages;
> +
> +	pfn = cc->free_pfn;
> +	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
> +	high_pfn = low_pfn;
> +
> +	/*
> +	 * Isolate free pages until enough are available to migrate the
> +	 * pages on cc->migratepages. We stop searching if the migrate
> +	 * and free page scanners meet or enough free pages are isolated.
> +	 */
> +	spin_lock_irqsave(&zone->lock, flags);
> +	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
> +					pfn -= pageblock_nr_pages) {
> +		int isolated;
> +
> +		if (!pfn_valid(pfn))
> +			continue;
> +
> +		/* 
> +		 * Check for overlapping nodes/zones. It's possible on some
> +		 * configurations to have a setup like
> +		 * node0 node1 node0
> +		 * i.e. it's possible that all pages within a zones range of
> +		 * pages do not belong to a single zone.
> +		 */
> +		page = pfn_to_page(pfn);
> +		if (page_zone(page) != zone)
> +			continue;
Well.  This code checks each pfn it touches, but
isolate_freepages_block() doesn't do this - isolate_freepages_block()
happily blunders across a contiguous span of pageframes, assuming that
all those pages are valid, and within the same zone.
> +		/* Check the block is suitable for migration */
> +		if (!suitable_migration_target(page))
> +			continue;
> +
> +		/* Found a block suitable for isolating free pages from */
> +		isolated = isolate_freepages_block(zone, pfn, freelist);
> +		nr_freepages += isolated;
> +
> +		/*
> +		 * Record the highest PFN we isolated pages from. When next
> +		 * looking for free pages, the search will restart here as
> +		 * page migration may have returned some pages to the allocator
> +		 */
> +		if (isolated)
> +			high_pfn = max(high_pfn, pfn);
> +	}
> +	spin_unlock_irqrestore(&zone->lock, flags);
For how long can this loop hold of interrupts?
> +	cc->free_pfn = high_pfn;
> +	cc->nr_freepages = nr_freepages;
> +}
> +
> +/* Update the number of anon and file isolated pages in the zone */
> +static void acct_isolated(struct zone *zone, struct compact_control *cc)
> +{
> +	struct page *page;
> +	unsigned int count[NR_LRU_LISTS] = { 0, };
> +
> +	list_for_each_entry(page, &cc->migratepages, lru) {
> +		int lru = page_lru_base_type(page);
> +		count[lru]++;
> +	}
> +
> +	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> +	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> +	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
> +	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
> +}
> +
> +/* Similar to reclaim, but different enough that they don't share logic */
yeah, but what does it do?
> +static int too_many_isolated(struct zone *zone)
> +{
> +
> +	unsigned long inactive, isolated;
> +
> +	inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> +					zone_page_state(zone, NR_INACTIVE_ANON);
> +	isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
> +					zone_page_state(zone, NR_ISOLATED_ANON);
> +
> +	return isolated > inactive;
> +}
> +
> +/*
> + * Isolate all pages that can be migrated from the block pointed to by
> + * the migrate scanner within compact_control.
> + */
> +static unsigned long isolate_migratepages(struct zone *zone,
> +					struct compact_control *cc)
> +{
> +	unsigned long low_pfn, end_pfn;
> +	struct list_head *migratelist;
> +
> +	low_pfn = cc->migrate_pfn;
> +	migratelist = &cc->migratepages;
> +
> +	/* Do not scan outside zone boundaries */
> +	if (low_pfn < zone->zone_start_pfn)
> +		low_pfn = zone->zone_start_pfn;
Can this happen?
Use max()?
> +	/* Setup to scan one block but not past where we are migrating to */
what?
> +	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> +
> +	/* Do not cross the free scanner or scan within a memory hole */
> +	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> +		cc->migrate_pfn = end_pfn;
> +		return 0;
> +	}
> +
> +	/* Do not isolate the world */
Needs (much) more explanation, please.
> +	while (unlikely(too_many_isolated(zone))) {
> +		congestion_wait(BLK_RW_ASYNC, HZ/10);
... why did it do this?  Quite a head-scratcher.
> +		if (fatal_signal_pending(current))
> +			return 0;
> +	}
> +
> +	/* Time to isolate some pages for migration */
> +	spin_lock_irq(&zone->lru_lock);
> +	for (; low_pfn < end_pfn; low_pfn++) {
> +		struct page *page;
> +		if (!pfn_valid_within(low_pfn))
> +			continue;
> +
> +		/* Get the page and skip if free */
> +		page = pfn_to_page(low_pfn);
> +		if (PageBuddy(page)) {
> +			low_pfn += (1 << page_order(page)) - 1;
> +			continue;
> +		}
> +
> +		/* Try isolate the page */
> +		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
> +			del_page_from_lru_list(zone, page, page_lru(page));
> +			list_add(&page->lru, migratelist);
> +			mem_cgroup_del_lru(page);
> +			cc->nr_migratepages++;
> +		}
> +
> +		/* Avoid isolating too much */
> +		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
> +			break;
This test could/should be moved inside the preceding `if' block.  Or,
better, simply do
		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
			continue;	/* comment goes here */
> +	}
> +
> +	acct_isolated(zone, cc);
> +
> +	spin_unlock_irq(&zone->lru_lock);
> +	cc->migrate_pfn = low_pfn;
> +
> +	return cc->nr_migratepages;
> +}
> +
> +/*
> + * This is a migrate-callback that "allocates" freepages by taking pages
> + * from the isolated freelists in the block we are migrating to.
> + */
> +static struct page *compaction_alloc(struct page *migratepage,
> +					unsigned long data,
> +					int **result)
> +{
> +	struct compact_control *cc = (struct compact_control *)data;
> +	struct page *freepage;
> +
> +	/* Isolate free pages if necessary */
> +	if (list_empty(&cc->freepages)) {
> +		isolate_freepages(cc->zone, cc);
> +
> +		if (list_empty(&cc->freepages))
> +			return NULL;
> +	}
> +
> +	freepage = list_entry(cc->freepages.next, struct page, lru);
> +	list_del(&freepage->lru);
> +	cc->nr_freepages--;
> +
> +	return freepage;
> +}
> +
> +/*
> + * We cannot control nr_migratepages and nr_freepages fully when migration is
> + * running as migrate_pages() has no knowledge of compact_control. When
> + * migration is complete, we count the number of pages on the lists by hand.
> + */
> +static void update_nr_listpages(struct compact_control *cc)
> +{
> +	int nr_migratepages = 0;
> +	int nr_freepages = 0;
> +	struct page *page;
newline here please.
> +	list_for_each_entry(page, &cc->migratepages, lru)
> +		nr_migratepages++;
> +	list_for_each_entry(page, &cc->freepages, lru)
> +		nr_freepages++;
> +
> +	cc->nr_migratepages = nr_migratepages;
> +	cc->nr_freepages = nr_freepages;
> +}
> +
> +static inline int compact_finished(struct zone *zone,
> +						struct compact_control *cc)
> +{
> +	if (fatal_signal_pending(current))
> +		return COMPACT_PARTIAL;
ah-hah!  So maybe we meant COMPACT_INTERRUPTED.
> +	/* Compaction run completes if the migrate and free scanner meet */
> +	if (cc->free_pfn <= cc->migrate_pfn)
> +		return COMPACT_COMPLETE;
> +
> +	return COMPACT_INCOMPLETE;
> +}
> +
> +static int compact_zone(struct zone *zone, struct compact_control *cc)
> +{
> +	int ret = COMPACT_INCOMPLETE;
> +
> +	/* Setup to move all movable pages to the end of the zone */
> +	cc->migrate_pfn = zone->zone_start_pfn;
> +	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> +	cc->free_pfn &= ~(pageblock_nr_pages-1);
If zone->spanned_pages is much much larger than zone->present_pages,
this code will suck rather a bit.  Is there a reason why that can never
happen?
> +	migrate_prep();
> +
> +	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
<stares at that for a while>
Perhaps
	while ((ret = compact_finished(zone, cc)) == COMPACT_INCOMPLETE) {
would be clearer.  That would make the definition-site initialisation
of `ret' unneeded too.
> +		unsigned long nr_migrate, nr_remaining;
newline please.
> +		if (!isolate_migratepages(zone, cc))
> +			continue;
Boy, this looks like an infinite loop waiting to happen.  Are you sure?
Suppose we hit a pageblock-sized string of !pfn_valid() pfn's, for
example.  Worried.
> +		nr_migrate = cc->nr_migratepages;
> +		migrate_pages(&cc->migratepages, compaction_alloc,
> +						(unsigned long)cc, 0);
> +		update_nr_listpages(cc);
> +		nr_remaining = cc->nr_migratepages;
> +
> +		count_vm_event(COMPACTBLOCKS);
> +		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> +		if (nr_remaining)
> +			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> +
> +		/* Release LRU pages not migrated */
> +		if (!list_empty(&cc->migratepages)) {
> +			putback_lru_pages(&cc->migratepages);
> +			cc->nr_migratepages = 0;
> +		}
> +
> +	}
> +
> +	/* Release free pages and check accounting */
> +	cc->nr_freepages -= release_freepages(&cc->freepages);
> +	VM_BUG_ON(cc->nr_freepages != 0);
> +
> +	return ret;
> +}
> +
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 624cba4..3cf947d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1208,6 +1208,45 @@ void split_page(struct page *page, unsigned int order)
>  }
>  
>  /*
> + * Similar to split_page except the page is already free. As this is only
> + * being used for migration, the migratetype of the block also changes.
> + */
> +int split_free_page(struct page *page)
> +{
> +	unsigned int order;
> +	unsigned long watermark;
> +	struct zone *zone;
> +
> +	BUG_ON(!PageBuddy(page));
> +
> +	zone = page_zone(page);
> +	order = page_order(page);
> +
> +	/* Obey watermarks or the system could deadlock */
> +	watermark = low_wmark_pages(zone) + (1 << order);
> +	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> +		return 0;
OK, there is no way in which the code-reader can work out why this is
here.  What deadlock?
> +	/* Remove page from free list */
> +	list_del(&page->lru);
> +	zone->free_area[order].nr_free--;
> +	rmv_page_order(page);
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> +
> +	/* Split into individual pages */
> +	set_page_refcounted(page);
> +	split_page(page, order);
> +
> +	if (order >= pageblock_order - 1) {
> +		struct page *endpage = page + (1 << order) - 1;
> +		for (; page < endpage; page += pageblock_nr_pages)
> +			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> +	}
> +
> +	return 1 << order;
> +}
> +
> +/*
>   * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
>   * we cheat by calling it from here, in the order > 0 path.  Saves a branch
>   * or two.
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 351e491..3a69b48 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -892,6 +892,11 @@ static const char * const vmstat_text[] = {
>  	"allocstall",
>  
>  	"pgrotated",
> +
> +	"compact_blocks_moved",
> +	"compact_pages_moved",
> +	"compact_pagemigrate_failed",
Should we present these on CONFIG_COMPACTION=n kernels?
Does all this code really need to iterate across individual pfn's like
this?  We can use the buddy structures to go straight to all of a
zone's order-N free pages, can't we?  Wouldn't that save a whole heap
of fruitless linear searching?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 09/14] Add /proc trigger for memory compaction
  2010-04-02 16:02 ` [PATCH 09/14] Add /proc trigger for memory compaction Mel Gorman
@ 2010-04-07  0:05   ` Andrew Morton
  2010-04-07 15:39     ` Mel Gorman
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2010-04-07  0:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri,  2 Apr 2010 17:02:43 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> value is written to the file,
Might be better if "when the number 1 is written...".  That permits you
to add 2, 3 and 4 later on.
> all zones are compacted. The expected user
> of such a trigger is a job scheduler that prepares the system before the
> target application runs.
> 
Ick.  The days of multi-user computers seems to have passed.
> ...
>
> +/* Compact all zones within a node */
> +static int compact_node(int nid)
> +{
> +	int zoneid;
> +	pg_data_t *pgdat;
> +	struct zone *zone;
> +
> +	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
> +		return -EINVAL;
> +	pgdat = NODE_DATA(nid);
> +
> +	/* Flush pending updates to the LRU lists */
> +	lru_add_drain_all();
> +
> +	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> +		struct compact_control cc;
> +
> +		zone = &pgdat->node_zones[zoneid];
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		cc.nr_freepages = 0;
> +		cc.nr_migratepages = 0;
> +		cc.zone = zone;
It would be better to do
	struct compact_control cc = {
		.nr_freepages = 0,
		etc
because if you later add more fields to compact_control, everything
else works by magick.  That's served us pretty well with
writeback_control, scan_control, etc.
	
> +		INIT_LIST_HEAD(&cc.freepages);
> +		INIT_LIST_HEAD(&cc.migratepages);
> +
> +		compact_zone(zone, &cc);
> +
> +		VM_BUG_ON(!list_empty(&cc.freepages));
> +		VM_BUG_ON(!list_empty(&cc.migratepages));
> +	}
> +
> +	return 0;
> +}
> +
> +/* Compact all nodes in the system */
> +static int compact_nodes(void)
> +{
> +	int nid;
> +
> +	for_each_online_node(nid)
> +		compact_node(nid);
What if a node goes offline?
> +	return COMPACT_COMPLETE;
> +}
> +
>
> ...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 10/14] Add /sys trigger for per-node memory compaction
  2010-04-02 16:02 ` [PATCH 10/14] Add /sys trigger for per-node " Mel Gorman
@ 2010-04-07  0:05   ` Andrew Morton
  2010-04-07  0:31     ` KAMEZAWA Hiroyuki
  2010-04-07 15:42     ` Mel Gorman
  0 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2010-04-07  0:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri,  2 Apr 2010 17:02:44 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> This patch adds a per-node sysfs file called compact. When the file is
> written to, each zone in that node is compacted. The intention that this
> would be used by something like a job scheduler in a batch system before
> a job starts so that the job can allocate the maximum number of
> hugepages without significant start-up cost.
Would it make more sense if this was a per-memcg thing rather than a
per-node thing?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 11/14] Direct compact when a high-order allocation fails
  2010-04-02 16:02 ` [PATCH 11/14] Direct compact when a high-order allocation fails Mel Gorman
@ 2010-04-07  0:06   ` Andrew Morton
  2010-04-07 16:06     ` Mel Gorman
  2010-04-07 18:29     ` Mel Gorman
  0 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2010-04-07  0:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri,  2 Apr 2010 17:02:45 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> Ordinarily when a high-order allocation fails, direct reclaim is entered to
> free pages to satisfy the allocation.  With this patch, it is determined if
> an allocation failed due to external fragmentation instead of low memory
> and if so, the calling process will compact until a suitable page is
> freed. Compaction by moving pages in memory is considerably cheaper than
> paging out to disk and works where there are locked pages or no swap. If
> compaction fails to free a page of a suitable size, then reclaim will
> still occur.
Does this work?
> Direct compaction returns as soon as possible. As each block is compacted,
> it is checked if a suitable page has been freed and if so, it returns.
So someone else can get in and steal it.  How is that resolved?
Please expound upon the relationship between the icky pageblock_order
and the caller's desired allocation order here.  The compaction design
seems fairly fixated upon pageblock_order - what happens if the caller
wanted something larger than pageblock_order?  The
less-than-pageblock_order case seems pretty obvious, although perhaps
wasteful?
>
> ...
>
> +static unsigned long compact_zone_order(struct zone *zone,
> +						int order, gfp_t gfp_mask)
> +{
> +	struct compact_control cc = {
> +		.nr_freepages = 0,
> +		.nr_migratepages = 0,
> +		.order = order,
> +		.migratetype = allocflags_to_migratetype(gfp_mask),
> +		.zone = zone,
> +	};
yeah, like that.
> +	INIT_LIST_HEAD(&cc.freepages);
> +	INIT_LIST_HEAD(&cc.migratepages);
> +
> +	return compact_zone(zone, &cc);
> +}
> +
> +/**
> + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> + * @zonelist: The zonelist used for the current allocation
> + * @order: The order of the current allocation
> + * @gfp_mask: The GFP mask of the current allocation
> + * @nodemask: The allowed nodes to allocate from
> + *
> + * This is the main entry point for direct page compaction.
> + */
> +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> +	int may_enter_fs = gfp_mask & __GFP_FS;
> +	int may_perform_io = gfp_mask & __GFP_IO;
> +	unsigned long watermark;
> +	struct zoneref *z;
> +	struct zone *zone;
> +	int rc = COMPACT_SKIPPED;
> +
> +	/*
> +	 * Check whether it is worth even starting compaction. The order check is
> +	 * made because an assumption is made that the page allocator can satisfy
> +	 * the "cheaper" orders without taking special steps
> +	 */
> +	if (order <= PAGE_ALLOC_COSTLY_ORDER 
Was that a correct decision?  If we perform compaction when smaller
allocation attemtps fail, will the kernel get better, or worse?
And how do we save my order-4-allocating wireless driver?  That would
require that kswapd perform the compaction for me, perhaps?
> || !may_enter_fs || !may_perform_io)
Would be nice to add some comments explaining this a bit more. 
Compaction doesn't actually perform IO, nor enter filesystems, does it?
> +		return rc;
> +
> +	count_vm_event(COMPACTSTALL);
> +
> +	/* Compact each zone in the list */
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> +								nodemask) {
> +		int fragindex;
> +		int status;
> +
> +		/*
> +		 * Watermarks for order-0 must be met for compaction. Note
> +		 * the 2UL. This is because during migration, copies of
> +		 * pages need to be allocated and for a short time, the
> +		 * footprint is higher
> +		 */
> +		watermark = low_wmark_pages(zone) + (2UL << order);
> +		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> +			continue;
ooh, so that starts to explain split_free_page().  But
split_free_page() didn't do the 2UL thing.
Surely these things are racy?  So we'll deadlock less often :(
> +		/*
> +		 * fragmentation index determines if allocation failures are
> +		 * due to low memory or external fragmentation
> +		 *
> +		 * index of -1 implies allocations might succeed depending
> +		 * 	on watermarks
> +		 * index towards 0 implies failure is due to lack of memory
> +		 * index towards 1000 implies failure is due to fragmentation
> +		 *
> +		 * Only compact if a failure would be due to fragmentation.
> +		 */
> +		fragindex = fragmentation_index(zone, order);
> +		if (fragindex >= 0 && fragindex <= 500)
> +			continue;
> +
> +		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
> +			rc = COMPACT_PARTIAL;
> +			break;
> +		}
Why are we doing all this handwavy stuff?  Why not just try a
compaction run and see if it worked?  That would be more
robust/reliable, surely?
> +		status = compact_zone_order(zone, order, gfp_mask);
> +		rc = max(status, rc);
> +
> +		if (zone_watermark_ok(zone, order, watermark, 0, 0))
> +			break;
> +	}
> +
> +	return rc;
> +}
> +
> +
>  /* Compact all zones within a node */
>  static int compact_node(int nid)
>  {
>
> ...
>
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -561,7 +561,7 @@ static int unusable_show(struct seq_file *m, void *arg)
>   * The value can be used to determine if page reclaim or compaction
>   * should be used
>   */
> -int fragmentation_index(unsigned int order, struct contig_page_info *info)
> +int __fragmentation_index(unsigned int order, struct contig_page_info *info)
>  {
>  	unsigned long requested = 1UL << order;
>  
> @@ -581,6 +581,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
>  	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
>  }
>  
> +/* Same as __fragmentation index but allocs contig_page_info on stack */
> +int fragmentation_index(struct zone *zone, unsigned int order)
> +{
> +	struct contig_page_info info;
> +
> +	fill_contig_page_info(zone, order, &info);
> +	return __fragmentation_index(order, &info);
> +}
>  
>  static void extfrag_show_print(struct seq_file *m,
>  					pg_data_t *pgdat, struct zone *zone)
> @@ -596,7 +604,7 @@ static void extfrag_show_print(struct seq_file *m,
>  				zone->name);
>  	for (order = 0; order < MAX_ORDER; ++order) {
>  		fill_contig_page_info(zone, order, &info);
> -		index = fragmentation_index(order, &info);
> +		index = __fragmentation_index(order, &info);
>  		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
>  	}
>  
> @@ -896,6 +904,9 @@ static const char * const vmstat_text[] = {
>  	"compact_blocks_moved",
>  	"compact_pages_moved",
>  	"compact_pagemigrate_failed",
> +	"compact_stall",
> +	"compact_fail",
> +	"compact_success",
CONFIG_COMPACTION=n?
>
> ...
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 12/14] Add a tunable that decides when memory should be compacted and when it should be reclaimed
  2010-04-02 16:02 ` [PATCH 12/14] Add a tunable that decides when memory should be compacted and when it should be reclaimed Mel Gorman
@ 2010-04-07  0:06   ` Andrew Morton
  2010-04-07 16:11     ` Mel Gorman
  0 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2010-04-07  0:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri,  2 Apr 2010 17:02:46 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> The kernel applies some heuristics when deciding if memory should be
> compacted or reclaimed to satisfy a high-order allocation. One of these
> is based on the fragmentation. If the index is below 500, memory will
> not be compacted. This choice is arbitrary and not based on data. To
> help optimise the system and set a sensible default for this value, this
> patch adds a sysctl extfrag_threshold. The kernel will only compact
> memory if the fragmentation index is above the extfrag_threshold.
Was this the most robust, reliable, no-2am-phone-calls thing we could
have done?
What about, say, just doing a bit of both until something worked?  For
extra smarts we could remember what worked best last time, and make
ourselves more likely to try that next time.
Or whatever, but extfrag_threshold must die!  And replacing it with a
hardwired constant doesn't count ;)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 13/14] Do not compact within a preferred zone after a compaction failure
  2010-04-02 16:02 ` [PATCH 13/14] Do not compact within a preferred zone after a compaction failure Mel Gorman
@ 2010-04-07  0:06   ` Andrew Morton
  2010-04-07  0:55     ` Andrea Arcangeli
  2010-04-07 16:32     ` Mel Gorman
  0 siblings, 2 replies; 56+ messages in thread
From: Andrew Morton @ 2010-04-07  0:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri,  2 Apr 2010 17:02:47 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> The fragmentation index may indicate that a failure is due to external
> fragmentation but after a compaction run completes, it is still possible
> for an allocation to fail. There are two obvious reasons as to why
> 
>   o Page migration cannot move all pages so fragmentation remains
>   o A suitable page may exist but watermarks are not met
> 
> In the event of compaction followed by an allocation failure, this patch
> defers further compaction in the zone for a period of time. The zone that
> is deferred is the first zone in the zonelist - i.e. the preferred zone.
> To defer compaction in the other zones, the information would need to be
> stored in the zonelist or implemented similar to the zonelist_cache.
> This would impact the fast-paths and is not justified at this time.
> 
Your patch, it sucks!
> ---
>  include/linux/compaction.h |   35 +++++++++++++++++++++++++++++++++++
>  include/linux/mmzone.h     |    7 +++++++
>  mm/page_alloc.c            |    5 ++++-
>  3 files changed, 46 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index ae98afc..2a02719 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -18,6 +18,32 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
>  extern int fragmentation_index(struct zone *zone, unsigned int order);
>  extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *mask);
> +
> +/* defer_compaction - Do not compact within a zone until a given time */
> +static inline void defer_compaction(struct zone *zone, unsigned long resume)
> +{
> +	/*
> +	 * This function is called when compaction fails to result in a page
> +	 * allocation success. This is somewhat unsatisfactory as the failure
> +	 * to compact has nothing to do with time and everything to do with
> +	 * the requested order, the number of free pages and watermarks. How
> +	 * to wait on that is more unclear, but the answer would apply to
> +	 * other areas where the VM waits based on time.
> +	 */
c'mon, let's not make this rod for our backs.
The "A suitable page may exist but watermarks are not met" case can be
addressed by testing the watermarks up-front, surely?
I bet the "Page migration cannot move all pages so fragmentation
remains" case can be addressed by setting some metric in the zone, and
suitably modifying that as a result on ongoing activity.  To tell the
zone "hey, compaction migth be worth trying now".  that sucks too, but not
so much.
Or something.  Putting a wallclock-based throttle on it like this
really does reduce the usefulness of the whole feature.
Internet: "My application works OK on a hard disk but fails when I use an SSD!". 
akpm: "Tell Mel!"
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-02 16:02 ` [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages Mel Gorman
  2010-04-06  6:54   ` KAMEZAWA Hiroyuki
  2010-04-06 15:37   ` Minchan Kim
@ 2010-04-07  0:06   ` Andrew Morton
  2010-04-07 16:49     ` Mel Gorman
  2 siblings, 1 reply; 56+ messages in thread
From: Andrew Morton @ 2010-04-07  0:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri,  2 Apr 2010 17:02:48 +0100
Mel Gorman <mel@csn.ul.ie> wrote:
> PageAnon pages that are unmapped may or may not have an anon_vma so are
> not currently migrated. However, a swap cache page can be migrated and
> fits this description. This patch identifies page swap caches and allows
> them to be migrated but ensures that no attempt to made to remap the pages
> would would potentially try to access an already freed anon_vma.
> 
> ...
>
> @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
>   *   < 0 - error code
>   *  == 0 - success
>   */
> -static int move_to_new_page(struct page *newpage, struct page *page)
> +static int move_to_new_page(struct page *newpage, struct page *page,
> +						int remap_swapcache)
You're not a fan of `bool'.
>  {
>  	struct address_space *mapping;
>  	int rc;
> @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
>  	else
>  		rc = fallback_migrate_page(mapping, newpage, page);
>  
> -	if (!rc)
> -		remove_migration_ptes(page, newpage);
> -	else
> +	if (rc) {
>  		newpage->mapping = NULL;
> +	} else {
> +		if (remap_swapcache) 
> +			remove_migration_ptes(page, newpage);
> +	}
>  
>  	unlock_page(newpage);
>  
> @@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  	int rc = 0;
>  	int *result = NULL;
>  	struct page *newpage = get_new_page(page, private, &result);
> +	int remap_swapcache = 1;
>  	int rcu_locked = 0;
>  	int charge = 0;
>  	struct mem_cgroup *mem = NULL;
> @@ -600,18 +604,27 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  		rcu_read_lock();
>  		rcu_locked = 1;
>  
> -		/*
> -		 * If the page has no mappings any more, just bail. An
> -		 * unmapped anon page is likely to be freed soon but worse,
> -		 * it's possible its anon_vma disappeared between when
> -		 * the page was isolated and when we reached here while
> -		 * the RCU lock was not held
> -		 */
> -		if (!page_mapped(page))
> -			goto rcu_unlock;
> +		/* Determine how to safely use anon_vma */
> +		if (!page_mapped(page)) {
> +			if (!PageSwapCache(page))
> +				goto rcu_unlock;
>  
> -		anon_vma = page_anon_vma(page);
> -		atomic_inc(&anon_vma->external_refcount);
> +			/*
> +			 * We cannot be sure that the anon_vma of an unmapped
> +			 * swapcache page is safe to use.
Why not?  A full explanation here would be nice.
> 			   In this case, the
> +			 * swapcache page gets migrated but the pages are not
> +			 * remapped
> +			 */
> +			remap_swapcache = 0;
> +		} else { 
> +			/*
> +			 * Take a reference count on the anon_vma if the
> +			 * page is mapped so that it is guaranteed to
> +			 * exist when the page is remapped later
> +			 */
> +			anon_vma = page_anon_vma(page);
> +			atomic_inc(&anon_vma->external_refcount);
> +		}
>  	}
>  
>  	/*
> @@ -646,9 +659,9 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  
>  skip_unmap:
>  	if (!page_mapped(page))
> -		rc = move_to_new_page(newpage, page);
> +		rc = move_to_new_page(newpage, page, remap_swapcache);
>  
> -	if (rc)
> +	if (rc && remap_swapcache)
>  		remove_migration_ptes(page, page);
>  rcu_unlock:
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 03/14] mm: Share the anon_vma ref counts between KSM and page migration
  2010-04-07  0:05   ` Andrew Morton
@ 2010-04-07  0:10     ` Rik van Riel
  2010-04-07 10:01     ` Mel Gorman
  1 sibling, 0 replies; 56+ messages in thread
From: Rik van Riel @ 2010-04-07  0:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, linux-kernel, linux-mm
On 04/06/2010 08:05 PM, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:37 +0100
> Mel Gorman<mel@csn.ul.ie>  wrote:
>> +#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
>> +
>> +	/*
>> +	 * The external_refcount is taken by either KSM or page migration
>> +	 * to take a reference to an anon_vma when there is no
>> +	 * guarantee that the vma of page tables will exist for
>> +	 * the duration of the operation. A caller that takes
>> +	 * the reference is responsible for clearing up the
>> +	 * anon_vma if they are the last user on release
>> +	 */
>> +	atomic_t external_refcount;
>>   #endif
>
> hah.
>> +	anonvma_external_refcount_init(anon_vma);
>
> What a mouthful.  Can we do s/external_//g?
For the function, sure.
However, I believe it would be good to keep the variable
inside the anon_vma as "external_refcount", because the
VMAs attached to the anon_vma take a reference by being
on the list (and leave the refcount alone).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 10/14] Add /sys trigger for per-node memory compaction
  2010-04-07  0:05   ` Andrew Morton
@ 2010-04-07  0:31     ` KAMEZAWA Hiroyuki
  2010-04-06 21:56       ` Andrew Morton
  2010-04-07 15:42     ` Mel Gorman
  1 sibling, 1 reply; 56+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-07  0:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, 6 Apr 2010 17:05:59 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Fri,  2 Apr 2010 17:02:44 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This patch adds a per-node sysfs file called compact. When the file is
> > written to, each zone in that node is compacted. The intention that this
> > would be used by something like a job scheduler in a batch system before
> > a job starts so that the job can allocate the maximum number of
> > hugepages without significant start-up cost.
> 
> Would it make more sense if this was a per-memcg thing rather than a
> per-node thing?
memcg doesn't have any relationship with placement of memory (now).
It's just controls the amount of memory.
So, memcg has no relationship with compaction.
A cgroup which controls placement of memory is cpuset.
One idea is per cpuset. But per-node seems ok.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 13/14] Do not compact within a preferred zone after a compaction failure
  2010-04-07  0:06   ` Andrew Morton
@ 2010-04-07  0:55     ` Andrea Arcangeli
  2010-04-07 16:32     ` Mel Gorman
  1 sibling, 0 replies; 56+ messages in thread
From: Andrea Arcangeli @ 2010-04-07  0:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:06:16PM -0700, Andrew Morton wrote:
> > ---
> >  include/linux/compaction.h |   35 +++++++++++++++++++++++++++++++++++
> >  include/linux/mmzone.h     |    7 +++++++
> >  mm/page_alloc.c            |    5 ++++-
> >  3 files changed, 46 insertions(+), 1 deletions(-)
> > 
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index ae98afc..2a02719 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -18,6 +18,32 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
> >  extern int fragmentation_index(struct zone *zone, unsigned int order);
> >  extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >  			int order, gfp_t gfp_mask, nodemask_t *mask);
> > +
> > +/* defer_compaction - Do not compact within a zone until a given time */
> > +static inline void defer_compaction(struct zone *zone, unsigned long resume)
> > +{
> > +	/*
> > +	 * This function is called when compaction fails to result in a page
> > +	 * allocation success. This is somewhat unsatisfactory as the failure
> > +	 * to compact has nothing to do with time and everything to do with
> > +	 * the requested order, the number of free pages and watermarks. How
> > +	 * to wait on that is more unclear, but the answer would apply to
> > +	 * other areas where the VM waits based on time.
> > +	 */
> 
> c'mon, let's not make this rod for our backs.
Actually I skipped this one in the unified tree (I'm running both
patchsets at the same time as I write this and I should have tweaked
it so that the defrag sysfs control in transparent hugepage turns
memory compaction on and off, plus I embedded the
set_recommended_min_free_kbytes() code inside huge_memory.c
initialization). I merged the whole V7 except the above. It also
didn't pass my threshold, also because this only checks 1 jiffy that
is random and too short to matter.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 10/14] Add /sys trigger for per-node memory compaction
  2010-04-06 21:56       ` Andrew Morton
@ 2010-04-07  1:19         ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 56+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-04-07  1:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Andrea Arcangeli, Christoph Lameter, Adam Litke,
	Avi Kivity, David Rientjes, Minchan Kim, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, 6 Apr 2010 17:56:01 -0400
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Wed, 7 Apr 2010 09:31:48 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > A cgroup which controls placement of memory is cpuset.
> 
> err, yes, that.
> 
> > One idea is per cpuset. But per-node seems ok.
> 
> Which is superior?
> 
> Which maps best onto the way systems are used (and onto ways in which
> we _intend_ that systems be used)?
> 
node has hugepage interface now.
[root@bluextal qemu-kvm-0.12.3]# ls /sys/devices/system/node/node0/hugepages/
hugepages-2048kB
So, per-node knob is straightforward. 
> Is the physical node really the best unit-of-administration?  And is
> direct access to physical nodes the best means by which admins will
> manage things?
In these days, we tend to use "setup tool" for using cpuset, etc.
(as libcgroup.)
Considering control by userland-support-soft, I think pernode is not bad.
And per-cpuset requires users to mount cpuset.
(Now, most of my customer doesn't use cpuset.)
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating
  2010-04-07  0:05   ` Andrew Morton
@ 2010-04-07  9:56     ` Mel Gorman
  0 siblings, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-07  9:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:05:20PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:35 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
> > locking an anon_vma and it does not appear to have sufficient locking to
> > ensure the anon_vma does not disappear from under it.
> > 
> > This patch copies an approach used by KSM to take a reference on the
> > anon_vma while pages are being migrated. This should prevent rmap_walk()
> > running into nasty surprises later because anon_vma has been freed.
> > 
> 
> The code didn't exactly bend over backwards making itself easy for
> others to understand...
> 
anon_vma in general is not perfectly straight-forward. I clarify the
situation somewhat in Patch 3/14.
> > 
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index d25bd22..567d43f 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -29,6 +29,9 @@ struct anon_vma {
> >  #ifdef CONFIG_KSM
> >  	atomic_t ksm_refcount;
> >  #endif
> > +#ifdef CONFIG_MIGRATION
> > +	atomic_t migrate_refcount;
> > +#endif
> 
> Some documentation here describing the need for this thing and its
> runtime semantics would be appropriate.
> 
Will come to that in Patch 3.
> >  	/*
> >  	 * NOTE: the LSB of the head.next is set by
> >  	 * mm_take_all_locks() _after_ taking the above lock. So the
> > @@ -81,6 +84,26 @@ static inline int ksm_refcount(struct anon_vma *anon_vma)
> >  	return 0;
> >  }
> >  #endif /* CONFIG_KSM */
> > +#ifdef CONFIG_MIGRATION
> > +static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> > +{
> > +	atomic_set(&anon_vma->migrate_refcount, 0);
> > +}
> > +
> > +static inline int migrate_refcount(struct anon_vma *anon_vma)
> > +{
> > +	return atomic_read(&anon_vma->migrate_refcount);
> > +}
> > +#else
> > +static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> > +{
> > +}
> > +
> > +static inline int migrate_refcount(struct anon_vma *anon_vma)
> > +{
> > +	return 0;
> > +}
> > +#endif /* CONFIG_MIGRATE */
> >  
> >  static inline struct anon_vma *page_anon_vma(struct page *page)
> >  {
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 6903abf..06e6316 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -542,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  	int rcu_locked = 0;
> >  	int charge = 0;
> >  	struct mem_cgroup *mem = NULL;
> > +	struct anon_vma *anon_vma = NULL;
> >  
> >  	if (!newpage)
> >  		return -ENOMEM;
> > @@ -598,6 +599,8 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  	if (PageAnon(page)) {
> >  		rcu_read_lock();
> >  		rcu_locked = 1;
> > +		anon_vma = page_anon_vma(page);
> > +		atomic_inc(&anon_vma->migrate_refcount);
> 
> So no helper function for this.  I guess a grep for `migrate_refcount'
> will find it OK.
> 
It will, again I will expand on this in my response on patch 3.
> Can this count ever have a value > 1?   I guess so..
> 
KSM and migration could both conceivably take a refcount.
> >  	}
> >  
> >  	/*
> > @@ -637,6 +640,15 @@ skip_unmap:
> >  	if (rc)
> >  		remove_migration_ptes(page, page);
> >  rcu_unlock:
> > +
> > +	/* Drop an anon_vma reference if we took one */
> > +	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> > +		int empty = list_empty(&anon_vma->head);
> > +		spin_unlock(&anon_vma->lock);
> > +		if (empty)
> > +			anon_vma_free(anon_vma);
> > +	}
> > +
> 
> So...  Why shouldn't this be testing ksm_refcount() too?
> 
It will in patch 3.
> Can we consolidate ksm_refcount and migrate_refcount into, err, `refcount'?
> 
Will expand on this again in the response to patch 3.
> >  	if (rcu_locked)
> >  		rcu_read_unlock();
> >  uncharge:
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index fcd593c..578d0fe 100644
> 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 03/14] mm: Share the anon_vma ref counts between KSM and page migration
  2010-04-07  0:05   ` Andrew Morton
  2010-04-07  0:10     ` Rik van Riel
@ 2010-04-07 10:01     ` Mel Gorman
  1 sibling, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-07 10:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:05:28PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:37 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > For clarity of review, KSM and page migration have separate refcounts on
> > the anon_vma. While clear, this is a waste of memory. This patch gets
> > KSM and page migration to share their toys in a spirit of harmony.
> > 
> > ...
> >
> > @@ -26,11 +26,17 @@
> >   */
> >  struct anon_vma {
> >  	spinlock_t lock;	/* Serialize access to vma list */
> > -#ifdef CONFIG_KSM
> > -	atomic_t ksm_refcount;
> > -#endif
> > -#ifdef CONFIG_MIGRATION
> > -	atomic_t migrate_refcount;
> > +#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
> > +
> > +	/*
> > +	 * The external_refcount is taken by either KSM or page migration
> > +	 * to take a reference to an anon_vma when there is no
> > +	 * guarantee that the vma of page tables will exist for
> > +	 * the duration of the operation. A caller that takes
> > +	 * the reference is responsible for clearing up the
> > +	 * anon_vma if they are the last user on release
> > +	 */
> > +	atomic_t external_refcount;
> >  #endif
> 
> hah.
> 
hah indeed. There is a very strong case for merging patch 1 and 3 into
the same patch. They were kept separate because the combined patch was
going to be tricky to review. The expansion of the comment in patch 3
was to avoid a full explanation that was then editted in a later patch.
> > @@ -653,7 +653,7 @@ skip_unmap:
> >  rcu_unlock:
> >  
> >  	/* Drop an anon_vma reference if we took one */
> > -	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> > +	if (anon_vma && atomic_dec_and_lock(&anon_vma->external_refcount, &anon_vma->lock)) {
> >  		int empty = list_empty(&anon_vma->head);
> >  		spin_unlock(&anon_vma->lock);
> >  		if (empty)
> 
> So we now _do_ test ksm_refcount.  Perhaps that fixed a bug added in [1/14]
> 
Would you like to make patch 3 patch 2 instead and then merge them when
going upstream?
As it is you are right in that there could be a bug if just 1 was merged
but not 3 because both refcounts are not taken. I could fix up patch 1
but a merge would make a lot more sense.
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 578d0fe..af35b75 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -248,8 +248,7 @@ static void anon_vma_unlink(struct anon_vma_chain *anon_vma_chain)
> >  	list_del(&anon_vma_chain->same_anon_vma);
> >  
> >  	/* We must garbage collect the anon_vma if it's empty */
> > -	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
> > -					!migrate_refcount(anon_vma);
> > +	empty = list_empty(&anon_vma->head) && !anonvma_external_refcount(anon_vma);
> >  	spin_unlock(&anon_vma->lock);
> >  
> >  	if (empty)
> > @@ -273,8 +272,7 @@ static void anon_vma_ctor(void *data)
> >  	struct anon_vma *anon_vma = data;
> >  
> >  	spin_lock_init(&anon_vma->lock);
> > -	ksm_refcount_init(anon_vma);
> > -	migrate_refcount_init(anon_vma);
> > +	anonvma_external_refcount_init(anon_vma);
> 
> What a mouthful.  Can we do s/external_//g?
> 
We could, but it would be misleading.
anon_vma has an explicit and implicit refcount. The implicit reference
is a VMA being on the anon_vma list. The explicit count is
external_refcount. Just "refcount" implies that it is properly reference
counted which is not the case. Someone looking at memory.c might
conclude that there is a refcounting bug because just the list is
checked.
Now, the right thing to do here is to get rid of implicit reference
counting. Peter Ziljstra has posted an RFC patch series on mm preempt
and the first two patches of that cover using proper reference counting.
When/if that gets merged, a rename from external_refcount to refcount
would be appropriate.
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 04/14] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-04-07  0:05   ` Andrew Morton
@ 2010-04-07 10:22     ` Mel Gorman
  0 siblings, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-07 10:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:05:32PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:38 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
> > being able to hot-remove memory. The main users of page migration such as
> > sys_move_pages(), sys_migrate_pages() and cpuset process migration are
> > only beneficial on NUMA so it makes sense.
> > 
> > As memory compaction will operate within a zone and is useful on both NUMA
> > and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
> > user selects CONFIG_COMPACTION as an option.
> > 
> > ...
> >
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -172,6 +172,16 @@ config SPLIT_PTLOCK_CPUS
> >  	default "4"
> >  
> >  #
> > +# support for memory compaction
> > +config COMPACTION
> > +	bool "Allow for memory compaction"
> > +	def_bool y
> > +	select MIGRATION
> > +	depends on EXPERIMENTAL && HUGETLBFS && MMU
> > +	help
> > +	  Allows the compaction of memory for the allocation of huge pages.
> 
> Seems strange to depend on hugetlbfs.  Perhaps depending on
> HUGETLB_PAGE would be more logical.
> 
Fair point, there is a fix below.
> But hang on.  I wanna use compaction to make my order-4 wireless skb
> allocations work better!  Why do you hate me?
> 
Because I'm a bad person and I hate your hardware. However, because I'm
told being a bad person for the sake of it just isn't the right thing to
do, I'll expand the reasoning :).
For your specific example, the allocation is also depending on GFP_ATOMIC
which migration cannot handle today. Significant plumbing would be needed
there to make it work and I believe at the moment at atomic-safe compaction
would be a subset of full compaction. This is a "future" thing but I'd also
expect you and others to resist it on the grounds that depending on such
high-order atomics for the correct working of the hardware is just a bad plan.
That does not cover other high-order allocs though such as those required for
stacks or the ARM allocation of PGDs. These are below PAGE_ALLOC_COSTLY_ORDER
so compaction will not currently trigger.  Reviews commented that it would
be preferable to limit the orders compaction handles to start with. The
direction I'd like to continue with this in the future is to have something
like __zone_reclaim to handle clean page cache first and moving more towards
integrating lumpy reclaim and compaction. When this is done, the HUGETLB_PAGE
dependency would be removed and the smaller orders will also be compacted.
In the meantime, we continue to discourage high-order allocations and
compaction gets its initial trial run against huge pages.
==== CUT HERE ====
mm,compaction: Have CONFIG_COMPACTION depend on HUGETLB_PAGE instead of HUGETLBFS
There is a strong coupling between HUGETLB_PAGE and HUGETLBFS but in theory
there can be alternative interfaces to huge pages than HUGETLB_PAGE. This
patch makes CONFIG_COMPACTION depend on the right thing.
This is a fix to the patch "Allow CONFIG_MIGRATION to be set without
CONFIG_NUMA or memory hot-remove" and should be merged together.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/Kconfig |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index 4fd75a0..a275a7d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -177,7 +177,7 @@ config COMPACTION
 	bool "Allow for memory compaction"
 	def_bool y
 	select MIGRATION
-	depends on EXPERIMENTAL && HUGETLBFS && MMU
+	depends on EXPERIMENTAL && HUGETLB_PAGE && MMU
 	help
 	  Allows the compaction of memory for the allocation of huge pages.
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* Re: [PATCH 05/14] Export unusable free space index via /proc/unusable_index
  2010-04-07  0:05   ` Andrew Morton
@ 2010-04-07 10:35     ` Mel Gorman
  2010-04-13 12:42     ` Mel Gorman
  1 sibling, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-07 10:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:05:37PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:39 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Unusable free space index is a measure of external fragmentation that
> > takes the allocation size into account. For the most part, the huge page
> > size will be the size of interest but not necessarily so it is exported
> > on a per-order and per-zone basis via /proc/unusable_index.
> 
> I'd suggest /proc/sys/vm/unusable_index.  I don't know how pagetypeinfo
> found its way into the top-level dir.
> 
For the same reason buddyinfo did - no one complained. It keeps the
fragmentation-related information in the same place but I can move it.
> > The index is a value between 0 and 1. It can be expressed as a
> > percentage by multiplying by 100 as documented in
> > Documentation/filesystems/proc.txt.
> > 
> > ...
> > 
> > +> cat /proc/unusable_index
> > +Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
> > +Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
> > +
> > +The unusable free space index measures how much of the available free
> > +memory cannot be used to satisfy an allocation of a given size and is a
> > +value between 0 and 1. The higher the value, the more of free memory is
> > +unusable and by implication, the worse the external fragmentation is. This
> > +can be expressed as a percentage by multiplying by 100.
> 
> That's going to hurt my brain.  Why didn't it report usable free blocks?
> 
Lets say you are graphing the index on a given order over time. If there
are a large number of frees, there can be a large change in that value
but it does nto necessarily tell you how much better or worse the system
is overall.
> Also, the index is scaled by the actual amount of free memory in the
> zones, yes?  So to work out how many order-N pages are available you
> first need to know how many free pages there are?
> 
It depends on what your question is. As I'm interest in fragmentation,
this value gives me information on that. Your question is about how many
pages of a given order can be allocated right now and that can be worked
out from buddyinfo.
> Seems complicated.
> 
> >  
> > +
> > +struct contig_page_info {
> > +	unsigned long free_pages;
> > +	unsigned long free_blocks_total;
> > +	unsigned long free_blocks_suitable;
> > +};
> > +
> > +/*
> > + * Calculate the number of free pages in a zone, how many contiguous
> > + * pages are free and how many are large enough to satisfy an allocation of
> > + * the target size. Note that this function makes no attempt to estimate
> > + * how many suitable free blocks there *might* be if MOVABLE pages were
> > + * migrated. Calculating that is possible, but expensive and can be
> > + * figured out from userspace
> > + */
> > +static void fill_contig_page_info(struct zone *zone,
> > +				unsigned int suitable_order,
> > +				struct contig_page_info *info)
> > +{
> > +	unsigned int order;
> > +
> > +	info->free_pages = 0;
> > +	info->free_blocks_total = 0;
> > +	info->free_blocks_suitable = 0;
> > +
> > +	for (order = 0; order < MAX_ORDER; order++) {
> > +		unsigned long blocks;
> > +
> > +		/* Count number of free blocks */
> > +		blocks = zone->free_area[order].nr_free;
> > +		info->free_blocks_total += blocks;
> > +
> > +		/* Count free base pages */
> > +		info->free_pages += blocks << order;
> > +
> > +		/* Count the suitable free blocks */
> > +		if (order >= suitable_order)
> > +			info->free_blocks_suitable += blocks <<
> > +						(order - suitable_order);
> > +	}
> > +}
> > +
> > +/*
> > + * Return an index indicating how much of the available free memory is
> > + * unusable for an allocation of the requested size.
> > + */
> > +static int unusable_free_index(unsigned int order,
> > +				struct contig_page_info *info)
> > +{
> > +	/* No free memory is interpreted as all free memory is unusable */
> > +	if (info->free_pages == 0)
> > +		return 1000;
> > +
> > +	/*
> > +	 * Index should be a value between 0 and 1. Return a value to 3
> > +	 * decimal places.
> > +	 *
> > +	 * 0 => no fragmentation
> > +	 * 1 => high fragmentation
> > +	 */
> > +	return div_u64((info->free_pages - (info->free_blocks_suitable << order)) * 1000ULL, info->free_pages);
> > +
> > +}
> > +
> > +static void unusable_show_print(struct seq_file *m,
> > +					pg_data_t *pgdat, struct zone *zone)
> > +{
> > +	unsigned int order;
> > +	int index;
> > +	struct contig_page_info info;
> > +
> > +	seq_printf(m, "Node %d, zone %8s ",
> > +				pgdat->node_id,
> > +				zone->name);
> > +	for (order = 0; order < MAX_ORDER; ++order) {
> > +		fill_contig_page_info(zone, order, &info);
> > +		index = unusable_free_index(order, &info);
> > +		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
> > +	}
> > +
> > +	seq_putc(m, '\n');
> > +}
> > +
> > +/*
> > + * Display unusable free space index
> > + * XXX: Could be a lot more efficient, but it's not a critical path
> > + */
> > +static int unusable_show(struct seq_file *m, void *arg)
> > +{
> > +	pg_data_t *pgdat = (pg_data_t *)arg;
> > +
> > +	/* check memoryless node */
> > +	if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
> > +		return 0;
> > +
> > +	walk_zones_in_node(m, pgdat, unusable_show_print);
> > +
> > +	return 0;
> > +}
> > +
> >  static void pagetypeinfo_showfree_print(struct seq_file *m,
> >  					pg_data_t *pgdat, struct zone *zone)
> >  {
> > @@ -603,6 +703,25 @@ static const struct file_operations pagetypeinfo_file_ops = {
> >  	.release	= seq_release,
> >  };
> >  
> > +static const struct seq_operations unusable_op = {
> > +	.start	= frag_start,
> > +	.next	= frag_next,
> > +	.stop	= frag_stop,
> > +	.show	= unusable_show,
> > +};
> > +
> > +static int unusable_open(struct inode *inode, struct file *file)
> > +{
> > +	return seq_open(file, &unusable_op);
> > +}
> > +
> > +static const struct file_operations unusable_file_ops = {
> > +	.open		= unusable_open,
> > +	.read		= seq_read,
> > +	.llseek		= seq_lseek,
> > +	.release	= seq_release,
> > +};
> > +
> >  #ifdef CONFIG_ZONE_DMA
> >  #define TEXT_FOR_DMA(xx) xx "_dma",
> >  #else
> > @@ -947,6 +1066,7 @@ static int __init setup_vmstat(void)
> >  #ifdef CONFIG_PROC_FS
> >  	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
> >  	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
> > +	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
> >  	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
> >  	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
> >  #endif
> 
> All this code will be bloat for most people, I suspect.  Can we find a
> suitable #ifdef wrapper to keep my cellphone happy?
> 
It could. However, this information can also be created from buddyinfo and
I have a perl script that can be adapted to duplicate the output of this
proc file. As there isn't an in-kernel user of this information, it can
also be dropped.
Will I roll a patch that moves the proc entry and makes it a CONFIG option
or will I just remove the file altogether? If I remove it, I can adapt
the perl script and add to the other hugepage-related utilities in
libhugetlbfs.
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 06/14] Export fragmentation index via /proc/extfrag_index
  2010-04-07  0:05   ` Andrew Morton
@ 2010-04-07 10:46     ` Mel Gorman
  2010-04-13 12:43     ` Mel Gorman
  1 sibling, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-07 10:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:05:42PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:40 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Fragmentation index is a value that makes sense when an allocation of a
> > given size would fail. The index indicates whether an allocation failure is
> > due to a lack of memory (values towards 0) or due to external fragmentation
> > (value towards 1).  For the most part, the huge page size will be the size
> > of interest but not necessarily so it is exported on a per-order and per-zone
> > basis via /proc/extfrag_index
> 
> (/proc/sys/vm?)
> 
It can move.
> Like unusable_index, this seems awfully specialised. 
Except in this case, the fragmentation index is used by the kernel when
deciding in advance whether compaction will do the job or if lumpy
reclaim is required.
I could avoid exposing this to userspace but it would make it harder to
decide what needs to happen with extfrag_threshold later. i.e. does the
threshold need a different value (proc would help gather the data) or
is a new heuristic needed.
> Perhaps we could
> hide it under CONFIG_MEL, or even put it in debugfs with the intention
> of removing it in 6 or 12 months time.  Either way, it's hard to
> justify permanently adding this stuff to every kernel in the world?
> 
Moving it to debugfs would satisfy the requirement of tuning extfrag_threshold
without adding it to every kernel but it could also be just removed.
> 
> I have a suspicion that all the info in unusable_index and
> extfrag_index could be computed from userspace using /proc/kpageflags
It can be computed from buddyinfo. I used a perl script to calculate it
in the past. I exposed the information from in-kernel in these patches so
people would be guaranteed to have the same information as me.
> (and perhaps a bit of dmesg-diddling to find the zones). 
Can be figured out from buddyinfo too.
> If that can't
> be done today, I bet it'd be pretty easy to arrange for it.
> 
It is. Will I just remove the proc files, keep the internal calculation
for fragmentation_index and kick that perl script into shape to produce
the same information from buddyinfo?
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 08/14] Memory compaction core
  2010-04-07  0:05   ` Andrew Morton
@ 2010-04-07 15:21     ` Mel Gorman
  0 siblings, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-07 15:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:05:51PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:42 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This patch is the core of a mechanism which compacts memory in a zone by
> > relocating movable pages towards the end of the zone.
> > 
> > A single compaction run involves a migration scanner and a free scanner.
> > Both scanners operate on pageblock-sized areas in the zone. The migration
> > scanner starts at the bottom of the zone and searches for all movable pages
> > within each area, isolating them onto a private list called migratelist.
> > The free scanner starts at the top of the zone and searches for suitable
> > areas and consumes the free pages within making them available for the
> > migration scanner. The pages isolated for migration are then migrated to
> > the newly isolated free pages.
> > 
> >
> > ...
> >
> > --- /dev/null
> > +++ b/include/linux/compaction.h
> > @@ -0,0 +1,9 @@
> > +#ifndef _LINUX_COMPACTION_H
> > +#define _LINUX_COMPACTION_H
> > +
> > +/* Return values for compact_zone() */
> > +#define COMPACT_INCOMPLETE	0
> > +#define COMPACT_PARTIAL		1
> > +#define COMPACT_COMPLETE	2
> 
> Confused.  "incomplete" and "partial" are synonyms.  Please fully
> document these here.
> 
I have a difficultly in that it's hard to give you fixes as it would
span two patches. It might be easiest on you overall if you so a
s/COMPACT_INCOMPLETE/COMPACT_CONTINUE/
on both this patch and the direct compaction patch. I'll then send a follow-on
patch documenting the four defines (later patch adds a fourth) as
/* Return values for compact_zone() and try_to_compact_pages() */
/* compaction didn't start as it was not possible or direct reclaim was more suitable */
#define COMPACT_SKIPPED         0
/* compaction should continue to another pageblock */
#define COMPACT_CONTINUE        1
/* direct compaction partially compacted a zone and there are suitable pages */
#define COMPACT_PARTIAL         2
/* The full zone was compacted */
#define COMPACT_COMPLETE        3
> > +#endif /* _LINUX_COMPACTION_H */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index f3b473a..f920815 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -335,6 +335,7 @@ void put_page(struct page *page);
> >  void put_pages_list(struct list_head *pages);
> >  
> >  void split_page(struct page *page, unsigned int order);
> > +int split_free_page(struct page *page);
> >  
> >  /*
> >   * Compound pages have a destructor function.  Provide a
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 986b12d..cf8bba7 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -151,6 +151,7 @@ enum {
> >  };
> >  
> >  #define SWAP_CLUSTER_MAX 32
> > +#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
> 
> Why? 
To reduce the amount of time zone locks are held.
> What are the implications of this decision? 
Pro: Latencies are lower, fewer pages are isolated at any given time
Con: There is a wider window during which a parallel allocator can use a
     page within the pageblock being compacted
> How was it arrived at? 
It's somewhat arbitrary, only that reclaim works on similar units and
they share logic on what the correct number of pages to have isolated
from the LRU lists are.
> What might one expect if one were to alter COMPACT_CLUSTER_MAX?
> 
The higher the value, the longer the latency is that the lock is held
during isolation but under very heavy memory pressure, there might be
higher success rates for allocation as the window during which parallel
allocators can allocate pages being compacted is reduced.
The lower the value, the lower the time the lock is held. Fewer pages
will be isolated at any given time.
The only advantage of either choice is increasing the value makes it
less likely a parallel allocator will interfere but it had to be
balanced against the lock hold latency time. As we appear to be ok with
the hold time for reclaim, it was reasonable to assume we'd also be ok
with the hold time for compaction.
> >  #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
> >  #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index 117f0dd..56e4b44 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> >  		KSWAPD_SKIP_CONGESTION_WAIT,
> >  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> > +		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> >  #ifdef CONFIG_HUGETLB_PAGE
> >  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> >  #endif
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 7a68d2a..ccb1f72 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> >
> > ...
> >
> > +static int release_freepages(struct list_head *freelist)
> > +{
> > +	struct page *page, *next;
> > +	int count = 0;
> > +
> > +	list_for_each_entry_safe(page, next, freelist, lru) {
> > +		list_del(&page->lru);
> > +		__free_page(page);
> > +		count++;
> > +	}
> > +
> > +	return count;
> > +}
> 
> I'm kinda surprised that we don't already have a function to do this.
> 
Subsystems needing lists of free pages would be using mempools.
> An `unsigned' return value would make more sense.  Perhaps even
> `unsigned long', unless there's something else here which would prevent
> that absurd corner-case.
> 
Included in the patch below. The corner-case is impossible. We're
isolating only COMPACT_CLUSTER_MAX and this must be less than
MAX_ORDER_NR_PAGES. However, the return value of the function is used with
an unsigned long.  Technically, it could be unsigned int but page counts
are always in unsigned long so why be surprising.
> > +/* Isolate free pages onto a private freelist. Must hold zone->lock */
> > +static int isolate_freepages_block(struct zone *zone,
> > +				unsigned long blockpfn,
> > +				struct list_head *freelist)
> > +{
> > +	unsigned long zone_end_pfn, end_pfn;
> > +	int total_isolated = 0;
> > +
> > +	/* Get the last PFN we should scan for free pages at */
> > +	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > +	end_pfn = blockpfn + pageblock_nr_pages;
> > +	if (end_pfn > zone_end_pfn)
> > +		end_pfn = zone_end_pfn;
> 
> 	end_pfn = min(blockpfn + pageblock_nr_pages, zone_end_pfn);
> 
> I find that easier to follow, dunno how others feel.
> 
It looks better. Done.
> > +	/* Isolate free pages. This assumes the block is valid */
> 
> What does "This assumes the block is valid" mean?  The code checks
> pfn_valid_within()..
> 
Typically, a MAX_ORDER_NR_PAGES naturally-aligned block of pages is
considered valid if any one of them return true for pfn_valid(). The
caller of this function has checked the block with pfn_valid so the
block of pages is "valid".
Some architectures insist on punching holes within a block of
MAX_ORDER_NR_PAGES. These are required to call pfn_valid_within() when
walking a range of PFNs. For architectures without these holes,
pfn_valid_within() is a no-op.
> > +	for (; blockpfn < end_pfn; blockpfn++) {
> > +		struct page *page;
> > +		int isolated, i;
> > +
> > +		if (!pfn_valid_within(blockpfn))
> > +			continue;
> > +
> > +		page = pfn_to_page(blockpfn);
> 
> hm.  pfn_to_page() isn't exactly cheap in some memory models.  I wonder
> if there was some partial result we could have locally cached across
> the entire loop.
> 
Ordinarily, a PFN walker is required to use pfn_to_page() in case it crosses
something like a sparsemem boundary (assuming no VMEMMAP) where there may
be no relationship between the PFN and the struct page location.
In this specific case though, we are within a MAX_ORDER_NR_PAGES block
so it's safe to cache the struct page assuming nothing crazy is
introduced by a memory model.
Done.
> > +		if (!PageBuddy(page))
> > +			continue;
> > +
> > +		/* Found a free page, break it into order-0 pages */
> > +		isolated = split_free_page(page);
> > +		total_isolated += isolated;
> > +		for (i = 0; i < isolated; i++) {
> > +			list_add(&page->lru, freelist);
> > +			page++;
> > +		}
> > +
> > +		/* If a page was split, advance to the end of it */
> > +		if (isolated)
> > +			blockpfn += isolated - 1;
> > +	}
> 
> Strange.  Having just busted a pageblock_order-sized higher-order page
> into order-0 pages
The page being broken up could be any size. It's not necessarily related
to pageblocks.
> , the loop goes on and inspects the remaining
> (1-2^pageblock_order) pages, presumably to no effect.  Perhaps
> 
> 	for (; blockpfn < end_pfn; blockpfn++) {
> 
> should be
> 
> 	for (; blockpfn < end_pfn; blockpfn += pageblock_nr_pages) {
> 
> or somesuch.
> 
That's what the code marked with "If a page was split, advance to the
end of it" is for. It knows how to advance to the end of the buddy page
without accidentally skipping over a page.
> btw, is the whole pageblock_order thing as sucky as it seems?  If I
> want my VM to be oriented to making order-4-skb-allocations work, I
> need to tune it that way, to coopt something the hugepage fetishists
> added?  What if I need order-4 skb's _and_ hugepages?
> 
It's easiest to consider migrating pages to and from in ranges of pageblocks
because that is the granularity anti-frag works on. There is very little gained
by considering a lower boundary. With direct compaction, compact_finished()
is checking on a regular basis whether it's ok to finish compaction early
because the caller is satisified.
At worst at the moment, more of a pageblock gets compacted than potentially
necessary for an order-4 allocation to succeed. Specifically, one pageblock
will get fully compacted even though only a small amount of it may have been
required. It'd be possible to do such an optimisation, but it'll be a
micro-optimisation and will obscure the logic somewhat.
> > +	return total_isolated;
> > +}
> > +
> > +/* Returns 1 if the page is within a block suitable for migration to */
> > +static int suitable_migration_target(struct page *page)
> 
> `bool'?
> 
Ok.
> > +{
> > +
> > +	int migratetype = get_pageblock_migratetype(page);
> > +
> > +	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
> > +	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
> > +		return 0;
> > +
> > +	/* If the page is a large free page, then allow migration */
> > +	if (PageBuddy(page) && page_order(page) >= pageblock_order)
> > +		return 1;
> > +
> > +	/* If the block is MIGRATE_MOVABLE, allow migration */
> > +	if (migratetype == MIGRATE_MOVABLE)
> > +		return 1;
> > +
> > +	/* Otherwise skip the block */
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Based on information in the current compact_control, find blocks
> > + * suitable for isolating free pages from
> 
> "and then isolate them"?
> 
Correct.
> > + */
> > +static void isolate_freepages(struct zone *zone,
> > +				struct compact_control *cc)
> > +{
> > +	struct page *page;
> > +	unsigned long high_pfn, low_pfn, pfn;
> > +	unsigned long flags;
> > +	int nr_freepages = cc->nr_freepages;
> > +	struct list_head *freelist = &cc->freepages;
> > +
> > +	pfn = cc->free_pfn;
> > +	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
> > +	high_pfn = low_pfn;
> > +
> > +	/*
> > +	 * Isolate free pages until enough are available to migrate the
> > +	 * pages on cc->migratepages. We stop searching if the migrate
> > +	 * and free page scanners meet or enough free pages are isolated.
> > +	 */
> > +	spin_lock_irqsave(&zone->lock, flags);
> > +	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
> > +					pfn -= pageblock_nr_pages) {
> > +		int isolated;
> > +
> > +		if (!pfn_valid(pfn))
> > +			continue;
> > +
> > +		/* 
> > +		 * Check for overlapping nodes/zones. It's possible on some
> > +		 * configurations to have a setup like
> > +		 * node0 node1 node0
> > +		 * i.e. it's possible that all pages within a zones range of
> > +		 * pages do not belong to a single zone.
> > +		 */
> > +		page = pfn_to_page(pfn);
> > +		if (page_zone(page) != zone)
> > +			continue;
> 
> Well.  This code checks each pfn it touches, but
> isolate_freepages_block() doesn't do this - isolate_freepages_block()
> happily blunders across a contiguous span of pageframes, assuming that
> all those pages are valid, and within the same zone.
> 
This is walking in strides of pageblock_nr_pages. You only have to call
pfn_valid() once for MAX_ORDER_NR_PAGES but if walking the PFNs within
the block, pfn_valid_within() must be called for each one.
Granted, pageblock_nr_pages != MAX_ORDER_NR_PAGES, but it'd be little
more than a micro-optimisation to identify exactly when the boundary was
crossed and call pfn_valid() a few times less.
> > +		/* Check the block is suitable for migration */
> > +		if (!suitable_migration_target(page))
> > +			continue;
> > +
> > +		/* Found a block suitable for isolating free pages from */
> > +		isolated = isolate_freepages_block(zone, pfn, freelist);
> > +		nr_freepages += isolated;
> > +
> > +		/*
> > +		 * Record the highest PFN we isolated pages from. When next
> > +		 * looking for free pages, the search will restart here as
> > +		 * page migration may have returned some pages to the allocator
> > +		 */
> > +		if (isolated)
> > +			high_pfn = max(high_pfn, pfn);
> > +	}
> > +	spin_unlock_irqrestore(&zone->lock, flags);
> 
> For how long can this loop hold of interrupts?
> 
Absolute worst case, until it reaches the location of the migration
scanner. As we are isolating pages for migration in units of 32 pages,
it seems unlikely that the migration and free page scanner would be a
substantial difference apart without 32 free pages between them.
> > +	cc->free_pfn = high_pfn;
> > +	cc->nr_freepages = nr_freepages;
> > +}
> > +
> > +/* Update the number of anon and file isolated pages in the zone */
> > +static void acct_isolated(struct zone *zone, struct compact_control *cc)
> > +{
> > +	struct page *page;
> > +	unsigned int count[NR_LRU_LISTS] = { 0, };
> > +
> > +	list_for_each_entry(page, &cc->migratepages, lru) {
> > +		int lru = page_lru_base_type(page);
> > +		count[lru]++;
> > +	}
> > +
> > +	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> > +	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> > +	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
> > +	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
> > +}
> > +
> > +/* Similar to reclaim, but different enough that they don't share logic */
> 
> yeah, but what does it do?
> 
hint is in the name. It tells you if there are "too many pages
isolated". Included in the patch below.
> > +static int too_many_isolated(struct zone *zone)
> > +{
> > +
> > +	unsigned long inactive, isolated;
> > +
> > +	inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
> > +					zone_page_state(zone, NR_INACTIVE_ANON);
> > +	isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
> > +					zone_page_state(zone, NR_ISOLATED_ANON);
> > +
> > +	return isolated > inactive;
> > +}
> > +
> > +/*
> > + * Isolate all pages that can be migrated from the block pointed to by
> > + * the migrate scanner within compact_control.
> > + */
> > +static unsigned long isolate_migratepages(struct zone *zone,
> > +					struct compact_control *cc)
> > +{
> > +	unsigned long low_pfn, end_pfn;
> > +	struct list_head *migratelist;
> > +
> > +	low_pfn = cc->migrate_pfn;
> > +	migratelist = &cc->migratepages;
> > +
> > +	/* Do not scan outside zone boundaries */
> > +	if (low_pfn < zone->zone_start_pfn)
> > +		low_pfn = zone->zone_start_pfn;
> 
> Can this happen?
> 
Unlikely, but yes.
> Use max()?
> 
Done, in the first follow-on patch.
> > +	/* Setup to scan one block but not past where we are migrating to */
> 
> what?
> 
What indeed. Changed to "Only scan within a pageblock boundary"
> > +	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> > +
> > +	/* Do not cross the free scanner or scan within a memory hole */
> > +	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> > +		cc->migrate_pfn = end_pfn;
> > +		return 0;
> > +	}
> > +
> > +	/* Do not isolate the world */
> 
> Needs (much) more explanation, please.
> 
        /*
         * Ensure that there are not too many pages isolated from the LRU
         * list by either parallel reclaimers or compaction. If there are,
         * delay for some time until fewer pages are isolated
         */
> > +	while (unlikely(too_many_isolated(zone))) {
> > +		congestion_wait(BLK_RW_ASYNC, HZ/10);
> 
> ... why did it do this?  Quite a head-scratcher.
> 
The expected cause of too many pages being isolated is parallel reclaimers. Too
many pages isolated implies pages are being cleaned so wait for a period of
time or until IO congestion clears to try again.
> > +		if (fatal_signal_pending(current))
> > +			return 0;
> > +	}
> > +
> > +	/* Time to isolate some pages for migration */
> > +	spin_lock_irq(&zone->lru_lock);
> > +	for (; low_pfn < end_pfn; low_pfn++) {
> > +		struct page *page;
> > +		if (!pfn_valid_within(low_pfn))
> > +			continue;
> > +
> > +		/* Get the page and skip if free */
> > +		page = pfn_to_page(low_pfn);
> > +		if (PageBuddy(page)) {
> > +			low_pfn += (1 << page_order(page)) - 1;
> > +			continue;
> > +		}
> > +
> > +		/* Try isolate the page */
> > +		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
> > +			del_page_from_lru_list(zone, page, page_lru(page));
> > +			list_add(&page->lru, migratelist);
> > +			mem_cgroup_del_lru(page);
> > +			cc->nr_migratepages++;
> > +		}
> > +
> > +		/* Avoid isolating too much */
> > +		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
> > +			break;
> 
> This test could/should be moved inside the preceding `if' block.  Or,
> better, simply do
> 
> 		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
> 			continue;	/* comment goes here */
> 
Done.
> > +	}
> > +
> > +	acct_isolated(zone, cc);
> > +
> > +	spin_unlock_irq(&zone->lru_lock);
> > +	cc->migrate_pfn = low_pfn;
> > +
> > +	return cc->nr_migratepages;
> > +}
> > +
> > +/*
> > + * This is a migrate-callback that "allocates" freepages by taking pages
> > + * from the isolated freelists in the block we are migrating to.
> > + */
> > +static struct page *compaction_alloc(struct page *migratepage,
> > +					unsigned long data,
> > +					int **result)
> > +{
> > +	struct compact_control *cc = (struct compact_control *)data;
> > +	struct page *freepage;
> > +
> > +	/* Isolate free pages if necessary */
> > +	if (list_empty(&cc->freepages)) {
> > +		isolate_freepages(cc->zone, cc);
> > +
> > +		if (list_empty(&cc->freepages))
> > +			return NULL;
> > +	}
> > +
> > +	freepage = list_entry(cc->freepages.next, struct page, lru);
> > +	list_del(&freepage->lru);
> > +	cc->nr_freepages--;
> > +
> > +	return freepage;
> > +}
> > +
> > +/*
> > + * We cannot control nr_migratepages and nr_freepages fully when migration is
> > + * running as migrate_pages() has no knowledge of compact_control. When
> > + * migration is complete, we count the number of pages on the lists by hand.
> > + */
> > +static void update_nr_listpages(struct compact_control *cc)
> > +{
> > +	int nr_migratepages = 0;
> > +	int nr_freepages = 0;
> > +	struct page *page;
> 
> newline here please.
> 
Done
> > +	list_for_each_entry(page, &cc->migratepages, lru)
> > +		nr_migratepages++;
> > +	list_for_each_entry(page, &cc->freepages, lru)
> > +		nr_freepages++;
> > +
> > +	cc->nr_migratepages = nr_migratepages;
> > +	cc->nr_freepages = nr_freepages;
> > +}
> > +
> > +static inline int compact_finished(struct zone *zone,
> > +						struct compact_control *cc)
> > +{
> > +	if (fatal_signal_pending(current))
> > +		return COMPACT_PARTIAL;
> 
> ah-hah!  So maybe we meant COMPACT_INTERRUPTED.
> 
No, although an interruption can be reason for a partial competion. In this
particular case, it's unfortunate because the caller is unlikely to get
the page requested but it also has received a fatal signal so it probably
doesn't care.
> > +	/* Compaction run completes if the migrate and free scanner meet */
> > +	if (cc->free_pfn <= cc->migrate_pfn)
> > +		return COMPACT_COMPLETE;
> > +
> > +	return COMPACT_INCOMPLETE;
> > +}
> > +
> > +static int compact_zone(struct zone *zone, struct compact_control *cc)
> > +{
> > +	int ret = COMPACT_INCOMPLETE;
> > +
> > +	/* Setup to move all movable pages to the end of the zone */
> > +	cc->migrate_pfn = zone->zone_start_pfn;
> > +	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> > +	cc->free_pfn &= ~(pageblock_nr_pages-1);
> 
> If zone->spanned_pages is much much larger than zone->present_pages,
> this code will suck rather a bit.  Is there a reason why that can never
> happen?
> 
No reason why it can't happen but it's mitigated by only checking one PFN
per pageblock_nr_pages to see if it is valid in isolate_migratepages().
> > +	migrate_prep();
> > +
> > +	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> 
> <stares at that for a while>
> 
> Perhaps
> 
> 	while ((ret = compact_finished(zone, cc)) == COMPACT_INCOMPLETE) {
> 
> would be clearer.  That would make the definition-site initialisation
> of `ret' unneeded too.
> 
True.
> > +		unsigned long nr_migrate, nr_remaining;
> 
> newline please.
> 
Done.
> > +		if (!isolate_migratepages(zone, cc))
> > +			continue;
> 
> Boy, this looks like an infinite loop waiting to happen. Are you sure?
Yes, compact_finished() has all the exit conditions.
> Suppose we hit a pageblock-sized string of !pfn_valid() pfn's,
> for example. 
Then the migrate scanner will eventually reach the free scanner and it
will exit.
> Worried.
> 
Can you spot a corner case that is not covered by compact_finished() ?
> > +		nr_migrate = cc->nr_migratepages;
> > +		migrate_pages(&cc->migratepages, compaction_alloc,
> > +						(unsigned long)cc, 0);
> > +		update_nr_listpages(cc);
> > +		nr_remaining = cc->nr_migratepages;
> > +
> > +		count_vm_event(COMPACTBLOCKS);
> > +		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> > +		if (nr_remaining)
> > +			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> > +
> > +		/* Release LRU pages not migrated */
> > +		if (!list_empty(&cc->migratepages)) {
> > +			putback_lru_pages(&cc->migratepages);
> > +			cc->nr_migratepages = 0;
> > +		}
> > +
> > +	}
> > +
> > +	/* Release free pages and check accounting */
> > +	cc->nr_freepages -= release_freepages(&cc->freepages);
> > +	VM_BUG_ON(cc->nr_freepages != 0);
> > +
> > +	return ret;
> > +}
> > +
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 624cba4..3cf947d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1208,6 +1208,45 @@ void split_page(struct page *page, unsigned int order)
> >  }
> >  
> >  /*
> > + * Similar to split_page except the page is already free. As this is only
> > + * being used for migration, the migratetype of the block also changes.
> > + */
> > +int split_free_page(struct page *page)
> > +{
> > +	unsigned int order;
> > +	unsigned long watermark;
> > +	struct zone *zone;
> > +
> > +	BUG_ON(!PageBuddy(page));
> > +
> > +	zone = page_zone(page);
> > +	order = page_order(page);
> > +
> > +	/* Obey watermarks or the system could deadlock */
> > +	watermark = low_wmark_pages(zone) + (1 << order);
> > +	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> > +		return 0;
> 
> OK, there is no way in which the code-reader can work out why this is
> here.  What deadlock?
> 
It's a general comment on watermarks. Allocators shouldn't allow the
watermarks to be breached so that there are always pages for things like
TIF_MEMDIE. Changed the comment to 
	/* Obey watermarks as if the page was being allocated */
> > +	/* Remove page from free list */
> > +	list_del(&page->lru);
> > +	zone->free_area[order].nr_free--;
> > +	rmv_page_order(page);
> > +	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> > +
> > +	/* Split into individual pages */
> > +	set_page_refcounted(page);
> > +	split_page(page, order);
> > +
> > +	if (order >= pageblock_order - 1) {
> > +		struct page *endpage = page + (1 << order) - 1;
> > +		for (; page < endpage; page += pageblock_nr_pages)
> > +			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> > +	}
> > +
> > +	return 1 << order;
> > +}
> > +
> > +/*
> >   * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
> >   * we cheat by calling it from here, in the order > 0 path.  Saves a branch
> >   * or two.
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 351e491..3a69b48 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -892,6 +892,11 @@ static const char * const vmstat_text[] = {
> >  	"allocstall",
> >  
> >  	"pgrotated",
> > +
> > +	"compact_blocks_moved",
> > +	"compact_pages_moved",
> > +	"compact_pagemigrate_failed",
> 
> Should we present these on CONFIG_COMPACTION=n kernels?
> 
To do it would require changes to direct compaction as well. I'll do it
as a patch on top of the series as an incremental change to this patch
will be a mess.
> Does all this code really need to iterate across individual pfn's like
> this?  We can use the buddy structures to go straight to all of a
> zone's order-N free pages, can't we?  Wouldn't that save a whole heap
> of fruitless linear searching?
> 
You could do as you suggest, but it's would not reduce scanning. If anything,
it will increase it.
The objective is to move pages into the smallest number of pageblocks. For
that, we want all the free pages within a given range no matter what their
current order in the free lists are. Doing what you suggest would involve
scanning the buddy lists which is potentially more pages than a linear scan
of a range.
Here is a roll-up of the suggestions you made
==== CUT HERE ====
mm,compaction: Various fixes to the patch 'Memory compaction core'
 o Have CONFIG_COMPACTION depend on HUGETLB_PAGE instead of HUGETLBFS
 o Use unsigned long instead of int for page counters
 o Simplify logic in isolate_freepages_block() and isolate_migratepages()
 o Optimise isolate_freepages_block to use a cursor
 o Use bool instead of int for true/false
 o Clarify some comments
 o Improve control flow in isolate_migratepages()
 o Add newlines for clarity
 o Simply loop in compact_zones
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/Kconfig      |    2 +-
 mm/compaction.c |   81 +++++++++++++++++++++++++++++++-----------------------
 mm/page_alloc.c |    2 +-
 3 files changed, 48 insertions(+), 37 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index 4fd75a0..a275a7d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -177,7 +177,7 @@ config COMPACTION
 	bool "Allow for memory compaction"
 	def_bool y
 	select MIGRATION
-	depends on EXPERIMENTAL && HUGETLBFS && MMU
+	depends on EXPERIMENTAL && HUGETLB_PAGE && MMU
 	help
 	  Allows the compaction of memory for the allocation of huge pages.
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 3bb65d7..38b54e2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -40,10 +40,10 @@ struct compact_control {
 	struct zone *zone;
 };
 
-static int release_freepages(struct list_head *freelist)
+static unsigned long release_freepages(struct list_head *freelist)
 {
 	struct page *page, *next;
-	int count = 0;
+	unsigned long count = 0;
 
 	list_for_each_entry_safe(page, next, freelist, lru) {
 		list_del(&page->lru);
@@ -55,28 +55,33 @@ static int release_freepages(struct list_head *freelist)
 }
 
 /* Isolate free pages onto a private freelist. Must hold zone->lock */
-static int isolate_freepages_block(struct zone *zone,
+static unsigned long isolate_freepages_block(struct zone *zone,
 				unsigned long blockpfn,
 				struct list_head *freelist)
 {
 	unsigned long zone_end_pfn, end_pfn;
 	int total_isolated = 0;
+	struct page *cursor;
 
 	/* Get the last PFN we should scan for free pages at */
 	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
-	end_pfn = blockpfn + pageblock_nr_pages;
-	if (end_pfn > zone_end_pfn)
-		end_pfn = zone_end_pfn;
+	end_pfn = min(blockpfn + pageblock_nr_pages, zone_end_pfn);
 
-	/* Isolate free pages. This assumes the block is valid */
+	/* Find the first usable PFN in the block to initialse page cursor */
 	for (; blockpfn < end_pfn; blockpfn++) {
-		struct page *page;
+		if (pfn_valid_within(blockpfn))
+			break;
+	}
+	cursor = pfn_to_page(blockpfn);
+
+	/* Isolate free pages. This assumes the block is valid */
+	for (; blockpfn < end_pfn; blockpfn++, cursor++) {
 		int isolated, i;
+		struct page *page = cursor;
 
 		if (!pfn_valid_within(blockpfn))
 			continue;
 
-		page = pfn_to_page(blockpfn);
 		if (!PageBuddy(page))
 			continue;
 
@@ -89,38 +94,40 @@ static int isolate_freepages_block(struct zone *zone,
 		}
 
 		/* If a page was split, advance to the end of it */
-		if (isolated)
+		if (isolated) {
 			blockpfn += isolated - 1;
+			cursor += isolated - 1;
+		}
 	}
 
 	return total_isolated;
 }
 
-/* Returns 1 if the page is within a block suitable for migration to */
-static int suitable_migration_target(struct page *page)
+/* Returns true if the page is within a block suitable for migration to */
+static bool suitable_migration_target(struct page *page)
 {
 
 	int migratetype = get_pageblock_migratetype(page);
 
 	/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
 	if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
-		return 0;
+		return false;
 
 	/* If the page is a large free page, then allow migration */
 	if (PageBuddy(page) && page_order(page) >= pageblock_order)
-		return 1;
+		return true;
 
 	/* If the block is MIGRATE_MOVABLE, allow migration */
 	if (migratetype == MIGRATE_MOVABLE)
-		return 1;
+		return true;
 
 	/* Otherwise skip the block */
-	return 0;
+	return false;
 }
 
 /*
  * Based on information in the current compact_control, find blocks
- * suitable for isolating free pages from
+ * suitable for isolating free pages from and then isolate them.
  */
 static void isolate_freepages(struct zone *zone,
 				struct compact_control *cc)
@@ -143,7 +150,7 @@ static void isolate_freepages(struct zone *zone,
 	spin_lock_irqsave(&zone->lock, flags);
 	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
 					pfn -= pageblock_nr_pages) {
-		int isolated;
+		unsigned long isolated;
 
 		if (!pfn_valid(pfn))
 			continue;
@@ -199,7 +206,7 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
 }
 
 /* Similar to reclaim, but different enough that they don't share logic */
-static int too_many_isolated(struct zone *zone)
+static bool too_many_isolated(struct zone *zone)
 {
 
 	unsigned long inactive, isolated;
@@ -220,16 +227,12 @@ static unsigned long isolate_migratepages(struct zone *zone,
 					struct compact_control *cc)
 {
 	unsigned long low_pfn, end_pfn;
-	struct list_head *migratelist;
-
-	low_pfn = cc->migrate_pfn;
-	migratelist = &cc->migratepages;
+	struct list_head *migratelist = &cc->migratepages;
 
 	/* Do not scan outside zone boundaries */
-	if (low_pfn < zone->zone_start_pfn)
-		low_pfn = zone->zone_start_pfn;
+	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
 
-	/* Setup to scan one block but not past where we are migrating to */
+	/* Only scan within a pageblock boundary */
 	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
 
 	/* Do not cross the free scanner or scan within a memory hole */
@@ -238,7 +241,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
 		return 0;
 	}
 
-	/* Do not isolate the world */
+	/*
+	 * Ensure that there are not too many pages isolated from the LRU
+	 * list by either parallel reclaimers or compaction. If there are,
+	 * delay for some time until fewer pages are isolated
+	 */
 	while (unlikely(too_many_isolated(zone))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
@@ -261,12 +268,14 @@ static unsigned long isolate_migratepages(struct zone *zone,
 		}
 
 		/* Try isolate the page */
-		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
-			del_page_from_lru_list(zone, page, page_lru(page));
-			list_add(&page->lru, migratelist);
-			mem_cgroup_del_lru(page);
-			cc->nr_migratepages++;
-		}
+		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
+			continue;
+
+		/* Successfully isolated */
+		del_page_from_lru_list(zone, page, page_lru(page));
+		list_add(&page->lru, migratelist);
+		mem_cgroup_del_lru(page);
+		cc->nr_migratepages++;
 
 		/* Avoid isolating too much */
 		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
@@ -317,6 +326,7 @@ static void update_nr_listpages(struct compact_control *cc)
 	int nr_migratepages = 0;
 	int nr_freepages = 0;
 	struct page *page;
+
 	list_for_each_entry(page, &cc->migratepages, lru)
 		nr_migratepages++;
 	list_for_each_entry(page, &cc->freepages, lru)
@@ -362,7 +372,7 @@ static int compact_finished(struct zone *zone,
 
 static int compact_zone(struct zone *zone, struct compact_control *cc)
 {
-	int ret = COMPACT_INCOMPLETE;
+	int ret;
 
 	/* Setup to move all movable pages to the end of the zone */
 	cc->migrate_pfn = zone->zone_start_pfn;
@@ -371,8 +381,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 
 	migrate_prep();
 
-	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
+	while ((ret = compact_finished(zone, cc)) == COMPACT_INCOMPLETE) {
 		unsigned long nr_migrate, nr_remaining;
+
 		if (!isolate_migratepages(zone, cc))
 			continue;
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 66823bd..08b6306 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1223,7 +1223,7 @@ int split_free_page(struct page *page)
 	zone = page_zone(page);
 	order = page_order(page);
 
-	/* Obey watermarks or the system could deadlock */
+	/* Obey watermarks as if the page was being allocated */
 	watermark = low_wmark_pages(zone) + (1 << order);
 	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
 		return 0;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* Re: [PATCH 09/14] Add /proc trigger for memory compaction
  2010-04-07  0:05   ` Andrew Morton
@ 2010-04-07 15:39     ` Mel Gorman
  2010-04-07 18:27       ` Mel Gorman
  0 siblings, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2010-04-07 15:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:05:55PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:43 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> > value is written to the file,
> 
> Might be better if "when the number 1 is written...".  That permits you
> to add 2, 3 and 4 later on.
> 
Ok.
> > all zones are compacted. The expected user
> > of such a trigger is a job scheduler that prepares the system before the
> > target application runs.
> > 
> 
> Ick.  The days of multi-user computers seems to have passed.
> 
Functionally, they shouldn't even need it. Direct compaction should work
just fine but it's the type of thing a job scheduler might want so it could
easily work out how many huge pages it potentially has in advance for example.
The same information could be figured out if your kpagemap-foo was strong
enough.
It would also be useful for debugging direct compaction in the same way
drop_caches can be useful. i.e. it's rarely the right thing to use but
it can be handy to illustrate a point. I didn't want to write that into
the docs though.
> > ...
> >
> > +/* Compact all zones within a node */
> > +static int compact_node(int nid)
> > +{
> > +	int zoneid;
> > +	pg_data_t *pgdat;
> > +	struct zone *zone;
> > +
> > +	if (nid < 0 || nid >= nr_node_ids || !node_online(nid))
> > +		return -EINVAL;
> > +	pgdat = NODE_DATA(nid);
> > +
> > +	/* Flush pending updates to the LRU lists */
> > +	lru_add_drain_all();
> > +
> > +	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> > +		struct compact_control cc;
> > +
> > +		zone = &pgdat->node_zones[zoneid];
> > +		if (!populated_zone(zone))
> > +			continue;
> > +
> > +		cc.nr_freepages = 0;
> > +		cc.nr_migratepages = 0;
> > +		cc.zone = zone;
> 
> It would be better to do
> 
> 	struct compact_control cc = {
> 		.nr_freepages = 0,
> 		etc
> 
> because if you later add more fields to compact_control, everything
> else works by magick.  That's served us pretty well with
> writeback_control, scan_control, etc.
> 	
Done. This is done in the patch below. It'll then collide with a later
patch where order is introduced but it's a trivial fixup to move the
initialisation.
> > +		INIT_LIST_HEAD(&cc.freepages);
> > +		INIT_LIST_HEAD(&cc.migratepages);
> > +
> > +		compact_zone(zone, &cc);
> > +
> > +		VM_BUG_ON(!list_empty(&cc.freepages));
> > +		VM_BUG_ON(!list_empty(&cc.migratepages));
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/* Compact all nodes in the system */
> > +static int compact_nodes(void)
> > +{
> > +	int nid;
> > +
> > +	for_each_online_node(nid)
> > +		compact_node(nid);
> 
> What if a node goes offline?
> 
Then it won't be in the online map?
> > +	return COMPACT_COMPLETE;
> > +}
> > +
> >
==== CUT HERE ====
mm,compaction: Tighten up the allowed values for compact_memory and initialisation
This patch updates the documentation on compact_memory to only define 1
as an allowed value in case it needs to be expanded later. It also
changes how a compact_control structure is initialised to avoid
potential trouble in the future.
This is a fix to the patch "Add /proc trigger for memory compaction".
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/sysctl/vm.txt |    9 ++++-----
 mm/compaction.c             |    9 +++++----
 2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 803c018..3b3fa1b 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -67,11 +67,10 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
 
 compact_memory
 
-Available only when CONFIG_COMPACTION is set. When an arbitrary value
-is written to the file, all zones are compacted such that free memory
-is available in contiguous blocks where possible. This can be important
-for example in the allocation of huge pages although processes will also
-directly compact memory as required.
+Available only when CONFIG_COMPACTION is set. When 1 is written to the file,
+all zones are compacted such that free memory is available in contiguous
+blocks where possible. This can be important for example in the allocation of
+huge pages although processes will also directly compact memory as required.
 
 ==============================================================
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 615b811..d9c5733 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -393,15 +393,16 @@ static int compact_node(int nid)
 	lru_add_drain_all();
 
 	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
-		struct compact_control cc;
+		struct compact_control cc = {
+			.nr_freepages = 0,
+			.nr_migratepages = 0,
+			.zone = zone,
+		};
 
 		zone = &pgdat->node_zones[zoneid];
 		if (!populated_zone(zone))
 			continue;
 
-		cc.nr_freepages = 0;
-		cc.nr_migratepages = 0;
-		cc.zone = zone;
 		INIT_LIST_HEAD(&cc.freepages);
 		INIT_LIST_HEAD(&cc.migratepages);
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* Re: [PATCH 10/14] Add /sys trigger for per-node memory compaction
  2010-04-07  0:05   ` Andrew Morton
  2010-04-07  0:31     ` KAMEZAWA Hiroyuki
@ 2010-04-07 15:42     ` Mel Gorman
  1 sibling, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-07 15:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:05:59PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:44 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This patch adds a per-node sysfs file called compact. When the file is
> > written to, each zone in that node is compacted. The intention that this
> > would be used by something like a job scheduler in a batch system before
> > a job starts so that the job can allocate the maximum number of
> > hugepages without significant start-up cost.
> 
> Would it make more sense if this was a per-memcg thing rather than a
> per-node thing?
> 
Kamezawa Hiroyuki covered this perfectly. memcg doesn't care and while
cpuset might, there are a lot more people working with nodes than there
are with cpuset.
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 11/14] Direct compact when a high-order allocation fails
  2010-04-07  0:06   ` Andrew Morton
@ 2010-04-07 16:06     ` Mel Gorman
  2010-04-07 18:29     ` Mel Gorman
  1 sibling, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-07 16:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:06:03PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:45 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Ordinarily when a high-order allocation fails, direct reclaim is entered to
> > free pages to satisfy the allocation.  With this patch, it is determined if
> > an allocation failed due to external fragmentation instead of low memory
> > and if so, the calling process will compact until a suitable page is
> > freed. Compaction by moving pages in memory is considerably cheaper than
> > paging out to disk and works where there are locked pages or no swap. If
> > compaction fails to free a page of a suitable size, then reclaim will
> > still occur.
> 
> Does this work?
> 
Well, yes or there wouldn't be a marked reduction in the latency to allocate
a huge page as linked to in the leader and the difference in allocation
success rates on ppc64 would not be so marked.
> > Direct compaction returns as soon as possible. As each block is compacted,
> > it is checked if a suitable page has been freed and if so, it returns.
> 
> So someone else can get in and steal it.  How is that resolved?
> 
It isn't, lumpy reclaim has a similar problem. They could be captured
of course but so far stealing has only been a problem when under very
heavy memory pressure.
> Please expound upon the relationship between the icky pageblock_order
> and the caller's desired allocation order here. 
Compaction works on the same units as anti-fragmentation does - the
pageblock_order. It could work on units smaller than that when selecting
pages to migrate from and to, but there would be little advantage for
some additional complexity.
The caller's desired allocation order determines if compaction has
finished or not after a pageblock of pages has been migrated.
> The compaction design
> seems fairly fixated upon pageblock_order - what happens if the caller
> wanted something larger than pageblock_order? 
Then it would get tricky. Selecting for migration stays simple but there would
be additional complexity in finding 2 or more adjacent naturally-aligned
MIGRATE_MOVABLE blocks to migrate to. As pageblock_order is related to the
default huge page size, I'd wonder what caller would be routinely allocating
larger pages?
> The
> less-than-pageblock_order case seems pretty obvious, although perhaps
> wasteful?
> 
compact_finished() could be called more regularly but the waste is minimal. At
worst, a few more pages get migrated that weren't necessary for the caller
to successfully allocate. This is not massively dissimilar to how direct
reclaim can reclaim slightly more pages than necessary.
> >
> > ...
> >
> > +static unsigned long compact_zone_order(struct zone *zone,
> > +						int order, gfp_t gfp_mask)
> > +{
> > +	struct compact_control cc = {
> > +		.nr_freepages = 0,
> > +		.nr_migratepages = 0,
> > +		.order = order,
> > +		.migratetype = allocflags_to_migratetype(gfp_mask),
> > +		.zone = zone,
> > +	};
> 
> yeah, like that.
> 
> > +	INIT_LIST_HEAD(&cc.freepages);
> > +	INIT_LIST_HEAD(&cc.migratepages);
> > +
> > +	return compact_zone(zone, &cc);
> > +}
> > +
> > +/**
> > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> > + * @zonelist: The zonelist used for the current allocation
> > + * @order: The order of the current allocation
> > + * @gfp_mask: The GFP mask of the current allocation
> > + * @nodemask: The allowed nodes to allocate from
> > + *
> > + * This is the main entry point for direct page compaction.
> > + */
> > +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > +	int may_enter_fs = gfp_mask & __GFP_FS;
> > +	int may_perform_io = gfp_mask & __GFP_IO;
> > +	unsigned long watermark;
> > +	struct zoneref *z;
> > +	struct zone *zone;
> > +	int rc = COMPACT_SKIPPED;
> > +
> > +	/*
> > +	 * Check whether it is worth even starting compaction. The order check is
> > +	 * made because an assumption is made that the page allocator can satisfy
> > +	 * the "cheaper" orders without taking special steps
> > +	 */
> > +	if (order <= PAGE_ALLOC_COSTLY_ORDER 
> 
> Was that a correct decision?  If we perform compaction when smaller
> allocation attemtps fail, will the kernel get better, or worse?
> 
I think better but there are concerns about LRU churn and it might encourage
increased use of high-order allocations. The desire is to try compaction out
first with huge pages and move towards lifting this restriction on order later.
> And how do we save my order-4-allocating wireless driver? 
Ultimately, it could perform a subset of compaction that doesn't go to
sleep but migration isn't up to that right now.
> That would
> require that kswapd perform the compaction for me, perhaps?
> 
> > || !may_enter_fs || !may_perform_io)
> 
> Would be nice to add some comments explaining this a bit more. 
> Compaction doesn't actually perform IO, nor enter filesystems, does it?
> 
Compaction doesn't, but migration can and you don't know in advance if
it will need to or not. Migration would itself need to take a GFP mask
of what was and wasn't allowed during the course of migration but these
checks to be moved.
Not impossible, just not done as of this time.
> > +		return rc;
> > +
> > +	count_vm_event(COMPACTSTALL);
> > +
> > +	/* Compact each zone in the list */
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> > +								nodemask) {
> > +		int fragindex;
> > +		int status;
> > +
> > +		/*
> > +		 * Watermarks for order-0 must be met for compaction. Note
> > +		 * the 2UL. This is because during migration, copies of
> > +		 * pages need to be allocated and for a short time, the
> > +		 * footprint is higher
> > +		 */
> > +		watermark = low_wmark_pages(zone) + (2UL << order);
> > +		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> > +			continue;
> 
> ooh, so that starts to explain split_free_page().  But
> split_free_page() didn't do the 2UL thing.
> 
No, but split_free_page() knows exactly how much it is removing at that
time. At this point, there is a worst-case expectation that the pages being
migrating from and to are both isolated. At no point should they be all
allocated at any given time but it's not checking against deadlocks.
> Surely these things are racy?  So we'll deadlock less often :(
> 
It won't deadlock, this is a heuristic only that guesses whether compaction
is likely to succeed or not. The watermarks are rechecked every time pages
are taken off free list.
> > +		/*
> > +		 * fragmentation index determines if allocation failures are
> > +		 * due to low memory or external fragmentation
> > +		 *
> > +		 * index of -1 implies allocations might succeed depending
> > +		 * 	on watermarks
> > +		 * index towards 0 implies failure is due to lack of memory
> > +		 * index towards 1000 implies failure is due to fragmentation
> > +		 *
> > +		 * Only compact if a failure would be due to fragmentation.
> > +		 */
> > +		fragindex = fragmentation_index(zone, order);
> > +		if (fragindex >= 0 && fragindex <= 500)
> > +			continue;
> > +
> > +		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
> > +			rc = COMPACT_PARTIAL;
> > +			break;
> > +		}
> 
> Why are we doing all this handwavy stuff?  Why not just try a
> compaction run and see if it worked? 
Because if that index is not matched, it really is a waste of time to
try compacting. It just won't work but it'll do a full scan of the zone
figuring that out.
> That would be more robust/reliable, surely?
> 
We'll also eventually get a bug report on low-memory situations causing
large amounts of CPU to be consumed in compaction without the pages
being allocated. Granted, we wouldn't get them until compaction was also
working for the lower orders but we'd get the report eventually.
> > +		status = compact_zone_order(zone, order, gfp_mask);
> > +		rc = max(status, rc);
> > +
> > +		if (zone_watermark_ok(zone, order, watermark, 0, 0))
> > +			break;
> > +	}
> > +
> > +	return rc;
> > +}
> > +
> > +
> >  /* Compact all zones within a node */
> >  static int compact_node(int nid)
> >  {
> >
> > ...
> >
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -561,7 +561,7 @@ static int unusable_show(struct seq_file *m, void *arg)
> >   * The value can be used to determine if page reclaim or compaction
> >   * should be used
> >   */
> > -int fragmentation_index(unsigned int order, struct contig_page_info *info)
> > +int __fragmentation_index(unsigned int order, struct contig_page_info *info)
> >  {
> >  	unsigned long requested = 1UL << order;
> >  
> > @@ -581,6 +581,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
> >  	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
> >  }
> >  
> > +/* Same as __fragmentation index but allocs contig_page_info on stack */
> > +int fragmentation_index(struct zone *zone, unsigned int order)
> > +{
> > +	struct contig_page_info info;
> > +
> > +	fill_contig_page_info(zone, order, &info);
> > +	return __fragmentation_index(order, &info);
> > +}
> >  
> >  static void extfrag_show_print(struct seq_file *m,
> >  					pg_data_t *pgdat, struct zone *zone)
> > @@ -596,7 +604,7 @@ static void extfrag_show_print(struct seq_file *m,
> >  				zone->name);
> >  	for (order = 0; order < MAX_ORDER; ++order) {
> >  		fill_contig_page_info(zone, order, &info);
> > -		index = fragmentation_index(order, &info);
> > +		index = __fragmentation_index(order, &info);
> >  		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
> >  	}
> >  
> > @@ -896,6 +904,9 @@ static const char * const vmstat_text[] = {
> >  	"compact_blocks_moved",
> >  	"compact_pages_moved",
> >  	"compact_pagemigrate_failed",
> > +	"compact_stall",
> > +	"compact_fail",
> > +	"compact_success",
> 
> CONFIG_COMPACTION=n?
> 
Yeah, it should be.
> >
> > ...
> >
> 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 12/14] Add a tunable that decides when memory should be compacted and when it should be reclaimed
  2010-04-07  0:06   ` Andrew Morton
@ 2010-04-07 16:11     ` Mel Gorman
  0 siblings, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-07 16:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:06:13PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:46 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > The kernel applies some heuristics when deciding if memory should be
> > compacted or reclaimed to satisfy a high-order allocation. One of these
> > is based on the fragmentation. If the index is below 500, memory will
> > not be compacted. This choice is arbitrary and not based on data. To
> > help optimise the system and set a sensible default for this value, this
> > patch adds a sysctl extfrag_threshold. The kernel will only compact
> > memory if the fragmentation index is above the extfrag_threshold.
> 
> Was this the most robust, reliable, no-2am-phone-calls thing we could
> have done?
> 
> What about, say, just doing a bit of both until something worked? 
I guess you could but that is not a million miles away from what
currently happens.
This heuristic is basically "based on free memory layout, how likely is
compaction to succeed?". It makes a decision based on that. A later
patch then checks if the guess was right. If not, just try direct
reclaim for a bit before trying compaction again.
> For
> extra smarts we could remember what worked best last time, and make
> ourselves more likely to try that next time.
> 
With the later patch, this is essentially what we do. Granted we
remember the opposite "If the kernel guesses wrong, then don't compact
for a short while before trying again".
> Or whatever, but extfrag_threshold must die!  And replacing it with a
> hardwired constant doesn't count ;)
> 
I think what you have in mind is "just try compaction every time" but my
concern about that is we'll hit a corner case where a lot of CPU time is
taken scanning zones uselessly. That is what this heuristic and the
back-off logic in a later patch was meant to avoid. I haven't thought of
a better alternative :/
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 13/14] Do not compact within a preferred zone after a compaction failure
  2010-04-07  0:06   ` Andrew Morton
  2010-04-07  0:55     ` Andrea Arcangeli
@ 2010-04-07 16:32     ` Mel Gorman
  1 sibling, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-07 16:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:06:16PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:47 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > The fragmentation index may indicate that a failure is due to external
> > fragmentation but after a compaction run completes, it is still possible
> > for an allocation to fail. There are two obvious reasons as to why
> > 
> >   o Page migration cannot move all pages so fragmentation remains
> >   o A suitable page may exist but watermarks are not met
> > 
> > In the event of compaction followed by an allocation failure, this patch
> > defers further compaction in the zone for a period of time. The zone that
> > is deferred is the first zone in the zonelist - i.e. the preferred zone.
> > To defer compaction in the other zones, the information would need to be
> > stored in the zonelist or implemented similar to the zonelist_cache.
> > This would impact the fast-paths and is not justified at this time.
> > 
> 
> Your patch, it sucks!
> 
> > ---
> >  include/linux/compaction.h |   35 +++++++++++++++++++++++++++++++++++
> >  include/linux/mmzone.h     |    7 +++++++
> >  mm/page_alloc.c            |    5 ++++-
> >  3 files changed, 46 insertions(+), 1 deletions(-)
> > 
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index ae98afc..2a02719 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -18,6 +18,32 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
> >  extern int fragmentation_index(struct zone *zone, unsigned int order);
> >  extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> >  			int order, gfp_t gfp_mask, nodemask_t *mask);
> > +
> > +/* defer_compaction - Do not compact within a zone until a given time */
> > +static inline void defer_compaction(struct zone *zone, unsigned long resume)
> > +{
> > +	/*
> > +	 * This function is called when compaction fails to result in a page
> > +	 * allocation success. This is somewhat unsatisfactory as the failure
> > +	 * to compact has nothing to do with time and everything to do with
> > +	 * the requested order, the number of free pages and watermarks. How
> > +	 * to wait on that is more unclear, but the answer would apply to
> > +	 * other areas where the VM waits based on time.
> > +	 */
> 
> c'mon, let's not make this rod for our backs.
> 
> The "A suitable page may exist but watermarks are not met" case can be
> addressed by testing the watermarks up-front, surely?
> 
Nope, because the number of pages free at each order changes before and
after compaction and you don't know by how much in advance. It wouldn't
be appropriate to assume perfect compaction because unmovable and
reclaimable pages are free.
> I bet the "Page migration cannot move all pages so fragmentation
> remains" case can be addressed by setting some metric in the zone, and
> suitably modifying that as a result on ongoing activity. 
> To tell the
> zone "hey, compaction migth be worth trying now".  that sucks too, but not
> so much.
> 
> Or something.  Putting a wallclock-based throttle on it like this
> really does reduce the usefulness of the whole feature.
> 
When it gets down to it, this patch was about paranoia. If the
heuristics on compaction-avoidance didn't work out, I didn't want
compaction to keep pounding.
That said, this patch would also hide the bug report telling us this happened
and was a mistake. A bug report detailing high oprofile usage in compaction
will be much easier to come across than a report on defer_compaction()
being called too often.
Please drop this patch.
> Internet: "My application works OK on a hard disk but fails when I use an SSD!". 
> 
> akpm: "Tell Mel!"
> 
Mel is in and he is listening.
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages
  2010-04-07  0:06   ` Andrew Morton
@ 2010-04-07 16:49     ` Mel Gorman
  0 siblings, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-07 16:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:06:23PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:48 +0100
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > PageAnon pages that are unmapped may or may not have an anon_vma so are
> > not currently migrated. However, a swap cache page can be migrated and
> > fits this description. This patch identifies page swap caches and allows
> > them to be migrated but ensures that no attempt to made to remap the pages
> > would would potentially try to access an already freed anon_vma.
> > 
> > ...
> >
> > @@ -484,7 +484,8 @@ static int fallback_migrate_page(struct address_space *mapping,
> >   *   < 0 - error code
> >   *  == 0 - success
> >   */
> > -static int move_to_new_page(struct page *newpage, struct page *page)
> > +static int move_to_new_page(struct page *newpage, struct page *page,
> > +						int remap_swapcache)
> 
> You're not a fan of `bool'.
> 
This function existed before compaction and returns an error code rather
than a true/false value.
> >  {
> >  	struct address_space *mapping;
> >  	int rc;
> > @@ -519,10 +520,12 @@ static int move_to_new_page(struct page *newpage, struct page *page)
> >  	else
> >  		rc = fallback_migrate_page(mapping, newpage, page);
> >  
> > -	if (!rc)
> > -		remove_migration_ptes(page, newpage);
> > -	else
> > +	if (rc) {
> >  		newpage->mapping = NULL;
> > +	} else {
> > +		if (remap_swapcache) 
> > +			remove_migration_ptes(page, newpage);
> > +	}
> >  
> >  	unlock_page(newpage);
> >  
> > @@ -539,6 +542,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  	int rc = 0;
> >  	int *result = NULL;
> >  	struct page *newpage = get_new_page(page, private, &result);
> > +	int remap_swapcache = 1;
> >  	int rcu_locked = 0;
> >  	int charge = 0;
> >  	struct mem_cgroup *mem = NULL;
> > @@ -600,18 +604,27 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  		rcu_read_lock();
> >  		rcu_locked = 1;
> >  
> > -		/*
> > -		 * If the page has no mappings any more, just bail. An
> > -		 * unmapped anon page is likely to be freed soon but worse,
> > -		 * it's possible its anon_vma disappeared between when
> > -		 * the page was isolated and when we reached here while
> > -		 * the RCU lock was not held
> > -		 */
> > -		if (!page_mapped(page))
> > -			goto rcu_unlock;
> > +		/* Determine how to safely use anon_vma */
> > +		if (!page_mapped(page)) {
> > +			if (!PageSwapCache(page))
> > +				goto rcu_unlock;
> >  
> > -		anon_vma = page_anon_vma(page);
> > -		atomic_inc(&anon_vma->external_refcount);
> > +			/*
> > +			 * We cannot be sure that the anon_vma of an unmapped
> > +			 * swapcache page is safe to use.
> 
> Why not?  A full explanation here would be nice.
Patch below.
> 
> > 			   In this case, the
> > +			 * swapcache page gets migrated but the pages are not
> > +			 * remapped
> > +			 */
> > +			remap_swapcache = 0;
> > +		} else { 
> > +			/*
> > +			 * Take a reference count on the anon_vma if the
> > +			 * page is mapped so that it is guaranteed to
> > +			 * exist when the page is remapped later
> > +			 */
> > +			anon_vma = page_anon_vma(page);
> > +			atomic_inc(&anon_vma->external_refcount);
> > +		}
> >  	}
> >  
> >  	/*
> > @@ -646,9 +659,9 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  
> >  skip_unmap:
> >  	if (!page_mapped(page))
> > -		rc = move_to_new_page(newpage, page);
> > +		rc = move_to_new_page(newpage, page, remap_swapcache);
> >  
> > -	if (rc)
> > +	if (rc && remap_swapcache)
> >  		remove_migration_ptes(page, page);
> >  rcu_unlock:
> 
Patch that updates the comment if you prefer it is as follows
==== CUT HERE ====
mm,compaction: Expand comment on unmapped page swap cache
The comment on the handling of anon_vma for unmapped pages is a bit
sparse. Expand it.
This is a fix to the patch "mm,migration: Allow the migration of
PageSwapCache pages"
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c |   12 +++++++++---
 1 files changed, 9 insertions(+), 3 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 0356e64..281a239 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -611,9 +611,15 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 
 			/*
 			 * We cannot be sure that the anon_vma of an unmapped
-			 * swapcache page is safe to use. In this case, the
-			 * swapcache page gets migrated but the pages are not
-			 * remapped
+			 * swapcache page is safe to use because we don't
+			 * know in advance if the VMA that this page belonged
+			 * to still exists. If the VMA and others sharing the
+			 * data have been freed, then the anon_vma could
+			 * already be invalid.
+			 *
+			 * To avoid this possibility, swapcache pages get
+			 * migrated but are not remapped when migration
+			 * completes
 			 */
 			remap_swapcache = 0;
 		} else { 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* Re: [PATCH 09/14] Add /proc trigger for memory compaction
  2010-04-07 15:39     ` Mel Gorman
@ 2010-04-07 18:27       ` Mel Gorman
  0 siblings, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-07 18:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
> mm,compaction: Tighten up the allowed values for compact_memory and initialisation
> 
Minor mistake in the initialisation part of the patch
==== CUT HERE ====
mm,compaction: Initialise cc->zone at the correct time
Init cc->zone after we know what zone we are looking for. This is a fix
to the fix patch "mm,compaction: Tighten up the allowed values for
compact_memory and initialisation"
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/compaction.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index d9c5733..effe57d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -396,13 +396,13 @@ static int compact_node(int nid)
 		struct compact_control cc = {
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
-			.zone = zone,
 		};
 
 		zone = &pgdat->node_zones[zoneid];
 		if (!populated_zone(zone))
 			continue;
 
+		cc.zone = zone,
 		INIT_LIST_HEAD(&cc.freepages);
 		INIT_LIST_HEAD(&cc.migratepages);
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* Re: [PATCH 11/14] Direct compact when a high-order allocation fails
  2010-04-07  0:06   ` Andrew Morton
  2010-04-07 16:06     ` Mel Gorman
@ 2010-04-07 18:29     ` Mel Gorman
  1 sibling, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-07 18:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:06:03PM -0700, Andrew Morton wrote:
> > @@ -896,6 +904,9 @@ static const char * const vmstat_text[] = {
> >  	"compact_blocks_moved",
> >  	"compact_pages_moved",
> >  	"compact_pagemigrate_failed",
> > +	"compact_stall",
> > +	"compact_fail",
> > +	"compact_success",
> 
> CONFIG_COMPACTION=n?
> 
This patch goes on top of the series. It looks big but it's mainly
moving code.
==== CUT HERE ====
mm,compaction: Do not display compaction-related stats when !CONFIG_COMPACTION
Although compaction can be disabled from .config, the vmstat entries
still exist. This patch removes the vmstat entries. As page_alloc.c
refers directly to the counters, the patch introduces
__alloc_pages_direct_compact() to isolate use of the counters.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/vmstat.h |    2 +
 mm/page_alloc.c        |   92 ++++++++++++++++++++++++++++++++---------------
 mm/vmstat.c            |    2 +
 3 files changed, 66 insertions(+), 30 deletions(-)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index b4b4d34..7f43ccd 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -43,8 +43,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_COMPACTION
 		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
+#endif
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 46f6be4..514cc96 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1756,6 +1756,59 @@ out:
 	return page;
 }
 
+#ifdef CONFIG_COMPACTION
+/* Try memory compaction for high-order allocations before reclaim */
+static struct page *
+__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+	int migratetype, unsigned long *did_some_progress)
+{
+	struct page *page;
+
+	if (!order)
+		return NULL;
+
+	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
+								nodemask);
+	if (*did_some_progress != COMPACT_SKIPPED) {
+
+		/* Page migration frees to the PCP lists but we want merging */
+		drain_pages(get_cpu());
+		put_cpu();
+
+		page = get_page_from_freelist(gfp_mask, nodemask,
+				order, zonelist, high_zoneidx,
+				alloc_flags, preferred_zone,
+				migratetype);
+		if (page) {
+			__count_vm_event(COMPACTSUCCESS);
+			return page;
+		}
+
+		/*
+		 * It's bad if compaction run occurs and fails.
+		 * The most likely reason is that pages exist,
+		 * but not enough to satisfy watermarks.
+		 */
+		count_vm_event(COMPACTFAIL);
+
+		cond_resched();
+	}
+
+	return NULL;
+}
+#else
+static inline struct page *
+__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
+	struct zonelist *zonelist, enum zone_type high_zoneidx,
+	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
+	int migratetype, unsigned long *did_some_progress)
+{
+	return NULL;
+}
+#endif /* CONFIG_COMPACTION */
+
 /* The really slow allocator path where we enter direct reclaim */
 static inline struct page *
 __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
@@ -1769,36 +1822,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
-	/* Try memory compaction for high-order allocations before reclaim */
-	if (order) {
-		*did_some_progress = try_to_compact_pages(zonelist,
-						order, gfp_mask, nodemask);
-		if (*did_some_progress != COMPACT_SKIPPED) {
-
-			/* Page migration frees to the PCP lists but we want merging */
-			drain_pages(get_cpu());
-			put_cpu();
-
-			page = get_page_from_freelist(gfp_mask, nodemask,
-					order, zonelist, high_zoneidx,
-					alloc_flags, preferred_zone,
-					migratetype);
-			if (page) {
-				__count_vm_event(COMPACTSUCCESS);
-				return page;
-			}
-
-			/*
-			 * It's bad if compaction run occurs and fails.
-			 * The most likely reason is that pages exist,
-			 * but not enough to satisfy watermarks.
-			 */
-			count_vm_event(COMPACTFAIL);
-
-			cond_resched();
-		}
-	}
-
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
 	p->flags |= PF_MEMALLOC;
@@ -1972,6 +1995,15 @@ rebalance:
 	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
 		goto nopage;
 
+	/* Try direct compaction */
+	page = __alloc_pages_direct_compact(gfp_mask, order,
+					zonelist, high_zoneidx,
+					nodemask,
+					alloc_flags, preferred_zone,
+					migratetype, &did_some_progress);
+	if (page)
+		goto got_pg;
+
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
 					zonelist, high_zoneidx,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2780a36..0a58cbe 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,12 +901,14 @@ static const char * const vmstat_text[] = {
 
 	"pgrotated",
 
+#ifdef CONFIG_COMPACTION
 	"compact_blocks_moved",
 	"compact_pages_moved",
 	"compact_pagemigrate_failed",
 	"compact_stall",
 	"compact_fail",
 	"compact_success",
+#endif
 
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* Re: [PATCH 08/14] Memory compaction core
  2010-04-02 16:02 ` [PATCH 08/14] Memory compaction core Mel Gorman
  2010-04-07  0:05   ` Andrew Morton
@ 2010-04-08 16:59   ` Mel Gorman
  2010-04-08 17:06     ` Andrea Arcangeli
  1 sibling, 1 reply; 56+ messages in thread
From: Mel Gorman @ 2010-04-08 16:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Fri, Apr 02, 2010 at 05:02:42PM +0100, Mel Gorman wrote:
> This patch is the core of a mechanism which compacts memory in a zone by
> relocating movable pages towards the end of the zone.
> 
When merging compaction and transparent huge pages, Andrea spotted and
fixed this problem in his tree but it should go to mmotm as well.
Thanks Andrea.
==== CUT HERE ====
mm,compaction: page buddy can go away before reading page_order while isolating pages for migration
From: Andrea Arcangeli <aarcange@redhat.com>
zone->lock isn't held so the optimisation is unsafe. The page could be
allocated between when PageBuddy is checked and page-order is called. The
scanner will harmlessly walk the other free pages so let's just skip this
optimization.
This is a fix to the patch "Memory compaction core".
[mel@csn.ul.ie: Expanded the changelog]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/compaction.c |    4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index dadad52..4fb33f6 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -262,10 +262,8 @@ static unsigned long isolate_migratepages(struct zone *zone,
 
 		/* Get the page and skip if free */
 		page = pfn_to_page(low_pfn);
-		if (PageBuddy(page)) {
-			low_pfn += (1 << page_order(page)) - 1;
+		if (PageBuddy(page))
 			continue;
-		}
 
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* Re: [PATCH 08/14] Memory compaction core
  2010-04-08 16:59   ` Mel Gorman
@ 2010-04-08 17:06     ` Andrea Arcangeli
  0 siblings, 0 replies; 56+ messages in thread
From: Andrea Arcangeli @ 2010-04-08 17:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Thu, Apr 08, 2010 at 05:59:54PM +0100, Mel Gorman wrote:
> On Fri, Apr 02, 2010 at 05:02:42PM +0100, Mel Gorman wrote:
> > This patch is the core of a mechanism which compacts memory in a zone by
> > relocating movable pages towards the end of the zone.
> > 
> 
> When merging compaction and transparent huge pages, Andrea spotted and
> fixed this problem in his tree but it should go to mmotm as well.
> 
> Thanks Andrea.
Thanks Mel for submitting this fix!
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply	[flat|nested] 56+ messages in thread
* Re: [PATCH 05/14] Export unusable free space index via /proc/unusable_index
  2010-04-07  0:05   ` Andrew Morton
  2010-04-07 10:35     ` Mel Gorman
@ 2010-04-13 12:42     ` Mel Gorman
  1 sibling, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-13 12:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:05:37PM -0700, Andrew Morton wrote:
> <SNIP>
> 
> All this code will be bloat for most people, I suspect.  Can we find a
> suitable #ifdef wrapper to keep my cellphone happy?
> 
==== CUT HERE ====
mm,compaction: Move unusable_index to debugfs
unusable_index can be worked out from userspace but for debugging and tuning
compaction, it'd be best for all users to have the same information. This
patch moves extfrag_index to debugfs where it is both easier to configure
out and remove at some future date.
This is a fix to the patch "Export unusable free space index via
/proc/unusable_index"
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/filesystems/proc.txt |   13 +---
 mm/vmstat.c                        |  183 ++++++++++++++++++++----------------
 2 files changed, 105 insertions(+), 91 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index e87775a..74d2605 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -453,7 +453,6 @@ Table 1-5: Kernel info in /proc
  sys         See chapter 2                                     
  sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
  tty	     Info of tty drivers
- unusable_index Additional page allocator information (see text)(2.5)
  uptime      System uptime                                     
  version     Kernel version                                    
  video	     bttv info of video resources			(2.4)
@@ -611,7 +610,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo and unusable_index
+pagetypeinfo.
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -652,16 +651,6 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
 also be allocatable although a lot of filesystem metadata may have to be
 reclaimed to achieve this.
 
-> cat /proc/unusable_index
-Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
-Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
-
-The unusable free space index measures how much of the available free
-memory cannot be used to satisfy an allocation of a given size and is a
-value between 0 and 1. The higher the value, the more of free memory is
-unusable and by implication, the worse the external fragmentation is. This
-can be expressed as a percentage by multiplying by 100.
-
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2fb4986..0dcf08d 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -453,7 +453,6 @@ static int frag_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
-
 struct contig_page_info {
 	unsigned long free_pages;
 	unsigned long free_blocks_total;
@@ -495,64 +494,6 @@ static void fill_contig_page_info(struct zone *zone,
 	}
 }
 
-/*
- * Return an index indicating how much of the available free memory is
- * unusable for an allocation of the requested size.
- */
-static int unusable_free_index(unsigned int order,
-				struct contig_page_info *info)
-{
-	/* No free memory is interpreted as all free memory is unusable */
-	if (info->free_pages == 0)
-		return 1000;
-
-	/*
-	 * Index should be a value between 0 and 1. Return a value to 3
-	 * decimal places.
-	 *
-	 * 0 => no fragmentation
-	 * 1 => high fragmentation
-	 */
-	return div_u64((info->free_pages - (info->free_blocks_suitable << order)) * 1000ULL, info->free_pages);
-
-}
-
-static void unusable_show_print(struct seq_file *m,
-					pg_data_t *pgdat, struct zone *zone)
-{
-	unsigned int order;
-	int index;
-	struct contig_page_info info;
-
-	seq_printf(m, "Node %d, zone %8s ",
-				pgdat->node_id,
-				zone->name);
-	for (order = 0; order < MAX_ORDER; ++order) {
-		fill_contig_page_info(zone, order, &info);
-		index = unusable_free_index(order, &info);
-		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
-	}
-
-	seq_putc(m, '\n');
-}
-
-/*
- * Display unusable free space index
- * XXX: Could be a lot more efficient, but it's not a critical path
- */
-static int unusable_show(struct seq_file *m, void *arg)
-{
-	pg_data_t *pgdat = (pg_data_t *)arg;
-
-	/* check memoryless node */
-	if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
-		return 0;
-
-	walk_zones_in_node(m, pgdat, unusable_show_print);
-
-	return 0;
-}
-
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -703,25 +644,6 @@ static const struct file_operations pagetypeinfo_file_ops = {
 	.release	= seq_release,
 };
 
-static const struct seq_operations unusable_op = {
-	.start	= frag_start,
-	.next	= frag_next,
-	.stop	= frag_stop,
-	.show	= unusable_show,
-};
-
-static int unusable_open(struct inode *inode, struct file *file)
-{
-	return seq_open(file, &unusable_op);
-}
-
-static const struct file_operations unusable_file_ops = {
-	.open		= unusable_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
-};
-
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -1066,10 +988,113 @@ static int __init setup_vmstat(void)
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
-	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
 	return 0;
 }
 module_init(setup_vmstat)
+
+#ifdef CONFIG_DEBUG_FS
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+static struct dentry *extfrag_debug_root;
+
+/*
+ * Return an index indicating how much of the available free memory is
+ * unusable for an allocation of the requested size.
+ */
+static int unusable_free_index(unsigned int order,
+				struct contig_page_info *info)
+{
+	/* No free memory is interpreted as all free memory is unusable */
+	if (info->free_pages == 0)
+		return 1000;
+
+	/*
+	 * Index should be a value between 0 and 1. Return a value to 3
+	 * decimal places.
+	 *
+	 * 0 => no fragmentation
+	 * 1 => high fragmentation
+	 */
+	return div_u64((info->free_pages - (info->free_blocks_suitable << order)) * 1000ULL, info->free_pages);
+
+}
+
+static void unusable_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = unusable_free_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display unusable free space index
+ *
+ * The unusable free space index measures how much of the available free
+ * memory cannot be used to satisfy an allocation of a given size and is a
+ * value between 0 and 1. The higher the value, the more of free memory is
+ * unusable and by implication, the worse the external fragmentation is. This
+ * can be expressed as a percentage by multiplying by 100.
+ */
+static int unusable_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	/* check memoryless node */
+	if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+		return 0;
+
+	walk_zones_in_node(m, pgdat, unusable_show_print);
+
+	return 0;
+}
+
+static const struct seq_operations unusable_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= unusable_show,
+};
+
+static int unusable_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &unusable_op);
+}
+
+static const struct file_operations unusable_file_ops = {
+	.open		= unusable_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
+static int __init extfrag_debug_init(void)
+{
+	extfrag_debug_root = debugfs_create_dir("extfrag", NULL);
+	if (!extfrag_debug_root)
+		return -ENOMEM;
+
+	if (!debugfs_create_file("unusable_index", 0444,
+			extfrag_debug_root, NULL, &unusable_file_ops))
+		return -ENOMEM;
+
+	return 0;
+}
+
+module_init(extfrag_debug_init);
+#endif
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
* Re: [PATCH 06/14] Export fragmentation index via /proc/extfrag_index
  2010-04-07  0:05   ` Andrew Morton
  2010-04-07 10:46     ` Mel Gorman
@ 2010-04-13 12:43     ` Mel Gorman
  1 sibling, 0 replies; 56+ messages in thread
From: Mel Gorman @ 2010-04-13 12:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, Minchan Kim, KAMEZAWA Hiroyuki, KOSAKI Motohiro,
	Rik van Riel, linux-kernel, linux-mm
On Tue, Apr 06, 2010 at 05:05:42PM -0700, Andrew Morton wrote:
> On Fri,  2 Apr 2010 17:02:40 +0100
> > Fragmentation index is a value that makes sense when an allocation of a
> > given size would fail. The index indicates whether an allocation failure is
> > due to a lack of memory (values towards 0) or due to external fragmentation
> > (value towards 1).  For the most part, the huge page size will be the size
> > of interest but not necessarily so it is exported on a per-order and per-zone
> > basis via /proc/extfrag_index
> 
> (/proc/sys/vm?)
> 
> Like unusable_index, this seems awfully specialised.  Perhaps we could
> hide it under CONFIG_MEL, or even put it in debugfs with the intention
> of removing it in 6 or 12 months time. 
> <SNIP>
==== CUT HERE ====
mm,compaction: Move extfrag_index to debugfs
extfrag_index can be worked out from userspace but for debugging and
tuning compaction, it'd be best for all users to have the same
information. This patch moves extfrag_index to debugfs where it is both
easier to configure out and remove at some future date.
This is a fix to the patch "Export fragmentation index via
/proc/extfrag_index". When merged, it'll collide with the patch "Direct
compact when a high-order allocation fails" but the resolution is
relatively straight forward - preserve the fragmentation_index functions
and delete the proc-related functions as they are now at the bottom of
the file under ifdef CONFIG_DEBUG_FS.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/filesystems/proc.txt |   14 +----
 mm/vmstat.c                        |  110 ++++++++++++++++++------------------
 2 files changed, 57 insertions(+), 67 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 66ebc11..74d2605 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -422,7 +422,6 @@ Table 1-5: Kernel info in /proc
  filesystems Supported filesystems                             
  driver	     Various drivers grouped here, currently rtc (2.4)
  execdomains Execdomains, related to security			(2.4)
- extfrag_index Additional page allocator information (see text) (2.5)
  fb	     Frame Buffer devices				(2.4)
  fs	     File system parameters, currently nfs/exports	(2.4)
  ide         Directory containing info about the IDE subsystem 
@@ -611,7 +610,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo and extfrag_index.
+pagetypeinfo.
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -652,17 +651,6 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
 also be allocatable although a lot of filesystem metadata may have to be
 reclaimed to achieve this.
 
-> cat /proc/extfrag_index
-Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
-Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954
-
-The external fragmentation index, is only meaningful if an allocation
-would fail and indicates what the failure is due to. A value of -1 such as
-in many of the examples above states that the allocation would succeed.
-If it would fail, the value is between 0 and 1. A value tending towards
-0 implies the allocation failed due to a lack of memory. A value tending
-towards 1 implies it failed due to external fragmentation.
-
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 582dc77..f70da05 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -522,40 +522,6 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
 	return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total);
 }
 
-
-static void extfrag_show_print(struct seq_file *m,
-					pg_data_t *pgdat, struct zone *zone)
-{
-	unsigned int order;
-	int index;
-
-	/* Alloc on stack as interrupts are disabled for zone walk */
-	struct contig_page_info info;
-
-	seq_printf(m, "Node %d, zone %8s ",
-				pgdat->node_id,
-				zone->name);
-	for (order = 0; order < MAX_ORDER; ++order) {
-		fill_contig_page_info(zone, order, &info);
-		index = fragmentation_index(order, &info);
-		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
-	}
-
-	seq_putc(m, '\n');
-}
-
-/*
- * Display fragmentation index for orders that allocations would fail for
- */
-static int extfrag_show(struct seq_file *m, void *arg)
-{
-	pg_data_t *pgdat = (pg_data_t *)arg;
-
-	walk_zones_in_node(m, pgdat, extfrag_show_print);
-
-	return 0;
-}
-
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -706,25 +672,6 @@ static const struct file_operations pagetypeinfo_file_ops = {
 	.release	= seq_release,
 };
 
-static const struct seq_operations extfrag_op = {
-	.start	= frag_start,
-	.next	= frag_next,
-	.stop	= frag_stop,
-	.show	= extfrag_show,
-};
-
-static int extfrag_open(struct inode *inode, struct file *file)
-{
-	return seq_open(file, &extfrag_op);
-}
-
-static const struct file_operations extfrag_file_ops = {
-	.open		= extfrag_open,
-	.read		= seq_read,
-	.llseek		= seq_lseek,
-	.release	= seq_release,
-};
-
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -1069,7 +1016,6 @@ static int __init setup_vmstat(void)
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
-	proc_create("extfrag_index", S_IRUGO, NULL, &extfrag_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
@@ -1165,6 +1111,58 @@ static const struct file_operations unusable_file_ops = {
 	.release	= seq_release,
 };
 
+static void extfrag_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+
+	/* Alloc on stack as interrupts are disabled for zone walk */
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = fragmentation_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display fragmentation index for orders that allocations would fail for
+ */
+static int extfrag_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	walk_zones_in_node(m, pgdat, extfrag_show_print);
+
+	return 0;
+}
+
+static const struct seq_operations extfrag_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= extfrag_show,
+};
+
+static int extfrag_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &extfrag_op);
+}
+
+static const struct file_operations extfrag_file_ops = {
+	.open		= extfrag_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 static int __init extfrag_debug_init(void)
 {
 	extfrag_debug_root = debugfs_create_dir("extfrag", NULL);
@@ -1175,6 +1173,10 @@ static int __init extfrag_debug_init(void)
 			extfrag_debug_root, NULL, &unusable_file_ops))
 		return -ENOMEM;
 
+	if (!debugfs_create_file("extfrag_index", 0444,
+			extfrag_debug_root, NULL, &extfrag_file_ops))
+		return -ENOMEM;
+
 	return 0;
 }
 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply related	[flat|nested] 56+ messages in thread
end of thread, other threads:[~2010-04-13 12:44 UTC | newest]
Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-02 16:02 [PATCH 0/14] Memory Compaction v7 Mel Gorman
2010-04-02 16:02 ` [PATCH 01/14] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07  9:56     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 02/14] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
2010-04-02 16:02 ` [PATCH 03/14] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07  0:10     ` Rik van Riel
2010-04-07 10:01     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 04/14] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07 10:22     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 05/14] Export unusable free space index via /proc/unusable_index Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07 10:35     ` Mel Gorman
2010-04-13 12:42     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 06/14] Export fragmentation index via /proc/extfrag_index Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07 10:46     ` Mel Gorman
2010-04-13 12:43     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 07/14] Move definition for LRU isolation modes to a header Mel Gorman
2010-04-02 16:02 ` [PATCH 08/14] Memory compaction core Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07 15:21     ` Mel Gorman
2010-04-08 16:59   ` Mel Gorman
2010-04-08 17:06     ` Andrea Arcangeli
2010-04-02 16:02 ` [PATCH 09/14] Add /proc trigger for memory compaction Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07 15:39     ` Mel Gorman
2010-04-07 18:27       ` Mel Gorman
2010-04-02 16:02 ` [PATCH 10/14] Add /sys trigger for per-node " Mel Gorman
2010-04-07  0:05   ` Andrew Morton
2010-04-07  0:31     ` KAMEZAWA Hiroyuki
2010-04-06 21:56       ` Andrew Morton
2010-04-07  1:19         ` KAMEZAWA Hiroyuki
2010-04-07 15:42     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 11/14] Direct compact when a high-order allocation fails Mel Gorman
2010-04-07  0:06   ` Andrew Morton
2010-04-07 16:06     ` Mel Gorman
2010-04-07 18:29     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 12/14] Add a tunable that decides when memory should be compacted and when it should be reclaimed Mel Gorman
2010-04-07  0:06   ` Andrew Morton
2010-04-07 16:11     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 13/14] Do not compact within a preferred zone after a compaction failure Mel Gorman
2010-04-07  0:06   ` Andrew Morton
2010-04-07  0:55     ` Andrea Arcangeli
2010-04-07 16:32     ` Mel Gorman
2010-04-02 16:02 ` [PATCH 14/14] mm,migration: Allow the migration of PageSwapCache pages Mel Gorman
2010-04-06  6:54   ` KAMEZAWA Hiroyuki
2010-04-06 15:37   ` Minchan Kim
2010-04-07  0:06   ` Andrew Morton
2010-04-07 16:49     ` Mel Gorman
2010-04-06 14:47 ` [PATCH 0/14] Memory Compaction v7 Tarkan Erimer
2010-04-06 15:00   ` Mel Gorman
2010-04-06 15:03     ` Tarkan Erimer
  -- strict thread matches above, loose matches on Subject: below --
2010-03-30  9:14 [PATCH 0/14] Memory Compaction v6 Mel Gorman
2010-03-30  9:14 ` [PATCH 05/14] Export unusable free space index via /proc/unusable_index Mel Gorman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).