[PATCH 0/12] Memory Compaction v3

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/12] Memory Compaction v3
@ 2010-02-18 18:02 Mel Gorman
  2010-02-18 18:02 ` [PATCH 01/12] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
                   ` (11 more replies)
  0 siblings, 12 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-18 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Changelog since V2
  o Move unusable and fragmentation indices to separate proc files
  o Express indices as being between 0 and 1
  o Update copyright notice for compaction.c
  o Avoid infinite loop when split free page fails
  o Init compact_resume at least once (impacted x86 testing)
  o Fewer pages are isolated during compaction.
  o LRU lists are no longer rotated when page is busy
  o NR_ISOLATED_* is updated to avoid isolating too many pages
  o Update zone LRU stats correctly when isolating pages
  o Reference count anon_vma instead of insufficient locking with
    use-after-free races in memory compaction
  o Watch for unmapped anon pages during migration
  o Remove unnecessary parameters on a few functions
  o Add Reviewed-by's. Note that I didn't add the Acks and Reviewed
    for the proc patches as they have been split out into separate
    files and I don't know if the Acks are still valid.

Changelog since V1
  o Update help blurb on CONFIG_MIGRATION
  o Max unusable free space index is 100, not 1000
  o Move blockpfn forward properly during compaction
  o Cleanup CONFIG_COMPACTION vs CONFIG_MIGRATION confusion
  o Permissions on /proc and /sys files should be 0200
  o Reduce verbosity
  o Compact all nodes when triggered via /proc
  o Add per-node compaction via sysfs
  o Move defer_compaction out-of-line
  o Fix lock oddities in rmap_walk_anon
  o Add documentation

===== CUT HERE =====

This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was slub
"defragmentation" (really a form of targeted reclaim). Hence, this is called
"compaction" to distinguish it from other forms of defragmentation.

In this implementation, a full compaction run involves two scanners operating
within a zone - a migration and a free scanner. The migration scanner
starts at the beginning of a zone and finds all movable pages within one
pageblock_nr_pages-sized area and isolates them on a migratepages list. The
free scanner begins at the end of the zone and searches on a per-area
basis for enough free pages to migrate all the pages on the migratepages
list. As each area is respectively migrated or exhausted of free pages,
the scanners are advanced one area.  A compaction run completes within a
zone when the two scanners meet.

This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.

It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.

Memory compaction can be triggered in one of three ways. It may be triggered
explicitly by writing any value to /proc/sys/vm/compact_memory and compacting
all of memory. It can be triggered on a per-node basis by writing any
value to /sys/devices/system/node/nodeN/compact where N is the node ID to
be compacted. When a process fails to allocate a high-order page, it may
compact memory in an attempt to satisfy the allocation instead of entering
direct reclaim. Explicit compaction does not finish until the two scanners
meet and direct compaction ends if a suitable page becomes available that
would meet watermarks.

The series is in 12 patches. The first three are not "core" to the series
but are important pre-requisites.

Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
	patch, it's possible to use anon_vma after free if the caller is
	not holding a VMA or mmap_sem for the pages in question. While
	there should be no existing user that causes this problem,
	it's a requirement for memory compaction to be stable. The patch
	is at the start of the series for bisection reasons.
Patch 2 skips over anon pages during migration that are no longer mapped
	because there still appeared to be a small window between when
	a page was isolated and migration started during which anon_vma
	could disappear.
Patch 3 merges the KSM and migrate counts. It could be merged with patch 1
	but would be slightly harder to review.
Patch 4 documents pagetypeinfo as the information is expanded later in the
	series
Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 6 exports a "unusable free space index" via /proc/pagetypeinfo. It's
	a measure of external fragmentation that takes the size of the
	allocation request into account. It can also be calculated from
	userspace so can be dropped if requested
Patch 7 exports a "fragmentation index" which only has meaning when an
	allocation request fails. It determines if an allocation failure
	would be due to a lack of memory or external fragmentation.
Patch 8 is the compaction mechanism although it's unreachable at this point
Patch 9 adds a means of compacting all of memory with a proc trgger
Patch 10 adds a means of compacting a specific node with a sysfs trigger
Patch 11 adds "direct compaction" before "direct reclaim" if it is
	determined there is a good chance of success.
Patch 12 temporarily disables compaction if an allocation failure occurs
	after compaction.

Testing of compaction was in three stages.  For the test, debugging, preempt,
the sleep watchdog and lockdep were all enabled but nothing nasty popped
out. min_free_kbytes was tuned as recommended by hugeadm to help fragmentation
avoidance and high-order allocations. It was only tested on X86-64 due to
the lack of availability of an X86 and PPC64 test machine for the moment.

Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.

1. Machine freshly booted and configured for hugepage usage with
	a) hugeadm --create-global-mounts
	b) hugeadm --pool-pages-max DEFAULT:8G
	c) hugeadm --set-recommended-min_free_kbytes
	d) hugeadm --set-recommended-shmmax

	The min_free_kbytes here is important. Anti-fragmentation works best
	when pageblocks don't mix. hugeadm knows how to calculate a value that
	will significantly reduce the worst of external-fragmentation-related
	events as reported by the mm_page_alloc_extfrag tracepoint.

2. Load up memory
	a) Start updatedb
	b) Create in parallel a X files of pagesize*128 in size. Wait
	   until files are created. By parallel, I mean that 4096 instances
	   of dd were launched, one after the other using &. The crude
	   objective being to mix filesystem metadata allocations with
	   the buffer cache.
	c) Delete every second file so that pageblocks are likely to
	   have holes
	d) kill updatedb if it's still running

	At this point, the system is quiet, memory is full but it's full with
	clean filesystem metadata and clean buffer cache that is unmapped.
	This is readily migrated or discarded so you'd expect lumpy reclaim
	to have no significant advantage over compaction but this is at
	the POC stage.

3. In increments, attempt to allocate 5% of memory as hugepages.
	   Measure how long it took, how successful it was, how many
	   direct reclaims took place and how how many compactions. Note
	   the compaction figures might not fully add up as compactions
	   can take place for orders other than the hugepage size

X86-64
				vanilla		compaction
Final page count:                   808		       893 (attempted 1002)
Total pages reclaimed:           102662		     43721
Total blocks compacted:               0		      3048
Total compact pages alloced:          0		       187

Compaction allocated slightly more pages but reclaimed a lot less - 230MB
of IO.

PPC64
				vanilla		compaction
Final page count:                    86                 91 (attempted 110)
Total pages reclaimed:            89297              62562
Total blocks compacted:               0		      1335
Total compact pages alloced:          0		        22

Similar to X86-64. No more huge pages were allocated byt a lot less was
reclaimed - about 104MB in this case.

The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.

The last test was a high-order allocation stress test. Many kernel compiles
are started to fill memory with a pressured mix of kernel and movable
allocations. During this, an attempt is made to allocate 90% of memory
as huge pages - one at a time with small delays between attempts to avoid
flooding the IO queue.

                                             vanilla   compaction
Percentage of request allocated X86-64           78%          88%
Percentage of request allocated PPC64            54%          73%

Compaction had slightly higher success rates on X86-64 but helped
significantly on PPC64 with the much larger huge pages and greater opportunity
for racers between direct reclaimers and page allocators. The main impact
is expected to be in latencies.

Latencies are seriously reduced but are more or less the same as they were in
v2 which was posted at http://www.csn.ul.ie/~mel/postings/compaction-20100212

Again, the page migration patches need careful review but otherwise, what
obstacles exist to merging?

 Documentation/filesystems/proc.txt |   68 +++++-
 Documentation/sysctl/vm.txt        |   11 +
 drivers/base/node.c                |    3 +
 include/linux/compaction.h         |   76 +++++
 include/linux/mm.h                 |    1 +
 include/linux/mmzone.h             |    7 +
 include/linux/rmap.h               |   27 ++-
 include/linux/swap.h               |    6 +
 include/linux/vmstat.h             |    2 +
 kernel/sysctl.c                    |   11 +
 mm/Kconfig                         |   20 +-
 mm/Makefile                        |    1 +
 mm/compaction.c                    |  548 ++++++++++++++++++++++++++++++++++++
 mm/ksm.c                           |    4 +-
 mm/migrate.c                       |   22 ++
 mm/page_alloc.c                    |   66 +++++
 mm/rmap.c                          |   10 +-
 mm/vmscan.c                        |    5 -
 mm/vmstat.c                        |  217 ++++++++++++++
 19 files changed, 1078 insertions(+), 27 deletions(-)
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 01/12] mm,migration: Take a reference to the anon_vma before migrating
  2010-02-18 18:02 [PATCH 0/12] Memory Compaction v3 Mel Gorman
@ 2010-02-18 18:02 ` Mel Gorman
  2010-02-19  0:12   ` KAMEZAWA Hiroyuki
  2010-02-19 16:43   ` Rik van Riel
  2010-02-18 18:02 ` [PATCH 02/12] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-18 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, Mel Gorman, linux-kernel, linux-mm

rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.

This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/rmap.h |   23 +++++++++++++++++++++++
 mm/migrate.c         |   12 ++++++++++++
 mm/rmap.c            |   10 +++++-----
 3 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b019ae6..6b5a1a9 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -29,6 +29,9 @@ struct anon_vma {
 #ifdef CONFIG_KSM
 	atomic_t ksm_refcount;
 #endif
+#ifdef CONFIG_MIGRATION
+	atomic_t migrate_refcount;
+#endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
 	 * mm_take_all_locks() _after_ taking the above lock. So the
@@ -61,6 +64,26 @@ static inline int ksm_refcount(struct anon_vma *anon_vma)
 	return 0;
 }
 #endif /* CONFIG_KSM */
+#ifdef CONFIG_MIGRATION
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+	atomic_set(&anon_vma->migrate_refcount, 0);
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return atomic_read(&anon_vma->migrate_refcount);
+}
+#else
+static inline void migrate_refcount_init(struct anon_vma *anon_vma)
+{
+}
+
+static inline int migrate_refcount(struct anon_vma *anon_vma)
+{
+	return 0;
+}
+#endif /* CONFIG_MIGRATE */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
diff --git a/mm/migrate.c b/mm/migrate.c
index 9a0db5b..63addfa 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -551,6 +551,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	int rcu_locked = 0;
 	int charge = 0;
 	struct mem_cgroup *mem = NULL;
+	struct anon_vma *anon_vma = NULL;
 
 	if (!newpage)
 		return -ENOMEM;
@@ -607,6 +608,8 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	if (PageAnon(page)) {
 		rcu_read_lock();
 		rcu_locked = 1;
+		anon_vma = page_anon_vma(page);
+		atomic_inc(&anon_vma->migrate_refcount);
 	}
 
 	/*
@@ -646,6 +649,15 @@ skip_unmap:
 	if (rc)
 		remove_migration_ptes(page, page);
 rcu_unlock:
+
+	/* Drop an anon_vma reference if we took one */
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+		int empty = list_empty(&anon_vma->head);
+		spin_unlock(&anon_vma->lock);
+		if (empty)
+			anon_vma_free(anon_vma);
+	}
+
 	if (rcu_locked)
 		rcu_read_unlock();
 uncharge:
diff --git a/mm/rmap.c b/mm/rmap.c
index 278cd27..11ba74a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -172,7 +172,8 @@ void anon_vma_unlink(struct vm_area_struct *vma)
 	list_del(&vma->anon_vma_node);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
+					!migrate_refcount(anon_vma);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -185,6 +186,7 @@ static void anon_vma_ctor(void *data)
 
 	spin_lock_init(&anon_vma->lock);
 	ksm_refcount_init(anon_vma);
+	migrate_refcount_init(anon_vma);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
@@ -1228,10 +1230,8 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	/*
 	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
 	 * because that depends on page_mapped(); but not all its usages
-	 * are holding mmap_sem, which also gave the necessary guarantee
-	 * (that this anon_vma's slab has not already been destroyed).
-	 * This needs to be reviewed later: avoiding page_lock_anon_vma()
-	 * is risky, and currently limits the usefulness of rmap_walk().
+	 * are holding mmap_sem. Users without mmap_sem are required to
+	 * take a reference count to prevent the anon_vma disappearing
 	 */
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/12] mm,migration: Take a reference to the anon_vma before migrating
  2010-02-18 18:02 ` [PATCH 01/12] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
@ 2010-02-19  0:12   ` KAMEZAWA Hiroyuki
  2010-02-19 13:59     ` Mel Gorman
  2010-02-19 16:43   ` Rik van Riel
  1 sibling, 1 reply; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-19  0:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Thu, 18 Feb 2010 18:02:31 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
> locking an anon_vma and it does not appear to have sufficient locking to
> ensure the anon_vma does not disappear from under it.
> 
> This patch copies an approach used by KSM to take a reference on the
> anon_vma while pages are being migrated. This should prevent rmap_walk()
> running into nasty surprises later because anon_vma has been freed.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

I have no objection to this direction. But after this patch, you can remove
rcu_read_lock()/unlock() in unmap_and_move().
ruc_read_lock() is for guarding against anon_vma replacement.

Thanks,
-Kame



> ---
>  include/linux/rmap.h |   23 +++++++++++++++++++++++
>  mm/migrate.c         |   12 ++++++++++++
>  mm/rmap.c            |   10 +++++-----
>  3 files changed, 40 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b019ae6..6b5a1a9 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -29,6 +29,9 @@ struct anon_vma {
>  #ifdef CONFIG_KSM
>  	atomic_t ksm_refcount;
>  #endif
> +#ifdef CONFIG_MIGRATION
> +	atomic_t migrate_refcount;
> +#endif
>  	/*
>  	 * NOTE: the LSB of the head.next is set by
>  	 * mm_take_all_locks() _after_ taking the above lock. So the
> @@ -61,6 +64,26 @@ static inline int ksm_refcount(struct anon_vma *anon_vma)
>  	return 0;
>  }
>  #endif /* CONFIG_KSM */
> +#ifdef CONFIG_MIGRATION
> +static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> +{
> +	atomic_set(&anon_vma->migrate_refcount, 0);
> +}
> +
> +static inline int migrate_refcount(struct anon_vma *anon_vma)
> +{
> +	return atomic_read(&anon_vma->migrate_refcount);
> +}
> +#else
> +static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> +{
> +}
> +
> +static inline int migrate_refcount(struct anon_vma *anon_vma)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_MIGRATE */
>  
>  static inline struct anon_vma *page_anon_vma(struct page *page)
>  {
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 9a0db5b..63addfa 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -551,6 +551,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  	int rcu_locked = 0;
>  	int charge = 0;
>  	struct mem_cgroup *mem = NULL;
> +	struct anon_vma *anon_vma = NULL;
>  
>  	if (!newpage)
>  		return -ENOMEM;
> @@ -607,6 +608,8 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  	if (PageAnon(page)) {
>  		rcu_read_lock();
>  		rcu_locked = 1;
> +		anon_vma = page_anon_vma(page);
> +		atomic_inc(&anon_vma->migrate_refcount);
>  	}
>  
>  	/*
> @@ -646,6 +649,15 @@ skip_unmap:
>  	if (rc)
>  		remove_migration_ptes(page, page);
>  rcu_unlock:
> +
> +	/* Drop an anon_vma reference if we took one */
> +	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> +		int empty = list_empty(&anon_vma->head);
> +		spin_unlock(&anon_vma->lock);
> +		if (empty)
> +			anon_vma_free(anon_vma);
> +	}
> +
>  	if (rcu_locked)
>  		rcu_read_unlock();
>  uncharge:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 278cd27..11ba74a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -172,7 +172,8 @@ void anon_vma_unlink(struct vm_area_struct *vma)
>  	list_del(&vma->anon_vma_node);
>  
>  	/* We must garbage collect the anon_vma if it's empty */
> -	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma);
> +	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
> +					!migrate_refcount(anon_vma);
>  	spin_unlock(&anon_vma->lock);
>  
>  	if (empty)
> @@ -185,6 +186,7 @@ static void anon_vma_ctor(void *data)
>  
>  	spin_lock_init(&anon_vma->lock);
>  	ksm_refcount_init(anon_vma);
> +	migrate_refcount_init(anon_vma);
>  	INIT_LIST_HEAD(&anon_vma->head);
>  }
>  
> @@ -1228,10 +1230,8 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
>  	/*
>  	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
>  	 * because that depends on page_mapped(); but not all its usages
> -	 * are holding mmap_sem, which also gave the necessary guarantee
> -	 * (that this anon_vma's slab has not already been destroyed).
> -	 * This needs to be reviewed later: avoiding page_lock_anon_vma()
> -	 * is risky, and currently limits the usefulness of rmap_walk().
> +	 * are holding mmap_sem. Users without mmap_sem are required to
> +	 * take a reference count to prevent the anon_vma disappearing
>  	 */
>  	anon_vma = page_anon_vma(page);
>  	if (!anon_vma)
> -- 
> 1.6.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/12] mm,migration: Take a reference to the anon_vma before migrating
  2010-02-19  0:12   ` KAMEZAWA Hiroyuki
@ 2010-02-19 13:59     ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-19 13:59 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 09:12:44AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 18 Feb 2010 18:02:31 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
> > locking an anon_vma and it does not appear to have sufficient locking to
> > ensure the anon_vma does not disappear from under it.
> > 
> > This patch copies an approach used by KSM to take a reference on the
> > anon_vma while pages are being migrated. This should prevent rmap_walk()
> > running into nasty surprises later because anon_vma has been freed.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> I have no objection to this direction. But after this patch, you can remove
> rcu_read_lock()/unlock() in unmap_and_move().
> ruc_read_lock() is for guarding against anon_vma replacement.
> 

Thanks. I expected that would be the case but was going to leave at
least one kernel release between when compaction went in and I did that.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 01/12] mm,migration: Take a reference to the anon_vma before migrating
  2010-02-18 18:02 ` [PATCH 01/12] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
  2010-02-19  0:12   ` KAMEZAWA Hiroyuki
@ 2010-02-19 16:43   ` Rik van Riel
  1 sibling, 0 replies; 51+ messages in thread
From: Rik van Riel @ 2010-02-19 16:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm

On 02/18/2010 01:02 PM, Mel Gorman wrote:
> rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
> locking an anon_vma and it does not appear to have sufficient locking to
> ensure the anon_vma does not disappear from under it.
>
> This patch copies an approach used by KSM to take a reference on the
> anon_vma while pages are being migrated. This should prevent rmap_walk()
> running into nasty surprises later because anon_vma has been freed.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 02/12] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-02-18 18:02 [PATCH 0/12] Memory Compaction v3 Mel Gorman
  2010-02-18 18:02 ` [PATCH 01/12] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
@ 2010-02-18 18:02 ` Mel Gorman
  2010-02-19 16:45   ` Rik van Riel
  2010-02-18 18:02 ` [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: Mel Gorman @ 2010-02-18 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, Mel Gorman, linux-kernel, linux-mm

rmap_walk_anon() was triggering errors in memory compaction that looks like
use-after-free errors in anon_vma. The problem appears to be that between
the page being isolated from the LRU and rcu_read_lock() being taken, the
mapcount of the page dropped to 0 and the anon_vma was freed. This patch
skips the migration of anon pages that are not mapped by anyone.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/migrate.c |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 63addfa..1ce6a2f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -606,6 +606,16 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 	 * just care Anon page here.
 	 */
 	if (PageAnon(page)) {
+		/*
+		 * If the page has no mappings any more, just bail. An
+		 * unmapped anon page is likely to be freed soon but worse,
+		 * it's possible its anon_vma disappeared between when
+		 * the page was isolated and when we reached here while
+		 * the RCU lock was not held
+		 */
+		if (!page_mapcount(page))
+			goto uncharge;
+
 		rcu_read_lock();
 		rcu_locked = 1;
 		anon_vma = page_anon_vma(page);
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 02/12] mm,migration: Do not try to migrate unmapped anonymous pages
  2010-02-18 18:02 ` [PATCH 02/12] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
@ 2010-02-19 16:45   ` Rik van Riel
  0 siblings, 0 replies; 51+ messages in thread
From: Rik van Riel @ 2010-02-19 16:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm

On 02/18/2010 01:02 PM, Mel Gorman wrote:
> rmap_walk_anon() was triggering errors in memory compaction that looks like
> use-after-free errors in anon_vma. The problem appears to be that between
> the page being isolated from the LRU and rcu_read_lock() being taken, the
> mapcount of the page dropped to 0 and the anon_vma was freed. This patch
> skips the migration of anon pages that are not mapped by anyone.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration
  2010-02-18 18:02 [PATCH 0/12] Memory Compaction v3 Mel Gorman
  2010-02-18 18:02 ` [PATCH 01/12] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
  2010-02-18 18:02 ` [PATCH 02/12] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
@ 2010-02-18 18:02 ` Mel Gorman
  2010-02-19  0:18   ` KAMEZAWA Hiroyuki
                     ` (2 more replies)
  2010-02-18 18:02 ` [PATCH 04/12] mm: Document /proc/pagetypeinfo Mel Gorman
                   ` (8 subsequent siblings)
  11 siblings, 3 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-18 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, Mel Gorman, linux-kernel, linux-mm

For clarity of review, KSM and page migration have separate refcounts on
the anon_vma. While clear, this is a waste of memory. This patch gets
KSM and page migration to share their toys in a spirit of harmony.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/rmap.h |   50 ++++++++++++++++++--------------------------------
 mm/ksm.c             |    4 ++--
 mm/migrate.c         |    4 ++--
 mm/rmap.c            |    6 ++----
 4 files changed, 24 insertions(+), 40 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6b5a1a9..55c0e9e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -26,11 +26,17 @@
  */
 struct anon_vma {
 	spinlock_t lock;	/* Serialize access to vma list */
-#ifdef CONFIG_KSM
-	atomic_t ksm_refcount;
-#endif
-#ifdef CONFIG_MIGRATION
-	atomic_t migrate_refcount;
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+
+	/*
+	 * The refcount is taken by either KSM or page migration
+	 * to take a reference to an anon_vma when there is no
+	 * guarantee that the vma of page tables will exist for
+	 * the duration of the operation. A caller that takes
+	 * the reference is responsible for clearing up the
+	 * anon_vma if they are the last user on release
+	 */
+	atomic_t refcount;
 #endif
 	/*
 	 * NOTE: the LSB of the head.next is set by
@@ -44,46 +50,26 @@ struct anon_vma {
 };
 
 #ifdef CONFIG_MMU
-#ifdef CONFIG_KSM
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
+static inline void anonvma_refcount_init(struct anon_vma *anon_vma)
 {
-	atomic_set(&anon_vma->ksm_refcount, 0);
+	atomic_set(&anon_vma->refcount, 0);
 }
 
-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_refcount(struct anon_vma *anon_vma)
 {
-	return atomic_read(&anon_vma->ksm_refcount);
+	return atomic_read(&anon_vma->refcount);
 }
 #else
-static inline void ksm_refcount_init(struct anon_vma *anon_vma)
+static inline void anonvma_refcount_init(struct anon_vma *anon_vma)
 {
 }
 
-static inline int ksm_refcount(struct anon_vma *anon_vma)
+static inline int anonvma_refcount(struct anon_vma *anon_vma)
 {
 	return 0;
 }
 #endif /* CONFIG_KSM */
-#ifdef CONFIG_MIGRATION
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
-	atomic_set(&anon_vma->migrate_refcount, 0);
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
-	return atomic_read(&anon_vma->migrate_refcount);
-}
-#else
-static inline void migrate_refcount_init(struct anon_vma *anon_vma)
-{
-}
-
-static inline int migrate_refcount(struct anon_vma *anon_vma)
-{
-	return 0;
-}
-#endif /* CONFIG_MIGRATE */
 
 static inline struct anon_vma *page_anon_vma(struct page *page)
 {
diff --git a/mm/ksm.c b/mm/ksm.c
index 56a0da1..7decf73 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -318,14 +318,14 @@ static void hold_anon_vma(struct rmap_item *rmap_item,
 			  struct anon_vma *anon_vma)
 {
 	rmap_item->anon_vma = anon_vma;
-	atomic_inc(&anon_vma->ksm_refcount);
+	atomic_inc(&anon_vma->refcount);
 }
 
 static void drop_anon_vma(struct rmap_item *rmap_item)
 {
 	struct anon_vma *anon_vma = rmap_item->anon_vma;
 
-	if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
+	if (atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
diff --git a/mm/migrate.c b/mm/migrate.c
index 1ce6a2f..00777b0 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -619,7 +619,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		rcu_read_lock();
 		rcu_locked = 1;
 		anon_vma = page_anon_vma(page);
-		atomic_inc(&anon_vma->migrate_refcount);
+		atomic_inc(&anon_vma->refcount);
 	}
 
 	/*
@@ -661,7 +661,7 @@ skip_unmap:
 rcu_unlock:
 
 	/* Drop an anon_vma reference if we took one */
-	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
+	if (anon_vma && atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->lock)) {
 		int empty = list_empty(&anon_vma->head);
 		spin_unlock(&anon_vma->lock);
 		if (empty)
diff --git a/mm/rmap.c b/mm/rmap.c
index 11ba74a..96b5905 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -172,8 +172,7 @@ void anon_vma_unlink(struct vm_area_struct *vma)
 	list_del(&vma->anon_vma_node);
 
 	/* We must garbage collect the anon_vma if it's empty */
-	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
-					!migrate_refcount(anon_vma);
+	empty = list_empty(&anon_vma->head) && !anonvma_refcount(anon_vma);
 	spin_unlock(&anon_vma->lock);
 
 	if (empty)
@@ -185,8 +184,7 @@ static void anon_vma_ctor(void *data)
 	struct anon_vma *anon_vma = data;
 
 	spin_lock_init(&anon_vma->lock);
-	ksm_refcount_init(anon_vma);
-	migrate_refcount_init(anon_vma);
+	anonvma_refcount_init(anon_vma);
 	INIT_LIST_HEAD(&anon_vma->head);
 }
 
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration
  2010-02-18 18:02 ` [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
@ 2010-02-19  0:18   ` KAMEZAWA Hiroyuki
  2010-02-19 14:05     ` Mel Gorman
  2010-02-19  5:09   ` Minchan Kim
  2010-02-19 21:42   ` Rik van Riel
  2 siblings, 1 reply; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-19  0:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Thu, 18 Feb 2010 18:02:33 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> For clarity of review, KSM and page migration have separate refcounts on
> the anon_vma. While clear, this is a waste of memory. This patch gets
> KSM and page migration to share their toys in a spirit of harmony.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Nitpick:
I think this refcnt has something different characteristics than other
usual refcnts. Even when refcnt goes down to 0, anon_vma will not be freed.
So, I think some kind of name as temporal_reference_count is better than
simple "refcnt". Then, it will be clearer what this refcnt is for.

Thanks,
-Kame

> ---
>  include/linux/rmap.h |   50 ++++++++++++++++++--------------------------------
>  mm/ksm.c             |    4 ++--
>  mm/migrate.c         |    4 ++--
>  mm/rmap.c            |    6 ++----
>  4 files changed, 24 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 6b5a1a9..55c0e9e 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -26,11 +26,17 @@
>   */
>  struct anon_vma {
>  	spinlock_t lock;	/* Serialize access to vma list */
> -#ifdef CONFIG_KSM
> -	atomic_t ksm_refcount;
> -#endif
> -#ifdef CONFIG_MIGRATION
> -	atomic_t migrate_refcount;
> +#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
> +
> +	/*
> +	 * The refcount is taken by either KSM or page migration
> +	 * to take a reference to an anon_vma when there is no
> +	 * guarantee that the vma of page tables will exist for
> +	 * the duration of the operation. A caller that takes
> +	 * the reference is responsible for clearing up the
> +	 * anon_vma if they are the last user on release
> +	 */
> +	atomic_t refcount;
>  #endif
>  	/*
>  	 * NOTE: the LSB of the head.next is set by
> @@ -44,46 +50,26 @@ struct anon_vma {
>  };
>  
>  #ifdef CONFIG_MMU
> -#ifdef CONFIG_KSM
> -static inline void ksm_refcount_init(struct anon_vma *anon_vma)
> +#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
> +static inline void anonvma_refcount_init(struct anon_vma *anon_vma)
>  {
> -	atomic_set(&anon_vma->ksm_refcount, 0);
> +	atomic_set(&anon_vma->refcount, 0);
>  }
>  
> -static inline int ksm_refcount(struct anon_vma *anon_vma)
> +static inline int anonvma_refcount(struct anon_vma *anon_vma)
>  {
> -	return atomic_read(&anon_vma->ksm_refcount);
> +	return atomic_read(&anon_vma->refcount);
>  }
>  #else
> -static inline void ksm_refcount_init(struct anon_vma *anon_vma)
> +static inline void anonvma_refcount_init(struct anon_vma *anon_vma)
>  {
>  }
>  
> -static inline int ksm_refcount(struct anon_vma *anon_vma)
> +static inline int anonvma_refcount(struct anon_vma *anon_vma)
>  {
>  	return 0;
>  }
>  #endif /* CONFIG_KSM */
> -#ifdef CONFIG_MIGRATION
> -static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> -{
> -	atomic_set(&anon_vma->migrate_refcount, 0);
> -}
> -
> -static inline int migrate_refcount(struct anon_vma *anon_vma)
> -{
> -	return atomic_read(&anon_vma->migrate_refcount);
> -}
> -#else
> -static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> -{
> -}
> -
> -static inline int migrate_refcount(struct anon_vma *anon_vma)
> -{
> -	return 0;
> -}
> -#endif /* CONFIG_MIGRATE */
>  
>  static inline struct anon_vma *page_anon_vma(struct page *page)
>  {
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 56a0da1..7decf73 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -318,14 +318,14 @@ static void hold_anon_vma(struct rmap_item *rmap_item,
>  			  struct anon_vma *anon_vma)
>  {
>  	rmap_item->anon_vma = anon_vma;
> -	atomic_inc(&anon_vma->ksm_refcount);
> +	atomic_inc(&anon_vma->refcount);
>  }
>  
>  static void drop_anon_vma(struct rmap_item *rmap_item)
>  {
>  	struct anon_vma *anon_vma = rmap_item->anon_vma;
>  
> -	if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
> +	if (atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->lock)) {
>  		int empty = list_empty(&anon_vma->head);
>  		spin_unlock(&anon_vma->lock);
>  		if (empty)
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 1ce6a2f..00777b0 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -619,7 +619,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
>  		rcu_read_lock();
>  		rcu_locked = 1;
>  		anon_vma = page_anon_vma(page);
> -		atomic_inc(&anon_vma->migrate_refcount);
> +		atomic_inc(&anon_vma->refcount);
>  	}
>  
>  	/*
> @@ -661,7 +661,7 @@ skip_unmap:
>  rcu_unlock:
>  
>  	/* Drop an anon_vma reference if we took one */
> -	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> +	if (anon_vma && atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->lock)) {
>  		int empty = list_empty(&anon_vma->head);
>  		spin_unlock(&anon_vma->lock);
>  		if (empty)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 11ba74a..96b5905 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -172,8 +172,7 @@ void anon_vma_unlink(struct vm_area_struct *vma)
>  	list_del(&vma->anon_vma_node);
>  
>  	/* We must garbage collect the anon_vma if it's empty */
> -	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
> -					!migrate_refcount(anon_vma);
> +	empty = list_empty(&anon_vma->head) && !anonvma_refcount(anon_vma);
>  	spin_unlock(&anon_vma->lock);
>  
>  	if (empty)
> @@ -185,8 +184,7 @@ static void anon_vma_ctor(void *data)
>  	struct anon_vma *anon_vma = data;
>  
>  	spin_lock_init(&anon_vma->lock);
> -	ksm_refcount_init(anon_vma);
> -	migrate_refcount_init(anon_vma);
> +	anonvma_refcount_init(anon_vma);
>  	INIT_LIST_HEAD(&anon_vma->head);
>  }
>  
> -- 
> 1.6.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration
  2010-02-19  0:18   ` KAMEZAWA Hiroyuki
@ 2010-02-19 14:05     ` Mel Gorman
  2010-02-19 15:01       ` Christoph Lameter
  2010-02-20  3:48       ` KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-19 14:05 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 09:18:59AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 18 Feb 2010 18:02:33 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > For clarity of review, KSM and page migration have separate refcounts on
> > the anon_vma. While clear, this is a waste of memory. This patch gets
> > KSM and page migration to share their toys in a spirit of harmony.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Nitpick:
> I think this refcnt has something different characteristics than other
> usual refcnts. Even when refcnt goes down to 0, anon_vma will not be freed.
> So, I think some kind of name as temporal_reference_count is better than
> simple "refcnt". Then, it will be clearer what this refcnt is for.
> 

When I read this in a few years, I'll have no idea what "temporal" is
referring to. The holder of this account is by a process that does not
necessarily own the page or its mappings but "remote" has special
meaning as well. "external_count" ?

> 
> > ---
> >  include/linux/rmap.h |   50 ++++++++++++++++++--------------------------------
> >  mm/ksm.c             |    4 ++--
> >  mm/migrate.c         |    4 ++--
> >  mm/rmap.c            |    6 ++----
> >  4 files changed, 24 insertions(+), 40 deletions(-)
> > 
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index 6b5a1a9..55c0e9e 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -26,11 +26,17 @@
> >   */
> >  struct anon_vma {
> >  	spinlock_t lock;	/* Serialize access to vma list */
> > -#ifdef CONFIG_KSM
> > -	atomic_t ksm_refcount;
> > -#endif
> > -#ifdef CONFIG_MIGRATION
> > -	atomic_t migrate_refcount;
> > +#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
> > +
> > +	/*
> > +	 * The refcount is taken by either KSM or page migration
> > +	 * to take a reference to an anon_vma when there is no
> > +	 * guarantee that the vma of page tables will exist for
> > +	 * the duration of the operation. A caller that takes
> > +	 * the reference is responsible for clearing up the
> > +	 * anon_vma if they are the last user on release
> > +	 */
> > +	atomic_t refcount;
> >  #endif
> >  	/*
> >  	 * NOTE: the LSB of the head.next is set by
> > @@ -44,46 +50,26 @@ struct anon_vma {
> >  };
> >  
> >  #ifdef CONFIG_MMU
> > -#ifdef CONFIG_KSM
> > -static inline void ksm_refcount_init(struct anon_vma *anon_vma)
> > +#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
> > +static inline void anonvma_refcount_init(struct anon_vma *anon_vma)
> >  {
> > -	atomic_set(&anon_vma->ksm_refcount, 0);
> > +	atomic_set(&anon_vma->refcount, 0);
> >  }
> >  
> > -static inline int ksm_refcount(struct anon_vma *anon_vma)
> > +static inline int anonvma_refcount(struct anon_vma *anon_vma)
> >  {
> > -	return atomic_read(&anon_vma->ksm_refcount);
> > +	return atomic_read(&anon_vma->refcount);
> >  }
> >  #else
> > -static inline void ksm_refcount_init(struct anon_vma *anon_vma)
> > +static inline void anonvma_refcount_init(struct anon_vma *anon_vma)
> >  {
> >  }
> >  
> > -static inline int ksm_refcount(struct anon_vma *anon_vma)
> > +static inline int anonvma_refcount(struct anon_vma *anon_vma)
> >  {
> >  	return 0;
> >  }
> >  #endif /* CONFIG_KSM */
> > -#ifdef CONFIG_MIGRATION
> > -static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> > -{
> > -	atomic_set(&anon_vma->migrate_refcount, 0);
> > -}
> > -
> > -static inline int migrate_refcount(struct anon_vma *anon_vma)
> > -{
> > -	return atomic_read(&anon_vma->migrate_refcount);
> > -}
> > -#else
> > -static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> > -{
> > -}
> > -
> > -static inline int migrate_refcount(struct anon_vma *anon_vma)
> > -{
> > -	return 0;
> > -}
> > -#endif /* CONFIG_MIGRATE */
> >  
> >  static inline struct anon_vma *page_anon_vma(struct page *page)
> >  {
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 56a0da1..7decf73 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -318,14 +318,14 @@ static void hold_anon_vma(struct rmap_item *rmap_item,
> >  			  struct anon_vma *anon_vma)
> >  {
> >  	rmap_item->anon_vma = anon_vma;
> > -	atomic_inc(&anon_vma->ksm_refcount);
> > +	atomic_inc(&anon_vma->refcount);
> >  }
> >  
> >  static void drop_anon_vma(struct rmap_item *rmap_item)
> >  {
> >  	struct anon_vma *anon_vma = rmap_item->anon_vma;
> >  
> > -	if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
> > +	if (atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->lock)) {
> >  		int empty = list_empty(&anon_vma->head);
> >  		spin_unlock(&anon_vma->lock);
> >  		if (empty)
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 1ce6a2f..00777b0 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -619,7 +619,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> >  		rcu_read_lock();
> >  		rcu_locked = 1;
> >  		anon_vma = page_anon_vma(page);
> > -		atomic_inc(&anon_vma->migrate_refcount);
> > +		atomic_inc(&anon_vma->refcount);
> >  	}
> >  
> >  	/*
> > @@ -661,7 +661,7 @@ skip_unmap:
> >  rcu_unlock:
> >  
> >  	/* Drop an anon_vma reference if we took one */
> > -	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> > +	if (anon_vma && atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->lock)) {
> >  		int empty = list_empty(&anon_vma->head);
> >  		spin_unlock(&anon_vma->lock);
> >  		if (empty)
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 11ba74a..96b5905 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -172,8 +172,7 @@ void anon_vma_unlink(struct vm_area_struct *vma)
> >  	list_del(&vma->anon_vma_node);
> >  
> >  	/* We must garbage collect the anon_vma if it's empty */
> > -	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
> > -					!migrate_refcount(anon_vma);
> > +	empty = list_empty(&anon_vma->head) && !anonvma_refcount(anon_vma);
> >  	spin_unlock(&anon_vma->lock);
> >  
> >  	if (empty)
> > @@ -185,8 +184,7 @@ static void anon_vma_ctor(void *data)
> >  	struct anon_vma *anon_vma = data;
> >  
> >  	spin_lock_init(&anon_vma->lock);
> > -	ksm_refcount_init(anon_vma);
> > -	migrate_refcount_init(anon_vma);
> > +	anonvma_refcount_init(anon_vma);
> >  	INIT_LIST_HEAD(&anon_vma->head);
> >  }
> >  
> > -- 
> > 1.6.5
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration
  2010-02-19 14:05     ` Mel Gorman
@ 2010-02-19 15:01       ` Christoph Lameter
  2010-02-20  3:48       ` KAMEZAWA Hiroyuki
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Lameter @ 2010-02-19 15:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: KAMEZAWA Hiroyuki, Andrea Arcangeli, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm, hugh.dickins@tiscali.co.uk

On Fri, 19 Feb 2010, Mel Gorman wrote:

> > Nitpick:
> > I think this refcnt has something different characteristics than other
> > usual refcnts. Even when refcnt goes down to 0, anon_vma will not be freed.
> > So, I think some kind of name as temporal_reference_count is better than
> > simple "refcnt". Then, it will be clearer what this refcnt is for.
> >
>
> When I read this in a few years, I'll have no idea what "temporal" is
> referring to. The holder of this account is by a process that does not
> necessarily own the page or its mappings but "remote" has special
> meaning as well. "external_count" ?

We could think about getting rid of RCU for anon_vmas and use the refcount
for everything. Would make the handling consistent with other users but
will have performance implications.

Hugh what do you say about this?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration
  2010-02-19 14:05     ` Mel Gorman
  2010-02-19 15:01       ` Christoph Lameter
@ 2010-02-20  3:48       ` KAMEZAWA Hiroyuki
  2010-02-20  9:32         ` Mel Gorman
  1 sibling, 1 reply; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-20  3:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, 19 Feb 2010 14:05:00 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Fri, Feb 19, 2010 at 09:18:59AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 18 Feb 2010 18:02:33 +0000
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > For clarity of review, KSM and page migration have separate refcounts on
> > > the anon_vma. While clear, this is a waste of memory. This patch gets
> > > KSM and page migration to share their toys in a spirit of harmony.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > 
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Nitpick:
> > I think this refcnt has something different characteristics than other
> > usual refcnts. Even when refcnt goes down to 0, anon_vma will not be freed.
> > So, I think some kind of name as temporal_reference_count is better than
> > simple "refcnt". Then, it will be clearer what this refcnt is for.
> > 
> 
> When I read this in a few years, I'll have no idea what "temporal" is
> referring to. The holder of this account is by a process that does not
> necessarily own the page or its mappings but "remote" has special
> meaning as well. "external_count" ?
> 
"external" seems good. My selection of word is tend to be bad ;)

Off topic:
But as Christoph says, make this as real reference counter as
"if coutner goes down to 0, it's freed." may be good.
I'm not fully aware of how anon_vma is copiled after Rik's anon_vma split(?)
work. So, it may be complicated than I'm thinking.

Thanks,
-Kame


> > 
> > > ---
> > >  include/linux/rmap.h |   50 ++++++++++++++++++--------------------------------
> > >  mm/ksm.c             |    4 ++--
> > >  mm/migrate.c         |    4 ++--
> > >  mm/rmap.c            |    6 ++----
> > >  4 files changed, 24 insertions(+), 40 deletions(-)
> > > 
> > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > > index 6b5a1a9..55c0e9e 100644
> > > --- a/include/linux/rmap.h
> > > +++ b/include/linux/rmap.h
> > > @@ -26,11 +26,17 @@
> > >   */
> > >  struct anon_vma {
> > >  	spinlock_t lock;	/* Serialize access to vma list */
> > > -#ifdef CONFIG_KSM
> > > -	atomic_t ksm_refcount;
> > > -#endif
> > > -#ifdef CONFIG_MIGRATION
> > > -	atomic_t migrate_refcount;
> > > +#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
> > > +
> > > +	/*
> > > +	 * The refcount is taken by either KSM or page migration
> > > +	 * to take a reference to an anon_vma when there is no
> > > +	 * guarantee that the vma of page tables will exist for
> > > +	 * the duration of the operation. A caller that takes
> > > +	 * the reference is responsible for clearing up the
> > > +	 * anon_vma if they are the last user on release
> > > +	 */
> > > +	atomic_t refcount;
> > >  #endif
> > >  	/*
> > >  	 * NOTE: the LSB of the head.next is set by
> > > @@ -44,46 +50,26 @@ struct anon_vma {
> > >  };
> > >  
> > >  #ifdef CONFIG_MMU
> > > -#ifdef CONFIG_KSM
> > > -static inline void ksm_refcount_init(struct anon_vma *anon_vma)
> > > +#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
> > > +static inline void anonvma_refcount_init(struct anon_vma *anon_vma)
> > >  {
> > > -	atomic_set(&anon_vma->ksm_refcount, 0);
> > > +	atomic_set(&anon_vma->refcount, 0);
> > >  }
> > >  
> > > -static inline int ksm_refcount(struct anon_vma *anon_vma)
> > > +static inline int anonvma_refcount(struct anon_vma *anon_vma)
> > >  {
> > > -	return atomic_read(&anon_vma->ksm_refcount);
> > > +	return atomic_read(&anon_vma->refcount);
> > >  }
> > >  #else
> > > -static inline void ksm_refcount_init(struct anon_vma *anon_vma)
> > > +static inline void anonvma_refcount_init(struct anon_vma *anon_vma)
> > >  {
> > >  }
> > >  
> > > -static inline int ksm_refcount(struct anon_vma *anon_vma)
> > > +static inline int anonvma_refcount(struct anon_vma *anon_vma)
> > >  {
> > >  	return 0;
> > >  }
> > >  #endif /* CONFIG_KSM */
> > > -#ifdef CONFIG_MIGRATION
> > > -static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> > > -{
> > > -	atomic_set(&anon_vma->migrate_refcount, 0);
> > > -}
> > > -
> > > -static inline int migrate_refcount(struct anon_vma *anon_vma)
> > > -{
> > > -	return atomic_read(&anon_vma->migrate_refcount);
> > > -}
> > > -#else
> > > -static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> > > -{
> > > -}
> > > -
> > > -static inline int migrate_refcount(struct anon_vma *anon_vma)
> > > -{
> > > -	return 0;
> > > -}
> > > -#endif /* CONFIG_MIGRATE */
> > >  
> > >  static inline struct anon_vma *page_anon_vma(struct page *page)
> > >  {
> > > diff --git a/mm/ksm.c b/mm/ksm.c
> > > index 56a0da1..7decf73 100644
> > > --- a/mm/ksm.c
> > > +++ b/mm/ksm.c
> > > @@ -318,14 +318,14 @@ static void hold_anon_vma(struct rmap_item *rmap_item,
> > >  			  struct anon_vma *anon_vma)
> > >  {
> > >  	rmap_item->anon_vma = anon_vma;
> > > -	atomic_inc(&anon_vma->ksm_refcount);
> > > +	atomic_inc(&anon_vma->refcount);
> > >  }
> > >  
> > >  static void drop_anon_vma(struct rmap_item *rmap_item)
> > >  {
> > >  	struct anon_vma *anon_vma = rmap_item->anon_vma;
> > >  
> > > -	if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
> > > +	if (atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->lock)) {
> > >  		int empty = list_empty(&anon_vma->head);
> > >  		spin_unlock(&anon_vma->lock);
> > >  		if (empty)
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index 1ce6a2f..00777b0 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -619,7 +619,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> > >  		rcu_read_lock();
> > >  		rcu_locked = 1;
> > >  		anon_vma = page_anon_vma(page);
> > > -		atomic_inc(&anon_vma->migrate_refcount);
> > > +		atomic_inc(&anon_vma->refcount);
> > >  	}
> > >  
> > >  	/*
> > > @@ -661,7 +661,7 @@ skip_unmap:
> > >  rcu_unlock:
> > >  
> > >  	/* Drop an anon_vma reference if we took one */
> > > -	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> > > +	if (anon_vma && atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->lock)) {
> > >  		int empty = list_empty(&anon_vma->head);
> > >  		spin_unlock(&anon_vma->lock);
> > >  		if (empty)
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 11ba74a..96b5905 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -172,8 +172,7 @@ void anon_vma_unlink(struct vm_area_struct *vma)
> > >  	list_del(&vma->anon_vma_node);
> > >  
> > >  	/* We must garbage collect the anon_vma if it's empty */
> > > -	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
> > > -					!migrate_refcount(anon_vma);
> > > +	empty = list_empty(&anon_vma->head) && !anonvma_refcount(anon_vma);
> > >  	spin_unlock(&anon_vma->lock);
> > >  
> > >  	if (empty)
> > > @@ -185,8 +184,7 @@ static void anon_vma_ctor(void *data)
> > >  	struct anon_vma *anon_vma = data;
> > >  
> > >  	spin_lock_init(&anon_vma->lock);
> > > -	ksm_refcount_init(anon_vma);
> > > -	migrate_refcount_init(anon_vma);
> > > +	anonvma_refcount_init(anon_vma);
> > >  	INIT_LIST_HEAD(&anon_vma->head);
> > >  }
> > >  
> > > -- 
> > > 1.6.5
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at  http://www.tux.org/lkml/
> > > 
> > 
> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration
  2010-02-20  3:48       ` KAMEZAWA Hiroyuki
@ 2010-02-20  9:32         ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-20  9:32 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Sat, Feb 20, 2010 at 12:48:47PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 19 Feb 2010 14:05:00 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > On Fri, Feb 19, 2010 at 09:18:59AM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Thu, 18 Feb 2010 18:02:33 +0000
> > > Mel Gorman <mel@csn.ul.ie> wrote:
> > > 
> > > > For clarity of review, KSM and page migration have separate refcounts on
> > > > the anon_vma. While clear, this is a waste of memory. This patch gets
> > > > KSM and page migration to share their toys in a spirit of harmony.
> > > > 
> > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > 
> > > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > > 
> > > Nitpick:
> > > I think this refcnt has something different characteristics than other
> > > usual refcnts. Even when refcnt goes down to 0, anon_vma will not be freed.
> > > So, I think some kind of name as temporal_reference_count is better than
> > > simple "refcnt". Then, it will be clearer what this refcnt is for.
> > > 
> > 
> > When I read this in a few years, I'll have no idea what "temporal" is
> > referring to. The holder of this account is by a process that does not
> > necessarily own the page or its mappings but "remote" has special
> > meaning as well. "external_count" ?
> > 
> "external" seems good. My selection of word is tend to be bad ;)
> 

Trust me, it's not my strong point either.

> Off topic:
> But as Christoph says, make this as real reference counter as
> "if coutner goes down to 0, it's freed." may be good.
> I'm not fully aware of how anon_vma is copiled after Rik's anon_vma split(?)
> work. So, it may be complicated than I'm thinking.
> 

The complexity is one factor. Rik's patches make it different for sure
because it's less clear what needs to be done with the chains. Even if
that is worked around, I'd have concerns about the refcount becoming a
highly contended cache line in some circumstances. I have taken note to
research it as a separate patch.

> > > >  mm/rmap.c            |    6 ++----
> > > >  4 files changed, 24 insertions(+), 40 deletions(-)
> > > > 
> > > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > > > index 6b5a1a9..55c0e9e 100644
> > > > --- a/include/linux/rmap.h
> > > > +++ b/include/linux/rmap.h
> > > > @@ -26,11 +26,17 @@
> > > >   */
> > > >  struct anon_vma {
> > > >  	spinlock_t lock;	/* Serialize access to vma list */
> > > > -#ifdef CONFIG_KSM
> > > > -	atomic_t ksm_refcount;
> > > > -#endif
> > > > -#ifdef CONFIG_MIGRATION
> > > > -	atomic_t migrate_refcount;
> > > > +#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
> > > > +
> > > > +	/*
> > > > +	 * The refcount is taken by either KSM or page migration
> > > > +	 * to take a reference to an anon_vma when there is no
> > > > +	 * guarantee that the vma of page tables will exist for
> > > > +	 * the duration of the operation. A caller that takes
> > > > +	 * the reference is responsible for clearing up the
> > > > +	 * anon_vma if they are the last user on release
> > > > +	 */
> > > > +	atomic_t refcount;
> > > >  #endif
> > > >  	/*
> > > >  	 * NOTE: the LSB of the head.next is set by
> > > > @@ -44,46 +50,26 @@ struct anon_vma {
> > > >  };
> > > >  
> > > >  #ifdef CONFIG_MMU
> > > > -#ifdef CONFIG_KSM
> > > > -static inline void ksm_refcount_init(struct anon_vma *anon_vma)
> > > > +#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
> > > > +static inline void anonvma_refcount_init(struct anon_vma *anon_vma)
> > > >  {
> > > > -	atomic_set(&anon_vma->ksm_refcount, 0);
> > > > +	atomic_set(&anon_vma->refcount, 0);
> > > >  }
> > > >  
> > > > -static inline int ksm_refcount(struct anon_vma *anon_vma)
> > > > +static inline int anonvma_refcount(struct anon_vma *anon_vma)
> > > >  {
> > > > -	return atomic_read(&anon_vma->ksm_refcount);
> > > > +	return atomic_read(&anon_vma->refcount);
> > > >  }
> > > >  #else
> > > > -static inline void ksm_refcount_init(struct anon_vma *anon_vma)
> > > > +static inline void anonvma_refcount_init(struct anon_vma *anon_vma)
> > > >  {
> > > >  }
> > > >  
> > > > -static inline int ksm_refcount(struct anon_vma *anon_vma)
> > > > +static inline int anonvma_refcount(struct anon_vma *anon_vma)
> > > >  {
> > > >  	return 0;
> > > >  }
> > > >  #endif /* CONFIG_KSM */
> > > > -#ifdef CONFIG_MIGRATION
> > > > -static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> > > > -{
> > > > -	atomic_set(&anon_vma->migrate_refcount, 0);
> > > > -}
> > > > -
> > > > -static inline int migrate_refcount(struct anon_vma *anon_vma)
> > > > -{
> > > > -	return atomic_read(&anon_vma->migrate_refcount);
> > > > -}
> > > > -#else
> > > > -static inline void migrate_refcount_init(struct anon_vma *anon_vma)
> > > > -{
> > > > -}
> > > > -
> > > > -static inline int migrate_refcount(struct anon_vma *anon_vma)
> > > > -{
> > > > -	return 0;
> > > > -}
> > > > -#endif /* CONFIG_MIGRATE */
> > > >  
> > > >  static inline struct anon_vma *page_anon_vma(struct page *page)
> > > >  {
> > > > diff --git a/mm/ksm.c b/mm/ksm.c
> > > > index 56a0da1..7decf73 100644
> > > > --- a/mm/ksm.c
> > > > +++ b/mm/ksm.c
> > > > @@ -318,14 +318,14 @@ static void hold_anon_vma(struct rmap_item *rmap_item,
> > > >  			  struct anon_vma *anon_vma)
> > > >  {
> > > >  	rmap_item->anon_vma = anon_vma;
> > > > -	atomic_inc(&anon_vma->ksm_refcount);
> > > > +	atomic_inc(&anon_vma->refcount);
> > > >  }
> > > >  
> > > >  static void drop_anon_vma(struct rmap_item *rmap_item)
> > > >  {
> > > >  	struct anon_vma *anon_vma = rmap_item->anon_vma;
> > > >  
> > > > -	if (atomic_dec_and_lock(&anon_vma->ksm_refcount, &anon_vma->lock)) {
> > > > +	if (atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->lock)) {
> > > >  		int empty = list_empty(&anon_vma->head);
> > > >  		spin_unlock(&anon_vma->lock);
> > > >  		if (empty)
> > > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > > index 1ce6a2f..00777b0 100644
> > > > --- a/mm/migrate.c
> > > > +++ b/mm/migrate.c
> > > > @@ -619,7 +619,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
> > > >  		rcu_read_lock();
> > > >  		rcu_locked = 1;
> > > >  		anon_vma = page_anon_vma(page);
> > > > -		atomic_inc(&anon_vma->migrate_refcount);
> > > > +		atomic_inc(&anon_vma->refcount);
> > > >  	}
> > > >  
> > > >  	/*
> > > > @@ -661,7 +661,7 @@ skip_unmap:
> > > >  rcu_unlock:
> > > >  
> > > >  	/* Drop an anon_vma reference if we took one */
> > > > -	if (anon_vma && atomic_dec_and_lock(&anon_vma->migrate_refcount, &anon_vma->lock)) {
> > > > +	if (anon_vma && atomic_dec_and_lock(&anon_vma->refcount, &anon_vma->lock)) {
> > > >  		int empty = list_empty(&anon_vma->head);
> > > >  		spin_unlock(&anon_vma->lock);
> > > >  		if (empty)
> > > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > > index 11ba74a..96b5905 100644
> > > > --- a/mm/rmap.c
> > > > +++ b/mm/rmap.c
> > > > @@ -172,8 +172,7 @@ void anon_vma_unlink(struct vm_area_struct *vma)
> > > >  	list_del(&vma->anon_vma_node);
> > > >  
> > > >  	/* We must garbage collect the anon_vma if it's empty */
> > > > -	empty = list_empty(&anon_vma->head) && !ksm_refcount(anon_vma) &&
> > > > -					!migrate_refcount(anon_vma);
> > > > +	empty = list_empty(&anon_vma->head) && !anonvma_refcount(anon_vma);
> > > >  	spin_unlock(&anon_vma->lock);
> > > >  
> > > >  	if (empty)
> > > > @@ -185,8 +184,7 @@ static void anon_vma_ctor(void *data)
> > > >  	struct anon_vma *anon_vma = data;
> > > >  
> > > >  	spin_lock_init(&anon_vma->lock);
> > > > -	ksm_refcount_init(anon_vma);
> > > > -	migrate_refcount_init(anon_vma);
> > > > +	anonvma_refcount_init(anon_vma);
> > > >  	INIT_LIST_HEAD(&anon_vma->head);
> > > >  }
> > > >  
> > > > -- 
> > > > 1.6.5
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > Please read the FAQ at  http://www.tux.org/lkml/
> > > > 
> > > 
> > 
> > -- 
> > Mel Gorman
> > Part-time Phd Student                          Linux Technology Center
> > University of Limerick                         IBM Dublin Software Lab
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration
  2010-02-18 18:02 ` [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
  2010-02-19  0:18   ` KAMEZAWA Hiroyuki
@ 2010-02-19  5:09   ` Minchan Kim
  2010-02-19 21:42   ` Rik van Riel
  2 siblings, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2010-02-19  5:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 3:02 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> For clarity of review, KSM and page migration have separate refcounts on
> the anon_vma. While clear, this is a waste of memory. This patch gets
> KSM and page migration to share their toys in a spirit of harmony.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

When I reviewed your patch [1/12], I thought like this.
Looks good to me.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration
  2010-02-18 18:02 ` [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
  2010-02-19  0:18   ` KAMEZAWA Hiroyuki
  2010-02-19  5:09   ` Minchan Kim
@ 2010-02-19 21:42   ` Rik van Riel
  2010-02-19 21:58     ` Mel Gorman
  2 siblings, 1 reply; 51+ messages in thread
From: Rik van Riel @ 2010-02-19 21:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm

On 02/18/2010 01:02 PM, Mel Gorman wrote:

>   struct anon_vma {
>   	spinlock_t lock;	/* Serialize access to vma list */
> -#ifdef CONFIG_KSM
> -	atomic_t ksm_refcount;
> -#endif
> -#ifdef CONFIG_MIGRATION
> -	atomic_t migrate_refcount;
> +#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
> +
> +	/*
> +	 * The refcount is taken by either KSM or page migration
> +	 * to take a reference to an anon_vma when there is no
> +	 * guarantee that the vma of page tables will exist for
> +	 * the duration of the operation. A caller that takes
> +	 * the reference is responsible for clearing up the
> +	 * anon_vma if they are the last user on release
> +	 */
> +	atomic_t refcount;

Calling it just refcount is probably confusing, since
the anon_vma is also referenced by being on the chain
with others.

Maybe "other_refcount" because it is refcounts taken
by things other than VMAs?  I am sure there is a better
name possible...

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration
  2010-02-19 21:42   ` Rik van Riel
@ 2010-02-19 21:58     ` Mel Gorman
  2010-02-20  0:16       ` Rik van Riel
  0 siblings, 1 reply; 51+ messages in thread
From: Mel Gorman @ 2010-02-19 21:58 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm

On Fri, Feb 19, 2010 at 04:42:18PM -0500, Rik van Riel wrote:
> On 02/18/2010 01:02 PM, Mel Gorman wrote:
>
>>   struct anon_vma {
>>   	spinlock_t lock;	/* Serialize access to vma list */
>> -#ifdef CONFIG_KSM
>> -	atomic_t ksm_refcount;
>> -#endif
>> -#ifdef CONFIG_MIGRATION
>> -	atomic_t migrate_refcount;
>> +#if defined(CONFIG_KSM) || defined(CONFIG_MIGRATION)
>> +
>> +	/*
>> +	 * The refcount is taken by either KSM or page migration
>> +	 * to take a reference to an anon_vma when there is no
>> +	 * guarantee that the vma of page tables will exist for
>> +	 * the duration of the operation. A caller that takes
>> +	 * the reference is responsible for clearing up the
>> +	 * anon_vma if they are the last user on release
>> +	 */
>> +	atomic_t refcount;
>
> Calling it just refcount is probably confusing, since
> the anon_vma is also referenced by being on the chain
> with others.
>
> Maybe "other_refcount" because it is refcounts taken
> by things other than VMAs?  I am sure there is a better
> name possible...
>

external_refcount is about as good as I can think of to explain what's
going on :/

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration
  2010-02-19 21:58     ` Mel Gorman
@ 2010-02-20  0:16       ` Rik van Riel
  2010-02-20  9:29         ` Mel Gorman
  0 siblings, 1 reply; 51+ messages in thread
From: Rik van Riel @ 2010-02-20  0:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm

On 02/19/2010 04:58 PM, Mel Gorman wrote:

> external_refcount is about as good as I can think of to explain what's
> going on :/

Sounds "good" to me.  Much better than giving the wrong
impression that this is the only refcount for the anon_vma.

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration
  2010-02-20  0:16       ` Rik van Riel
@ 2010-02-20  9:29         ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-20  9:29 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm

On Fri, Feb 19, 2010 at 07:16:13PM -0500, Rik van Riel wrote:
> On 02/19/2010 04:58 PM, Mel Gorman wrote:
>
>> external_refcount is about as good as I can think of to explain what's
>> going on :/
>
> Sounds "good" to me.  Much better than giving the wrong
> impression that this is the only refcount for the anon_vma.
>

Have renamed it so. If/when this all gets merged, I'll look into what's
required to make this a "real" refcount rather than the existing locking
mechanism. I'm very wary though because even with your anon_vma changes to
avoid excessive sharing, a refcount in there that is used in all paths might
become a hotly contended cache line. i.e. it might look nice, but it might
be a performance hit. It needs to be done carefully and as a separate
series.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 04/12] mm: Document /proc/pagetypeinfo
  2010-02-18 18:02 [PATCH 0/12] Memory Compaction v3 Mel Gorman
                   ` (2 preceding siblings ...)
  2010-02-18 18:02 ` [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
@ 2010-02-18 18:02 ` Mel Gorman
  2010-02-19  1:36   ` Minchan Kim
  2010-02-19 21:42   ` Rik van Riel
  2010-02-18 18:02 ` [PATCH 05/12] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-18 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, Mel Gorman, linux-kernel, linux-mm

This patch adds documentation for /proc/pagetypeinfo.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 Documentation/filesystems/proc.txt |   45 +++++++++++++++++++++++++++++++++++-
 1 files changed, 44 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 0d07513..1829dfb 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -430,6 +430,7 @@ Table 1-5: Kernel info in /proc
  modules     List of loaded modules                            
  mounts      Mounted filesystems                               
  net         Networking info (see text)                        
+ pagetypeinfo Additional page allocator information (see text)  (2.5)
  partitions  Table of partitions known to the system           
  pci	     Deprecated info of PCI bus (new way -> /proc/bus/pci/,
              decoupled by lspci					(2.4)
@@ -584,7 +585,7 @@ Node 0, zone      DMA      0      4      5      4      4      3 ...
 Node 0, zone   Normal      1      0      0      1    101      8 ...
 Node 0, zone  HighMem      2      0      0      1      1      0 ...
 
-Memory fragmentation is a problem under some workloads, and buddyinfo is a 
+External fragmentation is a problem under some workloads, and buddyinfo is a
 useful tool for helping diagnose these problems.  Buddyinfo will give you a 
 clue as to how big an area you can safely allocate, or why a previous
 allocation failed.
@@ -594,6 +595,48 @@ available.  In this case, there are 0 chunks of 2^0*PAGE_SIZE available in
 ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE 
 available in ZONE_NORMAL, etc... 
 
+More information relevant to external fragmentation can be found in
+pagetypeinfo.
+
+> cat /proc/pagetypeinfo
+Page block order: 9
+Pages per block:  512
+
+Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
+Node    0, zone      DMA, type    Unmovable      0      0      0      1      1      1      1      1      1      1      0
+Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
+Node    0, zone      DMA, type      Movable      1      1      2      1      2      1      1      0      1      0      2
+Node    0, zone      DMA, type      Reserve      0      0      0      0      0      0      0      0      0      1      0
+Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
+Node    0, zone    DMA32, type    Unmovable    103     54     77      1      1      1     11      8      7      1      9
+Node    0, zone    DMA32, type  Reclaimable      0      0      2      1      0      0      0      0      1      0      0
+Node    0, zone    DMA32, type      Movable    169    152    113     91     77     54     39     13      6      1    452
+Node    0, zone    DMA32, type      Reserve      1      2      2      2      2      0      1      1      1      1      0
+Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
+
+Number of blocks type     Unmovable  Reclaimable      Movable      Reserve      Isolate
+Node 0, zone      DMA            2            0            5            1            0
+Node 0, zone    DMA32           41            6          967            2            0
+
+Fragmentation avoidance in the kernel works by grouping pages of different
+migrate types into the same contiguous regions of memory called page blocks.
+A page block is typically the size of the default hugepage size e.g. 2MB on
+X86-64. By keeping pages grouped based on their ability to move, the kernel
+can reclaim pages within a page block to satisfy a high-order allocation.
+
+The pagetypinfo begins with information on the size of a page block. It
+then gives the same type of information as buddyinfo except broken down
+by migrate-type and finishes with details on how many page blocks of each
+type exist.
+
+If min_free_kbytes has been tuned correctly (recommendations made by hugeadm
+from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can
+make an estimate of the likely number of huge pages that can be allocated
+at a given point in time. All the "Movable" blocks should be allocatable
+unless memory has been mlock()'d. Some of the Reclaimable blocks should
+also be allocatable although a lot of filesystem metadata may have to be
+reclaimed to achieve this.
+
 ..............................................................................
 
 meminfo:
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 04/12] mm: Document /proc/pagetypeinfo
  2010-02-18 18:02 ` [PATCH 04/12] mm: Document /proc/pagetypeinfo Mel Gorman
@ 2010-02-19  1:36   ` Minchan Kim
  2010-02-19 21:42   ` Rik van Riel
  1 sibling, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2010-02-19  1:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 3:02 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> This patch adds documentation for /proc/pagetypeinfo.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 04/12] mm: Document /proc/pagetypeinfo
  2010-02-18 18:02 ` [PATCH 04/12] mm: Document /proc/pagetypeinfo Mel Gorman
  2010-02-19  1:36   ` Minchan Kim
@ 2010-02-19 21:42   ` Rik van Riel
  1 sibling, 0 replies; 51+ messages in thread
From: Rik van Riel @ 2010-02-19 21:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm

On 02/18/2010 01:02 PM, Mel Gorman wrote:
> This patch adds documentation for /proc/pagetypeinfo.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter<cl@linux-foundation.org>
> Reviewed-by: KOSAKI Motohiro<kosaki.motohiro@jp.fujitsu.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 05/12] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-02-18 18:02 [PATCH 0/12] Memory Compaction v3 Mel Gorman
                   ` (3 preceding siblings ...)
  2010-02-18 18:02 ` [PATCH 04/12] mm: Document /proc/pagetypeinfo Mel Gorman
@ 2010-02-18 18:02 ` Mel Gorman
  2010-02-19  0:21   ` KAMEZAWA Hiroyuki
  2010-02-18 18:02 ` [PATCH 06/12] Export unusable free space index via /proc/pagetypeinfo Mel Gorman
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: Mel Gorman @ 2010-02-18 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, Mel Gorman, linux-kernel, linux-mm

CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
being able to hot-remove memory. The main users of page migration such as
sys_move_pages(), sys_migrate_pages() and cpuset process migration are
only beneficial on NUMA so it makes sense.

As memory compaction will operate within a zone and is useful on both NUMA
and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
user selects CONFIG_COMPACTION as an option.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/Kconfig |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 17b8947..b1c2781 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -168,17 +168,29 @@ config SPLIT_PTLOCK_CPUS
 	default "4"
 
 #
+# support for memory compaction
+config COMPACTION
+	bool "Allow for memory compaction"
+	def_bool y
+	select MIGRATION
+	depends on EXPERIMENTAL && HUGETLBFS
+	help
+	  Allows the compaction of memory for the allocation of huge pages.
+
+#
 # support for page migration
 #
 config MIGRATION
 	bool "Page migration"
 	def_bool y
-	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
+	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
 	help
 	  Allows the migration of the physical location of pages of processes
-	  while the virtual addresses are not changed. This is useful for
-	  example on NUMA systems to put pages nearer to the processors accessing
-	  the page.
+	  while the virtual addresses are not changed. This is useful in
+	  two situations. The first is on NUMA systems to put pages nearer
+	  to the processors accessing. The second is when allocating huge
+	  pages as migration can relocate pages to satisfy a huge page
+	  allocation instead of reclaiming.
 
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/12] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-02-18 18:02 ` [PATCH 05/12] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
@ 2010-02-19  0:21   ` KAMEZAWA Hiroyuki
  2010-02-19 14:09     ` Mel Gorman
  0 siblings, 1 reply; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-19  0:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Thu, 18 Feb 2010 18:02:35 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
> being able to hot-remove memory. The main users of page migration such as
> sys_move_pages(), sys_migrate_pages() and cpuset process migration are
> only beneficial on NUMA so it makes sense.
> 
> As memory compaction will operate within a zone and is useful on both NUMA
> and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
> user selects CONFIG_COMPACTION as an option.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> Reviewed-by: Rik van Riel <riel@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
But see below.

> ---
>  mm/Kconfig |   20 ++++++++++++++++----
>  1 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 17b8947..b1c2781 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -168,17 +168,29 @@ config SPLIT_PTLOCK_CPUS
>  	default "4"
>  
>  #
> +# support for memory compaction
> +config COMPACTION
> +	bool "Allow for memory compaction"
> +	def_bool y
> +	select MIGRATION
> +	depends on EXPERIMENTAL && HUGETLBFS
> +	help
> +	  Allows the compaction of memory for the allocation of huge pages.
> +

I think 
  + depends on MMU

Thanks,
-Kame

> +#
>  # support for page migration
>  #
>  config MIGRATION
>  	bool "Page migration"
>  	def_bool y
> -	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
> +	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
>  	help
>  	  Allows the migration of the physical location of pages of processes
> -	  while the virtual addresses are not changed. This is useful for
> -	  example on NUMA systems to put pages nearer to the processors accessing
> -	  the page.
> +	  while the virtual addresses are not changed. This is useful in
> +	  two situations. The first is on NUMA systems to put pages nearer
> +	  to the processors accessing. The second is when allocating huge
> +	  pages as migration can relocate pages to satisfy a huge page
> +	  allocation instead of reclaiming.
>  
>  config PHYS_ADDR_T_64BIT
>  	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
> -- 
> 1.6.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 05/12] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove
  2010-02-19  0:21   ` KAMEZAWA Hiroyuki
@ 2010-02-19 14:09     ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-19 14:09 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 09:21:11AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 18 Feb 2010 18:02:35 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > CONFIG_MIGRATION currently depends on CONFIG_NUMA or on the architecture
> > being able to hot-remove memory. The main users of page migration such as
> > sys_move_pages(), sys_migrate_pages() and cpuset process migration are
> > only beneficial on NUMA so it makes sense.
> > 
> > As memory compaction will operate within a zone and is useful on both NUMA
> > and non-NUMA systems, this patch allows CONFIG_MIGRATION to be set if the
> > user selects CONFIG_COMPACTION as an option.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
> > Reviewed-by: Rik van Riel <riel@redhat.com>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> But see below.
> 
> > ---
> >  mm/Kconfig |   20 ++++++++++++++++----
> >  1 files changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 17b8947..b1c2781 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -168,17 +168,29 @@ config SPLIT_PTLOCK_CPUS
> >  	default "4"
> >  
> >  #
> > +# support for memory compaction
> > +config COMPACTION
> > +	bool "Allow for memory compaction"
> > +	def_bool y
> > +	select MIGRATION
> > +	depends on EXPERIMENTAL && HUGETLBFS
> > +	help
> > +	  Allows the compaction of memory for the allocation of huge pages.
> > +
> 
> I think 
>   + depends on MMU
> 

Agreed. Thanks

> > +#
> >  # support for page migration
> >  #
> >  config MIGRATION
> >  	bool "Page migration"
> >  	def_bool y
> > -	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE
> > +	depends on NUMA || ARCH_ENABLE_MEMORY_HOTREMOVE || COMPACTION
> >  	help
> >  	  Allows the migration of the physical location of pages of processes
> > -	  while the virtual addresses are not changed. This is useful for
> > -	  example on NUMA systems to put pages nearer to the processors accessing
> > -	  the page.
> > +	  while the virtual addresses are not changed. This is useful in
> > +	  two situations. The first is on NUMA systems to put pages nearer
> > +	  to the processors accessing. The second is when allocating huge
> > +	  pages as migration can relocate pages to satisfy a huge page
> > +	  allocation instead of reclaiming.
> >  
> >  config PHYS_ADDR_T_64BIT
> >  	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
> > -- 
> > 1.6.5
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 06/12] Export unusable free space index via /proc/pagetypeinfo
  2010-02-18 18:02 [PATCH 0/12] Memory Compaction v3 Mel Gorman
                   ` (4 preceding siblings ...)
  2010-02-18 18:02 ` [PATCH 05/12] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
@ 2010-02-18 18:02 ` Mel Gorman
  2010-02-19  1:35   ` Minchan Kim
  2010-02-19 21:46   ` Rik van Riel
  2010-02-18 18:02 ` [PATCH 07/12] Export fragmentation " Mel Gorman
                   ` (5 subsequent siblings)
  11 siblings, 2 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-18 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Unusuable free space index is a measure of external fragmentation that
takes the allocation size into account. For the most part, the huge page
size will be the size of interest but not necessarily so it is exported
on a per-order and per-zone basis via /proc/unusable_index.

The index is a value between 0 and 1. It can be expressed as a
percentage by multiplying by 100 as documented in
Documentation/filesystems/proc.txt.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/filesystems/proc.txt |   13 ++++-
 mm/vmstat.c                        |  120 ++++++++++++++++++++++++++++++++++++
 2 files changed, 132 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 1829dfb..57869d0 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -443,6 +443,7 @@ Table 1-5: Kernel info in /proc
  sys         See chapter 2                                     
  sysvipc     Info of SysVIPC Resources (msg, sem, shm)		(2.4)
  tty	     Info of tty drivers
+ unusable_index Additional page allocator information (see text)(2.5)
  uptime      System uptime                                     
  version     Kernel version                                    
  video	     bttv info of video resources			(2.4)
@@ -596,7 +597,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo.
+pagetypeinfo and unusable_index
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -637,6 +638,16 @@ unless memory has been mlock()'d. Some of the Reclaimable blocks should
 also be allocatable although a lot of filesystem metadata may have to be
 reclaimed to achieve this.
 
+> cat /proc/unusable_index
+Node 0, zone      DMA 0.000 0.000 0.000 0.001 0.005 0.013 0.021 0.037 0.037 0.101 0.230
+Node 0, zone   Normal 0.000 0.000 0.000 0.001 0.002 0.002 0.005 0.015 0.028 0.028 0.054
+
+The unusable free space index measures how much of the available free
+memory cannot be used to satisfy an allocation of a given size and is a
+value between 0 and 1. The higher the value, the more of free memory is
+unusable and by implication, the worse the external fragmentation is. This
+can be expressed as a percentage by multiplying by 100.
+
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6051fba..23f217e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -451,6 +451,106 @@ static int frag_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
+
+struct contig_page_info {
+	unsigned long free_pages;
+	unsigned long free_blocks_total;
+	unsigned long free_blocks_suitable;
+};
+
+/*
+ * Calculate the number of free pages in a zone, how many contiguous
+ * pages are free and how many are large enough to satisfy an allocation of
+ * the target size. Note that this function makes to attempt to estimate
+ * how many suitable free blocks there *might* be if MOVABLE pages were
+ * migrated. Calculating that is possible, but expensive and can be
+ * figured out from userspace
+ */
+static void fill_contig_page_info(struct zone *zone,
+				unsigned int suitable_order,
+				struct contig_page_info *info)
+{
+	unsigned int order;
+
+	info->free_pages = 0;
+	info->free_blocks_total = 0;
+	info->free_blocks_suitable = 0;
+
+	for (order = 0; order < MAX_ORDER; order++) {
+		unsigned long blocks;
+
+		/* Count number of free blocks */
+		blocks = zone->free_area[order].nr_free;
+		info->free_blocks_total += blocks;
+
+		/* Count free base pages */
+		info->free_pages += blocks << order;
+
+		/* Count the suitable free blocks */
+		if (order >= suitable_order)
+			info->free_blocks_suitable += blocks <<
+						(order - suitable_order);
+	}
+}
+
+/*
+ * Return an index indicating how much of the available free memory is
+ * unusable for an allocation of the requested size.
+ */
+static int unusable_free_index(unsigned int order,
+				struct contig_page_info *info)
+{
+	/* No free memory is interpreted as all free memory is unusable */
+	if (info->free_pages == 0)
+		return 1000;
+
+	/*
+	 * Index should be a value between 0 and 1. Return a value to 3
+	 * decimal places.
+	 *
+	 * 0 => no fragmentation
+	 * 1 => high fragmentation
+	 */
+	return ((info->free_pages - (info->free_blocks_suitable << order)) * 1000) / info->free_pages;
+
+}
+
+static void unusable_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = unusable_free_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display unusable free space index
+ * XXX: Could be a lot more efficient, but it's not a critical path
+ */
+static int unusable_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	/* check memoryless node */
+	if (!node_state(pgdat->node_id, N_HIGH_MEMORY))
+		return 0;
+
+	walk_zones_in_node(m, pgdat, unusable_show_print);
+
+	return 0;
+}
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -601,6 +701,25 @@ static const struct file_operations pagetypeinfo_file_ops = {
 	.release	= seq_release,
 };
 
+static const struct seq_operations unusable_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= unusable_show,
+};
+
+static int unusable_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &unusable_op);
+}
+
+static const struct file_operations unusable_file_ops = {
+	.open		= unusable_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -944,6 +1063,7 @@ static int __init setup_vmstat(void)
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
+	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/12] Export unusable free space index via /proc/pagetypeinfo
  2010-02-18 18:02 ` [PATCH 06/12] Export unusable free space index via /proc/pagetypeinfo Mel Gorman
@ 2010-02-19  1:35   ` Minchan Kim
  2010-02-19 21:46   ` Rik van Riel
  1 sibling, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2010-02-19  1:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 3:02 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> Unusuable free space index is a measure of external fragmentation that
> takes the allocation size into account. For the most part, the huge page
> size will be the size of interest but not necessarily so it is exported
> on a per-order and per-zone basis via /proc/unusable_index.
>
> The index is a value between 0 and 1. It can be expressed as a
> percentage by multiplying by 100 as documented in
> Documentation/filesystems/proc.txt.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 06/12] Export unusable free space index via /proc/pagetypeinfo
  2010-02-18 18:02 ` [PATCH 06/12] Export unusable free space index via /proc/pagetypeinfo Mel Gorman
  2010-02-19  1:35   ` Minchan Kim
@ 2010-02-19 21:46   ` Rik van Riel
  1 sibling, 0 replies; 51+ messages in thread
From: Rik van Riel @ 2010-02-19 21:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm

On 02/18/2010 01:02 PM, Mel Gorman wrote:
> Unusuable free space index is a measure of external fragmentation that
> takes the allocation size into account. For the most part, the huge page
> size will be the size of interest but not necessarily so it is exported
> on a per-order and per-zone basis via /proc/unusable_index.
>
> The index is a value between 0 and 1. It can be expressed as a
> percentage by multiplying by 100 as documented in
> Documentation/filesystems/proc.txt.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 07/12] Export fragmentation index via /proc/pagetypeinfo
  2010-02-18 18:02 [PATCH 0/12] Memory Compaction v3 Mel Gorman
                   ` (5 preceding siblings ...)
  2010-02-18 18:02 ` [PATCH 06/12] Export unusable free space index via /proc/pagetypeinfo Mel Gorman
@ 2010-02-18 18:02 ` Mel Gorman
  2010-02-19  1:59   ` Minchan Kim
  2010-02-20  0:16   ` Rik van Riel
  2010-02-18 18:02 ` [PATCH 08/12] Memory compaction core Mel Gorman
                   ` (4 subsequent siblings)
  11 siblings, 2 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-18 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Fragmentation index is a value that makes sense when an allocation of a
given size would fail. The index indicates whether an allocation failure is
due to a lack of memory (values towards 0) or due to external fragmentation
(value towards 1).  For the most part, the huge page size will be the size
of interest but not necessarily so it is exported on a per-order and per-zone
basis via /proc/pagetypeinfo.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 Documentation/filesystems/proc.txt |   14 ++++++-
 mm/vmstat.c                        |   81 ++++++++++++++++++++++++++++++++++++
 2 files changed, 94 insertions(+), 1 deletions(-)

diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 57869d0..24396ab 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -412,6 +412,7 @@ Table 1-5: Kernel info in /proc
  filesystems Supported filesystems                             
  driver	     Various drivers grouped here, currently rtc (2.4)
  execdomains Execdomains, related to security			(2.4)
+ extfrag_index Additional page allocator information (see text) (2.5)
  fb	     Frame Buffer devices				(2.4)
  fs	     File system parameters, currently nfs/exports	(2.4)
  ide         Directory containing info about the IDE subsystem 
@@ -597,7 +598,7 @@ ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
 available in ZONE_NORMAL, etc... 
 
 More information relevant to external fragmentation can be found in
-pagetypeinfo and unusable_index
+pagetypeinfo, unusable_index and extfrag_index.
 
 > cat /proc/pagetypeinfo
 Page block order: 9
@@ -648,6 +649,17 @@ value between 0 and 1. The higher the value, the more of free memory is
 unusable and by implication, the worse the external fragmentation is. This
 can be expressed as a percentage by multiplying by 100.
 
+> cat /proc/extfrag_index
+Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.00
+Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 0.954
+
+The external fragmentation index, is only meaningful if an allocation
+would fail and indicates what the failure is due to. A value of -1 such as
+in many of the examples above states that the allocation would succeed.
+If it would fail, the value is between 0 and 1. A value tending towards
+0 implies the allocation failed due to a lack of memory. A value tending
+towards 1 implies it failed due to external fragmentation.
+
 ..............................................................................
 
 meminfo:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 23f217e..fa5975c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -551,6 +551,67 @@ static int unusable_show(struct seq_file *m, void *arg)
 	return 0;
 }
 
+/*
+ * A fragmentation index only makes sense if an allocation of a requested
+ * size would fail. If that is true, the fragmentation index indicates
+ * whether external fragmentation or a lack of memory was the problem.
+ * The value can be used to determine if page reclaim or compaction
+ * should be used
+ */
+int fragmentation_index(unsigned int order, struct contig_page_info *info)
+{
+	unsigned long requested = 1UL << order;
+
+	if (!info->free_blocks_total)
+		return 0;
+
+	/* Fragmentation index only makes sense when a request would fail */
+	if (info->free_blocks_suitable)
+		return -1000;
+
+	/*
+	 * Index is between 0 and 1 so return within 3 decimal places
+	 *
+	 * 0 => allocation would fail due to lack of memory
+	 * 1 => allocation would fail due to fragmentation
+	 */
+	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
+}
+
+
+static void extfrag_show_print(struct seq_file *m,
+					pg_data_t *pgdat, struct zone *zone)
+{
+	unsigned int order;
+	int index;
+
+	/* Alloc on stack as interrupts are disabled for zone walk */
+	struct contig_page_info info;
+
+	seq_printf(m, "Node %d, zone %8s ",
+				pgdat->node_id,
+				zone->name);
+	for (order = 0; order < MAX_ORDER; ++order) {
+		fill_contig_page_info(zone, order, &info);
+		index = fragmentation_index(order, &info);
+		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+	}
+
+	seq_putc(m, '\n');
+}
+
+/*
+ * Display fragmentation index for orders that allocations would fail for
+ */
+static int extfrag_show(struct seq_file *m, void *arg)
+{
+	pg_data_t *pgdat = (pg_data_t *)arg;
+
+	walk_zones_in_node(m, pgdat, extfrag_show_print);
+
+	return 0;
+}
+
 static void pagetypeinfo_showfree_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
 {
@@ -720,6 +781,25 @@ static const struct file_operations unusable_file_ops = {
 	.release	= seq_release,
 };
 
+static const struct seq_operations extfrag_op = {
+	.start	= frag_start,
+	.next	= frag_next,
+	.stop	= frag_stop,
+	.show	= extfrag_show,
+};
+
+static int extfrag_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &extfrag_op);
+}
+
+static const struct file_operations extfrag_file_ops = {
+	.open		= extfrag_open,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= seq_release,
+};
+
 #ifdef CONFIG_ZONE_DMA
 #define TEXT_FOR_DMA(xx) xx "_dma",
 #else
@@ -1064,6 +1144,7 @@ static int __init setup_vmstat(void)
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
 	proc_create("pagetypeinfo", S_IRUGO, NULL, &pagetypeinfo_file_ops);
 	proc_create("unusable_index", S_IRUGO, NULL, &unusable_file_ops);
+	proc_create("extfrag_index", S_IRUGO, NULL, &extfrag_file_ops);
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/12] Export fragmentation index via /proc/pagetypeinfo
  2010-02-18 18:02 ` [PATCH 07/12] Export fragmentation " Mel Gorman
@ 2010-02-19  1:59   ` Minchan Kim
  2010-02-20  0:16   ` Rik van Riel
  1 sibling, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2010-02-19  1:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 3:02 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> Fragmentation index is a value that makes sense when an allocation of a
> given size would fail. The index indicates whether an allocation failure is
> due to a lack of memory (values towards 0) or due to external fragmentation
> (value towards 1).  For the most part, the huge page size will be the size
> of interest but not necessarily so it is exported on a per-order and per-zone
> basis via /proc/pagetypeinfo.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 07/12] Export fragmentation index via /proc/pagetypeinfo
  2010-02-18 18:02 ` [PATCH 07/12] Export fragmentation " Mel Gorman
  2010-02-19  1:59   ` Minchan Kim
@ 2010-02-20  0:16   ` Rik van Riel
  1 sibling, 0 replies; 51+ messages in thread
From: Rik van Riel @ 2010-02-20  0:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, linux-kernel, linux-mm

On 02/18/2010 01:02 PM, Mel Gorman wrote:
> Fragmentation index is a value that makes sense when an allocation of a
> given size would fail. The index indicates whether an allocation failure is
> due to a lack of memory (values towards 0) or due to external fragmentation
> (value towards 1).  For the most part, the huge page size will be the size
> of interest but not necessarily so it is exported on a per-order and per-zone
> basis via /proc/pagetypeinfo.
>
> Signed-off-by: Mel Gorman<mel@csn.ul.ie>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 08/12] Memory compaction core
  2010-02-18 18:02 [PATCH 0/12] Memory Compaction v3 Mel Gorman
                   ` (6 preceding siblings ...)
  2010-02-18 18:02 ` [PATCH 07/12] Export fragmentation " Mel Gorman
@ 2010-02-18 18:02 ` Mel Gorman
  2010-02-19  0:37   ` KAMEZAWA Hiroyuki
  2010-02-18 18:02 ` [PATCH 09/12] Add /proc trigger for memory compaction Mel Gorman
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 51+ messages in thread
From: Mel Gorman @ 2010-02-18 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, Mel Gorman, linux-kernel, linux-mm

This patch is the core of a mechanism which compacts memory in a zone by
relocating movable pages towards the end of the zone.

A single compaction run involves a migration scanner and a free scanner.
Both scanners operate on pageblock-sized areas in the zone. The migration
scanner starts at the bottom of the zone and searches for all movable pages
within each area, isolating them onto a private list called migratelist.
The free scanner starts at the top of the zone and searches for suitable
areas and consumes the free pages within making them available for the
migration scanner. The pages isolated for migration are then migrated to
the newly isolated free pages.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |    8 +
 include/linux/mm.h         |    1 +
 include/linux/swap.h       |    6 +
 include/linux/vmstat.h     |    1 +
 mm/Makefile                |    1 +
 mm/compaction.c            |  347 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   37 +++++
 mm/vmscan.c                |    5 -
 mm/vmstat.c                |    5 +
 9 files changed, 406 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/compaction.h
 create mode 100644 mm/compaction.c

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
new file mode 100644
index 0000000..6201371
--- /dev/null
+++ b/include/linux/compaction.h
@@ -0,0 +1,8 @@
+#ifndef _LINUX_COMPACTION_H
+#define _LINUX_COMPACTION_H
+
+/* Return values for compact_zone() */
+#define COMPACT_INCOMPLETE	0
+#define COMPACT_COMPLETE	1
+
+#endif /* _LINUX_COMPACTION_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 60c467b..c2a2ede 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -332,6 +332,7 @@ void put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
+int split_free_page(struct page *page);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a2602a8..12566ed 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -151,6 +151,7 @@ enum {
 };
 
 #define SWAP_CLUSTER_MAX 32
+#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
 #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
 #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
@@ -238,6 +239,11 @@ static inline void lru_cache_add_active_file(struct page *page)
 	__lru_cache_add(page, LRU_ACTIVE_FILE);
 }
 
+/* LRU Isolation modes. */
+#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
+#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
+#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
+
 /* linux/mm/vmscan.c */
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index ee03bba..d7f7236 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/Makefile b/mm/Makefile
index 7a68d2a..ccb1f72 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
 obj-$(CONFIG_FS_XIP) += filemap_xip.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_COMPACTION) += compaction.o
 obj-$(CONFIG_SMP) += percpu.o
 obj-$(CONFIG_QUICKLIST) += quicklist.o
 obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
diff --git a/mm/compaction.c b/mm/compaction.c
new file mode 100644
index 0000000..14ba0ac
--- /dev/null
+++ b/mm/compaction.c
@@ -0,0 +1,347 @@
+/*
+ * linux/mm/compaction.c
+ *
+ * Memory compaction for the reduction of external fragmentation. Note that
+ * this heavily depends upon page migration to do all the real heavy
+ * lifting
+ *
+ * Copyright IBM Corp. 2007-2010 Mel Gorman <mel@csn.ul.ie>
+ */
+#include <linux/swap.h>
+#include <linux/migrate.h>
+#include <linux/compaction.h>
+#include <linux/mm_inline.h>
+#include "internal.h"
+
+/*
+ * compact_control is used to track pages being migrated and the free pages
+ * they are being migrated to during memory compaction. The free_pfn starts
+ * at the end of a zone and migrate_pfn begins at the start. Movable pages
+ * are moved to the end of a zone during a compaction run and the run
+ * completes when free_pfn <= migrate_pfn
+ */
+struct compact_control {
+	struct list_head freepages;	/* List of free pages to migrate to */
+	struct list_head migratepages;	/* List of pages being migrated */
+	unsigned long nr_freepages;	/* Number of isolated free pages */
+	unsigned long nr_migratepages;	/* Number of pages to migrate */
+	unsigned long free_pfn;		/* isolate_freepages search base */
+	unsigned long migrate_pfn;	/* isolate_migratepages search base */
+
+	/* Account for isolated anon and file pages */
+	unsigned long nr_anon;
+	unsigned long nr_file;
+
+	struct zone *zone;
+};
+
+static int release_freepages(struct list_head *freelist)
+{
+	struct page *page, *next;
+	int count = 0;
+
+	list_for_each_entry_safe(page, next, freelist, lru) {
+		list_del(&page->lru);
+		__free_page(page);
+		count++;
+	}
+
+	return count;
+}
+
+/* Isolate free pages onto a private freelist. Must hold zone->lock */
+static int isolate_freepages_block(struct zone *zone,
+				unsigned long blockpfn,
+				struct list_head *freelist)
+{
+	unsigned long zone_end_pfn, end_pfn;
+	int total_isolated = 0;
+
+	/* Get the last PFN we should scan for free pages at */
+	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	end_pfn = blockpfn + pageblock_nr_pages;
+	if (end_pfn > zone_end_pfn)
+		end_pfn = zone_end_pfn;
+
+	/* Isolate free pages. This assumes the block is valid */
+	for (; blockpfn < end_pfn; blockpfn++) {
+		struct page *page;
+		int isolated, i;
+
+		if (!pfn_valid_within(blockpfn))
+			continue;
+
+		page = pfn_to_page(blockpfn);
+		if (!PageBuddy(page))
+			continue;
+
+		/* Found a free page, break it into order-0 pages */
+		isolated = split_free_page(page);
+		total_isolated += isolated;
+		for (i = 0; i < isolated; i++) {
+			list_add(&page->lru, freelist);
+			page++;
+		}
+
+		/* If a page was split, advance to the end of it */
+		if (isolated)
+			blockpfn += isolated - 1;
+	}
+
+	return total_isolated;
+}
+
+/* Returns 1 if the page is within a block suitable for migration to */
+static int suitable_migration_target(struct page *page)
+{
+	/* If the page is a large free page, then allow migration */
+	if (PageBuddy(page) && page_order(page) >= pageblock_order)
+		return 1;
+
+	/* If the block is MIGRATE_MOVABLE, allow migration */
+	if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
+		return 1;
+
+	/* Otherwise skip the block */
+	return 0;
+}
+
+/*
+ * Based on information in the current compact_control, find blocks
+ * suitable for isolating free pages from
+ */
+static void isolate_freepages(struct zone *zone,
+				struct compact_control *cc)
+{
+	struct page *page;
+	unsigned long high_pfn, low_pfn, pfn;
+	unsigned long flags;
+	int nr_freepages = cc->nr_freepages;
+	struct list_head *freelist = &cc->freepages;
+
+	pfn = cc->free_pfn;
+	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
+	high_pfn = low_pfn;
+
+	/*
+	 * Isolate free pages until enough are available to migrate the
+	 * pages on cc->migratepages. We stop searching if the migrate
+	 * and free page scanners meet or enough free pages are isolated.
+	 */
+	spin_lock_irqsave(&zone->lock, flags);
+	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
+					pfn -= pageblock_nr_pages) {
+		int isolated;
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		/* Check for overlapping nodes/zones */
+		page = pfn_to_page(pfn);
+		if (page_zone(page) != zone)
+			continue;
+
+		/* Check the block is suitable for migration */
+		if (!suitable_migration_target(page))
+			continue;
+
+		/* Found a block suitable for isolating free pages from */
+		isolated = isolate_freepages_block(zone, pfn, freelist);
+		nr_freepages += isolated;
+
+		/*
+		 * Record the highest PFN we isolated pages from. When next
+		 * looking for free pages, the search will restart here as
+		 * page migration may have returned some pages to the allocator
+		 */
+		if (isolated)
+			high_pfn = max(high_pfn, pfn);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+
+	cc->free_pfn = high_pfn;
+	cc->nr_freepages = nr_freepages;
+}
+
+/* Update the number of anon and file isolated pages in the zone) */
+void update_zone_isolated(struct zone *zone, struct compact_control *cc)
+{
+	struct page *page;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+
+	list_for_each_entry(page, &cc->migratepages, lru) {
+		int lru = page_lru_base_type(page);
+		count[lru]++;
+	}
+
+	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
+	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
+}
+
+/*
+ * Isolate all pages that can be migrated from the block pointed to by
+ * the migrate scanner within compact_control.
+ */
+static unsigned long isolate_migratepages(struct zone *zone,
+					struct compact_control *cc)
+{
+	unsigned long low_pfn, end_pfn;
+	struct list_head *migratelist;
+
+	low_pfn = cc->migrate_pfn;
+	migratelist = &cc->migratepages;
+
+	/* Do not scan outside zone boundaries */
+	if (low_pfn < zone->zone_start_pfn)
+		low_pfn = zone->zone_start_pfn;
+
+	/* Setup to scan one block but not past where we are migrating to */
+	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
+
+	/* Do not cross the free scanner or scan within a memory hole */
+	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
+		cc->migrate_pfn = end_pfn;
+		return 0;
+	}
+
+	migrate_prep();
+
+	/* Time to isolate some pages for migration */
+	spin_lock_irq(&zone->lru_lock);
+	for (; low_pfn < end_pfn; low_pfn++) {
+		struct page *page;
+		if (!pfn_valid_within(low_pfn))
+			continue;
+
+		/* Get the page and skip if free */
+		page = pfn_to_page(low_pfn);
+		if (PageBuddy(page)) {
+			low_pfn += (1 << page_order(page)) - 1;
+			continue;
+		}
+
+		if (!PageLRU(page) || PageUnevictable(page))
+			continue;
+
+		/* Try isolate the page */
+		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
+			del_page_from_lru_list(zone, page, page_lru(page));
+			list_add(&page->lru, migratelist);
+			mem_cgroup_del_lru(page);
+			cc->nr_migratepages++;
+		}
+
+		/* Avoid isolating too much */
+		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
+			break;
+	}
+
+	update_zone_isolated(zone, cc);
+
+	spin_unlock_irq(&zone->lru_lock);
+	cc->migrate_pfn = low_pfn;
+
+	return cc->nr_migratepages;
+}
+
+/*
+ * This is a migrate-callback that "allocates" freepages by taking pages
+ * from the isolated freelists in the block we are migrating to.
+ */
+static struct page *compaction_alloc(struct page *migratepage,
+					unsigned long data,
+					int **result)
+{
+	struct compact_control *cc = (struct compact_control *)data;
+	struct page *freepage;
+
+	VM_BUG_ON(cc == NULL);
+
+	/* Isolate free pages if necessary */
+	if (list_empty(&cc->freepages)) {
+		isolate_freepages(cc->zone, cc);
+
+		if (list_empty(&cc->freepages))
+			return NULL;
+	}
+
+	freepage = list_entry(cc->freepages.next, struct page, lru);
+	list_del(&freepage->lru);
+	cc->nr_freepages--;
+
+	return freepage;
+}
+
+/*
+ * We cannot control nr_migratepages and nr_freepages fully when migration is
+ * running as migrate_pages() has no knowledge of compact_control. When
+ * migration is complete, we count the number of pages on the lists by hand.
+ */
+static void update_nr_listpages(struct compact_control *cc)
+{
+	int nr_migratepages = 0;
+	int nr_freepages = 0;
+	struct page *page;
+	list_for_each_entry(page, &cc->migratepages, lru)
+		nr_migratepages++;
+	list_for_each_entry(page, &cc->freepages, lru)
+		nr_freepages++;
+
+	cc->nr_migratepages = nr_migratepages;
+	cc->nr_freepages = nr_freepages;
+}
+
+static inline int compact_finished(struct zone *zone,
+						struct compact_control *cc)
+{
+	/* Compaction run completes if the migrate and free scanner meet */
+	if (cc->free_pfn <= cc->migrate_pfn)
+		return COMPACT_COMPLETE;
+
+	return COMPACT_INCOMPLETE;
+}
+
+static int compact_zone(struct zone *zone, struct compact_control *cc)
+{
+	int ret = COMPACT_INCOMPLETE;
+
+	/* Setup to move all movable pages to the end of the zone */
+	cc->migrate_pfn = zone->zone_start_pfn;
+	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
+	cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
+		unsigned long nr_migrate, nr_remaining;
+		if (!isolate_migratepages(zone, cc))
+			continue;
+
+		nr_migrate = cc->nr_migratepages;
+		migrate_pages(&cc->migratepages, compaction_alloc,
+						(unsigned long)cc, 0);
+		update_nr_listpages(cc);
+		nr_remaining = cc->nr_migratepages;
+
+		count_vm_event(COMPACTBLOCKS);
+		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
+		if (nr_remaining)
+			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
+
+		/* Release LRU pages not migrated */
+		if (!list_empty(&cc->migratepages)) {
+			putback_lru_pages(&cc->migratepages);
+			cc->nr_migratepages = 0;
+		}
+
+		mod_zone_page_state(zone, NR_ISOLATED_ANON, -cc->nr_anon);
+		mod_zone_page_state(zone, NR_ISOLATED_FILE, -cc->nr_file);
+	}
+
+	/* Release free pages and check accounting */
+	cc->nr_freepages -= release_freepages(&cc->freepages);
+	VM_BUG_ON(cc->nr_freepages != 0);
+
+	return ret;
+}
+
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8deb9d0..6d57154 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1168,6 +1168,43 @@ void split_page(struct page *page, unsigned int order)
 		set_page_refcounted(page + i);
 }
 
+/* Similar to split_page except the page is already free */
+int split_free_page(struct page *page)
+{
+	unsigned int order;
+	unsigned long watermark;
+	struct zone *zone;
+
+	BUG_ON(!PageBuddy(page));
+
+	zone = page_zone(page);
+	order = page_order(page);
+
+	/* Obey watermarks or the system could deadlock */
+	watermark = low_wmark_pages(zone) + (1 << order);
+	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+		return 0;
+
+	/* Remove page from free list */
+	list_del(&page->lru);
+	zone->free_area[order].nr_free--;
+	rmv_page_order(page);
+	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
+
+	/* Split into individual pages */
+	set_page_refcounted(page);
+	split_page(page, order);
+
+	/* Set the migratetype on the assumption it's for migration */
+	if (order >= pageblock_order - 1) {
+		struct page *endpage = page + (1 << order) - 1;
+		for (; page < endpage; page += pageblock_nr_pages)
+			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
+	}
+
+	return 1 << order;
+}
+
 /*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c26986c..47de19b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -803,11 +803,6 @@ keep:
 	return nr_reclaimed;
 }
 
-/* LRU Isolation modes. */
-#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
-#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
-#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
-
 /*
  * Attempt to remove the specified page from its LRU.  Only take this page
  * if it is of the appropriate PageActive status.  Pages which are being
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fa5975c..0a14d22 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -889,6 +889,11 @@ static const char * const vmstat_text[] = {
 	"allocstall",
 
 	"pgrotated",
+
+	"compact_blocks_moved",
+	"compact_pages_moved",
+	"compact_pagemigrate_failed",
+
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
 	"htlb_buddy_alloc_fail",
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 08/12] Memory compaction core
  2010-02-18 18:02 ` [PATCH 08/12] Memory compaction core Mel Gorman
@ 2010-02-19  0:37   ` KAMEZAWA Hiroyuki
  2010-02-19 14:15     ` Mel Gorman
  0 siblings, 1 reply; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-19  0:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Thu, 18 Feb 2010 18:02:38 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> This patch is the core of a mechanism which compacts memory in a zone by
> relocating movable pages towards the end of the zone.
> 
> A single compaction run involves a migration scanner and a free scanner.
> Both scanners operate on pageblock-sized areas in the zone. The migration
> scanner starts at the bottom of the zone and searches for all movable pages
> within each area, isolating them onto a private list called migratelist.
> The free scanner starts at the top of the zone and searches for suitable
> areas and consumes the free pages within making them available for the
> migration scanner. The pages isolated for migration are then migrated to
> the newly isolated free pages.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/compaction.h |    8 +
>  include/linux/mm.h         |    1 +
>  include/linux/swap.h       |    6 +
>  include/linux/vmstat.h     |    1 +
>  mm/Makefile                |    1 +
>  mm/compaction.c            |  347 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c            |   37 +++++
>  mm/vmscan.c                |    5 -
>  mm/vmstat.c                |    5 +
>  9 files changed, 406 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/compaction.h
>  create mode 100644 mm/compaction.c
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> new file mode 100644
> index 0000000..6201371
> --- /dev/null
> +++ b/include/linux/compaction.h
> @@ -0,0 +1,8 @@
> +#ifndef _LINUX_COMPACTION_H
> +#define _LINUX_COMPACTION_H
> +
> +/* Return values for compact_zone() */
> +#define COMPACT_INCOMPLETE	0
> +#define COMPACT_COMPLETE	1
> +
> +#endif /* _LINUX_COMPACTION_H */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 60c467b..c2a2ede 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -332,6 +332,7 @@ void put_page(struct page *page);
>  void put_pages_list(struct list_head *pages);
>  
>  void split_page(struct page *page, unsigned int order);
> +int split_free_page(struct page *page);
>  
>  /*
>   * Compound pages have a destructor function.  Provide a
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index a2602a8..12566ed 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -151,6 +151,7 @@ enum {
>  };
>  
>  #define SWAP_CLUSTER_MAX 32
> +#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
>  
>  #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
>  #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
> @@ -238,6 +239,11 @@ static inline void lru_cache_add_active_file(struct page *page)
>  	__lru_cache_add(page, LRU_ACTIVE_FILE);
>  }
>  
> +/* LRU Isolation modes. */
> +#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
> +#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
> +#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
> +
>  /* linux/mm/vmscan.c */
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index ee03bba..d7f7236 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
>  		KSWAPD_SKIP_CONGESTION_WAIT,
>  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> +		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
>  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>  #endif
> diff --git a/mm/Makefile b/mm/Makefile
> index 7a68d2a..ccb1f72 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
>  obj-$(CONFIG_FS_XIP) += filemap_xip.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_COMPACTION) += compaction.o
>  obj-$(CONFIG_SMP) += percpu.o
>  obj-$(CONFIG_QUICKLIST) += quicklist.o
>  obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> diff --git a/mm/compaction.c b/mm/compaction.c
> new file mode 100644
> index 0000000..14ba0ac
> --- /dev/null
> +++ b/mm/compaction.c
> @@ -0,0 +1,347 @@
> +/*
> + * linux/mm/compaction.c
> + *
> + * Memory compaction for the reduction of external fragmentation. Note that
> + * this heavily depends upon page migration to do all the real heavy
> + * lifting
> + *
> + * Copyright IBM Corp. 2007-2010 Mel Gorman <mel@csn.ul.ie>
> + */
> +#include <linux/swap.h>
> +#include <linux/migrate.h>
> +#include <linux/compaction.h>
> +#include <linux/mm_inline.h>
> +#include "internal.h"
> +
> +/*
> + * compact_control is used to track pages being migrated and the free pages
> + * they are being migrated to during memory compaction. The free_pfn starts
> + * at the end of a zone and migrate_pfn begins at the start. Movable pages
> + * are moved to the end of a zone during a compaction run and the run
> + * completes when free_pfn <= migrate_pfn
> + */
> +struct compact_control {
> +	struct list_head freepages;	/* List of free pages to migrate to */
> +	struct list_head migratepages;	/* List of pages being migrated */
> +	unsigned long nr_freepages;	/* Number of isolated free pages */
> +	unsigned long nr_migratepages;	/* Number of pages to migrate */
> +	unsigned long free_pfn;		/* isolate_freepages search base */
> +	unsigned long migrate_pfn;	/* isolate_migratepages search base */
> +
> +	/* Account for isolated anon and file pages */
> +	unsigned long nr_anon;
> +	unsigned long nr_file;
> +
> +	struct zone *zone;
> +};
> +
> +static int release_freepages(struct list_head *freelist)
> +{
> +	struct page *page, *next;
> +	int count = 0;
> +
> +	list_for_each_entry_safe(page, next, freelist, lru) {
> +		list_del(&page->lru);
> +		__free_page(page);
> +		count++;
> +	}
> +
> +	return count;
> +}
> +
> +/* Isolate free pages onto a private freelist. Must hold zone->lock */
> +static int isolate_freepages_block(struct zone *zone,
> +				unsigned long blockpfn,
> +				struct list_head *freelist)
> +{
> +	unsigned long zone_end_pfn, end_pfn;
> +	int total_isolated = 0;
> +
> +	/* Get the last PFN we should scan for free pages at */
> +	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> +	end_pfn = blockpfn + pageblock_nr_pages;
> +	if (end_pfn > zone_end_pfn)
> +		end_pfn = zone_end_pfn;
> +
> +	/* Isolate free pages. This assumes the block is valid */
> +	for (; blockpfn < end_pfn; blockpfn++) {
> +		struct page *page;
> +		int isolated, i;
> +
> +		if (!pfn_valid_within(blockpfn))
> +			continue;

Doen't this mean failure of compaction ?
If there are memory hole, compaction of this area never be archieved.
If so, we should stop compaction of this area.


> +
> +		page = pfn_to_page(blockpfn);
> +		if (!PageBuddy(page))
> +			continue;
> +
> +		/* Found a free page, break it into order-0 pages */
> +		isolated = split_free_page(page);
> +		total_isolated += isolated;
> +		for (i = 0; i < isolated; i++) {
> +			list_add(&page->lru, freelist);
> +			page++;
> +		}
> +
> +		/* If a page was split, advance to the end of it */
> +		if (isolated)
> +			blockpfn += isolated - 1;
> +	}
> +
> +	return total_isolated;
> +}

Hmm, I wonder ...how about setting migrate type of page block to be ISOLATED
at the end of this function ?
Then, no new page allocation occurs in this pageblock. And you don't have to
keep the list of free pages. free_page() does isolation.

I'm sorry I miss something.



> +
> +/* Returns 1 if the page is within a block suitable for migration to */
> +static int suitable_migration_target(struct page *page)
> +{
> +	/* If the page is a large free page, then allow migration */
> +	if (PageBuddy(page) && page_order(page) >= pageblock_order)
> +		return 1;
> +
> +	/* If the block is MIGRATE_MOVABLE, allow migration */
> +	if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
> +		return 1;
> +
> +	/* Otherwise skip the block */
> +	return 0;
> +}
> +
> +/*
> + * Based on information in the current compact_control, find blocks
> + * suitable for isolating free pages from
> + */
> +static void isolate_freepages(struct zone *zone,
> +				struct compact_control *cc)
> +{
> +	struct page *page;
> +	unsigned long high_pfn, low_pfn, pfn;
> +	unsigned long flags;
> +	int nr_freepages = cc->nr_freepages;
> +	struct list_head *freelist = &cc->freepages;
> +
> +	pfn = cc->free_pfn;
> +	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
> +	high_pfn = low_pfn;
> +
> +	/*
> +	 * Isolate free pages until enough are available to migrate the
> +	 * pages on cc->migratepages. We stop searching if the migrate
> +	 * and free page scanners meet or enough free pages are isolated.
> +	 */
> +	spin_lock_irqsave(&zone->lock, flags);
> +	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
> +					pfn -= pageblock_nr_pages) {
> +		int isolated;
> +
> +		if (!pfn_valid(pfn))
> +			continue;
> +
> +		/* Check for overlapping nodes/zones */
> +		page = pfn_to_page(pfn);
> +		if (page_zone(page) != zone)
> +			continue;
> +
> +		/* Check the block is suitable for migration */
> +		if (!suitable_migration_target(page))
> +			continue;
> +
> +		/* Found a block suitable for isolating free pages from */
> +		isolated = isolate_freepages_block(zone, pfn, freelist);
> +		nr_freepages += isolated;
> +
> +		/*
> +		 * Record the highest PFN we isolated pages from. When next
> +		 * looking for free pages, the search will restart here as
> +		 * page migration may have returned some pages to the allocator
> +		 */
> +		if (isolated)
> +			high_pfn = max(high_pfn, pfn);
> +	}
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +
> +	cc->free_pfn = high_pfn;
> +	cc->nr_freepages = nr_freepages;
> +}
> +
> +/* Update the number of anon and file isolated pages in the zone) */
> +void update_zone_isolated(struct zone *zone, struct compact_control *cc)
> +{
> +	struct page *page;
> +	unsigned int count[NR_LRU_LISTS] = { 0, };
> +
> +	list_for_each_entry(page, &cc->migratepages, lru) {
> +		int lru = page_lru_base_type(page);
> +		count[lru]++;
> +	}
> +
> +	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> +	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> +	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
> +	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
> +}
> +
> +/*
> + * Isolate all pages that can be migrated from the block pointed to by
> + * the migrate scanner within compact_control.
> + */
> +static unsigned long isolate_migratepages(struct zone *zone,
> +					struct compact_control *cc)
> +{
> +	unsigned long low_pfn, end_pfn;
> +	struct list_head *migratelist;
> +
> +	low_pfn = cc->migrate_pfn;
> +	migratelist = &cc->migratepages;
> +
> +	/* Do not scan outside zone boundaries */
> +	if (low_pfn < zone->zone_start_pfn)
> +		low_pfn = zone->zone_start_pfn;
> +
> +	/* Setup to scan one block but not past where we are migrating to */
> +	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> +
> +	/* Do not cross the free scanner or scan within a memory hole */
> +	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> +		cc->migrate_pfn = end_pfn;
> +		return 0;
> +	}
> +
> +	migrate_prep();
> +
> +	/* Time to isolate some pages for migration */
> +	spin_lock_irq(&zone->lru_lock);
> +	for (; low_pfn < end_pfn; low_pfn++) {
> +		struct page *page;
> +		if (!pfn_valid_within(low_pfn))
> +			continue;
 break ?

> +
> +		/* Get the page and skip if free */
> +		page = pfn_to_page(low_pfn);
> +		if (PageBuddy(page)) {
> +			low_pfn += (1 << page_order(page)) - 1;
> +			continue;
> +		}
> +
> +		if (!PageLRU(page) || PageUnevictable(page))
> +	
 break ? (compaction will not succeed if !PG_lru). I'm not sure
 best-effort which can't readh goal is good or not. But conservative
 approach will be better as the 1st version..

Thanks,
-Kame
		continue;
> +
> +		/* Try isolate the page */
> +		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
> +			del_page_from_lru_list(zone, page, page_lru(page));
> +			list_add(&page->lru, migratelist);
> +			mem_cgroup_del_lru(page);
> +			cc->nr_migratepages++;
> +		}
> +
> +		/* Avoid isolating too much */
> +		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
> +			break;
> +	}
> +
> +	update_zone_isolated(zone, cc);
> +
> +	spin_unlock_irq(&zone->lru_lock);
> +	cc->migrate_pfn = low_pfn;
> +
> +	return cc->nr_migratepages;
> +}
> +
> +/*
> + * This is a migrate-callback that "allocates" freepages by taking pages
> + * from the isolated freelists in the block we are migrating to.
> + */
> +static struct page *compaction_alloc(struct page *migratepage,
> +					unsigned long data,
> +					int **result)
> +{
> +	struct compact_control *cc = (struct compact_control *)data;
> +	struct page *freepage;
> +
> +	VM_BUG_ON(cc == NULL);
> +
> +	/* Isolate free pages if necessary */
> +	if (list_empty(&cc->freepages)) {
> +		isolate_freepages(cc->zone, cc);
> +
> +		if (list_empty(&cc->freepages))
> +			return NULL;
> +	}
> +
> +	freepage = list_entry(cc->freepages.next, struct page, lru);
> +	list_del(&freepage->lru);
> +	cc->nr_freepages--;
> +
> +	return freepage;
> +}
> +
> +/*
> + * We cannot control nr_migratepages and nr_freepages fully when migration is
> + * running as migrate_pages() has no knowledge of compact_control. When
> + * migration is complete, we count the number of pages on the lists by hand.
> + */
> +static void update_nr_listpages(struct compact_control *cc)
> +{
> +	int nr_migratepages = 0;
> +	int nr_freepages = 0;
> +	struct page *page;
> +	list_for_each_entry(page, &cc->migratepages, lru)
> +		nr_migratepages++;
> +	list_for_each_entry(page, &cc->freepages, lru)
> +		nr_freepages++;
> +
> +	cc->nr_migratepages = nr_migratepages;
> +	cc->nr_freepages = nr_freepages;
> +}
> +
> +static inline int compact_finished(struct zone *zone,
> +						struct compact_control *cc)
> +{
> +	/* Compaction run completes if the migrate and free scanner meet */
> +	if (cc->free_pfn <= cc->migrate_pfn)
> +		return COMPACT_COMPLETE;
> +
> +	return COMPACT_INCOMPLETE;
> +}
> +
> +static int compact_zone(struct zone *zone, struct compact_control *cc)
> +{
> +	int ret = COMPACT_INCOMPLETE;
> +
> +	/* Setup to move all movable pages to the end of the zone */
> +	cc->migrate_pfn = zone->zone_start_pfn;
> +	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> +	cc->free_pfn &= ~(pageblock_nr_pages-1);
> +
> +	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> +		unsigned long nr_migrate, nr_remaining;
> +		if (!isolate_migratepages(zone, cc))
> +			continue;
> +
> +		nr_migrate = cc->nr_migratepages;
> +		migrate_pages(&cc->migratepages, compaction_alloc,
> +						(unsigned long)cc, 0);
> +		update_nr_listpages(cc);
> +		nr_remaining = cc->nr_migratepages;
> +
> +		count_vm_event(COMPACTBLOCKS);
> +		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> +		if (nr_remaining)
> +			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> +
> +		/* Release LRU pages not migrated */
> +		if (!list_empty(&cc->migratepages)) {
> +			putback_lru_pages(&cc->migratepages);
> +			cc->nr_migratepages = 0;
> +		}
> +
> +		mod_zone_page_state(zone, NR_ISOLATED_ANON, -cc->nr_anon);
> +		mod_zone_page_state(zone, NR_ISOLATED_FILE, -cc->nr_file);
> +	}
> +
> +	/* Release free pages and check accounting */
> +	cc->nr_freepages -= release_freepages(&cc->freepages);
> +	VM_BUG_ON(cc->nr_freepages != 0);
> +
> +	return ret;
> +}
> +
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8deb9d0..6d57154 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1168,6 +1168,43 @@ void split_page(struct page *page, unsigned int order)
>  		set_page_refcounted(page + i);
>  }
>  
> +/* Similar to split_page except the page is already free */
> +int split_free_page(struct page *page)
> +{
> +	unsigned int order;
> +	unsigned long watermark;
> +	struct zone *zone;
> +
> +	BUG_ON(!PageBuddy(page));
> +
> +	zone = page_zone(page);
> +	order = page_order(page);
> +
> +	/* Obey watermarks or the system could deadlock */
> +	watermark = low_wmark_pages(zone) + (1 << order);
> +	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> +		return 0;
> +
> +	/* Remove page from free list */
> +	list_del(&page->lru);
> +	zone->free_area[order].nr_free--;
> +	rmv_page_order(page);
> +	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> +
> +	/* Split into individual pages */
> +	set_page_refcounted(page);
> +	split_page(page, order);
> +
> +	/* Set the migratetype on the assumption it's for migration */
> +	if (order >= pageblock_order - 1) {
> +		struct page *endpage = page + (1 << order) - 1;
> +		for (; page < endpage; page += pageblock_nr_pages)
> +			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> +	}
> +
> +	return 1 << order;
> +}
> +
>  /*
>   * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
>   * we cheat by calling it from here, in the order > 0 path.  Saves a branch
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c26986c..47de19b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -803,11 +803,6 @@ keep:
>  	return nr_reclaimed;
>  }
>  
> -/* LRU Isolation modes. */
> -#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
> -#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
> -#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
> -
>  /*
>   * Attempt to remove the specified page from its LRU.  Only take this page
>   * if it is of the appropriate PageActive status.  Pages which are being
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index fa5975c..0a14d22 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -889,6 +889,11 @@ static const char * const vmstat_text[] = {
>  	"allocstall",
>  
>  	"pgrotated",
> +
> +	"compact_blocks_moved",
> +	"compact_pages_moved",
> +	"compact_pagemigrate_failed",
> +
>  #ifdef CONFIG_HUGETLB_PAGE
>  	"htlb_buddy_alloc_success",
>  	"htlb_buddy_alloc_fail",
> -- 
> 1.6.5
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 08/12] Memory compaction core
  2010-02-19  0:37   ` KAMEZAWA Hiroyuki
@ 2010-02-19 14:15     ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-19 14:15 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 09:37:58AM +0900, KAMEZAWA Hiroyuki wrote:
> > This patch is the core of a mechanism which compacts memory in a zone by
> > relocating movable pages towards the end of the zone.
> > 
> > A single compaction run involves a migration scanner and a free scanner.
> > Both scanners operate on pageblock-sized areas in the zone. The migration
> > scanner starts at the bottom of the zone and searches for all movable pages
> > within each area, isolating them onto a private list called migratelist.
> > The free scanner starts at the top of the zone and searches for suitable
> > areas and consumes the free pages within making them available for the
> > migration scanner. The pages isolated for migration are then migrated to
> > the newly isolated free pages.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  include/linux/compaction.h |    8 +
> >  include/linux/mm.h         |    1 +
> >  include/linux/swap.h       |    6 +
> >  include/linux/vmstat.h     |    1 +
> >  mm/Makefile                |    1 +
> >  mm/compaction.c            |  347 ++++++++++++++++++++++++++++++++++++++++++++
> >  mm/page_alloc.c            |   37 +++++
> >  mm/vmscan.c                |    5 -
> >  mm/vmstat.c                |    5 +
> >  9 files changed, 406 insertions(+), 5 deletions(-)
> >  create mode 100644 include/linux/compaction.h
> >  create mode 100644 mm/compaction.c
> > 
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > new file mode 100644
> > index 0000000..6201371
> > --- /dev/null
> > +++ b/include/linux/compaction.h
> > @@ -0,0 +1,8 @@
> > +#ifndef _LINUX_COMPACTION_H
> > +#define _LINUX_COMPACTION_H
> > +
> > +/* Return values for compact_zone() */
> > +#define COMPACT_INCOMPLETE	0
> > +#define COMPACT_COMPLETE	1
> > +
> > +#endif /* _LINUX_COMPACTION_H */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 60c467b..c2a2ede 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -332,6 +332,7 @@ void put_page(struct page *page);
> >  void put_pages_list(struct list_head *pages);
> >  
> >  void split_page(struct page *page, unsigned int order);
> > +int split_free_page(struct page *page);
> >  
> >  /*
> >   * Compound pages have a destructor function.  Provide a
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index a2602a8..12566ed 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -151,6 +151,7 @@ enum {
> >  };
> >  
> >  #define SWAP_CLUSTER_MAX 32
> > +#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
> >  
> >  #define SWAP_MAP_MAX	0x3e	/* Max duplication count, in first swap_map */
> >  #define SWAP_MAP_BAD	0x3f	/* Note pageblock is bad, in first swap_map */
> > @@ -238,6 +239,11 @@ static inline void lru_cache_add_active_file(struct page *page)
> >  	__lru_cache_add(page, LRU_ACTIVE_FILE);
> >  }
> >  
> > +/* LRU Isolation modes. */
> > +#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
> > +#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
> > +#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
> > +
> >  /* linux/mm/vmscan.c */
> >  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  					gfp_t gfp_mask, nodemask_t *mask);
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index ee03bba..d7f7236 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -43,6 +43,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> >  		KSWAPD_SKIP_CONGESTION_WAIT,
> >  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> > +		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> >  #ifdef CONFIG_HUGETLB_PAGE
> >  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> >  #endif
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 7a68d2a..ccb1f72 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -33,6 +33,7 @@ obj-$(CONFIG_FAILSLAB) += failslab.o
> >  obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
> >  obj-$(CONFIG_FS_XIP) += filemap_xip.o
> >  obj-$(CONFIG_MIGRATION) += migrate.o
> > +obj-$(CONFIG_COMPACTION) += compaction.o
> >  obj-$(CONFIG_SMP) += percpu.o
> >  obj-$(CONFIG_QUICKLIST) += quicklist.o
> >  obj-$(CONFIG_CGROUP_MEM_RES_CTLR) += memcontrol.o page_cgroup.o
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > new file mode 100644
> > index 0000000..14ba0ac
> > --- /dev/null
> > +++ b/mm/compaction.c
> > @@ -0,0 +1,347 @@
> > +/*
> > + * linux/mm/compaction.c
> > + *
> > + * Memory compaction for the reduction of external fragmentation. Note that
> > + * this heavily depends upon page migration to do all the real heavy
> > + * lifting
> > + *
> > + * Copyright IBM Corp. 2007-2010 Mel Gorman <mel@csn.ul.ie>
> > + */
> > +#include <linux/swap.h>
> > +#include <linux/migrate.h>
> > +#include <linux/compaction.h>
> > +#include <linux/mm_inline.h>
> > +#include "internal.h"
> > +
> > +/*
> > + * compact_control is used to track pages being migrated and the free pages
> > + * they are being migrated to during memory compaction. The free_pfn starts
> > + * at the end of a zone and migrate_pfn begins at the start. Movable pages
> > + * are moved to the end of a zone during a compaction run and the run
> > + * completes when free_pfn <= migrate_pfn
> > + */
> > +struct compact_control {
> > +	struct list_head freepages;	/* List of free pages to migrate to */
> > +	struct list_head migratepages;	/* List of pages being migrated */
> > +	unsigned long nr_freepages;	/* Number of isolated free pages */
> > +	unsigned long nr_migratepages;	/* Number of pages to migrate */
> > +	unsigned long free_pfn;		/* isolate_freepages search base */
> > +	unsigned long migrate_pfn;	/* isolate_migratepages search base */
> > +
> > +	/* Account for isolated anon and file pages */
> > +	unsigned long nr_anon;
> > +	unsigned long nr_file;
> > +
> > +	struct zone *zone;
> > +};
> > +
> > +static int release_freepages(struct list_head *freelist)
> > +{
> > +	struct page *page, *next;
> > +	int count = 0;
> > +
> > +	list_for_each_entry_safe(page, next, freelist, lru) {
> > +		list_del(&page->lru);
> > +		__free_page(page);
> > +		count++;
> > +	}
> > +
> > +	return count;
> > +}
> > +
> > +/* Isolate free pages onto a private freelist. Must hold zone->lock */
> > +static int isolate_freepages_block(struct zone *zone,
> > +				unsigned long blockpfn,
> > +				struct list_head *freelist)
> > +{
> > +	unsigned long zone_end_pfn, end_pfn;
> > +	int total_isolated = 0;
> > +
> > +	/* Get the last PFN we should scan for free pages at */
> > +	zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
> > +	end_pfn = blockpfn + pageblock_nr_pages;
> > +	if (end_pfn > zone_end_pfn)
> > +		end_pfn = zone_end_pfn;
> > +
> > +	/* Isolate free pages. This assumes the block is valid */
> > +	for (; blockpfn < end_pfn; blockpfn++) {
> > +		struct page *page;
> > +		int isolated, i;
> > +
> > +		if (!pfn_valid_within(blockpfn))
> > +			continue;
> 
> Doen't this mean failure of compaction ?

Nope. We're isolating free pages as a migration target here, not a
source. If it finds a hole, move on to the next page. If necessary, the
next block will be moved to later.

> If there are memory hole, compaction of this area never be archieved.
> If so, we should stop compaction of this area.
> 

Only half of the block might be in a hole so it's still worth checking
for free pages.

> 
> > +
> > +		page = pfn_to_page(blockpfn);
> > +		if (!PageBuddy(page))
> > +			continue;
> > +
> > +		/* Found a free page, break it into order-0 pages */
> > +		isolated = split_free_page(page);
> > +		total_isolated += isolated;
> > +		for (i = 0; i < isolated; i++) {
> > +			list_add(&page->lru, freelist);
> > +			page++;
> > +		}
> > +
> > +		/* If a page was split, advance to the end of it */
> > +		if (isolated)
> > +			blockpfn += isolated - 1;
> > +	}
> > +
> > +	return total_isolated;
> > +}
> 
> Hmm, I wonder ...how about setting migrate type of page block to be ISOLATED
> at the end of this function ?
> Then, no new page allocation occurs in this pageblock. And you don't have to
> keep the list of free pages. free_page() does isolation.
> 

Even if the block is marked ISOLATED, I would still have to move the
already-free pages between buddy lists. Targetting later for allocation
would then be slightly trickier as I'd have to scan the buddy lists. I
don't think it would be better overall as there would still be a lot of
messing with lists.

> I'm sorry I miss something.
> 
> 
> 
> > +
> > +/* Returns 1 if the page is within a block suitable for migration to */
> > +static int suitable_migration_target(struct page *page)
> > +{
> > +	/* If the page is a large free page, then allow migration */
> > +	if (PageBuddy(page) && page_order(page) >= pageblock_order)
> > +		return 1;
> > +
> > +	/* If the block is MIGRATE_MOVABLE, allow migration */
> > +	if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE)
> > +		return 1;
> > +
> > +	/* Otherwise skip the block */
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Based on information in the current compact_control, find blocks
> > + * suitable for isolating free pages from
> > + */
> > +static void isolate_freepages(struct zone *zone,
> > +				struct compact_control *cc)
> > +{
> > +	struct page *page;
> > +	unsigned long high_pfn, low_pfn, pfn;
> > +	unsigned long flags;
> > +	int nr_freepages = cc->nr_freepages;
> > +	struct list_head *freelist = &cc->freepages;
> > +
> > +	pfn = cc->free_pfn;
> > +	low_pfn = cc->migrate_pfn + pageblock_nr_pages;
> > +	high_pfn = low_pfn;
> > +
> > +	/*
> > +	 * Isolate free pages until enough are available to migrate the
> > +	 * pages on cc->migratepages. We stop searching if the migrate
> > +	 * and free page scanners meet or enough free pages are isolated.
> > +	 */
> > +	spin_lock_irqsave(&zone->lock, flags);
> > +	for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
> > +					pfn -= pageblock_nr_pages) {
> > +		int isolated;
> > +
> > +		if (!pfn_valid(pfn))
> > +			continue;
> > +
> > +		/* Check for overlapping nodes/zones */
> > +		page = pfn_to_page(pfn);
> > +		if (page_zone(page) != zone)
> > +			continue;
> > +
> > +		/* Check the block is suitable for migration */
> > +		if (!suitable_migration_target(page))
> > +			continue;
> > +
> > +		/* Found a block suitable for isolating free pages from */
> > +		isolated = isolate_freepages_block(zone, pfn, freelist);
> > +		nr_freepages += isolated;
> > +
> > +		/*
> > +		 * Record the highest PFN we isolated pages from. When next
> > +		 * looking for free pages, the search will restart here as
> > +		 * page migration may have returned some pages to the allocator
> > +		 */
> > +		if (isolated)
> > +			high_pfn = max(high_pfn, pfn);
> > +	}
> > +	spin_unlock_irqrestore(&zone->lock, flags);
> > +
> > +	cc->free_pfn = high_pfn;
> > +	cc->nr_freepages = nr_freepages;
> > +}
> > +
> > +/* Update the number of anon and file isolated pages in the zone) */
> > +void update_zone_isolated(struct zone *zone, struct compact_control *cc)
> > +{
> > +	struct page *page;
> > +	unsigned int count[NR_LRU_LISTS] = { 0, };
> > +
> > +	list_for_each_entry(page, &cc->migratepages, lru) {
> > +		int lru = page_lru_base_type(page);
> > +		count[lru]++;
> > +	}
> > +
> > +	cc->nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
> > +	cc->nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
> > +	__mod_zone_page_state(zone, NR_ISOLATED_ANON, cc->nr_anon);
> > +	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
> > +}
> > +
> > +/*
> > + * Isolate all pages that can be migrated from the block pointed to by
> > + * the migrate scanner within compact_control.
> > + */
> > +static unsigned long isolate_migratepages(struct zone *zone,
> > +					struct compact_control *cc)
> > +{
> > +	unsigned long low_pfn, end_pfn;
> > +	struct list_head *migratelist;
> > +
> > +	low_pfn = cc->migrate_pfn;
> > +	migratelist = &cc->migratepages;
> > +
> > +	/* Do not scan outside zone boundaries */
> > +	if (low_pfn < zone->zone_start_pfn)
> > +		low_pfn = zone->zone_start_pfn;
> > +
> > +	/* Setup to scan one block but not past where we are migrating to */
> > +	end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
> > +
> > +	/* Do not cross the free scanner or scan within a memory hole */
> > +	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
> > +		cc->migrate_pfn = end_pfn;
> > +		return 0;
> > +	}
> > +
> > +	migrate_prep();
> > +
> > +	/* Time to isolate some pages for migration */
> > +	spin_lock_irq(&zone->lru_lock);
> > +	for (; low_pfn < end_pfn; low_pfn++) {
> > +		struct page *page;
> > +		if (!pfn_valid_within(low_pfn))
> > +			continue;
>  break ?
> 

What if the whole was at the start of the page block? In the huge page
context, there is a strong chance that half a pageblock is worthless but
maybe it's useful in other contexts.

> > +
> > +		/* Get the page and skip if free */
> > +		page = pfn_to_page(low_pfn);
> > +		if (PageBuddy(page)) {
> > +			low_pfn += (1 << page_order(page)) - 1;
> > +			continue;
> > +		}
> > +
> > +		if (!PageLRU(page) || PageUnevictable(page))
> > +	
>  break ? (compaction will not succeed if !PG_lru). I'm not sure
>  best-effort which can't readh goal is good or not. But conservative
>  approach will be better as the 1st version..
> 

Well, lets say we are migrating from a MIGRATE_RECLAIMABLE block. That
block will not be freed for a huge page allocation, but it helps
fragmentation avoidance.

> 		continue;
> > +
> > +		/* Try isolate the page */
> > +		if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
> > +			del_page_from_lru_list(zone, page, page_lru(page));
> > +			list_add(&page->lru, migratelist);
> > +			mem_cgroup_del_lru(page);
> > +			cc->nr_migratepages++;
> > +		}
> > +
> > +		/* Avoid isolating too much */
> > +		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
> > +			break;
> > +	}
> > +
> > +	update_zone_isolated(zone, cc);
> > +
> > +	spin_unlock_irq(&zone->lru_lock);
> > +	cc->migrate_pfn = low_pfn;
> > +
> > +	return cc->nr_migratepages;
> > +}
> > +
> > +/*
> > + * This is a migrate-callback that "allocates" freepages by taking pages
> > + * from the isolated freelists in the block we are migrating to.
> > + */
> > +static struct page *compaction_alloc(struct page *migratepage,
> > +					unsigned long data,
> > +					int **result)
> > +{
> > +	struct compact_control *cc = (struct compact_control *)data;
> > +	struct page *freepage;
> > +
> > +	VM_BUG_ON(cc == NULL);
> > +
> > +	/* Isolate free pages if necessary */
> > +	if (list_empty(&cc->freepages)) {
> > +		isolate_freepages(cc->zone, cc);
> > +
> > +		if (list_empty(&cc->freepages))
> > +			return NULL;
> > +	}
> > +
> > +	freepage = list_entry(cc->freepages.next, struct page, lru);
> > +	list_del(&freepage->lru);
> > +	cc->nr_freepages--;
> > +
> > +	return freepage;
> > +}
> > +
> > +/*
> > + * We cannot control nr_migratepages and nr_freepages fully when migration is
> > + * running as migrate_pages() has no knowledge of compact_control. When
> > + * migration is complete, we count the number of pages on the lists by hand.
> > + */
> > +static void update_nr_listpages(struct compact_control *cc)
> > +{
> > +	int nr_migratepages = 0;
> > +	int nr_freepages = 0;
> > +	struct page *page;
> > +	list_for_each_entry(page, &cc->migratepages, lru)
> > +		nr_migratepages++;
> > +	list_for_each_entry(page, &cc->freepages, lru)
> > +		nr_freepages++;
> > +
> > +	cc->nr_migratepages = nr_migratepages;
> > +	cc->nr_freepages = nr_freepages;
> > +}
> > +
> > +static inline int compact_finished(struct zone *zone,
> > +						struct compact_control *cc)
> > +{
> > +	/* Compaction run completes if the migrate and free scanner meet */
> > +	if (cc->free_pfn <= cc->migrate_pfn)
> > +		return COMPACT_COMPLETE;
> > +
> > +	return COMPACT_INCOMPLETE;
> > +}
> > +
> > +static int compact_zone(struct zone *zone, struct compact_control *cc)
> > +{
> > +	int ret = COMPACT_INCOMPLETE;
> > +
> > +	/* Setup to move all movable pages to the end of the zone */
> > +	cc->migrate_pfn = zone->zone_start_pfn;
> > +	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
> > +	cc->free_pfn &= ~(pageblock_nr_pages-1);
> > +
> > +	for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
> > +		unsigned long nr_migrate, nr_remaining;
> > +		if (!isolate_migratepages(zone, cc))
> > +			continue;
> > +
> > +		nr_migrate = cc->nr_migratepages;
> > +		migrate_pages(&cc->migratepages, compaction_alloc,
> > +						(unsigned long)cc, 0);
> > +		update_nr_listpages(cc);
> > +		nr_remaining = cc->nr_migratepages;
> > +
> > +		count_vm_event(COMPACTBLOCKS);
> > +		count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
> > +		if (nr_remaining)
> > +			count_vm_events(COMPACTPAGEFAILED, nr_remaining);
> > +
> > +		/* Release LRU pages not migrated */
> > +		if (!list_empty(&cc->migratepages)) {
> > +			putback_lru_pages(&cc->migratepages);
> > +			cc->nr_migratepages = 0;
> > +		}
> > +
> > +		mod_zone_page_state(zone, NR_ISOLATED_ANON, -cc->nr_anon);
> > +		mod_zone_page_state(zone, NR_ISOLATED_FILE, -cc->nr_file);
> > +	}
> > +
> > +	/* Release free pages and check accounting */
> > +	cc->nr_freepages -= release_freepages(&cc->freepages);
> > +	VM_BUG_ON(cc->nr_freepages != 0);
> > +
> > +	return ret;
> > +}
> > +
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 8deb9d0..6d57154 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1168,6 +1168,43 @@ void split_page(struct page *page, unsigned int order)
> >  		set_page_refcounted(page + i);
> >  }
> >  
> > +/* Similar to split_page except the page is already free */
> > +int split_free_page(struct page *page)
> > +{
> > +	unsigned int order;
> > +	unsigned long watermark;
> > +	struct zone *zone;
> > +
> > +	BUG_ON(!PageBuddy(page));
> > +
> > +	zone = page_zone(page);
> > +	order = page_order(page);
> > +
> > +	/* Obey watermarks or the system could deadlock */
> > +	watermark = low_wmark_pages(zone) + (1 << order);
> > +	if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> > +		return 0;
> > +
> > +	/* Remove page from free list */
> > +	list_del(&page->lru);
> > +	zone->free_area[order].nr_free--;
> > +	rmv_page_order(page);
> > +	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
> > +
> > +	/* Split into individual pages */
> > +	set_page_refcounted(page);
> > +	split_page(page, order);
> > +
> > +	/* Set the migratetype on the assumption it's for migration */
> > +	if (order >= pageblock_order - 1) {
> > +		struct page *endpage = page + (1 << order) - 1;
> > +		for (; page < endpage; page += pageblock_nr_pages)
> > +			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> > +	}
> > +
> > +	return 1 << order;
> > +}
> > +
> >  /*
> >   * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
> >   * we cheat by calling it from here, in the order > 0 path.  Saves a branch
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c26986c..47de19b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -803,11 +803,6 @@ keep:
> >  	return nr_reclaimed;
> >  }
> >  
> > -/* LRU Isolation modes. */
> > -#define ISOLATE_INACTIVE 0	/* Isolate inactive pages. */
> > -#define ISOLATE_ACTIVE 1	/* Isolate active pages. */
> > -#define ISOLATE_BOTH 2		/* Isolate both active and inactive pages. */
> > -
> >  /*
> >   * Attempt to remove the specified page from its LRU.  Only take this page
> >   * if it is of the appropriate PageActive status.  Pages which are being
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index fa5975c..0a14d22 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -889,6 +889,11 @@ static const char * const vmstat_text[] = {
> >  	"allocstall",
> >  
> >  	"pgrotated",
> > +
> > +	"compact_blocks_moved",
> > +	"compact_pages_moved",
> > +	"compact_pagemigrate_failed",
> > +
> >  #ifdef CONFIG_HUGETLB_PAGE
> >  	"htlb_buddy_alloc_success",
> >  	"htlb_buddy_alloc_fail",

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 09/12] Add /proc trigger for memory compaction
  2010-02-18 18:02 [PATCH 0/12] Memory Compaction v3 Mel Gorman
                   ` (7 preceding siblings ...)
  2010-02-18 18:02 ` [PATCH 08/12] Memory compaction core Mel Gorman
@ 2010-02-18 18:02 ` Mel Gorman
  2010-02-19  0:43   ` KAMEZAWA Hiroyuki
  2010-02-19  2:26   ` Minchan Kim
  2010-02-18 18:02 ` [PATCH 10/12] Add /sys trigger for per-node " Mel Gorman
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-18 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, Mel Gorman, linux-kernel, linux-mm

This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
value is written to the file, all zones are compacted. The expected user
of such a trigger is a job scheduler that prepares the system before the
target application runs.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 Documentation/sysctl/vm.txt |   11 ++++++++
 include/linux/compaction.h  |    5 +++
 kernel/sysctl.c             |   11 ++++++++
 mm/compaction.c             |   60 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 87 insertions(+), 0 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index fc5790d..92b5b00 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -19,6 +19,7 @@ files can be found in mm/swap.c.
 Currently, these files are in /proc/sys/vm:
 
 - block_dump
+- compact_memory
 - dirty_background_bytes
 - dirty_background_ratio
 - dirty_bytes
@@ -64,6 +65,16 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
 
 ==============================================================
 
+compact_memory
+
+Available only when CONFIG_COMPACTION is set. When an arbitrary value
+is written to the file, all zones are compacted such that free memory
+is available in contiguous blocks where possible. This can be important
+for example in the allocation of huge pages although processes will also
+directly compact memory as required.
+
+==============================================================
+
 dirty_background_bytes
 
 Contains the amount of dirty memory at which the pdflush background writeback
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 6201371..facaa3d 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -5,4 +5,9 @@
 #define COMPACT_INCOMPLETE	0
 #define COMPACT_COMPLETE	1
 
+#ifdef CONFIG_COMPACTION
+extern int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos);
+#endif /* CONFIG_COMPACTION */
+
 #endif /* _LINUX_COMPACTION_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8a68b24..a02c816 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -50,6 +50,7 @@
 #include <linux/ftrace.h>
 #include <linux/slow-work.h>
 #include <linux/perf_event.h>
+#include <linux/compaction.h>
 
 #include <asm/uaccess.h>
 #include <asm/processor.h>
@@ -80,6 +81,7 @@ extern int pid_max;
 extern int min_free_kbytes;
 extern int pid_max_min, pid_max_max;
 extern int sysctl_drop_caches;
+extern int sysctl_compact_memory;
 extern int percpu_pagelist_fraction;
 extern int compat_log;
 extern int latencytop_enabled;
@@ -1109,6 +1111,15 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= drop_caches_sysctl_handler,
 	},
+#ifdef CONFIG_COMPACTION
+	{
+		.procname	= "compact_memory",
+		.data		= &sysctl_compact_memory,
+		.maxlen		= sizeof(int),
+		.mode		= 0200,
+		.proc_handler	= sysctl_compaction_handler,
+	},
+#endif /* CONFIG_COMPACTION */
 	{
 		.procname	= "min_free_kbytes",
 		.data		= &min_free_kbytes,
diff --git a/mm/compaction.c b/mm/compaction.c
index 14ba0ac..22f223f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -11,6 +11,7 @@
 #include <linux/migrate.h>
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
+#include <linux/sysctl.h>
 #include "internal.h"
 
 /*
@@ -345,3 +346,62 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
+/* Compact all zones within a node */
+static int compact_node(int nid)
+{
+	int zoneid;
+	pg_data_t *pgdat;
+	struct zone *zone;
+
+	if (nid < 0 || nid > nr_node_ids || !node_online(nid))
+		return -EINVAL;
+	pgdat = NODE_DATA(nid);
+
+	/* Flush pending updates to the LRU lists */
+	lru_add_drain_all();
+
+	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+		struct compact_control cc;
+
+		zone = &pgdat->node_zones[zoneid];
+		if (!populated_zone(zone))
+			continue;
+
+		cc.nr_freepages = 0;
+		cc.nr_migratepages = 0;
+		cc.zone = zone;
+		INIT_LIST_HEAD(&cc.freepages);
+		INIT_LIST_HEAD(&cc.migratepages);
+
+		compact_zone(zone, &cc);
+
+		VM_BUG_ON(!list_empty(&cc.freepages));
+		VM_BUG_ON(!list_empty(&cc.migratepages));
+	}
+
+	return 0;
+}
+
+/* Compact all nodes in the system */
+static int compact_nodes(void)
+{
+	int nid;
+
+	for_each_online_node(nid)
+		compact_node(nid);
+
+	return COMPACT_COMPLETE;
+}
+
+/* The written value is actually unused, all memory is compacted */
+int sysctl_compact_memory;
+
+/* This is the entry point for compacting all nodes via /proc/sys/vm */
+int sysctl_compaction_handler(struct ctl_table *table, int write,
+			void __user *buffer, size_t *length, loff_t *ppos)
+{
+	if (write)
+		return compact_nodes();
+
+	return 0;
+}
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 09/12] Add /proc trigger for memory compaction
  2010-02-18 18:02 ` [PATCH 09/12] Add /proc trigger for memory compaction Mel Gorman
@ 2010-02-19  0:43   ` KAMEZAWA Hiroyuki
  2010-02-19 14:16     ` Mel Gorman
  2010-02-19  2:26   ` Minchan Kim
  1 sibling, 1 reply; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-19  0:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Thu, 18 Feb 2010 18:02:39 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> value is written to the file, all zones are compacted. The expected user
> of such a trigger is a job scheduler that prepares the system before the
> target application runs.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Nitpick:
Hmm.. Is this necessary if we have per-node trigger in sysfs ?

Thanks,
-Kame


> ---
>  Documentation/sysctl/vm.txt |   11 ++++++++
>  include/linux/compaction.h  |    5 +++
>  kernel/sysctl.c             |   11 ++++++++
>  mm/compaction.c             |   60 +++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 87 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index fc5790d..92b5b00 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -19,6 +19,7 @@ files can be found in mm/swap.c.
>  Currently, these files are in /proc/sys/vm:
>  
>  - block_dump
> +- compact_memory
>  - dirty_background_bytes
>  - dirty_background_ratio
>  - dirty_bytes
> @@ -64,6 +65,16 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
>  
>  ==============================================================
>  
> +compact_memory
> +
> +Available only when CONFIG_COMPACTION is set. When an arbitrary value
> +is written to the file, all zones are compacted such that free memory
> +is available in contiguous blocks where possible. This can be important
> +for example in the allocation of huge pages although processes will also
> +directly compact memory as required.
> +
> +==============================================================
> +
>  dirty_background_bytes
>  
>  Contains the amount of dirty memory at which the pdflush background writeback
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 6201371..facaa3d 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -5,4 +5,9 @@
>  #define COMPACT_INCOMPLETE	0
>  #define COMPACT_COMPLETE	1
>  
> +#ifdef CONFIG_COMPACTION
> +extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> +			void __user *buffer, size_t *length, loff_t *ppos);
> +#endif /* CONFIG_COMPACTION */
> +
>  #endif /* _LINUX_COMPACTION_H */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 8a68b24..a02c816 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -50,6 +50,7 @@
>  #include <linux/ftrace.h>
>  #include <linux/slow-work.h>
>  #include <linux/perf_event.h>
> +#include <linux/compaction.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/processor.h>
> @@ -80,6 +81,7 @@ extern int pid_max;
>  extern int min_free_kbytes;
>  extern int pid_max_min, pid_max_max;
>  extern int sysctl_drop_caches;
> +extern int sysctl_compact_memory;
>  extern int percpu_pagelist_fraction;
>  extern int compat_log;
>  extern int latencytop_enabled;
> @@ -1109,6 +1111,15 @@ static struct ctl_table vm_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= drop_caches_sysctl_handler,
>  	},
> +#ifdef CONFIG_COMPACTION
> +	{
> +		.procname	= "compact_memory",
> +		.data		= &sysctl_compact_memory,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0200,
> +		.proc_handler	= sysctl_compaction_handler,
> +	},
> +#endif /* CONFIG_COMPACTION */
>  	{
>  		.procname	= "min_free_kbytes",
>  		.data		= &min_free_kbytes,
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 14ba0ac..22f223f 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -11,6 +11,7 @@
>  #include <linux/migrate.h>
>  #include <linux/compaction.h>
>  #include <linux/mm_inline.h>
> +#include <linux/sysctl.h>
>  #include "internal.h"
>  
>  /*
> @@ -345,3 +346,62 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  	return ret;
>  }
>  
> +/* Compact all zones within a node */
> +static int compact_node(int nid)
> +{
> +	int zoneid;
> +	pg_data_t *pgdat;
> +	struct zone *zone;
> +
> +	if (nid < 0 || nid > nr_node_ids || !node_online(nid))
> +		return -EINVAL;
> +	pgdat = NODE_DATA(nid);
> +
> +	/* Flush pending updates to the LRU lists */
> +	lru_add_drain_all();
> +
> +	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> +		struct compact_control cc;
> +
> +		zone = &pgdat->node_zones[zoneid];
> +		if (!populated_zone(zone))
> +			continue;
> +
> +		cc.nr_freepages = 0;
> +		cc.nr_migratepages = 0;
> +		cc.zone = zone;
> +		INIT_LIST_HEAD(&cc.freepages);
> +		INIT_LIST_HEAD(&cc.migratepages);
> +
> +		compact_zone(zone, &cc);
> +
> +		VM_BUG_ON(!list_empty(&cc.freepages));
> +		VM_BUG_ON(!list_empty(&cc.migratepages));
> +	}
> +
> +	return 0;
> +}
> +
> +/* Compact all nodes in the system */
> +static int compact_nodes(void)
> +{
> +	int nid;
> +
> +	for_each_online_node(nid)
> +		compact_node(nid);
> +
> +	return COMPACT_COMPLETE;
> +}
> +
> +/* The written value is actually unused, all memory is compacted */
> +int sysctl_compact_memory;
> +
> +/* This is the entry point for compacting all nodes via /proc/sys/vm */
> +int sysctl_compaction_handler(struct ctl_table *table, int write,
> +			void __user *buffer, size_t *length, loff_t *ppos)
> +{
> +	if (write)
> +		return compact_nodes();
> +
> +	return 0;
> +}
> -- 
> 1.6.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 09/12] Add /proc trigger for memory compaction
  2010-02-19  0:43   ` KAMEZAWA Hiroyuki
@ 2010-02-19 14:16     ` Mel Gorman
  2010-02-20  3:53       ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 51+ messages in thread
From: Mel Gorman @ 2010-02-19 14:16 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 09:43:26AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 18 Feb 2010 18:02:39 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> > value is written to the file, all zones are compacted. The expected user
> > of such a trigger is a job scheduler that prepares the system before the
> > target application runs.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> 
> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Nitpick:
> Hmm.. Is this necessary if we have per-node trigger in sysfs ?
> 

What does !NUMA do?

> Thanks,
> -Kame
> 
> 
> > ---
> >  Documentation/sysctl/vm.txt |   11 ++++++++
> >  include/linux/compaction.h  |    5 +++
> >  kernel/sysctl.c             |   11 ++++++++
> >  mm/compaction.c             |   60 +++++++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 87 insertions(+), 0 deletions(-)
> > 
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index fc5790d..92b5b00 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -19,6 +19,7 @@ files can be found in mm/swap.c.
> >  Currently, these files are in /proc/sys/vm:
> >  
> >  - block_dump
> > +- compact_memory
> >  - dirty_background_bytes
> >  - dirty_background_ratio
> >  - dirty_bytes
> > @@ -64,6 +65,16 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
> >  
> >  ==============================================================
> >  
> > +compact_memory
> > +
> > +Available only when CONFIG_COMPACTION is set. When an arbitrary value
> > +is written to the file, all zones are compacted such that free memory
> > +is available in contiguous blocks where possible. This can be important
> > +for example in the allocation of huge pages although processes will also
> > +directly compact memory as required.
> > +
> > +==============================================================
> > +
> >  dirty_background_bytes
> >  
> >  Contains the amount of dirty memory at which the pdflush background writeback
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index 6201371..facaa3d 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -5,4 +5,9 @@
> >  #define COMPACT_INCOMPLETE	0
> >  #define COMPACT_COMPLETE	1
> >  
> > +#ifdef CONFIG_COMPACTION
> > +extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> > +			void __user *buffer, size_t *length, loff_t *ppos);
> > +#endif /* CONFIG_COMPACTION */
> > +
> >  #endif /* _LINUX_COMPACTION_H */
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 8a68b24..a02c816 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -50,6 +50,7 @@
> >  #include <linux/ftrace.h>
> >  #include <linux/slow-work.h>
> >  #include <linux/perf_event.h>
> > +#include <linux/compaction.h>
> >  
> >  #include <asm/uaccess.h>
> >  #include <asm/processor.h>
> > @@ -80,6 +81,7 @@ extern int pid_max;
> >  extern int min_free_kbytes;
> >  extern int pid_max_min, pid_max_max;
> >  extern int sysctl_drop_caches;
> > +extern int sysctl_compact_memory;
> >  extern int percpu_pagelist_fraction;
> >  extern int compat_log;
> >  extern int latencytop_enabled;
> > @@ -1109,6 +1111,15 @@ static struct ctl_table vm_table[] = {
> >  		.mode		= 0644,
> >  		.proc_handler	= drop_caches_sysctl_handler,
> >  	},
> > +#ifdef CONFIG_COMPACTION
> > +	{
> > +		.procname	= "compact_memory",
> > +		.data		= &sysctl_compact_memory,
> > +		.maxlen		= sizeof(int),
> > +		.mode		= 0200,
> > +		.proc_handler	= sysctl_compaction_handler,
> > +	},
> > +#endif /* CONFIG_COMPACTION */
> >  	{
> >  		.procname	= "min_free_kbytes",
> >  		.data		= &min_free_kbytes,
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 14ba0ac..22f223f 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/migrate.h>
> >  #include <linux/compaction.h>
> >  #include <linux/mm_inline.h>
> > +#include <linux/sysctl.h>
> >  #include "internal.h"
> >  
> >  /*
> > @@ -345,3 +346,62 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> >  	return ret;
> >  }
> >  
> > +/* Compact all zones within a node */
> > +static int compact_node(int nid)
> > +{
> > +	int zoneid;
> > +	pg_data_t *pgdat;
> > +	struct zone *zone;
> > +
> > +	if (nid < 0 || nid > nr_node_ids || !node_online(nid))
> > +		return -EINVAL;
> > +	pgdat = NODE_DATA(nid);
> > +
> > +	/* Flush pending updates to the LRU lists */
> > +	lru_add_drain_all();
> > +
> > +	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> > +		struct compact_control cc;
> > +
> > +		zone = &pgdat->node_zones[zoneid];
> > +		if (!populated_zone(zone))
> > +			continue;
> > +
> > +		cc.nr_freepages = 0;
> > +		cc.nr_migratepages = 0;
> > +		cc.zone = zone;
> > +		INIT_LIST_HEAD(&cc.freepages);
> > +		INIT_LIST_HEAD(&cc.migratepages);
> > +
> > +		compact_zone(zone, &cc);
> > +
> > +		VM_BUG_ON(!list_empty(&cc.freepages));
> > +		VM_BUG_ON(!list_empty(&cc.migratepages));
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/* Compact all nodes in the system */
> > +static int compact_nodes(void)
> > +{
> > +	int nid;
> > +
> > +	for_each_online_node(nid)
> > +		compact_node(nid);
> > +
> > +	return COMPACT_COMPLETE;
> > +}
> > +
> > +/* The written value is actually unused, all memory is compacted */
> > +int sysctl_compact_memory;
> > +
> > +/* This is the entry point for compacting all nodes via /proc/sys/vm */
> > +int sysctl_compaction_handler(struct ctl_table *table, int write,
> > +			void __user *buffer, size_t *length, loff_t *ppos)
> > +{
> > +	if (write)
> > +		return compact_nodes();
> > +
> > +	return 0;
> > +}
> > -- 
> > 1.6.5
> > 
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > 
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 09/12] Add /proc trigger for memory compaction
  2010-02-19 14:16     ` Mel Gorman
@ 2010-02-20  3:53       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-20  3:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, 19 Feb 2010 14:16:41 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> On Fri, Feb 19, 2010 at 09:43:26AM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 18 Feb 2010 18:02:39 +0000
> > Mel Gorman <mel@csn.ul.ie> wrote:
> > 
> > > This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> > > value is written to the file, all zones are compacted. The expected user
> > > of such a trigger is a job scheduler that prepares the system before the
> > > target application runs.
> > > 
> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > > Acked-by: Rik van Riel <riel@redhat.com>
> > 
> > Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Nitpick:
> > Hmm.. Is this necessary if we have per-node trigger in sysfs ?
> > 
> 
> What does !NUMA do?
> 
.... I missed that. please ignore my comment.
(And I missed that we already have famous trigger as drop_caches in sysctl..)

Thanks,
-Kame


> > Thanks,
> > -Kame
> > 
> > 
> > > ---
> > >  Documentation/sysctl/vm.txt |   11 ++++++++
> > >  include/linux/compaction.h  |    5 +++
> > >  kernel/sysctl.c             |   11 ++++++++
> > >  mm/compaction.c             |   60 +++++++++++++++++++++++++++++++++++++++++++
> > >  4 files changed, 87 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > > index fc5790d..92b5b00 100644
> > > --- a/Documentation/sysctl/vm.txt
> > > +++ b/Documentation/sysctl/vm.txt
> > > @@ -19,6 +19,7 @@ files can be found in mm/swap.c.
> > >  Currently, these files are in /proc/sys/vm:
> > >  
> > >  - block_dump
> > > +- compact_memory
> > >  - dirty_background_bytes
> > >  - dirty_background_ratio
> > >  - dirty_bytes
> > > @@ -64,6 +65,16 @@ information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
> > >  
> > >  ==============================================================
> > >  
> > > +compact_memory
> > > +
> > > +Available only when CONFIG_COMPACTION is set. When an arbitrary value
> > > +is written to the file, all zones are compacted such that free memory
> > > +is available in contiguous blocks where possible. This can be important
> > > +for example in the allocation of huge pages although processes will also
> > > +directly compact memory as required.
> > > +
> > > +==============================================================
> > > +
> > >  dirty_background_bytes
> > >  
> > >  Contains the amount of dirty memory at which the pdflush background writeback
> > > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > > index 6201371..facaa3d 100644
> > > --- a/include/linux/compaction.h
> > > +++ b/include/linux/compaction.h
> > > @@ -5,4 +5,9 @@
> > >  #define COMPACT_INCOMPLETE	0
> > >  #define COMPACT_COMPLETE	1
> > >  
> > > +#ifdef CONFIG_COMPACTION
> > > +extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> > > +			void __user *buffer, size_t *length, loff_t *ppos);
> > > +#endif /* CONFIG_COMPACTION */
> > > +
> > >  #endif /* _LINUX_COMPACTION_H */
> > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > > index 8a68b24..a02c816 100644
> > > --- a/kernel/sysctl.c
> > > +++ b/kernel/sysctl.c
> > > @@ -50,6 +50,7 @@
> > >  #include <linux/ftrace.h>
> > >  #include <linux/slow-work.h>
> > >  #include <linux/perf_event.h>
> > > +#include <linux/compaction.h>
> > >  
> > >  #include <asm/uaccess.h>
> > >  #include <asm/processor.h>
> > > @@ -80,6 +81,7 @@ extern int pid_max;
> > >  extern int min_free_kbytes;
> > >  extern int pid_max_min, pid_max_max;
> > >  extern int sysctl_drop_caches;
> > > +extern int sysctl_compact_memory;
> > >  extern int percpu_pagelist_fraction;
> > >  extern int compat_log;
> > >  extern int latencytop_enabled;
> > > @@ -1109,6 +1111,15 @@ static struct ctl_table vm_table[] = {
> > >  		.mode		= 0644,
> > >  		.proc_handler	= drop_caches_sysctl_handler,
> > >  	},
> > > +#ifdef CONFIG_COMPACTION
> > > +	{
> > > +		.procname	= "compact_memory",
> > > +		.data		= &sysctl_compact_memory,
> > > +		.maxlen		= sizeof(int),
> > > +		.mode		= 0200,
> > > +		.proc_handler	= sysctl_compaction_handler,
> > > +	},
> > > +#endif /* CONFIG_COMPACTION */
> > >  	{
> > >  		.procname	= "min_free_kbytes",
> > >  		.data		= &min_free_kbytes,
> > > diff --git a/mm/compaction.c b/mm/compaction.c
> > > index 14ba0ac..22f223f 100644
> > > --- a/mm/compaction.c
> > > +++ b/mm/compaction.c
> > > @@ -11,6 +11,7 @@
> > >  #include <linux/migrate.h>
> > >  #include <linux/compaction.h>
> > >  #include <linux/mm_inline.h>
> > > +#include <linux/sysctl.h>
> > >  #include "internal.h"
> > >  
> > >  /*
> > > @@ -345,3 +346,62 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> > >  	return ret;
> > >  }
> > >  
> > > +/* Compact all zones within a node */
> > > +static int compact_node(int nid)
> > > +{
> > > +	int zoneid;
> > > +	pg_data_t *pgdat;
> > > +	struct zone *zone;
> > > +
> > > +	if (nid < 0 || nid > nr_node_ids || !node_online(nid))
> > > +		return -EINVAL;
> > > +	pgdat = NODE_DATA(nid);
> > > +
> > > +	/* Flush pending updates to the LRU lists */
> > > +	lru_add_drain_all();
> > > +
> > > +	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
> > > +		struct compact_control cc;
> > > +
> > > +		zone = &pgdat->node_zones[zoneid];
> > > +		if (!populated_zone(zone))
> > > +			continue;
> > > +
> > > +		cc.nr_freepages = 0;
> > > +		cc.nr_migratepages = 0;
> > > +		cc.zone = zone;
> > > +		INIT_LIST_HEAD(&cc.freepages);
> > > +		INIT_LIST_HEAD(&cc.migratepages);
> > > +
> > > +		compact_zone(zone, &cc);
> > > +
> > > +		VM_BUG_ON(!list_empty(&cc.freepages));
> > > +		VM_BUG_ON(!list_empty(&cc.migratepages));
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/* Compact all nodes in the system */
> > > +static int compact_nodes(void)
> > > +{
> > > +	int nid;
> > > +
> > > +	for_each_online_node(nid)
> > > +		compact_node(nid);
> > > +
> > > +	return COMPACT_COMPLETE;
> > > +}
> > > +
> > > +/* The written value is actually unused, all memory is compacted */
> > > +int sysctl_compact_memory;
> > > +
> > > +/* This is the entry point for compacting all nodes via /proc/sys/vm */
> > > +int sysctl_compaction_handler(struct ctl_table *table, int write,
> > > +			void __user *buffer, size_t *length, loff_t *ppos)
> > > +{
> > > +	if (write)
> > > +		return compact_nodes();
> > > +
> > > +	return 0;
> > > +}
> > > -- 
> > > 1.6.5
> > > 
> > > --
> > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > the body to majordomo@kvack.org.  For more info on Linux MM,
> > > see: http://www.linux-mm.org/ .
> > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > > 
> > 
> 
> -- 
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 09/12] Add /proc trigger for memory compaction
  2010-02-18 18:02 ` [PATCH 09/12] Add /proc trigger for memory compaction Mel Gorman
  2010-02-19  0:43   ` KAMEZAWA Hiroyuki
@ 2010-02-19  2:26   ` Minchan Kim
  1 sibling, 0 replies; 51+ messages in thread
From: Minchan Kim @ 2010-02-19  2:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 3:02 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> value is written to the file, all zones are compacted. The expected user
> of such a trigger is a job scheduler that prepares the system before the
> target application runs.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 10/12] Add /sys trigger for per-node memory compaction
  2010-02-18 18:02 [PATCH 0/12] Memory Compaction v3 Mel Gorman
                   ` (8 preceding siblings ...)
  2010-02-18 18:02 ` [PATCH 09/12] Add /proc trigger for memory compaction Mel Gorman
@ 2010-02-18 18:02 ` Mel Gorman
  2010-02-19 14:53   ` Greg KH
  2010-02-18 18:02 ` [PATCH 11/12] Direct compact when a high-order allocation fails Mel Gorman
  2010-02-18 18:02 ` [PATCH 12/12] Do not compact within a preferred zone after a compaction failure Mel Gorman
  11 siblings, 1 reply; 51+ messages in thread
From: Mel Gorman @ 2010-02-18 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, Mel Gorman, linux-kernel, linux-mm

This patch adds a per-node sysfs file called compact. When the file is
written to, each zone in that node is compacted. The intention that this
would be used by something like a job scheduler in a batch system before
a job starts so that the job can allocate the maximum number of
hugepages without significant start-up cost.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 drivers/base/node.c        |    3 +++
 include/linux/compaction.h |   16 ++++++++++++++++
 mm/compaction.c            |   23 +++++++++++++++++++++++
 3 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 7012279..2333c9d 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -15,6 +15,7 @@
 #include <linux/cpu.h>
 #include <linux/device.h>
 #include <linux/swap.h>
+#include <linux/compaction.h>
 
 static struct sysdev_class node_class = {
 	.name = "node",
@@ -239,6 +240,8 @@ int register_node(struct node *node, int num, struct node *parent)
 		scan_unevictable_register_node(node);
 
 		hugetlb_register_node(node);
+
+		compaction_register_node(node);
 	}
 	return error;
 }
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index facaa3d..6a2eefd 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -10,4 +10,20 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
 #endif /* CONFIG_COMPACTION */
 
+#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+extern int compaction_register_node(struct node *node);
+extern void compaction_unregister_node(struct node *node);
+
+#else
+
+static inline int compaction_register_node(struct node *node)
+{
+	return 0;
+}
+
+static inline void compaction_unregister_node(struct node *node)
+{
+}
+#endif /* CONFIG_COMPACTION && CONFIG_SYSFS && CONFIG_NUMA */
+
 #endif /* _LINUX_COMPACTION_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index 22f223f..02579c2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -12,6 +12,7 @@
 #include <linux/compaction.h>
 #include <linux/mm_inline.h>
 #include <linux/sysctl.h>
+#include <linux/sysfs.h>
 #include "internal.h"
 
 /*
@@ -405,3 +406,25 @@ int sysctl_compaction_handler(struct ctl_table *table, int write,
 
 	return 0;
 }
+
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+ssize_t sysfs_compact_node(struct sys_device *dev,
+			struct sysdev_attribute *attr,
+			const char *buf, size_t count)
+{
+	compact_node(dev->id);
+
+	return count;
+}
+static SYSDEV_ATTR(compact, S_IWUSR, NULL, sysfs_compact_node);
+
+int compaction_register_node(struct node *node)
+{
+	return sysdev_create_file(&node->sysdev, &attr_compact);
+}
+
+void compaction_unregister_node(struct node *node)
+{
+	return sysdev_remove_file(&node->sysdev, &attr_compact);
+}
+#endif /* CONFIG_SYSFS && CONFIG_NUMA */
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/12] Add /sys trigger for per-node memory compaction
  2010-02-18 18:02 ` [PATCH 10/12] Add /sys trigger for per-node " Mel Gorman
@ 2010-02-19 14:53   ` Greg KH
  2010-02-19 15:28     ` Mel Gorman
  0 siblings, 1 reply; 51+ messages in thread
From: Greg KH @ 2010-02-19 14:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Thu, Feb 18, 2010 at 06:02:40PM +0000, Mel Gorman wrote:
> This patch adds a per-node sysfs file called compact. When the file is
> written to, each zone in that node is compacted. The intention that this
> would be used by something like a job scheduler in a batch system before
> a job starts so that the job can allocate the maximum number of
> hugepages without significant start-up cost.

As you are adding sysfs files, can you please also add documentation for
the file in Documentation/ABI/ ?

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/12] Add /sys trigger for per-node memory compaction
  2010-02-19 14:53   ` Greg KH
@ 2010-02-19 15:28     ` Mel Gorman
  2010-02-19 15:31       ` Greg KH
  0 siblings, 1 reply; 51+ messages in thread
From: Mel Gorman @ 2010-02-19 15:28 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 06:53:59AM -0800, Greg KH wrote:
> On Thu, Feb 18, 2010 at 06:02:40PM +0000, Mel Gorman wrote:
> > This patch adds a per-node sysfs file called compact. When the file is
> > written to, each zone in that node is compacted. The intention that this
> > would be used by something like a job scheduler in a batch system before
> > a job starts so that the job can allocate the maximum number of
> > hugepages without significant start-up cost.
> 
> As you are adding sysfs files, can you please also add documentation for
> the file in Documentation/ABI/ ?
> 

I looked at this before and hit a wall and then forgot about it. I couldn't
find *where* I should document it at the time. There isn't a sysfs-devices-node
file to add to and much (all?) of what is in that branch appears undocumented.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/12] Add /sys trigger for per-node memory compaction
  2010-02-19 15:28     ` Mel Gorman
@ 2010-02-19 15:31       ` Greg KH
  2010-02-19 15:51         ` Mel Gorman
  0 siblings, 1 reply; 51+ messages in thread
From: Greg KH @ 2010-02-19 15:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 03:28:30PM +0000, Mel Gorman wrote:
> On Fri, Feb 19, 2010 at 06:53:59AM -0800, Greg KH wrote:
> > On Thu, Feb 18, 2010 at 06:02:40PM +0000, Mel Gorman wrote:
> > > This patch adds a per-node sysfs file called compact. When the file is
> > > written to, each zone in that node is compacted. The intention that this
> > > would be used by something like a job scheduler in a batch system before
> > > a job starts so that the job can allocate the maximum number of
> > > hugepages without significant start-up cost.
> > 
> > As you are adding sysfs files, can you please also add documentation for
> > the file in Documentation/ABI/ ?
> > 
> 
> I looked at this before and hit a wall and then forgot about it. I couldn't
> find *where* I should document it at the time. There isn't a sysfs-devices-node
> file to add to and much (all?) of what is in that branch appears undocumented.

Well, you can always just document what you add, or you can document the
existing stuff as well.  It's your choice :)

thanks,

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/12] Add /sys trigger for per-node memory compaction
  2010-02-19 15:31       ` Greg KH
@ 2010-02-19 15:51         ` Mel Gorman
  2010-02-19 15:52           ` Mel Gorman
  0 siblings, 1 reply; 51+ messages in thread
From: Mel Gorman @ 2010-02-19 15:51 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 07:31:42AM -0800, Greg KH wrote:
> On Fri, Feb 19, 2010 at 03:28:30PM +0000, Mel Gorman wrote:
> > On Fri, Feb 19, 2010 at 06:53:59AM -0800, Greg KH wrote:
> > > On Thu, Feb 18, 2010 at 06:02:40PM +0000, Mel Gorman wrote:
> > > > This patch adds a per-node sysfs file called compact. When the file is
> > > > written to, each zone in that node is compacted. The intention that this
> > > > would be used by something like a job scheduler in a batch system before
> > > > a job starts so that the job can allocate the maximum number of
> > > > hugepages without significant start-up cost.
> > > 
> > > As you are adding sysfs files, can you please also add documentation for
> > > the file in Documentation/ABI/ ?
> > > 
> > 
> > I looked at this before and hit a wall and then forgot about it. I couldn't
> > find *where* I should document it at the time. There isn't a sysfs-devices-node
> > file to add to and much (all?) of what is in that branch appears undocumented.
> 
> Well, you can always just document what you add, or you can document the
> existing stuff as well.  It's your choice :)
> 

Fair point!

I've taken note to document what's in there over time. For the moment,
is this a reasonable start? I'll split it into two patches but the end
result will be the same.

diff --git a/Documentation/ABI/testing/sysfs-devices-node b/Documentation/ABI/testing/sysfs-devices-node
new file mode 100644
index 0000000..1ee348b
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-node
@@ -0,0 +1,15 @@
+What:		/sys/devices/system/node/nodeX
+Date:		October 2002
+Contact:	Linux Memory Management list <linux-mm@kvack.org>
+Description:
+		When CONFIG_NUMA is enabled, this is a directory containing
+		information on node X such as what CPUs are local to the
+		node.
+
+What:		/sys/devices/system/node/nodeX/compact
+Date:		February 2010
+Contact:	Mel Gorman <mel@csn.ul.ie>
+Description:
+		When this file is written to, all memory within that node
+		will be compacted. When it completes, memory will be free
+		in as contiguous blocks as possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/12] Add /sys trigger for per-node memory compaction
  2010-02-19 15:51         ` Mel Gorman
@ 2010-02-19 15:52           ` Mel Gorman
  2010-02-19 16:02             ` Greg KH
  0 siblings, 1 reply; 51+ messages in thread
From: Mel Gorman @ 2010-02-19 15:52 UTC (permalink / raw)
  To: Greg KH
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 03:51:17PM +0000, Mel Gorman wrote:
> On Fri, Feb 19, 2010 at 07:31:42AM -0800, Greg KH wrote:
> > On Fri, Feb 19, 2010 at 03:28:30PM +0000, Mel Gorman wrote:
> > > On Fri, Feb 19, 2010 at 06:53:59AM -0800, Greg KH wrote:
> > > > On Thu, Feb 18, 2010 at 06:02:40PM +0000, Mel Gorman wrote:
> > > > > This patch adds a per-node sysfs file called compact. When the file is
> > > > > written to, each zone in that node is compacted. The intention that this
> > > > > would be used by something like a job scheduler in a batch system before
> > > > > a job starts so that the job can allocate the maximum number of
> > > > > hugepages without significant start-up cost.
> > > > 
> > > > As you are adding sysfs files, can you please also add documentation for
> > > > the file in Documentation/ABI/ ?
> > > > 
> > > 
> > > I looked at this before and hit a wall and then forgot about it. I couldn't
> > > find *where* I should document it at the time. There isn't a sysfs-devices-node
> > > file to add to and much (all?) of what is in that branch appears undocumented.
> > 
> > Well, you can always just document what you add, or you can document the
> > existing stuff as well.  It's your choice :)
> > 
> 
> Fair point!
> 
> I've taken note to document what's in there over time. For the moment,
> is this a reasonable start? I'll split it into two patches but the end
> result will be the same.
> 

Bah, as I hit send, I recognised my folly. The first entry should be in
stable/ and the second should be in testing/.

> diff --git a/Documentation/ABI/testing/sysfs-devices-node b/Documentation/ABI/testing/sysfs-devices-node
> new file mode 100644
> index 0000000..1ee348b
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-devices-node
> @@ -0,0 +1,15 @@
> +What:		/sys/devices/system/node/nodeX
> +Date:		October 2002
> +Contact:	Linux Memory Management list <linux-mm@kvack.org>
> +Description:
> +		When CONFIG_NUMA is enabled, this is a directory containing
> +		information on node X such as what CPUs are local to the
> +		node.
> +
> +What:		/sys/devices/system/node/nodeX/compact
> +Date:		February 2010
> +Contact:	Mel Gorman <mel@csn.ul.ie>
> +Description:
> +		When this file is written to, all memory within that node
> +		will be compacted. When it completes, memory will be free
> +		in as contiguous blocks as possible.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 10/12] Add /sys trigger for per-node memory compaction
  2010-02-19 15:52           ` Mel Gorman
@ 2010-02-19 16:02             ` Greg KH
  0 siblings, 0 replies; 51+ messages in thread
From: Greg KH @ 2010-02-19 16:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 03:52:13PM +0000, Mel Gorman wrote:
> On Fri, Feb 19, 2010 at 03:51:17PM +0000, Mel Gorman wrote:
> > On Fri, Feb 19, 2010 at 07:31:42AM -0800, Greg KH wrote:
> > > On Fri, Feb 19, 2010 at 03:28:30PM +0000, Mel Gorman wrote:
> > > > On Fri, Feb 19, 2010 at 06:53:59AM -0800, Greg KH wrote:
> > > > > On Thu, Feb 18, 2010 at 06:02:40PM +0000, Mel Gorman wrote:
> > > > > > This patch adds a per-node sysfs file called compact. When the file is
> > > > > > written to, each zone in that node is compacted. The intention that this
> > > > > > would be used by something like a job scheduler in a batch system before
> > > > > > a job starts so that the job can allocate the maximum number of
> > > > > > hugepages without significant start-up cost.
> > > > > 
> > > > > As you are adding sysfs files, can you please also add documentation for
> > > > > the file in Documentation/ABI/ ?
> > > > > 
> > > > 
> > > > I looked at this before and hit a wall and then forgot about it. I couldn't
> > > > find *where* I should document it at the time. There isn't a sysfs-devices-node
> > > > file to add to and much (all?) of what is in that branch appears undocumented.
> > > 
> > > Well, you can always just document what you add, or you can document the
> > > existing stuff as well.  It's your choice :)
> > > 
> > 
> > Fair point!
> > 
> > I've taken note to document what's in there over time. For the moment,
> > is this a reasonable start? I'll split it into two patches but the end
> > result will be the same.
> > 
> 
> Bah, as I hit send, I recognised my folly. The first entry should be in
> stable/ and the second should be in testing/.
> 

Heh, no problem, looks good to me, thanks for doing this.

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 11/12] Direct compact when a high-order allocation fails
  2010-02-18 18:02 [PATCH 0/12] Memory Compaction v3 Mel Gorman
                   ` (9 preceding siblings ...)
  2010-02-18 18:02 ` [PATCH 10/12] Add /sys trigger for per-node " Mel Gorman
@ 2010-02-18 18:02 ` Mel Gorman
  2010-02-19  0:51   ` KAMEZAWA Hiroyuki
  2010-02-19  2:41   ` Minchan Kim
  2010-02-18 18:02 ` [PATCH 12/12] Do not compact within a preferred zone after a compaction failure Mel Gorman
  11 siblings, 2 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-18 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, Mel Gorman, linux-kernel, linux-mm

Ordinarily when a high-order allocation fails, direct reclaim is entered to
free pages to satisfy the allocation.  With this patch, it is determined if
an allocation failed due to external fragmentation instead of low memory
and if so, the calling process will compact until a suitable page is
freed. Compaction by moving pages in memory is considerably cheaper than
paging out to disk and works where there are locked pages or no swap. If
compaction fails to free a page of a suitable size, then reclaim will
still occur.

Direct compaction returns as soon as possible. As each block is compacted,
it is checked if a suitable page has been freed and if so, it returns.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |   16 +++++-
 include/linux/vmstat.h     |    1 +
 mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c            |   26 ++++++++++
 mm/vmstat.c                |   15 +++++-
 5 files changed, 172 insertions(+), 4 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 6a2eefd..1cf95e2 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,13 +1,25 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
-/* Return values for compact_zone() */
+/* Return values for compact_zone() and try_to_compact_pages() */
 #define COMPACT_INCOMPLETE	0
-#define COMPACT_COMPLETE	1
+#define COMPACT_PARTIAL		1
+#define COMPACT_COMPLETE	2
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
+
+extern int fragmentation_index(struct zone *zone, unsigned int order);
+extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *mask);
+#else
+static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	return COMPACT_INCOMPLETE;
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index d7f7236..0ea7a38 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		KSWAPD_SKIP_CONGESTION_WAIT,
 		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
 		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
+		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
diff --git a/mm/compaction.c b/mm/compaction.c
index 02579c2..c7c73bb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -34,6 +34,8 @@ struct compact_control {
 	unsigned long nr_anon;
 	unsigned long nr_file;
 
+	unsigned int order;		/* order a direct compactor needs */
+	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
 };
 
@@ -298,10 +300,31 @@ static void update_nr_listpages(struct compact_control *cc)
 static inline int compact_finished(struct zone *zone,
 						struct compact_control *cc)
 {
+	unsigned int order;
+	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
+
 	/* Compaction run completes if the migrate and free scanner meet */
 	if (cc->free_pfn <= cc->migrate_pfn)
 		return COMPACT_COMPLETE;
 
+	/* Compaction run is not finished if the watermark is not met */
+	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
+		return COMPACT_INCOMPLETE;
+
+	if (cc->order == -1)
+		return COMPACT_INCOMPLETE;
+
+	/* Direct compactor: Is a suitable page free? */
+	for (order = cc->order; order < MAX_ORDER; order++) {
+		/* Job done if page is free of the right migratetype */
+		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
+			return COMPACT_PARTIAL;
+
+		/* Job done if allocation would set block type */
+		if (order >= pageblock_order && zone->free_area[order].nr_free)
+			return COMPACT_PARTIAL;
+	}
+
 	return COMPACT_INCOMPLETE;
 }
 
@@ -347,6 +370,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	return ret;
 }
 
+static inline unsigned long compact_zone_order(struct zone *zone,
+						int order, gfp_t gfp_mask)
+{
+	struct compact_control cc = {
+		.nr_freepages = 0,
+		.nr_migratepages = 0,
+		.order = order,
+		.migratetype = allocflags_to_migratetype(gfp_mask),
+		.zone = zone,
+	};
+	INIT_LIST_HEAD(&cc.freepages);
+	INIT_LIST_HEAD(&cc.migratepages);
+
+	return compact_zone(zone, &cc);
+}
+
+/**
+ * try_to_compact_pages - Direct compact to satisfy a high-order allocation
+ * @zonelist: The zonelist used for the current allocation
+ * @order: The order of the current allocation
+ * @gfp_mask: The GFP mask of the current allocation
+ * @nodemask: The allowed nodes to allocate from
+ *
+ * This is the main entry point for direct page compaction.
+ */
+unsigned long try_to_compact_pages(struct zonelist *zonelist,
+			int order, gfp_t gfp_mask, nodemask_t *nodemask)
+{
+	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
+	int may_enter_fs = gfp_mask & __GFP_FS;
+	int may_perform_io = gfp_mask & __GFP_IO;
+	unsigned long watermark;
+	struct zoneref *z;
+	struct zone *zone;
+	int rc = COMPACT_INCOMPLETE;
+
+	/* Check whether it is worth even starting compaction */
+	if (order == 0 || !may_enter_fs || !may_perform_io)
+		return rc;
+
+	/*
+	 * We will not stall if the necessary conditions are not met for
+	 * migration but direct reclaim seems to account stalls similarly
+	 */
+	count_vm_event(COMPACTSTALL);
+
+	/* Compact each zone in the list */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
+								nodemask) {
+		int fragindex;
+		int status;
+
+		/*
+		 * Watermarks for order-0 must be met for compaction. Note
+		 * the 2UL. This is because during migration, copies of
+		 * pages need to be allocated and for a short time, the
+		 * footprint is higher
+		 */
+		watermark = low_wmark_pages(zone) + (2UL << order);
+		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+			continue;
+
+		/*
+		 * fragmentation index determines if allocation failures are
+		 * due to low memory or external fragmentation
+		 *
+		 * index of -1 implies allocations might succeed depending
+		 * 	on watermarks
+		 * index < 500 implies alloc failure is due to lack of memory
+		 *
+		 * XXX: The choice of 500 is arbitrary. Reinvestigate
+		 *      appropriately to determine a sensible default.
+		 *      and what it means when watermarks are also taken
+		 *      into account. Consider making it a sysctl
+		 */
+		fragindex = fragmentation_index(zone, order);
+		if (fragindex >= 0 && fragindex <= 500)
+			continue;
+
+		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
+			rc = COMPACT_PARTIAL;
+			break;
+		}
+
+		status = compact_zone_order(zone, order, gfp_mask);
+		rc = max(status, rc);
+
+		if (zone_watermark_ok(zone, order, watermark, 0, 0))
+			break;
+	}
+
+	return rc;
+}
+
+
 /* Compact all zones within a node */
 static int compact_node(int nid)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6d57154..1910b8b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -49,6 +49,7 @@
 #include <linux/debugobjects.h>
 #include <linux/kmemleak.h>
 #include <linux/memory.h>
+#include <linux/compaction.h>
 #include <trace/events/kmem.h>
 
 #include <asm/tlbflush.h>
@@ -1728,6 +1729,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	cond_resched();
 
+	/* Try memory compaction for high-order allocations before reclaim */
+	if (order) {
+		*did_some_progress = try_to_compact_pages(zonelist,
+						order, gfp_mask, nodemask);
+		if (*did_some_progress != COMPACT_INCOMPLETE) {
+			page = get_page_from_freelist(gfp_mask, nodemask,
+					order, zonelist, high_zoneidx,
+					alloc_flags, preferred_zone,
+					migratetype);
+			if (page) {
+				__count_vm_event(COMPACTSUCCESS);
+				return page;
+			}
+
+			/*
+			 * It's bad if compaction run occurs and fails.
+			 * The most likely reason is that pages exist,
+			 * but not enough to satisfy watermarks.
+			 */
+			count_vm_event(COMPACTFAIL);
+
+			cond_resched();
+		}
+	}
+
 	/* We now go into synchronous reclaim */
 	cpuset_memory_pressure_bump();
 	p->flags |= PF_MEMALLOC;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0a14d22..189a379 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -558,7 +558,7 @@ static int unusable_show(struct seq_file *m, void *arg)
  * The value can be used to determine if page reclaim or compaction
  * should be used
  */
-int fragmentation_index(unsigned int order, struct contig_page_info *info)
+int __fragmentation_index(unsigned int order, struct contig_page_info *info)
 {
 	unsigned long requested = 1UL << order;
 
@@ -578,6 +578,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
 	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
 }
 
+/* Same as __fragmentation index but allocs contig_page_info on stack */
+int fragmentation_index(struct zone *zone, unsigned int order)
+{
+	struct contig_page_info info;
+
+	fill_contig_page_info(zone, order, &info);
+	return __fragmentation_index(order, &info);
+}
 
 static void extfrag_show_print(struct seq_file *m,
 					pg_data_t *pgdat, struct zone *zone)
@@ -593,7 +601,7 @@ static void extfrag_show_print(struct seq_file *m,
 				zone->name);
 	for (order = 0; order < MAX_ORDER; ++order) {
 		fill_contig_page_info(zone, order, &info);
-		index = fragmentation_index(order, &info);
+		index = __fragmentation_index(order, &info);
 		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
 	}
 
@@ -893,6 +901,9 @@ static const char * const vmstat_text[] = {
 	"compact_blocks_moved",
 	"compact_pages_moved",
 	"compact_pagemigrate_failed",
+	"compact_stall",
+	"compact_fail",
+	"compact_success",
 
 #ifdef CONFIG_HUGETLB_PAGE
 	"htlb_buddy_alloc_success",
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 11/12] Direct compact when a high-order allocation fails
  2010-02-18 18:02 ` [PATCH 11/12] Direct compact when a high-order allocation fails Mel Gorman
@ 2010-02-19  0:51   ` KAMEZAWA Hiroyuki
  2010-02-19 14:19     ` Mel Gorman
  2010-02-19  2:41   ` Minchan Kim
  1 sibling, 1 reply; 51+ messages in thread
From: KAMEZAWA Hiroyuki @ 2010-02-19  0:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Thu, 18 Feb 2010 18:02:41 +0000
Mel Gorman <mel@csn.ul.ie> wrote:

> Ordinarily when a high-order allocation fails, direct reclaim is entered to
> free pages to satisfy the allocation.  With this patch, it is determined if
> an allocation failed due to external fragmentation instead of low memory
> and if so, the calling process will compact until a suitable page is
> freed. Compaction by moving pages in memory is considerably cheaper than
> paging out to disk and works where there are locked pages or no swap. If
> compaction fails to free a page of a suitable size, then reclaim will
> still occur.
> 
> Direct compaction returns as soon as possible. As each block is compacted,
> it is checked if a suitable page has been freed and if so, it returns.
> 
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/compaction.h |   16 +++++-
>  include/linux/vmstat.h     |    1 +
>  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c            |   26 ++++++++++
>  mm/vmstat.c                |   15 +++++-
>  5 files changed, 172 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 6a2eefd..1cf95e2 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -1,13 +1,25 @@
>  #ifndef _LINUX_COMPACTION_H
>  #define _LINUX_COMPACTION_H
>  
> -/* Return values for compact_zone() */
> +/* Return values for compact_zone() and try_to_compact_pages() */
>  #define COMPACT_INCOMPLETE	0
> -#define COMPACT_COMPLETE	1
> +#define COMPACT_PARTIAL		1
> +#define COMPACT_COMPLETE	2
>  
>  #ifdef CONFIG_COMPACTION
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>  			void __user *buffer, size_t *length, loff_t *ppos);
> +
> +extern int fragmentation_index(struct zone *zone, unsigned int order);
> +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +			int order, gfp_t gfp_mask, nodemask_t *mask);
> +#else
> +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> +	return COMPACT_INCOMPLETE;
> +}
> +
>  #endif /* CONFIG_COMPACTION */
>  
>  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index d7f7236..0ea7a38 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>  		KSWAPD_SKIP_CONGESTION_WAIT,
>  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
>  		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> +		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
>  #ifdef CONFIG_HUGETLB_PAGE
>  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>  #endif
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 02579c2..c7c73bb 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -34,6 +34,8 @@ struct compact_control {
>  	unsigned long nr_anon;
>  	unsigned long nr_file;
>  
> +	unsigned int order;		/* order a direct compactor needs */
> +	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>  	struct zone *zone;
>  };
>  
> @@ -298,10 +300,31 @@ static void update_nr_listpages(struct compact_control *cc)
>  static inline int compact_finished(struct zone *zone,
>  						struct compact_control *cc)
>  {
> +	unsigned int order;
> +	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> +
>  	/* Compaction run completes if the migrate and free scanner meet */
>  	if (cc->free_pfn <= cc->migrate_pfn)
>  		return COMPACT_COMPLETE;
>  
> +	/* Compaction run is not finished if the watermark is not met */
> +	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> +		return COMPACT_INCOMPLETE;
> +
> +	if (cc->order == -1)
> +		return COMPACT_INCOMPLETE;
> +
> +	/* Direct compactor: Is a suitable page free? */
> +	for (order = cc->order; order < MAX_ORDER; order++) {
> +		/* Job done if page is free of the right migratetype */
> +		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
> +			return COMPACT_PARTIAL;
> +
> +		/* Job done if allocation would set block type */
> +		if (order >= pageblock_order && zone->free_area[order].nr_free)
> +			return COMPACT_PARTIAL;
> +	}
> +
>  	return COMPACT_INCOMPLETE;
>  }
>  
> @@ -347,6 +370,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  	return ret;
>  }
>  
> +static inline unsigned long compact_zone_order(struct zone *zone,
> +						int order, gfp_t gfp_mask)
> +{
> +	struct compact_control cc = {
> +		.nr_freepages = 0,
> +		.nr_migratepages = 0,
> +		.order = order,
> +		.migratetype = allocflags_to_migratetype(gfp_mask),
> +		.zone = zone,
> +	};
> +	INIT_LIST_HEAD(&cc.freepages);
> +	INIT_LIST_HEAD(&cc.migratepages);
> +
> +	return compact_zone(zone, &cc);
> +}
> +
> +/**
> + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> + * @zonelist: The zonelist used for the current allocation
> + * @order: The order of the current allocation
> + * @gfp_mask: The GFP mask of the current allocation
> + * @nodemask: The allowed nodes to allocate from
> + *
> + * This is the main entry point for direct page compaction.
> + */
> +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> +	int may_enter_fs = gfp_mask & __GFP_FS;
> +	int may_perform_io = gfp_mask & __GFP_IO;
> +	unsigned long watermark;
> +	struct zoneref *z;
> +	struct zone *zone;
> +	int rc = COMPACT_INCOMPLETE;
> +
> +	/* Check whether it is worth even starting compaction */
> +	if (order == 0 || !may_enter_fs || !may_perform_io)
> +		return rc;
> +
> +	/*
> +	 * We will not stall if the necessary conditions are not met for
> +	 * migration but direct reclaim seems to account stalls similarly
> +	 */
> +	count_vm_event(COMPACTSTALL);
> +
> +	/* Compact each zone in the list */
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> +								nodemask) {
> +		int fragindex;
> +		int status;
> +
> +		/*
> +		 * Watermarks for order-0 must be met for compaction. Note
> +		 * the 2UL. This is because during migration, copies of
> +		 * pages need to be allocated and for a short time, the
> +		 * footprint is higher
> +		 */
> +		watermark = low_wmark_pages(zone) + (2UL << order);
> +		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> +			continue;
> +
> +		/*
> +		 * fragmentation index determines if allocation failures are
> +		 * due to low memory or external fragmentation
> +		 *
> +		 * index of -1 implies allocations might succeed depending
> +		 * 	on watermarks
> +		 * index < 500 implies alloc failure is due to lack of memory
> +		 *
> +		 * XXX: The choice of 500 is arbitrary. Reinvestigate
> +		 *      appropriately to determine a sensible default.
> +		 *      and what it means when watermarks are also taken
> +		 *      into account. Consider making it a sysctl
> +		 */
> +		fragindex = fragmentation_index(zone, order);
> +		if (fragindex >= 0 && fragindex <= 500)
> +			continue;
> +
> +		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
> +			rc = COMPACT_PARTIAL;
> +			break;
> +		}
> +
> +		status = compact_zone_order(zone, order, gfp_mask);
> +		rc = max(status, rc);
> +
> +		if (zone_watermark_ok(zone, order, watermark, 0, 0))
> +			break;
> +	}
> +
> +	return rc;
> +}
> +
> +
>  /* Compact all zones within a node */
>  static int compact_node(int nid)
>  {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6d57154..1910b8b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -49,6 +49,7 @@
>  #include <linux/debugobjects.h>
>  #include <linux/kmemleak.h>
>  #include <linux/memory.h>
> +#include <linux/compaction.h>
>  #include <trace/events/kmem.h>
>  
>  #include <asm/tlbflush.h>
> @@ -1728,6 +1729,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  
>  	cond_resched();
>  

Isn't kswapd waken up before we reach here ? Is it intentional ?

Thanks,
-Kame




> +	/* Try memory compaction for high-order allocations before reclaim */
> +	if (order) {
> +		*did_some_progress = try_to_compact_pages(zonelist,
> +						order, gfp_mask, nodemask);
> +		if (*did_some_progress != COMPACT_INCOMPLETE) {
> +			page = get_page_from_freelist(gfp_mask, nodemask,
> +					order, zonelist, high_zoneidx,
> +					alloc_flags, preferred_zone,
> +					migratetype);
> +			if (page) {
> +				__count_vm_event(COMPACTSUCCESS);
> +				return page;
> +			}
> +
> +			/*
> +			 * It's bad if compaction run occurs and fails.
> +			 * The most likely reason is that pages exist,
> +			 * but not enough to satisfy watermarks.
> +			 */
> +			count_vm_event(COMPACTFAIL);
> +
> +			cond_resched();
> +		}
> +	}
> +
>  	/* We now go into synchronous reclaim */
>  	cpuset_memory_pressure_bump();
>  	p->flags |= PF_MEMALLOC;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 0a14d22..189a379 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -558,7 +558,7 @@ static int unusable_show(struct seq_file *m, void *arg)
>   * The value can be used to determine if page reclaim or compaction
>   * should be used
>   */
> -int fragmentation_index(unsigned int order, struct contig_page_info *info)
> +int __fragmentation_index(unsigned int order, struct contig_page_info *info)
>  {
>  	unsigned long requested = 1UL << order;
>  
> @@ -578,6 +578,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
>  	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
>  }
>  
> +/* Same as __fragmentation index but allocs contig_page_info on stack */
> +int fragmentation_index(struct zone *zone, unsigned int order)
> +{
> +	struct contig_page_info info;
> +
> +	fill_contig_page_info(zone, order, &info);
> +	return __fragmentation_index(order, &info);
> +}
>  
>  static void extfrag_show_print(struct seq_file *m,
>  					pg_data_t *pgdat, struct zone *zone)
> @@ -593,7 +601,7 @@ static void extfrag_show_print(struct seq_file *m,
>  				zone->name);
>  	for (order = 0; order < MAX_ORDER; ++order) {
>  		fill_contig_page_info(zone, order, &info);
> -		index = fragmentation_index(order, &info);
> +		index = __fragmentation_index(order, &info);
>  		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
>  	}
>  
> @@ -893,6 +901,9 @@ static const char * const vmstat_text[] = {
>  	"compact_blocks_moved",
>  	"compact_pages_moved",
>  	"compact_pagemigrate_failed",
> +	"compact_stall",
> +	"compact_fail",
> +	"compact_success",
>  
>  #ifdef CONFIG_HUGETLB_PAGE
>  	"htlb_buddy_alloc_success",
> -- 
> 1.6.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 11/12] Direct compact when a high-order allocation fails
  2010-02-19  0:51   ` KAMEZAWA Hiroyuki
@ 2010-02-19 14:19     ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-19 14:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 09:51:32AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 18 Feb 2010 18:02:41 +0000
> Mel Gorman <mel@csn.ul.ie> wrote:
> 
> > Ordinarily when a high-order allocation fails, direct reclaim is entered to
> > free pages to satisfy the allocation.  With this patch, it is determined if
> > an allocation failed due to external fragmentation instead of low memory
> > and if so, the calling process will compact until a suitable page is
> > freed. Compaction by moving pages in memory is considerably cheaper than
> > paging out to disk and works where there are locked pages or no swap. If
> > compaction fails to free a page of a suitable size, then reclaim will
> > still occur.
> > 
> > Direct compaction returns as soon as possible. As each block is compacted,
> > it is checked if a suitable page has been freed and if so, it returns.
> > 
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  include/linux/compaction.h |   16 +++++-
> >  include/linux/vmstat.h     |    1 +
> >  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
> >  mm/page_alloc.c            |   26 ++++++++++
> >  mm/vmstat.c                |   15 +++++-
> >  5 files changed, 172 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index 6a2eefd..1cf95e2 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -1,13 +1,25 @@
> >  #ifndef _LINUX_COMPACTION_H
> >  #define _LINUX_COMPACTION_H
> >  
> > -/* Return values for compact_zone() */
> > +/* Return values for compact_zone() and try_to_compact_pages() */
> >  #define COMPACT_INCOMPLETE	0
> > -#define COMPACT_COMPLETE	1
> > +#define COMPACT_PARTIAL		1
> > +#define COMPACT_COMPLETE	2
> >  
> >  #ifdef CONFIG_COMPACTION
> >  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> >  			void __user *buffer, size_t *length, loff_t *ppos);
> > +
> > +extern int fragmentation_index(struct zone *zone, unsigned int order);
> > +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +			int order, gfp_t gfp_mask, nodemask_t *mask);
> > +#else
> > +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > +	return COMPACT_INCOMPLETE;
> > +}
> > +
> >  #endif /* CONFIG_COMPACTION */
> >  
> >  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index d7f7236..0ea7a38 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >  		KSWAPD_SKIP_CONGESTION_WAIT,
> >  		PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> >  		COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> > +		COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
> >  #ifdef CONFIG_HUGETLB_PAGE
> >  		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> >  #endif
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 02579c2..c7c73bb 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -34,6 +34,8 @@ struct compact_control {
> >  	unsigned long nr_anon;
> >  	unsigned long nr_file;
> >  
> > +	unsigned int order;		/* order a direct compactor needs */
> > +	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
> >  	struct zone *zone;
> >  };
> >  
> > @@ -298,10 +300,31 @@ static void update_nr_listpages(struct compact_control *cc)
> >  static inline int compact_finished(struct zone *zone,
> >  						struct compact_control *cc)
> >  {
> > +	unsigned int order;
> > +	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> > +
> >  	/* Compaction run completes if the migrate and free scanner meet */
> >  	if (cc->free_pfn <= cc->migrate_pfn)
> >  		return COMPACT_COMPLETE;
> >  
> > +	/* Compaction run is not finished if the watermark is not met */
> > +	if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> > +		return COMPACT_INCOMPLETE;
> > +
> > +	if (cc->order == -1)
> > +		return COMPACT_INCOMPLETE;
> > +
> > +	/* Direct compactor: Is a suitable page free? */
> > +	for (order = cc->order; order < MAX_ORDER; order++) {
> > +		/* Job done if page is free of the right migratetype */
> > +		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
> > +			return COMPACT_PARTIAL;
> > +
> > +		/* Job done if allocation would set block type */
> > +		if (order >= pageblock_order && zone->free_area[order].nr_free)
> > +			return COMPACT_PARTIAL;
> > +	}
> > +
> >  	return COMPACT_INCOMPLETE;
> >  }
> >  
> > @@ -347,6 +370,101 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> >  	return ret;
> >  }
> >  
> > +static inline unsigned long compact_zone_order(struct zone *zone,
> > +						int order, gfp_t gfp_mask)
> > +{
> > +	struct compact_control cc = {
> > +		.nr_freepages = 0,
> > +		.nr_migratepages = 0,
> > +		.order = order,
> > +		.migratetype = allocflags_to_migratetype(gfp_mask),
> > +		.zone = zone,
> > +	};
> > +	INIT_LIST_HEAD(&cc.freepages);
> > +	INIT_LIST_HEAD(&cc.migratepages);
> > +
> > +	return compact_zone(zone, &cc);
> > +}
> > +
> > +/**
> > + * try_to_compact_pages - Direct compact to satisfy a high-order allocation
> > + * @zonelist: The zonelist used for the current allocation
> > + * @order: The order of the current allocation
> > + * @gfp_mask: The GFP mask of the current allocation
> > + * @nodemask: The allowed nodes to allocate from
> > + *
> > + * This is the main entry point for direct page compaction.
> > + */
> > +unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +			int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > +	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > +	int may_enter_fs = gfp_mask & __GFP_FS;
> > +	int may_perform_io = gfp_mask & __GFP_IO;
> > +	unsigned long watermark;
> > +	struct zoneref *z;
> > +	struct zone *zone;
> > +	int rc = COMPACT_INCOMPLETE;
> > +
> > +	/* Check whether it is worth even starting compaction */
> > +	if (order == 0 || !may_enter_fs || !may_perform_io)
> > +		return rc;
> > +
> > +	/*
> > +	 * We will not stall if the necessary conditions are not met for
> > +	 * migration but direct reclaim seems to account stalls similarly
> > +	 */
> > +	count_vm_event(COMPACTSTALL);
> > +
> > +	/* Compact each zone in the list */
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
> > +								nodemask) {
> > +		int fragindex;
> > +		int status;
> > +
> > +		/*
> > +		 * Watermarks for order-0 must be met for compaction. Note
> > +		 * the 2UL. This is because during migration, copies of
> > +		 * pages need to be allocated and for a short time, the
> > +		 * footprint is higher
> > +		 */
> > +		watermark = low_wmark_pages(zone) + (2UL << order);
> > +		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
> > +			continue;
> > +
> > +		/*
> > +		 * fragmentation index determines if allocation failures are
> > +		 * due to low memory or external fragmentation
> > +		 *
> > +		 * index of -1 implies allocations might succeed depending
> > +		 * 	on watermarks
> > +		 * index < 500 implies alloc failure is due to lack of memory
> > +		 *
> > +		 * XXX: The choice of 500 is arbitrary. Reinvestigate
> > +		 *      appropriately to determine a sensible default.
> > +		 *      and what it means when watermarks are also taken
> > +		 *      into account. Consider making it a sysctl
> > +		 */
> > +		fragindex = fragmentation_index(zone, order);
> > +		if (fragindex >= 0 && fragindex <= 500)
> > +			continue;
> > +
> > +		if (fragindex == -1 && zone_watermark_ok(zone, order, watermark, 0, 0)) {
> > +			rc = COMPACT_PARTIAL;
> > +			break;
> > +		}
> > +
> > +		status = compact_zone_order(zone, order, gfp_mask);
> > +		rc = max(status, rc);
> > +
> > +		if (zone_watermark_ok(zone, order, watermark, 0, 0))
> > +			break;
> > +	}
> > +
> > +	return rc;
> > +}
> > +
> > +
> >  /* Compact all zones within a node */
> >  static int compact_node(int nid)
> >  {
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 6d57154..1910b8b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -49,6 +49,7 @@
> >  #include <linux/debugobjects.h>
> >  #include <linux/kmemleak.h>
> >  #include <linux/memory.h>
> > +#include <linux/compaction.h>
> >  #include <trace/events/kmem.h>
> >  
> >  #include <asm/tlbflush.h>
> > @@ -1728,6 +1729,31 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >  
> >  	cond_resched();
> >  
> 
> Isn't kswapd waken up before we reach here ? Is it intentional ?
> 

For the moment, yes.

The watermarks were not met so kswapd should start doing work. Even if
compaction is partially successful, there is no guarantees the
watermarks are still met after a page is allocated so kswapd should do
some work.

Potentially in the far future, kswapd could be avoided and compaction
used far earlier - potentially even in the !wait paths but it's not
there yet.

> > +	/* Try memory compaction for high-order allocations before reclaim */
> > +	if (order) {
> > +		*did_some_progress = try_to_compact_pages(zonelist,
> > +						order, gfp_mask, nodemask);
> > +		if (*did_some_progress != COMPACT_INCOMPLETE) {
> > +			page = get_page_from_freelist(gfp_mask, nodemask,
> > +					order, zonelist, high_zoneidx,
> > +					alloc_flags, preferred_zone,
> > +					migratetype);
> > +			if (page) {
> > +				__count_vm_event(COMPACTSUCCESS);
> > +				return page;
> > +			}
> > +
> > +			/*
> > +			 * It's bad if compaction run occurs and fails.
> > +			 * The most likely reason is that pages exist,
> > +			 * but not enough to satisfy watermarks.
> > +			 */
> > +			count_vm_event(COMPACTFAIL);
> > +
> > +			cond_resched();
> > +		}
> > +	}
> > +
> >  	/* We now go into synchronous reclaim */
> >  	cpuset_memory_pressure_bump();
> >  	p->flags |= PF_MEMALLOC;
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index 0a14d22..189a379 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -558,7 +558,7 @@ static int unusable_show(struct seq_file *m, void *arg)
> >   * The value can be used to determine if page reclaim or compaction
> >   * should be used
> >   */
> > -int fragmentation_index(unsigned int order, struct contig_page_info *info)
> > +int __fragmentation_index(unsigned int order, struct contig_page_info *info)
> >  {
> >  	unsigned long requested = 1UL << order;
> >  
> > @@ -578,6 +578,14 @@ int fragmentation_index(unsigned int order, struct contig_page_info *info)
> >  	return 1000 - ( (1000+(info->free_pages * 1000 / requested)) / info->free_blocks_total);
> >  }
> >  
> > +/* Same as __fragmentation index but allocs contig_page_info on stack */
> > +int fragmentation_index(struct zone *zone, unsigned int order)
> > +{
> > +	struct contig_page_info info;
> > +
> > +	fill_contig_page_info(zone, order, &info);
> > +	return __fragmentation_index(order, &info);
> > +}
> >  
> >  static void extfrag_show_print(struct seq_file *m,
> >  					pg_data_t *pgdat, struct zone *zone)
> > @@ -593,7 +601,7 @@ static void extfrag_show_print(struct seq_file *m,
> >  				zone->name);
> >  	for (order = 0; order < MAX_ORDER; ++order) {
> >  		fill_contig_page_info(zone, order, &info);
> > -		index = fragmentation_index(order, &info);
> > +		index = __fragmentation_index(order, &info);
> >  		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
> >  	}
> >  
> > @@ -893,6 +901,9 @@ static const char * const vmstat_text[] = {
> >  	"compact_blocks_moved",
> >  	"compact_pages_moved",
> >  	"compact_pagemigrate_failed",
> > +	"compact_stall",
> > +	"compact_fail",
> > +	"compact_success",
> >  
> >  #ifdef CONFIG_HUGETLB_PAGE
> >  	"htlb_buddy_alloc_success",

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 11/12] Direct compact when a high-order allocation fails
  2010-02-18 18:02 ` [PATCH 11/12] Direct compact when a high-order allocation fails Mel Gorman
  2010-02-19  0:51   ` KAMEZAWA Hiroyuki
@ 2010-02-19  2:41   ` Minchan Kim
  2010-02-19 14:25     ` Mel Gorman
  1 sibling, 1 reply; 51+ messages in thread
From: Minchan Kim @ 2010-02-19  2:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 3:02 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> Ordinarily when a high-order allocation fails, direct reclaim is entered to
> free pages to satisfy the allocation.  With this patch, it is determined if
> an allocation failed due to external fragmentation instead of low memory
> and if so, the calling process will compact until a suitable page is
> freed. Compaction by moving pages in memory is considerably cheaper than
> paging out to disk and works where there are locked pages or no swap. If
> compaction fails to free a page of a suitable size, then reclaim will
> still occur.
>
> Direct compaction returns as soon as possible. As each block is compacted,
> it is checked if a suitable page has been freed and if so, it returns.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> Acked-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/compaction.h |   16 +++++-
>  include/linux/vmstat.h     |    1 +
>  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
>  mm/page_alloc.c            |   26 ++++++++++
>  mm/vmstat.c                |   15 +++++-
>  5 files changed, 172 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 6a2eefd..1cf95e2 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -1,13 +1,25 @@
>  #ifndef _LINUX_COMPACTION_H
>  #define _LINUX_COMPACTION_H
>
> -/* Return values for compact_zone() */
> +/* Return values for compact_zone() and try_to_compact_pages() */
>  #define COMPACT_INCOMPLETE     0
> -#define COMPACT_COMPLETE       1
> +#define COMPACT_PARTIAL                1
> +#define COMPACT_COMPLETE       2
>
>  #ifdef CONFIG_COMPACTION
>  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
>                        void __user *buffer, size_t *length, loff_t *ppos);
> +
> +extern int fragmentation_index(struct zone *zone, unsigned int order);
> +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +                       int order, gfp_t gfp_mask, nodemask_t *mask);
> +#else
> +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> +       return COMPACT_INCOMPLETE;
> +}
> +
>  #endif /* CONFIG_COMPACTION */
>
>  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index d7f7236..0ea7a38 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>                KSWAPD_SKIP_CONGESTION_WAIT,
>                PAGEOUTRUN, ALLOCSTALL, PGROTATED,
>                COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> +               COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
>  #ifdef CONFIG_HUGETLB_PAGE
>                HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
>  #endif
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 02579c2..c7c73bb 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -34,6 +34,8 @@ struct compact_control {
>        unsigned long nr_anon;
>        unsigned long nr_file;
>
> +       unsigned int order;             /* order a direct compactor needs */
> +       int migratetype;                /* MOVABLE, RECLAIMABLE etc */
>        struct zone *zone;
>  };
>
> @@ -298,10 +300,31 @@ static void update_nr_listpages(struct compact_control *cc)
>  static inline int compact_finished(struct zone *zone,
>                                                struct compact_control *cc)
>  {
> +       unsigned int order;
> +       unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> +
>        /* Compaction run completes if the migrate and free scanner meet */
>        if (cc->free_pfn <= cc->migrate_pfn)
>                return COMPACT_COMPLETE;
>
> +       /* Compaction run is not finished if the watermark is not met */
> +       if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> +               return COMPACT_INCOMPLETE;
> +
> +       if (cc->order == -1)
> +               return COMPACT_INCOMPLETE;

Where do we set cc->order = -1?
Sorry but I can't find it.


-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 11/12] Direct compact when a high-order allocation fails
  2010-02-19  2:41   ` Minchan Kim
@ 2010-02-19 14:25     ` Mel Gorman
  0 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-19 14:25 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrea Arcangeli, Christoph Lameter, Adam Litke, Avi Kivity,
	David Rientjes, KOSAKI Motohiro, Rik van Riel, linux-kernel,
	linux-mm

On Fri, Feb 19, 2010 at 11:41:56AM +0900, Minchan Kim wrote:
> On Fri, Feb 19, 2010 at 3:02 AM, Mel Gorman <mel@csn.ul.ie> wrote:
> > Ordinarily when a high-order allocation fails, direct reclaim is entered to
> > free pages to satisfy the allocation.  With this patch, it is determined if
> > an allocation failed due to external fragmentation instead of low memory
> > and if so, the calling process will compact until a suitable page is
> > freed. Compaction by moving pages in memory is considerably cheaper than
> > paging out to disk and works where there are locked pages or no swap. If
> > compaction fails to free a page of a suitable size, then reclaim will
> > still occur.
> >
> > Direct compaction returns as soon as possible. As each block is compacted,
> > it is checked if a suitable page has been freed and if so, it returns.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > Acked-by: Rik van Riel <riel@redhat.com>
> > ---
> >  include/linux/compaction.h |   16 +++++-
> >  include/linux/vmstat.h     |    1 +
> >  mm/compaction.c            |  118 ++++++++++++++++++++++++++++++++++++++++++++
> >  mm/page_alloc.c            |   26 ++++++++++
> >  mm/vmstat.c                |   15 +++++-
> >  5 files changed, 172 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index 6a2eefd..1cf95e2 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -1,13 +1,25 @@
> >  #ifndef _LINUX_COMPACTION_H
> >  #define _LINUX_COMPACTION_H
> >
> > -/* Return values for compact_zone() */
> > +/* Return values for compact_zone() and try_to_compact_pages() */
> >  #define COMPACT_INCOMPLETE     0
> > -#define COMPACT_COMPLETE       1
> > +#define COMPACT_PARTIAL                1
> > +#define COMPACT_COMPLETE       2
> >
> >  #ifdef CONFIG_COMPACTION
> >  extern int sysctl_compaction_handler(struct ctl_table *table, int write,
> >                        void __user *buffer, size_t *length, loff_t *ppos);
> > +
> > +extern int fragmentation_index(struct zone *zone, unsigned int order);
> > +extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +                       int order, gfp_t gfp_mask, nodemask_t *mask);
> > +#else
> > +static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
> > +                       int order, gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > +       return COMPACT_INCOMPLETE;
> > +}
> > +
> >  #endif /* CONFIG_COMPACTION */
> >
> >  #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index d7f7236..0ea7a38 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -44,6 +44,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >                KSWAPD_SKIP_CONGESTION_WAIT,
> >                PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> >                COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
> > +               COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
> >  #ifdef CONFIG_HUGETLB_PAGE
> >                HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> >  #endif
> > diff --git a/mm/compaction.c b/mm/compaction.c
> > index 02579c2..c7c73bb 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -34,6 +34,8 @@ struct compact_control {
> >        unsigned long nr_anon;
> >        unsigned long nr_file;
> >
> > +       unsigned int order;             /* order a direct compactor needs */
> > +       int migratetype;                /* MOVABLE, RECLAIMABLE etc */
> >        struct zone *zone;
> >  };
> >
> > @@ -298,10 +300,31 @@ static void update_nr_listpages(struct compact_control *cc)
> >  static inline int compact_finished(struct zone *zone,
> >                                                struct compact_control *cc)
> >  {
> > +       unsigned int order;
> > +       unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
> > +
> >        /* Compaction run completes if the migrate and free scanner meet */
> >        if (cc->free_pfn <= cc->migrate_pfn)
> >                return COMPACT_COMPLETE;
> >
> > +       /* Compaction run is not finished if the watermark is not met */
> > +       if (!zone_watermark_ok(zone, cc->order, watermark, 0, 0))
> > +               return COMPACT_INCOMPLETE;
> > +
> > +       if (cc->order == -1)
> > +               return COMPACT_INCOMPLETE;
> 
> Where do we set cc->order = -1?
> Sorry but I can't find it.
> 

Good spot, it should have been set in compact_node() to force a full
compaction.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 12/12] Do not compact within a preferred zone after a compaction failure
  2010-02-18 18:02 [PATCH 0/12] Memory Compaction v3 Mel Gorman
                   ` (10 preceding siblings ...)
  2010-02-18 18:02 ` [PATCH 11/12] Direct compact when a high-order allocation fails Mel Gorman
@ 2010-02-18 18:02 ` Mel Gorman
  11 siblings, 0 replies; 51+ messages in thread
From: Mel Gorman @ 2010-02-18 18:02 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Adam Litke, Avi Kivity, David Rientjes,
	KOSAKI Motohiro, Rik van Riel, Mel Gorman, linux-kernel, linux-mm

The fragmentation index may indicate that a failure it due to external
fragmentation, a compaction run complete and an allocation failure still
fail. There are two obvious reasons as to why

  o Page migration cannot move all pages so fragmentation remains
  o A suitable page may exist but watermarks are not met

In the event of compaction and allocation failure, this patch prevents
compaction happening for a short interval. It's only recorded on the
preferred zone but that should be enough coverage. This could have been
implemented similar to the zonelist_cache but the increased size of the
zonelist did not appear to be justified.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |   35 +++++++++++++++++++++++++++++++++++
 include/linux/mmzone.h     |    7 +++++++
 mm/page_alloc.c            |    5 ++++-
 3 files changed, 46 insertions(+), 1 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 1cf95e2..8b1471b 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -13,6 +13,32 @@ extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask);
+
+/* defer_compaction - Do not compact within a zone until a given time */
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+	/*
+	 * This function is called when compaction fails to result in a page
+	 * allocation success. This is somewhat unsatisfactory as the failure
+	 * to compact has nothing to do with time and everything to do with
+	 * the requested order, the number of free pages and watermarks. How
+	 * to wait on that is more unclear, but the answer would apply to
+	 * other areas where the VM waits based on time.
+	 */
+	zone->compact_resume = resume;
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+	/* init once if necessary */
+	if (unlikely(!zone->compact_resume)) {
+		zone->compact_resume = jiffies;
+		return 0;
+	}
+
+	return time_before(jiffies, zone->compact_resume);
+}
+
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask)
@@ -20,6 +46,15 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	return COMPACT_INCOMPLETE;
 }
 
+static inline void defer_compaction(struct zone *zone, unsigned long resume)
+{
+}
+
+static inline int compaction_deferred(struct zone *zone)
+{
+	return 1;
+}
+
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 30fe668..31fb38b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -328,6 +328,13 @@ struct zone {
 	unsigned long		*pageblock_flags;
 #endif /* CONFIG_SPARSEMEM */
 
+#ifdef CONFIG_COMPACTION
+	/*
+	 * If a compaction fails, do not try compaction again until
+	 * jiffies is after the value of compact_resume
+	 */
+	unsigned long		compact_resume;
+#endif
 
 	ZONE_PADDING(_pad1_)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1910b8b..7021c68 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1730,7 +1730,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	cond_resched();
 
 	/* Try memory compaction for high-order allocations before reclaim */
-	if (order) {
+	if (order && !compaction_deferred(preferred_zone)) {
 		*did_some_progress = try_to_compact_pages(zonelist,
 						order, gfp_mask, nodemask);
 		if (*did_some_progress != COMPACT_INCOMPLETE) {
@@ -1750,6 +1750,9 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 			 */
 			count_vm_event(COMPACTFAIL);
 
+			/* On failure, avoid compaction for a short time. */
+			defer_compaction(preferred_zone, jiffies + HZ/50);
+
 			cond_resched();
 		}
 	}
-- 
1.6.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2010-02-20  9:33 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-18 18:02 [PATCH 0/12] Memory Compaction v3 Mel Gorman
2010-02-18 18:02 ` [PATCH 01/12] mm,migration: Take a reference to the anon_vma before migrating Mel Gorman
2010-02-19  0:12   ` KAMEZAWA Hiroyuki
2010-02-19 13:59     ` Mel Gorman
2010-02-19 16:43   ` Rik van Riel
2010-02-18 18:02 ` [PATCH 02/12] mm,migration: Do not try to migrate unmapped anonymous pages Mel Gorman
2010-02-19 16:45   ` Rik van Riel
2010-02-18 18:02 ` [PATCH 03/12] mm: Share the anon_vma ref counts between KSM and page migration Mel Gorman
2010-02-19  0:18   ` KAMEZAWA Hiroyuki
2010-02-19 14:05     ` Mel Gorman
2010-02-19 15:01       ` Christoph Lameter
2010-02-20  3:48       ` KAMEZAWA Hiroyuki
2010-02-20  9:32         ` Mel Gorman
2010-02-19  5:09   ` Minchan Kim
2010-02-19 21:42   ` Rik van Riel
2010-02-19 21:58     ` Mel Gorman
2010-02-20  0:16       ` Rik van Riel
2010-02-20  9:29         ` Mel Gorman
2010-02-18 18:02 ` [PATCH 04/12] mm: Document /proc/pagetypeinfo Mel Gorman
2010-02-19  1:36   ` Minchan Kim
2010-02-19 21:42   ` Rik van Riel
2010-02-18 18:02 ` [PATCH 05/12] Allow CONFIG_MIGRATION to be set without CONFIG_NUMA or memory hot-remove Mel Gorman
2010-02-19  0:21   ` KAMEZAWA Hiroyuki
2010-02-19 14:09     ` Mel Gorman
2010-02-18 18:02 ` [PATCH 06/12] Export unusable free space index via /proc/pagetypeinfo Mel Gorman
2010-02-19  1:35   ` Minchan Kim
2010-02-19 21:46   ` Rik van Riel
2010-02-18 18:02 ` [PATCH 07/12] Export fragmentation " Mel Gorman
2010-02-19  1:59   ` Minchan Kim
2010-02-20  0:16   ` Rik van Riel
2010-02-18 18:02 ` [PATCH 08/12] Memory compaction core Mel Gorman
2010-02-19  0:37   ` KAMEZAWA Hiroyuki
2010-02-19 14:15     ` Mel Gorman
2010-02-18 18:02 ` [PATCH 09/12] Add /proc trigger for memory compaction Mel Gorman
2010-02-19  0:43   ` KAMEZAWA Hiroyuki
2010-02-19 14:16     ` Mel Gorman
2010-02-20  3:53       ` KAMEZAWA Hiroyuki
2010-02-19  2:26   ` Minchan Kim
2010-02-18 18:02 ` [PATCH 10/12] Add /sys trigger for per-node " Mel Gorman
2010-02-19 14:53   ` Greg KH
2010-02-19 15:28     ` Mel Gorman
2010-02-19 15:31       ` Greg KH
2010-02-19 15:51         ` Mel Gorman
2010-02-19 15:52           ` Mel Gorman
2010-02-19 16:02             ` Greg KH
2010-02-18 18:02 ` [PATCH 11/12] Direct compact when a high-order allocation fails Mel Gorman
2010-02-19  0:51   ` KAMEZAWA Hiroyuki
2010-02-19 14:19     ` Mel Gorman
2010-02-19  2:41   ` Minchan Kim
2010-02-19 14:25     ` Mel Gorman
2010-02-18 18:02 ` [PATCH 12/12] Do not compact within a preferred zone after a compaction failure Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).