[RFC PATCH 0/5] Improve hugepage allocation success rates under load V2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] Improve hugepage allocation success rates under load V2
@ 2012-08-08 19:08 Mel Gorman
  2012-08-08 19:08 ` [PATCH 1/5] mm: compaction: Update comment in try_to_compact_pages Mel Gorman
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Mel Gorman @ 2012-08-08 19:08 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Minchan Kim, Jim Schutt, LKML, Mel Gorman

Changelog since V1
o Dropped kswapd related patch, basically a no-op and regresses if fixed (minchan)
o Expanded changelogs a little

Allocation success rates have been far lower since 3.4 due to commit
[fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. This
commit was introduced for good reasons and it was known in advance that
the success rates would suffer but it was justified on the grounds that
the high allocation success rates were achieved by aggressive reclaim.
Success rates are expected to suffer even more in 3.6 due to commit
[7db8889a: mm: have order > 0 compaction start off where it left] which
testing has shown to severely reduce allocation success rates under load -
to 0% in one case.  There is a proposed change to that patch in this series
and it would be ideal if Jim Schutt could retest the workload that led to
commit [7db8889a: mm: have order > 0 compaction start off where it left].

This series aims to improve the allocation success rates without regressing
the benefits of commit fe2c2a10. The series is based on 3.5 and includes
the commit 7db8889a to illustrate what impact it has to success rates.

Patch 1 updates a stale comment seeing as I was in the general area.

Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
	of recent failures.

Patch 3 captures suitable high-order pages freed by compaction to reduce
	races with parallel allocation requests.

Patch 4 is an upstream commit that has compaction restart free page scanning
	from an old position instead of always starting from the end of the
	zone

Patch 5 adjusts patch 5 to restores allocation success rates.

STRESS-HIGHALLOC
		 3.5.0-vanilla	  patches:1-2	    patches:1-3       patches:1-5
Pass 1          36.00 ( 0.00%)    61.00 (25.00%)    62.00 (26.00%)    56.00 (20.00%)
Pass 2          46.00 ( 0.00%)    61.00 (15.00%)    60.00 (14.00%)    56.00 (10.00%)
while Rested    84.00 ( 0.00%)    85.00 ( 1.00%)    86.00 ( 2.00%)    85.00 ( 1.00%)

>From
http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html
I know that the allocation success rates in 3.3.6 was 78% in comparison
to 36% in 3.5. With the full series applied, the success rates are up to
around 60% with some variability in the results. This is not as high
a success rate but it does not reclaim excessively which is a key point.

Previous tests on V1 of this series showed that patch 4 on its own adversely
affected high-order allocation success rates.

MMTests Statistics: vmstat
Page Ins                                     3037580     3167260     3121588     2939576
Page Outs                                    8026888     8028472     8026444     8033852
Swap Ins                                           0           0           0           0
Swap Outs                                          0           0           0           0

Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates
there were 71881 pages swapped out.

Direct pages scanned                           97106       59600      118792       84142
Kswapd pages scanned                         1231288     1419472     1406569     1406642
Kswapd pages reclaimed                       1231221     1419248     1390694     1357820
Direct pages reclaimed                         97100       59486      107873       82067
Kswapd efficiency                                99%         99%         98%         96%
Kswapd velocity                             1001.153    1129.622    1082.592    1077.474
Direct efficiency                                99%         99%         90%         97%
Direct velocity                               78.956      47.430      91.431      64.452

kswapd velocity stays at around 1000 pages/second which is reasonable. In
kernel 3.3.6, it was 8140 pages/second.

 include/linux/compaction.h |    4 +-
 include/linux/mm.h         |    1 +
 include/linux/mmzone.h     |    4 ++
 mm/compaction.c            |  142 +++++++++++++++++++++++++++++++++++++-------
 mm/internal.h              |    7 +++
 mm/page_alloc.c            |   68 ++++++++++++++++-----
 mm/vmscan.c                |   10 ++++
 7 files changed, 197 insertions(+), 39 deletions(-)

-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/5] mm: compaction: Update comment in try_to_compact_pages
  2012-08-08 19:08 [RFC PATCH 0/5] Improve hugepage allocation success rates under load V2 Mel Gorman
@ 2012-08-08 19:08 ` Mel Gorman
  2012-08-08 19:08 ` [PATCH 2/5] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures Mel Gorman
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2012-08-08 19:08 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Minchan Kim, Jim Schutt, LKML, Mel Gorman

The comment about order applied when the check was
order > PAGE_ALLOC_COSTLY_ORDER which has not been the case since
[c5a73c3d: thp: use compaction for all allocation orders]. Fixing
the comment while I'm in the general area.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
---
 mm/compaction.c |    6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b39ede1..95ca967 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -759,11 +759,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	struct zone *zone;
 	int rc = COMPACT_SKIPPED;
 
-	/*
-	 * Check whether it is worth even starting compaction. The order check is
-	 * made because an assumption is made that the page allocator can satisfy
-	 * the "cheaper" orders without taking special steps
-	 */
+	/* Check if the GFP flags allow compaction */
 	if (!order || !may_enter_fs || !may_perform_io)
 		return rc;
 
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/5] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures
  2012-08-08 19:08 [RFC PATCH 0/5] Improve hugepage allocation success rates under load V2 Mel Gorman
  2012-08-08 19:08 ` [PATCH 1/5] mm: compaction: Update comment in try_to_compact_pages Mel Gorman
@ 2012-08-08 19:08 ` Mel Gorman
  2012-08-08 19:08 ` [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available Mel Gorman
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2012-08-08 19:08 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Minchan Kim, Jim Schutt, LKML, Mel Gorman

If allocation fails after compaction then compaction may be deferred for
a number of allocation attempts. If there are subsequent failures,
compact_defer_shift is increased to defer for longer periods. This patch
uses that information to scale the number of pages reclaimed with
compact_defer_shift until allocations succeed again.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
---
 mm/vmscan.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 66e4310..0cb2593 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1708,6 +1708,7 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
 {
 	unsigned long pages_for_compaction;
 	unsigned long inactive_lru_pages;
+	struct zone *zone;
 
 	/* If not in reclaim/compaction mode, stop */
 	if (!in_reclaim_compaction(sc))
@@ -1741,6 +1742,15 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
 	 * inactive lists are large enough, continue reclaiming
 	 */
 	pages_for_compaction = (2UL << sc->order);
+
+	/*
+	 * If compaction is deferred for this order then scale the number of
+	 * pages reclaimed based on the number of consecutive allocation
+	 * failures
+	 */
+	zone = lruvec_zone(lruvec);
+	if (zone->compact_order_failed >= sc->order)
+		pages_for_compaction <<= zone->compact_defer_shift;
 	inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
 	if (nr_swap_pages > 0)
 		inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available
  2012-08-08 19:08 [RFC PATCH 0/5] Improve hugepage allocation success rates under load V2 Mel Gorman
  2012-08-08 19:08 ` [PATCH 1/5] mm: compaction: Update comment in try_to_compact_pages Mel Gorman
  2012-08-08 19:08 ` [PATCH 2/5] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures Mel Gorman
@ 2012-08-08 19:08 ` Mel Gorman
  2012-08-09  1:33   ` Minchan Kim
  2012-08-08 19:08 ` [PATCH 4/5] mm: have order > 0 compaction start off where it left Mel Gorman
  2012-08-08 19:08 ` [PATCH 5/5] mm: have order > 0 compaction start near a pageblock with free pages Mel Gorman
  4 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2012-08-08 19:08 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Minchan Kim, Jim Schutt, LKML, Mel Gorman

While compaction is migrating pages to free up large contiguous blocks for
allocation it races with other allocation requests that may steal these
blocks or break them up. This patch alters direct compaction to capture a
suitable free page as soon as it becomes available to reduce this race. It
uses similar logic to split_free_page() to ensure that watermarks are
still obeyed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |    4 +--
 include/linux/mm.h         |    1 +
 mm/compaction.c            |   71 +++++++++++++++++++++++++++++++++++++-------
 mm/internal.h              |    1 +
 mm/page_alloc.c            |   63 +++++++++++++++++++++++++++++----------
 5 files changed, 111 insertions(+), 29 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 51a90b7..5673459 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
-			bool sync);
+			bool sync, struct page **page);
 extern int compact_pgdat(pg_data_t *pgdat, int order);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
@@ -64,7 +64,7 @@ static inline bool compaction_deferred(struct zone *zone, int order)
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			bool sync)
+			bool sync, struct page **page)
 {
 	return COMPACT_CONTINUE;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b36d08c..0812e86 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -454,6 +454,7 @@ void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
 int split_free_page(struct page *page);
+int capture_free_page(struct page *page, int alloc_order, int migratetype);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff --git a/mm/compaction.c b/mm/compaction.c
index 95ca967..63af8d2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,41 @@ static inline bool migrate_async_suitable(int migratetype)
 	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
 }
 
+static void compact_capture_page(struct compact_control *cc)
+{
+	unsigned long flags;
+	int mtype;
+
+	if (!cc->page || *cc->page)
+		return;
+
+	/* Speculatively examine the free lists without zone lock */
+	for (mtype = 0; mtype < MIGRATE_PCPTYPES; mtype++) {
+		int order;
+		for (order = cc->order; order < MAX_ORDER; order++) {
+			struct page *page;
+			struct free_area *area;
+			area = &(cc->zone->free_area[order]);
+			if (list_empty(&area->free_list[mtype]))
+				continue;
+
+			/* Take the lock and attempt capture of the page */
+			spin_lock_irqsave(&cc->zone->lock, flags);
+			if (!list_empty(&area->free_list[mtype])) {
+				page = list_entry(area->free_list[mtype].next,
+							struct page, lru);
+				if (capture_free_page(page, cc->order, mtype)) {
+					spin_unlock_irqrestore(&cc->zone->lock,
+									flags);
+					*cc->page = page;
+					return;
+				}
+			}
+			spin_unlock_irqrestore(&cc->zone->lock, flags);
+		}
+	}
+}
+
 /*
  * Isolate free pages onto a private freelist. Caller must hold zone->lock.
  * If @strict is true, will abort returning 0 on any invalid PFNs or non-free
@@ -561,7 +596,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 static int compact_finished(struct zone *zone,
 			    struct compact_control *cc)
 {
-	unsigned int order;
 	unsigned long watermark;
 
 	if (fatal_signal_pending(current))
@@ -586,14 +620,22 @@ static int compact_finished(struct zone *zone,
 		return COMPACT_CONTINUE;
 
 	/* Direct compactor: Is a suitable page free? */
-	for (order = cc->order; order < MAX_ORDER; order++) {
-		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
-			return COMPACT_PARTIAL;
-
-		/* Job done if allocation would set block type */
-		if (order >= pageblock_order && zone->free_area[order].nr_free)
+	if (cc->page) {
+		/* Was a suitable page captured? */
+		if (*cc->page)
 			return COMPACT_PARTIAL;
+	} else {
+		unsigned int order;
+		for (order = cc->order; order < MAX_ORDER; order++) {
+			struct free_area *area = &zone->free_area[cc->order];
+			/* Job done if page is free of the right migratetype */
+			if (!list_empty(&area->free_list[cc->migratetype]))
+				return COMPACT_PARTIAL;
+
+			/* Job done if allocation would set block type */
+			if (cc->order >= pageblock_order && area->nr_free)
+				return COMPACT_PARTIAL;
+		}
 	}
 
 	return COMPACT_CONTINUE;
@@ -708,6 +750,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 				goto out;
 			}
 		}
+
+		/* Capture a page now if it is a suitable size */
+		if (cc->migratetype == MIGRATE_MOVABLE)
+			compact_capture_page(cc);
 	}
 
 out:
@@ -720,7 +766,7 @@ out:
 
 static unsigned long compact_zone_order(struct zone *zone,
 				 int order, gfp_t gfp_mask,
-				 bool sync)
+				 bool sync, struct page **page)
 {
 	struct compact_control cc = {
 		.nr_freepages = 0,
@@ -729,6 +775,7 @@ static unsigned long compact_zone_order(struct zone *zone,
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
 		.sync = sync,
+		.page = page,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -750,7 +797,7 @@ int sysctl_extfrag_threshold = 500;
  */
 unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			bool sync)
+			bool sync, struct page **page)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	int may_enter_fs = gfp_mask & __GFP_FS;
@@ -770,7 +817,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 								nodemask) {
 		int status;
 
-		status = compact_zone_order(zone, order, gfp_mask, sync);
+		status = compact_zone_order(zone, order, gfp_mask, sync, page);
 		rc = max(status, rc);
 
 		/* If a normal allocation would succeed, stop compacting */
@@ -825,6 +872,7 @@ int compact_pgdat(pg_data_t *pgdat, int order)
 	struct compact_control cc = {
 		.order = order,
 		.sync = false,
+		.page = NULL,
 	};
 
 	return __compact_pgdat(pgdat, &cc);
@@ -835,6 +883,7 @@ static int compact_node(int nid)
 	struct compact_control cc = {
 		.order = -1,
 		.sync = true,
+		.page = NULL,
 	};
 
 	return __compact_pgdat(NODE_DATA(nid), &cc);
diff --git a/mm/internal.h b/mm/internal.h
index 2ba87fb..9156714 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -124,6 +124,7 @@ struct compact_control {
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
+	struct page **page;		/* Page captured of requested size */
 };
 
 unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a4f921..adc3aa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1374,16 +1374,11 @@ void split_page(struct page *page, unsigned int order)
 }
 
 /*
- * Similar to split_page except the page is already free. As this is only
- * being used for migration, the migratetype of the block also changes.
- * As this is called with interrupts disabled, the caller is responsible
- * for calling arch_alloc_page() and kernel_map_page() after interrupts
- * are enabled.
- *
- * Note: this is probably too low level an operation for use in drivers.
- * Please consult with lkml before using this in your driver.
+ * Similar to the split_page family of functions except that the page
+ * required at the given order and being isolated now to prevent races
+ * with parallel allocators
  */
-int split_free_page(struct page *page)
+int capture_free_page(struct page *page, int alloc_order, int migratetype)
 {
 	unsigned int order;
 	unsigned long watermark;
@@ -1405,10 +1400,11 @@ int split_free_page(struct page *page)
 	rmv_page_order(page);
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
 
-	/* Split into individual pages */
-	set_page_refcounted(page);
-	split_page(page, order);
+	if (alloc_order != order)
+		expand(zone, page, alloc_order, order,
+			&zone->free_area[order], migratetype);
 
+	/* Set the pageblock if the captured page is at least a pageblock */
 	if (order >= pageblock_order - 1) {
 		struct page *endpage = page + (1 << order) - 1;
 		for (; page < endpage; page += pageblock_nr_pages) {
@@ -1419,7 +1415,35 @@ int split_free_page(struct page *page)
 		}
 	}
 
-	return 1 << order;
+	return 1UL << order;
+}
+
+/*
+ * Similar to split_page except the page is already free. As this is only
+ * being used for migration, the migratetype of the block also changes.
+ * As this is called with interrupts disabled, the caller is responsible
+ * for calling arch_alloc_page() and kernel_map_page() after interrupts
+ * are enabled.
+ *
+ * Note: this is probably too low level an operation for use in drivers.
+ * Please consult with lkml before using this in your driver.
+ */
+int split_free_page(struct page *page)
+{
+	unsigned int order;
+	int nr_pages;
+
+	BUG_ON(!PageBuddy(page));
+	order = page_order(page);
+
+	nr_pages = capture_free_page(page, order, 0);
+	if (!nr_pages)
+		return 0;
+
+	/* Split into individual pages */
+	set_page_refcounted(page);
+	split_page(page, order);
+	return nr_pages;
 }
 
 /*
@@ -2065,7 +2089,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
-	struct page *page;
+	struct page *page = NULL;
 
 	if (!order)
 		return NULL;
@@ -2077,10 +2101,16 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	current->flags |= PF_MEMALLOC;
 	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
-						nodemask, sync_migration);
+					nodemask, sync_migration, &page);
 	current->flags &= ~PF_MEMALLOC;
-	if (*did_some_progress != COMPACT_SKIPPED) {
 
+	/* If compaction captured a page, prep and use it */
+	if (page) {
+		prep_new_page(page, order, gfp_mask);
+		goto got_page;
+	}
+
+	if (*did_some_progress != COMPACT_SKIPPED) {
 		/* Page migration frees to the PCP lists but we want merging */
 		drain_pages(get_cpu());
 		put_cpu();
@@ -2090,6 +2120,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 				alloc_flags, preferred_zone,
 				migratetype);
 		if (page) {
+got_page:
 			preferred_zone->compact_considered = 0;
 			preferred_zone->compact_defer_shift = 0;
 			if (order >= preferred_zone->compact_order_failed)
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/5] mm: have order > 0 compaction start off where it left
  2012-08-08 19:08 [RFC PATCH 0/5] Improve hugepage allocation success rates under load V2 Mel Gorman
                   ` (2 preceding siblings ...)
  2012-08-08 19:08 ` [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available Mel Gorman
@ 2012-08-08 19:08 ` Mel Gorman
  2012-08-08 19:08 ` [PATCH 5/5] mm: have order > 0 compaction start near a pageblock with free pages Mel Gorman
  4 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2012-08-08 19:08 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Minchan Kim, Jim Schutt, LKML, Mel Gorman

From: Rik van Riel <riel@redhat.com>

This commit is already upstream as [7db8889a: mm: have order > 0 compaction
start off where it left]. It's included in this series to provide context
to the next patch as the series is based on 3.5.

Order > 0 compaction stops when enough free pages of the correct page
order have been coalesced.  When doing subsequent higher order
allocations, it is possible for compaction to be invoked many times.

However, the compaction code always starts out looking for things to
compact at the start of the zone, and for free pages to compact things to
at the end of the zone.

This can cause quadratic behaviour, with isolate_freepages starting at the
end of the zone each time, even though previous invocations of the
compaction code already filled up all free memory on that end of the zone.

This can cause isolate_freepages to take enormous amounts of CPU with
certain workloads on larger memory systems.

The obvious solution is to have isolate_freepages remember where it left
off last time, and continue at that point the next time it gets invoked
for an order > 0 compaction.  This could cause compaction to fail if
cc->free_pfn and cc->migrate_pfn are close together initially, in that
case we restart from the end of the zone and try once more.

Forced full (order == -1) compactions are left alone.

[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: s/laste/last/, use 80 cols]
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Jim Schutt <jaschut@sandia.gov>
Tested-by: Jim Schutt <jaschut@sandia.gov>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/mmzone.h |    4 +++
 mm/compaction.c        |   63 ++++++++++++++++++++++++++++++++++++++++++++----
 mm/internal.h          |    6 +++++
 mm/page_alloc.c        |    5 ++++
 4 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 68c569f..6340f38 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -369,6 +369,10 @@ struct zone {
 	 */
 	spinlock_t		lock;
 	int                     all_unreclaimable; /* All pages pinned */
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+	/* pfn where the last incremental compaction isolated free pages */
+	unsigned long		compact_cached_free_pfn;
+#endif
 #ifdef CONFIG_MEMORY_HOTPLUG
 	/* see spanned/present_pages for more description */
 	seqlock_t		span_seqlock;
diff --git a/mm/compaction.c b/mm/compaction.c
index 63af8d2..be310f1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -457,6 +457,17 @@ static void isolate_freepages(struct zone *zone,
 					pfn -= pageblock_nr_pages) {
 		unsigned long isolated;
 
+		/*
+		 * Skip ahead if another thread is compacting in the area
+		 * simultaneously. If we wrapped around, we can only skip
+		 * ahead if zone->compact_cached_free_pfn also wrapped to
+		 * above our starting point.
+		 */
+		if (cc->order > 0 && (!cc->wrapped ||
+				      zone->compact_cached_free_pfn >
+				      cc->start_free_pfn))
+			pfn = min(pfn, zone->compact_cached_free_pfn);
+
 		if (!pfn_valid(pfn))
 			continue;
 
@@ -497,8 +508,11 @@ static void isolate_freepages(struct zone *zone,
 		 * looking for free pages, the search will restart here as
 		 * page migration may have returned some pages to the allocator
 		 */
-		if (isolated)
+		if (isolated) {
 			high_pfn = max(high_pfn, pfn);
+			if (cc->order > 0)
+				zone->compact_cached_free_pfn = high_pfn;
+		}
 	}
 
 	/* split_free_page does not map the pages */
@@ -593,6 +607,20 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	return ISOLATE_SUCCESS;
 }
 
+/*
+ * Returns the start pfn of the last page block in a zone.  This is the starting
+ * point for full compaction of a zone.  Compaction searches for free pages from
+ * the end of each zone, while isolate_freepages_block scans forward inside each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+	unsigned long free_pfn;
+	free_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	free_pfn &= ~(pageblock_nr_pages-1);
+	return free_pfn;
+}
+
 static int compact_finished(struct zone *zone,
 			    struct compact_control *cc)
 {
@@ -601,8 +629,26 @@ static int compact_finished(struct zone *zone,
 	if (fatal_signal_pending(current))
 		return COMPACT_PARTIAL;
 
-	/* Compaction run completes if the migrate and free scanner meet */
-	if (cc->free_pfn <= cc->migrate_pfn)
+	/*
+	 * A full (order == -1) compaction run starts at the beginning and
+	 * end of a zone; it completes when the migrate and free scanner meet.
+	 * A partial (order > 0) compaction can start with the free scanner
+	 * at a random point in the zone, and may have to restart.
+	 */
+	if (cc->free_pfn <= cc->migrate_pfn) {
+		if (cc->order > 0 && !cc->wrapped) {
+			/* We started partway through; restart at the end. */
+			unsigned long free_pfn = start_free_pfn(zone);
+			zone->compact_cached_free_pfn = free_pfn;
+			cc->free_pfn = free_pfn;
+			cc->wrapped = 1;
+			return COMPACT_CONTINUE;
+		}
+		return COMPACT_COMPLETE;
+	}
+
+	/* We wrapped around and ended up where we started. */
+	if (cc->wrapped && cc->free_pfn <= cc->start_free_pfn)
 		return COMPACT_COMPLETE;
 
 	/*
@@ -708,8 +754,15 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 
 	/* Setup to move all movable pages to the end of the zone */
 	cc->migrate_pfn = zone->zone_start_pfn;
-	cc->free_pfn = cc->migrate_pfn + zone->spanned_pages;
-	cc->free_pfn &= ~(pageblock_nr_pages-1);
+
+	if (cc->order > 0) {
+		/* Incremental compaction. Start where the last one stopped. */
+		cc->free_pfn = zone->compact_cached_free_pfn;
+		cc->start_free_pfn = cc->free_pfn;
+	} else {
+		/* Order == -1 starts at the end of the zone. */
+		cc->free_pfn = start_free_pfn(zone);
+	}
 
 	migrate_prep_local();
 
diff --git a/mm/internal.h b/mm/internal.h
index 9156714..064f6ef 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -118,8 +118,14 @@ struct compact_control {
 	unsigned long nr_freepages;	/* Number of isolated free pages */
 	unsigned long nr_migratepages;	/* Number of pages to migrate */
 	unsigned long free_pfn;		/* isolate_freepages search base */
+	unsigned long start_free_pfn;	/* where we started the search */
 	unsigned long migrate_pfn;	/* isolate_migratepages search base */
 	bool sync;			/* Synchronous migration */
+	bool wrapped;			/* Order > 0 compactions are
+					   incremental, once free_pfn
+					   and migrate_pfn meet, we restart
+					   from the top of the zone;
+					   remember we wrapped around. */
 
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index adc3aa8..781d6e4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4425,6 +4425,11 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 
 		zone->spanned_pages = size;
 		zone->present_pages = realsize;
+#if defined CONFIG_COMPACTION || defined CONFIG_CMA
+		zone->compact_cached_free_pfn = zone->zone_start_pfn +
+						zone->spanned_pages;
+		zone->compact_cached_free_pfn &= ~(pageblock_nr_pages-1);
+#endif
 #ifdef CONFIG_NUMA
 		zone->node = nid;
 		zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/5] mm: have order > 0 compaction start near a pageblock with free pages
  2012-08-08 19:08 [RFC PATCH 0/5] Improve hugepage allocation success rates under load V2 Mel Gorman
                   ` (3 preceding siblings ...)
  2012-08-08 19:08 ` [PATCH 4/5] mm: have order > 0 compaction start off where it left Mel Gorman
@ 2012-08-08 19:08 ` Mel Gorman
  2012-08-09  0:12   ` Minchan Kim
  2012-08-09  8:47   ` Minchan Kim
  4 siblings, 2 replies; 16+ messages in thread
From: Mel Gorman @ 2012-08-08 19:08 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Minchan Kim, Jim Schutt, LKML, Mel Gorman

commit [7db8889a: mm: have order > 0 compaction start off where it left]
introduced a caching mechanism to reduce the amount work the free page
scanner does in compaction. However, it has a problem. Consider two process
simultaneously scanning free pages

				    			C
Process A		M     S     			F
		|---------------------------------------|
Process B		M 	FS

C is zone->compact_cached_free_pfn
S is cc->start_pfree_pfn
M is cc->migrate_pfn
F is cc->free_pfn

In this diagram, Process A has just reached its migrate scanner, wrapped
around and updated compact_cached_free_pfn accordingly.

Simultaneously, Process B finishes isolating in a block and updates
compact_cached_free_pfn again to the location of its free scanner.

Process A moves to "end_of_zone - one_pageblock" and runs this check

                if (cc->order > 0 && (!cc->wrapped ||
                                      zone->compact_cached_free_pfn >
                                      cc->start_free_pfn))
                        pfn = min(pfn, zone->compact_cached_free_pfn);

compact_cached_free_pfn is above where it started so the free scanner skips
almost the entire space it should have scanned. When there are multiple
processes compacting it can end in a situation where the entire zone is
not being scanned at all.  Further, it is possible for two processes to
ping-pong update to compact_cached_free_pfn which is just random.

Overall, the end result wrecks allocation success rates.

There is not an obvious way around this problem without introducing new
locking and state so this patch takes a different approach.

First, it gets rid of the skip logic because it's not clear that it matters
if two free scanners happen to be in the same block but with racing updates
it's too easy for it to skip over blocks it should not.

Second, it updates compact_cached_free_pfn in a more limited set of
circumstances.

If a scanner has wrapped, it updates compact_cached_free_pfn to the end
	of the zone. When a wrapped scanner isolates a page, it updates
	compact_cached_free_pfn to point to the highest pageblock it
	can isolate pages from.

If a scanner has not wrapped when it has finished isolated pages it
	checks if compact_cached_free_pfn is pointing to the end of the
	zone. If so, the value is updated to point to the highest
	pageblock that pages were isolated from. This value will not
	be updated again until a free page scanner wraps and resets
	compact_cached_free_pfn.

This is not optimal and it can still race but the compact_cached_free_pfn
will be pointing to or very near a pageblock with free pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 mm/compaction.c |   54 ++++++++++++++++++++++++++++--------------------------
 1 file changed, 28 insertions(+), 26 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index be310f1..df50f73 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -419,6 +419,20 @@ static bool suitable_migration_target(struct page *page)
 }
 
 /*
+ * Returns the start pfn of the last page block in a zone.  This is the starting
+ * point for full compaction of a zone.  Compaction searches for free pages from
+ * the end of each zone, while isolate_freepages_block scans forward inside each
+ * page block.
+ */
+static unsigned long start_free_pfn(struct zone *zone)
+{
+	unsigned long free_pfn;
+	free_pfn = zone->zone_start_pfn + zone->spanned_pages;
+	free_pfn &= ~(pageblock_nr_pages-1);
+	return free_pfn;
+}
+
+/*
  * Based on information in the current compact_control, find blocks
  * suitable for isolating free pages from and then isolate them.
  */
@@ -457,17 +471,6 @@ static void isolate_freepages(struct zone *zone,
 					pfn -= pageblock_nr_pages) {
 		unsigned long isolated;
 
-		/*
-		 * Skip ahead if another thread is compacting in the area
-		 * simultaneously. If we wrapped around, we can only skip
-		 * ahead if zone->compact_cached_free_pfn also wrapped to
-		 * above our starting point.
-		 */
-		if (cc->order > 0 && (!cc->wrapped ||
-				      zone->compact_cached_free_pfn >
-				      cc->start_free_pfn))
-			pfn = min(pfn, zone->compact_cached_free_pfn);
-
 		if (!pfn_valid(pfn))
 			continue;
 
@@ -510,7 +513,15 @@ static void isolate_freepages(struct zone *zone,
 		 */
 		if (isolated) {
 			high_pfn = max(high_pfn, pfn);
-			if (cc->order > 0)
+
+			/*
+			 * If the free scanner has wrapped, update
+			 * compact_cached_free_pfn to point to the highest
+			 * pageblock with free pages. This reduces excessive
+			 * scanning of full pageblocks near the end of the
+			 * zone
+			 */
+			if (cc->order > 0 && cc->wrapped)
 				zone->compact_cached_free_pfn = high_pfn;
 		}
 	}
@@ -520,6 +531,11 @@ static void isolate_freepages(struct zone *zone,
 
 	cc->free_pfn = high_pfn;
 	cc->nr_freepages = nr_freepages;
+
+	/* If compact_cached_free_pfn is reset then set it now */
+	if (cc->order > 0 && !cc->wrapped &&
+			zone->compact_cached_free_pfn == start_free_pfn(zone))
+		zone->compact_cached_free_pfn = high_pfn;
 }
 
 /*
@@ -607,20 +623,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	return ISOLATE_SUCCESS;
 }
 
-/*
- * Returns the start pfn of the last page block in a zone.  This is the starting
- * point for full compaction of a zone.  Compaction searches for free pages from
- * the end of each zone, while isolate_freepages_block scans forward inside each
- * page block.
- */
-static unsigned long start_free_pfn(struct zone *zone)
-{
-	unsigned long free_pfn;
-	free_pfn = zone->zone_start_pfn + zone->spanned_pages;
-	free_pfn &= ~(pageblock_nr_pages-1);
-	return free_pfn;
-}
-
 static int compact_finished(struct zone *zone,
 			    struct compact_control *cc)
 {
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] mm: have order > 0 compaction start near a pageblock with free pages
  2012-08-08 19:08 ` [PATCH 5/5] mm: have order > 0 compaction start near a pageblock with free pages Mel Gorman
@ 2012-08-09  0:12   ` Minchan Kim
  2012-08-09  8:23     ` Mel Gorman
  2012-08-09  8:47   ` Minchan Kim
  1 sibling, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2012-08-09  0:12 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Jim Schutt, LKML

Hi Mel,

On Wed, Aug 08, 2012 at 08:08:44PM +0100, Mel Gorman wrote:
> commit [7db8889a: mm: have order > 0 compaction start off where it left]
> introduced a caching mechanism to reduce the amount work the free page
> scanner does in compaction. However, it has a problem. Consider two process
> simultaneously scanning free pages
> 
> 				    			C
> Process A		M     S     			F
> 		|---------------------------------------|
> Process B		M 	FS
> 
> C is zone->compact_cached_free_pfn
> S is cc->start_pfree_pfn
> M is cc->migrate_pfn
> F is cc->free_pfn
> 
> In this diagram, Process A has just reached its migrate scanner, wrapped
> around and updated compact_cached_free_pfn accordingly.
> 
> Simultaneously, Process B finishes isolating in a block and updates
> compact_cached_free_pfn again to the location of its free scanner.
> 
> Process A moves to "end_of_zone - one_pageblock" and runs this check
> 
>                 if (cc->order > 0 && (!cc->wrapped ||
>                                       zone->compact_cached_free_pfn >
>                                       cc->start_free_pfn))
>                         pfn = min(pfn, zone->compact_cached_free_pfn);
> 
> compact_cached_free_pfn is above where it started so the free scanner skips
> almost the entire space it should have scanned. When there are multiple
> processes compacting it can end in a situation where the entire zone is
> not being scanned at all.  Further, it is possible for two processes to
> ping-pong update to compact_cached_free_pfn which is just random.
> 
> Overall, the end result wrecks allocation success rates.
> 
> There is not an obvious way around this problem without introducing new
> locking and state so this patch takes a different approach.
> 
> First, it gets rid of the skip logic because it's not clear that it matters
> if two free scanners happen to be in the same block but with racing updates
> it's too easy for it to skip over blocks it should not.
> 
> Second, it updates compact_cached_free_pfn in a more limited set of
> circumstances.
> 
> If a scanner has wrapped, it updates compact_cached_free_pfn to the end
> 	of the zone. When a wrapped scanner isolates a page, it updates
> 	compact_cached_free_pfn to point to the highest pageblock it
> 	can isolate pages from.

Okay until here.

> 
> If a scanner has not wrapped when it has finished isolated pages it
> 	checks if compact_cached_free_pfn is pointing to the end of the
> 	zone. If so, the value is updated to point to the highest
> 	pageblock that pages were isolated from. This value will not
> 	be updated again until a free page scanner wraps and resets
> 	compact_cached_free_pfn.

I tried to understand your intention of this part but unfortunately failed.
By this part, the problem you mentioned could happen again?

 				    			C
 Process A		M     S     			F
 		|---------------------------------------|
 Process B		M 	FS
 
 C is zone->compact_cached_free_pfn
 S is cc->start_pfree_pfn
 M is cc->migrate_pfn
 F is cc->free_pfn

In this diagram, Process A has just reached its migrate scanner, wrapped
around and updated compact_cached_free_pfn to end of the zone accordingly.

Simultaneously, Process B finishes isolating in a block and peek 
compact_cached_free_pfn position and know it's end of the zone so
update compact_cached_free_pfn to highest pageblock that pages were
isolated from.

Process A updates compact_cached_free_pfn to the highest pageblock which
was set by process B because process A has wrapped. It ends up big jump
without any scanning in process A.

No?

> 
> This is not optimal and it can still race but the compact_cached_free_pfn
> will be pointing to or very near a pageblock with free pages.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> ---
>  mm/compaction.c |   54 ++++++++++++++++++++++++++++--------------------------
>  1 file changed, 28 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index be310f1..df50f73 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -419,6 +419,20 @@ static bool suitable_migration_target(struct page *page)
>  }
>  
>  /*
> + * Returns the start pfn of the last page block in a zone.  This is the starting
> + * point for full compaction of a zone.  Compaction searches for free pages from
> + * the end of each zone, while isolate_freepages_block scans forward inside each
> + * page block.
> + */
> +static unsigned long start_free_pfn(struct zone *zone)
> +{
> +	unsigned long free_pfn;
> +	free_pfn = zone->zone_start_pfn + zone->spanned_pages;
> +	free_pfn &= ~(pageblock_nr_pages-1);
> +	return free_pfn;
> +}
> +
> +/*
>   * Based on information in the current compact_control, find blocks
>   * suitable for isolating free pages from and then isolate them.
>   */
> @@ -457,17 +471,6 @@ static void isolate_freepages(struct zone *zone,
>  					pfn -= pageblock_nr_pages) {
>  		unsigned long isolated;
>  
> -		/*
> -		 * Skip ahead if another thread is compacting in the area
> -		 * simultaneously. If we wrapped around, we can only skip
> -		 * ahead if zone->compact_cached_free_pfn also wrapped to
> -		 * above our starting point.
> -		 */
> -		if (cc->order > 0 && (!cc->wrapped ||
> -				      zone->compact_cached_free_pfn >
> -				      cc->start_free_pfn))
> -			pfn = min(pfn, zone->compact_cached_free_pfn);
> -
>  		if (!pfn_valid(pfn))
>  			continue;
>  
> @@ -510,7 +513,15 @@ static void isolate_freepages(struct zone *zone,
>  		 */
>  		if (isolated) {
>  			high_pfn = max(high_pfn, pfn);
> -			if (cc->order > 0)
> +
> +			/*
> +			 * If the free scanner has wrapped, update
> +			 * compact_cached_free_pfn to point to the highest
> +			 * pageblock with free pages. This reduces excessive
> +			 * scanning of full pageblocks near the end of the
> +			 * zone
> +			 */
> +			if (cc->order > 0 && cc->wrapped)
>  				zone->compact_cached_free_pfn = high_pfn;
>  		}
>  	}
> @@ -520,6 +531,11 @@ static void isolate_freepages(struct zone *zone,
>  
>  	cc->free_pfn = high_pfn;
>  	cc->nr_freepages = nr_freepages;
> +
> +	/* If compact_cached_free_pfn is reset then set it now */
> +	if (cc->order > 0 && !cc->wrapped &&
> +			zone->compact_cached_free_pfn == start_free_pfn(zone))
> +		zone->compact_cached_free_pfn = high_pfn;
>  }
>  
>  /*
> @@ -607,20 +623,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  	return ISOLATE_SUCCESS;
>  }
>  
> -/*
> - * Returns the start pfn of the last page block in a zone.  This is the starting
> - * point for full compaction of a zone.  Compaction searches for free pages from
> - * the end of each zone, while isolate_freepages_block scans forward inside each
> - * page block.
> - */
> -static unsigned long start_free_pfn(struct zone *zone)
> -{
> -	unsigned long free_pfn;
> -	free_pfn = zone->zone_start_pfn + zone->spanned_pages;
> -	free_pfn &= ~(pageblock_nr_pages-1);
> -	return free_pfn;
> -}
> -
>  static int compact_finished(struct zone *zone,
>  			    struct compact_control *cc)
>  {
> -- 
> 1.7.9.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available
  2012-08-08 19:08 ` [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available Mel Gorman
@ 2012-08-09  1:33   ` Minchan Kim
  2012-08-09  8:11     ` Mel Gorman
  0 siblings, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2012-08-09  1:33 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Jim Schutt, LKML

Hi Mel,

Just one questoin below.

On Wed, Aug 08, 2012 at 08:08:42PM +0100, Mel Gorman wrote:
> While compaction is migrating pages to free up large contiguous blocks for
> allocation it races with other allocation requests that may steal these
> blocks or break them up. This patch alters direct compaction to capture a
> suitable free page as soon as it becomes available to reduce this race. It
> uses similar logic to split_free_page() to ensure that watermarks are
> still obeyed.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
> ---
>  include/linux/compaction.h |    4 +--
>  include/linux/mm.h         |    1 +
>  mm/compaction.c            |   71 +++++++++++++++++++++++++++++++++++++-------
>  mm/internal.h              |    1 +
>  mm/page_alloc.c            |   63 +++++++++++++++++++++++++++++----------
>  5 files changed, 111 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 51a90b7..5673459 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
>  extern int fragmentation_index(struct zone *zone, unsigned int order);
>  extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *mask,
> -			bool sync);
> +			bool sync, struct page **page);
>  extern int compact_pgdat(pg_data_t *pgdat, int order);
>  extern unsigned long compaction_suitable(struct zone *zone, int order);
>  
> @@ -64,7 +64,7 @@ static inline bool compaction_deferred(struct zone *zone, int order)
>  #else
>  static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *nodemask,
> -			bool sync)
> +			bool sync, struct page **page)
>  {
>  	return COMPACT_CONTINUE;
>  }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b36d08c..0812e86 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -454,6 +454,7 @@ void put_pages_list(struct list_head *pages);
>  
>  void split_page(struct page *page, unsigned int order);
>  int split_free_page(struct page *page);
> +int capture_free_page(struct page *page, int alloc_order, int migratetype);
>  
>  /*
>   * Compound pages have a destructor function.  Provide a
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 95ca967..63af8d2 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -50,6 +50,41 @@ static inline bool migrate_async_suitable(int migratetype)
>  	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
>  }
>  
> +static void compact_capture_page(struct compact_control *cc)
> +{
> +	unsigned long flags;
> +	int mtype;
> +
> +	if (!cc->page || *cc->page)
> +		return;
> +
> +	/* Speculatively examine the free lists without zone lock */
> +	for (mtype = 0; mtype < MIGRATE_PCPTYPES; mtype++) {
> +		int order;
> +		for (order = cc->order; order < MAX_ORDER; order++) {
> +			struct page *page;
> +			struct free_area *area;
> +			area = &(cc->zone->free_area[order]);
> +			if (list_empty(&area->free_list[mtype]))
> +				continue;
> +
> +			/* Take the lock and attempt capture of the page */
> +			spin_lock_irqsave(&cc->zone->lock, flags);
> +			if (!list_empty(&area->free_list[mtype])) {
> +				page = list_entry(area->free_list[mtype].next,
> +							struct page, lru);
> +				if (capture_free_page(page, cc->order, mtype)) {
> +					spin_unlock_irqrestore(&cc->zone->lock,
> +									flags);
> +					*cc->page = page;
> +					return;
> +				}
> +			}
> +			spin_unlock_irqrestore(&cc->zone->lock, flags);
> +		}
> +	}
> +}
> +
>  /*
>   * Isolate free pages onto a private freelist. Caller must hold zone->lock.
>   * If @strict is true, will abort returning 0 on any invalid PFNs or non-free
> @@ -561,7 +596,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  static int compact_finished(struct zone *zone,
>  			    struct compact_control *cc)
>  {
> -	unsigned int order;
>  	unsigned long watermark;
>  
>  	if (fatal_signal_pending(current))
> @@ -586,14 +620,22 @@ static int compact_finished(struct zone *zone,
>  		return COMPACT_CONTINUE;
>  
>  	/* Direct compactor: Is a suitable page free? */
> -	for (order = cc->order; order < MAX_ORDER; order++) {
> -		/* Job done if page is free of the right migratetype */
> -		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
> -			return COMPACT_PARTIAL;
> -
> -		/* Job done if allocation would set block type */
> -		if (order >= pageblock_order && zone->free_area[order].nr_free)
> +	if (cc->page) {
> +		/* Was a suitable page captured? */
> +		if (*cc->page)
>  			return COMPACT_PARTIAL;
> +	} else {
> +		unsigned int order;
> +		for (order = cc->order; order < MAX_ORDER; order++) {
> +			struct free_area *area = &zone->free_area[cc->order];
> +			/* Job done if page is free of the right migratetype */
> +			if (!list_empty(&area->free_list[cc->migratetype]))
> +				return COMPACT_PARTIAL;
> +
> +			/* Job done if allocation would set block type */
> +			if (cc->order >= pageblock_order && area->nr_free)
> +				return COMPACT_PARTIAL;
> +		}
>  	}
>  
>  	return COMPACT_CONTINUE;
> @@ -708,6 +750,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>  				goto out;
>  			}
>  		}
> +
> +		/* Capture a page now if it is a suitable size */

Why do we capture only when we migrate MIGRATE_MOVABLE type?
If you have a reasone, it should have been added as comment.

> +		if (cc->migratetype == MIGRATE_MOVABLE)
> +			compact_capture_page(cc);
>  	}
>  
>  out:
> @@ -720,7 +766,7 @@ out:
>  
>  static unsigned long compact_zone_order(struct zone *zone,
>  				 int order, gfp_t gfp_mask,
> -				 bool sync)
> +				 bool sync, struct page **page)
>  {
>  	struct compact_control cc = {
>  		.nr_freepages = 0,
> @@ -729,6 +775,7 @@ static unsigned long compact_zone_order(struct zone *zone,
>  		.migratetype = allocflags_to_migratetype(gfp_mask),
>  		.zone = zone,
>  		.sync = sync,
> +		.page = page,
>  	};
>  	INIT_LIST_HEAD(&cc.freepages);
>  	INIT_LIST_HEAD(&cc.migratepages);
> @@ -750,7 +797,7 @@ int sysctl_extfrag_threshold = 500;
>   */
>  unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  			int order, gfp_t gfp_mask, nodemask_t *nodemask,
> -			bool sync)
> +			bool sync, struct page **page)
>  {
>  	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
>  	int may_enter_fs = gfp_mask & __GFP_FS;
> @@ -770,7 +817,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
>  								nodemask) {
>  		int status;
>  
> -		status = compact_zone_order(zone, order, gfp_mask, sync);
> +		status = compact_zone_order(zone, order, gfp_mask, sync, page);
>  		rc = max(status, rc);
>  
>  		/* If a normal allocation would succeed, stop compacting */
> @@ -825,6 +872,7 @@ int compact_pgdat(pg_data_t *pgdat, int order)
>  	struct compact_control cc = {
>  		.order = order,
>  		.sync = false,
> +		.page = NULL,
>  	};
>  
>  	return __compact_pgdat(pgdat, &cc);
> @@ -835,6 +883,7 @@ static int compact_node(int nid)
>  	struct compact_control cc = {
>  		.order = -1,
>  		.sync = true,
> +		.page = NULL,
>  	};
>  
>  	return __compact_pgdat(NODE_DATA(nid), &cc);
> diff --git a/mm/internal.h b/mm/internal.h
> index 2ba87fb..9156714 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -124,6 +124,7 @@ struct compact_control {
>  	int order;			/* order a direct compactor needs */
>  	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
>  	struct zone *zone;
> +	struct page **page;		/* Page captured of requested size */
>  };
>  
>  unsigned long
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4a4f921..adc3aa8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1374,16 +1374,11 @@ void split_page(struct page *page, unsigned int order)
>  }
>  
>  /*
> - * Similar to split_page except the page is already free. As this is only
> - * being used for migration, the migratetype of the block also changes.
> - * As this is called with interrupts disabled, the caller is responsible
> - * for calling arch_alloc_page() and kernel_map_page() after interrupts
> - * are enabled.
> - *
> - * Note: this is probably too low level an operation for use in drivers.
> - * Please consult with lkml before using this in your driver.
> + * Similar to the split_page family of functions except that the page
> + * required at the given order and being isolated now to prevent races
> + * with parallel allocators
>   */
> -int split_free_page(struct page *page)
> +int capture_free_page(struct page *page, int alloc_order, int migratetype)
>  {
>  	unsigned int order;
>  	unsigned long watermark;
> @@ -1405,10 +1400,11 @@ int split_free_page(struct page *page)
>  	rmv_page_order(page);
>  	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
>  
> -	/* Split into individual pages */
> -	set_page_refcounted(page);
> -	split_page(page, order);
> +	if (alloc_order != order)
> +		expand(zone, page, alloc_order, order,
> +			&zone->free_area[order], migratetype);
>  
> +	/* Set the pageblock if the captured page is at least a pageblock */
>  	if (order >= pageblock_order - 1) {
>  		struct page *endpage = page + (1 << order) - 1;
>  		for (; page < endpage; page += pageblock_nr_pages) {
> @@ -1419,7 +1415,35 @@ int split_free_page(struct page *page)
>  		}
>  	}
>  
> -	return 1 << order;
> +	return 1UL << order;
> +}
> +
> +/*
> + * Similar to split_page except the page is already free. As this is only
> + * being used for migration, the migratetype of the block also changes.
> + * As this is called with interrupts disabled, the caller is responsible
> + * for calling arch_alloc_page() and kernel_map_page() after interrupts
> + * are enabled.
> + *
> + * Note: this is probably too low level an operation for use in drivers.
> + * Please consult with lkml before using this in your driver.
> + */
> +int split_free_page(struct page *page)
> +{
> +	unsigned int order;
> +	int nr_pages;
> +
> +	BUG_ON(!PageBuddy(page));
> +	order = page_order(page);
> +
> +	nr_pages = capture_free_page(page, order, 0);
> +	if (!nr_pages)
> +		return 0;
> +
> +	/* Split into individual pages */
> +	set_page_refcounted(page);
> +	split_page(page, order);
> +	return nr_pages;
>  }
>  
>  /*
> @@ -2065,7 +2089,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	bool *deferred_compaction,
>  	unsigned long *did_some_progress)
>  {
> -	struct page *page;
> +	struct page *page = NULL;
>  
>  	if (!order)
>  		return NULL;
> @@ -2077,10 +2101,16 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  
>  	current->flags |= PF_MEMALLOC;
>  	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
> -						nodemask, sync_migration);
> +					nodemask, sync_migration, &page);
>  	current->flags &= ~PF_MEMALLOC;
> -	if (*did_some_progress != COMPACT_SKIPPED) {
>  
> +	/* If compaction captured a page, prep and use it */
> +	if (page) {
> +		prep_new_page(page, order, gfp_mask);
> +		goto got_page;
> +	}
> +
> +	if (*did_some_progress != COMPACT_SKIPPED) {
>  		/* Page migration frees to the PCP lists but we want merging */
>  		drain_pages(get_cpu());
>  		put_cpu();
> @@ -2090,6 +2120,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  				alloc_flags, preferred_zone,
>  				migratetype);
>  		if (page) {
> +got_page:
>  			preferred_zone->compact_considered = 0;
>  			preferred_zone->compact_defer_shift = 0;
>  			if (order >= preferred_zone->compact_order_failed)
> -- 
> 1.7.9.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available
  2012-08-09  1:33   ` Minchan Kim
@ 2012-08-09  8:11     ` Mel Gorman
  2012-08-09  8:41       ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2012-08-09  8:11 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Linux-MM, Rik van Riel, Jim Schutt, LKML

On Thu, Aug 09, 2012 at 10:33:58AM +0900, Minchan Kim wrote:
> Hi Mel,
> 
> Just one questoin below.
> 

Sure! Your questions usually get me thinking about the right part of the
series, this series in particular :)

> > <SNIP>
> > @@ -708,6 +750,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> >  				goto out;
> >  			}
> >  		}
> > +
> > +		/* Capture a page now if it is a suitable size */
> 
> Why do we capture only when we migrate MIGRATE_MOVABLE type?
> If you have a reasone, it should have been added as comment.
> 

Good question and there is an answer. However, I also spotted a problem when
thinking about this more where !MIGRATE_MOVABLE allocations are forced to
do a full compaction. The simple solution would be to only set cc->page for
MIGRATE_MOVABLE but there is a better approach that I've implemented in the
patch below. It includes a comment that should answer your question. Does
this make sense to you?

diff --git a/mm/compaction.c b/mm/compaction.c
index 63af8d2..384164e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -53,13 +53,31 @@ static inline bool migrate_async_suitable(int migratetype)
 static void compact_capture_page(struct compact_control *cc)
 {
 	unsigned long flags;
-	int mtype;
+	int mtype, mtype_low, mtype_high;
 
 	if (!cc->page || *cc->page)
 		return;
 
+	/*
+	 * For MIGRATE_MOVABLE allocations we capture a suitable page ASAP
+	 * regardless of the migratetype of the freelist is is captured from.
+	 * This is fine because the order for a high-order MIGRATE_MOVABLE
+	 * allocation is typically at least a pageblock size and overall
+	 * fragmentation is not impaired. Other allocation types must
+	 * capture pages from their own migratelist because otherwise they
+	 * could pollute other pageblocks like MIGRATE_MOVABLE with
+	 * difficult to move pages and making fragmentation worse overall.
+	 */
+	if (cc->migratetype == MIGRATE_MOVABLE) {
+		mtype_low = 0;
+		mtype_high = MIGRATE_PCPTYPES;
+	} else {
+		mtype_low = cc->migratetype;
+		mtype_high = cc->migratetype + 1;
+	}
+
 	/* Speculatively examine the free lists without zone lock */
-	for (mtype = 0; mtype < MIGRATE_PCPTYPES; mtype++) {
+	for (mtype = mtype_low; mtype < mtype_high; mtype++) {
 		int order;
 		for (order = cc->order; order < MAX_ORDER; order++) {
 			struct page *page;
@@ -752,8 +770,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		}
 
 		/* Capture a page now if it is a suitable size */
-		if (cc->migratetype == MIGRATE_MOVABLE)
-			compact_capture_page(cc);
+		compact_capture_page(cc);
 	}
 
 out:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] mm: have order > 0 compaction start near a pageblock with free pages
  2012-08-09  0:12   ` Minchan Kim
@ 2012-08-09  8:23     ` Mel Gorman
  2012-08-09  8:46       ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2012-08-09  8:23 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Linux-MM, Rik van Riel, Jim Schutt, LKML

On Thu, Aug 09, 2012 at 09:12:12AM +0900, Minchan Kim wrote:
> > <SNIP>
> > 
> > Second, it updates compact_cached_free_pfn in a more limited set of
> > circumstances.
> > 
> > If a scanner has wrapped, it updates compact_cached_free_pfn to the end
> > 	of the zone. When a wrapped scanner isolates a page, it updates
> > 	compact_cached_free_pfn to point to the highest pageblock it
> > 	can isolate pages from.
> 
> Okay until here.
> 

Great.

> > 
> > If a scanner has not wrapped when it has finished isolated pages it
> > 	checks if compact_cached_free_pfn is pointing to the end of the
> > 	zone. If so, the value is updated to point to the highest
> > 	pageblock that pages were isolated from. This value will not
> > 	be updated again until a free page scanner wraps and resets
> > 	compact_cached_free_pfn.
> 
> I tried to understand your intention of this part but unfortunately failed.
> By this part, the problem you mentioned could happen again?
> 

Potentially yes, I did say it still races in the changelog.

>  				    			C
>  Process A		M     S     			F
>  		|---------------------------------------|
>  Process B		M 	FS
>  
>  C is zone->compact_cached_free_pfn
>  S is cc->start_pfree_pfn
>  M is cc->migrate_pfn
>  F is cc->free_pfn
> 
> In this diagram, Process A has just reached its migrate scanner, wrapped
> around and updated compact_cached_free_pfn to end of the zone accordingly.
> 

Yes. Now that it has wrapped it updates the compact_cached_free_pfn
every loop of isolate_freepages here.

                if (isolated) {
                        high_pfn = max(high_pfn, pfn);

                        /*
                         * If the free scanner has wrapped, update
                         * compact_cached_free_pfn to point to the highest
                         * pageblock with free pages. This reduces excessive
                         * scanning of full pageblocks near the end of the
                         * zone
                         */
                        if (cc->order > 0 && cc->wrapped)
                                zone->compact_cached_free_pfn = high_pfn;
                }



> Simultaneously, Process B finishes isolating in a block and peek 
> compact_cached_free_pfn position and know it's end of the zone so
> update compact_cached_free_pfn to highest pageblock that pages were
> isolated from.
> 

Yes, they race at this point. One of two things happen here and I agree
that this is racy

1. Process A does another iteration of its loop and sets it back
2. Process A does not do another iteration of the loop, the cached_pfn
   is further along that it should. The next compacting process will
   wrap early and reset cached_pfn again but continue to scan the zone.

Either option is relatively harmless because in both cases the zone gets
scanned. In patch 4 it was possible that large portions of the zone were
frequently missed.

> Process A updates compact_cached_free_pfn to the highest pageblock which
> was set by process B because process A has wrapped. It ends up big jump
> without any scanning in process A.
> 

It recovers quickly and is nowhere near as severe as what patch 4
suffers from.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available
  2012-08-09  8:11     ` Mel Gorman
@ 2012-08-09  8:41       ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2012-08-09  8:41 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Jim Schutt, LKML

On Thu, Aug 09, 2012 at 09:11:20AM +0100, Mel Gorman wrote:
> On Thu, Aug 09, 2012 at 10:33:58AM +0900, Minchan Kim wrote:
> > Hi Mel,
> > 
> > Just one questoin below.
> > 
> 
> Sure! Your questions usually get me thinking about the right part of the
> series, this series in particular :)
> 
> > > <SNIP>
> > > @@ -708,6 +750,10 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> > >  				goto out;
> > >  			}
> > >  		}
> > > +
> > > +		/* Capture a page now if it is a suitable size */
> > 
> > Why do we capture only when we migrate MIGRATE_MOVABLE type?
> > If you have a reasone, it should have been added as comment.
> > 
> 
> Good question and there is an answer. However, I also spotted a problem when
> thinking about this more where !MIGRATE_MOVABLE allocations are forced to
> do a full compaction. The simple solution would be to only set cc->page for
> MIGRATE_MOVABLE but there is a better approach that I've implemented in the
> patch below. It includes a comment that should answer your question. Does
> this make sense to you?

It does make sense.
I will add my Reviewed-by in your next spin which includes below patch.

Thanks, Mel.

> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 63af8d2..384164e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -53,13 +53,31 @@ static inline bool migrate_async_suitable(int migratetype)
>  static void compact_capture_page(struct compact_control *cc)
>  {
>  	unsigned long flags;
> -	int mtype;
> +	int mtype, mtype_low, mtype_high;
>  
>  	if (!cc->page || *cc->page)
>  		return;
>  
> +	/*
> +	 * For MIGRATE_MOVABLE allocations we capture a suitable page ASAP
> +	 * regardless of the migratetype of the freelist is is captured from.
                                                         ^  ^
                                                         typo?
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] mm: have order > 0 compaction start near a pageblock with free pages
  2012-08-09  8:23     ` Mel Gorman
@ 2012-08-09  8:46       ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2012-08-09  8:46 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Jim Schutt, LKML

On Thu, Aug 09, 2012 at 09:23:28AM +0100, Mel Gorman wrote:
> On Thu, Aug 09, 2012 at 09:12:12AM +0900, Minchan Kim wrote:
> > > <SNIP>
> > > 
> > > Second, it updates compact_cached_free_pfn in a more limited set of
> > > circumstances.
> > > 
> > > If a scanner has wrapped, it updates compact_cached_free_pfn to the end
> > > 	of the zone. When a wrapped scanner isolates a page, it updates
> > > 	compact_cached_free_pfn to point to the highest pageblock it
> > > 	can isolate pages from.
> > 
> > Okay until here.
> > 
> 
> Great.
> 
> > > 
> > > If a scanner has not wrapped when it has finished isolated pages it
> > > 	checks if compact_cached_free_pfn is pointing to the end of the
> > > 	zone. If so, the value is updated to point to the highest
> > > 	pageblock that pages were isolated from. This value will not
> > > 	be updated again until a free page scanner wraps and resets
> > > 	compact_cached_free_pfn.
> > 
> > I tried to understand your intention of this part but unfortunately failed.
> > By this part, the problem you mentioned could happen again?
> > 
> 
> Potentially yes, I did say it still races in the changelog.
> 
> >  				    			C
> >  Process A		M     S     			F
> >  		|---------------------------------------|
> >  Process B		M 	FS
> >  
> >  C is zone->compact_cached_free_pfn
> >  S is cc->start_pfree_pfn
> >  M is cc->migrate_pfn
> >  F is cc->free_pfn
> > 
> > In this diagram, Process A has just reached its migrate scanner, wrapped
> > around and updated compact_cached_free_pfn to end of the zone accordingly.
> > 
> 
> Yes. Now that it has wrapped it updates the compact_cached_free_pfn
> every loop of isolate_freepages here.
> 
>                 if (isolated) {
>                         high_pfn = max(high_pfn, pfn);
> 
>                         /*
>                          * If the free scanner has wrapped, update
>                          * compact_cached_free_pfn to point to the highest
>                          * pageblock with free pages. This reduces excessive
>                          * scanning of full pageblocks near the end of the
>                          * zone
>                          */
>                         if (cc->order > 0 && cc->wrapped)
>                                 zone->compact_cached_free_pfn = high_pfn;
>                 }
> 
> 
> 
> > Simultaneously, Process B finishes isolating in a block and peek 
> > compact_cached_free_pfn position and know it's end of the zone so
> > update compact_cached_free_pfn to highest pageblock that pages were
> > isolated from.
> > 
> 
> Yes, they race at this point. One of two things happen here and I agree
> that this is racy
> 
> 1. Process A does another iteration of its loop and sets it back
> 2. Process A does not do another iteration of the loop, the cached_pfn
>    is further along that it should. The next compacting process will
>    wrap early and reset cached_pfn again but continue to scan the zone.
> 
> Either option is relatively harmless because in both cases the zone gets
> scanned. In patch 4 it was possible that large portions of the zone were
> frequently missed.
> 
> > Process A updates compact_cached_free_pfn to the highest pageblock which
> > was set by process B because process A has wrapped. It ends up big jump
> > without any scanning in process A.
> > 
> 
> It recovers quickly and is nowhere near as severe as what patch 4
> suffers from.

Agreed.
Thanks, Mel.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 5/5] mm: have order > 0 compaction start near a pageblock with free pages
  2012-08-08 19:08 ` [PATCH 5/5] mm: have order > 0 compaction start near a pageblock with free pages Mel Gorman
  2012-08-09  0:12   ` Minchan Kim
@ 2012-08-09  8:47   ` Minchan Kim
  1 sibling, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2012-08-09  8:47 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Jim Schutt, LKML

On Wed, Aug 08, 2012 at 08:08:44PM +0100, Mel Gorman wrote:
> commit [7db8889a: mm: have order > 0 compaction start off where it left]
> introduced a caching mechanism to reduce the amount work the free page
> scanner does in compaction. However, it has a problem. Consider two process
> simultaneously scanning free pages
> 
> 				    			C
> Process A		M     S     			F
> 		|---------------------------------------|
> Process B		M 	FS
> 
> C is zone->compact_cached_free_pfn
> S is cc->start_pfree_pfn
> M is cc->migrate_pfn
> F is cc->free_pfn
> 
> In this diagram, Process A has just reached its migrate scanner, wrapped
> around and updated compact_cached_free_pfn accordingly.
> 
> Simultaneously, Process B finishes isolating in a block and updates
> compact_cached_free_pfn again to the location of its free scanner.
> 
> Process A moves to "end_of_zone - one_pageblock" and runs this check
> 
>                 if (cc->order > 0 && (!cc->wrapped ||
>                                       zone->compact_cached_free_pfn >
>                                       cc->start_free_pfn))
>                         pfn = min(pfn, zone->compact_cached_free_pfn);
> 
> compact_cached_free_pfn is above where it started so the free scanner skips
> almost the entire space it should have scanned. When there are multiple
> processes compacting it can end in a situation where the entire zone is
> not being scanned at all.  Further, it is possible for two processes to
> ping-pong update to compact_cached_free_pfn which is just random.
> 
> Overall, the end result wrecks allocation success rates.
> 
> There is not an obvious way around this problem without introducing new
> locking and state so this patch takes a different approach.
> 
> First, it gets rid of the skip logic because it's not clear that it matters
> if two free scanners happen to be in the same block but with racing updates
> it's too easy for it to skip over blocks it should not.
> 
> Second, it updates compact_cached_free_pfn in a more limited set of
> circumstances.
> 
> If a scanner has wrapped, it updates compact_cached_free_pfn to the end
> 	of the zone. When a wrapped scanner isolates a page, it updates
> 	compact_cached_free_pfn to point to the highest pageblock it
> 	can isolate pages from.
> 
> If a scanner has not wrapped when it has finished isolated pages it
> 	checks if compact_cached_free_pfn is pointing to the end of the
> 	zone. If so, the value is updated to point to the highest
> 	pageblock that pages were isolated from. This value will not
> 	be updated again until a free page scanner wraps and resets
> 	compact_cached_free_pfn.
> 
> This is not optimal and it can still race but the compact_cached_free_pfn
> will be pointing to or very near a pageblock with free pages.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available
  2012-08-09 13:49 [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3 Mel Gorman
@ 2012-08-09 13:49 ` Mel Gorman
  2012-08-09 23:35   ` Minchan Kim
  0 siblings, 1 reply; 16+ messages in thread
From: Mel Gorman @ 2012-08-09 13:49 UTC (permalink / raw)
  To: Linux-MM; +Cc: Rik van Riel, Minchan Kim, Jim Schutt, LKML, Mel Gorman

While compaction is migrating pages to free up large contiguous blocks for
allocation it races with other allocation requests that may steal these
blocks or break them up. This patch alters direct compaction to capture a
suitable free page as soon as it becomes available to reduce this race. It
uses similar logic to split_free_page() to ensure that watermarks are
still obeyed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h |    4 +-
 include/linux/mm.h         |    1 +
 mm/compaction.c            |   88 ++++++++++++++++++++++++++++++++++++++------
 mm/internal.h              |    1 +
 mm/page_alloc.c            |   63 +++++++++++++++++++++++--------
 5 files changed, 128 insertions(+), 29 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 51a90b7..5673459 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
-			bool sync);
+			bool sync, struct page **page);
 extern int compact_pgdat(pg_data_t *pgdat, int order);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
@@ -64,7 +64,7 @@ static inline bool compaction_deferred(struct zone *zone, int order)
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			bool sync)
+			bool sync, struct page **page)
 {
 	return COMPACT_CONTINUE;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b36d08c..0812e86 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -454,6 +454,7 @@ void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
 int split_free_page(struct page *page);
+int capture_free_page(struct page *page, int alloc_order, int migratetype);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff --git a/mm/compaction.c b/mm/compaction.c
index 95ca967..384164e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,59 @@ static inline bool migrate_async_suitable(int migratetype)
 	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
 }
 
+static void compact_capture_page(struct compact_control *cc)
+{
+	unsigned long flags;
+	int mtype, mtype_low, mtype_high;
+
+	if (!cc->page || *cc->page)
+		return;
+
+	/*
+	 * For MIGRATE_MOVABLE allocations we capture a suitable page ASAP
+	 * regardless of the migratetype of the freelist is is captured from.
+	 * This is fine because the order for a high-order MIGRATE_MOVABLE
+	 * allocation is typically at least a pageblock size and overall
+	 * fragmentation is not impaired. Other allocation types must
+	 * capture pages from their own migratelist because otherwise they
+	 * could pollute other pageblocks like MIGRATE_MOVABLE with
+	 * difficult to move pages and making fragmentation worse overall.
+	 */
+	if (cc->migratetype == MIGRATE_MOVABLE) {
+		mtype_low = 0;
+		mtype_high = MIGRATE_PCPTYPES;
+	} else {
+		mtype_low = cc->migratetype;
+		mtype_high = cc->migratetype + 1;
+	}
+
+	/* Speculatively examine the free lists without zone lock */
+	for (mtype = mtype_low; mtype < mtype_high; mtype++) {
+		int order;
+		for (order = cc->order; order < MAX_ORDER; order++) {
+			struct page *page;
+			struct free_area *area;
+			area = &(cc->zone->free_area[order]);
+			if (list_empty(&area->free_list[mtype]))
+				continue;
+
+			/* Take the lock and attempt capture of the page */
+			spin_lock_irqsave(&cc->zone->lock, flags);
+			if (!list_empty(&area->free_list[mtype])) {
+				page = list_entry(area->free_list[mtype].next,
+							struct page, lru);
+				if (capture_free_page(page, cc->order, mtype)) {
+					spin_unlock_irqrestore(&cc->zone->lock,
+									flags);
+					*cc->page = page;
+					return;
+				}
+			}
+			spin_unlock_irqrestore(&cc->zone->lock, flags);
+		}
+	}
+}
+
 /*
  * Isolate free pages onto a private freelist. Caller must hold zone->lock.
  * If @strict is true, will abort returning 0 on any invalid PFNs or non-free
@@ -561,7 +614,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 static int compact_finished(struct zone *zone,
 			    struct compact_control *cc)
 {
-	unsigned int order;
 	unsigned long watermark;
 
 	if (fatal_signal_pending(current))
@@ -586,14 +638,22 @@ static int compact_finished(struct zone *zone,
 		return COMPACT_CONTINUE;
 
 	/* Direct compactor: Is a suitable page free? */
-	for (order = cc->order; order < MAX_ORDER; order++) {
-		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
-			return COMPACT_PARTIAL;
-
-		/* Job done if allocation would set block type */
-		if (order >= pageblock_order && zone->free_area[order].nr_free)
+	if (cc->page) {
+		/* Was a suitable page captured? */
+		if (*cc->page)
 			return COMPACT_PARTIAL;
+	} else {
+		unsigned int order;
+		for (order = cc->order; order < MAX_ORDER; order++) {
+			struct free_area *area = &zone->free_area[cc->order];
+			/* Job done if page is free of the right migratetype */
+			if (!list_empty(&area->free_list[cc->migratetype]))
+				return COMPACT_PARTIAL;
+
+			/* Job done if allocation would set block type */
+			if (cc->order >= pageblock_order && area->nr_free)
+				return COMPACT_PARTIAL;
+		}
 	}
 
 	return COMPACT_CONTINUE;
@@ -708,6 +768,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 				goto out;
 			}
 		}
+
+		/* Capture a page now if it is a suitable size */
+		compact_capture_page(cc);
 	}
 
 out:
@@ -720,7 +783,7 @@ out:
 
 static unsigned long compact_zone_order(struct zone *zone,
 				 int order, gfp_t gfp_mask,
-				 bool sync)
+				 bool sync, struct page **page)
 {
 	struct compact_control cc = {
 		.nr_freepages = 0,
@@ -729,6 +792,7 @@ static unsigned long compact_zone_order(struct zone *zone,
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
 		.sync = sync,
+		.page = page,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -750,7 +814,7 @@ int sysctl_extfrag_threshold = 500;
  */
 unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			bool sync)
+			bool sync, struct page **page)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	int may_enter_fs = gfp_mask & __GFP_FS;
@@ -770,7 +834,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 								nodemask) {
 		int status;
 
-		status = compact_zone_order(zone, order, gfp_mask, sync);
+		status = compact_zone_order(zone, order, gfp_mask, sync, page);
 		rc = max(status, rc);
 
 		/* If a normal allocation would succeed, stop compacting */
@@ -825,6 +889,7 @@ int compact_pgdat(pg_data_t *pgdat, int order)
 	struct compact_control cc = {
 		.order = order,
 		.sync = false,
+		.page = NULL,
 	};
 
 	return __compact_pgdat(pgdat, &cc);
@@ -835,6 +900,7 @@ static int compact_node(int nid)
 	struct compact_control cc = {
 		.order = -1,
 		.sync = true,
+		.page = NULL,
 	};
 
 	return __compact_pgdat(NODE_DATA(nid), &cc);
diff --git a/mm/internal.h b/mm/internal.h
index 2ba87fb..9156714 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -124,6 +124,7 @@ struct compact_control {
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
+	struct page **page;		/* Page captured of requested size */
 };
 
 unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4a4f921..adc3aa8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1374,16 +1374,11 @@ void split_page(struct page *page, unsigned int order)
 }
 
 /*
- * Similar to split_page except the page is already free. As this is only
- * being used for migration, the migratetype of the block also changes.
- * As this is called with interrupts disabled, the caller is responsible
- * for calling arch_alloc_page() and kernel_map_page() after interrupts
- * are enabled.
- *
- * Note: this is probably too low level an operation for use in drivers.
- * Please consult with lkml before using this in your driver.
+ * Similar to the split_page family of functions except that the page
+ * required at the given order and being isolated now to prevent races
+ * with parallel allocators
  */
-int split_free_page(struct page *page)
+int capture_free_page(struct page *page, int alloc_order, int migratetype)
 {
 	unsigned int order;
 	unsigned long watermark;
@@ -1405,10 +1400,11 @@ int split_free_page(struct page *page)
 	rmv_page_order(page);
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
 
-	/* Split into individual pages */
-	set_page_refcounted(page);
-	split_page(page, order);
+	if (alloc_order != order)
+		expand(zone, page, alloc_order, order,
+			&zone->free_area[order], migratetype);
 
+	/* Set the pageblock if the captured page is at least a pageblock */
 	if (order >= pageblock_order - 1) {
 		struct page *endpage = page + (1 << order) - 1;
 		for (; page < endpage; page += pageblock_nr_pages) {
@@ -1419,7 +1415,35 @@ int split_free_page(struct page *page)
 		}
 	}
 
-	return 1 << order;
+	return 1UL << order;
+}
+
+/*
+ * Similar to split_page except the page is already free. As this is only
+ * being used for migration, the migratetype of the block also changes.
+ * As this is called with interrupts disabled, the caller is responsible
+ * for calling arch_alloc_page() and kernel_map_page() after interrupts
+ * are enabled.
+ *
+ * Note: this is probably too low level an operation for use in drivers.
+ * Please consult with lkml before using this in your driver.
+ */
+int split_free_page(struct page *page)
+{
+	unsigned int order;
+	int nr_pages;
+
+	BUG_ON(!PageBuddy(page));
+	order = page_order(page);
+
+	nr_pages = capture_free_page(page, order, 0);
+	if (!nr_pages)
+		return 0;
+
+	/* Split into individual pages */
+	set_page_refcounted(page);
+	split_page(page, order);
+	return nr_pages;
 }
 
 /*
@@ -2065,7 +2089,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
-	struct page *page;
+	struct page *page = NULL;
 
 	if (!order)
 		return NULL;
@@ -2077,10 +2101,16 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	current->flags |= PF_MEMALLOC;
 	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
-						nodemask, sync_migration);
+					nodemask, sync_migration, &page);
 	current->flags &= ~PF_MEMALLOC;
-	if (*did_some_progress != COMPACT_SKIPPED) {
 
+	/* If compaction captured a page, prep and use it */
+	if (page) {
+		prep_new_page(page, order, gfp_mask);
+		goto got_page;
+	}
+
+	if (*did_some_progress != COMPACT_SKIPPED) {
 		/* Page migration frees to the PCP lists but we want merging */
 		drain_pages(get_cpu());
 		put_cpu();
@@ -2090,6 +2120,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 				alloc_flags, preferred_zone,
 				migratetype);
 		if (page) {
+got_page:
 			preferred_zone->compact_considered = 0;
 			preferred_zone->compact_defer_shift = 0;
 			if (order >= preferred_zone->compact_order_failed)
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available
  2012-08-09 13:49 ` [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available Mel Gorman
@ 2012-08-09 23:35   ` Minchan Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2012-08-09 23:35 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Linux-MM, Rik van Riel, Jim Schutt, LKML

On Thu, Aug 09, 2012 at 02:49:23PM +0100, Mel Gorman wrote:
> While compaction is migrating pages to free up large contiguous blocks for
> allocation it races with other allocation requests that may steal these
> blocks or break them up. This patch alters direct compaction to capture a
> suitable free page as soon as it becomes available to reduce this race. It
> uses similar logic to split_free_page() to ensure that watermarks are
> still obeyed.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available
  2012-08-14 16:41 [PATCH 0/5] Improve hugepage allocation success rates under load V4 Mel Gorman
@ 2012-08-14 16:41 ` Mel Gorman
  0 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2012-08-14 16:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Rik van Riel, Minchan Kim, Jim Schutt, Linux-MM, LKML, Mel Gorman

While compaction is migrating pages to free up large contiguous blocks for
allocation it races with other allocation requests that may steal these
blocks or break them up. This patch alters direct compaction to capture a
suitable free page as soon as it becomes available to reduce this race. It
uses similar logic to split_free_page() to ensure that watermarks are
still obeyed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
---
 include/linux/compaction.h |    4 +-
 include/linux/mm.h         |    1 +
 mm/compaction.c            |   88 ++++++++++++++++++++++++++++++++++++++------
 mm/internal.h              |    1 +
 mm/page_alloc.c            |   63 +++++++++++++++++++++++--------
 5 files changed, 128 insertions(+), 29 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 133ddcf..fd20c15 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -22,7 +22,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
-			bool sync);
+			bool sync, struct page **page);
 extern int compact_pgdat(pg_data_t *pgdat, int order);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
@@ -64,7 +64,7 @@ static inline bool compaction_deferred(struct zone *zone, int order)
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			bool sync)
+			bool sync, struct page **page)
 {
 	return COMPACT_CONTINUE;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0514fe9..5ddb11b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -442,6 +442,7 @@ void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
 int split_free_page(struct page *page);
+int capture_free_page(struct page *page, int alloc_order, int migratetype);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff --git a/mm/compaction.c b/mm/compaction.c
index ea588eb..a806a9c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -50,6 +50,59 @@ static inline bool migrate_async_suitable(int migratetype)
 	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
 }
 
+static void compact_capture_page(struct compact_control *cc)
+{
+	unsigned long flags;
+	int mtype, mtype_low, mtype_high;
+
+	if (!cc->page || *cc->page)
+		return;
+
+	/*
+	 * For MIGRATE_MOVABLE allocations we capture a suitable page ASAP
+	 * regardless of the migratetype of the freelist is is captured from.
+	 * This is fine because the order for a high-order MIGRATE_MOVABLE
+	 * allocation is typically at least a pageblock size and overall
+	 * fragmentation is not impaired. Other allocation types must
+	 * capture pages from their own migratelist because otherwise they
+	 * could pollute other pageblocks like MIGRATE_MOVABLE with
+	 * difficult to move pages and making fragmentation worse overall.
+	 */
+	if (cc->migratetype == MIGRATE_MOVABLE) {
+		mtype_low = 0;
+		mtype_high = MIGRATE_PCPTYPES;
+	} else {
+		mtype_low = cc->migratetype;
+		mtype_high = cc->migratetype + 1;
+	}
+
+	/* Speculatively examine the free lists without zone lock */
+	for (mtype = mtype_low; mtype < mtype_high; mtype++) {
+		int order;
+		for (order = cc->order; order < MAX_ORDER; order++) {
+			struct page *page;
+			struct free_area *area;
+			area = &(cc->zone->free_area[order]);
+			if (list_empty(&area->free_list[mtype]))
+				continue;
+
+			/* Take the lock and attempt capture of the page */
+			spin_lock_irqsave(&cc->zone->lock, flags);
+			if (!list_empty(&area->free_list[mtype])) {
+				page = list_entry(area->free_list[mtype].next,
+							struct page, lru);
+				if (capture_free_page(page, cc->order, mtype)) {
+					spin_unlock_irqrestore(&cc->zone->lock,
+									flags);
+					*cc->page = page;
+					return;
+				}
+			}
+			spin_unlock_irqrestore(&cc->zone->lock, flags);
+		}
+	}
+}
+
 /*
  * Isolate free pages onto a private freelist. Caller must hold zone->lock.
  * If @strict is true, will abort returning 0 on any invalid PFNs or non-free
@@ -589,7 +642,6 @@ static unsigned long start_free_pfn(struct zone *zone)
 static int compact_finished(struct zone *zone,
 			    struct compact_control *cc)
 {
-	unsigned int order;
 	unsigned long watermark;
 
 	if (fatal_signal_pending(current))
@@ -632,14 +684,22 @@ static int compact_finished(struct zone *zone,
 		return COMPACT_CONTINUE;
 
 	/* Direct compactor: Is a suitable page free? */
-	for (order = cc->order; order < MAX_ORDER; order++) {
-		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&zone->free_area[order].free_list[cc->migratetype]))
-			return COMPACT_PARTIAL;
-
-		/* Job done if allocation would set block type */
-		if (order >= pageblock_order && zone->free_area[order].nr_free)
+	if (cc->page) {
+		/* Was a suitable page captured? */
+		if (*cc->page)
 			return COMPACT_PARTIAL;
+	} else {
+		unsigned int order;
+		for (order = cc->order; order < MAX_ORDER; order++) {
+			struct free_area *area = &zone->free_area[cc->order];
+			/* Job done if page is free of the right migratetype */
+			if (!list_empty(&area->free_list[cc->migratetype]))
+				return COMPACT_PARTIAL;
+
+			/* Job done if allocation would set block type */
+			if (cc->order >= pageblock_order && area->nr_free)
+				return COMPACT_PARTIAL;
+		}
 	}
 
 	return COMPACT_CONTINUE;
@@ -761,6 +821,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 				goto out;
 			}
 		}
+
+		/* Capture a page now if it is a suitable size */
+		compact_capture_page(cc);
 	}
 
 out:
@@ -773,7 +836,7 @@ out:
 
 static unsigned long compact_zone_order(struct zone *zone,
 				 int order, gfp_t gfp_mask,
-				 bool sync)
+				 bool sync, struct page **page)
 {
 	struct compact_control cc = {
 		.nr_freepages = 0,
@@ -782,6 +845,7 @@ static unsigned long compact_zone_order(struct zone *zone,
 		.migratetype = allocflags_to_migratetype(gfp_mask),
 		.zone = zone,
 		.sync = sync,
+		.page = page,
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -803,7 +867,7 @@ int sysctl_extfrag_threshold = 500;
  */
 unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			bool sync)
+			bool sync, struct page **page)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	int may_enter_fs = gfp_mask & __GFP_FS;
@@ -823,7 +887,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 								nodemask) {
 		int status;
 
-		status = compact_zone_order(zone, order, gfp_mask, sync);
+		status = compact_zone_order(zone, order, gfp_mask, sync, page);
 		rc = max(status, rc);
 
 		/* If a normal allocation would succeed, stop compacting */
@@ -878,6 +942,7 @@ int compact_pgdat(pg_data_t *pgdat, int order)
 	struct compact_control cc = {
 		.order = order,
 		.sync = false,
+		.page = NULL,
 	};
 
 	return __compact_pgdat(pgdat, &cc);
@@ -888,6 +953,7 @@ static int compact_node(int nid)
 	struct compact_control cc = {
 		.order = -1,
 		.sync = true,
+		.page = NULL,
 	};
 
 	return __compact_pgdat(NODE_DATA(nid), &cc);
diff --git a/mm/internal.h b/mm/internal.h
index 3314f79..b03f05e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -130,6 +130,7 @@ struct compact_control {
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
+	struct page **page;		/* Page captured of requested size */
 };
 
 unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cefac39..d1759f5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1379,16 +1379,11 @@ void split_page(struct page *page, unsigned int order)
 }
 
 /*
- * Similar to split_page except the page is already free. As this is only
- * being used for migration, the migratetype of the block also changes.
- * As this is called with interrupts disabled, the caller is responsible
- * for calling arch_alloc_page() and kernel_map_page() after interrupts
- * are enabled.
- *
- * Note: this is probably too low level an operation for use in drivers.
- * Please consult with lkml before using this in your driver.
+ * Similar to the split_page family of functions except that the page
+ * required at the given order and being isolated now to prevent races
+ * with parallel allocators
  */
-int split_free_page(struct page *page)
+int capture_free_page(struct page *page, int alloc_order, int migratetype)
 {
 	unsigned int order;
 	unsigned long watermark;
@@ -1410,10 +1405,11 @@ int split_free_page(struct page *page)
 	rmv_page_order(page);
 	__mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
 
-	/* Split into individual pages */
-	set_page_refcounted(page);
-	split_page(page, order);
+	if (alloc_order != order)
+		expand(zone, page, alloc_order, order,
+			&zone->free_area[order], migratetype);
 
+	/* Set the pageblock if the captured page is at least a pageblock */
 	if (order >= pageblock_order - 1) {
 		struct page *endpage = page + (1 << order) - 1;
 		for (; page < endpage; page += pageblock_nr_pages) {
@@ -1424,7 +1420,35 @@ int split_free_page(struct page *page)
 		}
 	}
 
-	return 1 << order;
+	return 1UL << order;
+}
+
+/*
+ * Similar to split_page except the page is already free. As this is only
+ * being used for migration, the migratetype of the block also changes.
+ * As this is called with interrupts disabled, the caller is responsible
+ * for calling arch_alloc_page() and kernel_map_page() after interrupts
+ * are enabled.
+ *
+ * Note: this is probably too low level an operation for use in drivers.
+ * Please consult with lkml before using this in your driver.
+ */
+int split_free_page(struct page *page)
+{
+	unsigned int order;
+	int nr_pages;
+
+	BUG_ON(!PageBuddy(page));
+	order = page_order(page);
+
+	nr_pages = capture_free_page(page, order, 0);
+	if (!nr_pages)
+		return 0;
+
+	/* Split into individual pages */
+	set_page_refcounted(page);
+	split_page(page, order);
+	return nr_pages;
 }
 
 /*
@@ -2093,7 +2117,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
-	struct page *page;
+	struct page *page = NULL;
 
 	if (!order)
 		return NULL;
@@ -2105,10 +2129,16 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	current->flags |= PF_MEMALLOC;
 	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
-						nodemask, sync_migration);
+					nodemask, sync_migration, &page);
 	current->flags &= ~PF_MEMALLOC;
-	if (*did_some_progress != COMPACT_SKIPPED) {
 
+	/* If compaction captured a page, prep and use it */
+	if (page) {
+		prep_new_page(page, order, gfp_mask);
+		goto got_page;
+	}
+
+	if (*did_some_progress != COMPACT_SKIPPED) {
 		/* Page migration frees to the PCP lists but we want merging */
 		drain_pages(get_cpu());
 		put_cpu();
@@ -2118,6 +2148,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
 				preferred_zone, migratetype);
 		if (page) {
+got_page:
 			preferred_zone->compact_considered = 0;
 			preferred_zone->compact_defer_shift = 0;
 			if (order >= preferred_zone->compact_order_failed)
-- 
1.7.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2012-08-14 16:41 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-08 19:08 [RFC PATCH 0/5] Improve hugepage allocation success rates under load V2 Mel Gorman
2012-08-08 19:08 ` [PATCH 1/5] mm: compaction: Update comment in try_to_compact_pages Mel Gorman
2012-08-08 19:08 ` [PATCH 2/5] mm: vmscan: Scale number of pages reclaimed by reclaim/compaction based on failures Mel Gorman
2012-08-08 19:08 ` [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available Mel Gorman
2012-08-09  1:33   ` Minchan Kim
2012-08-09  8:11     ` Mel Gorman
2012-08-09  8:41       ` Minchan Kim
2012-08-08 19:08 ` [PATCH 4/5] mm: have order > 0 compaction start off where it left Mel Gorman
2012-08-08 19:08 ` [PATCH 5/5] mm: have order > 0 compaction start near a pageblock with free pages Mel Gorman
2012-08-09  0:12   ` Minchan Kim
2012-08-09  8:23     ` Mel Gorman
2012-08-09  8:46       ` Minchan Kim
2012-08-09  8:47   ` Minchan Kim
  -- strict thread matches above, loose matches on Subject: below --
2012-08-09 13:49 [RFC PATCH 0/5] Improve hugepage allocation success rates under load V3 Mel Gorman
2012-08-09 13:49 ` [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available Mel Gorman
2012-08-09 23:35   ` Minchan Kim
2012-08-14 16:41 [PATCH 0/5] Improve hugepage allocation success rates under load V4 Mel Gorman
2012-08-14 16:41 ` [PATCH 3/5] mm: compaction: Capture a suitable high-order page immediately when it is made available Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).