[PATCH V4 00/15] compaction: balancing overhead and success rates

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V4 00/15] compaction: balancing overhead and success rates
@ 2014-07-16 13:48 Vlastimil Babka
  2014-07-16 13:48 ` [PATCH V4 01/15] mm, THP: don't hold mmap_sem in khugepaged when allocating THP Vlastimil Babka
                   ` (14 more replies)
  0 siblings, 15 replies; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Christoph Lameter, Joonsoo Kim,
	Mel Gorman, Michal Nazarewicz, Minchan Kim, Naoya Horiguchi,
	Rik van Riel, Zhang Yanfei

Based on next-20140715.

After a while, here's a V4 of series tweaking memory compaction.

Additional evaluation was made with stress-highalloc configured to use
__GFP_NO_KSWAPD, which makes it look like a THP page fault, so it only does
async compaction and is more likely to abort. This led to different (mostly
better) results in patches 10, 11 and 15, which is according to expectation.

Major changes in V4:

- Patch 2 changed deferred compaction signalling to a new return value instead
  of boolean pointer (suggested by Joonsoo Kim)

- Patch 3 is a new small change to make compact_stall reporting more accurate.
  I did it separately instead of within patch 2, as that one can also change
  reported compact_stall for other reasons, so bisectability etc...

- Patch 5 (previously 4) is bigger than last time as according to the
  suggestions it has gone all way to make isolate_migratepages family of
  functions to be the same as isolate_freepages family. So there is now a
  separate isolate_migratepages_block() function, and redundant parameters
  were removed across both families of functions.

- Patch 6 is a new patch triggered by Naoya Horiguchi's suggestion. It further
  unifies the scanner families and removes a per-page page_zone check from
  the migration scanner.

- Patch 7 (previously 5) was, after some discussions with Minchan Kim, changed
  to affect only khugepaged. For that reason, the contention type is passed
  back all the way to __alloc_pages_slowpath() where the decisions to continue
  or abort are made. I also changed the enum to a simple int, as the enum
  definition would otherwise had to be included in more source files.
  Also there are now hopefully no remaining holes where need_sched() or fatal
  signal pending would not lead to immediate abort through all the layers of
  direct compaction.

- Patch 15 remains a RFC, as there are still some not fully clear consequences,
  and I need to measure whether not calling update_pageblock_skip() in the
  skip_on_failure mode is a good decision. It probably isn't, as not marking
  the pageblock as skipped and not updating cached pfn means pageblocks will
  be checked repeatedly. On the other hand, marking pageblock as unsuitable
  for compaction, even though it was not fully scanned to due skip_on_failure,
  means that a lower-order compaction could succeed, but won't try the
  pageblock. Sigh.
 
David Rientjes (2):
  mm: rename allocflags_to_migratetype for clarity
  mm, compaction: pass gfp mask to compact_control

Vlastimil Babka (13):
  mm, THP: don't hold mmap_sem in khugepaged when allocating THP
  mm, compaction: defer each zone individually instead of preferred zone
  mm, compaction: do not count compact_stall if all zones skipped
    compaction
  mm, compaction: do not recheck suitable_migration_target under lock
  mm, compaction: move pageblock checks up from
    isolate_migratepages_range()
  mm, compaction: reduce zone checking frequency in the migration
    scanner
  mm, compaction: khugepaged should not give up due to need_resched()
  mm, compaction: periodically drop lock and restore IRQs in scanners
  mm, compaction: skip rechecks when lock was already held
  mm, compaction: remember position within pageblock in free pages
    scanner
  mm, compaction: skip buddy pages by their order in the migrate scanner
  mm, compaction: try to capture the just-created high-order freepage
  mm, compaction: do not migrate pages when that cannot satisfy page
    fault allocation

 include/linux/compaction.h |  28 +-
 include/linux/gfp.h        |   2 +-
 mm/compaction.c            | 781 ++++++++++++++++++++++++++++++++-------------
 mm/huge_memory.c           |  20 +-
 mm/internal.h              |  28 +-
 mm/page_alloc.c            | 189 ++++++++---
 6 files changed, 752 insertions(+), 296 deletions(-)

-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH V4 01/15] mm, THP: don't hold mmap_sem in khugepaged when allocating THP
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:18   ` Mel Gorman
  2014-07-16 13:48 ` [PATCH V4 02/15] mm, compaction: defer each zone individually instead of preferred zone Vlastimil Babka
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Zhang Yanfei

When allocating huge page for collapsing, khugepaged currently holds mmap_sem
for reading on the mm where collapsing occurs. Afterwards the read lock is
dropped before write lock is taken on the same mmap_sem.

Holding mmap_sem during whole huge page allocation is therefore useless, the
vma needs to be rechecked after taking the write lock anyway. Furthemore, huge
page allocation might involve a rather long sync compaction, and thus block
any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
or mprotect oterations.

This patch simply releases the read lock before allocating a huge page. It
also deletes an outdated comment that assumed vma must be stable, as it was
using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
("mm: thp: khugepaged: add policy for finding target node").

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/huge_memory.c | 20 +++++++-------------
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 02559ef..107da28 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2295,23 +2295,17 @@ static struct page
 		       int node)
 {
 	VM_BUG_ON_PAGE(*hpage, *hpage);
+
 	/*
-	 * Allocate the page while the vma is still valid and under
-	 * the mmap_sem read mode so there is no memory allocation
-	 * later when we take the mmap_sem in write mode. This is more
-	 * friendly behavior (OTOH it may actually hide bugs) to
-	 * filesystems in userland with daemons allocating memory in
-	 * the userland I/O paths.  Allocating memory with the
-	 * mmap_sem in read mode is good idea also to allow greater
-	 * scalability.
+	 * Before allocating the hugepage, release the mmap_sem read lock.
+	 * The allocation can take potentially a long time if it involves
+	 * sync compaction, and we do not need to hold the mmap_sem during
+	 * that. We will recheck the vma after taking it again in write mode.
 	 */
+	up_read(&mm->mmap_sem);
+
 	*hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask(
 		khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER);
-	/*
-	 * After allocating the hugepage, release the mmap_sem read lock in
-	 * preparation for taking it in write mode.
-	 */
-	up_read(&mm->mmap_sem);
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
 		*hpage = ERR_PTR(-ENOMEM);
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V4 02/15] mm, compaction: defer each zone individually instead of preferred zone
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
  2014-07-16 13:48 ` [PATCH V4 01/15] mm, THP: don't hold mmap_sem in khugepaged when allocating THP Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:20   ` Mel Gorman
  2014-07-16 13:48 ` [PATCH V4 03/15] mm, compaction: do not count compact_stall if all zones skipped compaction Vlastimil Babka
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Mel Gorman, Joonsoo Kim,
	Michal Nazarewicz, Naoya Horiguchi, Christoph Lameter,
	Rik van Riel, Minchan Kim, Zhang Yanfei

When direct sync compaction is often unsuccessful, it may become deferred for
some time to avoid further useless attempts, both sync and async. Successful
high-order allocations un-defer compaction, while further unsuccessful
compaction attempts prolong the copmaction deferred period.

Currently the checking and setting deferred status is performed only on the
preferred zone of the allocation that invoked direct compaction. But compaction
itself is attempted on all eligible zones in the zonelist, so the behavior is
suboptimal and may lead both to scenarios where 1) compaction is attempted
uselessly, or 2) where it's not attempted despite good chances of succeeding,
as shown on the examples below:

1) A direct compaction with Normal preferred zone failed and set deferred
   compaction for the Normal zone. Another unrelated direct compaction with
   DMA32 as preferred zone will attempt to compact DMA32 zone even though
   the first compaction attempt also included DMA32 zone.

   In another scenario, compaction with Normal preferred zone failed to compact
   Normal zone, but succeeded in the DMA32 zone, so it will not defer
   compaction. In the next attempt, it will try Normal zone which will fail
   again, instead of skipping Normal zone and trying DMA32 directly.

2) Kswapd will balance DMA32 zone and reset defer status based on watermarks
   looking good. A direct compaction with preferred Normal zone will skip
   compaction of all zones including DMA32 because Normal was still deferred.
   The allocation might have succeeded in DMA32, but won't.

This patch makes compaction deferring work on individual zone basis instead of
preferred zone. For each zone, it checks compaction_deferred() to decide if the
zone should be skipped. If watermarks fail after compacting the zone,
defer_compaction() is called. The zone where watermarks passed can still be
deferred when the allocation attempt is unsuccessful. When allocation is
successful, compaction_defer_reset() is called for the zone containing the
allocated page. This approach should approximate calling defer_compaction()
only on zones where compaction was attempted and did not yield allocated page.
There might be corner cases but that is inevitable as long as the decision
to stop compacting dues not guarantee that a page will be allocated.

During testing on a two-node machine with a single very small Normal zone on
node 1, this patch has improved success rates in stress-highalloc mmtests
benchmark. The success here were previously made worse by commit 3a025760fc
("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was
no longer resetting often enough the deferred compaction for the Normal zone,
and DMA32 zones on both nodes were thus not considered for compaction.
On different machine, success rates were improved with __GFP_NO_KSWAPD
allocations.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 include/linux/compaction.h | 16 ++++++++++------
 mm/compaction.c            | 30 ++++++++++++++++++++++++------
 mm/page_alloc.c            | 39 +++++++++++++++++++++++----------------
 3 files changed, 57 insertions(+), 28 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 01e3132..b2e4c92 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -2,14 +2,16 @@
 #define _LINUX_COMPACTION_H
 
 /* Return values for compact_zone() and try_to_compact_pages() */
+/* compaction didn't start as it was deferred due to past failures */
+#define COMPACT_DEFERRED	0
 /* compaction didn't start as it was not possible or direct reclaim was more suitable */
-#define COMPACT_SKIPPED		0
+#define COMPACT_SKIPPED		1
 /* compaction should continue to another pageblock */
-#define COMPACT_CONTINUE	1
+#define COMPACT_CONTINUE	2
 /* direct compaction partially compacted a zone and there are suitable pages */
-#define COMPACT_PARTIAL		2
+#define COMPACT_PARTIAL		3
 /* The full zone was compacted */
-#define COMPACT_COMPLETE	3
+#define COMPACT_COMPLETE	4
 
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
@@ -22,7 +24,8 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
-			enum migrate_mode mode, bool *contended);
+			enum migrate_mode mode, bool *contended,
+			struct zone **candidate_zone);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
@@ -91,7 +94,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			enum migrate_mode mode, bool *contended)
+			enum migrate_mode mode, bool *contended,
+			struct zone **candidate_zone)
 {
 	return COMPACT_CONTINUE;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index 5175019..f3ae2ec 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1122,28 +1122,27 @@ int sysctl_extfrag_threshold = 500;
  * @nodemask: The allowed nodes to allocate from
  * @mode: The migration mode for async, sync light, or sync migration
  * @contended: Return value that is true if compaction was aborted due to lock contention
- * @page: Optionally capture a free page of the requested order during compaction
+ * @candidate_zone: Return the zone where we think allocation should succeed
  *
  * This is the main entry point for direct page compaction.
  */
 unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			enum migrate_mode mode, bool *contended)
+			enum migrate_mode mode, bool *contended,
+			struct zone **candidate_zone)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	int may_enter_fs = gfp_mask & __GFP_FS;
 	int may_perform_io = gfp_mask & __GFP_IO;
 	struct zoneref *z;
 	struct zone *zone;
-	int rc = COMPACT_SKIPPED;
+	int rc = COMPACT_DEFERRED;
 	int alloc_flags = 0;
 
 	/* Check if the GFP flags allow compaction */
 	if (!order || !may_enter_fs || !may_perform_io)
 		return rc;
 
-	count_compact_event(COMPACTSTALL);
-
 #ifdef CONFIG_CMA
 	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
@@ -1153,14 +1152,33 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 								nodemask) {
 		int status;
 
+		if (compaction_deferred(zone, order))
+			continue;
+
 		status = compact_zone_order(zone, order, gfp_mask, mode,
 						contended);
 		rc = max(status, rc);
 
 		/* If a normal allocation would succeed, stop compacting */
 		if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
-				      alloc_flags))
+				      alloc_flags)) {
+			*candidate_zone = zone;
+			/*
+			 * We think the allocation will succeed in this zone,
+			 * but it is not certain, hence the false. The caller
+			 * will repeat this with true if allocation indeed
+			 * succeeds in this zone.
+			 */
+			compaction_defer_reset(zone, order, false);
 			break;
+		} else if (mode != MIGRATE_ASYNC) {
+			/*
+			 * We think that allocation won't succeed in this zone
+			 * so we defer compaction there. If it ends up
+			 * succeeding after all, it will be reset.
+			 */
+			defer_compaction(zone, order);
+		}
 	}
 
 	return rc;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f1e462e..963dacd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2250,21 +2250,24 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	bool *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
-	if (!order)
-		return NULL;
+	struct zone *last_compact_zone = NULL;
 
-	if (compaction_deferred(preferred_zone, order)) {
-		*deferred_compaction = true;
+	if (!order)
 		return NULL;
-	}
 
 	current->flags |= PF_MEMALLOC;
 	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
 						nodemask, mode,
-						contended_compaction);
+						contended_compaction,
+						&last_compact_zone);
 	current->flags &= ~PF_MEMALLOC;
 
-	if (*did_some_progress != COMPACT_SKIPPED) {
+	if (*did_some_progress > COMPACT_DEFERRED)
+		count_vm_event(COMPACTSTALL);
+	else
+		*deferred_compaction = true;
+
+	if (*did_some_progress > COMPACT_SKIPPED) {
 		struct page *page;
 
 		/* Page migration frees to the PCP lists but we want merging */
@@ -2275,27 +2278,31 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 				order, zonelist, high_zoneidx,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
 				preferred_zone, classzone_idx, migratetype);
+
 		if (page) {
-			preferred_zone->compact_blockskip_flush = false;
-			compaction_defer_reset(preferred_zone, order, true);
+			struct zone *zone = page_zone(page);
+
+			zone->compact_blockskip_flush = false;
+			compaction_defer_reset(zone, order, true);
 			count_vm_event(COMPACTSUCCESS);
 			return page;
 		}
 
 		/*
+		 * last_compact_zone is where try_to_compact_pages thought
+		 * allocation should succeed, so it did not defer compaction.
+		 * But now we know that it didn't succeed, so we do the defer.
+		 */
+		if (last_compact_zone && mode != MIGRATE_ASYNC)
+			defer_compaction(last_compact_zone, order);
+
+		/*
 		 * It's bad if compaction run occurs and fails.
 		 * The most likely reason is that pages exist,
 		 * but not enough to satisfy watermarks.
 		 */
 		count_vm_event(COMPACTFAIL);
 
-		/*
-		 * As async compaction considers a subset of pageblocks, only
-		 * defer if the failure was a sync compaction failure.
-		 */
-		if (mode != MIGRATE_ASYNC)
-			defer_compaction(preferred_zone, order);
-
 		cond_resched();
 	}
 
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V4 03/15] mm, compaction: do not count compact_stall if all zones skipped compaction
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
  2014-07-16 13:48 ` [PATCH V4 01/15] mm, THP: don't hold mmap_sem in khugepaged when allocating THP Vlastimil Babka
  2014-07-16 13:48 ` [PATCH V4 02/15] mm, compaction: defer each zone individually instead of preferred zone Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:22   ` Mel Gorman
  2014-07-16 13:48 ` [PATCH V4 04/15] mm, compaction: do not recheck suitable_migration_target under lock Vlastimil Babka
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Zhang Yanfei

The compact_stall vmstat counter counts the number of allocations stalled by
direct compaction. It does not count when all attempted zones had deferred
compaction, but it does count when all zones skipped compaction. The skipping
is decided based on very early check of compaction_suitable(), based on
watermarks and memory fragmentation. Therefore it makes sense not to count
skipped compactions as stalls. Moreover, compact_success or compact_fail is
also already not being counted when compaction was skipped, so this patch
changes the compact_stall counting to match the other two.

Additionally, restructure __alloc_pages_direct_compact() code for better
readability.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c | 75 +++++++++++++++++++++++++++++++--------------------------
 1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 963dacd..69a14b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2251,6 +2251,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	unsigned long *did_some_progress)
 {
 	struct zone *last_compact_zone = NULL;
+	struct page *page;
 
 	if (!order)
 		return NULL;
@@ -2262,49 +2263,55 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 						&last_compact_zone);
 	current->flags &= ~PF_MEMALLOC;
 
-	if (*did_some_progress > COMPACT_DEFERRED)
-		count_vm_event(COMPACTSTALL);
-	else
+	switch (*did_some_progress) {
+	case COMPACT_DEFERRED:
 		*deferred_compaction = true;
+		/* fall-through */
+	case COMPACT_SKIPPED:
+		return NULL;
+	default:
+		break;
+	}
 
-	if (*did_some_progress > COMPACT_SKIPPED) {
-		struct page *page;
+	/*
+	 * At least in one zone compaction wasn't deferred or skipped, so let's
+	 * count a compaction stall
+	 */
+	count_vm_event(COMPACTSTALL);
 
-		/* Page migration frees to the PCP lists but we want merging */
-		drain_pages(get_cpu());
-		put_cpu();
+	/* Page migration frees to the PCP lists but we want merging */
+	drain_pages(get_cpu());
+	put_cpu();
 
-		page = get_page_from_freelist(gfp_mask, nodemask,
-				order, zonelist, high_zoneidx,
-				alloc_flags & ~ALLOC_NO_WATERMARKS,
-				preferred_zone, classzone_idx, migratetype);
+	page = get_page_from_freelist(gfp_mask, nodemask,
+			order, zonelist, high_zoneidx,
+			alloc_flags & ~ALLOC_NO_WATERMARKS,
+			preferred_zone, classzone_idx, migratetype);
 
-		if (page) {
-			struct zone *zone = page_zone(page);
+	if (page) {
+		struct zone *zone = page_zone(page);
 
-			zone->compact_blockskip_flush = false;
-			compaction_defer_reset(zone, order, true);
-			count_vm_event(COMPACTSUCCESS);
-			return page;
-		}
+		zone->compact_blockskip_flush = false;
+		compaction_defer_reset(zone, order, true);
+		count_vm_event(COMPACTSUCCESS);
+		return page;
+	}
 
-		/*
-		 * last_compact_zone is where try_to_compact_pages thought
-		 * allocation should succeed, so it did not defer compaction.
-		 * But now we know that it didn't succeed, so we do the defer.
-		 */
-		if (last_compact_zone && mode != MIGRATE_ASYNC)
-			defer_compaction(last_compact_zone, order);
+	/*
+	 * last_compact_zone is where try_to_compact_pages thought allocation
+	 * should succeed, so it did not defer compaction. But here we know
+	 * that it didn't succeed, so we do the defer.
+	 */
+	if (last_compact_zone && mode != MIGRATE_ASYNC)
+		defer_compaction(last_compact_zone, order);
 
-		/*
-		 * It's bad if compaction run occurs and fails.
-		 * The most likely reason is that pages exist,
-		 * but not enough to satisfy watermarks.
-		 */
-		count_vm_event(COMPACTFAIL);
+	/*
+	 * It's bad if compaction run occurs and fails. The most likely reason
+	 * is that pages exist, but not enough to satisfy watermarks.
+	 */
+	count_vm_event(COMPACTFAIL);
 
-		cond_resched();
-	}
+	cond_resched();
 
 	return NULL;
 }
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V4 04/15] mm, compaction: do not recheck suitable_migration_target under lock
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
                   ` (2 preceding siblings ...)
  2014-07-16 13:48 ` [PATCH V4 03/15] mm, compaction: do not count compact_stall if all zones skipped compaction Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:23   ` Mel Gorman
  2014-07-16 13:48 ` [PATCH V4 05/15] mm, compaction: move pageblock checks up from isolate_migratepages_range() Vlastimil Babka
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Mel Gorman, Joonsoo Kim,
	Michal Nazarewicz, Naoya Horiguchi, Christoph Lameter,
	Rik van Riel, Minchan Kim, Zhang Yanfei

isolate_freepages_block() rechecks if the pageblock is suitable to be a target
for migration after it has taken the zone->lock. However, the check has been
optimized to occur only once per pageblock, and compact_checklock_irqsave()
might be dropping and reacquiring lock, which means somebody else might have
changed the pageblock's migratetype meanwhile.

Furthermore, nothing prevents the migratetype to change right after
isolate_freepages_block() has finished isolating. Given how imperfect this is,
it's simpler to just rely on the check done in isolate_freepages() without
lock, and not pretend that the recheck under lock guarantees anything. It is
just a heuristic after all.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index f3ae2ec..0a871e5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -276,7 +276,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	struct page *cursor, *valid_page = NULL;
 	unsigned long flags;
 	bool locked = false;
-	bool checked_pageblock = false;
 
 	cursor = pfn_to_page(blockpfn);
 
@@ -307,18 +306,6 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		if (!locked)
 			break;
 
-		/* Recheck this is a suitable migration target under lock */
-		if (!strict && !checked_pageblock) {
-			/*
-			 * We need to check suitability of pageblock only once
-			 * and this isolate_freepages_block() is called with
-			 * pageblock range, so just check once is sufficient.
-			 */
-			checked_pageblock = true;
-			if (!suitable_migration_target(page))
-				break;
-		}
-
 		/* Recheck this is a buddy page under lock */
 		if (!PageBuddy(page))
 			goto isolate_fail;
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V4 05/15] mm, compaction: move pageblock checks up from isolate_migratepages_range()
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
                   ` (3 preceding siblings ...)
  2014-07-16 13:48 ` [PATCH V4 04/15] mm, compaction: do not recheck suitable_migration_target under lock Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:28   ` Mel Gorman
  2014-07-16 13:48 ` [PATCH V4 06/15] mm, compaction: reduce zone checking frequency in the migration scanner Vlastimil Babka
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Zhang Yanfei

isolate_migratepages_range() is the main function of the compaction scanner,
called either on a single pageblock by isolate_migratepages() during regular
compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
It currently perfoms two pageblock-wide compaction suitability checks, and
because of the CMA callpath, it tracks if it crossed a pageblock boundary in
order to repeat those checks.

However, closer inspection shows that those checks are always true for CMA:
- isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
- migrate_async_suitable() check is skipped because CMA uses sync compaction

We can therefore move the compaction-specific checks to isolate_migratepages()
and simplify isolate_migratepages_range(). Furthermore, we can mimic the
freepage scanner family of functions, which has isolate_freepages_block()
function called both by compaction from isolate_freepages() and by CMA from
isolate_freepages_range(), where each use-case adds own specific glue code.
This allows further code simplification.

Therefore, we rename isolate_migratepages_range() to isolate_freepages_block()
and limit its functionality to a single pageblock (or its subset). For CMA,
a new different isolate_migratepages_range() is created as a CMA-specific
wrapper for the _block() function. The checks specific to compaction are moved
to isolate_migratepages(). As part of the unification of these two families of
functions, we remove the redundant zone parameter where applicable, since zone
pointer is already passed in cc->zone.

Furthermore, going back to compact_zone() and compact_finished() when pageblock
is found unsuitable (now by isolate_migratepages()) is wasteful - the checks
are meant to skip pageblocks quickly. The patch therefore also introduces a
simple loop into isolate_migratepages() so that it does not return immediately
on failed pageblock checks, but keeps going until isolate_migratepages_range()
gets called once. Similarily to isolate_freepages(), the function periodically
checks if it needs to reschedule or abort async compaction.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 234 +++++++++++++++++++++++++++++++++-----------------------
 mm/internal.h   |   4 +-
 mm/page_alloc.c |   3 +-
 3 files changed, 140 insertions(+), 101 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 0a871e5..28e48ea 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -132,7 +132,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
  */
 static void update_pageblock_skip(struct compact_control *cc,
 			struct page *page, unsigned long nr_isolated,
-			bool set_unsuitable, bool migrate_scanner)
+			bool migrate_scanner)
 {
 	struct zone *zone = cc->zone;
 	unsigned long pfn;
@@ -146,12 +146,7 @@ static void update_pageblock_skip(struct compact_control *cc,
 	if (nr_isolated)
 		return;
 
-	/*
-	 * Only skip pageblocks when all forms of compaction will be known to
-	 * fail in the near future.
-	 */
-	if (set_unsuitable)
-		set_pageblock_skip(page);
+	set_pageblock_skip(page);
 
 	pfn = page_to_pfn(page);
 
@@ -180,7 +175,7 @@ static inline bool isolation_suitable(struct compact_control *cc,
 
 static void update_pageblock_skip(struct compact_control *cc,
 			struct page *page, unsigned long nr_isolated,
-			bool set_unsuitable, bool migrate_scanner)
+			bool migrate_scanner)
 {
 }
 #endif /* CONFIG_COMPACTION */
@@ -345,8 +340,7 @@ isolate_fail:
 
 	/* Update the pageblock-skip if the whole pageblock was scanned */
 	if (blockpfn == end_pfn)
-		update_pageblock_skip(cc, valid_page, total_isolated, true,
-				      false);
+		update_pageblock_skip(cc, valid_page, total_isolated, false);
 
 	count_compact_events(COMPACTFREE_SCANNED, nr_scanned);
 	if (total_isolated)
@@ -451,40 +445,34 @@ static bool too_many_isolated(struct zone *zone)
 }
 
 /**
- * isolate_migratepages_range() - isolate all migrate-able pages in range.
- * @zone:	Zone pages are in.
+ * isolate_migratepages_block() - isolate all migrate-able pages within
+ *				  a single pageblock
  * @cc:		Compaction control structure.
- * @low_pfn:	The first PFN of the range.
- * @end_pfn:	The one-past-the-last PFN of the range.
- * @unevictable: true if it allows to isolate unevictable pages
+ * @low_pfn:	The first PFN to isolate
+ * @end_pfn:	The one-past-the-last PFN to isolate, within same pageblock
+ * @isolate_mode: Isolation mode to be used.
  *
  * Isolate all pages that can be migrated from the range specified by
- * [low_pfn, end_pfn).  Returns zero if there is a fatal signal
- * pending), otherwise PFN of the first page that was not scanned
- * (which may be both less, equal to or more then end_pfn).
- *
- * Assumes that cc->migratepages is empty and cc->nr_migratepages is
- * zero.
+ * [low_pfn, end_pfn). The range is expected to be within same pageblock.
+ * Returns zero if there is a fatal signal pending, otherwise PFN of the
+ * first page that was not scanned (which may be both less, equal to or more
+ * than end_pfn).
  *
- * Apart from cc->migratepages and cc->nr_migratetypes this function
- * does not modify any cc's fields, in particular it does not modify
- * (or read for that matter) cc->migrate_pfn.
+ * The pages are isolated on cc->migratepages list (not required to be empty),
+ * and cc->nr_migratepages is updated accordingly. The cc->migrate_pfn field
+ * is neither read nor updated.
  */
-unsigned long
-isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
-		unsigned long low_pfn, unsigned long end_pfn, bool unevictable)
+static unsigned long
+isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
+			unsigned long end_pfn, isolate_mode_t isolate_mode)
 {
-	unsigned long last_pageblock_nr = 0, pageblock_nr;
+	struct zone *zone = cc->zone;
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct list_head *migratelist = &cc->migratepages;
 	struct lruvec *lruvec;
 	unsigned long flags;
 	bool locked = false;
 	struct page *page = NULL, *valid_page = NULL;
-	bool set_unsuitable = true;
-	const isolate_mode_t mode = (cc->mode == MIGRATE_ASYNC ?
-					ISOLATE_ASYNC_MIGRATE : 0) |
-				    (unevictable ? ISOLATE_UNEVICTABLE : 0);
 
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
@@ -515,19 +503,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 			}
 		}
 
-		/*
-		 * migrate_pfn does not necessarily start aligned to a
-		 * pageblock. Ensure that pfn_valid is called when moving
-		 * into a new MAX_ORDER_NR_PAGES range in case of large
-		 * memory holes within the zone
-		 */
-		if ((low_pfn & (MAX_ORDER_NR_PAGES - 1)) == 0) {
-			if (!pfn_valid(low_pfn)) {
-				low_pfn += MAX_ORDER_NR_PAGES - 1;
-				continue;
-			}
-		}
-
 		if (!pfn_valid_within(low_pfn))
 			continue;
 		nr_scanned++;
@@ -545,28 +520,6 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		if (!valid_page)
 			valid_page = page;
 
-		/* If isolation recently failed, do not retry */
-		pageblock_nr = low_pfn >> pageblock_order;
-		if (last_pageblock_nr != pageblock_nr) {
-			int mt;
-
-			last_pageblock_nr = pageblock_nr;
-			if (!isolation_suitable(cc, page))
-				goto next_pageblock;
-
-			/*
-			 * For async migration, also only scan in MOVABLE
-			 * blocks. Async migration is optimistic to see if
-			 * the minimum amount of work satisfies the allocation
-			 */
-			mt = get_pageblock_migratetype(page);
-			if (cc->mode == MIGRATE_ASYNC &&
-			    !migrate_async_suitable(mt)) {
-				set_unsuitable = false;
-				goto next_pageblock;
-			}
-		}
-
 		/*
 		 * Skip if free. page_order cannot be used without zone->lock
 		 * as nothing prevents parallel allocations or buddy merging.
@@ -601,8 +554,11 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		 */
 		if (PageTransHuge(page)) {
 			if (!locked)
-				goto next_pageblock;
-			low_pfn += (1 << compound_order(page)) - 1;
+				low_pfn = ALIGN(low_pfn + 1,
+						pageblock_nr_pages) - 1;
+			else
+				low_pfn += (1 << compound_order(page)) - 1;
+
 			continue;
 		}
 
@@ -632,7 +588,7 @@ isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
 		lruvec = mem_cgroup_page_lruvec(page, zone);
 
 		/* Try isolate the page */
-		if (__isolate_lru_page(page, mode) != 0)
+		if (__isolate_lru_page(page, isolate_mode) != 0)
 			continue;
 
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -651,11 +607,6 @@ isolate_success:
 			++low_pfn;
 			break;
 		}
-
-		continue;
-
-next_pageblock:
-		low_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages) - 1;
 	}
 
 	acct_isolated(zone, locked, cc);
@@ -668,8 +619,7 @@ next_pageblock:
 	 * if the whole pageblock was scanned without isolating any page.
 	 */
 	if (low_pfn == end_pfn)
-		update_pageblock_skip(cc, valid_page, nr_isolated,
-				      set_unsuitable, true);
+		update_pageblock_skip(cc, valid_page, nr_isolated, true);
 
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
 
@@ -680,15 +630,61 @@ next_pageblock:
 	return low_pfn;
 }
 
+/**
+ * isolate_migratepages_range() - isolate migrate-able pages in a PFN range
+ * @start_pfn: The first PFN to start isolating.
+ * @end_pfn:   The one-past-last PFN.
+ *
+ * Returns zero if isolation fails fatally due to e.g. pending signal.
+ * Otherwise, function returns one-past-the-last PFN of isolated page
+ * (which may be greater than end_pfn if end fell in a middle of a THP page).
+ */
+unsigned long
+isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
+							unsigned long end_pfn)
+{
+	unsigned long pfn, block_end_pfn;
+
+	/* Scan block by block. First and last block may be incomplete */
+	pfn = start_pfn;
+	block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
+
+	for (; pfn < end_pfn; pfn = block_end_pfn,
+				block_end_pfn += pageblock_nr_pages) {
+
+		block_end_pfn = min(block_end_pfn, end_pfn);
+
+		/* Skip whole pageblock in case of a memory hole */
+		if (!pfn_valid(pfn))
+			continue;
+
+		pfn = isolate_migratepages_block(cc, pfn, block_end_pfn,
+							ISOLATE_UNEVICTABLE);
+
+		/*
+		 * In case of fatal failure, release everything that might
+		 * have been isolated in the previous iteration, and signal
+		 * the failure back to caller.
+                 */
+		if (!pfn) {
+			putback_movable_pages(&cc->migratepages);
+			cc->nr_migratepages = 0;
+			break;
+		}
+	}
+
+	return pfn;
+}
+
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 #ifdef CONFIG_COMPACTION
 /*
  * Based on information in the current compact_control, find blocks
  * suitable for isolating free pages from and then isolate them.
  */
-static void isolate_freepages(struct zone *zone,
-				struct compact_control *cc)
+static void isolate_freepages(struct compact_control *cc)
 {
+	struct zone *zone = cc->zone;
 	struct page *page;
 	unsigned long block_start_pfn;	/* start of current pageblock */
 	unsigned long block_end_pfn;	/* end of current pageblock */
@@ -806,7 +802,7 @@ static struct page *compaction_alloc(struct page *migratepage,
 	 */
 	if (list_empty(&cc->freepages)) {
 		if (!cc->contended)
-			isolate_freepages(cc->zone, cc);
+			isolate_freepages(cc);
 
 		if (list_empty(&cc->freepages))
 			return NULL;
@@ -840,34 +836,81 @@ typedef enum {
 } isolate_migrate_t;
 
 /*
- * Isolate all pages that can be migrated from the block pointed to by
- * the migrate scanner within compact_control.
+ * Isolate all pages that can be migrated from the first suitable block,
+ * starting at the block pointed to by the migrate scanner pfn within
+ * compact_control.
  */
 static isolate_migrate_t isolate_migratepages(struct zone *zone,
 					struct compact_control *cc)
 {
 	unsigned long low_pfn, end_pfn;
+	struct page *page;
+	const isolate_mode_t isolate_mode =
+		(cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
 
-	/* Do not scan outside zone boundaries */
-	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
+	/*
+	 * Start at where we last stopped, or beginning of the zone as
+	 * initialized by compact_zone()
+	 */
+	low_pfn = cc->migrate_pfn;
 
 	/* Only scan within a pageblock boundary */
 	end_pfn = ALIGN(low_pfn + 1, pageblock_nr_pages);
 
-	/* Do not cross the free scanner or scan within a memory hole */
-	if (end_pfn > cc->free_pfn || !pfn_valid(low_pfn)) {
-		cc->migrate_pfn = end_pfn;
-		return ISOLATE_NONE;
-	}
+	/*
+	 * Iterate over whole pageblocks until we find the first suitable.
+	 * Do not cross the free scanner.
+	 */
+	for (; end_pfn <= cc->free_pfn;
+			low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
+
+		/*
+		 * This can potentially iterate a massively long zone with
+		 * many pageblocks unsuitable, so periodically check if we
+		 * need to schedule, or even abort async compaction.
+		 */
+		if (!(low_pfn % (SWAP_CLUSTER_MAX * pageblock_nr_pages))
+						&& compact_should_abort(cc))
+			break;
+
+		/* Skip whole pageblock in case of a memory hole */
+		if (!pfn_valid(low_pfn))
+			continue;
+
+		page = pfn_to_page(low_pfn);
+
+		/* If isolation recently failed, do not retry */
+		if (!isolation_suitable(cc, page))
+			continue;
+
+		/*
+		 * For async compaction, also only scan in MOVABLE blocks.
+		 * Async compaction is optimistic to see if the minimum amount
+		 * of work satisfies the allocation.
+		 */
+		if (cc->mode == MIGRATE_ASYNC &&
+		    !migrate_async_suitable(get_pageblock_migratetype(page)))
+			continue;
+
+		/* Perform the isolation */
+		low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn,
+								isolate_mode);
+
+		if (!low_pfn || cc->contended)
+			return ISOLATE_ABORT;
 
-	/* Perform the isolation */
-	low_pfn = isolate_migratepages_range(zone, cc, low_pfn, end_pfn, false);
-	if (!low_pfn || cc->contended)
-		return ISOLATE_ABORT;
+		/*
+		 * Either we isolated something and proceed with migration. Or
+		 * we failed and compact_zone should decide if we should
+		 * continue or not.
+		 */
+		break;
+	}
 
+	/* Record where migration scanner will be restarted */
 	cc->migrate_pfn = low_pfn;
 
-	return ISOLATE_SUCCESS;
+	return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
 }
 
 static int compact_finished(struct zone *zone,
@@ -1040,9 +1083,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 			;
 		}
 
-		if (!cc->nr_migratepages)
-			continue;
-
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				compaction_free, (unsigned long)cc, cc->mode,
 				MR_COMPACTION);
diff --git a/mm/internal.h b/mm/internal.h
index a1b651b..5a0738f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -154,8 +154,8 @@ unsigned long
 isolate_freepages_range(struct compact_control *cc,
 			unsigned long start_pfn, unsigned long end_pfn);
 unsigned long
-isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
-	unsigned long low_pfn, unsigned long end_pfn, bool unevictable);
+isolate_migratepages_range(struct compact_control *cc,
+			   unsigned long low_pfn, unsigned long end_pfn);
 
 #endif
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 69a14b7..f16aa37 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6282,8 +6282,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 
 		if (list_empty(&cc->migratepages)) {
 			cc->nr_migratepages = 0;
-			pfn = isolate_migratepages_range(cc->zone, cc,
-							 pfn, end, true);
+			pfn = isolate_migratepages_range(cc, pfn, end);
 			if (!pfn) {
 				ret = -EINTR;
 				break;
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V4 06/15] mm, compaction: reduce zone checking frequency in the migration scanner
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
                   ` (4 preceding siblings ...)
  2014-07-16 13:48 ` [PATCH V4 05/15] mm, compaction: move pageblock checks up from isolate_migratepages_range() Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:29   ` Mel Gorman
  2014-07-16 13:48 ` [PATCH V4 07/15] mm, compaction: khugepaged should not give up due to need_resched() Vlastimil Babka
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Christoph Lameter, Rik van Riel,
	Naoya Horiguchi, Zhang Yanfei

The unification of the migrate and free scanner families of function has
highlighted a difference in how the scanners ensure they only isolate pages
of the intended zone. This is important for taking zone lock or lru lock of
the correct zone. Due to nodes overlapping, it is however possible to
encounter a different zone within the range of the zone being compacted.

The free scanner, since its inception by commit 748446bb6b ("mm: compaction:
memory compaction core"), has been checking the zone of the first valid page
in a pageblock, and skipping the whole pageblock if the zone does not match.

This checking was completely missing from the migration scanner at first, and
later added by commit dc9086004b ("mm: compaction: check for overlapping
nodes during isolation for migration") in a reaction to a bug report.
But the zone comparison in migration scanner is done once per a single scanned
page, which is more defensive and thus more costly than a check per pageblock.

This patch unifies the checking done in both scanners to once per pageblock,
through a new pageblock_within_zone() function, which also includes pfn_valid()
checks. It is more defensive than the current free scanner checks, as it checks
both the first and last page of the pageblock, but less defensive by the
migration scanner per-page checks. It assumes that node overlapping may result
(on some architecture) in a boundary between two nodes falling into the middle
of a pageblock, but that there cannot be a node0 node1 node0 interleaving
within a single pageblock.

The result is more code being shared and a bit less per-page CPU cost in the
migration scanner.

Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 91 ++++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 57 insertions(+), 34 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 28e48ea..9cff804 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -67,6 +67,49 @@ static inline bool migrate_async_suitable(int migratetype)
 	return is_migrate_cma(migratetype) || migratetype == MIGRATE_MOVABLE;
 }
 
+/*
+ * Check that the whole (or subset of) a pageblock given by the interval of
+ * [start_pfn, end_pfn) is valid and within the same zone, before scanning it
+ * with the migration of free compaction scanner. The scanners then need to
+ * use only pfn_valid_within() check for arches that allow holes within
+ * pageblocks.
+ *
+ * Return struct page pointer of start_pfn, or NULL if checks were not passed.
+ *
+ * It's possible on some configurations to have a setup like node0 node1 node0
+ * i.e. it's possible that all pages within a zones range of pages do not
+ * belong to a single zone. We assume that a border between node0 and node1
+ * can occur within a single pageblock, but not a node0 node1 node0
+ * interleaving within a single pageblock. It is therefore sufficient to check
+ * the first and last page of a pageblock and avoid checking each individual
+ * page in a pageblock.
+ */
+static struct page * pageblock_within_zone(unsigned long start_pfn,
+				unsigned long end_pfn, struct zone *zone)
+{
+	struct page *start_page;
+	struct page *end_page;
+
+	/* end_pfn is one past the range we are checking */
+	end_pfn--;
+
+	if (!pfn_valid(start_pfn) || !pfn_valid(end_pfn))
+		return NULL;
+
+	start_page = pfn_to_page(start_pfn);
+
+	if (page_zone(start_page) != zone)
+		return NULL;
+
+	end_page = pfn_to_page(end_pfn);
+
+	/* This gives a shorter code than deriving page_zone(end_page) */
+	if (page_zone_id(start_page) != page_zone_id(end_page))
+		return NULL;
+
+	return start_page;
+}
+
 #ifdef CONFIG_COMPACTION
 /* Returns true if the pageblock should be scanned for pages to isolate. */
 static inline bool isolation_suitable(struct compact_control *cc,
@@ -368,17 +411,17 @@ isolate_freepages_range(struct compact_control *cc,
 	unsigned long isolated, pfn, block_end_pfn;
 	LIST_HEAD(freelist);
 
-	for (pfn = start_pfn; pfn < end_pfn; pfn += isolated) {
-		if (!pfn_valid(pfn) || cc->zone != page_zone(pfn_to_page(pfn)))
-			break;
+	pfn = start_pfn;
+	block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
+
+	for (; pfn < end_pfn; pfn += isolated,
+				block_end_pfn += pageblock_nr_pages) {
 
-		/*
-		 * On subsequent iterations ALIGN() is actually not needed,
-		 * but we keep it that we not to complicate the code.
-		 */
-		block_end_pfn = ALIGN(pfn + 1, pageblock_nr_pages);
 		block_end_pfn = min(block_end_pfn, end_pfn);
 
+		if (!pageblock_within_zone(pfn, block_end_pfn, cc->zone))
+			break;
+
 		isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
 						   &freelist, true);
 
@@ -507,15 +550,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			continue;
 		nr_scanned++;
 
-		/*
-		 * Get the page and ensure the page is within the same zone.
-		 * See the comment in isolate_freepages about overlapping
-		 * nodes. It is deliberate that the new zone lock is not taken
-		 * as memory compaction should not move pages between nodes.
-		 */
 		page = pfn_to_page(low_pfn);
-		if (page_zone(page) != zone)
-			continue;
 
 		if (!valid_page)
 			valid_page = page;
@@ -654,8 +689,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
 
 		block_end_pfn = min(block_end_pfn, end_pfn);
 
-		/* Skip whole pageblock in case of a memory hole */
-		if (!pfn_valid(pfn))
+		if (!pageblock_within_zone(pfn, block_end_pfn, cc->zone))
 			continue;
 
 		pfn = isolate_migratepages_block(cc, pfn, block_end_pfn,
@@ -727,18 +761,9 @@ static void isolate_freepages(struct compact_control *cc)
 						&& compact_should_abort(cc))
 			break;
 
-		if (!pfn_valid(block_start_pfn))
-			continue;
-
-		/*
-		 * Check for overlapping nodes/zones. It's possible on some
-		 * configurations to have a setup like
-		 * node0 node1 node0
-		 * i.e. it's possible that all pages within a zones range of
-		 * pages do not belong to a single zone.
-		 */
-		page = pfn_to_page(block_start_pfn);
-		if (page_zone(page) != zone)
+		page = pageblock_within_zone(block_start_pfn, block_end_pfn,
+									zone);
+		if (!page)
 			continue;
 
 		/* Check the block is suitable for migration */
@@ -873,12 +898,10 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 						&& compact_should_abort(cc))
 			break;
 
-		/* Skip whole pageblock in case of a memory hole */
-		if (!pfn_valid(low_pfn))
+		page = pageblock_within_zone(low_pfn, end_pfn, zone);
+		if (!page)
 			continue;
 
-		page = pfn_to_page(low_pfn);
-
 		/* If isolation recently failed, do not retry */
 		if (!isolation_suitable(cc, page))
 			continue;
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V4 07/15] mm, compaction: khugepaged should not give up due to need_resched()
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
                   ` (5 preceding siblings ...)
  2014-07-16 13:48 ` [PATCH V4 06/15] mm, compaction: reduce zone checking frequency in the migration scanner Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:31   ` Mel Gorman
  2014-07-16 13:48 ` [PATCH V4 08/15] mm, compaction: periodically drop lock and restore IRQs in scanners Vlastimil Babka
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Zhang Yanfei

Async compaction aborts when it detects zone lock contention or need_resched()
is true. David Rientjes has reported that in practice, most direct async
compactions for THP allocation abort due to need_resched(). This means that a
second direct compaction is never attempted, which might be OK for a page
fault, but khugepaged is intended to attempt a sync compaction in such case and
in these cases it won't.

This patch replaces "bool contended" in compact_control with an int that
distinguieshes between aborting due to need_resched() and aborting due to lock
contention. This allows propagating the abort through all compaction functions
as before, but passing the abort reason up to __alloc_pages_slowpath() which
decides when to continue with direct reclaim and another compaction attempt.

Another problem is that try_to_compact_pages() did not act upon the reported
contention (both need_resched() or lock contention) immediately and would
proceed with another zone from the zonelist. When need_resched() is true, that
means initializing another zone compaction, only to check again need_resched()
in isolate_migratepages() and aborting. For zone lock contention, the
unintended consequence is that the lock contended status reported back to the
allocator is detrmined from the last zone where compaction was attempted, which
is rather arbitrary.

This patch fixes the problem in the following way:
- async compaction of a zone aborting due to need_resched() or fatal signal
  pending means that further zones should not be tried. We report
  COMPACT_CONTENDED_SCHED to the allocator.
- aborting zone compaction due to lock contention means we can still try
  another zone, since it has different set of locks. We report back
  COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
  it was aborted due to lock contention.

As a result of these fixes, khugepaged will proceed with second sync compaction
as intended, when the preceding async compaction aborted due to need_resched().
Page fault compactions aborting due to need_resched() will spare some cycles
previously wasted by initializing another zone compaction only to abort again.
Lock contention will be reported only when compaction in all zones aborted due
to lock contention, and therefore it's not a good idea to try again after
reclaim.

Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
---
 include/linux/compaction.h | 12 +++++--
 mm/compaction.c            | 89 +++++++++++++++++++++++++++++++++++++++-------
 mm/internal.h              |  4 +--
 mm/page_alloc.c            | 43 ++++++++++++++++------
 4 files changed, 121 insertions(+), 27 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index b2e4c92..60bdf8d 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -13,6 +13,14 @@
 /* The full zone was compacted */
 #define COMPACT_COMPLETE	4
 
+/* Used to signal whether compaction detected need_sched() or lock contention */
+/* No contention detected */
+#define COMPACT_CONTENDED_NONE	0
+/* Either need_sched() was true or fatal signal pending */
+#define COMPACT_CONTENDED_SCHED	1
+/* Zone lock or lru_lock was contended in async compaction */
+#define COMPACT_CONTENDED_LOCK	2
+
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
@@ -24,7 +32,7 @@ extern int sysctl_extfrag_handler(struct ctl_table *table, int write,
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
-			enum migrate_mode mode, bool *contended,
+			enum migrate_mode mode, int *contended,
 			struct zone **candidate_zone);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
@@ -94,7 +102,7 @@ static inline bool compaction_restarting(struct zone *zone, int order)
 #else
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			enum migrate_mode mode, bool *contended,
+			enum migrate_mode mode, int *contended,
 			struct zone **candidate_zone)
 {
 	return COMPACT_CONTINUE;
diff --git a/mm/compaction.c b/mm/compaction.c
index 9cff804..22648cc 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -223,9 +223,21 @@ static void update_pageblock_skip(struct compact_control *cc,
 }
 #endif /* CONFIG_COMPACTION */
 
-static inline bool should_release_lock(spinlock_t *lock)
+static int should_release_lock(spinlock_t *lock)
 {
-	return need_resched() || spin_is_contended(lock);
+	/*
+	 * Sched contention has higher priority here as we may potentially
+	 * have to abort whole compaction ASAP. Returning with lock contention
+	 * means we will try another zone, and further decisions are
+	 * influenced only when all zones are lock contended. That means
+	 * potentially missing a lock contention is less critical.
+	 */
+	if (need_resched())
+		return COMPACT_CONTENDED_SCHED;
+	else if (spin_is_contended(lock))
+		return COMPACT_CONTENDED_LOCK;
+	else
+		return COMPACT_CONTENDED_NONE;
 }
 
 /*
@@ -240,7 +252,9 @@ static inline bool should_release_lock(spinlock_t *lock)
 static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
 				      bool locked, struct compact_control *cc)
 {
-	if (should_release_lock(lock)) {
+	int contended = should_release_lock(lock);
+
+	if (contended) {
 		if (locked) {
 			spin_unlock_irqrestore(lock, *flags);
 			locked = false;
@@ -248,7 +262,7 @@ static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
 
 		/* async aborts if taking too long or contended */
 		if (cc->mode == MIGRATE_ASYNC) {
-			cc->contended = true;
+			cc->contended = contended;
 			return false;
 		}
 
@@ -274,7 +288,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
 	/* async compaction aborts if contended */
 	if (need_resched()) {
 		if (cc->mode == MIGRATE_ASYNC) {
-			cc->contended = true;
+			cc->contended = COMPACT_CONTENDED_SCHED;
 			return true;
 		}
 
@@ -1139,7 +1153,7 @@ out:
 }
 
 static unsigned long compact_zone_order(struct zone *zone, int order,
-		gfp_t gfp_mask, enum migrate_mode mode, bool *contended)
+		gfp_t gfp_mask, enum migrate_mode mode, int *contended)
 {
 	unsigned long ret;
 	struct compact_control cc = {
@@ -1154,11 +1168,11 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 	INIT_LIST_HEAD(&cc.migratepages);
 
 	ret = compact_zone(zone, &cc);
+	*contended = cc.contended;
 
 	VM_BUG_ON(!list_empty(&cc.freepages));
 	VM_BUG_ON(!list_empty(&cc.migratepages));
 
-	*contended = cc.contended;
 	return ret;
 }
 
@@ -1171,14 +1185,15 @@ int sysctl_extfrag_threshold = 500;
  * @gfp_mask: The GFP mask of the current allocation
  * @nodemask: The allowed nodes to allocate from
  * @mode: The migration mode for async, sync light, or sync migration
- * @contended: Return value that is true if compaction was aborted due to lock contention
+ * @contended: Return value that determines if compaction was aborted due to 
+ *	       need_resched() or lock contention
  * @candidate_zone: Return the zone where we think allocation should succeed
  *
  * This is the main entry point for direct page compaction.
  */
 unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
-			enum migrate_mode mode, bool *contended,
+			enum migrate_mode mode, int *contended,
 			struct zone **candidate_zone)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
@@ -1188,6 +1203,9 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	struct zone *zone;
 	int rc = COMPACT_DEFERRED;
 	int alloc_flags = 0;
+	bool all_zones_lock_contended = true; /* init true for &= operation */
+
+	*contended = COMPACT_CONTENDED_NONE;
 
 	/* Check if the GFP flags allow compaction */
 	if (!order || !may_enter_fs || !may_perform_io)
@@ -1201,13 +1219,20 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
 								nodemask) {
 		int status;
+		int zone_contended;
 
 		if (compaction_deferred(zone, order))
 			continue;
 
 		status = compact_zone_order(zone, order, gfp_mask, mode,
-						contended);
+							&zone_contended);
 		rc = max(status, rc);
+		/*
+		 * It takes at least one zone that wasn't lock contended
+		 * to turn all_zones_lock_contended to false.
+		 */
+		all_zones_lock_contended &=
+			(zone_contended == COMPACT_CONTENDED_LOCK);
 
 		/* If a normal allocation would succeed, stop compacting */
 		if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
@@ -1220,8 +1245,20 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			 * succeeds in this zone.
 			 */
 			compaction_defer_reset(zone, order, false);
-			break;
-		} else if (mode != MIGRATE_ASYNC) {
+			/*
+			 * It is possible that async compaction aborted due to
+			 * need_resched() and the watermarks were ok thanks to
+			 * somebody else freeing memory. The allocation can
+			 * however still fail so we better signal the
+			 * need_resched() contention anyway.
+			 */
+			if (zone_contended == COMPACT_CONTENDED_SCHED)
+				*contended = COMPACT_CONTENDED_SCHED;
+
+			goto break_loop;
+		}
+
+		if (mode != MIGRATE_ASYNC) {
 			/*
 			 * We think that allocation won't succeed in this zone
 			 * so we defer compaction there. If it ends up
@@ -1229,8 +1266,36 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			 */
 			defer_compaction(zone, order);
 		}
+
+		/*
+		 * We might have stopped compacting due to need_resched() in
+		 * async compaction, or due to a fatal signal detected. In that
+		 * case do not try further zones and signal need_resched()
+		 * contention.
+		 */
+		if ((zone_contended == COMPACT_CONTENDED_SCHED)
+					|| fatal_signal_pending(current)) {
+			*contended = COMPACT_CONTENDED_SCHED;
+			goto break_loop;
+		}
+
+		continue;
+break_loop:
+		/*
+		 * We might not have tried all the zones, so  be conservative
+		 * and assume they are not all lock contended.
+		 */
+		all_zones_lock_contended = false;
+		break;
 	}
 
+	/*
+	 * If at least one zone wasn't deferred or skipped, we report if all
+	 * zones that were tried were contended.
+	 */
+	if (rc > COMPACT_SKIPPED && all_zones_lock_contended)
+		*contended = COMPACT_CONTENDED_LOCK;
+
 	return rc;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 5a0738f..4c1d604 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -144,8 +144,8 @@ struct compact_control {
 	int order;			/* order a direct compactor needs */
 	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
 	struct zone *zone;
-	bool contended;			/* True if a lock was contended, or
-					 * need_resched() true during async
+	int contended;			/* Signal need_sched() or lock
+					 * contention detected during
 					 * compaction
 					 */
 };
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f16aa37..fb2988d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2247,7 +2247,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	struct zonelist *zonelist, enum zone_type high_zoneidx,
 	nodemask_t *nodemask, int alloc_flags, struct zone *preferred_zone,
 	int classzone_idx, int migratetype, enum migrate_mode mode,
-	bool *contended_compaction, bool *deferred_compaction,
+	int *contended_compaction, bool *deferred_compaction,
 	unsigned long *did_some_progress)
 {
 	struct zone *last_compact_zone = NULL;
@@ -2520,7 +2520,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	unsigned long did_some_progress;
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
-	bool contended_compaction = false;
+	int contended_compaction = COMPACT_CONTENDED_NONE;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2633,15 +2633,36 @@ rebalance:
 	if (!(gfp_mask & __GFP_NO_KSWAPD) || (current->flags & PF_KTHREAD))
 		migration_mode = MIGRATE_SYNC_LIGHT;
 
-	/*
-	 * If compaction is deferred for high-order allocations, it is because
-	 * sync compaction recently failed. In this is the case and the caller
-	 * requested a movable allocation that does not heavily disrupt the
-	 * system then fail the allocation instead of entering direct reclaim.
-	 */
-	if ((deferred_compaction || contended_compaction) &&
-						(gfp_mask & __GFP_NO_KSWAPD))
-		goto nopage;
+	/* Checks for THP-specific high-order allocations */
+	if (gfp_mask & __GFP_NO_KSWAPD) {
+		/*
+		 * If compaction is deferred for high-order allocations, it is
+		 * because sync compaction recently failed. If this is the case
+		 * and the caller requested a THP allocation, we do not want
+		 * to heavily disrupt the system, so we fail the allocation
+		 * instead of entering direct reclaim.
+		 */
+		if (deferred_compaction)
+			goto nopage;
+
+		/*
+		 * In all zones where compaction was attempted (and not
+		 * deferred or skipped), lock contention has been detected.
+		 * For THP allocation we do not want to disrupt the others
+		 * so we fallback to base pages instead.
+		 */
+		if (contended_compaction == COMPACT_CONTENDED_LOCK)
+			goto nopage;
+
+		/*
+		 * If compaction was aborted due to need_resched(), we do not
+		 * want to further increase allocation latency, unless it is
+		 * khugepaged trying to collapse.
+		 */
+		if (contended_compaction == COMPACT_CONTENDED_SCHED
+			&& !(current->flags & PF_KTHREAD))
+			goto nopage;
+	}
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V4 08/15] mm, compaction: periodically drop lock and restore IRQs in scanners
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
                   ` (6 preceding siblings ...)
  2014-07-16 13:48 ` [PATCH V4 07/15] mm, compaction: khugepaged should not give up due to need_resched() Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:32   ` Mel Gorman
  2014-07-16 13:48 ` [PATCH V4 09/15] mm, compaction: skip rechecks when lock was already held Vlastimil Babka
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Mel Gorman, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel, Joonsoo Kim,
	Minchan Kim, Zhang Yanfei

Compaction scanners regularly check for lock contention and need_resched()
through the compact_checklock_irqsave() function. However, if there is no
contention, the lock can be held and IRQ disabled for potentially long time.

This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
time IRQs are disabled while isolating pages for migration") for the migration
scanner. However, the refactoring done by commit 2a1402aa04 ("mm: compaction:
acquire the zone->lru_lock as late as possible") has changed the conditions so
that the lock is dropped only when there's contention on the lock or
need_resched() is true. Also, need_resched() is checked only when the lock is
already held. The comment "give a chance to irqs before checking need_resched"
is therefore misleading, as IRQs remain disabled when the check is done.

This patch restores the behavior intended by commit b2eef8c0d0 and also tries
to better balance and make more deterministic the time spent by checking for
contention vs the time the scanners might run between the checks. It also
avoids situations where checking has not been done often enough before. The
result should be avoiding both too frequent and too infrequent contention
checking, and especially the potentially long-running scans with IRQs disabled
and no checking of need_resched() or for fatal signal pending, which can happen
when many consecutive pages or pageblocks fail the preliminary tests and do not
reach the later call site to compact_checklock_irqsave(), as explained below.

Before the patch:

In the migration scanner, compact_checklock_irqsave() was called each loop, if
reached. If not reached, some lower-frequency checking could still be done if
the lock was already held, but this would not result in aborting contended
async compaction until reaching compact_checklock_irqsave() or end of
pageblock. In the free scanner, it was similar but completely without the
periodical checking, so lock can be potentially held until reaching the end of
pageblock.

After the patch, in both scanners:

The periodical check is done as the first thing in the loop on each
SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
function, which always unlocks the lock (if locked) and aborts async compaction
if scheduling is needed. It also aborts any type of compaction when a fatal
signal is pending.

The compact_checklock_irqsave() function is replaced with a slightly different
compact_trylock_irqsave(). The biggest difference is that the function is not
called at all if the lock is already held. The periodical need_resched()
checking is left solely to compact_unlock_should_abort(). The lock contention
avoidance for async compaction is achieved by the periodical unlock by
compact_unlock_should_abort() and by using trylock in compact_trylock_irqsave()
and aborting when trylock fails. Sync compaction does not use trylock.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 121 ++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 73 insertions(+), 48 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 22648cc..b213e0ba 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -223,61 +223,72 @@ static void update_pageblock_skip(struct compact_control *cc,
 }
 #endif /* CONFIG_COMPACTION */
 
-static int should_release_lock(spinlock_t *lock)
+/*
+ * Compaction requires the taking of some coarse locks that are potentially
+ * very heavily contended. For async compaction, back out if the lock cannot
+ * be taken immediately. For sync compaction, spin on the lock if needed.
+ *
+ * Returns true if the lock is held
+ * Returns false if the lock is not held and compaction should abort
+ */
+static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
+						struct compact_control *cc)
 {
-	/*
-	 * Sched contention has higher priority here as we may potentially
-	 * have to abort whole compaction ASAP. Returning with lock contention
-	 * means we will try another zone, and further decisions are
-	 * influenced only when all zones are lock contended. That means
-	 * potentially missing a lock contention is less critical.
-	 */
-	if (need_resched())
-		return COMPACT_CONTENDED_SCHED;
-	else if (spin_is_contended(lock))
-		return COMPACT_CONTENDED_LOCK;
-	else
-		return COMPACT_CONTENDED_NONE;
+	if (cc->mode == MIGRATE_ASYNC) {
+		if (!spin_trylock_irqsave(lock, *flags)) {
+			cc->contended = COMPACT_CONTENDED_LOCK;
+			return false;
+		}
+	} else {
+		spin_lock_irqsave(lock, *flags);
+	}
+
+	return true;
 }
 
 /*
  * Compaction requires the taking of some coarse locks that are potentially
- * very heavily contended. Check if the process needs to be scheduled or
- * if the lock is contended. For async compaction, back out in the event
- * if contention is severe. For sync compaction, schedule.
+ * very heavily contended. The lock should be periodically unlocked to avoid
+ * having disabled IRQs for a long time, even when there is nobody waiting on
+ * the lock. It might also be that allowing the IRQs will result in
+ * need_resched() becoming true. If scheduling is needed, async compaction
+ * aborts. Sync compaction schedules.
+ * Either compaction type will also abort if a fatal signal is pending.
+ * In either case if the lock was locked, it is dropped and not regained.
  *
- * Returns true if the lock is held.
- * Returns false if the lock is released and compaction should abort
+ * Returns true if compaction should abort due to fatal signal pending, or
+ *		async compaction due to need_resched()
+ * Returns false when compaction can continue (sync compaction might have
+ *		scheduled)
  */
-static bool compact_checklock_irqsave(spinlock_t *lock, unsigned long *flags,
-				      bool locked, struct compact_control *cc)
+static bool compact_unlock_should_abort(spinlock_t *lock,
+		unsigned long flags, bool *locked, struct compact_control *cc)
 {
-	int contended = should_release_lock(lock);
+	if (*locked) {
+		spin_unlock_irqrestore(lock, flags);
+		*locked = false;
+	}
 
-	if (contended) {
-		if (locked) {
-			spin_unlock_irqrestore(lock, *flags);
-			locked = false;
-		}
+	if (fatal_signal_pending(current)) {
+		cc->contended = COMPACT_CONTENDED_SCHED;
+		return true;
+	}
 
-		/* async aborts if taking too long or contended */
+	if (need_resched()) {
 		if (cc->mode == MIGRATE_ASYNC) {
-			cc->contended = contended;
-			return false;
+			cc->contended = COMPACT_CONTENDED_SCHED;
+			return true;
 		}
-
 		cond_resched();
 	}
 
-	if (!locked)
-		spin_lock_irqsave(lock, *flags);
-	return true;
+	return false;
 }
 
 /*
  * Aside from avoiding lock contention, compaction also periodically checks
  * need_resched() and either schedules in sync compaction or aborts async
- * compaction. This is similar to what compact_checklock_irqsave() does, but
+ * compaction. This is similar to what compact_unlock_should_abort() does, but
  * is used where no lock is concerned.
  *
  * Returns false when no scheduling was needed, or sync compaction scheduled.
@@ -336,6 +347,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		int isolated, i;
 		struct page *page = cursor;
 
+		/*
+		 * Periodically drop the lock (if held) regardless of its
+		 * contention, to give chance to IRQs. Abort async compaction
+		 * if contended.
+		 */
+		if (!(blockpfn % SWAP_CLUSTER_MAX)
+		    && compact_unlock_should_abort(&cc->zone->lock, flags,
+								&locked, cc))
+			break;
+
 		nr_scanned++;
 		if (!pfn_valid_within(blockpfn))
 			goto isolate_fail;
@@ -353,8 +374,9 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		 * spin on the lock and we acquire the lock as late as
 		 * possible.
 		 */
-		locked = compact_checklock_irqsave(&cc->zone->lock, &flags,
-								locked, cc);
+		if (!locked)
+			locked = compact_trylock_irqsave(&cc->zone->lock,
+								&flags, cc);
 		if (!locked)
 			break;
 
@@ -552,13 +574,15 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 	/* Time to isolate some pages for migration */
 	for (; low_pfn < end_pfn; low_pfn++) {
-		/* give a chance to irqs before checking need_resched() */
-		if (locked && !(low_pfn % SWAP_CLUSTER_MAX)) {
-			if (should_release_lock(&zone->lru_lock)) {
-				spin_unlock_irqrestore(&zone->lru_lock, flags);
-				locked = false;
-			}
-		}
+		/*
+		 * Periodically drop the lock (if held) regardless of its
+		 * contention, to give chance to IRQs. Abort async compaction
+		 * if contended.
+		 */
+		if (!(low_pfn % SWAP_CLUSTER_MAX)
+		    && compact_unlock_should_abort(&zone->lru_lock, flags,
+								&locked, cc))
+			break;
 
 		if (!pfn_valid_within(low_pfn))
 			continue;
@@ -620,10 +644,11 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		    page_count(page) > page_mapcount(page))
 			continue;
 
-		/* Check if it is ok to still hold the lock */
-		locked = compact_checklock_irqsave(&zone->lru_lock, &flags,
-								locked, cc);
-		if (!locked || fatal_signal_pending(current))
+		/* If the lock is not held, try to take it */
+		if (!locked)
+			locked = compact_trylock_irqsave(&zone->lru_lock,
+								&flags, cc);
+		if (!locked)
 			break;
 
 		/* Recheck PageLRU and PageTransHuge under lock */
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V4 09/15] mm, compaction: skip rechecks when lock was already held
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
                   ` (7 preceding siblings ...)
  2014-07-16 13:48 ` [PATCH V4 08/15] mm, compaction: periodically drop lock and restore IRQs in scanners Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:34   ` Mel Gorman
  2014-07-16 13:48 ` [PATCH V4 10/15] mm, compaction: remember position within pageblock in free pages scanner Vlastimil Babka
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Mel Gorman, Michal Nazarewicz,
	Naoya Horiguchi, Christoph Lameter, Rik van Riel, Joonsoo Kim,
	Minchan Kim, Zhang Yanfei

Compaction scanners try to lock zone locks as late as possible by checking
many page or pageblock properties opportunistically without lock and skipping
them if not unsuitable. For pages that pass the initial checks, some properties
have to be checked again safely under lock. However, if the lock was already
held from a previous iteration in the initial checks, the rechecks are
unnecessary.

This patch therefore skips the rechecks when the lock was already held. This is
now possible to do, since we don't (potentially) drop and reacquire the lock
between the initial checks and the safe rechecks anymore.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 53 +++++++++++++++++++++++++++++++----------------------
 1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b213e0ba..2b42397 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -367,22 +367,30 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 			goto isolate_fail;
 
 		/*
-		 * The zone lock must be held to isolate freepages.
-		 * Unfortunately this is a very coarse lock and can be
-		 * heavily contended if there are parallel allocations
-		 * or parallel compactions. For async compaction do not
-		 * spin on the lock and we acquire the lock as late as
-		 * possible.
+		 * If we already hold the lock, we can skip some rechecking.
+		 * Note that if we hold the lock now, checked_pageblock was
+		 * already set in some previous iteration (or strict is true),
+		 * so it is correct to skip the suitable migration target
+		 * recheck as well.
 		 */
-		if (!locked)
+		if (!locked) {
+			/*
+			 * The zone lock must be held to isolate freepages.
+			 * Unfortunately this is a very coarse lock and can be
+			 * heavily contended if there are parallel allocations
+			 * or parallel compactions. For async compaction do not
+			 * spin on the lock and we acquire the lock as late as
+			 * possible.
+			 */
 			locked = compact_trylock_irqsave(&cc->zone->lock,
 								&flags, cc);
-		if (!locked)
-			break;
+			if (!locked)
+				break;
 
-		/* Recheck this is a buddy page under lock */
-		if (!PageBuddy(page))
-			goto isolate_fail;
+			/* Recheck this is a buddy page under lock */
+			if (!PageBuddy(page))
+				goto isolate_fail;
+		}
 
 		/* Found a free page, break it into order-0 pages */
 		isolated = split_free_page(page);
@@ -644,19 +652,20 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		    page_count(page) > page_mapcount(page))
 			continue;
 
-		/* If the lock is not held, try to take it */
-		if (!locked)
+		/* If we already hold the lock, we can skip some rechecking */
+		if (!locked) {
 			locked = compact_trylock_irqsave(&zone->lru_lock,
 								&flags, cc);
-		if (!locked)
-			break;
+			if (!locked)
+				break;
 
-		/* Recheck PageLRU and PageTransHuge under lock */
-		if (!PageLRU(page))
-			continue;
-		if (PageTransHuge(page)) {
-			low_pfn += (1 << compound_order(page)) - 1;
-			continue;
+			/* Recheck PageLRU and PageTransHuge under lock */
+			if (!PageLRU(page))
+				continue;
+			if (PageTransHuge(page)) {
+				low_pfn += (1 << compound_order(page)) - 1;
+				continue;
+			}
 		}
 
 		lruvec = mem_cgroup_page_lruvec(page, zone);
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V4 10/15] mm, compaction: remember position within pageblock in free pages scanner
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
                   ` (8 preceding siblings ...)
  2014-07-16 13:48 ` [PATCH V4 09/15] mm, compaction: skip rechecks when lock was already held Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:35   ` Mel Gorman
  2014-07-16 13:48 ` [PATCH V4 11/15] mm, compaction: skip buddy pages by their order in the migrate scanner Vlastimil Babka
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Mel Gorman, Joonsoo Kim,
	Michal Nazarewicz, Naoya Horiguchi, Christoph Lameter,
	Rik van Riel, Zhang Yanfei, Minchan Kim

Unlike the migration scanner, the free scanner remembers the beginning of the
last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
uselessly when called several times during single compaction. This might have
been useful when pages were returned to the buddy allocator after a failed
migration, but this is no longer the case.

This patch changes the meaning of cc->free_pfn so that if it points to a
middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
end. isolate_freepages_block() will record the pfn of the last page it looked
at, which is then used to update cc->free_pfn.

In the mmtests stress-highalloc benchmark, this has resulted in lowering the
ratio between pages scanned by both scanners, from 2.5 free pages per migrate
page, to 2.25 free pages per migrate page, without affecting success rates.

With __GFP_NO_KSWAPD allocations, this appears to result in a worse ratio (2.1
instead of 1.8), but page migration successes increased by 10%, so this could
mean that more useful work can be done until need_resched() aborts this kind
of compaction.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
---
 mm/compaction.c | 42 ++++++++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 2b42397..d256c14 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -330,7 +330,7 @@ static bool suitable_migration_target(struct page *page)
  * (even though it may still end up isolating some pages).
  */
 static unsigned long isolate_freepages_block(struct compact_control *cc,
-				unsigned long blockpfn,
+				unsigned long *start_pfn,
 				unsigned long end_pfn,
 				struct list_head *freelist,
 				bool strict)
@@ -339,6 +339,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	struct page *cursor, *valid_page = NULL;
 	unsigned long flags;
 	bool locked = false;
+	unsigned long blockpfn = *start_pfn;
 
 	cursor = pfn_to_page(blockpfn);
 
@@ -412,6 +413,9 @@ isolate_fail:
 			break;
 	}
 
+	/* Record how far we have got within the block */
+	*start_pfn = blockpfn;
+
 	trace_mm_compaction_isolate_freepages(nr_scanned, total_isolated);
 
 	/*
@@ -460,14 +464,13 @@ isolate_freepages_range(struct compact_control *cc,
 
 	for (; pfn < end_pfn; pfn += isolated,
 				block_end_pfn += pageblock_nr_pages) {
+		/* Protect pfn from changing by isolate_freepages_block */
+		unsigned long isolate_start_pfn = pfn;
 
 		block_end_pfn = min(block_end_pfn, end_pfn);
 
-		if (!pageblock_within_zone(pfn, block_end_pfn, cc->zone))
-			break;
-
-		isolated = isolate_freepages_block(cc, pfn, block_end_pfn,
-						   &freelist, true);
+		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
+						block_end_pfn, &freelist, true);
 
 		/*
 		 * In strict mode, isolate_freepages_block() returns 0 if
@@ -769,6 +772,7 @@ static void isolate_freepages(struct compact_control *cc)
 	struct zone *zone = cc->zone;
 	struct page *page;
 	unsigned long block_start_pfn;	/* start of current pageblock */
+	unsigned long isolate_start_pfn; /* exact pfn we start at */
 	unsigned long block_end_pfn;	/* end of current pageblock */
 	unsigned long low_pfn;	     /* lowest pfn scanner is able to scan */
 	int nr_freepages = cc->nr_freepages;
@@ -777,14 +781,15 @@ static void isolate_freepages(struct compact_control *cc)
 	/*
 	 * Initialise the free scanner. The starting point is where we last
 	 * successfully isolated from, zone-cached value, or the end of the
-	 * zone when isolating for the first time. We need this aligned to
-	 * the pageblock boundary, because we do
+	 * zone when isolating for the first time. For looping we also need
+	 * this pfn aligned down to the pageblock boundary, because we do
 	 * block_start_pfn -= pageblock_nr_pages in the for loop.
 	 * For ending point, take care when isolating in last pageblock of a
 	 * a zone which ends in the middle of a pageblock.
 	 * The low boundary is the end of the pageblock the migration scanner
 	 * is using.
 	 */
+	isolate_start_pfn = cc->free_pfn;
 	block_start_pfn = cc->free_pfn & ~(pageblock_nr_pages-1);
 	block_end_pfn = min(block_start_pfn + pageblock_nr_pages,
 						zone_end_pfn(zone));
@@ -797,7 +802,8 @@ static void isolate_freepages(struct compact_control *cc)
 	 */
 	for (; block_start_pfn >= low_pfn && cc->nr_migratepages > nr_freepages;
 				block_end_pfn = block_start_pfn,
-				block_start_pfn -= pageblock_nr_pages) {
+				block_start_pfn -= pageblock_nr_pages,
+				isolate_start_pfn = block_start_pfn) {
 		unsigned long isolated;
 
 		/*
@@ -822,13 +828,25 @@ static void isolate_freepages(struct compact_control *cc)
 		if (!isolation_suitable(cc, page))
 			continue;
 
-		/* Found a block suitable for isolating free pages from */
-		cc->free_pfn = block_start_pfn;
-		isolated = isolate_freepages_block(cc, block_start_pfn,
+		/* Found a block suitable for isolating free pages from. */
+		isolated = isolate_freepages_block(cc, &isolate_start_pfn,
 					block_end_pfn, freelist, false);
 		nr_freepages += isolated;
 
 		/*
+		 * Remember where the free scanner should restart next time,
+		 * which is where isolate_freepages_block() left off.
+		 * But if it scanned the whole pageblock, isolate_start_pfn
+		 * now points at block_end_pfn, which is the start of the next
+		 * pageblock.
+		 * In that case we will however want to restart at the start
+		 * of the previous pageblock.
+		 */
+		cc->free_pfn = (isolate_start_pfn < block_end_pfn) ?
+				isolate_start_pfn :
+				block_start_pfn - pageblock_nr_pages;
+
+		/*
 		 * Set a flag that we successfully isolated in this pageblock.
 		 * In the next loop iteration, zone->compact_cached_free_pfn
 		 * will not be updated and thus it will effectively contain the
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V4 11/15] mm, compaction: skip buddy pages by their order in the migrate scanner
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
                   ` (9 preceding siblings ...)
  2014-07-16 13:48 ` [PATCH V4 10/15] mm, compaction: remember position within pageblock in free pages scanner Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:36   ` Mel Gorman
  2014-07-16 13:48 ` [PATCH V4 12/15] mm: rename allocflags_to_migratetype for clarity Vlastimil Babka
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Mel Gorman, Joonsoo Kim,
	Michal Nazarewicz, Naoya Horiguchi, Christoph Lameter,
	Rik van Riel, Minchan Kim, Zhang Yanfei

The migration scanner skips PageBuddy pages, but does not consider their order
as checking page_order() is generally unsafe without holding the zone->lock,
and acquiring the lock just for the check wouldn't be a good tradeoff.

Still, this could avoid some iterations over the rest of the buddy page, and
if we are careful, the race window between PageBuddy() check and page_order()
is small, and the worst thing that can happen is that we skip too much and miss
some isolation candidates. This is not that bad, as compaction can already fail
for many other reasons like parallel allocations, and those have much larger
race window.

This patch therefore makes the migration scanner obtain the buddy page order
and use it to skip the whole buddy page, if the order appears to be in the
valid range.

It's important that the page_order() is read only once, so that the value used
in the checks and in the pfn calculation is the same. But in theory the
compiler can replace the local variable by multiple inlines of page_order().
Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
prevent this.

Testing with stress-highalloc from mmtests shows a 15% reduction in number of
pages scanned by migration scanner. The reduction is >60% with __GFP_NO_KSWAPD
allocations, along with success rates better by few percent.
This change is also a prerequisite for a later patch which is detecting when
a cc->order block of pages contains non-buddy pages that cannot be isolated,
and the scanner should thus skip to the next block immediately.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 36 +++++++++++++++++++++++++++++++-----
 mm/internal.h   | 16 +++++++++++++++-
 2 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index d256c14..3cfc3a9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -313,8 +313,15 @@ static inline bool compact_should_abort(struct compact_control *cc)
 static bool suitable_migration_target(struct page *page)
 {
 	/* If the page is a large free page, then disallow migration */
-	if (PageBuddy(page) && page_order(page) >= pageblock_order)
-		return false;
+	if (PageBuddy(page)) {
+		/*
+		 * We are checking page_order without zone->lock taken. But
+		 * the only small danger is that we skip a potentially suitable
+		 * pageblock, so it's not worth to check order for valid range.
+		 */
+		if (page_order_unsafe(page) >= pageblock_order)
+			return false;
+	}
 
 	/* If the block is MIGRATE_MOVABLE or MIGRATE_CMA, allow migration */
 	if (migrate_async_suitable(get_pageblock_migratetype(page)))
@@ -605,11 +612,23 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			valid_page = page;
 
 		/*
-		 * Skip if free. page_order cannot be used without zone->lock
-		 * as nothing prevents parallel allocations or buddy merging.
+		 * Skip if free. We read page order here without zone lock
+		 * which is generally unsafe, but the race window is small and
+		 * the worst thing that can happen is that we skip some
+		 * potential isolation targets.
 		 */
-		if (PageBuddy(page))
+		if (PageBuddy(page)) {
+			unsigned long freepage_order = page_order_unsafe(page);
+
+			/*
+			 * Without lock, we cannot be sure that what we got is
+			 * a valid page order. Consider only values in the
+			 * valid order range to prevent low_pfn overflow.
+			 */
+			if (freepage_order > 0 && freepage_order < MAX_ORDER)
+				low_pfn += (1UL << freepage_order) - 1;
 			continue;
+		}
 
 		/*
 		 * Check may be lockless but that's ok as we recheck later.
@@ -695,6 +714,13 @@ isolate_success:
 		}
 	}
 
+	/*
+	 * The PageBuddy() check could have potentially brought us outside
+	 * the range to be scanned.
+	 */
+	if (unlikely(low_pfn > end_pfn))
+		low_pfn = end_pfn;
+
 	acct_isolated(zone, locked, cc);
 
 	if (locked)
diff --git a/mm/internal.h b/mm/internal.h
index 4c1d604..86ae964 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -164,7 +164,8 @@ isolate_migratepages_range(struct compact_control *cc,
  * general, page_zone(page)->lock must be held by the caller to prevent the
  * page from being allocated in parallel and returning garbage as the order.
  * If a caller does not hold page_zone(page)->lock, it must guarantee that the
- * page cannot be allocated or merged in parallel.
+ * page cannot be allocated or merged in parallel. Alternatively, it must
+ * handle invalid values gracefully, and use page_order_unsafe() below.
  */
 static inline unsigned long page_order(struct page *page)
 {
@@ -172,6 +173,19 @@ static inline unsigned long page_order(struct page *page)
 	return page_private(page);
 }
 
+/*
+ * Like page_order(), but for callers who cannot afford to hold the zone lock.
+ * PageBuddy() should be checked first by the caller to minimize race window,
+ * and invalid values must be handled gracefully.
+ *
+ * ACCESS_ONCE is used so that if the caller assigns the result into a local
+ * variable and e.g. tests it for valid range before using, the compiler cannot
+ * decide to remove the variable and inline the page_private(page) multiple
+ * times, potentially observing different values in the tests and the actual
+ * use of the result.
+ */
+#define page_order_unsafe(page)		ACCESS_ONCE(page_private(page))
+
 static inline bool is_cow_mapping(vm_flags_t flags)
 {
 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V4 12/15] mm: rename allocflags_to_migratetype for clarity
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
                   ` (10 preceding siblings ...)
  2014-07-16 13:48 ` [PATCH V4 11/15] mm, compaction: skip buddy pages by their order in the migrate scanner Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:37   ` Mel Gorman
  2014-07-16 13:48 ` [PATCH V4 13/15] mm, compaction: pass gfp mask to compact_control Vlastimil Babka
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Mel Gorman, Joonsoo Kim,
	Michal Nazarewicz, Christoph Lameter, Rik van Riel, Minchan Kim,
	Naoya Horiguchi, Zhang Yanfei

From: David Rientjes <rientjes@google.com>

The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
ALLOC_CPUSET) that have separate semantics.

The function allocflags_to_migratetype() actually takes gfp flags, not alloc
flags, and returns a migratetype.  Rename it to gfpflags_to_migratetype().

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
---
 include/linux/gfp.h | 2 +-
 mm/compaction.c     | 4 ++--
 mm/page_alloc.c     | 6 +++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5e7219d..41b30fd 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -156,7 +156,7 @@ struct vm_area_struct;
 #define GFP_DMA32	__GFP_DMA32
 
 /* Convert GFP flags to their corresponding migrate type */
-static inline int allocflags_to_migratetype(gfp_t gfp_flags)
+static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
 {
 	WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 3cfc3a9..523c7bd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1238,7 +1238,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 		.nr_freepages = 0,
 		.nr_migratepages = 0,
 		.order = order,
-		.migratetype = allocflags_to_migratetype(gfp_mask),
+		.migratetype = gfpflags_to_migratetype(gfp_mask),
 		.zone = zone,
 		.mode = mode,
 	};
@@ -1290,7 +1290,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 		return rc;
 
 #ifdef CONFIG_CMA
-	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
 	/* Compact each zone in the list */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fb2988d..6f1c6e6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2496,7 +2496,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 #ifdef CONFIG_CMA
-	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
 	return alloc_flags;
@@ -2760,7 +2760,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	struct zone *preferred_zone;
 	struct zoneref *preferred_zoneref;
 	struct page *page = NULL;
-	int migratetype = allocflags_to_migratetype(gfp_mask);
+	int migratetype = gfpflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	int classzone_idx;
@@ -2794,7 +2794,7 @@ retry_cpuset:
 	classzone_idx = zonelist_zone_idx(preferred_zoneref);
 
 #ifdef CONFIG_CMA
-	if (allocflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
 #endif
 retry:
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V4 13/15] mm, compaction: pass gfp mask to compact_control
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
                   ` (11 preceding siblings ...)
  2014-07-16 13:48 ` [PATCH V4 12/15] mm: rename allocflags_to_migratetype for clarity Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:38   ` Mel Gorman
  2014-07-16 13:48 ` [PATCH V4 14/15] mm, compaction: try to capture the just-created high-order freepage Vlastimil Babka
  2014-07-16 13:48 ` [RFC PATCH V4 15/15] mm, compaction: do not migrate pages when that cannot satisfy page fault allocation Vlastimil Babka
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Mel Gorman, Joonsoo Kim,
	Michal Nazarewicz, Naoya Horiguchi, Christoph Lameter,
	Rik van Riel, Minchan Kim, Zhang Yanfei

From: David Rientjes <rientjes@google.com>

struct compact_control currently converts the gfp mask to a migratetype, but we
need the entire gfp mask in a follow-up patch.

Pass the entire gfp mask as part of struct compact_control.

Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
---
 mm/compaction.c | 12 +++++++-----
 mm/internal.h   |  2 +-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 523c7bd..279c0b0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1028,8 +1028,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
 }
 
-static int compact_finished(struct zone *zone,
-			    struct compact_control *cc)
+static int compact_finished(struct zone *zone, struct compact_control *cc,
+			    const int migratetype)
 {
 	unsigned int order;
 	unsigned long watermark;
@@ -1075,7 +1075,7 @@ static int compact_finished(struct zone *zone,
 		struct free_area *area = &zone->free_area[order];
 
 		/* Job done if page is free of the right migratetype */
-		if (!list_empty(&area->free_list[cc->migratetype]))
+		if (!list_empty(&area->free_list[migratetype]))
 			return COMPACT_PARTIAL;
 
 		/* Job done if allocation would set block type */
@@ -1141,6 +1141,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	int ret;
 	unsigned long start_pfn = zone->zone_start_pfn;
 	unsigned long end_pfn = zone_end_pfn(zone);
+	const int migratetype = gfpflags_to_migratetype(cc->gfp_mask);
 	const bool sync = cc->mode != MIGRATE_ASYNC;
 
 	ret = compaction_suitable(zone, cc->order);
@@ -1183,7 +1184,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 
 	migrate_prep_local();
 
-	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
+	while ((ret = compact_finished(zone, cc, migratetype)) ==
+						COMPACT_CONTINUE) {
 		int err;
 
 		switch (isolate_migratepages(zone, cc)) {
@@ -1238,7 +1240,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 		.nr_freepages = 0,
 		.nr_migratepages = 0,
 		.order = order,
-		.migratetype = gfpflags_to_migratetype(gfp_mask),
+		.gfp_mask = gfp_mask,
 		.zone = zone,
 		.mode = mode,
 	};
diff --git a/mm/internal.h b/mm/internal.h
index 86ae964..8293040 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -142,7 +142,7 @@ struct compact_control {
 	bool finished_update_migrate;
 
 	int order;			/* order a direct compactor needs */
-	int migratetype;		/* MOVABLE, RECLAIMABLE etc */
+	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
 	struct zone *zone;
 	int contended;			/* Signal need_sched() or lock
 					 * contention detected during
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V4 14/15] mm, compaction: try to capture the just-created high-order freepage
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
                   ` (12 preceding siblings ...)
  2014-07-16 13:48 ` [PATCH V4 13/15] mm, compaction: pass gfp mask to compact_control Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  2014-07-25 12:56   ` Mel Gorman
  2014-07-16 13:48 ` [RFC PATCH V4 15/15] mm, compaction: do not migrate pages when that cannot satisfy page fault allocation Vlastimil Babka
  14 siblings, 1 reply; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Zhang Yanfei

Compaction uses watermark checking to determine if it succeeded in creating
a high-order free page. My testing has shown that this is quite racy and it
can happen that watermark checking in compaction succeeds, and moments later
the watermark checking in page allocation fails, even though the number of
free pages has increased meanwhile.

It should be more reliable if direct compaction captured the high-order free
page as soon as it detects it, and pass it back to allocation. This would
also reduce the window for somebody else to allocate the free page.

Capture has been implemented before by 1fb3f8ca0e92 ("mm: compaction: capture
a suitable high-order page immediately when it is made available"), but later
reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
high-order page") due to a bug.

This patch differs from the previous attempt in two aspects:

1) The previous patch scanned free lists to capture the page. In this patch,
   only the cc->order aligned block that the migration scanner just finished
   is considered, but only if pages were actually isolated for migration in
   that block. Tracking cc->order aligned blocks also has benefits for the
   following patch that skips blocks where non-migratable pages were found.

2) The operations done in buffered_rmqueue() and get_page_from_freelist() are
   closely followed so that page capture mimics normal page allocation as much
   as possible. This includes operations such as prep_new_page() and
   page->pfmemalloc setting (that was missing in the previous attempt), zone
   statistics are updated etc. Due to subtleties with IRQ disabling and
   enabling this cannot be simply factored out from the normal allocation
   functions without affecting the fastpath.

This patch has tripled compaction success rates (as recorded in vmstat) in
stress-highalloc mmtests benchmark, although allocation success rates increased
only by a few percent. Closer inspection shows that due to the racy watermark
checking and lack of lru_add_drain(), the allocations that resulted in direct
compactions were often failing, but later allocations succeeeded in the fast
path. So the benefit of the patch to allocation success rates may be limited,
but it improves the fairness in the sense that whoever spent the time
compacting has a higher change of benefitting from it, and also can stop
compacting sooner, as page availability is detected immediately. With better
success detection, the contribution of compaction to high-order allocation
success success rates is also no longer understated by the vmstats.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 include/linux/compaction.h |   8 ++-
 mm/compaction.c            | 118 ++++++++++++++++++++++++++++++++++++++++-----
 mm/internal.h              |   4 +-
 mm/page_alloc.c            |  81 ++++++++++++++++++++++++++-----
 4 files changed, 185 insertions(+), 26 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 60bdf8d..b83c142 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -12,6 +12,8 @@
 #define COMPACT_PARTIAL		3
 /* The full zone was compacted */
 #define COMPACT_COMPLETE	4
+/* Captured a high-order free page in direct compaction */
+#define COMPACT_CAPTURED	5
 
 /* Used to signal whether compaction detected need_sched() or lock contention */
 /* No contention detected */
@@ -33,7 +35,8 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			enum migrate_mode mode, int *contended,
-			struct zone **candidate_zone);
+			struct zone **candidate_zone,
+			struct page **captured_page);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
@@ -103,7 +106,8 @@ static inline bool compaction_restarting(struct zone *zone, int order)
 static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
 			enum migrate_mode mode, int *contended,
-			struct zone **candidate_zone)
+			struct zone **candidate_zone,
+			struct page **captured_page);
 {
 	return COMPACT_CONTINUE;
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index 279c0b0..4fe091c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -548,6 +548,7 @@ static bool too_many_isolated(struct zone *zone)
  * @low_pfn:	The first PFN to isolate
  * @end_pfn:	The one-past-the-last PFN to isolate, within same pageblock
  * @isolate_mode: Isolation mode to be used.
+ * @capture:    True if page capturing is allowed
  *
  * Isolate all pages that can be migrated from the range specified by
  * [low_pfn, end_pfn). The range is expected to be within same pageblock.
@@ -561,7 +562,8 @@ static bool too_many_isolated(struct zone *zone)
  */
 static unsigned long
 isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
-			unsigned long end_pfn, isolate_mode_t isolate_mode)
+			unsigned long end_pfn, isolate_mode_t isolate_mode,
+			bool capture)
 {
 	struct zone *zone = cc->zone;
 	unsigned long nr_scanned = 0, nr_isolated = 0;
@@ -570,6 +572,14 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 	unsigned long flags;
 	bool locked = false;
 	struct page *page = NULL, *valid_page = NULL;
+	unsigned long capture_pfn = 0;   /* current candidate for capturing */
+	unsigned long next_capture_pfn = 0; /* next candidate for capturing */
+
+	if (cc->order > 0 && cc->order <= pageblock_order && capture) {
+		/* This may be outside the zone, but we check that later */
+		capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
+		next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
+	}
 
 	/*
 	 * Ensure that there are not too many pages isolated from the LRU
@@ -591,7 +601,27 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		return 0;
 
 	/* Time to isolate some pages for migration */
-	for (; low_pfn < end_pfn; low_pfn++) {
+	for (; low_pfn <= end_pfn; low_pfn++) {
+		if (low_pfn == next_capture_pfn) {
+			/*
+			 * We have a capture candidate if we isolated something
+			 * during the last cc->order aligned block of pages.
+			 */
+			if (nr_isolated &&
+					capture_pfn >= zone->zone_start_pfn) {
+				cc->capture_page = pfn_to_page(capture_pfn);
+				break;
+			}
+
+			/* Prepare for a new capture candidate */
+			capture_pfn = next_capture_pfn;
+			next_capture_pfn += (1UL << cc->order);
+		}
+
+		/* We check that here, in case low_pfn == next_capture_pfn */
+		if (low_pfn == end_pfn)
+			break;
+
 		/*
 		 * Periodically drop the lock (if held) regardless of its
 		 * contention, to give chance to IRQs. Abort async compaction
@@ -625,8 +655,12 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			 * a valid page order. Consider only values in the
 			 * valid order range to prevent low_pfn overflow.
 			 */
-			if (freepage_order > 0 && freepage_order < MAX_ORDER)
+			if (freepage_order > 0 && freepage_order < MAX_ORDER) {
 				low_pfn += (1UL << freepage_order) - 1;
+				if (next_capture_pfn)
+					next_capture_pfn = ALIGN(low_pfn + 1,
+							(1UL << cc->order));
+			}
 			continue;
 		}
 
@@ -662,6 +696,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			else
 				low_pfn += (1 << compound_order(page)) - 1;
 
+			if (next_capture_pfn)
+				next_capture_pfn =
+					ALIGN(low_pfn + 1, (1UL << cc->order));
 			continue;
 		}
 
@@ -686,6 +723,9 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 				continue;
 			if (PageTransHuge(page)) {
 				low_pfn += (1 << compound_order(page)) - 1;
+				if (next_capture_pfn)
+					next_capture_pfn = ALIGN(low_pfn + 1,
+							(1UL << cc->order));
 				continue;
 			}
 		}
@@ -770,7 +810,7 @@ isolate_migratepages_range(struct compact_control *cc, unsigned long start_pfn,
 			continue;
 
 		pfn = isolate_migratepages_block(cc, pfn, block_end_pfn,
-							ISOLATE_UNEVICTABLE);
+						ISOLATE_UNEVICTABLE, false);
 
 		/*
 		 * In case of fatal failure, release everything that might
@@ -958,7 +998,7 @@ typedef enum {
  * compact_control.
  */
 static isolate_migrate_t isolate_migratepages(struct zone *zone,
-					struct compact_control *cc)
+			struct compact_control *cc, const int migratetype)
 {
 	unsigned long low_pfn, end_pfn;
 	struct page *page;
@@ -980,6 +1020,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	 */
 	for (; end_pfn <= cc->free_pfn;
 			low_pfn = end_pfn, end_pfn += pageblock_nr_pages) {
+		int pageblock_mt;
 
 		/*
 		 * This can potentially iterate a massively long zone with
@@ -1003,13 +1044,14 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 		 * Async compaction is optimistic to see if the minimum amount
 		 * of work satisfies the allocation.
 		 */
+		pageblock_mt = get_pageblock_migratetype(page);
 		if (cc->mode == MIGRATE_ASYNC &&
-		    !migrate_async_suitable(get_pageblock_migratetype(page)))
+					!migrate_async_suitable(pageblock_mt))
 			continue;
 
 		/* Perform the isolation */
 		low_pfn = isolate_migratepages_block(cc, low_pfn, end_pfn,
-								isolate_mode);
+							isolate_mode, true);
 
 		if (!low_pfn || cc->contended)
 			return ISOLATE_ABORT;
@@ -1028,6 +1070,44 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	return cc->nr_migratepages ? ISOLATE_SUCCESS : ISOLATE_NONE;
 }
 
+/*
+ * When called, cc->capture_page is just a candidate. This function will either
+ * successfully capture the page, or reset it to NULL.
+ */
+static bool compact_capture_page(struct compact_control *cc)
+{
+	struct page *page = cc->capture_page;
+	int cpu;
+
+	/* Unsafe check if it's worth to try acquiring the zone->lock at all */
+	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+		goto try_capture;
+
+	/*
+	 * There's a good chance that we have just put free pages on this CPU's
+	 * lru cache and pcplists after the page migrations. Drain them to
+	 * allow merging.
+	 */
+	cpu = get_cpu();
+	lru_add_drain_cpu(cpu);
+	drain_local_pages(NULL);
+	put_cpu();
+
+	/* Did the draining help? */
+	if (PageBuddy(page) && page_order_unsafe(page) >= cc->order)
+		goto try_capture;
+
+	goto fail;
+
+try_capture:
+	if (capture_free_page(page, cc->order))
+		return true;
+
+fail:
+	cc->capture_page = NULL;
+	return false;
+}
+
 static int compact_finished(struct zone *zone, struct compact_control *cc,
 			    const int migratetype)
 {
@@ -1056,6 +1136,10 @@ static int compact_finished(struct zone *zone, struct compact_control *cc,
 		return COMPACT_COMPLETE;
 	}
 
+	/* Did we just finish a pageblock that was capture candidate? */
+	if (cc->capture_page && compact_capture_page(cc))
+		return COMPACT_CAPTURED;
+
 	/*
 	 * order == -1 is expected when compacting via
 	 * /proc/sys/vm/compact_memory
@@ -1188,7 +1272,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 						COMPACT_CONTINUE) {
 		int err;
 
-		switch (isolate_migratepages(zone, cc)) {
+		switch (isolate_migratepages(zone, cc, migratetype)) {
 		case ISOLATE_ABORT:
 			ret = COMPACT_PARTIAL;
 			putback_movable_pages(&cc->migratepages);
@@ -1233,7 +1317,8 @@ out:
 }
 
 static unsigned long compact_zone_order(struct zone *zone, int order,
-		gfp_t gfp_mask, enum migrate_mode mode, int *contended)
+		gfp_t gfp_mask, enum migrate_mode mode, int *contended,
+						struct page **captured_page)
 {
 	unsigned long ret;
 	struct compact_control cc = {
@@ -1250,6 +1335,9 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
 	ret = compact_zone(zone, &cc);
 	*contended = cc.contended;
 
+	if (ret == COMPACT_CAPTURED)
+		*captured_page = cc.capture_page;
+
 	VM_BUG_ON(!list_empty(&cc.freepages));
 	VM_BUG_ON(!list_empty(&cc.migratepages));
 
@@ -1268,13 +1356,15 @@ int sysctl_extfrag_threshold = 500;
  * @contended: Return value that determines if compaction was aborted due to 
  *	       need_resched() or lock contention
  * @candidate_zone: Return the zone where we think allocation should succeed
+ * @captured_page: If successful, return the page captured during compaction
  *
  * This is the main entry point for direct page compaction.
  */
 unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *nodemask,
 			enum migrate_mode mode, int *contended,
-			struct zone **candidate_zone)
+			struct zone **candidate_zone,
+			struct page **captured_page)
 {
 	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
 	int may_enter_fs = gfp_mask & __GFP_FS;
@@ -1305,7 +1395,7 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			continue;
 
 		status = compact_zone_order(zone, order, gfp_mask, mode,
-							&zone_contended);
+					&zone_contended, captured_page);
 		rc = max(status, rc);
 		/*
 		 * It takes at least one zone that wasn't lock contended
@@ -1314,6 +1404,12 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
 		all_zones_lock_contended &=
 			(zone_contended == COMPACT_CONTENDED_LOCK);
 
+		/* If we captured a page, stop compacting */
+		if (*captured_page) {
+			*candidate_zone = zone;
+			break;
+		}
+
 		/* If a normal allocation would succeed, stop compacting */
 		if (zone_watermark_ok(zone, order, low_wmark_pages(zone), 0,
 				      alloc_flags)) {
diff --git a/mm/internal.h b/mm/internal.h
index 8293040..9e659fcd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -110,6 +110,7 @@ extern pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
  */
 extern void __free_pages_bootmem(struct page *page, unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned long order);
+extern bool capture_free_page(struct page *page, unsigned int order);
 #ifdef CONFIG_MEMORY_FAILURE
 extern bool is_free_buddy_page(struct page *page);
 #endif
@@ -148,6 +149,7 @@ struct compact_control {
 					 * contention detected during
 					 * compaction
 					 */
+	struct page *capture_page;	/* Free page captured by compaction */
 };
 
 unsigned long
@@ -155,7 +157,7 @@ isolate_freepages_range(struct compact_control *cc,
 			unsigned long start_pfn, unsigned long end_pfn);
 unsigned long
 isolate_migratepages_range(struct compact_control *cc,
-			   unsigned long low_pfn, unsigned long end_pfn);
+			unsigned long low_pfn, unsigned long end_pfn);
 
 #endif
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6f1c6e6..6f2fbfc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1483,9 +1483,11 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 {
 	unsigned long watermark;
 	struct zone *zone;
+	struct free_area *area;
 	int mt;
+	unsigned int freepage_order = page_order(page);
 
-	BUG_ON(!PageBuddy(page));
+	VM_BUG_ON_PAGE((!PageBuddy(page) || freepage_order < order), page);
 
 	zone = page_zone(page);
 	mt = get_pageblock_migratetype(page);
@@ -1500,9 +1502,12 @@ static int __isolate_free_page(struct page *page, unsigned int order)
 	}
 
 	/* Remove page from free list */
+	area = &zone->free_area[freepage_order];
 	list_del(&page->lru);
-	zone->free_area[order].nr_free--;
+	area->nr_free--;
 	rmv_page_order(page);
+	if (freepage_order != order)
+		expand(zone, page, order, freepage_order, area, mt);
 
 	/* Set the pageblock if the isolated page is at least a pageblock */
 	if (order >= pageblock_order - 1) {
@@ -1545,6 +1550,29 @@ int split_free_page(struct page *page)
 	return nr_pages;
 }
 
+bool capture_free_page(struct page *page, unsigned int order)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long flags;
+
+	spin_lock_irqsave(&zone->lock, flags);
+
+	if (!PageBuddy(page) || page_order(page) < order
+			|| !__isolate_free_page(page, order)) {
+		spin_unlock_irqrestore(&zone->lock, flags);
+		return false;
+	}
+
+	spin_unlock(&zone->lock);
+
+	/* Mimic what buffered_rmqueue() does */
+	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+	__count_zone_vm_events(PGALLOC, zone, 1 << order);
+	local_irq_restore(flags);
+
+	return true;
+}
+
 /*
  * Really, prep_compound_page() should be called from __rmqueue_bulk().  But
  * we cheat by calling it from here, in the order > 0 path.  Saves a branch
@@ -2251,7 +2279,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	unsigned long *did_some_progress)
 {
 	struct zone *last_compact_zone = NULL;
-	struct page *page;
+	struct page *page = NULL;
 
 	if (!order)
 		return NULL;
@@ -2260,7 +2288,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	*did_some_progress = try_to_compact_pages(zonelist, order, gfp_mask,
 						nodemask, mode,
 						contended_compaction,
-						&last_compact_zone);
+						&last_compact_zone, &page);
 	current->flags &= ~PF_MEMALLOC;
 
 	switch (*did_some_progress) {
@@ -2279,14 +2307,43 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	 */
 	count_vm_event(COMPACTSTALL);
 
-	/* Page migration frees to the PCP lists but we want merging */
-	drain_pages(get_cpu());
-	put_cpu();
+	/* Did we capture a page? */
+	if (page) {
+		struct zone *zone;
+		unsigned long flags;
+		/*
+		 * Mimic what buffered_rmqueue() does and capture_new_page()
+		 * has not yet done.
+		 */
+		zone = page_zone(page);
+
+		local_irq_save(flags);
+		zone_statistics(preferred_zone, zone, gfp_mask);
+		local_irq_restore(flags);
 
-	page = get_page_from_freelist(gfp_mask, nodemask,
-			order, zonelist, high_zoneidx,
-			alloc_flags & ~ALLOC_NO_WATERMARKS,
-			preferred_zone, classzone_idx, migratetype);
+		VM_BUG_ON_PAGE(bad_range(zone, page), page);
+		if (!prep_new_page(page, order, gfp_mask))
+			/* This is normally done in get_page_from_freelist() */
+			page->pfmemalloc = !!(alloc_flags &
+					ALLOC_NO_WATERMARKS);
+		else
+			page = NULL;
+	}
+
+	/* No capture but let's try allocating anyway */
+	if (!page) {
+		/*
+		 * Page migration frees to the PCP lists but we want
+		 * merging
+		 */
+		drain_pages(get_cpu());
+		put_cpu();
+
+		page = get_page_from_freelist(gfp_mask, nodemask, order,
+				zonelist, high_zoneidx,
+				alloc_flags & ~ALLOC_NO_WATERMARKS,
+				preferred_zone, classzone_idx, migratetype);
+	}
 
 	if (page) {
 		struct zone *zone = page_zone(page);
@@ -6303,7 +6360,7 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
 
 		if (list_empty(&cc->migratepages)) {
 			cc->nr_migratepages = 0;
-			pfn = isolate_migratepages_range(cc, pfn, end);
+			pfn = isolate_migratepages_range(cc, pfn, end, false);
 			if (!pfn) {
 				ret = -EINTR;
 				break;
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [RFC PATCH V4 15/15] mm, compaction: do not migrate pages when that cannot satisfy page fault allocation
  2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
                   ` (13 preceding siblings ...)
  2014-07-16 13:48 ` [PATCH V4 14/15] mm, compaction: try to capture the just-created high-order freepage Vlastimil Babka
@ 2014-07-16 13:48 ` Vlastimil Babka
  14 siblings, 0 replies; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-16 13:48 UTC (permalink / raw)
  To: linux-mm, Andrew Morton, David Rientjes
  Cc: linux-kernel, Vlastimil Babka, Minchan Kim, Mel Gorman,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Zhang Yanfei

In direct compaction for a page fault, we want to allocate the high-order page
as soon as possible, so migrating from a cc->order aligned block of pages that
contains also unmigratable pages just adds to page fault latency.

This patch therefore makes the migration scanner skip to the next cc->order
aligned block of pages as soon as it cannot isolate a non-free page. Everything
isolated up to that point is put back.

In this mode, the nr_isolated limit to COMPACT_CLUSTER_MAX is not observed,
allowing the scanner to scan the whole block at once, instead of migrating
COMPACT_CLUSTER_MAX pages and then finding an unmigratable page in the next
call. This might however have some implications on direct reclaimers through
too_many_isolated().

In tests with stress-highalloc benchmark where __GFP_NO_KSWAPD was not used
and therefore the patch did not affect the benchmark itself, but only the
kernel compilations occuring in parallel, this patch has increased allocation
success rates of the benchmark by a few percent, and pages scanned by
compaction by almost 20%. The compaction successes in vmstat increased by 16%
so this is probably due to more successes translating to less deferring.
The compaction successes (and attempts) did increase for other processes than
the benchmark, which would be explained by those processes faulting THP pages
with __GFP_NO_KSWAPD. However, THP faults in vmstat did not increase, so
there either some problem in that area, or the direct compactions were
triggered by different allocations.

In tests where __GFP_NO_KSWAPD was used by the benchmark, the allocation
success rates improved from 15% to 20% in the first phase, and from 19% to 33%
in the second phase. Again, the amount of work done increased due to less
deferring.

[rientjes@google.com: skip_on_failure based on THP page faults]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
---
 mm/compaction.c | 50 ++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 4fe091c..6271bf7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -574,11 +574,20 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 	struct page *page = NULL, *valid_page = NULL;
 	unsigned long capture_pfn = 0;   /* current candidate for capturing */
 	unsigned long next_capture_pfn = 0; /* next candidate for capturing */
+	bool skip_on_failure = false; /* skip block when isolation fails */
 
 	if (cc->order > 0 && cc->order <= pageblock_order && capture) {
 		/* This may be outside the zone, but we check that later */
 		capture_pfn = low_pfn & ~((1UL << cc->order) - 1);
 		next_capture_pfn = ALIGN(low_pfn + 1, (1UL << cc->order));
+		/*
+		 * It is too expensive for compaction to migrate pages from a
+		 * cc->order block of pages on page faults, unless the entire
+		 * block can become free. But hugepaged should try anyway for
+		 * THP so that general defragmentation happens.
+		 */
+		skip_on_failure = (cc->gfp_mask & __GFP_NO_KSWAPD)
+				&& !(current->flags & PF_KTHREAD);
 	}
 
 	/*
@@ -633,7 +642,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			break;
 
 		if (!pfn_valid_within(low_pfn))
-			continue;
+			goto isolation_failed;
 		nr_scanned++;
 
 		page = pfn_to_page(low_pfn);
@@ -676,7 +685,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 					goto isolate_success;
 				}
 			}
-			continue;
+			goto isolation_failed;
 		}
 
 		/*
@@ -699,7 +708,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			if (next_capture_pfn)
 				next_capture_pfn =
 					ALIGN(low_pfn + 1, (1UL << cc->order));
-			continue;
+			goto isolation_failed;
 		}
 
 		/*
@@ -709,7 +718,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		if (!page_mapping(page) &&
 		    page_count(page) > page_mapcount(page))
-			continue;
+			goto isolation_failed;
 
 		/* If we already hold the lock, we can skip some rechecking */
 		if (!locked) {
@@ -720,13 +729,13 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 			/* Recheck PageLRU and PageTransHuge under lock */
 			if (!PageLRU(page))
-				continue;
+				goto isolation_failed;
 			if (PageTransHuge(page)) {
 				low_pfn += (1 << compound_order(page)) - 1;
 				if (next_capture_pfn)
 					next_capture_pfn = ALIGN(low_pfn + 1,
 							(1UL << cc->order));
-				continue;
+				goto isolation_failed;
 			}
 		}
 
@@ -734,7 +743,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, isolate_mode) != 0)
-			continue;
+			goto isolation_failed;
 
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
 
@@ -747,11 +756,32 @@ isolate_success:
 		cc->nr_migratepages++;
 		nr_isolated++;
 
-		/* Avoid isolating too much */
-		if (cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
+		/*
+		 * Avoid isolating too much, except if we try to capture a
+		 * free page and want to find out at once if it can be done
+		 * or we should skip to the next block.
+		 */
+		if (!skip_on_failure &&
+				cc->nr_migratepages == COMPACT_CLUSTER_MAX) {
 			++low_pfn;
 			break;
 		}
+
+		continue;
+
+isolation_failed:
+		if (skip_on_failure) {
+			if (nr_isolated) {
+				if (locked) {
+					spin_unlock_irqrestore(&zone->lru_lock,
+									flags);
+					locked = false;
+				}
+				putback_movable_pages(migratelist);
+				nr_isolated = 0;
+			}
+			low_pfn = next_capture_pfn - 1;
+		}
 	}
 
 	/*
@@ -770,7 +800,7 @@ isolate_success:
 	 * Update the pageblock-skip information and cached scanner pfn,
 	 * if the whole pageblock was scanned without isolating any page.
 	 */
-	if (low_pfn == end_pfn)
+	if (low_pfn == end_pfn && !skip_on_failure)
 		update_pageblock_skip(cc, valid_page, nr_isolated, true);
 
 	trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);
-- 
1.8.4.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 01/15] mm, THP: don't hold mmap_sem in khugepaged when allocating THP
  2014-07-16 13:48 ` [PATCH V4 01/15] mm, THP: don't hold mmap_sem in khugepaged when allocating THP Vlastimil Babka
@ 2014-07-25 12:18   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Minchan Kim, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Zhang Yanfei

On Wed, Jul 16, 2014 at 03:48:09PM +0200, Vlastimil Babka wrote:
> When allocating huge page for collapsing, khugepaged currently holds mmap_sem
> for reading on the mm where collapsing occurs. Afterwards the read lock is
> dropped before write lock is taken on the same mmap_sem.
> 
> Holding mmap_sem during whole huge page allocation is therefore useless, the
> vma needs to be rechecked after taking the write lock anyway. Furthemore, huge
> page allocation might involve a rather long sync compaction, and thus block
> any mmap_sem writers and i.e. affect workloads that perform frequent m(un)map
> or mprotect oterations.
> 
> This patch simply releases the read lock before allocating a huge page. It
> also deletes an outdated comment that assumed vma must be stable, as it was
> using alloc_hugepage_vma(). This is no longer true since commit 9f1b868a13
> ("mm: thp: khugepaged: add policy for finding target node").
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 02/15] mm, compaction: defer each zone individually instead of preferred zone
  2014-07-16 13:48 ` [PATCH V4 02/15] mm, compaction: defer each zone individually instead of preferred zone Vlastimil Babka
@ 2014-07-25 12:20   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Minchan Kim, Zhang Yanfei

On Wed, Jul 16, 2014 at 03:48:10PM +0200, Vlastimil Babka wrote:
> When direct sync compaction is often unsuccessful, it may become deferred for
> some time to avoid further useless attempts, both sync and async. Successful
> high-order allocations un-defer compaction, while further unsuccessful
> compaction attempts prolong the copmaction deferred period.
> 
> Currently the checking and setting deferred status is performed only on the
> preferred zone of the allocation that invoked direct compaction. But compaction
> itself is attempted on all eligible zones in the zonelist, so the behavior is
> suboptimal and may lead both to scenarios where 1) compaction is attempted
> uselessly, or 2) where it's not attempted despite good chances of succeeding,
> as shown on the examples below:
> 
> 1) A direct compaction with Normal preferred zone failed and set deferred
>    compaction for the Normal zone. Another unrelated direct compaction with
>    DMA32 as preferred zone will attempt to compact DMA32 zone even though
>    the first compaction attempt also included DMA32 zone.
> 
>    In another scenario, compaction with Normal preferred zone failed to compact
>    Normal zone, but succeeded in the DMA32 zone, so it will not defer
>    compaction. In the next attempt, it will try Normal zone which will fail
>    again, instead of skipping Normal zone and trying DMA32 directly.
> 
> 2) Kswapd will balance DMA32 zone and reset defer status based on watermarks
>    looking good. A direct compaction with preferred Normal zone will skip
>    compaction of all zones including DMA32 because Normal was still deferred.
>    The allocation might have succeeded in DMA32, but won't.
> 
> This patch makes compaction deferring work on individual zone basis instead of
> preferred zone. For each zone, it checks compaction_deferred() to decide if the
> zone should be skipped. If watermarks fail after compacting the zone,
> defer_compaction() is called. The zone where watermarks passed can still be
> deferred when the allocation attempt is unsuccessful. When allocation is
> successful, compaction_defer_reset() is called for the zone containing the
> allocated page. This approach should approximate calling defer_compaction()
> only on zones where compaction was attempted and did not yield allocated page.
> There might be corner cases but that is inevitable as long as the decision
> to stop compacting dues not guarantee that a page will be allocated.
> 
> During testing on a two-node machine with a single very small Normal zone on
> node 1, this patch has improved success rates in stress-highalloc mmtests
> benchmark. The success here were previously made worse by commit 3a025760fc
> ("mm: page_alloc: spill to remote nodes before waking kswapd") as kswapd was
> no longer resetting often enough the deferred compaction for the Normal zone,
> and DMA32 zones on both nodes were thus not considered for compaction.
> On different machine, success rates were improved with __GFP_NO_KSWAPD
> allocations.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Minchan Kim <minchan@kernel.org>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 03/15] mm, compaction: do not count compact_stall if all zones skipped compaction
  2014-07-16 13:48 ` [PATCH V4 03/15] mm, compaction: do not count compact_stall if all zones skipped compaction Vlastimil Babka
@ 2014-07-25 12:22   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:22 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Minchan Kim, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Zhang Yanfei

On Wed, Jul 16, 2014 at 03:48:11PM +0200, Vlastimil Babka wrote:
> The compact_stall vmstat counter counts the number of allocations stalled by
> direct compaction. It does not count when all attempted zones had deferred
> compaction, but it does count when all zones skipped compaction. The skipping
> is decided based on very early check of compaction_suitable(), based on
> watermarks and memory fragmentation. Therefore it makes sense not to count
> skipped compactions as stalls. Moreover, compact_success or compact_fail is
> also already not being counted when compaction was skipped, so this patch
> changes the compact_stall counting to match the other two.
> 
> Additionally, restructure __alloc_pages_direct_compact() code for better
> readability.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 04/15] mm, compaction: do not recheck suitable_migration_target under lock
  2014-07-16 13:48 ` [PATCH V4 04/15] mm, compaction: do not recheck suitable_migration_target under lock Vlastimil Babka
@ 2014-07-25 12:23   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:23 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Minchan Kim, Zhang Yanfei

On Wed, Jul 16, 2014 at 03:48:12PM +0200, Vlastimil Babka wrote:
> isolate_freepages_block() rechecks if the pageblock is suitable to be a target
> for migration after it has taken the zone->lock. However, the check has been
> optimized to occur only once per pageblock, and compact_checklock_irqsave()
> might be dropping and reacquiring lock, which means somebody else might have
> changed the pageblock's migratetype meanwhile.
> 
> Furthermore, nothing prevents the migratetype to change right after
> isolate_freepages_block() has finished isolating. Given how imperfect this is,
> it's simpler to just rely on the check done in isolate_freepages() without
> lock, and not pretend that the recheck under lock guarantees anything. It is
> just a heuristic after all.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> Acked-by: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 05/15] mm, compaction: move pageblock checks up from isolate_migratepages_range()
  2014-07-16 13:48 ` [PATCH V4 05/15] mm, compaction: move pageblock checks up from isolate_migratepages_range() Vlastimil Babka
@ 2014-07-25 12:28   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:28 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Minchan Kim, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Zhang Yanfei

On Wed, Jul 16, 2014 at 03:48:13PM +0200, Vlastimil Babka wrote:
> isolate_migratepages_range() is the main function of the compaction scanner,
> called either on a single pageblock by isolate_migratepages() during regular
> compaction, or on an arbitrary range by CMA's __alloc_contig_migrate_range().
> It currently perfoms two pageblock-wide compaction suitability checks, and
> because of the CMA callpath, it tracks if it crossed a pageblock boundary in
> order to repeat those checks.
> 
> However, closer inspection shows that those checks are always true for CMA:
> - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
> - migrate_async_suitable() check is skipped because CMA uses sync compaction
> 
> We can therefore move the compaction-specific checks to isolate_migratepages()
> and simplify isolate_migratepages_range(). Furthermore, we can mimic the
> freepage scanner family of functions, which has isolate_freepages_block()
> function called both by compaction from isolate_freepages() and by CMA from
> isolate_freepages_range(), where each use-case adds own specific glue code.
> This allows further code simplification.
> 
> Therefore, we rename isolate_migratepages_range() to isolate_freepages_block()
> and limit its functionality to a single pageblock (or its subset). For CMA,
> a new different isolate_migratepages_range() is created as a CMA-specific
> wrapper for the _block() function. The checks specific to compaction are moved
> to isolate_migratepages(). As part of the unification of these two families of
> functions, we remove the redundant zone parameter where applicable, since zone
> pointer is already passed in cc->zone.
> 
> Furthermore, going back to compact_zone() and compact_finished() when pageblock
> is found unsuitable (now by isolate_migratepages()) is wasteful - the checks
> are meant to skip pageblocks quickly. The patch therefore also introduces a
> simple loop into isolate_migratepages() so that it does not return immediately
> on failed pageblock checks, but keeps going until isolate_migratepages_range()
> gets called once. Similarily to isolate_freepages(), the function periodically
> checks if it needs to reschedule or abort async compaction.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 06/15] mm, compaction: reduce zone checking frequency in the migration scanner
  2014-07-16 13:48 ` [PATCH V4 06/15] mm, compaction: reduce zone checking frequency in the migration scanner Vlastimil Babka
@ 2014-07-25 12:29   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Minchan Kim, Joonsoo Kim, Michal Nazarewicz, Christoph Lameter,
	Rik van Riel, Naoya Horiguchi, Zhang Yanfei

On Wed, Jul 16, 2014 at 03:48:14PM +0200, Vlastimil Babka wrote:
> The unification of the migrate and free scanner families of function has
> highlighted a difference in how the scanners ensure they only isolate pages
> of the intended zone. This is important for taking zone lock or lru lock of
> the correct zone. Due to nodes overlapping, it is however possible to
> encounter a different zone within the range of the zone being compacted.
> 
> The free scanner, since its inception by commit 748446bb6b ("mm: compaction:
> memory compaction core"), has been checking the zone of the first valid page
> in a pageblock, and skipping the whole pageblock if the zone does not match.
> 
> This checking was completely missing from the migration scanner at first, and
> later added by commit dc9086004b ("mm: compaction: check for overlapping
> nodes during isolation for migration") in a reaction to a bug report.
> But the zone comparison in migration scanner is done once per a single scanned
> page, which is more defensive and thus more costly than a check per pageblock.
> 
> This patch unifies the checking done in both scanners to once per pageblock,
> through a new pageblock_within_zone() function, which also includes pfn_valid()
> checks. It is more defensive than the current free scanner checks, as it checks
> both the first and last page of the pageblock, but less defensive by the
> migration scanner per-page checks. It assumes that node overlapping may result
> (on some architecture) in a boundary between two nodes falling into the middle
> of a pageblock, but that there cannot be a node0 node1 node0 interleaving
> within a single pageblock.
> 
> The result is more code being shared and a bit less per-page CPU cost in the
> migration scanner.
> 
> Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 07/15] mm, compaction: khugepaged should not give up due to need_resched()
  2014-07-16 13:48 ` [PATCH V4 07/15] mm, compaction: khugepaged should not give up due to need_resched() Vlastimil Babka
@ 2014-07-25 12:31   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Minchan Kim, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Zhang Yanfei

On Wed, Jul 16, 2014 at 03:48:15PM +0200, Vlastimil Babka wrote:
> Async compaction aborts when it detects zone lock contention or need_resched()
> is true. David Rientjes has reported that in practice, most direct async
> compactions for THP allocation abort due to need_resched(). This means that a
> second direct compaction is never attempted, which might be OK for a page
> fault, but khugepaged is intended to attempt a sync compaction in such case and
> in these cases it won't.
> 
> This patch replaces "bool contended" in compact_control with an int that
> distinguieshes between aborting due to need_resched() and aborting due to lock
> contention. This allows propagating the abort through all compaction functions
> as before, but passing the abort reason up to __alloc_pages_slowpath() which
> decides when to continue with direct reclaim and another compaction attempt.
> 
> Another problem is that try_to_compact_pages() did not act upon the reported
> contention (both need_resched() or lock contention) immediately and would
> proceed with another zone from the zonelist. When need_resched() is true, that
> means initializing another zone compaction, only to check again need_resched()
> in isolate_migratepages() and aborting. For zone lock contention, the
> unintended consequence is that the lock contended status reported back to the
> allocator is detrmined from the last zone where compaction was attempted, which
> is rather arbitrary.
> 
> This patch fixes the problem in the following way:
> - async compaction of a zone aborting due to need_resched() or fatal signal
>   pending means that further zones should not be tried. We report
>   COMPACT_CONTENDED_SCHED to the allocator.
> - aborting zone compaction due to lock contention means we can still try
>   another zone, since it has different set of locks. We report back
>   COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
>   it was aborted due to lock contention.
> 
> As a result of these fixes, khugepaged will proceed with second sync compaction
> as intended, when the preceding async compaction aborted due to need_resched().
> Page fault compactions aborting due to need_resched() will spare some cycles
> previously wasted by initializing another zone compaction only to abort again.
> Lock contention will be reported only when compaction in all zones aborted due
> to lock contention, and therefore it's not a good idea to try again after
> reclaim.
> 
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 08/15] mm, compaction: periodically drop lock and restore IRQs in scanners
  2014-07-16 13:48 ` [PATCH V4 08/15] mm, compaction: periodically drop lock and restore IRQs in scanners Vlastimil Babka
@ 2014-07-25 12:32   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:32 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Michal Nazarewicz, Naoya Horiguchi, Christoph Lameter,
	Rik van Riel, Joonsoo Kim, Minchan Kim, Zhang Yanfei

On Wed, Jul 16, 2014 at 03:48:16PM +0200, Vlastimil Babka wrote:
> Compaction scanners regularly check for lock contention and need_resched()
> through the compact_checklock_irqsave() function. However, if there is no
> contention, the lock can be held and IRQ disabled for potentially long time.
> 
> This has been addressed by commit b2eef8c0d0 ("mm: compaction: minimise the
> time IRQs are disabled while isolating pages for migration") for the migration
> scanner. However, the refactoring done by commit 2a1402aa04 ("mm: compaction:
> acquire the zone->lru_lock as late as possible") has changed the conditions so
> that the lock is dropped only when there's contention on the lock or
> need_resched() is true. Also, need_resched() is checked only when the lock is
> already held. The comment "give a chance to irqs before checking need_resched"
> is therefore misleading, as IRQs remain disabled when the check is done.
> 
> This patch restores the behavior intended by commit b2eef8c0d0 and also tries
> to better balance and make more deterministic the time spent by checking for
> contention vs the time the scanners might run between the checks. It also
> avoids situations where checking has not been done often enough before. The
> result should be avoiding both too frequent and too infrequent contention
> checking, and especially the potentially long-running scans with IRQs disabled
> and no checking of need_resched() or for fatal signal pending, which can happen
> when many consecutive pages or pageblocks fail the preliminary tests and do not
> reach the later call site to compact_checklock_irqsave(), as explained below.
> 
> Before the patch:
> 
> In the migration scanner, compact_checklock_irqsave() was called each loop, if
> reached. If not reached, some lower-frequency checking could still be done if
> the lock was already held, but this would not result in aborting contended
> async compaction until reaching compact_checklock_irqsave() or end of
> pageblock. In the free scanner, it was similar but completely without the
> periodical checking, so lock can be potentially held until reaching the end of
> pageblock.
> 
> After the patch, in both scanners:
> 
> The periodical check is done as the first thing in the loop on each
> SWAP_CLUSTER_MAX aligned pfn, using the new compact_unlock_should_abort()
> function, which always unlocks the lock (if locked) and aborts async compaction
> if scheduling is needed. It also aborts any type of compaction when a fatal
> signal is pending.
> 
> The compact_checklock_irqsave() function is replaced with a slightly different
> compact_trylock_irqsave(). The biggest difference is that the function is not
> called at all if the lock is already held. The periodical need_resched()
> checking is left solely to compact_unlock_should_abort(). The lock contention
> avoidance for async compaction is achieved by the periodical unlock by
> compact_unlock_should_abort() and by using trylock in compact_trylock_irqsave()
> and aborting when trylock fails. Sync compaction does not use trylock.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> Acked-by: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 09/15] mm, compaction: skip rechecks when lock was already held
  2014-07-16 13:48 ` [PATCH V4 09/15] mm, compaction: skip rechecks when lock was already held Vlastimil Babka
@ 2014-07-25 12:34   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Michal Nazarewicz, Naoya Horiguchi, Christoph Lameter,
	Rik van Riel, Joonsoo Kim, Minchan Kim, Zhang Yanfei

On Wed, Jul 16, 2014 at 03:48:17PM +0200, Vlastimil Babka wrote:
> Compaction scanners try to lock zone locks as late as possible by checking
> many page or pageblock properties opportunistically without lock and skipping
> them if not unsuitable. For pages that pass the initial checks, some properties
> have to be checked again safely under lock. However, if the lock was already
> held from a previous iteration in the initial checks, the rechecks are
> unnecessary.
> 
> This patch therefore skips the rechecks when the lock was already held. This is
> now possible to do, since we don't (potentially) drop and reacquire the lock
> between the initial checks and the safe rechecks anymore.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Acked-by: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Acked-by: David Rientjes <rientjes@google.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 10/15] mm, compaction: remember position within pageblock in free pages scanner
  2014-07-16 13:48 ` [PATCH V4 10/15] mm, compaction: remember position within pageblock in free pages scanner Vlastimil Babka
@ 2014-07-25 12:35   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Zhang Yanfei, Minchan Kim

On Wed, Jul 16, 2014 at 03:48:18PM +0200, Vlastimil Babka wrote:
> Unlike the migration scanner, the free scanner remembers the beginning of the
> last scanned pageblock in cc->free_pfn. It might be therefore rescanning pages
> uselessly when called several times during single compaction. This might have
> been useful when pages were returned to the buddy allocator after a failed
> migration, but this is no longer the case.
> 
> This patch changes the meaning of cc->free_pfn so that if it points to a
> middle of a pageblock, that pageblock is scanned only from cc->free_pfn to the
> end. isolate_freepages_block() will record the pfn of the last page it looked
> at, which is then used to update cc->free_pfn.
> 
> In the mmtests stress-highalloc benchmark, this has resulted in lowering the
> ratio between pages scanned by both scanners, from 2.5 free pages per migrate
> page, to 2.25 free pages per migrate page, without affecting success rates.
> 
> With __GFP_NO_KSWAPD allocations, this appears to result in a worse ratio (2.1
> instead of 1.8), but page migration successes increased by 10%, so this could
> mean that more useful work can be done until need_resched() aborts this kind
> of compaction.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Minchan Kim <minchan@kernel.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 11/15] mm, compaction: skip buddy pages by their order in the migrate scanner
  2014-07-16 13:48 ` [PATCH V4 11/15] mm, compaction: skip buddy pages by their order in the migrate scanner Vlastimil Babka
@ 2014-07-25 12:36   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:36 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Minchan Kim, Zhang Yanfei

On Wed, Jul 16, 2014 at 03:48:19PM +0200, Vlastimil Babka wrote:
> The migration scanner skips PageBuddy pages, but does not consider their order
> as checking page_order() is generally unsafe without holding the zone->lock,
> and acquiring the lock just for the check wouldn't be a good tradeoff.
> 
> Still, this could avoid some iterations over the rest of the buddy page, and
> if we are careful, the race window between PageBuddy() check and page_order()
> is small, and the worst thing that can happen is that we skip too much and miss
> some isolation candidates. This is not that bad, as compaction can already fail
> for many other reasons like parallel allocations, and those have much larger
> race window.
> 
> This patch therefore makes the migration scanner obtain the buddy page order
> and use it to skip the whole buddy page, if the order appears to be in the
> valid range.
> 
> It's important that the page_order() is read only once, so that the value used
> in the checks and in the pfn calculation is the same. But in theory the
> compiler can replace the local variable by multiple inlines of page_order().
> Therefore, the patch introduces page_order_unsafe() that uses ACCESS_ONCE to
> prevent this.
> 
> Testing with stress-highalloc from mmtests shows a 15% reduction in number of
> pages scanned by migration scanner. The reduction is >60% with __GFP_NO_KSWAPD
> allocations, along with success rates better by few percent.
> This change is also a prerequisite for a later patch which is detecting when
> a cc->order block of pages contains non-buddy pages that cannot be isolated,
> and the scanner should thus skip to the next block immediately.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> Acked-by: Minchan Kim <minchan@kernel.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 12/15] mm: rename allocflags_to_migratetype for clarity
  2014-07-16 13:48 ` [PATCH V4 12/15] mm: rename allocflags_to_migratetype for clarity Vlastimil Babka
@ 2014-07-25 12:37   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Joonsoo Kim, Michal Nazarewicz, Christoph Lameter, Rik van Riel,
	Minchan Kim, Naoya Horiguchi, Zhang Yanfei

On Wed, Jul 16, 2014 at 03:48:20PM +0200, Vlastimil Babka wrote:
> From: David Rientjes <rientjes@google.com>
> 
> The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
> ALLOC_CPUSET) that have separate semantics.
> 
> The function allocflags_to_migratetype() actually takes gfp flags, not alloc
> flags, and returns a migratetype.  Rename it to gfpflags_to_migratetype().
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Acked-by: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 13/15] mm, compaction: pass gfp mask to compact_control
  2014-07-16 13:48 ` [PATCH V4 13/15] mm, compaction: pass gfp mask to compact_control Vlastimil Babka
@ 2014-07-25 12:38   ` Mel Gorman
  0 siblings, 0 replies; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Minchan Kim, Zhang Yanfei

On Wed, Jul 16, 2014 at 03:48:21PM +0200, Vlastimil Babka wrote:
> From: David Rientjes <rientjes@google.com>
> 
> struct compact_control currently converts the gfp mask to a migratetype, but we
> need the entire gfp mask in a follow-up patch.
> 
> Pass the entire gfp mask as part of struct compact_control.
> 
> Signed-off-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
> Acked-by: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 14/15] mm, compaction: try to capture the just-created high-order freepage
  2014-07-16 13:48 ` [PATCH V4 14/15] mm, compaction: try to capture the just-created high-order freepage Vlastimil Babka
@ 2014-07-25 12:56   ` Mel Gorman
  2014-07-25 15:49     ` Vlastimil Babka
  0 siblings, 1 reply; 31+ messages in thread
From: Mel Gorman @ 2014-07-25 12:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Minchan Kim, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Zhang Yanfei

On Wed, Jul 16, 2014 at 03:48:22PM +0200, Vlastimil Babka wrote:
> Compaction uses watermark checking to determine if it succeeded in creating
> a high-order free page. My testing has shown that this is quite racy and it
> can happen that watermark checking in compaction succeeds, and moments later
> the watermark checking in page allocation fails, even though the number of
> free pages has increased meanwhile.
> 
> It should be more reliable if direct compaction captured the high-order free
> page as soon as it detects it, and pass it back to allocation. This would
> also reduce the window for somebody else to allocate the free page.
> 
> Capture has been implemented before by 1fb3f8ca0e92 ("mm: compaction: capture
> a suitable high-order page immediately when it is made available"), but later
> reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
> high-order page") due to a bug.
> 
> This patch differs from the previous attempt in two aspects:
> 
> 1) The previous patch scanned free lists to capture the page. In this patch,
>    only the cc->order aligned block that the migration scanner just finished
>    is considered, but only if pages were actually isolated for migration in
>    that block. Tracking cc->order aligned blocks also has benefits for the
>    following patch that skips blocks where non-migratable pages were found.
> 
> 2) The operations done in buffered_rmqueue() and get_page_from_freelist() are
>    closely followed so that page capture mimics normal page allocation as much
>    as possible. This includes operations such as prep_new_page() and
>    page->pfmemalloc setting (that was missing in the previous attempt), zone
>    statistics are updated etc. Due to subtleties with IRQ disabling and
>    enabling this cannot be simply factored out from the normal allocation
>    functions without affecting the fastpath.
> 
> This patch has tripled compaction success rates (as recorded in vmstat) in
> stress-highalloc mmtests benchmark, although allocation success rates increased
> only by a few percent. Closer inspection shows that due to the racy watermark
> checking and lack of lru_add_drain(), the allocations that resulted in direct
> compactions were often failing, but later allocations succeeeded in the fast
> path. So the benefit of the patch to allocation success rates may be limited,
> but it improves the fairness in the sense that whoever spent the time
> compacting has a higher change of benefitting from it, and also can stop
> compacting sooner, as page availability is detected immediately. With better
> success detection, the contribution of compaction to high-order allocation
> success success rates is also no longer understated by the vmstats.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Nazarewicz <mina86@mina86.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> <SNIP>
> @@ -2279,14 +2307,43 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	count_vm_event(COMPACTSTALL);
>  
> -	/* Page migration frees to the PCP lists but we want merging */
> -	drain_pages(get_cpu());
> -	put_cpu();
> +	/* Did we capture a page? */
> +	if (page) {
> +		struct zone *zone;
> +		unsigned long flags;
> +		/*
> +		 * Mimic what buffered_rmqueue() does and capture_new_page()
> +		 * has not yet done.
> +		 */
> +		zone = page_zone(page);
> +
> +		local_irq_save(flags);
> +		zone_statistics(preferred_zone, zone, gfp_mask);
> +		local_irq_restore(flags);
>  
> -	page = get_page_from_freelist(gfp_mask, nodemask,
> -			order, zonelist, high_zoneidx,
> -			alloc_flags & ~ALLOC_NO_WATERMARKS,
> -			preferred_zone, classzone_idx, migratetype);
> +		VM_BUG_ON_PAGE(bad_range(zone, page), page);
> +		if (!prep_new_page(page, order, gfp_mask))
> +			/* This is normally done in get_page_from_freelist() */
> +			page->pfmemalloc = !!(alloc_flags &
> +					ALLOC_NO_WATERMARKS);
> +		else
> +			page = NULL;
> +	}
> +
> +	/* No capture but let's try allocating anyway */
> +	if (!page) {
> +		/*
> +		 * Page migration frees to the PCP lists but we want
> +		 * merging
> +		 */
> +		drain_pages(get_cpu());
> +		put_cpu();
> +

Would the attempted capture not drained already? Not a big deal so

Acked-by: Mel Gorman <mgorman@suse.de>


-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V4 14/15] mm, compaction: try to capture the just-created high-order freepage
  2014-07-25 12:56   ` Mel Gorman
@ 2014-07-25 15:49     ` Vlastimil Babka
  0 siblings, 0 replies; 31+ messages in thread
From: Vlastimil Babka @ 2014-07-25 15:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, Andrew Morton, David Rientjes, linux-kernel,
	Minchan Kim, Joonsoo Kim, Michal Nazarewicz, Naoya Horiguchi,
	Christoph Lameter, Rik van Riel, Zhang Yanfei

On 07/25/2014 02:56 PM, Mel Gorman wrote:
> On Wed, Jul 16, 2014 at 03:48:22PM +0200, Vlastimil Babka wrote:
>> Compaction uses watermark checking to determine if it succeeded in creating
>> a high-order free page. My testing has shown that this is quite racy and it
>> can happen that watermark checking in compaction succeeds, and moments later
>> the watermark checking in page allocation fails, even though the number of
>> free pages has increased meanwhile.
>>
>> It should be more reliable if direct compaction captured the high-order free
>> page as soon as it detects it, and pass it back to allocation. This would
>> also reduce the window for somebody else to allocate the free page.
>>
>> Capture has been implemented before by 1fb3f8ca0e92 ("mm: compaction: capture
>> a suitable high-order page immediately when it is made available"), but later
>> reverted by 8fb74b9f ("mm: compaction: partially revert capture of suitable
>> high-order page") due to a bug.
>>
>> This patch differs from the previous attempt in two aspects:
>>
>> 1) The previous patch scanned free lists to capture the page. In this patch,
>>     only the cc->order aligned block that the migration scanner just finished
>>     is considered, but only if pages were actually isolated for migration in
>>     that block. Tracking cc->order aligned blocks also has benefits for the
>>     following patch that skips blocks where non-migratable pages were found.
>>
>> 2) The operations done in buffered_rmqueue() and get_page_from_freelist() are
>>     closely followed so that page capture mimics normal page allocation as much
>>     as possible. This includes operations such as prep_new_page() and
>>     page->pfmemalloc setting (that was missing in the previous attempt), zone
>>     statistics are updated etc. Due to subtleties with IRQ disabling and
>>     enabling this cannot be simply factored out from the normal allocation
>>     functions without affecting the fastpath.
>>
>> This patch has tripled compaction success rates (as recorded in vmstat) in
>> stress-highalloc mmtests benchmark, although allocation success rates increased
>> only by a few percent. Closer inspection shows that due to the racy watermark
>> checking and lack of lru_add_drain(), the allocations that resulted in direct
>> compactions were often failing, but later allocations succeeeded in the fast
>> path. So the benefit of the patch to allocation success rates may be limited,
>> but it improves the fairness in the sense that whoever spent the time
>> compacting has a higher change of benefitting from it, and also can stop
>> compacting sooner, as page availability is detected immediately. With better
>> success detection, the contribution of compaction to high-order allocation
>> success success rates is also no longer understated by the vmstats.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Minchan Kim <minchan@kernel.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> Cc: Michal Nazarewicz <mina86@mina86.com>
>> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>> Cc: Christoph Lameter <cl@linux.com>
>> Cc: Rik van Riel <riel@redhat.com>
>> Cc: David Rientjes <rientjes@google.com>
>> <SNIP>
>> @@ -2279,14 +2307,43 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>>   	 */
>>   	count_vm_event(COMPACTSTALL);
>>
>> -	/* Page migration frees to the PCP lists but we want merging */
>> -	drain_pages(get_cpu());
>> -	put_cpu();
>> +	/* Did we capture a page? */
>> +	if (page) {
>> +		struct zone *zone;
>> +		unsigned long flags;
>> +		/*
>> +		 * Mimic what buffered_rmqueue() does and capture_new_page()
>> +		 * has not yet done.
>> +		 */
>> +		zone = page_zone(page);
>> +
>> +		local_irq_save(flags);
>> +		zone_statistics(preferred_zone, zone, gfp_mask);
>> +		local_irq_restore(flags);
>>
>> -	page = get_page_from_freelist(gfp_mask, nodemask,
>> -			order, zonelist, high_zoneidx,
>> -			alloc_flags & ~ALLOC_NO_WATERMARKS,
>> -			preferred_zone, classzone_idx, migratetype);
>> +		VM_BUG_ON_PAGE(bad_range(zone, page), page);
>> +		if (!prep_new_page(page, order, gfp_mask))
>> +			/* This is normally done in get_page_from_freelist() */
>> +			page->pfmemalloc = !!(alloc_flags &
>> +					ALLOC_NO_WATERMARKS);
>> +		else
>> +			page = NULL;
>> +	}
>> +
>> +	/* No capture but let's try allocating anyway */
>> +	if (!page) {
>> +		/*
>> +		 * Page migration frees to the PCP lists but we want
>> +		 * merging
>> +		 */
>> +		drain_pages(get_cpu());
>> +		put_cpu();
>> +
>
> Would the attempted capture not drained already? Not a big deal so

Well, the compaction might have returned without capture, just because 
e.g. the watermarks became OK in the meanwhile. Although that should 
mean that drain is not needed. So it might be redundant, but really not 
hurt.

I've also realized that this pcp draining (also in capture) could be 
done for a single zone and it's useless to flush all zones. Memory 
isolation could perhaps also benefit from that. But I'll add that to a 
future series.

> Acked-by: Mel Gorman <mgorman@suse.de>

Thanks for all the acks. I personally believe 1-14 are ready. Patch 15 
could be deferred, the series can do without it just fine. Patch 15 does 
work in most cases and the RFC was only to make sure it's an improvement 
without subtle regressions. But it seems to affect fragmentation and 
needs more testing and perhaps tuning. I also hoped for some external 
testing, e.g. by DavidR who originally sent his version of the patch. So 
Patch 15 could be fine in -mm but not in mainline yet.

Vlastimil

>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2014-07-25 15:50 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-16 13:48 [PATCH V4 00/15] compaction: balancing overhead and success rates Vlastimil Babka
2014-07-16 13:48 ` [PATCH V4 01/15] mm, THP: don't hold mmap_sem in khugepaged when allocating THP Vlastimil Babka
2014-07-25 12:18   ` Mel Gorman
2014-07-16 13:48 ` [PATCH V4 02/15] mm, compaction: defer each zone individually instead of preferred zone Vlastimil Babka
2014-07-25 12:20   ` Mel Gorman
2014-07-16 13:48 ` [PATCH V4 03/15] mm, compaction: do not count compact_stall if all zones skipped compaction Vlastimil Babka
2014-07-25 12:22   ` Mel Gorman
2014-07-16 13:48 ` [PATCH V4 04/15] mm, compaction: do not recheck suitable_migration_target under lock Vlastimil Babka
2014-07-25 12:23   ` Mel Gorman
2014-07-16 13:48 ` [PATCH V4 05/15] mm, compaction: move pageblock checks up from isolate_migratepages_range() Vlastimil Babka
2014-07-25 12:28   ` Mel Gorman
2014-07-16 13:48 ` [PATCH V4 06/15] mm, compaction: reduce zone checking frequency in the migration scanner Vlastimil Babka
2014-07-25 12:29   ` Mel Gorman
2014-07-16 13:48 ` [PATCH V4 07/15] mm, compaction: khugepaged should not give up due to need_resched() Vlastimil Babka
2014-07-25 12:31   ` Mel Gorman
2014-07-16 13:48 ` [PATCH V4 08/15] mm, compaction: periodically drop lock and restore IRQs in scanners Vlastimil Babka
2014-07-25 12:32   ` Mel Gorman
2014-07-16 13:48 ` [PATCH V4 09/15] mm, compaction: skip rechecks when lock was already held Vlastimil Babka
2014-07-25 12:34   ` Mel Gorman
2014-07-16 13:48 ` [PATCH V4 10/15] mm, compaction: remember position within pageblock in free pages scanner Vlastimil Babka
2014-07-25 12:35   ` Mel Gorman
2014-07-16 13:48 ` [PATCH V4 11/15] mm, compaction: skip buddy pages by their order in the migrate scanner Vlastimil Babka
2014-07-25 12:36   ` Mel Gorman
2014-07-16 13:48 ` [PATCH V4 12/15] mm: rename allocflags_to_migratetype for clarity Vlastimil Babka
2014-07-25 12:37   ` Mel Gorman
2014-07-16 13:48 ` [PATCH V4 13/15] mm, compaction: pass gfp mask to compact_control Vlastimil Babka
2014-07-25 12:38   ` Mel Gorman
2014-07-16 13:48 ` [PATCH V4 14/15] mm, compaction: try to capture the just-created high-order freepage Vlastimil Babka
2014-07-25 12:56   ` Mel Gorman
2014-07-25 15:49     ` Vlastimil Babka
2014-07-16 13:48 ` [RFC PATCH V4 15/15] mm, compaction: do not migrate pages when that cannot satisfy page fault allocation Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).