[PATCH 00/10] adding compaction to zone_reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/10] adding compaction to zone_reclaim_mode > 0 #2
@ 2013-07-16 13:41 Andrea Arcangeli
  2013-07-16 13:41 ` [PATCH 01/10] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
                   ` (9 more replies)
  0 siblings, 10 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2013-07-16 13:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

Hello everyone,

I got a bugreport showing some problem with NUMA affinity with CPU
node bindings when THP is enabled and /proc/sys/vm/zone_reclaim_mode
is > 0.

When THP is disabled, zone_reclaim_mode set to 1 (or higher) tends to
allocate memory in the local node with quite some accuracy in presence
of CPU node bindings (and weak or no memory bindings). However THP
enabled tends to spread the memory to other nodes erroneously.

I also found zone_reclaim_mode is quite unreliable in presence of
multiple threads allocating memory at the same time from different
CPUs in the same node, even when THP is disabled and there's plenty of
clean cache to trivially reclaim.

The major problem with THP enabled is that zone_reclaim doesn't even
try to use compaction. Then there are more changes suggested to make
the whole compaction process more reliable than it is now.

After setting zone_reclaim_mode to 1 and booting with
numa_zonelist_order=n, with this patchset applied I get this NUMA placement:

  PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
 7088 breakthp              0      2.1M [   2.1M     0  ]
 7089 breakthp              1      2.1M [   2.1M     0  ]
 7090 breakthp              2      2.1M [   2.1M     0  ]
 7091 breakthp              3      2.1M [   2.1M     0  ]
 7092 breakthp              6      2.1M [     0    2.1M ]
 7093 breakthp              7      2.1M [     0    2.1M ]
 7094 breakthp              8      2.1M [     0    2.1M ]
 7095 breakthp              9      2.1M [     0    2.1M ]
 7097 breakthp              0      2.1M [   2.1M     0  ]
 7098 breakthp              1      2.1M [   2.1M     0  ]
 7099 breakthp              2      2.1M [   2.1M     0  ]
 7100 breakthp              3      2.1M [   2.1M     0  ]
 7101 breakthp              6      2.1M [     0    2.1M ]
 7102 breakthp              7      2.1M [     0    2.1M ]
 7103 breakthp              8      2.1M [     0    2.1M ]
 7104 breakthp              9      2.1M [     0    2.1M ]
  PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
 7106 usemem                0     1.00G [  1.00G     0  ]
 7107 usemem                1     1.00G [  1.00G     0  ]
 7108 usemem                2     1.00G [  1.00G     0  ]
 7109 usemem                3     1.00G [  1.00G     0  ]
 7110 usemem                6     1.00G [     0   1.00G ]
 7111 usemem                7     1.00G [     0   1.00G ]
 7112 usemem                8     1.00G [     0   1.00G ]
 7113 usemem                9     1.00G [     0   1.00G ]

Without current upstream without the patchset and still
zone_reclaim_mode = 1 and booting with numa_zonelist_order=n:

  PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
 2950 breakthp              0      2.1M [   2.1M     0  ]
 2951 breakthp              1      2.1M [   2.1M     0  ]
 2952 breakthp              2      2.1M [   2.1M     0  ]
 2953 breakthp              3      2.1M [   2.1M     0  ]
 2954 breakthp              6      2.1M [     0    2.1M ]
 2955 breakthp              7      2.1M [     0    2.1M ]
 2956 breakthp              8      2.1M [     0    2.1M ]
 2957 breakthp              9      2.1M [     0    2.1M ]
 2966 breakthp              0      2.1M [   2.0M    96K ]
 2967 breakthp              1      2.1M [   2.0M    96K ]
 2968 breakthp              2      1.9M [   1.9M    96K ]
 2969 breakthp              3      2.1M [   2.0M    96K ]
 2970 breakthp              6      2.1M [   228K   1.8M ]
 2971 breakthp              7      2.1M [    72K   2.0M ]
 2972 breakthp              8      2.1M [    60K   2.0M ]
 2973 breakthp              9      2.1M [   204K   1.9M ]
  PID COMMAND         CPUMASK     TOTAL [     N0     N1 ]
 3088 usemem                0     1.00G [ 856.2M 168.0M ]
 3089 usemem                1     1.00G [ 860.2M 164.0M ]
 3090 usemem                2     1.00G [ 860.2M 164.0M ]
 3091 usemem                3     1.00G [ 858.2M 166.0M ]
 3092 usemem                6     1.00G [ 248.0M 776.2M ]
 3093 usemem                7     1.00G [ 248.0M 776.2M ]
 3094 usemem                8     1.00G [ 250.0M 774.2M ]
 3095 usemem                9     1.00G [ 246.0M 778.2M ]

Allocation speed seems a bit faster with the patchset applied likely
thanks to the increased NUMA locality that even during a simple
initialization, more than offsets the compaction costs.

The testcase always uses CPU bindings (half processes in one node, and
half processes in the other node). It first fragments all memory
(breakthp) by breaking lots of hugepages with mremap, and then another
process (usemem) allocates lots of memory, in turn exercising the
reliability of compaction with zone_reclaim_mode > 0.

Very few hugepages are available when usemem starts, but compaction
has a trivial time to generate as many hugepages as needed without any
risk of failure.

The memory layout when usemem starts is like this:

4k page anon
4k page free
another 512-2 4k pages free
4k page anon
4k page free
another 512-2 4k pages free
[..]

If automatic NUMA balancing is enabled, this isn't as critical issues
as without (the placement will be fixed later at runtime with THP NUMA
migration faults), but it still looks worth optimizing the initial
placement to avoid those migrations and for short lived computations
(where automatic NUMA balancing can't help). Especially if the process
has already been pinned to the CPUs of a node like in the bugreport I
got.

The main change of behavior is the removal of compact_blockskip_flush
and the __reset_isolation_suitable immediately executed when a
compaction pass completes and the slightly increased amount of
hugepages required to meet the low/min watermarks. The rest of the
changes mostly applies to zone_reclaim_mode > 0 and doesn't affect the
default 0 value (some large system may boot with zone_reclaim_mode set
to 1 by default though, if the node distance is very high).

The heuristic that decides the default of numa_zonelist_order=z should
also be dropped or at least be improved (not addressed by this
patchset). The =z default makes no sense on my hardware for example,
and the coming roundrobin allocator from Johannes will defeats any
benefit of =z. Only =n default will make sense with the roundrobin
allocator.

The roundrobin allocator entirely depends on the lowmem_reserve logic
for its safety with regard to lowmem zones.

Andrea Arcangeli (10):
  mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED
  mm: zone_reclaim: compaction: scan all memory with
    /proc/sys/vm/compact_memory
  mm: zone_reclaim: compaction: don't depend on kswapd to invoke
    reset_isolation_suitable
  mm: zone_reclaim: compaction: reset before initializing the scan
    cursors
  mm: compaction: don't require high order pages below min wmark
  mm: zone_reclaim: compaction: increase the high order pages in the
    watermarks
  mm: zone_reclaim: compaction: export compact_zone_order()
  mm: zone_reclaim: only run zone_reclaim in the fast path
  mm: zone_reclaim: after a successful zone_reclaim check the min
    watermark
  mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode

 include/linux/compaction.h |  11 +++--
 include/linux/mmzone.h     |   9 ----
 include/linux/swap.h       |   8 ++-
 mm/compaction.c            |  40 ++++++++-------
 mm/page_alloc.c            |  48 ++++++++++++++++--
 mm/vmscan.c                | 121 ++++++++++++++++++++++++++++++++++-----------
 6 files changed, 170 insertions(+), 67 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01/10] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED
  2013-07-16 13:41 [PATCH 00/10] adding compaction to zone_reclaim_mode > 0 #2 Andrea Arcangeli
@ 2013-07-16 13:41 ` Andrea Arcangeli
  2013-07-16 23:45   ` Wanpeng Li
  2013-07-16 23:45   ` Wanpeng Li
  2013-07-16 13:41 ` [PATCH 02/10] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
                   ` (8 subsequent siblings)
  9 siblings, 2 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2013-07-16 13:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
thread allocates memory at the same time, it forces a premature
allocation into remote NUMA nodes even when there's plenty of clean
cache to reclaim in the local nodes.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mmzone.h | 6 ------
 mm/vmscan.c            | 4 ----
 2 files changed, 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index af4a3b7..9534a9a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -496,7 +496,6 @@ struct zone {
 } ____cacheline_internodealigned_in_smp;
 
 typedef enum {
-	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
 	ZONE_OOM_LOCKED,		/* zone is in OOM killer zonelist */
 	ZONE_CONGESTED,			/* zone has many dirty pages backed by
 					 * a congested BDI
@@ -540,11 +539,6 @@ static inline int zone_is_reclaim_writeback(const struct zone *zone)
 	return test_bit(ZONE_WRITEBACK, &zone->flags);
 }
 
-static inline int zone_is_reclaim_locked(const struct zone *zone)
-{
-	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
-}
-
 static inline int zone_is_oom_locked(const struct zone *zone)
 {
 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2cff0d4..042fdcd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3595,11 +3595,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
 		return ZONE_RECLAIM_NOSCAN;
 
-	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
-		return ZONE_RECLAIM_NOSCAN;
-
 	ret = __zone_reclaim(zone, gfp_mask, order);
-	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
 
 	if (!ret)
 		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 02/10] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory
  2013-07-16 13:41 [PATCH 00/10] adding compaction to zone_reclaim_mode > 0 #2 Andrea Arcangeli
  2013-07-16 13:41 ` [PATCH 01/10] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
@ 2013-07-16 13:41 ` Andrea Arcangeli
  2013-07-16 23:29   ` Wanpeng Li
  2013-07-16 23:29   ` Wanpeng Li
  2013-07-16 13:41 ` [PATCH 03/10] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2013-07-16 13:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

Reset the stats so /proc/sys/vm/compact_memory will scan all memory.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
---
 mm/compaction.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 05ccb4c..cac9594 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1136,12 +1136,14 @@ void compact_pgdat(pg_data_t *pgdat, int order)
 
 static void compact_node(int nid)
 {
+	pg_data_t *pgdat = NODE_DATA(nid);
 	struct compact_control cc = {
 		.order = -1,
 		.sync = true,
 	};
 
-	__compact_pgdat(NODE_DATA(nid), &cc);
+	reset_isolation_suitable(pgdat);
+	__compact_pgdat(pgdat, &cc);
 }
 
 /* Compact all nodes in the system */

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 03/10] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable
  2013-07-16 13:41 [PATCH 00/10] adding compaction to zone_reclaim_mode > 0 #2 Andrea Arcangeli
  2013-07-16 13:41 ` [PATCH 01/10] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
  2013-07-16 13:41 ` [PATCH 02/10] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
@ 2013-07-16 13:41 ` Andrea Arcangeli
  2013-07-16 23:32   ` Wanpeng Li
  2013-07-16 23:32   ` Wanpeng Li
  2013-07-16 13:41 ` [PATCH 04/10] mm: zone_reclaim: compaction: reset before initializing the scan cursors Andrea Arcangeli
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2013-07-16 13:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

If kswapd never need to run (only __GFP_NO_KSWAPD allocations and
plenty of free memory) compaction is otherwise crippled down and stops
running for a while after the free/isolation cursor meets. After that
allocation can fail for a full cycle of compaction_deferred, until
compaction_restarting finally reset it again.

Stopping compaction for a full cycle after the cursor meets, even if
it never failed and it's not going to fail, doesn't make sense.

We already throttle compaction CPU utilization using
defer_compaction. We shouldn't prevent compaction to run after each
pass completes when the cursor meets, unless it failed.

This makes direct compaction functional again. The throttling of
direct compaction is still controlled by the defer_compaction
logic.

kswapd still won't risk to reset compaction, and it will wait direct
compaction to do so. Not sure if this is ideal but it at least
decreases the risk of kswapd doing too much work. kswapd will only run
one pass of compaction until some allocation invokes compaction again.

This decreased reliability of compaction was introduced in commit
62997027ca5b3d4618198ed8b1aba40b61b1137b .

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/compaction.h |  5 -----
 include/linux/mmzone.h     |  3 ---
 mm/compaction.c            | 15 ++++++---------
 mm/page_alloc.c            |  1 -
 mm/vmscan.c                |  8 --------
 5 files changed, 6 insertions(+), 26 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 091d72e..fc3f266 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -24,7 +24,6 @@ extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync, bool *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
-extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
 /* Do not skip compaction more than 64 times */
@@ -84,10 +83,6 @@ static inline void compact_pgdat(pg_data_t *pgdat, int order)
 {
 }
 
-static inline void reset_isolation_suitable(pg_data_t *pgdat)
-{
-}
-
 static inline unsigned long compaction_suitable(struct zone *zone, int order)
 {
 	return COMPACT_SKIPPED;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9534a9a..e738871 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -354,9 +354,6 @@ struct zone {
 	spinlock_t		lock;
 	int                     all_unreclaimable; /* All pages pinned */
 #if defined CONFIG_COMPACTION || defined CONFIG_CMA
-	/* Set to true when the PG_migrate_skip bits should be cleared */
-	bool			compact_blockskip_flush;
-
 	/* pfns where compaction scanners should start */
 	unsigned long		compact_cached_free_pfn;
 	unsigned long		compact_cached_migrate_pfn;
diff --git a/mm/compaction.c b/mm/compaction.c
index cac9594..525baaa 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -91,7 +91,6 @@ static void __reset_isolation_suitable(struct zone *zone)
 
 	zone->compact_cached_migrate_pfn = start_pfn;
 	zone->compact_cached_free_pfn = end_pfn;
-	zone->compact_blockskip_flush = false;
 
 	/* Walk the zone and mark every pageblock as suitable for isolation */
 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
@@ -110,7 +109,7 @@ static void __reset_isolation_suitable(struct zone *zone)
 	}
 }
 
-void reset_isolation_suitable(pg_data_t *pgdat)
+static void reset_isolation_suitable(pg_data_t *pgdat)
 {
 	int zoneid;
 
@@ -120,8 +119,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
 			continue;
 
 		/* Only flush if a full compaction finished recently */
-		if (zone->compact_blockskip_flush)
-			__reset_isolation_suitable(zone);
+		__reset_isolation_suitable(zone);
 	}
 }
 
@@ -828,13 +826,12 @@ static int compact_finished(struct zone *zone,
 	/* Compaction run completes if the migrate and free scanner meet */
 	if (cc->free_pfn <= cc->migrate_pfn) {
 		/*
-		 * Mark that the PG_migrate_skip information should be cleared
-		 * by kswapd when it goes to sleep. kswapd does not set the
-		 * flag itself as the decision to be clear should be directly
-		 * based on an allocation request.
+		 * Clear the PG_migrate_skip information. kswapd does
+		 * not clear it as the decision to be clear should be
+		 * directly based on an allocation request.
 		 */
 		if (!current_is_kswapd())
-			zone->compact_blockskip_flush = true;
+			__reset_isolation_suitable(zone);
 
 		return COMPACT_COMPLETE;
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..db8fb66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2190,7 +2190,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 				alloc_flags & ~ALLOC_NO_WATERMARKS,
 				preferred_zone, migratetype);
 		if (page) {
-			preferred_zone->compact_blockskip_flush = false;
 			preferred_zone->compact_considered = 0;
 			preferred_zone->compact_defer_shift = 0;
 			if (order >= preferred_zone->compact_order_failed)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 042fdcd..85a0071 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3091,14 +3091,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
 		 */
 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
 
-		/*
-		 * Compaction records what page blocks it recently failed to
-		 * isolate pages from and skips them in the future scanning.
-		 * When kswapd is going to sleep, it is reasonable to assume
-		 * that pages and compaction may succeed so reset the cache.
-		 */
-		reset_isolation_suitable(pgdat);
-
 		if (!kthread_should_stop())
 			schedule();
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 04/10] mm: zone_reclaim: compaction: reset before initializing the scan cursors
  2013-07-16 13:41 [PATCH 00/10] adding compaction to zone_reclaim_mode > 0 #2 Andrea Arcangeli
                   ` (2 preceding siblings ...)
  2013-07-16 13:41 ` [PATCH 03/10] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
@ 2013-07-16 13:41 ` Andrea Arcangeli
  2013-07-16 23:31   ` Wanpeng Li
  2013-07-16 23:31   ` Wanpeng Li
  2013-07-16 13:41 ` [PATCH 05/10] mm: compaction: don't require high order pages below min wmark Andrea Arcangeli
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2013-07-16 13:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

Correct the location where we reset the scan cursors, otherwise the
first iteration of compaction (after restarting it) will only do a
partial scan.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Rafael Aquini <aquini@redhat.com>
---
 mm/compaction.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 525baaa..afaf692 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -934,6 +934,17 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 	}
 
 	/*
+	 * Clear pageblock skip if there were failures recently and
+	 * compaction is about to be retried after being
+	 * deferred. kswapd does not do this reset and it will wait
+	 * direct compaction to do so either when the cursor meets
+	 * after one compaction pass is complete or if compaction is
+	 * restarted after being deferred for a while.
+	 */
+	if ((compaction_restarting(zone, cc->order)) && !current_is_kswapd())
+		__reset_isolation_suitable(zone);
+
+	/*
 	 * Setup to move all movable pages to the end of the zone. Used cached
 	 * information on where the scanners should start but check that it
 	 * is initialised by ensuring the values are within zone boundaries.
@@ -949,14 +960,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		zone->compact_cached_migrate_pfn = cc->migrate_pfn;
 	}
 
-	/*
-	 * Clear pageblock skip if there were failures recently and compaction
-	 * is about to be retried after being deferred. kswapd does not do
-	 * this reset as it'll reset the cached information when going to sleep.
-	 */
-	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
-		__reset_isolation_suitable(zone);
-
 	migrate_prep_local();
 
 	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 05/10] mm: compaction: don't require high order pages below min wmark
  2013-07-16 13:41 [PATCH 00/10] adding compaction to zone_reclaim_mode > 0 #2 Andrea Arcangeli
                   ` (3 preceding siblings ...)
  2013-07-16 13:41 ` [PATCH 04/10] mm: zone_reclaim: compaction: reset before initializing the scan cursors Andrea Arcangeli
@ 2013-07-16 13:41 ` Andrea Arcangeli
  2013-07-17  8:13   ` Hush Bensen
  2013-07-16 13:41 ` [PATCH 06/10] mm: zone_reclaim: compaction: increase the high order pages in the watermarks Andrea Arcangeli
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2013-07-16 13:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

The min wmark should be satisfied with just 1 hugepage. And the other
wmarks should be adjusted accordingly. We need to succeed the low
wmark check if there's some significant amount of 0 order pages, but
we don't need plenty of high order pages because the PF_MEMALLOC paths
don't require those. Creating a ton of high order pages that cannot be
allocated by the high order allocation paths (no PF_MEMALLOC) is quite
wasteful because they can be splitted in lower order pages before
anybody has a chance to allocate them.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index db8fb66..d94503d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1643,6 +1643,23 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 
 	if (free_pages - free_cma <= min + lowmem_reserve)
 		return false;
+	if (!order)
+		return true;
+
+	/*
+	 * Don't require any high order page under the min
+	 * wmark. Invoking compaction to create lots of high order
+	 * pages below the min wmark is wasteful because those
+	 * hugepages cannot be allocated without PF_MEMALLOC and the
+	 * PF_MEMALLOC paths must not depend on high order allocations
+	 * to succeed.
+	 */
+	min = mark - z->watermark[WMARK_MIN];
+	WARN_ON(min < 0);
+	if (alloc_flags & ALLOC_HIGH)
+		min -= min / 2;
+	if (alloc_flags & ALLOC_HARDER)
+		min -= min / 4;
 	for (o = 0; o < order; o++) {
 		/* At the next order, this order's pages become unavailable */
 		free_pages -= z->free_area[o].nr_free << o;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 06/10] mm: zone_reclaim: compaction: increase the high order pages in the watermarks
  2013-07-16 13:41 [PATCH 00/10] adding compaction to zone_reclaim_mode > 0 #2 Andrea Arcangeli
                   ` (4 preceding siblings ...)
  2013-07-16 13:41 ` [PATCH 05/10] mm: compaction: don't require high order pages below min wmark Andrea Arcangeli
@ 2013-07-16 13:41 ` Andrea Arcangeli
  2013-07-16 13:41 ` [PATCH 07/10] mm: zone_reclaim: compaction: export compact_zone_order() Andrea Arcangeli
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2013-07-16 13:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

Prevent the scaling down to reduce the watermarks too much.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d94503d..f5ea1147 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1665,7 +1665,8 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
 		free_pages -= z->free_area[o].nr_free << o;
 
 		/* Require fewer higher order pages to be free */
-		min >>= 1;
+		if (o < (pageblock_order >> 2))
+			min >>= 1;
 
 		if (free_pages <= min)
 			return false;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 07/10] mm: zone_reclaim: compaction: export compact_zone_order()
  2013-07-16 13:41 [PATCH 00/10] adding compaction to zone_reclaim_mode > 0 #2 Andrea Arcangeli
                   ` (5 preceding siblings ...)
  2013-07-16 13:41 ` [PATCH 06/10] mm: zone_reclaim: compaction: increase the high order pages in the watermarks Andrea Arcangeli
@ 2013-07-16 13:41 ` Andrea Arcangeli
  2013-07-16 13:41 ` [PATCH 08/10] mm: zone_reclaim: only run zone_reclaim in the fast path Andrea Arcangeli
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2013-07-16 13:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

Needed by zone_reclaim_mode compaction-awareness.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/compaction.h | 10 ++++++++++
 mm/compaction.c            |  2 +-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index fc3f266..e953acb 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -23,6 +23,9 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
 			int order, gfp_t gfp_mask, nodemask_t *mask,
 			bool sync, bool *contended);
+extern unsigned long compact_zone_order(struct zone *zone,
+					int order, gfp_t gfp_mask,
+					bool sync, bool *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern unsigned long compaction_suitable(struct zone *zone, int order);
 
@@ -79,6 +82,13 @@ static inline unsigned long try_to_compact_pages(struct zonelist *zonelist,
 	return COMPACT_CONTINUE;
 }
 
+static inline unsigned long compact_zone_order(struct zone *zone,
+					       int order, gfp_t gfp_mask,
+					       bool sync, bool *contended)
+{
+	return COMPACT_CONTINUE;
+}
+
 static inline void compact_pgdat(pg_data_t *pgdat, int order)
 {
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index afaf692..a1154c8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1008,7 +1008,7 @@ out:
 	return ret;
 }
 
-static unsigned long compact_zone_order(struct zone *zone,
+unsigned long compact_zone_order(struct zone *zone,
 				 int order, gfp_t gfp_mask,
 				 bool sync, bool *contended)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 08/10] mm: zone_reclaim: only run zone_reclaim in the fast path
  2013-07-16 13:41 [PATCH 00/10] adding compaction to zone_reclaim_mode > 0 #2 Andrea Arcangeli
                   ` (6 preceding siblings ...)
  2013-07-16 13:41 ` [PATCH 07/10] mm: zone_reclaim: compaction: export compact_zone_order() Andrea Arcangeli
@ 2013-07-16 13:41 ` Andrea Arcangeli
  2013-07-16 13:41 ` [PATCH 09/10] mm: zone_reclaim: after a successful zone_reclaim check the min watermark Andrea Arcangeli
  2013-07-16 13:41 ` [PATCH 10/10] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
  9 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2013-07-16 13:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

Don't repeat direct reclaim when we enter the regular reclaim slowpath
under the min watermark.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f5ea1147..0519181 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1941,7 +1941,8 @@ zonelist_scan:
 			}
 
 			if (zone_reclaim_mode == 0 ||
-			    !zone_allows_reclaim(preferred_zone, zone))
+			    !zone_allows_reclaim(preferred_zone, zone) ||
+			    !(alloc_flags & ALLOC_WMARK_LOW))
 				goto this_zone_full;
 
 			/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 09/10] mm: zone_reclaim: after a successful zone_reclaim check the min watermark
  2013-07-16 13:41 [PATCH 00/10] adding compaction to zone_reclaim_mode > 0 #2 Andrea Arcangeli
                   ` (7 preceding siblings ...)
  2013-07-16 13:41 ` [PATCH 08/10] mm: zone_reclaim: only run zone_reclaim in the fast path Andrea Arcangeli
@ 2013-07-16 13:41 ` Andrea Arcangeli
  2013-07-16 13:41 ` [PATCH 10/10] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
  9 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2013-07-16 13:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

If we're in the fast path and we succeeded zone_reclaim(), it means we
freed enough memory and we can use the min watermark to have some
margin against concurrent allocations from other CPUs or interrupts.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0519181..3690c2e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1961,8 +1961,26 @@ zonelist_scan:
 			case ZONE_RECLAIM_FULL:
 				/* scanned but unreclaimable */
 				continue;
+			case ZONE_RECLAIM_SUCCESS:
+				/*
+				 * If we successfully reclaimed
+				 * enough, allow allocations up to the
+				 * min watermark (instead of stopping
+				 * at "mark"). This provides some more
+				 * margin against parallel
+				 * allocations. Using the min
+				 * watermark doesn't alter when we
+				 * wakeup kswapd. It also doesn't
+				 * alter the synchronous direct
+				 * reclaim behavior of zone_reclaim()
+				 * that will still be invoked at the
+				 * next pass if we're still below the
+				 * low watermark (even if kswapd isn't
+				 * woken).
+				 */
+				mark = min_wmark_pages(zone);
+				/* Fall through */
 			default:
-				/* did we reclaim enough */
 				if (zone_watermark_ok(zone, order, mark,
 						classzone_idx, alloc_flags))
 					goto try_this_zone;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 10/10] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode
  2013-07-16 13:41 [PATCH 00/10] adding compaction to zone_reclaim_mode > 0 #2 Andrea Arcangeli
                   ` (8 preceding siblings ...)
  2013-07-16 13:41 ` [PATCH 09/10] mm: zone_reclaim: after a successful zone_reclaim check the min watermark Andrea Arcangeli
@ 2013-07-16 13:41 ` Andrea Arcangeli
  2013-07-17  8:20   ` Hush Bensen
  9 siblings, 1 reply; 23+ messages in thread
From: Andrea Arcangeli @ 2013-07-16 13:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

This adds compaction to zone_reclaim so THP enabled won't decrease the
NUMA locality with /proc/sys/vm/zone_reclaim_mode > 0.

It is important to boot with numa_zonelist_order=n (n means nodes) to
get more accurate NUMA locality if there are multiple zones per node.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/swap.h |   8 +++-
 mm/page_alloc.c      |   4 +-
 mm/vmscan.c          | 111 ++++++++++++++++++++++++++++++++++++++++++---------
 3 files changed, 102 insertions(+), 21 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d95cde5..d076a54 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -289,10 +289,14 @@ extern unsigned long vm_total_pages;
 extern int zone_reclaim_mode;
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
+extern int zone_reclaim(struct zone *, struct zone *, gfp_t, unsigned int,
+			unsigned long, int, int);
 #else
 #define zone_reclaim_mode 0
-static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
+static inline int zone_reclaim(struct zone *preferred_zone, struct zone *zone,
+			       gfp_t mask, unsigned int order,
+			       unsigned long mark, int classzone_idx,
+			       int alloc_flags)
 {
 	return 0;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3690c2e..4101906 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1953,7 +1953,9 @@ zonelist_scan:
 				!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
 
-			ret = zone_reclaim(zone, gfp_mask, order);
+			ret = zone_reclaim(preferred_zone, zone, gfp_mask,
+					   order,
+					   mark, classzone_idx, alloc_flags);
 			switch (ret) {
 			case ZONE_RECLAIM_NOSCAN:
 				/* did not scan */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 85a0071..80ee2b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3488,6 +3488,24 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	unsigned long nr_slab_pages0, nr_slab_pages1;
 
 	cond_resched();
+
+	/*
+	 * Zone reclaim reclaims unmapped file backed pages and
+	 * slab pages if we are over the defined limits.
+	 *
+	 * A small portion of unmapped file backed pages is needed for
+	 * file I/O otherwise pages read by file I/O will be immediately
+	 * thrown out if the zone is overallocated. So we do not reclaim
+	 * if less than a specified percentage of the zone is used by
+	 * unmapped file backed pages.
+	 */
+	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
+	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
+		return ZONE_RECLAIM_FULL;
+
+	if (zone->all_unreclaimable)
+		return ZONE_RECLAIM_FULL;
+
 	/*
 	 * We need to be able to allocate from the reserves for RECLAIM_SWAP
 	 * and we also need to be able to write out pages for RECLAIM_WRITE
@@ -3549,27 +3567,35 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	return sc.nr_reclaimed >= nr_pages;
 }
 
-int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
+static int zone_reclaim_compact(struct zone *preferred_zone,
+				struct zone *zone, gfp_t gfp_mask,
+				unsigned int order,
+				bool sync_compaction,
+				bool *need_compaction)
 {
-	int node_id;
-	int ret;
+	bool contended;
 
-	/*
-	 * Zone reclaim reclaims unmapped file backed pages and
-	 * slab pages if we are over the defined limits.
-	 *
-	 * A small portion of unmapped file backed pages is needed for
-	 * file I/O otherwise pages read by file I/O will be immediately
-	 * thrown out if the zone is overallocated. So we do not reclaim
-	 * if less than a specified percentage of the zone is used by
-	 * unmapped file backed pages.
-	 */
-	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
-	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
-		return ZONE_RECLAIM_FULL;
+	if (compaction_deferred(preferred_zone, order) ||
+	    !order ||
+	    (gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
+		need_compaction = false;
+		return COMPACT_SKIPPED;
+	}
 
-	if (zone->all_unreclaimable)
-		return ZONE_RECLAIM_FULL;
+	*need_compaction = true;
+	return compact_zone_order(zone, order,
+				  gfp_mask,
+				  sync_compaction,
+				  &contended);
+}
+
+int zone_reclaim(struct zone *preferred_zone, struct zone *zone,
+		 gfp_t gfp_mask, unsigned int order,
+		 unsigned long mark, int classzone_idx, int alloc_flags)
+{
+	int node_id;
+	int ret, c_ret;
+	bool sync_compaction = false, need_compaction = false;
 
 	/*
 	 * Do not scan if the allocation should not be delayed.
@@ -3587,7 +3613,56 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
 		return ZONE_RECLAIM_NOSCAN;
 
+repeat_compaction:
+	/*
+	 * If this allocation may be satisfied by memory compaction,
+	 * run compaction before reclaim.
+	 */
+	c_ret = zone_reclaim_compact(preferred_zone,
+				     zone, gfp_mask, order,
+				     sync_compaction,
+				     &need_compaction);
+	if (need_compaction &&
+	    c_ret != COMPACT_SKIPPED &&
+	    zone_watermark_ok(zone, order, mark,
+			      classzone_idx,
+			      alloc_flags)) {
+#ifdef CONFIG_COMPACTION
+		zone->compact_considered = 0;
+		zone->compact_defer_shift = 0;
+#endif
+		return ZONE_RECLAIM_SUCCESS;
+	}
+
+	/*
+	 * reclaim if compaction failed because not enough memory was
+	 * available or if compaction didn't run (order 0) or didn't
+	 * succeed.
+	 */
 	ret = __zone_reclaim(zone, gfp_mask, order);
+	if (ret == ZONE_RECLAIM_SUCCESS) {
+		if (zone_watermark_ok(zone, order, mark,
+				      classzone_idx,
+				      alloc_flags))
+			return ZONE_RECLAIM_SUCCESS;
+
+		/*
+		 * If compaction run but it was skipped and reclaim was
+		 * successful keep going.
+		 */
+		if (need_compaction && c_ret == COMPACT_SKIPPED) {
+			/*
+			 * If it's ok to wait for I/O we can as well run sync
+			 * compaction
+			 */
+			sync_compaction = !!(zone_reclaim_mode &
+					     (RECLAIM_WRITE|RECLAIM_SWAP));
+			cond_resched();
+			goto repeat_compaction;
+		}
+	}
+	if (need_compaction)
+		defer_compaction(preferred_zone, order);
 
 	if (!ret)
 		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 02/10] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory
  2013-07-16 13:41 ` [PATCH 02/10] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
@ 2013-07-16 23:29   ` Wanpeng Li
  2013-07-16 23:29   ` Wanpeng Li
  1 sibling, 0 replies; 23+ messages in thread
From: Wanpeng Li @ 2013-07-16 23:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

On Tue, Jul 16, 2013 at 03:41:46PM +0200, Andrea Arcangeli wrote:
>Reset the stats so /proc/sys/vm/compact_memory will scan all memory.
>
>Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>Reviewed-by: Rik van Riel <riel@redhat.com>
>Acked-by: Rafael Aquini <aquini@redhat.com>
>Acked-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>---
> mm/compaction.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
>diff --git a/mm/compaction.c b/mm/compaction.c
>index 05ccb4c..cac9594 100644
>--- a/mm/compaction.c
>+++ b/mm/compaction.c
>@@ -1136,12 +1136,14 @@ void compact_pgdat(pg_data_t *pgdat, int order)
>
> static void compact_node(int nid)
> {
>+	pg_data_t *pgdat = NODE_DATA(nid);
> 	struct compact_control cc = {
> 		.order = -1,
> 		.sync = true,
> 	};
>
>-	__compact_pgdat(NODE_DATA(nid), &cc);
>+	reset_isolation_suitable(pgdat);
>+	__compact_pgdat(pgdat, &cc);
> }
>
> /* Compact all nodes in the system */
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 02/10] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory
  2013-07-16 13:41 ` [PATCH 02/10] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
  2013-07-16 23:29   ` Wanpeng Li
@ 2013-07-16 23:29   ` Wanpeng Li
  1 sibling, 0 replies; 23+ messages in thread
From: Wanpeng Li @ 2013-07-16 23:29 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

On Tue, Jul 16, 2013 at 03:41:46PM +0200, Andrea Arcangeli wrote:
>Reset the stats so /proc/sys/vm/compact_memory will scan all memory.
>
>Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>Reviewed-by: Rik van Riel <riel@redhat.com>
>Acked-by: Rafael Aquini <aquini@redhat.com>
>Acked-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>---
> mm/compaction.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
>diff --git a/mm/compaction.c b/mm/compaction.c
>index 05ccb4c..cac9594 100644
>--- a/mm/compaction.c
>+++ b/mm/compaction.c
>@@ -1136,12 +1136,14 @@ void compact_pgdat(pg_data_t *pgdat, int order)
>
> static void compact_node(int nid)
> {
>+	pg_data_t *pgdat = NODE_DATA(nid);
> 	struct compact_control cc = {
> 		.order = -1,
> 		.sync = true,
> 	};
>
>-	__compact_pgdat(NODE_DATA(nid), &cc);
>+	reset_isolation_suitable(pgdat);
>+	__compact_pgdat(pgdat, &cc);
> }
>
> /* Compact all nodes in the system */
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 04/10] mm: zone_reclaim: compaction: reset before initializing the scan cursors
  2013-07-16 13:41 ` [PATCH 04/10] mm: zone_reclaim: compaction: reset before initializing the scan cursors Andrea Arcangeli
@ 2013-07-16 23:31   ` Wanpeng Li
  2013-07-16 23:31   ` Wanpeng Li
  1 sibling, 0 replies; 23+ messages in thread
From: Wanpeng Li @ 2013-07-16 23:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

On Tue, Jul 16, 2013 at 03:41:48PM +0200, Andrea Arcangeli wrote:
>Correct the location where we reset the scan cursors, otherwise the
>first iteration of compaction (after restarting it) will only do a
>partial scan.
>
>Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>Reviewed-by: Rik van Riel <riel@redhat.com>
>Acked-by: Mel Gorman <mgorman@suse.de>
>Acked-by: Rafael Aquini <aquini@redhat.com>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>---
> mm/compaction.c | 19 +++++++++++--------
> 1 file changed, 11 insertions(+), 8 deletions(-)
>
>diff --git a/mm/compaction.c b/mm/compaction.c
>index 525baaa..afaf692 100644
>--- a/mm/compaction.c
>+++ b/mm/compaction.c
>@@ -934,6 +934,17 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> 	}
>
> 	/*
>+	 * Clear pageblock skip if there were failures recently and
>+	 * compaction is about to be retried after being
>+	 * deferred. kswapd does not do this reset and it will wait
>+	 * direct compaction to do so either when the cursor meets
>+	 * after one compaction pass is complete or if compaction is
>+	 * restarted after being deferred for a while.
>+	 */
>+	if ((compaction_restarting(zone, cc->order)) && !current_is_kswapd())
>+		__reset_isolation_suitable(zone);
>+
>+	/*
> 	 * Setup to move all movable pages to the end of the zone. Used cached
> 	 * information on where the scanners should start but check that it
> 	 * is initialised by ensuring the values are within zone boundaries.
>@@ -949,14 +960,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> 		zone->compact_cached_migrate_pfn = cc->migrate_pfn;
> 	}
>
>-	/*
>-	 * Clear pageblock skip if there were failures recently and compaction
>-	 * is about to be retried after being deferred. kswapd does not do
>-	 * this reset as it'll reset the cached information when going to sleep.
>-	 */
>-	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
>-		__reset_isolation_suitable(zone);
>-
> 	migrate_prep_local();
>
> 	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 04/10] mm: zone_reclaim: compaction: reset before initializing the scan cursors
  2013-07-16 13:41 ` [PATCH 04/10] mm: zone_reclaim: compaction: reset before initializing the scan cursors Andrea Arcangeli
  2013-07-16 23:31   ` Wanpeng Li
@ 2013-07-16 23:31   ` Wanpeng Li
  1 sibling, 0 replies; 23+ messages in thread
From: Wanpeng Li @ 2013-07-16 23:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

On Tue, Jul 16, 2013 at 03:41:48PM +0200, Andrea Arcangeli wrote:
>Correct the location where we reset the scan cursors, otherwise the
>first iteration of compaction (after restarting it) will only do a
>partial scan.
>
>Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>Reviewed-by: Rik van Riel <riel@redhat.com>
>Acked-by: Mel Gorman <mgorman@suse.de>
>Acked-by: Rafael Aquini <aquini@redhat.com>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>---
> mm/compaction.c | 19 +++++++++++--------
> 1 file changed, 11 insertions(+), 8 deletions(-)
>
>diff --git a/mm/compaction.c b/mm/compaction.c
>index 525baaa..afaf692 100644
>--- a/mm/compaction.c
>+++ b/mm/compaction.c
>@@ -934,6 +934,17 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> 	}
>
> 	/*
>+	 * Clear pageblock skip if there were failures recently and
>+	 * compaction is about to be retried after being
>+	 * deferred. kswapd does not do this reset and it will wait
>+	 * direct compaction to do so either when the cursor meets
>+	 * after one compaction pass is complete or if compaction is
>+	 * restarted after being deferred for a while.
>+	 */
>+	if ((compaction_restarting(zone, cc->order)) && !current_is_kswapd())
>+		__reset_isolation_suitable(zone);
>+
>+	/*
> 	 * Setup to move all movable pages to the end of the zone. Used cached
> 	 * information on where the scanners should start but check that it
> 	 * is initialised by ensuring the values are within zone boundaries.
>@@ -949,14 +960,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> 		zone->compact_cached_migrate_pfn = cc->migrate_pfn;
> 	}
>
>-	/*
>-	 * Clear pageblock skip if there were failures recently and compaction
>-	 * is about to be retried after being deferred. kswapd does not do
>-	 * this reset as it'll reset the cached information when going to sleep.
>-	 */
>-	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
>-		__reset_isolation_suitable(zone);
>-
> 	migrate_prep_local();
>
> 	while ((ret = compact_finished(zone, cc)) == COMPACT_CONTINUE) {
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 03/10] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable
  2013-07-16 13:41 ` [PATCH 03/10] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
@ 2013-07-16 23:32   ` Wanpeng Li
  2013-07-16 23:32   ` Wanpeng Li
  1 sibling, 0 replies; 23+ messages in thread
From: Wanpeng Li @ 2013-07-16 23:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

On Tue, Jul 16, 2013 at 03:41:47PM +0200, Andrea Arcangeli wrote:
>If kswapd never need to run (only __GFP_NO_KSWAPD allocations and
>plenty of free memory) compaction is otherwise crippled down and stops
>running for a while after the free/isolation cursor meets. After that
>allocation can fail for a full cycle of compaction_deferred, until
>compaction_restarting finally reset it again.
>
>Stopping compaction for a full cycle after the cursor meets, even if
>it never failed and it's not going to fail, doesn't make sense.
>
>We already throttle compaction CPU utilization using
>defer_compaction. We shouldn't prevent compaction to run after each
>pass completes when the cursor meets, unless it failed.
>
>This makes direct compaction functional again. The throttling of
>direct compaction is still controlled by the defer_compaction
>logic.
>
>kswapd still won't risk to reset compaction, and it will wait direct
>compaction to do so. Not sure if this is ideal but it at least
>decreases the risk of kswapd doing too much work. kswapd will only run
>one pass of compaction until some allocation invokes compaction again.
>
>This decreased reliability of compaction was introduced in commit
>62997027ca5b3d4618198ed8b1aba40b61b1137b .
>
>Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>Reviewed-by: Rik van Riel <riel@redhat.com>
>Acked-by: Rafael Aquini <aquini@redhat.com>
>Acked-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>---
> include/linux/compaction.h |  5 -----
> include/linux/mmzone.h     |  3 ---
> mm/compaction.c            | 15 ++++++---------
> mm/page_alloc.c            |  1 -
> mm/vmscan.c                |  8 --------
> 5 files changed, 6 insertions(+), 26 deletions(-)
>
>diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>index 091d72e..fc3f266 100644
>--- a/include/linux/compaction.h
>+++ b/include/linux/compaction.h
>@@ -24,7 +24,6 @@ extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> 			int order, gfp_t gfp_mask, nodemask_t *mask,
> 			bool sync, bool *contended);
> extern void compact_pgdat(pg_data_t *pgdat, int order);
>-extern void reset_isolation_suitable(pg_data_t *pgdat);
> extern unsigned long compaction_suitable(struct zone *zone, int order);
>
> /* Do not skip compaction more than 64 times */
>@@ -84,10 +83,6 @@ static inline void compact_pgdat(pg_data_t *pgdat, int order)
> {
> }
>
>-static inline void reset_isolation_suitable(pg_data_t *pgdat)
>-{
>-}
>-
> static inline unsigned long compaction_suitable(struct zone *zone, int order)
> {
> 	return COMPACT_SKIPPED;
>diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>index 9534a9a..e738871 100644
>--- a/include/linux/mmzone.h
>+++ b/include/linux/mmzone.h
>@@ -354,9 +354,6 @@ struct zone {
> 	spinlock_t		lock;
> 	int                     all_unreclaimable; /* All pages pinned */
> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>-	/* Set to true when the PG_migrate_skip bits should be cleared */
>-	bool			compact_blockskip_flush;
>-
> 	/* pfns where compaction scanners should start */
> 	unsigned long		compact_cached_free_pfn;
> 	unsigned long		compact_cached_migrate_pfn;
>diff --git a/mm/compaction.c b/mm/compaction.c
>index cac9594..525baaa 100644
>--- a/mm/compaction.c
>+++ b/mm/compaction.c
>@@ -91,7 +91,6 @@ static void __reset_isolation_suitable(struct zone *zone)
>
> 	zone->compact_cached_migrate_pfn = start_pfn;
> 	zone->compact_cached_free_pfn = end_pfn;
>-	zone->compact_blockskip_flush = false;
>
> 	/* Walk the zone and mark every pageblock as suitable for isolation */
> 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
>@@ -110,7 +109,7 @@ static void __reset_isolation_suitable(struct zone *zone)
> 	}
> }
>
>-void reset_isolation_suitable(pg_data_t *pgdat)
>+static void reset_isolation_suitable(pg_data_t *pgdat)
> {
> 	int zoneid;
>
>@@ -120,8 +119,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
> 			continue;
>
> 		/* Only flush if a full compaction finished recently */
>-		if (zone->compact_blockskip_flush)
>-			__reset_isolation_suitable(zone);
>+		__reset_isolation_suitable(zone);
> 	}
> }
>
>@@ -828,13 +826,12 @@ static int compact_finished(struct zone *zone,
> 	/* Compaction run completes if the migrate and free scanner meet */
> 	if (cc->free_pfn <= cc->migrate_pfn) {
> 		/*
>-		 * Mark that the PG_migrate_skip information should be cleared
>-		 * by kswapd when it goes to sleep. kswapd does not set the
>-		 * flag itself as the decision to be clear should be directly
>-		 * based on an allocation request.
>+		 * Clear the PG_migrate_skip information. kswapd does
>+		 * not clear it as the decision to be clear should be
>+		 * directly based on an allocation request.
> 		 */
> 		if (!current_is_kswapd())
>-			zone->compact_blockskip_flush = true;
>+			__reset_isolation_suitable(zone);
>
> 		return COMPACT_COMPLETE;
> 	}
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index b100255..db8fb66 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -2190,7 +2190,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> 				alloc_flags & ~ALLOC_NO_WATERMARKS,
> 				preferred_zone, migratetype);
> 		if (page) {
>-			preferred_zone->compact_blockskip_flush = false;
> 			preferred_zone->compact_considered = 0;
> 			preferred_zone->compact_defer_shift = 0;
> 			if (order >= preferred_zone->compact_order_failed)
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index 042fdcd..85a0071 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -3091,14 +3091,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> 		 */
> 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
>
>-		/*
>-		 * Compaction records what page blocks it recently failed to
>-		 * isolate pages from and skips them in the future scanning.
>-		 * When kswapd is going to sleep, it is reasonable to assume
>-		 * that pages and compaction may succeed so reset the cache.
>-		 */
>-		reset_isolation_suitable(pgdat);
>-
> 		if (!kthread_should_stop())
> 			schedule();
>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 03/10] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable
  2013-07-16 13:41 ` [PATCH 03/10] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
  2013-07-16 23:32   ` Wanpeng Li
@ 2013-07-16 23:32   ` Wanpeng Li
  1 sibling, 0 replies; 23+ messages in thread
From: Wanpeng Li @ 2013-07-16 23:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

On Tue, Jul 16, 2013 at 03:41:47PM +0200, Andrea Arcangeli wrote:
>If kswapd never need to run (only __GFP_NO_KSWAPD allocations and
>plenty of free memory) compaction is otherwise crippled down and stops
>running for a while after the free/isolation cursor meets. After that
>allocation can fail for a full cycle of compaction_deferred, until
>compaction_restarting finally reset it again.
>
>Stopping compaction for a full cycle after the cursor meets, even if
>it never failed and it's not going to fail, doesn't make sense.
>
>We already throttle compaction CPU utilization using
>defer_compaction. We shouldn't prevent compaction to run after each
>pass completes when the cursor meets, unless it failed.
>
>This makes direct compaction functional again. The throttling of
>direct compaction is still controlled by the defer_compaction
>logic.
>
>kswapd still won't risk to reset compaction, and it will wait direct
>compaction to do so. Not sure if this is ideal but it at least
>decreases the risk of kswapd doing too much work. kswapd will only run
>one pass of compaction until some allocation invokes compaction again.
>
>This decreased reliability of compaction was introduced in commit
>62997027ca5b3d4618198ed8b1aba40b61b1137b .
>
>Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>Reviewed-by: Rik van Riel <riel@redhat.com>
>Acked-by: Rafael Aquini <aquini@redhat.com>
>Acked-by: Mel Gorman <mgorman@suse.de>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>---
> include/linux/compaction.h |  5 -----
> include/linux/mmzone.h     |  3 ---
> mm/compaction.c            | 15 ++++++---------
> mm/page_alloc.c            |  1 -
> mm/vmscan.c                |  8 --------
> 5 files changed, 6 insertions(+), 26 deletions(-)
>
>diff --git a/include/linux/compaction.h b/include/linux/compaction.h
>index 091d72e..fc3f266 100644
>--- a/include/linux/compaction.h
>+++ b/include/linux/compaction.h
>@@ -24,7 +24,6 @@ extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
> 			int order, gfp_t gfp_mask, nodemask_t *mask,
> 			bool sync, bool *contended);
> extern void compact_pgdat(pg_data_t *pgdat, int order);
>-extern void reset_isolation_suitable(pg_data_t *pgdat);
> extern unsigned long compaction_suitable(struct zone *zone, int order);
>
> /* Do not skip compaction more than 64 times */
>@@ -84,10 +83,6 @@ static inline void compact_pgdat(pg_data_t *pgdat, int order)
> {
> }
>
>-static inline void reset_isolation_suitable(pg_data_t *pgdat)
>-{
>-}
>-
> static inline unsigned long compaction_suitable(struct zone *zone, int order)
> {
> 	return COMPACT_SKIPPED;
>diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>index 9534a9a..e738871 100644
>--- a/include/linux/mmzone.h
>+++ b/include/linux/mmzone.h
>@@ -354,9 +354,6 @@ struct zone {
> 	spinlock_t		lock;
> 	int                     all_unreclaimable; /* All pages pinned */
> #if defined CONFIG_COMPACTION || defined CONFIG_CMA
>-	/* Set to true when the PG_migrate_skip bits should be cleared */
>-	bool			compact_blockskip_flush;
>-
> 	/* pfns where compaction scanners should start */
> 	unsigned long		compact_cached_free_pfn;
> 	unsigned long		compact_cached_migrate_pfn;
>diff --git a/mm/compaction.c b/mm/compaction.c
>index cac9594..525baaa 100644
>--- a/mm/compaction.c
>+++ b/mm/compaction.c
>@@ -91,7 +91,6 @@ static void __reset_isolation_suitable(struct zone *zone)
>
> 	zone->compact_cached_migrate_pfn = start_pfn;
> 	zone->compact_cached_free_pfn = end_pfn;
>-	zone->compact_blockskip_flush = false;
>
> 	/* Walk the zone and mark every pageblock as suitable for isolation */
> 	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
>@@ -110,7 +109,7 @@ static void __reset_isolation_suitable(struct zone *zone)
> 	}
> }
>
>-void reset_isolation_suitable(pg_data_t *pgdat)
>+static void reset_isolation_suitable(pg_data_t *pgdat)
> {
> 	int zoneid;
>
>@@ -120,8 +119,7 @@ void reset_isolation_suitable(pg_data_t *pgdat)
> 			continue;
>
> 		/* Only flush if a full compaction finished recently */
>-		if (zone->compact_blockskip_flush)
>-			__reset_isolation_suitable(zone);
>+		__reset_isolation_suitable(zone);
> 	}
> }
>
>@@ -828,13 +826,12 @@ static int compact_finished(struct zone *zone,
> 	/* Compaction run completes if the migrate and free scanner meet */
> 	if (cc->free_pfn <= cc->migrate_pfn) {
> 		/*
>-		 * Mark that the PG_migrate_skip information should be cleared
>-		 * by kswapd when it goes to sleep. kswapd does not set the
>-		 * flag itself as the decision to be clear should be directly
>-		 * based on an allocation request.
>+		 * Clear the PG_migrate_skip information. kswapd does
>+		 * not clear it as the decision to be clear should be
>+		 * directly based on an allocation request.
> 		 */
> 		if (!current_is_kswapd())
>-			zone->compact_blockskip_flush = true;
>+			__reset_isolation_suitable(zone);
>
> 		return COMPACT_COMPLETE;
> 	}
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index b100255..db8fb66 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -2190,7 +2190,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
> 				alloc_flags & ~ALLOC_NO_WATERMARKS,
> 				preferred_zone, migratetype);
> 		if (page) {
>-			preferred_zone->compact_blockskip_flush = false;
> 			preferred_zone->compact_considered = 0;
> 			preferred_zone->compact_defer_shift = 0;
> 			if (order >= preferred_zone->compact_order_failed)
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index 042fdcd..85a0071 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -3091,14 +3091,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
> 		 */
> 		set_pgdat_percpu_threshold(pgdat, calculate_normal_threshold);
>
>-		/*
>-		 * Compaction records what page blocks it recently failed to
>-		 * isolate pages from and skips them in the future scanning.
>-		 * When kswapd is going to sleep, it is reasonable to assume
>-		 * that pages and compaction may succeed so reset the cache.
>-		 */
>-		reset_isolation_suitable(pgdat);
>-
> 		if (!kthread_should_stop())
> 			schedule();
>
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 01/10] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED
  2013-07-16 13:41 ` [PATCH 01/10] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
  2013-07-16 23:45   ` Wanpeng Li
@ 2013-07-16 23:45   ` Wanpeng Li
  1 sibling, 0 replies; 23+ messages in thread
From: Wanpeng Li @ 2013-07-16 23:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

On Tue, Jul 16, 2013 at 03:41:45PM +0200, Andrea Arcangeli wrote:
>Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
>thread allocates memory at the same time, it forces a premature
>allocation into remote NUMA nodes even when there's plenty of clean
>cache to reclaim in the local nodes.
>
>Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>Reviewed-by: Rik van Riel <riel@redhat.com>
>Acked-by: Rafael Aquini <aquini@redhat.com>
>Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>---
> include/linux/mmzone.h | 6 ------
> mm/vmscan.c            | 4 ----
> 2 files changed, 10 deletions(-)
>
>diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>index af4a3b7..9534a9a 100644
>--- a/include/linux/mmzone.h
>+++ b/include/linux/mmzone.h
>@@ -496,7 +496,6 @@ struct zone {
> } ____cacheline_internodealigned_in_smp;
>
> typedef enum {
>-	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
> 	ZONE_OOM_LOCKED,		/* zone is in OOM killer zonelist */
> 	ZONE_CONGESTED,			/* zone has many dirty pages backed by
> 					 * a congested BDI
>@@ -540,11 +539,6 @@ static inline int zone_is_reclaim_writeback(const struct zone *zone)
> 	return test_bit(ZONE_WRITEBACK, &zone->flags);
> }
>
>-static inline int zone_is_reclaim_locked(const struct zone *zone)
>-{
>-	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
>-}
>-
> static inline int zone_is_oom_locked(const struct zone *zone)
> {
> 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index 2cff0d4..042fdcd 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -3595,11 +3595,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
> 		return ZONE_RECLAIM_NOSCAN;
>
>-	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
>-		return ZONE_RECLAIM_NOSCAN;
>-
> 	ret = __zone_reclaim(zone, gfp_mask, order);
>-	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
>
> 	if (!ret)
> 		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 01/10] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED
  2013-07-16 13:41 ` [PATCH 01/10] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
@ 2013-07-16 23:45   ` Wanpeng Li
  2013-07-16 23:45   ` Wanpeng Li
  1 sibling, 0 replies; 23+ messages in thread
From: Wanpeng Li @ 2013-07-16 23:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini, Hush Bensen

On Tue, Jul 16, 2013 at 03:41:45PM +0200, Andrea Arcangeli wrote:
>Zone reclaim locked breaks zone_reclaim_mode=1. If more than one
>thread allocates memory at the same time, it forces a premature
>allocation into remote NUMA nodes even when there's plenty of clean
>cache to reclaim in the local nodes.
>
>Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>Reviewed-by: Rik van Riel <riel@redhat.com>
>Acked-by: Rafael Aquini <aquini@redhat.com>
>Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>---
> include/linux/mmzone.h | 6 ------
> mm/vmscan.c            | 4 ----
> 2 files changed, 10 deletions(-)
>
>diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>index af4a3b7..9534a9a 100644
>--- a/include/linux/mmzone.h
>+++ b/include/linux/mmzone.h
>@@ -496,7 +496,6 @@ struct zone {
> } ____cacheline_internodealigned_in_smp;
>
> typedef enum {
>-	ZONE_RECLAIM_LOCKED,		/* prevents concurrent reclaim */
> 	ZONE_OOM_LOCKED,		/* zone is in OOM killer zonelist */
> 	ZONE_CONGESTED,			/* zone has many dirty pages backed by
> 					 * a congested BDI
>@@ -540,11 +539,6 @@ static inline int zone_is_reclaim_writeback(const struct zone *zone)
> 	return test_bit(ZONE_WRITEBACK, &zone->flags);
> }
>
>-static inline int zone_is_reclaim_locked(const struct zone *zone)
>-{
>-	return test_bit(ZONE_RECLAIM_LOCKED, &zone->flags);
>-}
>-
> static inline int zone_is_oom_locked(const struct zone *zone)
> {
> 	return test_bit(ZONE_OOM_LOCKED, &zone->flags);
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index 2cff0d4..042fdcd 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -3595,11 +3595,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> 	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
> 		return ZONE_RECLAIM_NOSCAN;
>
>-	if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
>-		return ZONE_RECLAIM_NOSCAN;
>-
> 	ret = __zone_reclaim(zone, gfp_mask, order);
>-	zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
>
> 	if (!ret)
> 		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/10] mm: compaction: don't require high order pages below min wmark
  2013-07-16 13:41 ` [PATCH 05/10] mm: compaction: don't require high order pages below min wmark Andrea Arcangeli
@ 2013-07-17  8:13   ` Hush Bensen
  2013-07-17 17:15     ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Hush Bensen @ 2013-07-17  8:13 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

On 07/16/2013 09:41 AM, Andrea Arcangeli wrote:
> The min wmark should be satisfied with just 1 hugepage. And the other
> wmarks should be adjusted accordingly. We need to succeed the low
> wmark check if there's some significant amount of 0 order pages, but
> we don't need plenty of high order pages because the PF_MEMALLOC paths
> don't require those. Creating a ton of high order pages that cannot be
> allocated by the high order allocation paths (no PF_MEMALLOC) is quite
> wasteful because they can be splitted in lower order pages before
> anybody has a chance to allocate them.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>   mm/page_alloc.c | 17 +++++++++++++++++
>   1 file changed, 17 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index db8fb66..d94503d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1643,6 +1643,23 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
>   
>   	if (free_pages - free_cma <= min + lowmem_reserve)
>   		return false;
> +	if (!order)
> +		return true;
> +
> +	/*
> +	 * Don't require any high order page under the min
> +	 * wmark. Invoking compaction to create lots of high order
> +	 * pages below the min wmark is wasteful because those
> +	 * hugepages cannot be allocated without PF_MEMALLOC and the
> +	 * PF_MEMALLOC paths must not depend on high order allocations
> +	 * to succeed.
> +	 */
> +	min = mark - z->watermark[WMARK_MIN];
> +	WARN_ON(min < 0);
> +	if (alloc_flags & ALLOC_HIGH)
> +		min -= min / 2;
> +	if (alloc_flags & ALLOC_HARDER)
> +		min -= min / 4;

__zone_watermark_ok has these operations for mark, why do it again?


>   	for (o = 0; o < order; o++) {
>   		/* At the next order, this order's pages become unavailable */
>   		free_pages -= z->free_area[o].nr_free << o;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 10/10] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode
  2013-07-16 13:41 ` [PATCH 10/10] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
@ 2013-07-17  8:20   ` Hush Bensen
  2013-07-17 17:20     ` Andrea Arcangeli
  0 siblings, 1 reply; 23+ messages in thread
From: Hush Bensen @ 2013-07-17  8:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

On 07/16/2013 09:41 AM, Andrea Arcangeli wrote:
> This adds compaction to zone_reclaim so THP enabled won't decrease the
> NUMA locality with /proc/sys/vm/zone_reclaim_mode > 0.
>
> It is important to boot with numa_zonelist_order=n (n means nodes) to
> get more accurate NUMA locality if there are multiple zones per node.
>
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>   include/linux/swap.h |   8 +++-
>   mm/page_alloc.c      |   4 +-
>   mm/vmscan.c          | 111 ++++++++++++++++++++++++++++++++++++++++++---------
>   3 files changed, 102 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index d95cde5..d076a54 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -289,10 +289,14 @@ extern unsigned long vm_total_pages;
>   extern int zone_reclaim_mode;
>   extern int sysctl_min_unmapped_ratio;
>   extern int sysctl_min_slab_ratio;
> -extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
> +extern int zone_reclaim(struct zone *, struct zone *, gfp_t, unsigned int,
> +			unsigned long, int, int);
>   #else
>   #define zone_reclaim_mode 0
> -static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
> +static inline int zone_reclaim(struct zone *preferred_zone, struct zone *zone,
> +			       gfp_t mask, unsigned int order,
> +			       unsigned long mark, int classzone_idx,
> +			       int alloc_flags)
>   {
>   	return 0;
>   }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3690c2e..4101906 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1953,7 +1953,9 @@ zonelist_scan:
>   				!zlc_zone_worth_trying(zonelist, z, allowednodes))
>   				continue;
>   
> -			ret = zone_reclaim(zone, gfp_mask, order);
> +			ret = zone_reclaim(preferred_zone, zone, gfp_mask,
> +					   order,
> +					   mark, classzone_idx, alloc_flags);
>   			switch (ret) {
>   			case ZONE_RECLAIM_NOSCAN:
>   				/* did not scan */
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 85a0071..80ee2b2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3488,6 +3488,24 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>   	unsigned long nr_slab_pages0, nr_slab_pages1;
>   
>   	cond_resched();
> +
> +	/*
> +	 * Zone reclaim reclaims unmapped file backed pages and
> +	 * slab pages if we are over the defined limits.
> +	 *
> +	 * A small portion of unmapped file backed pages is needed for
> +	 * file I/O otherwise pages read by file I/O will be immediately
> +	 * thrown out if the zone is overallocated. So we do not reclaim
> +	 * if less than a specified percentage of the zone is used by
> +	 * unmapped file backed pages.
> +	 */
> +	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
> +	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
> +		return ZONE_RECLAIM_FULL;
> +
> +	if (zone->all_unreclaimable)
> +		return ZONE_RECLAIM_FULL;
> +
>   	/*
>   	 * We need to be able to allocate from the reserves for RECLAIM_SWAP
>   	 * and we also need to be able to write out pages for RECLAIM_WRITE
> @@ -3549,27 +3567,35 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>   	return sc.nr_reclaimed >= nr_pages;
>   }
>   
> -int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> +static int zone_reclaim_compact(struct zone *preferred_zone,
> +				struct zone *zone, gfp_t gfp_mask,
> +				unsigned int order,
> +				bool sync_compaction,
> +				bool *need_compaction)
>   {
> -	int node_id;
> -	int ret;
> +	bool contended;
>   
> -	/*
> -	 * Zone reclaim reclaims unmapped file backed pages and
> -	 * slab pages if we are over the defined limits.
> -	 *
> -	 * A small portion of unmapped file backed pages is needed for
> -	 * file I/O otherwise pages read by file I/O will be immediately
> -	 * thrown out if the zone is overallocated. So we do not reclaim
> -	 * if less than a specified percentage of the zone is used by
> -	 * unmapped file backed pages.
> -	 */
> -	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
> -	    zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
> -		return ZONE_RECLAIM_FULL;
> +	if (compaction_deferred(preferred_zone, order) ||
> +	    !order ||
> +	    (gfp_mask & (__GFP_FS|__GFP_IO)) != (__GFP_FS|__GFP_IO)) {
> +		need_compaction = false;
> +		return COMPACT_SKIPPED;
> +	}
>   
> -	if (zone->all_unreclaimable)
> -		return ZONE_RECLAIM_FULL;
> +	*need_compaction = true;
> +	return compact_zone_order(zone, order,
> +				  gfp_mask,
> +				  sync_compaction,
> +				  &contended);
> +}
> +
> +int zone_reclaim(struct zone *preferred_zone, struct zone *zone,
> +		 gfp_t gfp_mask, unsigned int order,
> +		 unsigned long mark, int classzone_idx, int alloc_flags)
> +{
> +	int node_id;
> +	int ret, c_ret;
> +	bool sync_compaction = false, need_compaction = false;
>   
>   	/*
>   	 * Do not scan if the allocation should not be delayed.
> @@ -3587,7 +3613,56 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>   	if (node_state(node_id, N_CPU) && node_id != numa_node_id())
>   		return ZONE_RECLAIM_NOSCAN;
>   
> +repeat_compaction:
> +	/*
> +	 * If this allocation may be satisfied by memory compaction,
> +	 * run compaction before reclaim.
> +	 */
> +	c_ret = zone_reclaim_compact(preferred_zone,
> +				     zone, gfp_mask, order,
> +				     sync_compaction,
> +				     &need_compaction);
> +	if (need_compaction &&
> +	    c_ret != COMPACT_SKIPPED &&
> +	    zone_watermark_ok(zone, order, mark,
> +			      classzone_idx,
> +			      alloc_flags)) {
> +#ifdef CONFIG_COMPACTION
> +		zone->compact_considered = 0;
> +		zone->compact_defer_shift = 0;
> +#endif
> +		return ZONE_RECLAIM_SUCCESS;
> +	}
> +
> +	/*
> +	 * reclaim if compaction failed because not enough memory was
> +	 * available or if compaction didn't run (order 0) or didn't
> +	 * succeed.
> +	 */
>   	ret = __zone_reclaim(zone, gfp_mask, order);
> +	if (ret == ZONE_RECLAIM_SUCCESS) {
> +		if (zone_watermark_ok(zone, order, mark,
> +				      classzone_idx,
> +				      alloc_flags))
> +			return ZONE_RECLAIM_SUCCESS;
> +
> +		/*
> +		 * If compaction run but it was skipped and reclaim was
> +		 * successful keep going.
> +		 */
> +		if (need_compaction && c_ret == COMPACT_SKIPPED) {
> +			/*
> +			 * If it's ok to wait for I/O we can as well run sync
> +			 * compaction
> +			 */
> +			sync_compaction = !!(zone_reclaim_mode &
> +					     (RECLAIM_WRITE|RECLAIM_SWAP));
> +			cond_resched();
> +			goto repeat_compaction;
> +		}
> +	}
> +	if (need_compaction)
> +		defer_compaction(preferred_zone, order);
>   
>   	if (!ret)
>   		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);

These works should be done in slow path, does it mean fast path is not 
faster any more?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 05/10] mm: compaction: don't require high order pages below min wmark
  2013-07-17  8:13   ` Hush Bensen
@ 2013-07-17 17:15     ` Andrea Arcangeli
  0 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2013-07-17 17:15 UTC (permalink / raw)
  To: Hush Bensen
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

On Wed, Jul 17, 2013 at 04:13:04AM -0400, Hush Bensen wrote:
> On 07/16/2013 09:41 AM, Andrea Arcangeli wrote:
> > The min wmark should be satisfied with just 1 hugepage. And the other
> > wmarks should be adjusted accordingly. We need to succeed the low
> > wmark check if there's some significant amount of 0 order pages, but
> > we don't need plenty of high order pages because the PF_MEMALLOC paths
> > don't require those. Creating a ton of high order pages that cannot be
> > allocated by the high order allocation paths (no PF_MEMALLOC) is quite
> > wasteful because they can be splitted in lower order pages before
> > anybody has a chance to allocate them.
> >
> > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> > ---
> >   mm/page_alloc.c | 17 +++++++++++++++++
> >   1 file changed, 17 insertions(+)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index db8fb66..d94503d 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1643,6 +1643,23 @@ static bool __zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> >   
> >   	if (free_pages - free_cma <= min + lowmem_reserve)
> >   		return false;
> > +	if (!order)
> > +		return true;
> > +
> > +	/*
> > +	 * Don't require any high order page under the min
> > +	 * wmark. Invoking compaction to create lots of high order
> > +	 * pages below the min wmark is wasteful because those
> > +	 * hugepages cannot be allocated without PF_MEMALLOC and the
> > +	 * PF_MEMALLOC paths must not depend on high order allocations
> > +	 * to succeed.
> > +	 */
> > +	min = mark - z->watermark[WMARK_MIN];
> > +	WARN_ON(min < 0);
> > +	if (alloc_flags & ALLOC_HIGH)
> > +		min -= min / 2;
> > +	if (alloc_flags & ALLOC_HARDER)
> > +		min -= min / 4;
> 
> __zone_watermark_ok has these operations for mark, why do it again?

min changed, so I repeat the operation on the new min.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 10/10] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode
  2013-07-17  8:20   ` Hush Bensen
@ 2013-07-17 17:20     ` Andrea Arcangeli
  0 siblings, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2013-07-17 17:20 UTC (permalink / raw)
  To: Hush Bensen
  Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins, Richard Davies,
	Shaohua Li, Rafael Aquini

On Wed, Jul 17, 2013 at 04:20:27AM -0400, Hush Bensen wrote:
> These works should be done in slow path, does it mean fast path is not 
> faster any more?

The changes are in zone_reclaim(), I don't think zone_reclaim shall be
considered a fast path, that is intended to reclaim memory. The fast
path is when the free pages are above the low wmark and we don't need
to call zone_reclaim.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2013-07-17 17:20 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-07-16 13:41 [PATCH 00/10] adding compaction to zone_reclaim_mode > 0 #2 Andrea Arcangeli
2013-07-16 13:41 ` [PATCH 01/10] mm: zone_reclaim: remove ZONE_RECLAIM_LOCKED Andrea Arcangeli
2013-07-16 23:45   ` Wanpeng Li
2013-07-16 23:45   ` Wanpeng Li
2013-07-16 13:41 ` [PATCH 02/10] mm: zone_reclaim: compaction: scan all memory with /proc/sys/vm/compact_memory Andrea Arcangeli
2013-07-16 23:29   ` Wanpeng Li
2013-07-16 23:29   ` Wanpeng Li
2013-07-16 13:41 ` [PATCH 03/10] mm: zone_reclaim: compaction: don't depend on kswapd to invoke reset_isolation_suitable Andrea Arcangeli
2013-07-16 23:32   ` Wanpeng Li
2013-07-16 23:32   ` Wanpeng Li
2013-07-16 13:41 ` [PATCH 04/10] mm: zone_reclaim: compaction: reset before initializing the scan cursors Andrea Arcangeli
2013-07-16 23:31   ` Wanpeng Li
2013-07-16 23:31   ` Wanpeng Li
2013-07-16 13:41 ` [PATCH 05/10] mm: compaction: don't require high order pages below min wmark Andrea Arcangeli
2013-07-17  8:13   ` Hush Bensen
2013-07-17 17:15     ` Andrea Arcangeli
2013-07-16 13:41 ` [PATCH 06/10] mm: zone_reclaim: compaction: increase the high order pages in the watermarks Andrea Arcangeli
2013-07-16 13:41 ` [PATCH 07/10] mm: zone_reclaim: compaction: export compact_zone_order() Andrea Arcangeli
2013-07-16 13:41 ` [PATCH 08/10] mm: zone_reclaim: only run zone_reclaim in the fast path Andrea Arcangeli
2013-07-16 13:41 ` [PATCH 09/10] mm: zone_reclaim: after a successful zone_reclaim check the min watermark Andrea Arcangeli
2013-07-16 13:41 ` [PATCH 10/10] mm: zone_reclaim: compaction: add compaction to zone_reclaim_mode Andrea Arcangeli
2013-07-17  8:20   ` Hush Bensen
2013-07-17 17:20     ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).