All of lore.kernel.org
 help / color / mirror / Atom feed
diff for duplicates of <56E02A33.2040106@suse.cz>

diff --git a/a/1.txt b/N1/1.txt
index 70c1cb8..898ebfc 100644
--- a/a/1.txt
+++ b/N1/1.txt
@@ -44,3 +44,639 @@ review, but functionality-wise the first patch leaves things somewhat
 weird without the third patch.
 
 ----8<----
+>From c829909527ecd33eb869c96bcd287bade2b32100 Mon Sep 17 00:00:00 2001
+From: Vlastimil Babka <vbabka@suse.cz>
+Date: Wed, 9 Mar 2016 12:45:24 +0100
+Subject: [PATCH 3/3] mm, kswapd: replace kswapd compaction with waking up
+ kcompactd
+
+Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim
+and compaction to attempt making memory allocation of given order
+available.  The details differ from direct reclaim e.g.  in having high
+watermark as a goal.  The code involved in kswapd's reclaim/compaction
+decisions has evolved to be quite complex.  Testing reveals that it
+doesn't actually work in at least one scenario, and closer inspection
+suggests that it could be greatly simplified without compromising on the
+goal (make high-order page available) or efficiency (don't reclaim too
+much).  The simplification relieas of doing all compaction in kcompactd,
+which is simply woken up when high watermarks are reached by kswapd's
+reclaim.
+
+The scenario where kswapd compaction doesn't work was found with mmtests
+test stress-highalloc configured to attempt order-9 allocations without
+direct reclaim, just waking up kswapd.  There was no compaction attempt
+from kswapd during the whole test.  Some added instrumentation shows what
+happens:
+
+- balance_pgdat() sets end_zone to Normal, as it's not balanced
+- reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it
+   cannot reclaim anything, so sc.nr_reclaimed is 0
+- for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it
+   merely checks if high watermarks were reached for base pages. This is true,
+   so no reclaim is attempted. For DMA, testorder=0 wasn't used, as
+   compaction_suitable() returned COMPACT_SKIPPED
+- even though the pgdat_needs_compaction flag wasn't set to false, no
+   compaction happens due to the condition sc.nr_reclaimed > nr_attempted
+   being false (as 0 < 99)
+- priority-- due to nr_reclaimed being 0, repeat until priority reaches 0
+   pgdat_balanced() is false as only the small zone DMA appears balanced
+   (curiously in that check, watermark appears OK and compaction_suitable()
+   returns COMPACT_PARTIAL, because a lower classzone_idx is used there)
+
+Now, even if it was decided that reclaim shouldn't be attempted on the DMA
+zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
+nr_attempted=0) is also false.  The condition really should use >= as the
+comment suggests.  Then there is a mismatch in the check for setting
+pgdat_needs_compaction to false using low watermark, while the rest uses
+high watermark, and who knows what other subtlety.  Hopefully this
+demonstrates that this is unsustainable.
+
+Luckily we can simplify this a lot.  The reclaim/compaction decisions make
+sense for direct reclaim scenario, but in kswapd, our primary goal is to
+reach high watermark in order-0 pages.  Afterwards we can attempt
+compaction just once.  Unlike direct reclaim, we don't reclaim extra pages
+(over the high watermark), the current code already disallows it for good
+reasons.
+
+After this patch, we simply wake up kcompactd to process the pgdat, after
+we have either succeeded or failed to reach the high watermarks in kswapd,
+which goes to sleep.  We pass kswapd's order and classzone_idx, so
+kcompactd can apply the same criteria to determine which zones are worth
+compacting.  Note that we use the classzone_idx from wakeup_kswapd(), not
+balanced_classzone_idx which can include higher zones that kswapd tried to
+balance too, but didn't consider them in pgdat_balanced().
+
+Since kswapd now cannot create high-order pages itself, we need to adjust
+how it determines the zones to be balanced.  The key element here is
+adding a "highorder" parameter to zone_balanced, which, when set to false,
+makes it consider only order-0 watermark instead of the desired higher
+order (this was done previously by kswapd_shrink_zone(), but not
+elsewhere).  This false is passed for example in pgdat_balanced().
+Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
+kcompactd are woken up for a high-order allocation failure.
+
+The last thing is to decide what to do with pageblock_skip bitmap handling.
+Compaction maintains a pageblock_skip bitmap to record pageblocks where
+isolation recently failed.  This bitmap can be reset by three ways:
+
+1) direct compaction is restarting after going through the full deferred cycle
+
+2) kswapd goes to sleep, and some other direct compaction has previously
+    finished scanning the whole zone and set zone->compact_blockskip_flush.
+    Note that a successful direct compaction clears this flag.
+
+3) compaction was invoked manually via trigger in /proc
+
+The case 2) is somewhat fuzzy to begin with, but after introducing
+kcompactd we should update it.  The check for direct compaction in 1), and
+to set the flush flag in 2) use current_is_kswapd(), which doesn't work
+for kcompactd.  Thus, this patch adds bool direct_compaction to
+compact_control to use in 2).  For the case 1) we remove the check
+completely - unlike the former kswapd compaction, kcompactd does use the
+deferred compaction functionality, so flushing tied to restarting from
+deferred compaction makes sense here.
+
+Note that when kswapd goes to sleep, kcompactd is woken up, so it will see
+the flushed pageblock_skip bits.  This is different from when the former
+kswapd compaction observed the bits and I believe it makes more sense.
+Kcompactd can afford to be more thorough than a direct compaction trying
+to limit allocation latency, or kswapd whose primary goal is to reclaim.
+
+For testing, I used stress-highalloc configured to do order-9 allocations
+with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on
+kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
+phases 1 and 2 work as usual):
+
+stress-highalloc
+                        4.5-rc1+before          4.5-rc1+after
+                             -nodirect              -nodirect
+Success 1 Min          1.00 (  0.00%)         5.00 (-66.67%)
+Success 1 Mean         1.40 (  0.00%)         6.20 (-55.00%)
+Success 1 Max          2.00 (  0.00%)         7.00 (-16.67%)
+Success 2 Min          1.00 (  0.00%)         5.00 (-66.67%)
+Success 2 Mean         1.80 (  0.00%)         6.40 (-52.38%)
+Success 2 Max          3.00 (  0.00%)         7.00 (-16.67%)
+Success 3 Min         34.00 (  0.00%)        62.00 (  1.59%)
+Success 3 Mean        41.80 (  0.00%)        63.80 (  1.24%)
+Success 3 Max         53.00 (  0.00%)        65.00 (  2.99%)
+
+User                          3166.67        3181.09
+System                        1153.37        1158.25
+Elapsed                       1768.53        1799.37
+
+                            4.5-rc1+before   4.5-rc1+after
+                                 -nodirect    -nodirect
+Direct pages scanned                32938        32797
+Kswapd pages scanned              2183166      2202613
+Kswapd pages reclaimed            2152359      2143524
+Direct pages reclaimed              32735        32545
+Percentage direct scans                1%           1%
+THP fault alloc                       579          612
+THP collapse alloc                    304          316
+THP splits                              0            0
+THP fault fallback                    793          778
+THP collapse fail                      11           16
+Compaction stalls                    1013         1007
+Compaction success                     92           67
+Compaction failures                   920          939
+Page migrate success               238457       721374
+Page migrate failure                23021        23469
+Compaction pages isolated          504695      1479924
+Compaction migrate scanned         661390      8812554
+Compaction free scanned          13476658     84327916
+Compaction cost                       262          838
+
+After this patch we see improvements in allocation success rate
+(especially for phase 3) along with increased compaction activity.  The
+compaction stalls (direct compaction) in the interfering kernel builds
+(probably THP's) also decreased somewhat thanks to kcompactd activity, yet
+THP alloc successes improved a bit.
+
+Note that elapsed and user time isn't so useful for this benchmark,
+because of the background interference being unpredictable.  It's just to
+quickly spot some major unexpected differences.  System time is somewhat
+more useful and that didn't increase.
+
+Also (after adjusting mmtests' ftrace monitor):
+
+Time kswapd awake               2547781     2269241
+Time kcompactd awake                  0      119253
+Time direct compacting           939937      557649
+Time kswapd compacting                0           0
+Time kcompactd compacting             0      119099
+
+The decrease of overal time spent compacting appears to not match the
+increased compaction stats.  I suspect the tasks get rescheduled and since
+the ftrace monitor doesn't see that, the reported time is wall time, not
+CPU time.  But arguably direct compactors care about overall latency
+anyway, whether busy compacting or waiting for CPU doesn't matter.  And
+that latency seems to almost halved.
+
+It's also interesting how much time kswapd spent awake just going through
+all the priorities and failing to even try compacting, over and over.
+
+We can also configure stress-highalloc to perform both direct
+reclaim/compaction and wakeup kswapd/kcompactd, by using
+GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
+
+stress-highalloc
+                        4.5-rc1+before         4.5-rc1+after
+                               -direct               -direct
+Success 1 Min          4.00 (  0.00%)        9.00 (-50.00%)
+Success 1 Mean         8.00 (  0.00%)       10.00 (-19.05%)
+Success 1 Max         12.00 (  0.00%)       11.00 ( 15.38%)
+Success 2 Min          4.00 (  0.00%)        9.00 (-50.00%)
+Success 2 Mean         8.20 (  0.00%)       10.00 (-16.28%)
+Success 2 Max         13.00 (  0.00%)       11.00 (  8.33%)
+Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
+Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
+Success 3 Max         77.00 (  0.00%)       76.00 (  0.00%)
+
+User                          3344.73       3246.04
+System                        1194.24       1172.29
+Elapsed                       1838.04       1836.76
+
+                            4.5-rc1+before  4.5-rc1+after
+                                   -direct     -direct
+Direct pages scanned               125146      120966
+Kswapd pages scanned              2119757     2135012
+Kswapd pages reclaimed            2073183     2108388
+Direct pages reclaimed             124909      120577
+Percentage direct scans                5%          5%
+THP fault alloc                       599         652
+THP collapse alloc                    323         354
+THP splits                              0           0
+THP fault fallback                    806         793
+THP collapse fail                      17          16
+Compaction stalls                    2457        2025
+Compaction success                    906         518
+Compaction failures                  1551        1507
+Page migrate success              2031423     2360608
+Page migrate failure                32845       40852
+Compaction pages isolated         4129761     4802025
+Compaction migrate scanned       11996712    21750613
+Compaction free scanned         214970969   344372001
+Compaction cost                      2271        2694
+
+In this scenario, this patch doesn't change the overall success rate as
+direct compaction already tries all it can.  There's however significant
+reduction in direct compaction stalls (that is, the number of allocations
+that went into direct compaction).  The number of successes (i.e.  direct
+compaction stalls that ended up with successful allocation) is reduced by
+the same number.  This means the offload to kcompactd is working as
+expected, and direct compaction is reduced either due to detecting
+contention, or compaction deferred by kcompactd.  In the previous version
+of this patchset there was some apparent reduction of success rate, but
+the changes in this version (such as using sync compaction only), new
+baseline kernel, and/or averaging results from 5 executions (my bet), made
+this go away.
+
+Ftrace-based stats seem to roughly agree:
+
+Time kswapd awake               2532984     2326824
+Time kcompactd awake                  0      257916
+Time direct compacting           864839      735130
+Time kswapd compacting                0           0
+Time kcompactd compacting             0      257585
+
+Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
+Cc: Andrea Arcangeli <aarcange@redhat.com>
+Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
+Cc: Rik van Riel <riel@redhat.com>
+Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
+Cc: Mel Gorman <mgorman@techsingularity.net>
+Cc: David Rientjes <rientjes@google.com>
+Cc: Michal Hocko <mhocko@suse.com>
+Cc: Johannes Weiner <hannes@cmpxchg.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+---
+ mm/compaction.c |  10 ++--
+ mm/internal.h   |   1 +
+ mm/vmscan.c     | 147 ++++++++++++++++++--------------------------------------
+ 3 files changed, 54 insertions(+), 104 deletions(-)
+
+diff --git a/mm/compaction.c b/mm/compaction.c
+index 5b2bfbaa821a..ccf97b02b85f 100644
+--- a/mm/compaction.c
++++ b/mm/compaction.c
+@@ -1191,11 +1191,11 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,
+ 
+ 		/*
+ 		 * Mark that the PG_migrate_skip information should be cleared
+-		 * by kswapd when it goes to sleep. kswapd does not set the
++		 * by kswapd when it goes to sleep. kcompactd does not set the
+ 		 * flag itself as the decision to be clear should be directly
+ 		 * based on an allocation request.
+ 		 */
+-		if (!current_is_kswapd())
++		if (cc->direct_compaction)
+ 			zone->compact_blockskip_flush = true;
+ 
+ 		return COMPACT_COMPLETE;
+@@ -1338,10 +1338,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
+ 
+ 	/*
+ 	 * Clear pageblock skip if there were failures recently and compaction
+-	 * is about to be retried after being deferred. kswapd does not do
+-	 * this reset as it'll reset the cached information when going to sleep.
++	 * is about to be retried after being deferred.
+ 	 */
+-	if (compaction_restarting(zone, cc->order) && !current_is_kswapd())
++	if (compaction_restarting(zone, cc->order))
+ 		__reset_isolation_suitable(zone);
+ 
+ 	/*
+@@ -1477,6 +1476,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,
+ 		.mode = mode,
+ 		.alloc_flags = alloc_flags,
+ 		.classzone_idx = classzone_idx,
++		.direct_compaction = true,
+ 	};
+ 	INIT_LIST_HEAD(&cc.freepages);
+ 	INIT_LIST_HEAD(&cc.migratepages);
+diff --git a/mm/internal.h b/mm/internal.h
+index 17ae0b52534b..013a786fa37f 100644
+--- a/mm/internal.h
++++ b/mm/internal.h
+@@ -181,6 +181,7 @@ struct compact_control {
+ 	unsigned long last_migrated_pfn;/* Not yet flushed page being freed */
+ 	enum migrate_mode mode;		/* Async or sync migration mode */
+ 	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
++	bool direct_compaction;		/* False from kcompactd or /proc/... */
+ 	int order;			/* order a direct compactor needs */
+ 	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
+ 	const int alloc_flags;		/* alloc flags of a direct compactor */
+diff --git a/mm/vmscan.c b/mm/vmscan.c
+index c67df4831565..23bc7e643ad8 100644
+--- a/mm/vmscan.c
++++ b/mm/vmscan.c
+@@ -2951,18 +2951,23 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
+ 	} while (memcg);
+ }
+ 
+-static bool zone_balanced(struct zone *zone, int order,
+-			  unsigned long balance_gap, int classzone_idx)
++static bool zone_balanced(struct zone *zone, int order, bool highorder,
++			unsigned long balance_gap, int classzone_idx)
+ {
+-	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
+-				    balance_gap, classzone_idx))
+-		return false;
++	unsigned long mark = high_wmark_pages(zone) + balance_gap;
+ 
+-	if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
+-				order, 0, classzone_idx) == COMPACT_SKIPPED)
+-		return false;
++	/*
++	 * When checking from pgdat_balanced(), kswapd should stop and sleep
++	 * when it reaches the high order-0 watermark and let kcompactd take
++	 * over. Other callers such as wakeup_kswapd() want to determine the
++	 * true high-order watermark.
++	 */
++	if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
++		mark += (1UL << order);
++		order = 0;
++	}
+ 
+-	return true;
++	return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
+ }
+ 
+ /*
+@@ -3012,7 +3017,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
+ 			continue;
+ 		}
+ 
+-		if (zone_balanced(zone, order, 0, i))
++		if (zone_balanced(zone, order, false, 0, i))
+ 			balanced_pages += zone->managed_pages;
+ 		else if (!order)
+ 			return false;
+@@ -3066,10 +3071,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
+  */
+ static bool kswapd_shrink_zone(struct zone *zone,
+ 			       int classzone_idx,
+-			       struct scan_control *sc,
+-			       unsigned long *nr_attempted)
++			       struct scan_control *sc)
+ {
+-	int testorder = sc->order;
+ 	unsigned long balance_gap;
+ 	bool lowmem_pressure;
+ 
+@@ -3077,17 +3080,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
+ 	sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
+ 
+ 	/*
+-	 * Kswapd reclaims only single pages with compaction enabled. Trying
+-	 * too hard to reclaim until contiguous free pages have become
+-	 * available can hurt performance by evicting too much useful data
+-	 * from memory. Do not reclaim more than needed for compaction.
+-	 */
+-	if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
+-			compaction_suitable(zone, sc->order, 0, classzone_idx)
+-							!= COMPACT_SKIPPED)
+-		testorder = 0;
+-
+-	/*
+ 	 * We put equal pressure on every zone, unless one zone has way too
+ 	 * many pages free already. The "too many pages" is defined as the
+ 	 * high wmark plus a "gap" where the gap is either the low
+@@ -3101,15 +3093,12 @@ static bool kswapd_shrink_zone(struct zone *zone,
+ 	 * reclaim is necessary
+ 	 */
+ 	lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
+-	if (!lowmem_pressure && zone_balanced(zone, testorder,
++	if (!lowmem_pressure && zone_balanced(zone, sc->order, false,
+ 						balance_gap, classzone_idx))
+ 		return true;
+ 
+ 	shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
+ 
+-	/* Account for the number of pages attempted to reclaim */
+-	*nr_attempted += sc->nr_to_reclaim;
+-
+ 	clear_bit(ZONE_WRITEBACK, &zone->flags);
+ 
+ 	/*
+@@ -3119,7 +3108,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
+ 	 * waits.
+ 	 */
+ 	if (zone_reclaimable(zone) &&
+-	    zone_balanced(zone, testorder, 0, classzone_idx)) {
++	    zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
+ 		clear_bit(ZONE_CONGESTED, &zone->flags);
+ 		clear_bit(ZONE_DIRTY, &zone->flags);
+ 	}
+@@ -3131,7 +3120,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
+  * For kswapd, balance_pgdat() will work across all this node's zones until
+  * they are all at high_wmark_pages(zone).
+  *
+- * Returns the final order kswapd was reclaiming at
++ * Returns the highest zone idx kswapd was reclaiming at
+  *
+  * There is special handling here for zones which are full of pinned pages.
+  * This can happen if the pages are all mlocked, or if they are all used by
+@@ -3148,8 +3137,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
+  * interoperates with the page allocator fallback scheme to ensure that aging
+  * of pages is balanced across the zones.
+  */
+-static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
+-							int *classzone_idx)
++static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
+ {
+ 	int i;
+ 	int end_zone = 0;	/* Inclusive.  0 = ZONE_DMA */
+@@ -3166,9 +3154,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
+ 	count_vm_event(PAGEOUTRUN);
+ 
+ 	do {
+-		unsigned long nr_attempted = 0;
+ 		bool raise_priority = true;
+-		bool pgdat_needs_compaction = (order > 0);
+ 
+ 		sc.nr_reclaimed = 0;
+ 
+@@ -3203,7 +3189,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
+ 				break;
+ 			}
+ 
+-			if (!zone_balanced(zone, order, 0, 0)) {
++			if (!zone_balanced(zone, order, false, 0, 0)) {
+ 				end_zone = i;
+ 				break;
+ 			} else {
+@@ -3219,24 +3205,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
+ 		if (i < 0)
+ 			goto out;
+ 
+-		for (i = 0; i <= end_zone; i++) {
+-			struct zone *zone = pgdat->node_zones + i;
+-
+-			if (!populated_zone(zone))
+-				continue;
+-
+-			/*
+-			 * If any zone is currently balanced then kswapd will
+-			 * not call compaction as it is expected that the
+-			 * necessary pages are already available.
+-			 */
+-			if (pgdat_needs_compaction &&
+-					zone_watermark_ok(zone, order,
+-						low_wmark_pages(zone),
+-						*classzone_idx, 0))
+-				pgdat_needs_compaction = false;
+-		}
+-
+ 		/*
+ 		 * If we're getting trouble reclaiming, start doing writepage
+ 		 * even in laptop mode.
+@@ -3280,8 +3248,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
+ 			 * that that high watermark would be met at 100%
+ 			 * efficiency.
+ 			 */
+-			if (kswapd_shrink_zone(zone, end_zone,
+-					       &sc, &nr_attempted))
++			if (kswapd_shrink_zone(zone, end_zone, &sc))
+ 				raise_priority = false;
+ 		}
+ 
+@@ -3294,49 +3261,29 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
+ 				pfmemalloc_watermark_ok(pgdat))
+ 			wake_up_all(&pgdat->pfmemalloc_wait);
+ 
+-		/*
+-		 * Fragmentation may mean that the system cannot be rebalanced
+-		 * for high-order allocations in all zones. If twice the
+-		 * allocation size has been reclaimed and the zones are still
+-		 * not balanced then recheck the watermarks at order-0 to
+-		 * prevent kswapd reclaiming excessively. Assume that a
+-		 * process requested a high-order can direct reclaim/compact.
+-		 */
+-		if (order && sc.nr_reclaimed >= 2UL << order)
+-			order = sc.order = 0;
+-
+ 		/* Check if kswapd should be suspending */
+ 		if (try_to_freeze() || kthread_should_stop())
+ 			break;
+ 
+ 		/*
+-		 * Compact if necessary and kswapd is reclaiming at least the
+-		 * high watermark number of pages as requsted
+-		 */
+-		if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted)
+-			compact_pgdat(pgdat, order);
+-
+-		/*
+ 		 * Raise priority if scanning rate is too low or there was no
+ 		 * progress in reclaiming pages
+ 		 */
+ 		if (raise_priority || !sc.nr_reclaimed)
+ 			sc.priority--;
+ 	} while (sc.priority >= 1 &&
+-		 !pgdat_balanced(pgdat, order, *classzone_idx));
++			!pgdat_balanced(pgdat, order, classzone_idx));
+ 
+ out:
+ 	/*
+-	 * Return the order we were reclaiming at so prepare_kswapd_sleep()
+-	 * makes a decision on the order we were last reclaiming at. However,
+-	 * if another caller entered the allocator slow path while kswapd
+-	 * was awake, order will remain at the higher level
++	 * Return the highest zone idx we were reclaiming at so
++	 * prepare_kswapd_sleep() makes the same decisions as here.
+ 	 */
+-	*classzone_idx = end_zone;
+-	return order;
++	return end_zone;
+ }
+ 
+-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
++static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
++				int classzone_idx, int balanced_classzone_idx)
+ {
+ 	long remaining = 0;
+ 	DEFINE_WAIT(wait);
+@@ -3347,7 +3294,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
+ 	prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+ 
+ 	/* Try to sleep for a short interval */
+-	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
++	if (prepare_kswapd_sleep(pgdat, order, remaining,
++						balanced_classzone_idx)) {
+ 		remaining = schedule_timeout(HZ/10);
+ 		finish_wait(&pgdat->kswapd_wait, &wait);
+ 		prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+@@ -3357,7 +3305,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
+ 	 * After a short sleep, check if it was a premature sleep. If not, then
+ 	 * go fully to sleep until explicitly woken up.
+ 	 */
+-	if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {
++	if (prepare_kswapd_sleep(pgdat, order, remaining,
++						balanced_classzone_idx)) {
+ 		trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
+ 
+ 		/*
+@@ -3378,6 +3327,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
+ 		 */
+ 		reset_isolation_suitable(pgdat);
+ 
++		/*
++		 * We have freed the memory, now we should compact it to make
++		 * allocation of the requested order possible.
++		 */
++		wakeup_kcompactd(pgdat, order, classzone_idx);
++
+ 		if (!kthread_should_stop())
+ 			schedule();
+ 
+@@ -3407,7 +3362,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
+ static int kswapd(void *p)
+ {
+ 	unsigned long order, new_order;
+-	unsigned balanced_order;
+ 	int classzone_idx, new_classzone_idx;
+ 	int balanced_classzone_idx;
+ 	pg_data_t *pgdat = (pg_data_t*)p;
+@@ -3440,23 +3394,19 @@ static int kswapd(void *p)
+ 	set_freezable();
+ 
+ 	order = new_order = 0;
+-	balanced_order = 0;
+ 	classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
+ 	balanced_classzone_idx = classzone_idx;
+ 	for ( ; ; ) {
+ 		bool ret;
+ 
+ 		/*
+-		 * If the last balance_pgdat was unsuccessful it's unlikely a
+-		 * new request of a similar or harder type will succeed soon
+-		 * so consider going to sleep on the basis we reclaimed at
++		 * While we were reclaiming, there might have been another
++		 * wakeup, so check the values.
+ 		 */
+-		if (balanced_order == new_order) {
+-			new_order = pgdat->kswapd_max_order;
+-			new_classzone_idx = pgdat->classzone_idx;
+-			pgdat->kswapd_max_order =  0;
+-			pgdat->classzone_idx = pgdat->nr_zones - 1;
+-		}
++		new_order = pgdat->kswapd_max_order;
++		new_classzone_idx = pgdat->classzone_idx;
++		pgdat->kswapd_max_order =  0;
++		pgdat->classzone_idx = pgdat->nr_zones - 1;
+ 
+ 		if (order < new_order || classzone_idx > new_classzone_idx) {
+ 			/*
+@@ -3466,7 +3416,7 @@ static int kswapd(void *p)
+ 			order = new_order;
+ 			classzone_idx = new_classzone_idx;
+ 		} else {
+-			kswapd_try_to_sleep(pgdat, balanced_order,
++			kswapd_try_to_sleep(pgdat, order, classzone_idx,
+ 						balanced_classzone_idx);
+ 			order = pgdat->kswapd_max_order;
+ 			classzone_idx = pgdat->classzone_idx;
+@@ -3486,9 +3436,8 @@ static int kswapd(void *p)
+ 		 */
+ 		if (!ret) {
+ 			trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
+-			balanced_classzone_idx = classzone_idx;
+-			balanced_order = balance_pgdat(pgdat, order,
+-						&balanced_classzone_idx);
++			balanced_classzone_idx = balance_pgdat(pgdat, order,
++								classzone_idx);
+ 		}
+ 	}
+ 
+@@ -3518,7 +3467,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
+ 	}
+ 	if (!waitqueue_active(&pgdat->kswapd_wait))
+ 		return;
+-	if (zone_balanced(zone, order, 0, 0))
++	if (zone_balanced(zone, order, true, 0, 0))
+ 		return;
+ 
+ 	trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
+-- 
+2.7.2
diff --git a/a/content_digest b/N1/content_digest
index f5e2991..9e5aab0 100644
--- a/a/content_digest
+++ b/N1/content_digest
@@ -71,6 +71,642 @@
  "review, but functionality-wise the first patch leaves things somewhat\n"
  "weird without the third patch.\n"
  "\n"
- ----8<----
+ "----8<----\n"
+ ">From c829909527ecd33eb869c96bcd287bade2b32100 Mon Sep 17 00:00:00 2001\n"
+ "From: Vlastimil Babka <vbabka@suse.cz>\n"
+ "Date: Wed, 9 Mar 2016 12:45:24 +0100\n"
+ "Subject: [PATCH 3/3] mm, kswapd: replace kswapd compaction with waking up\n"
+ " kcompactd\n"
+ "\n"
+ "Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim\n"
+ "and compaction to attempt making memory allocation of given order\n"
+ "available.  The details differ from direct reclaim e.g.  in having high\n"
+ "watermark as a goal.  The code involved in kswapd's reclaim/compaction\n"
+ "decisions has evolved to be quite complex.  Testing reveals that it\n"
+ "doesn't actually work in at least one scenario, and closer inspection\n"
+ "suggests that it could be greatly simplified without compromising on the\n"
+ "goal (make high-order page available) or efficiency (don't reclaim too\n"
+ "much).  The simplification relieas of doing all compaction in kcompactd,\n"
+ "which is simply woken up when high watermarks are reached by kswapd's\n"
+ "reclaim.\n"
+ "\n"
+ "The scenario where kswapd compaction doesn't work was found with mmtests\n"
+ "test stress-highalloc configured to attempt order-9 allocations without\n"
+ "direct reclaim, just waking up kswapd.  There was no compaction attempt\n"
+ "from kswapd during the whole test.  Some added instrumentation shows what\n"
+ "happens:\n"
+ "\n"
+ "- balance_pgdat() sets end_zone to Normal, as it's not balanced\n"
+ "- reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it\n"
+ "   cannot reclaim anything, so sc.nr_reclaimed is 0\n"
+ "- for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it\n"
+ "   merely checks if high watermarks were reached for base pages. This is true,\n"
+ "   so no reclaim is attempted. For DMA, testorder=0 wasn't used, as\n"
+ "   compaction_suitable() returned COMPACT_SKIPPED\n"
+ "- even though the pgdat_needs_compaction flag wasn't set to false, no\n"
+ "   compaction happens due to the condition sc.nr_reclaimed > nr_attempted\n"
+ "   being false (as 0 < 99)\n"
+ "- priority-- due to nr_reclaimed being 0, repeat until priority reaches 0\n"
+ "   pgdat_balanced() is false as only the small zone DMA appears balanced\n"
+ "   (curiously in that check, watermark appears OK and compaction_suitable()\n"
+ "   returns COMPACT_PARTIAL, because a lower classzone_idx is used there)\n"
+ "\n"
+ "Now, even if it was decided that reclaim shouldn't be attempted on the DMA\n"
+ "zone, the scenario would be the same, as (sc.nr_reclaimed=0 >\n"
+ "nr_attempted=0) is also false.  The condition really should use >= as the\n"
+ "comment suggests.  Then there is a mismatch in the check for setting\n"
+ "pgdat_needs_compaction to false using low watermark, while the rest uses\n"
+ "high watermark, and who knows what other subtlety.  Hopefully this\n"
+ "demonstrates that this is unsustainable.\n"
+ "\n"
+ "Luckily we can simplify this a lot.  The reclaim/compaction decisions make\n"
+ "sense for direct reclaim scenario, but in kswapd, our primary goal is to\n"
+ "reach high watermark in order-0 pages.  Afterwards we can attempt\n"
+ "compaction just once.  Unlike direct reclaim, we don't reclaim extra pages\n"
+ "(over the high watermark), the current code already disallows it for good\n"
+ "reasons.\n"
+ "\n"
+ "After this patch, we simply wake up kcompactd to process the pgdat, after\n"
+ "we have either succeeded or failed to reach the high watermarks in kswapd,\n"
+ "which goes to sleep.  We pass kswapd's order and classzone_idx, so\n"
+ "kcompactd can apply the same criteria to determine which zones are worth\n"
+ "compacting.  Note that we use the classzone_idx from wakeup_kswapd(), not\n"
+ "balanced_classzone_idx which can include higher zones that kswapd tried to\n"
+ "balance too, but didn't consider them in pgdat_balanced().\n"
+ "\n"
+ "Since kswapd now cannot create high-order pages itself, we need to adjust\n"
+ "how it determines the zones to be balanced.  The key element here is\n"
+ "adding a \"highorder\" parameter to zone_balanced, which, when set to false,\n"
+ "makes it consider only order-0 watermark instead of the desired higher\n"
+ "order (this was done previously by kswapd_shrink_zone(), but not\n"
+ "elsewhere).  This false is passed for example in pgdat_balanced().\n"
+ "Importantly, wakeup_kswapd() uses true to make sure kswapd and thus\n"
+ "kcompactd are woken up for a high-order allocation failure.\n"
+ "\n"
+ "The last thing is to decide what to do with pageblock_skip bitmap handling.\n"
+ "Compaction maintains a pageblock_skip bitmap to record pageblocks where\n"
+ "isolation recently failed.  This bitmap can be reset by three ways:\n"
+ "\n"
+ "1) direct compaction is restarting after going through the full deferred cycle\n"
+ "\n"
+ "2) kswapd goes to sleep, and some other direct compaction has previously\n"
+ "    finished scanning the whole zone and set zone->compact_blockskip_flush.\n"
+ "    Note that a successful direct compaction clears this flag.\n"
+ "\n"
+ "3) compaction was invoked manually via trigger in /proc\n"
+ "\n"
+ "The case 2) is somewhat fuzzy to begin with, but after introducing\n"
+ "kcompactd we should update it.  The check for direct compaction in 1), and\n"
+ "to set the flush flag in 2) use current_is_kswapd(), which doesn't work\n"
+ "for kcompactd.  Thus, this patch adds bool direct_compaction to\n"
+ "compact_control to use in 2).  For the case 1) we remove the check\n"
+ "completely - unlike the former kswapd compaction, kcompactd does use the\n"
+ "deferred compaction functionality, so flushing tied to restarting from\n"
+ "deferred compaction makes sense here.\n"
+ "\n"
+ "Note that when kswapd goes to sleep, kcompactd is woken up, so it will see\n"
+ "the flushed pageblock_skip bits.  This is different from when the former\n"
+ "kswapd compaction observed the bits and I believe it makes more sense.\n"
+ "Kcompactd can afford to be more thorough than a direct compaction trying\n"
+ "to limit allocation latency, or kswapd whose primary goal is to reclaim.\n"
+ "\n"
+ "For testing, I used stress-highalloc configured to do order-9 allocations\n"
+ "with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on\n"
+ "kswapd/kcompactd reclaim/compaction (the interfering kernel builds in\n"
+ "phases 1 and 2 work as usual):\n"
+ "\n"
+ "stress-highalloc\n"
+ "                        4.5-rc1+before          4.5-rc1+after\n"
+ "                             -nodirect              -nodirect\n"
+ "Success 1 Min          1.00 (  0.00%)         5.00 (-66.67%)\n"
+ "Success 1 Mean         1.40 (  0.00%)         6.20 (-55.00%)\n"
+ "Success 1 Max          2.00 (  0.00%)         7.00 (-16.67%)\n"
+ "Success 2 Min          1.00 (  0.00%)         5.00 (-66.67%)\n"
+ "Success 2 Mean         1.80 (  0.00%)         6.40 (-52.38%)\n"
+ "Success 2 Max          3.00 (  0.00%)         7.00 (-16.67%)\n"
+ "Success 3 Min         34.00 (  0.00%)        62.00 (  1.59%)\n"
+ "Success 3 Mean        41.80 (  0.00%)        63.80 (  1.24%)\n"
+ "Success 3 Max         53.00 (  0.00%)        65.00 (  2.99%)\n"
+ "\n"
+ "User                          3166.67        3181.09\n"
+ "System                        1153.37        1158.25\n"
+ "Elapsed                       1768.53        1799.37\n"
+ "\n"
+ "                            4.5-rc1+before   4.5-rc1+after\n"
+ "                                 -nodirect    -nodirect\n"
+ "Direct pages scanned                32938        32797\n"
+ "Kswapd pages scanned              2183166      2202613\n"
+ "Kswapd pages reclaimed            2152359      2143524\n"
+ "Direct pages reclaimed              32735        32545\n"
+ "Percentage direct scans                1%           1%\n"
+ "THP fault alloc                       579          612\n"
+ "THP collapse alloc                    304          316\n"
+ "THP splits                              0            0\n"
+ "THP fault fallback                    793          778\n"
+ "THP collapse fail                      11           16\n"
+ "Compaction stalls                    1013         1007\n"
+ "Compaction success                     92           67\n"
+ "Compaction failures                   920          939\n"
+ "Page migrate success               238457       721374\n"
+ "Page migrate failure                23021        23469\n"
+ "Compaction pages isolated          504695      1479924\n"
+ "Compaction migrate scanned         661390      8812554\n"
+ "Compaction free scanned          13476658     84327916\n"
+ "Compaction cost                       262          838\n"
+ "\n"
+ "After this patch we see improvements in allocation success rate\n"
+ "(especially for phase 3) along with increased compaction activity.  The\n"
+ "compaction stalls (direct compaction) in the interfering kernel builds\n"
+ "(probably THP's) also decreased somewhat thanks to kcompactd activity, yet\n"
+ "THP alloc successes improved a bit.\n"
+ "\n"
+ "Note that elapsed and user time isn't so useful for this benchmark,\n"
+ "because of the background interference being unpredictable.  It's just to\n"
+ "quickly spot some major unexpected differences.  System time is somewhat\n"
+ "more useful and that didn't increase.\n"
+ "\n"
+ "Also (after adjusting mmtests' ftrace monitor):\n"
+ "\n"
+ "Time kswapd awake               2547781     2269241\n"
+ "Time kcompactd awake                  0      119253\n"
+ "Time direct compacting           939937      557649\n"
+ "Time kswapd compacting                0           0\n"
+ "Time kcompactd compacting             0      119099\n"
+ "\n"
+ "The decrease of overal time spent compacting appears to not match the\n"
+ "increased compaction stats.  I suspect the tasks get rescheduled and since\n"
+ "the ftrace monitor doesn't see that, the reported time is wall time, not\n"
+ "CPU time.  But arguably direct compactors care about overall latency\n"
+ "anyway, whether busy compacting or waiting for CPU doesn't matter.  And\n"
+ "that latency seems to almost halved.\n"
+ "\n"
+ "It's also interesting how much time kswapd spent awake just going through\n"
+ "all the priorities and failing to even try compacting, over and over.\n"
+ "\n"
+ "We can also configure stress-highalloc to perform both direct\n"
+ "reclaim/compaction and wakeup kswapd/kcompactd, by using\n"
+ "GFP_KERNEL|__GFP_HIGH|__GFP_COMP:\n"
+ "\n"
+ "stress-highalloc\n"
+ "                        4.5-rc1+before         4.5-rc1+after\n"
+ "                               -direct               -direct\n"
+ "Success 1 Min          4.00 (  0.00%)        9.00 (-50.00%)\n"
+ "Success 1 Mean         8.00 (  0.00%)       10.00 (-19.05%)\n"
+ "Success 1 Max         12.00 (  0.00%)       11.00 ( 15.38%)\n"
+ "Success 2 Min          4.00 (  0.00%)        9.00 (-50.00%)\n"
+ "Success 2 Mean         8.20 (  0.00%)       10.00 (-16.28%)\n"
+ "Success 2 Max         13.00 (  0.00%)       11.00 (  8.33%)\n"
+ "Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)\n"
+ "Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)\n"
+ "Success 3 Max         77.00 (  0.00%)       76.00 (  0.00%)\n"
+ "\n"
+ "User                          3344.73       3246.04\n"
+ "System                        1194.24       1172.29\n"
+ "Elapsed                       1838.04       1836.76\n"
+ "\n"
+ "                            4.5-rc1+before  4.5-rc1+after\n"
+ "                                   -direct     -direct\n"
+ "Direct pages scanned               125146      120966\n"
+ "Kswapd pages scanned              2119757     2135012\n"
+ "Kswapd pages reclaimed            2073183     2108388\n"
+ "Direct pages reclaimed             124909      120577\n"
+ "Percentage direct scans                5%          5%\n"
+ "THP fault alloc                       599         652\n"
+ "THP collapse alloc                    323         354\n"
+ "THP splits                              0           0\n"
+ "THP fault fallback                    806         793\n"
+ "THP collapse fail                      17          16\n"
+ "Compaction stalls                    2457        2025\n"
+ "Compaction success                    906         518\n"
+ "Compaction failures                  1551        1507\n"
+ "Page migrate success              2031423     2360608\n"
+ "Page migrate failure                32845       40852\n"
+ "Compaction pages isolated         4129761     4802025\n"
+ "Compaction migrate scanned       11996712    21750613\n"
+ "Compaction free scanned         214970969   344372001\n"
+ "Compaction cost                      2271        2694\n"
+ "\n"
+ "In this scenario, this patch doesn't change the overall success rate as\n"
+ "direct compaction already tries all it can.  There's however significant\n"
+ "reduction in direct compaction stalls (that is, the number of allocations\n"
+ "that went into direct compaction).  The number of successes (i.e.  direct\n"
+ "compaction stalls that ended up with successful allocation) is reduced by\n"
+ "the same number.  This means the offload to kcompactd is working as\n"
+ "expected, and direct compaction is reduced either due to detecting\n"
+ "contention, or compaction deferred by kcompactd.  In the previous version\n"
+ "of this patchset there was some apparent reduction of success rate, but\n"
+ "the changes in this version (such as using sync compaction only), new\n"
+ "baseline kernel, and/or averaging results from 5 executions (my bet), made\n"
+ "this go away.\n"
+ "\n"
+ "Ftrace-based stats seem to roughly agree:\n"
+ "\n"
+ "Time kswapd awake               2532984     2326824\n"
+ "Time kcompactd awake                  0      257916\n"
+ "Time direct compacting           864839      735130\n"
+ "Time kswapd compacting                0           0\n"
+ "Time kcompactd compacting             0      257585\n"
+ "\n"
+ "Signed-off-by: Vlastimil Babka <vbabka@suse.cz>\n"
+ "Cc: Andrea Arcangeli <aarcange@redhat.com>\n"
+ "Cc: \"Kirill A. Shutemov\" <kirill.shutemov@linux.intel.com>\n"
+ "Cc: Rik van Riel <riel@redhat.com>\n"
+ "Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>\n"
+ "Cc: Mel Gorman <mgorman@techsingularity.net>\n"
+ "Cc: David Rientjes <rientjes@google.com>\n"
+ "Cc: Michal Hocko <mhocko@suse.com>\n"
+ "Cc: Johannes Weiner <hannes@cmpxchg.org>\n"
+ "Signed-off-by: Andrew Morton <akpm@linux-foundation.org>\n"
+ "---\n"
+ " mm/compaction.c |  10 ++--\n"
+ " mm/internal.h   |   1 +\n"
+ " mm/vmscan.c     | 147 ++++++++++++++++++--------------------------------------\n"
+ " 3 files changed, 54 insertions(+), 104 deletions(-)\n"
+ "\n"
+ "diff --git a/mm/compaction.c b/mm/compaction.c\n"
+ "index 5b2bfbaa821a..ccf97b02b85f 100644\n"
+ "--- a/mm/compaction.c\n"
+ "+++ b/mm/compaction.c\n"
+ "@@ -1191,11 +1191,11 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,\n"
+ " \n"
+ " \t\t/*\n"
+ " \t\t * Mark that the PG_migrate_skip information should be cleared\n"
+ "-\t\t * by kswapd when it goes to sleep. kswapd does not set the\n"
+ "+\t\t * by kswapd when it goes to sleep. kcompactd does not set the\n"
+ " \t\t * flag itself as the decision to be clear should be directly\n"
+ " \t\t * based on an allocation request.\n"
+ " \t\t */\n"
+ "-\t\tif (!current_is_kswapd())\n"
+ "+\t\tif (cc->direct_compaction)\n"
+ " \t\t\tzone->compact_blockskip_flush = true;\n"
+ " \n"
+ " \t\treturn COMPACT_COMPLETE;\n"
+ "@@ -1338,10 +1338,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)\n"
+ " \n"
+ " \t/*\n"
+ " \t * Clear pageblock skip if there were failures recently and compaction\n"
+ "-\t * is about to be retried after being deferred. kswapd does not do\n"
+ "-\t * this reset as it'll reset the cached information when going to sleep.\n"
+ "+\t * is about to be retried after being deferred.\n"
+ " \t */\n"
+ "-\tif (compaction_restarting(zone, cc->order) && !current_is_kswapd())\n"
+ "+\tif (compaction_restarting(zone, cc->order))\n"
+ " \t\t__reset_isolation_suitable(zone);\n"
+ " \n"
+ " \t/*\n"
+ "@@ -1477,6 +1476,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,\n"
+ " \t\t.mode = mode,\n"
+ " \t\t.alloc_flags = alloc_flags,\n"
+ " \t\t.classzone_idx = classzone_idx,\n"
+ "+\t\t.direct_compaction = true,\n"
+ " \t};\n"
+ " \tINIT_LIST_HEAD(&cc.freepages);\n"
+ " \tINIT_LIST_HEAD(&cc.migratepages);\n"
+ "diff --git a/mm/internal.h b/mm/internal.h\n"
+ "index 17ae0b52534b..013a786fa37f 100644\n"
+ "--- a/mm/internal.h\n"
+ "+++ b/mm/internal.h\n"
+ "@@ -181,6 +181,7 @@ struct compact_control {\n"
+ " \tunsigned long last_migrated_pfn;/* Not yet flushed page being freed */\n"
+ " \tenum migrate_mode mode;\t\t/* Async or sync migration mode */\n"
+ " \tbool ignore_skip_hint;\t\t/* Scan blocks even if marked skip */\n"
+ "+\tbool direct_compaction;\t\t/* False from kcompactd or /proc/... */\n"
+ " \tint order;\t\t\t/* order a direct compactor needs */\n"
+ " \tconst gfp_t gfp_mask;\t\t/* gfp mask of a direct compactor */\n"
+ " \tconst int alloc_flags;\t\t/* alloc flags of a direct compactor */\n"
+ "diff --git a/mm/vmscan.c b/mm/vmscan.c\n"
+ "index c67df4831565..23bc7e643ad8 100644\n"
+ "--- a/mm/vmscan.c\n"
+ "+++ b/mm/vmscan.c\n"
+ "@@ -2951,18 +2951,23 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)\n"
+ " \t} while (memcg);\n"
+ " }\n"
+ " \n"
+ "-static bool zone_balanced(struct zone *zone, int order,\n"
+ "-\t\t\t  unsigned long balance_gap, int classzone_idx)\n"
+ "+static bool zone_balanced(struct zone *zone, int order, bool highorder,\n"
+ "+\t\t\tunsigned long balance_gap, int classzone_idx)\n"
+ " {\n"
+ "-\tif (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +\n"
+ "-\t\t\t\t    balance_gap, classzone_idx))\n"
+ "-\t\treturn false;\n"
+ "+\tunsigned long mark = high_wmark_pages(zone) + balance_gap;\n"
+ " \n"
+ "-\tif (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,\n"
+ "-\t\t\t\torder, 0, classzone_idx) == COMPACT_SKIPPED)\n"
+ "-\t\treturn false;\n"
+ "+\t/*\n"
+ "+\t * When checking from pgdat_balanced(), kswapd should stop and sleep\n"
+ "+\t * when it reaches the high order-0 watermark and let kcompactd take\n"
+ "+\t * over. Other callers such as wakeup_kswapd() want to determine the\n"
+ "+\t * true high-order watermark.\n"
+ "+\t */\n"
+ "+\tif (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {\n"
+ "+\t\tmark += (1UL << order);\n"
+ "+\t\torder = 0;\n"
+ "+\t}\n"
+ " \n"
+ "-\treturn true;\n"
+ "+\treturn zone_watermark_ok_safe(zone, order, mark, classzone_idx);\n"
+ " }\n"
+ " \n"
+ " /*\n"
+ "@@ -3012,7 +3017,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)\n"
+ " \t\t\tcontinue;\n"
+ " \t\t}\n"
+ " \n"
+ "-\t\tif (zone_balanced(zone, order, 0, i))\n"
+ "+\t\tif (zone_balanced(zone, order, false, 0, i))\n"
+ " \t\t\tbalanced_pages += zone->managed_pages;\n"
+ " \t\telse if (!order)\n"
+ " \t\t\treturn false;\n"
+ "@@ -3066,10 +3071,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,\n"
+ "  */\n"
+ " static bool kswapd_shrink_zone(struct zone *zone,\n"
+ " \t\t\t       int classzone_idx,\n"
+ "-\t\t\t       struct scan_control *sc,\n"
+ "-\t\t\t       unsigned long *nr_attempted)\n"
+ "+\t\t\t       struct scan_control *sc)\n"
+ " {\n"
+ "-\tint testorder = sc->order;\n"
+ " \tunsigned long balance_gap;\n"
+ " \tbool lowmem_pressure;\n"
+ " \n"
+ "@@ -3077,17 +3080,6 @@ static bool kswapd_shrink_zone(struct zone *zone,\n"
+ " \tsc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));\n"
+ " \n"
+ " \t/*\n"
+ "-\t * Kswapd reclaims only single pages with compaction enabled. Trying\n"
+ "-\t * too hard to reclaim until contiguous free pages have become\n"
+ "-\t * available can hurt performance by evicting too much useful data\n"
+ "-\t * from memory. Do not reclaim more than needed for compaction.\n"
+ "-\t */\n"
+ "-\tif (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&\n"
+ "-\t\t\tcompaction_suitable(zone, sc->order, 0, classzone_idx)\n"
+ "-\t\t\t\t\t\t\t!= COMPACT_SKIPPED)\n"
+ "-\t\ttestorder = 0;\n"
+ "-\n"
+ "-\t/*\n"
+ " \t * We put equal pressure on every zone, unless one zone has way too\n"
+ " \t * many pages free already. The \"too many pages\" is defined as the\n"
+ " \t * high wmark plus a \"gap\" where the gap is either the low\n"
+ "@@ -3101,15 +3093,12 @@ static bool kswapd_shrink_zone(struct zone *zone,\n"
+ " \t * reclaim is necessary\n"
+ " \t */\n"
+ " \tlowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));\n"
+ "-\tif (!lowmem_pressure && zone_balanced(zone, testorder,\n"
+ "+\tif (!lowmem_pressure && zone_balanced(zone, sc->order, false,\n"
+ " \t\t\t\t\t\tbalance_gap, classzone_idx))\n"
+ " \t\treturn true;\n"
+ " \n"
+ " \tshrink_zone(zone, sc, zone_idx(zone) == classzone_idx);\n"
+ " \n"
+ "-\t/* Account for the number of pages attempted to reclaim */\n"
+ "-\t*nr_attempted += sc->nr_to_reclaim;\n"
+ "-\n"
+ " \tclear_bit(ZONE_WRITEBACK, &zone->flags);\n"
+ " \n"
+ " \t/*\n"
+ "@@ -3119,7 +3108,7 @@ static bool kswapd_shrink_zone(struct zone *zone,\n"
+ " \t * waits.\n"
+ " \t */\n"
+ " \tif (zone_reclaimable(zone) &&\n"
+ "-\t    zone_balanced(zone, testorder, 0, classzone_idx)) {\n"
+ "+\t    zone_balanced(zone, sc->order, false, 0, classzone_idx)) {\n"
+ " \t\tclear_bit(ZONE_CONGESTED, &zone->flags);\n"
+ " \t\tclear_bit(ZONE_DIRTY, &zone->flags);\n"
+ " \t}\n"
+ "@@ -3131,7 +3120,7 @@ static bool kswapd_shrink_zone(struct zone *zone,\n"
+ "  * For kswapd, balance_pgdat() will work across all this node's zones until\n"
+ "  * they are all at high_wmark_pages(zone).\n"
+ "  *\n"
+ "- * Returns the final order kswapd was reclaiming at\n"
+ "+ * Returns the highest zone idx kswapd was reclaiming at\n"
+ "  *\n"
+ "  * There is special handling here for zones which are full of pinned pages.\n"
+ "  * This can happen if the pages are all mlocked, or if they are all used by\n"
+ "@@ -3148,8 +3137,7 @@ static bool kswapd_shrink_zone(struct zone *zone,\n"
+ "  * interoperates with the page allocator fallback scheme to ensure that aging\n"
+ "  * of pages is balanced across the zones.\n"
+ "  */\n"
+ "-static unsigned long balance_pgdat(pg_data_t *pgdat, int order,\n"
+ "-\t\t\t\t\t\t\tint *classzone_idx)\n"
+ "+static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)\n"
+ " {\n"
+ " \tint i;\n"
+ " \tint end_zone = 0;\t/* Inclusive.  0 = ZONE_DMA */\n"
+ "@@ -3166,9 +3154,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,\n"
+ " \tcount_vm_event(PAGEOUTRUN);\n"
+ " \n"
+ " \tdo {\n"
+ "-\t\tunsigned long nr_attempted = 0;\n"
+ " \t\tbool raise_priority = true;\n"
+ "-\t\tbool pgdat_needs_compaction = (order > 0);\n"
+ " \n"
+ " \t\tsc.nr_reclaimed = 0;\n"
+ " \n"
+ "@@ -3203,7 +3189,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,\n"
+ " \t\t\t\tbreak;\n"
+ " \t\t\t}\n"
+ " \n"
+ "-\t\t\tif (!zone_balanced(zone, order, 0, 0)) {\n"
+ "+\t\t\tif (!zone_balanced(zone, order, false, 0, 0)) {\n"
+ " \t\t\t\tend_zone = i;\n"
+ " \t\t\t\tbreak;\n"
+ " \t\t\t} else {\n"
+ "@@ -3219,24 +3205,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,\n"
+ " \t\tif (i < 0)\n"
+ " \t\t\tgoto out;\n"
+ " \n"
+ "-\t\tfor (i = 0; i <= end_zone; i++) {\n"
+ "-\t\t\tstruct zone *zone = pgdat->node_zones + i;\n"
+ "-\n"
+ "-\t\t\tif (!populated_zone(zone))\n"
+ "-\t\t\t\tcontinue;\n"
+ "-\n"
+ "-\t\t\t/*\n"
+ "-\t\t\t * If any zone is currently balanced then kswapd will\n"
+ "-\t\t\t * not call compaction as it is expected that the\n"
+ "-\t\t\t * necessary pages are already available.\n"
+ "-\t\t\t */\n"
+ "-\t\t\tif (pgdat_needs_compaction &&\n"
+ "-\t\t\t\t\tzone_watermark_ok(zone, order,\n"
+ "-\t\t\t\t\t\tlow_wmark_pages(zone),\n"
+ "-\t\t\t\t\t\t*classzone_idx, 0))\n"
+ "-\t\t\t\tpgdat_needs_compaction = false;\n"
+ "-\t\t}\n"
+ "-\n"
+ " \t\t/*\n"
+ " \t\t * If we're getting trouble reclaiming, start doing writepage\n"
+ " \t\t * even in laptop mode.\n"
+ "@@ -3280,8 +3248,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,\n"
+ " \t\t\t * that that high watermark would be met at 100%\n"
+ " \t\t\t * efficiency.\n"
+ " \t\t\t */\n"
+ "-\t\t\tif (kswapd_shrink_zone(zone, end_zone,\n"
+ "-\t\t\t\t\t       &sc, &nr_attempted))\n"
+ "+\t\t\tif (kswapd_shrink_zone(zone, end_zone, &sc))\n"
+ " \t\t\t\traise_priority = false;\n"
+ " \t\t}\n"
+ " \n"
+ "@@ -3294,49 +3261,29 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,\n"
+ " \t\t\t\tpfmemalloc_watermark_ok(pgdat))\n"
+ " \t\t\twake_up_all(&pgdat->pfmemalloc_wait);\n"
+ " \n"
+ "-\t\t/*\n"
+ "-\t\t * Fragmentation may mean that the system cannot be rebalanced\n"
+ "-\t\t * for high-order allocations in all zones. If twice the\n"
+ "-\t\t * allocation size has been reclaimed and the zones are still\n"
+ "-\t\t * not balanced then recheck the watermarks at order-0 to\n"
+ "-\t\t * prevent kswapd reclaiming excessively. Assume that a\n"
+ "-\t\t * process requested a high-order can direct reclaim/compact.\n"
+ "-\t\t */\n"
+ "-\t\tif (order && sc.nr_reclaimed >= 2UL << order)\n"
+ "-\t\t\torder = sc.order = 0;\n"
+ "-\n"
+ " \t\t/* Check if kswapd should be suspending */\n"
+ " \t\tif (try_to_freeze() || kthread_should_stop())\n"
+ " \t\t\tbreak;\n"
+ " \n"
+ " \t\t/*\n"
+ "-\t\t * Compact if necessary and kswapd is reclaiming at least the\n"
+ "-\t\t * high watermark number of pages as requsted\n"
+ "-\t\t */\n"
+ "-\t\tif (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted)\n"
+ "-\t\t\tcompact_pgdat(pgdat, order);\n"
+ "-\n"
+ "-\t\t/*\n"
+ " \t\t * Raise priority if scanning rate is too low or there was no\n"
+ " \t\t * progress in reclaiming pages\n"
+ " \t\t */\n"
+ " \t\tif (raise_priority || !sc.nr_reclaimed)\n"
+ " \t\t\tsc.priority--;\n"
+ " \t} while (sc.priority >= 1 &&\n"
+ "-\t\t !pgdat_balanced(pgdat, order, *classzone_idx));\n"
+ "+\t\t\t!pgdat_balanced(pgdat, order, classzone_idx));\n"
+ " \n"
+ " out:\n"
+ " \t/*\n"
+ "-\t * Return the order we were reclaiming at so prepare_kswapd_sleep()\n"
+ "-\t * makes a decision on the order we were last reclaiming at. However,\n"
+ "-\t * if another caller entered the allocator slow path while kswapd\n"
+ "-\t * was awake, order will remain at the higher level\n"
+ "+\t * Return the highest zone idx we were reclaiming at so\n"
+ "+\t * prepare_kswapd_sleep() makes the same decisions as here.\n"
+ " \t */\n"
+ "-\t*classzone_idx = end_zone;\n"
+ "-\treturn order;\n"
+ "+\treturn end_zone;\n"
+ " }\n"
+ " \n"
+ "-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)\n"
+ "+static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,\n"
+ "+\t\t\t\tint classzone_idx, int balanced_classzone_idx)\n"
+ " {\n"
+ " \tlong remaining = 0;\n"
+ " \tDEFINE_WAIT(wait);\n"
+ "@@ -3347,7 +3294,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)\n"
+ " \tprepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);\n"
+ " \n"
+ " \t/* Try to sleep for a short interval */\n"
+ "-\tif (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {\n"
+ "+\tif (prepare_kswapd_sleep(pgdat, order, remaining,\n"
+ "+\t\t\t\t\t\tbalanced_classzone_idx)) {\n"
+ " \t\tremaining = schedule_timeout(HZ/10);\n"
+ " \t\tfinish_wait(&pgdat->kswapd_wait, &wait);\n"
+ " \t\tprepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);\n"
+ "@@ -3357,7 +3305,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)\n"
+ " \t * After a short sleep, check if it was a premature sleep. If not, then\n"
+ " \t * go fully to sleep until explicitly woken up.\n"
+ " \t */\n"
+ "-\tif (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {\n"
+ "+\tif (prepare_kswapd_sleep(pgdat, order, remaining,\n"
+ "+\t\t\t\t\t\tbalanced_classzone_idx)) {\n"
+ " \t\ttrace_mm_vmscan_kswapd_sleep(pgdat->node_id);\n"
+ " \n"
+ " \t\t/*\n"
+ "@@ -3378,6 +3327,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)\n"
+ " \t\t */\n"
+ " \t\treset_isolation_suitable(pgdat);\n"
+ " \n"
+ "+\t\t/*\n"
+ "+\t\t * We have freed the memory, now we should compact it to make\n"
+ "+\t\t * allocation of the requested order possible.\n"
+ "+\t\t */\n"
+ "+\t\twakeup_kcompactd(pgdat, order, classzone_idx);\n"
+ "+\n"
+ " \t\tif (!kthread_should_stop())\n"
+ " \t\t\tschedule();\n"
+ " \n"
+ "@@ -3407,7 +3362,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)\n"
+ " static int kswapd(void *p)\n"
+ " {\n"
+ " \tunsigned long order, new_order;\n"
+ "-\tunsigned balanced_order;\n"
+ " \tint classzone_idx, new_classzone_idx;\n"
+ " \tint balanced_classzone_idx;\n"
+ " \tpg_data_t *pgdat = (pg_data_t*)p;\n"
+ "@@ -3440,23 +3394,19 @@ static int kswapd(void *p)\n"
+ " \tset_freezable();\n"
+ " \n"
+ " \torder = new_order = 0;\n"
+ "-\tbalanced_order = 0;\n"
+ " \tclasszone_idx = new_classzone_idx = pgdat->nr_zones - 1;\n"
+ " \tbalanced_classzone_idx = classzone_idx;\n"
+ " \tfor ( ; ; ) {\n"
+ " \t\tbool ret;\n"
+ " \n"
+ " \t\t/*\n"
+ "-\t\t * If the last balance_pgdat was unsuccessful it's unlikely a\n"
+ "-\t\t * new request of a similar or harder type will succeed soon\n"
+ "-\t\t * so consider going to sleep on the basis we reclaimed at\n"
+ "+\t\t * While we were reclaiming, there might have been another\n"
+ "+\t\t * wakeup, so check the values.\n"
+ " \t\t */\n"
+ "-\t\tif (balanced_order == new_order) {\n"
+ "-\t\t\tnew_order = pgdat->kswapd_max_order;\n"
+ "-\t\t\tnew_classzone_idx = pgdat->classzone_idx;\n"
+ "-\t\t\tpgdat->kswapd_max_order =  0;\n"
+ "-\t\t\tpgdat->classzone_idx = pgdat->nr_zones - 1;\n"
+ "-\t\t}\n"
+ "+\t\tnew_order = pgdat->kswapd_max_order;\n"
+ "+\t\tnew_classzone_idx = pgdat->classzone_idx;\n"
+ "+\t\tpgdat->kswapd_max_order =  0;\n"
+ "+\t\tpgdat->classzone_idx = pgdat->nr_zones - 1;\n"
+ " \n"
+ " \t\tif (order < new_order || classzone_idx > new_classzone_idx) {\n"
+ " \t\t\t/*\n"
+ "@@ -3466,7 +3416,7 @@ static int kswapd(void *p)\n"
+ " \t\t\torder = new_order;\n"
+ " \t\t\tclasszone_idx = new_classzone_idx;\n"
+ " \t\t} else {\n"
+ "-\t\t\tkswapd_try_to_sleep(pgdat, balanced_order,\n"
+ "+\t\t\tkswapd_try_to_sleep(pgdat, order, classzone_idx,\n"
+ " \t\t\t\t\t\tbalanced_classzone_idx);\n"
+ " \t\t\torder = pgdat->kswapd_max_order;\n"
+ " \t\t\tclasszone_idx = pgdat->classzone_idx;\n"
+ "@@ -3486,9 +3436,8 @@ static int kswapd(void *p)\n"
+ " \t\t */\n"
+ " \t\tif (!ret) {\n"
+ " \t\t\ttrace_mm_vmscan_kswapd_wake(pgdat->node_id, order);\n"
+ "-\t\t\tbalanced_classzone_idx = classzone_idx;\n"
+ "-\t\t\tbalanced_order = balance_pgdat(pgdat, order,\n"
+ "-\t\t\t\t\t\t&balanced_classzone_idx);\n"
+ "+\t\t\tbalanced_classzone_idx = balance_pgdat(pgdat, order,\n"
+ "+\t\t\t\t\t\t\t\tclasszone_idx);\n"
+ " \t\t}\n"
+ " \t}\n"
+ " \n"
+ "@@ -3518,7 +3467,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)\n"
+ " \t}\n"
+ " \tif (!waitqueue_active(&pgdat->kswapd_wait))\n"
+ " \t\treturn;\n"
+ "-\tif (zone_balanced(zone, order, 0, 0))\n"
+ "+\tif (zone_balanced(zone, order, true, 0, 0))\n"
+ " \t\treturn;\n"
+ " \n"
+ " \ttrace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);\n"
+ "-- \n"
+ 2.7.2
 
-90f3c45b1c9035297052bcac0c8e8a9e18de4901973a75e842402b5df65114a1
+b713a7219c3276cf1bec511bd97cb41bf58c605627327a083d3e5b25f81a3351

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.