diff for duplicates of <56E02A33.2040106@suse.cz> diff --git a/a/1.txt b/N1/1.txt index 70c1cb8..898ebfc 100644 --- a/a/1.txt +++ b/N1/1.txt @@ -44,3 +44,639 @@ review, but functionality-wise the first patch leaves things somewhat weird without the third patch. ----8<---- +>From c829909527ecd33eb869c96bcd287bade2b32100 Mon Sep 17 00:00:00 2001 +From: Vlastimil Babka <vbabka@suse.cz> +Date: Wed, 9 Mar 2016 12:45:24 +0100 +Subject: [PATCH 3/3] mm, kswapd: replace kswapd compaction with waking up + kcompactd + +Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim +and compaction to attempt making memory allocation of given order +available. The details differ from direct reclaim e.g. in having high +watermark as a goal. The code involved in kswapd's reclaim/compaction +decisions has evolved to be quite complex. Testing reveals that it +doesn't actually work in at least one scenario, and closer inspection +suggests that it could be greatly simplified without compromising on the +goal (make high-order page available) or efficiency (don't reclaim too +much). The simplification relieas of doing all compaction in kcompactd, +which is simply woken up when high watermarks are reached by kswapd's +reclaim. + +The scenario where kswapd compaction doesn't work was found with mmtests +test stress-highalloc configured to attempt order-9 allocations without +direct reclaim, just waking up kswapd. There was no compaction attempt +from kswapd during the whole test. Some added instrumentation shows what +happens: + +- balance_pgdat() sets end_zone to Normal, as it's not balanced +- reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it + cannot reclaim anything, so sc.nr_reclaimed is 0 +- for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it + merely checks if high watermarks were reached for base pages. This is true, + so no reclaim is attempted. For DMA, testorder=0 wasn't used, as + compaction_suitable() returned COMPACT_SKIPPED +- even though the pgdat_needs_compaction flag wasn't set to false, no + compaction happens due to the condition sc.nr_reclaimed > nr_attempted + being false (as 0 < 99) +- priority-- due to nr_reclaimed being 0, repeat until priority reaches 0 + pgdat_balanced() is false as only the small zone DMA appears balanced + (curiously in that check, watermark appears OK and compaction_suitable() + returns COMPACT_PARTIAL, because a lower classzone_idx is used there) + +Now, even if it was decided that reclaim shouldn't be attempted on the DMA +zone, the scenario would be the same, as (sc.nr_reclaimed=0 > +nr_attempted=0) is also false. The condition really should use >= as the +comment suggests. Then there is a mismatch in the check for setting +pgdat_needs_compaction to false using low watermark, while the rest uses +high watermark, and who knows what other subtlety. Hopefully this +demonstrates that this is unsustainable. + +Luckily we can simplify this a lot. The reclaim/compaction decisions make +sense for direct reclaim scenario, but in kswapd, our primary goal is to +reach high watermark in order-0 pages. Afterwards we can attempt +compaction just once. Unlike direct reclaim, we don't reclaim extra pages +(over the high watermark), the current code already disallows it for good +reasons. + +After this patch, we simply wake up kcompactd to process the pgdat, after +we have either succeeded or failed to reach the high watermarks in kswapd, +which goes to sleep. We pass kswapd's order and classzone_idx, so +kcompactd can apply the same criteria to determine which zones are worth +compacting. Note that we use the classzone_idx from wakeup_kswapd(), not +balanced_classzone_idx which can include higher zones that kswapd tried to +balance too, but didn't consider them in pgdat_balanced(). + +Since kswapd now cannot create high-order pages itself, we need to adjust +how it determines the zones to be balanced. The key element here is +adding a "highorder" parameter to zone_balanced, which, when set to false, +makes it consider only order-0 watermark instead of the desired higher +order (this was done previously by kswapd_shrink_zone(), but not +elsewhere). This false is passed for example in pgdat_balanced(). +Importantly, wakeup_kswapd() uses true to make sure kswapd and thus +kcompactd are woken up for a high-order allocation failure. + +The last thing is to decide what to do with pageblock_skip bitmap handling. +Compaction maintains a pageblock_skip bitmap to record pageblocks where +isolation recently failed. This bitmap can be reset by three ways: + +1) direct compaction is restarting after going through the full deferred cycle + +2) kswapd goes to sleep, and some other direct compaction has previously + finished scanning the whole zone and set zone->compact_blockskip_flush. + Note that a successful direct compaction clears this flag. + +3) compaction was invoked manually via trigger in /proc + +The case 2) is somewhat fuzzy to begin with, but after introducing +kcompactd we should update it. The check for direct compaction in 1), and +to set the flush flag in 2) use current_is_kswapd(), which doesn't work +for kcompactd. Thus, this patch adds bool direct_compaction to +compact_control to use in 2). For the case 1) we remove the check +completely - unlike the former kswapd compaction, kcompactd does use the +deferred compaction functionality, so flushing tied to restarting from +deferred compaction makes sense here. + +Note that when kswapd goes to sleep, kcompactd is woken up, so it will see +the flushed pageblock_skip bits. This is different from when the former +kswapd compaction observed the bits and I believe it makes more sense. +Kcompactd can afford to be more thorough than a direct compaction trying +to limit allocation latency, or kswapd whose primary goal is to reclaim. + +For testing, I used stress-highalloc configured to do order-9 allocations +with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on +kswapd/kcompactd reclaim/compaction (the interfering kernel builds in +phases 1 and 2 work as usual): + +stress-highalloc + 4.5-rc1+before 4.5-rc1+after + -nodirect -nodirect +Success 1 Min 1.00 ( 0.00%) 5.00 (-66.67%) +Success 1 Mean 1.40 ( 0.00%) 6.20 (-55.00%) +Success 1 Max 2.00 ( 0.00%) 7.00 (-16.67%) +Success 2 Min 1.00 ( 0.00%) 5.00 (-66.67%) +Success 2 Mean 1.80 ( 0.00%) 6.40 (-52.38%) +Success 2 Max 3.00 ( 0.00%) 7.00 (-16.67%) +Success 3 Min 34.00 ( 0.00%) 62.00 ( 1.59%) +Success 3 Mean 41.80 ( 0.00%) 63.80 ( 1.24%) +Success 3 Max 53.00 ( 0.00%) 65.00 ( 2.99%) + +User 3166.67 3181.09 +System 1153.37 1158.25 +Elapsed 1768.53 1799.37 + + 4.5-rc1+before 4.5-rc1+after + -nodirect -nodirect +Direct pages scanned 32938 32797 +Kswapd pages scanned 2183166 2202613 +Kswapd pages reclaimed 2152359 2143524 +Direct pages reclaimed 32735 32545 +Percentage direct scans 1% 1% +THP fault alloc 579 612 +THP collapse alloc 304 316 +THP splits 0 0 +THP fault fallback 793 778 +THP collapse fail 11 16 +Compaction stalls 1013 1007 +Compaction success 92 67 +Compaction failures 920 939 +Page migrate success 238457 721374 +Page migrate failure 23021 23469 +Compaction pages isolated 504695 1479924 +Compaction migrate scanned 661390 8812554 +Compaction free scanned 13476658 84327916 +Compaction cost 262 838 + +After this patch we see improvements in allocation success rate +(especially for phase 3) along with increased compaction activity. The +compaction stalls (direct compaction) in the interfering kernel builds +(probably THP's) also decreased somewhat thanks to kcompactd activity, yet +THP alloc successes improved a bit. + +Note that elapsed and user time isn't so useful for this benchmark, +because of the background interference being unpredictable. It's just to +quickly spot some major unexpected differences. System time is somewhat +more useful and that didn't increase. + +Also (after adjusting mmtests' ftrace monitor): + +Time kswapd awake 2547781 2269241 +Time kcompactd awake 0 119253 +Time direct compacting 939937 557649 +Time kswapd compacting 0 0 +Time kcompactd compacting 0 119099 + +The decrease of overal time spent compacting appears to not match the +increased compaction stats. I suspect the tasks get rescheduled and since +the ftrace monitor doesn't see that, the reported time is wall time, not +CPU time. But arguably direct compactors care about overall latency +anyway, whether busy compacting or waiting for CPU doesn't matter. And +that latency seems to almost halved. + +It's also interesting how much time kswapd spent awake just going through +all the priorities and failing to even try compacting, over and over. + +We can also configure stress-highalloc to perform both direct +reclaim/compaction and wakeup kswapd/kcompactd, by using +GFP_KERNEL|__GFP_HIGH|__GFP_COMP: + +stress-highalloc + 4.5-rc1+before 4.5-rc1+after + -direct -direct +Success 1 Min 4.00 ( 0.00%) 9.00 (-50.00%) +Success 1 Mean 8.00 ( 0.00%) 10.00 (-19.05%) +Success 1 Max 12.00 ( 0.00%) 11.00 ( 15.38%) +Success 2 Min 4.00 ( 0.00%) 9.00 (-50.00%) +Success 2 Mean 8.20 ( 0.00%) 10.00 (-16.28%) +Success 2 Max 13.00 ( 0.00%) 11.00 ( 8.33%) +Success 3 Min 75.00 ( 0.00%) 74.00 ( 1.33%) +Success 3 Mean 75.60 ( 0.00%) 75.20 ( 0.53%) +Success 3 Max 77.00 ( 0.00%) 76.00 ( 0.00%) + +User 3344.73 3246.04 +System 1194.24 1172.29 +Elapsed 1838.04 1836.76 + + 4.5-rc1+before 4.5-rc1+after + -direct -direct +Direct pages scanned 125146 120966 +Kswapd pages scanned 2119757 2135012 +Kswapd pages reclaimed 2073183 2108388 +Direct pages reclaimed 124909 120577 +Percentage direct scans 5% 5% +THP fault alloc 599 652 +THP collapse alloc 323 354 +THP splits 0 0 +THP fault fallback 806 793 +THP collapse fail 17 16 +Compaction stalls 2457 2025 +Compaction success 906 518 +Compaction failures 1551 1507 +Page migrate success 2031423 2360608 +Page migrate failure 32845 40852 +Compaction pages isolated 4129761 4802025 +Compaction migrate scanned 11996712 21750613 +Compaction free scanned 214970969 344372001 +Compaction cost 2271 2694 + +In this scenario, this patch doesn't change the overall success rate as +direct compaction already tries all it can. There's however significant +reduction in direct compaction stalls (that is, the number of allocations +that went into direct compaction). The number of successes (i.e. direct +compaction stalls that ended up with successful allocation) is reduced by +the same number. This means the offload to kcompactd is working as +expected, and direct compaction is reduced either due to detecting +contention, or compaction deferred by kcompactd. In the previous version +of this patchset there was some apparent reduction of success rate, but +the changes in this version (such as using sync compaction only), new +baseline kernel, and/or averaging results from 5 executions (my bet), made +this go away. + +Ftrace-based stats seem to roughly agree: + +Time kswapd awake 2532984 2326824 +Time kcompactd awake 0 257916 +Time direct compacting 864839 735130 +Time kswapd compacting 0 0 +Time kcompactd compacting 0 257585 + +Signed-off-by: Vlastimil Babka <vbabka@suse.cz> +Cc: Andrea Arcangeli <aarcange@redhat.com> +Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> +Cc: Rik van Riel <riel@redhat.com> +Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> +Cc: Mel Gorman <mgorman@techsingularity.net> +Cc: David Rientjes <rientjes@google.com> +Cc: Michal Hocko <mhocko@suse.com> +Cc: Johannes Weiner <hannes@cmpxchg.org> +Signed-off-by: Andrew Morton <akpm@linux-foundation.org> +--- + mm/compaction.c | 10 ++-- + mm/internal.h | 1 + + mm/vmscan.c | 147 ++++++++++++++++++-------------------------------------- + 3 files changed, 54 insertions(+), 104 deletions(-) + +diff --git a/mm/compaction.c b/mm/compaction.c +index 5b2bfbaa821a..ccf97b02b85f 100644 +--- a/mm/compaction.c ++++ b/mm/compaction.c +@@ -1191,11 +1191,11 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc, + + /* + * Mark that the PG_migrate_skip information should be cleared +- * by kswapd when it goes to sleep. kswapd does not set the ++ * by kswapd when it goes to sleep. kcompactd does not set the + * flag itself as the decision to be clear should be directly + * based on an allocation request. + */ +- if (!current_is_kswapd()) ++ if (cc->direct_compaction) + zone->compact_blockskip_flush = true; + + return COMPACT_COMPLETE; +@@ -1338,10 +1338,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc) + + /* + * Clear pageblock skip if there were failures recently and compaction +- * is about to be retried after being deferred. kswapd does not do +- * this reset as it'll reset the cached information when going to sleep. ++ * is about to be retried after being deferred. + */ +- if (compaction_restarting(zone, cc->order) && !current_is_kswapd()) ++ if (compaction_restarting(zone, cc->order)) + __reset_isolation_suitable(zone); + + /* +@@ -1477,6 +1476,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order, + .mode = mode, + .alloc_flags = alloc_flags, + .classzone_idx = classzone_idx, ++ .direct_compaction = true, + }; + INIT_LIST_HEAD(&cc.freepages); + INIT_LIST_HEAD(&cc.migratepages); +diff --git a/mm/internal.h b/mm/internal.h +index 17ae0b52534b..013a786fa37f 100644 +--- a/mm/internal.h ++++ b/mm/internal.h +@@ -181,6 +181,7 @@ struct compact_control { + unsigned long last_migrated_pfn;/* Not yet flushed page being freed */ + enum migrate_mode mode; /* Async or sync migration mode */ + bool ignore_skip_hint; /* Scan blocks even if marked skip */ ++ bool direct_compaction; /* False from kcompactd or /proc/... */ + int order; /* order a direct compactor needs */ + const gfp_t gfp_mask; /* gfp mask of a direct compactor */ + const int alloc_flags; /* alloc flags of a direct compactor */ +diff --git a/mm/vmscan.c b/mm/vmscan.c +index c67df4831565..23bc7e643ad8 100644 +--- a/mm/vmscan.c ++++ b/mm/vmscan.c +@@ -2951,18 +2951,23 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc) + } while (memcg); + } + +-static bool zone_balanced(struct zone *zone, int order, +- unsigned long balance_gap, int classzone_idx) ++static bool zone_balanced(struct zone *zone, int order, bool highorder, ++ unsigned long balance_gap, int classzone_idx) + { +- if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) + +- balance_gap, classzone_idx)) +- return false; ++ unsigned long mark = high_wmark_pages(zone) + balance_gap; + +- if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone, +- order, 0, classzone_idx) == COMPACT_SKIPPED) +- return false; ++ /* ++ * When checking from pgdat_balanced(), kswapd should stop and sleep ++ * when it reaches the high order-0 watermark and let kcompactd take ++ * over. Other callers such as wakeup_kswapd() want to determine the ++ * true high-order watermark. ++ */ ++ if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) { ++ mark += (1UL << order); ++ order = 0; ++ } + +- return true; ++ return zone_watermark_ok_safe(zone, order, mark, classzone_idx); + } + + /* +@@ -3012,7 +3017,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx) + continue; + } + +- if (zone_balanced(zone, order, 0, i)) ++ if (zone_balanced(zone, order, false, 0, i)) + balanced_pages += zone->managed_pages; + else if (!order) + return false; +@@ -3066,10 +3071,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining, + */ + static bool kswapd_shrink_zone(struct zone *zone, + int classzone_idx, +- struct scan_control *sc, +- unsigned long *nr_attempted) ++ struct scan_control *sc) + { +- int testorder = sc->order; + unsigned long balance_gap; + bool lowmem_pressure; + +@@ -3077,17 +3080,6 @@ static bool kswapd_shrink_zone(struct zone *zone, + sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); + + /* +- * Kswapd reclaims only single pages with compaction enabled. Trying +- * too hard to reclaim until contiguous free pages have become +- * available can hurt performance by evicting too much useful data +- * from memory. Do not reclaim more than needed for compaction. +- */ +- if (IS_ENABLED(CONFIG_COMPACTION) && sc->order && +- compaction_suitable(zone, sc->order, 0, classzone_idx) +- != COMPACT_SKIPPED) +- testorder = 0; +- +- /* + * We put equal pressure on every zone, unless one zone has way too + * many pages free already. The "too many pages" is defined as the + * high wmark plus a "gap" where the gap is either the low +@@ -3101,15 +3093,12 @@ static bool kswapd_shrink_zone(struct zone *zone, + * reclaim is necessary + */ + lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone)); +- if (!lowmem_pressure && zone_balanced(zone, testorder, ++ if (!lowmem_pressure && zone_balanced(zone, sc->order, false, + balance_gap, classzone_idx)) + return true; + + shrink_zone(zone, sc, zone_idx(zone) == classzone_idx); + +- /* Account for the number of pages attempted to reclaim */ +- *nr_attempted += sc->nr_to_reclaim; +- + clear_bit(ZONE_WRITEBACK, &zone->flags); + + /* +@@ -3119,7 +3108,7 @@ static bool kswapd_shrink_zone(struct zone *zone, + * waits. + */ + if (zone_reclaimable(zone) && +- zone_balanced(zone, testorder, 0, classzone_idx)) { ++ zone_balanced(zone, sc->order, false, 0, classzone_idx)) { + clear_bit(ZONE_CONGESTED, &zone->flags); + clear_bit(ZONE_DIRTY, &zone->flags); + } +@@ -3131,7 +3120,7 @@ static bool kswapd_shrink_zone(struct zone *zone, + * For kswapd, balance_pgdat() will work across all this node's zones until + * they are all at high_wmark_pages(zone). + * +- * Returns the final order kswapd was reclaiming at ++ * Returns the highest zone idx kswapd was reclaiming at + * + * There is special handling here for zones which are full of pinned pages. + * This can happen if the pages are all mlocked, or if they are all used by +@@ -3148,8 +3137,7 @@ static bool kswapd_shrink_zone(struct zone *zone, + * interoperates with the page allocator fallback scheme to ensure that aging + * of pages is balanced across the zones. + */ +-static unsigned long balance_pgdat(pg_data_t *pgdat, int order, +- int *classzone_idx) ++static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) + { + int i; + int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ +@@ -3166,9 +3154,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, + count_vm_event(PAGEOUTRUN); + + do { +- unsigned long nr_attempted = 0; + bool raise_priority = true; +- bool pgdat_needs_compaction = (order > 0); + + sc.nr_reclaimed = 0; + +@@ -3203,7 +3189,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, + break; + } + +- if (!zone_balanced(zone, order, 0, 0)) { ++ if (!zone_balanced(zone, order, false, 0, 0)) { + end_zone = i; + break; + } else { +@@ -3219,24 +3205,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, + if (i < 0) + goto out; + +- for (i = 0; i <= end_zone; i++) { +- struct zone *zone = pgdat->node_zones + i; +- +- if (!populated_zone(zone)) +- continue; +- +- /* +- * If any zone is currently balanced then kswapd will +- * not call compaction as it is expected that the +- * necessary pages are already available. +- */ +- if (pgdat_needs_compaction && +- zone_watermark_ok(zone, order, +- low_wmark_pages(zone), +- *classzone_idx, 0)) +- pgdat_needs_compaction = false; +- } +- + /* + * If we're getting trouble reclaiming, start doing writepage + * even in laptop mode. +@@ -3280,8 +3248,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, + * that that high watermark would be met at 100% + * efficiency. + */ +- if (kswapd_shrink_zone(zone, end_zone, +- &sc, &nr_attempted)) ++ if (kswapd_shrink_zone(zone, end_zone, &sc)) + raise_priority = false; + } + +@@ -3294,49 +3261,29 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order, + pfmemalloc_watermark_ok(pgdat)) + wake_up_all(&pgdat->pfmemalloc_wait); + +- /* +- * Fragmentation may mean that the system cannot be rebalanced +- * for high-order allocations in all zones. If twice the +- * allocation size has been reclaimed and the zones are still +- * not balanced then recheck the watermarks at order-0 to +- * prevent kswapd reclaiming excessively. Assume that a +- * process requested a high-order can direct reclaim/compact. +- */ +- if (order && sc.nr_reclaimed >= 2UL << order) +- order = sc.order = 0; +- + /* Check if kswapd should be suspending */ + if (try_to_freeze() || kthread_should_stop()) + break; + + /* +- * Compact if necessary and kswapd is reclaiming at least the +- * high watermark number of pages as requsted +- */ +- if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted) +- compact_pgdat(pgdat, order); +- +- /* + * Raise priority if scanning rate is too low or there was no + * progress in reclaiming pages + */ + if (raise_priority || !sc.nr_reclaimed) + sc.priority--; + } while (sc.priority >= 1 && +- !pgdat_balanced(pgdat, order, *classzone_idx)); ++ !pgdat_balanced(pgdat, order, classzone_idx)); + + out: + /* +- * Return the order we were reclaiming at so prepare_kswapd_sleep() +- * makes a decision on the order we were last reclaiming at. However, +- * if another caller entered the allocator slow path while kswapd +- * was awake, order will remain at the higher level ++ * Return the highest zone idx we were reclaiming at so ++ * prepare_kswapd_sleep() makes the same decisions as here. + */ +- *classzone_idx = end_zone; +- return order; ++ return end_zone; + } + +-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) ++static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, ++ int classzone_idx, int balanced_classzone_idx) + { + long remaining = 0; + DEFINE_WAIT(wait); +@@ -3347,7 +3294,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) + prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); + + /* Try to sleep for a short interval */ +- if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) { ++ if (prepare_kswapd_sleep(pgdat, order, remaining, ++ balanced_classzone_idx)) { + remaining = schedule_timeout(HZ/10); + finish_wait(&pgdat->kswapd_wait, &wait); + prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); +@@ -3357,7 +3305,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) + * After a short sleep, check if it was a premature sleep. If not, then + * go fully to sleep until explicitly woken up. + */ +- if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) { ++ if (prepare_kswapd_sleep(pgdat, order, remaining, ++ balanced_classzone_idx)) { + trace_mm_vmscan_kswapd_sleep(pgdat->node_id); + + /* +@@ -3378,6 +3327,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) + */ + reset_isolation_suitable(pgdat); + ++ /* ++ * We have freed the memory, now we should compact it to make ++ * allocation of the requested order possible. ++ */ ++ wakeup_kcompactd(pgdat, order, classzone_idx); ++ + if (!kthread_should_stop()) + schedule(); + +@@ -3407,7 +3362,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) + static int kswapd(void *p) + { + unsigned long order, new_order; +- unsigned balanced_order; + int classzone_idx, new_classzone_idx; + int balanced_classzone_idx; + pg_data_t *pgdat = (pg_data_t*)p; +@@ -3440,23 +3394,19 @@ static int kswapd(void *p) + set_freezable(); + + order = new_order = 0; +- balanced_order = 0; + classzone_idx = new_classzone_idx = pgdat->nr_zones - 1; + balanced_classzone_idx = classzone_idx; + for ( ; ; ) { + bool ret; + + /* +- * If the last balance_pgdat was unsuccessful it's unlikely a +- * new request of a similar or harder type will succeed soon +- * so consider going to sleep on the basis we reclaimed at ++ * While we were reclaiming, there might have been another ++ * wakeup, so check the values. + */ +- if (balanced_order == new_order) { +- new_order = pgdat->kswapd_max_order; +- new_classzone_idx = pgdat->classzone_idx; +- pgdat->kswapd_max_order = 0; +- pgdat->classzone_idx = pgdat->nr_zones - 1; +- } ++ new_order = pgdat->kswapd_max_order; ++ new_classzone_idx = pgdat->classzone_idx; ++ pgdat->kswapd_max_order = 0; ++ pgdat->classzone_idx = pgdat->nr_zones - 1; + + if (order < new_order || classzone_idx > new_classzone_idx) { + /* +@@ -3466,7 +3416,7 @@ static int kswapd(void *p) + order = new_order; + classzone_idx = new_classzone_idx; + } else { +- kswapd_try_to_sleep(pgdat, balanced_order, ++ kswapd_try_to_sleep(pgdat, order, classzone_idx, + balanced_classzone_idx); + order = pgdat->kswapd_max_order; + classzone_idx = pgdat->classzone_idx; +@@ -3486,9 +3436,8 @@ static int kswapd(void *p) + */ + if (!ret) { + trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); +- balanced_classzone_idx = classzone_idx; +- balanced_order = balance_pgdat(pgdat, order, +- &balanced_classzone_idx); ++ balanced_classzone_idx = balance_pgdat(pgdat, order, ++ classzone_idx); + } + } + +@@ -3518,7 +3467,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx) + } + if (!waitqueue_active(&pgdat->kswapd_wait)) + return; +- if (zone_balanced(zone, order, 0, 0)) ++ if (zone_balanced(zone, order, true, 0, 0)) + return; + + trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); +-- +2.7.2 diff --git a/a/content_digest b/N1/content_digest index f5e2991..9e5aab0 100644 --- a/a/content_digest +++ b/N1/content_digest @@ -71,6 +71,642 @@ "review, but functionality-wise the first patch leaves things somewhat\n" "weird without the third patch.\n" "\n" - ----8<---- + "----8<----\n" + ">From c829909527ecd33eb869c96bcd287bade2b32100 Mon Sep 17 00:00:00 2001\n" + "From: Vlastimil Babka <vbabka@suse.cz>\n" + "Date: Wed, 9 Mar 2016 12:45:24 +0100\n" + "Subject: [PATCH 3/3] mm, kswapd: replace kswapd compaction with waking up\n" + " kcompactd\n" + "\n" + "Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim\n" + "and compaction to attempt making memory allocation of given order\n" + "available. The details differ from direct reclaim e.g. in having high\n" + "watermark as a goal. The code involved in kswapd's reclaim/compaction\n" + "decisions has evolved to be quite complex. Testing reveals that it\n" + "doesn't actually work in at least one scenario, and closer inspection\n" + "suggests that it could be greatly simplified without compromising on the\n" + "goal (make high-order page available) or efficiency (don't reclaim too\n" + "much). The simplification relieas of doing all compaction in kcompactd,\n" + "which is simply woken up when high watermarks are reached by kswapd's\n" + "reclaim.\n" + "\n" + "The scenario where kswapd compaction doesn't work was found with mmtests\n" + "test stress-highalloc configured to attempt order-9 allocations without\n" + "direct reclaim, just waking up kswapd. There was no compaction attempt\n" + "from kswapd during the whole test. Some added instrumentation shows what\n" + "happens:\n" + "\n" + "- balance_pgdat() sets end_zone to Normal, as it's not balanced\n" + "- reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it\n" + " cannot reclaim anything, so sc.nr_reclaimed is 0\n" + "- for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it\n" + " merely checks if high watermarks were reached for base pages. This is true,\n" + " so no reclaim is attempted. For DMA, testorder=0 wasn't used, as\n" + " compaction_suitable() returned COMPACT_SKIPPED\n" + "- even though the pgdat_needs_compaction flag wasn't set to false, no\n" + " compaction happens due to the condition sc.nr_reclaimed > nr_attempted\n" + " being false (as 0 < 99)\n" + "- priority-- due to nr_reclaimed being 0, repeat until priority reaches 0\n" + " pgdat_balanced() is false as only the small zone DMA appears balanced\n" + " (curiously in that check, watermark appears OK and compaction_suitable()\n" + " returns COMPACT_PARTIAL, because a lower classzone_idx is used there)\n" + "\n" + "Now, even if it was decided that reclaim shouldn't be attempted on the DMA\n" + "zone, the scenario would be the same, as (sc.nr_reclaimed=0 >\n" + "nr_attempted=0) is also false. The condition really should use >= as the\n" + "comment suggests. Then there is a mismatch in the check for setting\n" + "pgdat_needs_compaction to false using low watermark, while the rest uses\n" + "high watermark, and who knows what other subtlety. Hopefully this\n" + "demonstrates that this is unsustainable.\n" + "\n" + "Luckily we can simplify this a lot. The reclaim/compaction decisions make\n" + "sense for direct reclaim scenario, but in kswapd, our primary goal is to\n" + "reach high watermark in order-0 pages. Afterwards we can attempt\n" + "compaction just once. Unlike direct reclaim, we don't reclaim extra pages\n" + "(over the high watermark), the current code already disallows it for good\n" + "reasons.\n" + "\n" + "After this patch, we simply wake up kcompactd to process the pgdat, after\n" + "we have either succeeded or failed to reach the high watermarks in kswapd,\n" + "which goes to sleep. We pass kswapd's order and classzone_idx, so\n" + "kcompactd can apply the same criteria to determine which zones are worth\n" + "compacting. Note that we use the classzone_idx from wakeup_kswapd(), not\n" + "balanced_classzone_idx which can include higher zones that kswapd tried to\n" + "balance too, but didn't consider them in pgdat_balanced().\n" + "\n" + "Since kswapd now cannot create high-order pages itself, we need to adjust\n" + "how it determines the zones to be balanced. The key element here is\n" + "adding a \"highorder\" parameter to zone_balanced, which, when set to false,\n" + "makes it consider only order-0 watermark instead of the desired higher\n" + "order (this was done previously by kswapd_shrink_zone(), but not\n" + "elsewhere). This false is passed for example in pgdat_balanced().\n" + "Importantly, wakeup_kswapd() uses true to make sure kswapd and thus\n" + "kcompactd are woken up for a high-order allocation failure.\n" + "\n" + "The last thing is to decide what to do with pageblock_skip bitmap handling.\n" + "Compaction maintains a pageblock_skip bitmap to record pageblocks where\n" + "isolation recently failed. This bitmap can be reset by three ways:\n" + "\n" + "1) direct compaction is restarting after going through the full deferred cycle\n" + "\n" + "2) kswapd goes to sleep, and some other direct compaction has previously\n" + " finished scanning the whole zone and set zone->compact_blockskip_flush.\n" + " Note that a successful direct compaction clears this flag.\n" + "\n" + "3) compaction was invoked manually via trigger in /proc\n" + "\n" + "The case 2) is somewhat fuzzy to begin with, but after introducing\n" + "kcompactd we should update it. The check for direct compaction in 1), and\n" + "to set the flush flag in 2) use current_is_kswapd(), which doesn't work\n" + "for kcompactd. Thus, this patch adds bool direct_compaction to\n" + "compact_control to use in 2). For the case 1) we remove the check\n" + "completely - unlike the former kswapd compaction, kcompactd does use the\n" + "deferred compaction functionality, so flushing tied to restarting from\n" + "deferred compaction makes sense here.\n" + "\n" + "Note that when kswapd goes to sleep, kcompactd is woken up, so it will see\n" + "the flushed pageblock_skip bits. This is different from when the former\n" + "kswapd compaction observed the bits and I believe it makes more sense.\n" + "Kcompactd can afford to be more thorough than a direct compaction trying\n" + "to limit allocation latency, or kswapd whose primary goal is to reclaim.\n" + "\n" + "For testing, I used stress-highalloc configured to do order-9 allocations\n" + "with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on\n" + "kswapd/kcompactd reclaim/compaction (the interfering kernel builds in\n" + "phases 1 and 2 work as usual):\n" + "\n" + "stress-highalloc\n" + " 4.5-rc1+before 4.5-rc1+after\n" + " -nodirect -nodirect\n" + "Success 1 Min 1.00 ( 0.00%) 5.00 (-66.67%)\n" + "Success 1 Mean 1.40 ( 0.00%) 6.20 (-55.00%)\n" + "Success 1 Max 2.00 ( 0.00%) 7.00 (-16.67%)\n" + "Success 2 Min 1.00 ( 0.00%) 5.00 (-66.67%)\n" + "Success 2 Mean 1.80 ( 0.00%) 6.40 (-52.38%)\n" + "Success 2 Max 3.00 ( 0.00%) 7.00 (-16.67%)\n" + "Success 3 Min 34.00 ( 0.00%) 62.00 ( 1.59%)\n" + "Success 3 Mean 41.80 ( 0.00%) 63.80 ( 1.24%)\n" + "Success 3 Max 53.00 ( 0.00%) 65.00 ( 2.99%)\n" + "\n" + "User 3166.67 3181.09\n" + "System 1153.37 1158.25\n" + "Elapsed 1768.53 1799.37\n" + "\n" + " 4.5-rc1+before 4.5-rc1+after\n" + " -nodirect -nodirect\n" + "Direct pages scanned 32938 32797\n" + "Kswapd pages scanned 2183166 2202613\n" + "Kswapd pages reclaimed 2152359 2143524\n" + "Direct pages reclaimed 32735 32545\n" + "Percentage direct scans 1% 1%\n" + "THP fault alloc 579 612\n" + "THP collapse alloc 304 316\n" + "THP splits 0 0\n" + "THP fault fallback 793 778\n" + "THP collapse fail 11 16\n" + "Compaction stalls 1013 1007\n" + "Compaction success 92 67\n" + "Compaction failures 920 939\n" + "Page migrate success 238457 721374\n" + "Page migrate failure 23021 23469\n" + "Compaction pages isolated 504695 1479924\n" + "Compaction migrate scanned 661390 8812554\n" + "Compaction free scanned 13476658 84327916\n" + "Compaction cost 262 838\n" + "\n" + "After this patch we see improvements in allocation success rate\n" + "(especially for phase 3) along with increased compaction activity. The\n" + "compaction stalls (direct compaction) in the interfering kernel builds\n" + "(probably THP's) also decreased somewhat thanks to kcompactd activity, yet\n" + "THP alloc successes improved a bit.\n" + "\n" + "Note that elapsed and user time isn't so useful for this benchmark,\n" + "because of the background interference being unpredictable. It's just to\n" + "quickly spot some major unexpected differences. System time is somewhat\n" + "more useful and that didn't increase.\n" + "\n" + "Also (after adjusting mmtests' ftrace monitor):\n" + "\n" + "Time kswapd awake 2547781 2269241\n" + "Time kcompactd awake 0 119253\n" + "Time direct compacting 939937 557649\n" + "Time kswapd compacting 0 0\n" + "Time kcompactd compacting 0 119099\n" + "\n" + "The decrease of overal time spent compacting appears to not match the\n" + "increased compaction stats. I suspect the tasks get rescheduled and since\n" + "the ftrace monitor doesn't see that, the reported time is wall time, not\n" + "CPU time. But arguably direct compactors care about overall latency\n" + "anyway, whether busy compacting or waiting for CPU doesn't matter. And\n" + "that latency seems to almost halved.\n" + "\n" + "It's also interesting how much time kswapd spent awake just going through\n" + "all the priorities and failing to even try compacting, over and over.\n" + "\n" + "We can also configure stress-highalloc to perform both direct\n" + "reclaim/compaction and wakeup kswapd/kcompactd, by using\n" + "GFP_KERNEL|__GFP_HIGH|__GFP_COMP:\n" + "\n" + "stress-highalloc\n" + " 4.5-rc1+before 4.5-rc1+after\n" + " -direct -direct\n" + "Success 1 Min 4.00 ( 0.00%) 9.00 (-50.00%)\n" + "Success 1 Mean 8.00 ( 0.00%) 10.00 (-19.05%)\n" + "Success 1 Max 12.00 ( 0.00%) 11.00 ( 15.38%)\n" + "Success 2 Min 4.00 ( 0.00%) 9.00 (-50.00%)\n" + "Success 2 Mean 8.20 ( 0.00%) 10.00 (-16.28%)\n" + "Success 2 Max 13.00 ( 0.00%) 11.00 ( 8.33%)\n" + "Success 3 Min 75.00 ( 0.00%) 74.00 ( 1.33%)\n" + "Success 3 Mean 75.60 ( 0.00%) 75.20 ( 0.53%)\n" + "Success 3 Max 77.00 ( 0.00%) 76.00 ( 0.00%)\n" + "\n" + "User 3344.73 3246.04\n" + "System 1194.24 1172.29\n" + "Elapsed 1838.04 1836.76\n" + "\n" + " 4.5-rc1+before 4.5-rc1+after\n" + " -direct -direct\n" + "Direct pages scanned 125146 120966\n" + "Kswapd pages scanned 2119757 2135012\n" + "Kswapd pages reclaimed 2073183 2108388\n" + "Direct pages reclaimed 124909 120577\n" + "Percentage direct scans 5% 5%\n" + "THP fault alloc 599 652\n" + "THP collapse alloc 323 354\n" + "THP splits 0 0\n" + "THP fault fallback 806 793\n" + "THP collapse fail 17 16\n" + "Compaction stalls 2457 2025\n" + "Compaction success 906 518\n" + "Compaction failures 1551 1507\n" + "Page migrate success 2031423 2360608\n" + "Page migrate failure 32845 40852\n" + "Compaction pages isolated 4129761 4802025\n" + "Compaction migrate scanned 11996712 21750613\n" + "Compaction free scanned 214970969 344372001\n" + "Compaction cost 2271 2694\n" + "\n" + "In this scenario, this patch doesn't change the overall success rate as\n" + "direct compaction already tries all it can. There's however significant\n" + "reduction in direct compaction stalls (that is, the number of allocations\n" + "that went into direct compaction). The number of successes (i.e. direct\n" + "compaction stalls that ended up with successful allocation) is reduced by\n" + "the same number. This means the offload to kcompactd is working as\n" + "expected, and direct compaction is reduced either due to detecting\n" + "contention, or compaction deferred by kcompactd. In the previous version\n" + "of this patchset there was some apparent reduction of success rate, but\n" + "the changes in this version (such as using sync compaction only), new\n" + "baseline kernel, and/or averaging results from 5 executions (my bet), made\n" + "this go away.\n" + "\n" + "Ftrace-based stats seem to roughly agree:\n" + "\n" + "Time kswapd awake 2532984 2326824\n" + "Time kcompactd awake 0 257916\n" + "Time direct compacting 864839 735130\n" + "Time kswapd compacting 0 0\n" + "Time kcompactd compacting 0 257585\n" + "\n" + "Signed-off-by: Vlastimil Babka <vbabka@suse.cz>\n" + "Cc: Andrea Arcangeli <aarcange@redhat.com>\n" + "Cc: \"Kirill A. Shutemov\" <kirill.shutemov@linux.intel.com>\n" + "Cc: Rik van Riel <riel@redhat.com>\n" + "Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>\n" + "Cc: Mel Gorman <mgorman@techsingularity.net>\n" + "Cc: David Rientjes <rientjes@google.com>\n" + "Cc: Michal Hocko <mhocko@suse.com>\n" + "Cc: Johannes Weiner <hannes@cmpxchg.org>\n" + "Signed-off-by: Andrew Morton <akpm@linux-foundation.org>\n" + "---\n" + " mm/compaction.c | 10 ++--\n" + " mm/internal.h | 1 +\n" + " mm/vmscan.c | 147 ++++++++++++++++++--------------------------------------\n" + " 3 files changed, 54 insertions(+), 104 deletions(-)\n" + "\n" + "diff --git a/mm/compaction.c b/mm/compaction.c\n" + "index 5b2bfbaa821a..ccf97b02b85f 100644\n" + "--- a/mm/compaction.c\n" + "+++ b/mm/compaction.c\n" + "@@ -1191,11 +1191,11 @@ static int __compact_finished(struct zone *zone, struct compact_control *cc,\n" + " \n" + " \t\t/*\n" + " \t\t * Mark that the PG_migrate_skip information should be cleared\n" + "-\t\t * by kswapd when it goes to sleep. kswapd does not set the\n" + "+\t\t * by kswapd when it goes to sleep. kcompactd does not set the\n" + " \t\t * flag itself as the decision to be clear should be directly\n" + " \t\t * based on an allocation request.\n" + " \t\t */\n" + "-\t\tif (!current_is_kswapd())\n" + "+\t\tif (cc->direct_compaction)\n" + " \t\t\tzone->compact_blockskip_flush = true;\n" + " \n" + " \t\treturn COMPACT_COMPLETE;\n" + "@@ -1338,10 +1338,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)\n" + " \n" + " \t/*\n" + " \t * Clear pageblock skip if there were failures recently and compaction\n" + "-\t * is about to be retried after being deferred. kswapd does not do\n" + "-\t * this reset as it'll reset the cached information when going to sleep.\n" + "+\t * is about to be retried after being deferred.\n" + " \t */\n" + "-\tif (compaction_restarting(zone, cc->order) && !current_is_kswapd())\n" + "+\tif (compaction_restarting(zone, cc->order))\n" + " \t\t__reset_isolation_suitable(zone);\n" + " \n" + " \t/*\n" + "@@ -1477,6 +1476,7 @@ static unsigned long compact_zone_order(struct zone *zone, int order,\n" + " \t\t.mode = mode,\n" + " \t\t.alloc_flags = alloc_flags,\n" + " \t\t.classzone_idx = classzone_idx,\n" + "+\t\t.direct_compaction = true,\n" + " \t};\n" + " \tINIT_LIST_HEAD(&cc.freepages);\n" + " \tINIT_LIST_HEAD(&cc.migratepages);\n" + "diff --git a/mm/internal.h b/mm/internal.h\n" + "index 17ae0b52534b..013a786fa37f 100644\n" + "--- a/mm/internal.h\n" + "+++ b/mm/internal.h\n" + "@@ -181,6 +181,7 @@ struct compact_control {\n" + " \tunsigned long last_migrated_pfn;/* Not yet flushed page being freed */\n" + " \tenum migrate_mode mode;\t\t/* Async or sync migration mode */\n" + " \tbool ignore_skip_hint;\t\t/* Scan blocks even if marked skip */\n" + "+\tbool direct_compaction;\t\t/* False from kcompactd or /proc/... */\n" + " \tint order;\t\t\t/* order a direct compactor needs */\n" + " \tconst gfp_t gfp_mask;\t\t/* gfp mask of a direct compactor */\n" + " \tconst int alloc_flags;\t\t/* alloc flags of a direct compactor */\n" + "diff --git a/mm/vmscan.c b/mm/vmscan.c\n" + "index c67df4831565..23bc7e643ad8 100644\n" + "--- a/mm/vmscan.c\n" + "+++ b/mm/vmscan.c\n" + "@@ -2951,18 +2951,23 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)\n" + " \t} while (memcg);\n" + " }\n" + " \n" + "-static bool zone_balanced(struct zone *zone, int order,\n" + "-\t\t\t unsigned long balance_gap, int classzone_idx)\n" + "+static bool zone_balanced(struct zone *zone, int order, bool highorder,\n" + "+\t\t\tunsigned long balance_gap, int classzone_idx)\n" + " {\n" + "-\tif (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +\n" + "-\t\t\t\t balance_gap, classzone_idx))\n" + "-\t\treturn false;\n" + "+\tunsigned long mark = high_wmark_pages(zone) + balance_gap;\n" + " \n" + "-\tif (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,\n" + "-\t\t\t\torder, 0, classzone_idx) == COMPACT_SKIPPED)\n" + "-\t\treturn false;\n" + "+\t/*\n" + "+\t * When checking from pgdat_balanced(), kswapd should stop and sleep\n" + "+\t * when it reaches the high order-0 watermark and let kcompactd take\n" + "+\t * over. Other callers such as wakeup_kswapd() want to determine the\n" + "+\t * true high-order watermark.\n" + "+\t */\n" + "+\tif (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {\n" + "+\t\tmark += (1UL << order);\n" + "+\t\torder = 0;\n" + "+\t}\n" + " \n" + "-\treturn true;\n" + "+\treturn zone_watermark_ok_safe(zone, order, mark, classzone_idx);\n" + " }\n" + " \n" + " /*\n" + "@@ -3012,7 +3017,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)\n" + " \t\t\tcontinue;\n" + " \t\t}\n" + " \n" + "-\t\tif (zone_balanced(zone, order, 0, i))\n" + "+\t\tif (zone_balanced(zone, order, false, 0, i))\n" + " \t\t\tbalanced_pages += zone->managed_pages;\n" + " \t\telse if (!order)\n" + " \t\t\treturn false;\n" + "@@ -3066,10 +3071,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,\n" + " */\n" + " static bool kswapd_shrink_zone(struct zone *zone,\n" + " \t\t\t int classzone_idx,\n" + "-\t\t\t struct scan_control *sc,\n" + "-\t\t\t unsigned long *nr_attempted)\n" + "+\t\t\t struct scan_control *sc)\n" + " {\n" + "-\tint testorder = sc->order;\n" + " \tunsigned long balance_gap;\n" + " \tbool lowmem_pressure;\n" + " \n" + "@@ -3077,17 +3080,6 @@ static bool kswapd_shrink_zone(struct zone *zone,\n" + " \tsc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));\n" + " \n" + " \t/*\n" + "-\t * Kswapd reclaims only single pages with compaction enabled. Trying\n" + "-\t * too hard to reclaim until contiguous free pages have become\n" + "-\t * available can hurt performance by evicting too much useful data\n" + "-\t * from memory. Do not reclaim more than needed for compaction.\n" + "-\t */\n" + "-\tif (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&\n" + "-\t\t\tcompaction_suitable(zone, sc->order, 0, classzone_idx)\n" + "-\t\t\t\t\t\t\t!= COMPACT_SKIPPED)\n" + "-\t\ttestorder = 0;\n" + "-\n" + "-\t/*\n" + " \t * We put equal pressure on every zone, unless one zone has way too\n" + " \t * many pages free already. The \"too many pages\" is defined as the\n" + " \t * high wmark plus a \"gap\" where the gap is either the low\n" + "@@ -3101,15 +3093,12 @@ static bool kswapd_shrink_zone(struct zone *zone,\n" + " \t * reclaim is necessary\n" + " \t */\n" + " \tlowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));\n" + "-\tif (!lowmem_pressure && zone_balanced(zone, testorder,\n" + "+\tif (!lowmem_pressure && zone_balanced(zone, sc->order, false,\n" + " \t\t\t\t\t\tbalance_gap, classzone_idx))\n" + " \t\treturn true;\n" + " \n" + " \tshrink_zone(zone, sc, zone_idx(zone) == classzone_idx);\n" + " \n" + "-\t/* Account for the number of pages attempted to reclaim */\n" + "-\t*nr_attempted += sc->nr_to_reclaim;\n" + "-\n" + " \tclear_bit(ZONE_WRITEBACK, &zone->flags);\n" + " \n" + " \t/*\n" + "@@ -3119,7 +3108,7 @@ static bool kswapd_shrink_zone(struct zone *zone,\n" + " \t * waits.\n" + " \t */\n" + " \tif (zone_reclaimable(zone) &&\n" + "-\t zone_balanced(zone, testorder, 0, classzone_idx)) {\n" + "+\t zone_balanced(zone, sc->order, false, 0, classzone_idx)) {\n" + " \t\tclear_bit(ZONE_CONGESTED, &zone->flags);\n" + " \t\tclear_bit(ZONE_DIRTY, &zone->flags);\n" + " \t}\n" + "@@ -3131,7 +3120,7 @@ static bool kswapd_shrink_zone(struct zone *zone,\n" + " * For kswapd, balance_pgdat() will work across all this node's zones until\n" + " * they are all at high_wmark_pages(zone).\n" + " *\n" + "- * Returns the final order kswapd was reclaiming at\n" + "+ * Returns the highest zone idx kswapd was reclaiming at\n" + " *\n" + " * There is special handling here for zones which are full of pinned pages.\n" + " * This can happen if the pages are all mlocked, or if they are all used by\n" + "@@ -3148,8 +3137,7 @@ static bool kswapd_shrink_zone(struct zone *zone,\n" + " * interoperates with the page allocator fallback scheme to ensure that aging\n" + " * of pages is balanced across the zones.\n" + " */\n" + "-static unsigned long balance_pgdat(pg_data_t *pgdat, int order,\n" + "-\t\t\t\t\t\t\tint *classzone_idx)\n" + "+static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)\n" + " {\n" + " \tint i;\n" + " \tint end_zone = 0;\t/* Inclusive. 0 = ZONE_DMA */\n" + "@@ -3166,9 +3154,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,\n" + " \tcount_vm_event(PAGEOUTRUN);\n" + " \n" + " \tdo {\n" + "-\t\tunsigned long nr_attempted = 0;\n" + " \t\tbool raise_priority = true;\n" + "-\t\tbool pgdat_needs_compaction = (order > 0);\n" + " \n" + " \t\tsc.nr_reclaimed = 0;\n" + " \n" + "@@ -3203,7 +3189,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,\n" + " \t\t\t\tbreak;\n" + " \t\t\t}\n" + " \n" + "-\t\t\tif (!zone_balanced(zone, order, 0, 0)) {\n" + "+\t\t\tif (!zone_balanced(zone, order, false, 0, 0)) {\n" + " \t\t\t\tend_zone = i;\n" + " \t\t\t\tbreak;\n" + " \t\t\t} else {\n" + "@@ -3219,24 +3205,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,\n" + " \t\tif (i < 0)\n" + " \t\t\tgoto out;\n" + " \n" + "-\t\tfor (i = 0; i <= end_zone; i++) {\n" + "-\t\t\tstruct zone *zone = pgdat->node_zones + i;\n" + "-\n" + "-\t\t\tif (!populated_zone(zone))\n" + "-\t\t\t\tcontinue;\n" + "-\n" + "-\t\t\t/*\n" + "-\t\t\t * If any zone is currently balanced then kswapd will\n" + "-\t\t\t * not call compaction as it is expected that the\n" + "-\t\t\t * necessary pages are already available.\n" + "-\t\t\t */\n" + "-\t\t\tif (pgdat_needs_compaction &&\n" + "-\t\t\t\t\tzone_watermark_ok(zone, order,\n" + "-\t\t\t\t\t\tlow_wmark_pages(zone),\n" + "-\t\t\t\t\t\t*classzone_idx, 0))\n" + "-\t\t\t\tpgdat_needs_compaction = false;\n" + "-\t\t}\n" + "-\n" + " \t\t/*\n" + " \t\t * If we're getting trouble reclaiming, start doing writepage\n" + " \t\t * even in laptop mode.\n" + "@@ -3280,8 +3248,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,\n" + " \t\t\t * that that high watermark would be met at 100%\n" + " \t\t\t * efficiency.\n" + " \t\t\t */\n" + "-\t\t\tif (kswapd_shrink_zone(zone, end_zone,\n" + "-\t\t\t\t\t &sc, &nr_attempted))\n" + "+\t\t\tif (kswapd_shrink_zone(zone, end_zone, &sc))\n" + " \t\t\t\traise_priority = false;\n" + " \t\t}\n" + " \n" + "@@ -3294,49 +3261,29 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,\n" + " \t\t\t\tpfmemalloc_watermark_ok(pgdat))\n" + " \t\t\twake_up_all(&pgdat->pfmemalloc_wait);\n" + " \n" + "-\t\t/*\n" + "-\t\t * Fragmentation may mean that the system cannot be rebalanced\n" + "-\t\t * for high-order allocations in all zones. If twice the\n" + "-\t\t * allocation size has been reclaimed and the zones are still\n" + "-\t\t * not balanced then recheck the watermarks at order-0 to\n" + "-\t\t * prevent kswapd reclaiming excessively. Assume that a\n" + "-\t\t * process requested a high-order can direct reclaim/compact.\n" + "-\t\t */\n" + "-\t\tif (order && sc.nr_reclaimed >= 2UL << order)\n" + "-\t\t\torder = sc.order = 0;\n" + "-\n" + " \t\t/* Check if kswapd should be suspending */\n" + " \t\tif (try_to_freeze() || kthread_should_stop())\n" + " \t\t\tbreak;\n" + " \n" + " \t\t/*\n" + "-\t\t * Compact if necessary and kswapd is reclaiming at least the\n" + "-\t\t * high watermark number of pages as requsted\n" + "-\t\t */\n" + "-\t\tif (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted)\n" + "-\t\t\tcompact_pgdat(pgdat, order);\n" + "-\n" + "-\t\t/*\n" + " \t\t * Raise priority if scanning rate is too low or there was no\n" + " \t\t * progress in reclaiming pages\n" + " \t\t */\n" + " \t\tif (raise_priority || !sc.nr_reclaimed)\n" + " \t\t\tsc.priority--;\n" + " \t} while (sc.priority >= 1 &&\n" + "-\t\t !pgdat_balanced(pgdat, order, *classzone_idx));\n" + "+\t\t\t!pgdat_balanced(pgdat, order, classzone_idx));\n" + " \n" + " out:\n" + " \t/*\n" + "-\t * Return the order we were reclaiming at so prepare_kswapd_sleep()\n" + "-\t * makes a decision on the order we were last reclaiming at. However,\n" + "-\t * if another caller entered the allocator slow path while kswapd\n" + "-\t * was awake, order will remain at the higher level\n" + "+\t * Return the highest zone idx we were reclaiming at so\n" + "+\t * prepare_kswapd_sleep() makes the same decisions as here.\n" + " \t */\n" + "-\t*classzone_idx = end_zone;\n" + "-\treturn order;\n" + "+\treturn end_zone;\n" + " }\n" + " \n" + "-static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)\n" + "+static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,\n" + "+\t\t\t\tint classzone_idx, int balanced_classzone_idx)\n" + " {\n" + " \tlong remaining = 0;\n" + " \tDEFINE_WAIT(wait);\n" + "@@ -3347,7 +3294,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)\n" + " \tprepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);\n" + " \n" + " \t/* Try to sleep for a short interval */\n" + "-\tif (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {\n" + "+\tif (prepare_kswapd_sleep(pgdat, order, remaining,\n" + "+\t\t\t\t\t\tbalanced_classzone_idx)) {\n" + " \t\tremaining = schedule_timeout(HZ/10);\n" + " \t\tfinish_wait(&pgdat->kswapd_wait, &wait);\n" + " \t\tprepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);\n" + "@@ -3357,7 +3305,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)\n" + " \t * After a short sleep, check if it was a premature sleep. If not, then\n" + " \t * go fully to sleep until explicitly woken up.\n" + " \t */\n" + "-\tif (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) {\n" + "+\tif (prepare_kswapd_sleep(pgdat, order, remaining,\n" + "+\t\t\t\t\t\tbalanced_classzone_idx)) {\n" + " \t\ttrace_mm_vmscan_kswapd_sleep(pgdat->node_id);\n" + " \n" + " \t\t/*\n" + "@@ -3378,6 +3327,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)\n" + " \t\t */\n" + " \t\treset_isolation_suitable(pgdat);\n" + " \n" + "+\t\t/*\n" + "+\t\t * We have freed the memory, now we should compact it to make\n" + "+\t\t * allocation of the requested order possible.\n" + "+\t\t */\n" + "+\t\twakeup_kcompactd(pgdat, order, classzone_idx);\n" + "+\n" + " \t\tif (!kthread_should_stop())\n" + " \t\t\tschedule();\n" + " \n" + "@@ -3407,7 +3362,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)\n" + " static int kswapd(void *p)\n" + " {\n" + " \tunsigned long order, new_order;\n" + "-\tunsigned balanced_order;\n" + " \tint classzone_idx, new_classzone_idx;\n" + " \tint balanced_classzone_idx;\n" + " \tpg_data_t *pgdat = (pg_data_t*)p;\n" + "@@ -3440,23 +3394,19 @@ static int kswapd(void *p)\n" + " \tset_freezable();\n" + " \n" + " \torder = new_order = 0;\n" + "-\tbalanced_order = 0;\n" + " \tclasszone_idx = new_classzone_idx = pgdat->nr_zones - 1;\n" + " \tbalanced_classzone_idx = classzone_idx;\n" + " \tfor ( ; ; ) {\n" + " \t\tbool ret;\n" + " \n" + " \t\t/*\n" + "-\t\t * If the last balance_pgdat was unsuccessful it's unlikely a\n" + "-\t\t * new request of a similar or harder type will succeed soon\n" + "-\t\t * so consider going to sleep on the basis we reclaimed at\n" + "+\t\t * While we were reclaiming, there might have been another\n" + "+\t\t * wakeup, so check the values.\n" + " \t\t */\n" + "-\t\tif (balanced_order == new_order) {\n" + "-\t\t\tnew_order = pgdat->kswapd_max_order;\n" + "-\t\t\tnew_classzone_idx = pgdat->classzone_idx;\n" + "-\t\t\tpgdat->kswapd_max_order = 0;\n" + "-\t\t\tpgdat->classzone_idx = pgdat->nr_zones - 1;\n" + "-\t\t}\n" + "+\t\tnew_order = pgdat->kswapd_max_order;\n" + "+\t\tnew_classzone_idx = pgdat->classzone_idx;\n" + "+\t\tpgdat->kswapd_max_order = 0;\n" + "+\t\tpgdat->classzone_idx = pgdat->nr_zones - 1;\n" + " \n" + " \t\tif (order < new_order || classzone_idx > new_classzone_idx) {\n" + " \t\t\t/*\n" + "@@ -3466,7 +3416,7 @@ static int kswapd(void *p)\n" + " \t\t\torder = new_order;\n" + " \t\t\tclasszone_idx = new_classzone_idx;\n" + " \t\t} else {\n" + "-\t\t\tkswapd_try_to_sleep(pgdat, balanced_order,\n" + "+\t\t\tkswapd_try_to_sleep(pgdat, order, classzone_idx,\n" + " \t\t\t\t\t\tbalanced_classzone_idx);\n" + " \t\t\torder = pgdat->kswapd_max_order;\n" + " \t\t\tclasszone_idx = pgdat->classzone_idx;\n" + "@@ -3486,9 +3436,8 @@ static int kswapd(void *p)\n" + " \t\t */\n" + " \t\tif (!ret) {\n" + " \t\t\ttrace_mm_vmscan_kswapd_wake(pgdat->node_id, order);\n" + "-\t\t\tbalanced_classzone_idx = classzone_idx;\n" + "-\t\t\tbalanced_order = balance_pgdat(pgdat, order,\n" + "-\t\t\t\t\t\t&balanced_classzone_idx);\n" + "+\t\t\tbalanced_classzone_idx = balance_pgdat(pgdat, order,\n" + "+\t\t\t\t\t\t\t\tclasszone_idx);\n" + " \t\t}\n" + " \t}\n" + " \n" + "@@ -3518,7 +3467,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)\n" + " \t}\n" + " \tif (!waitqueue_active(&pgdat->kswapd_wait))\n" + " \t\treturn;\n" + "-\tif (zone_balanced(zone, order, 0, 0))\n" + "+\tif (zone_balanced(zone, order, true, 0, 0))\n" + " \t\treturn;\n" + " \n" + " \ttrace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);\n" + "-- \n" + 2.7.2 -90f3c45b1c9035297052bcac0c8e8a9e18de4901973a75e842402b5df65114a1 +b713a7219c3276cf1bec511bd97cb41bf58c605627327a083d3e5b25f81a3351
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.