Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries
       [not found] <00f601d1d691$d790ad40$86b207c0$@alibaba-inc.com>
@ 2016-07-05  8:07 ` Hillf Danton
  2016-07-05 10:55   ` Mel Gorman
  0 siblings, 1 reply; 10+ messages in thread
From: Hillf Danton @ 2016-07-05  8:07 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Michal Hocko, linux-kernel, linux-mm, Andrew Morton

> 
> The number of LRU pages, dirty pages and writeback pages must be accounted
> for on both zones and nodes because of the reclaim retry logic, compaction
> retry logic and highmem calculations all depending on per-zone stats.
> 
> The retry logic is only critical for allocations that can use any zones.
> Hence this patch will not retry reclaim or compaction for such allocations.
> This should not be a problem for reclaim as zone-constrained allocations
> are immune from OOM kill. For retries, a very rough approximation is made
> whether to retry or not. While it is possible this will make the wrong
> decision on occasion, it will not infinite loop as the number of reclaim
> attempts is capped by MAX_RECLAIM_RETRIES.
> 
> The highmem calculations only care about the global count of file pages
> in highmem. Hence, a global counter is used instead of per-zone stats.
> With this, the per-zone double accounting disappears.
> 
> Suggested by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

>  include/linux/mm_inline.h | 20 +++++++++++--
>  include/linux/mmzone.h    |  4 ---
>  include/linux/swap.h      |  1 -
>  mm/compaction.c           | 22 ++++++++++++++-
>  mm/migrate.c              |  2 --
>  mm/page-writeback.c       | 13 ++++-----
>  mm/page_alloc.c           | 71 ++++++++++++++++++++++++++++++++---------------
>  mm/vmscan.c               | 16 -----------
>  mm/vmstat.c               |  3 --
>  9 files changed, 92 insertions(+), 60 deletions(-)
> 
[...]
> @@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
> +	pg_data_t *current_pgdat = NULL;
> 
>  	/*
>  	 * Make sure we converge to OOM if we cannot make any progress
> @@ -3454,6 +3455,14 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		return false;
> 
>  	/*
> +	 * Blindly retry allocation requests that cannot use all zones. We do
> +	 * not have a reliable and fast means of calculating reclaimable, dirty
> +	 * and writeback pages in eligible zones.
> +	 */
> +	if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask)))
> +		goto out;
> +
> +	/*
>  	 * Keep reclaiming pages while there is a chance this will lead somewhere.
>  	 * If none of the target zones can satisfy our allocation request even
>  	 * if all reclaimable pages are considered then we are screwed and have
> @@ -3463,36 +3472,54 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  					ac->nodemask) {
>  		unsigned long available;
>  		unsigned long reclaimable;
> +		unsigned long write_pending = 0;
> +		int zid;
> +
> +		if (current_pgdat == zone->zone_pgdat)
> +			continue;
> 
> -		available = reclaimable = zone_reclaimable_pages(zone);
> +		current_pgdat = zone->zone_pgdat;
> +		available = reclaimable = pgdat_reclaimable_pages(current_pgdat);
>  		available -= DIV_ROUND_UP(no_progress_loops * available,
>  					MAX_RECLAIM_RETRIES);
> -		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +		write_pending = node_page_state(current_pgdat, NR_WRITEBACK) +
> +					node_page_state(current_pgdat, NR_FILE_DIRTY);
> 
> -		/*
> -		 * Would the allocation succeed if we reclaimed the whole
> -		 * available?
> -		 */
> -		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> -				ac_classzone_idx(ac), alloc_flags, available)) {
> -			/*
> -			 * If we didn't make any progress and have a lot of
> -			 * dirty + writeback pages then we should wait for
> -			 * an IO to complete to slow down the reclaim and
> -			 * prevent from pre mature OOM
> -			 */
> -			if (!did_some_progress) {
> -				unsigned long write_pending;
> +		/* Account for all free pages on eligible zones */
> +		for (zid = 0; zid <= zone_idx(zone); zid++) {
> +			struct zone *acct_zone = &current_pgdat->node_zones[zid];
> 
> -				write_pending = zone_page_state_snapshot(zone,
> -							NR_ZONE_WRITE_PENDING);
> +			available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES);
> +		}
> 
> -				if (2 * write_pending > reclaimable) {
> -					congestion_wait(BLK_RW_ASYNC, HZ/10);
> -					return true;
> -				}
> +		/*
> +		 * If we didn't make any progress and have a lot of
> +		 * dirty + writeback pages then we should wait for an IO to
> +		 * complete to slow down the reclaim and prevent from premature
> +		 * OOM.
> +		 */
> +		if (!did_some_progress) {
> +			if (2 * write_pending > reclaimable) {
> +				congestion_wait(BLK_RW_ASYNC, HZ/10);
> +				return true;
>  			}
> +		}
> 
> +		/*
> +		 * Would the allocation succeed if we reclaimed the whole
> +		 * available? This is approximate because there is no
> +		 * accurate count of reclaimable pages per zone.
> +		 */
> +		for (zid = 0; zid <= zone_idx(zone); zid++) {
> +			struct zone *check_zone = &current_pgdat->node_zones[zid];
> +			unsigned long estimate;
> +
> +			estimate = min(check_zone->managed_pages, available);
> +			if (__zone_watermark_ok(check_zone, order,
> +					min_wmark_pages(check_zone), ac_classzone_idx(ac),
> +					alloc_flags, available)) {
> +			}
Stray indent?

> +out:
>  			/*
>  			 * Memory allocation/reclaim might be called from a WQ
>  			 * context and the current implementation of the WQ
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 151c30dd27e2..c538a8cab43b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries
  2016-07-05  8:07 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Hillf Danton
@ 2016-07-05 10:55   ` Mel Gorman
  0 siblings, 0 replies; 10+ messages in thread
From: Mel Gorman @ 2016-07-05 10:55 UTC (permalink / raw)
  To: Hillf Danton; +Cc: Michal Hocko, linux-kernel, linux-mm, Andrew Morton

On Tue, Jul 05, 2016 at 04:07:23PM +0800, Hillf Danton wrote:
> > +		/*
> > +		 * Would the allocation succeed if we reclaimed the whole
> > +		 * available? This is approximate because there is no
> > +		 * accurate count of reclaimable pages per zone.
> > +		 */
> > +		for (zid = 0; zid <= zone_idx(zone); zid++) {
> > +			struct zone *check_zone = &current_pgdat->node_zones[zid];
> > +			unsigned long estimate;
> > +
> > +			estimate = min(check_zone->managed_pages, available);
> > +			if (__zone_watermark_ok(check_zone, order,
> > +					min_wmark_pages(check_zone), ac_classzone_idx(ac),
> > +					alloc_flags, available)) {
> > +			}
> Stray indent?
> 

Last minute rebase-related damage. I'll fix it.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
@ 2016-07-01 20:01 Mel Gorman
  2016-07-01 20:01 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Mel Gorman
  0 siblings, 1 reply; 10+ messages in thread
From: Mel Gorman @ 2016-07-01 20:01 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

(Sorry for the resend, I accidentally sent the branch that still had the
Signed-off-by's from mmotm still applied which is incorrect.)

Previous releases double accounted LRU stats on the zone and the node
because it was required by should_reclaim_retry. The last patch in the
series removes the double accounting. It's not integrated with the series
as reviewers may not like the solution. If not, it can be safely dropped
without a major impact to the results.

Changelog since v7
o Rebase onto current mmots
o Avoid double accounting of stats in node and zone
o Kswapd will avoid more reclaim if an eligible zone is available
o Remove some duplications of sc->reclaim_idx and classzone_idx
o Print per-node stats in zoneinfo

Changelog since v6
o Correct reclaim_idx when direct reclaiming for memcg
o Also account LRU pages per zone for compaction/reclaim
o Add page_pgdat helper with more efficient lookup
o Init pgdat LRU lock only once
o Slight optimisation to wake_all_kswapds
o Always wake kcompactd when kswapd is going to sleep
o Rebase to mmotm as of June 15th, 2016

Changelog since v5
o Rebase and adjust to changes

Changelog since v4
o Rebase on top of v3 of page allocator optimisation series

Changelog since v3
o Rebase on top of the page allocator optimisation series
o Remove RFC tag

This is the latest version of a series that moves LRUs from the zones to
the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
is a current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details. Some
of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in 
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes. 

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
---------

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

                                           4.7.0-rc4                  4.7.0-rc4
                                      mmotm-20160623                 nodelru-v8
Min      total-odr0-1               490.00 (  0.00%)           463.00 (  5.51%)
Min      total-odr0-2               349.00 (  0.00%)           325.00 (  6.88%)
Min      total-odr0-4               288.00 (  0.00%)           272.00 (  5.56%)
Min      total-odr0-8               250.00 (  0.00%)           235.00 (  6.00%)
Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
Min      total-odr0-32              223.00 (  0.00%)           205.00 (  8.07%)
Min      total-odr0-64              217.00 (  0.00%)           202.00 (  6.91%)
Min      total-odr0-128             214.00 (  0.00%)           207.00 (  3.27%)
Min      total-odr0-256             242.00 (  0.00%)           242.00 (  0.00%)
Min      total-odr0-512             272.00 (  0.00%)           265.00 (  2.57%)
Min      total-odr0-1024            290.00 (  0.00%)           283.00 (  2.41%)
Min      total-odr0-2048            302.00 (  0.00%)           296.00 (  1.99%)
Min      total-odr0-4096            311.00 (  0.00%)           306.00 (  1.61%)
Min      total-odr0-8192            314.00 (  0.00%)           309.00 (  1.59%)
Min      total-odr0-16384           315.00 (  0.00%)           309.00 (  1.90%)
Min      total-odr1-1               741.00 (  0.00%)           716.00 (  3.37%)
Min      total-odr1-2               565.00 (  0.00%)           524.00 (  7.26%)
Min      total-odr1-4               457.00 (  0.00%)           427.00 (  6.56%)
Min      total-odr1-8               408.00 (  0.00%)           371.00 (  9.07%)
Min      total-odr1-16              383.00 (  0.00%)           344.00 ( 10.18%)
Min      total-odr1-32              378.00 (  0.00%)           334.00 ( 11.64%)
Min      total-odr1-64              383.00 (  0.00%)           334.00 ( 12.79%)
Min      total-odr1-128             376.00 (  0.00%)           342.00 (  9.04%)
Min      total-odr1-256             381.00 (  0.00%)           343.00 (  9.97%)
Min      total-odr1-512             388.00 (  0.00%)           349.00 ( 10.05%)
Min      total-odr1-1024            386.00 (  0.00%)           356.00 (  7.77%)
Min      total-odr1-2048            389.00 (  0.00%)           362.00 (  6.94%)
Min      total-odr1-4096            389.00 (  0.00%)           362.00 (  6.94%)
Min      total-odr1-8192            389.00 (  0.00%)           362.00 (  6.94%)

This shows a steady improvement throughout. The primary benefit is from
reduced system CPU usage which is obvious from the overall times;

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623nodelru-v8
User          191.39      191.61
System       2651.24     2504.48
Elapsed      2904.40     2757.01

The vmstats also showed that the fair zone allocation policy was definitely
removed as can be seen here;


                             4.7.0-rc3   4.7.0-rc3
                          mmotm-20160623 nodelru-v8
DMA32 allocs               28794771816           0
Normal allocs              48432582848 77227356392
Movable allocs                       0           0

tiobench on ext4
----------------

tiobench is a benchmark that artifically benefits if old pages remain resident
while new pages get reclaimed. The fair zone allocation policy mitigates this
problem so pages age fairly. While the benchmark has problems, it is important
that tiobench performance remains constant as it implies that page aging
problems that the fair zone allocation policy fixes are not re-introduced.

                                         4.7.0-rc4             4.7.0-rc4
                                    mmotm-20160623            nodelru-v8
Min      PotentialReadSpeed        89.65 (  0.00%)       90.34 (  0.77%)
Min      SeqRead-MB/sec-1          82.68 (  0.00%)       83.13 (  0.54%)
Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.15 ( -0.84%)
Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.23 ( -1.20%)
Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.25 (  0.52%)
Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.76 (  0.84%)
Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.95 (  7.95%)
Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.94 ( -1.05%)
Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.46 (  2.10%)
Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.58 ( -1.86%)
Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.93 (  7.22%)
Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       78.84 (  3.18%)
Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.35 ( -1.03%)
Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       78.69 ( -1.70%)
Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       71.38 ( -2.06%)
Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       75.81 ( -0.13%)
Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.12 ( -5.08%)
Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.02 (  0.00%)
Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.99 ( -5.71%)
Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.89 ( -3.26%)

This shows that the series has little or not impact on tiobench which is
desirable. It indicates that the fair zone allocation policy was removed
in a manner that didn't reintroduce one class of page aging bug. There
were only minor differences in overall reclaim activity

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Minor Faults                    645838      644036
Major Faults                       573         593
Swap Ins                             0           0
Swap Outs                            0           0
Allocation stalls                   24           0
DMA allocs                           0           0
DMA32 allocs                  46041453    44154171
Normal allocs                 78053072    79865782
Movable allocs                       0           0
Direct pages scanned             10969       54504
Kswapd pages scanned          93375144    93250583
Kswapd pages reclaimed        93372243    93247714
Direct pages reclaimed           10969       54504
Kswapd efficiency                  99%         99%
Kswapd velocity              13741.015   13711.950
Direct efficiency                 100%        100%
Direct velocity                  1.614       8.014
Percentage direct scans             0%          0%
Zone normal velocity          8641.875   13719.964
Zone dma32 velocity           5100.754       0.000
Zone dma velocity                0.000       0.000
Page writes by reclaim           0.000       0.000
Page writes file                     0           0
Page writes anon                     0           0
Page reclaim immediate              37          54

kswapd activity was roughly comparable. There were differences in direct
reclaim activity but negligible in the context of the overall workload
(velocity of 8 pages per second with the patches applied, 1.6 pages per
second in the baseline kernel).

pgbench read-only large configuration on ext4
---------------------------------------------

pgbench is a database benchmark that can be sensitive to page reclaim
decisions. This also checks if removing the fair zone allocation policy
is safe

pgbench Transactions
                        4.7.0-rc4             4.7.0-rc4
                   mmotm-20160623            nodelru-v8
Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)

Negligible differences again. As with tiobench, overall reclaim activity
was comparable.

bonnie++ on ext4
----------------

No interesting performance difference, negligible differences on reclaim
stats.

paralleldd on ext4
------------------

This workload uses varying numbers of dd instances to read large amounts of
data from disk.

                               4.7.0-rc3             4.7.0-rc3
                          mmotm-20160615         nodelru-v7r17
Amean    Elapsd-1       181.57 (  0.00%)      179.63 (  1.07%)
Amean    Elapsd-3       188.29 (  0.00%)      183.68 (  2.45%)
Amean    Elapsd-5       188.02 (  0.00%)      181.73 (  3.35%)
Amean    Elapsd-7       186.07 (  0.00%)      184.11 (  1.05%)
Amean    Elapsd-12      188.16 (  0.00%)      183.51 (  2.47%)
Amean    Elapsd-16      189.03 (  0.00%)      181.27 (  4.10%)

           4.7.0-rc3   4.7.0-rc3
        mmotm-20160615nodelru-v7r17
User         1439.23     1433.37
System       8332.31     8216.01
Elapsed      3619.80     3532.69

There is a slight gain in performance, some of which is from the reduced system
CPU usage. There areminor differences in reclaim activity but nothing significant

                             4.7.0-rc3   4.7.0-rc3
                          mmotm-20160615nodelru-v7r17
Minor Faults                    362486      358215
Major Faults                      1143        1113
Swap Ins                            26           0
Swap Outs                         2920         482
DMA allocs                           0           0
DMA32 allocs                  31568814    28598887
Normal allocs                 46539922    49514444
Movable allocs                       0           0
Allocation stalls                    0           0
Direct pages scanned                 0           0
Kswapd pages scanned          40886878    40849710
Kswapd pages reclaimed        40869923    40835207
Direct pages reclaimed               0           0
Kswapd efficiency                  99%         99%
Kswapd velocity              11295.342   11563.344
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Slabs scanned                   131673      126099
Direct inode steals                 57          60
Kswapd inode steals                762          18

It basically shows that kswapd was active at roughly the same rate in
both kernels. There was also comparable slab scanning activity and direct
reclaim was avoided in both cases. There appears to be a large difference
in numbers of inodes reclaimed but the workload has few active inodes and
is likely a timing artifact. It's interesting to note that the node-lru
did not swap in any pages but given the low swap activity, it's unlikely
to be significant.

stutter
-------

stutter simulates a simple workload. One part uses a lot of anonymous
memory, a second measures mmap latency and a third copies a large file.
The primary metric is checking for mmap latency.

stutter
                             4.7.0-rc4             4.7.0-rc4
                        mmotm-20160623            nodelru-v8
Min         mmap     16.6283 (  0.00%)     16.1394 (  2.94%)
1st-qrtle   mmap     54.7570 (  0.00%)     55.2975 ( -0.99%)
2nd-qrtle   mmap     57.3163 (  0.00%)     57.5230 ( -0.36%)
3rd-qrtle   mmap     58.9976 (  0.00%)     58.0537 (  1.60%)
Max-90%     mmap     59.7433 (  0.00%)     58.3910 (  2.26%)
Max-93%     mmap     60.1298 (  0.00%)     58.4801 (  2.74%)
Max-95%     mmap     73.4112 (  0.00%)     58.5537 ( 20.24%)
Max-99%     mmap     92.8542 (  0.00%)     58.9673 ( 36.49%)
Max         mmap   1440.6569 (  0.00%)    137.6875 ( 90.44%)
Mean        mmap     59.3493 (  0.00%)     55.5153 (  6.46%)
Best99%Mean mmap     57.2121 (  0.00%)     55.4194 (  3.13%)
Best95%Mean mmap     55.9113 (  0.00%)     55.2813 (  1.13%)
Best90%Mean mmap     55.6199 (  0.00%)     55.1044 (  0.93%)
Best50%Mean mmap     53.2183 (  0.00%)     52.8330 (  0.72%)
Best10%Mean mmap     45.9842 (  0.00%)     42.3740 (  7.85%)
Best5%Mean  mmap     43.2256 (  0.00%)     38.8660 ( 10.09%)
Best1%Mean  mmap     32.9388 (  0.00%)     27.7577 ( 15.73%)

This shows a number of improvements with the worst-case outlier greatly
improved.

Some of the vmstats are interesting

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Swap Ins                           163         239
Swap Outs                            0           0
Allocation stalls                 2603           0
DMA allocs                           0           0
DMA32 allocs                 618719206  1303037965
Normal allocs                891235743   229914091
Movable allocs                       0           0
Direct pages scanned            216787        3173
Kswapd pages scanned          50719775    41732250
Kswapd pages reclaimed        41541765    41731168
Direct pages reclaimed          209159        3173
Kswapd efficiency                  81%         99%
Kswapd velocity              16859.554   14231.043
Direct efficiency                  96%        100%
Direct velocity                 72.061       1.082
Percentage direct scans             0%          0%
Zone normal velocity          8431.777   14232.125
Zone dma32 velocity           8499.838       0.000
Zone dma velocity                0.000       0.000
Page writes by reclaim     6215049.000       0.000
Page writes file               6215049           0
Page writes anon                     0           0
Page reclaim immediate           70673         143
Sector Reads                  81940800    81489388
Sector Writes                100158984    99161860
Page rescued immediate               0           0
Slabs scanned                  1366954       21196

While this is not guaranteed in all cases, this particular test showed
a large reduction in direct reclaim activity. It's also worth noting
that no page writes were issued from reclaim context.

This series is not without its hazards. There are at least three areas
that I'm concerned with even though I could not reproduce any problems in
that area.

1. Reclaim/compaction is going to be affected because the amount of reclaim is
   no longer targetted at a specific zone. Compaction works on a per-zone basis
   so there is no guarantee that reclaiming a few THP's worth page pages will
   have a positive impact on compaction success rates.

2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
   are called is now different. This may or may not be a problem but if it
   is, it'll be because shrinkers are not called enough and some balancing
   is required.

3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
   distributed between zones and the fair zone allocation policy used to do
   something very similar for anon. The distribution is now different but not
   necessarily in any way that matters but it's still worth bearing in mind.

 Documentation/cgroup-v1/memcg_test.txt        |   4 +-
 Documentation/cgroup-v1/memory.txt            |   4 +-
 arch/s390/appldata/appldata_mem.c             |   2 +-
 arch/tile/mm/pgtable.c                        |  18 +-
 drivers/base/node.c                           |  77 ++-
 drivers/staging/android/lowmemorykiller.c     |  12 +-
 drivers/staging/lustre/lustre/osc/osc_cache.c |   6 +-
 fs/fs-writeback.c                             |   4 +-
 fs/fuse/file.c                                |   8 +-
 fs/nfs/internal.h                             |   2 +-
 fs/nfs/write.c                                |   2 +-
 fs/proc/meminfo.c                             |  20 +-
 include/linux/backing-dev.h                   |   2 +-
 include/linux/memcontrol.h                    |  61 +-
 include/linux/mm.h                            |   5 +
 include/linux/mm_inline.h                     |  35 +-
 include/linux/mm_types.h                      |   2 +-
 include/linux/mmzone.h                        | 155 +++--
 include/linux/swap.h                          |  24 +-
 include/linux/topology.h                      |   2 +-
 include/linux/vm_event_item.h                 |  14 +-
 include/linux/vmstat.h                        | 111 +++-
 include/linux/writeback.h                     |   2 +-
 include/trace/events/vmscan.h                 |  63 +-
 include/trace/events/writeback.h              |  10 +-
 kernel/power/snapshot.c                       |  10 +-
 kernel/sysctl.c                               |   4 +-
 mm/backing-dev.c                              |  15 +-
 mm/compaction.c                               |  50 +-
 mm/filemap.c                                  |  16 +-
 mm/huge_memory.c                              |  12 +-
 mm/internal.h                                 |  11 +-
 mm/khugepaged.c                               |  14 +-
 mm/memcontrol.c                               | 215 +++----
 mm/memory-failure.c                           |   4 +-
 mm/memory_hotplug.c                           |   7 +-
 mm/mempolicy.c                                |   2 +-
 mm/migrate.c                                  |  35 +-
 mm/mlock.c                                    |  12 +-
 mm/page-writeback.c                           | 123 ++--
 mm/page_alloc.c                               | 371 +++++------
 mm/page_idle.c                                |   4 +-
 mm/rmap.c                                     |  26 +-
 mm/shmem.c                                    |  14 +-
 mm/swap.c                                     |  64 +-
 mm/swap_state.c                               |   4 +-
 mm/util.c                                     |   4 +-
 mm/vmscan.c                                   | 879 +++++++++++++-------------
 mm/vmstat.c                                   | 398 +++++++++---
 mm/workingset.c                               |  54 +-
 50 files changed, 1674 insertions(+), 1319 deletions(-)

-- 
2.6.4

Mel Gorman (31):
  mm, vmstat: add infrastructure for per-node vmstats
  mm, vmscan: move lru_lock to the node
  mm, vmscan: move LRU lists to node
  mm, vmscan: begin reclaiming pages on a per-node basis
  mm, vmscan: have kswapd only scan based on the highest requested zone
  mm, vmscan: make kswapd reclaim in terms of nodes
  mm, vmscan: remove balance gap
  mm, vmscan: simplify the logic deciding whether kswapd sleeps
  mm, vmscan: by default have direct reclaim only shrink once per node
  mm, vmscan: remove duplicate logic clearing node congestion and dirty
    state
  mm: vmscan: do not reclaim from kswapd if there is any eligible zone
  mm, vmscan: make shrink_node decisions more node-centric
  mm, memcg: move memcg limit enforcement from zones to nodes
  mm, workingset: make working set detection node-aware
  mm, page_alloc: consider dirtyable memory in terms of nodes
  mm: move page mapped accounting to the node
  mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
  mm: move most file-based accounting to the node
  mm: move vmscan writes and file write accounting to the node
  mm, vmscan: only wakeup kswapd once per node for the requested
    classzone
  mm, page_alloc: Wake kswapd based on the highest eligible zone
  mm: convert zone_reclaim to node_reclaim
  mm, vmscan: Avoid passing in classzone_idx unnecessarily to
    shrink_node
  mm, vmscan: Avoid passing in classzone_idx unnecessarily to
    compaction_ready
  mm, vmscan: add classzone information to tracepoints
  mm, page_alloc: remove fair zone allocation policy
  mm: page_alloc: cache the last node whose dirty limit is reached
  mm: vmstat: replace __count_zone_vm_events with a zone id equivalent
  mm: vmstat: account per-zone stalls and pages skipped during reclaim
  mm, vmstat: print node-based stats in zoneinfo file
  mm, vmstat: Remove zone and node double accounting by approximating
    retries

 Documentation/cgroup-v1/memcg_test.txt        |   4 +-
 Documentation/cgroup-v1/memory.txt            |   4 +-
 arch/s390/appldata/appldata_mem.c             |   2 +-
 arch/tile/mm/pgtable.c                        |  18 +-
 drivers/base/node.c                           |  77 ++-
 drivers/staging/android/lowmemorykiller.c     |  12 +-
 drivers/staging/lustre/lustre/osc/osc_cache.c |   6 +-
 fs/fs-writeback.c                             |   4 +-
 fs/fuse/file.c                                |   8 +-
 fs/nfs/internal.h                             |   2 +-
 fs/nfs/write.c                                |   2 +-
 fs/proc/meminfo.c                             |  20 +-
 include/linux/backing-dev.h                   |   2 +-
 include/linux/memcontrol.h                    |  61 +-
 include/linux/mm.h                            |   5 +
 include/linux/mm_inline.h                     |  35 +-
 include/linux/mm_types.h                      |   2 +-
 include/linux/mmzone.h                        | 155 +++--
 include/linux/swap.h                          |  24 +-
 include/linux/topology.h                      |   2 +-
 include/linux/vm_event_item.h                 |  14 +-
 include/linux/vmstat.h                        | 111 +++-
 include/linux/writeback.h                     |   2 +-
 include/trace/events/vmscan.h                 |  63 +-
 include/trace/events/writeback.h              |  10 +-
 kernel/power/snapshot.c                       |  10 +-
 kernel/sysctl.c                               |   4 +-
 mm/backing-dev.c                              |  15 +-
 mm/compaction.c                               |  50 +-
 mm/filemap.c                                  |  16 +-
 mm/huge_memory.c                              |  12 +-
 mm/internal.h                                 |  11 +-
 mm/khugepaged.c                               |  14 +-
 mm/memcontrol.c                               | 215 +++----
 mm/memory-failure.c                           |   4 +-
 mm/memory_hotplug.c                           |   7 +-
 mm/mempolicy.c                                |   2 +-
 mm/migrate.c                                  |  35 +-
 mm/mlock.c                                    |  12 +-
 mm/page-writeback.c                           | 123 ++--
 mm/page_alloc.c                               | 371 +++++------
 mm/page_idle.c                                |   4 +-
 mm/rmap.c                                     |  26 +-
 mm/shmem.c                                    |  14 +-
 mm/swap.c                                     |  64 +-
 mm/swap_state.c                               |   4 +-
 mm/util.c                                     |   4 +-
 mm/vmscan.c                                   | 879 +++++++++++++-------------
 mm/vmstat.c                                   | 398 +++++++++---
 mm/workingset.c                               |  54 +-
 50 files changed, 1674 insertions(+), 1319 deletions(-)

-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries
  2016-07-01 20:01 [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 Mel Gorman
@ 2016-07-01 20:01 ` Mel Gorman
  2016-07-06  0:02   ` Minchan Kim
  2016-07-06 18:12   ` Dave Hansen
  0 siblings, 2 replies; 10+ messages in thread
From: Mel Gorman @ 2016-07-01 20:01 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

The number of LRU pages, dirty pages and writeback pages must be accounted
for on both zones and nodes because of the reclaim retry logic, compaction
retry logic and highmem calculations all depending on per-zone stats.

The retry logic is only critical for allocations that can use any zones.
Hence this patch will not retry reclaim or compaction for such allocations.
This should not be a problem for reclaim as zone-constrained allocations
are immune from OOM kill. For retries, a very rough approximation is made
whether to retry or not. While it is possible this will make the wrong
decision on occasion, it will not infinite loop as the number of reclaim
attempts is capped by MAX_RECLAIM_RETRIES.

The highmem calculations only care about the global count of file pages
in highmem. Hence, a global counter is used instead of per-zone stats.
With this, the per-zone double accounting disappears.

Suggested by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mm_inline.h | 20 +++++++++++--
 include/linux/mmzone.h    |  4 ---
 include/linux/swap.h      |  1 -
 mm/compaction.c           | 22 ++++++++++++++-
 mm/migrate.c              |  2 --
 mm/page-writeback.c       | 13 ++++-----
 mm/page_alloc.c           | 71 ++++++++++++++++++++++++++++++++---------------
 mm/vmscan.c               | 16 -----------
 mm/vmstat.c               |  3 --
 9 files changed, 92 insertions(+), 60 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 9aadcc781857..c68680aac044 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -4,6 +4,22 @@
 #include <linux/huge_mm.h>
 #include <linux/swap.h>
 
+#ifdef CONFIG_HIGHMEM
+extern unsigned long highmem_file_pages;
+
+static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
+							int nr_pages)
+{
+	if (is_highmem_idx(zid) && is_file_lru(lru))
+		highmem_file_pages += nr_pages;
+}
+#else
+static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
+							int nr_pages)
+{
+}
+#endif
+
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
  * @page: the page to test
@@ -29,9 +45,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
-	__mod_zone_page_state(&pgdat->node_zones[zid],
-		NR_ZONE_LRU_BASE + !!is_file_lru(lru),
-		nr_pages);
+	acct_highmem_file_pages(zid, lru, nr_pages);
 }
 
 static __always_inline void update_lru_size(struct lruvec *lruvec,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index facee6b83440..9268528c20c0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -110,10 +110,6 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
-	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
-	NR_ZONE_LRU_FILE,
-	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b17cc4830fa6..cc753c639e3d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
-extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/compaction.c b/mm/compaction.c
index a0bd85712516..dfe7dafe8e8b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1446,6 +1446,13 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 {
 	struct zone *zone;
 	struct zoneref *z;
+	pg_data_t *last_pgdat = NULL;
+
+#ifdef CONFIG_HIGHMEM
+	/* Do not retry compaction for zone-constrained allocations */
+	if (!is_highmem_idx(ac->high_zoneidx))
+		return false;
+#endif
 
 	/*
 	 * Make sure at least one zone would pass __compaction_suitable if we continue
@@ -1456,14 +1463,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 		unsigned long available;
 		enum compact_result compact_result;
 
+		if (last_pgdat == zone->zone_pgdat)
+			continue;
+
+		/*
+		 * This over-estimates the number of pages available for
+		 * reclaim/compaction but walking the LRU would take too
+		 * long. The consequences are that compaction may retry
+		 * longer than it should for a zone-constrained allocation
+		 * request.
+		 */
+		last_pgdat = zone->zone_pgdat;
+		available = pgdat_reclaimable_pages(zone->zone_pgdat) / order;
+
 		/*
 		 * Do not consider all the reclaimable memory because we do not
 		 * want to trash just for a single high order allocation which
 		 * is even not guaranteed to appear even if __compaction_suitable
 		 * is happy about the watermark check.
 		 */
-		available = zone_reclaimable_pages(zone) / order;
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+		available = min(zone->managed_pages, available);
 		compact_result = __compaction_suitable(zone, order, alloc_flags,
 				ac_classzone_idx(ac), available);
 		if (compact_result != COMPACT_SKIPPED &&
diff --git a/mm/migrate.c b/mm/migrate.c
index c77997dc6ed7..ed2f85e61de1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -513,9 +513,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 		}
 		if (dirty && mapping_cap_account_dirty(mapping)) {
 			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
-			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
 			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
-			__dec_zone_state(newzone, NR_ZONE_WRITE_PENDING);
 		}
 	}
 	local_irq_enable();
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3c02aa603f5a..8db1db234915 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -299,6 +299,9 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 
 	return nr_pages;
 }
+#ifdef CONFIG_HIGHMEM
+unsigned long highmem_file_pages;
+#endif
 
 static unsigned long highmem_dirtyable_memory(unsigned long total)
 {
@@ -306,18 +309,17 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
 	int node;
 	unsigned long x = 0;
 	int i;
+	unsigned long dirtyable = highmem_file_pages;
 
 	for_each_node_state(node, N_HIGH_MEMORY) {
 		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
 			struct zone *z;
-			unsigned long dirtyable;
 
 			if (!is_highmem_idx(i))
 				continue;
 
 			z = &NODE_DATA(node)->node_zones[i];
-			dirtyable = zone_page_state(z, NR_FREE_PAGES) +
-				zone_page_state(z, NR_ZONE_LRU_FILE);
+			dirtyable += zone_page_state(z, NR_FREE_PAGES);
 
 			/* watch for underflows */
 			dirtyable -= min(dirtyable, high_wmark_pages(z));
@@ -2460,7 +2462,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		__inc_node_page_state(page, NR_FILE_DIRTY);
-		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		__inc_node_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
@@ -2482,7 +2483,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		dec_node_page_state(page, NR_FILE_DIRTY);
-		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_SIZE);
 	}
@@ -2739,7 +2739,6 @@ int clear_page_dirty_for_io(struct page *page)
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 			dec_node_page_state(page, NR_FILE_DIRTY);
-			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
@@ -2786,7 +2785,6 @@ int test_clear_page_writeback(struct page *page)
 	if (ret) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_node_page_state(page, NR_WRITEBACK);
-		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		inc_node_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
@@ -2841,7 +2839,6 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	if (!ret) {
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		inc_node_page_state(page, NR_WRITEBACK);
-		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d3eb15c35bb1..9581185cb31a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 {
 	struct zone *zone;
 	struct zoneref *z;
+	pg_data_t *current_pgdat = NULL;
 
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
@@ -3454,6 +3455,14 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		return false;
 
 	/*
+	 * Blindly retry allocation requests that cannot use all zones. We do
+	 * not have a reliable and fast means of calculating reclaimable, dirty
+	 * and writeback pages in eligible zones.
+	 */
+	if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask)))
+		goto out;
+
+	/*
 	 * Keep reclaiming pages while there is a chance this will lead somewhere.
 	 * If none of the target zones can satisfy our allocation request even
 	 * if all reclaimable pages are considered then we are screwed and have
@@ -3463,36 +3472,54 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 					ac->nodemask) {
 		unsigned long available;
 		unsigned long reclaimable;
+		unsigned long write_pending = 0;
+		int zid;
+
+		if (current_pgdat == zone->zone_pgdat)
+			continue;
 
-		available = reclaimable = zone_reclaimable_pages(zone);
+		current_pgdat = zone->zone_pgdat;
+		available = reclaimable = pgdat_reclaimable_pages(current_pgdat);
 		available -= DIV_ROUND_UP(no_progress_loops * available,
 					MAX_RECLAIM_RETRIES);
-		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+		write_pending = node_page_state(current_pgdat, NR_WRITEBACK) +
+					node_page_state(current_pgdat, NR_FILE_DIRTY);
 
-		/*
-		 * Would the allocation succeed if we reclaimed the whole
-		 * available?
-		 */
-		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
-				ac_classzone_idx(ac), alloc_flags, available)) {
-			/*
-			 * If we didn't make any progress and have a lot of
-			 * dirty + writeback pages then we should wait for
-			 * an IO to complete to slow down the reclaim and
-			 * prevent from pre mature OOM
-			 */
-			if (!did_some_progress) {
-				unsigned long write_pending;
+		/* Account for all free pages on eligible zones */
+		for (zid = 0; zid <= zone_idx(zone); zid++) {
+			struct zone *acct_zone = &current_pgdat->node_zones[zid];
 
-				write_pending = zone_page_state_snapshot(zone,
-							NR_ZONE_WRITE_PENDING);
+			available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES);
+		}
 
-				if (2 * write_pending > reclaimable) {
-					congestion_wait(BLK_RW_ASYNC, HZ/10);
-					return true;
-				}
+		/*
+		 * If we didn't make any progress and have a lot of
+		 * dirty + writeback pages then we should wait for an IO to
+		 * complete to slow down the reclaim and prevent from premature
+		 * OOM.
+		 */
+		if (!did_some_progress) {
+			if (2 * write_pending > reclaimable) {
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				return true;
 			}
+		}
 
+		/*
+		 * Would the allocation succeed if we reclaimed the whole
+		 * available? This is approximate because there is no
+		 * accurate count of reclaimable pages per zone.
+		 */
+		for (zid = 0; zid <= zone_idx(zone); zid++) {
+			struct zone *check_zone = &current_pgdat->node_zones[zid];
+			unsigned long estimate;
+
+			estimate = min(check_zone->managed_pages, available);
+			if (__zone_watermark_ok(check_zone, order,
+					min_wmark_pages(check_zone), ac_classzone_idx(ac),
+					alloc_flags, available)) {
+			}
+out:
 			/*
 			 * Memory allocation/reclaim might be called from a WQ
 			 * context and the current implementation of the WQ
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 151c30dd27e2..c538a8cab43b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -194,22 +194,6 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
-/*
- * This misses isolated pages which are not accounted for to save counters.
- * As the data only determines if reclaim or compaction continues, it is
- * not expected that isolated pages will be a dominating factor.
- */
-unsigned long zone_reclaimable_pages(struct zone *zone)
-{
-	unsigned long nr;
-
-	nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
-	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
-
-	return nr;
-}
-
 unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 {
 	unsigned long nr;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ce09be63e8c7..524c082072be 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -908,9 +908,6 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
-	"nr_zone_anon_lru",
-	"nr_zone_file_lru",
-	"nr_zone_write_pending",
 	"nr_mlock",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries
  2016-07-01 20:01 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Mel Gorman
@ 2016-07-06  0:02   ` Minchan Kim
  2016-07-06  8:58     ` Mel Gorman
  2016-07-06 18:12   ` Dave Hansen
  1 sibling, 1 reply; 10+ messages in thread
From: Minchan Kim @ 2016-07-06  0:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, LKML

On Fri, Jul 01, 2016 at 09:01:39PM +0100, Mel Gorman wrote:
> The number of LRU pages, dirty pages and writeback pages must be accounted
> for on both zones and nodes because of the reclaim retry logic, compaction
> retry logic and highmem calculations all depending on per-zone stats.
> 
> The retry logic is only critical for allocations that can use any zones.

Sorry, I cannot follow this assertion.
Could you explain?

> Hence this patch will not retry reclaim or compaction for such allocations.

What is such allocations?

> This should not be a problem for reclaim as zone-constrained allocations
> are immune from OOM kill. For retries, a very rough approximation is made

zone-constrained allocations are immune from OOM kill?
Please explain it, too.

Sorry for the many questions but I cannot review code without clear
understanding of assumption/background which I couldn't notice.

> whether to retry or not. While it is possible this will make the wrong
> decision on occasion, it will not infinite loop as the number of reclaim
> attempts is capped by MAX_RECLAIM_RETRIES.
> 
> The highmem calculations only care about the global count of file pages
> in highmem. Hence, a global counter is used instead of per-zone stats.
> With this, the per-zone double accounting disappears.
> 
> Suggested by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/mm_inline.h | 20 +++++++++++--
>  include/linux/mmzone.h    |  4 ---
>  include/linux/swap.h      |  1 -
>  mm/compaction.c           | 22 ++++++++++++++-
>  mm/migrate.c              |  2 --
>  mm/page-writeback.c       | 13 ++++-----
>  mm/page_alloc.c           | 71 ++++++++++++++++++++++++++++++++---------------
>  mm/vmscan.c               | 16 -----------
>  mm/vmstat.c               |  3 --
>  9 files changed, 92 insertions(+), 60 deletions(-)
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 9aadcc781857..c68680aac044 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -4,6 +4,22 @@
>  #include <linux/huge_mm.h>
>  #include <linux/swap.h>
>  
> +#ifdef CONFIG_HIGHMEM
> +extern unsigned long highmem_file_pages;
> +
> +static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> +							int nr_pages)
> +{
> +	if (is_highmem_idx(zid) && is_file_lru(lru))
> +		highmem_file_pages += nr_pages;
> +}
> +#else
> +static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> +							int nr_pages)
> +{
> +}
> +#endif
> +
>  /**
>   * page_is_file_cache - should the page be on a file LRU or anon LRU?
>   * @page: the page to test
> @@ -29,9 +45,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  
>  	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
> -	__mod_zone_page_state(&pgdat->node_zones[zid],
> -		NR_ZONE_LRU_BASE + !!is_file_lru(lru),
> -		nr_pages);
> +	acct_highmem_file_pages(zid, lru, nr_pages);
>  }
>  
>  static __always_inline void update_lru_size(struct lruvec *lruvec,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index facee6b83440..9268528c20c0 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -110,10 +110,6 @@ struct zone_padding {
>  enum zone_stat_item {
>  	/* First 128 byte cacheline (assuming 64 bit words) */
>  	NR_FREE_PAGES,
> -	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
> -	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
> -	NR_ZONE_LRU_FILE,
> -	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
>  	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
>  	NR_SLAB_RECLAIMABLE,
>  	NR_SLAB_UNRECLAIMABLE,
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b17cc4830fa6..cc753c639e3d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  						struct vm_area_struct *vma);
>  
>  /* linux/mm/vmscan.c */
> -extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
> diff --git a/mm/compaction.c b/mm/compaction.c
> index a0bd85712516..dfe7dafe8e8b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1446,6 +1446,13 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
> +	pg_data_t *last_pgdat = NULL;
> +
> +#ifdef CONFIG_HIGHMEM
> +	/* Do not retry compaction for zone-constrained allocations */
> +	if (!is_highmem_idx(ac->high_zoneidx))
> +		return false;
> +#endif
>  
>  	/*
>  	 * Make sure at least one zone would pass __compaction_suitable if we continue
> @@ -1456,14 +1463,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>  		unsigned long available;
>  		enum compact_result compact_result;
>  
> +		if (last_pgdat == zone->zone_pgdat)
> +			continue;
> +
> +		/*
> +		 * This over-estimates the number of pages available for
> +		 * reclaim/compaction but walking the LRU would take too
> +		 * long. The consequences are that compaction may retry
> +		 * longer than it should for a zone-constrained allocation
> +		 * request.
> +		 */
> +		last_pgdat = zone->zone_pgdat;
> +		available = pgdat_reclaimable_pages(zone->zone_pgdat) / order;
> +
>  		/*
>  		 * Do not consider all the reclaimable memory because we do not
>  		 * want to trash just for a single high order allocation which
>  		 * is even not guaranteed to appear even if __compaction_suitable
>  		 * is happy about the watermark check.
>  		 */
> -		available = zone_reclaimable_pages(zone) / order;
>  		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +		available = min(zone->managed_pages, available);
>  		compact_result = __compaction_suitable(zone, order, alloc_flags,
>  				ac_classzone_idx(ac), available);
>  		if (compact_result != COMPACT_SKIPPED &&
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c77997dc6ed7..ed2f85e61de1 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -513,9 +513,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
>  		}
>  		if (dirty && mapping_cap_account_dirty(mapping)) {
>  			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
> -			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
>  			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
> -			__dec_zone_state(newzone, NR_ZONE_WRITE_PENDING);
>  		}
>  	}
>  	local_irq_enable();
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 3c02aa603f5a..8db1db234915 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -299,6 +299,9 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
>  
>  	return nr_pages;
>  }
> +#ifdef CONFIG_HIGHMEM
> +unsigned long highmem_file_pages;
> +#endif
>  
>  static unsigned long highmem_dirtyable_memory(unsigned long total)
>  {
> @@ -306,18 +309,17 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>  	int node;
>  	unsigned long x = 0;
>  	int i;
> +	unsigned long dirtyable = highmem_file_pages;
>  
>  	for_each_node_state(node, N_HIGH_MEMORY) {
>  		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
>  			struct zone *z;
> -			unsigned long dirtyable;
>  
>  			if (!is_highmem_idx(i))
>  				continue;
>  
>  			z = &NODE_DATA(node)->node_zones[i];
> -			dirtyable = zone_page_state(z, NR_FREE_PAGES) +
> -				zone_page_state(z, NR_ZONE_LRU_FILE);
> +			dirtyable += zone_page_state(z, NR_FREE_PAGES);
>  
>  			/* watch for underflows */
>  			dirtyable -= min(dirtyable, high_wmark_pages(z));
> @@ -2460,7 +2462,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
>  
>  		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
>  		__inc_node_page_state(page, NR_FILE_DIRTY);
> -		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  		__inc_node_page_state(page, NR_DIRTIED);
>  		__inc_wb_stat(wb, WB_RECLAIMABLE);
>  		__inc_wb_stat(wb, WB_DIRTIED);
> @@ -2482,7 +2483,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
>  	if (mapping_cap_account_dirty(mapping)) {
>  		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
>  		dec_node_page_state(page, NR_FILE_DIRTY);
> -		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  		dec_wb_stat(wb, WB_RECLAIMABLE);
>  		task_io_account_cancelled_write(PAGE_SIZE);
>  	}
> @@ -2739,7 +2739,6 @@ int clear_page_dirty_for_io(struct page *page)
>  		if (TestClearPageDirty(page)) {
>  			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
>  			dec_node_page_state(page, NR_FILE_DIRTY);
> -			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  			dec_wb_stat(wb, WB_RECLAIMABLE);
>  			ret = 1;
>  		}
> @@ -2786,7 +2785,6 @@ int test_clear_page_writeback(struct page *page)
>  	if (ret) {
>  		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_node_page_state(page, NR_WRITEBACK);
> -		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  		inc_node_page_state(page, NR_WRITTEN);
>  	}
>  	unlock_page_memcg(page);
> @@ -2841,7 +2839,6 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
>  	if (!ret) {
>  		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		inc_node_page_state(page, NR_WRITEBACK);
> -		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  	}
>  	unlock_page_memcg(page);
>  	return ret;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d3eb15c35bb1..9581185cb31a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
> +	pg_data_t *current_pgdat = NULL;
>  
>  	/*
>  	 * Make sure we converge to OOM if we cannot make any progress
> @@ -3454,6 +3455,14 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		return false;
>  
>  	/*
> +	 * Blindly retry allocation requests that cannot use all zones. We do
> +	 * not have a reliable and fast means of calculating reclaimable, dirty
> +	 * and writeback pages in eligible zones.
> +	 */
> +	if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask)))
> +		goto out;
> +
> +	/*
>  	 * Keep reclaiming pages while there is a chance this will lead somewhere.
>  	 * If none of the target zones can satisfy our allocation request even
>  	 * if all reclaimable pages are considered then we are screwed and have
> @@ -3463,36 +3472,54 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  					ac->nodemask) {
>  		unsigned long available;
>  		unsigned long reclaimable;
> +		unsigned long write_pending = 0;
> +		int zid;
> +
> +		if (current_pgdat == zone->zone_pgdat)
> +			continue;
>  
> -		available = reclaimable = zone_reclaimable_pages(zone);
> +		current_pgdat = zone->zone_pgdat;
> +		available = reclaimable = pgdat_reclaimable_pages(current_pgdat);
>  		available -= DIV_ROUND_UP(no_progress_loops * available,
>  					MAX_RECLAIM_RETRIES);
> -		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +		write_pending = node_page_state(current_pgdat, NR_WRITEBACK) +
> +					node_page_state(current_pgdat, NR_FILE_DIRTY);
>  
> -		/*
> -		 * Would the allocation succeed if we reclaimed the whole
> -		 * available?
> -		 */
> -		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> -				ac_classzone_idx(ac), alloc_flags, available)) {
> -			/*
> -			 * If we didn't make any progress and have a lot of
> -			 * dirty + writeback pages then we should wait for
> -			 * an IO to complete to slow down the reclaim and
> -			 * prevent from pre mature OOM
> -			 */
> -			if (!did_some_progress) {
> -				unsigned long write_pending;
> +		/* Account for all free pages on eligible zones */
> +		for (zid = 0; zid <= zone_idx(zone); zid++) {
> +			struct zone *acct_zone = &current_pgdat->node_zones[zid];
>  
> -				write_pending = zone_page_state_snapshot(zone,
> -							NR_ZONE_WRITE_PENDING);
> +			available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES);
> +		}
>  
> -				if (2 * write_pending > reclaimable) {
> -					congestion_wait(BLK_RW_ASYNC, HZ/10);
> -					return true;
> -				}
> +		/*
> +		 * If we didn't make any progress and have a lot of
> +		 * dirty + writeback pages then we should wait for an IO to
> +		 * complete to slow down the reclaim and prevent from premature
> +		 * OOM.
> +		 */
> +		if (!did_some_progress) {
> +			if (2 * write_pending > reclaimable) {
> +				congestion_wait(BLK_RW_ASYNC, HZ/10);
> +				return true;
>  			}
> +		}
>  
> +		/*
> +		 * Would the allocation succeed if we reclaimed the whole
> +		 * available? This is approximate because there is no
> +		 * accurate count of reclaimable pages per zone.
> +		 */
> +		for (zid = 0; zid <= zone_idx(zone); zid++) {
> +			struct zone *check_zone = &current_pgdat->node_zones[zid];
> +			unsigned long estimate;
> +
> +			estimate = min(check_zone->managed_pages, available);
> +			if (__zone_watermark_ok(check_zone, order,
> +					min_wmark_pages(check_zone), ac_classzone_idx(ac),
> +					alloc_flags, available)) {
> +			}
> +out:
>  			/*
>  			 * Memory allocation/reclaim might be called from a WQ
>  			 * context and the current implementation of the WQ
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 151c30dd27e2..c538a8cab43b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -194,22 +194,6 @@ static bool sane_reclaim(struct scan_control *sc)
>  }
>  #endif
>  
> -/*
> - * This misses isolated pages which are not accounted for to save counters.
> - * As the data only determines if reclaim or compaction continues, it is
> - * not expected that isolated pages will be a dominating factor.
> - */
> -unsigned long zone_reclaimable_pages(struct zone *zone)
> -{
> -	unsigned long nr;
> -
> -	nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
> -	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
> -
> -	return nr;
> -}
> -
>  unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
>  {
>  	unsigned long nr;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index ce09be63e8c7..524c082072be 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -908,9 +908,6 @@ int fragmentation_index(struct zone *zone, unsigned int order)
>  const char * const vmstat_text[] = {
>  	/* enum zone_stat_item countes */
>  	"nr_free_pages",
> -	"nr_zone_anon_lru",
> -	"nr_zone_file_lru",
> -	"nr_zone_write_pending",
>  	"nr_mlock",
>  	"nr_slab_reclaimable",
>  	"nr_slab_unreclaimable",
> -- 
> 2.6.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries
  2016-07-06  0:02   ` Minchan Kim
@ 2016-07-06  8:58     ` Mel Gorman
  2016-07-06  9:33       ` Mel Gorman
  2016-07-07  6:47       ` Minchan Kim
  0 siblings, 2 replies; 10+ messages in thread
From: Mel Gorman @ 2016-07-06  8:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, LKML

On Wed, Jul 06, 2016 at 09:02:52AM +0900, Minchan Kim wrote:
> On Fri, Jul 01, 2016 at 09:01:39PM +0100, Mel Gorman wrote:
> > The number of LRU pages, dirty pages and writeback pages must be accounted
> > for on both zones and nodes because of the reclaim retry logic, compaction
> > retry logic and highmem calculations all depending on per-zone stats.
> > 
> > The retry logic is only critical for allocations that can use any zones.
> 
> Sorry, I cannot follow this assertion.
> Could you explain?
> 

The patch has been reworked since and I tried clarifying the changelog.
Does this help?

--- 8<----
mm, vmstat: remove zone and node double accounting by approximating retries

The number of LRU pages, dirty pages and writeback pages must be accounted
for on both zones and nodes because of the reclaim retry logic, compaction
retry logic and highmem calculations all depending on per-zone stats.

Many lowmem allocations are immune from OOM kill due to a check in
__alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The exception
is costly high-order allocations or allocations that cannot fail. If the
__alloc_pages_may_oom avoids OOM-kill for low-order lowmem allocations
then a check in __alloc_pages_slowpath will always retry.

Hence this patch will always retry reclaim for zone-constrained allocations
in should_reclaim_retry.

As there is no guarantee enough memory can ever be freed to satisfy
compaction, this patch avoids retrying compaction for zone-contrained
allocations.o

In combination, that means that the per-node stats can be used when deciding
whether to continue reclaim using a rough approximation.  While it is
possible this will make the wrong decision on occasion, it will not infinite
loop as the number of reclaim attempts is capped by MAX_RECLAIM_RETRIES.

The final step is calculating the number of dirtyable highmem pages. As
those calculations only care about the global count of file pages in
highmem. This patch uses a global counter used instead of per-zone stats
as it is sufficient.

In combination, this allows the per-zone LRU and dirty state counters to
be removed.

Suggested by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 9aadcc781857..c68680aac044 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -4,6 +4,22 @@
 #include <linux/huge_mm.h>
 #include <linux/swap.h>
 
+#ifdef CONFIG_HIGHMEM
+extern unsigned long highmem_file_pages;
+
+static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
+							int nr_pages)
+{
+	if (is_highmem_idx(zid) && is_file_lru(lru))
+		highmem_file_pages += nr_pages;
+}
+#else
+static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
+							int nr_pages)
+{
+}
+#endif
+
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
  * @page: the page to test
@@ -29,9 +45,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
-	__mod_zone_page_state(&pgdat->node_zones[zid],
-		NR_ZONE_LRU_BASE + !!is_file_lru(lru),
-		nr_pages);
+	acct_highmem_file_pages(zid, lru, nr_pages);
 }
 
 static __always_inline void update_lru_size(struct lruvec *lruvec,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd33e6f1bed0..a3b7f45aac56 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -110,10 +110,6 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
-	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
-	NR_ZONE_LRU_FILE,
-	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b17cc4830fa6..cc753c639e3d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
-extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/compaction.c b/mm/compaction.c
index a0bd85712516..dfe7dafe8e8b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1446,6 +1446,13 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 {
 	struct zone *zone;
 	struct zoneref *z;
+	pg_data_t *last_pgdat = NULL;
+
+#ifdef CONFIG_HIGHMEM
+	/* Do not retry compaction for zone-constrained allocations */
+	if (!is_highmem_idx(ac->high_zoneidx))
+		return false;
+#endif
 
 	/*
 	 * Make sure at least one zone would pass __compaction_suitable if we continue
@@ -1456,14 +1463,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 		unsigned long available;
 		enum compact_result compact_result;
 
+		if (last_pgdat == zone->zone_pgdat)
+			continue;
+
+		/*
+		 * This over-estimates the number of pages available for
+		 * reclaim/compaction but walking the LRU would take too
+		 * long. The consequences are that compaction may retry
+		 * longer than it should for a zone-constrained allocation
+		 * request.
+		 */
+		last_pgdat = zone->zone_pgdat;
+		available = pgdat_reclaimable_pages(zone->zone_pgdat) / order;
+
 		/*
 		 * Do not consider all the reclaimable memory because we do not
 		 * want to trash just for a single high order allocation which
 		 * is even not guaranteed to appear even if __compaction_suitable
 		 * is happy about the watermark check.
 		 */
-		available = zone_reclaimable_pages(zone) / order;
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+		available = min(zone->managed_pages, available);
 		compact_result = __compaction_suitable(zone, order, alloc_flags,
 				ac_classzone_idx(ac), available);
 		if (compact_result != COMPACT_SKIPPED &&
diff --git a/mm/migrate.c b/mm/migrate.c
index c77997dc6ed7..ed2f85e61de1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -513,9 +513,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 		}
 		if (dirty && mapping_cap_account_dirty(mapping)) {
 			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
-			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
 			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
-			__dec_zone_state(newzone, NR_ZONE_WRITE_PENDING);
 		}
 	}
 	local_irq_enable();
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3c02aa603f5a..8db1db234915 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -299,6 +299,9 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 
 	return nr_pages;
 }
+#ifdef CONFIG_HIGHMEM
+unsigned long highmem_file_pages;
+#endif
 
 static unsigned long highmem_dirtyable_memory(unsigned long total)
 {
@@ -306,18 +309,17 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
 	int node;
 	unsigned long x = 0;
 	int i;
+	unsigned long dirtyable = highmem_file_pages;
 
 	for_each_node_state(node, N_HIGH_MEMORY) {
 		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
 			struct zone *z;
-			unsigned long dirtyable;
 
 			if (!is_highmem_idx(i))
 				continue;
 
 			z = &NODE_DATA(node)->node_zones[i];
-			dirtyable = zone_page_state(z, NR_FREE_PAGES) +
-				zone_page_state(z, NR_ZONE_LRU_FILE);
+			dirtyable += zone_page_state(z, NR_FREE_PAGES);
 
 			/* watch for underflows */
 			dirtyable -= min(dirtyable, high_wmark_pages(z));
@@ -2460,7 +2462,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		__inc_node_page_state(page, NR_FILE_DIRTY);
-		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		__inc_node_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
@@ -2482,7 +2483,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		dec_node_page_state(page, NR_FILE_DIRTY);
-		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_SIZE);
 	}
@@ -2739,7 +2739,6 @@ int clear_page_dirty_for_io(struct page *page)
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 			dec_node_page_state(page, NR_FILE_DIRTY);
-			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
@@ -2786,7 +2785,6 @@ int test_clear_page_writeback(struct page *page)
 	if (ret) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_node_page_state(page, NR_WRITEBACK);
-		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		inc_node_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
@@ -2841,7 +2839,6 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	if (!ret) {
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		inc_node_page_state(page, NR_WRITEBACK);
-		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 030114f55b0e..ded48e580abc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 {
 	struct zone *zone;
 	struct zoneref *z;
+	pg_data_t *current_pgdat = NULL;
 
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
@@ -3454,6 +3455,14 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		return false;
 
 	/*
+	 * Blindly retry allocation requests that cannot use all zones. We do
+	 * not have a reliable and fast means of calculating reclaimable, dirty
+	 * and writeback pages in eligible zones.
+	 */
+	if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask)))
+		goto out;
+
+	/*
 	 * Keep reclaiming pages while there is a chance this will lead somewhere.
 	 * If none of the target zones can satisfy our allocation request even
 	 * if all reclaimable pages are considered then we are screwed and have
@@ -3463,18 +3472,38 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 					ac->nodemask) {
 		unsigned long available;
 		unsigned long reclaimable;
+		int zid;
 
-		available = reclaimable = zone_reclaimable_pages(zone);
+		if (current_pgdat == zone->zone_pgdat)
+			continue;
+
+		current_pgdat = zone->zone_pgdat;
+		available = reclaimable = pgdat_reclaimable_pages(current_pgdat);
 		available -= DIV_ROUND_UP(no_progress_loops * available,
 					  MAX_RECLAIM_RETRIES);
-		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+
+		/* Account for all free pages on eligible zones */
+		for (zid = 0; zid <= zone_idx(zone); zid++) {
+			struct zone *acct_zone = &current_pgdat->node_zones[zid];
+
+			available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES);
+		}
 
 		/*
 		 * Would the allocation succeed if we reclaimed the whole
-		 * available?
+		 * available? This is approximate because there is no
+		 * accurate count of reclaimable pages per zone.
 		 */
-		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
-				ac_classzone_idx(ac), alloc_flags, available)) {
+		for (zid = 0; zid <= zone_idx(zone); zid++) {
+			struct zone *check_zone = &current_pgdat->node_zones[zid];
+			unsigned long estimate;
+
+			estimate = min(check_zone->managed_pages, available);
+			if (!__zone_watermark_ok(check_zone, order,
+					min_wmark_pages(check_zone), ac_classzone_idx(ac),
+					alloc_flags, estimate))
+				continue;
+
 			/*
 			 * If we didn't make any progress and have a lot of
 			 * dirty + writeback pages then we should wait for
@@ -3484,15 +3513,16 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 			if (!did_some_progress) {
 				unsigned long write_pending;
 
-				write_pending = zone_page_state_snapshot(zone,
-							NR_ZONE_WRITE_PENDING);
+				write_pending =
+					node_page_state(current_pgdat, NR_WRITEBACK) +
+					node_page_state(current_pgdat, NR_FILE_DIRTY);
 
 				if (2 * write_pending > reclaimable) {
 					congestion_wait(BLK_RW_ASYNC, HZ/10);
 					return true;
 				}
 			}
-
+out:
 			/*
 			 * Memory allocation/reclaim might be called from a WQ
 			 * context and the current implementation of the WQ
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9eed2d3e05f3..a8ebd1871f16 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -194,22 +194,6 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
-/*
- * This misses isolated pages which are not accounted for to save counters.
- * As the data only determines if reclaim or compaction continues, it is
- * not expected that isolated pages will be a dominating factor.
- */
-unsigned long zone_reclaimable_pages(struct zone *zone)
-{
-	unsigned long nr;
-
-	nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
-	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
-
-	return nr;
-}
-
 unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 {
 	unsigned long nr;
@@ -3167,7 +3151,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
 		 * zone was balanced even under extreme pressure when the
 		 * overall node may be congested.
 		 */
-		for (i = sc.reclaim_idx; i >= 0; i--) {
+		for (i = sc.reclaim_idx; i >= 0 && !buffer_heads_over_limit; i--) {
 			zone = pgdat->node_zones + i;
 			if (!populated_zone(zone))
 				continue;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 60372f31fee3..7415775faf08 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -921,9 +921,6 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
-	"nr_zone_anon_lru",
-	"nr_zone_file_lru",
-	"nr_zone_write_pending",
 	"nr_mlock",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries
  2016-07-06  8:58     ` Mel Gorman
@ 2016-07-06  9:33       ` Mel Gorman
  2016-07-07  6:47       ` Minchan Kim
  1 sibling, 0 replies; 10+ messages in thread
From: Mel Gorman @ 2016-07-06  9:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, LKML

On Wed, Jul 06, 2016 at 09:58:50AM +0100, Mel Gorman wrote:
> On Wed, Jul 06, 2016 at 09:02:52AM +0900, Minchan Kim wrote:
> > On Fri, Jul 01, 2016 at 09:01:39PM +0100, Mel Gorman wrote:
> > > The number of LRU pages, dirty pages and writeback pages must be accounted
> > > for on both zones and nodes because of the reclaim retry logic, compaction
> > > retry logic and highmem calculations all depending on per-zone stats.
> > > 
> > > The retry logic is only critical for allocations that can use any zones.
> > 
> > Sorry, I cannot follow this assertion.
> > Could you explain?
> > 
> 
> The patch has been reworked since and I tried clarifying the changelog.
> Does this help?
> 

It occurred to me at breakfast that this should be more consistent with
the OOM killer on both 32-bit and 64-bit so;

diff --git a/mm/compaction.c b/mm/compaction.c
index dfe7dafe8e8b..640532831b94 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1448,11 +1448,9 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 	struct zoneref *z;
 	pg_data_t *last_pgdat = NULL;
 
-#ifdef CONFIG_HIGHMEM
 	/* Do not retry compaction for zone-constrained allocations */
-	if (!is_highmem_idx(ac->high_zoneidx))
+	if (ac->high_zoneidx < ZONE_NORMAL)
 		return false;
-#endif
 
 	/*
 	 * Make sure at least one zone would pass __compaction_suitable if we continue
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ded48e580abc..194a8162528b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3455,11 +3455,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		return false;
 
 	/*
-	 * Blindly retry allocation requests that cannot use all zones. We do
-	 * not have a reliable and fast means of calculating reclaimable, dirty
-	 * and writeback pages in eligible zones.
+	 * Blindly retry lowmem allocation requests that are often ignored by
+	 * the OOM killer as we not have a reliable and fast means of
+	 * calculating reclaimable, dirty and writeback pages in eligible zones.
 	 */
-	if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask)))
+	if (ac->high_zoneidx < ZONE_NORMAL)
 		goto out;
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries
  2016-07-06  8:58     ` Mel Gorman
  2016-07-06  9:33       ` Mel Gorman
@ 2016-07-07  6:47       ` Minchan Kim
  1 sibling, 0 replies; 10+ messages in thread
From: Minchan Kim @ 2016-07-07  6:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, LKML

On Wed, Jul 06, 2016 at 09:58:50AM +0100, Mel Gorman wrote:
> On Wed, Jul 06, 2016 at 09:02:52AM +0900, Minchan Kim wrote:
> > On Fri, Jul 01, 2016 at 09:01:39PM +0100, Mel Gorman wrote:
> > > The number of LRU pages, dirty pages and writeback pages must be accounted
> > > for on both zones and nodes because of the reclaim retry logic, compaction
> > > retry logic and highmem calculations all depending on per-zone stats.
> > > 
> > > The retry logic is only critical for allocations that can use any zones.
> > 
> > Sorry, I cannot follow this assertion.
> > Could you explain?
> > 
> 
> The patch has been reworked since and I tried clarifying the changelog.
> Does this help?

Thanks. It is surely better than old but not clear to me, yet.

> 
> --- 8<----
> mm, vmstat: remove zone and node double accounting by approximating retries
> 
> The number of LRU pages, dirty pages and writeback pages must be accounted
> for on both zones and nodes because of the reclaim retry logic, compaction
> retry logic and highmem calculations all depending on per-zone stats.
> 
> Many lowmem allocations are immune from OOM kill due to a check in
> __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
> 03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The exception
> is costly high-order allocations or allocations that cannot fail. If the
> __alloc_pages_may_oom avoids OOM-kill for low-order lowmem allocations
> then a check in __alloc_pages_slowpath will always retry.

If I read code rightly, __alloc_pages_slowpath will never retry in that case
because __alloc_pages_may_oom will return 0's did_some_progress vaule
so it would go to warn_alloc_failed unless direct compaction is successful.

> 
> Hence this patch will always retry reclaim for zone-constrained allocations
> in should_reclaim_retry.
> 
> As there is no guarantee enough memory can ever be freed to satisfy
> compaction, this patch avoids retrying compaction for zone-contrained
> allocations.o
> 
> In combination, that means that the per-node stats can be used when deciding
> whether to continue reclaim using a rough approximation.  While it is
> possible this will make the wrong decision on occasion, it will not infinite
> loop as the number of reclaim attempts is capped by MAX_RECLAIM_RETRIES.
> 
> The final step is calculating the number of dirtyable highmem pages. As
> those calculations only care about the global count of file pages in
> highmem. This patch uses a global counter used instead of per-zone stats
> as it is sufficient.
> 
> In combination, this allows the per-zone LRU and dirty state counters to
> be removed.
> 
> Suggested by: Michal Hocko <mhocko@kernel.org>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
> 
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 9aadcc781857..c68680aac044 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -4,6 +4,22 @@
>  #include <linux/huge_mm.h>
>  #include <linux/swap.h>
>  
> +#ifdef CONFIG_HIGHMEM
> +extern unsigned long highmem_file_pages;
> +
> +static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> +							int nr_pages)
> +{
> +	if (is_highmem_idx(zid) && is_file_lru(lru))
> +		highmem_file_pages += nr_pages;
> +}
> +#else
> +static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> +							int nr_pages)
> +{
> +}
> +#endif
> +
>  /**
>   * page_is_file_cache - should the page be on a file LRU or anon LRU?
>   * @page: the page to test
> @@ -29,9 +45,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  
>  	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
> -	__mod_zone_page_state(&pgdat->node_zones[zid],
> -		NR_ZONE_LRU_BASE + !!is_file_lru(lru),
> -		nr_pages);
> +	acct_highmem_file_pages(zid, lru, nr_pages);
>  }
>  
>  static __always_inline void update_lru_size(struct lruvec *lruvec,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index bd33e6f1bed0..a3b7f45aac56 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -110,10 +110,6 @@ struct zone_padding {
>  enum zone_stat_item {
>  	/* First 128 byte cacheline (assuming 64 bit words) */
>  	NR_FREE_PAGES,
> -	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
> -	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
> -	NR_ZONE_LRU_FILE,
> -	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
>  	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
>  	NR_SLAB_RECLAIMABLE,
>  	NR_SLAB_UNRECLAIMABLE,
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index b17cc4830fa6..cc753c639e3d 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
>  						struct vm_area_struct *vma);
>  
>  /* linux/mm/vmscan.c */
> -extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
>  extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  					gfp_t gfp_mask, nodemask_t *mask);
> diff --git a/mm/compaction.c b/mm/compaction.c
> index a0bd85712516..dfe7dafe8e8b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1446,6 +1446,13 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
> +	pg_data_t *last_pgdat = NULL;
> +
> +#ifdef CONFIG_HIGHMEM
> +	/* Do not retry compaction for zone-constrained allocations */
> +	if (!is_highmem_idx(ac->high_zoneidx))
> +		return false;
> +#endif
>  
>  	/*
>  	 * Make sure at least one zone would pass __compaction_suitable if we continue
> @@ -1456,14 +1463,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
>  		unsigned long available;
>  		enum compact_result compact_result;
>  
> +		if (last_pgdat == zone->zone_pgdat)
> +			continue;
> +
> +		/*
> +		 * This over-estimates the number of pages available for
> +		 * reclaim/compaction but walking the LRU would take too
> +		 * long. The consequences are that compaction may retry
> +		 * longer than it should for a zone-constrained allocation
> +		 * request.
> +		 */
> +		last_pgdat = zone->zone_pgdat;
> +		available = pgdat_reclaimable_pages(zone->zone_pgdat) / order;
> +
>  		/*
>  		 * Do not consider all the reclaimable memory because we do not
>  		 * want to trash just for a single high order allocation which
>  		 * is even not guaranteed to appear even if __compaction_suitable
>  		 * is happy about the watermark check.
>  		 */
> -		available = zone_reclaimable_pages(zone) / order;
>  		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +		available = min(zone->managed_pages, available);
>  		compact_result = __compaction_suitable(zone, order, alloc_flags,
>  				ac_classzone_idx(ac), available);
>  		if (compact_result != COMPACT_SKIPPED &&
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c77997dc6ed7..ed2f85e61de1 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -513,9 +513,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
>  		}
>  		if (dirty && mapping_cap_account_dirty(mapping)) {
>  			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
> -			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
>  			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
> -			__dec_zone_state(newzone, NR_ZONE_WRITE_PENDING);
>  		}
>  	}
>  	local_irq_enable();
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 3c02aa603f5a..8db1db234915 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -299,6 +299,9 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
>  
>  	return nr_pages;
>  }
> +#ifdef CONFIG_HIGHMEM
> +unsigned long highmem_file_pages;
> +#endif
>  
>  static unsigned long highmem_dirtyable_memory(unsigned long total)
>  {
> @@ -306,18 +309,17 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
>  	int node;
>  	unsigned long x = 0;
>  	int i;
> +	unsigned long dirtyable = highmem_file_pages;
>  
>  	for_each_node_state(node, N_HIGH_MEMORY) {
>  		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
>  			struct zone *z;
> -			unsigned long dirtyable;
>  
>  			if (!is_highmem_idx(i))
>  				continue;
>  
>  			z = &NODE_DATA(node)->node_zones[i];
> -			dirtyable = zone_page_state(z, NR_FREE_PAGES) +
> -				zone_page_state(z, NR_ZONE_LRU_FILE);
> +			dirtyable += zone_page_state(z, NR_FREE_PAGES);
>  
>  			/* watch for underflows */
>  			dirtyable -= min(dirtyable, high_wmark_pages(z));
> @@ -2460,7 +2462,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
>  
>  		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
>  		__inc_node_page_state(page, NR_FILE_DIRTY);
> -		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  		__inc_node_page_state(page, NR_DIRTIED);
>  		__inc_wb_stat(wb, WB_RECLAIMABLE);
>  		__inc_wb_stat(wb, WB_DIRTIED);
> @@ -2482,7 +2483,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
>  	if (mapping_cap_account_dirty(mapping)) {
>  		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
>  		dec_node_page_state(page, NR_FILE_DIRTY);
> -		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  		dec_wb_stat(wb, WB_RECLAIMABLE);
>  		task_io_account_cancelled_write(PAGE_SIZE);
>  	}
> @@ -2739,7 +2739,6 @@ int clear_page_dirty_for_io(struct page *page)
>  		if (TestClearPageDirty(page)) {
>  			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
>  			dec_node_page_state(page, NR_FILE_DIRTY);
> -			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  			dec_wb_stat(wb, WB_RECLAIMABLE);
>  			ret = 1;
>  		}
> @@ -2786,7 +2785,6 @@ int test_clear_page_writeback(struct page *page)
>  	if (ret) {
>  		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		dec_node_page_state(page, NR_WRITEBACK);
> -		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  		inc_node_page_state(page, NR_WRITTEN);
>  	}
>  	unlock_page_memcg(page);
> @@ -2841,7 +2839,6 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
>  	if (!ret) {
>  		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
>  		inc_node_page_state(page, NR_WRITEBACK);
> -		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
>  	}
>  	unlock_page_memcg(page);
>  	return ret;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 030114f55b0e..ded48e580abc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
> +	pg_data_t *current_pgdat = NULL;
>  
>  	/*
>  	 * Make sure we converge to OOM if we cannot make any progress
> @@ -3454,6 +3455,14 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		return false;
>  
>  	/*
> +	 * Blindly retry allocation requests that cannot use all zones. We do
> +	 * not have a reliable and fast means of calculating reclaimable, dirty
> +	 * and writeback pages in eligible zones.
> +	 */
> +	if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask)))
> +		goto out;
> +
> +	/*
>  	 * Keep reclaiming pages while there is a chance this will lead somewhere.
>  	 * If none of the target zones can satisfy our allocation request even
>  	 * if all reclaimable pages are considered then we are screwed and have
> @@ -3463,18 +3472,38 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  					ac->nodemask) {
>  		unsigned long available;
>  		unsigned long reclaimable;
> +		int zid;
>  
> -		available = reclaimable = zone_reclaimable_pages(zone);
> +		if (current_pgdat == zone->zone_pgdat)
> +			continue;
> +
> +		current_pgdat = zone->zone_pgdat;
> +		available = reclaimable = pgdat_reclaimable_pages(current_pgdat);
>  		available -= DIV_ROUND_UP(no_progress_loops * available,
>  					  MAX_RECLAIM_RETRIES);
> -		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> +
> +		/* Account for all free pages on eligible zones */
> +		for (zid = 0; zid <= zone_idx(zone); zid++) {
> +			struct zone *acct_zone = &current_pgdat->node_zones[zid];
> +
> +			available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES);
> +		}
>  
>  		/*
>  		 * Would the allocation succeed if we reclaimed the whole
> -		 * available?
> +		 * available? This is approximate because there is no
> +		 * accurate count of reclaimable pages per zone.
>  		 */
> -		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> -				ac_classzone_idx(ac), alloc_flags, available)) {
> +		for (zid = 0; zid <= zone_idx(zone); zid++) {
> +			struct zone *check_zone = &current_pgdat->node_zones[zid];
> +			unsigned long estimate;
> +
> +			estimate = min(check_zone->managed_pages, available);
> +			if (!__zone_watermark_ok(check_zone, order,
> +					min_wmark_pages(check_zone), ac_classzone_idx(ac),
> +					alloc_flags, estimate))
> +				continue;
> +
>  			/*
>  			 * If we didn't make any progress and have a lot of
>  			 * dirty + writeback pages then we should wait for
> @@ -3484,15 +3513,16 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  			if (!did_some_progress) {
>  				unsigned long write_pending;
>  
> -				write_pending = zone_page_state_snapshot(zone,
> -							NR_ZONE_WRITE_PENDING);
> +				write_pending =
> +					node_page_state(current_pgdat, NR_WRITEBACK) +
> +					node_page_state(current_pgdat, NR_FILE_DIRTY);
>  
>  				if (2 * write_pending > reclaimable) {
>  					congestion_wait(BLK_RW_ASYNC, HZ/10);
>  					return true;
>  				}
>  			}
> -
> +out:
>  			/*
>  			 * Memory allocation/reclaim might be called from a WQ
>  			 * context and the current implementation of the WQ
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9eed2d3e05f3..a8ebd1871f16 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -194,22 +194,6 @@ static bool sane_reclaim(struct scan_control *sc)
>  }
>  #endif
>  
> -/*
> - * This misses isolated pages which are not accounted for to save counters.
> - * As the data only determines if reclaim or compaction continues, it is
> - * not expected that isolated pages will be a dominating factor.
> - */
> -unsigned long zone_reclaimable_pages(struct zone *zone)
> -{
> -	unsigned long nr;
> -
> -	nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
> -	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
> -
> -	return nr;
> -}
> -
>  unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
>  {
>  	unsigned long nr;
> @@ -3167,7 +3151,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
>  		 * zone was balanced even under extreme pressure when the
>  		 * overall node may be congested.
>  		 */
> -		for (i = sc.reclaim_idx; i >= 0; i--) {
> +		for (i = sc.reclaim_idx; i >= 0 && !buffer_heads_over_limit; i--) {
>  			zone = pgdat->node_zones + i;
>  			if (!populated_zone(zone))
>  				continue;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 60372f31fee3..7415775faf08 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -921,9 +921,6 @@ int fragmentation_index(struct zone *zone, unsigned int order)
>  const char * const vmstat_text[] = {
>  	/* enum zone_stat_item countes */
>  	"nr_free_pages",
> -	"nr_zone_anon_lru",
> -	"nr_zone_file_lru",
> -	"nr_zone_write_pending",
>  	"nr_mlock",
>  	"nr_slab_reclaimable",
>  	"nr_slab_unreclaimable",
> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries
  2016-07-01 20:01 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Mel Gorman
  2016-07-06  0:02   ` Minchan Kim
@ 2016-07-06 18:12   ` Dave Hansen
  2016-07-07 11:26     ` Mel Gorman
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Hansen @ 2016-07-06 18:12 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML

On 07/01/2016 01:01 PM, Mel Gorman wrote:
> +#ifdef CONFIG_HIGHMEM
> +extern unsigned long highmem_file_pages;
> +
> +static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> +							int nr_pages)
> +{
> +	if (is_highmem_idx(zid) && is_file_lru(lru))
> +		highmem_file_pages += nr_pages;
> +}
> +#else

Shouldn't highmem_file_pages technically be an atomic_t (or atomic64_t)?
 We could have highmem on two nodes which take two different LRU locks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries
  2016-07-06 18:12   ` Dave Hansen
@ 2016-07-07 11:26     ` Mel Gorman
  0 siblings, 0 replies; 10+ messages in thread
From: Mel Gorman @ 2016-07-07 11:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka,
	Johannes Weiner, LKML

On Wed, Jul 06, 2016 at 11:12:52AM -0700, Dave Hansen wrote:
> On 07/01/2016 01:01 PM, Mel Gorman wrote:
> > +#ifdef CONFIG_HIGHMEM
> > +extern unsigned long highmem_file_pages;
> > +
> > +static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
> > +							int nr_pages)
> > +{
> > +	if (is_highmem_idx(zid) && is_file_lru(lru))
> > +		highmem_file_pages += nr_pages;
> > +}
> > +#else
> 
> Shouldn't highmem_file_pages technically be an atomic_t (or atomic64_t)?
>  We could have highmem on two nodes which take two different LRU locks.

It would require a NUMA machine with highmem or very weird
configurations but sure, atomic is safer.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 00/31] Move LRU page reclaim from zones to nodes v8
@ 2016-07-01 15:37 Mel Gorman
  2016-07-01 15:37 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Mel Gorman
  0 siblings, 1 reply; 10+ messages in thread
From: Mel Gorman @ 2016-07-01 15:37 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

Previous releases double accounted LRU stats on the zone and the node
because it was required by should_reclaim_retry. The last patch in the
series removes the double accounting. It's not integrated with the series
as reviewers may not like the solution. If not, it can be safely dropped
without a major impact to the results.

Changelog since v7
o Rebase onto current mmots
o Avoid double accounting of stats in node and zone
o Kswapd will avoid more reclaim if an eligible zone is available
o Remove some duplications of sc->reclaim_idx and classzone_idx
o Print per-node stats in zoneinfo

Changelog since v6
o Correct reclaim_idx when direct reclaiming for memcg
o Also account LRU pages per zone for compaction/reclaim
o Add page_pgdat helper with more efficient lookup
o Init pgdat LRU lock only once
o Slight optimisation to wake_all_kswapds
o Always wake kcompactd when kswapd is going to sleep
o Rebase to mmotm as of June 15th, 2016

Changelog since v5
o Rebase and adjust to changes

Changelog since v4
o Rebase on top of v3 of page allocator optimisation series

Changelog since v3
o Rebase on top of the page allocator optimisation series
o Remove RFC tag

This is the latest version of a series that moves LRUs from the zones to
the node that is based upon 4.7-rc4 with Andrew's tree applied. While this
is a current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details. Some
of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in 
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes. 

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
---------

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

                                           4.7.0-rc4                  4.7.0-rc4
                                      mmotm-20160623                 nodelru-v8
Min      total-odr0-1               490.00 (  0.00%)           463.00 (  5.51%)
Min      total-odr0-2               349.00 (  0.00%)           325.00 (  6.88%)
Min      total-odr0-4               288.00 (  0.00%)           272.00 (  5.56%)
Min      total-odr0-8               250.00 (  0.00%)           235.00 (  6.00%)
Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
Min      total-odr0-32              223.00 (  0.00%)           205.00 (  8.07%)
Min      total-odr0-64              217.00 (  0.00%)           202.00 (  6.91%)
Min      total-odr0-128             214.00 (  0.00%)           207.00 (  3.27%)
Min      total-odr0-256             242.00 (  0.00%)           242.00 (  0.00%)
Min      total-odr0-512             272.00 (  0.00%)           265.00 (  2.57%)
Min      total-odr0-1024            290.00 (  0.00%)           283.00 (  2.41%)
Min      total-odr0-2048            302.00 (  0.00%)           296.00 (  1.99%)
Min      total-odr0-4096            311.00 (  0.00%)           306.00 (  1.61%)
Min      total-odr0-8192            314.00 (  0.00%)           309.00 (  1.59%)
Min      total-odr0-16384           315.00 (  0.00%)           309.00 (  1.90%)
Min      total-odr1-1               741.00 (  0.00%)           716.00 (  3.37%)
Min      total-odr1-2               565.00 (  0.00%)           524.00 (  7.26%)
Min      total-odr1-4               457.00 (  0.00%)           427.00 (  6.56%)
Min      total-odr1-8               408.00 (  0.00%)           371.00 (  9.07%)
Min      total-odr1-16              383.00 (  0.00%)           344.00 ( 10.18%)
Min      total-odr1-32              378.00 (  0.00%)           334.00 ( 11.64%)
Min      total-odr1-64              383.00 (  0.00%)           334.00 ( 12.79%)
Min      total-odr1-128             376.00 (  0.00%)           342.00 (  9.04%)
Min      total-odr1-256             381.00 (  0.00%)           343.00 (  9.97%)
Min      total-odr1-512             388.00 (  0.00%)           349.00 ( 10.05%)
Min      total-odr1-1024            386.00 (  0.00%)           356.00 (  7.77%)
Min      total-odr1-2048            389.00 (  0.00%)           362.00 (  6.94%)
Min      total-odr1-4096            389.00 (  0.00%)           362.00 (  6.94%)
Min      total-odr1-8192            389.00 (  0.00%)           362.00 (  6.94%)

This shows a steady improvement throughout. The primary benefit is from
reduced system CPU usage which is obvious from the overall times;

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623nodelru-v8
User          191.39      191.61
System       2651.24     2504.48
Elapsed      2904.40     2757.01

The vmstats also showed that the fair zone allocation policy was definitely
removed as can be seen here;


                             4.7.0-rc3   4.7.0-rc3
                          mmotm-20160623 nodelru-v8
DMA32 allocs               28794771816           0
Normal allocs              48432582848 77227356392
Movable allocs                       0           0

tiobench on ext4
----------------

tiobench is a benchmark that artifically benefits if old pages remain resident
while new pages get reclaimed. The fair zone allocation policy mitigates this
problem so pages age fairly. While the benchmark has problems, it is important
that tiobench performance remains constant as it implies that page aging
problems that the fair zone allocation policy fixes are not re-introduced.

                                         4.7.0-rc4             4.7.0-rc4
                                    mmotm-20160623            nodelru-v8
Min      PotentialReadSpeed        89.65 (  0.00%)       90.34 (  0.77%)
Min      SeqRead-MB/sec-1          82.68 (  0.00%)       83.13 (  0.54%)
Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.15 ( -0.84%)
Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.23 ( -1.20%)
Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.25 (  0.52%)
Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.76 (  0.84%)
Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.95 (  7.95%)
Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.94 ( -1.05%)
Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.46 (  2.10%)
Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.58 ( -1.86%)
Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.93 (  7.22%)
Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       78.84 (  3.18%)
Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.35 ( -1.03%)
Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       78.69 ( -1.70%)
Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       71.38 ( -2.06%)
Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       75.81 ( -0.13%)
Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.12 ( -5.08%)
Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.02 (  0.00%)
Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.99 ( -5.71%)
Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.89 ( -3.26%)

This shows that the series has little or not impact on tiobench which is
desirable. It indicates that the fair zone allocation policy was removed
in a manner that didn't reintroduce one class of page aging bug. There
were only minor differences in overall reclaim activity

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Minor Faults                    645838      644036
Major Faults                       573         593
Swap Ins                             0           0
Swap Outs                            0           0
Allocation stalls                   24           0
DMA allocs                           0           0
DMA32 allocs                  46041453    44154171
Normal allocs                 78053072    79865782
Movable allocs                       0           0
Direct pages scanned             10969       54504
Kswapd pages scanned          93375144    93250583
Kswapd pages reclaimed        93372243    93247714
Direct pages reclaimed           10969       54504
Kswapd efficiency                  99%         99%
Kswapd velocity              13741.015   13711.950
Direct efficiency                 100%        100%
Direct velocity                  1.614       8.014
Percentage direct scans             0%          0%
Zone normal velocity          8641.875   13719.964
Zone dma32 velocity           5100.754       0.000
Zone dma velocity                0.000       0.000
Page writes by reclaim           0.000       0.000
Page writes file                     0           0
Page writes anon                     0           0
Page reclaim immediate              37          54

kswapd activity was roughly comparable. There were differences in direct
reclaim activity but negligible in the context of the overall workload
(velocity of 8 pages per second with the patches applied, 1.6 pages per
second in the baseline kernel).

pgbench read-only large configuration on ext4
---------------------------------------------

pgbench is a database benchmark that can be sensitive to page reclaim
decisions. This also checks if removing the fair zone allocation policy
is safe

pgbench Transactions
                        4.7.0-rc4             4.7.0-rc4
                   mmotm-20160623            nodelru-v8
Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)

Negligible differences again. As with tiobench, overall reclaim activity
was comparable.

bonnie++ on ext4
----------------

No interesting performance difference, negligible differences on reclaim
stats.

paralleldd on ext4
------------------

This workload uses varying numbers of dd instances to read large amounts of
data from disk.

                               4.7.0-rc3             4.7.0-rc3
                          mmotm-20160615         nodelru-v7r17
Amean    Elapsd-1       181.57 (  0.00%)      179.63 (  1.07%)
Amean    Elapsd-3       188.29 (  0.00%)      183.68 (  2.45%)
Amean    Elapsd-5       188.02 (  0.00%)      181.73 (  3.35%)
Amean    Elapsd-7       186.07 (  0.00%)      184.11 (  1.05%)
Amean    Elapsd-12      188.16 (  0.00%)      183.51 (  2.47%)
Amean    Elapsd-16      189.03 (  0.00%)      181.27 (  4.10%)

           4.7.0-rc3   4.7.0-rc3
        mmotm-20160615nodelru-v7r17
User         1439.23     1433.37
System       8332.31     8216.01
Elapsed      3619.80     3532.69

There is a slight gain in performance, some of which is from the reduced system
CPU usage. There areminor differences in reclaim activity but nothing significant

                             4.7.0-rc3   4.7.0-rc3
                          mmotm-20160615nodelru-v7r17
Minor Faults                    362486      358215
Major Faults                      1143        1113
Swap Ins                            26           0
Swap Outs                         2920         482
DMA allocs                           0           0
DMA32 allocs                  31568814    28598887
Normal allocs                 46539922    49514444
Movable allocs                       0           0
Allocation stalls                    0           0
Direct pages scanned                 0           0
Kswapd pages scanned          40886878    40849710
Kswapd pages reclaimed        40869923    40835207
Direct pages reclaimed               0           0
Kswapd efficiency                  99%         99%
Kswapd velocity              11295.342   11563.344
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Slabs scanned                   131673      126099
Direct inode steals                 57          60
Kswapd inode steals                762          18

It basically shows that kswapd was active at roughly the same rate in
both kernels. There was also comparable slab scanning activity and direct
reclaim was avoided in both cases. There appears to be a large difference
in numbers of inodes reclaimed but the workload has few active inodes and
is likely a timing artifact. It's interesting to note that the node-lru
did not swap in any pages but given the low swap activity, it's unlikely
to be significant.

stutter
-------

stutter simulates a simple workload. One part uses a lot of anonymous
memory, a second measures mmap latency and a third copies a large file.
The primary metric is checking for mmap latency.

stutter
                             4.7.0-rc4             4.7.0-rc4
                        mmotm-20160623            nodelru-v8
Min         mmap     16.6283 (  0.00%)     16.1394 (  2.94%)
1st-qrtle   mmap     54.7570 (  0.00%)     55.2975 ( -0.99%)
2nd-qrtle   mmap     57.3163 (  0.00%)     57.5230 ( -0.36%)
3rd-qrtle   mmap     58.9976 (  0.00%)     58.0537 (  1.60%)
Max-90%     mmap     59.7433 (  0.00%)     58.3910 (  2.26%)
Max-93%     mmap     60.1298 (  0.00%)     58.4801 (  2.74%)
Max-95%     mmap     73.4112 (  0.00%)     58.5537 ( 20.24%)
Max-99%     mmap     92.8542 (  0.00%)     58.9673 ( 36.49%)
Max         mmap   1440.6569 (  0.00%)    137.6875 ( 90.44%)
Mean        mmap     59.3493 (  0.00%)     55.5153 (  6.46%)
Best99%Mean mmap     57.2121 (  0.00%)     55.4194 (  3.13%)
Best95%Mean mmap     55.9113 (  0.00%)     55.2813 (  1.13%)
Best90%Mean mmap     55.6199 (  0.00%)     55.1044 (  0.93%)
Best50%Mean mmap     53.2183 (  0.00%)     52.8330 (  0.72%)
Best10%Mean mmap     45.9842 (  0.00%)     42.3740 (  7.85%)
Best5%Mean  mmap     43.2256 (  0.00%)     38.8660 ( 10.09%)
Best1%Mean  mmap     32.9388 (  0.00%)     27.7577 ( 15.73%)

This shows a number of improvements with the worst-case outlier greatly
improved.

Some of the vmstats are interesting

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Swap Ins                           163         239
Swap Outs                            0           0
Allocation stalls                 2603           0
DMA allocs                           0           0
DMA32 allocs                 618719206  1303037965
Normal allocs                891235743   229914091
Movable allocs                       0           0
Direct pages scanned            216787        3173
Kswapd pages scanned          50719775    41732250
Kswapd pages reclaimed        41541765    41731168
Direct pages reclaimed          209159        3173
Kswapd efficiency                  81%         99%
Kswapd velocity              16859.554   14231.043
Direct efficiency                  96%        100%
Direct velocity                 72.061       1.082
Percentage direct scans             0%          0%
Zone normal velocity          8431.777   14232.125
Zone dma32 velocity           8499.838       0.000
Zone dma velocity                0.000       0.000
Page writes by reclaim     6215049.000       0.000
Page writes file               6215049           0
Page writes anon                     0           0
Page reclaim immediate           70673         143
Sector Reads                  81940800    81489388
Sector Writes                100158984    99161860
Page rescued immediate               0           0
Slabs scanned                  1366954       21196

While this is not guaranteed in all cases, this particular test showed
a large reduction in direct reclaim activity. It's also worth noting
that no page writes were issued from reclaim context.

This series is not without its hazards. There are at least three areas
that I'm concerned with even though I could not reproduce any problems in
that area.

1. Reclaim/compaction is going to be affected because the amount of reclaim is
   no longer targetted at a specific zone. Compaction works on a per-zone basis
   so there is no guarantee that reclaiming a few THP's worth page pages will
   have a positive impact on compaction success rates.

2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
   are called is now different. This may or may not be a problem but if it
   is, it'll be because shrinkers are not called enough and some balancing
   is required.

3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
   distributed between zones and the fair zone allocation policy used to do
   something very similar for anon. The distribution is now different but not
   necessarily in any way that matters but it's still worth bearing in mind.

 Documentation/cgroup-v1/memcg_test.txt        |   4 +-
 Documentation/cgroup-v1/memory.txt            |   4 +-
 arch/s390/appldata/appldata_mem.c             |   2 +-
 arch/tile/mm/pgtable.c                        |  18 +-
 drivers/base/node.c                           |  77 ++-
 drivers/staging/android/lowmemorykiller.c     |  12 +-
 drivers/staging/lustre/lustre/osc/osc_cache.c |   6 +-
 fs/fs-writeback.c                             |   4 +-
 fs/fuse/file.c                                |   8 +-
 fs/nfs/internal.h                             |   2 +-
 fs/nfs/write.c                                |   2 +-
 fs/proc/meminfo.c                             |  20 +-
 include/linux/backing-dev.h                   |   2 +-
 include/linux/memcontrol.h                    |  61 +-
 include/linux/mm.h                            |   5 +
 include/linux/mm_inline.h                     |  35 +-
 include/linux/mm_types.h                      |   2 +-
 include/linux/mmzone.h                        | 155 +++--
 include/linux/swap.h                          |  24 +-
 include/linux/topology.h                      |   2 +-
 include/linux/vm_event_item.h                 |  14 +-
 include/linux/vmstat.h                        | 111 +++-
 include/linux/writeback.h                     |   2 +-
 include/trace/events/vmscan.h                 |  63 +-
 include/trace/events/writeback.h              |  10 +-
 kernel/power/snapshot.c                       |  10 +-
 kernel/sysctl.c                               |   4 +-
 mm/backing-dev.c                              |  15 +-
 mm/compaction.c                               |  50 +-
 mm/filemap.c                                  |  16 +-
 mm/huge_memory.c                              |  12 +-
 mm/internal.h                                 |  11 +-
 mm/khugepaged.c                               |  14 +-
 mm/memcontrol.c                               | 215 +++----
 mm/memory-failure.c                           |   4 +-
 mm/memory_hotplug.c                           |   7 +-
 mm/mempolicy.c                                |   2 +-
 mm/migrate.c                                  |  35 +-
 mm/mlock.c                                    |  12 +-
 mm/page-writeback.c                           | 123 ++--
 mm/page_alloc.c                               | 371 +++++------
 mm/page_idle.c                                |   4 +-
 mm/rmap.c                                     |  26 +-
 mm/shmem.c                                    |  14 +-
 mm/swap.c                                     |  64 +-
 mm/swap_state.c                               |   4 +-
 mm/util.c                                     |   4 +-
 mm/vmscan.c                                   | 879 +++++++++++++-------------
 mm/vmstat.c                                   | 398 +++++++++---
 mm/workingset.c                               |  54 +-
 50 files changed, 1674 insertions(+), 1319 deletions(-)

-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries
  2016-07-01 15:37 [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 Mel Gorman
@ 2016-07-01 15:37 ` Mel Gorman
  0 siblings, 0 replies; 10+ messages in thread
From: Mel Gorman @ 2016-07-01 15:37 UTC (permalink / raw)
  To: Andrew Morton, Linux-MM
  Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman

The number of LRU pages, dirty pages and writeback pages must be accounted
for on both zones and nodes because of the reclaim retry logic, compaction
retry logic and highmem calculations all depending on per-zone stats.

The retry logic is only critical for allocations that can use any zones.
Hence this patch will not retry reclaim or compaction for such allocations.
This should not be a problem for reclaim as zone-constrained allocations
are immune from OOM kill. For retries, a very rough approximation is made
whether to retry or not. While it is possible this will make the wrong
decision on occasion, it will not infinite loop as the number of reclaim
attempts is capped by MAX_RECLAIM_RETRIES.

The highmem calculations only care about the global count of file pages
in highmem. Hence, a global counter is used instead of per-zone stats.
With this, the per-zone double accounting disappears.

Suggested by: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mm_inline.h | 20 +++++++++++--
 include/linux/mmzone.h    |  4 ---
 include/linux/swap.h      |  1 -
 mm/compaction.c           | 22 ++++++++++++++-
 mm/migrate.c              |  2 --
 mm/page-writeback.c       | 13 ++++-----
 mm/page_alloc.c           | 71 ++++++++++++++++++++++++++++++++---------------
 mm/vmscan.c               | 16 -----------
 mm/vmstat.c               |  3 --
 9 files changed, 92 insertions(+), 60 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 9aadcc781857..c68680aac044 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -4,6 +4,22 @@
 #include <linux/huge_mm.h>
 #include <linux/swap.h>
 
+#ifdef CONFIG_HIGHMEM
+extern unsigned long highmem_file_pages;
+
+static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
+							int nr_pages)
+{
+	if (is_highmem_idx(zid) && is_file_lru(lru))
+		highmem_file_pages += nr_pages;
+}
+#else
+static inline void acct_highmem_file_pages(int zid, enum lru_list lru,
+							int nr_pages)
+{
+}
+#endif
+
 /**
  * page_is_file_cache - should the page be on a file LRU or anon LRU?
  * @page: the page to test
@@ -29,9 +45,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	__mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages);
-	__mod_zone_page_state(&pgdat->node_zones[zid],
-		NR_ZONE_LRU_BASE + !!is_file_lru(lru),
-		nr_pages);
+	acct_highmem_file_pages(zid, lru, nr_pages);
 }
 
 static __always_inline void update_lru_size(struct lruvec *lruvec,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index facee6b83440..9268528c20c0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -110,10 +110,6 @@ struct zone_padding {
 enum zone_stat_item {
 	/* First 128 byte cacheline (assuming 64 bit words) */
 	NR_FREE_PAGES,
-	NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */
-	NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE,
-	NR_ZONE_LRU_FILE,
-	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
 	NR_SLAB_UNRECLAIMABLE,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b17cc4830fa6..cc753c639e3d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct page *page,
 						struct vm_area_struct *vma);
 
 /* linux/mm/vmscan.c */
-extern unsigned long zone_reclaimable_pages(struct zone *zone);
 extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat);
 extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 					gfp_t gfp_mask, nodemask_t *mask);
diff --git a/mm/compaction.c b/mm/compaction.c
index a0bd85712516..dfe7dafe8e8b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1446,6 +1446,13 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 {
 	struct zone *zone;
 	struct zoneref *z;
+	pg_data_t *last_pgdat = NULL;
+
+#ifdef CONFIG_HIGHMEM
+	/* Do not retry compaction for zone-constrained allocations */
+	if (!is_highmem_idx(ac->high_zoneidx))
+		return false;
+#endif
 
 	/*
 	 * Make sure at least one zone would pass __compaction_suitable if we continue
@@ -1456,14 +1463,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
 		unsigned long available;
 		enum compact_result compact_result;
 
+		if (last_pgdat == zone->zone_pgdat)
+			continue;
+
+		/*
+		 * This over-estimates the number of pages available for
+		 * reclaim/compaction but walking the LRU would take too
+		 * long. The consequences are that compaction may retry
+		 * longer than it should for a zone-constrained allocation
+		 * request.
+		 */
+		last_pgdat = zone->zone_pgdat;
+		available = pgdat_reclaimable_pages(zone->zone_pgdat) / order;
+
 		/*
 		 * Do not consider all the reclaimable memory because we do not
 		 * want to trash just for a single high order allocation which
 		 * is even not guaranteed to appear even if __compaction_suitable
 		 * is happy about the watermark check.
 		 */
-		available = zone_reclaimable_pages(zone) / order;
 		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+		available = min(zone->managed_pages, available);
 		compact_result = __compaction_suitable(zone, order, alloc_flags,
 				ac_classzone_idx(ac), available);
 		if (compact_result != COMPACT_SKIPPED &&
diff --git a/mm/migrate.c b/mm/migrate.c
index c77997dc6ed7..ed2f85e61de1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -513,9 +513,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
 		}
 		if (dirty && mapping_cap_account_dirty(mapping)) {
 			__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
-			__dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING);
 			__inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY);
-			__dec_zone_state(newzone, NR_ZONE_WRITE_PENDING);
 		}
 	}
 	local_irq_enable();
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3c02aa603f5a..8db1db234915 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -299,6 +299,9 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat)
 
 	return nr_pages;
 }
+#ifdef CONFIG_HIGHMEM
+unsigned long highmem_file_pages;
+#endif
 
 static unsigned long highmem_dirtyable_memory(unsigned long total)
 {
@@ -306,18 +309,17 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
 	int node;
 	unsigned long x = 0;
 	int i;
+	unsigned long dirtyable = highmem_file_pages;
 
 	for_each_node_state(node, N_HIGH_MEMORY) {
 		for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) {
 			struct zone *z;
-			unsigned long dirtyable;
 
 			if (!is_highmem_idx(i))
 				continue;
 
 			z = &NODE_DATA(node)->node_zones[i];
-			dirtyable = zone_page_state(z, NR_FREE_PAGES) +
-				zone_page_state(z, NR_ZONE_LRU_FILE);
+			dirtyable += zone_page_state(z, NR_FREE_PAGES);
 
 			/* watch for underflows */
 			dirtyable -= min(dirtyable, high_wmark_pages(z));
@@ -2460,7 +2462,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
 
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		__inc_node_page_state(page, NR_FILE_DIRTY);
-		__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		__inc_node_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
 		__inc_wb_stat(wb, WB_DIRTIED);
@@ -2482,7 +2483,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 		dec_node_page_state(page, NR_FILE_DIRTY);
-		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_SIZE);
 	}
@@ -2739,7 +2739,6 @@ int clear_page_dirty_for_io(struct page *page)
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY);
 			dec_node_page_state(page, NR_FILE_DIRTY);
-			dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;
 		}
@@ -2786,7 +2785,6 @@ int test_clear_page_writeback(struct page *page)
 	if (ret) {
 		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		dec_node_page_state(page, NR_WRITEBACK);
-		dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 		inc_node_page_state(page, NR_WRITTEN);
 	}
 	unlock_page_memcg(page);
@@ -2841,7 +2839,6 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
 	if (!ret) {
 		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK);
 		inc_node_page_state(page, NR_WRITEBACK);
-		inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
 	}
 	unlock_page_memcg(page);
 	return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d3eb15c35bb1..9581185cb31a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 {
 	struct zone *zone;
 	struct zoneref *z;
+	pg_data_t *current_pgdat = NULL;
 
 	/*
 	 * Make sure we converge to OOM if we cannot make any progress
@@ -3454,6 +3455,14 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 		return false;
 
 	/*
+	 * Blindly retry allocation requests that cannot use all zones. We do
+	 * not have a reliable and fast means of calculating reclaimable, dirty
+	 * and writeback pages in eligible zones.
+	 */
+	if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask)))
+		goto out;
+
+	/*
 	 * Keep reclaiming pages while there is a chance this will lead somewhere.
 	 * If none of the target zones can satisfy our allocation request even
 	 * if all reclaimable pages are considered then we are screwed and have
@@ -3463,36 +3472,54 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
 					ac->nodemask) {
 		unsigned long available;
 		unsigned long reclaimable;
+		unsigned long write_pending = 0;
+		int zid;
+
+		if (current_pgdat == zone->zone_pgdat)
+			continue;
 
-		available = reclaimable = zone_reclaimable_pages(zone);
+		current_pgdat = zone->zone_pgdat;
+		available = reclaimable = pgdat_reclaimable_pages(current_pgdat);
 		available -= DIV_ROUND_UP(no_progress_loops * available,
 					MAX_RECLAIM_RETRIES);
-		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
+		write_pending = node_page_state(current_pgdat, NR_WRITEBACK) +
+					node_page_state(current_pgdat, NR_FILE_DIRTY);
 
-		/*
-		 * Would the allocation succeed if we reclaimed the whole
-		 * available?
-		 */
-		if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
-				ac_classzone_idx(ac), alloc_flags, available)) {
-			/*
-			 * If we didn't make any progress and have a lot of
-			 * dirty + writeback pages then we should wait for
-			 * an IO to complete to slow down the reclaim and
-			 * prevent from pre mature OOM
-			 */
-			if (!did_some_progress) {
-				unsigned long write_pending;
+		/* Account for all free pages on eligible zones */
+		for (zid = 0; zid <= zone_idx(zone); zid++) {
+			struct zone *acct_zone = &current_pgdat->node_zones[zid];
 
-				write_pending = zone_page_state_snapshot(zone,
-							NR_ZONE_WRITE_PENDING);
+			available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES);
+		}
 
-				if (2 * write_pending > reclaimable) {
-					congestion_wait(BLK_RW_ASYNC, HZ/10);
-					return true;
-				}
+		/*
+		 * If we didn't make any progress and have a lot of
+		 * dirty + writeback pages then we should wait for an IO to
+		 * complete to slow down the reclaim and prevent from premature
+		 * OOM.
+		 */
+		if (!did_some_progress) {
+			if (2 * write_pending > reclaimable) {
+				congestion_wait(BLK_RW_ASYNC, HZ/10);
+				return true;
 			}
+		}
 
+		/*
+		 * Would the allocation succeed if we reclaimed the whole
+		 * available? This is approximate because there is no
+		 * accurate count of reclaimable pages per zone.
+		 */
+		for (zid = 0; zid <= zone_idx(zone); zid++) {
+			struct zone *check_zone = &current_pgdat->node_zones[zid];
+			unsigned long estimate;
+
+			estimate = min(check_zone->managed_pages, available);
+			if (__zone_watermark_ok(check_zone, order,
+					min_wmark_pages(check_zone), ac_classzone_idx(ac),
+					alloc_flags, available)) {
+			}
+out:
 			/*
 			 * Memory allocation/reclaim might be called from a WQ
 			 * context and the current implementation of the WQ
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 151c30dd27e2..c538a8cab43b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -194,22 +194,6 @@ static bool sane_reclaim(struct scan_control *sc)
 }
 #endif
 
-/*
- * This misses isolated pages which are not accounted for to save counters.
- * As the data only determines if reclaim or compaction continues, it is
- * not expected that isolated pages will be a dominating factor.
- */
-unsigned long zone_reclaimable_pages(struct zone *zone)
-{
-	unsigned long nr;
-
-	nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE);
-	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON);
-
-	return nr;
-}
-
 unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 {
 	unsigned long nr;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ce09be63e8c7..524c082072be 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -908,9 +908,6 @@ int fragmentation_index(struct zone *zone, unsigned int order)
 const char * const vmstat_text[] = {
 	/* enum zone_stat_item countes */
 	"nr_free_pages",
-	"nr_zone_anon_lru",
-	"nr_zone_file_lru",
-	"nr_zone_write_pending",
 	"nr_mlock",
 	"nr_slab_reclaimable",
 	"nr_slab_unreclaimable",
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-07-07 11:26 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <00f601d1d691$d790ad40$86b207c0$@alibaba-inc.com>
2016-07-05  8:07 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Hillf Danton
2016-07-05 10:55   ` Mel Gorman
2016-07-01 20:01 [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 Mel Gorman
2016-07-01 20:01 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Mel Gorman
2016-07-06  0:02   ` Minchan Kim
2016-07-06  8:58     ` Mel Gorman
2016-07-06  9:33       ` Mel Gorman
2016-07-07  6:47       ` Minchan Kim
2016-07-06 18:12   ` Dave Hansen
2016-07-07 11:26     ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2016-07-01 15:37 [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 Mel Gorman
2016-07-01 15:37 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).