* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries [not found] <00f601d1d691$d790ad40$86b207c0$@alibaba-inc.com> @ 2016-07-05 8:07 ` Hillf Danton 2016-07-05 10:55 ` Mel Gorman 0 siblings, 1 reply; 10+ messages in thread From: Hillf Danton @ 2016-07-05 8:07 UTC (permalink / raw) To: Mel Gorman; +Cc: Michal Hocko, linux-kernel, linux-mm, Andrew Morton > > The number of LRU pages, dirty pages and writeback pages must be accounted > for on both zones and nodes because of the reclaim retry logic, compaction > retry logic and highmem calculations all depending on per-zone stats. > > The retry logic is only critical for allocations that can use any zones. > Hence this patch will not retry reclaim or compaction for such allocations. > This should not be a problem for reclaim as zone-constrained allocations > are immune from OOM kill. For retries, a very rough approximation is made > whether to retry or not. While it is possible this will make the wrong > decision on occasion, it will not infinite loop as the number of reclaim > attempts is capped by MAX_RECLAIM_RETRIES. > > The highmem calculations only care about the global count of file pages > in highmem. Hence, a global counter is used instead of per-zone stats. > With this, the per-zone double accounting disappears. > > Suggested by: Michal Hocko <mhocko@kernel.org> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> > include/linux/mm_inline.h | 20 +++++++++++-- > include/linux/mmzone.h | 4 --- > include/linux/swap.h | 1 - > mm/compaction.c | 22 ++++++++++++++- > mm/migrate.c | 2 -- > mm/page-writeback.c | 13 ++++----- > mm/page_alloc.c | 71 ++++++++++++++++++++++++++++++++--------------- > mm/vmscan.c | 16 ----------- > mm/vmstat.c | 3 -- > 9 files changed, 92 insertions(+), 60 deletions(-) > [...] > @@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > { > struct zone *zone; > struct zoneref *z; > + pg_data_t *current_pgdat = NULL; > > /* > * Make sure we converge to OOM if we cannot make any progress > @@ -3454,6 +3455,14 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > return false; > > /* > + * Blindly retry allocation requests that cannot use all zones. We do > + * not have a reliable and fast means of calculating reclaimable, dirty > + * and writeback pages in eligible zones. > + */ > + if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask))) > + goto out; > + > + /* > * Keep reclaiming pages while there is a chance this will lead somewhere. > * If none of the target zones can satisfy our allocation request even > * if all reclaimable pages are considered then we are screwed and have > @@ -3463,36 +3472,54 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > ac->nodemask) { > unsigned long available; > unsigned long reclaimable; > + unsigned long write_pending = 0; > + int zid; > + > + if (current_pgdat == zone->zone_pgdat) > + continue; > > - available = reclaimable = zone_reclaimable_pages(zone); > + current_pgdat = zone->zone_pgdat; > + available = reclaimable = pgdat_reclaimable_pages(current_pgdat); > available -= DIV_ROUND_UP(no_progress_loops * available, > MAX_RECLAIM_RETRIES); > - available += zone_page_state_snapshot(zone, NR_FREE_PAGES); > + write_pending = node_page_state(current_pgdat, NR_WRITEBACK) + > + node_page_state(current_pgdat, NR_FILE_DIRTY); > > - /* > - * Would the allocation succeed if we reclaimed the whole > - * available? > - */ > - if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), > - ac_classzone_idx(ac), alloc_flags, available)) { > - /* > - * If we didn't make any progress and have a lot of > - * dirty + writeback pages then we should wait for > - * an IO to complete to slow down the reclaim and > - * prevent from pre mature OOM > - */ > - if (!did_some_progress) { > - unsigned long write_pending; > + /* Account for all free pages on eligible zones */ > + for (zid = 0; zid <= zone_idx(zone); zid++) { > + struct zone *acct_zone = ¤t_pgdat->node_zones[zid]; > > - write_pending = zone_page_state_snapshot(zone, > - NR_ZONE_WRITE_PENDING); > + available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES); > + } > > - if (2 * write_pending > reclaimable) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > - return true; > - } > + /* > + * If we didn't make any progress and have a lot of > + * dirty + writeback pages then we should wait for an IO to > + * complete to slow down the reclaim and prevent from premature > + * OOM. > + */ > + if (!did_some_progress) { > + if (2 * write_pending > reclaimable) { > + congestion_wait(BLK_RW_ASYNC, HZ/10); > + return true; > } > + } > > + /* > + * Would the allocation succeed if we reclaimed the whole > + * available? This is approximate because there is no > + * accurate count of reclaimable pages per zone. > + */ > + for (zid = 0; zid <= zone_idx(zone); zid++) { > + struct zone *check_zone = ¤t_pgdat->node_zones[zid]; > + unsigned long estimate; > + > + estimate = min(check_zone->managed_pages, available); > + if (__zone_watermark_ok(check_zone, order, > + min_wmark_pages(check_zone), ac_classzone_idx(ac), > + alloc_flags, available)) { > + } Stray indent? > +out: > /* > * Memory allocation/reclaim might be called from a WQ > * context and the current implementation of the WQ > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 151c30dd27e2..c538a8cab43b 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries 2016-07-05 8:07 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Hillf Danton @ 2016-07-05 10:55 ` Mel Gorman 0 siblings, 0 replies; 10+ messages in thread From: Mel Gorman @ 2016-07-05 10:55 UTC (permalink / raw) To: Hillf Danton; +Cc: Michal Hocko, linux-kernel, linux-mm, Andrew Morton On Tue, Jul 05, 2016 at 04:07:23PM +0800, Hillf Danton wrote: > > + /* > > + * Would the allocation succeed if we reclaimed the whole > > + * available? This is approximate because there is no > > + * accurate count of reclaimable pages per zone. > > + */ > > + for (zid = 0; zid <= zone_idx(zone); zid++) { > > + struct zone *check_zone = ¤t_pgdat->node_zones[zid]; > > + unsigned long estimate; > > + > > + estimate = min(check_zone->managed_pages, available); > > + if (__zone_watermark_ok(check_zone, order, > > + min_wmark_pages(check_zone), ac_classzone_idx(ac), > > + alloc_flags, available)) { > > + } > Stray indent? > Last minute rebase-related damage. I'll fix it. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 @ 2016-07-01 20:01 Mel Gorman 2016-07-01 20:01 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Mel Gorman 0 siblings, 1 reply; 10+ messages in thread From: Mel Gorman @ 2016-07-01 20:01 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman (Sorry for the resend, I accidentally sent the branch that still had the Signed-off-by's from mmotm still applied which is incorrect.) Previous releases double accounted LRU stats on the zone and the node because it was required by should_reclaim_retry. The last patch in the series removes the double accounting. It's not integrated with the series as reviewers may not like the solution. If not, it can be safely dropped without a major impact to the results. Changelog since v7 o Rebase onto current mmots o Avoid double accounting of stats in node and zone o Kswapd will avoid more reclaim if an eligible zone is available o Remove some duplications of sc->reclaim_idx and classzone_idx o Print per-node stats in zoneinfo Changelog since v6 o Correct reclaim_idx when direct reclaiming for memcg o Also account LRU pages per zone for compaction/reclaim o Add page_pgdat helper with more efficient lookup o Init pgdat LRU lock only once o Slight optimisation to wake_all_kswapds o Always wake kcompactd when kswapd is going to sleep o Rebase to mmotm as of June 15th, 2016 Changelog since v5 o Rebase and adjust to changes Changelog since v4 o Rebase on top of v3 of page allocator optimisation series Changelog since v3 o Rebase on top of the page allocator optimisation series o Remove RFC tag This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.7-rc4 with Andrew's tree applied. While this is a current rebase, the test results were based on mmotm as of June 23rd. Conceptually, this series is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc --------- This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min total-odr0-1 490.00 ( 0.00%) 463.00 ( 5.51%) Min total-odr0-2 349.00 ( 0.00%) 325.00 ( 6.88%) Min total-odr0-4 288.00 ( 0.00%) 272.00 ( 5.56%) Min total-odr0-8 250.00 ( 0.00%) 235.00 ( 6.00%) Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%) Min total-odr0-32 223.00 ( 0.00%) 205.00 ( 8.07%) Min total-odr0-64 217.00 ( 0.00%) 202.00 ( 6.91%) Min total-odr0-128 214.00 ( 0.00%) 207.00 ( 3.27%) Min total-odr0-256 242.00 ( 0.00%) 242.00 ( 0.00%) Min total-odr0-512 272.00 ( 0.00%) 265.00 ( 2.57%) Min total-odr0-1024 290.00 ( 0.00%) 283.00 ( 2.41%) Min total-odr0-2048 302.00 ( 0.00%) 296.00 ( 1.99%) Min total-odr0-4096 311.00 ( 0.00%) 306.00 ( 1.61%) Min total-odr0-8192 314.00 ( 0.00%) 309.00 ( 1.59%) Min total-odr0-16384 315.00 ( 0.00%) 309.00 ( 1.90%) Min total-odr1-1 741.00 ( 0.00%) 716.00 ( 3.37%) Min total-odr1-2 565.00 ( 0.00%) 524.00 ( 7.26%) Min total-odr1-4 457.00 ( 0.00%) 427.00 ( 6.56%) Min total-odr1-8 408.00 ( 0.00%) 371.00 ( 9.07%) Min total-odr1-16 383.00 ( 0.00%) 344.00 ( 10.18%) Min total-odr1-32 378.00 ( 0.00%) 334.00 ( 11.64%) Min total-odr1-64 383.00 ( 0.00%) 334.00 ( 12.79%) Min total-odr1-128 376.00 ( 0.00%) 342.00 ( 9.04%) Min total-odr1-256 381.00 ( 0.00%) 343.00 ( 9.97%) Min total-odr1-512 388.00 ( 0.00%) 349.00 ( 10.05%) Min total-odr1-1024 386.00 ( 0.00%) 356.00 ( 7.77%) Min total-odr1-2048 389.00 ( 0.00%) 362.00 ( 6.94%) Min total-odr1-4096 389.00 ( 0.00%) 362.00 ( 6.94%) Min total-odr1-8192 389.00 ( 0.00%) 362.00 ( 6.94%) This shows a steady improvement throughout. The primary benefit is from reduced system CPU usage which is obvious from the overall times; 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 User 191.39 191.61 System 2651.24 2504.48 Elapsed 2904.40 2757.01 The vmstats also showed that the fair zone allocation policy was definitely removed as can be seen here; 4.7.0-rc3 4.7.0-rc3 mmotm-20160623 nodelru-v8 DMA32 allocs 28794771816 0 Normal allocs 48432582848 77227356392 Movable allocs 0 0 tiobench on ext4 ---------------- tiobench is a benchmark that artifically benefits if old pages remain resident while new pages get reclaimed. The fair zone allocation policy mitigates this problem so pages age fairly. While the benchmark has problems, it is important that tiobench performance remains constant as it implies that page aging problems that the fair zone allocation policy fixes are not re-introduced. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min PotentialReadSpeed 89.65 ( 0.00%) 90.34 ( 0.77%) Min SeqRead-MB/sec-1 82.68 ( 0.00%) 83.13 ( 0.54%) Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.15 ( -0.84%) Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.23 ( -1.20%) Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.25 ( 0.52%) Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.76 ( 0.84%) Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.95 ( 7.95%) Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.94 ( -1.05%) Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.46 ( 2.10%) Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.58 ( -1.86%) Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.93 ( 7.22%) Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 78.84 ( 3.18%) Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.35 ( -1.03%) Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 78.69 ( -1.70%) Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 71.38 ( -2.06%) Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 75.81 ( -0.13%) Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.12 ( -5.08%) Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.02 ( 0.00%) Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.99 ( -5.71%) Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%) Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.89 ( -3.26%) This shows that the series has little or not impact on tiobench which is desirable. It indicates that the fair zone allocation policy was removed in a manner that didn't reintroduce one class of page aging bug. There were only minor differences in overall reclaim activity 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 Minor Faults 645838 644036 Major Faults 573 593 Swap Ins 0 0 Swap Outs 0 0 Allocation stalls 24 0 DMA allocs 0 0 DMA32 allocs 46041453 44154171 Normal allocs 78053072 79865782 Movable allocs 0 0 Direct pages scanned 10969 54504 Kswapd pages scanned 93375144 93250583 Kswapd pages reclaimed 93372243 93247714 Direct pages reclaimed 10969 54504 Kswapd efficiency 99% 99% Kswapd velocity 13741.015 13711.950 Direct efficiency 100% 100% Direct velocity 1.614 8.014 Percentage direct scans 0% 0% Zone normal velocity 8641.875 13719.964 Zone dma32 velocity 5100.754 0.000 Zone dma velocity 0.000 0.000 Page writes by reclaim 0.000 0.000 Page writes file 0 0 Page writes anon 0 0 Page reclaim immediate 37 54 kswapd activity was roughly comparable. There were differences in direct reclaim activity but negligible in the context of the overall workload (velocity of 8 pages per second with the patches applied, 1.6 pages per second in the baseline kernel). pgbench read-only large configuration on ext4 --------------------------------------------- pgbench is a database benchmark that can be sensitive to page reclaim decisions. This also checks if removing the fair zone allocation policy is safe pgbench Transactions 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%) Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%) Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%) Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%) Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%) Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%) Negligible differences again. As with tiobench, overall reclaim activity was comparable. bonnie++ on ext4 ---------------- No interesting performance difference, negligible differences on reclaim stats. paralleldd on ext4 ------------------ This workload uses varying numbers of dd instances to read large amounts of data from disk. 4.7.0-rc3 4.7.0-rc3 mmotm-20160615 nodelru-v7r17 Amean Elapsd-1 181.57 ( 0.00%) 179.63 ( 1.07%) Amean Elapsd-3 188.29 ( 0.00%) 183.68 ( 2.45%) Amean Elapsd-5 188.02 ( 0.00%) 181.73 ( 3.35%) Amean Elapsd-7 186.07 ( 0.00%) 184.11 ( 1.05%) Amean Elapsd-12 188.16 ( 0.00%) 183.51 ( 2.47%) Amean Elapsd-16 189.03 ( 0.00%) 181.27 ( 4.10%) 4.7.0-rc3 4.7.0-rc3 mmotm-20160615nodelru-v7r17 User 1439.23 1433.37 System 8332.31 8216.01 Elapsed 3619.80 3532.69 There is a slight gain in performance, some of which is from the reduced system CPU usage. There areminor differences in reclaim activity but nothing significant 4.7.0-rc3 4.7.0-rc3 mmotm-20160615nodelru-v7r17 Minor Faults 362486 358215 Major Faults 1143 1113 Swap Ins 26 0 Swap Outs 2920 482 DMA allocs 0 0 DMA32 allocs 31568814 28598887 Normal allocs 46539922 49514444 Movable allocs 0 0 Allocation stalls 0 0 Direct pages scanned 0 0 Kswapd pages scanned 40886878 40849710 Kswapd pages reclaimed 40869923 40835207 Direct pages reclaimed 0 0 Kswapd efficiency 99% 99% Kswapd velocity 11295.342 11563.344 Direct efficiency 100% 100% Direct velocity 0.000 0.000 Slabs scanned 131673 126099 Direct inode steals 57 60 Kswapd inode steals 762 18 It basically shows that kswapd was active at roughly the same rate in both kernels. There was also comparable slab scanning activity and direct reclaim was avoided in both cases. There appears to be a large difference in numbers of inodes reclaimed but the workload has few active inodes and is likely a timing artifact. It's interesting to note that the node-lru did not swap in any pages but given the low swap activity, it's unlikely to be significant. stutter ------- stutter simulates a simple workload. One part uses a lot of anonymous memory, a second measures mmap latency and a third copies a large file. The primary metric is checking for mmap latency. stutter 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min mmap 16.6283 ( 0.00%) 16.1394 ( 2.94%) 1st-qrtle mmap 54.7570 ( 0.00%) 55.2975 ( -0.99%) 2nd-qrtle mmap 57.3163 ( 0.00%) 57.5230 ( -0.36%) 3rd-qrtle mmap 58.9976 ( 0.00%) 58.0537 ( 1.60%) Max-90% mmap 59.7433 ( 0.00%) 58.3910 ( 2.26%) Max-93% mmap 60.1298 ( 0.00%) 58.4801 ( 2.74%) Max-95% mmap 73.4112 ( 0.00%) 58.5537 ( 20.24%) Max-99% mmap 92.8542 ( 0.00%) 58.9673 ( 36.49%) Max mmap 1440.6569 ( 0.00%) 137.6875 ( 90.44%) Mean mmap 59.3493 ( 0.00%) 55.5153 ( 6.46%) Best99%Mean mmap 57.2121 ( 0.00%) 55.4194 ( 3.13%) Best95%Mean mmap 55.9113 ( 0.00%) 55.2813 ( 1.13%) Best90%Mean mmap 55.6199 ( 0.00%) 55.1044 ( 0.93%) Best50%Mean mmap 53.2183 ( 0.00%) 52.8330 ( 0.72%) Best10%Mean mmap 45.9842 ( 0.00%) 42.3740 ( 7.85%) Best5%Mean mmap 43.2256 ( 0.00%) 38.8660 ( 10.09%) Best1%Mean mmap 32.9388 ( 0.00%) 27.7577 ( 15.73%) This shows a number of improvements with the worst-case outlier greatly improved. Some of the vmstats are interesting 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 Swap Ins 163 239 Swap Outs 0 0 Allocation stalls 2603 0 DMA allocs 0 0 DMA32 allocs 618719206 1303037965 Normal allocs 891235743 229914091 Movable allocs 0 0 Direct pages scanned 216787 3173 Kswapd pages scanned 50719775 41732250 Kswapd pages reclaimed 41541765 41731168 Direct pages reclaimed 209159 3173 Kswapd efficiency 81% 99% Kswapd velocity 16859.554 14231.043 Direct efficiency 96% 100% Direct velocity 72.061 1.082 Percentage direct scans 0% 0% Zone normal velocity 8431.777 14232.125 Zone dma32 velocity 8499.838 0.000 Zone dma velocity 0.000 0.000 Page writes by reclaim 6215049.000 0.000 Page writes file 6215049 0 Page writes anon 0 0 Page reclaim immediate 70673 143 Sector Reads 81940800 81489388 Sector Writes 100158984 99161860 Page rescued immediate 0 0 Slabs scanned 1366954 21196 While this is not guaranteed in all cases, this particular test showed a large reduction in direct reclaim activity. It's also worth noting that no page writes were issued from reclaim context. This series is not without its hazards. There are at least three areas that I'm concerned with even though I could not reproduce any problems in that area. 1. Reclaim/compaction is going to be affected because the amount of reclaim is no longer targetted at a specific zone. Compaction works on a per-zone basis so there is no guarantee that reclaiming a few THP's worth page pages will have a positive impact on compaction success rates. 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers are called is now different. This may or may not be a problem but if it is, it'll be because shrinkers are not called enough and some balancing is required. 3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are distributed between zones and the fair zone allocation policy used to do something very similar for anon. The distribution is now different but not necessarily in any way that matters but it's still worth bearing in mind. Documentation/cgroup-v1/memcg_test.txt | 4 +- Documentation/cgroup-v1/memory.txt | 4 +- arch/s390/appldata/appldata_mem.c | 2 +- arch/tile/mm/pgtable.c | 18 +- drivers/base/node.c | 77 ++- drivers/staging/android/lowmemorykiller.c | 12 +- drivers/staging/lustre/lustre/osc/osc_cache.c | 6 +- fs/fs-writeback.c | 4 +- fs/fuse/file.c | 8 +- fs/nfs/internal.h | 2 +- fs/nfs/write.c | 2 +- fs/proc/meminfo.c | 20 +- include/linux/backing-dev.h | 2 +- include/linux/memcontrol.h | 61 +- include/linux/mm.h | 5 + include/linux/mm_inline.h | 35 +- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 155 +++-- include/linux/swap.h | 24 +- include/linux/topology.h | 2 +- include/linux/vm_event_item.h | 14 +- include/linux/vmstat.h | 111 +++- include/linux/writeback.h | 2 +- include/trace/events/vmscan.h | 63 +- include/trace/events/writeback.h | 10 +- kernel/power/snapshot.c | 10 +- kernel/sysctl.c | 4 +- mm/backing-dev.c | 15 +- mm/compaction.c | 50 +- mm/filemap.c | 16 +- mm/huge_memory.c | 12 +- mm/internal.h | 11 +- mm/khugepaged.c | 14 +- mm/memcontrol.c | 215 +++---- mm/memory-failure.c | 4 +- mm/memory_hotplug.c | 7 +- mm/mempolicy.c | 2 +- mm/migrate.c | 35 +- mm/mlock.c | 12 +- mm/page-writeback.c | 123 ++-- mm/page_alloc.c | 371 +++++------ mm/page_idle.c | 4 +- mm/rmap.c | 26 +- mm/shmem.c | 14 +- mm/swap.c | 64 +- mm/swap_state.c | 4 +- mm/util.c | 4 +- mm/vmscan.c | 879 +++++++++++++------------- mm/vmstat.c | 398 +++++++++--- mm/workingset.c | 54 +- 50 files changed, 1674 insertions(+), 1319 deletions(-) -- 2.6.4 Mel Gorman (31): mm, vmstat: add infrastructure for per-node vmstats mm, vmscan: move lru_lock to the node mm, vmscan: move LRU lists to node mm, vmscan: begin reclaiming pages on a per-node basis mm, vmscan: have kswapd only scan based on the highest requested zone mm, vmscan: make kswapd reclaim in terms of nodes mm, vmscan: remove balance gap mm, vmscan: simplify the logic deciding whether kswapd sleeps mm, vmscan: by default have direct reclaim only shrink once per node mm, vmscan: remove duplicate logic clearing node congestion and dirty state mm: vmscan: do not reclaim from kswapd if there is any eligible zone mm, vmscan: make shrink_node decisions more node-centric mm, memcg: move memcg limit enforcement from zones to nodes mm, workingset: make working set detection node-aware mm, page_alloc: consider dirtyable memory in terms of nodes mm: move page mapped accounting to the node mm: rename NR_ANON_PAGES to NR_ANON_MAPPED mm: move most file-based accounting to the node mm: move vmscan writes and file write accounting to the node mm, vmscan: only wakeup kswapd once per node for the requested classzone mm, page_alloc: Wake kswapd based on the highest eligible zone mm: convert zone_reclaim to node_reclaim mm, vmscan: Avoid passing in classzone_idx unnecessarily to shrink_node mm, vmscan: Avoid passing in classzone_idx unnecessarily to compaction_ready mm, vmscan: add classzone information to tracepoints mm, page_alloc: remove fair zone allocation policy mm: page_alloc: cache the last node whose dirty limit is reached mm: vmstat: replace __count_zone_vm_events with a zone id equivalent mm: vmstat: account per-zone stalls and pages skipped during reclaim mm, vmstat: print node-based stats in zoneinfo file mm, vmstat: Remove zone and node double accounting by approximating retries Documentation/cgroup-v1/memcg_test.txt | 4 +- Documentation/cgroup-v1/memory.txt | 4 +- arch/s390/appldata/appldata_mem.c | 2 +- arch/tile/mm/pgtable.c | 18 +- drivers/base/node.c | 77 ++- drivers/staging/android/lowmemorykiller.c | 12 +- drivers/staging/lustre/lustre/osc/osc_cache.c | 6 +- fs/fs-writeback.c | 4 +- fs/fuse/file.c | 8 +- fs/nfs/internal.h | 2 +- fs/nfs/write.c | 2 +- fs/proc/meminfo.c | 20 +- include/linux/backing-dev.h | 2 +- include/linux/memcontrol.h | 61 +- include/linux/mm.h | 5 + include/linux/mm_inline.h | 35 +- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 155 +++-- include/linux/swap.h | 24 +- include/linux/topology.h | 2 +- include/linux/vm_event_item.h | 14 +- include/linux/vmstat.h | 111 +++- include/linux/writeback.h | 2 +- include/trace/events/vmscan.h | 63 +- include/trace/events/writeback.h | 10 +- kernel/power/snapshot.c | 10 +- kernel/sysctl.c | 4 +- mm/backing-dev.c | 15 +- mm/compaction.c | 50 +- mm/filemap.c | 16 +- mm/huge_memory.c | 12 +- mm/internal.h | 11 +- mm/khugepaged.c | 14 +- mm/memcontrol.c | 215 +++---- mm/memory-failure.c | 4 +- mm/memory_hotplug.c | 7 +- mm/mempolicy.c | 2 +- mm/migrate.c | 35 +- mm/mlock.c | 12 +- mm/page-writeback.c | 123 ++-- mm/page_alloc.c | 371 +++++------ mm/page_idle.c | 4 +- mm/rmap.c | 26 +- mm/shmem.c | 14 +- mm/swap.c | 64 +- mm/swap_state.c | 4 +- mm/util.c | 4 +- mm/vmscan.c | 879 +++++++++++++------------- mm/vmstat.c | 398 +++++++++--- mm/workingset.c | 54 +- 50 files changed, 1674 insertions(+), 1319 deletions(-) -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries 2016-07-01 20:01 [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 Mel Gorman @ 2016-07-01 20:01 ` Mel Gorman 2016-07-06 0:02 ` Minchan Kim 2016-07-06 18:12 ` Dave Hansen 0 siblings, 2 replies; 10+ messages in thread From: Mel Gorman @ 2016-07-01 20:01 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman The number of LRU pages, dirty pages and writeback pages must be accounted for on both zones and nodes because of the reclaim retry logic, compaction retry logic and highmem calculations all depending on per-zone stats. The retry logic is only critical for allocations that can use any zones. Hence this patch will not retry reclaim or compaction for such allocations. This should not be a problem for reclaim as zone-constrained allocations are immune from OOM kill. For retries, a very rough approximation is made whether to retry or not. While it is possible this will make the wrong decision on occasion, it will not infinite loop as the number of reclaim attempts is capped by MAX_RECLAIM_RETRIES. The highmem calculations only care about the global count of file pages in highmem. Hence, a global counter is used instead of per-zone stats. With this, the per-zone double accounting disappears. Suggested by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/mm_inline.h | 20 +++++++++++-- include/linux/mmzone.h | 4 --- include/linux/swap.h | 1 - mm/compaction.c | 22 ++++++++++++++- mm/migrate.c | 2 -- mm/page-writeback.c | 13 ++++----- mm/page_alloc.c | 71 ++++++++++++++++++++++++++++++++--------------- mm/vmscan.c | 16 ----------- mm/vmstat.c | 3 -- 9 files changed, 92 insertions(+), 60 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 9aadcc781857..c68680aac044 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -4,6 +4,22 @@ #include <linux/huge_mm.h> #include <linux/swap.h> +#ifdef CONFIG_HIGHMEM +extern unsigned long highmem_file_pages; + +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, + int nr_pages) +{ + if (is_highmem_idx(zid) && is_file_lru(lru)) + highmem_file_pages += nr_pages; +} +#else +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, + int nr_pages) +{ +} +#endif + /** * page_is_file_cache - should the page be on a file LRU or anon LRU? * @page: the page to test @@ -29,9 +45,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, struct pglist_data *pgdat = lruvec_pgdat(lruvec); __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); - __mod_zone_page_state(&pgdat->node_zones[zid], - NR_ZONE_LRU_BASE + !!is_file_lru(lru), - nr_pages); + acct_highmem_file_pages(zid, lru, nr_pages); } static __always_inline void update_lru_size(struct lruvec *lruvec, diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index facee6b83440..9268528c20c0 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -110,10 +110,6 @@ struct zone_padding { enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, - NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */ - NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE, - NR_ZONE_LRU_FILE, - NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, diff --git a/include/linux/swap.h b/include/linux/swap.h index b17cc4830fa6..cc753c639e3d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, struct vm_area_struct *vma); /* linux/mm/vmscan.c */ -extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); diff --git a/mm/compaction.c b/mm/compaction.c index a0bd85712516..dfe7dafe8e8b 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1446,6 +1446,13 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, { struct zone *zone; struct zoneref *z; + pg_data_t *last_pgdat = NULL; + +#ifdef CONFIG_HIGHMEM + /* Do not retry compaction for zone-constrained allocations */ + if (!is_highmem_idx(ac->high_zoneidx)) + return false; +#endif /* * Make sure at least one zone would pass __compaction_suitable if we continue @@ -1456,14 +1463,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, unsigned long available; enum compact_result compact_result; + if (last_pgdat == zone->zone_pgdat) + continue; + + /* + * This over-estimates the number of pages available for + * reclaim/compaction but walking the LRU would take too + * long. The consequences are that compaction may retry + * longer than it should for a zone-constrained allocation + * request. + */ + last_pgdat = zone->zone_pgdat; + available = pgdat_reclaimable_pages(zone->zone_pgdat) / order; + /* * Do not consider all the reclaimable memory because we do not * want to trash just for a single high order allocation which * is even not guaranteed to appear even if __compaction_suitable * is happy about the watermark check. */ - available = zone_reclaimable_pages(zone) / order; available += zone_page_state_snapshot(zone, NR_FREE_PAGES); + available = min(zone->managed_pages, available); compact_result = __compaction_suitable(zone, order, alloc_flags, ac_classzone_idx(ac), available); if (compact_result != COMPACT_SKIPPED && diff --git a/mm/migrate.c b/mm/migrate.c index c77997dc6ed7..ed2f85e61de1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -513,9 +513,7 @@ int migrate_page_move_mapping(struct address_space *mapping, } if (dirty && mapping_cap_account_dirty(mapping)) { __dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY); - __dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING); __inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY); - __dec_zone_state(newzone, NR_ZONE_WRITE_PENDING); } } local_irq_enable(); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 3c02aa603f5a..8db1db234915 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -299,6 +299,9 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat) return nr_pages; } +#ifdef CONFIG_HIGHMEM +unsigned long highmem_file_pages; +#endif static unsigned long highmem_dirtyable_memory(unsigned long total) { @@ -306,18 +309,17 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) int node; unsigned long x = 0; int i; + unsigned long dirtyable = highmem_file_pages; for_each_node_state(node, N_HIGH_MEMORY) { for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) { struct zone *z; - unsigned long dirtyable; if (!is_highmem_idx(i)) continue; z = &NODE_DATA(node)->node_zones[i]; - dirtyable = zone_page_state(z, NR_FREE_PAGES) + - zone_page_state(z, NR_ZONE_LRU_FILE); + dirtyable += zone_page_state(z, NR_FREE_PAGES); /* watch for underflows */ dirtyable -= min(dirtyable, high_wmark_pages(z)); @@ -2460,7 +2462,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY); __inc_node_page_state(page, NR_FILE_DIRTY); - __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); __inc_node_page_state(page, NR_DIRTIED); __inc_wb_stat(wb, WB_RECLAIMABLE); __inc_wb_stat(wb, WB_DIRTIED); @@ -2482,7 +2483,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping, if (mapping_cap_account_dirty(mapping)) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); dec_node_page_state(page, NR_FILE_DIRTY); - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); dec_wb_stat(wb, WB_RECLAIMABLE); task_io_account_cancelled_write(PAGE_SIZE); } @@ -2739,7 +2739,6 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); dec_node_page_state(page, NR_FILE_DIRTY); - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); dec_wb_stat(wb, WB_RECLAIMABLE); ret = 1; } @@ -2786,7 +2785,6 @@ int test_clear_page_writeback(struct page *page) if (ret) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); dec_node_page_state(page, NR_WRITEBACK); - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); inc_node_page_state(page, NR_WRITTEN); } unlock_page_memcg(page); @@ -2841,7 +2839,6 @@ int __test_set_page_writeback(struct page *page, bool keep_write) if (!ret) { mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); inc_node_page_state(page, NR_WRITEBACK); - inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); } unlock_page_memcg(page); return ret; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d3eb15c35bb1..9581185cb31a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, { struct zone *zone; struct zoneref *z; + pg_data_t *current_pgdat = NULL; /* * Make sure we converge to OOM if we cannot make any progress @@ -3454,6 +3455,14 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, return false; /* + * Blindly retry allocation requests that cannot use all zones. We do + * not have a reliable and fast means of calculating reclaimable, dirty + * and writeback pages in eligible zones. + */ + if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask))) + goto out; + + /* * Keep reclaiming pages while there is a chance this will lead somewhere. * If none of the target zones can satisfy our allocation request even * if all reclaimable pages are considered then we are screwed and have @@ -3463,36 +3472,54 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, ac->nodemask) { unsigned long available; unsigned long reclaimable; + unsigned long write_pending = 0; + int zid; + + if (current_pgdat == zone->zone_pgdat) + continue; - available = reclaimable = zone_reclaimable_pages(zone); + current_pgdat = zone->zone_pgdat; + available = reclaimable = pgdat_reclaimable_pages(current_pgdat); available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES); - available += zone_page_state_snapshot(zone, NR_FREE_PAGES); + write_pending = node_page_state(current_pgdat, NR_WRITEBACK) + + node_page_state(current_pgdat, NR_FILE_DIRTY); - /* - * Would the allocation succeed if we reclaimed the whole - * available? - */ - if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), - ac_classzone_idx(ac), alloc_flags, available)) { - /* - * If we didn't make any progress and have a lot of - * dirty + writeback pages then we should wait for - * an IO to complete to slow down the reclaim and - * prevent from pre mature OOM - */ - if (!did_some_progress) { - unsigned long write_pending; + /* Account for all free pages on eligible zones */ + for (zid = 0; zid <= zone_idx(zone); zid++) { + struct zone *acct_zone = ¤t_pgdat->node_zones[zid]; - write_pending = zone_page_state_snapshot(zone, - NR_ZONE_WRITE_PENDING); + available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES); + } - if (2 * write_pending > reclaimable) { - congestion_wait(BLK_RW_ASYNC, HZ/10); - return true; - } + /* + * If we didn't make any progress and have a lot of + * dirty + writeback pages then we should wait for an IO to + * complete to slow down the reclaim and prevent from premature + * OOM. + */ + if (!did_some_progress) { + if (2 * write_pending > reclaimable) { + congestion_wait(BLK_RW_ASYNC, HZ/10); + return true; } + } + /* + * Would the allocation succeed if we reclaimed the whole + * available? This is approximate because there is no + * accurate count of reclaimable pages per zone. + */ + for (zid = 0; zid <= zone_idx(zone); zid++) { + struct zone *check_zone = ¤t_pgdat->node_zones[zid]; + unsigned long estimate; + + estimate = min(check_zone->managed_pages, available); + if (__zone_watermark_ok(check_zone, order, + min_wmark_pages(check_zone), ac_classzone_idx(ac), + alloc_flags, available)) { + } +out: /* * Memory allocation/reclaim might be called from a WQ * context and the current implementation of the WQ diff --git a/mm/vmscan.c b/mm/vmscan.c index 151c30dd27e2..c538a8cab43b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -194,22 +194,6 @@ static bool sane_reclaim(struct scan_control *sc) } #endif -/* - * This misses isolated pages which are not accounted for to save counters. - * As the data only determines if reclaim or compaction continues, it is - * not expected that isolated pages will be a dominating factor. - */ -unsigned long zone_reclaimable_pages(struct zone *zone) -{ - unsigned long nr; - - nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE); - if (get_nr_swap_pages() > 0) - nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON); - - return nr; -} - unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat) { unsigned long nr; diff --git a/mm/vmstat.c b/mm/vmstat.c index ce09be63e8c7..524c082072be 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -908,9 +908,6 @@ int fragmentation_index(struct zone *zone, unsigned int order) const char * const vmstat_text[] = { /* enum zone_stat_item countes */ "nr_free_pages", - "nr_zone_anon_lru", - "nr_zone_file_lru", - "nr_zone_write_pending", "nr_mlock", "nr_slab_reclaimable", "nr_slab_unreclaimable", -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries 2016-07-01 20:01 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Mel Gorman @ 2016-07-06 0:02 ` Minchan Kim 2016-07-06 8:58 ` Mel Gorman 2016-07-06 18:12 ` Dave Hansen 1 sibling, 1 reply; 10+ messages in thread From: Minchan Kim @ 2016-07-06 0:02 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Fri, Jul 01, 2016 at 09:01:39PM +0100, Mel Gorman wrote: > The number of LRU pages, dirty pages and writeback pages must be accounted > for on both zones and nodes because of the reclaim retry logic, compaction > retry logic and highmem calculations all depending on per-zone stats. > > The retry logic is only critical for allocations that can use any zones. Sorry, I cannot follow this assertion. Could you explain? > Hence this patch will not retry reclaim or compaction for such allocations. What is such allocations? > This should not be a problem for reclaim as zone-constrained allocations > are immune from OOM kill. For retries, a very rough approximation is made zone-constrained allocations are immune from OOM kill? Please explain it, too. Sorry for the many questions but I cannot review code without clear understanding of assumption/background which I couldn't notice. > whether to retry or not. While it is possible this will make the wrong > decision on occasion, it will not infinite loop as the number of reclaim > attempts is capped by MAX_RECLAIM_RETRIES. > > The highmem calculations only care about the global count of file pages > in highmem. Hence, a global counter is used instead of per-zone stats. > With this, the per-zone double accounting disappears. > > Suggested by: Michal Hocko <mhocko@kernel.org> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > --- > include/linux/mm_inline.h | 20 +++++++++++-- > include/linux/mmzone.h | 4 --- > include/linux/swap.h | 1 - > mm/compaction.c | 22 ++++++++++++++- > mm/migrate.c | 2 -- > mm/page-writeback.c | 13 ++++----- > mm/page_alloc.c | 71 ++++++++++++++++++++++++++++++++--------------- > mm/vmscan.c | 16 ----------- > mm/vmstat.c | 3 -- > 9 files changed, 92 insertions(+), 60 deletions(-) > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h > index 9aadcc781857..c68680aac044 100644 > --- a/include/linux/mm_inline.h > +++ b/include/linux/mm_inline.h > @@ -4,6 +4,22 @@ > #include <linux/huge_mm.h> > #include <linux/swap.h> > > +#ifdef CONFIG_HIGHMEM > +extern unsigned long highmem_file_pages; > + > +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, > + int nr_pages) > +{ > + if (is_highmem_idx(zid) && is_file_lru(lru)) > + highmem_file_pages += nr_pages; > +} > +#else > +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, > + int nr_pages) > +{ > +} > +#endif > + > /** > * page_is_file_cache - should the page be on a file LRU or anon LRU? > * @page: the page to test > @@ -29,9 +45,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); > - __mod_zone_page_state(&pgdat->node_zones[zid], > - NR_ZONE_LRU_BASE + !!is_file_lru(lru), > - nr_pages); > + acct_highmem_file_pages(zid, lru, nr_pages); > } > > static __always_inline void update_lru_size(struct lruvec *lruvec, > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index facee6b83440..9268528c20c0 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -110,10 +110,6 @@ struct zone_padding { > enum zone_stat_item { > /* First 128 byte cacheline (assuming 64 bit words) */ > NR_FREE_PAGES, > - NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */ > - NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE, > - NR_ZONE_LRU_FILE, > - NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ > NR_MLOCK, /* mlock()ed pages found and moved off LRU */ > NR_SLAB_RECLAIMABLE, > NR_SLAB_UNRECLAIMABLE, > diff --git a/include/linux/swap.h b/include/linux/swap.h > index b17cc4830fa6..cc753c639e3d 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, > struct vm_area_struct *vma); > > /* linux/mm/vmscan.c */ > -extern unsigned long zone_reclaimable_pages(struct zone *zone); > extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat); > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > gfp_t gfp_mask, nodemask_t *mask); > diff --git a/mm/compaction.c b/mm/compaction.c > index a0bd85712516..dfe7dafe8e8b 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1446,6 +1446,13 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, > { > struct zone *zone; > struct zoneref *z; > + pg_data_t *last_pgdat = NULL; > + > +#ifdef CONFIG_HIGHMEM > + /* Do not retry compaction for zone-constrained allocations */ > + if (!is_highmem_idx(ac->high_zoneidx)) > + return false; > +#endif > > /* > * Make sure at least one zone would pass __compaction_suitable if we continue > @@ -1456,14 +1463,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, > unsigned long available; > enum compact_result compact_result; > > + if (last_pgdat == zone->zone_pgdat) > + continue; > + > + /* > + * This over-estimates the number of pages available for > + * reclaim/compaction but walking the LRU would take too > + * long. The consequences are that compaction may retry > + * longer than it should for a zone-constrained allocation > + * request. > + */ > + last_pgdat = zone->zone_pgdat; > + available = pgdat_reclaimable_pages(zone->zone_pgdat) / order; > + > /* > * Do not consider all the reclaimable memory because we do not > * want to trash just for a single high order allocation which > * is even not guaranteed to appear even if __compaction_suitable > * is happy about the watermark check. > */ > - available = zone_reclaimable_pages(zone) / order; > available += zone_page_state_snapshot(zone, NR_FREE_PAGES); > + available = min(zone->managed_pages, available); > compact_result = __compaction_suitable(zone, order, alloc_flags, > ac_classzone_idx(ac), available); > if (compact_result != COMPACT_SKIPPED && > diff --git a/mm/migrate.c b/mm/migrate.c > index c77997dc6ed7..ed2f85e61de1 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -513,9 +513,7 @@ int migrate_page_move_mapping(struct address_space *mapping, > } > if (dirty && mapping_cap_account_dirty(mapping)) { > __dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY); > - __dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING); > __inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY); > - __dec_zone_state(newzone, NR_ZONE_WRITE_PENDING); > } > } > local_irq_enable(); > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 3c02aa603f5a..8db1db234915 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -299,6 +299,9 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat) > > return nr_pages; > } > +#ifdef CONFIG_HIGHMEM > +unsigned long highmem_file_pages; > +#endif > > static unsigned long highmem_dirtyable_memory(unsigned long total) > { > @@ -306,18 +309,17 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) > int node; > unsigned long x = 0; > int i; > + unsigned long dirtyable = highmem_file_pages; > > for_each_node_state(node, N_HIGH_MEMORY) { > for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) { > struct zone *z; > - unsigned long dirtyable; > > if (!is_highmem_idx(i)) > continue; > > z = &NODE_DATA(node)->node_zones[i]; > - dirtyable = zone_page_state(z, NR_FREE_PAGES) + > - zone_page_state(z, NR_ZONE_LRU_FILE); > + dirtyable += zone_page_state(z, NR_FREE_PAGES); > > /* watch for underflows */ > dirtyable -= min(dirtyable, high_wmark_pages(z)); > @@ -2460,7 +2462,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) > > mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY); > __inc_node_page_state(page, NR_FILE_DIRTY); > - __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); > __inc_node_page_state(page, NR_DIRTIED); > __inc_wb_stat(wb, WB_RECLAIMABLE); > __inc_wb_stat(wb, WB_DIRTIED); > @@ -2482,7 +2483,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping, > if (mapping_cap_account_dirty(mapping)) { > mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); > dec_node_page_state(page, NR_FILE_DIRTY); > - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); > dec_wb_stat(wb, WB_RECLAIMABLE); > task_io_account_cancelled_write(PAGE_SIZE); > } > @@ -2739,7 +2739,6 @@ int clear_page_dirty_for_io(struct page *page) > if (TestClearPageDirty(page)) { > mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); > dec_node_page_state(page, NR_FILE_DIRTY); > - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); > dec_wb_stat(wb, WB_RECLAIMABLE); > ret = 1; > } > @@ -2786,7 +2785,6 @@ int test_clear_page_writeback(struct page *page) > if (ret) { > mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); > dec_node_page_state(page, NR_WRITEBACK); > - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); > inc_node_page_state(page, NR_WRITTEN); > } > unlock_page_memcg(page); > @@ -2841,7 +2839,6 @@ int __test_set_page_writeback(struct page *page, bool keep_write) > if (!ret) { > mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); > inc_node_page_state(page, NR_WRITEBACK); > - inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); > } > unlock_page_memcg(page); > return ret; > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index d3eb15c35bb1..9581185cb31a 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > { > struct zone *zone; > struct zoneref *z; > + pg_data_t *current_pgdat = NULL; > > /* > * Make sure we converge to OOM if we cannot make any progress > @@ -3454,6 +3455,14 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > return false; > > /* > + * Blindly retry allocation requests that cannot use all zones. We do > + * not have a reliable and fast means of calculating reclaimable, dirty > + * and writeback pages in eligible zones. > + */ > + if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask))) > + goto out; > + > + /* > * Keep reclaiming pages while there is a chance this will lead somewhere. > * If none of the target zones can satisfy our allocation request even > * if all reclaimable pages are considered then we are screwed and have > @@ -3463,36 +3472,54 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > ac->nodemask) { > unsigned long available; > unsigned long reclaimable; > + unsigned long write_pending = 0; > + int zid; > + > + if (current_pgdat == zone->zone_pgdat) > + continue; > > - available = reclaimable = zone_reclaimable_pages(zone); > + current_pgdat = zone->zone_pgdat; > + available = reclaimable = pgdat_reclaimable_pages(current_pgdat); > available -= DIV_ROUND_UP(no_progress_loops * available, > MAX_RECLAIM_RETRIES); > - available += zone_page_state_snapshot(zone, NR_FREE_PAGES); > + write_pending = node_page_state(current_pgdat, NR_WRITEBACK) + > + node_page_state(current_pgdat, NR_FILE_DIRTY); > > - /* > - * Would the allocation succeed if we reclaimed the whole > - * available? > - */ > - if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), > - ac_classzone_idx(ac), alloc_flags, available)) { > - /* > - * If we didn't make any progress and have a lot of > - * dirty + writeback pages then we should wait for > - * an IO to complete to slow down the reclaim and > - * prevent from pre mature OOM > - */ > - if (!did_some_progress) { > - unsigned long write_pending; > + /* Account for all free pages on eligible zones */ > + for (zid = 0; zid <= zone_idx(zone); zid++) { > + struct zone *acct_zone = ¤t_pgdat->node_zones[zid]; > > - write_pending = zone_page_state_snapshot(zone, > - NR_ZONE_WRITE_PENDING); > + available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES); > + } > > - if (2 * write_pending > reclaimable) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > - return true; > - } > + /* > + * If we didn't make any progress and have a lot of > + * dirty + writeback pages then we should wait for an IO to > + * complete to slow down the reclaim and prevent from premature > + * OOM. > + */ > + if (!did_some_progress) { > + if (2 * write_pending > reclaimable) { > + congestion_wait(BLK_RW_ASYNC, HZ/10); > + return true; > } > + } > > + /* > + * Would the allocation succeed if we reclaimed the whole > + * available? This is approximate because there is no > + * accurate count of reclaimable pages per zone. > + */ > + for (zid = 0; zid <= zone_idx(zone); zid++) { > + struct zone *check_zone = ¤t_pgdat->node_zones[zid]; > + unsigned long estimate; > + > + estimate = min(check_zone->managed_pages, available); > + if (__zone_watermark_ok(check_zone, order, > + min_wmark_pages(check_zone), ac_classzone_idx(ac), > + alloc_flags, available)) { > + } > +out: > /* > * Memory allocation/reclaim might be called from a WQ > * context and the current implementation of the WQ > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 151c30dd27e2..c538a8cab43b 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -194,22 +194,6 @@ static bool sane_reclaim(struct scan_control *sc) > } > #endif > > -/* > - * This misses isolated pages which are not accounted for to save counters. > - * As the data only determines if reclaim or compaction continues, it is > - * not expected that isolated pages will be a dominating factor. > - */ > -unsigned long zone_reclaimable_pages(struct zone *zone) > -{ > - unsigned long nr; > - > - nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE); > - if (get_nr_swap_pages() > 0) > - nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON); > - > - return nr; > -} > - > unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat) > { > unsigned long nr; > diff --git a/mm/vmstat.c b/mm/vmstat.c > index ce09be63e8c7..524c082072be 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -908,9 +908,6 @@ int fragmentation_index(struct zone *zone, unsigned int order) > const char * const vmstat_text[] = { > /* enum zone_stat_item countes */ > "nr_free_pages", > - "nr_zone_anon_lru", > - "nr_zone_file_lru", > - "nr_zone_write_pending", > "nr_mlock", > "nr_slab_reclaimable", > "nr_slab_unreclaimable", > -- > 2.6.4 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries 2016-07-06 0:02 ` Minchan Kim @ 2016-07-06 8:58 ` Mel Gorman 2016-07-06 9:33 ` Mel Gorman 2016-07-07 6:47 ` Minchan Kim 0 siblings, 2 replies; 10+ messages in thread From: Mel Gorman @ 2016-07-06 8:58 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Wed, Jul 06, 2016 at 09:02:52AM +0900, Minchan Kim wrote: > On Fri, Jul 01, 2016 at 09:01:39PM +0100, Mel Gorman wrote: > > The number of LRU pages, dirty pages and writeback pages must be accounted > > for on both zones and nodes because of the reclaim retry logic, compaction > > retry logic and highmem calculations all depending on per-zone stats. > > > > The retry logic is only critical for allocations that can use any zones. > > Sorry, I cannot follow this assertion. > Could you explain? > The patch has been reworked since and I tried clarifying the changelog. Does this help? --- 8<---- mm, vmstat: remove zone and node double accounting by approximating retries The number of LRU pages, dirty pages and writeback pages must be accounted for on both zones and nodes because of the reclaim retry logic, compaction retry logic and highmem calculations all depending on per-zone stats. Many lowmem allocations are immune from OOM kill due to a check in __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit 03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The exception is costly high-order allocations or allocations that cannot fail. If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem allocations then a check in __alloc_pages_slowpath will always retry. Hence this patch will always retry reclaim for zone-constrained allocations in should_reclaim_retry. As there is no guarantee enough memory can ever be freed to satisfy compaction, this patch avoids retrying compaction for zone-contrained allocations.o In combination, that means that the per-node stats can be used when deciding whether to continue reclaim using a rough approximation. While it is possible this will make the wrong decision on occasion, it will not infinite loop as the number of reclaim attempts is capped by MAX_RECLAIM_RETRIES. The final step is calculating the number of dirtyable highmem pages. As those calculations only care about the global count of file pages in highmem. This patch uses a global counter used instead of per-zone stats as it is sufficient. In combination, this allows the per-zone LRU and dirty state counters to be removed. Suggested by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 9aadcc781857..c68680aac044 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -4,6 +4,22 @@ #include <linux/huge_mm.h> #include <linux/swap.h> +#ifdef CONFIG_HIGHMEM +extern unsigned long highmem_file_pages; + +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, + int nr_pages) +{ + if (is_highmem_idx(zid) && is_file_lru(lru)) + highmem_file_pages += nr_pages; +} +#else +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, + int nr_pages) +{ +} +#endif + /** * page_is_file_cache - should the page be on a file LRU or anon LRU? * @page: the page to test @@ -29,9 +45,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, struct pglist_data *pgdat = lruvec_pgdat(lruvec); __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); - __mod_zone_page_state(&pgdat->node_zones[zid], - NR_ZONE_LRU_BASE + !!is_file_lru(lru), - nr_pages); + acct_highmem_file_pages(zid, lru, nr_pages); } static __always_inline void update_lru_size(struct lruvec *lruvec, diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index bd33e6f1bed0..a3b7f45aac56 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -110,10 +110,6 @@ struct zone_padding { enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, - NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */ - NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE, - NR_ZONE_LRU_FILE, - NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, diff --git a/include/linux/swap.h b/include/linux/swap.h index b17cc4830fa6..cc753c639e3d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, struct vm_area_struct *vma); /* linux/mm/vmscan.c */ -extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); diff --git a/mm/compaction.c b/mm/compaction.c index a0bd85712516..dfe7dafe8e8b 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1446,6 +1446,13 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, { struct zone *zone; struct zoneref *z; + pg_data_t *last_pgdat = NULL; + +#ifdef CONFIG_HIGHMEM + /* Do not retry compaction for zone-constrained allocations */ + if (!is_highmem_idx(ac->high_zoneidx)) + return false; +#endif /* * Make sure at least one zone would pass __compaction_suitable if we continue @@ -1456,14 +1463,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, unsigned long available; enum compact_result compact_result; + if (last_pgdat == zone->zone_pgdat) + continue; + + /* + * This over-estimates the number of pages available for + * reclaim/compaction but walking the LRU would take too + * long. The consequences are that compaction may retry + * longer than it should for a zone-constrained allocation + * request. + */ + last_pgdat = zone->zone_pgdat; + available = pgdat_reclaimable_pages(zone->zone_pgdat) / order; + /* * Do not consider all the reclaimable memory because we do not * want to trash just for a single high order allocation which * is even not guaranteed to appear even if __compaction_suitable * is happy about the watermark check. */ - available = zone_reclaimable_pages(zone) / order; available += zone_page_state_snapshot(zone, NR_FREE_PAGES); + available = min(zone->managed_pages, available); compact_result = __compaction_suitable(zone, order, alloc_flags, ac_classzone_idx(ac), available); if (compact_result != COMPACT_SKIPPED && diff --git a/mm/migrate.c b/mm/migrate.c index c77997dc6ed7..ed2f85e61de1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -513,9 +513,7 @@ int migrate_page_move_mapping(struct address_space *mapping, } if (dirty && mapping_cap_account_dirty(mapping)) { __dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY); - __dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING); __inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY); - __dec_zone_state(newzone, NR_ZONE_WRITE_PENDING); } } local_irq_enable(); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 3c02aa603f5a..8db1db234915 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -299,6 +299,9 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat) return nr_pages; } +#ifdef CONFIG_HIGHMEM +unsigned long highmem_file_pages; +#endif static unsigned long highmem_dirtyable_memory(unsigned long total) { @@ -306,18 +309,17 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) int node; unsigned long x = 0; int i; + unsigned long dirtyable = highmem_file_pages; for_each_node_state(node, N_HIGH_MEMORY) { for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) { struct zone *z; - unsigned long dirtyable; if (!is_highmem_idx(i)) continue; z = &NODE_DATA(node)->node_zones[i]; - dirtyable = zone_page_state(z, NR_FREE_PAGES) + - zone_page_state(z, NR_ZONE_LRU_FILE); + dirtyable += zone_page_state(z, NR_FREE_PAGES); /* watch for underflows */ dirtyable -= min(dirtyable, high_wmark_pages(z)); @@ -2460,7 +2462,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY); __inc_node_page_state(page, NR_FILE_DIRTY); - __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); __inc_node_page_state(page, NR_DIRTIED); __inc_wb_stat(wb, WB_RECLAIMABLE); __inc_wb_stat(wb, WB_DIRTIED); @@ -2482,7 +2483,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping, if (mapping_cap_account_dirty(mapping)) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); dec_node_page_state(page, NR_FILE_DIRTY); - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); dec_wb_stat(wb, WB_RECLAIMABLE); task_io_account_cancelled_write(PAGE_SIZE); } @@ -2739,7 +2739,6 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); dec_node_page_state(page, NR_FILE_DIRTY); - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); dec_wb_stat(wb, WB_RECLAIMABLE); ret = 1; } @@ -2786,7 +2785,6 @@ int test_clear_page_writeback(struct page *page) if (ret) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); dec_node_page_state(page, NR_WRITEBACK); - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); inc_node_page_state(page, NR_WRITTEN); } unlock_page_memcg(page); @@ -2841,7 +2839,6 @@ int __test_set_page_writeback(struct page *page, bool keep_write) if (!ret) { mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); inc_node_page_state(page, NR_WRITEBACK); - inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); } unlock_page_memcg(page); return ret; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 030114f55b0e..ded48e580abc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, { struct zone *zone; struct zoneref *z; + pg_data_t *current_pgdat = NULL; /* * Make sure we converge to OOM if we cannot make any progress @@ -3454,6 +3455,14 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, return false; /* + * Blindly retry allocation requests that cannot use all zones. We do + * not have a reliable and fast means of calculating reclaimable, dirty + * and writeback pages in eligible zones. + */ + if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask))) + goto out; + + /* * Keep reclaiming pages while there is a chance this will lead somewhere. * If none of the target zones can satisfy our allocation request even * if all reclaimable pages are considered then we are screwed and have @@ -3463,18 +3472,38 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, ac->nodemask) { unsigned long available; unsigned long reclaimable; + int zid; - available = reclaimable = zone_reclaimable_pages(zone); + if (current_pgdat == zone->zone_pgdat) + continue; + + current_pgdat = zone->zone_pgdat; + available = reclaimable = pgdat_reclaimable_pages(current_pgdat); available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES); - available += zone_page_state_snapshot(zone, NR_FREE_PAGES); + + /* Account for all free pages on eligible zones */ + for (zid = 0; zid <= zone_idx(zone); zid++) { + struct zone *acct_zone = ¤t_pgdat->node_zones[zid]; + + available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES); + } /* * Would the allocation succeed if we reclaimed the whole - * available? + * available? This is approximate because there is no + * accurate count of reclaimable pages per zone. */ - if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), - ac_classzone_idx(ac), alloc_flags, available)) { + for (zid = 0; zid <= zone_idx(zone); zid++) { + struct zone *check_zone = ¤t_pgdat->node_zones[zid]; + unsigned long estimate; + + estimate = min(check_zone->managed_pages, available); + if (!__zone_watermark_ok(check_zone, order, + min_wmark_pages(check_zone), ac_classzone_idx(ac), + alloc_flags, estimate)) + continue; + /* * If we didn't make any progress and have a lot of * dirty + writeback pages then we should wait for @@ -3484,15 +3513,16 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, if (!did_some_progress) { unsigned long write_pending; - write_pending = zone_page_state_snapshot(zone, - NR_ZONE_WRITE_PENDING); + write_pending = + node_page_state(current_pgdat, NR_WRITEBACK) + + node_page_state(current_pgdat, NR_FILE_DIRTY); if (2 * write_pending > reclaimable) { congestion_wait(BLK_RW_ASYNC, HZ/10); return true; } } - +out: /* * Memory allocation/reclaim might be called from a WQ * context and the current implementation of the WQ diff --git a/mm/vmscan.c b/mm/vmscan.c index 9eed2d3e05f3..a8ebd1871f16 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -194,22 +194,6 @@ static bool sane_reclaim(struct scan_control *sc) } #endif -/* - * This misses isolated pages which are not accounted for to save counters. - * As the data only determines if reclaim or compaction continues, it is - * not expected that isolated pages will be a dominating factor. - */ -unsigned long zone_reclaimable_pages(struct zone *zone) -{ - unsigned long nr; - - nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE); - if (get_nr_swap_pages() > 0) - nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON); - - return nr; -} - unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat) { unsigned long nr; @@ -3167,7 +3151,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) * zone was balanced even under extreme pressure when the * overall node may be congested. */ - for (i = sc.reclaim_idx; i >= 0; i--) { + for (i = sc.reclaim_idx; i >= 0 && !buffer_heads_over_limit; i--) { zone = pgdat->node_zones + i; if (!populated_zone(zone)) continue; diff --git a/mm/vmstat.c b/mm/vmstat.c index 60372f31fee3..7415775faf08 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -921,9 +921,6 @@ int fragmentation_index(struct zone *zone, unsigned int order) const char * const vmstat_text[] = { /* enum zone_stat_item countes */ "nr_free_pages", - "nr_zone_anon_lru", - "nr_zone_file_lru", - "nr_zone_write_pending", "nr_mlock", "nr_slab_reclaimable", "nr_slab_unreclaimable", -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries 2016-07-06 8:58 ` Mel Gorman @ 2016-07-06 9:33 ` Mel Gorman 2016-07-07 6:47 ` Minchan Kim 1 sibling, 0 replies; 10+ messages in thread From: Mel Gorman @ 2016-07-06 9:33 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Wed, Jul 06, 2016 at 09:58:50AM +0100, Mel Gorman wrote: > On Wed, Jul 06, 2016 at 09:02:52AM +0900, Minchan Kim wrote: > > On Fri, Jul 01, 2016 at 09:01:39PM +0100, Mel Gorman wrote: > > > The number of LRU pages, dirty pages and writeback pages must be accounted > > > for on both zones and nodes because of the reclaim retry logic, compaction > > > retry logic and highmem calculations all depending on per-zone stats. > > > > > > The retry logic is only critical for allocations that can use any zones. > > > > Sorry, I cannot follow this assertion. > > Could you explain? > > > > The patch has been reworked since and I tried clarifying the changelog. > Does this help? > It occurred to me at breakfast that this should be more consistent with the OOM killer on both 32-bit and 64-bit so; diff --git a/mm/compaction.c b/mm/compaction.c index dfe7dafe8e8b..640532831b94 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1448,11 +1448,9 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, struct zoneref *z; pg_data_t *last_pgdat = NULL; -#ifdef CONFIG_HIGHMEM /* Do not retry compaction for zone-constrained allocations */ - if (!is_highmem_idx(ac->high_zoneidx)) + if (ac->high_zoneidx < ZONE_NORMAL) return false; -#endif /* * Make sure at least one zone would pass __compaction_suitable if we continue diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ded48e580abc..194a8162528b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3455,11 +3455,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, return false; /* - * Blindly retry allocation requests that cannot use all zones. We do - * not have a reliable and fast means of calculating reclaimable, dirty - * and writeback pages in eligible zones. + * Blindly retry lowmem allocation requests that are often ignored by + * the OOM killer as we not have a reliable and fast means of + * calculating reclaimable, dirty and writeback pages in eligible zones. */ - if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask))) + if (ac->high_zoneidx < ZONE_NORMAL) goto out; /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries 2016-07-06 8:58 ` Mel Gorman 2016-07-06 9:33 ` Mel Gorman @ 2016-07-07 6:47 ` Minchan Kim 1 sibling, 0 replies; 10+ messages in thread From: Minchan Kim @ 2016-07-07 6:47 UTC (permalink / raw) To: Mel Gorman Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Wed, Jul 06, 2016 at 09:58:50AM +0100, Mel Gorman wrote: > On Wed, Jul 06, 2016 at 09:02:52AM +0900, Minchan Kim wrote: > > On Fri, Jul 01, 2016 at 09:01:39PM +0100, Mel Gorman wrote: > > > The number of LRU pages, dirty pages and writeback pages must be accounted > > > for on both zones and nodes because of the reclaim retry logic, compaction > > > retry logic and highmem calculations all depending on per-zone stats. > > > > > > The retry logic is only critical for allocations that can use any zones. > > > > Sorry, I cannot follow this assertion. > > Could you explain? > > > > The patch has been reworked since and I tried clarifying the changelog. > Does this help? Thanks. It is surely better than old but not clear to me, yet. > > --- 8<---- > mm, vmstat: remove zone and node double accounting by approximating retries > > The number of LRU pages, dirty pages and writeback pages must be accounted > for on both zones and nodes because of the reclaim retry logic, compaction > retry logic and highmem calculations all depending on per-zone stats. > > Many lowmem allocations are immune from OOM kill due to a check in > __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit > 03668b3ceb0c ("oom: avoid oom killer for lowmem allocations"). The exception > is costly high-order allocations or allocations that cannot fail. If the > __alloc_pages_may_oom avoids OOM-kill for low-order lowmem allocations > then a check in __alloc_pages_slowpath will always retry. If I read code rightly, __alloc_pages_slowpath will never retry in that case because __alloc_pages_may_oom will return 0's did_some_progress vaule so it would go to warn_alloc_failed unless direct compaction is successful. > > Hence this patch will always retry reclaim for zone-constrained allocations > in should_reclaim_retry. > > As there is no guarantee enough memory can ever be freed to satisfy > compaction, this patch avoids retrying compaction for zone-contrained > allocations.o > > In combination, that means that the per-node stats can be used when deciding > whether to continue reclaim using a rough approximation. While it is > possible this will make the wrong decision on occasion, it will not infinite > loop as the number of reclaim attempts is capped by MAX_RECLAIM_RETRIES. > > The final step is calculating the number of dirtyable highmem pages. As > those calculations only care about the global count of file pages in > highmem. This patch uses a global counter used instead of per-zone stats > as it is sufficient. > > In combination, this allows the per-zone LRU and dirty state counters to > be removed. > > Suggested by: Michal Hocko <mhocko@kernel.org> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net> > Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> > > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h > index 9aadcc781857..c68680aac044 100644 > --- a/include/linux/mm_inline.h > +++ b/include/linux/mm_inline.h > @@ -4,6 +4,22 @@ > #include <linux/huge_mm.h> > #include <linux/swap.h> > > +#ifdef CONFIG_HIGHMEM > +extern unsigned long highmem_file_pages; > + > +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, > + int nr_pages) > +{ > + if (is_highmem_idx(zid) && is_file_lru(lru)) > + highmem_file_pages += nr_pages; > +} > +#else > +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, > + int nr_pages) > +{ > +} > +#endif > + > /** > * page_is_file_cache - should the page be on a file LRU or anon LRU? > * @page: the page to test > @@ -29,9 +45,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); > - __mod_zone_page_state(&pgdat->node_zones[zid], > - NR_ZONE_LRU_BASE + !!is_file_lru(lru), > - nr_pages); > + acct_highmem_file_pages(zid, lru, nr_pages); > } > > static __always_inline void update_lru_size(struct lruvec *lruvec, > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index bd33e6f1bed0..a3b7f45aac56 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -110,10 +110,6 @@ struct zone_padding { > enum zone_stat_item { > /* First 128 byte cacheline (assuming 64 bit words) */ > NR_FREE_PAGES, > - NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */ > - NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE, > - NR_ZONE_LRU_FILE, > - NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ > NR_MLOCK, /* mlock()ed pages found and moved off LRU */ > NR_SLAB_RECLAIMABLE, > NR_SLAB_UNRECLAIMABLE, > diff --git a/include/linux/swap.h b/include/linux/swap.h > index b17cc4830fa6..cc753c639e3d 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, > struct vm_area_struct *vma); > > /* linux/mm/vmscan.c */ > -extern unsigned long zone_reclaimable_pages(struct zone *zone); > extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat); > extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > gfp_t gfp_mask, nodemask_t *mask); > diff --git a/mm/compaction.c b/mm/compaction.c > index a0bd85712516..dfe7dafe8e8b 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1446,6 +1446,13 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, > { > struct zone *zone; > struct zoneref *z; > + pg_data_t *last_pgdat = NULL; > + > +#ifdef CONFIG_HIGHMEM > + /* Do not retry compaction for zone-constrained allocations */ > + if (!is_highmem_idx(ac->high_zoneidx)) > + return false; > +#endif > > /* > * Make sure at least one zone would pass __compaction_suitable if we continue > @@ -1456,14 +1463,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, > unsigned long available; > enum compact_result compact_result; > > + if (last_pgdat == zone->zone_pgdat) > + continue; > + > + /* > + * This over-estimates the number of pages available for > + * reclaim/compaction but walking the LRU would take too > + * long. The consequences are that compaction may retry > + * longer than it should for a zone-constrained allocation > + * request. > + */ > + last_pgdat = zone->zone_pgdat; > + available = pgdat_reclaimable_pages(zone->zone_pgdat) / order; > + > /* > * Do not consider all the reclaimable memory because we do not > * want to trash just for a single high order allocation which > * is even not guaranteed to appear even if __compaction_suitable > * is happy about the watermark check. > */ > - available = zone_reclaimable_pages(zone) / order; > available += zone_page_state_snapshot(zone, NR_FREE_PAGES); > + available = min(zone->managed_pages, available); > compact_result = __compaction_suitable(zone, order, alloc_flags, > ac_classzone_idx(ac), available); > if (compact_result != COMPACT_SKIPPED && > diff --git a/mm/migrate.c b/mm/migrate.c > index c77997dc6ed7..ed2f85e61de1 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -513,9 +513,7 @@ int migrate_page_move_mapping(struct address_space *mapping, > } > if (dirty && mapping_cap_account_dirty(mapping)) { > __dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY); > - __dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING); > __inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY); > - __dec_zone_state(newzone, NR_ZONE_WRITE_PENDING); > } > } > local_irq_enable(); > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 3c02aa603f5a..8db1db234915 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -299,6 +299,9 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat) > > return nr_pages; > } > +#ifdef CONFIG_HIGHMEM > +unsigned long highmem_file_pages; > +#endif > > static unsigned long highmem_dirtyable_memory(unsigned long total) > { > @@ -306,18 +309,17 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) > int node; > unsigned long x = 0; > int i; > + unsigned long dirtyable = highmem_file_pages; > > for_each_node_state(node, N_HIGH_MEMORY) { > for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) { > struct zone *z; > - unsigned long dirtyable; > > if (!is_highmem_idx(i)) > continue; > > z = &NODE_DATA(node)->node_zones[i]; > - dirtyable = zone_page_state(z, NR_FREE_PAGES) + > - zone_page_state(z, NR_ZONE_LRU_FILE); > + dirtyable += zone_page_state(z, NR_FREE_PAGES); > > /* watch for underflows */ > dirtyable -= min(dirtyable, high_wmark_pages(z)); > @@ -2460,7 +2462,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) > > mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY); > __inc_node_page_state(page, NR_FILE_DIRTY); > - __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); > __inc_node_page_state(page, NR_DIRTIED); > __inc_wb_stat(wb, WB_RECLAIMABLE); > __inc_wb_stat(wb, WB_DIRTIED); > @@ -2482,7 +2483,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping, > if (mapping_cap_account_dirty(mapping)) { > mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); > dec_node_page_state(page, NR_FILE_DIRTY); > - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); > dec_wb_stat(wb, WB_RECLAIMABLE); > task_io_account_cancelled_write(PAGE_SIZE); > } > @@ -2739,7 +2739,6 @@ int clear_page_dirty_for_io(struct page *page) > if (TestClearPageDirty(page)) { > mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); > dec_node_page_state(page, NR_FILE_DIRTY); > - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); > dec_wb_stat(wb, WB_RECLAIMABLE); > ret = 1; > } > @@ -2786,7 +2785,6 @@ int test_clear_page_writeback(struct page *page) > if (ret) { > mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); > dec_node_page_state(page, NR_WRITEBACK); > - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); > inc_node_page_state(page, NR_WRITTEN); > } > unlock_page_memcg(page); > @@ -2841,7 +2839,6 @@ int __test_set_page_writeback(struct page *page, bool keep_write) > if (!ret) { > mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); > inc_node_page_state(page, NR_WRITEBACK); > - inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); > } > unlock_page_memcg(page); > return ret; > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 030114f55b0e..ded48e580abc 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > { > struct zone *zone; > struct zoneref *z; > + pg_data_t *current_pgdat = NULL; > > /* > * Make sure we converge to OOM if we cannot make any progress > @@ -3454,6 +3455,14 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > return false; > > /* > + * Blindly retry allocation requests that cannot use all zones. We do > + * not have a reliable and fast means of calculating reclaimable, dirty > + * and writeback pages in eligible zones. > + */ > + if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask))) > + goto out; > + > + /* > * Keep reclaiming pages while there is a chance this will lead somewhere. > * If none of the target zones can satisfy our allocation request even > * if all reclaimable pages are considered then we are screwed and have > @@ -3463,18 +3472,38 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > ac->nodemask) { > unsigned long available; > unsigned long reclaimable; > + int zid; > > - available = reclaimable = zone_reclaimable_pages(zone); > + if (current_pgdat == zone->zone_pgdat) > + continue; > + > + current_pgdat = zone->zone_pgdat; > + available = reclaimable = pgdat_reclaimable_pages(current_pgdat); > available -= DIV_ROUND_UP(no_progress_loops * available, > MAX_RECLAIM_RETRIES); > - available += zone_page_state_snapshot(zone, NR_FREE_PAGES); > + > + /* Account for all free pages on eligible zones */ > + for (zid = 0; zid <= zone_idx(zone); zid++) { > + struct zone *acct_zone = ¤t_pgdat->node_zones[zid]; > + > + available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES); > + } > > /* > * Would the allocation succeed if we reclaimed the whole > - * available? > + * available? This is approximate because there is no > + * accurate count of reclaimable pages per zone. > */ > - if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), > - ac_classzone_idx(ac), alloc_flags, available)) { > + for (zid = 0; zid <= zone_idx(zone); zid++) { > + struct zone *check_zone = ¤t_pgdat->node_zones[zid]; > + unsigned long estimate; > + > + estimate = min(check_zone->managed_pages, available); > + if (!__zone_watermark_ok(check_zone, order, > + min_wmark_pages(check_zone), ac_classzone_idx(ac), > + alloc_flags, estimate)) > + continue; > + > /* > * If we didn't make any progress and have a lot of > * dirty + writeback pages then we should wait for > @@ -3484,15 +3513,16 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, > if (!did_some_progress) { > unsigned long write_pending; > > - write_pending = zone_page_state_snapshot(zone, > - NR_ZONE_WRITE_PENDING); > + write_pending = > + node_page_state(current_pgdat, NR_WRITEBACK) + > + node_page_state(current_pgdat, NR_FILE_DIRTY); > > if (2 * write_pending > reclaimable) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > return true; > } > } > - > +out: > /* > * Memory allocation/reclaim might be called from a WQ > * context and the current implementation of the WQ > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 9eed2d3e05f3..a8ebd1871f16 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -194,22 +194,6 @@ static bool sane_reclaim(struct scan_control *sc) > } > #endif > > -/* > - * This misses isolated pages which are not accounted for to save counters. > - * As the data only determines if reclaim or compaction continues, it is > - * not expected that isolated pages will be a dominating factor. > - */ > -unsigned long zone_reclaimable_pages(struct zone *zone) > -{ > - unsigned long nr; > - > - nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE); > - if (get_nr_swap_pages() > 0) > - nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON); > - > - return nr; > -} > - > unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat) > { > unsigned long nr; > @@ -3167,7 +3151,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx) > * zone was balanced even under extreme pressure when the > * overall node may be congested. > */ > - for (i = sc.reclaim_idx; i >= 0; i--) { > + for (i = sc.reclaim_idx; i >= 0 && !buffer_heads_over_limit; i--) { > zone = pgdat->node_zones + i; > if (!populated_zone(zone)) > continue; > diff --git a/mm/vmstat.c b/mm/vmstat.c > index 60372f31fee3..7415775faf08 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -921,9 +921,6 @@ int fragmentation_index(struct zone *zone, unsigned int order) > const char * const vmstat_text[] = { > /* enum zone_stat_item countes */ > "nr_free_pages", > - "nr_zone_anon_lru", > - "nr_zone_file_lru", > - "nr_zone_write_pending", > "nr_mlock", > "nr_slab_reclaimable", > "nr_slab_unreclaimable", > > -- > Mel Gorman > SUSE Labs > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries 2016-07-01 20:01 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Mel Gorman 2016-07-06 0:02 ` Minchan Kim @ 2016-07-06 18:12 ` Dave Hansen 2016-07-07 11:26 ` Mel Gorman 1 sibling, 1 reply; 10+ messages in thread From: Dave Hansen @ 2016-07-06 18:12 UTC (permalink / raw) To: Mel Gorman, Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On 07/01/2016 01:01 PM, Mel Gorman wrote: > +#ifdef CONFIG_HIGHMEM > +extern unsigned long highmem_file_pages; > + > +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, > + int nr_pages) > +{ > + if (is_highmem_idx(zid) && is_file_lru(lru)) > + highmem_file_pages += nr_pages; > +} > +#else Shouldn't highmem_file_pages technically be an atomic_t (or atomic64_t)? We could have highmem on two nodes which take two different LRU locks. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries 2016-07-06 18:12 ` Dave Hansen @ 2016-07-07 11:26 ` Mel Gorman 0 siblings, 0 replies; 10+ messages in thread From: Mel Gorman @ 2016-07-07 11:26 UTC (permalink / raw) To: Dave Hansen Cc: Andrew Morton, Linux-MM, Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML On Wed, Jul 06, 2016 at 11:12:52AM -0700, Dave Hansen wrote: > On 07/01/2016 01:01 PM, Mel Gorman wrote: > > +#ifdef CONFIG_HIGHMEM > > +extern unsigned long highmem_file_pages; > > + > > +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, > > + int nr_pages) > > +{ > > + if (is_highmem_idx(zid) && is_file_lru(lru)) > > + highmem_file_pages += nr_pages; > > +} > > +#else > > Shouldn't highmem_file_pages technically be an atomic_t (or atomic64_t)? > We could have highmem on two nodes which take two different LRU locks. It would require a NUMA machine with highmem or very weird configurations but sure, atomic is safer. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 @ 2016-07-01 15:37 Mel Gorman 2016-07-01 15:37 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Mel Gorman 0 siblings, 1 reply; 10+ messages in thread From: Mel Gorman @ 2016-07-01 15:37 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman Previous releases double accounted LRU stats on the zone and the node because it was required by should_reclaim_retry. The last patch in the series removes the double accounting. It's not integrated with the series as reviewers may not like the solution. If not, it can be safely dropped without a major impact to the results. Changelog since v7 o Rebase onto current mmots o Avoid double accounting of stats in node and zone o Kswapd will avoid more reclaim if an eligible zone is available o Remove some duplications of sc->reclaim_idx and classzone_idx o Print per-node stats in zoneinfo Changelog since v6 o Correct reclaim_idx when direct reclaiming for memcg o Also account LRU pages per zone for compaction/reclaim o Add page_pgdat helper with more efficient lookup o Init pgdat LRU lock only once o Slight optimisation to wake_all_kswapds o Always wake kcompactd when kswapd is going to sleep o Rebase to mmotm as of June 15th, 2016 Changelog since v5 o Rebase and adjust to changes Changelog since v4 o Rebase on top of v3 of page allocator optimisation series Changelog since v3 o Rebase on top of the page allocator optimisation series o Remove RFC tag This is the latest version of a series that moves LRUs from the zones to the node that is based upon 4.7-rc4 with Andrew's tree applied. While this is a current rebase, the test results were based on mmotm as of June 23rd. Conceptually, this series is simple but there are a lot of details. Some of the broad motivations for this are; 1. The residency of a page partially depends on what zone the page was allocated from. This is partially combatted by the fair zone allocation policy but that is a partial solution that introduces overhead in the page allocator paths. 2. Currently, reclaim on node 0 behaves slightly different to node 1. For example, direct reclaim scans in zonelist order and reclaims even if the zone is over the high watermark regardless of the age of pages in that LRU. Kswapd on the other hand starts reclaim on the highest unbalanced zone. A difference in distribution of file/anon pages due to when they were allocated results can result in a difference in again. While the fair zone allocation policy mitigates some of the problems here, the page reclaim results on a multi-zone node will always be different to a single-zone node. it was scheduled on as a result. 3. kswapd and the page allocator scan zones in the opposite order to avoid interfering with each other but it's sensitive to timing. This mitigates the page allocator using pages that were allocated very recently in the ideal case but it's sensitive to timing. When kswapd is allocating from lower zones then it's great but during the rebalancing of the highest zone, the page allocator and kswapd interfere with each other. It's worse if the highest zone is small and difficult to balance. 4. slab shrinkers are node-based which makes it harder to identify the exact relationship between slab reclaim and LRU reclaim. The reason we have zone-based reclaim is that we used to have large highmem zones in common configurations and it was necessary to quickly find ZONE_NORMAL pages for reclaim. Today, this is much less of a concern as machines with lots of memory will (or should) use 64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are rare. Machines that do use highmem should have relatively low highmem:lowmem ratios than we worried about in the past. Conceptually, moving to node LRUs should be easier to understand. The page allocator plays fewer tricks to game reclaim and reclaim behaves similarly on all nodes. The series has been tested on a 16 core UMA machine and a 2-socket 48 core NUMA machine. The UMA results are presented in most cases as the NUMA machine behaved similarly. pagealloc --------- This is a microbenchmark that shows the benefit of removing the fair zone allocation policy. It was tested uip to order-4 but only orders 0 and 1 are shown as the other orders were comparable. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min total-odr0-1 490.00 ( 0.00%) 463.00 ( 5.51%) Min total-odr0-2 349.00 ( 0.00%) 325.00 ( 6.88%) Min total-odr0-4 288.00 ( 0.00%) 272.00 ( 5.56%) Min total-odr0-8 250.00 ( 0.00%) 235.00 ( 6.00%) Min total-odr0-16 234.00 ( 0.00%) 222.00 ( 5.13%) Min total-odr0-32 223.00 ( 0.00%) 205.00 ( 8.07%) Min total-odr0-64 217.00 ( 0.00%) 202.00 ( 6.91%) Min total-odr0-128 214.00 ( 0.00%) 207.00 ( 3.27%) Min total-odr0-256 242.00 ( 0.00%) 242.00 ( 0.00%) Min total-odr0-512 272.00 ( 0.00%) 265.00 ( 2.57%) Min total-odr0-1024 290.00 ( 0.00%) 283.00 ( 2.41%) Min total-odr0-2048 302.00 ( 0.00%) 296.00 ( 1.99%) Min total-odr0-4096 311.00 ( 0.00%) 306.00 ( 1.61%) Min total-odr0-8192 314.00 ( 0.00%) 309.00 ( 1.59%) Min total-odr0-16384 315.00 ( 0.00%) 309.00 ( 1.90%) Min total-odr1-1 741.00 ( 0.00%) 716.00 ( 3.37%) Min total-odr1-2 565.00 ( 0.00%) 524.00 ( 7.26%) Min total-odr1-4 457.00 ( 0.00%) 427.00 ( 6.56%) Min total-odr1-8 408.00 ( 0.00%) 371.00 ( 9.07%) Min total-odr1-16 383.00 ( 0.00%) 344.00 ( 10.18%) Min total-odr1-32 378.00 ( 0.00%) 334.00 ( 11.64%) Min total-odr1-64 383.00 ( 0.00%) 334.00 ( 12.79%) Min total-odr1-128 376.00 ( 0.00%) 342.00 ( 9.04%) Min total-odr1-256 381.00 ( 0.00%) 343.00 ( 9.97%) Min total-odr1-512 388.00 ( 0.00%) 349.00 ( 10.05%) Min total-odr1-1024 386.00 ( 0.00%) 356.00 ( 7.77%) Min total-odr1-2048 389.00 ( 0.00%) 362.00 ( 6.94%) Min total-odr1-4096 389.00 ( 0.00%) 362.00 ( 6.94%) Min total-odr1-8192 389.00 ( 0.00%) 362.00 ( 6.94%) This shows a steady improvement throughout. The primary benefit is from reduced system CPU usage which is obvious from the overall times; 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 User 191.39 191.61 System 2651.24 2504.48 Elapsed 2904.40 2757.01 The vmstats also showed that the fair zone allocation policy was definitely removed as can be seen here; 4.7.0-rc3 4.7.0-rc3 mmotm-20160623 nodelru-v8 DMA32 allocs 28794771816 0 Normal allocs 48432582848 77227356392 Movable allocs 0 0 tiobench on ext4 ---------------- tiobench is a benchmark that artifically benefits if old pages remain resident while new pages get reclaimed. The fair zone allocation policy mitigates this problem so pages age fairly. While the benchmark has problems, it is important that tiobench performance remains constant as it implies that page aging problems that the fair zone allocation policy fixes are not re-introduced. 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min PotentialReadSpeed 89.65 ( 0.00%) 90.34 ( 0.77%) Min SeqRead-MB/sec-1 82.68 ( 0.00%) 83.13 ( 0.54%) Min SeqRead-MB/sec-2 72.76 ( 0.00%) 72.15 ( -0.84%) Min SeqRead-MB/sec-4 75.13 ( 0.00%) 74.23 ( -1.20%) Min SeqRead-MB/sec-8 64.91 ( 0.00%) 65.25 ( 0.52%) Min SeqRead-MB/sec-16 62.24 ( 0.00%) 62.76 ( 0.84%) Min RandRead-MB/sec-1 0.88 ( 0.00%) 0.95 ( 7.95%) Min RandRead-MB/sec-2 0.95 ( 0.00%) 0.94 ( -1.05%) Min RandRead-MB/sec-4 1.43 ( 0.00%) 1.46 ( 2.10%) Min RandRead-MB/sec-8 1.61 ( 0.00%) 1.58 ( -1.86%) Min RandRead-MB/sec-16 1.80 ( 0.00%) 1.93 ( 7.22%) Min SeqWrite-MB/sec-1 76.41 ( 0.00%) 78.84 ( 3.18%) Min SeqWrite-MB/sec-2 74.11 ( 0.00%) 73.35 ( -1.03%) Min SeqWrite-MB/sec-4 80.05 ( 0.00%) 78.69 ( -1.70%) Min SeqWrite-MB/sec-8 72.88 ( 0.00%) 71.38 ( -2.06%) Min SeqWrite-MB/sec-16 75.91 ( 0.00%) 75.81 ( -0.13%) Min RandWrite-MB/sec-1 1.18 ( 0.00%) 1.12 ( -5.08%) Min RandWrite-MB/sec-2 1.02 ( 0.00%) 1.02 ( 0.00%) Min RandWrite-MB/sec-4 1.05 ( 0.00%) 0.99 ( -5.71%) Min RandWrite-MB/sec-8 0.89 ( 0.00%) 0.92 ( 3.37%) Min RandWrite-MB/sec-16 0.92 ( 0.00%) 0.89 ( -3.26%) This shows that the series has little or not impact on tiobench which is desirable. It indicates that the fair zone allocation policy was removed in a manner that didn't reintroduce one class of page aging bug. There were only minor differences in overall reclaim activity 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 Minor Faults 645838 644036 Major Faults 573 593 Swap Ins 0 0 Swap Outs 0 0 Allocation stalls 24 0 DMA allocs 0 0 DMA32 allocs 46041453 44154171 Normal allocs 78053072 79865782 Movable allocs 0 0 Direct pages scanned 10969 54504 Kswapd pages scanned 93375144 93250583 Kswapd pages reclaimed 93372243 93247714 Direct pages reclaimed 10969 54504 Kswapd efficiency 99% 99% Kswapd velocity 13741.015 13711.950 Direct efficiency 100% 100% Direct velocity 1.614 8.014 Percentage direct scans 0% 0% Zone normal velocity 8641.875 13719.964 Zone dma32 velocity 5100.754 0.000 Zone dma velocity 0.000 0.000 Page writes by reclaim 0.000 0.000 Page writes file 0 0 Page writes anon 0 0 Page reclaim immediate 37 54 kswapd activity was roughly comparable. There were differences in direct reclaim activity but negligible in the context of the overall workload (velocity of 8 pages per second with the patches applied, 1.6 pages per second in the baseline kernel). pgbench read-only large configuration on ext4 --------------------------------------------- pgbench is a database benchmark that can be sensitive to page reclaim decisions. This also checks if removing the fair zone allocation policy is safe pgbench Transactions 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Hmean 1 188.26 ( 0.00%) 189.78 ( 0.81%) Hmean 5 330.66 ( 0.00%) 328.69 ( -0.59%) Hmean 12 370.32 ( 0.00%) 380.72 ( 2.81%) Hmean 21 368.89 ( 0.00%) 369.00 ( 0.03%) Hmean 30 382.14 ( 0.00%) 360.89 ( -5.56%) Hmean 32 428.87 ( 0.00%) 432.96 ( 0.95%) Negligible differences again. As with tiobench, overall reclaim activity was comparable. bonnie++ on ext4 ---------------- No interesting performance difference, negligible differences on reclaim stats. paralleldd on ext4 ------------------ This workload uses varying numbers of dd instances to read large amounts of data from disk. 4.7.0-rc3 4.7.0-rc3 mmotm-20160615 nodelru-v7r17 Amean Elapsd-1 181.57 ( 0.00%) 179.63 ( 1.07%) Amean Elapsd-3 188.29 ( 0.00%) 183.68 ( 2.45%) Amean Elapsd-5 188.02 ( 0.00%) 181.73 ( 3.35%) Amean Elapsd-7 186.07 ( 0.00%) 184.11 ( 1.05%) Amean Elapsd-12 188.16 ( 0.00%) 183.51 ( 2.47%) Amean Elapsd-16 189.03 ( 0.00%) 181.27 ( 4.10%) 4.7.0-rc3 4.7.0-rc3 mmotm-20160615nodelru-v7r17 User 1439.23 1433.37 System 8332.31 8216.01 Elapsed 3619.80 3532.69 There is a slight gain in performance, some of which is from the reduced system CPU usage. There areminor differences in reclaim activity but nothing significant 4.7.0-rc3 4.7.0-rc3 mmotm-20160615nodelru-v7r17 Minor Faults 362486 358215 Major Faults 1143 1113 Swap Ins 26 0 Swap Outs 2920 482 DMA allocs 0 0 DMA32 allocs 31568814 28598887 Normal allocs 46539922 49514444 Movable allocs 0 0 Allocation stalls 0 0 Direct pages scanned 0 0 Kswapd pages scanned 40886878 40849710 Kswapd pages reclaimed 40869923 40835207 Direct pages reclaimed 0 0 Kswapd efficiency 99% 99% Kswapd velocity 11295.342 11563.344 Direct efficiency 100% 100% Direct velocity 0.000 0.000 Slabs scanned 131673 126099 Direct inode steals 57 60 Kswapd inode steals 762 18 It basically shows that kswapd was active at roughly the same rate in both kernels. There was also comparable slab scanning activity and direct reclaim was avoided in both cases. There appears to be a large difference in numbers of inodes reclaimed but the workload has few active inodes and is likely a timing artifact. It's interesting to note that the node-lru did not swap in any pages but given the low swap activity, it's unlikely to be significant. stutter ------- stutter simulates a simple workload. One part uses a lot of anonymous memory, a second measures mmap latency and a third copies a large file. The primary metric is checking for mmap latency. stutter 4.7.0-rc4 4.7.0-rc4 mmotm-20160623 nodelru-v8 Min mmap 16.6283 ( 0.00%) 16.1394 ( 2.94%) 1st-qrtle mmap 54.7570 ( 0.00%) 55.2975 ( -0.99%) 2nd-qrtle mmap 57.3163 ( 0.00%) 57.5230 ( -0.36%) 3rd-qrtle mmap 58.9976 ( 0.00%) 58.0537 ( 1.60%) Max-90% mmap 59.7433 ( 0.00%) 58.3910 ( 2.26%) Max-93% mmap 60.1298 ( 0.00%) 58.4801 ( 2.74%) Max-95% mmap 73.4112 ( 0.00%) 58.5537 ( 20.24%) Max-99% mmap 92.8542 ( 0.00%) 58.9673 ( 36.49%) Max mmap 1440.6569 ( 0.00%) 137.6875 ( 90.44%) Mean mmap 59.3493 ( 0.00%) 55.5153 ( 6.46%) Best99%Mean mmap 57.2121 ( 0.00%) 55.4194 ( 3.13%) Best95%Mean mmap 55.9113 ( 0.00%) 55.2813 ( 1.13%) Best90%Mean mmap 55.6199 ( 0.00%) 55.1044 ( 0.93%) Best50%Mean mmap 53.2183 ( 0.00%) 52.8330 ( 0.72%) Best10%Mean mmap 45.9842 ( 0.00%) 42.3740 ( 7.85%) Best5%Mean mmap 43.2256 ( 0.00%) 38.8660 ( 10.09%) Best1%Mean mmap 32.9388 ( 0.00%) 27.7577 ( 15.73%) This shows a number of improvements with the worst-case outlier greatly improved. Some of the vmstats are interesting 4.7.0-rc4 4.7.0-rc4 mmotm-20160623nodelru-v8 Swap Ins 163 239 Swap Outs 0 0 Allocation stalls 2603 0 DMA allocs 0 0 DMA32 allocs 618719206 1303037965 Normal allocs 891235743 229914091 Movable allocs 0 0 Direct pages scanned 216787 3173 Kswapd pages scanned 50719775 41732250 Kswapd pages reclaimed 41541765 41731168 Direct pages reclaimed 209159 3173 Kswapd efficiency 81% 99% Kswapd velocity 16859.554 14231.043 Direct efficiency 96% 100% Direct velocity 72.061 1.082 Percentage direct scans 0% 0% Zone normal velocity 8431.777 14232.125 Zone dma32 velocity 8499.838 0.000 Zone dma velocity 0.000 0.000 Page writes by reclaim 6215049.000 0.000 Page writes file 6215049 0 Page writes anon 0 0 Page reclaim immediate 70673 143 Sector Reads 81940800 81489388 Sector Writes 100158984 99161860 Page rescued immediate 0 0 Slabs scanned 1366954 21196 While this is not guaranteed in all cases, this particular test showed a large reduction in direct reclaim activity. It's also worth noting that no page writes were issued from reclaim context. This series is not without its hazards. There are at least three areas that I'm concerned with even though I could not reproduce any problems in that area. 1. Reclaim/compaction is going to be affected because the amount of reclaim is no longer targetted at a specific zone. Compaction works on a per-zone basis so there is no guarantee that reclaiming a few THP's worth page pages will have a positive impact on compaction success rates. 2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers are called is now different. This may or may not be a problem but if it is, it'll be because shrinkers are not called enough and some balancing is required. 3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are distributed between zones and the fair zone allocation policy used to do something very similar for anon. The distribution is now different but not necessarily in any way that matters but it's still worth bearing in mind. Documentation/cgroup-v1/memcg_test.txt | 4 +- Documentation/cgroup-v1/memory.txt | 4 +- arch/s390/appldata/appldata_mem.c | 2 +- arch/tile/mm/pgtable.c | 18 +- drivers/base/node.c | 77 ++- drivers/staging/android/lowmemorykiller.c | 12 +- drivers/staging/lustre/lustre/osc/osc_cache.c | 6 +- fs/fs-writeback.c | 4 +- fs/fuse/file.c | 8 +- fs/nfs/internal.h | 2 +- fs/nfs/write.c | 2 +- fs/proc/meminfo.c | 20 +- include/linux/backing-dev.h | 2 +- include/linux/memcontrol.h | 61 +- include/linux/mm.h | 5 + include/linux/mm_inline.h | 35 +- include/linux/mm_types.h | 2 +- include/linux/mmzone.h | 155 +++-- include/linux/swap.h | 24 +- include/linux/topology.h | 2 +- include/linux/vm_event_item.h | 14 +- include/linux/vmstat.h | 111 +++- include/linux/writeback.h | 2 +- include/trace/events/vmscan.h | 63 +- include/trace/events/writeback.h | 10 +- kernel/power/snapshot.c | 10 +- kernel/sysctl.c | 4 +- mm/backing-dev.c | 15 +- mm/compaction.c | 50 +- mm/filemap.c | 16 +- mm/huge_memory.c | 12 +- mm/internal.h | 11 +- mm/khugepaged.c | 14 +- mm/memcontrol.c | 215 +++---- mm/memory-failure.c | 4 +- mm/memory_hotplug.c | 7 +- mm/mempolicy.c | 2 +- mm/migrate.c | 35 +- mm/mlock.c | 12 +- mm/page-writeback.c | 123 ++-- mm/page_alloc.c | 371 +++++------ mm/page_idle.c | 4 +- mm/rmap.c | 26 +- mm/shmem.c | 14 +- mm/swap.c | 64 +- mm/swap_state.c | 4 +- mm/util.c | 4 +- mm/vmscan.c | 879 +++++++++++++------------- mm/vmstat.c | 398 +++++++++--- mm/workingset.c | 54 +- 50 files changed, 1674 insertions(+), 1319 deletions(-) -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries 2016-07-01 15:37 [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 Mel Gorman @ 2016-07-01 15:37 ` Mel Gorman 0 siblings, 0 replies; 10+ messages in thread From: Mel Gorman @ 2016-07-01 15:37 UTC (permalink / raw) To: Andrew Morton, Linux-MM Cc: Rik van Riel, Vlastimil Babka, Johannes Weiner, LKML, Mel Gorman The number of LRU pages, dirty pages and writeback pages must be accounted for on both zones and nodes because of the reclaim retry logic, compaction retry logic and highmem calculations all depending on per-zone stats. The retry logic is only critical for allocations that can use any zones. Hence this patch will not retry reclaim or compaction for such allocations. This should not be a problem for reclaim as zone-constrained allocations are immune from OOM kill. For retries, a very rough approximation is made whether to retry or not. While it is possible this will make the wrong decision on occasion, it will not infinite loop as the number of reclaim attempts is capped by MAX_RECLAIM_RETRIES. The highmem calculations only care about the global count of file pages in highmem. Hence, a global counter is used instead of per-zone stats. With this, the per-zone double accounting disappears. Suggested by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- include/linux/mm_inline.h | 20 +++++++++++-- include/linux/mmzone.h | 4 --- include/linux/swap.h | 1 - mm/compaction.c | 22 ++++++++++++++- mm/migrate.c | 2 -- mm/page-writeback.c | 13 ++++----- mm/page_alloc.c | 71 ++++++++++++++++++++++++++++++++--------------- mm/vmscan.c | 16 ----------- mm/vmstat.c | 3 -- 9 files changed, 92 insertions(+), 60 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index 9aadcc781857..c68680aac044 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -4,6 +4,22 @@ #include <linux/huge_mm.h> #include <linux/swap.h> +#ifdef CONFIG_HIGHMEM +extern unsigned long highmem_file_pages; + +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, + int nr_pages) +{ + if (is_highmem_idx(zid) && is_file_lru(lru)) + highmem_file_pages += nr_pages; +} +#else +static inline void acct_highmem_file_pages(int zid, enum lru_list lru, + int nr_pages) +{ +} +#endif + /** * page_is_file_cache - should the page be on a file LRU or anon LRU? * @page: the page to test @@ -29,9 +45,7 @@ static __always_inline void __update_lru_size(struct lruvec *lruvec, struct pglist_data *pgdat = lruvec_pgdat(lruvec); __mod_node_page_state(pgdat, NR_LRU_BASE + lru, nr_pages); - __mod_zone_page_state(&pgdat->node_zones[zid], - NR_ZONE_LRU_BASE + !!is_file_lru(lru), - nr_pages); + acct_highmem_file_pages(zid, lru, nr_pages); } static __always_inline void update_lru_size(struct lruvec *lruvec, diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index facee6b83440..9268528c20c0 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -110,10 +110,6 @@ struct zone_padding { enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, - NR_ZONE_LRU_BASE, /* Used only for compaction and reclaim retry */ - NR_ZONE_LRU_ANON = NR_ZONE_LRU_BASE, - NR_ZONE_LRU_FILE, - NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, diff --git a/include/linux/swap.h b/include/linux/swap.h index b17cc4830fa6..cc753c639e3d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -307,7 +307,6 @@ extern void lru_cache_add_active_or_unevictable(struct page *page, struct vm_area_struct *vma); /* linux/mm/vmscan.c */ -extern unsigned long zone_reclaimable_pages(struct zone *zone); extern unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat); extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *mask); diff --git a/mm/compaction.c b/mm/compaction.c index a0bd85712516..dfe7dafe8e8b 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1446,6 +1446,13 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, { struct zone *zone; struct zoneref *z; + pg_data_t *last_pgdat = NULL; + +#ifdef CONFIG_HIGHMEM + /* Do not retry compaction for zone-constrained allocations */ + if (!is_highmem_idx(ac->high_zoneidx)) + return false; +#endif /* * Make sure at least one zone would pass __compaction_suitable if we continue @@ -1456,14 +1463,27 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order, unsigned long available; enum compact_result compact_result; + if (last_pgdat == zone->zone_pgdat) + continue; + + /* + * This over-estimates the number of pages available for + * reclaim/compaction but walking the LRU would take too + * long. The consequences are that compaction may retry + * longer than it should for a zone-constrained allocation + * request. + */ + last_pgdat = zone->zone_pgdat; + available = pgdat_reclaimable_pages(zone->zone_pgdat) / order; + /* * Do not consider all the reclaimable memory because we do not * want to trash just for a single high order allocation which * is even not guaranteed to appear even if __compaction_suitable * is happy about the watermark check. */ - available = zone_reclaimable_pages(zone) / order; available += zone_page_state_snapshot(zone, NR_FREE_PAGES); + available = min(zone->managed_pages, available); compact_result = __compaction_suitable(zone, order, alloc_flags, ac_classzone_idx(ac), available); if (compact_result != COMPACT_SKIPPED && diff --git a/mm/migrate.c b/mm/migrate.c index c77997dc6ed7..ed2f85e61de1 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -513,9 +513,7 @@ int migrate_page_move_mapping(struct address_space *mapping, } if (dirty && mapping_cap_account_dirty(mapping)) { __dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY); - __dec_zone_state(oldzone, NR_ZONE_WRITE_PENDING); __inc_node_state(newzone->zone_pgdat, NR_FILE_DIRTY); - __dec_zone_state(newzone, NR_ZONE_WRITE_PENDING); } } local_irq_enable(); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 3c02aa603f5a..8db1db234915 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -299,6 +299,9 @@ static unsigned long node_dirtyable_memory(struct pglist_data *pgdat) return nr_pages; } +#ifdef CONFIG_HIGHMEM +unsigned long highmem_file_pages; +#endif static unsigned long highmem_dirtyable_memory(unsigned long total) { @@ -306,18 +309,17 @@ static unsigned long highmem_dirtyable_memory(unsigned long total) int node; unsigned long x = 0; int i; + unsigned long dirtyable = highmem_file_pages; for_each_node_state(node, N_HIGH_MEMORY) { for (i = ZONE_NORMAL + 1; i < MAX_NR_ZONES; i++) { struct zone *z; - unsigned long dirtyable; if (!is_highmem_idx(i)) continue; z = &NODE_DATA(node)->node_zones[i]; - dirtyable = zone_page_state(z, NR_FREE_PAGES) + - zone_page_state(z, NR_ZONE_LRU_FILE); + dirtyable += zone_page_state(z, NR_FREE_PAGES); /* watch for underflows */ dirtyable -= min(dirtyable, high_wmark_pages(z)); @@ -2460,7 +2462,6 @@ void account_page_dirtied(struct page *page, struct address_space *mapping) mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_DIRTY); __inc_node_page_state(page, NR_FILE_DIRTY); - __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); __inc_node_page_state(page, NR_DIRTIED); __inc_wb_stat(wb, WB_RECLAIMABLE); __inc_wb_stat(wb, WB_DIRTIED); @@ -2482,7 +2483,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping, if (mapping_cap_account_dirty(mapping)) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); dec_node_page_state(page, NR_FILE_DIRTY); - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); dec_wb_stat(wb, WB_RECLAIMABLE); task_io_account_cancelled_write(PAGE_SIZE); } @@ -2739,7 +2739,6 @@ int clear_page_dirty_for_io(struct page *page) if (TestClearPageDirty(page)) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_DIRTY); dec_node_page_state(page, NR_FILE_DIRTY); - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); dec_wb_stat(wb, WB_RECLAIMABLE); ret = 1; } @@ -2786,7 +2785,6 @@ int test_clear_page_writeback(struct page *page) if (ret) { mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); dec_node_page_state(page, NR_WRITEBACK); - dec_zone_page_state(page, NR_ZONE_WRITE_PENDING); inc_node_page_state(page, NR_WRITTEN); } unlock_page_memcg(page); @@ -2841,7 +2839,6 @@ int __test_set_page_writeback(struct page *page, bool keep_write) if (!ret) { mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_WRITEBACK); inc_node_page_state(page, NR_WRITEBACK); - inc_zone_page_state(page, NR_ZONE_WRITE_PENDING); } unlock_page_memcg(page); return ret; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d3eb15c35bb1..9581185cb31a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3445,6 +3445,7 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, { struct zone *zone; struct zoneref *z; + pg_data_t *current_pgdat = NULL; /* * Make sure we converge to OOM if we cannot make any progress @@ -3454,6 +3455,14 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, return false; /* + * Blindly retry allocation requests that cannot use all zones. We do + * not have a reliable and fast means of calculating reclaimable, dirty + * and writeback pages in eligible zones. + */ + if (IS_ENABLED(CONFIG_HIGHMEM) && !is_highmem_idx(gfp_zone(gfp_mask))) + goto out; + + /* * Keep reclaiming pages while there is a chance this will lead somewhere. * If none of the target zones can satisfy our allocation request even * if all reclaimable pages are considered then we are screwed and have @@ -3463,36 +3472,54 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order, ac->nodemask) { unsigned long available; unsigned long reclaimable; + unsigned long write_pending = 0; + int zid; + + if (current_pgdat == zone->zone_pgdat) + continue; - available = reclaimable = zone_reclaimable_pages(zone); + current_pgdat = zone->zone_pgdat; + available = reclaimable = pgdat_reclaimable_pages(current_pgdat); available -= DIV_ROUND_UP(no_progress_loops * available, MAX_RECLAIM_RETRIES); - available += zone_page_state_snapshot(zone, NR_FREE_PAGES); + write_pending = node_page_state(current_pgdat, NR_WRITEBACK) + + node_page_state(current_pgdat, NR_FILE_DIRTY); - /* - * Would the allocation succeed if we reclaimed the whole - * available? - */ - if (__zone_watermark_ok(zone, order, min_wmark_pages(zone), - ac_classzone_idx(ac), alloc_flags, available)) { - /* - * If we didn't make any progress and have a lot of - * dirty + writeback pages then we should wait for - * an IO to complete to slow down the reclaim and - * prevent from pre mature OOM - */ - if (!did_some_progress) { - unsigned long write_pending; + /* Account for all free pages on eligible zones */ + for (zid = 0; zid <= zone_idx(zone); zid++) { + struct zone *acct_zone = ¤t_pgdat->node_zones[zid]; - write_pending = zone_page_state_snapshot(zone, - NR_ZONE_WRITE_PENDING); + available += zone_page_state_snapshot(acct_zone, NR_FREE_PAGES); + } - if (2 * write_pending > reclaimable) { - congestion_wait(BLK_RW_ASYNC, HZ/10); - return true; - } + /* + * If we didn't make any progress and have a lot of + * dirty + writeback pages then we should wait for an IO to + * complete to slow down the reclaim and prevent from premature + * OOM. + */ + if (!did_some_progress) { + if (2 * write_pending > reclaimable) { + congestion_wait(BLK_RW_ASYNC, HZ/10); + return true; } + } + /* + * Would the allocation succeed if we reclaimed the whole + * available? This is approximate because there is no + * accurate count of reclaimable pages per zone. + */ + for (zid = 0; zid <= zone_idx(zone); zid++) { + struct zone *check_zone = ¤t_pgdat->node_zones[zid]; + unsigned long estimate; + + estimate = min(check_zone->managed_pages, available); + if (__zone_watermark_ok(check_zone, order, + min_wmark_pages(check_zone), ac_classzone_idx(ac), + alloc_flags, available)) { + } +out: /* * Memory allocation/reclaim might be called from a WQ * context and the current implementation of the WQ diff --git a/mm/vmscan.c b/mm/vmscan.c index 151c30dd27e2..c538a8cab43b 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -194,22 +194,6 @@ static bool sane_reclaim(struct scan_control *sc) } #endif -/* - * This misses isolated pages which are not accounted for to save counters. - * As the data only determines if reclaim or compaction continues, it is - * not expected that isolated pages will be a dominating factor. - */ -unsigned long zone_reclaimable_pages(struct zone *zone) -{ - unsigned long nr; - - nr = zone_page_state_snapshot(zone, NR_ZONE_LRU_FILE); - if (get_nr_swap_pages() > 0) - nr += zone_page_state_snapshot(zone, NR_ZONE_LRU_ANON); - - return nr; -} - unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat) { unsigned long nr; diff --git a/mm/vmstat.c b/mm/vmstat.c index ce09be63e8c7..524c082072be 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -908,9 +908,6 @@ int fragmentation_index(struct zone *zone, unsigned int order) const char * const vmstat_text[] = { /* enum zone_stat_item countes */ "nr_free_pages", - "nr_zone_anon_lru", - "nr_zone_file_lru", - "nr_zone_write_pending", "nr_mlock", "nr_slab_reclaimable", "nr_slab_unreclaimable", -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 10+ messages in thread
end of thread, other threads:[~2016-07-07 11:26 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <00f601d1d691$d790ad40$86b207c0$@alibaba-inc.com> 2016-07-05 8:07 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Hillf Danton 2016-07-05 10:55 ` Mel Gorman 2016-07-01 20:01 [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 Mel Gorman 2016-07-01 20:01 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Mel Gorman 2016-07-06 0:02 ` Minchan Kim 2016-07-06 8:58 ` Mel Gorman 2016-07-06 9:33 ` Mel Gorman 2016-07-07 6:47 ` Minchan Kim 2016-07-06 18:12 ` Dave Hansen 2016-07-07 11:26 ` Mel Gorman -- strict thread matches above, loose matches on Subject: below -- 2016-07-01 15:37 [PATCH 00/31] Move LRU page reclaim from zones to nodes v8 Mel Gorman 2016-07-01 15:37 ` [PATCH 31/31] mm, vmstat: Remove zone and node double accounting by approximating retries Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).