* [PATCH 0/8] Reduce writeback from page reclaim context V4 @ 2010-07-19 13:11 Mel Gorman 2010-07-19 13:11 ` [PATCH 1/8] vmscan: tracing: Roll up of patches currently in mmotm Mel Gorman ` (7 more replies) 0 siblings, 8 replies; 87+ messages in thread From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-mm Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman Sorry for the long delay, I got side-tracked on other bugs. This is a follow-on series from the series "Avoid overflowing of stack during page reclaim". It eliminates writeback requiring a filesystem from direct reclaim and follows on by reducing the amount of IO required from page reclaim to mitigate any corner cases from the modification. Changelog since V3 o Distinguish between file and anon related IO from page reclaim o Allow anon writeback from reclaim context o Sync old inodes first in background writeback o Pre-emptively clean pages when dirty pages are encountered on the LRU o Rebase to 2.6.35-rc5 Changelog since V2 o Add acks and reviewed-bys o Do not lock multiple pages at the same time for writeback as it's unsafe o Drop the clean_page_list function. It alters timing with very little benefit. Without the contiguous writing, it doesn't do much to simplify the subsequent patches either o Throttle processes that encounter dirty pages in direct reclaim. Instead wakeup flusher threads to clean the number of pages encountered that were dirty Changelog since V1 o Merge with series that reduces stack usage in page reclaim in general o Allow memcg to writeback pages as they are not expected to overflow stack o Drop the contiguous-write patch for the moment There is a problem in the stack depth usage of page reclaim. Particularly during direct reclaim, it is possible to overflow the stack if it calls into the filesystems writepage function. This patch series begins by preventing writeback from direct reclaim and allowing btrfs and xfs to writeback from kswapd context. As this is a potentially large change, the remainder of the series aims to reduce any filesystem writeback from page reclaim and depend more on background flush. The first patch in the series is a roll-up of what should currently be in mmotm. It's provided for convenience of testing. Patch 2 and 3 note that it is important to distinguish between file and anon page writeback from page reclaim as they use stack to different depths. It updates the trace points and scripts appropriately noting which mmotm patch they should be merged with. Patch 4 prevents direct reclaim writing out filesystem pages while still allowing writeback of anon pages which is in less danger of stack overflow and doesn't have something like background flush to clean the pages. For filesystem pages, flusher threads are asked to clean the number of pages encountered, the caller waits on congestion and puts the pages back on the LRU. For lumpy reclaim, the caller will wait for a time calling the flusher multiple times waiting on dirty pages to be written out before trying to reclaim the dirty pages a second time. This increases the responsibility of kswapd somewhat because it's now cleaning pages on behalf of direct reclaimers but unlike background flushers, kswapd knows what zone pages need to be cleaned from. As it is async IO, it should not cause kswapd to stall (at least until the queue is congested) but the order that pages are reclaimed on the LRU is altered. Dirty pages that would have been reclaimed by direct reclaimers are getting another lap on the LRU. The dirty pages could have been put on a dedicated list but this increased counter overhead and the number of lists and it is unclear if it is necessary. Patches 5 and 6 revert chances on XFS and btrfs that ignore writeback from reclaim context which is a relatively recent change. extX could be modified to allow kswapd to writeback but it is a relatively deep change. There may be some collision with items in the filesystem git trees but it is expected to be trivial to resolve. Patch 7 makes background flush behave more like kupdate by syncing old or expired inodes first as implemented by Wu Fengguang. As filesystem pages are added onto the inactive queue and only promoted if referenced, it makes sense to write old pages first to reduce the chances page reclaim is initiating IO. Patch 8 notes that dirty pages can still be found at the end of the LRU. If a number of them are encountered, it's reasonable to assume that a similar number of dirty pages will be discovered in the very near future as that was the dirtying pattern at the time. The patch pre-emptively kicks background flusher to clean a number of pages creating feedback from page reclaim to background flusher that is based on scanning rates. Christoph has described discussions on this patch as a "band-aid" but Rik liked the idea and the patch does have interesting results so is worth a closer look. I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each machine had 3G of RAM and the CPUs were X86: Intel P4 2 core X86-64: AMD Phenom 4-core PPC64: PPC970MP Each used a single disk and the onboard IO controller. Dirty ratio was left at 20. Tests on an earlier series indicated that moving to 40 did not make much difference. The filesystem used for all tests was XFS. Four kernels are compared. traceonly-v4r7 is the first 3 patches of this series nodirect-v4r7 is the first 6 patches flusholdest-v4r7 makes background flush behave like kupdated (patch 1-7) flushforward-v4r7 pre-emptively cleans pages when encountered on the LRU (patch 1-8) The results on each test is broken up into two parts. The first part is a report based on the ftrace postprocessing script in patch 4 and reports on direct reclaim and kswapd activity. The second part reports what percentage of time was spent in direct reclaim and kswapd being awake. To work out the percentage of time spent in direct reclaim, I used /usr/bin/time to get the User + Sys CPU time. The stalled time was taken from the post-processing script. The total time is (User + Sys + Stall) and obviously the percentage is of stalled over total time. I am omitting the actual performance results simply because they are not interesting with very few significant changes. kernbench ========= No writeback from reclaim initiated and no performance change of significance. IOzone ====== No writeback from reclaim initiated and no performance change of significance. SysBench ======== The results were based on a read/write and as the machine is under-provisioned for the type of tests, figures are very unstable so not reported. with variances up to 15%. Part of the problem is that larger thread counts push the test into swap as the memory is insufficient and destabilises results further. I could tune for this, but it was reclaim that was important. X86 raceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7 Direct reclaims 18 25 6 196 Direct reclaim pages scanned 1615 1662 605 22233 Direct reclaim write file async I/O 40 0 0 0 Direct reclaim write anon async I/O 0 0 13 9 Direct reclaim write file sync I/O 0 0 0 0 Direct reclaim write anon sync I/O 0 0 0 0 Wake kswapd requests 171039 401450 313156 90960 Kswapd wakeups 685 532 611 262 Kswapd pages scanned 14272338 12209663 13799001 5230124 Kswapd reclaim write file async I/O 581811 23047 23795 759 Kswapd reclaim write anon async I/O 189590 124947 114948 42906 Kswapd reclaim write file sync I/O 0 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 0 Time stalled direct reclaim (ms) 0.00 0.91 0.92 1.31 Time kswapd awake (ms) 1079.32 1039.42 1194.82 1091.06 User/Sys Time Running Test (seconds) 1312.24 1241.37 1308.16 1253.15 Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00% 0.00% Total Elapsed Time (seconds) 8411.28 7471.15 8292.18 8170.16 Percentage Time kswapd Awake 3.45% 0.00% 0.00% 0.00% Dirty file pages from X86 were not much of a problem to begin with and the patches eliminate them as expected. What is interesting is nodirct-v4r7 made such a large difference to the amount of filesystem pages that had to be written back. Apparently, background flush must have been doing a better job getting them cleaned in time and the direct reclaim stalls are harmful overall. Waking background threads for dirty pages made a very large difference to the number of pages written back. With all patches applied, just 759 filesystem pages were written back in comparison to 581811 in the vanilla kernel and overall the number of pages scanned was reduced. X86-64 traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7 Direct reclaims 795 1662 2131 6459 Direct reclaim pages scanned 204900 127300 291647 317035 Direct reclaim write file async I/O 53763 0 0 0 Direct reclaim write anon async I/O 1256 730 6114 20 Direct reclaim write file sync I/O 10 0 0 0 Direct reclaim write anon sync I/O 0 0 0 0 Wake kswapd requests 690850 1457411 1713379 1648469 Kswapd wakeups 1683 1353 1275 1171 Kswapd pages scanned 17976327 15711169 16501926 12634291 Kswapd reclaim write file async I/O 818222 26560 42081 6311 Kswapd reclaim write anon async I/O 245442 218708 209703 205254 Kswapd reclaim write file sync I/O 0 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 0 Time stalled direct reclaim (ms) 13.50 41.19 69.56 51.32 Time kswapd awake (ms) 2243.53 2515.34 2767.58 2607.94 User/Sys Time Running Test (seconds) 687.69 650.83 653.28 640.38 Percentage Time Spent Direct Reclaim 0.01% 0.00% 0.00% 0.00% Total Elapsed Time (seconds) 6954.05 6472.68 6508.28 6211.11 Percentage Time kswapd Awake 0.04% 0.00% 0.00% 0.00% Direct reclaim of filesystem pages is eliminated as expected. Again, the overall number of pages that need to be written back by page reclaim is reduced. Flushing just the oldest inode was not much of a help in terms of how many pages needed to be written back from reclaim but pre-emptively waking flusher threads helped a lot. Oddly, more time was spent in direct reclaim with the patches as a greater number of anon pages needed to be written back. It's possible this was due to the test making more forward progress as indicated by the shorter running time. PPC64 traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7 Direct reclaims 1517 34527 32365 51973 Direct reclaim pages scanned 144496 2041199 1950282 3137493 Direct reclaim write file async I/O 28147 0 0 0 Direct reclaim write anon async I/O 463 25258 10894 0 Direct reclaim write file sync I/O 7 0 0 0 Direct reclaim write anon sync I/O 0 1 0 0 Wake kswapd requests 1126060 6578275 6281512 6649558 Kswapd wakeups 591 262 229 247 Kswapd pages scanned 16522849 12277885 11076027 7614475 Kswapd reclaim write file async I/O 1302640 50301 43308 8658 Kswapd reclaim write anon async I/O 150876 146600 159229 134919 Kswapd reclaim write file sync I/O 0 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 0 Time stalled direct reclaim (ms) 32.28 481.52 535.15 342.97 Time kswapd awake (ms) 1694.00 4789.76 4426.42 4309.49 User/Sys Time Running Test (seconds) 1294.96 1264.5 1254.92 1216.92 Percentage Time Spent Direct Reclaim 0.03% 0.00% 0.00% 0.00% Total Elapsed Time (seconds) 8876.80 8446.49 7644.95 7519.83 Percentage Time kswapd Awake 0.05% 0.00% 0.00% 0.00% Direct reclaim filesystem writes are eliminated but the scan rates went way up. It implies that direct reclaim was spinning quite a bit and finding clean pages allowing the test to complete 22 minutes faster. S Flushing oldest inodes helped but pre-emptively waking background flushers helped more in terms of the number of pages cleaned by page reclaim. Stress HighAlloc ================ This test builds a large number of kernels simultaneously so that the total workload is 1.5 times the size of RAM. It then attempts to allocate all of RAM as huge pages. The metric is the percentage of memory allocated using load (Pass 1), a second attempt under load (Pass 2) and when the kernel compiles are finishes and the system is quiet (At Rest). The patches have little impact on the success rates. X86 traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7 Direct reclaims 623 607 611 491 Direct reclaim pages scanned 126515 117477 142502 91649 Direct reclaim write file async I/O 896 0 0 0 Direct reclaim write anon async I/O 35286 27508 35688 24819 Direct reclaim write file sync I/O 580 0 0 0 Direct reclaim write anon sync I/O 13932 12301 15203 11509 Wake kswapd requests 1561 1650 1618 1152 Kswapd wakeups 183 209 211 79 Kswapd pages scanned 9391908 9144543 11418802 6959545 Kswapd reclaim write file async I/O 92730 7073 8215 807 Kswapd reclaim write anon async I/O 946499 831573 1164240 833063 Kswapd reclaim write file sync I/O 0 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 0 Time stalled direct reclaim (ms) 4653.17 4193.28 5292.97 6954.96 Time kswapd awake (ms) 4618.67 3787.74 4856.45 55704.90 User/Sys Time Running Test (seconds) 2103.48 2161.14 2131 2160.01 Percentage Time Spent Direct Reclaim 0.33% 0.00% 0.00% 0.00% Total Elapsed Time (seconds) 6996.43 6405.43 7584.74 8904.53 Percentage Time kswapd Awake 0.80% 0.00% 0.00% 0.00% Total time running the test was increased unfortunately but this was the only instance it occurred. Similar story as elsewhere otherwise - filesystem direct writes are eliminated and overall filesystem writes from page reclaim are significantly reduced to almost negligible levels (0.01% of pages scanned by kswapd resulted in a filesystem write for the full series in comparison to 0.99% in the vanilla kernel). X86-64 traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7 Direct reclaims 1275 1300 1222 1224 Direct reclaim pages scanned 156940 152253 148993 148726 Direct reclaim write file async I/O 2472 0 0 0 Direct reclaim write anon async I/O 29281 26887 28073 26283 Direct reclaim write file sync I/O 1943 0 0 0 Direct reclaim write anon sync I/O 11777 9258 10256 8510 Wake kswapd requests 4865 12895 1185 1176 Kswapd wakeups 869 757 789 822 Kswapd pages scanned 41664053 30419872 29602438 42603986 Kswapd reclaim write file async I/O 550544 16092 12775 4414 Kswapd reclaim write anon async I/O 2409931 1964446 1779486 1667076 Kswapd reclaim write file sync I/O 0 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 0 Time stalled direct reclaim (ms) 8908.93 7920.53 6192.17 5926.47 Time kswapd awake (ms) 6045.11 5486.48 3945.35 3367.01 User/Sys Time Running Test (seconds) 2813.44 2818.17 2801.8 2803.61 Percentage Time Spent Direct Reclaim 0.21% 0.00% 0.00% 0.00% Total Elapsed Time (seconds) 11217.45 10286.90 8534.22 8332.84 Percentage Time kswapd Awake 0.03% 0.00% 0.00% 0.00% Unlike X86, total time spent on the test was significantly reduced and like elsewhere, filesystem IO due to reclaim is way down. PPC64 traceonly-v4r7 nodirect-v4r7 flusholdest-v4r7 flushforward-v4r7 Direct reclaims 665 709 652 663 Direct reclaim pages scanned 145630 125161 116556 124718 Direct reclaim write file async I/O 946 0 0 0 Direct reclaim write anon async I/O 26983 23160 28531 23360 Direct reclaim write file sync I/O 596 0 0 0 Direct reclaim write anon sync I/O 17517 13635 16114 13121 Wake kswapd requests 271 302 299 278 Kswapd wakeups 181 164 158 172 Kswapd pages scanned 68789711 68058349 54613548 64905996 Kswapd reclaim write file async I/O 159196 20569 17538 2475 Kswapd reclaim write anon async I/O 2311178 1962398 1811115 1829023 Kswapd reclaim write file sync I/O 0 0 0 0 Kswapd reclaim write anon sync I/O 0 0 0 0 Time stalled direct reclaim (ms) 13784.95 12895.39 11132.26 11785.26 Time kswapd awake (ms) 13331.51 12603.74 10956.18 11479.22 User/Sys Time Running Test (seconds) 3567.03 2730.23 2682.86 2668.08 Percentage Time Spent Direct Reclaim 0.33% 0.00% 0.00% 0.00% Total Elapsed Time (seconds) 15282.74 14347.67 12614.61 13386.85 Percentage Time kswapd Awake 0.08% 0.00% 0.00% 0.00% Similar story, the test completed faster and page reclaim IO is down. Overall, the patches seem to help. Reclaim activity is reduced while test times are generally improved. A big concern with V3 was that direct reclaim not being able to write pages could lead to unexpected behaviour. This series mitigates that risk by reducing the amount of IO initiated by page reclaim making it a rarer event. Mel Gorman (7): MMOTM MARKER vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim vmscan: Do not writeback filesystem pages in direct reclaim fs,btrfs: Allow kswapd to writeback pages fs,xfs: Allow kswapd to writeback pages vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Wu Fengguang (1): writeback: sync old inodes first in background writeback .../trace/postprocess/trace-vmscan-postprocess.pl | 89 +++++++++----- Makefile | 2 +- fs/btrfs/disk-io.c | 21 +---- fs/btrfs/inode.c | 6 - fs/fs-writeback.c | 19 +++- fs/xfs/linux-2.6/xfs_aops.c | 15 --- include/trace/events/vmscan.h | 8 +- mm/vmscan.c | 121 ++++++++++++++++++- 8 files changed, 195 insertions(+), 86 deletions(-) ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH 1/8] vmscan: tracing: Roll up of patches currently in mmotm 2010-07-19 13:11 [PATCH 0/8] Reduce writeback from page reclaim context V4 Mel Gorman @ 2010-07-19 13:11 ` Mel Gorman 2010-07-19 13:11 ` [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages Mel Gorman ` (6 subsequent siblings) 7 siblings, 0 replies; 87+ messages in thread From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-mm Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman This is a roll-up of patches currently in an unreleased mmotm tree related to stack reduction and tracing reclaim. The patches were taken from mm-commits traffic. It is based on 2.6.35-rc5 and included for the convenience of testing. No signed off required. --- .../trace/postprocess/trace-vmscan-postprocess.pl | 654 ++++++++++++++++++++ include/linux/memcontrol.h | 5 - include/linux/mmzone.h | 15 - include/trace/events/gfpflags.h | 37 ++ include/trace/events/kmem.h | 38 +-- include/trace/events/vmscan.h | 184 ++++++ mm/memcontrol.c | 31 - mm/page_alloc.c | 2 - mm/vmscan.c | 414 +++++++------ mm/vmstat.c | 2 - 10 files changed, 1089 insertions(+), 293 deletions(-) diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl new file mode 100644 index 0000000..d1ddc33 --- /dev/null +++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl @@ -0,0 +1,654 @@ +#!/usr/bin/perl +# This is a POC for reading the text representation of trace output related to +# page reclaim. It makes an attempt to extract some high-level information on +# what is going on. The accuracy of the parser may vary +# +# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe +# other options +# --read-procstat If the trace lacks process info, get it from /proc +# --ignore-pid Aggregate processes of the same name together +# +# Copyright (c) IBM Corporation 2009 +# Author: Mel Gorman <mel@csn.ul.ie> +use strict; +use Getopt::Long; + +# Tracepoint events +use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN => 1; +use constant MM_VMSCAN_DIRECT_RECLAIM_END => 2; +use constant MM_VMSCAN_KSWAPD_WAKE => 3; +use constant MM_VMSCAN_KSWAPD_SLEEP => 4; +use constant MM_VMSCAN_LRU_SHRINK_ACTIVE => 5; +use constant MM_VMSCAN_LRU_SHRINK_INACTIVE => 6; +use constant MM_VMSCAN_LRU_ISOLATE => 7; +use constant MM_VMSCAN_WRITEPAGE_SYNC => 8; +use constant MM_VMSCAN_WRITEPAGE_ASYNC => 9; +use constant EVENT_UNKNOWN => 10; + +# Per-order events +use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11; +use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER => 12; +use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER => 13; +use constant HIGH_KSWAPD_REWAKEUP_PERORDER => 14; + +# Constants used to track state +use constant STATE_DIRECT_BEGIN => 15; +use constant STATE_DIRECT_ORDER => 16; +use constant STATE_KSWAPD_BEGIN => 17; +use constant STATE_KSWAPD_ORDER => 18; + +# High-level events extrapolated from tracepoints +use constant HIGH_DIRECT_RECLAIM_LATENCY => 19; +use constant HIGH_KSWAPD_LATENCY => 20; +use constant HIGH_KSWAPD_REWAKEUP => 21; +use constant HIGH_NR_SCANNED => 22; +use constant HIGH_NR_TAKEN => 23; +use constant HIGH_NR_RECLAIM => 24; +use constant HIGH_NR_CONTIG_DIRTY => 25; + +my %perprocesspid; +my %perprocess; +my %last_procmap; +my $opt_ignorepid; +my $opt_read_procstat; + +my $total_wakeup_kswapd; +my ($total_direct_reclaim, $total_direct_nr_scanned); +my ($total_direct_latency, $total_kswapd_latency); +my ($total_direct_writepage_sync, $total_direct_writepage_async); +my ($total_kswapd_nr_scanned, $total_kswapd_wake); +my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async); + +# Catch sigint and exit on request +my $sigint_report = 0; +my $sigint_exit = 0; +my $sigint_pending = 0; +my $sigint_received = 0; +sub sigint_handler { + my $current_time = time; + if ($current_time - 2 > $sigint_received) { + print "SIGINT received, report pending. Hit ctrl-c again to exit\n"; + $sigint_report = 1; + } else { + if (!$sigint_exit) { + print "Second SIGINT received quickly, exiting\n"; + } + $sigint_exit++; + } + + if ($sigint_exit > 3) { + print "Many SIGINTs received, exiting now without report\n"; + exit; + } + + $sigint_received = $current_time; + $sigint_pending = 1; +} +$SIG{INT} = "sigint_handler"; + +# Parse command line options +GetOptions( + 'ignore-pid' => \$opt_ignorepid, + 'read-procstat' => \$opt_read_procstat, +); + +# Defaults for dynamically discovered regex's +my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)'; +my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)'; +my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)'; +my $regex_kswapd_sleep_default = 'nid=([0-9]*)'; +my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)'; +my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)'; +my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)'; +my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)'; +my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)'; + +# Dyanically discovered regex +my $regex_direct_begin; +my $regex_direct_end; +my $regex_kswapd_wake; +my $regex_kswapd_sleep; +my $regex_wakeup_kswapd; +my $regex_lru_isolate; +my $regex_lru_shrink_inactive; +my $regex_lru_shrink_active; +my $regex_writepage; + +# Static regex used. Specified like this for readability and for use with /o +# (process_pid) (cpus ) ( time ) (tpoint ) (details) +my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)'; +my $regex_statname = '[-0-9]*\s\((.*)\).*'; +my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*'; + +sub generate_traceevent_regex { + my $event = shift; + my $default = shift; + my $regex; + + # Read the event format or use the default + if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) { + print("WARNING: Event $event format string not found\n"); + return $default; + } else { + my $line; + while (!eof(FORMAT)) { + $line = <FORMAT>; + $line =~ s/, REC->.*//; + if ($line =~ /^print fmt:\s"(.*)".*/) { + $regex = $1; + $regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g; + $regex =~ s/%p/\([0-9a-f]*\)/g; + $regex =~ s/%d/\([-0-9]*\)/g; + $regex =~ s/%ld/\([-0-9]*\)/g; + $regex =~ s/%lu/\([0-9]*\)/g; + } + } + } + + # Can't handle the print_flags stuff but in the context of this + # script, it really doesn't matter + $regex =~ s/\(REC.*\) \? __print_flags.*//; + + # Verify fields are in the right order + my $tuple; + foreach $tuple (split /\s/, $regex) { + my ($key, $value) = split(/=/, $tuple); + my $expected = shift; + if ($key ne $expected) { + print("WARNING: Format not as expected for event $event '$key' != '$expected'\n"); + $regex =~ s/$key=\((.*)\)/$key=$1/; + } + } + + if (defined shift) { + die("Fewer fields than expected in format"); + } + + return $regex; +} + +$regex_direct_begin = generate_traceevent_regex( + "vmscan/mm_vmscan_direct_reclaim_begin", + $regex_direct_begin_default, + "order", "may_writepage", + "gfp_flags"); +$regex_direct_end = generate_traceevent_regex( + "vmscan/mm_vmscan_direct_reclaim_end", + $regex_direct_end_default, + "nr_reclaimed"); +$regex_kswapd_wake = generate_traceevent_regex( + "vmscan/mm_vmscan_kswapd_wake", + $regex_kswapd_wake_default, + "nid", "order"); +$regex_kswapd_sleep = generate_traceevent_regex( + "vmscan/mm_vmscan_kswapd_sleep", + $regex_kswapd_sleep_default, + "nid"); +$regex_wakeup_kswapd = generate_traceevent_regex( + "vmscan/mm_vmscan_wakeup_kswapd", + $regex_wakeup_kswapd_default, + "nid", "zid", "order"); +$regex_lru_isolate = generate_traceevent_regex( + "vmscan/mm_vmscan_lru_isolate", + $regex_lru_isolate_default, + "isolate_mode", "order", + "nr_requested", "nr_scanned", "nr_taken", + "contig_taken", "contig_dirty", "contig_failed"); +$regex_lru_shrink_inactive = generate_traceevent_regex( + "vmscan/mm_vmscan_lru_shrink_inactive", + $regex_lru_shrink_inactive_default, + "nid", "zid", + "lru", + "nr_scanned", "nr_reclaimed", "priority"); +$regex_lru_shrink_active = generate_traceevent_regex( + "vmscan/mm_vmscan_lru_shrink_active", + $regex_lru_shrink_active_default, + "nid", "zid", + "lru", + "nr_scanned", "nr_rotated", "priority"); +$regex_writepage = generate_traceevent_regex( + "vmscan/mm_vmscan_writepage", + $regex_writepage_default, + "page", "pfn", "sync_io"); + +sub read_statline($) { + my $pid = $_[0]; + my $statline; + + if (open(STAT, "/proc/$pid/stat")) { + $statline = <STAT>; + close(STAT); + } + + if ($statline eq '') { + $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0"; + } + + return $statline; +} + +sub guess_process_pid($$) { + my $pid = $_[0]; + my $statline = $_[1]; + + if ($pid == 0) { + return "swapper-0"; + } + + if ($statline !~ /$regex_statname/o) { + die("Failed to math stat line for process name :: $statline"); + } + return "$1-$pid"; +} + +# Convert sec.usec timestamp format +sub timestamp_to_ms($) { + my $timestamp = $_[0]; + + my ($sec, $usec) = split (/\./, $timestamp); + return ($sec * 1000) + ($usec / 1000); +} + +sub process_events { + my $traceevent; + my $process_pid; + my $cpus; + my $timestamp; + my $tracepoint; + my $details; + my $statline; + + # Read each line of the event log +EVENT_PROCESS: + while ($traceevent = <STDIN>) { + if ($traceevent =~ /$regex_traceevent/o) { + $process_pid = $1; + $timestamp = $3; + $tracepoint = $4; + + $process_pid =~ /(.*)-([0-9]*)$/; + my $process = $1; + my $pid = $2; + + if ($process eq "") { + $process = $last_procmap{$pid}; + $process_pid = "$process-$pid"; + } + $last_procmap{$pid} = $process; + + if ($opt_read_procstat) { + $statline = read_statline($pid); + if ($opt_read_procstat && $process eq '') { + $process_pid = guess_process_pid($pid, $statline); + } + } + } else { + next; + } + + # Perl Switch() sucks majorly + if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") { + $timestamp = timestamp_to_ms($timestamp); + $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++; + $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp; + + $details = $5; + if ($details !~ /$regex_direct_begin/o) { + print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n"; + print " $details\n"; + print " $regex_direct_begin\n"; + next; + } + my $order = $1; + $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++; + $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order; + } elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") { + # Count the event itself + my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}; + $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++; + + # Record how long direct reclaim took this time + if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) { + $timestamp = timestamp_to_ms($timestamp); + my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER}; + my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}); + $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency"; + } + } elsif ($tracepoint eq "mm_vmscan_kswapd_wake") { + $details = $5; + if ($details !~ /$regex_kswapd_wake/o) { + print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n"; + print " $details\n"; + print " $regex_kswapd_wake\n"; + next; + } + + my $order = $2; + $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order; + if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) { + $timestamp = timestamp_to_ms($timestamp); + $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++; + $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp; + $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++; + } else { + $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++; + $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++; + } + } elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") { + + # Count the event itself + my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}; + $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++; + + # Record how long kswapd was awake + $timestamp = timestamp_to_ms($timestamp); + my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER}; + my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}); + $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency"; + $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0; + } elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") { + $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++; + + $details = $5; + if ($details !~ /$regex_wakeup_kswapd/o) { + print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n"; + print " $details\n"; + print " $regex_wakeup_kswapd\n"; + next; + } + my $order = $3; + $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++; + } elsif ($tracepoint eq "mm_vmscan_lru_isolate") { + $details = $5; + if ($details !~ /$regex_lru_isolate/o) { + print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n"; + print " $details\n"; + print " $regex_lru_isolate/o\n"; + next; + } + my $nr_scanned = $4; + my $nr_contig_dirty = $7; + $perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned; + $perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty; + } elsif ($tracepoint eq "mm_vmscan_writepage") { + $details = $5; + if ($details !~ /$regex_writepage/o) { + print "WARNING: Failed to parse mm_vmscan_writepage as expected\n"; + print " $details\n"; + print " $regex_writepage\n"; + next; + } + + my $sync_io = $3; + if ($sync_io) { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++; + } else { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++; + } + } else { + $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++; + } + + if ($sigint_pending) { + last EVENT_PROCESS; + } + } +} + +sub dump_stats { + my $hashref = shift; + my %stats = %$hashref; + + # Dump per-process stats + my $process_pid; + my $max_strlen = 0; + + # Get the maximum process name + foreach $process_pid (keys %perprocesspid) { + my $len = length($process_pid); + if ($len > $max_strlen) { + $max_strlen = $len; + } + } + $max_strlen += 2; + + # Work out latencies + printf("\n") if !$opt_ignorepid; + printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid; + foreach $process_pid (keys %stats) { + + if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] && + !$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) { + next; + } + + printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid; + my $index = 0; + while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] || + defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) { + + if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { + printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid; + my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]); + $total_direct_latency += $latency; + } else { + printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid; + my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]); + $total_kswapd_latency += $latency; + } + $index++; + } + print "\n" if !$opt_ignorepid; + } + + # Print out process activity + printf("\n"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Time"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Sync-IO", "ASync-IO", "Stalled"); + foreach $process_pid (keys %stats) { + + if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) { + next; + } + + $total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}; + $total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; + $total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; + $total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}; + $total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}; + + my $index = 0; + my $this_reclaim_delay = 0; + while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { + my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]); + $this_reclaim_delay += $latency; + $index++; + } + + printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8.3f", + $process_pid, + $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}, + $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}, + $stats{$process_pid}->{HIGH_NR_SCANNED}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}, + $this_reclaim_delay / 1000); + + if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) { + print " "; + for (my $order = 0; $order < 20; $order++) { + my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]; + if ($count != 0) { + print "direct-$order=$count "; + } + } + } + if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) { + print " "; + for (my $order = 0; $order < 20; $order++) { + my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]; + if ($count != 0) { + print "wakeup-$order=$count "; + } + } + } + if ($stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}) { + print " "; + my $count = $stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}; + if ($count != 0) { + print "contig-dirty=$count "; + } + } + + print "\n"; + } + + # Print out kswapd activity + printf("\n"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO"); + foreach $process_pid (keys %stats) { + + if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) { + next; + } + + $total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}; + $total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; + $total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}; + $total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}; + + printf("%-" . $max_strlen . "s %8d %10d %8u %8i %8u", + $process_pid, + $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}, + $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}, + $stats{$process_pid}->{HIGH_NR_SCANNED}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}); + + if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) { + print " "; + for (my $order = 0; $order < 20; $order++) { + my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]; + if ($count != 0) { + print "wake-$order=$count "; + } + } + } + if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) { + print " "; + for (my $order = 0; $order < 20; $order++) { + my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]; + if ($count != 0) { + print "rewake-$order=$count "; + } + } + } + printf("\n"); + } + + # Print out summaries + $total_direct_latency /= 1000; + $total_kswapd_latency /= 1000; + print "\nSummary\n"; + print "Direct reclaims: $total_direct_reclaim\n"; + print "Direct reclaim pages scanned: $total_direct_nr_scanned\n"; + print "Direct reclaim write sync I/O: $total_direct_writepage_sync\n"; + print "Direct reclaim write async I/O: $total_direct_writepage_async\n"; + print "Wake kswapd requests: $total_wakeup_kswapd\n"; + printf "Time stalled direct reclaim: %-1.2f ms\n", $total_direct_latency; + print "\n"; + print "Kswapd wakeups: $total_kswapd_wake\n"; + print "Kswapd pages scanned: $total_kswapd_nr_scanned\n"; + print "Kswapd reclaim write sync I/O: $total_kswapd_writepage_sync\n"; + print "Kswapd reclaim write async I/O: $total_kswapd_writepage_async\n"; + printf "Time kswapd awake: %-1.2f ms\n", $total_kswapd_latency; +} + +sub aggregate_perprocesspid() { + my $process_pid; + my $process; + undef %perprocess; + + foreach $process_pid (keys %perprocesspid) { + $process = $process_pid; + $process =~ s/-([0-9])*$//; + if ($process eq '') { + $process = "NO_PROCESS_NAME"; + } + + $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}; + $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}; + $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; + $perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}; + $perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}; + + for (my $order = 0; $order < 20; $order++) { + $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]; + $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]; + $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]; + + } + + # Aggregate direct reclaim latencies + my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END}; + my $rd_index = 0; + while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) { + $perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]; + $rd_index++; + $wr_index++; + } + $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index; + + # Aggregate kswapd latencies + my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP}; + my $rd_index = 0; + while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) { + $perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]; + $rd_index++; + $wr_index++; + } + $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index; + } +} + +sub report() { + if (!$opt_ignorepid) { + dump_stats(\%perprocesspid); + } else { + aggregate_perprocesspid(); + dump_stats(\%perprocess); + } +} + +# Process events or signals until neither is available +sub signal_loop() { + my $sigint_processed; + do { + $sigint_processed = 0; + process_events(); + + # Handle pending signals if any + if ($sigint_pending) { + my $current_time = time; + + if ($sigint_exit) { + print "Received exit signal\n"; + $sigint_pending = 0; + } + if ($sigint_report) { + if ($current_time >= $sigint_received + 2) { + report(); + $sigint_report = 0; + $sigint_pending = 0; + $sigint_processed = 1; + } + } + } + } while ($sigint_pending || $sigint_processed); +} + +signal_loop(); +report(); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 9411d32..9f1afd3 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -98,11 +98,6 @@ extern void mem_cgroup_end_migration(struct mem_cgroup *mem, /* * For memory reclaim. */ -extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem); -extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, - int priority); -extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, - int priority); int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg); int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg); unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg, diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b4d109e..b578eee 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -348,21 +348,6 @@ struct zone { atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; /* - * prev_priority holds the scanning priority for this zone. It is - * defined as the scanning priority at which we achieved our reclaim - * target at the previous try_to_free_pages() or balance_pgdat() - * invocation. - * - * We use prev_priority as a measure of how much stress page reclaim is - * under - it drives the swappiness decision: whether to unmap mapped - * pages. - * - * Access to both this field is quite racy even on uniprocessor. But - * it is expected to average out OK. - */ - int prev_priority; - - /* * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on * this zone's LRU. Maintained by the pageout code. */ diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h new file mode 100644 index 0000000..e3615c0 --- /dev/null +++ b/include/trace/events/gfpflags.h @@ -0,0 +1,37 @@ +/* + * The order of these masks is important. Matching masks will be seen + * first and the left over flags will end up showing by themselves. + * + * For example, if we have GFP_KERNEL before GFP_USER we wil get: + * + * GFP_KERNEL|GFP_HARDWALL + * + * Thus most bits set go first. + */ +#define show_gfp_flags(flags) \ + (flags) ? __print_flags(flags, "|", \ + {(unsigned long)GFP_HIGHUSER_MOVABLE, "GFP_HIGHUSER_MOVABLE"}, \ + {(unsigned long)GFP_HIGHUSER, "GFP_HIGHUSER"}, \ + {(unsigned long)GFP_USER, "GFP_USER"}, \ + {(unsigned long)GFP_TEMPORARY, "GFP_TEMPORARY"}, \ + {(unsigned long)GFP_KERNEL, "GFP_KERNEL"}, \ + {(unsigned long)GFP_NOFS, "GFP_NOFS"}, \ + {(unsigned long)GFP_ATOMIC, "GFP_ATOMIC"}, \ + {(unsigned long)GFP_NOIO, "GFP_NOIO"}, \ + {(unsigned long)__GFP_HIGH, "GFP_HIGH"}, \ + {(unsigned long)__GFP_WAIT, "GFP_WAIT"}, \ + {(unsigned long)__GFP_IO, "GFP_IO"}, \ + {(unsigned long)__GFP_COLD, "GFP_COLD"}, \ + {(unsigned long)__GFP_NOWARN, "GFP_NOWARN"}, \ + {(unsigned long)__GFP_REPEAT, "GFP_REPEAT"}, \ + {(unsigned long)__GFP_NOFAIL, "GFP_NOFAIL"}, \ + {(unsigned long)__GFP_NORETRY, "GFP_NORETRY"}, \ + {(unsigned long)__GFP_COMP, "GFP_COMP"}, \ + {(unsigned long)__GFP_ZERO, "GFP_ZERO"}, \ + {(unsigned long)__GFP_NOMEMALLOC, "GFP_NOMEMALLOC"}, \ + {(unsigned long)__GFP_HARDWALL, "GFP_HARDWALL"}, \ + {(unsigned long)__GFP_THISNODE, "GFP_THISNODE"}, \ + {(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \ + {(unsigned long)__GFP_MOVABLE, "GFP_MOVABLE"} \ + ) : "GFP_NOWAIT" + diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index 3adca0c..a9c87ad 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -6,43 +6,7 @@ #include <linux/types.h> #include <linux/tracepoint.h> - -/* - * The order of these masks is important. Matching masks will be seen - * first and the left over flags will end up showing by themselves. - * - * For example, if we have GFP_KERNEL before GFP_USER we wil get: - * - * GFP_KERNEL|GFP_HARDWALL - * - * Thus most bits set go first. - */ -#define show_gfp_flags(flags) \ - (flags) ? __print_flags(flags, "|", \ - {(unsigned long)GFP_HIGHUSER_MOVABLE, "GFP_HIGHUSER_MOVABLE"}, \ - {(unsigned long)GFP_HIGHUSER, "GFP_HIGHUSER"}, \ - {(unsigned long)GFP_USER, "GFP_USER"}, \ - {(unsigned long)GFP_TEMPORARY, "GFP_TEMPORARY"}, \ - {(unsigned long)GFP_KERNEL, "GFP_KERNEL"}, \ - {(unsigned long)GFP_NOFS, "GFP_NOFS"}, \ - {(unsigned long)GFP_ATOMIC, "GFP_ATOMIC"}, \ - {(unsigned long)GFP_NOIO, "GFP_NOIO"}, \ - {(unsigned long)__GFP_HIGH, "GFP_HIGH"}, \ - {(unsigned long)__GFP_WAIT, "GFP_WAIT"}, \ - {(unsigned long)__GFP_IO, "GFP_IO"}, \ - {(unsigned long)__GFP_COLD, "GFP_COLD"}, \ - {(unsigned long)__GFP_NOWARN, "GFP_NOWARN"}, \ - {(unsigned long)__GFP_REPEAT, "GFP_REPEAT"}, \ - {(unsigned long)__GFP_NOFAIL, "GFP_NOFAIL"}, \ - {(unsigned long)__GFP_NORETRY, "GFP_NORETRY"}, \ - {(unsigned long)__GFP_COMP, "GFP_COMP"}, \ - {(unsigned long)__GFP_ZERO, "GFP_ZERO"}, \ - {(unsigned long)__GFP_NOMEMALLOC, "GFP_NOMEMALLOC"}, \ - {(unsigned long)__GFP_HARDWALL, "GFP_HARDWALL"}, \ - {(unsigned long)__GFP_THISNODE, "GFP_THISNODE"}, \ - {(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \ - {(unsigned long)__GFP_MOVABLE, "GFP_MOVABLE"} \ - ) : "GFP_NOWAIT" +#include "gfpflags.h" DECLARE_EVENT_CLASS(kmem_alloc, diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h new file mode 100644 index 0000000..f2da66a --- /dev/null +++ b/include/trace/events/vmscan.h @@ -0,0 +1,184 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM vmscan + +#if !defined(_TRACE_VMSCAN_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_VMSCAN_H + +#include <linux/types.h> +#include <linux/tracepoint.h> +#include "gfpflags.h" + +TRACE_EVENT(mm_vmscan_kswapd_sleep, + + TP_PROTO(int nid), + + TP_ARGS(nid), + + TP_STRUCT__entry( + __field( int, nid ) + ), + + TP_fast_assign( + __entry->nid = nid; + ), + + TP_printk("nid=%d", __entry->nid) +); + +TRACE_EVENT(mm_vmscan_kswapd_wake, + + TP_PROTO(int nid, int order), + + TP_ARGS(nid, order), + + TP_STRUCT__entry( + __field( int, nid ) + __field( int, order ) + ), + + TP_fast_assign( + __entry->nid = nid; + __entry->order = order; + ), + + TP_printk("nid=%d order=%d", __entry->nid, __entry->order) +); + +TRACE_EVENT(mm_vmscan_wakeup_kswapd, + + TP_PROTO(int nid, int zid, int order), + + TP_ARGS(nid, zid, order), + + TP_STRUCT__entry( + __field( int, nid ) + __field( int, zid ) + __field( int, order ) + ), + + TP_fast_assign( + __entry->nid = nid; + __entry->zid = zid; + __entry->order = order; + ), + + TP_printk("nid=%d zid=%d order=%d", + __entry->nid, + __entry->zid, + __entry->order) +); + +TRACE_EVENT(mm_vmscan_direct_reclaim_begin, + + TP_PROTO(int order, int may_writepage, gfp_t gfp_flags), + + TP_ARGS(order, may_writepage, gfp_flags), + + TP_STRUCT__entry( + __field( int, order ) + __field( int, may_writepage ) + __field( gfp_t, gfp_flags ) + ), + + TP_fast_assign( + __entry->order = order; + __entry->may_writepage = may_writepage; + __entry->gfp_flags = gfp_flags; + ), + + TP_printk("order=%d may_writepage=%d gfp_flags=%s", + __entry->order, + __entry->may_writepage, + show_gfp_flags(__entry->gfp_flags)) +); + +TRACE_EVENT(mm_vmscan_direct_reclaim_end, + + TP_PROTO(unsigned long nr_reclaimed), + + TP_ARGS(nr_reclaimed), + + TP_STRUCT__entry( + __field( unsigned long, nr_reclaimed ) + ), + + TP_fast_assign( + __entry->nr_reclaimed = nr_reclaimed; + ), + + TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed) +); + +TRACE_EVENT(mm_vmscan_lru_isolate, + + TP_PROTO(int order, + unsigned long nr_requested, + unsigned long nr_scanned, + unsigned long nr_taken, + unsigned long nr_lumpy_taken, + unsigned long nr_lumpy_dirty, + unsigned long nr_lumpy_failed, + int isolate_mode), + + TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode), + + TP_STRUCT__entry( + __field(int, order) + __field(unsigned long, nr_requested) + __field(unsigned long, nr_scanned) + __field(unsigned long, nr_taken) + __field(unsigned long, nr_lumpy_taken) + __field(unsigned long, nr_lumpy_dirty) + __field(unsigned long, nr_lumpy_failed) + __field(int, isolate_mode) + ), + + TP_fast_assign( + __entry->order = order; + __entry->nr_requested = nr_requested; + __entry->nr_scanned = nr_scanned; + __entry->nr_taken = nr_taken; + __entry->nr_lumpy_taken = nr_lumpy_taken; + __entry->nr_lumpy_dirty = nr_lumpy_dirty; + __entry->nr_lumpy_failed = nr_lumpy_failed; + __entry->isolate_mode = isolate_mode; + ), + + TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu contig_taken=%lu contig_dirty=%lu contig_failed=%lu", + __entry->isolate_mode, + __entry->order, + __entry->nr_requested, + __entry->nr_scanned, + __entry->nr_taken, + __entry->nr_lumpy_taken, + __entry->nr_lumpy_dirty, + __entry->nr_lumpy_failed) +); + +TRACE_EVENT(mm_vmscan_writepage, + + TP_PROTO(struct page *page, + int sync_io), + + TP_ARGS(page, sync_io), + + TP_STRUCT__entry( + __field(struct page *, page) + __field(int, sync_io) + ), + + TP_fast_assign( + __entry->page = page; + __entry->sync_io = sync_io; + ), + + TP_printk("page=%p pfn=%lu sync_io=%d", + __entry->page, + page_to_pfn(__entry->page), + __entry->sync_io) +); + +#endif /* _TRACE_VMSCAN_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 20a8193..31abd1c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -211,8 +211,6 @@ struct mem_cgroup { */ spinlock_t reclaim_param_lock; - int prev_priority; /* for recording reclaim priority */ - /* * While reclaiming in a hierarchy, we cache the last child we * reclaimed from. @@ -858,35 +856,6 @@ int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem) return ret; } -/* - * prev_priority control...this will be used in memory reclaim path. - */ -int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem) -{ - int prev_priority; - - spin_lock(&mem->reclaim_param_lock); - prev_priority = mem->prev_priority; - spin_unlock(&mem->reclaim_param_lock); - - return prev_priority; -} - -void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, int priority) -{ - spin_lock(&mem->reclaim_param_lock); - if (priority < mem->prev_priority) - mem->prev_priority = priority; - spin_unlock(&mem->reclaim_param_lock); -} - -void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, int priority) -{ - spin_lock(&mem->reclaim_param_lock); - mem->prev_priority = priority; - spin_unlock(&mem->reclaim_param_lock); -} - static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_pages) { unsigned long active; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 431214b..0b0b629 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4081,8 +4081,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, zone_seqlock_init(zone); zone->zone_pgdat = pgdat; - zone->prev_priority = DEF_PRIORITY; - zone_pcp_init(zone); for_each_lru(l) { INIT_LIST_HEAD(&zone->lru[l].list); diff --git a/mm/vmscan.c b/mm/vmscan.c index 9c7e57c..e6ddba9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -48,6 +48,9 @@ #include "internal.h" +#define CREATE_TRACE_POINTS +#include <trace/events/vmscan.h> + struct scan_control { /* Incremented by the number of inactive pages that were scanned */ unsigned long nr_scanned; @@ -290,13 +293,13 @@ static int may_write_to_queue(struct backing_dev_info *bdi) * prevents it from being freed up. But we have a ref on the page and once * that page is locked, the mapping is pinned. * - * We're allowed to run sleeping lock_page() here because we know the caller has - * __GFP_FS. + * We're allowed to run sleeping lock_page_nosync() here because we know the + * caller has __GFP_FS. */ static void handle_write_error(struct address_space *mapping, struct page *page, int error) { - lock_page(page); + lock_page_nosync(page); if (page_mapping(page) == mapping) mapping_set_error(mapping, error); unlock_page(page); @@ -396,6 +399,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, /* synchronous write or broken a_ops? */ ClearPageReclaim(page); } + trace_mm_vmscan_writepage(page, + sync_writeback == PAGEOUT_IO_SYNC); inc_zone_page_state(page, NR_VMSCAN_WRITE); return PAGE_SUCCESS; } @@ -615,6 +620,24 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_RECLAIM; } +static noinline_for_stack void free_page_list(struct list_head *free_pages) +{ + struct pagevec freed_pvec; + struct page *page, *tmp; + + pagevec_init(&freed_pvec, 1); + + list_for_each_entry_safe(page, tmp, free_pages, lru) { + list_del(&page->lru); + if (!pagevec_add(&freed_pvec, page)) { + __pagevec_free(&freed_pvec); + pagevec_reinit(&freed_pvec); + } + } + + pagevec_free(&freed_pvec); +} + /* * shrink_page_list() returns the number of reclaimed pages */ @@ -623,13 +646,12 @@ static unsigned long shrink_page_list(struct list_head *page_list, enum pageout_io sync_writeback) { LIST_HEAD(ret_pages); - struct pagevec freed_pvec; + LIST_HEAD(free_pages); int pgactivate = 0; unsigned long nr_reclaimed = 0; cond_resched(); - pagevec_init(&freed_pvec, 1); while (!list_empty(page_list)) { enum page_references references; struct address_space *mapping; @@ -804,10 +826,12 @@ static unsigned long shrink_page_list(struct list_head *page_list, __clear_page_locked(page); free_it: nr_reclaimed++; - if (!pagevec_add(&freed_pvec, page)) { - __pagevec_free(&freed_pvec); - pagevec_reinit(&freed_pvec); - } + + /* + * Is there need to periodically free_page_list? It would + * appear not as the counts should be low + */ + list_add(&page->lru, &free_pages); continue; cull_mlocked: @@ -830,9 +854,10 @@ keep: list_add(&page->lru, &ret_pages); VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); } + + free_page_list(&free_pages); + list_splice(&ret_pages, page_list); - if (pagevec_count(&freed_pvec)) - __pagevec_free(&freed_pvec); count_vm_events(PGACTIVATE, pgactivate); return nr_reclaimed; } @@ -914,6 +939,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, unsigned long *scanned, int order, int mode, int file) { unsigned long nr_taken = 0; + unsigned long nr_lumpy_taken = 0, nr_lumpy_dirty = 0, nr_lumpy_failed = 0; unsigned long scan; for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) { @@ -991,12 +1017,25 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, list_move(&cursor_page->lru, dst); mem_cgroup_del_lru(cursor_page); nr_taken++; + nr_lumpy_taken++; + if (PageDirty(cursor_page)) + nr_lumpy_dirty++; scan++; + } else { + if (mode == ISOLATE_BOTH && + page_count(cursor_page)) + nr_lumpy_failed++; } } } *scanned = scan; + + trace_mm_vmscan_lru_isolate(order, + nr_to_scan, scan, + nr_taken, + nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, + mode); return nr_taken; } @@ -1033,7 +1072,8 @@ static unsigned long clear_active_flags(struct list_head *page_list, ClearPageActive(page); nr_active++; } - count[lru]++; + if (count) + count[lru]++; } return nr_active; @@ -1110,174 +1150,177 @@ static int too_many_isolated(struct zone *zone, int file, } /* - * shrink_inactive_list() is a helper for shrink_zone(). It returns the number - * of reclaimed pages + * TODO: Try merging with migrations version of putback_lru_pages */ -static unsigned long shrink_inactive_list(unsigned long max_scan, - struct zone *zone, struct scan_control *sc, - int priority, int file) +static noinline_for_stack void +putback_lru_pages(struct zone *zone, struct scan_control *sc, + unsigned long nr_anon, unsigned long nr_file, + struct list_head *page_list) { - LIST_HEAD(page_list); + struct page *page; struct pagevec pvec; - unsigned long nr_scanned = 0; - unsigned long nr_reclaimed = 0; struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc); - while (unlikely(too_many_isolated(zone, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + pagevec_init(&pvec, 1); - /* We are about to die and free our memory. Return now. */ - if (fatal_signal_pending(current)) - return SWAP_CLUSTER_MAX; + /* + * Put back any unfreeable pages. + */ + spin_lock(&zone->lru_lock); + while (!list_empty(page_list)) { + int lru; + page = lru_to_page(page_list); + VM_BUG_ON(PageLRU(page)); + list_del(&page->lru); + if (unlikely(!page_evictable(page, NULL))) { + spin_unlock_irq(&zone->lru_lock); + putback_lru_page(page); + spin_lock_irq(&zone->lru_lock); + continue; + } + SetPageLRU(page); + lru = page_lru(page); + add_page_to_lru_list(zone, page, lru); + if (is_active_lru(lru)) { + int file = is_file_lru(lru); + reclaim_stat->recent_rotated[file]++; + } + if (!pagevec_add(&pvec, page)) { + spin_unlock_irq(&zone->lru_lock); + __pagevec_release(&pvec); + spin_lock_irq(&zone->lru_lock); + } } + __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon); + __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file); + spin_unlock_irq(&zone->lru_lock); + pagevec_release(&pvec); +} - pagevec_init(&pvec, 1); +static noinline_for_stack void update_isolated_counts(struct zone *zone, + struct scan_control *sc, + unsigned long *nr_anon, + unsigned long *nr_file, + struct list_head *isolated_list) +{ + unsigned long nr_active; + unsigned int count[NR_LRU_LISTS] = { 0, }; + struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc); - lru_add_drain(); - spin_lock_irq(&zone->lru_lock); - do { - struct page *page; - unsigned long nr_taken; - unsigned long nr_scan; - unsigned long nr_freed; - unsigned long nr_active; - unsigned int count[NR_LRU_LISTS] = { 0, }; - int mode = sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE; - unsigned long nr_anon; - unsigned long nr_file; + nr_active = clear_active_flags(isolated_list, count); + __count_vm_events(PGDEACTIVATE, nr_active); - if (scanning_global_lru(sc)) { - nr_taken = isolate_pages_global(SWAP_CLUSTER_MAX, - &page_list, &nr_scan, - sc->order, mode, - zone, 0, file); - zone->pages_scanned += nr_scan; - if (current_is_kswapd()) - __count_zone_vm_events(PGSCAN_KSWAPD, zone, - nr_scan); - else - __count_zone_vm_events(PGSCAN_DIRECT, zone, - nr_scan); - } else { - nr_taken = mem_cgroup_isolate_pages(SWAP_CLUSTER_MAX, - &page_list, &nr_scan, - sc->order, mode, - zone, sc->mem_cgroup, - 0, file); - /* - * mem_cgroup_isolate_pages() keeps track of - * scanned pages on its own. - */ - } + __mod_zone_page_state(zone, NR_ACTIVE_FILE, + -count[LRU_ACTIVE_FILE]); + __mod_zone_page_state(zone, NR_INACTIVE_FILE, + -count[LRU_INACTIVE_FILE]); + __mod_zone_page_state(zone, NR_ACTIVE_ANON, + -count[LRU_ACTIVE_ANON]); + __mod_zone_page_state(zone, NR_INACTIVE_ANON, + -count[LRU_INACTIVE_ANON]); - if (nr_taken == 0) - goto done; + *nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON]; + *nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE]; + __mod_zone_page_state(zone, NR_ISOLATED_ANON, *nr_anon); + __mod_zone_page_state(zone, NR_ISOLATED_FILE, *nr_file); - nr_active = clear_active_flags(&page_list, count); - __count_vm_events(PGDEACTIVATE, nr_active); + reclaim_stat->recent_scanned[0] += *nr_anon; + reclaim_stat->recent_scanned[1] += *nr_file; +} - __mod_zone_page_state(zone, NR_ACTIVE_FILE, - -count[LRU_ACTIVE_FILE]); - __mod_zone_page_state(zone, NR_INACTIVE_FILE, - -count[LRU_INACTIVE_FILE]); - __mod_zone_page_state(zone, NR_ACTIVE_ANON, - -count[LRU_ACTIVE_ANON]); - __mod_zone_page_state(zone, NR_INACTIVE_ANON, - -count[LRU_INACTIVE_ANON]); +/* + * shrink_inactive_list() is a helper for shrink_zone(). It returns the number + * of reclaimed pages + */ +static noinline_for_stack unsigned long +shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, + struct scan_control *sc, int priority, int file) +{ + LIST_HEAD(page_list); + unsigned long nr_scanned; + unsigned long nr_reclaimed = 0; + unsigned long nr_taken; + unsigned long nr_active; + unsigned long nr_anon; + unsigned long nr_file; - nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON]; - nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE]; - __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon); - __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file); + while (unlikely(too_many_isolated(zone, file, sc))) { + congestion_wait(BLK_RW_ASYNC, HZ/10); - reclaim_stat->recent_scanned[0] += nr_anon; - reclaim_stat->recent_scanned[1] += nr_file; + /* We are about to die and free our memory. Return now. */ + if (fatal_signal_pending(current)) + return SWAP_CLUSTER_MAX; + } - spin_unlock_irq(&zone->lru_lock); - nr_scanned += nr_scan; - nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); + lru_add_drain(); + spin_lock_irq(&zone->lru_lock); + if (scanning_global_lru(sc)) { + nr_taken = isolate_pages_global(nr_to_scan, + &page_list, &nr_scanned, sc->order, + sc->lumpy_reclaim_mode ? + ISOLATE_BOTH : ISOLATE_INACTIVE, + zone, 0, file); + zone->pages_scanned += nr_scanned; + if (current_is_kswapd()) + __count_zone_vm_events(PGSCAN_KSWAPD, zone, + nr_scanned); + else + __count_zone_vm_events(PGSCAN_DIRECT, zone, + nr_scanned); + } else { + nr_taken = mem_cgroup_isolate_pages(nr_to_scan, + &page_list, &nr_scanned, sc->order, + sc->lumpy_reclaim_mode ? + ISOLATE_BOTH : ISOLATE_INACTIVE, + zone, sc->mem_cgroup, + 0, file); /* - * If we are direct reclaiming for contiguous pages and we do - * not reclaim everything in the list, try again and wait - * for IO to complete. This will stall high-order allocations - * but that should be acceptable to the caller + * mem_cgroup_isolate_pages() keeps track of + * scanned pages on its own. */ - if (nr_freed < nr_taken && !current_is_kswapd() && - sc->lumpy_reclaim_mode) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + } - /* - * The attempt at page out may have made some - * of the pages active, mark them inactive again. - */ - nr_active = clear_active_flags(&page_list, count); - count_vm_events(PGDEACTIVATE, nr_active); + if (nr_taken == 0) { + spin_unlock_irq(&zone->lru_lock); + return 0; + } - nr_freed += shrink_page_list(&page_list, sc, - PAGEOUT_IO_SYNC); - } + update_isolated_counts(zone, sc, &nr_anon, &nr_file, &page_list); - nr_reclaimed += nr_freed; + spin_unlock_irq(&zone->lru_lock); - local_irq_disable(); - if (current_is_kswapd()) - __count_vm_events(KSWAPD_STEAL, nr_freed); - __count_zone_vm_events(PGSTEAL, zone, nr_freed); + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); + + /* + * If we are direct reclaiming for contiguous pages and we do + * not reclaim everything in the list, try again and wait + * for IO to complete. This will stall high-order allocations + * but that should be acceptable to the caller + */ + if (nr_reclaimed < nr_taken && !current_is_kswapd() && + sc->lumpy_reclaim_mode) { + congestion_wait(BLK_RW_ASYNC, HZ/10); - spin_lock(&zone->lru_lock); /* - * Put back any unfreeable pages. + * The attempt at page out may have made some + * of the pages active, mark them inactive again. */ - while (!list_empty(&page_list)) { - int lru; - page = lru_to_page(&page_list); - VM_BUG_ON(PageLRU(page)); - list_del(&page->lru); - if (unlikely(!page_evictable(page, NULL))) { - spin_unlock_irq(&zone->lru_lock); - putback_lru_page(page); - spin_lock_irq(&zone->lru_lock); - continue; - } - SetPageLRU(page); - lru = page_lru(page); - add_page_to_lru_list(zone, page, lru); - if (is_active_lru(lru)) { - int file = is_file_lru(lru); - reclaim_stat->recent_rotated[file]++; - } - if (!pagevec_add(&pvec, page)) { - spin_unlock_irq(&zone->lru_lock); - __pagevec_release(&pvec); - spin_lock_irq(&zone->lru_lock); - } - } - __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon); - __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file); + nr_active = clear_active_flags(&page_list, NULL); + count_vm_events(PGDEACTIVATE, nr_active); - } while (nr_scanned < max_scan); + nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); + } -done: - spin_unlock_irq(&zone->lru_lock); - pagevec_release(&pvec); - return nr_reclaimed; -} + local_irq_disable(); + if (current_is_kswapd()) + __count_vm_events(KSWAPD_STEAL, nr_reclaimed); + __count_zone_vm_events(PGSTEAL, zone, nr_reclaimed); -/* - * We are about to scan this zone at a certain priority level. If that priority - * level is smaller (ie: more urgent) than the previous priority, then note - * that priority level within the zone. This is done so that when the next - * process comes in to scan this zone, it will immediately start out at this - * priority level rather than having to build up its own scanning priority. - * Here, this priority affects only the reclaim-mapped threshold. - */ -static inline void note_zone_scanning_priority(struct zone *zone, int priority) -{ - if (priority < zone->prev_priority) - zone->prev_priority = priority; + putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list); + return nr_reclaimed; } /* @@ -1727,13 +1770,12 @@ static void shrink_zone(int priority, struct zone *zone, static bool shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { - enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask); struct zoneref *z; struct zone *zone; bool all_unreclaimable = true; - for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx, - sc->nodemask) { + for_each_zone_zonelist_nodemask(zone, z, zonelist, + gfp_zone(sc->gfp_mask), sc->nodemask) { if (!populated_zone(zone)) continue; /* @@ -1743,17 +1785,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist, if (scanning_global_lru(sc)) { if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) continue; - note_zone_scanning_priority(zone, priority); - if (zone->all_unreclaimable && priority != DEF_PRIORITY) continue; /* Let kswapd poll it */ - } else { - /* - * Ignore cpuset limitation here. We just want to reduce - * # of used pages by us regardless of memory shortage. - */ - mem_cgroup_note_reclaim_priority(sc->mem_cgroup, - priority); } shrink_zone(priority, zone, sc); @@ -1788,7 +1821,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, unsigned long lru_pages = 0; struct zoneref *z; struct zone *zone; - enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask); unsigned long writeback_threshold; get_mems_allowed(); @@ -1800,7 +1832,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, * mem_cgroup will not do shrink_slab. */ if (scanning_global_lru(sc)) { - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { + for_each_zone_zonelist(zone, z, zonelist, + gfp_zone(sc->gfp_mask)) { if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) continue; @@ -1859,17 +1892,6 @@ out: if (priority < 0) priority = 0; - if (scanning_global_lru(sc)) { - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) { - - if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) - continue; - - zone->prev_priority = priority; - } - } else - mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority); - delayacct_freepages_end(); put_mems_allowed(); @@ -1886,6 +1908,7 @@ out: unsigned long try_to_free_pages(struct zonelist *zonelist, int order, gfp_t gfp_mask, nodemask_t *nodemask) { + unsigned long nr_reclaimed; struct scan_control sc = { .gfp_mask = gfp_mask, .may_writepage = !laptop_mode, @@ -1898,7 +1921,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .nodemask = nodemask, }; - return do_try_to_free_pages(zonelist, &sc); + trace_mm_vmscan_direct_reclaim_begin(order, + sc.may_writepage, + gfp_mask); + + nr_reclaimed = do_try_to_free_pages(zonelist, &sc); + + trace_mm_vmscan_direct_reclaim_end(nr_reclaimed); + + return nr_reclaimed; } #ifdef CONFIG_CGROUP_MEM_RES_CTLR @@ -2026,22 +2057,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order) .order = order, .mem_cgroup = NULL, }; - /* - * temp_priority is used to remember the scanning priority at which - * this zone was successfully refilled to - * free_pages == high_wmark_pages(zone). - */ - int temp_priority[MAX_NR_ZONES]; - loop_again: total_scanned = 0; sc.nr_reclaimed = 0; sc.may_writepage = !laptop_mode; count_vm_event(PAGEOUTRUN); - for (i = 0; i < pgdat->nr_zones; i++) - temp_priority[i] = DEF_PRIORITY; - for (priority = DEF_PRIORITY; priority >= 0; priority--) { int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ unsigned long lru_pages = 0; @@ -2109,9 +2130,7 @@ loop_again: if (zone->all_unreclaimable && priority != DEF_PRIORITY) continue; - temp_priority[i] = priority; sc.nr_scanned = 0; - note_zone_scanning_priority(zone, priority); nid = pgdat->node_id; zid = zone_idx(zone); @@ -2184,16 +2203,6 @@ loop_again: break; } out: - /* - * Note within each zone the priority level at which this zone was - * brought into a happy state. So that the next thread which scans this - * zone will start out at that priority level. - */ - for (i = 0; i < pgdat->nr_zones; i++) { - struct zone *zone = pgdat->node_zones + i; - - zone->prev_priority = temp_priority[i]; - } if (!all_zones_ok) { cond_resched(); @@ -2297,9 +2306,10 @@ static int kswapd(void *p) * premature sleep. If not, then go fully * to sleep until explicitly woken up */ - if (!sleeping_prematurely(pgdat, order, remaining)) + if (!sleeping_prematurely(pgdat, order, remaining)) { + trace_mm_vmscan_kswapd_sleep(pgdat->node_id); schedule(); - else { + } else { if (remaining) count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY); else @@ -2319,8 +2329,10 @@ static int kswapd(void *p) * We can speed up thawing tasks if we don't call balance_pgdat * after returning from the refrigerator */ - if (!ret) + if (!ret) { + trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); balance_pgdat(pgdat, order); + } } return 0; } @@ -2340,6 +2352,7 @@ void wakeup_kswapd(struct zone *zone, int order) return; if (pgdat->kswapd_max_order < order) pgdat->kswapd_max_order = order; + trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL)) return; if (!waitqueue_active(&pgdat->kswapd_wait)) @@ -2609,7 +2622,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) */ priority = ZONE_RECLAIM_PRIORITY; do { - note_zone_scanning_priority(zone, priority); shrink_zone(priority, zone, &sc); priority--; } while (priority >= 0 && sc.nr_reclaimed < nr_pages); diff --git a/mm/vmstat.c b/mm/vmstat.c index 7759941..5c0b1b6 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -853,11 +853,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat, } seq_printf(m, "\n all_unreclaimable: %u" - "\n prev_priority: %i" "\n start_pfn: %lu" "\n inactive_ratio: %u", zone->all_unreclaimable, - zone->prev_priority, zone->zone_start_pfn, zone->inactive_ratio); seq_putc(m, '\n'); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 87+ messages in thread
* [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages 2010-07-19 13:11 [PATCH 0/8] Reduce writeback from page reclaim context V4 Mel Gorman 2010-07-19 13:11 ` [PATCH 1/8] vmscan: tracing: Roll up of patches currently in mmotm Mel Gorman @ 2010-07-19 13:11 ` Mel Gorman 2010-07-19 13:24 ` Rik van Riel 2010-07-19 14:15 ` Christoph Hellwig 2010-07-19 13:11 ` [PATCH 3/8] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim Mel Gorman ` (5 subsequent siblings) 7 siblings, 2 replies; 87+ messages in thread From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-mm Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman It is useful to distinguish between IO for anon and file pages. This patch updates vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include that information. The patches can be merged together. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/trace/events/vmscan.h | 8 ++++++-- mm/vmscan.c | 1 + 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index f2da66a..110aea2 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -158,23 +158,27 @@ TRACE_EVENT(mm_vmscan_lru_isolate, TRACE_EVENT(mm_vmscan_writepage, TP_PROTO(struct page *page, + int file, int sync_io), - TP_ARGS(page, sync_io), + TP_ARGS(page, file, sync_io), TP_STRUCT__entry( __field(struct page *, page) + __field(int, file) __field(int, sync_io) ), TP_fast_assign( __entry->page = page; + __entry->file = file; __entry->sync_io = sync_io; ), - TP_printk("page=%p pfn=%lu sync_io=%d", + TP_printk("page=%p pfn=%lu file=%d sync_io=%d", __entry->page, page_to_pfn(__entry->page), + __entry->file, __entry->sync_io) ); diff --git a/mm/vmscan.c b/mm/vmscan.c index e6ddba9..6587155 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -400,6 +400,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, ClearPageReclaim(page); } trace_mm_vmscan_writepage(page, + page_is_file_cache(page), sync_writeback == PAGEOUT_IO_SYNC); inc_zone_page_state(page, NR_VMSCAN_WRITE); return PAGE_SUCCESS; -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages 2010-07-19 13:11 ` [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages Mel Gorman @ 2010-07-19 13:24 ` Rik van Riel 2010-07-19 14:15 ` Christoph Hellwig 1 sibling, 0 replies; 87+ messages in thread From: Rik van Riel @ 2010-07-19 13:24 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On 07/19/2010 09:11 AM, Mel Gorman wrote: > It is useful to distinguish between IO for anon and file pages. This > patch updates > vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include > that information. The patches can be merged together. > > Signed-off-by: Mel Gorman<mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages 2010-07-19 13:11 ` [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages Mel Gorman 2010-07-19 13:24 ` Rik van Riel @ 2010-07-19 14:15 ` Christoph Hellwig 2010-07-19 14:24 ` Mel Gorman 1 sibling, 1 reply; 87+ messages in thread From: Christoph Hellwig @ 2010-07-19 14:15 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 02:11:24PM +0100, Mel Gorman wrote: > It is useful to distinguish between IO for anon and file pages. This > patch updates > vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include > that information. The patches can be merged together. I think the trace would be nicer if you #define flags for both cases and then use __print_flags on them. That'll also make it more extensible in case we need to add more flags later. And a purely procedural question: This is supposed to get rolled into the original patch before it gets commited to a git tree, right? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages 2010-07-19 14:15 ` Christoph Hellwig @ 2010-07-19 14:24 ` Mel Gorman 2010-07-19 14:26 ` Christoph Hellwig 0 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-19 14:24 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 10:15:01AM -0400, Christoph Hellwig wrote: > On Mon, Jul 19, 2010 at 02:11:24PM +0100, Mel Gorman wrote: > > It is useful to distinguish between IO for anon and file pages. This > > patch updates > > vmscan-tracing-add-trace-event-when-a-page-is-written.patch to include > > that information. The patches can be merged together. > > I think the trace would be nicer if you #define flags for both > cases and then use __print_flags on them. That'll also make it more > extensible in case we need to add more flags later. > Not a bad idea, I'll check it out. Thanks. The first flags would be; RECLAIM_WB_ANON RECLAIM_WB_FILE Does anyone have problems with the naming? > And a purely procedural question: This is supposed to get rolled into > the original patch before it gets commited to a git tree, right? > That is my expectation. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages 2010-07-19 14:24 ` Mel Gorman @ 2010-07-19 14:26 ` Christoph Hellwig 0 siblings, 0 replies; 87+ messages in thread From: Christoph Hellwig @ 2010-07-19 14:26 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 03:24:36PM +0100, Mel Gorman wrote: > Not a bad idea, I'll check it out. Thanks. The first flags would be; > > RECLAIM_WB_ANON > RECLAIM_WB_FILE > > Does anyone have problems with the naming? The names look fine to me. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH 3/8] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim 2010-07-19 13:11 [PATCH 0/8] Reduce writeback from page reclaim context V4 Mel Gorman 2010-07-19 13:11 ` [PATCH 1/8] vmscan: tracing: Roll up of patches currently in mmotm Mel Gorman 2010-07-19 13:11 ` [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages Mel Gorman @ 2010-07-19 13:11 ` Mel Gorman 2010-07-19 13:32 ` Rik van Riel 2010-07-19 13:11 ` [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman ` (4 subsequent siblings) 7 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-mm Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman It is useful to distinguish between IO for anon and file pages. This patch updates vmscan-tracing-add-a-postprocessing-script-for-reclaim-related-ftrace-events.patch so the post-processing script can handle the additional information. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- .../trace/postprocess/trace-vmscan-postprocess.pl | 89 +++++++++++++------- 1 files changed, 57 insertions(+), 32 deletions(-) diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl index d1ddc33..7795a9b 100644 --- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl +++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl @@ -21,9 +21,12 @@ use constant MM_VMSCAN_KSWAPD_SLEEP => 4; use constant MM_VMSCAN_LRU_SHRINK_ACTIVE => 5; use constant MM_VMSCAN_LRU_SHRINK_INACTIVE => 6; use constant MM_VMSCAN_LRU_ISOLATE => 7; -use constant MM_VMSCAN_WRITEPAGE_SYNC => 8; -use constant MM_VMSCAN_WRITEPAGE_ASYNC => 9; -use constant EVENT_UNKNOWN => 10; +use constant MM_VMSCAN_WRITEPAGE_FILE_SYNC => 8; +use constant MM_VMSCAN_WRITEPAGE_ANON_SYNC => 9; +use constant MM_VMSCAN_WRITEPAGE_FILE_ASYNC => 10; +use constant MM_VMSCAN_WRITEPAGE_ANON_ASYNC => 11; +use constant MM_VMSCAN_WRITEPAGE_ASYNC => 12; +use constant EVENT_UNKNOWN => 13; # Per-order events use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11; @@ -55,9 +58,11 @@ my $opt_read_procstat; my $total_wakeup_kswapd; my ($total_direct_reclaim, $total_direct_nr_scanned); my ($total_direct_latency, $total_kswapd_latency); -my ($total_direct_writepage_sync, $total_direct_writepage_async); +my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async); +my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async); my ($total_kswapd_nr_scanned, $total_kswapd_wake); -my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async); +my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async); +my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async); # Catch sigint and exit on request my $sigint_report = 0; @@ -101,7 +106,7 @@ my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)'; my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)'; my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)'; my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)'; -my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)'; +my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) file=([0-9]) sync_io=([0-9]*)'; # Dyanically discovered regex my $regex_direct_begin; @@ -209,7 +214,7 @@ $regex_lru_shrink_active = generate_traceevent_regex( $regex_writepage = generate_traceevent_regex( "vmscan/mm_vmscan_writepage", $regex_writepage_default, - "page", "pfn", "sync_io"); + "page", "pfn", "file", "sync_io"); sub read_statline($) { my $pid = $_[0]; @@ -379,11 +384,20 @@ EVENT_PROCESS: next; } - my $sync_io = $3; + my $file = $3; + my $sync_io = $4; if ($sync_io) { - $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++; + if ($file) { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}++; + } else { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}++; + } } else { - $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++; + if ($file) { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}++; + } else { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}++; + } } } else { $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++; @@ -427,7 +441,7 @@ sub dump_stats { while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] || defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) { - if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { + if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid; my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]); $total_direct_latency += $latency; @@ -454,8 +468,11 @@ sub dump_stats { $total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}; $total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; $total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; - $total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}; - $total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}; + $total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; + $total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; + $total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; + + $total_direct_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; my $index = 0; my $this_reclaim_delay = 0; @@ -470,8 +487,8 @@ sub dump_stats { $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}, $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}, $stats{$process_pid}->{HIGH_NR_SCANNED}, - $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}, - $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}, $this_reclaim_delay / 1000); if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) { @@ -515,16 +532,18 @@ sub dump_stats { $total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}; $total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; - $total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}; - $total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}; + $total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; + $total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; + $total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; + $total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; printf("%-" . $max_strlen . "s %8d %10d %8u %8i %8u", $process_pid, $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}, $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}, $stats{$process_pid}->{HIGH_NR_SCANNED}, - $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}, - $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}); + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}); if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) { print " "; @@ -551,18 +570,22 @@ sub dump_stats { $total_direct_latency /= 1000; $total_kswapd_latency /= 1000; print "\nSummary\n"; - print "Direct reclaims: $total_direct_reclaim\n"; - print "Direct reclaim pages scanned: $total_direct_nr_scanned\n"; - print "Direct reclaim write sync I/O: $total_direct_writepage_sync\n"; - print "Direct reclaim write async I/O: $total_direct_writepage_async\n"; - print "Wake kswapd requests: $total_wakeup_kswapd\n"; - printf "Time stalled direct reclaim: %-1.2f ms\n", $total_direct_latency; + print "Direct reclaims: $total_direct_reclaim\n"; + print "Direct reclaim pages scanned: $total_direct_nr_scanned\n"; + print "Direct reclaim write file sync I/O: $total_direct_writepage_file_sync\n"; + print "Direct reclaim write anon sync I/O: $total_direct_writepage_anon_sync\n"; + print "Direct reclaim write file async I/O: $total_direct_writepage_file_async\n"; + print "Direct reclaim write anon async I/O: $total_direct_writepage_anon_async\n"; + print "Wake kswapd requests: $total_wakeup_kswapd\n"; + printf "Time stalled direct reclaim: %-1.2f ms\n", $total_direct_latency; print "\n"; - print "Kswapd wakeups: $total_kswapd_wake\n"; - print "Kswapd pages scanned: $total_kswapd_nr_scanned\n"; - print "Kswapd reclaim write sync I/O: $total_kswapd_writepage_sync\n"; - print "Kswapd reclaim write async I/O: $total_kswapd_writepage_async\n"; - printf "Time kswapd awake: %-1.2f ms\n", $total_kswapd_latency; + print "Kswapd wakeups: $total_kswapd_wake\n"; + print "Kswapd pages scanned: $total_kswapd_nr_scanned\n"; + print "Kswapd reclaim write file sync I/O: $total_kswapd_writepage_file_sync\n"; + print "Kswapd reclaim write anon sync I/O: $total_kswapd_writepage_anon_sync\n"; + print "Kswapd reclaim write file async I/O: $total_kswapd_writepage_file_async\n"; + print "Kswapd reclaim write anon async I/O: $total_kswapd_writepage_anon_async\n"; + printf "Time kswapd awake: %-1.2f ms\n", $total_kswapd_latency; } sub aggregate_perprocesspid() { @@ -582,8 +605,10 @@ sub aggregate_perprocesspid() { $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; $perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}; $perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED}; - $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}; - $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; for (my $order = 0; $order < 20; $order++) { $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]; -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 3/8] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim 2010-07-19 13:11 ` [PATCH 3/8] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim Mel Gorman @ 2010-07-19 13:32 ` Rik van Riel 0 siblings, 0 replies; 87+ messages in thread From: Rik van Riel @ 2010-07-19 13:32 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On 07/19/2010 09:11 AM, Mel Gorman wrote: > It is useful to distinguish between IO for anon and file pages. This patch > updates > vmscan-tracing-add-a-postprocessing-script-for-reclaim-related-ftrace-events.patch > so the post-processing script can handle the additional information. > > Signed-off-by: Mel Gorman<mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-19 13:11 [PATCH 0/8] Reduce writeback from page reclaim context V4 Mel Gorman ` (2 preceding siblings ...) 2010-07-19 13:11 ` [PATCH 3/8] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim Mel Gorman @ 2010-07-19 13:11 ` Mel Gorman 2010-07-19 14:19 ` Christoph Hellwig ` (2 more replies) 2010-07-19 13:11 ` [PATCH 5/8] fs,btrfs: Allow kswapd to writeback pages Mel Gorman ` (3 subsequent siblings) 7 siblings, 3 replies; 87+ messages in thread From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-mm Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman When memory is under enough pressure, a process may enter direct reclaim to free pages in the same manner kswapd does. If a dirty page is encountered during the scan, this page is written to backing storage using mapping->writepage. This can result in very deep call stacks, particularly if the target storage or filesystem are complex. It has already been observed on XFS that the stack overflows but the problem is not XFS-specific. This patch prevents direct reclaim writing back filesystem pages by checking if current is kswapd or the page is anonymous before writing back. If the dirty pages cannot be written back, they are placed back on the LRU lists for either background writing by the BDI threads or kswapd. If in direct lumpy reclaim and dirty pages are encountered, the process will stall for the background flusher before trying to reclaim the pages again. As the call-chain for writing anonymous pages is not expected to be deep and they are not cleaned by flusher threads, anonymous pages are still written back in direct reclaim. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/vmscan.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 files changed, 109 insertions(+), 7 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 6587155..bc50937 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -323,6 +323,61 @@ typedef enum { PAGE_CLEAN, } pageout_t; +int write_reclaim_page(struct page *page, struct address_space *mapping, + enum pageout_io sync_writeback) +{ + int res; + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + .nr_to_write = SWAP_CLUSTER_MAX, + .range_start = 0, + .range_end = LLONG_MAX, + .nonblocking = 1, + .for_reclaim = 1, + }; + + if (!clear_page_dirty_for_io(page)) + return PAGE_CLEAN; + + SetPageReclaim(page); + res = mapping->a_ops->writepage(page, &wbc); + if (res < 0) + handle_write_error(mapping, page, res); + if (res == AOP_WRITEPAGE_ACTIVATE) { + ClearPageReclaim(page); + return PAGE_ACTIVATE; + } + + /* + * Wait on writeback if requested to. This happens when + * direct reclaiming a large contiguous area and the + * first attempt to free a range of pages fails. + */ + if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC) + wait_on_page_writeback(page); + + if (!PageWriteback(page)) { + /* synchronous write or broken a_ops? */ + ClearPageReclaim(page); + } + trace_mm_vmscan_writepage(page, + page_is_file_cache(page), + sync_writeback == PAGEOUT_IO_SYNC); + inc_zone_page_state(page, NR_VMSCAN_WRITE); + + return PAGE_SUCCESS; +} + +/* + * For now, only kswapd can writeback filesystem pages as otherwise + * there is a stack overflow risk + */ +static inline bool reclaim_can_writeback(struct scan_control *sc, + struct page *page) +{ + return !page_is_file_cache(page) || current_is_kswapd(); +} + /* * pageout is called by shrink_page_list() for each dirty page. * Calls ->writepage(). @@ -406,7 +461,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, return PAGE_SUCCESS; } - return PAGE_CLEAN; + return write_reclaim_page(page, mapping, sync_writeback); } /* @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) pagevec_free(&freed_pvec); } +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ +#define MAX_SWAP_CLEAN_WAIT 50 + /* * shrink_page_list() returns the number of reclaimed pages */ @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list, struct scan_control *sc, enum pageout_io sync_writeback) { - LIST_HEAD(ret_pages); LIST_HEAD(free_pages); - int pgactivate = 0; + LIST_HEAD(putback_pages); + LIST_HEAD(dirty_pages); + int pgactivate; + int dirty_isolated = 0; + unsigned long nr_dirty; unsigned long nr_reclaimed = 0; + pgactivate = 0; cond_resched(); +restart_dirty: + nr_dirty = 0; while (!list_empty(page_list)) { enum page_references references; struct address_space *mapping; @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list, } } - if (PageDirty(page)) { + if (PageDirty(page)) { + /* + * If the caller cannot writeback pages, dirty pages + * are put on a separate list for cleaning by either + * a flusher thread or kswapd + */ + if (!reclaim_can_writeback(sc, page)) { + list_add(&page->lru, &dirty_pages); + unlock_page(page); + nr_dirty++; + goto keep_dirty; + } + if (references == PAGEREF_RECLAIM_CLEAN) goto keep_locked; if (!may_enter_fs) @@ -852,13 +928,39 @@ activate_locked: keep_locked: unlock_page(page); keep: - list_add(&page->lru, &ret_pages); + list_add(&page->lru, &putback_pages); +keep_dirty: VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); } + if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) { + /* + * Wakeup a flusher thread to clean at least as many dirty + * pages as encountered by direct reclaim. Wait on congestion + * to throttle processes cleaning dirty pages + */ + wakeup_flusher_threads(nr_dirty); + congestion_wait(BLK_RW_ASYNC, HZ/10); + + /* + * As lumpy reclaim and memcg targets specific pages, wait on + * them to be cleaned and try reclaim again. + */ + if (sync_writeback == PAGEOUT_IO_SYNC || + sc->mem_cgroup != NULL) { + dirty_isolated++; + list_splice(&dirty_pages, page_list); + INIT_LIST_HEAD(&dirty_pages); + goto restart_dirty; + } + } + free_page_list(&free_pages); - list_splice(&ret_pages, page_list); + if (!list_empty(&dirty_pages)) + list_splice(&dirty_pages, page_list); + list_splice(&putback_pages, page_list); + count_vm_events(PGACTIVATE, pgactivate); return nr_reclaimed; } -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-19 13:11 ` [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman @ 2010-07-19 14:19 ` Christoph Hellwig 2010-07-19 14:26 ` Mel Gorman 2010-07-19 18:25 ` Rik van Riel 2010-07-19 22:14 ` Johannes Weiner 2 siblings, 1 reply; 87+ messages in thread From: Christoph Hellwig @ 2010-07-19 14:19 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote: > As the call-chain for writing anonymous pages is not expected to be deep > and they are not cleaned by flusher threads, anonymous pages are still > written back in direct reclaim. While it is not quite as deep as it skips the filesystem allocator and extent mapping code it can still be quite deep for swap given that it still has to traverse the whole I/O stack. Probably not worth worrying about now, but we need to keep an eye on it. The patch looks fine to me anyway. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-19 14:19 ` Christoph Hellwig @ 2010-07-19 14:26 ` Mel Gorman 0 siblings, 0 replies; 87+ messages in thread From: Mel Gorman @ 2010-07-19 14:26 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 10:19:34AM -0400, Christoph Hellwig wrote: > On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote: > > As the call-chain for writing anonymous pages is not expected to be deep > > and they are not cleaned by flusher threads, anonymous pages are still > > written back in direct reclaim. > > While it is not quite as deep as it skips the filesystem allocator and > extent mapping code it can still be quite deep for swap given that it > still has to traverse the whole I/O stack. Probably not worth worrying > about now, but we need to keep an eye on it. > Agreed that we need to keep an eye on it. If this ever becomes a problem, we're going to need to consider a flusher for anonymous pages. If you look at the figures, we are still doing a lot of writeback of anonymous pages. Granted, the layout of swap sucks anyway but it's something to keep at the back of the mind. > The patch looks fine to me anyway. > Thanks. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-19 13:11 ` [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman 2010-07-19 14:19 ` Christoph Hellwig @ 2010-07-19 18:25 ` Rik van Riel 2010-07-19 22:14 ` Johannes Weiner 2 siblings, 0 replies; 87+ messages in thread From: Rik van Riel @ 2010-07-19 18:25 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On 07/19/2010 09:11 AM, Mel Gorman wrote: > When memory is under enough pressure, a process may enter direct > reclaim to free pages in the same manner kswapd does. If a dirty page is > encountered during the scan, this page is written to backing storage using > mapping->writepage. This can result in very deep call stacks, particularly > if the target storage or filesystem are complex. It has already been observed > on XFS that the stack overflows but the problem is not XFS-specific. > > This patch prevents direct reclaim writing back filesystem pages by checking > if current is kswapd or the page is anonymous before writing back. If the > dirty pages cannot be written back, they are placed back on the LRU lists > for either background writing by the BDI threads or kswapd. If in direct > lumpy reclaim and dirty pages are encountered, the process will stall for > the background flusher before trying to reclaim the pages again. > > As the call-chain for writing anonymous pages is not expected to be deep > and they are not cleaned by flusher threads, anonymous pages are still > written back in direct reclaim. > > Signed-off-by: Mel Gorman<mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-19 13:11 ` [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman 2010-07-19 14:19 ` Christoph Hellwig 2010-07-19 18:25 ` Rik van Riel @ 2010-07-19 22:14 ` Johannes Weiner 2010-07-20 13:45 ` Mel Gorman 2 siblings, 1 reply; 87+ messages in thread From: Johannes Weiner @ 2010-07-19 22:14 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli Hi Mel, On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote: > @@ -406,7 +461,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, > return PAGE_SUCCESS; > } Did you forget to delete the worker code from pageout() which is now in write_reclaim_page()? > - return PAGE_CLEAN; > + return write_reclaim_page(page, mapping, sync_writeback); > } > > /* > @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) > pagevec_free(&freed_pvec); > } > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ > +#define MAX_SWAP_CLEAN_WAIT 50 > + > /* > * shrink_page_list() returns the number of reclaimed pages > */ > @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list, > struct scan_control *sc, > enum pageout_io sync_writeback) > { > - LIST_HEAD(ret_pages); > LIST_HEAD(free_pages); > - int pgactivate = 0; > + LIST_HEAD(putback_pages); > + LIST_HEAD(dirty_pages); > + int pgactivate; > + int dirty_isolated = 0; > + unsigned long nr_dirty; > unsigned long nr_reclaimed = 0; > > + pgactivate = 0; > cond_resched(); > > +restart_dirty: > + nr_dirty = 0; > while (!list_empty(page_list)) { > enum page_references references; > struct address_space *mapping; > @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list, > } > } > > - if (PageDirty(page)) { > + if (PageDirty(page)) { > + /* > + * If the caller cannot writeback pages, dirty pages > + * are put on a separate list for cleaning by either > + * a flusher thread or kswapd > + */ > + if (!reclaim_can_writeback(sc, page)) { > + list_add(&page->lru, &dirty_pages); > + unlock_page(page); > + nr_dirty++; > + goto keep_dirty; > + } > + > if (references == PAGEREF_RECLAIM_CLEAN) > goto keep_locked; > if (!may_enter_fs) > @@ -852,13 +928,39 @@ activate_locked: > keep_locked: > unlock_page(page); > keep: > - list_add(&page->lru, &ret_pages); > + list_add(&page->lru, &putback_pages); > +keep_dirty: > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > } > > + if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) { > + /* > + * Wakeup a flusher thread to clean at least as many dirty > + * pages as encountered by direct reclaim. Wait on congestion > + * to throttle processes cleaning dirty pages > + */ > + wakeup_flusher_threads(nr_dirty); > + congestion_wait(BLK_RW_ASYNC, HZ/10); > + > + /* > + * As lumpy reclaim and memcg targets specific pages, wait on > + * them to be cleaned and try reclaim again. > + */ > + if (sync_writeback == PAGEOUT_IO_SYNC || > + sc->mem_cgroup != NULL) { > + dirty_isolated++; > + list_splice(&dirty_pages, page_list); > + INIT_LIST_HEAD(&dirty_pages); > + goto restart_dirty; > + } > + } I think it would turn out more natural to just return dirty pages on page_list and have the whole looping logic in shrink_inactive_list(). Mixing dirty pages with other 'please try again' pages is probably not so bad anyway, it means we could retry all temporary unavailable pages instead of twiddling thumbs over that particular bunch of pages until the flushers catch up. What do you think? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-19 22:14 ` Johannes Weiner @ 2010-07-20 13:45 ` Mel Gorman 2010-07-20 22:02 ` Johannes Weiner 0 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-20 13:45 UTC (permalink / raw) To: Johannes Weiner Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Tue, Jul 20, 2010 at 12:14:20AM +0200, Johannes Weiner wrote: > Hi Mel, > > On Mon, Jul 19, 2010 at 02:11:26PM +0100, Mel Gorman wrote: > > @@ -406,7 +461,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, > > return PAGE_SUCCESS; > > } > > Did you forget to delete the worker code from pageout() which is now > in write_reclaim_page()? > Damn, a snarl during the final rebase when collapsing patches together that I missed when re-reading. Sorry :( > > - return PAGE_CLEAN; > > + return write_reclaim_page(page, mapping, sync_writeback); > > } > > > > /* > > @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) > > pagevec_free(&freed_pvec); > > } > > > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ > > +#define MAX_SWAP_CLEAN_WAIT 50 > > + > > /* > > * shrink_page_list() returns the number of reclaimed pages > > */ > > @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > struct scan_control *sc, > > enum pageout_io sync_writeback) > > { > > - LIST_HEAD(ret_pages); > > LIST_HEAD(free_pages); > > - int pgactivate = 0; > > + LIST_HEAD(putback_pages); > > + LIST_HEAD(dirty_pages); > > + int pgactivate; > > + int dirty_isolated = 0; > > + unsigned long nr_dirty; > > unsigned long nr_reclaimed = 0; > > > > + pgactivate = 0; > > cond_resched(); > > > > +restart_dirty: > > + nr_dirty = 0; > > while (!list_empty(page_list)) { > > enum page_references references; > > struct address_space *mapping; > > @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > } > > } > > > > - if (PageDirty(page)) { > > + if (PageDirty(page)) { > > + /* > > + * If the caller cannot writeback pages, dirty pages > > + * are put on a separate list for cleaning by either > > + * a flusher thread or kswapd > > + */ > > + if (!reclaim_can_writeback(sc, page)) { > > + list_add(&page->lru, &dirty_pages); > > + unlock_page(page); > > + nr_dirty++; > > + goto keep_dirty; > > + } > > + > > if (references == PAGEREF_RECLAIM_CLEAN) > > goto keep_locked; > > if (!may_enter_fs) > > @@ -852,13 +928,39 @@ activate_locked: > > keep_locked: > > unlock_page(page); > > keep: > > - list_add(&page->lru, &ret_pages); > > + list_add(&page->lru, &putback_pages); > > +keep_dirty: > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > > } > > > > + if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) { > > + /* > > + * Wakeup a flusher thread to clean at least as many dirty > > + * pages as encountered by direct reclaim. Wait on congestion > > + * to throttle processes cleaning dirty pages > > + */ > > + wakeup_flusher_threads(nr_dirty); > > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > + > > + /* > > + * As lumpy reclaim and memcg targets specific pages, wait on > > + * them to be cleaned and try reclaim again. > > + */ > > + if (sync_writeback == PAGEOUT_IO_SYNC || > > + sc->mem_cgroup != NULL) { > > + dirty_isolated++; > > + list_splice(&dirty_pages, page_list); > > + INIT_LIST_HEAD(&dirty_pages); > > + goto restart_dirty; > > + } > > + } > > I think it would turn out more natural to just return dirty pages on > page_list and have the whole looping logic in shrink_inactive_list(). > > Mixing dirty pages with other 'please try again' pages is probably not > so bad anyway, it means we could retry all temporary unavailable pages > instead of twiddling thumbs over that particular bunch of pages until > the flushers catch up. > > What do you think? > It's worth considering! It won't be very tidy but it's workable. The reason it is not tidy is that dirty pages and pages that couldn't be paged will be on the same list so they whole lot will need to be recycled. We'd record in scan_control though that there were pages that need to be retried and loop based on that value. That is managable though. The reason why I did it this way was because of lumpy reclaim and memcg requiring specific pages. I considered lumpy reclaim to be the more common case. In that case, it's removing potentially a large number of pages from the LRU that are contiguous. If some of those are dirty and it selects more contiguous ranges for reclaim, I'd worry that lumpy reclaim would trash the system even worse than it currently does when the system is under load. Hence, this wait and retry loop is done instead of returning and isolating more pages. For memcg, the concern was different. It is depending on flusher threads to clean its pages, kswapd does not operate on the list and it can't clean pages itself because the stack may overflow. If the memcg has many dirty pages, one process in the container could isolate all the dirty pages in the list forcing others to reclaim clean pages regardless of age. This could be very disruptive but looping like this throttling processes that encounter dirty pages instead of isolating more. For lumpy, I don't think we should return and isolate more pages, it's too disruptive. For memcg, I think it could possibly get an advantage but there is a nasty corner case if the container is mostly dirty - it depends on how memcg handles dirty_ratio I guess. Is it worth it at this point? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-20 13:45 ` Mel Gorman @ 2010-07-20 22:02 ` Johannes Weiner 2010-07-21 11:36 ` Johannes Weiner 2010-07-21 11:52 ` Mel Gorman 0 siblings, 2 replies; 87+ messages in thread From: Johannes Weiner @ 2010-07-20 22:02 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Tue, Jul 20, 2010 at 02:45:56PM +0100, Mel Gorman wrote: > On Tue, Jul 20, 2010 at 12:14:20AM +0200, Johannes Weiner wrote: > > > @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) > > > pagevec_free(&freed_pvec); > > > } > > > > > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ > > > +#define MAX_SWAP_CLEAN_WAIT 50 > > > + > > > /* > > > * shrink_page_list() returns the number of reclaimed pages > > > */ > > > @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > > struct scan_control *sc, > > > enum pageout_io sync_writeback) > > > { > > > - LIST_HEAD(ret_pages); > > > LIST_HEAD(free_pages); > > > - int pgactivate = 0; > > > + LIST_HEAD(putback_pages); > > > + LIST_HEAD(dirty_pages); > > > + int pgactivate; > > > + int dirty_isolated = 0; > > > + unsigned long nr_dirty; > > > unsigned long nr_reclaimed = 0; > > > > > > + pgactivate = 0; > > > cond_resched(); > > > > > > +restart_dirty: > > > + nr_dirty = 0; > > > while (!list_empty(page_list)) { > > > enum page_references references; > > > struct address_space *mapping; > > > @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > > } > > > } > > > > > > - if (PageDirty(page)) { > > > + if (PageDirty(page)) { > > > + /* > > > + * If the caller cannot writeback pages, dirty pages > > > + * are put on a separate list for cleaning by either > > > + * a flusher thread or kswapd > > > + */ > > > + if (!reclaim_can_writeback(sc, page)) { > > > + list_add(&page->lru, &dirty_pages); > > > + unlock_page(page); > > > + nr_dirty++; > > > + goto keep_dirty; > > > + } > > > + > > > if (references == PAGEREF_RECLAIM_CLEAN) > > > goto keep_locked; > > > if (!may_enter_fs) > > > @@ -852,13 +928,39 @@ activate_locked: > > > keep_locked: > > > unlock_page(page); > > > keep: > > > - list_add(&page->lru, &ret_pages); > > > + list_add(&page->lru, &putback_pages); > > > +keep_dirty: > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > > > } > > > > > > + if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) { > > > + /* > > > + * Wakeup a flusher thread to clean at least as many dirty > > > + * pages as encountered by direct reclaim. Wait on congestion > > > + * to throttle processes cleaning dirty pages > > > + */ > > > + wakeup_flusher_threads(nr_dirty); > > > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > > + > > > + /* > > > + * As lumpy reclaim and memcg targets specific pages, wait on > > > + * them to be cleaned and try reclaim again. > > > + */ > > > + if (sync_writeback == PAGEOUT_IO_SYNC || > > > + sc->mem_cgroup != NULL) { > > > + dirty_isolated++; > > > + list_splice(&dirty_pages, page_list); > > > + INIT_LIST_HEAD(&dirty_pages); > > > + goto restart_dirty; > > > + } > > > + } > > > > I think it would turn out more natural to just return dirty pages on > > page_list and have the whole looping logic in shrink_inactive_list(). > > > > Mixing dirty pages with other 'please try again' pages is probably not > > so bad anyway, it means we could retry all temporary unavailable pages > > instead of twiddling thumbs over that particular bunch of pages until > > the flushers catch up. > > > > What do you think? > > > > It's worth considering! It won't be very tidy but it's workable. The reason > it is not tidy is that dirty pages and pages that couldn't be paged will be > on the same list so they whole lot will need to be recycled. We'd record in > scan_control though that there were pages that need to be retried and loop > based on that value. That is managable though. Recycling all of them is what I had in mind, yeah. But... > The reason why I did it this way was because of lumpy reclaim and memcg > requiring specific pages. I considered lumpy reclaim to be the more common > case. In that case, it's removing potentially a large number of pages from > the LRU that are contiguous. If some of those are dirty and it selects more > contiguous ranges for reclaim, I'd worry that lumpy reclaim would trash the > system even worse than it currently does when the system is under load. Hence, > this wait and retry loop is done instead of returning and isolating more pages. I think here we missed each other. I don't want the loop to be _that_ much more in the outer scope that isolation is repeated as well. What I had in mind is the attached patch. It is not tested and hacked up rather quickly due to time constraints, sorry, but you should get the idea. I hope I did not miss anything fundamental. Note that since only kswapd enters pageout() anymore, everything depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync cycles for kswapd. Just to mitigate the WTF-count on the patch :-) Hannes --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -386,21 +386,17 @@ static pageout_t pageout(struct page *pa ClearPageReclaim(page); return PAGE_ACTIVATE; } - - /* - * Wait on writeback if requested to. This happens when - * direct reclaiming a large contiguous area and the - * first attempt to free a range of pages fails. - */ - if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC) - wait_on_page_writeback(page); - if (!PageWriteback(page)) { /* synchronous write or broken a_ops? */ ClearPageReclaim(page); } trace_mm_vmscan_writepage(page, page_is_file_cache(page), + /* + * Humm. Only kswapd comes here and for + * kswapd there never is a PAGEOUT_IO_SYNC + * cycle... + */ sync_writeback == PAGEOUT_IO_SYNC); inc_zone_page_state(page, NR_VMSCAN_WRITE); return PAGE_SUCCESS; @@ -643,12 +639,14 @@ static noinline_for_stack void free_page * shrink_page_list() returns the number of reclaimed pages */ static unsigned long shrink_page_list(struct list_head *page_list, - struct scan_control *sc, - enum pageout_io sync_writeback) + struct scan_control *sc, + enum pageout_io sync_writeback, + int *dirty_seen) { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); int pgactivate = 0; + unsigned long nr_dirty = 0; unsigned long nr_reclaimed = 0; cond_resched(); @@ -657,7 +655,7 @@ static unsigned long shrink_page_list(st enum page_references references; struct address_space *mapping; struct page *page; - int may_enter_fs; + int may_pageout; cond_resched(); @@ -681,10 +679,15 @@ static unsigned long shrink_page_list(st if (page_mapped(page) || PageSwapCache(page)) sc->nr_scanned++; - may_enter_fs = (sc->gfp_mask & __GFP_FS) || + /* + * To prevent stack overflows, only kswapd can enter + * the filesystem. Swap IO is always fine (for now). + */ + may_pageout = current_is_kswapd() || (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); if (PageWriteback(page)) { + int may_wait; /* * Synchronous reclaim is performed in two passes, * first an asynchronous pass over the list to @@ -693,7 +696,8 @@ static unsigned long shrink_page_list(st * for any page for which writeback has already * started. */ - if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs) + may_wait = (sc->gfp_mask & __GFP_FS) || may_pageout; + if (sync_writeback == PAGEOUT_IO_SYNC && may_wait) wait_on_page_writeback(page); else goto keep_locked; @@ -719,7 +723,7 @@ static unsigned long shrink_page_list(st goto keep_locked; if (!add_to_swap(page)) goto activate_locked; - may_enter_fs = 1; + may_pageout = 1; } mapping = page_mapping(page); @@ -742,9 +746,11 @@ static unsigned long shrink_page_list(st } if (PageDirty(page)) { + nr_dirty++; + if (references == PAGEREF_RECLAIM_CLEAN) goto keep_locked; - if (!may_enter_fs) + if (!may_pageout) goto keep_locked; if (!sc->may_writepage) goto keep_locked; @@ -860,6 +866,7 @@ keep: list_splice(&ret_pages, page_list); count_vm_events(PGACTIVATE, pgactivate); + *dirty_seen = nr_dirty; return nr_reclaimed; } @@ -1232,6 +1239,9 @@ static noinline_for_stack void update_is reclaim_stat->recent_scanned[1] += *nr_file; } +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ +#define MAX_SWAP_CLEAN_WAIT 50 + /* * shrink_inactive_list() is a helper for shrink_zone(). It returns the number * of reclaimed pages @@ -1247,6 +1257,7 @@ shrink_inactive_list(unsigned long nr_to unsigned long nr_active; unsigned long nr_anon; unsigned long nr_file; + unsigned long nr_dirty; while (unlikely(too_many_isolated(zone, file, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); @@ -1295,26 +1306,34 @@ shrink_inactive_list(unsigned long nr_to spin_unlock_irq(&zone->lru_lock); - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); - + nr_reclaimed = shrink_page_list(&page_list, sc, + PAGEOUT_IO_ASYNC, + &nr_dirty); /* * If we are direct reclaiming for contiguous pages and we do * not reclaim everything in the list, try again and wait * for IO to complete. This will stall high-order allocations * but that should be acceptable to the caller */ - if (nr_reclaimed < nr_taken && !current_is_kswapd() && - sc->lumpy_reclaim_mode) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + if (!current_is_kswapd() && sc->lumpy_reclaim_mode || sc->mem_cgroup) { + int dirty_retry = MAX_SWAP_CLEAN_WAIT; - /* - * The attempt at page out may have made some - * of the pages active, mark them inactive again. - */ - nr_active = clear_active_flags(&page_list, NULL); - count_vm_events(PGDEACTIVATE, nr_active); + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { + wakeup_flusher_threads(nr_dirty); + congestion_wait(BLK_RW_ASYNC, HZ/10); + /* + * The attempt at page out may have made some + * of the pages active, mark them inactive again. + * + * Humm. Still needed? + */ + nr_active = clear_active_flags(&page_list, NULL); + count_vm_events(PGDEACTIVATE, nr_active); - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); + nr_reclaimed += shrink_page_list(&page_list, sc, + PAGEOUT_IO_SYNC, + &nr_dirty); + } } local_irq_disable(); ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-20 22:02 ` Johannes Weiner @ 2010-07-21 11:36 ` Johannes Weiner 2010-07-21 11:52 ` Mel Gorman 1 sibling, 0 replies; 87+ messages in thread From: Johannes Weiner @ 2010-07-21 11:36 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Wed, Jul 21, 2010 at 12:02:18AM +0200, Johannes Weiner wrote: > On Tue, Jul 20, 2010 at 02:45:56PM +0100, Mel Gorman wrote: > > On Tue, Jul 20, 2010 at 12:14:20AM +0200, Johannes Weiner wrote: > > > I think it would turn out more natural to just return dirty pages on > > > page_list and have the whole looping logic in shrink_inactive_list(). > > > > > > Mixing dirty pages with other 'please try again' pages is probably not > > > so bad anyway, it means we could retry all temporary unavailable pages > > > instead of twiddling thumbs over that particular bunch of pages until > > > the flushers catch up. > > > > > > What do you think? > > > [...] > > The reason why I did it this way was because of lumpy reclaim and memcg > > requiring specific pages. I considered lumpy reclaim to be the more common > > case. In that case, it's removing potentially a large number of pages from > > the LRU that are contiguous. If some of those are dirty and it selects more > > contiguous ranges for reclaim, I'd worry that lumpy reclaim would trash the > > system even worse than it currently does when the system is under load. Hence, > > this wait and retry loop is done instead of returning and isolating more pages. > > I think here we missed each other. I don't want the loop to be _that_ > much more in the outer scope that isolation is repeated as well. What > I had in mind is the attached patch. It is not tested and hacked up > rather quickly due to time constraints, sorry, but you should get the > idea. I hope I did not miss anything fundamental. > > Note that since only kswapd enters pageout() anymore, everything > depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync > cycles for kswapd. Just to mitigate the WTF-count on the patch :-) Aaaaand direct reclaimers for swap, of course. Selfslap. Here is the patch again, sans the first hunk (and the type of @dirty_seen fixed): --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -643,12 +643,14 @@ static noinline_for_stack void free_page * shrink_page_list() returns the number of reclaimed pages */ static unsigned long shrink_page_list(struct list_head *page_list, - struct scan_control *sc, - enum pageout_io sync_writeback) + struct scan_control *sc, + enum pageout_io sync_writeback, + unsigned long *dirty_seen) { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); int pgactivate = 0; + unsigned long nr_dirty = 0; unsigned long nr_reclaimed = 0; cond_resched(); @@ -657,7 +659,7 @@ static unsigned long shrink_page_list(st enum page_references references; struct address_space *mapping; struct page *page; - int may_enter_fs; + int may_pageout; cond_resched(); @@ -681,10 +683,15 @@ static unsigned long shrink_page_list(st if (page_mapped(page) || PageSwapCache(page)) sc->nr_scanned++; - may_enter_fs = (sc->gfp_mask & __GFP_FS) || + /* + * To prevent stack overflows, only kswapd can enter + * the filesystem. Swap IO is always fine (for now). + */ + may_pageout = current_is_kswapd() || (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); if (PageWriteback(page)) { + int may_wait; /* * Synchronous reclaim is performed in two passes, * first an asynchronous pass over the list to @@ -693,7 +700,8 @@ static unsigned long shrink_page_list(st * for any page for which writeback has already * started. */ - if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs) + may_wait = (sc->gfp_mask & __GFP_FS) || may_pageout; + if (sync_writeback == PAGEOUT_IO_SYNC && may_wait) wait_on_page_writeback(page); else goto keep_locked; @@ -719,7 +727,7 @@ static unsigned long shrink_page_list(st goto keep_locked; if (!add_to_swap(page)) goto activate_locked; - may_enter_fs = 1; + may_pageout = 1; } mapping = page_mapping(page); @@ -742,9 +750,11 @@ static unsigned long shrink_page_list(st } if (PageDirty(page)) { + nr_dirty++; + if (references == PAGEREF_RECLAIM_CLEAN) goto keep_locked; - if (!may_enter_fs) + if (!may_pageout) goto keep_locked; if (!sc->may_writepage) goto keep_locked; @@ -860,6 +870,7 @@ keep: list_splice(&ret_pages, page_list); count_vm_events(PGACTIVATE, pgactivate); + *dirty_seen = nr_dirty; return nr_reclaimed; } @@ -1232,6 +1243,9 @@ static noinline_for_stack void update_is reclaim_stat->recent_scanned[1] += *nr_file; } +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ +#define MAX_SWAP_CLEAN_WAIT 50 + /* * shrink_inactive_list() is a helper for shrink_zone(). It returns the number * of reclaimed pages @@ -1247,6 +1261,7 @@ shrink_inactive_list(unsigned long nr_to unsigned long nr_active; unsigned long nr_anon; unsigned long nr_file; + unsigned long nr_dirty; while (unlikely(too_many_isolated(zone, file, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); @@ -1295,26 +1310,32 @@ shrink_inactive_list(unsigned long nr_to spin_unlock_irq(&zone->lru_lock); - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); - + nr_reclaimed = shrink_page_list(&page_list, sc, + PAGEOUT_IO_ASYNC, + &nr_dirty); /* * If we are direct reclaiming for contiguous pages and we do * not reclaim everything in the list, try again and wait * for IO to complete. This will stall high-order allocations * but that should be acceptable to the caller */ - if (nr_reclaimed < nr_taken && !current_is_kswapd() && - sc->lumpy_reclaim_mode) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + if (!current_is_kswapd() && sc->lumpy_reclaim_mode || sc->mem_cgroup) { + int dirty_retry = MAX_SWAP_CLEAN_WAIT; - /* - * The attempt at page out may have made some - * of the pages active, mark them inactive again. - */ - nr_active = clear_active_flags(&page_list, NULL); - count_vm_events(PGDEACTIVATE, nr_active); + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { + wakeup_flusher_threads(nr_dirty); + congestion_wait(BLK_RW_ASYNC, HZ/10); + /* + * The attempt at page out may have made some + * of the pages active, mark them inactive again. + */ + nr_active = clear_active_flags(&page_list, NULL); + count_vm_events(PGDEACTIVATE, nr_active); - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); + nr_reclaimed += shrink_page_list(&page_list, sc, + PAGEOUT_IO_SYNC, + &nr_dirty); + } } local_irq_disable(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-20 22:02 ` Johannes Weiner 2010-07-21 11:36 ` Johannes Weiner @ 2010-07-21 11:52 ` Mel Gorman 2010-07-21 12:01 ` KAMEZAWA Hiroyuki 2010-07-21 13:04 ` Johannes Weiner 1 sibling, 2 replies; 87+ messages in thread From: Mel Gorman @ 2010-07-21 11:52 UTC (permalink / raw) To: Johannes Weiner Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Wed, Jul 21, 2010 at 12:02:18AM +0200, Johannes Weiner wrote: > On Tue, Jul 20, 2010 at 02:45:56PM +0100, Mel Gorman wrote: > > On Tue, Jul 20, 2010 at 12:14:20AM +0200, Johannes Weiner wrote: > > > > @@ -639,6 +694,9 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) > > > > pagevec_free(&freed_pvec); > > > > } > > > > > > > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ > > > > +#define MAX_SWAP_CLEAN_WAIT 50 > > > > + > > > > /* > > > > * shrink_page_list() returns the number of reclaimed pages > > > > */ > > > > @@ -646,13 +704,19 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > > > struct scan_control *sc, > > > > enum pageout_io sync_writeback) > > > > { > > > > - LIST_HEAD(ret_pages); > > > > LIST_HEAD(free_pages); > > > > - int pgactivate = 0; > > > > + LIST_HEAD(putback_pages); > > > > + LIST_HEAD(dirty_pages); > > > > + int pgactivate; > > > > + int dirty_isolated = 0; > > > > + unsigned long nr_dirty; > > > > unsigned long nr_reclaimed = 0; > > > > > > > > + pgactivate = 0; > > > > cond_resched(); > > > > > > > > +restart_dirty: > > > > + nr_dirty = 0; > > > > while (!list_empty(page_list)) { > > > > enum page_references references; > > > > struct address_space *mapping; > > > > @@ -741,7 +805,19 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > > > } > > > > } > > > > > > > > - if (PageDirty(page)) { > > > > + if (PageDirty(page)) { > > > > + /* > > > > + * If the caller cannot writeback pages, dirty pages > > > > + * are put on a separate list for cleaning by either > > > > + * a flusher thread or kswapd > > > > + */ > > > > + if (!reclaim_can_writeback(sc, page)) { > > > > + list_add(&page->lru, &dirty_pages); > > > > + unlock_page(page); > > > > + nr_dirty++; > > > > + goto keep_dirty; > > > > + } > > > > + > > > > if (references == PAGEREF_RECLAIM_CLEAN) > > > > goto keep_locked; > > > > if (!may_enter_fs) > > > > @@ -852,13 +928,39 @@ activate_locked: > > > > keep_locked: > > > > unlock_page(page); > > > > keep: > > > > - list_add(&page->lru, &ret_pages); > > > > + list_add(&page->lru, &putback_pages); > > > > +keep_dirty: > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > > > > } > > > > > > > > + if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) { > > > > + /* > > > > + * Wakeup a flusher thread to clean at least as many dirty > > > > + * pages as encountered by direct reclaim. Wait on congestion > > > > + * to throttle processes cleaning dirty pages > > > > + */ > > > > + wakeup_flusher_threads(nr_dirty); > > > > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > + > > > > + /* > > > > + * As lumpy reclaim and memcg targets specific pages, wait on > > > > + * them to be cleaned and try reclaim again. > > > > + */ > > > > + if (sync_writeback == PAGEOUT_IO_SYNC || > > > > + sc->mem_cgroup != NULL) { > > > > + dirty_isolated++; > > > > + list_splice(&dirty_pages, page_list); > > > > + INIT_LIST_HEAD(&dirty_pages); > > > > + goto restart_dirty; > > > > + } > > > > + } > > > > > > I think it would turn out more natural to just return dirty pages on > > > page_list and have the whole looping logic in shrink_inactive_list(). > > > > > > Mixing dirty pages with other 'please try again' pages is probably not > > > so bad anyway, it means we could retry all temporary unavailable pages > > > instead of twiddling thumbs over that particular bunch of pages until > > > the flushers catch up. > > > > > > What do you think? > > > > > > > It's worth considering! It won't be very tidy but it's workable. The reason > > it is not tidy is that dirty pages and pages that couldn't be paged will be > > on the same list so they whole lot will need to be recycled. We'd record in > > scan_control though that there were pages that need to be retried and loop > > based on that value. That is managable though. > > Recycling all of them is what I had in mind, yeah. But... > > > The reason why I did it this way was because of lumpy reclaim and memcg > > requiring specific pages. I considered lumpy reclaim to be the more common > > case. In that case, it's removing potentially a large number of pages from > > the LRU that are contiguous. If some of those are dirty and it selects more > > contiguous ranges for reclaim, I'd worry that lumpy reclaim would trash the > > system even worse than it currently does when the system is under load. Hence, > > this wait and retry loop is done instead of returning and isolating more pages. > > I think here we missed each other. I don't want the loop to be _that_ > much more in the outer scope that isolation is repeated as well. My bad. > What > I had in mind is the attached patch. It is not tested and hacked up > rather quickly due to time constraints, sorry, but you should get the > idea. I hope I did not miss anything fundamental. > > Note that since only kswapd enters pageout() anymore, everything > depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync > cycles for kswapd. Just to mitigate the WTF-count on the patch :-) > Anon page writeback can enter pageout. See static inline bool reclaim_can_writeback(struct scan_control *sc, struct page *page) { return !page_is_file_cache(page) || current_is_kswapd(); } So the logic still applies. > Hannes > > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -386,21 +386,17 @@ static pageout_t pageout(struct page *pa > ClearPageReclaim(page); > return PAGE_ACTIVATE; > } > - > - /* > - * Wait on writeback if requested to. This happens when > - * direct reclaiming a large contiguous area and the > - * first attempt to free a range of pages fails. > - */ > - if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC) > - wait_on_page_writeback(page); > - I'm assuming this should still remain because it can apply to anon page writeback (i.e. being swapped)? > if (!PageWriteback(page)) { > /* synchronous write or broken a_ops? */ > ClearPageReclaim(page); > } > trace_mm_vmscan_writepage(page, > page_is_file_cache(page), > + /* > + * Humm. Only kswapd comes here and for > + * kswapd there never is a PAGEOUT_IO_SYNC > + * cycle... > + */ > sync_writeback == PAGEOUT_IO_SYNC); > inc_zone_page_state(page, NR_VMSCAN_WRITE); To clarify, see the following example of writeback stats - the anon sync I/O in particular Direct reclaim pages scanned 156940 150720 145472 142254 Direct reclaim write file async I/O 2472 0 0 0 Direct reclaim write anon async I/O 29281 27195 27968 25519 Direct reclaim write file sync I/O 1943 0 0 0 Direct reclaim write anon sync I/O 11777 12488 10835 4806 > return PAGE_SUCCESS; > @@ -643,12 +639,14 @@ static noinline_for_stack void free_page > * shrink_page_list() returns the number of reclaimed pages > */ > static unsigned long shrink_page_list(struct list_head *page_list, > - struct scan_control *sc, > - enum pageout_io sync_writeback) > + struct scan_control *sc, > + enum pageout_io sync_writeback, > + int *dirty_seen) > { > LIST_HEAD(ret_pages); > LIST_HEAD(free_pages); > int pgactivate = 0; > + unsigned long nr_dirty = 0; > unsigned long nr_reclaimed = 0; > > cond_resched(); > @@ -657,7 +655,7 @@ static unsigned long shrink_page_list(st > enum page_references references; > struct address_space *mapping; > struct page *page; > - int may_enter_fs; > + int may_pageout; > > cond_resched(); > > @@ -681,10 +679,15 @@ static unsigned long shrink_page_list(st > if (page_mapped(page) || PageSwapCache(page)) > sc->nr_scanned++; > > - may_enter_fs = (sc->gfp_mask & __GFP_FS) || > + /* > + * To prevent stack overflows, only kswapd can enter > + * the filesystem. Swap IO is always fine (for now). > + */ > + may_pageout = current_is_kswapd() || > (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); > We lost the __GFP_FS check and it's vaguely possible kswapd could call the allocator with GFP_NOFS. While you check it before wait_on_page_writeback it needs to be checked before calling pageout(). I toyed around with creating a may_pageout that took everything into account but I couldn't convince myself there was no holes or serious change in functionality. > if (PageWriteback(page)) { > + int may_wait; > /* > * Synchronous reclaim is performed in two passes, > * first an asynchronous pass over the list to > @@ -693,7 +696,8 @@ static unsigned long shrink_page_list(st > * for any page for which writeback has already > * started. > */ > - if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs) > + may_wait = (sc->gfp_mask & __GFP_FS) || may_pageout; > + if (sync_writeback == PAGEOUT_IO_SYNC && may_wait) > wait_on_page_writeback(page); > else > goto keep_locked; > @@ -719,7 +723,7 @@ static unsigned long shrink_page_list(st > goto keep_locked; > if (!add_to_swap(page)) > goto activate_locked; > - may_enter_fs = 1; > + may_pageout = 1; > } > > mapping = page_mapping(page); > @@ -742,9 +746,11 @@ static unsigned long shrink_page_list(st > } > > if (PageDirty(page)) { > + nr_dirty++; > + > if (references == PAGEREF_RECLAIM_CLEAN) > goto keep_locked; > - if (!may_enter_fs) > + if (!may_pageout) > goto keep_locked; > if (!sc->may_writepage) > goto keep_locked; > @@ -860,6 +866,7 @@ keep: > > list_splice(&ret_pages, page_list); > count_vm_events(PGACTIVATE, pgactivate); > + *dirty_seen = nr_dirty; > return nr_reclaimed; > } > > @@ -1232,6 +1239,9 @@ static noinline_for_stack void update_is > reclaim_stat->recent_scanned[1] += *nr_file; > } > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ > +#define MAX_SWAP_CLEAN_WAIT 50 > + > /* > * shrink_inactive_list() is a helper for shrink_zone(). It returns the number > * of reclaimed pages > @@ -1247,6 +1257,7 @@ shrink_inactive_list(unsigned long nr_to > unsigned long nr_active; > unsigned long nr_anon; > unsigned long nr_file; > + unsigned long nr_dirty; > > while (unlikely(too_many_isolated(zone, file, sc))) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > @@ -1295,26 +1306,34 @@ shrink_inactive_list(unsigned long nr_to > > spin_unlock_irq(&zone->lru_lock); > > - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); > - > + nr_reclaimed = shrink_page_list(&page_list, sc, > + PAGEOUT_IO_ASYNC, > + &nr_dirty); > /* > * If we are direct reclaiming for contiguous pages and we do > * not reclaim everything in the list, try again and wait > * for IO to complete. This will stall high-order allocations > * but that should be acceptable to the caller > */ > - if (nr_reclaimed < nr_taken && !current_is_kswapd() && > - sc->lumpy_reclaim_mode) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + if (!current_is_kswapd() && sc->lumpy_reclaim_mode || sc->mem_cgroup) { > + int dirty_retry = MAX_SWAP_CLEAN_WAIT; > > - /* > - * The attempt at page out may have made some > - * of the pages active, mark them inactive again. > - */ > - nr_active = clear_active_flags(&page_list, NULL); > - count_vm_events(PGDEACTIVATE, nr_active); > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { > + wakeup_flusher_threads(nr_dirty); > + congestion_wait(BLK_RW_ASYNC, HZ/10); > + /* > + * The attempt at page out may have made some > + * of the pages active, mark them inactive again. > + * > + * Humm. Still needed? > + */ > + nr_active = clear_active_flags(&page_list, NULL); > + count_vm_events(PGDEACTIVATE, nr_active); > I don't see why it would be removed. > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); > + nr_reclaimed += shrink_page_list(&page_list, sc, > + PAGEOUT_IO_SYNC, > + &nr_dirty); > + } > } > > local_irq_disable(); Ok, is this closer to what you had in mind? ==== CUT HERE ==== [PATCH] vmscan: Do not writeback filesystem pages in direct reclaim When memory is under enough pressure, a process may enter direct reclaim to free pages in the same manner kswapd does. If a dirty page is encountered during the scan, this page is written to backing storage using mapping->writepage. This can result in very deep call stacks, particularly if the target storage or filesystem are complex. It has already been observed on XFS that the stack overflows but the problem is not XFS-specific. This patch prevents direct reclaim writing back filesystem pages by checking if current is kswapd or the page is anonymous before writing back. If the dirty pages cannot be written back, they are placed back on the LRU lists for either background writing by the BDI threads or kswapd. If in direct lumpy reclaim and dirty pages are encountered, the process will stall for the background flusher before trying to reclaim the pages again. As the call-chain for writing anonymous pages is not expected to be deep and they are not cleaned by flusher threads, anonymous pages are still written back in direct reclaim. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> diff --git a/mm/vmscan.c b/mm/vmscan.c index 6587155..e3a5816 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -323,6 +323,51 @@ typedef enum { PAGE_CLEAN, } pageout_t; +int write_reclaim_page(struct page *page, struct address_space *mapping, + enum pageout_io sync_writeback) +{ + int res; + struct writeback_control wbc = { + .sync_mode = WB_SYNC_NONE, + .nr_to_write = SWAP_CLUSTER_MAX, + .range_start = 0, + .range_end = LLONG_MAX, + .nonblocking = 1, + .for_reclaim = 1, + }; + + if (!clear_page_dirty_for_io(page)) + return PAGE_CLEAN; + + SetPageReclaim(page); + res = mapping->a_ops->writepage(page, &wbc); + if (res < 0) + handle_write_error(mapping, page, res); + if (res == AOP_WRITEPAGE_ACTIVATE) { + ClearPageReclaim(page); + return PAGE_ACTIVATE; + } + + /* + * Wait on writeback if requested to. This happens when + * direct reclaiming a large contiguous area and the + * first attempt to free a range of pages fails. + */ + if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC) + wait_on_page_writeback(page); + + if (!PageWriteback(page)) { + /* synchronous write or broken a_ops? */ + ClearPageReclaim(page); + } + trace_mm_vmscan_writepage(page, + page_is_file_cache(page), + sync_writeback == PAGEOUT_IO_SYNC); + inc_zone_page_state(page, NR_VMSCAN_WRITE); + + return PAGE_SUCCESS; +} + /* * pageout is called by shrink_page_list() for each dirty page. * Calls ->writepage(). @@ -367,46 +412,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, if (!may_write_to_queue(mapping->backing_dev_info)) return PAGE_KEEP; - if (clear_page_dirty_for_io(page)) { - int res; - struct writeback_control wbc = { - .sync_mode = WB_SYNC_NONE, - .nr_to_write = SWAP_CLUSTER_MAX, - .range_start = 0, - .range_end = LLONG_MAX, - .nonblocking = 1, - .for_reclaim = 1, - }; - - SetPageReclaim(page); - res = mapping->a_ops->writepage(page, &wbc); - if (res < 0) - handle_write_error(mapping, page, res); - if (res == AOP_WRITEPAGE_ACTIVATE) { - ClearPageReclaim(page); - return PAGE_ACTIVATE; - } - - /* - * Wait on writeback if requested to. This happens when - * direct reclaiming a large contiguous area and the - * first attempt to free a range of pages fails. - */ - if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC) - wait_on_page_writeback(page); - - if (!PageWriteback(page)) { - /* synchronous write or broken a_ops? */ - ClearPageReclaim(page); - } - trace_mm_vmscan_writepage(page, - page_is_file_cache(page), - sync_writeback == PAGEOUT_IO_SYNC); - inc_zone_page_state(page, NR_VMSCAN_WRITE); - return PAGE_SUCCESS; - } - - return PAGE_CLEAN; + return write_reclaim_page(page, mapping, sync_writeback); } /* @@ -639,18 +645,25 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) pagevec_free(&freed_pvec); } +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ +#define MAX_SWAP_CLEAN_WAIT 50 + /* * shrink_page_list() returns the number of reclaimed pages */ static unsigned long shrink_page_list(struct list_head *page_list, struct scan_control *sc, - enum pageout_io sync_writeback) + enum pageout_io sync_writeback, + unsigned long *nr_still_dirty) { - LIST_HEAD(ret_pages); LIST_HEAD(free_pages); - int pgactivate = 0; + LIST_HEAD(putback_pages); + LIST_HEAD(dirty_pages); + int pgactivate; + unsigned long nr_dirty = 0; unsigned long nr_reclaimed = 0; + pgactivate = 0; cond_resched(); while (!list_empty(page_list)) { @@ -741,7 +754,18 @@ static unsigned long shrink_page_list(struct list_head *page_list, } } - if (PageDirty(page)) { + if (PageDirty(page)) { + /* + * Only kswapd can writeback filesystem pages to + * avoid risk of stack overflow + */ + if (page_is_file_cache(page) && !current_is_kswapd()) { + list_add(&page->lru, &dirty_pages); + unlock_page(page); + nr_dirty++; + goto keep_dirty; + } + if (references == PAGEREF_RECLAIM_CLEAN) goto keep_locked; if (!may_enter_fs) @@ -852,13 +876,19 @@ activate_locked: keep_locked: unlock_page(page); keep: - list_add(&page->lru, &ret_pages); + list_add(&page->lru, &putback_pages); +keep_dirty: VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); } free_page_list(&free_pages); - list_splice(&ret_pages, page_list); + if (nr_dirty) { + *nr_still_dirty = nr_dirty; + list_splice(&dirty_pages, page_list); + } + list_splice(&putback_pages, page_list); + count_vm_events(PGACTIVATE, pgactivate); return nr_reclaimed; } @@ -1245,6 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, unsigned long nr_active; unsigned long nr_anon; unsigned long nr_file; + unsigned long nr_dirty; while (unlikely(too_many_isolated(zone, file, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); @@ -1293,26 +1324,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, spin_unlock_irq(&zone->lru_lock); - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC, + &nr_dirty); /* - * If we are direct reclaiming for contiguous pages and we do + * If specific pages are needed such as with direct reclaiming + * for contiguous pages or for memory containers and we do * not reclaim everything in the list, try again and wait - * for IO to complete. This will stall high-order allocations - * but that should be acceptable to the caller + * for IO to complete. This will stall callers that require + * specific pages but it should be acceptable to the caller */ - if (nr_reclaimed < nr_taken && !current_is_kswapd() && - sc->lumpy_reclaim_mode) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + if (sc->may_writepage && !current_is_kswapd() && + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) { + int dirty_retry = MAX_SWAP_CLEAN_WAIT; - /* - * The attempt at page out may have made some - * of the pages active, mark them inactive again. - */ - nr_active = clear_active_flags(&page_list, NULL); - count_vm_events(PGDEACTIVATE, nr_active); + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); + congestion_wait(BLK_RW_ASYNC, HZ/10); - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); + /* + * The attempt at page out may have made some + * of the pages active, mark them inactive again. + */ + nr_active = clear_active_flags(&page_list, NULL); + count_vm_events(PGDEACTIVATE, nr_active); + + nr_reclaimed += shrink_page_list(&page_list, sc, + PAGEOUT_IO_SYNC, &nr_dirty); + } } local_irq_disable(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-21 11:52 ` Mel Gorman @ 2010-07-21 12:01 ` KAMEZAWA Hiroyuki 2010-07-21 14:27 ` Mel Gorman 2010-07-21 13:04 ` Johannes Weiner 1 sibling, 1 reply; 87+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-07-21 12:01 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Wed, 21 Jul 2010 12:52:50 +0100 Mel Gorman <mel@csn.ul.ie> wrote: > ==== CUT HERE ==== > [PATCH] vmscan: Do not writeback filesystem pages in direct reclaim > > When memory is under enough pressure, a process may enter direct > reclaim to free pages in the same manner kswapd does. If a dirty page is > encountered during the scan, this page is written to backing storage using > mapping->writepage. This can result in very deep call stacks, particularly > if the target storage or filesystem are complex. It has already been observed > on XFS that the stack overflows but the problem is not XFS-specific. > > This patch prevents direct reclaim writing back filesystem pages by checking > if current is kswapd or the page is anonymous before writing back. If the > dirty pages cannot be written back, they are placed back on the LRU lists > for either background writing by the BDI threads or kswapd. If in direct > lumpy reclaim and dirty pages are encountered, the process will stall for > the background flusher before trying to reclaim the pages again. > > As the call-chain for writing anonymous pages is not expected to be deep > and they are not cleaned by flusher threads, anonymous pages are still > written back in direct reclaim. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > Acked-by: Rik van Riel <riel@redhat.com> > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 6587155..e3a5816 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -323,6 +323,51 @@ typedef enum { > PAGE_CLEAN, > } pageout_t; > > +int write_reclaim_page(struct page *page, struct address_space *mapping, > + enum pageout_io sync_writeback) > +{ > + int res; > + struct writeback_control wbc = { > + .sync_mode = WB_SYNC_NONE, > + .nr_to_write = SWAP_CLUSTER_MAX, > + .range_start = 0, > + .range_end = LLONG_MAX, > + .nonblocking = 1, > + .for_reclaim = 1, > + }; > + > + if (!clear_page_dirty_for_io(page)) > + return PAGE_CLEAN; > + > + SetPageReclaim(page); > + res = mapping->a_ops->writepage(page, &wbc); > + if (res < 0) > + handle_write_error(mapping, page, res); > + if (res == AOP_WRITEPAGE_ACTIVATE) { > + ClearPageReclaim(page); > + return PAGE_ACTIVATE; > + } > + > + /* > + * Wait on writeback if requested to. This happens when > + * direct reclaiming a large contiguous area and the > + * first attempt to free a range of pages fails. > + */ > + if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC) > + wait_on_page_writeback(page); > + > + if (!PageWriteback(page)) { > + /* synchronous write or broken a_ops? */ > + ClearPageReclaim(page); > + } > + trace_mm_vmscan_writepage(page, > + page_is_file_cache(page), > + sync_writeback == PAGEOUT_IO_SYNC); > + inc_zone_page_state(page, NR_VMSCAN_WRITE); > + > + return PAGE_SUCCESS; > +} > + > /* > * pageout is called by shrink_page_list() for each dirty page. > * Calls ->writepage(). > @@ -367,46 +412,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, > if (!may_write_to_queue(mapping->backing_dev_info)) > return PAGE_KEEP; > > - if (clear_page_dirty_for_io(page)) { > - int res; > - struct writeback_control wbc = { > - .sync_mode = WB_SYNC_NONE, > - .nr_to_write = SWAP_CLUSTER_MAX, > - .range_start = 0, > - .range_end = LLONG_MAX, > - .nonblocking = 1, > - .for_reclaim = 1, > - }; > - > - SetPageReclaim(page); > - res = mapping->a_ops->writepage(page, &wbc); > - if (res < 0) > - handle_write_error(mapping, page, res); > - if (res == AOP_WRITEPAGE_ACTIVATE) { > - ClearPageReclaim(page); > - return PAGE_ACTIVATE; > - } > - > - /* > - * Wait on writeback if requested to. This happens when > - * direct reclaiming a large contiguous area and the > - * first attempt to free a range of pages fails. > - */ > - if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC) > - wait_on_page_writeback(page); > - > - if (!PageWriteback(page)) { > - /* synchronous write or broken a_ops? */ > - ClearPageReclaim(page); > - } > - trace_mm_vmscan_writepage(page, > - page_is_file_cache(page), > - sync_writeback == PAGEOUT_IO_SYNC); > - inc_zone_page_state(page, NR_VMSCAN_WRITE); > - return PAGE_SUCCESS; > - } > - > - return PAGE_CLEAN; > + return write_reclaim_page(page, mapping, sync_writeback); > } > > /* > @@ -639,18 +645,25 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) > pagevec_free(&freed_pvec); > } > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ > +#define MAX_SWAP_CLEAN_WAIT 50 > + > /* > * shrink_page_list() returns the number of reclaimed pages > */ > static unsigned long shrink_page_list(struct list_head *page_list, > struct scan_control *sc, > - enum pageout_io sync_writeback) > + enum pageout_io sync_writeback, > + unsigned long *nr_still_dirty) > { > - LIST_HEAD(ret_pages); > LIST_HEAD(free_pages); > - int pgactivate = 0; > + LIST_HEAD(putback_pages); > + LIST_HEAD(dirty_pages); > + int pgactivate; > + unsigned long nr_dirty = 0; > unsigned long nr_reclaimed = 0; > > + pgactivate = 0; > cond_resched(); > > while (!list_empty(page_list)) { > @@ -741,7 +754,18 @@ static unsigned long shrink_page_list(struct list_head *page_list, > } > } > > - if (PageDirty(page)) { > + if (PageDirty(page)) { > + /* > + * Only kswapd can writeback filesystem pages to > + * avoid risk of stack overflow > + */ > + if (page_is_file_cache(page) && !current_is_kswapd()) { > + list_add(&page->lru, &dirty_pages); > + unlock_page(page); > + nr_dirty++; > + goto keep_dirty; > + } > + > if (references == PAGEREF_RECLAIM_CLEAN) > goto keep_locked; > if (!may_enter_fs) > @@ -852,13 +876,19 @@ activate_locked: > keep_locked: > unlock_page(page); > keep: > - list_add(&page->lru, &ret_pages); > + list_add(&page->lru, &putback_pages); > +keep_dirty: > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > } > > free_page_list(&free_pages); > > - list_splice(&ret_pages, page_list); > + if (nr_dirty) { > + *nr_still_dirty = nr_dirty; > + list_splice(&dirty_pages, page_list); > + } > + list_splice(&putback_pages, page_list); > + > count_vm_events(PGACTIVATE, pgactivate); > return nr_reclaimed; > } > @@ -1245,6 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > unsigned long nr_active; > unsigned long nr_anon; > unsigned long nr_file; > + unsigned long nr_dirty; > > while (unlikely(too_many_isolated(zone, file, sc))) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > @@ -1293,26 +1324,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > > spin_unlock_irq(&zone->lru_lock); > > - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); > + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC, > + &nr_dirty); > > /* > - * If we are direct reclaiming for contiguous pages and we do > + * If specific pages are needed such as with direct reclaiming > + * for contiguous pages or for memory containers and we do > * not reclaim everything in the list, try again and wait > - * for IO to complete. This will stall high-order allocations > - * but that should be acceptable to the caller > + * for IO to complete. This will stall callers that require > + * specific pages but it should be acceptable to the caller > */ > - if (nr_reclaimed < nr_taken && !current_is_kswapd() && > - sc->lumpy_reclaim_mode) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + if (sc->may_writepage && !current_is_kswapd() && > + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) { > + int dirty_retry = MAX_SWAP_CLEAN_WAIT; Hmm, ok. I see what will happen to memcg. But, hmm, memcg will have to select to enter this rounine based on the result of 1st memory reclaim. > > - /* > - * The attempt at page out may have made some > - * of the pages active, mark them inactive again. > - */ > - nr_active = clear_active_flags(&page_list, NULL); > - count_vm_events(PGDEACTIVATE, nr_active); > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); > + congestion_wait(BLK_RW_ASYNC, HZ/10); > Congestion wait is required ?? Where the congestion happens ? I'm sorry you already have some other trick in other patch. > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); > + /* > + * The attempt at page out may have made some > + * of the pages active, mark them inactive again. > + */ > + nr_active = clear_active_flags(&page_list, NULL); > + count_vm_events(PGDEACTIVATE, nr_active); > + > + nr_reclaimed += shrink_page_list(&page_list, sc, > + PAGEOUT_IO_SYNC, &nr_dirty); > + } Just a question. This PAGEOUT_IO_SYNC has some meanings ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-21 12:01 ` KAMEZAWA Hiroyuki @ 2010-07-21 14:27 ` Mel Gorman 2010-07-21 23:57 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-21 14:27 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Wed, Jul 21, 2010 at 09:01:11PM +0900, KAMEZAWA Hiroyuki wrote: > > <SNIP> > > > > /* > > - * If we are direct reclaiming for contiguous pages and we do > > + * If specific pages are needed such as with direct reclaiming > > + * for contiguous pages or for memory containers and we do > > * not reclaim everything in the list, try again and wait > > - * for IO to complete. This will stall high-order allocations > > - * but that should be acceptable to the caller > > + * for IO to complete. This will stall callers that require > > + * specific pages but it should be acceptable to the caller > > */ > > - if (nr_reclaimed < nr_taken && !current_is_kswapd() && > > - sc->lumpy_reclaim_mode) { > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > + if (sc->may_writepage && !current_is_kswapd() && > > + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) { > > + int dirty_retry = MAX_SWAP_CLEAN_WAIT; > > Hmm, ok. I see what will happen to memcg. Thanks > But, hmm, memcg will have to select to enter this rounine based on > the result of 1st memory reclaim. > It has the option of igoring pages being dirtied but I worry that the container could be filled with dirty pages waiting for flushers to do something. > > > > - /* > > - * The attempt at page out may have made some > > - * of the pages active, mark them inactive again. > > - */ > > - nr_active = clear_active_flags(&page_list, NULL); > > - count_vm_events(PGDEACTIVATE, nr_active); > > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); > > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > Congestion wait is required ?? Where the congestion happens ? > I'm sorry you already have some other trick in other patch. > It's to wait for the IO to occur. > > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); > > + /* > > + * The attempt at page out may have made some > > + * of the pages active, mark them inactive again. > > + */ > > + nr_active = clear_active_flags(&page_list, NULL); > > + count_vm_events(PGDEACTIVATE, nr_active); > > + > > + nr_reclaimed += shrink_page_list(&page_list, sc, > > + PAGEOUT_IO_SYNC, &nr_dirty); > > + } > > Just a question. This PAGEOUT_IO_SYNC has some meanings ? > Yes, in pageout it will wait on pages currently being written back to be cleaned before trying to reclaim them. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-21 14:27 ` Mel Gorman @ 2010-07-21 23:57 ` KAMEZAWA Hiroyuki 2010-07-22 9:19 ` Mel Gorman 0 siblings, 1 reply; 87+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-07-21 23:57 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Wed, 21 Jul 2010 15:27:10 +0100 Mel Gorman <mel@csn.ul.ie> wrote: > On Wed, Jul 21, 2010 at 09:01:11PM +0900, KAMEZAWA Hiroyuki wrote: > > But, hmm, memcg will have to select to enter this rounine based on > > the result of 1st memory reclaim. > > > > It has the option of igoring pages being dirtied but I worry that the > container could be filled with dirty pages waiting for flushers to do > something. I'll prepare dirty_ratio for memcg. It's not easy but requested by I/O cgroup guys, too... > > > > > > > - /* > > > - * The attempt at page out may have made some > > > - * of the pages active, mark them inactive again. > > > - */ > > > - nr_active = clear_active_flags(&page_list, NULL); > > > - count_vm_events(PGDEACTIVATE, nr_active); > > > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); > > > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > > > > Congestion wait is required ?? Where the congestion happens ? > > I'm sorry you already have some other trick in other patch. > > > > It's to wait for the IO to occur. > 1 tick penalty seems too large. I hope we can have some waitqueue in future. > > > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); > > > + /* > > > + * The attempt at page out may have made some > > > + * of the pages active, mark them inactive again. > > > + */ > > > + nr_active = clear_active_flags(&page_list, NULL); > > > + count_vm_events(PGDEACTIVATE, nr_active); > > > + > > > + nr_reclaimed += shrink_page_list(&page_list, sc, > > > + PAGEOUT_IO_SYNC, &nr_dirty); > > > + } > > > > Just a question. This PAGEOUT_IO_SYNC has some meanings ? > > > > Yes, in pageout it will wait on pages currently being written back to be > cleaned before trying to reclaim them. > Hmm. IIUC, this routine is called only when !current_is_kswapd() and pageout is done only whne current_is_kswapd(). So, this seems .... Wrong ? Thanks, -Kame -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-21 23:57 ` KAMEZAWA Hiroyuki @ 2010-07-22 9:19 ` Mel Gorman 2010-07-22 9:22 ` KAMEZAWA Hiroyuki 0 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-22 9:19 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Thu, Jul 22, 2010 at 08:57:34AM +0900, KAMEZAWA Hiroyuki wrote: > On Wed, 21 Jul 2010 15:27:10 +0100 > Mel Gorman <mel@csn.ul.ie> wrote: > > > On Wed, Jul 21, 2010 at 09:01:11PM +0900, KAMEZAWA Hiroyuki wrote: > > > > But, hmm, memcg will have to select to enter this rounine based on > > > the result of 1st memory reclaim. > > > > > > > It has the option of igoring pages being dirtied but I worry that the > > container could be filled with dirty pages waiting for flushers to do > > something. > > I'll prepare dirty_ratio for memcg. It's not easy but requested by I/O cgroup > guys, too... > I can see why it might be difficult. Dirty pages are not being counted on a per-container basis. It would require additional infrastructure to count it or a lot of scanning. > > > > > > > > > > > - /* > > > > - * The attempt at page out may have made some > > > > - * of the pages active, mark them inactive again. > > > > - */ > > > > - nr_active = clear_active_flags(&page_list, NULL); > > > > - count_vm_events(PGDEACTIVATE, nr_active); > > > > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); > > > > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > > > > > > > Congestion wait is required ?? Where the congestion happens ? > > > I'm sorry you already have some other trick in other patch. > > > > > > > It's to wait for the IO to occur. > > > > 1 tick penalty seems too large. I hope we can have some waitqueue in future. > congestion_wait() if congestion occurs goes onto a waitqueue that is woken if congestion clears. I didn't measure it this time around but I doubt it waits for HZ/10 much of the time. > > > > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); > > > > + /* > > > > + * The attempt at page out may have made some > > > > + * of the pages active, mark them inactive again. > > > > + */ > > > > + nr_active = clear_active_flags(&page_list, NULL); > > > > + count_vm_events(PGDEACTIVATE, nr_active); > > > > + > > > > + nr_reclaimed += shrink_page_list(&page_list, sc, > > > > + PAGEOUT_IO_SYNC, &nr_dirty); > > > > + } > > > > > > Just a question. This PAGEOUT_IO_SYNC has some meanings ? > > > > > > > Yes, in pageout it will wait on pages currently being written back to be > > cleaned before trying to reclaim them. > > > Hmm. IIUC, this routine is called only when !current_is_kswapd() and > pageout is done only whne current_is_kswapd(). So, this seems .... > Wrong ? > Both direct reclaim and kswapd can reach shrink_inactive_list Direct reclaim do_try_to_free_pages -> shrink_zones -> shrink_zone -> shrink_list -> shrink_inactive list <--- the routine in question Kswapd balance_pgdat -> shrink_zone -> shrink_list -> shrink_inactive_list pageout() is still called by direct reclaim if the page is anon so it will synchronously wait on those if PAGEOUT_IO_SYNC is set. For either anon or file pages, if they are being currently written back, they will be waited on in shrink_page_list() if PAGEOUT_IO_SYNC. So it still has meaning. Did I miss something? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-22 9:19 ` Mel Gorman @ 2010-07-22 9:22 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 87+ messages in thread From: KAMEZAWA Hiroyuki @ 2010-07-22 9:22 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Thu, 22 Jul 2010 10:19:30 +0100 Mel Gorman <mel@csn.ul.ie> wrote: > On Thu, Jul 22, 2010 at 08:57:34AM +0900, KAMEZAWA Hiroyuki wrote: > > On Wed, 21 Jul 2010 15:27:10 +0100 > > Mel Gorman <mel@csn.ul.ie> wrote: > > 1 tick penalty seems too large. I hope we can have some waitqueue in future. > > > > congestion_wait() if congestion occurs goes onto a waitqueue that is > woken if congestion clears. I didn't measure it this time around but I > doubt it waits for HZ/10 much of the time. > Okay. > > > > > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); > > > > > + /* > > > > > + * The attempt at page out may have made some > > > > > + * of the pages active, mark them inactive again. > > > > > + */ > > > > > + nr_active = clear_active_flags(&page_list, NULL); > > > > > + count_vm_events(PGDEACTIVATE, nr_active); > > > > > + > > > > > + nr_reclaimed += shrink_page_list(&page_list, sc, > > > > > + PAGEOUT_IO_SYNC, &nr_dirty); > > > > > + } > > > > > > > > Just a question. This PAGEOUT_IO_SYNC has some meanings ? > > > > > > > > > > Yes, in pageout it will wait on pages currently being written back to be > > > cleaned before trying to reclaim them. > > > > > Hmm. IIUC, this routine is called only when !current_is_kswapd() and > > pageout is done only whne current_is_kswapd(). So, this seems .... > > Wrong ? > > > > Both direct reclaim and kswapd can reach shrink_inactive_list > > Direct reclaim > do_try_to_free_pages > -> shrink_zones > -> shrink_zone > -> shrink_list > -> shrink_inactive list <--- the routine in question > > Kswapd > balance_pgdat > -> shrink_zone > -> shrink_list > -> shrink_inactive_list > > pageout() is still called by direct reclaim if the page is anon so it > will synchronously wait on those if PAGEOUT_IO_SYNC is set. Ah, ok. I missed that. Thank you for kindly clarification. Thanks, -Kame ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-21 11:52 ` Mel Gorman 2010-07-21 12:01 ` KAMEZAWA Hiroyuki @ 2010-07-21 13:04 ` Johannes Weiner 2010-07-21 13:38 ` Mel Gorman 1 sibling, 1 reply; 87+ messages in thread From: Johannes Weiner @ 2010-07-21 13:04 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Wed, Jul 21, 2010 at 12:52:50PM +0100, Mel Gorman wrote: > On Wed, Jul 21, 2010 at 12:02:18AM +0200, Johannes Weiner wrote: > > What > > I had in mind is the attached patch. It is not tested and hacked up > > rather quickly due to time constraints, sorry, but you should get the > > idea. I hope I did not miss anything fundamental. > > > > Note that since only kswapd enters pageout() anymore, everything > > depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync > > cycles for kswapd. Just to mitigate the WTF-count on the patch :-) > > > > Anon page writeback can enter pageout. See > > static inline bool reclaim_can_writeback(struct scan_control *sc, > struct page *page) > { > return !page_is_file_cache(page) || current_is_kswapd(); > } > > So the logic still applies. Yeah, I noticed it only after looking at it again this morning. My bad, it got a bit late when I wrote it. > > @@ -643,12 +639,14 @@ static noinline_for_stack void free_page > > * shrink_page_list() returns the number of reclaimed pages > > */ > > static unsigned long shrink_page_list(struct list_head *page_list, > > - struct scan_control *sc, > > - enum pageout_io sync_writeback) > > + struct scan_control *sc, > > + enum pageout_io sync_writeback, > > + int *dirty_seen) > > { > > LIST_HEAD(ret_pages); > > LIST_HEAD(free_pages); > > int pgactivate = 0; > > + unsigned long nr_dirty = 0; > > unsigned long nr_reclaimed = 0; > > > > cond_resched(); > > @@ -657,7 +655,7 @@ static unsigned long shrink_page_list(st > > enum page_references references; > > struct address_space *mapping; > > struct page *page; > > - int may_enter_fs; > > + int may_pageout; > > > > cond_resched(); > > > > @@ -681,10 +679,15 @@ static unsigned long shrink_page_list(st > > if (page_mapped(page) || PageSwapCache(page)) > > sc->nr_scanned++; > > > > - may_enter_fs = (sc->gfp_mask & __GFP_FS) || > > + /* > > + * To prevent stack overflows, only kswapd can enter > > + * the filesystem. Swap IO is always fine (for now). > > + */ > > + may_pageout = current_is_kswapd() || > > (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); > > > > We lost the __GFP_FS check and it's vaguely possible kswapd could call the > allocator with GFP_NOFS. While you check it before wait_on_page_writeback it > needs to be checked before calling pageout(). I toyed around with > creating a may_pageout that took everything into account but I couldn't > convince myself there was no holes or serious change in functionality. Yeah, I checked balance_pgdat(), saw GFP_KERNEL and went for it. But it's probably better to keep such dependencies out. > Ok, is this closer to what you had in mind? IMHO this is (almost) ready to get merged, so I am including the nitpicking comments :-) > ==== CUT HERE ==== > [PATCH] vmscan: Do not writeback filesystem pages in direct reclaim > > When memory is under enough pressure, a process may enter direct > reclaim to free pages in the same manner kswapd does. If a dirty page is > encountered during the scan, this page is written to backing storage using > mapping->writepage. This can result in very deep call stacks, particularly > if the target storage or filesystem are complex. It has already been observed > on XFS that the stack overflows but the problem is not XFS-specific. > > This patch prevents direct reclaim writing back filesystem pages by checking > if current is kswapd or the page is anonymous before writing back. If the > dirty pages cannot be written back, they are placed back on the LRU lists > for either background writing by the BDI threads or kswapd. If in direct > lumpy reclaim and dirty pages are encountered, the process will stall for > the background flusher before trying to reclaim the pages again. > > As the call-chain for writing anonymous pages is not expected to be deep > and they are not cleaned by flusher threads, anonymous pages are still > written back in direct reclaim. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > Acked-by: Rik van Riel <riel@redhat.com> > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 6587155..e3a5816 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c [...] Does factoring pageout() still make sense in this patch? It does not introduce a second callsite. > @@ -639,18 +645,25 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) > pagevec_free(&freed_pvec); > } > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ > +#define MAX_SWAP_CLEAN_WAIT 50 That's placed a bit randomly now that shrink_page_list() doesn't use it anymore. I moved it just above shrink_inactive_list() but maybe it would be better at the file's head? > /* > * shrink_page_list() returns the number of reclaimed pages > */ > static unsigned long shrink_page_list(struct list_head *page_list, > struct scan_control *sc, > - enum pageout_io sync_writeback) > + enum pageout_io sync_writeback, > + unsigned long *nr_still_dirty) > { > - LIST_HEAD(ret_pages); > LIST_HEAD(free_pages); > - int pgactivate = 0; > + LIST_HEAD(putback_pages); > + LIST_HEAD(dirty_pages); > + int pgactivate; > + unsigned long nr_dirty = 0; > unsigned long nr_reclaimed = 0; > > + pgactivate = 0; Spurious change? > cond_resched(); > > while (!list_empty(page_list)) { > @@ -741,7 +754,18 @@ static unsigned long shrink_page_list(struct list_head *page_list, > } > } > > - if (PageDirty(page)) { > + if (PageDirty(page)) { Ha! > + /* > + * Only kswapd can writeback filesystem pages to > + * avoid risk of stack overflow > + */ > + if (page_is_file_cache(page) && !current_is_kswapd()) { > + list_add(&page->lru, &dirty_pages); > + unlock_page(page); > + nr_dirty++; > + goto keep_dirty; > + } I don't understand why you keep the extra dirty list. Couldn't this just be `goto keep_locked'? > if (references == PAGEREF_RECLAIM_CLEAN) > goto keep_locked; > if (!may_enter_fs) > @@ -852,13 +876,19 @@ activate_locked: > keep_locked: > unlock_page(page); > keep: > - list_add(&page->lru, &ret_pages); > + list_add(&page->lru, &putback_pages); > +keep_dirty: > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > } > > free_page_list(&free_pages); > > - list_splice(&ret_pages, page_list); > + if (nr_dirty) { > + *nr_still_dirty = nr_dirty; You either have to set *nr_still_dirty unconditionally or (re)initialize the variable in shrink_inactive_list(). > + list_splice(&dirty_pages, page_list); > + } > + list_splice(&putback_pages, page_list); When we retry those pages, the dirty ones come last on the list. Was this maybe the intention behind collecting dirties separately? > @@ -1245,6 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > unsigned long nr_active; > unsigned long nr_anon; > unsigned long nr_file; > + unsigned long nr_dirty; > > while (unlikely(too_many_isolated(zone, file, sc))) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > @@ -1293,26 +1324,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > > spin_unlock_irq(&zone->lru_lock); > > - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); > + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC, > + &nr_dirty); > > /* > - * If we are direct reclaiming for contiguous pages and we do > + * If specific pages are needed such as with direct reclaiming > + * for contiguous pages or for memory containers and we do > * not reclaim everything in the list, try again and wait > - * for IO to complete. This will stall high-order allocations > - * but that should be acceptable to the caller > + * for IO to complete. This will stall callers that require > + * specific pages but it should be acceptable to the caller > */ > - if (nr_reclaimed < nr_taken && !current_is_kswapd() && > - sc->lumpy_reclaim_mode) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + if (sc->may_writepage && !current_is_kswapd() && > + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) { > + int dirty_retry = MAX_SWAP_CLEAN_WAIT; > > - /* > - * The attempt at page out may have made some > - * of the pages active, mark them inactive again. > - */ > - nr_active = clear_active_flags(&page_list, NULL); > - count_vm_events(PGDEACTIVATE, nr_active); > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); Yup, minding laptop_mode (together with may_writepage). Agreed. > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); > + /* > + * The attempt at page out may have made some > + * of the pages active, mark them inactive again. > + */ > + nr_active = clear_active_flags(&page_list, NULL); > + count_vm_events(PGDEACTIVATE, nr_active); > + > + nr_reclaimed += shrink_page_list(&page_list, sc, > + PAGEOUT_IO_SYNC, &nr_dirty); > + } > } > > local_irq_disable(); Thanks, Hannes ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-21 13:04 ` Johannes Weiner @ 2010-07-21 13:38 ` Mel Gorman 2010-07-21 14:28 ` Johannes Weiner 2010-07-26 8:29 ` Wu Fengguang 0 siblings, 2 replies; 87+ messages in thread From: Mel Gorman @ 2010-07-21 13:38 UTC (permalink / raw) To: Johannes Weiner Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Wed, Jul 21, 2010 at 03:04:35PM +0200, Johannes Weiner wrote: > On Wed, Jul 21, 2010 at 12:52:50PM +0100, Mel Gorman wrote: > > On Wed, Jul 21, 2010 at 12:02:18AM +0200, Johannes Weiner wrote: > > > What > > > I had in mind is the attached patch. It is not tested and hacked up > > > rather quickly due to time constraints, sorry, but you should get the > > > idea. I hope I did not miss anything fundamental. > > > > > > Note that since only kswapd enters pageout() anymore, everything > > > depending on PAGEOUT_IO_SYNC in there is moot, since there are no sync > > > cycles for kswapd. Just to mitigate the WTF-count on the patch :-) > > > > > > > Anon page writeback can enter pageout. See > > > > static inline bool reclaim_can_writeback(struct scan_control *sc, > > struct page *page) > > { > > return !page_is_file_cache(page) || current_is_kswapd(); > > } > > > > So the logic still applies. > > Yeah, I noticed it only after looking at it again this morning. My > bad, it got a bit late when I wrote it. > No worries, in an earlier version anon and file writeback were both blocked and I suspect that was in the back of your mind somewhere. > > > @@ -643,12 +639,14 @@ static noinline_for_stack void free_page > > > * shrink_page_list() returns the number of reclaimed pages > > > */ > > > static unsigned long shrink_page_list(struct list_head *page_list, > > > - struct scan_control *sc, > > > - enum pageout_io sync_writeback) > > > + struct scan_control *sc, > > > + enum pageout_io sync_writeback, > > > + int *dirty_seen) > > > { > > > LIST_HEAD(ret_pages); > > > LIST_HEAD(free_pages); > > > int pgactivate = 0; > > > + unsigned long nr_dirty = 0; > > > unsigned long nr_reclaimed = 0; > > > > > > cond_resched(); > > > @@ -657,7 +655,7 @@ static unsigned long shrink_page_list(st > > > enum page_references references; > > > struct address_space *mapping; > > > struct page *page; > > > - int may_enter_fs; > > > + int may_pageout; > > > > > > cond_resched(); > > > > > > @@ -681,10 +679,15 @@ static unsigned long shrink_page_list(st > > > if (page_mapped(page) || PageSwapCache(page)) > > > sc->nr_scanned++; > > > > > > - may_enter_fs = (sc->gfp_mask & __GFP_FS) || > > > + /* > > > + * To prevent stack overflows, only kswapd can enter > > > + * the filesystem. Swap IO is always fine (for now). > > > + */ > > > + may_pageout = current_is_kswapd() || > > > (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); > > > > > > > We lost the __GFP_FS check and it's vaguely possible kswapd could call the > > allocator with GFP_NOFS. While you check it before wait_on_page_writeback it > > needs to be checked before calling pageout(). I toyed around with > > creating a may_pageout that took everything into account but I couldn't > > convince myself there was no holes or serious change in functionality. > > Yeah, I checked balance_pgdat(), saw GFP_KERNEL and went for it. But > it's probably better to keep such dependencies out. > Ok. > > Ok, is this closer to what you had in mind? > > IMHO this is (almost) ready to get merged, so I am including the > nitpicking comments :-) > > > ==== CUT HERE ==== > > [PATCH] vmscan: Do not writeback filesystem pages in direct reclaim > > > > When memory is under enough pressure, a process may enter direct > > reclaim to free pages in the same manner kswapd does. If a dirty page is > > encountered during the scan, this page is written to backing storage using > > mapping->writepage. This can result in very deep call stacks, particularly > > if the target storage or filesystem are complex. It has already been observed > > on XFS that the stack overflows but the problem is not XFS-specific. > > > > This patch prevents direct reclaim writing back filesystem pages by checking > > if current is kswapd or the page is anonymous before writing back. If the > > dirty pages cannot be written back, they are placed back on the LRU lists > > for either background writing by the BDI threads or kswapd. If in direct > > lumpy reclaim and dirty pages are encountered, the process will stall for > > the background flusher before trying to reclaim the pages again. > > > > As the call-chain for writing anonymous pages is not expected to be deep > > and they are not cleaned by flusher threads, anonymous pages are still > > written back in direct reclaim. > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > Acked-by: Rik van Riel <riel@redhat.com> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 6587155..e3a5816 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > [...] > > Does factoring pageout() still make sense in this patch? It does not > introduce a second callsite. > It's not necessary anymore and just obscures the patch. I collapsed it. > > @@ -639,18 +645,25 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) > > pagevec_free(&freed_pvec); > > } > > > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ > > +#define MAX_SWAP_CLEAN_WAIT 50 > > That's placed a bit randomly now that shrink_page_list() doesn't use > it anymore. I moved it just above shrink_inactive_list() but maybe it > would be better at the file's head? > I will move it to the top. > > /* > > * shrink_page_list() returns the number of reclaimed pages > > */ > > static unsigned long shrink_page_list(struct list_head *page_list, > > struct scan_control *sc, > > - enum pageout_io sync_writeback) > > + enum pageout_io sync_writeback, > > + unsigned long *nr_still_dirty) > > { > > - LIST_HEAD(ret_pages); > > LIST_HEAD(free_pages); > > - int pgactivate = 0; > > + LIST_HEAD(putback_pages); > > + LIST_HEAD(dirty_pages); > > + int pgactivate; > > + unsigned long nr_dirty = 0; > > unsigned long nr_reclaimed = 0; > > > > + pgactivate = 0; > > Spurious change? > Yes, was previously needed for the restart_dirty. Now it's a stupid change. > > cond_resched(); > > > > while (!list_empty(page_list)) { > > @@ -741,7 +754,18 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > } > > } > > > > - if (PageDirty(page)) { > > + if (PageDirty(page)) { > > Ha! > :) fixed. > > + /* > > + * Only kswapd can writeback filesystem pages to > > + * avoid risk of stack overflow > > + */ > > + if (page_is_file_cache(page) && !current_is_kswapd()) { > > + list_add(&page->lru, &dirty_pages); > > + unlock_page(page); > > + nr_dirty++; > > + goto keep_dirty; > > + } > > I don't understand why you keep the extra dirty list. Couldn't this > just be `goto keep_locked'? > Yep, because we are no longer looping to retry dirty pages. > > if (references == PAGEREF_RECLAIM_CLEAN) > > goto keep_locked; > > if (!may_enter_fs) > > @@ -852,13 +876,19 @@ activate_locked: > > keep_locked: > > unlock_page(page); > > keep: > > - list_add(&page->lru, &ret_pages); > > + list_add(&page->lru, &putback_pages); > > +keep_dirty: > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > > } > > > > free_page_list(&free_pages); > > > > - list_splice(&ret_pages, page_list); > > + if (nr_dirty) { > > + *nr_still_dirty = nr_dirty; > > You either have to set *nr_still_dirty unconditionally or > (re)initialize the variable in shrink_inactive_list(). > Unconditionally happening now. > > + list_splice(&dirty_pages, page_list); > > + } > > + list_splice(&putback_pages, page_list); > > When we retry those pages, the dirty ones come last on the list. Was > this maybe the intention behind collecting dirties separately? > No, the intention was to only recycle dirty pages but it's not very important. > > @@ -1245,6 +1275,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > > unsigned long nr_active; > > unsigned long nr_anon; > > unsigned long nr_file; > > + unsigned long nr_dirty; > > > > while (unlikely(too_many_isolated(zone, file, sc))) { > > congestion_wait(BLK_RW_ASYNC, HZ/10); > > @@ -1293,26 +1324,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > > > > spin_unlock_irq(&zone->lru_lock); > > > > - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); > > + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC, > > + &nr_dirty); > > > > /* > > - * If we are direct reclaiming for contiguous pages and we do > > + * If specific pages are needed such as with direct reclaiming > > + * for contiguous pages or for memory containers and we do > > * not reclaim everything in the list, try again and wait > > - * for IO to complete. This will stall high-order allocations > > - * but that should be acceptable to the caller > > + * for IO to complete. This will stall callers that require > > + * specific pages but it should be acceptable to the caller > > */ > > - if (nr_reclaimed < nr_taken && !current_is_kswapd() && > > - sc->lumpy_reclaim_mode) { > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > + if (sc->may_writepage && !current_is_kswapd() && > > + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) { > > + int dirty_retry = MAX_SWAP_CLEAN_WAIT; > > > > - /* > > - * The attempt at page out may have made some > > - * of the pages active, mark them inactive again. > > - */ > > - nr_active = clear_active_flags(&page_list, NULL); > > - count_vm_events(PGDEACTIVATE, nr_active); > > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); > > Yup, minding laptop_mode (together with may_writepage). Agreed. > > > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); > > + /* > > + * The attempt at page out may have made some > > + * of the pages active, mark them inactive again. > > + */ > > + nr_active = clear_active_flags(&page_list, NULL); > > + count_vm_events(PGDEACTIVATE, nr_active); > > + > > + nr_reclaimed += shrink_page_list(&page_list, sc, > > + PAGEOUT_IO_SYNC, &nr_dirty); > > + } > > } > > > > local_irq_disable(); > Here is an updated version. Thanks very much ==== CUT HERE ==== vmscan: Do not writeback filesystem pages in direct reclaim When memory is under enough pressure, a process may enter direct reclaim to free pages in the same manner kswapd does. If a dirty page is encountered during the scan, this page is written to backing storage using mapping->writepage. This can result in very deep call stacks, particularly if the target storage or filesystem are complex. It has already been observed on XFS that the stack overflows but the problem is not XFS-specific. This patch prevents direct reclaim writing back filesystem pages by checking if current is kswapd or the page is anonymous before writing back. If the dirty pages cannot be written back, they are placed back on the LRU lists for either background writing by the BDI threads or kswapd. If in direct lumpy reclaim and dirty pages are encountered, the process will stall for the background flusher before trying to reclaim the pages again. As the call-chain for writing anonymous pages is not expected to be deep and they are not cleaned by flusher threads, anonymous pages are still written back in direct reclaim. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> --- mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++++++---------------- 1 files changed, 39 insertions(+), 16 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 6587155..45d9934 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem); #define scanning_global_lru(sc) (1) #endif +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ +#define MAX_SWAP_CLEAN_WAIT 50 + static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, struct scan_control *sc) { @@ -644,11 +647,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) */ static unsigned long shrink_page_list(struct list_head *page_list, struct scan_control *sc, - enum pageout_io sync_writeback) + enum pageout_io sync_writeback, + unsigned long *nr_still_dirty) { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); int pgactivate = 0; + unsigned long nr_dirty = 0; unsigned long nr_reclaimed = 0; cond_resched(); @@ -742,6 +747,15 @@ static unsigned long shrink_page_list(struct list_head *page_list, } if (PageDirty(page)) { + /* + * Only kswapd can writeback filesystem pages to + * avoid risk of stack overflow + */ + if (page_is_file_cache(page) && !current_is_kswapd()) { + nr_dirty++; + goto keep_locked; + } + if (references == PAGEREF_RECLAIM_CLEAN) goto keep_locked; if (!may_enter_fs) @@ -858,7 +872,7 @@ keep: free_page_list(&free_pages); - list_splice(&ret_pages, page_list); + *nr_still_dirty = nr_dirty; count_vm_events(PGACTIVATE, pgactivate); return nr_reclaimed; } @@ -1245,6 +1259,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, unsigned long nr_active; unsigned long nr_anon; unsigned long nr_file; + unsigned long nr_dirty; while (unlikely(too_many_isolated(zone, file, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, spin_unlock_irq(&zone->lru_lock); - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC, + &nr_dirty); /* - * If we are direct reclaiming for contiguous pages and we do + * If specific pages are needed such as with direct reclaiming + * for contiguous pages or for memory containers and we do * not reclaim everything in the list, try again and wait - * for IO to complete. This will stall high-order allocations - * but that should be acceptable to the caller + * for IO to complete. This will stall callers that require + * specific pages but it should be acceptable to the caller */ - if (nr_reclaimed < nr_taken && !current_is_kswapd() && - sc->lumpy_reclaim_mode) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + if (sc->may_writepage && !current_is_kswapd() && + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) { + int dirty_retry = MAX_SWAP_CLEAN_WAIT; - /* - * The attempt at page out may have made some - * of the pages active, mark them inactive again. - */ - nr_active = clear_active_flags(&page_list, NULL); - count_vm_events(PGDEACTIVATE, nr_active); + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); + congestion_wait(BLK_RW_ASYNC, HZ/10); - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); + /* + * The attempt at page out may have made some + * of the pages active, mark them inactive again. + */ + nr_active = clear_active_flags(&page_list, NULL); + count_vm_events(PGDEACTIVATE, nr_active); + + nr_reclaimed += shrink_page_list(&page_list, sc, + PAGEOUT_IO_SYNC, &nr_dirty); + } } local_irq_disable(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-21 13:38 ` Mel Gorman @ 2010-07-21 14:28 ` Johannes Weiner 2010-07-21 14:31 ` Mel Gorman 2010-07-26 8:29 ` Wu Fengguang 1 sibling, 1 reply; 87+ messages in thread From: Johannes Weiner @ 2010-07-21 14:28 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote: > Here is an updated version. Thanks very much > > ==== CUT HERE ==== > vmscan: Do not writeback filesystem pages in direct reclaim > > When memory is under enough pressure, a process may enter direct > reclaim to free pages in the same manner kswapd does. If a dirty page is > encountered during the scan, this page is written to backing storage using > mapping->writepage. This can result in very deep call stacks, particularly > if the target storage or filesystem are complex. It has already been observed > on XFS that the stack overflows but the problem is not XFS-specific. > > This patch prevents direct reclaim writing back filesystem pages by checking > if current is kswapd or the page is anonymous before writing back. If the > dirty pages cannot be written back, they are placed back on the LRU lists > for either background writing by the BDI threads or kswapd. If in direct > lumpy reclaim and dirty pages are encountered, the process will stall for > the background flusher before trying to reclaim the pages again. > > As the call-chain for writing anonymous pages is not expected to be deep > and they are not cleaned by flusher threads, anonymous pages are still > written back in direct reclaim. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > Acked-by: Rik van Riel <riel@redhat.com> Cool! Except for one last tiny thing... > @@ -858,7 +872,7 @@ keep: > > free_page_list(&free_pages); > > - list_splice(&ret_pages, page_list); This will lose all retry pages forever, I think. > + *nr_still_dirty = nr_dirty; > count_vm_events(PGACTIVATE, pgactivate); > return nr_reclaimed; > } Otherwise, Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-21 14:28 ` Johannes Weiner @ 2010-07-21 14:31 ` Mel Gorman 2010-07-21 14:39 ` Johannes Weiner 0 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-21 14:31 UTC (permalink / raw) To: Johannes Weiner Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Wed, Jul 21, 2010 at 04:28:44PM +0200, Johannes Weiner wrote: > On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote: > > Here is an updated version. Thanks very much > > > > ==== CUT HERE ==== > > vmscan: Do not writeback filesystem pages in direct reclaim > > > > When memory is under enough pressure, a process may enter direct > > reclaim to free pages in the same manner kswapd does. If a dirty page is > > encountered during the scan, this page is written to backing storage using > > mapping->writepage. This can result in very deep call stacks, particularly > > if the target storage or filesystem are complex. It has already been observed > > on XFS that the stack overflows but the problem is not XFS-specific. > > > > This patch prevents direct reclaim writing back filesystem pages by checking > > if current is kswapd or the page is anonymous before writing back. If the > > dirty pages cannot be written back, they are placed back on the LRU lists > > for either background writing by the BDI threads or kswapd. If in direct > > lumpy reclaim and dirty pages are encountered, the process will stall for > > the background flusher before trying to reclaim the pages again. > > > > As the call-chain for writing anonymous pages is not expected to be deep > > and they are not cleaned by flusher threads, anonymous pages are still > > written back in direct reclaim. > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > Acked-by: Rik van Riel <riel@redhat.com> > > Cool! > > Except for one last tiny thing... > > > @@ -858,7 +872,7 @@ keep: > > > > free_page_list(&free_pages); > > > > - list_splice(&ret_pages, page_list); > > This will lose all retry pages forever, I think. > Above this is while (!list_empty(page_list)) { ... } page_list should be empty and keep_locked is putting the pages on ret_pages already so I think it's ok. > > + *nr_still_dirty = nr_dirty; > > count_vm_events(PGACTIVATE, pgactivate); > > return nr_reclaimed; > > } > > Otherwise, > Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> > Thanks! -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-21 14:31 ` Mel Gorman @ 2010-07-21 14:39 ` Johannes Weiner 2010-07-21 15:06 ` Mel Gorman 0 siblings, 1 reply; 87+ messages in thread From: Johannes Weiner @ 2010-07-21 14:39 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Wed, Jul 21, 2010 at 03:31:19PM +0100, Mel Gorman wrote: > On Wed, Jul 21, 2010 at 04:28:44PM +0200, Johannes Weiner wrote: > > On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote: > > > @@ -858,7 +872,7 @@ keep: > > > > > > free_page_list(&free_pages); > > > > > > - list_splice(&ret_pages, page_list); > > > > This will lose all retry pages forever, I think. > > > > Above this is > > while (!list_empty(page_list)) { > ... > } > > page_list should be empty and keep_locked is putting the pages on ret_pages > already so I think it's ok. But ret_pages is function-local. Putting them back on the then-empty page_list is to give them back to the caller, otherwise they are lost in a dead stack slot. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-21 14:39 ` Johannes Weiner @ 2010-07-21 15:06 ` Mel Gorman 0 siblings, 0 replies; 87+ messages in thread From: Mel Gorman @ 2010-07-21 15:06 UTC (permalink / raw) To: Johannes Weiner Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Wed, Jul 21, 2010 at 04:39:56PM +0200, Johannes Weiner wrote: > On Wed, Jul 21, 2010 at 03:31:19PM +0100, Mel Gorman wrote: > > On Wed, Jul 21, 2010 at 04:28:44PM +0200, Johannes Weiner wrote: > > > On Wed, Jul 21, 2010 at 02:38:57PM +0100, Mel Gorman wrote: > > > > @@ -858,7 +872,7 @@ keep: > > > > > > > > free_page_list(&free_pages); > > > > > > > > - list_splice(&ret_pages, page_list); > > > > > > This will lose all retry pages forever, I think. > > > > > > > Above this is > > > > while (!list_empty(page_list)) { > > ... > > } > > > > page_list should be empty and keep_locked is putting the pages on ret_pages > > already so I think it's ok. > > But ret_pages is function-local. Putting them back on the then-empty > page_list is to give them back to the caller, otherwise they are lost > in a dead stack slot. > Bah, you're right, it is repaired now. /me slaps self. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-21 13:38 ` Mel Gorman 2010-07-21 14:28 ` Johannes Weiner @ 2010-07-26 8:29 ` Wu Fengguang 2010-07-26 9:12 ` Mel Gorman 1 sibling, 1 reply; 87+ messages in thread From: Wu Fengguang @ 2010-07-26 8:29 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli > ==== CUT HERE ==== > vmscan: Do not writeback filesystem pages in direct reclaim > > When memory is under enough pressure, a process may enter direct > reclaim to free pages in the same manner kswapd does. If a dirty page is > encountered during the scan, this page is written to backing storage using > mapping->writepage. This can result in very deep call stacks, particularly > if the target storage or filesystem are complex. It has already been observed > on XFS that the stack overflows but the problem is not XFS-specific. > > This patch prevents direct reclaim writing back filesystem pages by checking > if current is kswapd or the page is anonymous before writing back. If the > dirty pages cannot be written back, they are placed back on the LRU lists > for either background writing by the BDI threads or kswapd. If in direct > lumpy reclaim and dirty pages are encountered, the process will stall for > the background flusher before trying to reclaim the pages again. > > As the call-chain for writing anonymous pages is not expected to be deep > and they are not cleaned by flusher threads, anonymous pages are still > written back in direct reclaim. This is also a good step towards reducing pageout() calls. For better IO performance the flusher threads should take more work from pageout(). > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > Acked-by: Rik van Riel <riel@redhat.com> > --- > mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++++++---------------- > 1 files changed, 39 insertions(+), 16 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 6587155..45d9934 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem); > #define scanning_global_lru(sc) (1) > #endif > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ > +#define MAX_SWAP_CLEAN_WAIT 50 > + > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, > struct scan_control *sc) > { > @@ -644,11 +647,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) > */ > static unsigned long shrink_page_list(struct list_head *page_list, > struct scan_control *sc, > - enum pageout_io sync_writeback) > + enum pageout_io sync_writeback, > + unsigned long *nr_still_dirty) > { > LIST_HEAD(ret_pages); > LIST_HEAD(free_pages); > int pgactivate = 0; > + unsigned long nr_dirty = 0; > unsigned long nr_reclaimed = 0; > > cond_resched(); > @@ -742,6 +747,15 @@ static unsigned long shrink_page_list(struct list_head *page_list, > } > > if (PageDirty(page)) { > + /* > + * Only kswapd can writeback filesystem pages to > + * avoid risk of stack overflow > + */ > + if (page_is_file_cache(page) && !current_is_kswapd()) { > + nr_dirty++; > + goto keep_locked; > + } > + > if (references == PAGEREF_RECLAIM_CLEAN) > goto keep_locked; > if (!may_enter_fs) > @@ -858,7 +872,7 @@ keep: > > free_page_list(&free_pages); > > - list_splice(&ret_pages, page_list); > + *nr_still_dirty = nr_dirty; > count_vm_events(PGACTIVATE, pgactivate); > return nr_reclaimed; > } > @@ -1245,6 +1259,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > unsigned long nr_active; > unsigned long nr_anon; > unsigned long nr_file; > + unsigned long nr_dirty; > > while (unlikely(too_many_isolated(zone, file, sc))) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > > spin_unlock_irq(&zone->lru_lock); > > - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); > + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC, > + &nr_dirty); > > /* > - * If we are direct reclaiming for contiguous pages and we do > + * If specific pages are needed such as with direct reclaiming > + * for contiguous pages or for memory containers and we do > * not reclaim everything in the list, try again and wait > - * for IO to complete. This will stall high-order allocations > - * but that should be acceptable to the caller > + * for IO to complete. This will stall callers that require > + * specific pages but it should be acceptable to the caller > */ > - if (nr_reclaimed < nr_taken && !current_is_kswapd() && > - sc->lumpy_reclaim_mode) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + if (sc->may_writepage && !current_is_kswapd() && > + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) { > + int dirty_retry = MAX_SWAP_CLEAN_WAIT; > > - /* > - * The attempt at page out may have made some > - * of the pages active, mark them inactive again. > - */ > - nr_active = clear_active_flags(&page_list, NULL); > - count_vm_events(PGDEACTIVATE, nr_active); > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); > + congestion_wait(BLK_RW_ASYNC, HZ/10); It needs good luck for the flusher threads to "happen to" sync the dirty pages in our page_list. I'd rather take the logic as "there are too many dirty pages, shrink them to avoid some future pageout() calls and/or congestion_wait() stalls". So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times. Let's remove it? > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); > + /* > + * The attempt at page out may have made some > + * of the pages active, mark them inactive again. > + */ > + nr_active = clear_active_flags(&page_list, NULL); > + count_vm_events(PGDEACTIVATE, nr_active); > + > + nr_reclaimed += shrink_page_list(&page_list, sc, > + PAGEOUT_IO_SYNC, &nr_dirty); This shrink_page_list() won't be called at all if nr_dirty==0 and pageout() was called. This is a change of behavior. It can also be fixed by removing the loop. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-26 8:29 ` Wu Fengguang @ 2010-07-26 9:12 ` Mel Gorman 2010-07-26 11:19 ` Wu Fengguang 0 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-26 9:12 UTC (permalink / raw) To: Wu Fengguang Cc: Johannes Weiner, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote: > > ==== CUT HERE ==== > > vmscan: Do not writeback filesystem pages in direct reclaim > > > > When memory is under enough pressure, a process may enter direct > > reclaim to free pages in the same manner kswapd does. If a dirty page is > > encountered during the scan, this page is written to backing storage using > > mapping->writepage. This can result in very deep call stacks, particularly > > if the target storage or filesystem are complex. It has already been observed > > on XFS that the stack overflows but the problem is not XFS-specific. > > > > This patch prevents direct reclaim writing back filesystem pages by checking > > if current is kswapd or the page is anonymous before writing back. If the > > dirty pages cannot be written back, they are placed back on the LRU lists > > for either background writing by the BDI threads or kswapd. If in direct > > lumpy reclaim and dirty pages are encountered, the process will stall for > > the background flusher before trying to reclaim the pages again. > > > > As the call-chain for writing anonymous pages is not expected to be deep > > and they are not cleaned by flusher threads, anonymous pages are still > > written back in direct reclaim. > > This is also a good step towards reducing pageout() calls. For better > IO performance the flusher threads should take more work from pageout(). > This is true for better IO performance all right but reclaim does require specific pages cleaned. The strict requirement is when lumpy reclaim is involved but a looser requirement is when any pages within a zone be cleaned. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > Acked-by: Rik van Riel <riel@redhat.com> > > --- > > mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++++++---------------- > > 1 files changed, 39 insertions(+), 16 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 6587155..45d9934 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem); > > #define scanning_global_lru(sc) (1) > > #endif > > > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ > > +#define MAX_SWAP_CLEAN_WAIT 50 > > + > > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, > > struct scan_control *sc) > > { > > @@ -644,11 +647,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) > > */ > > static unsigned long shrink_page_list(struct list_head *page_list, > > struct scan_control *sc, > > - enum pageout_io sync_writeback) > > + enum pageout_io sync_writeback, > > + unsigned long *nr_still_dirty) > > { > > LIST_HEAD(ret_pages); > > LIST_HEAD(free_pages); > > int pgactivate = 0; > > + unsigned long nr_dirty = 0; > > unsigned long nr_reclaimed = 0; > > > > cond_resched(); > > @@ -742,6 +747,15 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > } > > > > if (PageDirty(page)) { > > + /* > > + * Only kswapd can writeback filesystem pages to > > + * avoid risk of stack overflow > > + */ > > + if (page_is_file_cache(page) && !current_is_kswapd()) { > > + nr_dirty++; > > + goto keep_locked; > > + } > > + > > if (references == PAGEREF_RECLAIM_CLEAN) > > goto keep_locked; > > if (!may_enter_fs) > > @@ -858,7 +872,7 @@ keep: > > > > free_page_list(&free_pages); > > > > - list_splice(&ret_pages, page_list); > > + *nr_still_dirty = nr_dirty; > > count_vm_events(PGACTIVATE, pgactivate); > > return nr_reclaimed; > > } > > @@ -1245,6 +1259,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > > unsigned long nr_active; > > unsigned long nr_anon; > > unsigned long nr_file; > > + unsigned long nr_dirty; > > > > while (unlikely(too_many_isolated(zone, file, sc))) { > > congestion_wait(BLK_RW_ASYNC, HZ/10); > > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > > > > spin_unlock_irq(&zone->lru_lock); > > > > - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); > > + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC, > > + &nr_dirty); > > > > /* > > - * If we are direct reclaiming for contiguous pages and we do > > + * If specific pages are needed such as with direct reclaiming > > + * for contiguous pages or for memory containers and we do > > * not reclaim everything in the list, try again and wait > > - * for IO to complete. This will stall high-order allocations > > - * but that should be acceptable to the caller > > + * for IO to complete. This will stall callers that require > > + * specific pages but it should be acceptable to the caller > > */ > > - if (nr_reclaimed < nr_taken && !current_is_kswapd() && > > - sc->lumpy_reclaim_mode) { > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > + if (sc->may_writepage && !current_is_kswapd() && > > + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) { > > + int dirty_retry = MAX_SWAP_CLEAN_WAIT; > > > > - /* > > - * The attempt at page out may have made some > > - * of the pages active, mark them inactive again. > > - */ > > - nr_active = clear_active_flags(&page_list, NULL); > > - count_vm_events(PGDEACTIVATE, nr_active); > > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); > > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > It needs good luck for the flusher threads to "happen to" sync the > dirty pages in our page_list. That is why I'm expecting the "shrink oldest inode" patchset to help. It still requires a certain amount of luck but callers that encounter dirty pages will be delayed. It's also because a certain amount of luck is required that the last patch in the series aims at reducing the number of dirty pages encountered by reclaim. The closer that is to 0, the less important the timing of flusher threads is. > I'd rather take the logic as "there are > too many dirty pages, shrink them to avoid some future pageout() calls > and/or congestion_wait() stalls". > What do you mean by shrink them? They cannot be reclaimed until they are clean. > So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times. Let's remove it? > This loop only applies to direct reclaimers in lumpy reclaim mode and memory containers. Both need specific pages to be cleaned and freed. Hence, the loop is to stall them and wait on flusher threads up to a point. Otherwise they can cause a reclaim storm of clean pages that can't be used. Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached but I am inferring this from timing data rather than a direct measurement. > > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); > > + /* > > + * The attempt at page out may have made some > > + * of the pages active, mark them inactive again. > > + */ > > + nr_active = clear_active_flags(&page_list, NULL); > > + count_vm_events(PGDEACTIVATE, nr_active); > > + > > + nr_reclaimed += shrink_page_list(&page_list, sc, > > + PAGEOUT_IO_SYNC, &nr_dirty); > > This shrink_page_list() won't be called at all if nr_dirty==0 and > pageout() was called. This is a change of behavior. It can also be > fixed by removing the loop. > The whole patch is a change of behaviour but in this case it also makes sense to focus on just the dirty pages. The first shrink_page_list decided that the pages could not be unmapped and reclaimed - probably because it was referenced. This is not likely to change during the loop. Testing with a version of the patch that processed the full list added significant stalls when sync writeback was involved. Testing time length was tripled in one case implying that this loop was continually reaching MAX_SWAP_CLEAN_WAIT. The intention of this loop is "wait on dirty pages to be cleaned" and it's a change of behaviour, but one that makes sense and testing indicates it's a good idea. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-26 9:12 ` Mel Gorman @ 2010-07-26 11:19 ` Wu Fengguang 2010-07-26 12:53 ` Mel Gorman 0 siblings, 1 reply; 87+ messages in thread From: Wu Fengguang @ 2010-07-26 11:19 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 26, 2010 at 05:12:27PM +0800, Mel Gorman wrote: > On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote: > > > ==== CUT HERE ==== > > > vmscan: Do not writeback filesystem pages in direct reclaim > > > > > > When memory is under enough pressure, a process may enter direct > > > reclaim to free pages in the same manner kswapd does. If a dirty page is > > > encountered during the scan, this page is written to backing storage using > > > mapping->writepage. This can result in very deep call stacks, particularly > > > if the target storage or filesystem are complex. It has already been observed > > > on XFS that the stack overflows but the problem is not XFS-specific. > > > > > > This patch prevents direct reclaim writing back filesystem pages by checking > > > if current is kswapd or the page is anonymous before writing back. If the > > > dirty pages cannot be written back, they are placed back on the LRU lists > > > for either background writing by the BDI threads or kswapd. If in direct > > > lumpy reclaim and dirty pages are encountered, the process will stall for > > > the background flusher before trying to reclaim the pages again. > > > > > > As the call-chain for writing anonymous pages is not expected to be deep > > > and they are not cleaned by flusher threads, anonymous pages are still > > > written back in direct reclaim. > > > > This is also a good step towards reducing pageout() calls. For better > > IO performance the flusher threads should take more work from pageout(). > > > > This is true for better IO performance all right but reclaim does require > specific pages cleaned. The strict requirement is when lumpy reclaim is > involved but a looser requirement is when any pages within a zone be cleaned. Good point, I missed the lumpy reclaim requirement. It seems necessary to add a call to the flusher thread to writeback a specific inode range (that contains the current dirty page). This is a more reliable way to ensure both the strict and looser requirements: the current dirty page will guaranteed to be synced, and the inode will have good opportunity to contain more dirty pages in the zone, which can be freed quickly if tagged PG_reclaim. > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > Acked-by: Rik van Riel <riel@redhat.com> > > > --- > > > mm/vmscan.c | 55 +++++++++++++++++++++++++++++++++++++++---------------- > > > 1 files changed, 39 insertions(+), 16 deletions(-) > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 6587155..45d9934 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -139,6 +139,9 @@ static DECLARE_RWSEM(shrinker_rwsem); > > > #define scanning_global_lru(sc) (1) > > > #endif > > > > > > +/* Direct lumpy reclaim waits up to 5 seconds for background cleaning */ > > > +#define MAX_SWAP_CLEAN_WAIT 50 > > > + > > > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone, > > > struct scan_control *sc) > > > { > > > @@ -644,11 +647,13 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages) > > > */ > > > static unsigned long shrink_page_list(struct list_head *page_list, > > > struct scan_control *sc, > > > - enum pageout_io sync_writeback) > > > + enum pageout_io sync_writeback, > > > + unsigned long *nr_still_dirty) > > > { > > > LIST_HEAD(ret_pages); > > > LIST_HEAD(free_pages); > > > int pgactivate = 0; > > > + unsigned long nr_dirty = 0; > > > unsigned long nr_reclaimed = 0; > > > > > > cond_resched(); > > > @@ -742,6 +747,15 @@ static unsigned long shrink_page_list(struct list_head *page_list, > > > } > > > > > > if (PageDirty(page)) { > > > + /* > > > + * Only kswapd can writeback filesystem pages to > > > + * avoid risk of stack overflow > > > + */ > > > + if (page_is_file_cache(page) && !current_is_kswapd()) { > > > + nr_dirty++; > > > + goto keep_locked; > > > + } > > > + > > > if (references == PAGEREF_RECLAIM_CLEAN) > > > goto keep_locked; > > > if (!may_enter_fs) > > > @@ -858,7 +872,7 @@ keep: > > > > > > free_page_list(&free_pages); > > > > > > - list_splice(&ret_pages, page_list); > > > + *nr_still_dirty = nr_dirty; > > > count_vm_events(PGACTIVATE, pgactivate); > > > return nr_reclaimed; > > > } > > > @@ -1245,6 +1259,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > > > unsigned long nr_active; > > > unsigned long nr_anon; > > > unsigned long nr_file; > > > + unsigned long nr_dirty; > > > > > > while (unlikely(too_many_isolated(zone, file, sc))) { > > > congestion_wait(BLK_RW_ASYNC, HZ/10); > > > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > > > > > > spin_unlock_irq(&zone->lru_lock); > > > > > > - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); > > > + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC, > > > + &nr_dirty); > > > > > > /* > > > - * If we are direct reclaiming for contiguous pages and we do > > > + * If specific pages are needed such as with direct reclaiming > > > + * for contiguous pages or for memory containers and we do > > > * not reclaim everything in the list, try again and wait > > > - * for IO to complete. This will stall high-order allocations > > > - * but that should be acceptable to the caller > > > + * for IO to complete. This will stall callers that require > > > + * specific pages but it should be acceptable to the caller > > > */ > > > - if (nr_reclaimed < nr_taken && !current_is_kswapd() && > > > - sc->lumpy_reclaim_mode) { > > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > > + if (sc->may_writepage && !current_is_kswapd() && > > > + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) { > > > + int dirty_retry = MAX_SWAP_CLEAN_WAIT; > > > > > > - /* > > > - * The attempt at page out may have made some > > > - * of the pages active, mark them inactive again. > > > - */ > > > - nr_active = clear_active_flags(&page_list, NULL); > > > - count_vm_events(PGDEACTIVATE, nr_active); > > > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); > > > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > It needs good luck for the flusher threads to "happen to" sync the > > dirty pages in our page_list. > > That is why I'm expecting the "shrink oldest inode" patchset to help. It > still requires a certain amount of luck but callers that encounter dirty > pages will be delayed. > > It's also because a certain amount of luck is required that the last patch > in the series aims at reducing the number of dirty pages encountered by > reclaim. The closer that is to 0, the less important the timing of flusher > threads is. OK. > > I'd rather take the logic as "there are > > too many dirty pages, shrink them to avoid some future pageout() calls > > and/or congestion_wait() stalls". > > > > What do you mean by shrink them? They cannot be reclaimed until they are > clean. I mean we are freeing much more than nr_dirty pages. In this sense we are shrinking the number of dirty pages. Note that we are calling wakeup_flusher_threads(nr_dirty), however the real synced pages will be much more than nr_dirty, that is reasonable good behavior. > > So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times. Let's remove it? > > > > This loop only applies to direct reclaimers in lumpy reclaim mode and > memory containers. Both need specific pages to be cleaned and freed. > Hence, the loop is to stall them and wait on flusher threads up to a > point. Otherwise they can cause a reclaim storm of clean pages that > can't be used. Agreed. We could call the flusher to sync the inode explicitly, as recommended above. This will clean and free (with PG_reclaim) the page in seconds. With reasonable waits here we may avoid reclaim storm effectively. > Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached > but I am inferring this from timing data rather than a direct measurement. > > > > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); > > > + /* > > > + * The attempt at page out may have made some > > > + * of the pages active, mark them inactive again. > > > + */ > > > + nr_active = clear_active_flags(&page_list, NULL); > > > + count_vm_events(PGDEACTIVATE, nr_active); > > > + > > > + nr_reclaimed += shrink_page_list(&page_list, sc, > > > + PAGEOUT_IO_SYNC, &nr_dirty); > > > > This shrink_page_list() won't be called at all if nr_dirty==0 and > > pageout() was called. This is a change of behavior. It can also be > > fixed by removing the loop. > > > > The whole patch is a change of behaviour but in this case it also makes > sense to focus on just the dirty pages. The first shrink_page_list > decided that the pages could not be unmapped and reclaimed - probably > because it was referenced. This is not likely to change during the loop. Agreed. > Testing with a version of the patch that processed the full list added > significant stalls when sync writeback was involved. Testing time length > was tripled in one case implying that this loop was continually reaching > MAX_SWAP_CLEAN_WAIT. I'm OK with the change actually, this removes one not-that-user-friendly wait_on_page_writeback(). > The intention of this loop is "wait on dirty pages to be cleaned" and > it's a change of behaviour, but one that makes sense and testing > indicates it's a good idea. I mean, this loop may be unwinded. And we may need another loop to sync the inodes that contains the dirty pages. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-26 11:19 ` Wu Fengguang @ 2010-07-26 12:53 ` Mel Gorman 2010-07-26 13:03 ` Wu Fengguang 0 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-26 12:53 UTC (permalink / raw) To: Wu Fengguang Cc: Johannes Weiner, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 26, 2010 at 07:19:53PM +0800, Wu Fengguang wrote: > On Mon, Jul 26, 2010 at 05:12:27PM +0800, Mel Gorman wrote: > > On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote: > > > > ==== CUT HERE ==== > > > > vmscan: Do not writeback filesystem pages in direct reclaim > > > > > > > > When memory is under enough pressure, a process may enter direct > > > > reclaim to free pages in the same manner kswapd does. If a dirty page is > > > > encountered during the scan, this page is written to backing storage using > > > > mapping->writepage. This can result in very deep call stacks, particularly > > > > if the target storage or filesystem are complex. It has already been observed > > > > on XFS that the stack overflows but the problem is not XFS-specific. > > > > > > > > This patch prevents direct reclaim writing back filesystem pages by checking > > > > if current is kswapd or the page is anonymous before writing back. If the > > > > dirty pages cannot be written back, they are placed back on the LRU lists > > > > for either background writing by the BDI threads or kswapd. If in direct > > > > lumpy reclaim and dirty pages are encountered, the process will stall for > > > > the background flusher before trying to reclaim the pages again. > > > > > > > > As the call-chain for writing anonymous pages is not expected to be deep > > > > and they are not cleaned by flusher threads, anonymous pages are still > > > > written back in direct reclaim. > > > > > > This is also a good step towards reducing pageout() calls. For better > > > IO performance the flusher threads should take more work from pageout(). > > > > > > > This is true for better IO performance all right but reclaim does require > > specific pages cleaned. The strict requirement is when lumpy reclaim is > > involved but a looser requirement is when any pages within a zone be cleaned. > > Good point, I missed the lumpy reclaim requirement. It seems necessary > to add a call to the flusher thread to writeback a specific inode range > (that contains the current dirty page). This is a more reliable way to > ensure both the strict and looser requirements: the current dirty page > will guaranteed to be synced, and the inode will have good opportunity > to contain more dirty pages in the zone, which can be freed quickly if > tagged PG_reclaim. > I'm not sure about an inode range. The window being considered is quite small and we might select ranges that are too small to be useful. However, taking the inodes into account makes sense. If wakeup_flusher_thread took a list of unique inodes that own dirty pages encountered by reclaim, it would then move inodes to the head of the queue rather than depending just on expired. > > > > <SNIP> > > > > > > > > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > > > > > > > > spin_unlock_irq(&zone->lru_lock); > > > > > > > > - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); > > > > + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC, > > > > + &nr_dirty); > > > > > > > > /* > > > > - * If we are direct reclaiming for contiguous pages and we do > > > > + * If specific pages are needed such as with direct reclaiming > > > > + * for contiguous pages or for memory containers and we do > > > > * not reclaim everything in the list, try again and wait > > > > - * for IO to complete. This will stall high-order allocations > > > > - * but that should be acceptable to the caller > > > > + * for IO to complete. This will stall callers that require > > > > + * specific pages but it should be acceptable to the caller > > > > */ > > > > - if (nr_reclaimed < nr_taken && !current_is_kswapd() && > > > > - sc->lumpy_reclaim_mode) { > > > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > + if (sc->may_writepage && !current_is_kswapd() && > > > > + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) { > > > > + int dirty_retry = MAX_SWAP_CLEAN_WAIT; > > > > > > > > - /* > > > > - * The attempt at page out may have made some > > > > - * of the pages active, mark them inactive again. > > > > - */ > > > > - nr_active = clear_active_flags(&page_list, NULL); > > > > - count_vm_events(PGDEACTIVATE, nr_active); > > > > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); > > > > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > > > It needs good luck for the flusher threads to "happen to" sync the > > > dirty pages in our page_list. > > > > That is why I'm expecting the "shrink oldest inode" patchset to help. It > > still requires a certain amount of luck but callers that encounter dirty > > pages will be delayed. > > > > It's also because a certain amount of luck is required that the last patch > > in the series aims at reducing the number of dirty pages encountered by > > reclaim. The closer that is to 0, the less important the timing of flusher > > threads is. > > OK. > > > > I'd rather take the logic as "there are > > > too many dirty pages, shrink them to avoid some future pageout() calls > > > and/or congestion_wait() stalls". > > > > > > > What do you mean by shrink them? They cannot be reclaimed until they are > > clean. > > I mean we are freeing much more than nr_dirty pages. In this sense we > are shrinking the number of dirty pages. Note that we are calling > wakeup_flusher_threads(nr_dirty), however the real synced pages will > be much more than nr_dirty, that is reasonable good behavior. > Ok. > > > So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times. Let's remove it? > > > > > > > This loop only applies to direct reclaimers in lumpy reclaim mode and > > memory containers. Both need specific pages to be cleaned and freed. > > Hence, the loop is to stall them and wait on flusher threads up to a > > point. Otherwise they can cause a reclaim storm of clean pages that > > can't be used. > > Agreed. We could call the flusher to sync the inode explicitly, as > recommended above. This will clean and free (with PG_reclaim) the page > in seconds. With reasonable waits here we may avoid reclaim storm > effectively. > I'll follow this suggestion as a new patch. > > Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached > > but I am inferring this from timing data rather than a direct measurement. > > > > > > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); > > > > + /* > > > > + * The attempt at page out may have made some > > > > + * of the pages active, mark them inactive again. > > > > + */ > > > > + nr_active = clear_active_flags(&page_list, NULL); > > > > + count_vm_events(PGDEACTIVATE, nr_active); > > > > + > > > > + nr_reclaimed += shrink_page_list(&page_list, sc, > > > > + PAGEOUT_IO_SYNC, &nr_dirty); > > > > > > This shrink_page_list() won't be called at all if nr_dirty==0 and > > > pageout() was called. This is a change of behavior. It can also be > > > fixed by removing the loop. > > > > > > > The whole patch is a change of behaviour but in this case it also makes > > sense to focus on just the dirty pages. The first shrink_page_list > > decided that the pages could not be unmapped and reclaimed - probably > > because it was referenced. This is not likely to change during the loop. > > Agreed. > > > Testing with a version of the patch that processed the full list added > > significant stalls when sync writeback was involved. Testing time length > > was tripled in one case implying that this loop was continually reaching > > MAX_SWAP_CLEAN_WAIT. > > I'm OK with the change actually, this removes one not-that-user-friendly > wait_on_page_writeback(). > > > The intention of this loop is "wait on dirty pages to be cleaned" and > > it's a change of behaviour, but one that makes sense and testing > > indicates it's a good idea. > > I mean, this loop may be unwinded. And we may need another loop to > sync the inodes that contains the dirty pages. > I'm not quite sure what you mean here but I think it might tie into the idea of passing a list of inodes to the flusher threads. Lets see what that ends up looking like. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim 2010-07-26 12:53 ` Mel Gorman @ 2010-07-26 13:03 ` Wu Fengguang 0 siblings, 0 replies; 87+ messages in thread From: Wu Fengguang @ 2010-07-26 13:03 UTC (permalink / raw) To: Mel Gorman Cc: Johannes Weiner, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 26, 2010 at 08:53:26PM +0800, Mel Gorman wrote: > On Mon, Jul 26, 2010 at 07:19:53PM +0800, Wu Fengguang wrote: > > On Mon, Jul 26, 2010 at 05:12:27PM +0800, Mel Gorman wrote: > > > On Mon, Jul 26, 2010 at 04:29:35PM +0800, Wu Fengguang wrote: > > > > > ==== CUT HERE ==== > > > > > vmscan: Do not writeback filesystem pages in direct reclaim > > > > > > > > > > When memory is under enough pressure, a process may enter direct > > > > > reclaim to free pages in the same manner kswapd does. If a dirty page is > > > > > encountered during the scan, this page is written to backing storage using > > > > > mapping->writepage. This can result in very deep call stacks, particularly > > > > > if the target storage or filesystem are complex. It has already been observed > > > > > on XFS that the stack overflows but the problem is not XFS-specific. > > > > > > > > > > This patch prevents direct reclaim writing back filesystem pages by checking > > > > > if current is kswapd or the page is anonymous before writing back. If the > > > > > dirty pages cannot be written back, they are placed back on the LRU lists > > > > > for either background writing by the BDI threads or kswapd. If in direct > > > > > lumpy reclaim and dirty pages are encountered, the process will stall for > > > > > the background flusher before trying to reclaim the pages again. > > > > > > > > > > As the call-chain for writing anonymous pages is not expected to be deep > > > > > and they are not cleaned by flusher threads, anonymous pages are still > > > > > written back in direct reclaim. > > > > > > > > This is also a good step towards reducing pageout() calls. For better > > > > IO performance the flusher threads should take more work from pageout(). > > > > > > > > > > This is true for better IO performance all right but reclaim does require > > > specific pages cleaned. The strict requirement is when lumpy reclaim is > > > involved but a looser requirement is when any pages within a zone be cleaned. > > > > Good point, I missed the lumpy reclaim requirement. It seems necessary > > to add a call to the flusher thread to writeback a specific inode range > > (that contains the current dirty page). This is a more reliable way to > > ensure both the strict and looser requirements: the current dirty page > > will guaranteed to be synced, and the inode will have good opportunity > > to contain more dirty pages in the zone, which can be freed quickly if > > tagged PG_reclaim. > > > > I'm not sure about an inode range. The window being considered is quite small > and we might select ranges that are too small to be useful. However, taking We don't need to pass the range. We only pass the page offset, and let the flusher thread select the approach range that covers our target page. This guarantees the target page will be served. > the inodes into account makes sense. If wakeup_flusher_thread took a list > of unique inodes that own dirty pages encountered by reclaim, it would then > move inodes to the head of the queue rather than depending just on expired. The flusher thread may internally queue all older inodes than this one for IO. But sure it'd better serve the target inode first to avoid adding delays. > > > > > > <SNIP> > > > > > > > > > > @@ -1293,26 +1308,34 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > > > > > > > > > > spin_unlock_irq(&zone->lru_lock); > > > > > > > > > > - nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); > > > > > + nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC, > > > > > + &nr_dirty); > > > > > > > > > > /* > > > > > - * If we are direct reclaiming for contiguous pages and we do > > > > > + * If specific pages are needed such as with direct reclaiming > > > > > + * for contiguous pages or for memory containers and we do > > > > > * not reclaim everything in the list, try again and wait > > > > > - * for IO to complete. This will stall high-order allocations > > > > > - * but that should be acceptable to the caller > > > > > + * for IO to complete. This will stall callers that require > > > > > + * specific pages but it should be acceptable to the caller > > > > > */ > > > > > - if (nr_reclaimed < nr_taken && !current_is_kswapd() && > > > > > - sc->lumpy_reclaim_mode) { > > > > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > > + if (sc->may_writepage && !current_is_kswapd() && > > > > > + (sc->lumpy_reclaim_mode || sc->mem_cgroup)) { > > > > > + int dirty_retry = MAX_SWAP_CLEAN_WAIT; > > > > > > > > > > - /* > > > > > - * The attempt at page out may have made some > > > > > - * of the pages active, mark them inactive again. > > > > > - */ > > > > > - nr_active = clear_active_flags(&page_list, NULL); > > > > > - count_vm_events(PGDEACTIVATE, nr_active); > > > > > + while (nr_reclaimed < nr_taken && nr_dirty && dirty_retry--) { > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); > > > > > + congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > > > > > It needs good luck for the flusher threads to "happen to" sync the > > > > dirty pages in our page_list. > > > > > > That is why I'm expecting the "shrink oldest inode" patchset to help. It > > > still requires a certain amount of luck but callers that encounter dirty > > > pages will be delayed. > > > > > > It's also because a certain amount of luck is required that the last patch > > > in the series aims at reducing the number of dirty pages encountered by > > > reclaim. The closer that is to 0, the less important the timing of flusher > > > threads is. > > > > OK. > > > > > > I'd rather take the logic as "there are > > > > too many dirty pages, shrink them to avoid some future pageout() calls > > > > and/or congestion_wait() stalls". > > > > > > > > > > What do you mean by shrink them? They cannot be reclaimed until they are > > > clean. > > > > I mean we are freeing much more than nr_dirty pages. In this sense we > > are shrinking the number of dirty pages. Note that we are calling > > wakeup_flusher_threads(nr_dirty), however the real synced pages will > > be much more than nr_dirty, that is reasonable good behavior. > > > > Ok. > > > > > So the loop is likely to repeat MAX_SWAP_CLEAN_WAIT times. Let's remove it? > > > > > > > > > > This loop only applies to direct reclaimers in lumpy reclaim mode and > > > memory containers. Both need specific pages to be cleaned and freed. > > > Hence, the loop is to stall them and wait on flusher threads up to a > > > point. Otherwise they can cause a reclaim storm of clean pages that > > > can't be used. > > > > Agreed. We could call the flusher to sync the inode explicitly, as > > recommended above. This will clean and free (with PG_reclaim) the page > > in seconds. With reasonable waits here we may avoid reclaim storm > > effectively. > > > > I'll follow this suggestion as a new patch. > > > > Current tests have not indicated MAX_SWAP_CLEAN_WAIT is regularly reached > > > but I am inferring this from timing data rather than a direct measurement. > > > > > > > > - nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC); > > > > > + /* > > > > > + * The attempt at page out may have made some > > > > > + * of the pages active, mark them inactive again. > > > > > + */ > > > > > + nr_active = clear_active_flags(&page_list, NULL); > > > > > + count_vm_events(PGDEACTIVATE, nr_active); > > > > > + > > > > > + nr_reclaimed += shrink_page_list(&page_list, sc, > > > > > + PAGEOUT_IO_SYNC, &nr_dirty); > > > > > > > > This shrink_page_list() won't be called at all if nr_dirty==0 and > > > > pageout() was called. This is a change of behavior. It can also be > > > > fixed by removing the loop. > > > > > > > > > > The whole patch is a change of behaviour but in this case it also makes > > > sense to focus on just the dirty pages. The first shrink_page_list > > > decided that the pages could not be unmapped and reclaimed - probably > > > because it was referenced. This is not likely to change during the loop. > > > > Agreed. > > > > > Testing with a version of the patch that processed the full list added > > > significant stalls when sync writeback was involved. Testing time length > > > was tripled in one case implying that this loop was continually reaching > > > MAX_SWAP_CLEAN_WAIT. > > > > I'm OK with the change actually, this removes one not-that-user-friendly > > wait_on_page_writeback(). > > > > > The intention of this loop is "wait on dirty pages to be cleaned" and > > > it's a change of behaviour, but one that makes sense and testing > > > indicates it's a good idea. > > > > I mean, this loop may be unwinded. And we may need another loop to > > sync the inodes that contains the dirty pages. > > > > I'm not quite sure what you mean here but I think it might tie into the > idea of passing a list of inodes to the flusher threads. Lets see what > that ends up looking like. OK. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH 5/8] fs,btrfs: Allow kswapd to writeback pages 2010-07-19 13:11 [PATCH 0/8] Reduce writeback from page reclaim context V4 Mel Gorman ` (3 preceding siblings ...) 2010-07-19 13:11 ` [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman @ 2010-07-19 13:11 ` Mel Gorman 2010-07-19 18:27 ` Rik van Riel 2010-07-19 13:11 ` [PATCH 6/8] fs,xfs: " Mel Gorman ` (2 subsequent siblings) 7 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-mm Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman As only kswapd and memcg are writing back pages, there should be no danger of overflowing the stack. Allow the writing back of dirty pages in btrfs from the VM. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- fs/btrfs/disk-io.c | 21 +-------------------- fs/btrfs/inode.c | 6 ------ 2 files changed, 1 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 34f7c37..e4aa547 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -696,26 +696,7 @@ static int btree_writepage(struct page *page, struct writeback_control *wbc) int was_dirty; tree = &BTRFS_I(page->mapping->host)->io_tree; - if (!(current->flags & PF_MEMALLOC)) { - return extent_write_full_page(tree, page, - btree_get_extent, wbc); - } - - redirty_page_for_writepage(wbc, page); - eb = btrfs_find_tree_block(root, page_offset(page), - PAGE_CACHE_SIZE); - WARN_ON(!eb); - - was_dirty = test_and_set_bit(EXTENT_BUFFER_DIRTY, &eb->bflags); - if (!was_dirty) { - spin_lock(&root->fs_info->delalloc_lock); - root->fs_info->dirty_metadata_bytes += PAGE_CACHE_SIZE; - spin_unlock(&root->fs_info->delalloc_lock); - } - free_extent_buffer(eb); - - unlock_page(page); - return 0; + return extent_write_full_page(tree, page, btree_get_extent, wbc); } static int btree_writepages(struct address_space *mapping, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 1bff92a..5c0e604 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5859,12 +5859,6 @@ static int btrfs_writepage(struct page *page, struct writeback_control *wbc) { struct extent_io_tree *tree; - - if (current->flags & PF_MEMALLOC) { - redirty_page_for_writepage(wbc, page); - unlock_page(page); - return 0; - } tree = &BTRFS_I(page->mapping->host)->io_tree; return extent_write_full_page(tree, page, btrfs_get_extent, wbc); } -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 5/8] fs,btrfs: Allow kswapd to writeback pages 2010-07-19 13:11 ` [PATCH 5/8] fs,btrfs: Allow kswapd to writeback pages Mel Gorman @ 2010-07-19 18:27 ` Rik van Riel 0 siblings, 0 replies; 87+ messages in thread From: Rik van Riel @ 2010-07-19 18:27 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On 07/19/2010 09:11 AM, Mel Gorman wrote: > As only kswapd and memcg are writing back pages, there should be no > danger of overflowing the stack. Allow the writing back of dirty pages > in btrfs from the VM. > > Signed-off-by: Mel Gorman<mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH 6/8] fs,xfs: Allow kswapd to writeback pages 2010-07-19 13:11 [PATCH 0/8] Reduce writeback from page reclaim context V4 Mel Gorman ` (4 preceding siblings ...) 2010-07-19 13:11 ` [PATCH 5/8] fs,btrfs: Allow kswapd to writeback pages Mel Gorman @ 2010-07-19 13:11 ` Mel Gorman 2010-07-19 14:20 ` Christoph Hellwig 2010-07-19 13:11 ` [PATCH 7/8] writeback: sync old inodes first in background writeback Mel Gorman 2010-07-19 13:11 ` [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Mel Gorman 7 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-mm Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman As only kswapd and memcg are writing back pages, there should be no danger of overflowing the stack. Allow the writing back of dirty pages in xfs from the VM. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- fs/xfs/linux-2.6/xfs_aops.c | 15 --------------- 1 files changed, 0 insertions(+), 15 deletions(-) diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index 34640d6..4c89db3 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -1333,21 +1333,6 @@ xfs_vm_writepage( trace_xfs_writepage(inode, page, 0); /* - * Refuse to write the page out if we are called from reclaim context. - * - * This is primarily to avoid stack overflows when called from deep - * used stacks in random callers for direct reclaim, but disabling - * reclaim for kswap is a nice side-effect as kswapd causes rather - * suboptimal I/O patters, too. - * - * This should really be done by the core VM, but until that happens - * filesystems like XFS, btrfs and ext4 have to take care of this - * by themselves. - */ - if (current->flags & PF_MEMALLOC) - goto out_fail; - - /* * We need a transaction if: * 1. There are delalloc buffers on the page * 2. The page is uptodate and we have unmapped buffers -- 1.7.1 ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 6/8] fs,xfs: Allow kswapd to writeback pages 2010-07-19 13:11 ` [PATCH 6/8] fs,xfs: " Mel Gorman @ 2010-07-19 14:20 ` Christoph Hellwig 2010-07-19 14:43 ` Mel Gorman 0 siblings, 1 reply; 87+ messages in thread From: Christoph Hellwig @ 2010-07-19 14:20 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 02:11:28PM +0100, Mel Gorman wrote: > As only kswapd and memcg are writing back pages, there should be no > danger of overflowing the stack. Allow the writing back of dirty pages > in xfs from the VM. As pointed out during the discussion on one of your previous post memcg does pose a huge risk of stack overflows. In the XFS tree we've already relaxed the check to allow writeback from kswapd, and until the memcg situation we'll need to keep that check. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 6/8] fs,xfs: Allow kswapd to writeback pages 2010-07-19 14:20 ` Christoph Hellwig @ 2010-07-19 14:43 ` Mel Gorman 0 siblings, 0 replies; 87+ messages in thread From: Mel Gorman @ 2010-07-19 14:43 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 10:20:51AM -0400, Christoph Hellwig wrote: > On Mon, Jul 19, 2010 at 02:11:28PM +0100, Mel Gorman wrote: > > As only kswapd and memcg are writing back pages, there should be no > > danger of overflowing the stack. Allow the writing back of dirty pages > > in xfs from the VM. > > As pointed out during the discussion on one of your previous post memcg > does pose a huge risk of stack overflows. I remember. This is partially to nudge the memcg people to see where they currently stand with alleviating the problem. > In the XFS tree we've already > relaxed the check to allow writeback from kswapd, and until the memcg > situation we'll need to keep that check. > If memcg remains a problem, I'll drop these two patches. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-19 13:11 [PATCH 0/8] Reduce writeback from page reclaim context V4 Mel Gorman ` (5 preceding siblings ...) 2010-07-19 13:11 ` [PATCH 6/8] fs,xfs: " Mel Gorman @ 2010-07-19 13:11 ` Mel Gorman 2010-07-19 14:21 ` Christoph Hellwig 2010-07-19 18:43 ` Rik van Riel 2010-07-19 13:11 ` [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Mel Gorman 7 siblings, 2 replies; 87+ messages in thread From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-mm Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman From: Wu Fengguang <fengguang.wu@intel.com> A background flush work may run for ever. So it's reasonable for it to mimic the kupdate behavior of syncing old/expired inodes first. This behavior also makes sense from the perspective of page reclaim. File pages are added to the inactive list and promoted if referenced after one recycling. If not referenced, it's very easy for pages to be cleaned from reclaim context which is inefficient in terms of IO. If background flush is cleaning pages, it's best it cleans old pages to help minimise IO from reclaim. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- fs/fs-writeback.c | 19 ++++++++++++++++--- 1 files changed, 16 insertions(+), 3 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index d5be169..cc81c67 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -612,13 +612,14 @@ static long wb_writeback(struct bdi_writeback *wb, .range_cyclic = work->range_cyclic, }; unsigned long oldest_jif; + int expire_interval = msecs_to_jiffies(dirty_expire_interval * 10); + int fg_rounds = 0; long wrote = 0; struct inode *inode; - if (wbc.for_kupdate) { + if (wbc.for_kupdate || wbc.for_background) { wbc.older_than_this = &oldest_jif; - oldest_jif = jiffies - - msecs_to_jiffies(dirty_expire_interval * 10); + oldest_jif = jiffies - expire_interval; } if (!wbc.range_cyclic) { wbc.range_start = 0; @@ -649,6 +650,18 @@ static long wb_writeback(struct bdi_writeback *wb, work->nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write; wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write; + if (work->for_background && expire_interval && + ++fg_rounds && list_empty(&wb->b_io)) { + if (fg_rounds < 10) + expire_interval >>= 1; + if (expire_interval) + oldest_jif = jiffies - expire_interval; + else + wbc.older_than_this = 0; + fg_rounds = 0; + continue; + } + /* * If we consumed everything, see if we have more */ -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-19 13:11 ` [PATCH 7/8] writeback: sync old inodes first in background writeback Mel Gorman @ 2010-07-19 14:21 ` Christoph Hellwig 2010-07-19 14:40 ` Mel Gorman 2010-07-22 1:13 ` Wu Fengguang 2010-07-19 18:43 ` Rik van Riel 1 sibling, 2 replies; 87+ messages in thread From: Christoph Hellwig @ 2010-07-19 14:21 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 02:11:29PM +0100, Mel Gorman wrote: > From: Wu Fengguang <fengguang.wu@intel.com> > > A background flush work may run for ever. So it's reasonable for it to > mimic the kupdate behavior of syncing old/expired inodes first. > > This behavior also makes sense from the perspective of page reclaim. > File pages are added to the inactive list and promoted if referenced > after one recycling. If not referenced, it's very easy for pages to be > cleaned from reclaim context which is inefficient in terms of IO. If > background flush is cleaning pages, it's best it cleans old pages to > help minimise IO from reclaim. Yes, we absolutely do this. Wu, do you have an improved version of the pending or should we put it in this version for now? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-19 14:21 ` Christoph Hellwig @ 2010-07-19 14:40 ` Mel Gorman 2010-07-19 14:48 ` Christoph Hellwig 2010-07-22 8:52 ` Wu Fengguang 2010-07-22 1:13 ` Wu Fengguang 1 sibling, 2 replies; 87+ messages in thread From: Mel Gorman @ 2010-07-19 14:40 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 10:21:45AM -0400, Christoph Hellwig wrote: > On Mon, Jul 19, 2010 at 02:11:29PM +0100, Mel Gorman wrote: > > From: Wu Fengguang <fengguang.wu@intel.com> > > > > A background flush work may run for ever. So it's reasonable for it to > > mimic the kupdate behavior of syncing old/expired inodes first. > > > > This behavior also makes sense from the perspective of page reclaim. > > File pages are added to the inactive list and promoted if referenced > > after one recycling. If not referenced, it's very easy for pages to be > > cleaned from reclaim context which is inefficient in terms of IO. If > > background flush is cleaning pages, it's best it cleans old pages to > > help minimise IO from reclaim. > > Yes, we absolutely do this. Do you mean we absolutely want to do this? > Wu, do you have an improved version of the > pending or should we put it in this version for now? > Some insight on how the other writeback changes that are being floated around might affect the number of dirty pages reclaim encounters would also be helpful. The tracepoints are there for people to figure it out but any help figuring it out is useful. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-19 14:40 ` Mel Gorman @ 2010-07-19 14:48 ` Christoph Hellwig 2010-07-22 8:52 ` Wu Fengguang 1 sibling, 0 replies; 87+ messages in thread From: Christoph Hellwig @ 2010-07-19 14:48 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 03:40:47PM +0100, Mel Gorman wrote: > > Yes, we absolutely do this. > > Do you mean we absolutely want to do this? Ermm yes, sorry. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-19 14:40 ` Mel Gorman 2010-07-19 14:48 ` Christoph Hellwig @ 2010-07-22 8:52 ` Wu Fengguang 2010-07-22 9:02 ` Wu Fengguang ` (2 more replies) 1 sibling, 3 replies; 87+ messages in thread From: Wu Fengguang @ 2010-07-22 8:52 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Minchan Kim > Some insight on how the other writeback changes that are being floated > around might affect the number of dirty pages reclaim encounters would also > be helpful. Here is an interesting related problem about the wait_on_page_writeback() call inside shrink_page_list(): http://lkml.org/lkml/2010/4/4/86 The problem is, wait_on_page_writeback() is called too early in the direct reclaim path, which blocks many random/unrelated processes when some slow (USB stick) writeback is on the way. A simple dd can easily create a big range of dirty pages in the LRU list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a typical desktop, which triggers the lumpy reclaim mode and hence wait_on_page_writeback(). I proposed this patch at the time, which was confirmed to solve the problem: --- linux-next.orig/mm/vmscan.c 2010-06-24 14:32:03.000000000 +0800 +++ linux-next/mm/vmscan.c 2010-07-22 16:12:34.000000000 +0800 @@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p */ if (sc->order > PAGE_ALLOC_COSTLY_ORDER) sc->lumpy_reclaim_mode = 1; - else if (sc->order && priority < DEF_PRIORITY - 2) + else if (sc->order && priority < DEF_PRIORITY / 2) sc->lumpy_reclaim_mode = 1; else sc->lumpy_reclaim_mode = 0; However KOSAKI and Minchan raised concerns about raising the bar. I guess this new patch is more problem oriented and acceptable: --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800 @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis count_vm_events(PGDEACTIVATE, nr_active); nr_freed += shrink_page_list(&page_list, sc, - PAGEOUT_IO_SYNC); + priority < DEF_PRIORITY / 3 ? + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC); } nr_reclaimed += nr_freed; Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-22 8:52 ` Wu Fengguang @ 2010-07-22 9:02 ` Wu Fengguang 2010-07-22 9:21 ` Wu Fengguang 2010-07-22 9:42 ` Mel Gorman 2 siblings, 0 replies; 87+ messages in thread From: Wu Fengguang @ 2010-07-22 9:02 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Minchan Kim Sorry, please ignore this hack, it's non sense.. > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800 > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis > count_vm_events(PGDEACTIVATE, nr_active); > > nr_freed += shrink_page_list(&page_list, sc, > - PAGEOUT_IO_SYNC); > + priority < DEF_PRIORITY / 3 ? > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC); > } > > nr_reclaimed += nr_freed; Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-22 8:52 ` Wu Fengguang 2010-07-22 9:02 ` Wu Fengguang @ 2010-07-22 9:21 ` Wu Fengguang 2010-07-22 10:48 ` Mel Gorman 2010-07-22 15:34 ` Minchan Kim 2010-07-22 9:42 ` Mel Gorman 2 siblings, 2 replies; 87+ messages in thread From: Wu Fengguang @ 2010-07-22 9:21 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Minchan Kim > I guess this new patch is more problem oriented and acceptable: > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800 > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis > count_vm_events(PGDEACTIVATE, nr_active); > > nr_freed += shrink_page_list(&page_list, sc, > - PAGEOUT_IO_SYNC); > + priority < DEF_PRIORITY / 3 ? > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC); > } > > nr_reclaimed += nr_freed; This one looks better: --- vmscan: raise the bar to PAGEOUT_IO_SYNC stalls Fix "system goes totally unresponsive with many dirty/writeback pages" problem: http://lkml.org/lkml/2010/4/4/86 The root cause is, wait_on_page_writeback() is called too early in the direct reclaim path, which blocks many random/unrelated processes when some slow (USB stick) writeback is on the way. A simple dd can easily create a big range of dirty pages in the LRU list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a typical desktop, which triggers the lumpy reclaim mode and hence wait_on_page_writeback(). In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to the 22MB writeback and 190MB dirty pages. There can easily be a continuous range of 512KB dirty/writeback pages in the LRU, which will trigger the wait logic. To make it worse, when there are 50MB writeback pages and USB 1.1 is writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50 seconds. So only enter sync write&wait when priority goes below DEF_PRIORITY/3, or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait will hardly be triggered by pure dirty pages. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- mm/vmscan.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 +++ linux-next/mm/vmscan.c 2010-07-22 17:03:47.000000000 +0800 @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis * but that should be acceptable to the caller */ if (nr_freed < nr_taken && !current_is_kswapd() && - sc->lumpy_reclaim_mode) { + sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-22 9:21 ` Wu Fengguang @ 2010-07-22 10:48 ` Mel Gorman 2010-07-23 9:45 ` Wu Fengguang 2010-07-22 15:34 ` Minchan Kim 1 sibling, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-22 10:48 UTC (permalink / raw) To: Wu Fengguang Cc: Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Minchan Kim On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote: > > I guess this new patch is more problem oriented and acceptable: > > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800 > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis > > count_vm_events(PGDEACTIVATE, nr_active); > > > > nr_freed += shrink_page_list(&page_list, sc, > > - PAGEOUT_IO_SYNC); > > + priority < DEF_PRIORITY / 3 ? > > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC); > > } > > > > nr_reclaimed += nr_freed; > > This one looks better: > --- > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls > > Fix "system goes totally unresponsive with many dirty/writeback pages" > problem: > > http://lkml.org/lkml/2010/4/4/86 > > The root cause is, wait_on_page_writeback() is called too early in the > direct reclaim path, which blocks many random/unrelated processes when > some slow (USB stick) writeback is on the way. > So, what's the bet if lumpy reclaim is a factor that it's high-order-but-low-cost such as fork() that are getting caught by this since [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC] was introduced? That could manifest to the user as stalls creating new processes when under heavy IO. I would be surprised it would freeze the entire system but certainly any new work would feel very slow. > A simple dd can easily create a big range of dirty pages in the LRU > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a > typical desktop, which triggers the lumpy reclaim mode and hence > wait_on_page_writeback(). > which triggers the lumpy reclaim mode for high-order allocations. lumpy reclaim mode is not something that is triggered just because priority is high. I think there is a second possibility for causing stalls as well that is unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may also result in stalls. If it is taking a long time to writeback dirty data, random processes could be getting stalled just because they happened to dirty data at the wrong time. This would be the case if the main dirtying process (e.g. dd) is not calling sync and dropping pages it's no longer using. > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to > the 22MB writeback and 190MB dirty pages. There can easily be a > continuous range of 512KB dirty/writeback pages in the LRU, which will > trigger the wait logic. > > To make it worse, when there are 50MB writeback pages and USB 1.1 is > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50 > seconds. > > So only enter sync write&wait when priority goes below DEF_PRIORITY/3, > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait > will hardly be triggered by pure dirty pages. > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > mm/vmscan.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > +++ linux-next/mm/vmscan.c 2010-07-22 17:03:47.000000000 +0800 > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis > * but that should be acceptable to the caller > */ > if (nr_freed < nr_taken && !current_is_kswapd() && > - sc->lumpy_reclaim_mode) { > + sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > This will also delay waiting on congestion for really high-order allocations such as huge pages, some video decoder and the like which really should be stalling. How about the following compile-tested diff? It takes the cost of the high-order allocation into account and the priority when deciding whether to synchronously wait or not. diff --git a/mm/vmscan.c b/mm/vmscan.c index 9c7e57c..d652e0c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1110,6 +1110,48 @@ static int too_many_isolated(struct zone *zone, int file, } /* + * Returns true if the caller should stall on congestion and retry to clean + * the list of pages synchronously. + * + * If we are direct reclaiming for contiguous pages and we do not reclaim + * everything in the list, try again and wait for IO to complete. This + * will stall high-order allocations but that should be acceptable to + * the caller + */ +static inline bool should_reclaim_stall(unsigned long nr_taken, + unsigned long nr_freed, + int priority, + struct scan_control *sc) +{ + int lumpy_stall_priority; + + /* kswapd should not stall on sync IO */ + if (current_is_kswapd()) + return false; + + /* Only stall on lumpy reclaim */ + if (!sc->lumpy_reclaim_mode) + return false; + + /* If we have relaimed everything on the isolated list, no stall */ + if (nr_freed == nr_taken) + return false; + + /* + * For high-order allocations, there are two stall thresholds. + * High-cost allocations stall immediately where as lower + * order allocations such as stacks require the scanning + * priority to be much higher before stalling + */ + if (sc->order > PAGE_ALLOC_COSTLY_ORDER) + lumpy_stall_priority = DEF_PRIORITY; + else + lumpy_stall_priority = DEF_PRIORITY / 3; + + return priority <= lumpy_stall_priority; +} + +/* * shrink_inactive_list() is a helper for shrink_zone(). It returns the number * of reclaimed pages */ @@ -1199,14 +1241,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, nr_scanned += nr_scan; nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); - /* - * If we are direct reclaiming for contiguous pages and we do - * not reclaim everything in the list, try again and wait - * for IO to complete. This will stall high-order allocations - * but that should be acceptable to the caller - */ - if (nr_freed < nr_taken && !current_is_kswapd() && - sc->lumpy_reclaim_mode) { + /* Check if we should syncronously wait for writeback */ + if (should_reclaim_stall(nr_taken, nr_freed, priority, sc)) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-22 10:48 ` Mel Gorman @ 2010-07-23 9:45 ` Wu Fengguang 2010-07-23 10:57 ` Mel Gorman 0 siblings, 1 reply; 87+ messages in thread From: Wu Fengguang @ 2010-07-23 9:45 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Minchan Kim On Thu, Jul 22, 2010 at 06:48:23PM +0800, Mel Gorman wrote: > On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote: > > > I guess this new patch is more problem oriented and acceptable: > > > > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > > > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800 > > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis > > > count_vm_events(PGDEACTIVATE, nr_active); > > > > > > nr_freed += shrink_page_list(&page_list, sc, > > > - PAGEOUT_IO_SYNC); > > > + priority < DEF_PRIORITY / 3 ? > > > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC); > > > } > > > > > > nr_reclaimed += nr_freed; > > > > This one looks better: > > --- > > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls > > > > Fix "system goes totally unresponsive with many dirty/writeback pages" > > problem: > > > > http://lkml.org/lkml/2010/4/4/86 > > > > The root cause is, wait_on_page_writeback() is called too early in the > > direct reclaim path, which blocks many random/unrelated processes when > > some slow (USB stick) writeback is on the way. > > > > So, what's the bet if lumpy reclaim is a factor that it's > high-order-but-low-cost such as fork() that are getting caught by this since > [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC] > was introduced? Sorry I'm a bit confused by your wording.. > That could manifest to the user as stalls creating new processes when under > heavy IO. I would be surprised it would freeze the entire system but certainly > any new work would feel very slow. > > > A simple dd can easily create a big range of dirty pages in the LRU > > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a > > typical desktop, which triggers the lumpy reclaim mode and hence > > wait_on_page_writeback(). > > > > which triggers the lumpy reclaim mode for high-order allocations. Exactly. Changelog updated. > lumpy reclaim mode is not something that is triggered just because priority > is high. Right. > I think there is a second possibility for causing stalls as well that is > unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may > also result in stalls. If it is taking a long time to writeback dirty data, > random processes could be getting stalled just because they happened to dirty > data at the wrong time. This would be the case if the main dirtying process > (e.g. dd) is not calling sync and dropping pages it's no longer using. The dirty_limit throttling will slow down the dirty process to the writeback throughput. If a process is dirtying files on sda (HDD), it will be throttled at 80MB/s. If another process is dirtying files on sdb (USB 1.1), it will be throttled at 1MB/s. So dirty throttling will slow things down. However the slow down should be smooth (a series of 100ms stalls instead of a sudden 10s stall), and won't impact random processes (which does no read/write IO at all). > > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to > > the 22MB writeback and 190MB dirty pages. There can easily be a > > continuous range of 512KB dirty/writeback pages in the LRU, which will > > trigger the wait logic. > > > > To make it worse, when there are 50MB writeback pages and USB 1.1 is > > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50 > > seconds. > > > > So only enter sync write&wait when priority goes below DEF_PRIORITY/3, > > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait > > will hardly be triggered by pure dirty pages. > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > --- > > mm/vmscan.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > > +++ linux-next/mm/vmscan.c 2010-07-22 17:03:47.000000000 +0800 > > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis > > * but that should be acceptable to the caller > > */ > > if (nr_freed < nr_taken && !current_is_kswapd() && > > - sc->lumpy_reclaim_mode) { > > + sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) { > > congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > This will also delay waiting on congestion for really high-order > allocations such as huge pages, some video decoder and the like which > really should be stalling. I absolutely agree that high order allocators should be somehow throttled. However given that one can easily create a large _continuous_ range of dirty LRU pages, let someone bumping all the way through the range sounds a bit cruel.. > How about the following compile-tested diff? > It takes the cost of the high-order allocation into account and the > priority when deciding whether to synchronously wait or not. Very nice patch. Thanks! Cheers, Fengguang > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 9c7e57c..d652e0c 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1110,6 +1110,48 @@ static int too_many_isolated(struct zone *zone, int file, > } > > /* > + * Returns true if the caller should stall on congestion and retry to clean > + * the list of pages synchronously. > + * > + * If we are direct reclaiming for contiguous pages and we do not reclaim > + * everything in the list, try again and wait for IO to complete. This > + * will stall high-order allocations but that should be acceptable to > + * the caller > + */ > +static inline bool should_reclaim_stall(unsigned long nr_taken, > + unsigned long nr_freed, > + int priority, > + struct scan_control *sc) > +{ > + int lumpy_stall_priority; > + > + /* kswapd should not stall on sync IO */ > + if (current_is_kswapd()) > + return false; > + > + /* Only stall on lumpy reclaim */ > + if (!sc->lumpy_reclaim_mode) > + return false; > + > + /* If we have relaimed everything on the isolated list, no stall */ > + if (nr_freed == nr_taken) > + return false; > + > + /* > + * For high-order allocations, there are two stall thresholds. > + * High-cost allocations stall immediately where as lower > + * order allocations such as stacks require the scanning > + * priority to be much higher before stalling > + */ > + if (sc->order > PAGE_ALLOC_COSTLY_ORDER) > + lumpy_stall_priority = DEF_PRIORITY; > + else > + lumpy_stall_priority = DEF_PRIORITY / 3; > + > + return priority <= lumpy_stall_priority; > +} > + > +/* > * shrink_inactive_list() is a helper for shrink_zone(). It returns the number > * of reclaimed pages > */ > @@ -1199,14 +1241,8 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, > nr_scanned += nr_scan; > nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC); > > - /* > - * If we are direct reclaiming for contiguous pages and we do > - * not reclaim everything in the list, try again and wait > - * for IO to complete. This will stall high-order allocations > - * but that should be acceptable to the caller > - */ > - if (nr_freed < nr_taken && !current_is_kswapd() && > - sc->lumpy_reclaim_mode) { > + /* Check if we should syncronously wait for writeback */ > + if (should_reclaim_stall(nr_taken, nr_freed, priority, sc)) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > > /* > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-23 9:45 ` Wu Fengguang @ 2010-07-23 10:57 ` Mel Gorman 2010-07-23 11:49 ` Wu Fengguang 2010-07-25 10:43 ` KOSAKI Motohiro 0 siblings, 2 replies; 87+ messages in thread From: Mel Gorman @ 2010-07-23 10:57 UTC (permalink / raw) To: Wu Fengguang Cc: Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Minchan Kim On Fri, Jul 23, 2010 at 05:45:15PM +0800, Wu Fengguang wrote: > On Thu, Jul 22, 2010 at 06:48:23PM +0800, Mel Gorman wrote: > > On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote: > > > > I guess this new patch is more problem oriented and acceptable: > > > > > > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > > > > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800 > > > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis > > > > count_vm_events(PGDEACTIVATE, nr_active); > > > > > > > > nr_freed += shrink_page_list(&page_list, sc, > > > > - PAGEOUT_IO_SYNC); > > > > + priority < DEF_PRIORITY / 3 ? > > > > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC); > > > > } > > > > > > > > nr_reclaimed += nr_freed; > > > > > > This one looks better: > > > --- > > > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls > > > > > > Fix "system goes totally unresponsive with many dirty/writeback pages" > > > problem: > > > > > > http://lkml.org/lkml/2010/4/4/86 > > > > > > The root cause is, wait_on_page_writeback() is called too early in the > > > direct reclaim path, which blocks many random/unrelated processes when > > > some slow (USB stick) writeback is on the way. > > > > > > > So, what's the bet if lumpy reclaim is a factor that it's > > high-order-but-low-cost such as fork() that are getting caught by this since > > [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC] > > was introduced? > > Sorry I'm a bit confused by your wording.. > After reading the thread, I realised that fork() stalling could be a factor. That commit allows lumpy reclaim and PAGEOUT_IO_SYNC to be used for high-order allocations such as those used by fork(). It might have been an oversight to allow order-1 to use PAGEOUT_IO_SYNC too easily. > > That could manifest to the user as stalls creating new processes when under > > heavy IO. I would be surprised it would freeze the entire system but certainly > > any new work would feel very slow. > > > > > A simple dd can easily create a big range of dirty pages in the LRU > > > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a > > > typical desktop, which triggers the lumpy reclaim mode and hence > > > wait_on_page_writeback(). > > > > > > > which triggers the lumpy reclaim mode for high-order allocations. > > Exactly. Changelog updated. > > > lumpy reclaim mode is not something that is triggered just because priority > > is high. > > Right. > > > I think there is a second possibility for causing stalls as well that is > > unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may > > also result in stalls. If it is taking a long time to writeback dirty data, > > random processes could be getting stalled just because they happened to dirty > > data at the wrong time. This would be the case if the main dirtying process > > (e.g. dd) is not calling sync and dropping pages it's no longer using. > > The dirty_limit throttling will slow down the dirty process to the > writeback throughput. If a process is dirtying files on sda (HDD), > it will be throttled at 80MB/s. If another process is dirtying files > on sdb (USB 1.1), it will be throttled at 1MB/s. > It will slow down the dirty process doing the dd, but can it also slow down other processes that just happened to dirty pages at the wrong time. > So dirty throttling will slow things down. However the slow down > should be smooth (a series of 100ms stalls instead of a sudden 10s > stall), and won't impact random processes (which does no read/write IO > at all). > Ok. > > > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to > > > the 22MB writeback and 190MB dirty pages. There can easily be a > > > continuous range of 512KB dirty/writeback pages in the LRU, which will > > > trigger the wait logic. > > > > > > To make it worse, when there are 50MB writeback pages and USB 1.1 is > > > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50 > > > seconds. > > > > > > So only enter sync write&wait when priority goes below DEF_PRIORITY/3, > > > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait > > > will hardly be triggered by pure dirty pages. > > > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > > --- > > > mm/vmscan.c | 4 ++-- > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > > > +++ linux-next/mm/vmscan.c 2010-07-22 17:03:47.000000000 +0800 > > > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis > > > * but that should be acceptable to the caller > > > */ > > > if (nr_freed < nr_taken && !current_is_kswapd() && > > > - sc->lumpy_reclaim_mode) { > > > + sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) { > > > congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > > > > This will also delay waiting on congestion for really high-order > > allocations such as huge pages, some video decoder and the like which > > really should be stalling. > > I absolutely agree that high order allocators should be somehow throttled. > > However given that one can easily create a large _continuous_ range of > dirty LRU pages, let someone bumping all the way through the range > sounds a bit cruel.. > > > How about the following compile-tested diff? > > It takes the cost of the high-order allocation into account and the > > priority when deciding whether to synchronously wait or not. > > Very nice patch. Thanks! > Will you be picking it up or should I? The changelog should be more or less the same as yours and consider it Signed-off-by: Mel Gorman <mel@csn.ul.ie> It'd be nice if the original tester is still knocking around and willing to confirm the patch resolves his/her problem. I am running this patch on my desktop at the moment and it does feel a little smoother but it might be my imagination. I had trouble with odd stalls that I never pinned down and was attributing to the machine being commonly heavily loaded but I haven't noticed them today. It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC] Thanks > <SNIP> -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-23 10:57 ` Mel Gorman @ 2010-07-23 11:49 ` Wu Fengguang 2010-07-23 12:20 ` Wu Fengguang 2010-07-25 10:43 ` KOSAKI Motohiro 1 sibling, 1 reply; 87+ messages in thread From: Wu Fengguang @ 2010-07-23 11:49 UTC (permalink / raw) To: Mel Gorman Cc: Andreas Mohr, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Minchan Kim On Fri, Jul 23, 2010 at 06:57:19PM +0800, Mel Gorman wrote: > On Fri, Jul 23, 2010 at 05:45:15PM +0800, Wu Fengguang wrote: > > On Thu, Jul 22, 2010 at 06:48:23PM +0800, Mel Gorman wrote: > > > On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote: > > > > > I guess this new patch is more problem oriented and acceptable: > > > > > > > > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > > > > > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800 > > > > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis > > > > > count_vm_events(PGDEACTIVATE, nr_active); > > > > > > > > > > nr_freed += shrink_page_list(&page_list, sc, > > > > > - PAGEOUT_IO_SYNC); > > > > > + priority < DEF_PRIORITY / 3 ? > > > > > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC); > > > > > } > > > > > > > > > > nr_reclaimed += nr_freed; > > > > > > > > This one looks better: > > > > --- > > > > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls > > > > > > > > Fix "system goes totally unresponsive with many dirty/writeback pages" > > > > problem: > > > > > > > > http://lkml.org/lkml/2010/4/4/86 > > > > > > > > The root cause is, wait_on_page_writeback() is called too early in the > > > > direct reclaim path, which blocks many random/unrelated processes when > > > > some slow (USB stick) writeback is on the way. > > > > > > > > > > So, what's the bet if lumpy reclaim is a factor that it's > > > high-order-but-low-cost such as fork() that are getting caught by this since > > > [78dc583d: vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC] > > > was introduced? > > > > Sorry I'm a bit confused by your wording.. > > > > After reading the thread, I realised that fork() stalling could be a > factor. That commit allows lumpy reclaim and PAGEOUT_IO_SYNC to be used for > high-order allocations such as those used by fork(). It might have been an > oversight to allow order-1 to use PAGEOUT_IO_SYNC too easily. That reads much clear. Thanks! I have the same feeling, hence the proposed patch. > > > That could manifest to the user as stalls creating new processes when under > > > heavy IO. I would be surprised it would freeze the entire system but certainly > > > any new work would feel very slow. > > > > > > > A simple dd can easily create a big range of dirty pages in the LRU > > > > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a > > > > typical desktop, which triggers the lumpy reclaim mode and hence > > > > wait_on_page_writeback(). > > > > > > > > > > which triggers the lumpy reclaim mode for high-order allocations. > > > > Exactly. Changelog updated. > > > > > lumpy reclaim mode is not something that is triggered just because priority > > > is high. > > > > Right. > > > > > I think there is a second possibility for causing stalls as well that is > > > unrelated to lumpy reclaim. Once dirty_limit is reached, new page faults may > > > also result in stalls. If it is taking a long time to writeback dirty data, > > > random processes could be getting stalled just because they happened to dirty > > > data at the wrong time. This would be the case if the main dirtying process > > > (e.g. dd) is not calling sync and dropping pages it's no longer using. > > > > The dirty_limit throttling will slow down the dirty process to the > > writeback throughput. If a process is dirtying files on sda (HDD), > > it will be throttled at 80MB/s. If another process is dirtying files > > on sdb (USB 1.1), it will be throttled at 1MB/s. > > > > It will slow down the dirty process doing the dd, but can it also slow > down other processes that just happened to dirty pages at the wrong > time. For the case of of a heavy dirtier (dd) and concurrent light dirtiers (some random processes), the light dirtiers won't be easily throttled. task_dirty_limit() handles that case well. It will give light dirtiers higher threshold than heavy dirtiers so that only the latter will be dirty throttled. > > So dirty throttling will slow things down. However the slow down > > should be smooth (a series of 100ms stalls instead of a sudden 10s > > stall), and won't impact random processes (which does no read/write IO > > at all). > > > > Ok. > > > > > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to > > > > the 22MB writeback and 190MB dirty pages. There can easily be a > > > > continuous range of 512KB dirty/writeback pages in the LRU, which will > > > > trigger the wait logic. > > > > > > > > To make it worse, when there are 50MB writeback pages and USB 1.1 is > > > > writing them in 1MB/s, wait_on_page_writeback() may stuck for up to 50 > > > > seconds. > > > > > > > > So only enter sync write&wait when priority goes below DEF_PRIORITY/3, > > > > or 6.25% LRU. As the default dirty throttle ratio is 20%, sync write&wait > > > > will hardly be triggered by pure dirty pages. > > > > > > > > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > > > > --- > > > > mm/vmscan.c | 4 ++-- > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > > > > +++ linux-next/mm/vmscan.c 2010-07-22 17:03:47.000000000 +0800 > > > > @@ -1206,7 +1206,7 @@ static unsigned long shrink_inactive_lis > > > > * but that should be acceptable to the caller > > > > */ > > > > if (nr_freed < nr_taken && !current_is_kswapd() && > > > > - sc->lumpy_reclaim_mode) { > > > > + sc->lumpy_reclaim_mode && priority < DEF_PRIORITY / 3) { > > > > congestion_wait(BLK_RW_ASYNC, HZ/10); > > > > > > > > > > This will also delay waiting on congestion for really high-order > > > allocations such as huge pages, some video decoder and the like which > > > really should be stalling. > > > > I absolutely agree that high order allocators should be somehow throttled. > > However given that one can easily create a large _continuous_ range of > > dirty LRU pages, let someone bumping all the way through the range > > sounds a bit cruel.. Hmm. If such large range of dirty pages are approaching the end of LRU, it means the LRU lists are being scanned pretty fast, indicating a busy system and/or high memory pressure. So it seems reasonable to act cruel to really high order allocators -- they won't perform well under memory pressure after all, and only make things worse. > > > How about the following compile-tested diff? > > > It takes the cost of the high-order allocation into account and the > > > priority when deciding whether to synchronously wait or not. > > > > Very nice patch. Thanks! > > > > Will you be picking it up or should I? The changelog should be more or less > the same as yours and consider it > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> Thanks. I'll post the patch. > It'd be nice if the original tester is still knocking around and willing > to confirm the patch resolves his/her problem. I am running this patch on > my desktop at the moment and it does feel a little smoother but it might be > my imagination. I had trouble with odd stalls that I never pinned down and > was attributing to the machine being commonly heavily loaded but I haven't > noticed them today. Great. Just added CC to Andreas Mohr. > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also > should use PAGEOUT_IO_SYNC] And Minchan, he has been following this issue too :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-23 11:49 ` Wu Fengguang @ 2010-07-23 12:20 ` Wu Fengguang 0 siblings, 0 replies; 87+ messages in thread From: Wu Fengguang @ 2010-07-23 12:20 UTC (permalink / raw) To: Mel Gorman Cc: Peter Zijlstra, Andreas Mohr, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Minchan Kim > For the case of of a heavy dirtier (dd) and concurrent light dirtiers > (some random processes), the light dirtiers won't be easily throttled. > task_dirty_limit() handles that case well. It will give light dirtiers > higher threshold than heavy dirtiers so that only the latter will be > dirty throttled. The caveat is, the real dirty throttling threshold is not exactly the value specified by vm.dirty_ratio or vm.dirty_bytes. Instead it's some value slightly lower than it. That real value differs for each process, which is a nice trick to throttle heavy dirtiers first. If I remember it right, that's invented by Peter and Andrew. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-23 10:57 ` Mel Gorman 2010-07-23 11:49 ` Wu Fengguang @ 2010-07-25 10:43 ` KOSAKI Motohiro 2010-07-25 12:03 ` Minchan Kim 2010-07-26 3:08 ` Wu Fengguang 1 sibling, 2 replies; 87+ messages in thread From: KOSAKI Motohiro @ 2010-07-25 10:43 UTC (permalink / raw) To: Mel Gorman Cc: kosaki.motohiro, Wu Fengguang, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli, Minchan Kim Hi sorry for the delay. > Will you be picking it up or should I? The changelog should be more or less > the same as yours and consider it > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > It'd be nice if the original tester is still knocking around and willing > to confirm the patch resolves his/her problem. I am running this patch on > my desktop at the moment and it does feel a little smoother but it might be > my imagination. I had trouble with odd stalls that I never pinned down and > was attributing to the machine being commonly heavily loaded but I haven't > noticed them today. > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also > should use PAGEOUT_IO_SYNC] My reviewing doesn't found any bug. however I think original thread have too many guess and we need to know reproduce way and confirm it. At least, we need three confirms. o original issue is still there? o DEF_PRIORITY/3 is best value? o Current approach have better performance than Wu's original proposal? (below) Anyway, please feel free to use my reviewed-by tag. Thanks. --- linux-next.orig/mm/vmscan.c 2010-06-24 14:32:03.000000000 +0800 +++ linux-next/mm/vmscan.c 2010-07-22 16:12:34.000000000 +0800 @@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p */ if (sc->order > PAGE_ALLOC_COSTLY_ORDER) sc->lumpy_reclaim_mode = 1; - else if (sc->order && priority < DEF_PRIORITY - 2) + else if (sc->order && priority < DEF_PRIORITY / 2) sc->lumpy_reclaim_mode = 1; else sc->lumpy_reclaim_mode = 0; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-25 10:43 ` KOSAKI Motohiro @ 2010-07-25 12:03 ` Minchan Kim 2010-07-26 3:27 ` Wu Fengguang 2010-07-26 3:08 ` Wu Fengguang 1 sibling, 1 reply; 87+ messages in thread From: Minchan Kim @ 2010-07-25 12:03 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Mel Gorman, Wu Fengguang, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote: > Hi > > sorry for the delay. > > > Will you be picking it up or should I? The changelog should be more or less > > the same as yours and consider it > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > > It'd be nice if the original tester is still knocking around and willing > > to confirm the patch resolves his/her problem. I am running this patch on > > my desktop at the moment and it does feel a little smoother but it might be > > my imagination. I had trouble with odd stalls that I never pinned down and > > was attributing to the machine being commonly heavily loaded but I haven't > > noticed them today. > > > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also > > should use PAGEOUT_IO_SYNC] > > My reviewing doesn't found any bug. however I think original thread have too many guess > and we need to know reproduce way and confirm it. > > At least, we need three confirms. > o original issue is still there? > o DEF_PRIORITY/3 is best value? I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU? I guess system has 512M and 22M writeback pages. So you may determine it for skipping max 32M writeback pages. Is right? And I have a question of your below comment. "As the default dirty throttle ratio is 20%, sync write&wait will hardly be triggered by pure dirty pages" I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be related to dirty_ratio. It always can be changed by admin. Then do we have to determine magic value(DEF_PRIORITY/3) proportional to dirty_ratio? -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-25 12:03 ` Minchan Kim @ 2010-07-26 3:27 ` Wu Fengguang 2010-07-26 4:11 ` Minchan Kim 0 siblings, 1 reply; 87+ messages in thread From: Wu Fengguang @ 2010-07-26 3:27 UTC (permalink / raw) To: Minchan Kim Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote: > On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote: > > Hi > > > > sorry for the delay. > > > > > Will you be picking it up or should I? The changelog should be more or less > > > the same as yours and consider it > > > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > > > > It'd be nice if the original tester is still knocking around and willing > > > to confirm the patch resolves his/her problem. I am running this patch on > > > my desktop at the moment and it does feel a little smoother but it might be > > > my imagination. I had trouble with odd stalls that I never pinned down and > > > was attributing to the machine being commonly heavily loaded but I haven't > > > noticed them today. > > > > > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters > > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also > > > should use PAGEOUT_IO_SYNC] > > > > My reviewing doesn't found any bug. however I think original thread have too many guess > > and we need to know reproduce way and confirm it. > > > > At least, we need three confirms. > > o original issue is still there? > > o DEF_PRIORITY/3 is best value? > > I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU? > I guess system has 512M and 22M writeback pages. > So you may determine it for skipping max 32M writeback pages. > Is right? For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages. Because shrink_inactive_list() first calls shrink_page_list(PAGEOUT_IO_ASYNC) then optionally shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be converted to writeback pages and then optionally be waited on. The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks a reasonable value. > And I have a question of your below comment. > > "As the default dirty throttle ratio is 20%, sync write&wait > will hardly be triggered by pure dirty pages" > > I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be > related to dirty_ratio. It always can be changed by admin. > Then do we have to determine magic value(DEF_PRIORITY/3) proportional to dirty_ratio? Yes DEF_PRIORITY/3 is already proportional to the _default_ dirty_ratio. We could do explicit comparison with dirty_ratio just in case dirty_ratio get changed by user. It's mainly a question of whether deserving to add such overheads and complexity. I'd prefer to keep the current simple form :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-26 3:27 ` Wu Fengguang @ 2010-07-26 4:11 ` Minchan Kim 2010-07-26 4:37 ` Wu Fengguang 0 siblings, 1 reply; 87+ messages in thread From: Minchan Kim @ 2010-07-26 4:11 UTC (permalink / raw) To: Wu Fengguang Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote: > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote: >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote: >> > Hi >> > >> > sorry for the delay. >> > >> > > Will you be picking it up or should I? The changelog should be more or less >> > > the same as yours and consider it >> > > >> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> >> > > >> > > It'd be nice if the original tester is still knocking around and willing >> > > to confirm the patch resolves his/her problem. I am running this patch on >> > > my desktop at the moment and it does feel a little smoother but it might be >> > > my imagination. I had trouble with odd stalls that I never pinned down and >> > > was attributing to the machine being commonly heavily loaded but I haven't >> > > noticed them today. >> > > >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also >> > > should use PAGEOUT_IO_SYNC] >> > >> > My reviewing doesn't found any bug. however I think original thread have too many guess >> > and we need to know reproduce way and confirm it. >> > >> > At least, we need three confirms. >> > o original issue is still there? >> > o DEF_PRIORITY/3 is best value? >> >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU? >> I guess system has 512M and 22M writeback pages. >> So you may determine it for skipping max 32M writeback pages. >> Is right? > > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages. > Because shrink_inactive_list() first calls > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be > converted to writeback pages and then optionally be waited on. > > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks > a reasonable value. Why do you think it's a reasonable value? I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%? I am not against you. Just out of curiosity and requires more explanation. It might be thing _only I_ don't know. :( > >> And I have a question of your below comment. >> >> "As the default dirty throttle ratio is 20%, sync write&wait >> will hardly be triggered by pure dirty pages" >> >> I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be >> related to dirty_ratio. It always can be changed by admin. >> Then do we have to determine magic value(DEF_PRIORITY/3) proportional to dirty_ratio? > > Yes DEF_PRIORITY/3 is already proportional to the _default_ > dirty_ratio. We could do explicit comparison with dirty_ratio > just in case dirty_ratio get changed by user. It's mainly a question > of whether deserving to add such overheads and complexity. I'd prefer > to keep the current simple form :) What I suggest is that couldn't we use recent_writeback/recent_scanned ratio? I think scan_control's new filed and counting wouldn't be a big overhead and complexity. I am not sure which ratio is best. but at least, it would make the logic scalable and sense to me. :) > > Thanks, > Fengguang > -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-26 4:11 ` Minchan Kim @ 2010-07-26 4:37 ` Wu Fengguang 2010-07-26 16:30 ` Minchan Kim 0 siblings, 1 reply; 87+ messages in thread From: Wu Fengguang @ 2010-07-26 4:37 UTC (permalink / raw) To: Minchan Kim Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote: > On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote: > > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote: > >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote: > >> > Hi > >> > > >> > sorry for the delay. > >> > > >> > > Will you be picking it up or should I? The changelog should be more or less > >> > > the same as yours and consider it > >> > > > >> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > >> > > > >> > > It'd be nice if the original tester is still knocking around and willing > >> > > to confirm the patch resolves his/her problem. I am running this patch on > >> > > my desktop at the moment and it does feel a little smoother but it might be > >> > > my imagination. I had trouble with odd stalls that I never pinned down and > >> > > was attributing to the machine being commonly heavily loaded but I haven't > >> > > noticed them today. > >> > > > >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters > >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also > >> > > should use PAGEOUT_IO_SYNC] > >> > > >> > My reviewing doesn't found any bug. however I think original thread have too many guess > >> > and we need to know reproduce way and confirm it. > >> > > >> > At least, we need three confirms. > >> > o original issue is still there? > >> > o DEF_PRIORITY/3 is best value? > >> > >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU? > >> I guess system has 512M and 22M writeback pages. > >> So you may determine it for skipping max 32M writeback pages. > >> Is right? > > > > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages. > > Because shrink_inactive_list() first calls > > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally > > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be > > converted to writeback pages and then optionally be waited on. > > > > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks > > a reasonable value. > > Why do you think it's a reasonable value? > I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%? > I am not against you. Just out of curiosity and requires more explanation. > It might be thing _only I_ don't know. :( It's more or less random selected. I'm also OK with 3.125%. It's an threshold to turn on some _last resort_ mechanism, so don't need to be optimal.. > > > >> And I have a question of your below comment. > >> > >> "As the default dirty throttle ratio is 20%, sync write&wait > >> will hardly be triggered by pure dirty pages" > >> > >> I am not sure exactly what you mean but at least DEF_PRIOIRTY/3 seems to be > >> related to dirty_ratio. It always can be changed by admin. > >> Then do we have to determine magic value(DEF_PRIORITY/3) proportional to dirty_ratio? > > > > Yes DEF_PRIORITY/3 is already proportional to the _default_ > > dirty_ratio. We could do explicit comparison with dirty_ratio > > just in case dirty_ratio get changed by user. It's mainly a question > > of whether deserving to add such overheads and complexity. I'd prefer > > to keep the current simple form :) > > What I suggest is that couldn't we use recent_writeback/recent_scanned ratio? > I think scan_control's new filed and counting wouldn't be a big > overhead and complexity. > I am not sure which ratio is best. but at least, it would make the > logic scalable and sense to me. :) ..and don't need to be elaborated :) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-26 4:37 ` Wu Fengguang @ 2010-07-26 16:30 ` Minchan Kim 2010-07-26 22:48 ` Wu Fengguang 0 siblings, 1 reply; 87+ messages in thread From: Minchan Kim @ 2010-07-26 16:30 UTC (permalink / raw) To: Wu Fengguang Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli On Mon, Jul 26, 2010 at 12:37:09PM +0800, Wu Fengguang wrote: > On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote: > > On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote: > > > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote: > > >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote: > > >> > Hi > > >> > > > >> > sorry for the delay. > > >> > > > >> > > Will you be picking it up or should I? The changelog should be more or less > > >> > > the same as yours and consider it > > >> > > > > >> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > >> > > > > >> > > It'd be nice if the original tester is still knocking around and willing > > >> > > to confirm the patch resolves his/her problem. I am running this patch on > > >> > > my desktop at the moment and it does feel a little smoother but it might be > > >> > > my imagination. I had trouble with odd stalls that I never pinned down and > > >> > > was attributing to the machine being commonly heavily loaded but I haven't > > >> > > noticed them today. > > >> > > > > >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters > > >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also > > >> > > should use PAGEOUT_IO_SYNC] > > >> > > > >> > My reviewing doesn't found any bug. however I think original thread have too many guess > > >> > and we need to know reproduce way and confirm it. > > >> > > > >> > At least, we need three confirms. > > >> > o original issue is still there? > > >> > o DEF_PRIORITY/3 is best value? > > >> > > >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU? > > >> I guess system has 512M and 22M writeback pages. > > >> So you may determine it for skipping max 32M writeback pages. > > >> Is right? > > > > > > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages. > > > Because shrink_inactive_list() first calls > > > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally > > > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be > > > converted to writeback pages and then optionally be waited on. > > > > > > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks > > > a reasonable value. > > > > Why do you think it's a reasonable value? > > I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%? > > I am not against you. Just out of curiosity and requires more explanation. > > It might be thing _only I_ don't know. :( > > It's more or less random selected. I'm also OK with 3.125%. It's an > threshold to turn on some _last resort_ mechanism, so don't need to be > optimal.. Okay. Why I had a question is that I don't want to add new magic value in VM without detailed comment. While I review the source code, I always suffer form it. :( Now we have a great tool called 'git'. Please write down why we select that number detaily when we add new magic value. :) Thanks, Wu. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-26 16:30 ` Minchan Kim @ 2010-07-26 22:48 ` Wu Fengguang 0 siblings, 0 replies; 87+ messages in thread From: Wu Fengguang @ 2010-07-26 22:48 UTC (permalink / raw) To: Minchan Kim Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli On Tue, Jul 27, 2010 at 12:30:11AM +0800, Minchan Kim wrote: > On Mon, Jul 26, 2010 at 12:37:09PM +0800, Wu Fengguang wrote: > > On Mon, Jul 26, 2010 at 12:11:59PM +0800, Minchan Kim wrote: > > > On Mon, Jul 26, 2010 at 12:27 PM, Wu Fengguang <fengguang.wu@intel.com> wrote: > > > > On Sun, Jul 25, 2010 at 08:03:45PM +0800, Minchan Kim wrote: > > > >> On Sun, Jul 25, 2010 at 07:43:20PM +0900, KOSAKI Motohiro wrote: > > > >> > Hi > > > >> > > > > >> > sorry for the delay. > > > >> > > > > >> > > Will you be picking it up or should I? The changelog should be more or less > > > >> > > the same as yours and consider it > > > >> > > > > > >> > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > > >> > > > > > >> > > It'd be nice if the original tester is still knocking around and willing > > > >> > > to confirm the patch resolves his/her problem. I am running this patch on > > > >> > > my desktop at the moment and it does feel a little smoother but it might be > > > >> > > my imagination. I had trouble with odd stalls that I never pinned down and > > > >> > > was attributing to the machine being commonly heavily loaded but I haven't > > > >> > > noticed them today. > > > >> > > > > > >> > > It also needs an Acked-by or Reviewed-by from Kosaki Motohiro as it alters > > > >> > > logic he introduced in commit [78dc583: vmscan: low order lumpy reclaim also > > > >> > > should use PAGEOUT_IO_SYNC] > > > >> > > > > >> > My reviewing doesn't found any bug. however I think original thread have too many guess > > > >> > and we need to know reproduce way and confirm it. > > > >> > > > > >> > At least, we need three confirms. > > > >> > o original issue is still there? > > > >> > o DEF_PRIORITY/3 is best value? > > > >> > > > >> I agree. Wu, how do you determine DEF_PRIORITY/3 of LRU? > > > >> I guess system has 512M and 22M writeback pages. > > > >> So you may determine it for skipping max 32M writeback pages. > > > >> Is right? > > > > > > > > For 512M mem, DEF_PRIORITY/3 means 32M dirty _or_ writeback pages. > > > > Because shrink_inactive_list() first calls > > > > shrink_page_list(PAGEOUT_IO_ASYNC) then optionally > > > > shrink_page_list(PAGEOUT_IO_SYNC), so dirty pages will first be > > > > converted to writeback pages and then optionally be waited on. > > > > > > > > The dirty/writeback pages may go up to 512M*20% = 100M. So 32M looks > > > > a reasonable value. > > > > > > Why do you think it's a reasonable value? > > > I mean why isn't it good 12.5% or 3.125%? Why do you select 6.25%? > > > I am not against you. Just out of curiosity and requires more explanation. > > > It might be thing _only I_ don't know. :( > > > > It's more or less random selected. I'm also OK with 3.125%. It's an > > threshold to turn on some _last resort_ mechanism, so don't need to be > > optimal.. > > Okay. Why I had a question is that I don't want to add new magic value in > VM without detailed comment. > While I review the source code, I always suffer form it. :( > Now we have a great tool called 'git'. > Please write down why we select that number detaily when we add new > magic value. :) Good point. I'll do that :) Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-25 10:43 ` KOSAKI Motohiro 2010-07-25 12:03 ` Minchan Kim @ 2010-07-26 3:08 ` Wu Fengguang 2010-07-26 3:11 ` Rik van Riel 1 sibling, 1 reply; 87+ messages in thread From: Wu Fengguang @ 2010-07-26 3:08 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Mel Gorman, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli, Minchan Kim KOSAKI, > My reviewing doesn't found any bug. however I think original thread have too many guess > and we need to know reproduce way and confirm it. > > At least, we need three confirms. > o original issue is still there? As long as the root cause is still there :) > o DEF_PRIORITY/3 is best value? There are no best value. I suspect the whole PAGEOUT_IO_SYNC and wait_on_page_writeback() approach is a terrible workaround and should be avoided as much as possible. This is why I lifted the bar from DEF_PRIORITY/2 to DEF_PRIORITY/3. wait_on_page_writeback() is bad because for a typical desktop, one single call may block 1-10 seconds (remember we are under memory pressure, which is almost always accompanied with busy disk IO, so the page will wait noticeable time in the IO queue). To make it worse, it is very possible there are 10 more dirty/writeback pages in the isolated pages(dirty pages are often clustered). This ends up with 10-100 seconds stall time. We do need some throttling under memory pressure. However stall time more than 1s is not acceptable. A simple congestion_wait() may be better, since it waits on _any_ IO completion (which will likely release a set of PG_reclaim pages) rather than one specific IO completion. This makes much smoother stall time. wait_on_page_writeback() shall really be the last resort. DEF_PRIORITY/3 means 1/16=6.25%, which is closer. Since dirty/writeback pages are such a bad factor under memory pressure, it may deserve to adaptively shrink dirty_limit as well. When short on memory, why not reduce the dirty/writeback page cache? This will not only consume memory, but also considerably improve IO efficiency and responsiveness. When the LRU lists are scanned fast (under memory pressure), it is likely lots of the dirty pages are caught by pageout(). Reducing the number of dirty pages reduces the pageout() invocations. > o Current approach have better performance than Wu's original proposal? (below) I guess it will have better user experience :) > Anyway, please feel free to use my reviewed-by tag. Thanks, Fengguang > --- linux-next.orig/mm/vmscan.c 2010-06-24 14:32:03.000000000 +0800 > +++ linux-next/mm/vmscan.c 2010-07-22 16:12:34.000000000 +0800 > @@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p > */ > if (sc->order > PAGE_ALLOC_COSTLY_ORDER) > sc->lumpy_reclaim_mode = 1; > - else if (sc->order && priority < DEF_PRIORITY - 2) > + else if (sc->order && priority < DEF_PRIORITY / 2) > sc->lumpy_reclaim_mode = 1; > else > sc->lumpy_reclaim_mode = 0; -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-26 3:08 ` Wu Fengguang @ 2010-07-26 3:11 ` Rik van Riel 2010-07-26 3:17 ` Wu Fengguang 0 siblings, 1 reply; 87+ messages in thread From: Rik van Riel @ 2010-07-26 3:11 UTC (permalink / raw) To: Wu Fengguang Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli, Minchan Kim On 07/25/2010 11:08 PM, Wu Fengguang wrote: > We do need some throttling under memory pressure. However stall time > more than 1s is not acceptable. A simple congestion_wait() may be > better, since it waits on _any_ IO completion (which will likely > release a set of PG_reclaim pages) rather than one specific IO > completion. This makes much smoother stall time. > wait_on_page_writeback() shall really be the last resort. > DEF_PRIORITY/3 means 1/16=6.25%, which is closer. I agree with the max 1 second stall time, but 6.25% of memory could be an awful lot of pages to scan on a system with 1TB of memory :) Not sure what the best approach is, just pointing out that DEF_PRIORITY/3 may be too much for large systems... -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-26 3:11 ` Rik van Riel @ 2010-07-26 3:17 ` Wu Fengguang 0 siblings, 0 replies; 87+ messages in thread From: Wu Fengguang @ 2010-07-26 3:17 UTC (permalink / raw) To: Rik van Riel Cc: KOSAKI Motohiro, Mel Gorman, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner, KAMEZAWA Hiroyuki, Andrew Morton, Andrea Arcangeli, Minchan Kim On Mon, Jul 26, 2010 at 11:11:37AM +0800, Rik van Riel wrote: > On 07/25/2010 11:08 PM, Wu Fengguang wrote: > > > We do need some throttling under memory pressure. However stall time > > more than 1s is not acceptable. A simple congestion_wait() may be > > better, since it waits on _any_ IO completion (which will likely > > release a set of PG_reclaim pages) rather than one specific IO > > completion. This makes much smoother stall time. > > wait_on_page_writeback() shall really be the last resort. > > DEF_PRIORITY/3 means 1/16=6.25%, which is closer. > > I agree with the max 1 second stall time, but 6.25% of > memory could be an awful lot of pages to scan on a system > with 1TB of memory :) I totally ignored the 1TB systems out of this topic, because in such systems, <PAGE_ALLOC_COSTLY_ORDER pages are easily available? :) > Not sure what the best approach is, just pointing out > that DEF_PRIORITY/3 may be too much for large systems... What if DEF_PRIORITY/3 is used under PAGE_ALLOC_COSTLY_ORDER? Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-22 9:21 ` Wu Fengguang 2010-07-22 10:48 ` Mel Gorman @ 2010-07-22 15:34 ` Minchan Kim 2010-07-23 11:59 ` Wu Fengguang 1 sibling, 1 reply; 87+ messages in thread From: Minchan Kim @ 2010-07-22 15:34 UTC (permalink / raw) To: Wu Fengguang Cc: Mel Gorman, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli Hi, Wu. Thanks for Cced me. AFAIR, we discussed this by private mail and didn't conclude yet. Let's start from beginning. On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote: > > I guess this new patch is more problem oriented and acceptable: > > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800 > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis > > count_vm_events(PGDEACTIVATE, nr_active); > > > > nr_freed += shrink_page_list(&page_list, sc, > > - PAGEOUT_IO_SYNC); > > + priority < DEF_PRIORITY / 3 ? > > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC); > > } > > > > nr_reclaimed += nr_freed; > > This one looks better: > --- > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls > > Fix "system goes totally unresponsive with many dirty/writeback pages" > problem: > > http://lkml.org/lkml/2010/4/4/86 > > The root cause is, wait_on_page_writeback() is called too early in the > direct reclaim path, which blocks many random/unrelated processes when > some slow (USB stick) writeback is on the way. > > A simple dd can easily create a big range of dirty pages in the LRU > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a > typical desktop, which triggers the lumpy reclaim mode and hence > wait_on_page_writeback(). I see oom message. order is zero. How is lumpy reclaim work? For working lumpy reclaim, we have to meet priority < 10 and sc->order > 0. Please, clarify the problem. > > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to > the 22MB writeback and 190MB dirty pages. There can easily be a What's 22MB and 190M? It would be better to explain more detail. I think the description has to be clear as summary of the problem without the above link. Thanks for taking out this problem, again. :) -- Kind regards, Minchan Kim ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-22 15:34 ` Minchan Kim @ 2010-07-23 11:59 ` Wu Fengguang 0 siblings, 0 replies; 87+ messages in thread From: Wu Fengguang @ 2010-07-23 11:59 UTC (permalink / raw) To: Minchan Kim Cc: Mel Gorman, Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli Hi Minchan, On Thu, Jul 22, 2010 at 11:34:40PM +0800, Minchan Kim wrote: > Hi, Wu. > Thanks for Cced me. > > AFAIR, we discussed this by private mail and didn't conclude yet. > Let's start from beginning. OK. > On Thu, Jul 22, 2010 at 05:21:55PM +0800, Wu Fengguang wrote: > > > I guess this new patch is more problem oriented and acceptable: > > > > > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > > > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800 > > > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis > > > count_vm_events(PGDEACTIVATE, nr_active); > > > > > > nr_freed += shrink_page_list(&page_list, sc, > > > - PAGEOUT_IO_SYNC); > > > + priority < DEF_PRIORITY / 3 ? > > > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC); > > > } > > > > > > nr_reclaimed += nr_freed; > > > > This one looks better: > > --- > > vmscan: raise the bar to PAGEOUT_IO_SYNC stalls > > > > Fix "system goes totally unresponsive with many dirty/writeback pages" > > problem: > > > > http://lkml.org/lkml/2010/4/4/86 > > > > The root cause is, wait_on_page_writeback() is called too early in the > > direct reclaim path, which blocks many random/unrelated processes when > > some slow (USB stick) writeback is on the way. > > > > A simple dd can easily create a big range of dirty pages in the LRU > > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a > > typical desktop, which triggers the lumpy reclaim mode and hence > > wait_on_page_writeback(). > > I see oom message. order is zero. OOM after applying this patch? It's not an obvious consequence. > How is lumpy reclaim work? > For working lumpy reclaim, we have to meet priority < 10 and sc->order > 0. > > Please, clarify the problem. This patch tries to respect the lumpy reclaim logic, and only raises the bar for sync writeback and IO wait. With Mel's change, it's only doing so for (order <= PAGE_ALLOC_COSTLY_ORDER) allocations. Hopefully this will limit unexpected side effects. > > > > In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to > > the 22MB writeback and 190MB dirty pages. There can easily be a > > What's 22MB and 190M? The numbers are adapted from the OOM dmesg in http://lkml.org/lkml/2010/4/4/86 . The OOM is order 0 and GFP_KERNEL. > It would be better to explain more detail. > I think the description has to be clear as summary of the problem > without the above link. Good suggestion. I'll try. > Thanks for taking out this problem, again. :) Heh, I'm actually feeling guilty for the long delay! Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-22 8:52 ` Wu Fengguang 2010-07-22 9:02 ` Wu Fengguang 2010-07-22 9:21 ` Wu Fengguang @ 2010-07-22 9:42 ` Mel Gorman 2010-07-23 8:33 ` Wu Fengguang 2 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-22 9:42 UTC (permalink / raw) To: Wu Fengguang Cc: Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Minchan Kim On Thu, Jul 22, 2010 at 04:52:10PM +0800, Wu Fengguang wrote: > > Some insight on how the other writeback changes that are being floated > > around might affect the number of dirty pages reclaim encounters would also > > be helpful. > > Here is an interesting related problem about the wait_on_page_writeback() call > inside shrink_page_list(): > > http://lkml.org/lkml/2010/4/4/86 > > The problem is, wait_on_page_writeback() is called too early in the > direct reclaim path, which blocks many random/unrelated processes when > some slow (USB stick) writeback is on the way. > > A simple dd can easily create a big range of dirty pages in the LRU > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a > typical desktop, which triggers the lumpy reclaim mode and hence > wait_on_page_writeback(). > Lumpy reclaim is for high-order allocations. A simple dd should not be triggering it regularly unless there was a lot of forking going on at the same time. Also, how would a random or unrelated process get blocked on writeback unless they were also doing high-order allocations? What was the source of the high-order allocations? > I proposed this patch at the time, which was confirmed to solve the problem: > > --- linux-next.orig/mm/vmscan.c 2010-06-24 14:32:03.000000000 +0800 > +++ linux-next/mm/vmscan.c 2010-07-22 16:12:34.000000000 +0800 > @@ -1650,7 +1650,7 @@ static void set_lumpy_reclaim_mode(int p > */ > if (sc->order > PAGE_ALLOC_COSTLY_ORDER) > sc->lumpy_reclaim_mode = 1; > - else if (sc->order && priority < DEF_PRIORITY - 2) > + else if (sc->order && priority < DEF_PRIORITY / 2) > sc->lumpy_reclaim_mode = 1; > else > sc->lumpy_reclaim_mode = 0; > > > However KOSAKI and Minchan raised concerns about raising the bar. > I guess this new patch is more problem oriented and acceptable: > > --- linux-next.orig/mm/vmscan.c 2010-07-22 16:36:58.000000000 +0800 > +++ linux-next/mm/vmscan.c 2010-07-22 16:39:57.000000000 +0800 > @@ -1217,7 +1217,8 @@ static unsigned long shrink_inactive_lis > count_vm_events(PGDEACTIVATE, nr_active); > > nr_freed += shrink_page_list(&page_list, sc, > - PAGEOUT_IO_SYNC); > + priority < DEF_PRIORITY / 3 ? > + PAGEOUT_IO_SYNC : PAGEOUT_IO_ASYNC); > } > I'm not seeing how this helps. It delays when lumpy reclaim waits on IO to clean contiguous ranges of pages. I'll read that full thread as I wasn't aware of it before. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-22 9:42 ` Mel Gorman @ 2010-07-23 8:33 ` Wu Fengguang 0 siblings, 0 replies; 87+ messages in thread From: Wu Fengguang @ 2010-07-23 8:33 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Minchan Kim Hi Mel, On Thu, Jul 22, 2010 at 05:42:09PM +0800, Mel Gorman wrote: > On Thu, Jul 22, 2010 at 04:52:10PM +0800, Wu Fengguang wrote: > > > Some insight on how the other writeback changes that are being floated > > > around might affect the number of dirty pages reclaim encounters would also > > > be helpful. > > > > Here is an interesting related problem about the wait_on_page_writeback() call > > inside shrink_page_list(): > > > > http://lkml.org/lkml/2010/4/4/86 I guess you've got the answers from the above thread, anyway here is the brief answers to your questions. > > > > The problem is, wait_on_page_writeback() is called too early in the > > direct reclaim path, which blocks many random/unrelated processes when > > some slow (USB stick) writeback is on the way. > > > > A simple dd can easily create a big range of dirty pages in the LRU > > list. Therefore priority can easily go below (DEF_PRIORITY - 2) in a > > typical desktop, which triggers the lumpy reclaim mode and hence > > wait_on_page_writeback(). > > > > Lumpy reclaim is for high-order allocations. A simple dd should not be > triggering it regularly unless there was a lot of forking going on at the > same time. dd could create the dirty file fast enough, so that no other processes are injecting pages into the LRU lists besides dd itself. So it's creating a large range of hard-to-reclaim LRU pages which will trigger this code + else if (sc->order && priority < DEF_PRIORITY - 2) + lumpy_reclaim = 1; > Also, how would a random or unrelated process get blocked on > writeback unless they were also doing high-order allocations? What was the > source of the high-order allocations? sc->order is 1 on fork(). Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-19 14:21 ` Christoph Hellwig 2010-07-19 14:40 ` Mel Gorman @ 2010-07-22 1:13 ` Wu Fengguang 1 sibling, 0 replies; 87+ messages in thread From: Wu Fengguang @ 2010-07-22 1:13 UTC (permalink / raw) To: Christoph Hellwig Cc: Mel Gorman, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 10:21:45PM +0800, Christoph Hellwig wrote: > On Mon, Jul 19, 2010 at 02:11:29PM +0100, Mel Gorman wrote: > > From: Wu Fengguang <fengguang.wu@intel.com> > > > > A background flush work may run for ever. So it's reasonable for it to > > mimic the kupdate behavior of syncing old/expired inodes first. > > > > This behavior also makes sense from the perspective of page reclaim. > > File pages are added to the inactive list and promoted if referenced > > after one recycling. If not referenced, it's very easy for pages to be > > cleaned from reclaim context which is inefficient in terms of IO. If > > background flush is cleaning pages, it's best it cleans old pages to > > help minimise IO from reclaim. > > Yes, we absolutely do this. Wu, do you have an improved version of the > pending or should we put it in this version for now? Sorry for the delay! The code looks a bit hacky, and there is a problem: it only decrease expire_interval and never increase/reset it. So it's possible when dirty workload first goes light then goes heavy, expire_interval may be reduced to 0 and never be able to grow up again. In the end we revert to the old behavior of ignoring dirtied_when totally. A more complete solution would be to make use of older_than_this not only for the kupdate case, but also for the background and sync cases. The policies can be most cleanly carried out in move_expired_inodes(). - kupdate: older_than_this = jiffies - 30s - background: older_than_this = TRY FROM (jiffies - 30s) TO (jiffies), UNTIL get some inodes to sync - sync: older_than_this = start time of sync I'll post an untested RFC patchset for the kupdate and background cases. The sync case will need two more patch series due to other problems. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 7/8] writeback: sync old inodes first in background writeback 2010-07-19 13:11 ` [PATCH 7/8] writeback: sync old inodes first in background writeback Mel Gorman 2010-07-19 14:21 ` Christoph Hellwig @ 2010-07-19 18:43 ` Rik van Riel 1 sibling, 0 replies; 87+ messages in thread From: Rik van Riel @ 2010-07-19 18:43 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On 07/19/2010 09:11 AM, Mel Gorman wrote: > From: Wu Fengguang<fengguang.wu@intel.com> > > A background flush work may run for ever. So it's reasonable for it to > mimic the kupdate behavior of syncing old/expired inodes first. > > This behavior also makes sense from the perspective of page reclaim. > File pages are added to the inactive list and promoted if referenced > after one recycling. If not referenced, it's very easy for pages to be > cleaned from reclaim context which is inefficient in terms of IO. If > background flush is cleaning pages, it's best it cleans old pages to > help minimise IO from reclaim. > > Signed-off-by: Wu Fengguang<fengguang.wu@intel.com> > Signed-off-by: Mel Gorman<mel@csn.ul.ie> Acked-by: Rik van Riel <riel@redhat.com> It can probably be optimized, but we really need something like this... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-19 13:11 [PATCH 0/8] Reduce writeback from page reclaim context V4 Mel Gorman ` (6 preceding siblings ...) 2010-07-19 13:11 ` [PATCH 7/8] writeback: sync old inodes first in background writeback Mel Gorman @ 2010-07-19 13:11 ` Mel Gorman 2010-07-19 14:23 ` Christoph Hellwig ` (3 more replies) 7 siblings, 4 replies; 87+ messages in thread From: Mel Gorman @ 2010-07-19 13:11 UTC (permalink / raw) To: linux-kernel, linux-fsdevel, linux-mm Cc: Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli, Mel Gorman There are a number of cases where pages get cleaned but two of concern to this patch are; o When dirtying pages, processes may be throttled to clean pages if dirty_ratio is not met. o Pages belonging to inodes dirtied longer than dirty_writeback_centisecs get cleaned. The problem for reclaim is that dirty pages can reach the end of the LRU if pages are being dirtied slowly so that neither the throttling cleans them or a flusher thread waking periodically. Background flush is already cleaning old or expired inodes first but the expire time is too far in the future at the time of page reclaim. To mitigate future problems, this patch wakes flusher threads to clean 1.5 times the number of dirty pages encountered by reclaimers. The reasoning is that pages were being dirtied at a roughly constant rate recently so if N dirty pages were encountered in this scan block, we are likely to see roughly N dirty pages again soon so try keep the flusher threads ahead of reclaim. This is unfortunately very hand-wavy but there is not really a good way of quantifying how bad it is when reclaim encounters dirty pages other than "down with that sort of thing". Similarly, there is not an obvious way of figuring how what percentage of dirty pages are old in terms of LRU-age and should be cleaned. Ideally, the background flushers would only be cleaning pages belonging to the zone being scanned but it's not clear if this would be of benefit (less IO) or not (potentially less efficient IO if an inode is scattered across multiple zones). Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- mm/vmscan.c | 18 +++++++++++------- 1 files changed, 11 insertions(+), 7 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index bc50937..5763719 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -806,6 +806,8 @@ restart_dirty: } if (PageDirty(page)) { + nr_dirty++; + /* * If the caller cannot writeback pages, dirty pages * are put on a separate list for cleaning by either @@ -814,7 +816,6 @@ restart_dirty: if (!reclaim_can_writeback(sc, page)) { list_add(&page->lru, &dirty_pages); unlock_page(page); - nr_dirty++; goto keep_dirty; } @@ -933,13 +934,16 @@ keep_dirty: VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); } + /* + * If reclaim is encountering dirty pages, it may be because + * dirty pages are reaching the end of the LRU even though + * the dirty_ratio may be satisified. In this case, wake + * flusher threads to pro-actively clean some pages + */ + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); + if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) { - /* - * Wakeup a flusher thread to clean at least as many dirty - * pages as encountered by direct reclaim. Wait on congestion - * to throttle processes cleaning dirty pages - */ - wakeup_flusher_threads(nr_dirty); + /* Throttle direct reclaimers cleaning pages */ congestion_wait(BLK_RW_ASYNC, HZ/10); /* -- 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-19 13:11 ` [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Mel Gorman @ 2010-07-19 14:23 ` Christoph Hellwig 2010-07-19 14:37 ` Mel Gorman 2010-07-19 18:59 ` Rik van Riel ` (2 subsequent siblings) 3 siblings, 1 reply; 87+ messages in thread From: Christoph Hellwig @ 2010-07-19 14:23 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote: > + /* > + * If reclaim is encountering dirty pages, it may be because > + * dirty pages are reaching the end of the LRU even though > + * the dirty_ratio may be satisified. In this case, wake > + * flusher threads to pro-actively clean some pages > + */ > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); > + Where is the laptop-mode magic coming from? And btw, at least currently wakeup_flusher_threads writes back nr_pages for each BDI, which might not be what you want. Then again probably no caller wants it, but I don't see an easy way to fix it. ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-19 14:23 ` Christoph Hellwig @ 2010-07-19 14:37 ` Mel Gorman 2010-07-19 22:48 ` Johannes Weiner 0 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-19 14:37 UTC (permalink / raw) To: Christoph Hellwig Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 10:23:49AM -0400, Christoph Hellwig wrote: > On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote: > > + /* > > + * If reclaim is encountering dirty pages, it may be because > > + * dirty pages are reaching the end of the LRU even though > > + * the dirty_ratio may be satisified. In this case, wake > > + * flusher threads to pro-actively clean some pages > > + */ > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); > > + > > Where is the laptop-mode magic coming from? > It comes from other parts of page reclaim where writing pages is avoided by page reclaim where possible. Things like this wakeup_flusher_threads(laptop_mode ? 0 : total_scanned); and .may_writepage = !laptop_mode although the latter can get disabled too. Deleting the magic is an option which would trade IO efficiency for power efficiency but my current thinking is laptop mode preferred reduced power. > And btw, at least currently wakeup_flusher_threads writes back nr_pages > for each BDI, which might not be what you want. I saw you pointing that out in another thread all right although I can't remember the context. It's not exactly what I want but then again we really want writing back of pages from a particular zone which we don't get either. There did not seem to be an ideal here and this appeared to be "less bad" than the alternatives. > Then again probably > no caller wants it, but I don't see an easy way to fix it. > I didn't either but my writeback-foo is weak (getting better but still weak). I hoped to bring it up at MM Summit and maybe at the Filesystem Summit too to see what ideas exist to improve this. When this idea was first floated, you called it a band-aid and I prioritised writing back old inodes over this. How do you feel about this approach now? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-19 14:37 ` Mel Gorman @ 2010-07-19 22:48 ` Johannes Weiner 2010-07-20 14:10 ` Mel Gorman 0 siblings, 1 reply; 87+ messages in thread From: Johannes Weiner @ 2010-07-19 22:48 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 03:37:37PM +0100, Mel Gorman wrote: > On Mon, Jul 19, 2010 at 10:23:49AM -0400, Christoph Hellwig wrote: > > On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote: > > > + /* > > > + * If reclaim is encountering dirty pages, it may be because > > > + * dirty pages are reaching the end of the LRU even though > > > + * the dirty_ratio may be satisified. In this case, wake > > > + * flusher threads to pro-actively clean some pages > > > + */ > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); > > > + > > > > Where is the laptop-mode magic coming from? > > > > It comes from other parts of page reclaim where writing pages is avoided > by page reclaim where possible. Things like this > > wakeup_flusher_threads(laptop_mode ? 0 : total_scanned); Actually, it's not avoiding writing pages in laptop mode, instead it is lumping writeouts aggressively (as I wrote in my other mail, .nr_pages=0 means 'write everything') to keep disk spinups rare and make maximum use of them. > although the latter can get disabled too. Deleting the magic is an > option which would trade IO efficiency for power efficiency but my > current thinking is laptop mode preferred reduced power. Maybe couple your wakeup with sc->may_writepage? It is usually false for laptop_mode but direct reclaimers enable it at one point in do_try_to_free_pages() when it scanned more than 150% of the reclaim target, so you could use existing disk spin-up points instead of introducing new ones or disabling the heuristics in laptop mode. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-19 22:48 ` Johannes Weiner @ 2010-07-20 14:10 ` Mel Gorman 2010-07-20 22:05 ` Johannes Weiner 0 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-20 14:10 UTC (permalink / raw) To: Johannes Weiner Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Tue, Jul 20, 2010 at 12:48:39AM +0200, Johannes Weiner wrote: > On Mon, Jul 19, 2010 at 03:37:37PM +0100, Mel Gorman wrote: > > On Mon, Jul 19, 2010 at 10:23:49AM -0400, Christoph Hellwig wrote: > > > On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote: > > > > + /* > > > > + * If reclaim is encountering dirty pages, it may be because > > > > + * dirty pages are reaching the end of the LRU even though > > > > + * the dirty_ratio may be satisified. In this case, wake > > > > + * flusher threads to pro-actively clean some pages > > > > + */ > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); > > > > + > > > > > > Where is the laptop-mode magic coming from? > > > > > > > It comes from other parts of page reclaim where writing pages is avoided > > by page reclaim where possible. Things like this > > > > wakeup_flusher_threads(laptop_mode ? 0 : total_scanned); > > Actually, it's not avoiding writing pages in laptop mode, instead it > is lumping writeouts aggressively (as I wrote in my other mail, > .nr_pages=0 means 'write everything') to keep disk spinups rare and > make maximum use of them. > You're right, 0 does mean flush everything - /me slaps self. It was introduced in 2.6.6 with the patch "[PATCH] laptop mode". Quoting from it Algorithm: the idea is to hold dirty data in memory for a long time, but to flush everything which has been accumulated if the disk happens to spin up for other reasons. So, the reason for the magic is half right - avoid excessive disk spin-ups but my reasoning for it was wrong. I thought it was avoiding a cleaning to save power. What it is actually intended to do is "if we are spinning up the disk anyway, do as much work as possible so it can spin down for longer later". Where it's wrong is that it should only wakeup flusher threads if dirty pages were encountered. What it's doing right now is potentially cleaning everything. It means I need to rerun all the tests and see if the number of pages encountered by page reclaim is really reduced or was it because I was calling wakeup_flusher_threads(0) when no dirty pages were encountered. > > although the latter can get disabled too. Deleting the magic is an > > option which would trade IO efficiency for power efficiency but my > > current thinking is laptop mode preferred reduced power. > > Maybe couple your wakeup with sc->may_writepage? It is usually false > for laptop_mode but direct reclaimers enable it at one point in > do_try_to_free_pages() when it scanned more than 150% of the reclaim > target, so you could use existing disk spin-up points instead of > introducing new ones or disabling the heuristics in laptop mode. > How about the following? if (nr_dirty && sc->may_writepage) wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); 1. Wakup flusher threads if dirty pages are encountered 2. For direct reclaim, only wake them up if may_writepage is set indicating that the system is ready to spin up disks and start reclaiming 3. In laptop_mode, flush everything to reduce future spin-ups -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-20 14:10 ` Mel Gorman @ 2010-07-20 22:05 ` Johannes Weiner 0 siblings, 0 replies; 87+ messages in thread From: Johannes Weiner @ 2010-07-20 22:05 UTC (permalink / raw) To: Mel Gorman Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Tue, Jul 20, 2010 at 03:10:49PM +0100, Mel Gorman wrote: > On Tue, Jul 20, 2010 at 12:48:39AM +0200, Johannes Weiner wrote: > > On Mon, Jul 19, 2010 at 03:37:37PM +0100, Mel Gorman wrote: > > > although the latter can get disabled too. Deleting the magic is an > > > option which would trade IO efficiency for power efficiency but my > > > current thinking is laptop mode preferred reduced power. > > > > Maybe couple your wakeup with sc->may_writepage? It is usually false > > for laptop_mode but direct reclaimers enable it at one point in > > do_try_to_free_pages() when it scanned more than 150% of the reclaim > > target, so you could use existing disk spin-up points instead of > > introducing new ones or disabling the heuristics in laptop mode. > > > > How about the following? > > if (nr_dirty && sc->may_writepage) > wakeup_flusher_threads(laptop_mode ? 0 : > nr_dirty + nr_dirty / 2); > > > 1. Wakup flusher threads if dirty pages are encountered > 2. For direct reclaim, only wake them up if may_writepage is set > indicating that the system is ready to spin up disks and start > reclaiming > 3. In laptop_mode, flush everything to reduce future spin-ups Sounds like the sanest approach to me. Thanks. Hannes ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-19 13:11 ` [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Mel Gorman 2010-07-19 14:23 ` Christoph Hellwig @ 2010-07-19 18:59 ` Rik van Riel 2010-07-19 22:26 ` Johannes Weiner 2010-07-26 7:28 ` Wu Fengguang 3 siblings, 0 replies; 87+ messages in thread From: Rik van Riel @ 2010-07-19 18:59 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Johannes Weiner, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On 07/19/2010 09:11 AM, Mel Gorman wrote: > There are a number of cases where pages get cleaned but two of concern > to this patch are; > o When dirtying pages, processes may be throttled to clean pages if > dirty_ratio is not met. > o Pages belonging to inodes dirtied longer than > dirty_writeback_centisecs get cleaned. > > The problem for reclaim is that dirty pages can reach the end of the LRU > if pages are being dirtied slowly so that neither the throttling cleans > them or a flusher thread waking periodically. I can't see a better way to do this without creating a way-too-big-to-merge patch series, and this patch should result in the right behaviour, so ... Acked-by: Rik van Riel <riel@redhat.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-19 13:11 ` [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Mel Gorman 2010-07-19 14:23 ` Christoph Hellwig 2010-07-19 18:59 ` Rik van Riel @ 2010-07-19 22:26 ` Johannes Weiner 2010-07-26 7:28 ` Wu Fengguang 3 siblings, 0 replies; 87+ messages in thread From: Johannes Weiner @ 2010-07-19 22:26 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel, linux-fsdevel, linux-mm, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Christoph Hellwig, Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 02:11:30PM +0100, Mel Gorman wrote: > @@ -933,13 +934,16 @@ keep_dirty: > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > } > > + /* > + * If reclaim is encountering dirty pages, it may be because > + * dirty pages are reaching the end of the LRU even though > + * the dirty_ratio may be satisified. In this case, wake > + * flusher threads to pro-actively clean some pages > + */ > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); An argument of 0 means 'every dirty page in the system', I assume this is not what you wanted, right? Something like this? if (nr_dirty && !laptop_mode) wakeup_flusher_threads(nr_dirty + nr_dirty / 2); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-19 13:11 ` [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Mel Gorman ` (2 preceding siblings ...) 2010-07-19 22:26 ` Johannes Weiner @ 2010-07-26 7:28 ` Wu Fengguang 2010-07-26 9:26 ` Mel Gorman 3 siblings, 1 reply; 87+ messages in thread From: Wu Fengguang @ 2010-07-26 7:28 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 19, 2010 at 09:11:30PM +0800, Mel Gorman wrote: > There are a number of cases where pages get cleaned but two of concern > to this patch are; > o When dirtying pages, processes may be throttled to clean pages if > dirty_ratio is not met. > o Pages belonging to inodes dirtied longer than > dirty_writeback_centisecs get cleaned. > > The problem for reclaim is that dirty pages can reach the end of the LRU > if pages are being dirtied slowly so that neither the throttling cleans > them or a flusher thread waking periodically. > > Background flush is already cleaning old or expired inodes first but the > expire time is too far in the future at the time of page reclaim. To mitigate > future problems, this patch wakes flusher threads to clean 1.5 times the > number of dirty pages encountered by reclaimers. The reasoning is that pages > were being dirtied at a roughly constant rate recently so if N dirty pages > were encountered in this scan block, we are likely to see roughly N dirty > pages again soon so try keep the flusher threads ahead of reclaim. > > This is unfortunately very hand-wavy but there is not really a good way of > quantifying how bad it is when reclaim encounters dirty pages other than > "down with that sort of thing". Similarly, there is not an obvious way of > figuring how what percentage of dirty pages are old in terms of LRU-age and > should be cleaned. Ideally, the background flushers would only be cleaning > pages belonging to the zone being scanned but it's not clear if this would > be of benefit (less IO) or not (potentially less efficient IO if an inode > is scattered across multiple zones). > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > --- > mm/vmscan.c | 18 +++++++++++------- > 1 files changed, 11 insertions(+), 7 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index bc50937..5763719 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -806,6 +806,8 @@ restart_dirty: > } > > if (PageDirty(page)) { > + nr_dirty++; > + > /* > * If the caller cannot writeback pages, dirty pages > * are put on a separate list for cleaning by either > @@ -814,7 +816,6 @@ restart_dirty: > if (!reclaim_can_writeback(sc, page)) { > list_add(&page->lru, &dirty_pages); > unlock_page(page); > - nr_dirty++; > goto keep_dirty; > } > > @@ -933,13 +934,16 @@ keep_dirty: > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > } > > + /* > + * If reclaim is encountering dirty pages, it may be because > + * dirty pages are reaching the end of the LRU even though > + * the dirty_ratio may be satisified. In this case, wake > + * flusher threads to pro-actively clean some pages > + */ > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); Ah it's very possible that nr_dirty==0 here! Then you are hitting the number of dirty pages down to 0 whether or not pageout() is called. Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is normally a small number, much smaller than MAX_WRITEBACK_PAGES. The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good for efficiency. And it seems good to let the flusher write much more than nr_dirty pages to safeguard a reasonable large vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to update the comments. Thanks, Fengguang > if (dirty_isolated < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) { > - /* > - * Wakeup a flusher thread to clean at least as many dirty > - * pages as encountered by direct reclaim. Wait on congestion > - * to throttle processes cleaning dirty pages > - */ > - wakeup_flusher_threads(nr_dirty); > + /* Throttle direct reclaimers cleaning pages */ > congestion_wait(BLK_RW_ASYNC, HZ/10); > > /* > -- > 1.7.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-26 7:28 ` Wu Fengguang @ 2010-07-26 9:26 ` Mel Gorman 2010-07-26 11:27 ` Wu Fengguang 0 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-26 9:26 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 26, 2010 at 03:28:32PM +0800, Wu Fengguang wrote: > On Mon, Jul 19, 2010 at 09:11:30PM +0800, Mel Gorman wrote: > > There are a number of cases where pages get cleaned but two of concern > > to this patch are; > > o When dirtying pages, processes may be throttled to clean pages if > > dirty_ratio is not met. > > o Pages belonging to inodes dirtied longer than > > dirty_writeback_centisecs get cleaned. > > > > The problem for reclaim is that dirty pages can reach the end of the LRU > > if pages are being dirtied slowly so that neither the throttling cleans > > them or a flusher thread waking periodically. > > > > Background flush is already cleaning old or expired inodes first but the > > expire time is too far in the future at the time of page reclaim. To mitigate > > future problems, this patch wakes flusher threads to clean 1.5 times the > > number of dirty pages encountered by reclaimers. The reasoning is that pages > > were being dirtied at a roughly constant rate recently so if N dirty pages > > were encountered in this scan block, we are likely to see roughly N dirty > > pages again soon so try keep the flusher threads ahead of reclaim. > > > > This is unfortunately very hand-wavy but there is not really a good way of > > quantifying how bad it is when reclaim encounters dirty pages other than > > "down with that sort of thing". Similarly, there is not an obvious way of > > figuring how what percentage of dirty pages are old in terms of LRU-age and > > should be cleaned. Ideally, the background flushers would only be cleaning > > pages belonging to the zone being scanned but it's not clear if this would > > be of benefit (less IO) or not (potentially less efficient IO if an inode > > is scattered across multiple zones). > > > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > > --- > > mm/vmscan.c | 18 +++++++++++------- > > 1 files changed, 11 insertions(+), 7 deletions(-) > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index bc50937..5763719 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -806,6 +806,8 @@ restart_dirty: > > } > > > > if (PageDirty(page)) { > > + nr_dirty++; > > + > > /* > > * If the caller cannot writeback pages, dirty pages > > * are put on a separate list for cleaning by either > > @@ -814,7 +816,6 @@ restart_dirty: > > if (!reclaim_can_writeback(sc, page)) { > > list_add(&page->lru, &dirty_pages); > > unlock_page(page); > > - nr_dirty++; > > goto keep_dirty; > > } > > > > @@ -933,13 +934,16 @@ keep_dirty: > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > > } > > > > + /* > > + * If reclaim is encountering dirty pages, it may be because > > + * dirty pages are reaching the end of the LRU even though > > + * the dirty_ratio may be satisified. In this case, wake > > + * flusher threads to pro-actively clean some pages > > + */ > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the > number of dirty pages down to 0 whether or not pageout() is called. > True, this has been fixed to only wakeup flusher threads when this is the file LRU, dirty pages have been encountered and the caller has sc->may_writepage. > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is > normally a small number, much smaller than MAX_WRITEBACK_PAGES. > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good > for efficiency. > And it seems good to let the flusher write much more > than nr_dirty pages to safeguard a reasonable large > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to > update the comments. > Ok, the reasoning had been to flush a number of pages that was related to the scanning rate but if that is inefficient for the flusher, I'll use MAX_WRITEBACK_PAGES. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-26 9:26 ` Mel Gorman @ 2010-07-26 11:27 ` Wu Fengguang 2010-07-26 12:57 ` Mel Gorman 0 siblings, 1 reply; 87+ messages in thread From: Wu Fengguang @ 2010-07-26 11:27 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli > > > @@ -933,13 +934,16 @@ keep_dirty: > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > > > } > > > > > > + /* > > > + * If reclaim is encountering dirty pages, it may be because > > > + * dirty pages are reaching the end of the LRU even though > > > + * the dirty_ratio may be satisified. In this case, wake > > > + * flusher threads to pro-actively clean some pages > > > + */ > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the > > number of dirty pages down to 0 whether or not pageout() is called. > > > > True, this has been fixed to only wakeup flusher threads when this is > the file LRU, dirty pages have been encountered and the caller has > sc->may_writepage. OK. > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is > > normally a small number, much smaller than MAX_WRITEBACK_PAGES. > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good > > for efficiency. > > And it seems good to let the flusher write much more > > than nr_dirty pages to safeguard a reasonable large > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to > > update the comments. > > > > Ok, the reasoning had been to flush a number of pages that was related > to the scanning rate but if that is inefficient for the flusher, I'll > use MAX_WRITEBACK_PAGES. It would be better to pass something like (nr_dirty * N). MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is obviously too large as a parameter. When the batch size is increased to 128MB, the writeback code may be improved somehow to not exceed the nr_pages limit too much. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-26 11:27 ` Wu Fengguang @ 2010-07-26 12:57 ` Mel Gorman 2010-07-26 13:10 ` Wu Fengguang 0 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-26 12:57 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote: > > > > @@ -933,13 +934,16 @@ keep_dirty: > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > > > > } > > > > > > > > + /* > > > > + * If reclaim is encountering dirty pages, it may be because > > > > + * dirty pages are reaching the end of the LRU even though > > > > + * the dirty_ratio may be satisified. In this case, wake > > > > + * flusher threads to pro-actively clean some pages > > > > + */ > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the > > > number of dirty pages down to 0 whether or not pageout() is called. > > > > > > > True, this has been fixed to only wakeup flusher threads when this is > > the file LRU, dirty pages have been encountered and the caller has > > sc->may_writepage. > > OK. > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES. > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good > > > for efficiency. > > > And it seems good to let the flusher write much more > > > than nr_dirty pages to safeguard a reasonable large > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to > > > update the comments. > > > > > > > Ok, the reasoning had been to flush a number of pages that was related > > to the scanning rate but if that is inefficient for the flusher, I'll > > use MAX_WRITEBACK_PAGES. > > It would be better to pass something like (nr_dirty * N). > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is > obviously too large as a parameter. When the batch size is increased > to 128MB, the writeback code may be improved somehow to not exceed the > nr_pages limit too much. > What might be a useful value for N? 1.5 appears to work reasonably well to create a window of writeback ahead of the scanner but it's a bit arbitrary. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-26 12:57 ` Mel Gorman @ 2010-07-26 13:10 ` Wu Fengguang 2010-07-27 13:35 ` Mel Gorman 0 siblings, 1 reply; 87+ messages in thread From: Wu Fengguang @ 2010-07-26 13:10 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote: > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote: > > > > > @@ -933,13 +934,16 @@ keep_dirty: > > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > > > > > } > > > > > > > > > > + /* > > > > > + * If reclaim is encountering dirty pages, it may be because > > > > > + * dirty pages are reaching the end of the LRU even though > > > > > + * the dirty_ratio may be satisified. In this case, wake > > > > > + * flusher threads to pro-actively clean some pages > > > > > + */ > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); > > > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the > > > > number of dirty pages down to 0 whether or not pageout() is called. > > > > > > > > > > True, this has been fixed to only wakeup flusher threads when this is > > > the file LRU, dirty pages have been encountered and the caller has > > > sc->may_writepage. > > > > OK. > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES. > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good > > > > for efficiency. > > > > And it seems good to let the flusher write much more > > > > than nr_dirty pages to safeguard a reasonable large > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to > > > > update the comments. > > > > > > > > > > Ok, the reasoning had been to flush a number of pages that was related > > > to the scanning rate but if that is inefficient for the flusher, I'll > > > use MAX_WRITEBACK_PAGES. > > > > It would be better to pass something like (nr_dirty * N). > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is > > obviously too large as a parameter. When the batch size is increased > > to 128MB, the writeback code may be improved somehow to not exceed the > > nr_pages limit too much. > > > > What might be a useful value for N? 1.5 appears to work reasonably well > to create a window of writeback ahead of the scanner but it's a bit > arbitrary. I'd recommend N to be a large value. It's no longer relevant now since we'll call the flusher to sync some range containing the target page. The flusher will then choose an N large enough (eg. 4MB) for efficient IO. It needs to be a large value, otherwise the vmscan code will quickly run into dirty pages again.. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-26 13:10 ` Wu Fengguang @ 2010-07-27 13:35 ` Mel Gorman 2010-07-27 14:24 ` Wu Fengguang 0 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-27 13:35 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote: > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote: > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote: > > > > > > @@ -933,13 +934,16 @@ keep_dirty: > > > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > > > > > > } > > > > > > > > > > > > + /* > > > > > > + * If reclaim is encountering dirty pages, it may be because > > > > > > + * dirty pages are reaching the end of the LRU even though > > > > > > + * the dirty_ratio may be satisified. In this case, wake > > > > > > + * flusher threads to pro-actively clean some pages > > > > > > + */ > > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); > > > > > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the > > > > > number of dirty pages down to 0 whether or not pageout() is called. > > > > > > > > > > > > > True, this has been fixed to only wakeup flusher threads when this is > > > > the file LRU, dirty pages have been encountered and the caller has > > > > sc->may_writepage. > > > > > > OK. > > > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES. > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good > > > > > for efficiency. > > > > > And it seems good to let the flusher write much more > > > > > than nr_dirty pages to safeguard a reasonable large > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to > > > > > update the comments. > > > > > > > > > > > > > Ok, the reasoning had been to flush a number of pages that was related > > > > to the scanning rate but if that is inefficient for the flusher, I'll > > > > use MAX_WRITEBACK_PAGES. > > > > > > It would be better to pass something like (nr_dirty * N). > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is > > > obviously too large as a parameter. When the batch size is increased > > > to 128MB, the writeback code may be improved somehow to not exceed the > > > nr_pages limit too much. > > > > > > > What might be a useful value for N? 1.5 appears to work reasonably well > > to create a window of writeback ahead of the scanner but it's a bit > > arbitrary. > > I'd recommend N to be a large value. It's no longer relevant now since > we'll call the flusher to sync some range containing the target page. > The flusher will then choose an N large enough (eg. 4MB) for efficient > IO. It needs to be a large value, otherwise the vmscan code will > quickly run into dirty pages again.. > Ok, I took the 4MB at face value to be a "reasonable amount that should not cause congestion". The end result is #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT) #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX) static inline long nr_writeback_pages(unsigned long nr_dirty) { return laptop_mode ? 0 : min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR)); } nr_writeback_pages(nr_dirty) is what gets passed to wakeup_flusher_threads(). Does that seem sensible? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-27 13:35 ` Mel Gorman @ 2010-07-27 14:24 ` Wu Fengguang 2010-07-27 14:34 ` Wu Fengguang 2010-07-27 14:38 ` Mel Gorman 0 siblings, 2 replies; 87+ messages in thread From: Wu Fengguang @ 2010-07-27 14:24 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote: > On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote: > > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote: > > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote: > > > > > > > @@ -933,13 +934,16 @@ keep_dirty: > > > > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > > > > > > > } > > > > > > > > > > > > > > + /* > > > > > > > + * If reclaim is encountering dirty pages, it may be because > > > > > > > + * dirty pages are reaching the end of the LRU even though > > > > > > > + * the dirty_ratio may be satisified. In this case, wake > > > > > > > + * flusher threads to pro-actively clean some pages > > > > > > > + */ > > > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); > > > > > > > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the > > > > > > number of dirty pages down to 0 whether or not pageout() is called. > > > > > > > > > > > > > > > > True, this has been fixed to only wakeup flusher threads when this is > > > > > the file LRU, dirty pages have been encountered and the caller has > > > > > sc->may_writepage. > > > > > > > > OK. > > > > > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is > > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES. > > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good > > > > > > for efficiency. > > > > > > And it seems good to let the flusher write much more > > > > > > than nr_dirty pages to safeguard a reasonable large > > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to > > > > > > update the comments. > > > > > > > > > > > > > > > > Ok, the reasoning had been to flush a number of pages that was related > > > > > to the scanning rate but if that is inefficient for the flusher, I'll > > > > > use MAX_WRITEBACK_PAGES. > > > > > > > > It would be better to pass something like (nr_dirty * N). > > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is > > > > obviously too large as a parameter. When the batch size is increased > > > > to 128MB, the writeback code may be improved somehow to not exceed the > > > > nr_pages limit too much. > > > > > > > > > > What might be a useful value for N? 1.5 appears to work reasonably well > > > to create a window of writeback ahead of the scanner but it's a bit > > > arbitrary. > > > > I'd recommend N to be a large value. It's no longer relevant now since > > we'll call the flusher to sync some range containing the target page. > > The flusher will then choose an N large enough (eg. 4MB) for efficient > > IO. It needs to be a large value, otherwise the vmscan code will > > quickly run into dirty pages again.. > > > > Ok, I took the 4MB at face value to be a "reasonable amount that should > not cause congestion". Under memory pressure, the disk should be busy/congested anyway. The big 4MB adds much work, however many of the pages may need to be synced in the near future anyway. It also requires more time to do the bigger IO, hence adding some latency, however the latency should be a small factor comparing to the IO queue time (which will be long for a busy disk). Overall expectation is, the more efficient IO, the more progress :) > The end result is > > #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT) > #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX) > static inline long nr_writeback_pages(unsigned long nr_dirty) > { > return laptop_mode ? 0 : > min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR)); > } > > nr_writeback_pages(nr_dirty) is what gets passed to > wakeup_flusher_threads(). Does that seem sensible? If you plan to keep wakeup_flusher_threads(), a simpler form may be sufficient, eg. laptop_mode ? 0 : (nr_dirty * 16) On top of this, we may write another patch to convert the wakeup_flusher_threads(bdi, nr_pages) call to some bdi_start_inode_writeback(inode, offset) call, to start more oriented writeback. When talking the 4MB optimization, I was referring to the internal implementation of bdi_start_inode_writeback(). Sorry for the missing context in the previous email. It may need a big patch to implement bdi_start_inode_writeback(). Would you like to try it, or leave the task to me? Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-27 14:24 ` Wu Fengguang @ 2010-07-27 14:34 ` Wu Fengguang 2010-07-27 14:40 ` Mel Gorman 2010-07-27 14:38 ` Mel Gorman 1 sibling, 1 reply; 87+ messages in thread From: Wu Fengguang @ 2010-07-27 14:34 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli > If you plan to keep wakeup_flusher_threads(), a simpler form may be > sufficient, eg. > > laptop_mode ? 0 : (nr_dirty * 16) This number is not sensitive because the writeback code may well round it up to some more IO efficient value (currently 4MB). AFAIK the nr_pages parameters passed by all existing flusher callers are some rule-of-thumb value, and far from being an exact number. Thanks, Fengguang -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-27 14:34 ` Wu Fengguang @ 2010-07-27 14:40 ` Mel Gorman 2010-07-27 14:55 ` Wu Fengguang 0 siblings, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-27 14:40 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Tue, Jul 27, 2010 at 10:34:23PM +0800, Wu Fengguang wrote: > > If you plan to keep wakeup_flusher_threads(), a simpler form may be > > sufficient, eg. > > > > laptop_mode ? 0 : (nr_dirty * 16) > > This number is not sensitive because the writeback code may well round > it up to some more IO efficient value (currently 4MB). AFAIK the > nr_pages parameters passed by all existing flusher callers are some > rule-of-thumb value, and far from being an exact number. > I get that it's a rule of thumb but decided I would still pass in some value related to nr_dirty that was bounded in some manner. Currently, that bound is 4MB but maybe it should have been bound to MAX_WRITEBACK_PAGES (which is 4MB for x86, but could be anything depending on the base page size). -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-27 14:40 ` Mel Gorman @ 2010-07-27 14:55 ` Wu Fengguang 0 siblings, 0 replies; 87+ messages in thread From: Wu Fengguang @ 2010-07-27 14:55 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Tue, Jul 27, 2010 at 10:40:26PM +0800, Mel Gorman wrote: > On Tue, Jul 27, 2010 at 10:34:23PM +0800, Wu Fengguang wrote: > > > If you plan to keep wakeup_flusher_threads(), a simpler form may be > > > sufficient, eg. > > > > > > laptop_mode ? 0 : (nr_dirty * 16) > > > > This number is not sensitive because the writeback code may well round > > it up to some more IO efficient value (currently 4MB). AFAIK the > > nr_pages parameters passed by all existing flusher callers are some > > rule-of-thumb value, and far from being an exact number. > > > > I get that it's a rule of thumb but decided I would still pass in some value > related to nr_dirty that was bounded in some manner. > Currently, that bound is 4MB but maybe it should have been bound to > MAX_WRITEBACK_PAGES (which is 4MB for x86, but could be anything > depending on the base page size). I see your worry about much bigger page size making vmscan batch size > writeback batch size and it's a legitimate worry. Thanks, Fengguang ^ permalink raw reply [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-27 14:24 ` Wu Fengguang 2010-07-27 14:34 ` Wu Fengguang @ 2010-07-27 14:38 ` Mel Gorman 2010-07-27 15:21 ` Wu Fengguang 1 sibling, 1 reply; 87+ messages in thread From: Mel Gorman @ 2010-07-27 14:38 UTC (permalink / raw) To: Wu Fengguang Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Tue, Jul 27, 2010 at 10:24:13PM +0800, Wu Fengguang wrote: > On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote: > > On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote: > > > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote: > > > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote: > > > > > > > > @@ -933,13 +934,16 @@ keep_dirty: > > > > > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > > > > > > > > } > > > > > > > > > > > > > > > > + /* > > > > > > > > + * If reclaim is encountering dirty pages, it may be because > > > > > > > > + * dirty pages are reaching the end of the LRU even though > > > > > > > > + * the dirty_ratio may be satisified. In this case, wake > > > > > > > > + * flusher threads to pro-actively clean some pages > > > > > > > > + */ > > > > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); > > > > > > > > > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the > > > > > > > number of dirty pages down to 0 whether or not pageout() is called. > > > > > > > > > > > > > > > > > > > True, this has been fixed to only wakeup flusher threads when this is > > > > > > the file LRU, dirty pages have been encountered and the caller has > > > > > > sc->may_writepage. > > > > > > > > > > OK. > > > > > > > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is > > > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES. > > > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good > > > > > > > for efficiency. > > > > > > > And it seems good to let the flusher write much more > > > > > > > than nr_dirty pages to safeguard a reasonable large > > > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to > > > > > > > update the comments. > > > > > > > > > > > > > > > > > > > Ok, the reasoning had been to flush a number of pages that was related > > > > > > to the scanning rate but if that is inefficient for the flusher, I'll > > > > > > use MAX_WRITEBACK_PAGES. > > > > > > > > > > It would be better to pass something like (nr_dirty * N). > > > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is > > > > > obviously too large as a parameter. When the batch size is increased > > > > > to 128MB, the writeback code may be improved somehow to not exceed the > > > > > nr_pages limit too much. > > > > > > > > > > > > > What might be a useful value for N? 1.5 appears to work reasonably well > > > > to create a window of writeback ahead of the scanner but it's a bit > > > > arbitrary. > > > > > > I'd recommend N to be a large value. It's no longer relevant now since > > > we'll call the flusher to sync some range containing the target page. > > > The flusher will then choose an N large enough (eg. 4MB) for efficient > > > IO. It needs to be a large value, otherwise the vmscan code will > > > quickly run into dirty pages again.. > > > > > > > Ok, I took the 4MB at face value to be a "reasonable amount that should > > not cause congestion". > > Under memory pressure, the disk should be busy/congested anyway. Not necessarily. It could be streaming reads where pages are being added to the LRU quickly but not necessarily dominated by dirty pages. Due to the scanning rate, a dirty page may be encountered but it could be rare. > The big 4MB adds much work, however many of the pages may need to be > synced in the near future anyway. It also requires more time to do > the bigger IO, hence adding some latency, however the latency should > be a small factor comparing to the IO queue time (which will be long > for a busy disk). > > Overall expectation is, the more efficient IO, the more progress :) > Ok. > > The end result is > > > > #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT) > > #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX) > > static inline long nr_writeback_pages(unsigned long nr_dirty) > > { > > return laptop_mode ? 0 : > > min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR)); > > } > > > > nr_writeback_pages(nr_dirty) is what gets passed to > > wakeup_flusher_threads(). Does that seem sensible? > > If you plan to keep wakeup_flusher_threads(), a simpler form may be > sufficient, eg. > > laptop_mode ? 0 : (nr_dirty * 16) > I plan to keep wakeup_flusher_threads() for now. I didn't go with 16 because while nr_dirty will usually be < SWAP_CLUSTER_MAX, it might not be due to lumpy reclaim. I wanted to firmly bound how much writeback was being requested - hence the mild complexity. > On top of this, we may write another patch to convert the > wakeup_flusher_threads(bdi, nr_pages) call to some > bdi_start_inode_writeback(inode, offset) call, to start more oriented > writeback. > I did a first pass at optimising based on prioritising inodes related to dirty pages. It's incredibly primitive and I have to sit down and see how the entire of writeback is put together to improve on it. Maybe you'll spot something simple or see if it's the totally wrong direction. Patch is below. > When talking the 4MB optimization, I was referring to the internal > implementation of bdi_start_inode_writeback(). Sorry for the missing > context in the previous email. > No worries, I was assuming it was something in mainline I didn't know yet :) > It may need a big patch to implement bdi_start_inode_writeback(). > Would you like to try it, or leave the task to me? > If you send me a patch, I can try it out but it's not my highest priority right now. I'm still looking to get writeback-from-reclaim down to a reasonable level without causing a large amount of churn. Here is the first pass anyway at kicking wakeup_flusher_threads() for inodes belonging to a list of pages. You'll note that I do nothing with page offset because I didn't spot a simple way of taking that information into account. It's also horrible from a locking perspective. So far, it's testing has been "it didn't crash". ==== CUT HERE ==== writeback: Prioritise dirty inodes encountered by reclaim for background flushing It is preferable that as few dirty pages as possible are dispatched for cleaning from the page reclaim path. When dirty pages are encountered by page reclaim, this patch marks the inodes that they should be dispatched immediately. When the background flusher runs, it moves such inodes immediately to the dispatch queue regardless of inode age. This is an early prototype. It could be optimised to not regularly take the inode lock repeatedly and ideally the page offset would also be taken into account. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- fs/fs-writeback.c | 52 ++++++++++++++++++++++++++++++++++++++++++++- include/linux/fs.h | 5 ++- include/linux/writeback.h | 1 + mm/vmscan.c | 6 +++- 4 files changed, 59 insertions(+), 5 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 5a3c764..27a8b75 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -221,7 +221,7 @@ static void move_expired_inodes(struct list_head *delaying_queue, LIST_HEAD(tmp); struct list_head *pos, *node; struct super_block *sb = NULL; - struct inode *inode; + struct inode *inode, *tinode; int do_sb_sort = 0; if (wbc->for_kupdate || wbc->for_background) { @@ -229,6 +229,14 @@ static void move_expired_inodes(struct list_head *delaying_queue, older_than_this = jiffies - expire_interval; } + /* Move inodes reclaim found at end of LRU to dispatch queue */ + list_for_each_entry_safe(inode, tinode, delaying_queue, i_list) { + if (inode->i_state & I_DIRTY_RECLAIM) { + inode->i_state &= ~I_DIRTY_RECLAIM; + list_move(&inode->i_list, &tmp); + } + } + while (!list_empty(delaying_queue)) { inode = list_entry(delaying_queue->prev, struct inode, i_list); if (expire_interval && @@ -906,6 +914,48 @@ void wakeup_flusher_threads(long nr_pages) rcu_read_unlock(); } +/* + * Similar to wakeup_flusher_threads except prioritise inodes contained + * in the page_list regardless of age + */ +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list) +{ + struct page *page; + struct address_space *mapping; + struct inode *inode; + + list_for_each_entry(page, page_list, lru) { + if (!PageDirty(page)) + continue; + + lock_page(page); + mapping = page_mapping(page); + if (!mapping || mapping == &swapper_space) + goto unlock; + + /* + * Test outside the lock to see as if it is already set, taking + * the inode lock is a waste and the inode should be pinned by + * the lock_page + */ + inode = page->mapping->host; + if (inode->i_state & I_DIRTY_RECLAIM) + goto unlock; + + /* + * XXX: Yuck, has to be a way of batching this by not requiring + * the page lock to pin the inode + */ + spin_lock(&inode_lock); + inode->i_state |= I_DIRTY_RECLAIM; + spin_unlock(&inode_lock); +unlock: + unlock_page(page); + } + + wakeup_flusher_threads(nr_pages); +} + static noinline void block_dump___mark_inode_dirty(struct inode *inode) { if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) { diff --git a/include/linux/fs.h b/include/linux/fs.h index e29f0ed..8836698 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1585,8 +1585,8 @@ struct super_operations { /* * Inode state bits. Protected by inode_lock. * - * Three bits determine the dirty state of the inode, I_DIRTY_SYNC, - * I_DIRTY_DATASYNC and I_DIRTY_PAGES. + * Four bits determine the dirty state of the inode, I_DIRTY_SYNC, + * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM. * * Four bits define the lifetime of an inode. Initially, inodes are I_NEW, * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at @@ -1633,6 +1633,7 @@ struct super_operations { #define I_DIRTY_SYNC 1 #define I_DIRTY_DATASYNC 2 #define I_DIRTY_PAGES 4 +#define I_DIRTY_RECLAIM 256 #define __I_NEW 3 #define I_NEW (1 << __I_NEW) #define I_WILL_FREE 16 diff --git a/include/linux/writeback.h b/include/linux/writeback.h index 494edd6..73a4df2 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -64,6 +64,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb, struct writeback_control *wbc); long wb_do_writeback(struct bdi_writeback *wb, int force_wait); void wakeup_flusher_threads(long nr_pages); +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list); /* writeback.h requires fs.h; it, too, is not included from here. */ static inline void wait_on_inode(struct inode *inode) diff --git a/mm/vmscan.c b/mm/vmscan.c index b66d1f5..bad1abf 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -901,7 +901,8 @@ keep: * laptop mode avoiding disk spin-ups */ if (file && nr_dirty_seen && sc->may_writepage) - wakeup_flusher_threads(nr_writeback_pages(nr_dirty)); + wakeup_flusher_threads_pages(nr_writeback_pages(nr_dirty), + page_list); *nr_still_dirty = nr_dirty; count_vm_events(PGACTIVATE, pgactivate); @@ -1368,7 +1369,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, list_add(&page->lru, &putback_list); } - wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); + wakeup_flusher_threads_pages(laptop_mode ? 0 : nr_dirty, + &page_list); congestion_wait(BLK_RW_ASYNC, HZ/10); /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 87+ messages in thread
* Re: [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages 2010-07-27 14:38 ` Mel Gorman @ 2010-07-27 15:21 ` Wu Fengguang 0 siblings, 0 replies; 87+ messages in thread From: Wu Fengguang @ 2010-07-27 15:21 UTC (permalink / raw) To: Mel Gorman Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Dave Chinner, Chris Mason, Nick Piggin, Rik van Riel, Johannes Weiner, Christoph Hellwig, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Andrew Morton, Andrea Arcangeli On Tue, Jul 27, 2010 at 10:38:05PM +0800, Mel Gorman wrote: > On Tue, Jul 27, 2010 at 10:24:13PM +0800, Wu Fengguang wrote: > > On Tue, Jul 27, 2010 at 09:35:13PM +0800, Mel Gorman wrote: > > > On Mon, Jul 26, 2010 at 09:10:08PM +0800, Wu Fengguang wrote: > > > > On Mon, Jul 26, 2010 at 08:57:17PM +0800, Mel Gorman wrote: > > > > > On Mon, Jul 26, 2010 at 07:27:09PM +0800, Wu Fengguang wrote: > > > > > > > > > @@ -933,13 +934,16 @@ keep_dirty: > > > > > > > > > VM_BUG_ON(PageLRU(page) || PageUnevictable(page)); > > > > > > > > > } > > > > > > > > > > > > > > > > > > + /* > > > > > > > > > + * If reclaim is encountering dirty pages, it may be because > > > > > > > > > + * dirty pages are reaching the end of the LRU even though > > > > > > > > > + * the dirty_ratio may be satisified. In this case, wake > > > > > > > > > + * flusher threads to pro-actively clean some pages > > > > > > > > > + */ > > > > > > > > > + wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty + nr_dirty / 2); > > > > > > > > > > > > > > > > Ah it's very possible that nr_dirty==0 here! Then you are hitting the > > > > > > > > number of dirty pages down to 0 whether or not pageout() is called. > > > > > > > > > > > > > > > > > > > > > > True, this has been fixed to only wakeup flusher threads when this is > > > > > > > the file LRU, dirty pages have been encountered and the caller has > > > > > > > sc->may_writepage. > > > > > > > > > > > > OK. > > > > > > > > > > > > > > Another minor issue is, the passed (nr_dirty + nr_dirty / 2) is > > > > > > > > normally a small number, much smaller than MAX_WRITEBACK_PAGES. > > > > > > > > The flusher will sync at least MAX_WRITEBACK_PAGES pages, this is good > > > > > > > > for efficiency. > > > > > > > > And it seems good to let the flusher write much more > > > > > > > > than nr_dirty pages to safeguard a reasonable large > > > > > > > > vmscan-head-to-first-dirty-LRU-page margin. So it would be enough to > > > > > > > > update the comments. > > > > > > > > > > > > > > > > > > > > > > Ok, the reasoning had been to flush a number of pages that was related > > > > > > > to the scanning rate but if that is inefficient for the flusher, I'll > > > > > > > use MAX_WRITEBACK_PAGES. > > > > > > > > > > > > It would be better to pass something like (nr_dirty * N). > > > > > > MAX_WRITEBACK_PAGES may be increased to 128MB in the future, which is > > > > > > obviously too large as a parameter. When the batch size is increased > > > > > > to 128MB, the writeback code may be improved somehow to not exceed the > > > > > > nr_pages limit too much. > > > > > > > > > > > > > > > > What might be a useful value for N? 1.5 appears to work reasonably well > > > > > to create a window of writeback ahead of the scanner but it's a bit > > > > > arbitrary. > > > > > > > > I'd recommend N to be a large value. It's no longer relevant now since > > > > we'll call the flusher to sync some range containing the target page. > > > > The flusher will then choose an N large enough (eg. 4MB) for efficient > > > > IO. It needs to be a large value, otherwise the vmscan code will > > > > quickly run into dirty pages again.. > > > > > > > > > > Ok, I took the 4MB at face value to be a "reasonable amount that should > > > not cause congestion". > > > > Under memory pressure, the disk should be busy/congested anyway. > > Not necessarily. It could be streaming reads where pages are being added > to the LRU quickly but not necessarily dominated by dirty pages. Due to the > scanning rate, a dirty page may be encountered but it could be rare. Right. > > The big 4MB adds much work, however many of the pages may need to be > > synced in the near future anyway. It also requires more time to do > > the bigger IO, hence adding some latency, however the latency should > > be a small factor comparing to the IO queue time (which will be long > > for a busy disk). > > > > Overall expectation is, the more efficient IO, the more progress :) > > > > Ok. > > > > The end result is > > > > > > #define MAX_WRITEBACK (4194304UL >> PAGE_SHIFT) > > > #define WRITEBACK_FACTOR (MAX_WRITEBACK / SWAP_CLUSTER_MAX) > > > static inline long nr_writeback_pages(unsigned long nr_dirty) > > > { > > > return laptop_mode ? 0 : > > > min(MAX_WRITEBACK, (nr_dirty * WRITEBACK_FACTOR)); > > > } > > > > > > nr_writeback_pages(nr_dirty) is what gets passed to > > > wakeup_flusher_threads(). Does that seem sensible? > > > > If you plan to keep wakeup_flusher_threads(), a simpler form may be > > sufficient, eg. > > > > laptop_mode ? 0 : (nr_dirty * 16) > > > > I plan to keep wakeup_flusher_threads() for now. I didn't go with 16 because > while nr_dirty will usually be < SWAP_CLUSTER_MAX, it might not be due to lumpy > reclaim. I wanted to firmly bound how much writeback was being requested - > hence the mild complexity. OK. > > On top of this, we may write another patch to convert the > > wakeup_flusher_threads(bdi, nr_pages) call to some > > bdi_start_inode_writeback(inode, offset) call, to start more oriented > > writeback. > > > > I did a first pass at optimising based on prioritising inodes related to > dirty pages. It's incredibly primitive and I have to sit down and see > how the entire of writeback is put together to improve on it. Maybe > you'll spot something simple or see if it's the totally wrong direction. > Patch is below. The simplest style may be struct writeback_control wbc = { .sync_mode = WB_SYNC_NONE, .nr_to_write = MAX_WRITEBACK_PAGES, }; mapping->writeback_index = offset; return do_writepages(mapping, &wbc); But sure there will be many details to handle. > > When talking the 4MB optimization, I was referring to the internal > > implementation of bdi_start_inode_writeback(). Sorry for the missing > > context in the previous email. > > > > No worries, I was assuming it was something in mainline I didn't know > yet :) > > > It may need a big patch to implement bdi_start_inode_writeback(). > > Would you like to try it, or leave the task to me? > > > > If you send me a patch, I can try it out but it's not my highest > priority right now. I'm still looking to get writeback-from-reclaim down > to a reasonable level without causing a large amount of churn. OK. That's already a great work. > Here is the first pass anyway at kicking wakeup_flusher_threads() for > inodes belonging to a list of pages. You'll note that I do nothing with > page offset because I didn't spot a simple way of taking that > information into account. It's also horrible from a locking perspective. > So far, it's testing has been "it didn't crash". It seems a neat way to prioritize the inodes with a new flag I_DIRTY_RECLAIM. However it may require vastly different implementation when considering the offset. I'll try to work up a prototype tomorrow. Thanks, Fengguang > ==== CUT HERE ==== > writeback: Prioritise dirty inodes encountered by reclaim for background flushing > > It is preferable that as few dirty pages as possible are dispatched for > cleaning from the page reclaim path. When dirty pages are encountered by > page reclaim, this patch marks the inodes that they should be dispatched > immediately. When the background flusher runs, it moves such inodes immediately > to the dispatch queue regardless of inode age. > > This is an early prototype. It could be optimised to not regularly take > the inode lock repeatedly and ideally the page offset would also be > taken into account. > > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > --- > fs/fs-writeback.c | 52 ++++++++++++++++++++++++++++++++++++++++++++- > include/linux/fs.h | 5 ++- > include/linux/writeback.h | 1 + > mm/vmscan.c | 6 +++- > 4 files changed, 59 insertions(+), 5 deletions(-) > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c > index 5a3c764..27a8b75 100644 > --- a/fs/fs-writeback.c > +++ b/fs/fs-writeback.c > @@ -221,7 +221,7 @@ static void move_expired_inodes(struct list_head *delaying_queue, > LIST_HEAD(tmp); > struct list_head *pos, *node; > struct super_block *sb = NULL; > - struct inode *inode; > + struct inode *inode, *tinode; > int do_sb_sort = 0; > > if (wbc->for_kupdate || wbc->for_background) { > @@ -229,6 +229,14 @@ static void move_expired_inodes(struct list_head *delaying_queue, > older_than_this = jiffies - expire_interval; > } > > + /* Move inodes reclaim found at end of LRU to dispatch queue */ > + list_for_each_entry_safe(inode, tinode, delaying_queue, i_list) { > + if (inode->i_state & I_DIRTY_RECLAIM) { > + inode->i_state &= ~I_DIRTY_RECLAIM; > + list_move(&inode->i_list, &tmp); > + } > + } > + > while (!list_empty(delaying_queue)) { > inode = list_entry(delaying_queue->prev, struct inode, i_list); > if (expire_interval && > @@ -906,6 +914,48 @@ void wakeup_flusher_threads(long nr_pages) > rcu_read_unlock(); > } > > +/* > + * Similar to wakeup_flusher_threads except prioritise inodes contained > + * in the page_list regardless of age > + */ > +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list) > +{ > + struct page *page; > + struct address_space *mapping; > + struct inode *inode; > + > + list_for_each_entry(page, page_list, lru) { > + if (!PageDirty(page)) > + continue; > + > + lock_page(page); > + mapping = page_mapping(page); > + if (!mapping || mapping == &swapper_space) > + goto unlock; > + > + /* > + * Test outside the lock to see as if it is already set, taking > + * the inode lock is a waste and the inode should be pinned by > + * the lock_page > + */ > + inode = page->mapping->host; > + if (inode->i_state & I_DIRTY_RECLAIM) > + goto unlock; > + > + /* > + * XXX: Yuck, has to be a way of batching this by not requiring > + * the page lock to pin the inode > + */ > + spin_lock(&inode_lock); > + inode->i_state |= I_DIRTY_RECLAIM; > + spin_unlock(&inode_lock); > +unlock: > + unlock_page(page); > + } > + > + wakeup_flusher_threads(nr_pages); > +} > + > static noinline void block_dump___mark_inode_dirty(struct inode *inode) > { > if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) { > diff --git a/include/linux/fs.h b/include/linux/fs.h > index e29f0ed..8836698 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1585,8 +1585,8 @@ struct super_operations { > /* > * Inode state bits. Protected by inode_lock. > * > - * Three bits determine the dirty state of the inode, I_DIRTY_SYNC, > - * I_DIRTY_DATASYNC and I_DIRTY_PAGES. > + * Four bits determine the dirty state of the inode, I_DIRTY_SYNC, > + * I_DIRTY_DATASYNC, I_DIRTY_PAGES and I_DIRTY_RECLAIM. > * > * Four bits define the lifetime of an inode. Initially, inodes are I_NEW, > * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at > @@ -1633,6 +1633,7 @@ struct super_operations { > #define I_DIRTY_SYNC 1 > #define I_DIRTY_DATASYNC 2 > #define I_DIRTY_PAGES 4 > +#define I_DIRTY_RECLAIM 256 > #define __I_NEW 3 > #define I_NEW (1 << __I_NEW) > #define I_WILL_FREE 16 > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > index 494edd6..73a4df2 100644 > --- a/include/linux/writeback.h > +++ b/include/linux/writeback.h > @@ -64,6 +64,7 @@ void writeback_inodes_wb(struct bdi_writeback *wb, > struct writeback_control *wbc); > long wb_do_writeback(struct bdi_writeback *wb, int force_wait); > void wakeup_flusher_threads(long nr_pages); > +void wakeup_flusher_threads_pages(long nr_pages, struct list_head *page_list); > > /* writeback.h requires fs.h; it, too, is not included from here. */ > static inline void wait_on_inode(struct inode *inode) > diff --git a/mm/vmscan.c b/mm/vmscan.c > index b66d1f5..bad1abf 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -901,7 +901,8 @@ keep: > * laptop mode avoiding disk spin-ups > */ > if (file && nr_dirty_seen && sc->may_writepage) > - wakeup_flusher_threads(nr_writeback_pages(nr_dirty)); > + wakeup_flusher_threads_pages(nr_writeback_pages(nr_dirty), > + page_list); > > *nr_still_dirty = nr_dirty; > count_vm_events(PGACTIVATE, pgactivate); > @@ -1368,7 +1369,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone, > list_add(&page->lru, &putback_list); > } > > - wakeup_flusher_threads(laptop_mode ? 0 : nr_dirty); > + wakeup_flusher_threads_pages(laptop_mode ? 0 : nr_dirty, > + &page_list); > congestion_wait(BLK_RW_ASYNC, HZ/10); > > /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 87+ messages in thread
end of thread, other threads:[~2010-07-27 15:21 UTC | newest] Thread overview: 87+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-07-19 13:11 [PATCH 0/8] Reduce writeback from page reclaim context V4 Mel Gorman 2010-07-19 13:11 ` [PATCH 1/8] vmscan: tracing: Roll up of patches currently in mmotm Mel Gorman 2010-07-19 13:11 ` [PATCH 2/8] vmscan: tracing: Update trace event to track if page reclaim IO is for anon or file pages Mel Gorman 2010-07-19 13:24 ` Rik van Riel 2010-07-19 14:15 ` Christoph Hellwig 2010-07-19 14:24 ` Mel Gorman 2010-07-19 14:26 ` Christoph Hellwig 2010-07-19 13:11 ` [PATCH 3/8] vmscan: tracing: Update post-processing script to distinguish between anon and file IO from page reclaim Mel Gorman 2010-07-19 13:32 ` Rik van Riel 2010-07-19 13:11 ` [PATCH 4/8] vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman 2010-07-19 14:19 ` Christoph Hellwig 2010-07-19 14:26 ` Mel Gorman 2010-07-19 18:25 ` Rik van Riel 2010-07-19 22:14 ` Johannes Weiner 2010-07-20 13:45 ` Mel Gorman 2010-07-20 22:02 ` Johannes Weiner 2010-07-21 11:36 ` Johannes Weiner 2010-07-21 11:52 ` Mel Gorman 2010-07-21 12:01 ` KAMEZAWA Hiroyuki 2010-07-21 14:27 ` Mel Gorman 2010-07-21 23:57 ` KAMEZAWA Hiroyuki 2010-07-22 9:19 ` Mel Gorman 2010-07-22 9:22 ` KAMEZAWA Hiroyuki 2010-07-21 13:04 ` Johannes Weiner 2010-07-21 13:38 ` Mel Gorman 2010-07-21 14:28 ` Johannes Weiner 2010-07-21 14:31 ` Mel Gorman 2010-07-21 14:39 ` Johannes Weiner 2010-07-21 15:06 ` Mel Gorman 2010-07-26 8:29 ` Wu Fengguang 2010-07-26 9:12 ` Mel Gorman 2010-07-26 11:19 ` Wu Fengguang 2010-07-26 12:53 ` Mel Gorman 2010-07-26 13:03 ` Wu Fengguang 2010-07-19 13:11 ` [PATCH 5/8] fs,btrfs: Allow kswapd to writeback pages Mel Gorman 2010-07-19 18:27 ` Rik van Riel 2010-07-19 13:11 ` [PATCH 6/8] fs,xfs: " Mel Gorman 2010-07-19 14:20 ` Christoph Hellwig 2010-07-19 14:43 ` Mel Gorman 2010-07-19 13:11 ` [PATCH 7/8] writeback: sync old inodes first in background writeback Mel Gorman 2010-07-19 14:21 ` Christoph Hellwig 2010-07-19 14:40 ` Mel Gorman 2010-07-19 14:48 ` Christoph Hellwig 2010-07-22 8:52 ` Wu Fengguang 2010-07-22 9:02 ` Wu Fengguang 2010-07-22 9:21 ` Wu Fengguang 2010-07-22 10:48 ` Mel Gorman 2010-07-23 9:45 ` Wu Fengguang 2010-07-23 10:57 ` Mel Gorman 2010-07-23 11:49 ` Wu Fengguang 2010-07-23 12:20 ` Wu Fengguang 2010-07-25 10:43 ` KOSAKI Motohiro 2010-07-25 12:03 ` Minchan Kim 2010-07-26 3:27 ` Wu Fengguang 2010-07-26 4:11 ` Minchan Kim 2010-07-26 4:37 ` Wu Fengguang 2010-07-26 16:30 ` Minchan Kim 2010-07-26 22:48 ` Wu Fengguang 2010-07-26 3:08 ` Wu Fengguang 2010-07-26 3:11 ` Rik van Riel 2010-07-26 3:17 ` Wu Fengguang 2010-07-22 15:34 ` Minchan Kim 2010-07-23 11:59 ` Wu Fengguang 2010-07-22 9:42 ` Mel Gorman 2010-07-23 8:33 ` Wu Fengguang 2010-07-22 1:13 ` Wu Fengguang 2010-07-19 18:43 ` Rik van Riel 2010-07-19 13:11 ` [PATCH 8/8] vmscan: Kick flusher threads to clean pages when reclaim is encountering dirty pages Mel Gorman 2010-07-19 14:23 ` Christoph Hellwig 2010-07-19 14:37 ` Mel Gorman 2010-07-19 22:48 ` Johannes Weiner 2010-07-20 14:10 ` Mel Gorman 2010-07-20 22:05 ` Johannes Weiner 2010-07-19 18:59 ` Rik van Riel 2010-07-19 22:26 ` Johannes Weiner 2010-07-26 7:28 ` Wu Fengguang 2010-07-26 9:26 ` Mel Gorman 2010-07-26 11:27 ` Wu Fengguang 2010-07-26 12:57 ` Mel Gorman 2010-07-26 13:10 ` Wu Fengguang 2010-07-27 13:35 ` Mel Gorman 2010-07-27 14:24 ` Wu Fengguang 2010-07-27 14:34 ` Wu Fengguang 2010-07-27 14:40 ` Mel Gorman 2010-07-27 14:55 ` Wu Fengguang 2010-07-27 14:38 ` Mel Gorman 2010-07-27 15:21 ` Wu Fengguang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).