linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2
@ 2011-07-21 16:28 Mel Gorman
  2011-07-21 16:28 ` [PATCH 1/8] mm: vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
                   ` (10 more replies)
  0 siblings, 11 replies; 43+ messages in thread
From: Mel Gorman @ 2011-07-21 16:28 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

Warning: Long post with lots of figures. If you normally drink coffee
and you don't have a cup, get one or you may end up with a case of
keyboard face.

Changelog since v1
  o Drop prio-inode patch. There is now a dependency that the flusher
    threads find these dirty pages quickly.
  o Drop nr_vmscan_throttled counter
  o SetPageReclaim instead of deactivate_page which was wrong
  o Add warning to main filesystems if called from direct reclaim context
  o Add patch to completely disable filesystem writeback from reclaim

Testing from the XFS folk revealed that there is still too much
I/O from the end of the LRU in kswapd. Previously it was considered
acceptable by VM people for a small number of pages to be written
back from reclaim with testing generally showing about 0.3% of pages
reclaimed were written back (higher if memory was low). That writing
back a small number of pages is ok has been heavily disputed for
quite some time and Dave Chinner explained it well;

	It doesn't have to be a very high number to be a problem. IO
	is orders of magnitude slower than the CPU time it takes to
	flush a page, so the cost of making a bad flush decision is
	very high. And single page writeback from the LRU is almost
	always a bad flush decision.

To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig;

	xfs tries to write it back if the requester is kswapd
	ext4 ignores the request if it's a delayed allocation
	btrfs ignores the request

As a result, each filesystem has different performance characteristics
when under memory pressure and there are many pages being dirties. In
some cases, the request is ignored entirely so the VM cannot depend
on the IO being dispatched.

The objective of this series to to reduce writing of filesystem-backed
pages from reclaim, play nicely with writeback that is already in
progress and throttle reclaim appropriately when dirty pages are
encountered. The assumption is that the flushers will always write
pages faster than if reclaim issues the IO. The new problem is that
reclaim has very little control over how long before a page in a
particular zone or container is cleaned which is discussed later. A
secondary goal is to avoid the problem whereby direct reclaim splices
two potentially deep call stacks together.

Patch 1 disables writeback of filesystem pages from direct reclaim
	entirely. Anonymous pages are still written.

Patches 2-4 add warnings to XFS, ext4 and btrfs if called from
	direct reclaim. With patch 1, this "never happens" and
	is intended to catch regressions in this logic in the
	future.

Patch 5 disables writeback of filesystem pages from kswapd unless
	the priority is raised to the point where kswapd is considered
	to be in trouble.

Patch 6 throttles reclaimers if too many dirty pages are being
	encountered and the zones or backing devices are congested.

Patch 7 invalidates dirty pages found at the end of the LRU so they
	are reclaimed quickly after being written back rather than
	waiting for a reclaimer to find them

Patch 8 disables writeback of filesystem pages from kswapd and
	depends entirely on the flusher threads for cleaning pages.
	This is potentially a problem if the flusher threads take a
	long time to wake or are not discovering the pages we need
	cleaned. By placing the patch last, it's more likely that
	bisection can catch if this situation occurs and can be
	easily reverted.

I consider this series to be orthogonal to the writeback work but
it is worth noting that the writeback work affects the viability of
patch 8 in particular.

I tested this on ext4 and xfs using fs_mark and a micro benchmark
that does a streaming write to a large mapping (exercises use-once
LRU logic) followed by streaming writes to a mix of anonymous and
file-backed mappings. The command line for fs_mark when botted with
512M looked something like

./fs_mark  -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760

The number of files was adjusted depending on the amount of available
memory so that the files created was about 3xRAM. For multiple threads,
the -d switch is specified multiple times.

3 kernels are tested.

vanilla	3.0-rc6
kswapdwb-v2r5		patches 1-7
nokswapdwb-v2r5		patches 1-8

The test machine is x86-64 with an older generation of AMD processor
with 4 cores. The underlying storage was 4 disks configured as RAID-0
as this was the best configuration of storage I had available. Swap
is on a separate disk. Dirty ratio was tuned to 40% instead of the
default of 20%.

Testing was run with and without monitors to both verify that the
patches were operating as expected and that any performance gain was
real and not due to interference from monitors.

I've posted the raw reports for each filesystem at

http://www.csn.ul.ie/~mel/postings/reclaim-20110721

Unfortunately, the volume of data is excessive but here is a partial
summary of what was interesting for XFS.

512M1P-xfs           Files/s  mean         32.99 ( 0.00%)       35.16 ( 6.18%)       35.08 ( 5.94%)
512M1P-xfs           Elapsed Time fsmark           122.54               115.54               115.21
512M1P-xfs           Elapsed Time mmap-strm        105.09               104.44               106.12
512M-xfs             Files/s  mean         30.50 ( 0.00%)       33.30 ( 8.40%)       34.68 (12.06%)
512M-xfs             Elapsed Time fsmark           136.14               124.26               120.33
512M-xfs             Elapsed Time mmap-strm        154.68               145.91               138.83
512M-2X-xfs          Files/s  mean         28.48 ( 0.00%)       32.90 (13.45%)       32.83 (13.26%)
512M-2X-xfs          Elapsed Time fsmark           145.64               128.67               128.67
512M-2X-xfs          Elapsed Time mmap-strm        145.92               136.65               137.67
512M-4X-xfs          Files/s  mean         29.06 ( 0.00%)       32.82 (11.46%)       33.32 (12.81%)
512M-4X-xfs          Elapsed Time fsmark           153.69               136.74               135.11
512M-4X-xfs          Elapsed Time mmap-strm        159.47               128.64               132.59
512M-16X-xfs         Files/s  mean         48.80 ( 0.00%)       41.80 (-16.77%)       56.61 (13.79%)
512M-16X-xfs         Elapsed Time fsmark           161.48               144.61               141.19
512M-16X-xfs         Elapsed Time mmap-strm        167.04               150.62               147.83

The difference between kswapd writing and not writing for fsmark
in many cases is marginal simply because kswapd was not reaching a
high enough priority to enter writeback. Memory is mostly consumed
by filesystem-backed pages so limiting the number of dirty pages
(dirty_ratio == 40) means that kswapd always makes forward progress
and avoids the OOM killer.

For the streaming-write benchmark, it does make a small difference as
kswapd is reaching the higher priorities there due to a large number
of anonymous pages added to the mix. The performance difference is
marginal though as the number of filesystem pages written is about
1/50th of the number of anonymous pages written so it is drowned out.

I was initially worried about 512M-16X-xfs but it's well within the noise
looking at the standard deviations from
http://www.csn.ul.ie/~mel/postings/reclaim-20110721/html-no-monitor/global-dhp-512M-16X__writeback-reclaimdirty-xfs/hydra/comparison.html

Files/s  min          25.00 ( 0.00%)       31.10 (19.61%)       32.00 (21.88%)
Files/s  mean         48.80 ( 0.00%)       41.80 (-16.77%)       56.61 (13.79%)
Files/s  stddev       28.65 ( 0.00%)       11.32 (-153.19%)       32.79 (12.62%)
Files/s  max         133.20 ( 0.00%)       81.60 (-63.24%)      154.00 (13.51%)

64 threads writing on a machine with 4 CPUs with 512M RAM has variable
performance which is hardly surprising.

The streaming-write benchmarks all completed faster.

The tests were also run with mem=1024M and mem=4608M with the relative
performance improvement reduced as memory increases reflecting that
with enough memory there are fewer writes from reclaim as the flusher
threads have time to clean the page before it reaches the end of
the LRU.

Here is the same tests except when using ext4

512M1P-ext4          Files/s  mean         37.36 ( 0.00%)       37.10 (-0.71%)       37.66 ( 0.78%)
512M1P-ext4          Elapsed Time fsmark           108.93               109.91               108.61
512M1P-ext4          Elapsed Time mmap-strm        112.15               108.93               109.10
512M-ext4            Files/s  mean         30.83 ( 0.00%)       39.80 (22.54%)       32.74 ( 5.83%)
512M-ext4            Elapsed Time fsmark           368.07               322.55               328.80
512M-ext4            Elapsed Time mmap-strm        131.98               117.01               118.94
512M-2X-ext4         Files/s  mean         20.27 ( 0.00%)       22.75 (10.88%)       20.80 ( 2.52%)
512M-2X-ext4         Elapsed Time fsmark           518.06               493.74               479.21
512M-2X-ext4         Elapsed Time mmap-strm        131.32               126.64               117.05
512M-4X-ext4         Files/s  mean         17.91 ( 0.00%)       12.30 (-45.63%)       16.58 (-8.06%)
512M-4X-ext4         Elapsed Time fsmark           633.41               660.70               572.74
512M-4X-ext4         Elapsed Time mmap-strm        137.85               127.63               124.07
512M-16X-ext4        Files/s  mean         55.86 ( 0.00%)       69.90 (20.09%)       42.66 (-30.94%)
512M-16X-ext4        Elapsed Time fsmark           543.21               544.43               586.16
512M-16X-ext4        Elapsed Time mmap-strm        141.84               146.12               144.01

At first glance, the benefit for ext4 is less clear cut but this
is due to the standard deviation being very high. Take 512M-4X-ext4
showing a 45.63% regression for example and we see.

Files/s  min           5.40 ( 0.00%)        4.10 (-31.71%)        6.50 (16.92%)
Files/s  mean         17.91 ( 0.00%)       12.30 (-45.63%)       16.58 (-8.06%)
Files/s  stddev       14.34 ( 0.00%)        8.04 (-78.46%)       14.50 ( 1.04%)
Files/s  max          54.30 ( 0.00%)       37.70 (-44.03%)       77.20 (29.66%)

The standard deviation is *massive* meaning that the performance
loss is well within the noise. The main positive out of this is the
streaming write benchmarks are generally better.

Where it does benefit is stalls in direct reclaim. Unlike xfs, ext4
can stall direct reclaim writing back pages. When I look at a separate
run using ftrace to gather more information, I see;

512M-ext4            Time stalled direct reclaim fsmark            0.36       0.30       0.31 
512M-ext4            Time stalled direct reclaim mmap-strm        36.88       7.48      36.24 
512M-4X-ext4         Time stalled direct reclaim fsmark            1.06       0.40       0.43 
512M-4X-ext4         Time stalled direct reclaim mmap-strm       102.68      33.18      23.99 
512M-16X-ext4        Time stalled direct reclaim fsmark            0.17       0.27       0.30 
512M-16X-ext4        Time stalled direct reclaim mmap-strm         9.80       2.62       1.28 
512M-32X-ext4        Time stalled direct reclaim fsmark            0.00       0.00       0.00 
512M-32X-ext4        Time stalled direct reclaim mmap-strm         2.27       0.51       1.26 

Time spent in direct reclaim is reduced implying that bug reports
complaining about the system becoming jittery when copying large
files may also be hel.

To show what effect the patches are having, this is a more detailed
look at one of the tests running with monitoring enabled. It's booted
with mem=512M and the number of threads running is equal to the number
of CPU cores. The backing filesystem is XFS.

FS-Mark
                  fsmark-3.0.0         3.0.0-rc6         3.0.0-rc6
                   rc6-vanilla      kswapwb-v2r5    nokswapwb-v2r5
Files/s  min          27.30 ( 0.00%)       31.80 (14.15%)       31.40 (13.06%)
Files/s  mean         30.32 ( 0.00%)       34.34 (11.73%)       34.52 (12.18%)
Files/s  stddev        1.39 ( 0.00%)        1.06 (-31.96%)        1.20 (-16.05%)
Files/s  max          33.60 ( 0.00%)       36.00 ( 6.67%)       36.30 ( 7.44%)
Overhead min     1393832.00 ( 0.00%)  1793141.00 (-22.27%)  1133240.00 (23.00%)
Overhead mean    2423808.52 ( 0.00%)  2513297.40 (-3.56%)  1823398.44 (32.93%)
Overhead stddev   445880.26 ( 0.00%)   392952.66 (13.47%)   420498.38 ( 6.04%)
Overhead max     3359477.00 ( 0.00%)  3184889.00 ( 5.48%)  3016170.00 (11.38%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         53.26     52.27     51.88
Total Elapsed Time (seconds)                137.65    121.95    121.11

Average files per second is increased by a nice percentage that is
outside the noise.  This is also true when I look at the results
without monitoring although the relative performance gain is less.

Time to completion is reduced which is always good ane as it implies
that IO was consistently higher and this is clearly visible at

http://www.csn.ul.ie/~mel/postings/reclaim-20110721/html-run-monitor/global-dhp-512M__writeback-reclaimdirty-xfs/hydra/blockio-comparison-hydra.png
http://www.csn.ul.ie/~mel/postings/reclaim-20110721/html-run-monitor/global-dhp-512M__writeback-reclaimdirty-xfs/hydra/blockio-comparison-smooth-hydra.png

kswapd CPU usage is also interesting

http://www.csn.ul.ie/~mel/postings/reclaim-20110721/html-run-monitor/global-dhp-512M__writeback-reclaimdirty-xfs/hydra/kswapdcpu-comparison-smooth-hydra.png

Note how preventing kswapd reclaiming dirty pages pushes up its CPU
usage as it scans more pages but it does not get excessive due to
the throttling.

MMTests Statistics: vmstat
Page Ins                                   1481672   1352900   1105364
Page Outs                                 38397462  38337199  38366073
Swap Ins                                    351918    320883    258868
Swap Outs                                   132060    117715    123564
Direct pages scanned                        886587    968087    784109
Kswapd pages scanned                      18931089  18275983  18324613
Kswapd pages reclaimed                     8878200   8768648   8885482
Direct pages reclaimed                      883407    960496    781632
Kswapd efficiency                              46%       47%       48%
Kswapd velocity                         137530.614 149864.559 151305.532
Direct efficiency                              99%       99%       99%
Direct velocity                           6440.879  7938.393  6474.354
Percentage direct scans                         4%        5%        4%
Page writes by reclaim                      170014    117717    123510
Page reclaim invalidate                          0   1221396   1212857
Page reclaim throttled                           0         0         0
Slabs scanned                                23424     23680     23552
Direct inode steals                              0         0         0
Kswapd inode steals                           5560      5500      5584
Kswapd skipped wait                             20         3         5
Compaction stalls                                0         0         0
Compaction success                               0         0         0
Compaction failures                              0         0         0
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

These stats are based on information from /proc/vmstat

"Kswapd efficiency" is the percentage of pages reclaimed to pages
scanned. The higher the percentage is the better because a low
percentage implies that kswapd is scanning uselessly. As the workload
dirties memory heavily and is a small machine, the efficiency is low at
46% and marginally improves due to a reduced number of pages scanned.
As memory increases, so does the efficiency as one might expect as
the flushers have a chance to clean the pages in time.

"Kswapd velocity" is the average number of pages scanned per
second. The patches increase this as it's no longer getting blocked on
page writes so it's expected but in general a higher velocity means
that kswapd is doing more work and consuming more CPU. In this case,
it is offset by the fact that fewer pages overall are scanned and
the test completes faster but it explains why CPU usage is higher.

Page writes by reclaim is what is motivating this series. It goes
from 170014 pages to 123510 which is a big improvement and we'll see
later that these writes are for anonymous pages.

"Page reclaim invalided" is very high and implies that a large number
of dirty pages are reaching the end of the list quickly. Unfortunately,
this is somewhat unavoidable. Kswapd is scanning pages at a rate
of roughly 125000 (or 488M) a second on a 512M machine. The best
possible writing rate of the underlying storage is about 300M/second.
With the rate of reclaim exceeding the best possible writing speed,
the system is going to get throttled.

FTrace Reclaim Statistics: vmscan
                              fsmark-3.0.0         3.0.0-rc6         3.0.0-rc6
                               rc6-vanilla      kswapwb-v2r5    nokswapwb-v2r5
Direct reclaims                              16173      17605      14313 
Direct reclaim pages scanned                886587     968087     784109 
Direct reclaim pages reclaimed              883407     960496     781632 
Direct reclaim write file async I/O              0          0          0 
Direct reclaim write anon async I/O              0          0          0 
Direct reclaim write file sync I/O               0          0          0 
Direct reclaim write anon sync I/O               0          0          0 
Wake kswapd requests                         20699      22048      22893 
Kswapd wakeups                                  24         20         25 
Kswapd pages scanned                      18931089   18275983   18324613 
Kswapd pages reclaimed                     8878200    8768648    8885482 
Kswapd reclaim write file async I/O          37966          0          0 
Kswapd reclaim write anon async I/O         132062     117717     123567 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)         0.08       0.09       0.08 
Time kswapd awake (seconds)                 132.11     117.78     115.82 

Total pages scanned                       19817676  19244070  19108722
Total pages reclaimed                      9761607   9729144   9667114
%age total pages scanned/reclaimed          49.26%    50.56%    50.59%
%age total pages scanned/written             0.86%     0.61%     0.65%
%age  file pages scanned/written             0.19%     0.00%     0.00%
Percentage Time Spent Direct Reclaim         0.15%     0.17%     0.15%
Percentage Time kswapd Awake                95.98%    96.58%    95.63%

Despite kswapd having higher CPU usage, it spent less time awake which
is probably a reflection of the test completing faster. File writes
from kswapd were 0 with the patches applied implying that kswapd was
not getting to a priority high enough to start writing. The remaining
writes correlate almost exactly to nr_vmscan_write implying that all
writes were for anonymous pages.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited                 0          0          0 
Direct time   congest     waited               0ms        0ms        0ms 
Direct full   congest     waited                 0          0          0 
Direct number conditional waited                 2         17          6 
Direct time   conditional waited               0ms        0ms        0ms 
Direct full   conditional waited                 0          0          0 
KSwapd number congest     waited                 4          8         10 
KSwapd time   congest     waited               4ms       20ms        8ms 
KSwapd full   congest     waited                 0          0          0 
KSwapd number conditional waited                 0      26036      26283 
KSwapd time   conditional waited               0ms       16ms        4ms 
KSwapd full   conditional waited                 0          0          0 

This is based on some of the writeback tracepoints. It's interesting
to note that while kswapd got throttled about 26000 times with all
patches applied, it spent negligible time asleep so probably just
called cond_resched().  This implies that neither the zone nor the
backing device are rarely truly congested and throttling is necessary
simply to allow the pages to be written.

MICRO
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         32.57     31.18     30.52
Total Elapsed Time (seconds)                166.29    141.94    148.23

This test is in two stages. The first writes only to a file. The second
writes to a mix of anonymous and file mappings.  Time to completion
is improved and this is still true with monitoring disabled.

MMTests Statistics: vmstat
Page Ins                                  11018260  10668536  10792204
Page Outs                                 16632838  16468468  16449897
Swap Ins                                    296167    245878    256038
Swap Outs                                   221626    177922    179409
Direct pages scanned                       4129424   5172015   3686598
Kswapd pages scanned                       9152837   9000480   7909180
Kswapd pages reclaimed                     3388122   3284663   3371737
Direct pages reclaimed                      735425    765263    708713
Kswapd efficiency                              37%       36%       42%
Kswapd velocity                          55041.416 63410.455 53357.485
Direct efficiency                              17%       14%       19%
Direct velocity                          24832.666 36438.037 24870.795
Percentage direct scans                        31%       36%       31%
Page writes by reclaim                      347283    180065    179425
Page writes skipped                              0         0         0
Page reclaim invalidate                          0    864018    554666
Write invalidated                                0         0         0
Page reclaim throttled                           0         0         0
Slabs scanned                                14464     13696     13952
Direct inode steals                            470       864       934
Kswapd inode steals                            426       411       317
Kswapd skipped wait                           3255      3381      1437
Compaction stalls                                0         0         2
Compaction success                               0         0         1
Compaction failures                              0         0         1
Compaction pages moved                           0         0         0
Compaction move failure                          0         0         0

Kswapd efficiency is improved slightly. kswapd is operating at roughly
the same velocity but the number of pages scanned is far lower due
to the test completing faster.

Direct reclaim efficiency is improved slightly and scanning fewer pages
(again due to lower time to completion).

Fewer pages are being written from reclaim.

FTrace Reclaim Statistics: vmscan
                   micro-3.0.0         3.0.0-rc6         3.0.0-rc6
                   rc6-vanilla      kswapwb-v2r5    nokswapwb-v2r5
Direct reclaims                              14060      15425      13726 
Direct reclaim pages scanned               3596218    4621037    3613503 
Direct reclaim pages reclaimed              735425     765263     708713 
Direct reclaim write file async I/O          87264          0          0 
Direct reclaim write anon async I/O          10030       9127      15028 
Direct reclaim write file sync I/O               0          0          0 
Direct reclaim write anon sync I/O               0          0          0 
Wake kswapd requests                         10424      10346      10786 
Kswapd wakeups                                  22         22         14 
Kswapd pages scanned                       9041353    8889081    7895846 
Kswapd pages reclaimed                     3388122    3284663    3371737 
Kswapd reclaim write file async I/O           7277       1710          0 
Kswapd reclaim write anon async I/O         184205     159178     162367 
Kswapd reclaim write file sync I/O               0          0          0 
Kswapd reclaim write anon sync I/O               0          0          0 
Time stalled direct reclaim (seconds)        54.29       5.67      14.29 
Time kswapd awake (seconds)                 151.62     129.83     135.98 

Total pages scanned                       12637571  13510118  11509349
Total pages reclaimed                      4123547   4049926   4080450
%age total pages scanned/reclaimed          32.63%    29.98%    35.45%
%age total pages scanned/written             2.29%     1.26%     1.54%
%age  file pages scanned/written             0.75%     0.01%     0.00%
Percentage Time Spent Direct Reclaim        62.50%    15.39%    31.89%
Percentage Time kswapd Awake                91.18%    91.47%    91.74%

Time spent in direct reclaim is massively reduced which is surprising
as this is XFS so it should not have been stalling in the writing
files anyway.  It's possible that the anon writes are completing
faster so time spent swapping is reduced.

With patches 1-7, kswapd still writes some pages due to it reaching
higher priorities due to memory pressure but the number of pages it
writes is significantly reduced and a small percentage of those that
were written to swap. Patch 8 eliminates it entirely but the benefit is
not seen in the completion times as the number of writes is so small.

FTrace Reclaim Statistics: congestion_wait
Direct number congest     waited                 0          0          0 
Direct time   congest     waited               0ms        0ms        0ms 
Direct full   congest     waited                 0          0          0 
Direct number conditional waited             12345      37713      34841 
Direct time   conditional waited           12396ms      132ms      168ms 
Direct full   conditional waited                53          0          0 
KSwapd number congest     waited              4248       2957       2293 
KSwapd time   congest     waited           15320ms    10312ms    13416ms 
KSwapd full   congest     waited                31          1         21 
KSwapd number conditional waited                 0      15989      10410 
KSwapd time   conditional waited               0ms        0ms        0ms 
KSwapd full   conditional waited                 0          0          0 

Congestion is way down as direct reclaim conditional wait time is
reduced by about 12 seconds.

Overall, this looks good. Avoiding writes from kswapd improves
overall performance as expected and eliminating them entirely seems
to behave well.

Next I tested on a NUMA configuration of sorts. I don't have a real
NUMA machine so I booted the same machine with mem=4096M numa=fake=8
so each node is 512M. Again, the volume of information is high but
here is a summary of sorts based on a test run with monitors enabled.

4096M8N-xfs     Files/s  mean                    27.29 ( 0.00%)      27.35 ( 0.20%)   27.91 ( 2.22%)
4096M8N-xfs     Elapsed Time fsmark                     1402.55             1400.77          1382.92
4096M8N-xfs     Elapsed Time mmap-strm                   660.90              596.91           630.05
4096M8N-xfs     Kswapd efficiency fsmark                    72%                 71%              13%
4096M8N-xfs     Kswapd efficiency mmap-strm                 39%                 40%              31%
4096M8N-xfs     stalled direct reclaim fsmark              0.00                0.00             0.00
4096M8N-xfs     stalled direct reclaim mmap-strm          36.37               13.06            56.88
4096M8N-4X-xfs  Files/s  mean                    26.80 ( 0.00%)      26.41 (-1.47%)   26.40 (-1.53%)
4096M8N-4X-xfs  Elapsed Time fsmark                     1453.95             1460.62          1470.98
4096M8N-4X-xfs  Elapsed Time mmap-strm                   683.34              663.46           690.01
4096M8N-4X-xfs  Kswapd efficiency fsmark                    68%                 67%               8%
4096M8N-4X-xfs  Kswapd efficiency mmap-strm                 35%                 34%               6%
4096M8N-4X-xfs  stalled direct reclaim fsmark              0.00                0.00             0.00
4096M8N-4X-xfs  stalled direct reclaim mmap-strm          26.45               87.57            46.87
4096M8N-2X-xfs  Files/s  mean                    26.22 ( 0.00%)      26.70 ( 1.77%)   27.21 ( 3.62%)
4096M8N-2X-xfs  Elapsed Time fsmark                     1469.28             1439.30          1424.45
4096M8N-2X-xfs  Elapsed Time mmap-strm                   676.77              656.28           655.03
4096M8N-2X-xfs  Kswapd efficiency fsmark                    69%                 69%               9%
4096M8N-2X-xfs  Kswapd efficiency mmap-strm                 33%                 33%               7%
4096M8N-2X-xfs  stalled direct reclaim fsmark              0.00                0.00             0.00
4096M8N-2X-xfs  stalled direct reclaim mmap-strm          52.74               57.96           102.49
4096M8N-16X-xfs Files/s  mean                    25.78 ( 0.00%)       27.81 ( 7.32%)  48.52 (46.87%)
4096M8N-16X-xfs Elapsed Time fsmark                     1555.95             1554.78          1542.53
4096M8N-16X-xfs Elapsed Time mmap-strm                   770.01              763.62           844.55
4096M8N-16X-xfs Kswapd efficiency fsmark                    62%                 62%               7%
4096M8N-16X-xfs Kswapd efficiency mmap-strm                 38%                 37%              10%
4096M8N-16X-xfs stalled direct reclaim fsmark              0.12                0.01             0.05
4096M8N-16X-xfs stalled direct reclaim mmap-strm           1.07                1.09            63.32

The performance differences for fsmark are marginal because the number
of page written from reclaim is pretty low with this much memory even
with NUMA enabled. At no point did fsmark enter direct reclaim to
try and write a page so it's all kswapd. What is important to note is
the "Kswapd efficiency". Once kswapd cannot write pages at all, its
efficiency drops rapidly for fsmark as it scans about 5-8 times more
pages waiting on flusher threads to clean a page from the correct node.

Kswapd not writing pages impairs direct reclaim performance for the
streaming writer test. Note the times stalled in direct reclaim. In
all cases, the time stalled in direct reclaim goes way up as both
direct reclaimers and kswapd get stalled waiting on pages to get
cleaned from the right node.

Fortunately, kswapd CPU usage does not go to 100% because of the
throttling. From the 40968M test for example, I see

KSwapd full   congest     waited               834        739        989
KSwapd number conditional waited                 0      68552     372275
KSwapd time   conditional waited               0ms       16ms     1684ms
KSwapd full   conditional waited                 0          0          0

With kswapd avoiding writes, it gets throttled lightly but when it
writes no pasges at all, it gets throttled very heavily and sleeps.

ext4 tells a slightly different story

4096M8N-ext4         Files/s  mean               28.63 ( 0.00%)       30.58 ( 6.37%)   31.04 ( 7.76%)
4096M8N-ext4         Elapsed Time fsmark                1578.51              1551.99          1532.65
4096M8N-ext4         Elapsed Time mmap-strm              703.66               655.25           654.86
4096M8N-ext4         Kswapd efficiency                      62%                  69%              68%
4096M8N-ext4         Kswapd efficiency                      35%                  35%              35%
4096M8N-ext4         stalled direct reclaim fsmark         0.00                 0.00             0.00 
4096M8N-ext4         stalled direct reclaim mmap-strm     32.64                95.72           152.62 
4096M8N-2X-ext4      Files/s  mean               30.74 ( 0.00%)       28.49 (-7.89%)   28.79 (-6.75%)
4096M8N-2X-ext4      Elapsed Time fsmark                1466.62              1583.12          1580.07
4096M8N-2X-ext4      Elapsed Time mmap-strm              705.17               705.64           693.01
4096M8N-2X-ext4      Kswapd efficiency                      68%                  68%              67%
4096M8N-2X-ext4      Kswapd efficiency                      34%                  30%              18%
4096M8N-2X-ext4      stalled direct reclaim fsmark         0.00                 0.00             0.00 
4096M8N-2X-ext4      stalled direct reclaim mmap-strm    106.82                24.88            27.88 
4096M8N-4X-ext4      Files/s  mean               24.15 ( 0.00%)       23.18 (-4.18%)   23.94 (-0.89%)
4096M8N-4X-ext4      Elapsed Time fsmark                1848.41              1971.48          1867.07
4096M8N-4X-ext4      Elapsed Time mmap-strm              664.87               673.66           674.46
4096M8N-4X-ext4      Kswapd efficiency                      62%                  65%              65%
4096M8N-4X-ext4      Kswapd efficiency                      33%                  37%              15%
4096M8N-4X-ext4      stalled direct reclaim fsmark         0.18                 0.03             0.26 
4096M8N-4X-ext4      stalled direct reclaim mmap-strm    115.71                23.05            61.12 
4096M8N-16X-ext4     Files/s  mean                5.42 ( 0.00%)        5.43 ( 0.15%)    3.83 (-41.44%)
4096M8N-16X-ext4     Elapsed Time fsmark                9572.85              9653.66         11245.41
4096M8N-16X-ext4     Elapsed Time mmap-strm              752.88               750.38           769.19
4096M8N-16X-ext4     Kswapd efficiency                      59%                  59%              61%
4096M8N-16X-ext4     Kswapd efficiency                      34%                  34%              21%
4096M8N-16X-ext4     stalled direct reclaim fsmark         0.26                 0.65             0.26 
4096M8N-16X-ext4     stalled direct reclaim mmap-strm    177.48               125.91           196.92 

4096M8N-16X-ext4 with kswapd writing no pages collapsed in terms of
performance. Looking at the fsmark logs, in a number of iterations,
it was barely able to write files at all.

The apparent slowdown for fsmark in 4096M8N-2X-ext4 is well within
the noise but the reduced time spent in direct reclaim is very welcome.

Unlike xfs, it's less clear cut if direct reclaim performance is
impaired but in a few tests, preventing kswapd writing pages did
increase the time stalled.

Last test is that I've been running this series on my laptop since
Monday without any problem but it's rarely under serious memory
pressure. I see nr_vmscan_write is 0 and the number of pages
invalidated from the end of the LRU is only 10844 after 3 days so
it's not much of a test.

Overall, having kswapd avoiding writes does improve performance
which is not a surprise. Dave asked "do we even need IO at all from
reclaim?". On NUMA machines, the answer is "yes" unless the VM can
wake the flusher thread to clean a specific node. When kswapd never
writes, processes can stall for significant periods of time waiting on
flushers to clean the correct pages. If all writing is to be deferred
to flushers, it must ensure that many writes on one node would not
starve requests for cleaning pages on another node.

I'm currently of the opinion that we should consider merging patches
1-7 and discuss what is required before merging. It can be tackled
later how the flushers can prioritise writing of pages belonging to
a particular zone before disabling all writes from reclaim. There
is already some work in this general area with the possibility that
series such as "writeback: moving expire targets for background/kupdate
works" could be extended to allow patch 8 to be merged later even if
the series needs work.

 fs/btrfs/disk-io.c          |    2 ++
 fs/btrfs/inode.c            |    2 ++
 fs/ext4/inode.c             |    6 +++++-
 fs/xfs/linux-2.6/xfs_aops.c |    9 +++++----
 include/linux/mmzone.h      |    1 +
 mm/vmscan.c                 |   34 +++++++++++++++++++++++++++++++---
 mm/vmstat.c                 |    1 +
 7 files changed, 47 insertions(+), 8 deletions(-)

-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH 1/8] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-21 16:28 [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Mel Gorman
@ 2011-07-21 16:28 ` Mel Gorman
  2011-07-31 15:06   ` Minchan Kim
  2011-07-21 16:28 ` [PATCH 2/8] xfs: Warn if direct reclaim tries to writeback pages Mel Gorman
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2011-07-21 16:28 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

From: Mel Gorman <mel@csn.ul.ie>

When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does. If a dirty
page is encountered during the scan, this page is written to backing
storage using mapping->writepage.

This causes two problems. First, it can result in very deep call
stacks, particularly if the target storage or filesystem are complex.
Some filesystems ignore write requests from direct reclaim as a result.
The second is that a single-page flush is inefficient in terms of IO.
While there is an expectation that the elevator will merge requests,
this does not always happen. Quoting Christoph Hellwig;

	The elevator has a relatively small window it can operate on,
	and can never fix up a bad large scale writeback pattern.

This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd. Anonymous pages are still written to
swap as there is not the equivalent of a flusher thread for anonymous
pages. If the dirty pages cannot be written back, they are placed
back on the LRU lists. There is now a direct dependency on dirty page
balancing to prevent too many pages in the system being dirtied which
would prevent reclaim making forward progress.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    1 +
 mm/vmscan.c            |    9 +++++++++
 mm/vmstat.c            |    1 +
 3 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f7c3eb..b70a0c0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -100,6 +100,7 @@ enum zone_stat_item {
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
 	NR_VMSCAN_WRITE,
+	NR_VMSCAN_WRITE_SKIP,
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5ed24b9..ee00c94 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -825,6 +825,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (PageDirty(page)) {
 			nr_dirty++;
 
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd()) {
+				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
+				goto keep_locked;
+			}
+
 			if (references == PAGEREF_RECLAIM_CLEAN)
 				goto keep_locked;
 			if (!may_enter_fs)
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 20c18b7..fd109f3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -702,6 +702,7 @@ const char * const vmstat_text[] = {
 	"nr_unstable",
 	"nr_bounce",
 	"nr_vmscan_write",
+	"nr_vmscan_write_skip",
 	"nr_writeback_temp",
 	"nr_isolated_anon",
 	"nr_isolated_file",
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 2/8] xfs: Warn if direct reclaim tries to writeback pages
  2011-07-21 16:28 [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Mel Gorman
  2011-07-21 16:28 ` [PATCH 1/8] mm: vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
@ 2011-07-21 16:28 ` Mel Gorman
  2011-07-24 11:32   ` Christoph Hellwig
  2011-07-21 16:28 ` [PATCH 3/8] ext4: " Mel Gorman
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2011-07-21 16:28 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

Direct reclaim should never writeback pages. For now, handle the
situation and warn about it. Ultimately, this will be a BUG_ON.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/xfs/linux-2.6/xfs_aops.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 79ce38b..c33a439 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -930,12 +930,13 @@ xfs_vm_writepage(
 	 * random callers for direct reclaim or memcg reclaim.  We explicitly
 	 * allow reclaim from kswapd as the stack usage there is relatively low.
 	 *
-	 * This should really be done by the core VM, but until that happens
-	 * filesystems like XFS, btrfs and ext4 have to take care of this
-	 * by themselves.
+	 * This should never happen except in the case of a VM regression so
+	 * warn about it.
 	 */
-	if ((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC)
+	if ((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC) {
+		WARN_ON_ONCE(1);
 		goto redirty;
+	}
 
 	/*
 	 * We need a transaction if there are delalloc or unwritten buffers
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 3/8] ext4: Warn if direct reclaim tries to writeback pages
  2011-07-21 16:28 [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Mel Gorman
  2011-07-21 16:28 ` [PATCH 1/8] mm: vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
  2011-07-21 16:28 ` [PATCH 2/8] xfs: Warn if direct reclaim tries to writeback pages Mel Gorman
@ 2011-07-21 16:28 ` Mel Gorman
  2011-08-03 10:58   ` Johannes Weiner
  2011-07-21 16:28 ` [PATCH 4/8] btrfs: " Mel Gorman
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2011-07-21 16:28 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

Direct reclaim should never writeback pages. Warn if an attempt
is made.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/ext4/inode.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e3126c0..95bb179 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2663,8 +2663,12 @@ static int ext4_writepage(struct page *page,
 		 * We don't want to do block allocation, so redirty
 		 * the page and return.  We may reach here when we do
 		 * a journal commit via journal_submit_inode_data_buffers.
-		 * We can also reach here via shrink_page_list
+		 * We can also reach here via shrink_page_list but it
+		 * should never be for direct reclaim so warn if that
+		 * happens
 		 */
+		WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
+								PF_MEMALLOC);
 		goto redirty_page;
 	}
 	if (commit_write)
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 4/8] btrfs: Warn if direct reclaim tries to writeback pages
  2011-07-21 16:28 [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Mel Gorman
                   ` (2 preceding siblings ...)
  2011-07-21 16:28 ` [PATCH 3/8] ext4: " Mel Gorman
@ 2011-07-21 16:28 ` Mel Gorman
  2011-08-03 11:10   ` Johannes Weiner
  2011-07-21 16:28 ` [PATCH 5/8] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority Mel Gorman
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2011-07-21 16:28 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

Direct reclaim should never writeback pages. Warn if an attempt is
made. By rights, btrfs should be allowing writepage from kswapd if
it is failing to reclaim pages by any other means but it's outside
the scope of this patch.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/btrfs/disk-io.c |    2 ++
 fs/btrfs/inode.c   |    2 ++
 2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1ac8db5d..cc9c9cf 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -829,6 +829,8 @@ static int btree_writepage(struct page *page, struct writeback_control *wbc)
 
 	tree = &BTRFS_I(page->mapping->host)->io_tree;
 	if (!(current->flags & PF_MEMALLOC)) {
+		WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
+								PF_MEMALLOC);
 		return extent_write_full_page(tree, page,
 					      btree_get_extent, wbc);
 	}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3601f0a..07d6c27 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6259,6 +6259,8 @@ static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
 
 
 	if (current->flags & PF_MEMALLOC) {
+		WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
+								PF_MEMALLOC);
 		redirty_page_for_writepage(wbc, page);
 		unlock_page(page);
 		return 0;
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 5/8] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
  2011-07-21 16:28 [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Mel Gorman
                   ` (3 preceding siblings ...)
  2011-07-21 16:28 ` [PATCH 4/8] btrfs: " Mel Gorman
@ 2011-07-21 16:28 ` Mel Gorman
  2011-07-31 15:11   ` Minchan Kim
  2011-07-21 16:28 ` [PATCH 6/8] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback Mel Gorman
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2011-07-21 16:28 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

It is preferable that no dirty pages are dispatched for cleaning from
the page reclaim path. At normal priorities, this patch prevents kswapd
writing pages.

However, page reclaim does have a requirement that pages be freed
in a particular zone. If it is failing to make sufficient progress
(reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
considered to tbe the point where kswapd is getting into trouble
reclaiming pages. If this priority is reached, kswapd will dispatch
pages for writing.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   13 ++++++++-----
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ee00c94..cf7b501 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -719,7 +719,8 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
  */
 static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct zone *zone,
-				      struct scan_control *sc)
+				      struct scan_control *sc,
+				      int priority)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -827,9 +828,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 			/*
 			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow
+			 * avoid risk of stack overflow but do not writeback
+			 * unless under significant pressure.
 			 */
-			if (page_is_file_cache(page) && !current_is_kswapd()) {
+			if (page_is_file_cache(page) &&
+					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
 				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
 				goto keep_locked;
 			}
@@ -1465,12 +1468,12 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc);
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, zone, sc);
+		nr_reclaimed += shrink_page_list(&page_list, zone, sc, priority);
 	}
 
 	local_irq_disable();
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 6/8] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
  2011-07-21 16:28 [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Mel Gorman
                   ` (4 preceding siblings ...)
  2011-07-21 16:28 ` [PATCH 5/8] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority Mel Gorman
@ 2011-07-21 16:28 ` Mel Gorman
  2011-07-31 15:17   ` Minchan Kim
  2011-08-03 11:19   ` Johannes Weiner
  2011-07-21 16:28 ` [PATCH 7/8] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes Mel Gorman
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 43+ messages in thread
From: Mel Gorman @ 2011-07-21 16:28 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

Workloads that are allocating frequently and writing files place a
large number of dirty pages on the LRU. With use-once logic, it is
possible for them to reach the end of the LRU quickly requiring the
reclaimer to scan more to find clean pages. Ordinarily, processes that
are dirtying memory will get throttled by dirty balancing but this
is a global heuristic and does not take into account that LRUs are
maintained on a per-zone basis. This can lead to a situation whereby
reclaim is scanning heavily, skipping over a large number of pages
under writeback and recycling them around the LRU consuming CPU.

This patch checks how many of the number of pages isolated from the
LRU were dirty. If a percentage of them are dirty, the process will be
throttled if a blocking device is congested or the zone being scanned
is marked congested. The percentage that must be dirty depends on
the priority. At default priority, all of them must be dirty. At
DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
etc. i.e.  as pressure increases the greater the likelihood the process
will get throttled to allow the flusher threads to make some progress.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   21 ++++++++++++++++++---
 1 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index cf7b501..b0060f8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -720,7 +720,8 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct zone *zone,
 				      struct scan_control *sc,
-				      int priority)
+				      int priority,
+				      unsigned long *ret_nr_dirty)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -971,6 +972,7 @@ keep_lumpy:
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
+	*ret_nr_dirty += nr_dirty;
 	return nr_reclaimed;
 }
 
@@ -1420,6 +1422,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 	unsigned long nr_taken;
 	unsigned long nr_anon;
 	unsigned long nr_file;
+	unsigned long nr_dirty = 0;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1468,12 +1471,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc, priority);
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc,
+							priority, &nr_dirty);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
-		nr_reclaimed += shrink_page_list(&page_list, zone, sc, priority);
+		nr_reclaimed += shrink_page_list(&page_list, zone, sc,
+							priority, &nr_dirty);
 	}
 
 	local_irq_disable();
@@ -1483,6 +1488,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
 
+	/*
+	 * If we have encountered a high number of dirty pages then they
+	 * are reaching the end of the LRU too quickly and global limits are
+	 * not enough to throttle processes due to the page distribution
+	 * throughout zones. Scale the number of dirty pages that must be
+	 * dirty before being throttled to priority.
+	 */
+	if (nr_dirty && nr_dirty >= (nr_taken >> (DEF_PRIORITY-priority)))
+		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),
 		nr_scanned, nr_reclaimed,
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 7/8] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
  2011-07-21 16:28 [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Mel Gorman
                   ` (5 preceding siblings ...)
  2011-07-21 16:28 ` [PATCH 6/8] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback Mel Gorman
@ 2011-07-21 16:28 ` Mel Gorman
  2011-07-22 12:53   ` Peter Zijlstra
  2011-08-03 11:26   ` Johannes Weiner
  2011-07-21 16:28 ` [PATCH 8/8] mm: vmscan: Do not writeback filesystem pages from kswapd Mel Gorman
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 43+ messages in thread
From: Mel Gorman @ 2011-07-21 16:28 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

When direct reclaim encounters a dirty page, it gets recycled around
the LRU for another cycle. This patch marks the page PageReclaim
similar to deactivate_page() so that the page gets reclaimed almost
immediately after the page gets cleaned. This is to avoid reclaiming
clean pages that are younger than a dirty page encountered at the
end of the LRU that might have been something like a use-once page.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    2 +-
 mm/vmscan.c            |   10 +++++++++-
 mm/vmstat.c            |    2 +-
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b70a0c0..30d1dd1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -100,7 +100,7 @@ enum zone_stat_item {
 	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_BOUNCE,
 	NR_VMSCAN_WRITE,
-	NR_VMSCAN_WRITE_SKIP,
+	NR_VMSCAN_INVALIDATE,
 	NR_WRITEBACK_TEMP,	/* Writeback using temporary buffers */
 	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
 	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b0060f8..c3d8341 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -834,7 +834,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 */
 			if (page_is_file_cache(page) &&
 					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
-				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
+				/*
+				 * Immediately reclaim when written back.
+				 * Similar in principal to deactivate_page()
+				 * except we already have the page isolated
+				 * and know it's dirty
+				 */
+				inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
+				SetPageReclaim(page);
+
 				goto keep_locked;
 			}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fd109f3..5bd2043 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -702,7 +702,7 @@ const char * const vmstat_text[] = {
 	"nr_unstable",
 	"nr_bounce",
 	"nr_vmscan_write",
-	"nr_vmscan_write_skip",
+	"nr_vmscan_invalidate",
 	"nr_writeback_temp",
 	"nr_isolated_anon",
 	"nr_isolated_file",
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH 8/8] mm: vmscan: Do not writeback filesystem pages from kswapd
  2011-07-21 16:28 [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Mel Gorman
                   ` (6 preceding siblings ...)
  2011-07-21 16:28 ` [PATCH 7/8] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes Mel Gorman
@ 2011-07-21 16:28 ` Mel Gorman
  2011-07-22 12:57   ` Peter Zijlstra
  2011-08-03 11:37   ` Johannes Weiner
  2011-07-26 11:20 ` [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Dave Chinner
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 43+ messages in thread
From: Mel Gorman @ 2011-07-21 16:28 UTC (permalink / raw)
  To: Linux-MM
  Cc: LKML, XFS, Dave Chinner, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, Mel Gorman

Assuming that flusher threads will always write back dirty pages promptly
then it is always faster for reclaimers to wait for flushers. This patch
prevents kswapd writing back any filesystem pages.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   15 ++++-----------
 1 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c3d8341..6023494 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -720,7 +720,6 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
 static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct zone *zone,
 				      struct scan_control *sc,
-				      int priority,
 				      unsigned long *ret_nr_dirty)
 {
 	LIST_HEAD(ret_pages);
@@ -827,13 +826,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (PageDirty(page)) {
 			nr_dirty++;
 
-			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but do not writeback
-			 * unless under significant pressure.
-			 */
-			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
+			/* Flusher must clean dirty filesystem-backed pages */
+			if (page_is_file_cache(page)) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
@@ -1479,14 +1473,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
 
 	spin_unlock_irq(&zone->lru_lock);
 
-	nr_reclaimed = shrink_page_list(&page_list, zone, sc,
-							priority, &nr_dirty);
+	nr_reclaimed = shrink_page_list(&page_list, zone, sc, &nr_dirty);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
 		nr_reclaimed += shrink_page_list(&page_list, zone, sc,
-							priority, &nr_dirty);
+							&nr_dirty);
 	}
 
 	local_irq_disable();
-- 
1.7.3.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
  2011-07-21 16:28 ` [PATCH 7/8] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes Mel Gorman
@ 2011-07-22 12:53   ` Peter Zijlstra
  2011-07-22 13:23     ` Mel Gorman
  2011-08-03 11:26   ` Johannes Weiner
  1 sibling, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-22 12:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Thu, 2011-07-21 at 17:28 +0100, Mel Gorman wrote:
> When direct reclaim encounters a dirty page, it gets recycled around
> the LRU for another cycle. This patch marks the page PageReclaim
> similar to deactivate_page() so that the page gets reclaimed almost
> immediately after the page gets cleaned. This is to avoid reclaiming
> clean pages that are younger than a dirty page encountered at the
> end of the LRU that might have been something like a use-once page.
> 

> @@ -834,7 +834,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			 */
>  			if (page_is_file_cache(page) &&
>  					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> -				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> +				/*
> +				 * Immediately reclaim when written back.
> +				 * Similar in principal to deactivate_page()
> +				 * except we already have the page isolated
> +				 * and know it's dirty
> +				 */
> +				inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
> +				SetPageReclaim(page);
> +

I find the invalidate name somewhat confusing. It makes me think we'll
drop the page without writeback, like invalidatepage().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 8/8] mm: vmscan: Do not writeback filesystem pages from kswapd
  2011-07-21 16:28 ` [PATCH 8/8] mm: vmscan: Do not writeback filesystem pages from kswapd Mel Gorman
@ 2011-07-22 12:57   ` Peter Zijlstra
  2011-07-22 13:31     ` Mel Gorman
  2011-08-03 11:37   ` Johannes Weiner
  1 sibling, 1 reply; 43+ messages in thread
From: Peter Zijlstra @ 2011-07-22 12:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Thu, 2011-07-21 at 17:28 +0100, Mel Gorman wrote:
> Assuming that flusher threads will always write back dirty pages promptly
> then it is always faster for reclaimers to wait for flushers. This patch
> prevents kswapd writing back any filesystem pages. 

That is a somewhat sort changelog for such a big assumption ;-)

I think it can use a few extra words to explain the need to clean pages
from @zone vs writeback picks whatever fits best on disk and how that
works out wrt the assumption.

What requirements does this place on writeback and how does it meet
them.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
  2011-07-22 12:53   ` Peter Zijlstra
@ 2011-07-22 13:23     ` Mel Gorman
  2011-07-31 15:24       ` Minchan Kim
  0 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2011-07-22 13:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Fri, Jul 22, 2011 at 02:53:48PM +0200, Peter Zijlstra wrote:
> On Thu, 2011-07-21 at 17:28 +0100, Mel Gorman wrote:
> > When direct reclaim encounters a dirty page, it gets recycled around
> > the LRU for another cycle. This patch marks the page PageReclaim
> > similar to deactivate_page() so that the page gets reclaimed almost
> > immediately after the page gets cleaned. This is to avoid reclaiming
> > clean pages that are younger than a dirty page encountered at the
> > end of the LRU that might have been something like a use-once page.
> > 
> 
> > @@ -834,7 +834,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  			 */
> >  			if (page_is_file_cache(page) &&
> >  					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > -				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > +				/*
> > +				 * Immediately reclaim when written back.
> > +				 * Similar in principal to deactivate_page()
> > +				 * except we already have the page isolated
> > +				 * and know it's dirty
> > +				 */
> > +				inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
> > +				SetPageReclaim(page);
> > +
> 
> I find the invalidate name somewhat confusing. It makes me think we'll
> drop the page without writeback, like invalidatepage().

I wasn't that happy with it either to be honest but didn't think of a
better one at the time. nr_reclaim_deferred?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 8/8] mm: vmscan: Do not writeback filesystem pages from kswapd
  2011-07-22 12:57   ` Peter Zijlstra
@ 2011-07-22 13:31     ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2011-07-22 13:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel,
	Minchan Kim

On Fri, Jul 22, 2011 at 02:57:12PM +0200, Peter Zijlstra wrote:
> On Thu, 2011-07-21 at 17:28 +0100, Mel Gorman wrote:
> > Assuming that flusher threads will always write back dirty pages promptly
> > then it is always faster for reclaimers to wait for flushers. This patch
> > prevents kswapd writing back any filesystem pages. 
> 
> That is a somewhat sort changelog for such a big assumption ;-)
> 

That is an understatement but the impact of the patch is discussed in
detail in the leader. On NUMA, this patch has a negative impact so
I put no effort into the changelog. The patch is part of the series
because it was specifically asked for.

> I think it can use a few extra words to explain the need to clean pages
> from @zone vs writeback picks whatever fits best on disk and how that
> works out wrt the assumption.
> 

At the time of writing the changelog, I knew that flushers were
not finding pages from the correct zones quickly enough in the NUMA
usecase. The changelog documents the assumptions testing shows them to
be false.

> What requirements does this place on writeback and how does it meet
> them.

It places a requirement on writeback to prioritise pages from zones
under memory pressure. It doesn't meet them. I mention in the leader
that I think patch 8 should be dropped which is why the changelog
sucks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] xfs: Warn if direct reclaim tries to writeback pages
  2011-07-21 16:28 ` [PATCH 2/8] xfs: Warn if direct reclaim tries to writeback pages Mel Gorman
@ 2011-07-24 11:32   ` Christoph Hellwig
  2011-07-25  8:19     ` Mel Gorman
  0 siblings, 1 reply; 43+ messages in thread
From: Christoph Hellwig @ 2011-07-24 11:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Rik van Riel, Jan Kara, LKML, XFS, Christoph Hellwig,
	Minchan Kim, Wu Fengguang, Johannes Weiner

On Thu, Jul 21, 2011 at 05:28:44PM +0100, Mel Gorman wrote:
> --- a/fs/xfs/linux-2.6/xfs_aops.c
> +++ b/fs/xfs/linux-2.6/xfs_aops.c
> @@ -930,12 +930,13 @@ xfs_vm_writepage(
>  	 * random callers for direct reclaim or memcg reclaim.  We explicitly
>  	 * allow reclaim from kswapd as the stack usage there is relatively low.
>  	 *
> -	 * This should really be done by the core VM, but until that happens
> -	 * filesystems like XFS, btrfs and ext4 have to take care of this
> -	 * by themselves.
> +	 * This should never happen except in the case of a VM regression so
> +	 * warn about it.
>  	 */
> -	if ((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC)
> +	if ((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC) {
> +		WARN_ON_ONCE(1);
>  		goto redirty;

The nicer way to write this is

	if (WARN_ON(current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC)
		goto redirty;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 2/8] xfs: Warn if direct reclaim tries to writeback pages
  2011-07-24 11:32   ` Christoph Hellwig
@ 2011-07-25  8:19     ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2011-07-25  8:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linux-MM, Rik van Riel, Jan Kara, LKML, XFS, Minchan Kim,
	Wu Fengguang, Johannes Weiner

On Sun, Jul 24, 2011 at 07:32:00AM -0400, Christoph Hellwig wrote:
> On Thu, Jul 21, 2011 at 05:28:44PM +0100, Mel Gorman wrote:
> > --- a/fs/xfs/linux-2.6/xfs_aops.c
> > +++ b/fs/xfs/linux-2.6/xfs_aops.c
> > @@ -930,12 +930,13 @@ xfs_vm_writepage(
> >  	 * random callers for direct reclaim or memcg reclaim.  We explicitly
> >  	 * allow reclaim from kswapd as the stack usage there is relatively low.
> >  	 *
> > -	 * This should really be done by the core VM, but until that happens
> > -	 * filesystems like XFS, btrfs and ext4 have to take care of this
> > -	 * by themselves.
> > +	 * This should never happen except in the case of a VM regression so
> > +	 * warn about it.
> >  	 */
> > -	if ((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC)
> > +	if ((current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC) {
> > +		WARN_ON_ONCE(1);
> >  		goto redirty;
> 
> The nicer way to write this is
> 
> 	if (WARN_ON(current->flags & (PF_MEMALLOC|PF_KSWAPD)) == PF_MEMALLOC)
> 		goto redirty;
> 

I wanted to avoid side effects if WARN_ON was compiled out similar to
the care that is normally taken for BUG_ON but it's unnecessary and
your version is far tidier. Do you really want WARN_ON used instead
of WARN_ON_ONCE()?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2
  2011-07-21 16:28 [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Mel Gorman
                   ` (7 preceding siblings ...)
  2011-07-21 16:28 ` [PATCH 8/8] mm: vmscan: Do not writeback filesystem pages from kswapd Mel Gorman
@ 2011-07-26 11:20 ` Dave Chinner
  2011-07-27  4:32 ` Minchan Kim
  2011-07-27 16:18 ` Minchan Kim
  10 siblings, 0 replies; 43+ messages in thread
From: Dave Chinner @ 2011-07-26 11:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Christoph Hellwig, Johannes Weiner,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 21, 2011 at 05:28:42PM +0100, Mel Gorman wrote:
> Warning: Long post with lots of figures. If you normally drink coffee
> and you don't have a cup, get one or you may end up with a case of
> keyboard face.

[snip]

> Overall, having kswapd avoiding writes does improve performance
> which is not a surprise. Dave asked "do we even need IO at all from
> reclaim?". On NUMA machines, the answer is "yes" unless the VM can
> wake the flusher thread to clean a specific node.

Great answer, Mel. ;)

> When kswapd never
> writes, processes can stall for significant periods of time waiting on
> flushers to clean the correct pages. If all writing is to be deferred
> to flushers, it must ensure that many writes on one node would not
> starve requests for cleaning pages on another node.

Ok, so that's a direction we need to work towards, then.

> I'm currently of the opinion that we should consider merging patches
> 1-7 and discuss what is required before merging. It can be tackled
> later how the flushers can prioritise writing of pages belonging to
> a particular zone before disabling all writes from reclaim.

Sounds reasonable to me.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2
  2011-07-21 16:28 [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Mel Gorman
                   ` (8 preceding siblings ...)
  2011-07-26 11:20 ` [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Dave Chinner
@ 2011-07-27  4:32 ` Minchan Kim
  2011-07-27  7:37   ` Mel Gorman
  2011-07-27 16:18 ` Minchan Kim
  10 siblings, 1 reply; 43+ messages in thread
From: Minchan Kim @ 2011-07-27  4:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel

Hi Mel,

On Fri, Jul 22, 2011 at 1:28 AM, Mel Gorman <mgorman@suse.de> wrote:
> Warning: Long post with lots of figures. If you normally drink coffee
> and you don't have a cup, get one or you may end up with a case of
> keyboard face.
>
> Changelog since v1
>  o Drop prio-inode patch. There is now a dependency that the flusher
>    threads find these dirty pages quickly.
>  o Drop nr_vmscan_throttled counter
>  o SetPageReclaim instead of deactivate_page which was wrong
>  o Add warning to main filesystems if called from direct reclaim context
>  o Add patch to completely disable filesystem writeback from reclaim
>
> Testing from the XFS folk revealed that there is still too much
> I/O from the end of the LRU in kswapd. Previously it was considered
> acceptable by VM people for a small number of pages to be written
> back from reclaim with testing generally showing about 0.3% of pages
> reclaimed were written back (higher if memory was low). That writing
> back a small number of pages is ok has been heavily disputed for
> quite some time and Dave Chinner explained it well;
>
>        It doesn't have to be a very high number to be a problem. IO
>        is orders of magnitude slower than the CPU time it takes to
>        flush a page, so the cost of making a bad flush decision is
>        very high. And single page writeback from the LRU is almost
>        always a bad flush decision.
>
> To complicate matters, filesystems respond very differently to requests
> from reclaim according to Christoph Hellwig;
>
>        xfs tries to write it back if the requester is kswapd
>        ext4 ignores the request if it's a delayed allocation
>        btrfs ignores the request
>
> As a result, each filesystem has different performance characteristics
> when under memory pressure and there are many pages being dirties. In
> some cases, the request is ignored entirely so the VM cannot depend
> on the IO being dispatched.
>
> The objective of this series to to reduce writing of filesystem-backed
> pages from reclaim, play nicely with writeback that is already in
> progress and throttle reclaim appropriately when dirty pages are
> encountered. The assumption is that the flushers will always write
> pages faster than if reclaim issues the IO. The new problem is that
> reclaim has very little control over how long before a page in a
> particular zone or container is cleaned which is discussed later. A
> secondary goal is to avoid the problem whereby direct reclaim splices
> two potentially deep call stacks together.
>
> Patch 1 disables writeback of filesystem pages from direct reclaim
>        entirely. Anonymous pages are still written.
>
> Patches 2-4 add warnings to XFS, ext4 and btrfs if called from
>        direct reclaim. With patch 1, this "never happens" and
>        is intended to catch regressions in this logic in the
>        future.
>
> Patch 5 disables writeback of filesystem pages from kswapd unless
>        the priority is raised to the point where kswapd is considered
>        to be in trouble.
>
> Patch 6 throttles reclaimers if too many dirty pages are being
>        encountered and the zones or backing devices are congested.
>
> Patch 7 invalidates dirty pages found at the end of the LRU so they
>        are reclaimed quickly after being written back rather than
>        waiting for a reclaimer to find them
>
> Patch 8 disables writeback of filesystem pages from kswapd and
>        depends entirely on the flusher threads for cleaning pages.
>        This is potentially a problem if the flusher threads take a
>        long time to wake or are not discovering the pages we need
>        cleaned. By placing the patch last, it's more likely that
>        bisection can catch if this situation occurs and can be
>        easily reverted.
>
> I consider this series to be orthogonal to the writeback work but
> it is worth noting that the writeback work affects the viability of
> patch 8 in particular.
>
> I tested this on ext4 and xfs using fs_mark and a micro benchmark
> that does a streaming write to a large mapping (exercises use-once
> LRU logic) followed by streaming writes to a mix of anonymous and
> file-backed mappings. The command line for fs_mark when botted with
> 512M looked something like
>
> ./fs_mark  -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760
>
> The number of files was adjusted depending on the amount of available
> memory so that the files created was about 3xRAM. For multiple threads,
> the -d switch is specified multiple times.
>
> 3 kernels are tested.
>
> vanilla 3.0-rc6
> kswapdwb-v2r5           patches 1-7
> nokswapdwb-v2r5         patches 1-8
>
> The test machine is x86-64 with an older generation of AMD processor
> with 4 cores. The underlying storage was 4 disks configured as RAID-0
> as this was the best configuration of storage I had available. Swap
> is on a separate disk. Dirty ratio was tuned to 40% instead of the
> default of 20%.
>
> Testing was run with and without monitors to both verify that the
> patches were operating as expected and that any performance gain was
> real and not due to interference from monitors.
>
> I've posted the raw reports for each filesystem at
>
> http://www.csn.ul.ie/~mel/postings/reclaim-20110721
>
> Unfortunately, the volume of data is excessive but here is a partial
> summary of what was interesting for XFS.

Could you clarify the notation?
1P :  1 Processor?
512M: system memory size?
2X , 4X, 16X: the size of files created during test

>
> 512M1P-xfs           Files/s  mean         32.99 ( 0.00%)       35.16 ( 6.18%)       35.08 ( 5.94%)
> 512M1P-xfs           Elapsed Time fsmark           122.54               115.54               115.21
> 512M1P-xfs           Elapsed Time mmap-strm        105.09               104.44               106.12
> 512M-xfs             Files/s  mean         30.50 ( 0.00%)       33.30 ( 8.40%)       34.68 (12.06%)
> 512M-xfs             Elapsed Time fsmark           136.14               124.26               120.33
> 512M-xfs             Elapsed Time mmap-strm        154.68               145.91               138.83
> 512M-2X-xfs          Files/s  mean         28.48 ( 0.00%)       32.90 (13.45%)       32.83 (13.26%)
> 512M-2X-xfs          Elapsed Time fsmark           145.64               128.67               128.67
> 512M-2X-xfs          Elapsed Time mmap-strm        145.92               136.65               137.67
> 512M-4X-xfs          Files/s  mean         29.06 ( 0.00%)       32.82 (11.46%)       33.32 (12.81%)
> 512M-4X-xfs          Elapsed Time fsmark           153.69               136.74               135.11
> 512M-4X-xfs          Elapsed Time mmap-strm        159.47               128.64               132.59
> 512M-16X-xfs         Files/s  mean         48.80 ( 0.00%)       41.80 (-16.77%)       56.61 (13.79%)
> 512M-16X-xfs         Elapsed Time fsmark           161.48               144.61               141.19
> 512M-16X-xfs         Elapsed Time mmap-strm        167.04               150.62               147.83
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2
  2011-07-27  4:32 ` Minchan Kim
@ 2011-07-27  7:37   ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2011-07-27  7:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel

On Wed, Jul 27, 2011 at 01:32:17PM +0900, Minchan Kim wrote:
> >
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110721
> >
> > Unfortunately, the volume of data is excessive but here is a partial
> > summary of what was interesting for XFS.
> 
> Could you clarify the notation?
> 1P :  1 Processor?
> 512M: system memory size?
> 2X , 4X, 16X: the size of files created during test
> 

1P   == 1 Processor
512M == 512M RAM (mem=512M)
2X   == 2 x NUM_CPU fsmark threads

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2
  2011-07-21 16:28 [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Mel Gorman
                   ` (9 preceding siblings ...)
  2011-07-27  4:32 ` Minchan Kim
@ 2011-07-27 16:18 ` Minchan Kim
  2011-07-28 11:38   ` Mel Gorman
  10 siblings, 1 reply; 43+ messages in thread
From: Minchan Kim @ 2011-07-27 16:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel

On Thu, Jul 21, 2011 at 05:28:42PM +0100, Mel Gorman wrote:
> Warning: Long post with lots of figures. If you normally drink coffee
> and you don't have a cup, get one or you may end up with a case of
> keyboard face.

At last, I get a coffee.

> 
> Changelog since v1
>   o Drop prio-inode patch. There is now a dependency that the flusher
>     threads find these dirty pages quickly.
>   o Drop nr_vmscan_throttled counter
>   o SetPageReclaim instead of deactivate_page which was wrong
>   o Add warning to main filesystems if called from direct reclaim context
>   o Add patch to completely disable filesystem writeback from reclaim

It seems to go to the very desirable way.

> 
> Testing from the XFS folk revealed that there is still too much
> I/O from the end of the LRU in kswapd. Previously it was considered
> acceptable by VM people for a small number of pages to be written
> back from reclaim with testing generally showing about 0.3% of pages
> reclaimed were written back (higher if memory was low). That writing
> back a small number of pages is ok has been heavily disputed for
> quite some time and Dave Chinner explained it well;
> 
> 	It doesn't have to be a very high number to be a problem. IO
> 	is orders of magnitude slower than the CPU time it takes to
> 	flush a page, so the cost of making a bad flush decision is
> 	very high. And single page writeback from the LRU is almost
> 	always a bad flush decision.
> 
> To complicate matters, filesystems respond very differently to requests
> from reclaim according to Christoph Hellwig;
> 
> 	xfs tries to write it back if the requester is kswapd
> 	ext4 ignores the request if it's a delayed allocation
> 	btrfs ignores the request
> 
> As a result, each filesystem has different performance characteristics
> when under memory pressure and there are many pages being dirties. In
> some cases, the request is ignored entirely so the VM cannot depend
> on the IO being dispatched.
> 
> The objective of this series to to reduce writing of filesystem-backed
> pages from reclaim, play nicely with writeback that is already in
> progress and throttle reclaim appropriately when dirty pages are
> encountered. The assumption is that the flushers will always write
> pages faster than if reclaim issues the IO. The new problem is that
> reclaim has very little control over how long before a page in a
> particular zone or container is cleaned which is discussed later. A
> secondary goal is to avoid the problem whereby direct reclaim splices
> two potentially deep call stacks together.
> 
> Patch 1 disables writeback of filesystem pages from direct reclaim
> 	entirely. Anonymous pages are still written.
> 
> Patches 2-4 add warnings to XFS, ext4 and btrfs if called from
> 	direct reclaim. With patch 1, this "never happens" and
> 	is intended to catch regressions in this logic in the
> 	future.
> 
> Patch 5 disables writeback of filesystem pages from kswapd unless
> 	the priority is raised to the point where kswapd is considered
> 	to be in trouble.
> 
> Patch 6 throttles reclaimers if too many dirty pages are being
> 	encountered and the zones or backing devices are congested.
> 
> Patch 7 invalidates dirty pages found at the end of the LRU so they
> 	are reclaimed quickly after being written back rather than
> 	waiting for a reclaimer to find them
> 
> Patch 8 disables writeback of filesystem pages from kswapd and
> 	depends entirely on the flusher threads for cleaning pages.
> 	This is potentially a problem if the flusher threads take a
> 	long time to wake or are not discovering the pages we need
> 	cleaned. By placing the patch last, it's more likely that
> 	bisection can catch if this situation occurs and can be
> 	easily reverted.

Patch ordering is good, too.

> 
> I consider this series to be orthogonal to the writeback work but
> it is worth noting that the writeback work affects the viability of
> patch 8 in particular.
> 
> I tested this on ext4 and xfs using fs_mark and a micro benchmark
> that does a streaming write to a large mapping (exercises use-once
> LRU logic) followed by streaming writes to a mix of anonymous and
> file-backed mappings. The command line for fs_mark when botted with
> 512M looked something like
> 
> ./fs_mark  -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760
> 
> The number of files was adjusted depending on the amount of available
> memory so that the files created was about 3xRAM. For multiple threads,
> the -d switch is specified multiple times.
> 
> 3 kernels are tested.
> 
> vanilla	3.0-rc6
> kswapdwb-v2r5		patches 1-7
> nokswapdwb-v2r5		patches 1-8
> 
> The test machine is x86-64 with an older generation of AMD processor
> with 4 cores. The underlying storage was 4 disks configured as RAID-0
> as this was the best configuration of storage I had available. Swap
> is on a separate disk. Dirty ratio was tuned to 40% instead of the
> default of 20%.
> 
> Testing was run with and without monitors to both verify that the
> patches were operating as expected and that any performance gain was
> real and not due to interference from monitors.

Wow, it seems you would take a long time to finish your experiments.
Thanks for sharing good data.

> 
> I've posted the raw reports for each filesystem at
> 
> http://www.csn.ul.ie/~mel/postings/reclaim-20110721
> 
> Unfortunately, the volume of data is excessive but here is a partial
> summary of what was interesting for XFS.
> 
> 512M1P-xfs           Files/s  mean         32.99 ( 0.00%)       35.16 ( 6.18%)       35.08 ( 5.94%)
> 512M1P-xfs           Elapsed Time fsmark           122.54               115.54               115.21
> 512M1P-xfs           Elapsed Time mmap-strm        105.09               104.44               106.12
> 512M-xfs             Files/s  mean         30.50 ( 0.00%)       33.30 ( 8.40%)       34.68 (12.06%)
> 512M-xfs             Elapsed Time fsmark           136.14               124.26               120.33
> 512M-xfs             Elapsed Time mmap-strm        154.68               145.91               138.83
> 512M-2X-xfs          Files/s  mean         28.48 ( 0.00%)       32.90 (13.45%)       32.83 (13.26%)
> 512M-2X-xfs          Elapsed Time fsmark           145.64               128.67               128.67
> 512M-2X-xfs          Elapsed Time mmap-strm        145.92               136.65               137.67
> 512M-4X-xfs          Files/s  mean         29.06 ( 0.00%)       32.82 (11.46%)       33.32 (12.81%)
> 512M-4X-xfs          Elapsed Time fsmark           153.69               136.74               135.11
> 512M-4X-xfs          Elapsed Time mmap-strm        159.47               128.64               132.59
> 512M-16X-xfs         Files/s  mean         48.80 ( 0.00%)       41.80 (-16.77%)       56.61 (13.79%)
> 512M-16X-xfs         Elapsed Time fsmark           161.48               144.61               141.19
> 512M-16X-xfs         Elapsed Time mmap-strm        167.04               150.62               147.83
> 
> The difference between kswapd writing and not writing for fsmark
> in many cases is marginal simply because kswapd was not reaching a
> high enough priority to enter writeback. Memory is mostly consumed
> by filesystem-backed pages so limiting the number of dirty pages
> (dirty_ratio == 40) means that kswapd always makes forward progress
> and avoids the OOM killer.

Looks promising as most of elapsed time is lower than vanilla.

> 
> For the streaming-write benchmark, it does make a small difference as
> kswapd is reaching the higher priorities there due to a large number
> of anonymous pages added to the mix. The performance difference is
> marginal though as the number of filesystem pages written is about
> 1/50th of the number of anonymous pages written so it is drowned out.

It does make sense.

> 
> I was initially worried about 512M-16X-xfs but it's well within the noise
> looking at the standard deviations from
> http://www.csn.ul.ie/~mel/postings/reclaim-20110721/html-no-monitor/global-dhp-512M-16X__writeback-reclaimdirty-xfs/hydra/comparison.html
> 
> Files/s  min          25.00 ( 0.00%)       31.10 (19.61%)       32.00 (21.88%)
> Files/s  mean         48.80 ( 0.00%)       41.80 (-16.77%)       56.61 (13.79%)
> Files/s  stddev       28.65 ( 0.00%)       11.32 (-153.19%)       32.79 (12.62%)
> Files/s  max         133.20 ( 0.00%)       81.60 (-63.24%)      154.00 (13.51%)

Yes. it's within the noise so let's not worry about that.

> 
> 64 threads writing on a machine with 4 CPUs with 512M RAM has variable
> performance which is hardly surprising.

Fair enough.

> 
> The streaming-write benchmarks all completed faster.
> 
> The tests were also run with mem=1024M and mem=4608M with the relative
> performance improvement reduced as memory increases reflecting that
> with enough memory there are fewer writes from reclaim as the flusher
> threads have time to clean the page before it reaches the end of
> the LRU.
> 
> Here is the same tests except when using ext4
> 
> 512M1P-ext4          Files/s  mean         37.36 ( 0.00%)       37.10 (-0.71%)       37.66 ( 0.78%)
> 512M1P-ext4          Elapsed Time fsmark           108.93               109.91               108.61
> 512M1P-ext4          Elapsed Time mmap-strm        112.15               108.93               109.10
> 512M-ext4            Files/s  mean         30.83 ( 0.00%)       39.80 (22.54%)       32.74 ( 5.83%)
> 512M-ext4            Elapsed Time fsmark           368.07               322.55               328.80
> 512M-ext4            Elapsed Time mmap-strm        131.98               117.01               118.94
> 512M-2X-ext4         Files/s  mean         20.27 ( 0.00%)       22.75 (10.88%)       20.80 ( 2.52%)
> 512M-2X-ext4         Elapsed Time fsmark           518.06               493.74               479.21
> 512M-2X-ext4         Elapsed Time mmap-strm        131.32               126.64               117.05
> 512M-4X-ext4         Files/s  mean         17.91 ( 0.00%)       12.30 (-45.63%)       16.58 (-8.06%)
> 512M-4X-ext4         Elapsed Time fsmark           633.41               660.70               572.74
> 512M-4X-ext4         Elapsed Time mmap-strm        137.85               127.63               124.07
> 512M-16X-ext4        Files/s  mean         55.86 ( 0.00%)       69.90 (20.09%)       42.66 (-30.94%)
> 512M-16X-ext4        Elapsed Time fsmark           543.21               544.43               586.16
> 512M-16X-ext4        Elapsed Time mmap-strm        141.84               146.12               144.01
> 
> At first glance, the benefit for ext4 is less clear cut but this
> is due to the standard deviation being very high. Take 512M-4X-ext4
> showing a 45.63% regression for example and we see.
> 
> Files/s  min           5.40 ( 0.00%)        4.10 (-31.71%)        6.50 (16.92%)
> Files/s  mean         17.91 ( 0.00%)       12.30 (-45.63%)       16.58 (-8.06%)
> Files/s  stddev       14.34 ( 0.00%)        8.04 (-78.46%)       14.50 ( 1.04%)
> Files/s  max          54.30 ( 0.00%)       37.70 (-44.03%)       77.20 (29.66%)
> 
> The standard deviation is *massive* meaning that the performance
> loss is well within the noise. The main positive out of this is the

Yes.
ext4 seems to be very sensitive on the situation.

> streaming write benchmarks are generally better.
> 
> Where it does benefit is stalls in direct reclaim. Unlike xfs, ext4
> can stall direct reclaim writing back pages. When I look at a separate
> run using ftrace to gather more information, I see;
> 
> 512M-ext4            Time stalled direct reclaim fsmark            0.36       0.30       0.31 
> 512M-ext4            Time stalled direct reclaim mmap-strm        36.88       7.48      36.24 

This data is odd.
[2] and [3] experiment's elapsed time is almost same(117.01, 118.94) but stall time in direct reclaim of
[2] is much fast. Hmm??
Anyway, if we don't write out in kswapd, it seems we can enter direct reclaim path so many time.

> 512M-4X-ext4         Time stalled direct reclaim fsmark            1.06       0.40       0.43 
> 512M-4X-ext4         Time stalled direct reclaim mmap-strm       102.68      33.18      23.99 
> 512M-16X-ext4        Time stalled direct reclaim fsmark            0.17       0.27       0.30 
> 512M-16X-ext4        Time stalled direct reclaim mmap-strm         9.80       2.62       1.28 
> 512M-32X-ext4        Time stalled direct reclaim fsmark            0.00       0.00       0.00 
> 512M-32X-ext4        Time stalled direct reclaim mmap-strm         2.27       0.51       1.26 
> 
> Time spent in direct reclaim is reduced implying that bug reports
> complaining about the system becoming jittery when copying large
> files may also be hel.

It would be very good thing.

> 
> To show what effect the patches are having, this is a more detailed
> look at one of the tests running with monitoring enabled. It's booted
> with mem=512M and the number of threads running is equal to the number
> of CPU cores. The backing filesystem is XFS.
> 
> FS-Mark
>                   fsmark-3.0.0         3.0.0-rc6         3.0.0-rc6
>                    rc6-vanilla      kswapwb-v2r5    nokswapwb-v2r5
> Files/s  min          27.30 ( 0.00%)       31.80 (14.15%)       31.40 (13.06%)
> Files/s  mean         30.32 ( 0.00%)       34.34 (11.73%)       34.52 (12.18%)
> Files/s  stddev        1.39 ( 0.00%)        1.06 (-31.96%)        1.20 (-16.05%)
> Files/s  max          33.60 ( 0.00%)       36.00 ( 6.67%)       36.30 ( 7.44%)
> Overhead min     1393832.00 ( 0.00%)  1793141.00 (-22.27%)  1133240.00 (23.00%)
> Overhead mean    2423808.52 ( 0.00%)  2513297.40 (-3.56%)  1823398.44 (32.93%)
> Overhead stddev   445880.26 ( 0.00%)   392952.66 (13.47%)   420498.38 ( 6.04%)
> Overhead max     3359477.00 ( 0.00%)  3184889.00 ( 5.48%)  3016170.00 (11.38%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         53.26     52.27     51.88

What is User/Sys?

> Total Elapsed Time (seconds)                137.65    121.95    121.11
> 
> Average files per second is increased by a nice percentage that is
> outside the noise.  This is also true when I look at the results

Sure.

> without monitoring although the relative performance gain is less.
> 
> Time to completion is reduced which is always good ane as it implies
> that IO was consistently higher and this is clearly visible at
> 
> http://www.csn.ul.ie/~mel/postings/reclaim-20110721/html-run-monitor/global-dhp-512M__writeback-reclaimdirty-xfs/hydra/blockio-comparison-hydra.png
> http://www.csn.ul.ie/~mel/postings/reclaim-20110721/html-run-monitor/global-dhp-512M__writeback-reclaimdirty-xfs/hydra/blockio-comparison-smooth-hydra.png
> 
> kswapd CPU usage is also interesting
> 
> http://www.csn.ul.ie/~mel/postings/reclaim-20110721/html-run-monitor/global-dhp-512M__writeback-reclaimdirty-xfs/hydra/kswapdcpu-comparison-smooth-hydra.png
> 
> Note how preventing kswapd reclaiming dirty pages pushes up its CPU
> usage as it scans more pages but it does not get excessive due to
> the throttling.

Good to hear.
The concern of this patchset was early OOM kill with too many scanning.
I can throw such concern out from now on.

> 
> MMTests Statistics: vmstat
> Page Ins                                   1481672   1352900   1105364
> Page Outs                                 38397462  38337199  38366073
> Swap Ins                                    351918    320883    258868
> Swap Outs                                   132060    117715    123564
> Direct pages scanned                        886587    968087    784109
> Kswapd pages scanned                      18931089  18275983  18324613
> Kswapd pages reclaimed                     8878200   8768648   8885482
> Direct pages reclaimed                      883407    960496    781632
> Kswapd efficiency                              46%       47%       48%
> Kswapd velocity                         137530.614 149864.559 151305.532
> Direct efficiency                              99%       99%       99%
> Direct velocity                           6440.879  7938.393  6474.354
> Percentage direct scans                         4%        5%        4%
> Page writes by reclaim                      170014    117717    123510
> Page reclaim invalidate                          0   1221396   1212857
> Page reclaim throttled                           0         0         0
> Slabs scanned                                23424     23680     23552
> Direct inode steals                              0         0         0
> Kswapd inode steals                           5560      5500      5584
> Kswapd skipped wait                             20         3         5
> Compaction stalls                                0         0         0
> Compaction success                               0         0         0
> Compaction failures                              0         0         0
> Compaction pages moved                           0         0         0
> Compaction move failure                          0         0         0
> 
> These stats are based on information from /proc/vmstat
> 
> "Kswapd efficiency" is the percentage of pages reclaimed to pages
> scanned. The higher the percentage is the better because a low
> percentage implies that kswapd is scanning uselessly. As the workload
> dirties memory heavily and is a small machine, the efficiency is low at
> 46% and marginally improves due to a reduced number of pages scanned.
> As memory increases, so does the efficiency as one might expect as
> the flushers have a chance to clean the pages in time.
> 
> "Kswapd velocity" is the average number of pages scanned per
> second. The patches increase this as it's no longer getting blocked on
> page writes so it's expected but in general a higher velocity means
> that kswapd is doing more work and consuming more CPU. In this case,
> it is offset by the fact that fewer pages overall are scanned and
> the test completes faster but it explains why CPU usage is higher.

Fair enough.

> 
> Page writes by reclaim is what is motivating this series. It goes
> from 170014 pages to 123510 which is a big improvement and we'll see
> later that these writes are for anonymous pages.
> 
> "Page reclaim invalided" is very high and implies that a large number
> of dirty pages are reaching the end of the list quickly. Unfortunately,
> this is somewhat unavoidable. Kswapd is scanning pages at a rate
> of roughly 125000 (or 488M) a second on a 512M machine. The best
> possible writing rate of the underlying storage is about 300M/second.
> With the rate of reclaim exceeding the best possible writing speed,
> the system is going to get throttled.

Just out of curiosity.
What is 'Page reclaim throttled'?

> 
> FTrace Reclaim Statistics: vmscan
>                               fsmark-3.0.0         3.0.0-rc6         3.0.0-rc6
>                                rc6-vanilla      kswapwb-v2r5    nokswapwb-v2r5
> Direct reclaims                              16173      17605      14313 
> Direct reclaim pages scanned                886587     968087     784109 
> Direct reclaim pages reclaimed              883407     960496     781632 
> Direct reclaim write file async I/O              0          0          0 
> Direct reclaim write anon async I/O              0          0          0 
> Direct reclaim write file sync I/O               0          0          0 
> Direct reclaim write anon sync I/O               0          0          0 
> Wake kswapd requests                         20699      22048      22893 
> Kswapd wakeups                                  24         20         25 
> Kswapd pages scanned                      18931089   18275983   18324613 
> Kswapd pages reclaimed                     8878200    8768648    8885482 
> Kswapd reclaim write file async I/O          37966          0          0 
> Kswapd reclaim write anon async I/O         132062     117717     123567 
> Kswapd reclaim write file sync I/O               0          0          0 
> Kswapd reclaim write anon sync I/O               0          0          0 
> Time stalled direct reclaim (seconds)         0.08       0.09       0.08 
> Time kswapd awake (seconds)                 132.11     117.78     115.82 
> 
> Total pages scanned                       19817676  19244070  19108722
> Total pages reclaimed                      9761607   9729144   9667114
> %age total pages scanned/reclaimed          49.26%    50.56%    50.59%
> %age total pages scanned/written             0.86%     0.61%     0.65%
> %age  file pages scanned/written             0.19%     0.00%     0.00%
> Percentage Time Spent Direct Reclaim         0.15%     0.17%     0.15%
> Percentage Time kswapd Awake                95.98%    96.58%    95.63%
> 
> Despite kswapd having higher CPU usage, it spent less time awake which
> is probably a reflection of the test completing faster. File writes

Make sense.

> from kswapd were 0 with the patches applied implying that kswapd was
> not getting to a priority high enough to start writing. The remaining
> writes correlate almost exactly to nr_vmscan_write implying that all
> writes were for anonymous pages.
> 
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited                 0          0          0 
> Direct time   congest     waited               0ms        0ms        0ms 
> Direct full   congest     waited                 0          0          0 
> Direct number conditional waited                 2         17          6 
> Direct time   conditional waited               0ms        0ms        0ms 
> Direct full   conditional waited                 0          0          0 
> KSwapd number congest     waited                 4          8         10 
> KSwapd time   congest     waited               4ms       20ms        8ms 
> KSwapd full   congest     waited                 0          0          0 
> KSwapd number conditional waited                 0      26036      26283 
> KSwapd time   conditional waited               0ms       16ms        4ms 
> KSwapd full   conditional waited                 0          0          0 

What means congest and conditional?
congest is trace_writeback_congestion_wait and conditional is trace_writeback_wait_iff_congested?

> 
> This is based on some of the writeback tracepoints. It's interesting
> to note that while kswapd got throttled about 26000 times with all
> patches applied, it spent negligible time asleep so probably just
> called cond_resched().  This implies that neither the zone nor the
> backing device are rarely truly congested and throttling is necessary
> simply to allow the pages to be written.
> 
> MICRO
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         32.57     31.18     30.52
> Total Elapsed Time (seconds)                166.29    141.94    148.23
> 
> This test is in two stages. The first writes only to a file. The second
> writes to a mix of anonymous and file mappings.  Time to completion
> is improved and this is still true with monitoring disabled.

Good.

> 
> MMTests Statistics: vmstat
> Page Ins                                  11018260  10668536  10792204
> Page Outs                                 16632838  16468468  16449897
> Swap Ins                                    296167    245878    256038
> Swap Outs                                   221626    177922    179409
> Direct pages scanned                       4129424   5172015   3686598
> Kswapd pages scanned                       9152837   9000480   7909180
> Kswapd pages reclaimed                     3388122   3284663   3371737
> Direct pages reclaimed                      735425    765263    708713
> Kswapd efficiency                              37%       36%       42%
> Kswapd velocity                          55041.416 63410.455 53357.485
> Direct efficiency                              17%       14%       19%
> Direct velocity                          24832.666 36438.037 24870.795
> Percentage direct scans                        31%       36%       31%
> Page writes by reclaim                      347283    180065    179425
> Page writes skipped                              0         0         0
> Page reclaim invalidate                          0    864018    554666
> Write invalidated                                0         0         0
> Page reclaim throttled                           0         0         0
> Slabs scanned                                14464     13696     13952
> Direct inode steals                            470       864       934
> Kswapd inode steals                            426       411       317
> Kswapd skipped wait                           3255      3381      1437
> Compaction stalls                                0         0         2
> Compaction success                               0         0         1
> Compaction failures                              0         0         1
> Compaction pages moved                           0         0         0
> Compaction move failure                          0         0         0
> 
> Kswapd efficiency is improved slightly. kswapd is operating at roughly
> the same velocity but the number of pages scanned is far lower due
> to the test completing faster.
> 
> Direct reclaim efficiency is improved slightly and scanning fewer pages
> (again due to lower time to completion).
> 
> Fewer pages are being written from reclaim.
> 
> FTrace Reclaim Statistics: vmscan
>                    micro-3.0.0         3.0.0-rc6         3.0.0-rc6
>                    rc6-vanilla      kswapwb-v2r5    nokswapwb-v2r5
> Direct reclaims                              14060      15425      13726 
> Direct reclaim pages scanned               3596218    4621037    3613503 
> Direct reclaim pages reclaimed              735425     765263     708713 
> Direct reclaim write file async I/O          87264          0          0 
> Direct reclaim write anon async I/O          10030       9127      15028 
> Direct reclaim write file sync I/O               0          0          0 
> Direct reclaim write anon sync I/O               0          0          0 
> Wake kswapd requests                         10424      10346      10786 
> Kswapd wakeups                                  22         22         14 
> Kswapd pages scanned                       9041353    8889081    7895846 
> Kswapd pages reclaimed                     3388122    3284663    3371737 
> Kswapd reclaim write file async I/O           7277       1710          0 
> Kswapd reclaim write anon async I/O         184205     159178     162367 
> Kswapd reclaim write file sync I/O               0          0          0 
> Kswapd reclaim write anon sync I/O               0          0          0 
> Time stalled direct reclaim (seconds)        54.29       5.67      14.29 
> Time kswapd awake (seconds)                 151.62     129.83     135.98 
> 
> Total pages scanned                       12637571  13510118  11509349
> Total pages reclaimed                      4123547   4049926   4080450
> %age total pages scanned/reclaimed          32.63%    29.98%    35.45%
> %age total pages scanned/written             2.29%     1.26%     1.54%
> %age  file pages scanned/written             0.75%     0.01%     0.00%
> Percentage Time Spent Direct Reclaim        62.50%    15.39%    31.89%
> Percentage Time kswapd Awake                91.18%    91.47%    91.74%
> 
> Time spent in direct reclaim is massively reduced which is surprising

Awesome!

> as this is XFS so it should not have been stalling in the writing
> files anyway.  It's possible that the anon writes are completing
> faster so time spent swapping is reduced.
> 
> With patches 1-7, kswapd still writes some pages due to it reaching
> higher priorities due to memory pressure but the number of pages it
> writes is significantly reduced and a small percentage of those that
> were written to swap. Patch 8 eliminates it entirely but the benefit is
> not seen in the completion times as the number of writes is so small.

Yes. It seems patch 8's effect is so small in general.
Even it increased direct reclaim time.

> 
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest     waited                 0          0          0 
> Direct time   congest     waited               0ms        0ms        0ms 
> Direct full   congest     waited                 0          0          0 
> Direct number conditional waited             12345      37713      34841 
> Direct time   conditional waited           12396ms      132ms      168ms 
> Direct full   conditional waited                53          0          0 
> KSwapd number congest     waited              4248       2957       2293 
> KSwapd time   congest     waited           15320ms    10312ms    13416ms 
> KSwapd full   congest     waited                31          1         21 
> KSwapd number conditional waited                 0      15989      10410 
> KSwapd time   conditional waited               0ms        0ms        0ms 
> KSwapd full   conditional waited                 0          0          0 
> 
> Congestion is way down as direct reclaim conditional wait time is
> reduced by about 12 seconds.
> 
> Overall, this looks good. Avoiding writes from kswapd improves
> overall performance as expected and eliminating them entirely seems
> to behave well.

I agree with you.

> 
> Next I tested on a NUMA configuration of sorts. I don't have a real
> NUMA machine so I booted the same machine with mem=4096M numa=fake=8
> so each node is 512M. Again, the volume of information is high but
> here is a summary of sorts based on a test run with monitors enabled.
> 
> 4096M8N-xfs     Files/s  mean                    27.29 ( 0.00%)      27.35 ( 0.20%)   27.91 ( 2.22%)
> 4096M8N-xfs     Elapsed Time fsmark                     1402.55             1400.77          1382.92
> 4096M8N-xfs     Elapsed Time mmap-strm                   660.90              596.91           630.05
> 4096M8N-xfs     Kswapd efficiency fsmark                    72%                 71%              13%
> 4096M8N-xfs     Kswapd efficiency mmap-strm                 39%                 40%              31%
> 4096M8N-xfs     stalled direct reclaim fsmark              0.00                0.00             0.00
> 4096M8N-xfs     stalled direct reclaim mmap-strm          36.37               13.06            56.88
> 4096M8N-4X-xfs  Files/s  mean                    26.80 ( 0.00%)      26.41 (-1.47%)   26.40 (-1.53%)
> 4096M8N-4X-xfs  Elapsed Time fsmark                     1453.95             1460.62          1470.98
> 4096M8N-4X-xfs  Elapsed Time mmap-strm                   683.34              663.46           690.01
> 4096M8N-4X-xfs  Kswapd efficiency fsmark                    68%                 67%               8%
> 4096M8N-4X-xfs  Kswapd efficiency mmap-strm                 35%                 34%               6%
> 4096M8N-4X-xfs  stalled direct reclaim fsmark              0.00                0.00             0.00
> 4096M8N-4X-xfs  stalled direct reclaim mmap-strm          26.45               87.57            46.87
> 4096M8N-2X-xfs  Files/s  mean                    26.22 ( 0.00%)      26.70 ( 1.77%)   27.21 ( 3.62%)
> 4096M8N-2X-xfs  Elapsed Time fsmark                     1469.28             1439.30          1424.45
> 4096M8N-2X-xfs  Elapsed Time mmap-strm                   676.77              656.28           655.03
> 4096M8N-2X-xfs  Kswapd efficiency fsmark                    69%                 69%               9%
> 4096M8N-2X-xfs  Kswapd efficiency mmap-strm                 33%                 33%               7%
> 4096M8N-2X-xfs  stalled direct reclaim fsmark              0.00                0.00             0.00
> 4096M8N-2X-xfs  stalled direct reclaim mmap-strm          52.74               57.96           102.49
> 4096M8N-16X-xfs Files/s  mean                    25.78 ( 0.00%)       27.81 ( 7.32%)  48.52 (46.87%)
> 4096M8N-16X-xfs Elapsed Time fsmark                     1555.95             1554.78          1542.53
> 4096M8N-16X-xfs Elapsed Time mmap-strm                   770.01              763.62           844.55
> 4096M8N-16X-xfs Kswapd efficiency fsmark                    62%                 62%               7%
> 4096M8N-16X-xfs Kswapd efficiency mmap-strm                 38%                 37%              10%
> 4096M8N-16X-xfs stalled direct reclaim fsmark              0.12                0.01             0.05
> 4096M8N-16X-xfs stalled direct reclaim mmap-strm           1.07                1.09            63.32
> 
> The performance differences for fsmark are marginal because the number
> of page written from reclaim is pretty low with this much memory even
> with NUMA enabled. At no point did fsmark enter direct reclaim to
> try and write a page so it's all kswapd. What is important to note is
> the "Kswapd efficiency". Once kswapd cannot write pages at all, its
> efficiency drops rapidly for fsmark as it scans about 5-8 times more
> pages waiting on flusher threads to clean a page from the correct node.
> 
> Kswapd not writing pages impairs direct reclaim performance for the
> streaming writer test. Note the times stalled in direct reclaim. In
> all cases, the time stalled in direct reclaim goes way up as both
> direct reclaimers and kswapd get stalled waiting on pages to get
> cleaned from the right node.

Yes. The data is horrible.

> 
> Fortunately, kswapd CPU usage does not go to 100% because of the
> throttling. From the 40968M test for example, I see
> 
> KSwapd full   congest     waited               834        739        989
> KSwapd number conditional waited                 0      68552     372275
> KSwapd time   conditional waited               0ms       16ms     1684ms
> KSwapd full   conditional waited                 0          0          0
> 
> With kswapd avoiding writes, it gets throttled lightly but when it
> writes no pasges at all, it gets throttled very heavily and sleeps.
> 
> ext4 tells a slightly different story
> 
> 4096M8N-ext4         Files/s  mean               28.63 ( 0.00%)       30.58 ( 6.37%)   31.04 ( 7.76%)
> 4096M8N-ext4         Elapsed Time fsmark                1578.51              1551.99          1532.65
> 4096M8N-ext4         Elapsed Time mmap-strm              703.66               655.25           654.86
> 4096M8N-ext4         Kswapd efficiency                      62%                  69%              68%
> 4096M8N-ext4         Kswapd efficiency                      35%                  35%              35%
> 4096M8N-ext4         stalled direct reclaim fsmark         0.00                 0.00             0.00 
> 4096M8N-ext4         stalled direct reclaim mmap-strm     32.64                95.72           152.62 
> 4096M8N-2X-ext4      Files/s  mean               30.74 ( 0.00%)       28.49 (-7.89%)   28.79 (-6.75%)
> 4096M8N-2X-ext4      Elapsed Time fsmark                1466.62              1583.12          1580.07
> 4096M8N-2X-ext4      Elapsed Time mmap-strm              705.17               705.64           693.01
> 4096M8N-2X-ext4      Kswapd efficiency                      68%                  68%              67%
> 4096M8N-2X-ext4      Kswapd efficiency                      34%                  30%              18%
> 4096M8N-2X-ext4      stalled direct reclaim fsmark         0.00                 0.00             0.00 
> 4096M8N-2X-ext4      stalled direct reclaim mmap-strm    106.82                24.88            27.88 
> 4096M8N-4X-ext4      Files/s  mean               24.15 ( 0.00%)       23.18 (-4.18%)   23.94 (-0.89%)
> 4096M8N-4X-ext4      Elapsed Time fsmark                1848.41              1971.48          1867.07
> 4096M8N-4X-ext4      Elapsed Time mmap-strm              664.87               673.66           674.46
> 4096M8N-4X-ext4      Kswapd efficiency                      62%                  65%              65%
> 4096M8N-4X-ext4      Kswapd efficiency                      33%                  37%              15%
> 4096M8N-4X-ext4      stalled direct reclaim fsmark         0.18                 0.03             0.26 
> 4096M8N-4X-ext4      stalled direct reclaim mmap-strm    115.71                23.05            61.12 
> 4096M8N-16X-ext4     Files/s  mean                5.42 ( 0.00%)        5.43 ( 0.15%)    3.83 (-41.44%)
> 4096M8N-16X-ext4     Elapsed Time fsmark                9572.85              9653.66         11245.41
> 4096M8N-16X-ext4     Elapsed Time mmap-strm              752.88               750.38           769.19
> 4096M8N-16X-ext4     Kswapd efficiency                      59%                  59%              61%
> 4096M8N-16X-ext4     Kswapd efficiency                      34%                  34%              21%
> 4096M8N-16X-ext4     stalled direct reclaim fsmark         0.26                 0.65             0.26 
> 4096M8N-16X-ext4     stalled direct reclaim mmap-strm    177.48               125.91           196.92 
> 
> 4096M8N-16X-ext4 with kswapd writing no pages collapsed in terms of
> performance. Looking at the fsmark logs, in a number of iterations,
> it was barely able to write files at all.
> 
> The apparent slowdown for fsmark in 4096M8N-2X-ext4 is well within
> the noise but the reduced time spent in direct reclaim is very welcome.

But 4096M8N-ext4 increased the time and 4096M8N-2X-ext4 is within the noise
as you said. I doubt it's reliability.

> 
> Unlike xfs, it's less clear cut if direct reclaim performance is
> impaired but in a few tests, preventing kswapd writing pages did
> increase the time stalled.
> 
> Last test is that I've been running this series on my laptop since
> Monday without any problem but it's rarely under serious memory
> pressure. I see nr_vmscan_write is 0 and the number of pages
> invalidated from the end of the LRU is only 10844 after 3 days so
> it's not much of a test.
> 
> Overall, having kswapd avoiding writes does improve performance
> which is not a surprise. Dave asked "do we even need IO at all from
> reclaim?". On NUMA machines, the answer is "yes" unless the VM can
> wake the flusher thread to clean a specific node. When kswapd never
> writes, processes can stall for significant periods of time waiting on
> flushers to clean the correct pages. If all writing is to be deferred
> to flushers, it must ensure that many writes on one node would not
> starve requests for cleaning pages on another node.

It's a good answer. :)

> 
> I'm currently of the opinion that we should consider merging patches
> 1-7 and discuss what is required before merging. It can be tackled
> later how the flushers can prioritise writing of pages belonging to
> a particular zone before disabling all writes from reclaim. There
> is already some work in this general area with the possibility that
> series such as "writeback: moving expire targets for background/kupdate
> works" could be extended to allow patch 8 to be merged later even if
> the series needs work.

I think you already knew what we need(ie, prioritising the pages in a zone)
In case of NUMA, 1-7 has a problem in ext4 so we have to focus NUMA during remained time.

The alternative of [prioritising the page in a zone] might be Johannes's [mm: per-zone dirty limiting].
It might mitigate NUMA problems.

Overall, I really welcome this approach and would like to merge this in mmotm as soon as possible
for see the side effects in non-NUMA(I will add my reviewed-by soon).
In case of NUMA, we know the problem apparently so I think it could be solved
before it is sent to mainline.

It was a great time to see your data and you makes my coffee delicious. :)
You're a good Barista.
Thanks for your great effort, Mel!

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2
  2011-07-27 16:18 ` Minchan Kim
@ 2011-07-28 11:38   ` Mel Gorman
  2011-07-29  9:48     ` Minchan Kim
  0 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2011-07-28 11:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel

On Thu, Jul 28, 2011 at 01:18:21AM +0900, Minchan Kim wrote:
> On Thu, Jul 21, 2011 at 05:28:42PM +0100, Mel Gorman wrote:
> > Warning: Long post with lots of figures. If you normally drink coffee
> > and you don't have a cup, get one or you may end up with a case of
> > keyboard face.
> 
> At last, I get a coffee.
> 

Nice one.

> > <SNIP>
> > I consider this series to be orthogonal to the writeback work but
> > it is worth noting that the writeback work affects the viability of
> > patch 8 in particular.
> > 
> > I tested this on ext4 and xfs using fs_mark and a micro benchmark
> > that does a streaming write to a large mapping (exercises use-once
> > LRU logic) followed by streaming writes to a mix of anonymous and
> > file-backed mappings. The command line for fs_mark when botted with
> > 512M looked something like
> > 
> > ./fs_mark  -d  /tmp/fsmark-2676  -D  100  -N  150  -n  150  -L  25  -t  1  -S0  -s  10485760
> > 
> > The number of files was adjusted depending on the amount of available
> > memory so that the files created was about 3xRAM. For multiple threads,
> > the -d switch is specified multiple times.
> > 
> > 3 kernels are tested.
> > 
> > vanilla	3.0-rc6
> > kswapdwb-v2r5		patches 1-7
> > nokswapdwb-v2r5		patches 1-8
> > 
> > The test machine is x86-64 with an older generation of AMD processor
> > with 4 cores. The underlying storage was 4 disks configured as RAID-0
> > as this was the best configuration of storage I had available. Swap
> > is on a separate disk. Dirty ratio was tuned to 40% instead of the
> > default of 20%.
> > 
> > Testing was run with and without monitors to both verify that the
> > patches were operating as expected and that any performance gain was
> > real and not due to interference from monitors.
> 
> Wow, it seems you would take a long time to finish your experiments.

Yes, they take a long time to run.

> > I've posted the raw reports for each filesystem at
> > 
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110721
> > 
> > Unfortunately, the volume of data is excessive but here is a partial
> > summary of what was interesting for XFS.
> > 
> > 512M1P-xfs           Files/s  mean         32.99 ( 0.00%)       35.16 ( 6.18%)       35.08 ( 5.94%)
> > 512M1P-xfs           Elapsed Time fsmark           122.54               115.54               115.21
> > 512M1P-xfs           Elapsed Time mmap-strm        105.09               104.44               106.12
> > 512M-xfs             Files/s  mean         30.50 ( 0.00%)       33.30 ( 8.40%)       34.68 (12.06%)
> > 512M-xfs             Elapsed Time fsmark           136.14               124.26               120.33
> > 512M-xfs             Elapsed Time mmap-strm        154.68               145.91               138.83
> > 512M-2X-xfs          Files/s  mean         28.48 ( 0.00%)       32.90 (13.45%)       32.83 (13.26%)
> > 512M-2X-xfs          Elapsed Time fsmark           145.64               128.67               128.67
> > 512M-2X-xfs          Elapsed Time mmap-strm        145.92               136.65               137.67
> > 512M-4X-xfs          Files/s  mean         29.06 ( 0.00%)       32.82 (11.46%)       33.32 (12.81%)
> > 512M-4X-xfs          Elapsed Time fsmark           153.69               136.74               135.11
> > 512M-4X-xfs          Elapsed Time mmap-strm        159.47               128.64               132.59
> > 512M-16X-xfs         Files/s  mean         48.80 ( 0.00%)       41.80 (-16.77%)       56.61 (13.79%)
> > 512M-16X-xfs         Elapsed Time fsmark           161.48               144.61               141.19
> > 512M-16X-xfs         Elapsed Time mmap-strm        167.04               150.62               147.83
> > 
> > The difference between kswapd writing and not writing for fsmark
> > in many cases is marginal simply because kswapd was not reaching a
> > high enough priority to enter writeback. Memory is mostly consumed
> > by filesystem-backed pages so limiting the number of dirty pages
> > (dirty_ratio == 40) means that kswapd always makes forward progress
> > and avoids the OOM killer.
> 
> Looks promising as most of elapsed time is lower than vanilla.
> 

Yes. Lower elapsed time is not always better. For example, some tests I
run will execute a variable number of times trying to get a good
estimate of the true mean. For these tests, there is a fixed number of
iterations so a lower elapsed time implies higher throughput.

> > The streaming-write benchmarks all completed faster.
> > 
> > The tests were also run with mem=1024M and mem=4608M with the relative
> > performance improvement reduced as memory increases reflecting that
> > with enough memory there are fewer writes from reclaim as the flusher
> > threads have time to clean the page before it reaches the end of
> > the LRU.
> > 
> > Here is the same tests except when using ext4
> > 
> > 512M1P-ext4          Files/s  mean         37.36 ( 0.00%)       37.10 (-0.71%)       37.66 ( 0.78%)
> > 512M1P-ext4          Elapsed Time fsmark           108.93               109.91               108.61
> > 512M1P-ext4          Elapsed Time mmap-strm        112.15               108.93               109.10
> > 512M-ext4            Files/s  mean         30.83 ( 0.00%)       39.80 (22.54%)       32.74 ( 5.83%)
> > 512M-ext4            Elapsed Time fsmark           368.07               322.55               328.80
> > 512M-ext4            Elapsed Time mmap-strm        131.98               117.01               118.94
> > 512M-2X-ext4         Files/s  mean         20.27 ( 0.00%)       22.75 (10.88%)       20.80 ( 2.52%)
> > 512M-2X-ext4         Elapsed Time fsmark           518.06               493.74               479.21
> > 512M-2X-ext4         Elapsed Time mmap-strm        131.32               126.64               117.05
> > 512M-4X-ext4         Files/s  mean         17.91 ( 0.00%)       12.30 (-45.63%)       16.58 (-8.06%)
> > 512M-4X-ext4         Elapsed Time fsmark           633.41               660.70               572.74
> > 512M-4X-ext4         Elapsed Time mmap-strm        137.85               127.63               124.07
> > 512M-16X-ext4        Files/s  mean         55.86 ( 0.00%)       69.90 (20.09%)       42.66 (-30.94%)
> > 512M-16X-ext4        Elapsed Time fsmark           543.21               544.43               586.16
> > 512M-16X-ext4        Elapsed Time mmap-strm        141.84               146.12               144.01
> > 
> > At first glance, the benefit for ext4 is less clear cut but this
> > is due to the standard deviation being very high. Take 512M-4X-ext4
> > showing a 45.63% regression for example and we see.
> > 
> > Files/s  min           5.40 ( 0.00%)        4.10 (-31.71%)        6.50 (16.92%)
> > Files/s  mean         17.91 ( 0.00%)       12.30 (-45.63%)       16.58 (-8.06%)
> > Files/s  stddev       14.34 ( 0.00%)        8.04 (-78.46%)       14.50 ( 1.04%)
> > Files/s  max          54.30 ( 0.00%)       37.70 (-44.03%)       77.20 (29.66%)
> > 
> > The standard deviation is *massive* meaning that the performance
> > loss is well within the noise. The main positive out of this is the
> 
> Yes.
> ext4 seems to be very sensitive on the situation.
> 

It'd be nice to have a theory as to why it is so variable but it could
be simply down to disk layout and seeks. I wasn't running blktrace to
see if that was the case. As this is RAID, it's also possible it is a
stride problem as I didn't specify stride= to mkfs.

> > streaming write benchmarks are generally better.
> > 
> > Where it does benefit is stalls in direct reclaim. Unlike xfs, ext4
> > can stall direct reclaim writing back pages. When I look at a separate
> > run using ftrace to gather more information, I see;
> > 
> > 512M-ext4            Time stalled direct reclaim fsmark            0.36       0.30       0.31 
> > 512M-ext4            Time stalled direct reclaim mmap-strm        36.88       7.48      36.24 
> 
> This data is odd.
> [2] and [3] experiment's elapsed time is almost same(117.01, 118.94) but stall time in direct reclaim of
> [2] is much fast. Hmm??

It could have been just luck on that particular run. These figures
don't tell us *which* process got stuck in direct reclaim for that
length of time. If it was one of the monitors recording stats for
example, it wouldn't affect the reported results. It could be figured
out from the trace data if I went back through it but it's probably
not worth the trouble.

> > 512M-4X-ext4         Time stalled direct reclaim fsmark            1.06       0.40       0.43 
> > 512M-4X-ext4         Time stalled direct reclaim mmap-strm       102.68      33.18      23.99 
> > 512M-16X-ext4        Time stalled direct reclaim fsmark            0.17       0.27       0.30 
> > 512M-16X-ext4        Time stalled direct reclaim mmap-strm         9.80       2.62       1.28 
> > 512M-32X-ext4        Time stalled direct reclaim fsmark            0.00       0.00       0.00 
> > 512M-32X-ext4        Time stalled direct reclaim mmap-strm         2.27       0.51       1.26 
> > 
> > Time spent in direct reclaim is reduced implying that bug reports
> > complaining about the system becoming jittery when copying large
> > files may also be hel.
> 
> It would be very good thing.
> 

I'm currently running the same tests on a laptop using a USB stick for
storage to see if something useful comes out.

> > To show what effect the patches are having, this is a more detailed
> > look at one of the tests running with monitoring enabled. It's booted
> > with mem=512M and the number of threads running is equal to the number
> > of CPU cores. The backing filesystem is XFS.
> > 
> > FS-Mark
> >                   fsmark-3.0.0         3.0.0-rc6         3.0.0-rc6
> >                    rc6-vanilla      kswapwb-v2r5    nokswapwb-v2r5
> > Files/s  min          27.30 ( 0.00%)       31.80 (14.15%)       31.40 (13.06%)
> > Files/s  mean         30.32 ( 0.00%)       34.34 (11.73%)       34.52 (12.18%)
> > Files/s  stddev        1.39 ( 0.00%)        1.06 (-31.96%)        1.20 (-16.05%)
> > Files/s  max          33.60 ( 0.00%)       36.00 ( 6.67%)       36.30 ( 7.44%)
> > Overhead min     1393832.00 ( 0.00%)  1793141.00 (-22.27%)  1133240.00 (23.00%)
> > Overhead mean    2423808.52 ( 0.00%)  2513297.40 (-3.56%)  1823398.44 (32.93%)
> > Overhead stddev   445880.26 ( 0.00%)   392952.66 (13.47%)   420498.38 ( 6.04%)
> > Overhead max     3359477.00 ( 0.00%)  3184889.00 ( 5.48%)  3016170.00 (11.38%)
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds)         53.26     52.27     51.88
> 
> What is User/Sys?
> 

The sum if the CPU-seconds spent in user and sys mode. Should have used
a + there :/

> > <SNIP>
> > without monitoring although the relative performance gain is less.
> > 
> > Time to completion is reduced which is always good ane as it implies
> > that IO was consistently higher and this is clearly visible at
> > 
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110721/html-run-monitor/global-dhp-512M__writeback-reclaimdirty-xfs/hydra/blockio-comparison-hydra.png
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110721/html-run-monitor/global-dhp-512M__writeback-reclaimdirty-xfs/hydra/blockio-comparison-smooth-hydra.png
> > 
> > kswapd CPU usage is also interesting
> > 
> > http://www.csn.ul.ie/~mel/postings/reclaim-20110721/html-run-monitor/global-dhp-512M__writeback-reclaimdirty-xfs/hydra/kswapdcpu-comparison-smooth-hydra.png
> > 
> > Note how preventing kswapd reclaiming dirty pages pushes up its CPU
> > usage as it scans more pages but it does not get excessive due to
> > the throttling.
> 
> Good to hear.
> The concern of this patchset was early OOM kill with too many scanning.
> I can throw such concern out from now on.
> 

At least, I haven't been able to trigger a premature OOM.

> > <SNIP>
> > Page writes by reclaim is what is motivating this series. It goes
> > from 170014 pages to 123510 which is a big improvement and we'll see
> > later that these writes are for anonymous pages.
> > 
> > "Page reclaim invalided" is very high and implies that a large number
> > of dirty pages are reaching the end of the list quickly. Unfortunately,
> > this is somewhat unavoidable. Kswapd is scanning pages at a rate
> > of roughly 125000 (or 488M) a second on a 512M machine. The best
> > possible writing rate of the underlying storage is about 300M/second.
> > With the rate of reclaim exceeding the best possible writing speed,
> > the system is going to get throttled.
> 
> Just out of curiosity.
> What is 'Page reclaim throttled'?
> 

It should have been deleted from this report. It used to be a vmstat
counting how many times patch 6 called wait_iff_congested(). It no
longer exists.

> > <SNIP>
> > from kswapd were 0 with the patches applied implying that kswapd was
> > not getting to a priority high enough to start writing. The remaining
> > writes correlate almost exactly to nr_vmscan_write implying that all
> > writes were for anonymous pages.
> > 
> > FTrace Reclaim Statistics: congestion_wait
> > Direct number congest     waited                 0          0          0 
> > Direct time   congest     waited               0ms        0ms        0ms 
> > Direct full   congest     waited                 0          0          0 
> > Direct number conditional waited                 2         17          6 
> > Direct time   conditional waited               0ms        0ms        0ms 
> > Direct full   conditional waited                 0          0          0 
> > KSwapd number congest     waited                 4          8         10 
> > KSwapd time   congest     waited               4ms       20ms        8ms 
> > KSwapd full   congest     waited                 0          0          0 
> > KSwapd number conditional waited                 0      26036      26283 
> > KSwapd time   conditional waited               0ms       16ms        4ms 
> > KSwapd full   conditional waited                 0          0          0 
> 
> What means congest and conditional?
> congest is trace_writeback_congestion_wait and conditional is trace_writeback_wait_iff_congested?
> 

Yes.

> > <SNIP>
> > Next I tested on a NUMA configuration of sorts. I don't have a real
> > NUMA machine so I booted the same machine with mem=4096M numa=fake=8
> > so each node is 512M. Again, the volume of information is high but
> > here is a summary of sorts based on a test run with monitors enabled.
> > 
> > <XFS discussion snipped>
> >
> > With kswapd avoiding writes, it gets throttled lightly but when it
> > writes no pasges at all, it gets throttled very heavily and sleeps.
> > 
> > ext4 tells a slightly different story
> > 
> > 4096M8N-ext4         Files/s  mean               28.63 ( 0.00%)       30.58 ( 6.37%)   31.04 ( 7.76%)
> > 4096M8N-ext4         Elapsed Time fsmark                1578.51              1551.99          1532.65
> > 4096M8N-ext4         Elapsed Time mmap-strm              703.66               655.25           654.86
> > 4096M8N-ext4         Kswapd efficiency                      62%                  69%              68%
> > 4096M8N-ext4         Kswapd efficiency                      35%                  35%              35%
> > 4096M8N-ext4         stalled direct reclaim fsmark         0.00                 0.00             0.00 
> > 4096M8N-ext4         stalled direct reclaim mmap-strm     32.64                95.72           152.62 
> > 4096M8N-2X-ext4      Files/s  mean               30.74 ( 0.00%)       28.49 (-7.89%)   28.79 (-6.75%)
> > 4096M8N-2X-ext4      Elapsed Time fsmark                1466.62              1583.12          1580.07
> > 4096M8N-2X-ext4      Elapsed Time mmap-strm              705.17               705.64           693.01
> > 4096M8N-2X-ext4      Kswapd efficiency                      68%                  68%              67%
> > 4096M8N-2X-ext4      Kswapd efficiency                      34%                  30%              18%
> > 4096M8N-2X-ext4      stalled direct reclaim fsmark         0.00                 0.00             0.00 
> > 4096M8N-2X-ext4      stalled direct reclaim mmap-strm    106.82                24.88            27.88 
> > 4096M8N-4X-ext4      Files/s  mean               24.15 ( 0.00%)       23.18 (-4.18%)   23.94 (-0.89%)
> > 4096M8N-4X-ext4      Elapsed Time fsmark                1848.41              1971.48          1867.07
> > 4096M8N-4X-ext4      Elapsed Time mmap-strm              664.87               673.66           674.46
> > 4096M8N-4X-ext4      Kswapd efficiency                      62%                  65%              65%
> > 4096M8N-4X-ext4      Kswapd efficiency                      33%                  37%              15%
> > 4096M8N-4X-ext4      stalled direct reclaim fsmark         0.18                 0.03             0.26 
> > 4096M8N-4X-ext4      stalled direct reclaim mmap-strm    115.71                23.05            61.12 
> > 4096M8N-16X-ext4     Files/s  mean                5.42 ( 0.00%)        5.43 ( 0.15%)    3.83 (-41.44%)
> > 4096M8N-16X-ext4     Elapsed Time fsmark                9572.85              9653.66         11245.41
> > 4096M8N-16X-ext4     Elapsed Time mmap-strm              752.88               750.38           769.19
> > 4096M8N-16X-ext4     Kswapd efficiency                      59%                  59%              61%
> > 4096M8N-16X-ext4     Kswapd efficiency                      34%                  34%              21%
> > 4096M8N-16X-ext4     stalled direct reclaim fsmark         0.26                 0.65             0.26 
> > 4096M8N-16X-ext4     stalled direct reclaim mmap-strm    177.48               125.91           196.92 
> > 
> > 4096M8N-16X-ext4 with kswapd writing no pages collapsed in terms of
> > performance. Looking at the fsmark logs, in a number of iterations,
> > it was barely able to write files at all.
> > 
> > The apparent slowdown for fsmark in 4096M8N-2X-ext4 is well within
> > the noise but the reduced time spent in direct reclaim is very welcome.
> 
> But 4096M8N-ext4 increased the time and 4096M8N-2X-ext4 is within the noise
> as you said. I doubt it's reliability.
> 

Agreed. Again, it could be figured out which process is stalling but it
wouldn't tell us very much.

> > 
> > Unlike xfs, it's less clear cut if direct reclaim performance is
> > impaired but in a few tests, preventing kswapd writing pages did
> > increase the time stalled.
> > 
> > Last test is that I've been running this series on my laptop since
> > Monday without any problem but it's rarely under serious memory
> > pressure. I see nr_vmscan_write is 0 and the number of pages
> > invalidated from the end of the LRU is only 10844 after 3 days so
> > it's not much of a test.
> > 
> > Overall, having kswapd avoiding writes does improve performance
> > which is not a surprise. Dave asked "do we even need IO at all from
> > reclaim?". On NUMA machines, the answer is "yes" unless the VM can
> > wake the flusher thread to clean a specific node. When kswapd never
> > writes, processes can stall for significant periods of time waiting on
> > flushers to clean the correct pages. If all writing is to be deferred
> > to flushers, it must ensure that many writes on one node would not
> > starve requests for cleaning pages on another node.
> 
> It's a good answer. :)
> 

Thanks :)

> > I'm currently of the opinion that we should consider merging patches
> > 1-7 and discuss what is required before merging. It can be tackled
> > later how the flushers can prioritise writing of pages belonging to
> > a particular zone before disabling all writes from reclaim. There
> > is already some work in this general area with the possibility that
> > series such as "writeback: moving expire targets for background/kupdate
> > works" could be extended to allow patch 8 to be merged later even if
> > the series needs work.
> 
> I think you already knew what we need(ie, prioritising the pages in a zone)
> In case of NUMA, 1-7 has a problem in ext4 so we have to focus NUMA during remained time.
> 

The slowdown for ext4 was within the noise but I'll run it again and
confirm that it really is not a problem.

> The alternative of [prioritising the page in a zone] might be Johannes's [mm: per-zone dirty limiting].
> It might mitigate NUMA problems.
> 

It might.

> Overall, I really welcome this approach and would like to merge this in mmotm as soon as possible
> for see the side effects in non-NUMA(I will add my reviewed-by soon).
> In case of NUMA, we know the problem apparently so I think it could be solved
> before it is sent to mainline.
> 
> It was a great time to see your data and you makes my coffee delicious. :)
> You're a good Barista.
> Thanks for your great effort, Mel!
> 

Thanks for your review.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2
  2011-07-28 11:38   ` Mel Gorman
@ 2011-07-29  9:48     ` Minchan Kim
  2011-07-29  9:50       ` Minchan Kim
  0 siblings, 1 reply; 43+ messages in thread
From: Minchan Kim @ 2011-07-29  9:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel

On Thu, Jul 28, 2011 at 12:38:52PM +0100, Mel Gorman wrote:
> On Thu, Jul 28, 2011 at 01:18:21AM +0900, Minchan Kim wrote:
> > On Thu, Jul 21, 2011 at 05:28:42PM +0100, Mel Gorman wrote:
> > > Note how preventing kswapd reclaiming dirty pages pushes up its CPU

<snip>

> > > usage as it scans more pages but it does not get excessive due to
> > > the throttling.
> > 
> > Good to hear.
> > The concern of this patchset was early OOM kill with too many scanning.
> > I can throw such concern out from now on.
> > 
> 
> At least, I haven't been able to trigger a premature OOM.

AFAIR, Andrew had a premature OOM problem[1] but I couldn't track down at that time.
I think this patch series might solve his problem. Although it doesn't, it should not accelerate
his problem, at least.

Andrew, Could you test this patchset?

[1] https://lkml.org/lkml/2011/5/25/415
-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2
  2011-07-29  9:48     ` Minchan Kim
@ 2011-07-29  9:50       ` Minchan Kim
  2011-07-29 13:41         ` Andrew Lutomirski
  0 siblings, 1 reply; 43+ messages in thread
From: Minchan Kim @ 2011-07-29  9:50 UTC (permalink / raw)
  To: Mel Gorman, Andrew Lutomirski
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel

Sorry for missing Ccing.

On Fri, Jul 29, 2011 at 06:48:16PM +0900, Minchan Kim wrote:
> On Thu, Jul 28, 2011 at 12:38:52PM +0100, Mel Gorman wrote:
> > On Thu, Jul 28, 2011 at 01:18:21AM +0900, Minchan Kim wrote:
> > > On Thu, Jul 21, 2011 at 05:28:42PM +0100, Mel Gorman wrote:
> > > > Note how preventing kswapd reclaiming dirty pages pushes up its CPU
> 
> <snip>
> 
> > > > usage as it scans more pages but it does not get excessive due to
> > > > the throttling.
> > > 
> > > Good to hear.
> > > The concern of this patchset was early OOM kill with too many scanning.
> > > I can throw such concern out from now on.
> > > 
> > 
> > At least, I haven't been able to trigger a premature OOM.
> 
> AFAIR, Andrew had a premature OOM problem[1] but I couldn't track down at that time.
> I think this patch series might solve his problem. Although it doesn't, it should not accelerate
> his problem, at least.
> 
> Andrew, Could you test this patchset?
> 
> [1] https://lkml.org/lkml/2011/5/25/415
> -- 
> Kind regards,
> Minchan Kim

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2
  2011-07-29  9:50       ` Minchan Kim
@ 2011-07-29 13:41         ` Andrew Lutomirski
  0 siblings, 0 replies; 43+ messages in thread
From: Andrew Lutomirski @ 2011-07-29 13:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel

On Fri, Jul 29, 2011 at 5:50 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> Sorry for missing Ccing.
>
> On Fri, Jul 29, 2011 at 06:48:16PM +0900, Minchan Kim wrote:
>> On Thu, Jul 28, 2011 at 12:38:52PM +0100, Mel Gorman wrote:
>> > On Thu, Jul 28, 2011 at 01:18:21AM +0900, Minchan Kim wrote:
>> > > On Thu, Jul 21, 2011 at 05:28:42PM +0100, Mel Gorman wrote:
>> > > > Note how preventing kswapd reclaiming dirty pages pushes up its CPU
>>
>> <snip>
>>
>> > > > usage as it scans more pages but it does not get excessive due to
>> > > > the throttling.
>> > >
>> > > Good to hear.
>> > > The concern of this patchset was early OOM kill with too many scanning.
>> > > I can throw such concern out from now on.
>> > >
>> >
>> > At least, I haven't been able to trigger a premature OOM.
>>
>> AFAIR, Andrew had a premature OOM problem[1] but I couldn't track down at that time.
>> I think this patch series might solve his problem. Although it doesn't, it should not accelerate
>> his problem, at least.
>>
>> Andrew, Could you test this patchset?

Gladly, but not until Wednesday most likely.  I'm defending my thesis
on Monday :)

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/8] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-21 16:28 ` [PATCH 1/8] mm: vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
@ 2011-07-31 15:06   ` Minchan Kim
  2011-08-02 11:21     ` Mel Gorman
  0 siblings, 1 reply; 43+ messages in thread
From: Minchan Kim @ 2011-07-31 15:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel

On Thu, Jul 21, 2011 at 05:28:43PM +0100, Mel Gorman wrote:
> From: Mel Gorman <mel@csn.ul.ie>
> 
> When kswapd is failing to keep zones above the min watermark, a process
> will enter direct reclaim in the same manner kswapd does. If a dirty
> page is encountered during the scan, this page is written to backing
> storage using mapping->writepage.
> 
> This causes two problems. First, it can result in very deep call
> stacks, particularly if the target storage or filesystem are complex.
> Some filesystems ignore write requests from direct reclaim as a result.
> The second is that a single-page flush is inefficient in terms of IO.
> While there is an expectation that the elevator will merge requests,
> this does not always happen. Quoting Christoph Hellwig;
> 
> 	The elevator has a relatively small window it can operate on,
> 	and can never fix up a bad large scale writeback pattern.
> 
> This patch prevents direct reclaim writing back filesystem pages by
> checking if current is kswapd. Anonymous pages are still written to
> swap as there is not the equivalent of a flusher thread for anonymous
> pages. If the dirty pages cannot be written back, they are placed
> back on the LRU lists. There is now a direct dependency on dirty page
> balancing to prevent too many pages in the system being dirtied which
> would prevent reclaim making forward progress.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

Nitpick.
We can change description of should_reclaim_stall.

"Returns true if the caller should wait to clean dirty/writeback pages"
->
"Returns true if direct reclaimer should wait to clean writeback pages"

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 5/8] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority
  2011-07-21 16:28 ` [PATCH 5/8] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority Mel Gorman
@ 2011-07-31 15:11   ` Minchan Kim
  0 siblings, 0 replies; 43+ messages in thread
From: Minchan Kim @ 2011-07-31 15:11 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel

On Thu, Jul 21, 2011 at 05:28:47PM +0100, Mel Gorman wrote:
> It is preferable that no dirty pages are dispatched for cleaning from
> the page reclaim path. At normal priorities, this patch prevents kswapd
> writing pages.
> 
> However, page reclaim does have a requirement that pages be freed
> in a particular zone. If it is failing to make sufficient progress
> (reclaiming < SWAP_CLUSTER_MAX at any priority priority), the priority
> is raised to scan more pages. A priority of DEF_PRIORITY - 3 is
> considered to tbe the point where kswapd is getting into trouble
> reclaiming pages. If this priority is reached, kswapd will dispatch
> pages for writing.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/8] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
  2011-07-21 16:28 ` [PATCH 6/8] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback Mel Gorman
@ 2011-07-31 15:17   ` Minchan Kim
  2011-08-03 11:19   ` Johannes Weiner
  1 sibling, 0 replies; 43+ messages in thread
From: Minchan Kim @ 2011-07-31 15:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel

On Thu, Jul 21, 2011 at 05:28:48PM +0100, Mel Gorman wrote:
> Workloads that are allocating frequently and writing files place a
> large number of dirty pages on the LRU. With use-once logic, it is
> possible for them to reach the end of the LRU quickly requiring the
> reclaimer to scan more to find clean pages. Ordinarily, processes that
> are dirtying memory will get throttled by dirty balancing but this
> is a global heuristic and does not take into account that LRUs are
> maintained on a per-zone basis. This can lead to a situation whereby
> reclaim is scanning heavily, skipping over a large number of pages
> under writeback and recycling them around the LRU consuming CPU.
> 
> This patch checks how many of the number of pages isolated from the
> LRU were dirty. If a percentage of them are dirty, the process will be
> throttled if a blocking device is congested or the zone being scanned
> is marked congested. The percentage that must be dirty depends on
> the priority. At default priority, all of them must be dirty. At
> DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
> etc. i.e.  as pressure increases the greater the likelihood the process
> will get throttled to allow the flusher threads to make some progress.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
  2011-07-22 13:23     ` Mel Gorman
@ 2011-07-31 15:24       ` Minchan Kim
  2011-08-02 11:25         ` Mel Gorman
  0 siblings, 1 reply; 43+ messages in thread
From: Minchan Kim @ 2011-07-31 15:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Linux-MM, LKML, XFS, Dave Chinner,
	Christoph Hellwig, Johannes Weiner, Wu Fengguang, Jan Kara,
	Rik van Riel

On Fri, Jul 22, 2011 at 02:23:19PM +0100, Mel Gorman wrote:
> On Fri, Jul 22, 2011 at 02:53:48PM +0200, Peter Zijlstra wrote:
> > On Thu, 2011-07-21 at 17:28 +0100, Mel Gorman wrote:
> > > When direct reclaim encounters a dirty page, it gets recycled around
> > > the LRU for another cycle. This patch marks the page PageReclaim
> > > similar to deactivate_page() so that the page gets reclaimed almost
> > > immediately after the page gets cleaned. This is to avoid reclaiming
> > > clean pages that are younger than a dirty page encountered at the
> > > end of the LRU that might have been something like a use-once page.
> > > 
> > 
> > > @@ -834,7 +834,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > >  			 */
> > >  			if (page_is_file_cache(page) &&
> > >  					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > > -				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > +				/*
> > > +				 * Immediately reclaim when written back.
> > > +				 * Similar in principal to deactivate_page()
> > > +				 * except we already have the page isolated
> > > +				 * and know it's dirty
> > > +				 */
> > > +				inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
> > > +				SetPageReclaim(page);
> > > +
> > 
> > I find the invalidate name somewhat confusing. It makes me think we'll
> > drop the page without writeback, like invalidatepage().
> 
> I wasn't that happy with it either to be honest but didn't think of a
> better one at the time. nr_reclaim_deferred?

How about "NR_VMSCAN_IMMEDIATE_RECLAIM" like comment rotate_reclaimable_page?

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 1/8] mm: vmscan: Do not writeback filesystem pages in direct reclaim
  2011-07-31 15:06   ` Minchan Kim
@ 2011-08-02 11:21     ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2011-08-02 11:21 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Johannes Weiner, Wu Fengguang, Jan Kara, Rik van Riel

On Mon, Aug 01, 2011 at 12:06:06AM +0900, Minchan Kim wrote:
> On Thu, Jul 21, 2011 at 05:28:43PM +0100, Mel Gorman wrote:
> > From: Mel Gorman <mel@csn.ul.ie>
> > 
> > When kswapd is failing to keep zones above the min watermark, a process
> > will enter direct reclaim in the same manner kswapd does. If a dirty
> > page is encountered during the scan, this page is written to backing
> > storage using mapping->writepage.
> > 
> > This causes two problems. First, it can result in very deep call
> > stacks, particularly if the target storage or filesystem are complex.
> > Some filesystems ignore write requests from direct reclaim as a result.
> > The second is that a single-page flush is inefficient in terms of IO.
> > While there is an expectation that the elevator will merge requests,
> > this does not always happen. Quoting Christoph Hellwig;
> > 
> > 	The elevator has a relatively small window it can operate on,
> > 	and can never fix up a bad large scale writeback pattern.
> > 
> > This patch prevents direct reclaim writing back filesystem pages by
> > checking if current is kswapd. Anonymous pages are still written to
> > swap as there is not the equivalent of a flusher thread for anonymous
> > pages. If the dirty pages cannot be written back, they are placed
> > back on the LRU lists. There is now a direct dependency on dirty page
> > balancing to prevent too many pages in the system being dirtied which
> > would prevent reclaim making forward progress.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> 

Thanks

> Nitpick.
> We can change description of should_reclaim_stall.
> 
> "Returns true if the caller should wait to clean dirty/writeback pages"
> ->
> "Returns true if direct reclaimer should wait to clean writeback pages"
> 

Not a nitpick. At least one check for RECLAIM_MODE_SYNC is no longer
reachable. I've added a new patch that updates the comment and has
synchronous direct reclaim wait on pages under writeback.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
  2011-07-31 15:24       ` Minchan Kim
@ 2011-08-02 11:25         ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2011-08-02 11:25 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Peter Zijlstra, Linux-MM, LKML, XFS, Dave Chinner,
	Christoph Hellwig, Johannes Weiner, Wu Fengguang, Jan Kara,
	Rik van Riel

On Mon, Aug 01, 2011 at 12:24:01AM +0900, Minchan Kim wrote:
> On Fri, Jul 22, 2011 at 02:23:19PM +0100, Mel Gorman wrote:
> > On Fri, Jul 22, 2011 at 02:53:48PM +0200, Peter Zijlstra wrote:
> > > On Thu, 2011-07-21 at 17:28 +0100, Mel Gorman wrote:
> > > > When direct reclaim encounters a dirty page, it gets recycled around
> > > > the LRU for another cycle. This patch marks the page PageReclaim
> > > > similar to deactivate_page() so that the page gets reclaimed almost
> > > > immediately after the page gets cleaned. This is to avoid reclaiming
> > > > clean pages that are younger than a dirty page encountered at the
> > > > end of the LRU that might have been something like a use-once page.
> > > > 
> > > 
> > > > @@ -834,7 +834,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> > > >  			 */
> > > >  			if (page_is_file_cache(page) &&
> > > >  					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > > > -				inc_zone_page_state(page, NR_VMSCAN_WRITE_SKIP);
> > > > +				/*
> > > > +				 * Immediately reclaim when written back.
> > > > +				 * Similar in principal to deactivate_page()
> > > > +				 * except we already have the page isolated
> > > > +				 * and know it's dirty
> > > > +				 */
> > > > +				inc_zone_page_state(page, NR_VMSCAN_INVALIDATE);
> > > > +				SetPageReclaim(page);
> > > > +
> > > 
> > > I find the invalidate name somewhat confusing. It makes me think we'll
> > > drop the page without writeback, like invalidatepage().
> > 
> > I wasn't that happy with it either to be honest but didn't think of a
> > better one at the time. nr_reclaim_deferred?
> 
> How about "NR_VMSCAN_IMMEDIATE_RECLAIM" like comment rotate_reclaimable_page?
> 

Yeah, I guess. I find it a little misleading because the reclaim does
not happen immediately at the time the counter is incremented but it's
better than "invalidate".

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/8] ext4: Warn if direct reclaim tries to writeback pages
  2011-07-21 16:28 ` [PATCH 3/8] ext4: " Mel Gorman
@ 2011-08-03 10:58   ` Johannes Weiner
  2011-08-03 11:06     ` Johannes Weiner
  0 siblings, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2011-08-03 10:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 21, 2011 at 05:28:45PM +0100, Mel Gorman wrote:
> Direct reclaim should never writeback pages. Warn if an attempt
> is made.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: Johannes Weiner <jweiner@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/8] ext4: Warn if direct reclaim tries to writeback pages
  2011-08-03 10:58   ` Johannes Weiner
@ 2011-08-03 11:06     ` Johannes Weiner
  2011-08-03 13:44       ` Mel Gorman
  0 siblings, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2011-08-03 11:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Aug 03, 2011 at 12:58:19PM +0200, Johannes Weiner wrote:
> On Thu, Jul 21, 2011 at 05:28:45PM +0100, Mel Gorman wrote:
> > Direct reclaim should never writeback pages. Warn if an attempt
> > is made.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Acked-by: Johannes Weiner <jweiner@redhat.com>

Oops, too fast.

Shouldn't the WARN_ON() be at the top of the function, rather than
just warn when the write is deferred due to delalloc?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/8] btrfs: Warn if direct reclaim tries to writeback pages
  2011-07-21 16:28 ` [PATCH 4/8] btrfs: " Mel Gorman
@ 2011-08-03 11:10   ` Johannes Weiner
  2011-08-03 13:45     ` Mel Gorman
  0 siblings, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2011-08-03 11:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 21, 2011 at 05:28:46PM +0100, Mel Gorman wrote:
> Direct reclaim should never writeback pages. Warn if an attempt is
> made. By rights, btrfs should be allowing writepage from kswapd if
> it is failing to reclaim pages by any other means but it's outside
> the scope of this patch.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  fs/btrfs/disk-io.c |    2 ++
>  fs/btrfs/inode.c   |    2 ++
>  2 files changed, 4 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 1ac8db5d..cc9c9cf 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -829,6 +829,8 @@ static int btree_writepage(struct page *page, struct writeback_control *wbc)
>  
>  	tree = &BTRFS_I(page->mapping->host)->io_tree;
>  	if (!(current->flags & PF_MEMALLOC)) {
> +		WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> +								PF_MEMALLOC);

Since it is branch for PF_MEMALLOC being set, why not just
WARN_ON_ONCE(!(current->flags & PF_KSWAPD)) instead?

Minor nitpick, though, and I can understand if you just want to have
the conditionals be the same in every fs.

Acked-by: Johannes Weiner <jweiner@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/8] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
  2011-07-21 16:28 ` [PATCH 6/8] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback Mel Gorman
  2011-07-31 15:17   ` Minchan Kim
@ 2011-08-03 11:19   ` Johannes Weiner
  2011-08-03 13:56     ` Mel Gorman
  1 sibling, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2011-08-03 11:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 21, 2011 at 05:28:48PM +0100, Mel Gorman wrote:
> Workloads that are allocating frequently and writing files place a
> large number of dirty pages on the LRU. With use-once logic, it is
> possible for them to reach the end of the LRU quickly requiring the
> reclaimer to scan more to find clean pages. Ordinarily, processes that
> are dirtying memory will get throttled by dirty balancing but this
> is a global heuristic and does not take into account that LRUs are
> maintained on a per-zone basis. This can lead to a situation whereby
> reclaim is scanning heavily, skipping over a large number of pages
> under writeback and recycling them around the LRU consuming CPU.
> 
> This patch checks how many of the number of pages isolated from the
> LRU were dirty. If a percentage of them are dirty, the process will be
> throttled if a blocking device is congested or the zone being scanned
> is marked congested. The percentage that must be dirty depends on
> the priority. At default priority, all of them must be dirty. At
> DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
> etc. i.e.  as pressure increases the greater the likelihood the process
> will get throttled to allow the flusher threads to make some progress.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |   21 ++++++++++++++++++---
>  1 files changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cf7b501..b0060f8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -720,7 +720,8 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
>  static unsigned long shrink_page_list(struct list_head *page_list,
>  				      struct zone *zone,
>  				      struct scan_control *sc,
> -				      int priority)
> +				      int priority,
> +				      unsigned long *ret_nr_dirty)
>  {
>  	LIST_HEAD(ret_pages);
>  	LIST_HEAD(free_pages);
> @@ -971,6 +972,7 @@ keep_lumpy:
>  
>  	list_splice(&ret_pages, page_list);
>  	count_vm_events(PGACTIVATE, pgactivate);
> +	*ret_nr_dirty += nr_dirty;

Note that this includes anon pages, which means that swapping is
throttled as well.

I don't think it is a downside to throttle swapping during IO
congestion - waiting for pages under writeback to become reclaimable
is better than kicking off even more IO in this case as well - but the
changelog and the comments should include it, I guess.

Otherwise,
Acked-by: Johannes Weiner <jweiner@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
  2011-07-21 16:28 ` [PATCH 7/8] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes Mel Gorman
  2011-07-22 12:53   ` Peter Zijlstra
@ 2011-08-03 11:26   ` Johannes Weiner
  2011-08-03 13:57     ` Mel Gorman
  1 sibling, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2011-08-03 11:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 21, 2011 at 05:28:49PM +0100, Mel Gorman wrote:
> When direct reclaim encounters a dirty page, it gets recycled around
> the LRU for another cycle. This patch marks the page PageReclaim
> similar to deactivate_page() so that the page gets reclaimed almost
> immediately after the page gets cleaned. This is to avoid reclaiming
> clean pages that are younger than a dirty page encountered at the
> end of the LRU that might have been something like a use-once page.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Apart from the naming of the counter (I like nr_reclaim_preferred),

Acked-by: Johannes Weiner <jweiner@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 8/8] mm: vmscan: Do not writeback filesystem pages from kswapd
  2011-07-21 16:28 ` [PATCH 8/8] mm: vmscan: Do not writeback filesystem pages from kswapd Mel Gorman
  2011-07-22 12:57   ` Peter Zijlstra
@ 2011-08-03 11:37   ` Johannes Weiner
  2011-08-03 13:58     ` Mel Gorman
  1 sibling, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2011-08-03 11:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Thu, Jul 21, 2011 at 05:28:50PM +0100, Mel Gorman wrote:
> Assuming that flusher threads will always write back dirty pages promptly
> then it is always faster for reclaimers to wait for flushers. This patch
> prevents kswapd writing back any filesystem pages.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Relying on the flushers may mean that every dirty page in the system
has to be written back before the pages from the zone of interest are
clean.

De-facto we have only one mechanism to stay on top of the dirty pages
from a per-zone perspective, and that is single-page writeout from
reclaim.

While we all agree that this sucks, we can not remove it unless we
have a replacement that makes zones reclaimable in a reasonable time
frame (or keep them reclaimable in the first place, what per-zone
dirty limits attempt to do).

As such, please include

Nacked-by: Johannes Weiner <jweiner@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/8] ext4: Warn if direct reclaim tries to writeback pages
  2011-08-03 11:06     ` Johannes Weiner
@ 2011-08-03 13:44       ` Mel Gorman
  2011-08-03 14:00         ` Johannes Weiner
  0 siblings, 1 reply; 43+ messages in thread
From: Mel Gorman @ 2011-08-03 13:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Aug 03, 2011 at 01:06:29PM +0200, Johannes Weiner wrote:
> On Wed, Aug 03, 2011 at 12:58:19PM +0200, Johannes Weiner wrote:
> > On Thu, Jul 21, 2011 at 05:28:45PM +0100, Mel Gorman wrote:
> > > Direct reclaim should never writeback pages. Warn if an attempt
> > > is made.
> > > 
> > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > 
> > Acked-by: Johannes Weiner <jweiner@redhat.com>
> 
> Oops, too fast.
> 
> Shouldn't the WARN_ON() be at the top of the function, rather than
> just warn when the write is deferred due to delalloc?

I thought it made more sense to put the warning at the point where ext4
would normally ignore ->writepage.

That said, in my current revision of the series, I've dropped these
patches altogether as page migration should be able to trigger the same
warnings but be called from paths that are of less concern for stack
overflows (or at the very least be looked at as a separate series).

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 4/8] btrfs: Warn if direct reclaim tries to writeback pages
  2011-08-03 11:10   ` Johannes Weiner
@ 2011-08-03 13:45     ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2011-08-03 13:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Aug 03, 2011 at 01:10:31PM +0200, Johannes Weiner wrote:
> On Thu, Jul 21, 2011 at 05:28:46PM +0100, Mel Gorman wrote:
> > Direct reclaim should never writeback pages. Warn if an attempt is
> > made. By rights, btrfs should be allowing writepage from kswapd if
> > it is failing to reclaim pages by any other means but it's outside
> > the scope of this patch.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  fs/btrfs/disk-io.c |    2 ++
> >  fs/btrfs/inode.c   |    2 ++
> >  2 files changed, 4 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > index 1ac8db5d..cc9c9cf 100644
> > --- a/fs/btrfs/disk-io.c
> > +++ b/fs/btrfs/disk-io.c
> > @@ -829,6 +829,8 @@ static int btree_writepage(struct page *page, struct writeback_control *wbc)
> >  
> >  	tree = &BTRFS_I(page->mapping->host)->io_tree;
> >  	if (!(current->flags & PF_MEMALLOC)) {
> > +		WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> > +								PF_MEMALLOC);
> 
> Since it is branch for PF_MEMALLOC being set, why not just
> WARN_ON_ONCE(!(current->flags & PF_KSWAPD)) instead?
> 
> Minor nitpick, though, and I can understand if you just want to have
> the conditionals be the same in every fs.
> 

It was just copying the conditionals for the other FS although I admit
your version would look nicer.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 6/8] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback
  2011-08-03 11:19   ` Johannes Weiner
@ 2011-08-03 13:56     ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2011-08-03 13:56 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Aug 03, 2011 at 01:19:40PM +0200, Johannes Weiner wrote:
> On Thu, Jul 21, 2011 at 05:28:48PM +0100, Mel Gorman wrote:
> > Workloads that are allocating frequently and writing files place a
> > large number of dirty pages on the LRU. With use-once logic, it is
> > possible for them to reach the end of the LRU quickly requiring the
> > reclaimer to scan more to find clean pages. Ordinarily, processes that
> > are dirtying memory will get throttled by dirty balancing but this
> > is a global heuristic and does not take into account that LRUs are
> > maintained on a per-zone basis. This can lead to a situation whereby
> > reclaim is scanning heavily, skipping over a large number of pages
> > under writeback and recycling them around the LRU consuming CPU.
> > 
> > This patch checks how many of the number of pages isolated from the
> > LRU were dirty. If a percentage of them are dirty, the process will be
> > throttled if a blocking device is congested or the zone being scanned
> > is marked congested. The percentage that must be dirty depends on
> > the priority. At default priority, all of them must be dirty. At
> > DEF_PRIORITY-1, 50% of them must be dirty, DEF_PRIORITY-2, 25%
> > etc. i.e.  as pressure increases the greater the likelihood the process
> > will get throttled to allow the flusher threads to make some progress.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  mm/vmscan.c |   21 ++++++++++++++++++---
> >  1 files changed, 18 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index cf7b501..b0060f8 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -720,7 +720,8 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
> >  static unsigned long shrink_page_list(struct list_head *page_list,
> >  				      struct zone *zone,
> >  				      struct scan_control *sc,
> > -				      int priority)
> > +				      int priority,
> > +				      unsigned long *ret_nr_dirty)
> >  {
> >  	LIST_HEAD(ret_pages);
> >  	LIST_HEAD(free_pages);
> > @@ -971,6 +972,7 @@ keep_lumpy:
> >  
> >  	list_splice(&ret_pages, page_list);
> >  	count_vm_events(PGACTIVATE, pgactivate);
> > +	*ret_nr_dirty += nr_dirty;
> 
> Note that this includes anon pages, which means that swapping is
> throttled as well.
> 

Yes it does. In the current revision of the series, I'm not using
nr_dirty as it throttles too aggressively. Instead the number of pages
under writeback is counted and that is used for the throttling decision.
It still potentially includes anon pages but that is reasonable.

> I don't think it is a downside to throttle swapping during IO
> congestion - waiting for pages under writeback to become reclaimable
> is better than kicking off even more IO in this case as well - but the
> changelog and the comments should include it, I guess.
> 

Fair point. I've updated the changelog accordingly. Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 7/8] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes
  2011-08-03 11:26   ` Johannes Weiner
@ 2011-08-03 13:57     ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2011-08-03 13:57 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Aug 03, 2011 at 01:26:30PM +0200, Johannes Weiner wrote:
> On Thu, Jul 21, 2011 at 05:28:49PM +0100, Mel Gorman wrote:
> > When direct reclaim encounters a dirty page, it gets recycled around
> > the LRU for another cycle. This patch marks the page PageReclaim
> > similar to deactivate_page() so that the page gets reclaimed almost
> > immediately after the page gets cleaned. This is to avoid reclaiming
> > clean pages that are younger than a dirty page encountered at the
> > end of the LRU that might have been something like a use-once page.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Apart from the naming of the counter (I like nr_reclaim_preferred),
> 

At the moment it's NR_VMSCAN_IMMEDIATE and the name visible in
/proc/vmstat is nr_vmscan_immediate_reclaim

> Acked-by: Johannes Weiner <jweiner@redhat.com>

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 8/8] mm: vmscan: Do not writeback filesystem pages from kswapd
  2011-08-03 11:37   ` Johannes Weiner
@ 2011-08-03 13:58     ` Mel Gorman
  0 siblings, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2011-08-03 13:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Aug 03, 2011 at 01:37:06PM +0200, Johannes Weiner wrote:
> On Thu, Jul 21, 2011 at 05:28:50PM +0100, Mel Gorman wrote:
> > Assuming that flusher threads will always write back dirty pages promptly
> > then it is always faster for reclaimers to wait for flushers. This patch
> > prevents kswapd writing back any filesystem pages.
> > 
> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> 
> Relying on the flushers may mean that every dirty page in the system
> has to be written back before the pages from the zone of interest are
> clean.
> 

Yes.

> De-facto we have only one mechanism to stay on top of the dirty pages
> from a per-zone perspective, and that is single-page writeout from
> reclaim.
> 

Yes.

> While we all agree that this sucks, we can not remove it unless we
> have a replacement that makes zones reclaimable in a reasonable time
> frame (or keep them reclaimable in the first place, what per-zone
> dirty limits attempt to do).
> 
> As such, please include
> 
> Nacked-by: Johannes Weiner <jweiner@redhat.com>

I've already dropped the patch. If I could, I would have signed this at
the time as

Signed-off-but-naking-it-anyway: Mel Gorman <mgorman@suse.de

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/8] ext4: Warn if direct reclaim tries to writeback pages
  2011-08-03 13:44       ` Mel Gorman
@ 2011-08-03 14:00         ` Johannes Weiner
  2011-08-03 14:18           ` Christoph Hellwig
  2011-08-03 14:35           ` Mel Gorman
  0 siblings, 2 replies; 43+ messages in thread
From: Johannes Weiner @ 2011-08-03 14:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Aug 03, 2011 at 02:44:20PM +0100, Mel Gorman wrote:
> On Wed, Aug 03, 2011 at 01:06:29PM +0200, Johannes Weiner wrote:
> > On Wed, Aug 03, 2011 at 12:58:19PM +0200, Johannes Weiner wrote:
> > > On Thu, Jul 21, 2011 at 05:28:45PM +0100, Mel Gorman wrote:
> > > > Direct reclaim should never writeback pages. Warn if an attempt
> > > > is made.
> > > > 
> > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > 
> > > Acked-by: Johannes Weiner <jweiner@redhat.com>
> > 
> > Oops, too fast.
> > 
> > Shouldn't the WARN_ON() be at the top of the function, rather than
> > just warn when the write is deferred due to delalloc?
> 
> I thought it made more sense to put the warning at the point where ext4
> would normally ignore ->writepage.
> 
> That said, in my current revision of the series, I've dropped these
> patches altogether as page migration should be able to trigger the same
> warnings but be called from paths that are of less concern for stack
> overflows (or at the very least be looked at as a separate series).

Doesn't this only apply to btrfs which has no own .migratepage aop for
file pages?  The others use buffer_migrate_page.

But if you dropped them anyway, it does not matter :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/8] ext4: Warn if direct reclaim tries to writeback pages
  2011-08-03 14:00         ` Johannes Weiner
@ 2011-08-03 14:18           ` Christoph Hellwig
  2011-08-03 14:35           ` Mel Gorman
  1 sibling, 0 replies; 43+ messages in thread
From: Christoph Hellwig @ 2011-08-03 14:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Mel Gorman, Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim, josef

On Wed, Aug 03, 2011 at 04:00:19PM +0200, Johannes Weiner wrote:
> > That said, in my current revision of the series, I've dropped these
> > patches altogether as page migration should be able to trigger the same
> > warnings but be called from paths that are of less concern for stack
> > overflows (or at the very least be looked at as a separate series).
> 
> Doesn't this only apply to btrfs which has no own .migratepage aop for
> file pages?  The others use buffer_migrate_page.
> 
> But if you dropped them anyway, it does not matter :)

Note that the mid-term plan is to kill ->writepage as an address space
method.  Besides the usage from reclaim as as callbacks to
write_cache_pages and write_one_page (which can be made explicit
arguments) the only remaining user is the above mentioned fallback.

Josef, any chance you could switch btrfs over to implement a proper
->migratepage?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH 3/8] ext4: Warn if direct reclaim tries to writeback pages
  2011-08-03 14:00         ` Johannes Weiner
  2011-08-03 14:18           ` Christoph Hellwig
@ 2011-08-03 14:35           ` Mel Gorman
  1 sibling, 0 replies; 43+ messages in thread
From: Mel Gorman @ 2011-08-03 14:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Linux-MM, LKML, XFS, Dave Chinner, Christoph Hellwig,
	Wu Fengguang, Jan Kara, Rik van Riel, Minchan Kim

On Wed, Aug 03, 2011 at 04:00:19PM +0200, Johannes Weiner wrote:
> On Wed, Aug 03, 2011 at 02:44:20PM +0100, Mel Gorman wrote:
> > On Wed, Aug 03, 2011 at 01:06:29PM +0200, Johannes Weiner wrote:
> > > On Wed, Aug 03, 2011 at 12:58:19PM +0200, Johannes Weiner wrote:
> > > > On Thu, Jul 21, 2011 at 05:28:45PM +0100, Mel Gorman wrote:
> > > > > Direct reclaim should never writeback pages. Warn if an attempt
> > > > > is made.
> > > > > 
> > > > > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > > > 
> > > > Acked-by: Johannes Weiner <jweiner@redhat.com>
> > > 
> > > Oops, too fast.
> > > 
> > > Shouldn't the WARN_ON() be at the top of the function, rather than
> > > just warn when the write is deferred due to delalloc?
> > 
> > I thought it made more sense to put the warning at the point where ext4
> > would normally ignore ->writepage.
> > 
> > That said, in my current revision of the series, I've dropped these
> > patches altogether as page migration should be able to trigger the same
> > warnings but be called from paths that are of less concern for stack
> > overflows (or at the very least be looked at as a separate series).
> 
> Doesn't this only apply to btrfs which has no own .migratepage aop for
> file pages?  The others use buffer_migrate_page.
> 

Bah, you're right. It was btrfs I was looking at during the time I
decided to drop the patches and I didn't think it through. I only
needed to drop the btrfs one.

> But if you dropped them anyway, it does not matter :)

I put back in the xfs and ext4 checks. The ext4 check is still in the
same place.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2011-08-03 14:35 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-07-21 16:28 [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Mel Gorman
2011-07-21 16:28 ` [PATCH 1/8] mm: vmscan: Do not writeback filesystem pages in direct reclaim Mel Gorman
2011-07-31 15:06   ` Minchan Kim
2011-08-02 11:21     ` Mel Gorman
2011-07-21 16:28 ` [PATCH 2/8] xfs: Warn if direct reclaim tries to writeback pages Mel Gorman
2011-07-24 11:32   ` Christoph Hellwig
2011-07-25  8:19     ` Mel Gorman
2011-07-21 16:28 ` [PATCH 3/8] ext4: " Mel Gorman
2011-08-03 10:58   ` Johannes Weiner
2011-08-03 11:06     ` Johannes Weiner
2011-08-03 13:44       ` Mel Gorman
2011-08-03 14:00         ` Johannes Weiner
2011-08-03 14:18           ` Christoph Hellwig
2011-08-03 14:35           ` Mel Gorman
2011-07-21 16:28 ` [PATCH 4/8] btrfs: " Mel Gorman
2011-08-03 11:10   ` Johannes Weiner
2011-08-03 13:45     ` Mel Gorman
2011-07-21 16:28 ` [PATCH 5/8] mm: vmscan: Do not writeback filesystem pages in kswapd except in high priority Mel Gorman
2011-07-31 15:11   ` Minchan Kim
2011-07-21 16:28 ` [PATCH 6/8] mm: vmscan: Throttle reclaim if encountering too many dirty pages under writeback Mel Gorman
2011-07-31 15:17   ` Minchan Kim
2011-08-03 11:19   ` Johannes Weiner
2011-08-03 13:56     ` Mel Gorman
2011-07-21 16:28 ` [PATCH 7/8] mm: vmscan: Immediately reclaim end-of-LRU dirty pages when writeback completes Mel Gorman
2011-07-22 12:53   ` Peter Zijlstra
2011-07-22 13:23     ` Mel Gorman
2011-07-31 15:24       ` Minchan Kim
2011-08-02 11:25         ` Mel Gorman
2011-08-03 11:26   ` Johannes Weiner
2011-08-03 13:57     ` Mel Gorman
2011-07-21 16:28 ` [PATCH 8/8] mm: vmscan: Do not writeback filesystem pages from kswapd Mel Gorman
2011-07-22 12:57   ` Peter Zijlstra
2011-07-22 13:31     ` Mel Gorman
2011-08-03 11:37   ` Johannes Weiner
2011-08-03 13:58     ` Mel Gorman
2011-07-26 11:20 ` [RFC PATCH 0/8] Reduce filesystem writeback from page reclaim v2 Dave Chinner
2011-07-27  4:32 ` Minchan Kim
2011-07-27  7:37   ` Mel Gorman
2011-07-27 16:18 ` Minchan Kim
2011-07-28 11:38   ` Mel Gorman
2011-07-29  9:48     ` Minchan Kim
2011-07-29  9:50       ` Minchan Kim
2011-07-29 13:41         ` Andrew Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).