[PATCH 0/8] Reduce system disruption due to kswapd followup V3

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8] Reduce system disruption due to kswapd followup V3
@ 2013-05-29 23:17 Mel Gorman
  2013-05-29 23:17 ` [PATCH 1/8] mm: vmscan: Block kswapd if it is encountering pages under writeback -fix Mel Gorman
                   ` (9 more replies)
  0 siblings, 10 replies; 11+ messages in thread
From: Mel Gorman @ 2013-05-29 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner,
	Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman

tldr; Overall the system is getting less kicked in the face. Scan rates
	between zones is often more balanced than it used to be. There are
	now fewer writes from reclaim context and a reduction in IO wait
	times.

This series replaces all of the previous follow-up series. It was clear
that more of the stall logic needed to be in the same place so it is
comprehensible and easier to predict.

Changelog since V2
o Consolidate stall decisions into one place
o Add is_dirty_writeback for NFS
o Move accounting around

Further testing of the "Reduce system disruption due to kswapd" discovered
a few problems. First and foremost, it's possible for pages under writeback
to be freed which will lead to badness. Second, as pages were not being
swapped the file LRU was being scanned faster and clean file pages were
being reclaimed. In some cases this results in increased read IO to re-read
data from disk.  Third, more pages were being written from kswapd context
which can adversly affect IO performance. Lastly, it was observed that
PageDirty pages are not necessarily dirty on all filesystems (buffers can be
clean while PageDirty is set and ->writepage generates no IO) and not all
filesystems set PageWriteback when the page is being written (e.g. ext3).
This disconnect confuses the reclaim stalling logic. This follow-up series
is aimed at these problems.

The tests were based on three kernels

vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
		kswapd" applied on top as per what should be in Andrew's tree
		right now
lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel

The first test used memcached+memcachetest while some background IO
was in progress as implemented by the parallel IO tests implement in
MM Tests. memcachetest benchmarks how many operations/second memcached
can service. It starts with no background IO on a freshly created ext4
filesystem and then re-runs the test with larger amounts of IO in the
background to roughly simulate a large copy in progress. The expectation
is that the IO should have little or no impact on memcachetest which is
running entirely in memory.

parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)

memcachetest is the transactions/second reported by memcachetest. In
        the vanilla kernel note that performance drops from around
        23K/sec to just over 4K/second when there is 2385M of IO going
        on in the background. With current mmotm, there is no collapse
	in performance and with this follow-up series there is little
	change.

swaptotal is the total amount of swap traffic. With mmotm and the follow-up
	series, the total amount of swapping is much reduced.


                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  11160152    10706748    10622316
Major Faults                     46305         755         678
Swap Ins                        260249           0           0
Swap Outs                       683860          18          18
Direct pages scanned                 0         678        2520
Kswapd pages scanned           6046108     8814900     1639279
Kswapd pages reclaimed         1081954     1172267     1094635
Direct pages reclaimed               0         566        2304
Kswapd efficiency                  17%         13%         66%
Kswapd velocity               5217.560    7618.953    1414.879
Direct efficiency                 100%         83%         91%
Direct velocity                  0.000       0.586       2.175
Percentage direct scans             0%          0%          0%
Zone normal velocity          5105.086    6824.681     671.158
Zone dma32 velocity            112.473     794.858     745.896
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     1929612.000 6861768.000   32821.000
Page writes file               1245752     6861750       32803
Page writes anon                683860          18          18
Page reclaim immediate            7484          40         239
Sector Reads                   1130320       93996       86900
Sector Writes                 13508052    10823500    11804436
Page rescued immediate               0           0           0
Slabs scanned                    33536       27136       18560
Direct inode steals                  0           0           0
Kswapd inode steals               8641        1035           0
Kswapd skipped wait                  0           0           0
THP fault alloc                      8          37          33
THP collapse alloc                 508         552         515
THP splits                          24           1           1
THP fault fallback                   0           0           0
THP collapse fail                    0           0           0

There are a number of observations to make here

1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
   pages swapped were really unused anonymous pages. Related to that,
   major faults are much reduced.

2. kswapd efficiency was impacted by the initial series but with these
   follow-up patches, the efficiency is now at 66% indicating that far
   fewer pages were skipped during scanning due to dirty or writeback
   pages.

3. kswapd velocity is reduced indicating that fewer pages are being scanned
   with the follow-up series as kswapd now stalls when the tail of the
   LRU queue is full of unqueued dirty pages. The stall gives flushers a
   chance to catch-up so kswapd can reclaim clean pages when it wakes

4. In light of Zlatko's recent reports about zone scanning imbalances,
   mmtests now reports scanning velocity on a per-zone basis. With mainline,
   you can see that the scanning activity is dominated by the Normal
   zone with over 45 times more scanning in Normal than the DMA32 zone.
   With the series currently in mmotm, the ratio is slightly better but it
   is still the case that the bulk of scanning is in the highest zone. With
   this follow-up series, the ratio of scanning between the Normal and
   DMA32 zone is roughly equal.

5. As Dave Chinner observed, the current patches in mmotm increased the
   number of pages written from kswapd context which is expected to adversly
   impact IO performance. With the follow-up patches, far fewer pages are
   written from kswapd context than the mainline kernel

6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
   the follow-up series, there is less slab shrinking activity and no inodes
   were reclaimed.

7. Note that "Sectors Read" is drastically reduced implying that the source
   data being used for the IO is not being aggressively discarded due to
   page reclaim skipping over dirty pages and reclaiming clean pages. Note
   that the reducion in reads could also be due to inode data not being
   re-read from disk after a slab shrink.

                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz        166.99       32.09       33.44
Mean sda-await        853.64      192.76      185.43
Mean sda-r_await        6.31        9.24        5.97
Mean sda-w_await     2992.81      202.65      192.43
Max  sda-avgqz       1409.91      718.75      698.98
Max  sda-await       6665.74     3538.00     3124.23
Max  sda-r_await       58.96      111.95       58.00
Max  sda-w_await    28458.94     3977.29     3148.61

In light of the changes in writes from reclaim context, the number of
reads and Dave Chinner's concerns about IO performance I took a closer
look at the IO stats for the test disk. Few observations

1. The average queue size is reduced by the initial series and roughly
   the same with this follow up.

2. Average wait times for writes are reduced and as the IO
   is completing faster it at least implies that the gain is because
   flushers are writing the files efficiently instead of page reclaim
   getting in the way.

3. The reduction in maximum write latency is staggering. 28 seconds down
   to 3 seconds.


Jan Kara asked how NFS is affected by all of this. Unstable pages can
be taken into account as one of the patches in the series shows but it
is still the case that filesystems with unusual handling of dirty or
writeback could still be treated better.

Tests like postmark, fsmark and largedd showed up nothing useful. On my test
setup, pages are simply not being written back from reclaim context with or
without the patches and there are no changes in performance. My test setup
probably is just not strong enough network-wise to be really interesting.

I ran a longer-lived memcached test with IO going to NFS instead of a local disk

parallelio
                                             3.9.0                       3.9.0                       3.9.0
                                           vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)

1. Performance does not collapse due to IO which is good. IO is also completing
   faster. Note with mmotm, IO completes in a third of the time and faster again
   with this series applied

2. Swapping is reduced, although not eliminated. The figures for the follow-up
   look bad but it does vary a bit as the stalling is not perfect for nfs
   or filesystems like ext3 with unusual handling of dirty and writeback
   pages

3. There are swapins, particularly with larger amounts of IO indicating
   that active pages are being reclaimed. However, the number of much
   reduced.

                                 3.9.0       3.9.0       3.9.0
                               vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Minor Faults                  36339175    35025445    35219699
Major Faults                    310964       27108       51887
Swap Ins                       2176399      173069      333316
Swap Outs                      3344050      357228      504824
Direct pages scanned              8972       77283       43242
Kswapd pages scanned          20899983     8939566    14772851
Kswapd pages reclaimed         6193156     5172605     5231026
Direct pages reclaimed            8450       73802       39514
Kswapd efficiency                  29%         57%         35%
Kswapd velocity               3929.743    1847.499    3058.840
Direct efficiency                  94%         95%         91%
Direct velocity                  1.687      15.972       8.954
Percentage direct scans             0%          0%          0%
Zone normal velocity          3721.907     939.103    2185.142
Zone dma32 velocity            209.522     924.368     882.651
Zone dma velocity                0.000       0.000       0.000
Page writes by reclaim     4082185.000  526319.000  537114.000
Page writes file                738135      169091       32290
Page writes anon               3344050      357228      504824
Page reclaim immediate            9524         170     5595843
Sector Reads                   8909900      861192     1483680
Sector Writes                 13428980     1488744     2076800
Page rescued immediate               0           0           0
Slabs scanned                    38016       31744       28672
Direct inode steals                  0           0           0
Kswapd inode steals                424           0           0
Kswapd skipped wait                  0           0           0
THP fault alloc                     14          15         119
THP collapse alloc                1767        1569        1618
THP splits                          30          29          25
THP fault fallback                   0           0           0
THP collapse fail                    8           5           0
Compaction stalls                   17          41         100
Compaction success                   7          31          95
Compaction failures                 10          10           5
Page migrate success              7083       22157       62217
Page migrate failure                 0           0           0
Compaction pages isolated        14847       48758      135830
Compaction migrate scanned       18328       48398      138929
Compaction free scanned        2000255      355827     1720269
Compaction cost                      7          24          68

I guess the main takeaway again is the much reduced page writes
from reclaim context and reduced reads.

                       3.9.0       3.9.0       3.9.0
                     vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
Mean sda-avgqz         23.58        0.35        0.44
Mean sda-await        133.47       15.72       15.46
Mean sda-r_await        4.72        4.69        3.95
Mean sda-w_await      507.69       28.40       33.68
Max  sda-avgqz        680.60       12.25       23.14
Max  sda-await       3958.89      221.83      286.22
Max  sda-r_await       63.86       61.23       67.29
Max  sda-w_await    11710.38      883.57     1767.28

And as before, write wait times are much reduced.

 fs/block_dev.c              |   1 +
 fs/buffer.c                 |  34 +++++++++
 fs/ext3/inode.c             |   1 +
 fs/nfs/file.c               |  30 ++++++++
 include/linux/buffer_head.h |   3 +
 include/linux/fs.h          |   1 +
 mm/vmscan.c                 | 164 ++++++++++++++++++++++++++++++++------------
 7 files changed, 189 insertions(+), 45 deletions(-)

-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/8] mm: vmscan: Block kswapd if it is encountering pages under writeback -fix
  2013-05-29 23:17 [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
@ 2013-05-29 23:17 ` Mel Gorman
  2013-05-29 23:17 ` [PATCH 2/8] mm: vmscan: Stall page reclaim and writeback pages based on dirty/writepage pages encountered Mel Gorman
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2013-05-29 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner,
	Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman

The patch "mm: vmscan: Block kswapd if it is encountering pages
under writeback" stalls in congestion_wait it encounters a page under
writeback that is marked for immediate reclaim. Initially this was a
wait_on_page_writeback() but after the switch to congestion_wait(),
there is no guarantee the page has completed writeback and it can
be placed on a list for freeing.

This is a fix for
mm-vmscan-block-kswapd-if-it-is-encountering-pages-under-writeback.patch

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b1b38ad..4a43c28 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -766,8 +766,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			if (current_is_kswapd() &&
 			    PageReclaim(page) &&
 			    zone_is_reclaim_writeback(zone)) {
+				unlock_page(page);
 				congestion_wait(BLK_RW_ASYNC, HZ/10);
 				zone_clear_flag(zone, ZONE_WRITEBACK);
+				goto keep;
 
 			/* Case 2 above */
 			} else if (global_reclaim(sc) ||
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/8] mm: vmscan: Stall page reclaim and writeback pages based on dirty/writepage pages encountered
  2013-05-29 23:17 [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
  2013-05-29 23:17 ` [PATCH 1/8] mm: vmscan: Block kswapd if it is encountering pages under writeback -fix Mel Gorman
@ 2013-05-29 23:17 ` Mel Gorman
  2013-05-29 23:17 ` [PATCH 3/8] mm: vmscan: Stall page reclaim after a list of pages have been processed Mel Gorman
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2013-05-29 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner,
	Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman

The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages
encountered, not priority" decides whether to writeback pages from reclaim
context based on the number of dirty pages encountered. This situation
is flagged too easily and flushers are not given the chance to catch up
resulting in more pages being written from reclaim context and potentially
impacting IO performance. The check for PageWriteback is also misplaced as
it happens within a PageDirty check which is nonsense as the dirty may have
been cleared for IO. The accounting is updated very late and pages that are
already under writeback, were reactivated, could not unmapped or could not
be released are all missed. Similarly, a page is considered congested for
reasons other than being congested and pages that cannot be written out
in the correct context are skipped. Finally, it considers stalling and
writing back filesystem pages due to encountering dirty anonymous pages
at the tail of the LRU which is dumb.

This patch causes kswapd to begin writing filesystem pages from reclaim
context only if page reclaim found that all filesystem pages at the tail of
the LRU were unqueued dirty pages. Before it starts writing filesystem pages,
it will stall to give flushers a chance to catch up. The decision on whether
wait_iff_congested is also now determined by dirty filesystem pages only.
Congested pages are based on whether the underlying BDI is congested
regardless of the context of the reclaiming process.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 48 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4a43c28..999ef0b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -669,6 +669,25 @@ static enum page_references page_check_references(struct page *page,
 	return PAGEREF_RECLAIM;
 }
 
+/* Check if a page is dirty or under writeback */
+static void page_check_dirty_writeback(struct page *page,
+				       bool *dirty, bool *writeback)
+{
+	/*
+	 * Anonymous pages are not handled by flushers and must be written
+	 * from reclaim context. Do not stall reclaim based on them
+	 */
+	if (!page_is_file_cache(page)) {
+		*dirty = false;
+		*writeback = false;
+		return;
+	}
+
+	/* By default assume that the page flags are accurate */
+	*dirty = PageDirty(page);
+	*writeback = PageWriteback(page);
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -697,6 +716,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		struct page *page;
 		int may_enter_fs;
 		enum page_references references = PAGEREF_RECLAIM_CLEAN;
+		bool dirty, writeback;
 
 		cond_resched();
 
@@ -725,6 +745,24 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 
 		/*
+		 * The number of dirty pages determines if a zone is marked
+		 * reclaim_congested which affects wait_iff_congested. kswapd
+		 * will stall and start writing pages if the tail of the LRU
+		 * is all dirty unqueued pages.
+		 */
+		page_check_dirty_writeback(page, &dirty, &writeback);
+		if (dirty || writeback)
+			nr_dirty++;
+
+		if (dirty && !writeback)
+			nr_unqueued_dirty++;
+
+		/* Treat this page as congested if underlying BDI is */
+		mapping = page_mapping(page);
+		if (mapping && bdi_write_congested(mapping->backing_dev_info))
+			nr_congested++;
+
+		/*
 		 * If a page at the tail of the LRU is under writeback, there
 		 * are three cases to consider.
 		 *
@@ -819,9 +857,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			if (!add_to_swap(page, page_list))
 				goto activate_locked;
 			may_enter_fs = 1;
-		}
 
-		mapping = page_mapping(page);
+			/* Adding to swap updated mapping */
+			mapping = page_mapping(page);
+		}
 
 		/*
 		 * The page is mapped into the page tables of one or more
@@ -841,11 +880,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 		if (PageDirty(page)) {
-			nr_dirty++;
-
-			if (!PageWriteback(page))
-				nr_unqueued_dirty++;
-
 			/*
 			 * Only kswapd can writeback filesystem pages to
 			 * avoid risk of stack overflow but only writeback
@@ -876,7 +910,6 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			/* Page is dirty, try to write it out here */
 			switch (pageout(page, mapping, sc)) {
 			case PAGE_KEEP:
-				nr_congested++;
 				goto keep_locked;
 			case PAGE_ACTIVATE:
 				goto activate_locked;
@@ -1318,7 +1351,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_taken;
-	unsigned long nr_dirty = 0;
+	unsigned long nr_unqueued_dirty = 0;
 	unsigned long nr_writeback = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
@@ -1361,7 +1394,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		return 0;
 
 	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
-					&nr_dirty, &nr_writeback, false);
+				&nr_unqueued_dirty, &nr_writeback, false);
 
 	spin_lock_irq(&zone->lru_lock);
 
@@ -1416,11 +1449,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	/*
 	 * Similarly, if many dirty pages are encountered that are not
 	 * currently being written then flag that kswapd should start
-	 * writing back pages.
+	 * writing back pages and stall to give a chance for flushers
+	 * to catch up.
 	 */
-	if (global_reclaim(sc) && nr_dirty &&
-			nr_dirty >= (nr_taken >> (DEF_PRIORITY - sc->priority)))
+	if (global_reclaim(sc) && nr_unqueued_dirty == nr_taken) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
 		zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY);
+	}
 
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/8] mm: vmscan: Stall page reclaim after a list of pages have been processed
  2013-05-29 23:17 [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
  2013-05-29 23:17 ` [PATCH 1/8] mm: vmscan: Block kswapd if it is encountering pages under writeback -fix Mel Gorman
  2013-05-29 23:17 ` [PATCH 2/8] mm: vmscan: Stall page reclaim and writeback pages based on dirty/writepage pages encountered Mel Gorman
@ 2013-05-29 23:17 ` Mel Gorman
  2013-05-29 23:17 ` [PATCH 4/8] mm: vmscan: Set zone flags before blocking Mel Gorman
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2013-05-29 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner,
	Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman

Commit "mm: vmscan: Block kswapd if it is encountering pages under writeback"
blocks page reclaim if it encounters pages under writeback marked for
immediate reclaim. It blocks while pages are still isolated from the
LRU which is unnecessary. This patch defers the blocking until after the
isolated pages have been processed and tidies up some of the comments.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 49 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 33 insertions(+), 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 999ef0b..5b1a79c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -697,6 +697,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				      enum ttu_flags ttu_flags,
 				      unsigned long *ret_nr_unqueued_dirty,
 				      unsigned long *ret_nr_writeback,
+				      unsigned long *ret_nr_immediate,
 				      bool force_reclaim)
 {
 	LIST_HEAD(ret_pages);
@@ -707,6 +708,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 	unsigned long nr_congested = 0;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_writeback = 0;
+	unsigned long nr_immediate = 0;
 
 	cond_resched();
 
@@ -773,8 +775,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 *    IO can complete. Waiting on the page itself risks an
 		 *    indefinite stall if it is impossible to writeback the
 		 *    page due to IO error or disconnected storage so instead
-		 *    block for HZ/10 or until some IO completes then clear the
-		 *    ZONE_WRITEBACK flag to recheck if the condition exists.
+		 *    note that the LRU is being scanned too quickly and the
+		 *    caller can stall after page list has been processed.
 		 *
 		 * 2) Global reclaim encounters a page, memcg encounters a
 		 *    page that is not marked for immediate reclaim or
@@ -804,10 +806,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			if (current_is_kswapd() &&
 			    PageReclaim(page) &&
 			    zone_is_reclaim_writeback(zone)) {
-				unlock_page(page);
-				congestion_wait(BLK_RW_ASYNC, HZ/10);
-				zone_clear_flag(zone, ZONE_WRITEBACK);
-				goto keep;
+				nr_immediate++;
+				goto keep_locked;
 
 			/* Case 2 above */
 			} else if (global_reclaim(sc) ||
@@ -1033,6 +1033,7 @@ keep:
 	mem_cgroup_uncharge_end();
 	*ret_nr_unqueued_dirty += nr_unqueued_dirty;
 	*ret_nr_writeback += nr_writeback;
+	*ret_nr_immediate += nr_immediate;
 	return nr_reclaimed;
 }
 
@@ -1044,7 +1045,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 		.priority = DEF_PRIORITY,
 		.may_unmap = 1,
 	};
-	unsigned long ret, dummy1, dummy2;
+	unsigned long ret, dummy1, dummy2, dummy3;
 	struct page *page, *next;
 	LIST_HEAD(clean_pages);
 
@@ -1057,7 +1058,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 
 	ret = shrink_page_list(&clean_pages, zone, &sc,
 				TTU_UNMAP|TTU_IGNORE_ACCESS,
-				&dummy1, &dummy2, true);
+				&dummy1, &dummy2, &dummy3, true);
 	list_splice(&clean_pages, page_list);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
 	return ret;
@@ -1353,6 +1354,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	unsigned long nr_taken;
 	unsigned long nr_unqueued_dirty = 0;
 	unsigned long nr_writeback = 0;
+	unsigned long nr_immediate = 0;
 	isolate_mode_t isolate_mode = 0;
 	int file = is_file_lru(lru);
 	struct zone *zone = lruvec_zone(lruvec);
@@ -1394,7 +1396,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		return 0;
 
 	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
-				&nr_unqueued_dirty, &nr_writeback, false);
+			&nr_unqueued_dirty, &nr_writeback, &nr_immediate,
+			false);
 
 	spin_lock_irq(&zone->lru_lock);
 
@@ -1447,14 +1450,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	}
 
 	/*
-	 * Similarly, if many dirty pages are encountered that are not
-	 * currently being written then flag that kswapd should start
-	 * writing back pages and stall to give a chance for flushers
-	 * to catch up.
+	 * memcg will stall in page writeback so only consider forcibly
+	 * stalling for global reclaim
 	 */
-	if (global_reclaim(sc) && nr_unqueued_dirty == nr_taken) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
-		zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY);
+	if (global_reclaim(sc)) {
+		/*
+		 * If dirty pages are scanned that are not queued for IO, it
+		 * implies that flushers are not keeping up. In this case, flag
+		 * the zone ZONE_TAIL_LRU_DIRTY and kswapd will start writing
+		 * pages from reclaim context. It will forcibly stall in the
+		 * next check.
+		 */
+		if (nr_unqueued_dirty == nr_taken)
+			zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY);
+
+		/*
+		 * In addition, if kswapd scans pages marked marked for
+		 * immediate reclaim and under writeback (nr_immediate), it
+		 * implies that pages are cycling through the LRU faster than
+		 * they are written so also forcibly stall.
+		 */
+		if (nr_unqueued_dirty == nr_taken || nr_immediate)
+			congestion_wait(BLK_RW_ASYNC, HZ/10);
 	}
 
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 4/8] mm: vmscan: Set zone flags before blocking
  2013-05-29 23:17 [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
                   ` (2 preceding siblings ...)
  2013-05-29 23:17 ` [PATCH 3/8] mm: vmscan: Stall page reclaim after a list of pages have been processed Mel Gorman
@ 2013-05-29 23:17 ` Mel Gorman
  2013-05-29 23:17 ` [PATCH 5/8] mm: vmscan: Move direct reclaim wait_iff_congested into shrink_list Mel Gorman
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2013-05-29 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner,
	Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman

In shrink_page_list a decision may be made to stall and flag a zone
as ZONE_WRITEBACK so that if a large number of unqueued dirty pages are
encountered later then the reclaimer will stall. Set ZONE_WRITEBACK before
potentially going to sleep so it is noticed sooner.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5b1a79c..5f80d01 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1445,8 +1445,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 */
 	if (nr_writeback && nr_writeback >=
 			(nr_taken >> (DEF_PRIORITY - sc->priority))) {
-		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
 		zone_set_flag(zone, ZONE_WRITEBACK);
+		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
 	}
 
 	/*
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/8] mm: vmscan: Move direct reclaim wait_iff_congested into shrink_list
  2013-05-29 23:17 [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
                   ` (3 preceding siblings ...)
  2013-05-29 23:17 ` [PATCH 4/8] mm: vmscan: Set zone flags before blocking Mel Gorman
@ 2013-05-29 23:17 ` Mel Gorman
  2013-05-29 23:17 ` [PATCH 6/8] mm: vmscan: Treat pages marked for immediate reclaim as zone congestion Mel Gorman
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2013-05-29 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner,
	Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman

shrink_inactive_list makes decisions on whether to stall based on the
number of dirty pages encountered. The wait_iff_congested() call in
shrink_page_list does no such thing and it's arbitrary.

This patch moves the decision on whether to set ZONE_CONGESTED and the
wait_iff_congested call into shrink_page_list. This keeps all the
decisions on whether to stall or not in the one place.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 62 ++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 33 insertions(+), 29 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5f80d01..4898daf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -695,7 +695,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 				      struct zone *zone,
 				      struct scan_control *sc,
 				      enum ttu_flags ttu_flags,
+				      unsigned long *ret_nr_dirty,
 				      unsigned long *ret_nr_unqueued_dirty,
+				      unsigned long *ret_nr_congested,
 				      unsigned long *ret_nr_writeback,
 				      unsigned long *ret_nr_immediate,
 				      bool force_reclaim)
@@ -1017,20 +1019,13 @@ keep:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 
-	/*
-	 * Tag a zone as congested if all the dirty pages encountered were
-	 * backed by a congested BDI. In this case, reclaimers should just
-	 * back off and wait for congestion to clear because further reclaim
-	 * will encounter the same problem
-	 */
-	if (nr_dirty && nr_dirty == nr_congested && global_reclaim(sc))
-		zone_set_flag(zone, ZONE_CONGESTED);
-
 	free_hot_cold_page_list(&free_pages, 1);
 
 	list_splice(&ret_pages, page_list);
 	count_vm_events(PGACTIVATE, pgactivate);
 	mem_cgroup_uncharge_end();
+	*ret_nr_dirty += nr_dirty;
+	*ret_nr_congested += nr_congested;
 	*ret_nr_unqueued_dirty += nr_unqueued_dirty;
 	*ret_nr_writeback += nr_writeback;
 	*ret_nr_immediate += nr_immediate;
@@ -1045,7 +1040,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 		.priority = DEF_PRIORITY,
 		.may_unmap = 1,
 	};
-	unsigned long ret, dummy1, dummy2, dummy3;
+	unsigned long ret, dummy1, dummy2, dummy3, dummy4, dummy5;
 	struct page *page, *next;
 	LIST_HEAD(clean_pages);
 
@@ -1057,8 +1052,8 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 	}
 
 	ret = shrink_page_list(&clean_pages, zone, &sc,
-				TTU_UNMAP|TTU_IGNORE_ACCESS,
-				&dummy1, &dummy2, &dummy3, true);
+			TTU_UNMAP|TTU_IGNORE_ACCESS,
+			&dummy1, &dummy2, &dummy3, &dummy4, &dummy5, true);
 	list_splice(&clean_pages, page_list);
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret);
 	return ret;
@@ -1352,6 +1347,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_taken;
+	unsigned long nr_dirty = 0;
+	unsigned long nr_congested = 0;
 	unsigned long nr_unqueued_dirty = 0;
 	unsigned long nr_writeback = 0;
 	unsigned long nr_immediate = 0;
@@ -1396,8 +1393,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 		return 0;
 
 	nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
-			&nr_unqueued_dirty, &nr_writeback, &nr_immediate,
-			false);
+				&nr_dirty, &nr_unqueued_dirty, &nr_congested,
+				&nr_writeback, &nr_immediate,
+				false);
 
 	spin_lock_irq(&zone->lru_lock);
 
@@ -1431,7 +1429,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 * same way balance_dirty_pages() manages.
 	 *
 	 * This scales the number of dirty pages that must be under writeback
-	 * before throttling depending on priority. It is a simple backoff
+	 * before a zone gets flagged ZONE_WRITEBACK. It is a simple backoff
 	 * function that has the most effect in the range DEF_PRIORITY to
 	 * DEF_PRIORITY-2 which is the priority reclaim is considered to be
 	 * in trouble and reclaim is considered to be in trouble.
@@ -1442,12 +1440,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 * ...
 	 * DEF_PRIORITY-6 For SWAP_CLUSTER_MAX isolated pages, throttle if any
 	 *                     isolated page is PageWriteback
+	 *
+	 * Once a zone is flagged ZONE_WRITEBACK, kswapd will count the number
+	 * of pages under pages flagged for immediate reclaim and stall if any
+	 * are encountered in the nr_immediate check below.
 	 */
 	if (nr_writeback && nr_writeback >=
-			(nr_taken >> (DEF_PRIORITY - sc->priority))) {
+			(nr_taken >> (DEF_PRIORITY - sc->priority)))
 		zone_set_flag(zone, ZONE_WRITEBACK);
-		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
-	}
 
 	/*
 	 * memcg will stall in page writeback so only consider forcibly
@@ -1455,6 +1455,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	 */
 	if (global_reclaim(sc)) {
 		/*
+		 * Tag a zone as congested if all the dirty pages scanned were
+		 * backed by a congested BDI and wait_iff_congested will stall.
+		 */
+		if (nr_dirty && nr_dirty == nr_congested)
+			zone_set_flag(zone, ZONE_CONGESTED);
+
+		/*
 		 * If dirty pages are scanned that are not queued for IO, it
 		 * implies that flushers are not keeping up. In this case, flag
 		 * the zone ZONE_TAIL_LRU_DIRTY and kswapd will start writing
@@ -1474,6 +1481,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 			congestion_wait(BLK_RW_ASYNC, HZ/10);
 	}
 
+	/*
+	 * Stall direct reclaim for IO completions if underlying BDIs or zone
+	 * is congested. Allow kswapd to continue until it starts encountering
+	 * unqueued dirty pages or cycling through the LRU too quickly.
+	 */
+	if (!sc->hibernation_mode && !current_is_kswapd())
+		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),
 		nr_scanned, nr_reclaimed,
@@ -2374,17 +2389,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 						WB_REASON_TRY_TO_FREE_PAGES);
 			sc->may_writepage = 1;
 		}
-
-		/* Take a nap, wait for some writeback to complete */
-		if (!sc->hibernation_mode && sc->nr_scanned &&
-		    sc->priority < DEF_PRIORITY - 2) {
-			struct zone *preferred_zone;
-
-			first_zones_zonelist(zonelist, gfp_zone(sc->gfp_mask),
-						&cpuset_current_mems_allowed,
-						&preferred_zone);
-			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/10);
-		}
 	} while (--sc->priority >= 0);
 
 out:
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 6/8] mm: vmscan: Treat pages marked for immediate reclaim as zone congestion
  2013-05-29 23:17 [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
                   ` (4 preceding siblings ...)
  2013-05-29 23:17 ` [PATCH 5/8] mm: vmscan: Move direct reclaim wait_iff_congested into shrink_list Mel Gorman
@ 2013-05-29 23:17 ` Mel Gorman
  2013-05-29 23:17 ` [PATCH 7/8] mm: vmscan: Take page buffers dirty and locked state into account Mel Gorman
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2013-05-29 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner,
	Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman

Currently a zone will only be marked congested if the underlying BDI
is congested but if dirty pages are spread across zones it is possible
that an individual zone is full of dirty pages without being congested.
The impact is that zone gets scanned very quickly potentially reclaiming
really clean pages. This patch treats pages marked for immediate reclaim
as congested for the purposes of marking a zone ZONE_CONGESTED and
stalling in wait_iff_congested.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4898daf..bf47784 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -761,9 +761,15 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (dirty && !writeback)
 			nr_unqueued_dirty++;
 
-		/* Treat this page as congested if underlying BDI is */
+		/*
+		 * Treat this page as congested if the underlying BDI is or if
+		 * pages are cycling through the LRU so quickly that the
+		 * pages marked for immediate reclaim are making it to the
+		 * end of the LRU a second time.
+		 */
 		mapping = page_mapping(page);
-		if (mapping && bdi_write_congested(mapping->backing_dev_info))
+		if ((mapping && bdi_write_congested(mapping->backing_dev_info)) ||
+		    (writeback && PageReclaim(page)))
 			nr_congested++;
 
 		/*
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 7/8] mm: vmscan: Take page buffers dirty and locked state into account
  2013-05-29 23:17 [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
                   ` (5 preceding siblings ...)
  2013-05-29 23:17 ` [PATCH 6/8] mm: vmscan: Treat pages marked for immediate reclaim as zone congestion Mel Gorman
@ 2013-05-29 23:17 ` Mel Gorman
  2013-05-29 23:17 ` [PATCH 8/8] fs: nfs: Inform the VM about pages being committed or unstable Mel Gorman
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2013-05-29 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner,
	Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman

Page reclaim keeps track of dirty and under writeback pages and uses it to
determine if wait_iff_congested() should stall or if kswapd should begin
writing back pages. This fails to account for buffer pages that can be under
writeback but not PageWriteback which is the case for filesystems like ext3
ordered mode. Furthermore, PageDirty buffer pages can have all the buffers
clean and writepage does no IO so it should not be accounted as congested.

This patch adds an address_space operation that filesystems may
optionally use to check if a page is really dirty or really under
writeback. An implementation is provided for for buffer_heads is added
and used for block operations and ext3 in ordered mode. By default the
page flags are obeyed.

Credit goes to Jan Kara for identifying that the page flags alone are
not sufficient for ext3 and sanity checking a number of ideas on how
the problem could be addressed.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/block_dev.c              |  1 +
 fs/buffer.c                 | 34 ++++++++++++++++++++++++++++++++++
 fs/ext3/inode.c             |  1 +
 include/linux/buffer_head.h |  3 +++
 include/linux/fs.h          |  1 +
 mm/vmscan.c                 | 10 ++++++++++
 6 files changed, 50 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2091db8..9c8ebe4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1583,6 +1583,7 @@ static const struct address_space_operations def_blk_aops = {
 	.writepages	= generic_writepages,
 	.releasepage	= blkdev_releasepage,
 	.direct_IO	= blkdev_direct_IO,
+	.is_dirty_writeback = buffer_check_dirty_writeback,
 };
 
 const struct file_operations def_blk_fops = {
diff --git a/fs/buffer.c b/fs/buffer.c
index 1aa0836..4247aa9 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -91,6 +91,40 @@ void unlock_buffer(struct buffer_head *bh)
 EXPORT_SYMBOL(unlock_buffer);
 
 /*
+ * Returns if the page has dirty or writeback buffers. If all the buffers
+ * are unlocked and clean then the PageDirty information is stale. If
+ * any of the pages are locked, it is assumed they are locked for IO.
+ */
+void buffer_check_dirty_writeback(struct page *page,
+				     bool *dirty, bool *writeback)
+{
+	struct buffer_head *head, *bh;
+	*dirty = false;
+	*writeback = false;
+
+	BUG_ON(!PageLocked(page));
+
+	if (!page_has_buffers(page))
+		return;
+
+	if (PageWriteback(page))
+		*writeback = true;
+
+	head = page_buffers(page);
+	bh = head;
+	do {
+		if (buffer_locked(bh))
+			*writeback = true;
+
+		if (buffer_dirty(bh))
+			*dirty = true;
+
+		bh = bh->b_this_page;
+	} while (bh != head);
+}
+EXPORT_SYMBOL(buffer_check_dirty_writeback);
+
+/*
  * Block until a buffer comes unlocked.  This doesn't stop it
  * from becoming locked again - you have to lock it yourself
  * if you want to preserve its state.
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 23c7128..8e590bd 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1984,6 +1984,7 @@ static const struct address_space_operations ext3_ordered_aops = {
 	.direct_IO		= ext3_direct_IO,
 	.migratepage		= buffer_migrate_page,
 	.is_partially_uptodate  = block_is_partially_uptodate,
+	.is_dirty_writeback	= buffer_check_dirty_writeback,
 	.error_remove_page	= generic_error_remove_page,
 };
 
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 6d9f5a2..d458880 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -139,6 +139,9 @@ BUFFER_FNS(Prio, prio)
 	})
 #define page_has_buffers(page)	PagePrivate(page)
 
+void buffer_check_dirty_writeback(struct page *page,
+				     bool *dirty, bool *writeback);
+
 /*
  * Declarations
  */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0a9a6766..96f857f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -380,6 +380,7 @@ struct address_space_operations {
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
 					unsigned long);
+	void (*is_dirty_writeback) (struct page *, bool *, bool *);
 	int (*error_remove_page)(struct address_space *, struct page *);
 
 	/* swapfile support */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bf47784..c857943 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -673,6 +673,8 @@ static enum page_references page_check_references(struct page *page,
 static void page_check_dirty_writeback(struct page *page,
 				       bool *dirty, bool *writeback)
 {
+	struct address_space *mapping;
+
 	/*
 	 * Anonymous pages are not handled by flushers and must be written
 	 * from reclaim context. Do not stall reclaim based on them
@@ -686,6 +688,14 @@ static void page_check_dirty_writeback(struct page *page,
 	/* By default assume that the page flags are accurate */
 	*dirty = PageDirty(page);
 	*writeback = PageWriteback(page);
+
+	/* Verify dirty/writeback state if the filesystem supports it */
+	if (!page_has_private(page))
+		return;
+
+	mapping = page_mapping(page);
+	if (mapping && mapping->a_ops->is_dirty_writeback)
+		mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
 }
 
 /*
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 8/8] fs: nfs: Inform the VM about pages being committed or unstable
  2013-05-29 23:17 [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
                   ` (6 preceding siblings ...)
  2013-05-29 23:17 ` [PATCH 7/8] mm: vmscan: Take page buffers dirty and locked state into account Mel Gorman
@ 2013-05-29 23:17 ` Mel Gorman
  2013-05-30 10:10 ` [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
  2013-07-15 14:21 ` Hush Bensen
  9 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2013-05-29 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner,
	Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman

VM page reclaim uses dirty and writeback page states to determine if
flushers are cleaning pages too slowly and that page reclaim should
stall waiting on flushers to catch up. Page state in NFS is a bit
more complex and a clean page can be unreclaimable due to being
unstable which is effectively "dirty" from the perspective of the
VM from reclaim context. Similarly, if the inode is currently being
committed then it's similar to being under writeback.

This patch adds a is_dirty_writeback() handled for NFS that checks
if a pages backing inode is being committed and should be accounted as
writeback and if a page has private state indicating that it is
effectively dirty.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/nfs/file.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index a87a44f..a4250a4 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -493,6 +493,35 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
 	return nfs_fscache_release_page(page, gfp);
 }
 
+static void nfs_check_dirty_writeback(struct page *page,
+				bool *dirty, bool *writeback)
+{
+	struct nfs_inode *nfsi;
+	struct address_space *mapping = page_file_mapping(page);
+
+	if (!mapping || PageSwapCache(page))
+		return;
+
+	/*
+	 * Check if an unstable page is currently being committed and
+	 * if so, have the VM treat it as if the page is under writeback
+	 * so it will not block due to pages that will shortly be freeable.
+	 */
+	nfsi = NFS_I(mapping->host);
+	if (test_bit(NFS_INO_COMMIT, &nfsi->flags)) {
+		*writeback = true;
+		return;
+	}
+
+	/*
+	 * If PagePrivate() is set, then the page is not freeable and as the
+	 * inode is not being committed, it's not going to be cleaned in the
+	 * near future so treat it as dirty
+	 */
+	if (PagePrivate(page))
+		*dirty = true;
+}
+
 /*
  * Attempt to clear the private state associated with a page when an error
  * occurs that requires the cached contents of an inode to be written back or
@@ -540,6 +569,7 @@ const struct address_space_operations nfs_file_aops = {
 	.direct_IO = nfs_direct_IO,
 	.migratepage = nfs_migrate_page,
 	.launder_page = nfs_launder_page,
+	.is_dirty_writeback = nfs_check_dirty_writeback,
 	.error_remove_page = generic_error_remove_page,
 #ifdef CONFIG_NFS_SWAP
 	.swap_activate = nfs_swap_activate,
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/8] Reduce system disruption due to kswapd followup V3
  2013-05-29 23:17 [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
                   ` (7 preceding siblings ...)
  2013-05-29 23:17 ` [PATCH 8/8] fs: nfs: Inform the VM about pages being committed or unstable Mel Gorman
@ 2013-05-30 10:10 ` Mel Gorman
  2013-07-15 14:21 ` Hush Bensen
  9 siblings, 0 replies; 11+ messages in thread
From: Mel Gorman @ 2013-05-30 10:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner,
	Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML

On Thu, May 30, 2013 at 12:17:29AM +0100, Mel Gorman wrote:
> tldr; Overall the system is getting less kicked in the face. Scan rates
> 	between zones is often more balanced than it used to be. There are
> 	now fewer writes from reclaim context and a reduction in IO wait
> 	times.
> 
> This series replaces all of the previous follow-up series. It was clear
> that more of the stall logic needed to be in the same place so it is
> comprehensible and easier to predict.
> 

There was some unfortunate crossover in timing as I see mmotm has pulled
in the previous follow up series. It would probably be easiest to replace
these patches

mm-vmscan-stall-page-reclaim-and-writeback-pages-based-on-dirty-writepage-pages-encountered.patch
mm-vmscan-stall-page-reclaim-after-a-list-of-pages-have-been-processed.patch
mm-vmscan-take-page-buffers-dirty-and-locked-state-into-account.patch
mm-vmscan-stall-page-reclaim-and-writeback-pages-based-on-dirty-writepage-pages-encountered.patch

with patches 2-8 of this series. The fixup patch
mm-vmscan-block-kswapd-if-it-is-encountering-pages-under-writeback-fix-2.patch
is still the same

Sorry for the inconvenience.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 0/8] Reduce system disruption due to kswapd followup V3
  2013-05-29 23:17 [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
                   ` (8 preceding siblings ...)
  2013-05-30 10:10 ` [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
@ 2013-07-15 14:21 ` Hush Bensen
  9 siblings, 0 replies; 11+ messages in thread
From: Hush Bensen @ 2013-07-15 14:21 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic,
	Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner,
	Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML

于 2013/5/30 7:17, Mel Gorman 写道:
> tldr; Overall the system is getting less kicked in the face. Scan rates
> 	between zones is often more balanced than it used to be. There are
> 	now fewer writes from reclaim context and a reduction in IO wait
> 	times.
>
> This series replaces all of the previous follow-up series. It was clear
> that more of the stall logic needed to be in the same place so it is
> comprehensible and easier to predict.
>
> Changelog since V2
> o Consolidate stall decisions into one place
> o Add is_dirty_writeback for NFS
> o Move accounting around
>
> Further testing of the "Reduce system disruption due to kswapd" discovered
> a few problems. First and foremost, it's possible for pages under writeback
> to be freed which will lead to badness. Second, as pages were not being
> swapped the file LRU was being scanned faster and clean file pages were
> being reclaimed. In some cases this results in increased read IO to re-read
> data from disk.  Third, more pages were being written from kswapd context
> which can adversly affect IO performance. Lastly, it was observed that
> PageDirty pages are not necessarily dirty on all filesystems (buffers can be
> clean while PageDirty is set and ->writepage generates no IO) and not all
> filesystems set PageWriteback when the page is being written (e.g. ext3).
> This disconnect confuses the reclaim stalling logic. This follow-up series
> is aimed at these problems.
>
> The tests were based on three kernels
>
> vanilla:	kernel 3.9 as that is what the current mmotm uses as a baseline
> mmotm-20130522	is mmotm as of 22nd May with "Reduce system disruption due to
> 		kswapd" applied on top as per what should be in Andrew's tree
> 		right now
> lessdisrupt-v7r10 is this follow-up series on top of the mmotm kernel
>
> The first test used memcached+memcachetest while some background IO
> was in progress as implemented by the parallel IO tests implement in
> MM Tests. memcachetest benchmarks how many operations/second memcached
> can service. It starts with no background IO on a freshly created ext4
> filesystem and then re-runs the test with larger amounts of IO in the
> background to roughly simulate a large copy in progress. The expectation
> is that the IO should have little or no impact on memcachetest which is
> running entirely in memory.
>
> parallelio
>                                              3.9.0                       3.9.0                       3.9.0
>                                            vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
> Ops memcachetest-0M             23117.00 (  0.00%)          22780.00 ( -1.46%)          22763.00 ( -1.53%)
> Ops memcachetest-715M           23774.00 (  0.00%)          23299.00 ( -2.00%)          22934.00 ( -3.53%)
> Ops memcachetest-2385M           4208.00 (  0.00%)          24154.00 (474.00%)          23765.00 (464.76%)
> Ops memcachetest-4055M           4104.00 (  0.00%)          25130.00 (512.33%)          24614.00 (499.76%)
> Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> Ops io-duration-715M               12.00 (  0.00%)              7.00 ( 41.67%)              6.00 ( 50.00%)
> Ops io-duration-2385M             116.00 (  0.00%)             21.00 ( 81.90%)             21.00 ( 81.90%)
> Ops io-duration-4055M             160.00 (  0.00%)             36.00 ( 77.50%)             35.00 ( 78.12%)
> Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> Ops swaptotal-715M             140138.00 (  0.00%)             18.00 ( 99.99%)             18.00 ( 99.99%)
> Ops swaptotal-2385M            385682.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> Ops swaptotal-4055M            418029.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> Ops swapin-715M                   144.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> Ops swapin-2385M               134227.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> Ops swapin-4055M               125618.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> Ops minorfaults-0M            1536429.00 (  0.00%)        1531632.00 (  0.31%)        1533541.00 (  0.19%)
> Ops minorfaults-715M          1786996.00 (  0.00%)        1612148.00 (  9.78%)        1608832.00 (  9.97%)
> Ops minorfaults-2385M         1757952.00 (  0.00%)        1614874.00 (  8.14%)        1613541.00 (  8.21%)
> Ops minorfaults-4055M         1774460.00 (  0.00%)        1633400.00 (  7.95%)        1630881.00 (  8.09%)
> Ops majorfaults-0M                  1.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> Ops majorfaults-715M              184.00 (  0.00%)            167.00 (  9.24%)            166.00 (  9.78%)
> Ops majorfaults-2385M           24444.00 (  0.00%)            155.00 ( 99.37%)             93.00 ( 99.62%)
> Ops majorfaults-4055M           21357.00 (  0.00%)            147.00 ( 99.31%)            134.00 ( 99.37%)
>
> memcachetest is the transactions/second reported by memcachetest. In
>         the vanilla kernel note that performance drops from around
>         23K/sec to just over 4K/second when there is 2385M of IO going
>         on in the background. With current mmotm, there is no collapse
> 	in performance and with this follow-up series there is little
> 	change.
>
> swaptotal is the total amount of swap traffic. With mmotm and the follow-up
> 	series, the total amount of swapping is much reduced.
>
>
>                                  3.9.0       3.9.0       3.9.0
>                                vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
> Minor Faults                  11160152    10706748    10622316
> Major Faults                     46305         755         678
> Swap Ins                        260249           0           0
> Swap Outs                       683860          18          18
> Direct pages scanned                 0         678        2520
> Kswapd pages scanned           6046108     8814900     1639279
> Kswapd pages reclaimed         1081954     1172267     1094635
> Direct pages reclaimed               0         566        2304
> Kswapd efficiency                  17%         13%         66%
> Kswapd velocity               5217.560    7618.953    1414.879
> Direct efficiency                 100%         83%         91%
> Direct velocity                  0.000       0.586       2.175
> Percentage direct scans             0%          0%          0%
> Zone normal velocity          5105.086    6824.681     671.158
> Zone dma32 velocity            112.473     794.858     745.896
> Zone dma velocity                0.000       0.000       0.000
> Page writes by reclaim     1929612.000 6861768.000   32821.000
> Page writes file               1245752     6861750       32803
> Page writes anon                683860          18          18
> Page reclaim immediate            7484          40         239
> Sector Reads                   1130320       93996       86900
> Sector Writes                 13508052    10823500    11804436
> Page rescued immediate               0           0           0
> Slabs scanned                    33536       27136       18560
> Direct inode steals                  0           0           0
> Kswapd inode steals               8641        1035           0
> Kswapd skipped wait                  0           0           0
> THP fault alloc                      8          37          33
> THP collapse alloc                 508         552         515
> THP splits                          24           1           1
> THP fault fallback                   0           0           0
> THP collapse fail                    0           0           0

Which mmtest config you used for this one?

>
> There are a number of observations to make here
>
> 1. Swap outs are almost eliminated. Swap ins are 0 indicating that the
>    pages swapped were really unused anonymous pages. Related to that,
>    major faults are much reduced.
>
> 2. kswapd efficiency was impacted by the initial series but with these
>    follow-up patches, the efficiency is now at 66% indicating that far
>    fewer pages were skipped during scanning due to dirty or writeback
>    pages.
>
> 3. kswapd velocity is reduced indicating that fewer pages are being scanned
>    with the follow-up series as kswapd now stalls when the tail of the
>    LRU queue is full of unqueued dirty pages. The stall gives flushers a
>    chance to catch-up so kswapd can reclaim clean pages when it wakes
>
> 4. In light of Zlatko's recent reports about zone scanning imbalances,
>    mmtests now reports scanning velocity on a per-zone basis. With mainline,
>    you can see that the scanning activity is dominated by the Normal
>    zone with over 45 times more scanning in Normal than the DMA32 zone.
>    With the series currently in mmotm, the ratio is slightly better but it
>    is still the case that the bulk of scanning is in the highest zone. With
>    this follow-up series, the ratio of scanning between the Normal and
>    DMA32 zone is roughly equal.
>
> 5. As Dave Chinner observed, the current patches in mmotm increased the
>    number of pages written from kswapd context which is expected to adversly
>    impact IO performance. With the follow-up patches, far fewer pages are
>    written from kswapd context than the mainline kernel
>
> 6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With
>    the follow-up series, there is less slab shrinking activity and no inodes
>    were reclaimed.
>
> 7. Note that "Sectors Read" is drastically reduced implying that the source
>    data being used for the IO is not being aggressively discarded due to
>    page reclaim skipping over dirty pages and reclaiming clean pages. Note
>    that the reducion in reads could also be due to inode data not being
>    re-read from disk after a slab shrink.
>
>                        3.9.0       3.9.0       3.9.0
>                      vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
> Mean sda-avgqz        166.99       32.09       33.44
> Mean sda-await        853.64      192.76      185.43
> Mean sda-r_await        6.31        9.24        5.97
> Mean sda-w_await     2992.81      202.65      192.43
> Max  sda-avgqz       1409.91      718.75      698.98
> Max  sda-await       6665.74     3538.00     3124.23
> Max  sda-r_await       58.96      111.95       58.00
> Max  sda-w_await    28458.94     3977.29     3148.61
>
> In light of the changes in writes from reclaim context, the number of
> reads and Dave Chinner's concerns about IO performance I took a closer
> look at the IO stats for the test disk. Few observations
>
> 1. The average queue size is reduced by the initial series and roughly
>    the same with this follow up.
>
> 2. Average wait times for writes are reduced and as the IO
>    is completing faster it at least implies that the gain is because
>    flushers are writing the files efficiently instead of page reclaim
>    getting in the way.
>
> 3. The reduction in maximum write latency is staggering. 28 seconds down
>    to 3 seconds.
>
>
> Jan Kara asked how NFS is affected by all of this. Unstable pages can
> be taken into account as one of the patches in the series shows but it
> is still the case that filesystems with unusual handling of dirty or
> writeback could still be treated better.
>
> Tests like postmark, fsmark and largedd showed up nothing useful. On my test
> setup, pages are simply not being written back from reclaim context with or
> without the patches and there are no changes in performance. My test setup
> probably is just not strong enough network-wise to be really interesting.
>
> I ran a longer-lived memcached test with IO going to NFS instead of a local disk
>
> parallelio
>                                              3.9.0                       3.9.0                       3.9.0
>                                            vanilla          mm1-mmotm-20130522       mm1-lessdisrupt-v7r10
> Ops memcachetest-0M             23323.00 (  0.00%)          23241.00 ( -0.35%)          23321.00 ( -0.01%)
> Ops memcachetest-715M           25526.00 (  0.00%)          24763.00 ( -2.99%)          23242.00 ( -8.95%)
> Ops memcachetest-2385M           8814.00 (  0.00%)          26924.00 (205.47%)          23521.00 (166.86%)
> Ops memcachetest-4055M           5835.00 (  0.00%)          26827.00 (359.76%)          25560.00 (338.05%)
> Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> Ops io-duration-715M               65.00 (  0.00%)             71.00 ( -9.23%)             11.00 ( 83.08%)
> Ops io-duration-2385M             129.00 (  0.00%)             94.00 ( 27.13%)             53.00 ( 58.91%)
> Ops io-duration-4055M             301.00 (  0.00%)            100.00 ( 66.78%)            108.00 ( 64.12%)
> Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> Ops swaptotal-715M              14394.00 (  0.00%)            949.00 ( 93.41%)             63.00 ( 99.56%)
> Ops swaptotal-2385M            401483.00 (  0.00%)          24437.00 ( 93.91%)          30118.00 ( 92.50%)
> Ops swaptotal-4055M            554123.00 (  0.00%)          35688.00 ( 93.56%)          63082.00 ( 88.62%)
> Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)              0.00 (  0.00%)
> Ops swapin-715M                  4522.00 (  0.00%)            560.00 ( 87.62%)             63.00 ( 98.61%)
> Ops swapin-2385M               169861.00 (  0.00%)           5026.00 ( 97.04%)          13917.00 ( 91.81%)
> Ops swapin-4055M               192374.00 (  0.00%)          10056.00 ( 94.77%)          25729.00 ( 86.63%)
> Ops minorfaults-0M            1445969.00 (  0.00%)        1520878.00 ( -5.18%)        1454024.00 ( -0.56%)
> Ops minorfaults-715M          1557288.00 (  0.00%)        1528482.00 (  1.85%)        1535776.00 (  1.38%)
> Ops minorfaults-2385M         1692896.00 (  0.00%)        1570523.00 (  7.23%)        1559622.00 (  7.87%)
> Ops minorfaults-4055M         1654985.00 (  0.00%)        1581456.00 (  4.44%)        1596713.00 (  3.52%)
> Ops majorfaults-0M                  0.00 (  0.00%)              1.00 (-99.00%)              0.00 (  0.00%)
> Ops majorfaults-715M              763.00 (  0.00%)            265.00 ( 65.27%)             75.00 ( 90.17%)
> Ops majorfaults-2385M           23861.00 (  0.00%)            894.00 ( 96.25%)           2189.00 ( 90.83%)
> Ops majorfaults-4055M           27210.00 (  0.00%)           1569.00 ( 94.23%)           4088.00 ( 84.98%)
>
> 1. Performance does not collapse due to IO which is good. IO is also completing
>    faster. Note with mmotm, IO completes in a third of the time and faster again
>    with this series applied
>
> 2. Swapping is reduced, although not eliminated. The figures for the follow-up
>    look bad but it does vary a bit as the stalling is not perfect for nfs
>    or filesystems like ext3 with unusual handling of dirty and writeback
>    pages
>
> 3. There are swapins, particularly with larger amounts of IO indicating
>    that active pages are being reclaimed. However, the number of much
>    reduced.
>
>                                  3.9.0       3.9.0       3.9.0
>                                vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
> Minor Faults                  36339175    35025445    35219699
> Major Faults                    310964       27108       51887
> Swap Ins                       2176399      173069      333316
> Swap Outs                      3344050      357228      504824
> Direct pages scanned              8972       77283       43242
> Kswapd pages scanned          20899983     8939566    14772851
> Kswapd pages reclaimed         6193156     5172605     5231026
> Direct pages reclaimed            8450       73802       39514
> Kswapd efficiency                  29%         57%         35%
> Kswapd velocity               3929.743    1847.499    3058.840
> Direct efficiency                  94%         95%         91%
> Direct velocity                  1.687      15.972       8.954
> Percentage direct scans             0%          0%          0%
> Zone normal velocity          3721.907     939.103    2185.142
> Zone dma32 velocity            209.522     924.368     882.651
> Zone dma velocity                0.000       0.000       0.000
> Page writes by reclaim     4082185.000  526319.000  537114.000
> Page writes file                738135      169091       32290
> Page writes anon               3344050      357228      504824
> Page reclaim immediate            9524         170     5595843
> Sector Reads                   8909900      861192     1483680
> Sector Writes                 13428980     1488744     2076800
> Page rescued immediate               0           0           0
> Slabs scanned                    38016       31744       28672
> Direct inode steals                  0           0           0
> Kswapd inode steals                424           0           0
> Kswapd skipped wait                  0           0           0
> THP fault alloc                     14          15         119
> THP collapse alloc                1767        1569        1618
> THP splits                          30          29          25
> THP fault fallback                   0           0           0
> THP collapse fail                    8           5           0
> Compaction stalls                   17          41         100
> Compaction success                   7          31          95
> Compaction failures                 10          10           5
> Page migrate success              7083       22157       62217
> Page migrate failure                 0           0           0
> Compaction pages isolated        14847       48758      135830
> Compaction migrate scanned       18328       48398      138929
> Compaction free scanned        2000255      355827     1720269
> Compaction cost                      7          24          68
>
> I guess the main takeaway again is the much reduced page writes
> from reclaim context and reduced reads.
>
>                        3.9.0       3.9.0       3.9.0
>                      vanillamm1-mmotm-20130522mm1-lessdisrupt-v7r10
> Mean sda-avgqz         23.58        0.35        0.44
> Mean sda-await        133.47       15.72       15.46
> Mean sda-r_await        4.72        4.69        3.95
> Mean sda-w_await      507.69       28.40       33.68
> Max  sda-avgqz        680.60       12.25       23.14
> Max  sda-await       3958.89      221.83      286.22
> Max  sda-r_await       63.86       61.23       67.29
> Max  sda-w_await    11710.38      883.57     1767.28
>
> And as before, write wait times are much reduced.
>
>  fs/block_dev.c              |   1 +
>  fs/buffer.c                 |  34 +++++++++
>  fs/ext3/inode.c             |   1 +
>  fs/nfs/file.c               |  30 ++++++++
>  include/linux/buffer_head.h |   3 +
>  include/linux/fs.h          |   1 +
>  mm/vmscan.c                 | 164 ++++++++++++++++++++++++++++++++------------
>  7 files changed, 189 insertions(+), 45 deletions(-)
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2013-07-15 14:21 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-29 23:17 [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
2013-05-29 23:17 ` [PATCH 1/8] mm: vmscan: Block kswapd if it is encountering pages under writeback -fix Mel Gorman
2013-05-29 23:17 ` [PATCH 2/8] mm: vmscan: Stall page reclaim and writeback pages based on dirty/writepage pages encountered Mel Gorman
2013-05-29 23:17 ` [PATCH 3/8] mm: vmscan: Stall page reclaim after a list of pages have been processed Mel Gorman
2013-05-29 23:17 ` [PATCH 4/8] mm: vmscan: Set zone flags before blocking Mel Gorman
2013-05-29 23:17 ` [PATCH 5/8] mm: vmscan: Move direct reclaim wait_iff_congested into shrink_list Mel Gorman
2013-05-29 23:17 ` [PATCH 6/8] mm: vmscan: Treat pages marked for immediate reclaim as zone congestion Mel Gorman
2013-05-29 23:17 ` [PATCH 7/8] mm: vmscan: Take page buffers dirty and locked state into account Mel Gorman
2013-05-29 23:17 ` [PATCH 8/8] fs: nfs: Inform the VM about pages being committed or unstable Mel Gorman
2013-05-30 10:10 ` [PATCH 0/8] Reduce system disruption due to kswapd followup V3 Mel Gorman
2013-07-15 14:21 ` Hush Bensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).