* [PATCH 0/2] Reduce system disruption due to kswapd followup @ 2013-05-27 13:02 Mel Gorman 2013-05-27 13:02 ` [PATCH 1/4] mm: vmscan: Block kswapd if it is encountering pages under writeback -fix Mel Gorman ` (3 more replies) 0 siblings, 4 replies; 8+ messages in thread From: Mel Gorman @ 2013-05-27 13:02 UTC (permalink / raw) To: Andrew Morton Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic, Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner, Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman tldr; Overall the system is getting less kicked in the face. Scan rates between zones is often more balanced than it used to be. There are now fewer writes from reclaim context and a reduction in IO wait times. Performance on NFS could be further improved if it used a new aops callback to identify unstable pages as "dirty". Further testing of the "Reduce system disruption due to kswapd" discovered a few problems. First and foremost, it's possible for pages under writeback to be freed which will lead to badness. Second, as pages were not being swapped the file LRU was being scanned faster and clean file pages were being reclaimed. In some cases this results in increased read IO to re-read data from disk. Third, more pages were being written from kswapd context which can adversly affect IO performance. Lastly, it was observed that PageDirty pages are not necessarily dirty on all filesystems (buffers can be clean while PageDirty is set and ->writepage generates no IO) and not all filesystems set PageWriteback when the page is being written (e.g. ext3). This disconnect confuses the reclaim stalling logic. This follow-up series is aimed at these problems. The tests were based on three kernels vanilla: kernel 3.9 as that is what the current mmotm uses as a baseline mmotm-20130522 is mmotm as of 22nd May with "Reduce system disruption due to kswapd" applied on top as per what should be in Andrew's tree right now lessdisrupt-v6r4 is this follow-up series on top of the mmotm kernel The first test used memcached+memcachetest while some background IO was in progress as implemented by the parallel IO tests implement in MM Tests. memcachetest benchmarks how many operations/second memcached can service. It starts with no background IO on a freshly created ext4 filesystem and then re-runs the test with larger amounts of IO in the background to roughly simulate a large copy in progress. The expectation is that the IO should have little or no impact on memcachetest which is running entirely in memory. parallelio 3.9.0 3.9.0 3.9.0 vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v6r4 Ops memcachetest-0M 23117.00 ( 0.00%) 22780.00 ( -1.46%) 22833.00 ( -1.23%) Ops memcachetest-715M 23774.00 ( 0.00%) 23299.00 ( -2.00%) 23188.00 ( -2.46%) Ops memcachetest-2385M 4208.00 ( 0.00%) 24154.00 (474.00%) 23728.00 (463.88%) Ops memcachetest-4055M 4104.00 ( 0.00%) 25130.00 (512.33%) 24220.00 (490.16%) Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%) 7.00 ( 41.67%) Ops io-duration-2385M 116.00 ( 0.00%) 21.00 ( 81.90%) 21.00 ( 81.90%) Ops io-duration-4055M 160.00 ( 0.00%) 36.00 ( 77.50%) 35.00 ( 78.12%) Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-715M 140138.00 ( 0.00%) 18.00 ( 99.99%) 18.00 ( 99.99%) Ops swaptotal-2385M 385682.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-4055M 418029.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-715M 144.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-2385M 134227.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-4055M 125618.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops minorfaults-0M 1536429.00 ( 0.00%) 1531632.00 ( 0.31%) 1580984.00 ( -2.90%) Ops minorfaults-715M 1786996.00 ( 0.00%) 1612148.00 ( 9.78%) 1609175.00 ( 9.95%) Ops minorfaults-2385M 1757952.00 ( 0.00%) 1614874.00 ( 8.14%) 1612031.00 ( 8.30%) Ops minorfaults-4055M 1774460.00 ( 0.00%) 1633400.00 ( 7.95%) 1617945.00 ( 8.82%) Ops majorfaults-0M 1.00 ( 0.00%) 0.00 ( 0.00%) 22.00 (-2100.00%) Ops majorfaults-715M 184.00 ( 0.00%) 167.00 ( 9.24%) 157.00 ( 14.67%) Ops majorfaults-2385M 24444.00 ( 0.00%) 155.00 ( 99.37%) 162.00 ( 99.34%) Ops majorfaults-4055M 21357.00 ( 0.00%) 147.00 ( 99.31%) 160.00 ( 99.25%) memcachetest is the transactions/second reported by memcachetest. In the vanilla kernel note that performance drops from around 23K/sec to just over 4K/second when there is 2385M of IO going on in the background. With current mmotm, there is no collapse in performance and with this follow-up series there is little change. swaptotal is the total amount of swap traffic. With mmotm and the follow-up series, the total amount of swapping is much reduced. 3.9.0 3.9.0 3.9.0 vanillamm1-mmotm-20130522mm1-lessdisrupt-v6r4 Minor Faults 11160152 10706748 10728322 Major Faults 46305 755 787 Swap Ins 260249 0 0 Swap Outs 683860 18 18 Direct pages scanned 0 678 21756 Kswapd pages scanned 6046108 8814900 1673198 Kswapd pages reclaimed 1081954 1172267 1089195 Direct pages reclaimed 0 566 19835 Kswapd efficiency 17% 13% 65% Kswapd velocity 5217.560 7618.953 1446.740 Direct efficiency 100% 83% 91% Direct velocity 0.000 0.586 18.811 Percentage direct scans 0% 0% 1% Zone normal velocity 5105.086 6824.681 720.905 Zone dma32 velocity 112.473 794.858 744.646 Zone dma velocity 0.000 0.000 0.000 Page writes by reclaim 1929612.000 6861768.000 25772.000 Page writes file 1245752 6861750 25754 Page writes anon 683860 18 18 Page reclaim immediate 7484 40 507 Sector Reads 1130320 93996 102788 Sector Writes 13508052 10823500 10792360 Page rescued immediate 0 0 0 Slabs scanned 33536 27136 36864 Direct inode steals 0 0 0 Kswapd inode steals 8641 1035 0 Kswapd skipped wait 0 0 0 THP fault alloc 8 37 38 THP collapse alloc 508 552 559 THP splits 24 1 0 THP fault fallback 0 0 0 THP collapse fail 0 0 0 Compaction stalls 0 0 3 Compaction success 0 0 0 Compaction failures 0 0 3 Page migrate success 0 0 0 Page migrate failure 0 0 0 Compaction pages isolated 0 0 0 Compaction migrate scanned 0 0 0 Compaction free scanned 0 0 0 Compaction cost 0 0 0 NUMA PTE updates 0 0 0 NUMA hint faults 0 0 0 NUMA hint local faults 0 0 0 NUMA pages migrated 0 0 0 AutoNUMA cost 0 0 0 There are a number of observations to make here 1. Swap outs are almost eliminated. Swap ins are 0 indicating that the pages swapped were really unused anonymous pages. Related to that, major faults are much reduced. 2. kswapd efficiency was impacted by the initial series but with these follow-up patches, the efficiency is now at 65% indicating that far fewer pages were skipped during scanning due to dirty or writeback pages. 3. kswapd velocity is reduced indicating that fewer pages are being scanned with the follow-up series as kswapd now stalls when the tail of the LRU queue is full of unqueued dirty pages. The stall gives flushers a chance to catch-up so kswapd can reclaim clean pages when it wakes 4. In light of Zlatko's recent reports about zone scanning imbalances, mmtests now reports scanning velocity on a per-zone basis. With mainline, you can see that the scanning activity is dominated by the Normal zone with over 45 times more scanning in Normal than the DMA32 zone. With the series currently in mmotm, the ratio is slightly better but it is still the case that the bulk of scanning is in the highest zone. With this follow-up series, the ratio of scanning between the Normal and DMA32 zone is roughly equal. 5. As Dave Chinner observed, the current patches in mmotm increased the number of pages written from kswapd context which is expected to adversly impact IO performance. With the follow-up patches, far fewer pages are written from kswapd context than the mainline kernel 6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With the follow-up series, there is less slab shrinking activity and no inodes were reclaimed. 7. Note that "Sectors Read" is drastically reduced implying that the source data being used for the IO is not being aggressively discarded due to page reclaim skipping over dirty pages and reclaiming clean pages. Note that the reducion in reads could also be due to inode data not being re-read from disk after a slab shrink. 3.9.0 3.9.0 3.9.0 vanillamm1-mmotm-20130522mm1-lessdisrupt-v6r4 Mean sda-avgqz 166.99 32.09 32.39 Mean sda-await 853.64 192.76 164.65 Mean sda-r_await 6.31 9.24 7.28 Mean sda-w_await 2992.81 202.65 171.99 Max sda-avgqz 1409.91 718.75 693.31 Max sda-await 6665.74 3538.00 2972.46 Max sda-r_await 58.96 111.95 84.04 Max sda-w_await 28458.94 3977.29 3002.72 In light of the changes in writes from reclaim context, the number of reads and Dave Chinner's concerns about IO performance I took a closer look at the IO stats for the test disk. Few observations 1. The average queue size is reduced by the initial series and roughly the same with this follow up. 2. Average wait times for writes are massively reduced and as the IO is completing faster it at least implies that the gain is because flushers are writing the files efficiently instead of page reclaim getting in the way. 3. The reduction in average write latency is staggering. 28 seconds down to 3 seconds. Jan Kara asked how NFS is affected by all of this. There is an open question on whether the VM is treating unstable questions correctly and the answer is "no, it's not". As unstable pages cannot be reclaimed, they should probably be treated as dirty. An initial patch to do this exists but will be treated as a follow-up to this series if this series gets pulled in. Tests indicate that current behaviour is not as good as it could be but still an improvement. Tests like postmark, fsmark and largedd showed up nothing useful. On my test setup, pages are simply not being written back from reclaim context with or without the patches and there are no changes in performance. My test setup probably is just not strong enough network-wise to be really interesting. I ran a longer-lived memcached test with IO going to NFS instead of a local disk 3.9.0 3.9.0 3.9.0 vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v6r4 Ops memcachetest-0M 23323.00 ( 0.00%) 23241.00 ( -0.35%) 23281.00 ( -0.18%) Ops memcachetest-715M 25526.00 ( 0.00%) 24763.00 ( -2.99%) 23654.00 ( -7.33%) Ops memcachetest-2385M 8814.00 ( 0.00%) 26924.00 (205.47%) 24034.00 (172.68%) Ops memcachetest-4055M 5835.00 ( 0.00%) 26827.00 (359.76%) 25293.00 (333.47%) Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops io-duration-715M 65.00 ( 0.00%) 71.00 ( -9.23%) 14.00 ( 78.46%) Ops io-duration-2385M 129.00 ( 0.00%) 94.00 ( 27.13%) 43.00 ( 66.67%) Ops io-duration-4055M 301.00 ( 0.00%) 100.00 ( 66.78%) 75.00 ( 75.08%) Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-715M 14394.00 ( 0.00%) 949.00 ( 93.41%) 2232.00 ( 84.49%) Ops swaptotal-2385M 401483.00 ( 0.00%) 24437.00 ( 93.91%) 34772.00 ( 91.34%) Ops swaptotal-4055M 554123.00 ( 0.00%) 35688.00 ( 93.56%) 38432.00 ( 93.06%) Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-715M 4522.00 ( 0.00%) 560.00 ( 87.62%) 32.00 ( 99.29%) Ops swapin-2385M 169861.00 ( 0.00%) 5026.00 ( 97.04%) 11844.00 ( 93.03%) Ops swapin-4055M 192374.00 ( 0.00%) 10056.00 ( 94.77%) 13630.00 ( 92.91%) Ops minorfaults-0M 1445969.00 ( 0.00%) 1520878.00 ( -5.18%) 1526865.00 ( -5.59%) Ops minorfaults-715M 1557288.00 ( 0.00%) 1528482.00 ( 1.85%) 1529207.00 ( 1.80%) Ops minorfaults-2385M 1692896.00 ( 0.00%) 1570523.00 ( 7.23%) 1569154.00 ( 7.31%) Ops minorfaults-4055M 1654985.00 ( 0.00%) 1581456.00 ( 4.44%) 1514596.00 ( 8.48%) Ops majorfaults-0M 0.00 ( 0.00%) 1.00 (-99.00%) 2.00 (-99.00%) Ops majorfaults-715M 763.00 ( 0.00%) 265.00 ( 65.27%) 85.00 ( 88.86%) Ops majorfaults-2385M 23861.00 ( 0.00%) 894.00 ( 96.25%) 2241.00 ( 90.61%) Ops majorfaults-4055M 27210.00 ( 0.00%) 1569.00 ( 94.23%) 2543.00 ( 90.65%) 1. Performance does not collapse due to IO which is good. IO is also completing faster. Note with mmotm, IO completes in a third of the time and faster again with this series applied 2. Swapping is reduced, although not eliminated. 3. There are swapins, particularly with larger amounts of IO indicating that active pages are being reclaimed. However, the number of much reduced. So the series helps even on NFS where the VM is not accounting for stable pages but it's still an improvement. I'm not going through the vmstat figures in detail but IO from reclaim context is a tenth of what it is in 3.9 with balanced scanning between the zones. 3.9.0 3.9.0 3.9.0 vanillamm1-mmotm-20130522mm1-lessdisrupt-v6r4 Mean sda-avgqz 23.58 0.35 0.56 Mean sda-await 133.47 15.72 17.06 Mean sda-r_await 4.72 4.69 5.49 Mean sda-w_await 507.69 28.40 35.07 Max sda-avgqz 680.60 12.25 71.45 Max sda-await 3958.89 221.83 379.46 Max sda-r_await 63.86 61.23 88.58 Max sda-w_await 11710.38 883.57 1858.22 And as before, wait times are much reduced. fs/block_dev.c | 1 + fs/buffer.c | 34 ++++++++++++++++++ fs/ext3/inode.c | 1 + include/linux/buffer_head.h | 3 ++ include/linux/fs.h | 1 + mm/vmscan.c | 86 +++++++++++++++++++++++++++++++++++---------- 6 files changed, 108 insertions(+), 18 deletions(-) -- 1.8.1.4 Mel Gorman (4): mm: vmscan: Block kswapd if it is encountering pages under writeback -fix mm: vmscan: Stall page reclaim and writeback pages based on dirty/writepage pages encountered mm: vmscan: Stall page reclaim after a list of pages have been processed mm: vmscan: Take page buffers dirty and locked state into account fs/block_dev.c | 1 + fs/buffer.c | 34 +++++++++++++++++ fs/ext3/inode.c | 1 + include/linux/buffer_head.h | 3 ++ include/linux/fs.h | 1 + mm/vmscan.c | 89 +++++++++++++++++++++++++++++++++++---------- 6 files changed, 110 insertions(+), 19 deletions(-) -- 1.8.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 1/4] mm: vmscan: Block kswapd if it is encountering pages under writeback -fix 2013-05-27 13:02 [PATCH 0/2] Reduce system disruption due to kswapd followup Mel Gorman @ 2013-05-27 13:02 ` Mel Gorman 2013-05-27 13:02 ` [PATCH 2/4] mm: vmscan: Stall page reclaim and writeback pages based on dirty/writepage pages encountered Mel Gorman ` (2 subsequent siblings) 3 siblings, 0 replies; 8+ messages in thread From: Mel Gorman @ 2013-05-27 13:02 UTC (permalink / raw) To: Andrew Morton Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic, Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner, Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman The patch "mm: vmscan: Block kswapd if it is encountering pages under writeback" stalls in congestion_wait it encounters a page under writeback that is marked for immediate reclaim. Initially this was a wait_on_page_writeback() but after the switch to congestion_wait(), there is no guarantee the page has completed writeback and it can be placed on a list for freeing. This is a fix for mm-vmscan-block-kswapd-if-it-is-encountering-pages-under-writeback.patch Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/vmscan.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index b1b38ad..4a43c28 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -766,8 +766,10 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (current_is_kswapd() && PageReclaim(page) && zone_is_reclaim_writeback(zone)) { + unlock_page(page); congestion_wait(BLK_RW_ASYNC, HZ/10); zone_clear_flag(zone, ZONE_WRITEBACK); + goto keep; /* Case 2 above */ } else if (global_reclaim(sc) || -- 1.8.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 2/4] mm: vmscan: Stall page reclaim and writeback pages based on dirty/writepage pages encountered 2013-05-27 13:02 [PATCH 0/2] Reduce system disruption due to kswapd followup Mel Gorman 2013-05-27 13:02 ` [PATCH 1/4] mm: vmscan: Block kswapd if it is encountering pages under writeback -fix Mel Gorman @ 2013-05-27 13:02 ` Mel Gorman 2013-05-27 13:02 ` [PATCH 3/4] mm: vmscan: Stall page reclaim after a list of pages have been processed Mel Gorman 2013-05-27 13:02 ` [PATCH 4/4] mm: vmscan: Take page buffers dirty and locked state into account Mel Gorman 3 siblings, 0 replies; 8+ messages in thread From: Mel Gorman @ 2013-05-27 13:02 UTC (permalink / raw) To: Andrew Morton Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic, Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner, Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman The patch "mm: vmscan: Have kswapd writeback pages based on dirty pages encountered, not priority" decides whether to writeback pages from reclaim context based on the number of dirty pages encountered. This situation is flagged too easily and flushers are not given the chance to catch up resulting in more pages being written from reclaim context and potentially impacting IO performance. The check for PageWriteback is also misplaced as it happens within a PageDirty check which is nonsense as the dirty may have been cleared for IO. The accounting is updated very late and pages that are already under writeback, were reactivated, could not unmapped or could not be released are all missed. Finally, it considers stalling and writing back filesystem pages due to encountering dirty anonymous pages at the tail of the LRU which is dumb. This patch causes kswapd to begin writing filesystem pages from reclaim context only if page reclaim found that all filesystem pages at the tail of the LRU were unqueued dirty pages. Before it starts writing filesystem pages, it will stall to give flushers a chance to catch up. The decision on whether wait_iff_congested is also now determined by dirty filesystem pages only. Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/vmscan.c | 52 ++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 42 insertions(+), 10 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4a43c28..be8e445 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -669,6 +669,27 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_RECLAIM; } +/* Check if a page is dirty or under writeback */ +static void page_check_dirty_writeback(struct page *page, + bool *dirty, bool *writeback) +{ + struct address_space *mapping; + + /* + * Anonymous pages are not handled by flushers and must be written + * from reclaim context. Do not stall reclaim based on them + */ + if (!page_is_file_cache(page)) { + *dirty = false; + *writeback = false; + return; + } + + /* By default assume that the page flags are accurate */ + *dirty = PageDirty(page); + *writeback = PageWriteback(page); +} + /* * shrink_page_list() returns the number of reclaimed pages */ @@ -697,6 +718,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, struct page *page; int may_enter_fs; enum page_references references = PAGEREF_RECLAIM_CLEAN; + bool dirty, writeback; cond_resched(); @@ -725,6 +747,19 @@ static unsigned long shrink_page_list(struct list_head *page_list, (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO)); /* + * The number of dirty pages determines if a zone is marked + * reclaim_congested which affects wait_iff_congested. kswapd + * will stall and start writing pages if the tail of the LRU + * is all dirty unqueued pages. + */ + page_check_dirty_writeback(page, &dirty, &writeback); + if (dirty || writeback) + nr_dirty++; + + if (dirty && !writeback) + nr_unqueued_dirty++; + + /* * If a page at the tail of the LRU is under writeback, there * are three cases to consider. * @@ -841,11 +876,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, } if (PageDirty(page)) { - nr_dirty++; - - if (!PageWriteback(page)) - nr_unqueued_dirty++; - /* * Only kswapd can writeback filesystem pages to * avoid risk of stack overflow but only writeback @@ -1318,7 +1348,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, unsigned long nr_scanned; unsigned long nr_reclaimed = 0; unsigned long nr_taken; - unsigned long nr_dirty = 0; + unsigned long nr_unqueued_dirty = 0; unsigned long nr_writeback = 0; isolate_mode_t isolate_mode = 0; int file = is_file_lru(lru); @@ -1361,7 +1391,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, return 0; nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP, - &nr_dirty, &nr_writeback, false); + &nr_unqueued_dirty, &nr_writeback, false); spin_lock_irq(&zone->lru_lock); @@ -1416,11 +1446,13 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, /* * Similarly, if many dirty pages are encountered that are not * currently being written then flag that kswapd should start - * writing back pages. + * writing back pages and stall to give a chance for flushers + * to catch up. */ - if (global_reclaim(sc) && nr_dirty && - nr_dirty >= (nr_taken >> (DEF_PRIORITY - sc->priority))) + if (global_reclaim(sc) && nr_unqueued_dirty == nr_taken) { + congestion_wait(BLK_RW_ASYNC, HZ/10); zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY); + } trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id, zone_idx(zone), -- 1.8.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 3/4] mm: vmscan: Stall page reclaim after a list of pages have been processed 2013-05-27 13:02 [PATCH 0/2] Reduce system disruption due to kswapd followup Mel Gorman 2013-05-27 13:02 ` [PATCH 1/4] mm: vmscan: Block kswapd if it is encountering pages under writeback -fix Mel Gorman 2013-05-27 13:02 ` [PATCH 2/4] mm: vmscan: Stall page reclaim and writeback pages based on dirty/writepage pages encountered Mel Gorman @ 2013-05-27 13:02 ` Mel Gorman 2013-05-27 13:02 ` [PATCH 4/4] mm: vmscan: Take page buffers dirty and locked state into account Mel Gorman 3 siblings, 0 replies; 8+ messages in thread From: Mel Gorman @ 2013-05-27 13:02 UTC (permalink / raw) To: Andrew Morton Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic, Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner, Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman Commit "mm: vmscan: Block kswapd if it is encountering pages under writeback" blocks page reclaim if it encounters pages under writeback marked for immediate reclaim. It blocks while pages are still isolated from the LRU which is necessary. This patch defers the blocking until after the isolated pages have been processed. Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/vmscan.c | 41 +++++++++++++++++++++++++---------------- 1 file changed, 25 insertions(+), 16 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index be8e445..f576bcc 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -699,6 +699,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, enum ttu_flags ttu_flags, unsigned long *ret_nr_unqueued_dirty, unsigned long *ret_nr_writeback, + unsigned long *ret_nr_immediate, bool force_reclaim) { LIST_HEAD(ret_pages); @@ -709,6 +710,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, unsigned long nr_congested = 0; unsigned long nr_reclaimed = 0; unsigned long nr_writeback = 0; + unsigned long nr_immediate = 0; cond_resched(); @@ -770,8 +772,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, * IO can complete. Waiting on the page itself risks an * indefinite stall if it is impossible to writeback the * page due to IO error or disconnected storage so instead - * block for HZ/10 or until some IO completes then clear the - * ZONE_WRITEBACK flag to recheck if the condition exists. + * note that the LRU is being scanned too quickly and the + * caller can stall after page list has been processed. * * 2) Global reclaim encounters a page, memcg encounters a * page that is not marked for immediate reclaim or @@ -801,10 +803,8 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (current_is_kswapd() && PageReclaim(page) && zone_is_reclaim_writeback(zone)) { - unlock_page(page); - congestion_wait(BLK_RW_ASYNC, HZ/10); - zone_clear_flag(zone, ZONE_WRITEBACK); - goto keep; + nr_immediate++; + goto keep_locked; /* Case 2 above */ } else if (global_reclaim(sc) || @@ -1030,6 +1030,7 @@ keep: mem_cgroup_uncharge_end(); *ret_nr_unqueued_dirty += nr_unqueued_dirty; *ret_nr_writeback += nr_writeback; + *ret_nr_immediate += nr_immediate; return nr_reclaimed; } @@ -1041,7 +1042,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, .priority = DEF_PRIORITY, .may_unmap = 1, }; - unsigned long ret, dummy1, dummy2; + unsigned long ret, dummy1, dummy2, dummy3; struct page *page, *next; LIST_HEAD(clean_pages); @@ -1054,7 +1055,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, ret = shrink_page_list(&clean_pages, zone, &sc, TTU_UNMAP|TTU_IGNORE_ACCESS, - &dummy1, &dummy2, true); + &dummy1, &dummy2, &dummy3, true); list_splice(&clean_pages, page_list); __mod_zone_page_state(zone, NR_ISOLATED_FILE, -ret); return ret; @@ -1350,6 +1351,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, unsigned long nr_taken; unsigned long nr_unqueued_dirty = 0; unsigned long nr_writeback = 0; + unsigned long nr_immediate = 0; isolate_mode_t isolate_mode = 0; int file = is_file_lru(lru); struct zone *zone = lruvec_zone(lruvec); @@ -1391,7 +1393,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, return 0; nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP, - &nr_unqueued_dirty, &nr_writeback, false); + &nr_unqueued_dirty, &nr_writeback, &nr_immediate, false); spin_lock_irq(&zone->lru_lock); @@ -1444,14 +1446,21 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, } /* - * Similarly, if many dirty pages are encountered that are not - * currently being written then flag that kswapd should start - * writing back pages and stall to give a chance for flushers - * to catch up. + * Similarly, if pages marked for immediate reclaim and under writeback + * are encountered it implies that pages are cycling through the LRU + * faster than they can be written. If dirty pages are encountered that + * are not queued for IO, it implies that flushers are not keeping up. + * In this case, be more aggressive about stalling and start writing + * pages from reclaim context if necessary. */ - if (global_reclaim(sc) && nr_unqueued_dirty == nr_taken) { - congestion_wait(BLK_RW_ASYNC, HZ/10); - zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY); + if (global_reclaim(sc)) { + if (nr_unqueued_dirty == nr_taken || nr_immediate) { + congestion_wait(BLK_RW_ASYNC, HZ/10); + zone_clear_flag(zone, ZONE_WRITEBACK); + } + + if (nr_unqueued_dirty == nr_taken) + zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY); } trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id, -- 1.8.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH 4/4] mm: vmscan: Take page buffers dirty and locked state into account 2013-05-27 13:02 [PATCH 0/2] Reduce system disruption due to kswapd followup Mel Gorman ` (2 preceding siblings ...) 2013-05-27 13:02 ` [PATCH 3/4] mm: vmscan: Stall page reclaim after a list of pages have been processed Mel Gorman @ 2013-05-27 13:02 ` Mel Gorman 2013-05-29 19:53 ` Andrew Morton 3 siblings, 1 reply; 8+ messages in thread From: Mel Gorman @ 2013-05-27 13:02 UTC (permalink / raw) To: Andrew Morton Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic, Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner, Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman Page reclaim keeps track of dirty and under writeback pages and uses it to determine if wait_iff_congested() should stall or if kswapd should begin writing back pages. This fails to account for buffer pages that can be under writeback but not PageWriteback which is the case for filesystems like ext3 ordered mode. Furthermore, PageDirty buffer pages can have all the buffers clean and writepage does no IO so it should not be accounted as congested. This patch adds an address_space operation that filesystems may optionally use to check if a page is really dirty or really under writeback. An implementation is provided for for buffer_heads is added and used for block operations and ext3 in ordered mode. By default the page flags are obeyed. Credit goes to Jan Kara for identifying that the page flags alone are not sufficient for ext3 and sanity checking a number of ideas on how the problem could be addressed. Signed-off-by: Mel Gorman <mgorman@suse.de> --- fs/block_dev.c | 1 + fs/buffer.c | 34 ++++++++++++++++++++++++++++++++++ fs/ext3/inode.c | 1 + include/linux/buffer_head.h | 3 +++ include/linux/fs.h | 1 + mm/vmscan.c | 8 ++++++++ 6 files changed, 48 insertions(+) diff --git a/fs/block_dev.c b/fs/block_dev.c index 2091db8..9c8ebe4 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1583,6 +1583,7 @@ static const struct address_space_operations def_blk_aops = { .writepages = generic_writepages, .releasepage = blkdev_releasepage, .direct_IO = blkdev_direct_IO, + .is_dirty_writeback = buffer_check_dirty_writeback, }; const struct file_operations def_blk_fops = { diff --git a/fs/buffer.c b/fs/buffer.c index 1aa0836..4247aa9 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -91,6 +91,40 @@ void unlock_buffer(struct buffer_head *bh) EXPORT_SYMBOL(unlock_buffer); /* + * Returns if the page has dirty or writeback buffers. If all the buffers + * are unlocked and clean then the PageDirty information is stale. If + * any of the pages are locked, it is assumed they are locked for IO. + */ +void buffer_check_dirty_writeback(struct page *page, + bool *dirty, bool *writeback) +{ + struct buffer_head *head, *bh; + *dirty = false; + *writeback = false; + + BUG_ON(!PageLocked(page)); + + if (!page_has_buffers(page)) + return; + + if (PageWriteback(page)) + *writeback = true; + + head = page_buffers(page); + bh = head; + do { + if (buffer_locked(bh)) + *writeback = true; + + if (buffer_dirty(bh)) + *dirty = true; + + bh = bh->b_this_page; + } while (bh != head); +} +EXPORT_SYMBOL(buffer_check_dirty_writeback); + +/* * Block until a buffer comes unlocked. This doesn't stop it * from becoming locked again - you have to lock it yourself * if you want to preserve its state. diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c index 23c7128..8e590bd 100644 --- a/fs/ext3/inode.c +++ b/fs/ext3/inode.c @@ -1984,6 +1984,7 @@ static const struct address_space_operations ext3_ordered_aops = { .direct_IO = ext3_direct_IO, .migratepage = buffer_migrate_page, .is_partially_uptodate = block_is_partially_uptodate, + .is_dirty_writeback = buffer_check_dirty_writeback, .error_remove_page = generic_error_remove_page, }; diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 6d9f5a2..d458880 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -139,6 +139,9 @@ BUFFER_FNS(Prio, prio) }) #define page_has_buffers(page) PagePrivate(page) +void buffer_check_dirty_writeback(struct page *page, + bool *dirty, bool *writeback); + /* * Declarations */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 0a9a6766..96f857f 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -380,6 +380,7 @@ struct address_space_operations { int (*launder_page) (struct page *); int (*is_partially_uptodate) (struct page *, read_descriptor_t *, unsigned long); + void (*is_dirty_writeback) (struct page *, bool *, bool *); int (*error_remove_page)(struct address_space *, struct page *); /* swapfile support */ diff --git a/mm/vmscan.c b/mm/vmscan.c index f576bcc..6237725 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -688,6 +688,14 @@ static void page_check_dirty_writeback(struct page *page, /* By default assume that the page flags are accurate */ *dirty = PageDirty(page); *writeback = PageWriteback(page); + + /* Verify dirty/writeback state if the filesystem supports it */ + if (!page_has_private(page)) + return; + + mapping = page_mapping(page); + if (mapping && mapping->a_ops->is_dirty_writeback) + mapping->a_ops->is_dirty_writeback(page, dirty, writeback); } /* -- 1.8.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: Take page buffers dirty and locked state into account 2013-05-27 13:02 ` [PATCH 4/4] mm: vmscan: Take page buffers dirty and locked state into account Mel Gorman @ 2013-05-29 19:53 ` Andrew Morton 2013-05-29 22:28 ` Jan Kara 0 siblings, 1 reply; 8+ messages in thread From: Andrew Morton @ 2013-05-29 19:53 UTC (permalink / raw) To: Mel Gorman Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic, Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner, Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML On Mon, 27 May 2013 14:02:58 +0100 Mel Gorman <mgorman@suse.de> wrote: > Page reclaim keeps track of dirty and under writeback pages and uses it to > determine if wait_iff_congested() should stall or if kswapd should begin > writing back pages. This fails to account for buffer pages that can be under > writeback but not PageWriteback which is the case for filesystems like ext3 > ordered mode. Furthermore, PageDirty buffer pages can have all the buffers > clean and writepage does no IO so it should not be accounted as congested. iirc, the PageDirty-all-buffers-clean state is pretty rare. It might not be worth bothering about? > This patch adds an address_space operation that filesystems may > optionally use to check if a page is really dirty or really under > writeback. address_space_operations methods are Documented in Documentation/filesystems/vfs.txt ;) > An implementation is provided for for buffer_heads is added > and used for block operations and ext3 in ordered mode. By default the > page flags are obeyed. > > Credit goes to Jan Kara for identifying that the page flags alone are > not sufficient for ext3 and sanity checking a number of ideas on how > the problem could be addressed. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 4/4] mm: vmscan: Take page buffers dirty and locked state into account 2013-05-29 19:53 ` Andrew Morton @ 2013-05-29 22:28 ` Jan Kara 0 siblings, 0 replies; 8+ messages in thread From: Jan Kara @ 2013-05-29 22:28 UTC (permalink / raw) To: Andrew Morton Cc: Mel Gorman, Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic, Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner, Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML On Wed 29-05-13 12:53:56, Andrew Morton wrote: > On Mon, 27 May 2013 14:02:58 +0100 Mel Gorman <mgorman@suse.de> wrote: > > > Page reclaim keeps track of dirty and under writeback pages and uses it to > > determine if wait_iff_congested() should stall or if kswapd should begin > > writing back pages. This fails to account for buffer pages that can be under > > writeback but not PageWriteback which is the case for filesystems like ext3 > > ordered mode. Furthermore, PageDirty buffer pages can have all the buffers > > clean and writepage does no IO so it should not be accounted as congested. > > iirc, the PageDirty-all-buffers-clean state is pretty rare. It might > not be worth bothering about? Not true for ext3 in data=ordered mode. In some workloads, kjournald ends up writing most of the data during journal commit and that exactly leaves dirty pages with clean buffers. So in such setup lots of dirty pages can be of that strange kind... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH 0/2] Reduce system disruption due to kswapd followup @ 2013-05-23 9:26 Mel Gorman 0 siblings, 0 replies; 8+ messages in thread From: Mel Gorman @ 2013-05-23 9:26 UTC (permalink / raw) To: Andrew Morton Cc: Jiri Slaby, Valdis Kletnieks, Rik van Riel, Zlatko Calusic, Johannes Weiner, dormando, Michal Hocko, Jan Kara, Dave Chinner, Kamezawa Hiroyuki, Linux-FSDevel, Linux-MM, LKML, Mel Gorman Further testing of the "Reduce system disruption due to kswapd" discovered a few problems. First, as pages were not being swapped, the file LRU was being scanned faster and clean file pages were being reclaimed resulting in some cases in larger amounts of read IO to re-read data from disk. Second, more pages were being written from kswapd context which can adversly affect IO performance. Lastly, it was observed that PageDirty pages are not necessarily dirty on all filesystems (buffers can be clean while PageDirty is set and ->writepage generates no IO) and not all filesystems set PageWriteback when the page is being written (e.g. ext3). This disconnect confuses the reclaim stalling logic. This follow-up series is aimed at these problems. The tests were based on three kernels vanilla: kernel 3.9 as that is what the current mmotm uses as a baseline mmotm-20130522 is mmotm as of 22nd May with "Reduce system disruption due to kswapd" applied on top as per what should be in Andrew's tree right now lessdisrupt-v5r4 is this follow-up series on top of the mmotm kernel The first test used memcached+memcachetest while some background IO was in progress as implemented by the parallel IO tests implement in MM Tests. memcachetest benchmarks how many operations/second memcached can service. It starts with no background IO on a freshly created ext4 filesystem and then re-runs the test with larger amounts of IO in the background to roughly simulate a large copy in progress. The expectation is that the IO should have little or no impact on memcachetest which is running entirely in memory. 3.9.0 3.9.0 3.9.0 vanilla mm1-mmotm-20130522 mm1-lessdisrupt-v5r4 Ops memcachetest-0M 23117.00 ( 0.00%) 23088.00 ( -0.13%) 22815.00 ( -1.31%) Ops memcachetest-715M 23774.00 ( 0.00%) 23504.00 ( -1.14%) 23342.00 ( -1.82%) Ops memcachetest-2385M 4208.00 ( 0.00%) 23740.00 (464.16%) 24138.00 (473.62%) Ops memcachetest-4055M 4104.00 ( 0.00%) 24800.00 (504.29%) 24930.00 (507.46%) Ops io-duration-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops io-duration-715M 12.00 ( 0.00%) 7.00 ( 41.67%) 7.00 ( 41.67%) Ops io-duration-2385M 116.00 ( 0.00%) 21.00 ( 81.90%) 21.00 ( 81.90%) Ops io-duration-4055M 160.00 ( 0.00%) 37.00 ( 76.88%) 36.00 ( 77.50%) Ops swaptotal-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-715M 140138.00 ( 0.00%) 18.00 ( 99.99%) 18.00 ( 99.99%) Ops swaptotal-2385M 385682.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swaptotal-4055M 418029.00 ( 0.00%) 0.00 ( 0.00%) 2.00 (100.00%) Ops swapin-0M 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-715M 144.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-2385M 134227.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops swapin-4055M 125618.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops minorfaults-0M 1536429.00 ( 0.00%) 1533759.00 ( 0.17%) 1537248.00 ( -0.05%) Ops minorfaults-715M 1786996.00 ( 0.00%) 1606613.00 ( 10.09%) 1610854.00 ( 9.86%) Ops minorfaults-2385M 1757952.00 ( 0.00%) 1608201.00 ( 8.52%) 1614772.00 ( 8.14%) Ops minorfaults-4055M 1774460.00 ( 0.00%) 1620493.00 ( 8.68%) 1625930.00 ( 8.37%) Ops majorfaults-0M 1.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Ops majorfaults-715M 184.00 ( 0.00%) 159.00 ( 13.59%) 162.00 ( 11.96%) Ops majorfaults-2385M 24444.00 ( 0.00%) 108.00 ( 99.56%) 151.00 ( 99.38%) Ops majorfaults-4055M 21357.00 ( 0.00%) 218.00 ( 98.98%) 189.00 ( 99.12%) memcachetest is the transactions/second reported by memcachetest. In the vanilla kernel note that performance drops from around 23K/sec to just over 4K/second when there is 2385M of IO going on in the background. With current mmotm, there is no collapse in performance and with this follow-up series there is little change. swaptotal is the total amount of swap traffic. With mmotm and the follow-up series, the total amount of swapping is much reduced. 3.9.0 3.9.0 3.9.0 vanillamm1-mmotm-20130522mm1-lessdisrupt-v5r4 Minor Faults 11160152 10592704 10620743 Major Faults 46305 771 788 Swap Ins 260249 0 0 Swap Outs 683860 18 20 Direct pages scanned 0 0 850 Kswapd pages scanned 6046108 18523180 1598979 Kswapd pages reclaimed 1081954 1182759 1093766 Direct pages reclaimed 0 0 800 Kswapd efficiency 17% 6% 68% Kswapd velocity 5217.560 16027.810 1382.231 Direct efficiency 100% 100% 94% Direct velocity 0.000 0.000 0.735 Percentage direct scans 0% 0% 0% Zone normal velocity 5105.086 15217.472 636.579 Zone dma32 velocity 112.473 810.338 746.387 Zone dma velocity 0.000 0.000 0.000 Page writes by reclaim 1929612.00016620834.000 43115.000 Page writes file 1245752 16620816 43095 Page writes anon 683860 18 20 Page reclaim immediate 7484 70 147 Sector Reads 1130320 94964 97244 Sector Writes 13508052 11356812 11469072 Page rescued immediate 0 0 0 Slabs scanned 33536 27648 21120 Direct inode steals 0 0 0 Kswapd inode steals 8641 1495 0 Kswapd skipped wait 0 0 0 THP fault alloc 8 9 39 THP collapse alloc 508 476 378 THP splits 24 0 0 THP fault fallback 0 0 0 THP collapse fail 0 0 0 There are a number of observations to make here 1. Swap outs are almost eliminated. Swap ins are 0 indicating that the pages swapped were really unused anonymous pages. Related to that, major faults are much reduced. 2. kswapd efficiency was impacted by the initial series but with these follow-up patches, the efficiency is now at 66% indicating that far fewer pages were skipped during scanning due to dirty or writeback pages. 3. kswapd velocity is reduced indicating that fewer pages are being scanned with the follow-up series as kswapd now stalls when the tail of the LRU queue is full of unqueued dirty pages. The stall gives flushers a chance to catch-up so kswapd can reclaim clean pages when it wakes 4. In light of Zlatko's recent reports about zone scanning imbalances, mmtests now reports scanning velocity on a per-zone basis. With mainline, you can see that the scanning activity is dominated by the Normal zone with over 45 times more scanning in Normal than the DMA32 zone. With the series currently in mmotm, the ratio is slightly better but it is still the case that the bulk of scanning is in the highest zone. With this follow-up series, the ratio of scanning between the Normal and DMA32 zone is roughly equal. 5. As Dave Chinner observed, the current patches in mmotm increased the number of pages written from kswapd context which is expected to adversly impact IO performance. With the follow-up patches, far fewer pages are written from kswapd context than the mainline kernel 6. With the series in mmotm, fewer inodes were reclaimed by kswapd. With the follow-up series, there is less slab shrinking activity and no inodes were reclaimed. 7. Note that "Sectors Read" is drastically reduced implying that the source data being used for the IO is not being aggressively discarded due to page reclaim skipping over dirty pages and reclaiming clean pages. Note that the reducion in reads could also be due to inode data not being re-read from disk after a slab shrink. Overall, the system is getting less kicked in the face due to IO. fs/buffer.c | 34 ++++++++++++++++++++++++++++++++++ fs/ext2/inode.c | 1 + fs/ext3/inode.c | 3 +++ fs/ext4/inode.c | 2 ++ fs/gfs2/aops.c | 2 ++ fs/ntfs/aops.c | 1 + fs/ocfs2/aops.c | 1 + fs/xfs/xfs_aops.c | 1 + include/linux/buffer_head.h | 3 +++ include/linux/fs.h | 1 + mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++------- 11 files changed, 87 insertions(+), 7 deletions(-) -- 1.8.1.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-05-29 22:28 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-05-27 13:02 [PATCH 0/2] Reduce system disruption due to kswapd followup Mel Gorman 2013-05-27 13:02 ` [PATCH 1/4] mm: vmscan: Block kswapd if it is encountering pages under writeback -fix Mel Gorman 2013-05-27 13:02 ` [PATCH 2/4] mm: vmscan: Stall page reclaim and writeback pages based on dirty/writepage pages encountered Mel Gorman 2013-05-27 13:02 ` [PATCH 3/4] mm: vmscan: Stall page reclaim after a list of pages have been processed Mel Gorman 2013-05-27 13:02 ` [PATCH 4/4] mm: vmscan: Take page buffers dirty and locked state into account Mel Gorman 2013-05-29 19:53 ` Andrew Morton 2013-05-29 22:28 ` Jan Kara -- strict thread matches above, loose matches on Subject: below -- 2013-05-23 9:26 [PATCH 0/2] Reduce system disruption due to kswapd followup Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).