* [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1
@ 2016-09-09 9:59 Mel Gorman
2016-09-09 9:59 ` [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Mel Gorman
` (4 more replies)
0 siblings, 5 replies; 12+ messages in thread
From: Mel Gorman @ 2016-09-09 9:59 UTC (permalink / raw)
To: LKML, Linux-MM, Mel Gorman
Cc: Dave Chinner, Linus Torvalds, Ying Huang, Michal Hocko
This is a follow-on series from the thread "[lkp] [xfs] 68a9f5e700:
aim7.jobs-per-min -13.6% regression" with active parties cc'd. I've
pushed the series to git.kernel.org where the LKP robot should pick it
up automatically.
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaim-contention-v1r15
The progression of this series has been unsatisfactory. Dave originally
reported a problem with tree_lock contention and while it can be fixed
by pushing reclaim to direct reclaim, it slows swap considerably and was
not a universal win. This series is the best balance I've found so far
between the swapping and large rewriter cases.
I never reliably produced the same contentions that Dave did so testing
is needed. Dave, ideally you would test patches 1+2 and patches 1+4 but
a test of patches 1+3 would also be nice if you have the time. Minimally,
I'm expected that patches 1+2 will help the swapping-to-fast-storage case
(LKP to confirm independently) and may be worth considering on their own
even if Dave's test case is not helped.
drivers/block/brd.c | 1 +
mm/vmscan.c | 209 +++++++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 182 insertions(+), 28 deletions(-)
--
2.6.4
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 12+ messages in thread* [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim 2016-09-09 9:59 [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Mel Gorman @ 2016-09-09 9:59 ` Mel Gorman 2016-09-16 13:25 ` Peter Zijlstra 2016-09-09 9:59 ` [PATCH 2/4] block, brd: Treat storage as non-rotational Mel Gorman ` (3 subsequent siblings) 4 siblings, 1 reply; 12+ messages in thread From: Mel Gorman @ 2016-09-09 9:59 UTC (permalink / raw) To: LKML, Linux-MM, Mel Gorman Cc: Dave Chinner, Linus Torvalds, Ying Huang, Michal Hocko Pages unmapped during reclaim acquire/release the mapping->tree_lock for every single page. There are two cases when it's likely that pages at the tail of the LRU share the same mapping -- large amounts of IO to/from a single file and swapping. This patch acquires the mapping->tree_lock for multiple page removals. To trigger heavy swapping, varying numbers of usemem instances were used to read anonymous memory larger than the physical memory size. A UMA machine was used with 4 fake NUMA nodes to increase interference from kswapd. The swap device was backed by ramdisk using the brd driver. NUMA balancing was disabled to limit interference. 4.8.0-rc5 4.8.0-rc5 vanilla batch-v1 Amean System-1 260.53 ( 0.00%) 192.98 ( 25.93%) Amean System-3 179.59 ( 0.00%) 198.33 (-10.43%) Amean System-5 205.71 ( 0.00%) 105.22 ( 48.85%) Amean System-7 146.46 ( 0.00%) 97.79 ( 33.23%) Amean System-8 275.37 ( 0.00%) 149.39 ( 45.75%) Amean Elapsd-1 292.89 ( 0.00%) 219.95 ( 24.90%) Amean Elapsd-3 69.47 ( 0.00%) 79.02 (-13.74%) Amean Elapsd-5 54.12 ( 0.00%) 29.88 ( 44.79%) Amean Elapsd-7 34.28 ( 0.00%) 24.06 ( 29.81%) Amean Elapsd-8 57.98 ( 0.00%) 33.34 ( 42.50%) System is system CPU usage and elapsed time is the time to complete the workload. Regardless of the thread count, the workload generally completes faster although there is a lot of varability as much more work is being done under a single lock. xfs_io and pwrite was used to rewrite a file multiple times to measure any locking overhead reduction. xfsio Time 4.8.0-rc5 4.8.0-rc5 vanilla batch-v1r18 Amean pwrite-single-rewrite-async-System 49.19 ( 0.00%) 49.49 ( -0.60%) Amean pwrite-single-rewrite-async-Elapsd 322.87 ( 0.00%) 322.72 ( 0.05%) Unfortunately the difference here is well within the noise as the workload is dominated by the cost of the IO. It may be the case that the benefit is noticable on faster storage or in KVM instances where the data may be resident in the page cache of the host. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/vmscan.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 142 insertions(+), 28 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index b1e12a1ea9cf..f7beb573a594 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -622,18 +622,47 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, } /* + * Finalise the mapping removal without the mapping lock held. The pages + * are placed on the free_list and the caller is expected to drop the + * final reference. + */ +static void finalise_remove_mapping_list(struct list_head *swapcache, + struct list_head *filecache, + void (*freepage)(struct page *), + struct list_head *free_list) +{ + struct page *page; + + list_for_each_entry(page, swapcache, lru) { + swp_entry_t swap = { .val = page_private(page) }; + swapcache_free(swap); + set_page_private(page, 0); + } + + list_for_each_entry(page, filecache, lru) + freepage(page); + + list_splice_init(swapcache, free_list); + list_splice_init(filecache, free_list); +} + +enum remove_mapping { + REMOVED_FAIL, + REMOVED_SWAPCACHE, + REMOVED_FILECACHE +}; + +/* * Same as remove_mapping, but if the page is removed from the mapping, it * gets returned with a refcount of 0. */ -static int __remove_mapping(struct address_space *mapping, struct page *page, - bool reclaimed) +static enum remove_mapping __remove_mapping(struct address_space *mapping, + struct page *page, bool reclaimed, + void (**freepage)(struct page *)) { - unsigned long flags; - BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); - spin_lock_irqsave(&mapping->tree_lock, flags); /* * The non racy check for a busy page. * @@ -668,16 +697,17 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, } if (PageSwapCache(page)) { - swp_entry_t swap = { .val = page_private(page) }; + unsigned long swapval = page_private(page); + swp_entry_t swap = { .val = swapval }; mem_cgroup_swapout(page, swap); __delete_from_swap_cache(page); - spin_unlock_irqrestore(&mapping->tree_lock, flags); - swapcache_free(swap); + set_page_private(page, swapval); + return REMOVED_SWAPCACHE; } else { - void (*freepage)(struct page *); void *shadow = NULL; - freepage = mapping->a_ops->freepage; + *freepage = mapping->a_ops->freepage; + /* * Remember a shadow entry for reclaimed file cache in * order to detect refaults, thus thrashing, later on. @@ -698,17 +728,76 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, !mapping_exiting(mapping) && !dax_mapping(mapping)) shadow = workingset_eviction(mapping, page); __delete_from_page_cache(page, shadow); - spin_unlock_irqrestore(&mapping->tree_lock, flags); + return REMOVED_FILECACHE; + } - if (freepage != NULL) - freepage(page); +cannot_free: + return REMOVED_FAIL; +} + +static unsigned long remove_mapping_list(struct list_head *mapping_list, + struct list_head *free_pages, + struct list_head *ret_pages) +{ + unsigned long flags; + struct address_space *mapping = NULL; + void (*freepage)(struct page *) = NULL; + LIST_HEAD(swapcache); + LIST_HEAD(filecache); + struct page *page, *tmp; + unsigned long nr_reclaimed = 0; + +continue_removal: + list_for_each_entry_safe(page, tmp, mapping_list, lru) { + /* Batch removals under one tree lock at a time */ + if (mapping && page_mapping(page) != mapping) + continue; + + list_del(&page->lru); + if (!mapping) { + mapping = page_mapping(page); + spin_lock_irqsave(&mapping->tree_lock, flags); + } + + switch (__remove_mapping(mapping, page, true, &freepage)) { + case REMOVED_FILECACHE: + /* + * At this point, we have no other references and there + * is no way to pick any more up (removed from LRU, + * removed from pagecache). Can use non-atomic bitops + * now (and we obviously don't have to worry about + * waking up a process waiting on the page lock, + * because there are no references. + */ + __ClearPageLocked(page); + if (freepage) + list_add(&page->lru, &filecache); + else + list_add(&page->lru, free_pages); + nr_reclaimed++; + break; + case REMOVED_SWAPCACHE: + /* See FILECACHE case as to why non-atomic is safe */ + __ClearPageLocked(page); + list_add(&page->lru, &swapcache); + nr_reclaimed++; + break; + case REMOVED_FAIL: + unlock_page(page); + list_add(&page->lru, ret_pages); + } } - return 1; + if (mapping) { + spin_unlock_irqrestore(&mapping->tree_lock, flags); + finalise_remove_mapping_list(&swapcache, &filecache, freepage, free_pages); + mapping = NULL; + } -cannot_free: - spin_unlock_irqrestore(&mapping->tree_lock, flags); - return 0; + if (!list_empty(mapping_list)) + goto continue_removal; + + return nr_reclaimed; } /* @@ -719,16 +808,42 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, */ int remove_mapping(struct address_space *mapping, struct page *page) { - if (__remove_mapping(mapping, page, false)) { + unsigned long flags; + LIST_HEAD(swapcache); + LIST_HEAD(filecache); + void (*freepage)(struct page *) = NULL; + swp_entry_t swap; + int ret = 0; + + spin_lock_irqsave(&mapping->tree_lock, flags); + freepage = mapping->a_ops->freepage; + ret = __remove_mapping(mapping, page, false, &freepage); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + + if (ret != REMOVED_FAIL) { /* * Unfreezing the refcount with 1 rather than 2 effectively * drops the pagecache ref for us without requiring another * atomic operation. */ page_ref_unfreeze(page, 1); + } + + switch (ret) { + case REMOVED_FILECACHE: + if (freepage) + freepage(page); + return 1; + case REMOVED_SWAPCACHE: + swap.val = page_private(page); + swapcache_free(swap); + set_page_private(page, 0); return 1; + case REMOVED_FAIL: + return 0; } - return 0; + + BUG(); } /** @@ -910,6 +1025,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); + LIST_HEAD(mapping_pages); int pgactivate = 0; unsigned long nr_unqueued_dirty = 0; unsigned long nr_dirty = 0; @@ -1206,17 +1322,14 @@ static unsigned long shrink_page_list(struct list_head *page_list, } lazyfree: - if (!mapping || !__remove_mapping(mapping, page, true)) + if (!mapping) goto keep_locked; - /* - * At this point, we have no other references and there is - * no way to pick any more up (removed from LRU, removed - * from pagecache). Can use non-atomic bitops now (and - * we obviously don't have to worry about waking up a process - * waiting on the page lock, because there are no references. - */ - __ClearPageLocked(page); + list_add(&page->lru, &mapping_pages); + if (ret == SWAP_LZFREE) + count_vm_event(PGLAZYFREED); + continue; + free_it: if (ret == SWAP_LZFREE) count_vm_event(PGLAZYFREED); @@ -1251,6 +1364,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); } + nr_reclaimed += remove_mapping_list(&mapping_pages, &free_pages, &ret_pages); mem_cgroup_uncharge_list(&free_pages); try_to_unmap_flush(); free_hot_cold_page_list(&free_pages, true); -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim 2016-09-09 9:59 ` [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Mel Gorman @ 2016-09-16 13:25 ` Peter Zijlstra 2016-09-16 14:07 ` Peter Zijlstra 2016-09-16 18:33 ` Linus Torvalds 0 siblings, 2 replies; 12+ messages in thread From: Peter Zijlstra @ 2016-09-16 13:25 UTC (permalink / raw) To: Mel Gorman Cc: LKML, Linux-MM, Dave Chinner, Linus Torvalds, Ying Huang, Michal Hocko On Fri, Sep 09, 2016 at 10:59:32AM +0100, Mel Gorman wrote: > Pages unmapped during reclaim acquire/release the mapping->tree_lock for > every single page. There are two cases when it's likely that pages at the > tail of the LRU share the same mapping -- large amounts of IO to/from a > single file and swapping. This patch acquires the mapping->tree_lock for > multiple page removals. So, once upon a time, in a galaxy far away,.. I did a concurrent pagecache patch set that replaced the tree_lock with a per page bit- spinlock and fine grained locking in the radix tree. I know the mm has changed quite a bit since, but would such an approach still be feasible? I cannot seem to find an online reference to a 'complete' version of that patch set, but I did find the OLS paper on it and I did find some copies on my local machines. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim 2016-09-16 13:25 ` Peter Zijlstra @ 2016-09-16 14:07 ` Peter Zijlstra 2016-09-16 18:33 ` Linus Torvalds 1 sibling, 0 replies; 12+ messages in thread From: Peter Zijlstra @ 2016-09-16 14:07 UTC (permalink / raw) To: Mel Gorman Cc: LKML, Linux-MM, Dave Chinner, Linus Torvalds, Ying Huang, Michal Hocko On Fri, Sep 16, 2016 at 03:25:06PM +0200, Peter Zijlstra wrote: > On Fri, Sep 09, 2016 at 10:59:32AM +0100, Mel Gorman wrote: > > Pages unmapped during reclaim acquire/release the mapping->tree_lock for > > every single page. There are two cases when it's likely that pages at the > > tail of the LRU share the same mapping -- large amounts of IO to/from a > > single file and swapping. This patch acquires the mapping->tree_lock for > > multiple page removals. > > So, once upon a time, in a galaxy far away,.. I did a concurrent > pagecache patch set that replaced the tree_lock with a per page bit- > spinlock and fine grained locking in the radix tree. > > I know the mm has changed quite a bit since, but would such an approach > still be feasible? > > I cannot seem to find an online reference to a 'complete' version of > that patch set, but I did find the OLS paper on it and I did find some > copies on my local machines. https://www.kernel.org/doc/ols/2007/ols2007v2-pages-311-318.pdf -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim 2016-09-16 13:25 ` Peter Zijlstra 2016-09-16 14:07 ` Peter Zijlstra @ 2016-09-16 18:33 ` Linus Torvalds 2016-09-17 1:36 ` Peter Zijlstra 1 sibling, 1 reply; 12+ messages in thread From: Linus Torvalds @ 2016-09-16 18:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Mel Gorman, LKML, Linux-MM, Dave Chinner, Ying Huang, Michal Hocko On Fri, Sep 16, 2016 at 6:25 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > So, once upon a time, in a galaxy far away,.. I did a concurrent > pagecache patch set that replaced the tree_lock with a per page bit- > spinlock and fine grained locking in the radix tree. I'd love to see the patch for that. I'd be a bit worried about extra locking in the trivial cases (ie multi-level locking when we now take just the single mapping lock), but if there is some smart reason why that doesn't happen, then.. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim 2016-09-16 18:33 ` Linus Torvalds @ 2016-09-17 1:36 ` Peter Zijlstra 0 siblings, 0 replies; 12+ messages in thread From: Peter Zijlstra @ 2016-09-17 1:36 UTC (permalink / raw) To: Linus Torvalds Cc: Mel Gorman, LKML, Linux-MM, Dave Chinner, Ying Huang, Michal Hocko On Fri, Sep 16, 2016 at 11:33:00AM -0700, Linus Torvalds wrote: > On Fri, Sep 16, 2016 at 6:25 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > So, once upon a time, in a galaxy far away,.. I did a concurrent > > pagecache patch set that replaced the tree_lock with a per page bit- > > spinlock and fine grained locking in the radix tree. > > I'd love to see the patch for that. I'd be a bit worried about extra > locking in the trivial cases (ie multi-level locking when we now take > just the single mapping lock), but if there is some smart reason why > that doesn't happen, then.. On average we'll likely take a few more locks, but its not as bad as having to take the whole tree depth every time, or even touching the root lock most times. There's two cases, the first: the modification is only done on a single node (like insert), here we do an RCU lookup of the node, lock it, verify the node is still correct, do modification and unlock, done. The second case, the modification needs to then back up the tree (like setting/clearing tags, delete). For this case we can determine on our way down where the first node is we need to modify, lock that, verify, and then lock all nodes down to the last. i.e. we lock a partial path. I can send you the 2.6.31 patches if you're interested, but if you want something that applies to a kernel from this decade I'll have to go rewrite them which will take a wee bit of time :-) Both the radix tree code and the mm have changed somewhat. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 2/4] block, brd: Treat storage as non-rotational 2016-09-09 9:59 [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Mel Gorman 2016-09-09 9:59 ` [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Mel Gorman @ 2016-09-09 9:59 ` Mel Gorman 2016-09-09 9:59 ` [PATCH 3/4] mm, vmscan: Stall kswapd if contending on tree_lock Mel Gorman ` (2 subsequent siblings) 4 siblings, 0 replies; 12+ messages in thread From: Mel Gorman @ 2016-09-09 9:59 UTC (permalink / raw) To: LKML, Linux-MM, Mel Gorman Cc: Dave Chinner, Linus Torvalds, Ying Huang, Michal Hocko Unlike the rims of a punked out car, RAM does not spin. Ramdisk as implemented by the brd is treated as rotational storage. When used as swap to simulate fast storage, swap uses the algoritms for minimising seek times instead of the algorithms optimised for SSD. When the tree_lock contention was reduced by the previous patch, it was found that the workload was dominated by scan_swap_map(). This patch has no practical application as swap-on-ramdisk is dumb is rocks but it's trivial to fix. 4.8.0-rc5 4.8.0-rc5 batch-v1 ramdisknonrot-v1 Amean System-1 192.98 ( 0.00%) 181.00 ( 6.21%) Amean System-3 198.33 ( 0.00%) 86.19 ( 56.54%) Amean System-5 105.22 ( 0.00%) 67.43 ( 35.91%) Amean System-7 97.79 ( 0.00%) 89.55 ( 8.42%) Amean System-8 149.39 ( 0.00%) 102.92 ( 31.11%) Amean Elapsd-1 219.95 ( 0.00%) 209.23 ( 4.88%) Amean Elapsd-3 79.02 ( 0.00%) 36.93 ( 53.26%) Amean Elapsd-5 29.88 ( 0.00%) 19.52 ( 34.69%) Amean Elapsd-7 24.06 ( 0.00%) 21.93 ( 8.84%) Amean Elapsd-8 33.34 ( 0.00%) 23.63 ( 29.12%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- drivers/block/brd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 0c76d4016eeb..83a76a74e027 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -504,6 +504,7 @@ static struct brd_device *brd_alloc(int i) blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX); brd->brd_queue->limits.discard_zeroes_data = 1; queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue); + queue_flag_set_unlocked(QUEUE_FLAG_NONROT, brd->brd_queue); #ifdef CONFIG_BLK_DEV_RAM_DAX queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue); #endif -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 3/4] mm, vmscan: Stall kswapd if contending on tree_lock 2016-09-09 9:59 [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Mel Gorman 2016-09-09 9:59 ` [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Mel Gorman 2016-09-09 9:59 ` [PATCH 2/4] block, brd: Treat storage as non-rotational Mel Gorman @ 2016-09-09 9:59 ` Mel Gorman 2016-09-09 9:59 ` [PATCH 4/4] mm, vmscan: Potentially stall direct reclaimers on tree_lock contention Mel Gorman 2016-09-09 15:31 ` [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Linus Torvalds 4 siblings, 0 replies; 12+ messages in thread From: Mel Gorman @ 2016-09-09 9:59 UTC (permalink / raw) To: LKML, Linux-MM, Mel Gorman Cc: Dave Chinner, Linus Torvalds, Ying Huang, Michal Hocko If there is a large reader/writer, it's possible for multiple kswapd instances and the processes issueing IO to contend on a single mapping->tree_lock. This patch will cause all kswapd instances except one to backoff if contending on tree_lock. A sleep kswapd instance will be woken when one has made progress. 4.8.0-rc5 4.8.0-rc5 ramdisknonrot-v1 waitqueue-v1 Min Elapsd-8 18.31 ( 0.00%) 28.32 (-54.67%) Amean System-1 181.00 ( 0.00%) 179.61 ( 0.77%) Amean System-3 86.19 ( 0.00%) 68.91 ( 20.05%) Amean System-5 67.43 ( 0.00%) 93.09 (-38.05%) Amean System-7 89.55 ( 0.00%) 90.98 ( -1.60%) Amean System-8 102.92 ( 0.00%) 299.81 (-191.30%) Amean Elapsd-1 209.23 ( 0.00%) 210.41 ( -0.57%) Amean Elapsd-3 36.93 ( 0.00%) 33.89 ( 8.25%) Amean Elapsd-5 19.52 ( 0.00%) 25.19 (-29.08%) Amean Elapsd-7 21.93 ( 0.00%) 18.45 ( 15.88%) Amean Elapsd-8 23.63 ( 0.00%) 48.80 (-106.51%) Note that unlike the previous patches that this is not an unconditional win. System CPU usage is generally higher because direct reclaim is used instead of multiple competing kswapd instances. According to the stats, there is 10 times more direct reclaim scanning and reclaim activity and overall the workload takes longer to complete. 4.8.0-rc5 4.8.0-rc5 amdisknonrot-v1 waitqueue-v1 User 473.24 462.40 System 3690.20 5127.32 Elapsed 2186.05 2364.08 The motivation for this patch was Dave Chinner reporting that an xfs_io workload rewriting a single file spent significant amount of time spinning on the tree_lock. Local tests were inconclusive. On spinning storage, the IO was so slow as it was not noticable. When xfs_io is backed by ramdisk to simulate fast storage then it can be observed; 4.8.0-rc5 4.8.0-rc5 ramdisknonrot-v1 waitqueue-v1 Min pwrite-single-rewrite-async-System 3.12 ( 0.00%) 3.06 ( 1.92%) Min pwrite-single-rewrite-async-Elapsd 3.25 ( 0.00%) 3.17 ( 2.46%) Amean pwrite-single-rewrite-async-System 3.32 ( 0.00%) 3.23 ( 2.67%) Amean pwrite-single-rewrite-async-Elapsd 3.42 ( 0.00%) 3.33 ( 2.71%) 4.8.0-rc5 4.8.0-rc5 ramdisknonrot-v1 waitqueue-v1 User 9.06 8.76 System 402.67 392.31 Elapsed 416.91 406.29 That's roughly a 2.5% drop in CPU usage overall. A test from Dave Chinner with some data to support/reject this patch is highly desirable. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/vmscan.c | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f7beb573a594..936070b0790e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -735,6 +735,9 @@ static enum remove_mapping __remove_mapping(struct address_space *mapping, return REMOVED_FAIL; } +static unsigned long kswapd_exclusive = NUMA_NO_NODE; +static DECLARE_WAIT_QUEUE_HEAD(kswapd_contended_wait); + static unsigned long remove_mapping_list(struct list_head *mapping_list, struct list_head *free_pages, struct list_head *ret_pages) @@ -755,8 +758,28 @@ static unsigned long remove_mapping_list(struct list_head *mapping_list, list_del(&page->lru); if (!mapping) { + pg_data_t *pgdat = page_pgdat(page); mapping = page_mapping(page); - spin_lock_irqsave(&mapping->tree_lock, flags); + + /* Account for trylock contentions in kswapd */ + if (!current_is_kswapd() || + pgdat->node_id == kswapd_exclusive) { + spin_lock_irqsave(&mapping->tree_lock, flags); + } else { + /* Account for contended pages and contended kswapds */ + if (!spin_trylock_irqsave(&mapping->tree_lock, flags)) { + /* Stall kswapd once for 10ms on contention */ + if (cmpxchg(&kswapd_exclusive, NUMA_NO_NODE, pgdat->node_id) != NUMA_NO_NODE) { + DEFINE_WAIT(wait); + prepare_to_wait(&kswapd_contended_wait, + &wait, TASK_INTERRUPTIBLE); + io_schedule_timeout(HZ/100); + finish_wait(&kswapd_contended_wait, &wait); + } + + spin_lock_irqsave(&mapping->tree_lock, flags); + } + } } switch (__remove_mapping(mapping, page, true, &freepage)) { @@ -3212,6 +3235,7 @@ static void age_active_anon(struct pglist_data *pgdat, static bool zone_balanced(struct zone *zone, int order, int classzone_idx) { unsigned long mark = high_wmark_pages(zone); + unsigned long nid; if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx)) return false; @@ -3223,6 +3247,12 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx) clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags); clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags); + nid = zone->zone_pgdat->node_id; + if (nid == kswapd_exclusive) { + cmpxchg(&kswapd_exclusive, nid, NUMA_NO_NODE); + wake_up_interruptible(&kswapd_contended_wait); + } + return true; } -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 4/4] mm, vmscan: Potentially stall direct reclaimers on tree_lock contention 2016-09-09 9:59 [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Mel Gorman ` (2 preceding siblings ...) 2016-09-09 9:59 ` [PATCH 3/4] mm, vmscan: Stall kswapd if contending on tree_lock Mel Gorman @ 2016-09-09 9:59 ` Mel Gorman 2016-09-09 15:31 ` [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Linus Torvalds 4 siblings, 0 replies; 12+ messages in thread From: Mel Gorman @ 2016-09-09 9:59 UTC (permalink / raw) To: LKML, Linux-MM, Mel Gorman Cc: Dave Chinner, Linus Torvalds, Ying Huang, Michal Hocko If a heavy writer of a single file is forcing contention on the tree_lock then it may be necessary to tempoarily stall the direct writer to allow kswapd to make progress. This patch marks a pgdat congested if tree_lock is being contended on the tail of the LRU. On a swap-intensive workload to ramdisk, the following is observed usemem 4.8.0-rc5 4.8.0-rc5 waitqueue-v1 directcongest-v1 Amean System-1 179.61 ( 0.00%) 202.21 (-12.58%) Amean System-3 68.91 ( 0.00%) 105.14 (-52.59%) Amean System-5 93.09 ( 0.00%) 80.98 ( 13.01%) Amean System-7 90.98 ( 0.00%) 81.07 ( 10.90%) Amean System-8 299.81 ( 0.00%) 227.08 ( 24.26%) Amean Elapsd-1 210.41 ( 0.00%) 236.56 (-12.43%) Amean Elapsd-3 33.89 ( 0.00%) 46.78 (-38.06%) Amean Elapsd-5 25.19 ( 0.00%) 23.33 ( 7.38%) Amean Elapsd-7 18.45 ( 0.00%) 17.18 ( 6.91%) Amean Elapsd-8 48.80 ( 0.00%) 38.09 ( 21.93%) Note that system CPU usage is reduced for high thread counts but it is not a universal win and it's known to be highly variable. The overall time stats look like 4.8.0-rc5 4.8.0-rc5 waitqueue-v1 directcongest-v1 User 462.40 468.18 System 5127.32 4875.92 Elapsed 2364.08 2539.77 It takes longer to complete but uses less system CPU. The benefit is more noticable with xfs_io rewriting a file backed by ramdisk 4.8.0-rc5 4.8.0-rc5 waitqueue-v1r24 directcongest-v1r24 Amean pwrite-single-rewrite-async-System 3.23 ( 0.00%) 3.21 ( 0.80%) Amean pwrite-single-rewrite-async-Elapsd 3.33 ( 0.00%) 3.31 ( 0.67%) 4.8.0-rc5 4.8.0-rc5 waitqueue-v1 directcongest-v1 User 8.76 9.25 System 392.31 389.10 Elapsed 406.29 403.74 As with the previous patch, a test from Dave Chinner would be necessary to decide whether this patch is worthwhile. It seems reasonable to favour workloads that are heavily writing files than heavily swapping as the former situation is normal and reasonable while the latter situation will never be optimal. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> --- mm/vmscan.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 936070b0790e..953df97abe0c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -771,6 +771,15 @@ static unsigned long remove_mapping_list(struct list_head *mapping_list, /* Stall kswapd once for 10ms on contention */ if (cmpxchg(&kswapd_exclusive, NUMA_NO_NODE, pgdat->node_id) != NUMA_NO_NODE) { DEFINE_WAIT(wait); + + /* + * Tag the pgdat as congested as it may + * indicate contention with a heavy + * writer that should stall on + * wait_iff_congested. + */ + set_bit(PGDAT_CONGESTED, &pgdat->flags); + prepare_to_wait(&kswapd_contended_wait, &wait, TASK_INTERRUPTIBLE); io_schedule_timeout(HZ/100); -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 2016-09-09 9:59 [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Mel Gorman ` (3 preceding siblings ...) 2016-09-09 9:59 ` [PATCH 4/4] mm, vmscan: Potentially stall direct reclaimers on tree_lock contention Mel Gorman @ 2016-09-09 15:31 ` Linus Torvalds 2016-09-09 16:19 ` Mel Gorman 4 siblings, 1 reply; 12+ messages in thread From: Linus Torvalds @ 2016-09-09 15:31 UTC (permalink / raw) To: Mel Gorman; +Cc: LKML, Linux-MM, Dave Chinner, Ying Huang, Michal Hocko On Fri, Sep 9, 2016 at 2:59 AM, Mel Gorman <mgorman@techsingularity.net> wrote: > > The progression of this series has been unsatisfactory. Yeah, I have to say that I particularly don't like patch #1. It's some rather nasty complexity for dubious gains, and holding the lock for longer times might have downsides. And the numbers seem to not necessarily be in favor of patch #3 either, which I would have otherwise been predisposed to like (ie it looks fairly targeted and not very complex). #2 seems trivially correct but largely irrelevant. So I think this series is one of those "we need to find that it makes a big positive impact" to make sense. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 2016-09-09 15:31 ` [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Linus Torvalds @ 2016-09-09 16:19 ` Mel Gorman 2016-09-09 18:16 ` Huang, Ying 0 siblings, 1 reply; 12+ messages in thread From: Mel Gorman @ 2016-09-09 16:19 UTC (permalink / raw) To: Linus Torvalds; +Cc: LKML, Linux-MM, Dave Chinner, Ying Huang, Michal Hocko On Fri, Sep 09, 2016 at 08:31:27AM -0700, Linus Torvalds wrote: > On Fri, Sep 9, 2016 at 2:59 AM, Mel Gorman <mgorman@techsingularity.net> wrote: > > > > The progression of this series has been unsatisfactory. > > Yeah, I have to say that I particularly don't like patch #1. There isn't many ways to make it prettier. Making it nicer is partially hindered by the fact that tree_lock is IRQ-safe for IO completions but even if that was addressed there might be lock ordering issues. > It's some > rather nasty complexity for dubious gains, and holding the lock for > longer times might have downsides. > Kswapd reclaim would delay a parallel truncation for example. Doubtful it matters but the possibility is there. The gain in swapping is nice but ramdisk is excessively artifical. It might matter if someone reported it made a big difference swapping to faster storage like SSD or NVMe although the cases where fast swap is important are few -- overcommitted host with multiple idle VMs with a new active VM starting is the only one that springs to mind. > So I think this series is one of those "we need to find that it makes > a big positive impact" to make sense. > Agreed. I don't mind leaving it on the back burner unless Dave reports it really helps or a new bug report about realistic tree_lock contention shows up. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 2016-09-09 16:19 ` Mel Gorman @ 2016-09-09 18:16 ` Huang, Ying 0 siblings, 0 replies; 12+ messages in thread From: Huang, Ying @ 2016-09-09 18:16 UTC (permalink / raw) To: Mel Gorman Cc: Linus Torvalds, LKML, Linux-MM, Dave Chinner, Ying Huang, Michal Hocko, Tim C. Chen, Dave Hansen, Andi Kleen Mel Gorman <mgorman@techsingularity.net> writes: > On Fri, Sep 09, 2016 at 08:31:27AM -0700, Linus Torvalds wrote: >> On Fri, Sep 9, 2016 at 2:59 AM, Mel Gorman <mgorman@techsingularity.net> wrote: >> > >> > The progression of this series has been unsatisfactory. >> >> Yeah, I have to say that I particularly don't like patch #1. > > There isn't many ways to make it prettier. Making it nicer is partially > hindered by the fact that tree_lock is IRQ-safe for IO completions but > even if that was addressed there might be lock ordering issues. > >> It's some >> rather nasty complexity for dubious gains, and holding the lock for >> longer times might have downsides. >> > > Kswapd reclaim would delay a parallel truncation for example. Doubtful it > matters but the possibility is there. > > The gain in swapping is nice but ramdisk is excessively artifical. It might > matter if someone reported it made a big difference swapping to faster > storage like SSD or NVMe although the cases where fast swap is important > are few -- overcommitted host with multiple idle VMs with a new active VM > starting is the only one that springs to mind. I will try to provide some data for the NVMe disk. I think the trend is that the performance of the disk is increasing fast and will continue in the near future at least. We found we cannot saturate the latest NVMe disk when swapping because of locking issues in swap and page reclaim path. The swap usage problem could be a "Chicken and Egg" problem. Because swap performance is poor, nobody uses swap, and because nobody uses swap, nobody works on improving the performance of the swap. With the faster and faster storage device, swap could be more popular in the future if we optimize its performance to catch up with the performance of the storage. >> So I think this series is one of those "we need to find that it makes >> a big positive impact" to make sense. >> > > Agreed. I don't mind leaving it on the back burner unless Dave reports > it really helps or a new bug report about realistic tree_lock contention > shows up. Best Regards, Huang, Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2016-09-17 1:36 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-09-09 9:59 [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Mel Gorman 2016-09-09 9:59 ` [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Mel Gorman 2016-09-16 13:25 ` Peter Zijlstra 2016-09-16 14:07 ` Peter Zijlstra 2016-09-16 18:33 ` Linus Torvalds 2016-09-17 1:36 ` Peter Zijlstra 2016-09-09 9:59 ` [PATCH 2/4] block, brd: Treat storage as non-rotational Mel Gorman 2016-09-09 9:59 ` [PATCH 3/4] mm, vmscan: Stall kswapd if contending on tree_lock Mel Gorman 2016-09-09 9:59 ` [PATCH 4/4] mm, vmscan: Potentially stall direct reclaimers on tree_lock contention Mel Gorman 2016-09-09 15:31 ` [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Linus Torvalds 2016-09-09 16:19 ` Mel Gorman 2016-09-09 18:16 ` Huang, Ying
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).