From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id 273376B0069 for ; Fri, 9 Sep 2016 05:59:38 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id g141so10389596wmd.0 for ; Fri, 09 Sep 2016 02:59:38 -0700 (PDT) Received: from outbound-smtp06.blacknight.com (outbound-smtp06.blacknight.com. [81.17.249.39]) by mx.google.com with ESMTPS id b75si2225517wma.30.2016.09.09.02.59.36 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 09 Sep 2016 02:59:36 -0700 (PDT) Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp06.blacknight.com (Postfix) with ESMTPS id 3FE38989CE for ; Fri, 9 Sep 2016 09:59:36 +0000 (UTC) From: Mel Gorman Subject: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Date: Fri, 9 Sep 2016 10:59:31 +0100 Message-Id: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> Sender: owner-linux-mm@kvack.org List-ID: To: LKML , Linux-MM , Mel Gorman Cc: Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko This is a follow-on series from the thread "[lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression" with active parties cc'd. I've pushed the series to git.kernel.org where the LKP robot should pick it up automatically. git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaim-contention-v1r15 The progression of this series has been unsatisfactory. Dave originally reported a problem with tree_lock contention and while it can be fixed by pushing reclaim to direct reclaim, it slows swap considerably and was not a universal win. This series is the best balance I've found so far between the swapping and large rewriter cases. I never reliably produced the same contentions that Dave did so testing is needed. Dave, ideally you would test patches 1+2 and patches 1+4 but a test of patches 1+3 would also be nice if you have the time. Minimally, I'm expected that patches 1+2 will help the swapping-to-fast-storage case (LKP to confirm independently) and may be worth considering on their own even if Dave's test case is not helped. drivers/block/brd.c | 1 + mm/vmscan.c | 209 +++++++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 182 insertions(+), 28 deletions(-) -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f69.google.com (mail-lf0-f69.google.com [209.85.215.69]) by kanga.kvack.org (Postfix) with ESMTP id 1D4556B0253 for ; Fri, 9 Sep 2016 05:59:39 -0400 (EDT) Received: by mail-lf0-f69.google.com with SMTP id u14so42335488lfd.0 for ; Fri, 09 Sep 2016 02:59:39 -0700 (PDT) Received: from outbound-smtp06.blacknight.com (outbound-smtp06.blacknight.com. [81.17.249.39]) by mx.google.com with ESMTPS id hr7si2232118wjb.178.2016.09.09.02.59.36 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 09 Sep 2016 02:59:36 -0700 (PDT) Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp06.blacknight.com (Postfix) with ESMTPS id 7F093989D7 for ; Fri, 9 Sep 2016 09:59:36 +0000 (UTC) From: Mel Gorman Subject: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Date: Fri, 9 Sep 2016 10:59:32 +0100 Message-Id: <1473415175-20807-2-git-send-email-mgorman@techsingularity.net> In-Reply-To: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> Sender: owner-linux-mm@kvack.org List-ID: To: LKML , Linux-MM , Mel Gorman Cc: Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko Pages unmapped during reclaim acquire/release the mapping->tree_lock for every single page. There are two cases when it's likely that pages at the tail of the LRU share the same mapping -- large amounts of IO to/from a single file and swapping. This patch acquires the mapping->tree_lock for multiple page removals. To trigger heavy swapping, varying numbers of usemem instances were used to read anonymous memory larger than the physical memory size. A UMA machine was used with 4 fake NUMA nodes to increase interference from kswapd. The swap device was backed by ramdisk using the brd driver. NUMA balancing was disabled to limit interference. 4.8.0-rc5 4.8.0-rc5 vanilla batch-v1 Amean System-1 260.53 ( 0.00%) 192.98 ( 25.93%) Amean System-3 179.59 ( 0.00%) 198.33 (-10.43%) Amean System-5 205.71 ( 0.00%) 105.22 ( 48.85%) Amean System-7 146.46 ( 0.00%) 97.79 ( 33.23%) Amean System-8 275.37 ( 0.00%) 149.39 ( 45.75%) Amean Elapsd-1 292.89 ( 0.00%) 219.95 ( 24.90%) Amean Elapsd-3 69.47 ( 0.00%) 79.02 (-13.74%) Amean Elapsd-5 54.12 ( 0.00%) 29.88 ( 44.79%) Amean Elapsd-7 34.28 ( 0.00%) 24.06 ( 29.81%) Amean Elapsd-8 57.98 ( 0.00%) 33.34 ( 42.50%) System is system CPU usage and elapsed time is the time to complete the workload. Regardless of the thread count, the workload generally completes faster although there is a lot of varability as much more work is being done under a single lock. xfs_io and pwrite was used to rewrite a file multiple times to measure any locking overhead reduction. xfsio Time 4.8.0-rc5 4.8.0-rc5 vanilla batch-v1r18 Amean pwrite-single-rewrite-async-System 49.19 ( 0.00%) 49.49 ( -0.60%) Amean pwrite-single-rewrite-async-Elapsd 322.87 ( 0.00%) 322.72 ( 0.05%) Unfortunately the difference here is well within the noise as the workload is dominated by the cost of the IO. It may be the case that the benefit is noticable on faster storage or in KVM instances where the data may be resident in the page cache of the host. Signed-off-by: Mel Gorman --- mm/vmscan.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 142 insertions(+), 28 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index b1e12a1ea9cf..f7beb573a594 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -622,18 +622,47 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, } /* + * Finalise the mapping removal without the mapping lock held. The pages + * are placed on the free_list and the caller is expected to drop the + * final reference. + */ +static void finalise_remove_mapping_list(struct list_head *swapcache, + struct list_head *filecache, + void (*freepage)(struct page *), + struct list_head *free_list) +{ + struct page *page; + + list_for_each_entry(page, swapcache, lru) { + swp_entry_t swap = { .val = page_private(page) }; + swapcache_free(swap); + set_page_private(page, 0); + } + + list_for_each_entry(page, filecache, lru) + freepage(page); + + list_splice_init(swapcache, free_list); + list_splice_init(filecache, free_list); +} + +enum remove_mapping { + REMOVED_FAIL, + REMOVED_SWAPCACHE, + REMOVED_FILECACHE +}; + +/* * Same as remove_mapping, but if the page is removed from the mapping, it * gets returned with a refcount of 0. */ -static int __remove_mapping(struct address_space *mapping, struct page *page, - bool reclaimed) +static enum remove_mapping __remove_mapping(struct address_space *mapping, + struct page *page, bool reclaimed, + void (**freepage)(struct page *)) { - unsigned long flags; - BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); - spin_lock_irqsave(&mapping->tree_lock, flags); /* * The non racy check for a busy page. * @@ -668,16 +697,17 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, } if (PageSwapCache(page)) { - swp_entry_t swap = { .val = page_private(page) }; + unsigned long swapval = page_private(page); + swp_entry_t swap = { .val = swapval }; mem_cgroup_swapout(page, swap); __delete_from_swap_cache(page); - spin_unlock_irqrestore(&mapping->tree_lock, flags); - swapcache_free(swap); + set_page_private(page, swapval); + return REMOVED_SWAPCACHE; } else { - void (*freepage)(struct page *); void *shadow = NULL; - freepage = mapping->a_ops->freepage; + *freepage = mapping->a_ops->freepage; + /* * Remember a shadow entry for reclaimed file cache in * order to detect refaults, thus thrashing, later on. @@ -698,17 +728,76 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, !mapping_exiting(mapping) && !dax_mapping(mapping)) shadow = workingset_eviction(mapping, page); __delete_from_page_cache(page, shadow); - spin_unlock_irqrestore(&mapping->tree_lock, flags); + return REMOVED_FILECACHE; + } - if (freepage != NULL) - freepage(page); +cannot_free: + return REMOVED_FAIL; +} + +static unsigned long remove_mapping_list(struct list_head *mapping_list, + struct list_head *free_pages, + struct list_head *ret_pages) +{ + unsigned long flags; + struct address_space *mapping = NULL; + void (*freepage)(struct page *) = NULL; + LIST_HEAD(swapcache); + LIST_HEAD(filecache); + struct page *page, *tmp; + unsigned long nr_reclaimed = 0; + +continue_removal: + list_for_each_entry_safe(page, tmp, mapping_list, lru) { + /* Batch removals under one tree lock at a time */ + if (mapping && page_mapping(page) != mapping) + continue; + + list_del(&page->lru); + if (!mapping) { + mapping = page_mapping(page); + spin_lock_irqsave(&mapping->tree_lock, flags); + } + + switch (__remove_mapping(mapping, page, true, &freepage)) { + case REMOVED_FILECACHE: + /* + * At this point, we have no other references and there + * is no way to pick any more up (removed from LRU, + * removed from pagecache). Can use non-atomic bitops + * now (and we obviously don't have to worry about + * waking up a process waiting on the page lock, + * because there are no references. + */ + __ClearPageLocked(page); + if (freepage) + list_add(&page->lru, &filecache); + else + list_add(&page->lru, free_pages); + nr_reclaimed++; + break; + case REMOVED_SWAPCACHE: + /* See FILECACHE case as to why non-atomic is safe */ + __ClearPageLocked(page); + list_add(&page->lru, &swapcache); + nr_reclaimed++; + break; + case REMOVED_FAIL: + unlock_page(page); + list_add(&page->lru, ret_pages); + } } - return 1; + if (mapping) { + spin_unlock_irqrestore(&mapping->tree_lock, flags); + finalise_remove_mapping_list(&swapcache, &filecache, freepage, free_pages); + mapping = NULL; + } -cannot_free: - spin_unlock_irqrestore(&mapping->tree_lock, flags); - return 0; + if (!list_empty(mapping_list)) + goto continue_removal; + + return nr_reclaimed; } /* @@ -719,16 +808,42 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, */ int remove_mapping(struct address_space *mapping, struct page *page) { - if (__remove_mapping(mapping, page, false)) { + unsigned long flags; + LIST_HEAD(swapcache); + LIST_HEAD(filecache); + void (*freepage)(struct page *) = NULL; + swp_entry_t swap; + int ret = 0; + + spin_lock_irqsave(&mapping->tree_lock, flags); + freepage = mapping->a_ops->freepage; + ret = __remove_mapping(mapping, page, false, &freepage); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + + if (ret != REMOVED_FAIL) { /* * Unfreezing the refcount with 1 rather than 2 effectively * drops the pagecache ref for us without requiring another * atomic operation. */ page_ref_unfreeze(page, 1); + } + + switch (ret) { + case REMOVED_FILECACHE: + if (freepage) + freepage(page); + return 1; + case REMOVED_SWAPCACHE: + swap.val = page_private(page); + swapcache_free(swap); + set_page_private(page, 0); return 1; + case REMOVED_FAIL: + return 0; } - return 0; + + BUG(); } /** @@ -910,6 +1025,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); + LIST_HEAD(mapping_pages); int pgactivate = 0; unsigned long nr_unqueued_dirty = 0; unsigned long nr_dirty = 0; @@ -1206,17 +1322,14 @@ static unsigned long shrink_page_list(struct list_head *page_list, } lazyfree: - if (!mapping || !__remove_mapping(mapping, page, true)) + if (!mapping) goto keep_locked; - /* - * At this point, we have no other references and there is - * no way to pick any more up (removed from LRU, removed - * from pagecache). Can use non-atomic bitops now (and - * we obviously don't have to worry about waking up a process - * waiting on the page lock, because there are no references. - */ - __ClearPageLocked(page); + list_add(&page->lru, &mapping_pages); + if (ret == SWAP_LZFREE) + count_vm_event(PGLAZYFREED); + continue; + free_it: if (ret == SWAP_LZFREE) count_vm_event(PGLAZYFREED); @@ -1251,6 +1364,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); } + nr_reclaimed += remove_mapping_list(&mapping_pages, &free_pages, &ret_pages); mem_cgroup_uncharge_list(&free_pages); try_to_unmap_flush(); free_hot_cold_page_list(&free_pages, true); -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72]) by kanga.kvack.org (Postfix) with ESMTP id 027A56B025E for ; Fri, 9 Sep 2016 05:59:41 -0400 (EDT) Received: by mail-wm0-f72.google.com with SMTP id 1so10314189wmz.2 for ; Fri, 09 Sep 2016 02:59:40 -0700 (PDT) Received: from outbound-smtp07.blacknight.com (outbound-smtp07.blacknight.com. [46.22.139.12]) by mx.google.com with ESMTPS id ty2si2229195wjb.223.2016.09.09.02.59.37 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 09 Sep 2016 02:59:37 -0700 (PDT) Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp07.blacknight.com (Postfix) with ESMTPS id B61491C184B for ; Fri, 9 Sep 2016 10:59:36 +0100 (IST) From: Mel Gorman Subject: [PATCH 2/4] block, brd: Treat storage as non-rotational Date: Fri, 9 Sep 2016 10:59:33 +0100 Message-Id: <1473415175-20807-3-git-send-email-mgorman@techsingularity.net> In-Reply-To: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> Sender: owner-linux-mm@kvack.org List-ID: To: LKML , Linux-MM , Mel Gorman Cc: Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko Unlike the rims of a punked out car, RAM does not spin. Ramdisk as implemented by the brd is treated as rotational storage. When used as swap to simulate fast storage, swap uses the algoritms for minimising seek times instead of the algorithms optimised for SSD. When the tree_lock contention was reduced by the previous patch, it was found that the workload was dominated by scan_swap_map(). This patch has no practical application as swap-on-ramdisk is dumb is rocks but it's trivial to fix. 4.8.0-rc5 4.8.0-rc5 batch-v1 ramdisknonrot-v1 Amean System-1 192.98 ( 0.00%) 181.00 ( 6.21%) Amean System-3 198.33 ( 0.00%) 86.19 ( 56.54%) Amean System-5 105.22 ( 0.00%) 67.43 ( 35.91%) Amean System-7 97.79 ( 0.00%) 89.55 ( 8.42%) Amean System-8 149.39 ( 0.00%) 102.92 ( 31.11%) Amean Elapsd-1 219.95 ( 0.00%) 209.23 ( 4.88%) Amean Elapsd-3 79.02 ( 0.00%) 36.93 ( 53.26%) Amean Elapsd-5 29.88 ( 0.00%) 19.52 ( 34.69%) Amean Elapsd-7 24.06 ( 0.00%) 21.93 ( 8.84%) Amean Elapsd-8 33.34 ( 0.00%) 23.63 ( 29.12%) Signed-off-by: Mel Gorman --- drivers/block/brd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 0c76d4016eeb..83a76a74e027 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -504,6 +504,7 @@ static struct brd_device *brd_alloc(int i) blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX); brd->brd_queue->limits.discard_zeroes_data = 1; queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue); + queue_flag_set_unlocked(QUEUE_FLAG_NONROT, brd->brd_queue); #ifdef CONFIG_BLK_DEV_RAM_DAX queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue); #endif -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com [74.125.82.71]) by kanga.kvack.org (Postfix) with ESMTP id 2A2116B0260 for ; Fri, 9 Sep 2016 05:59:43 -0400 (EDT) Received: by mail-wm0-f71.google.com with SMTP id g141so10391326wmd.0 for ; Fri, 09 Sep 2016 02:59:43 -0700 (PDT) Received: from outbound-smtp06.blacknight.com (outbound-smtp06.blacknight.com. [81.17.249.39]) by mx.google.com with ESMTPS id n64si2225375wmn.41.2016.09.09.02.59.37 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 09 Sep 2016 02:59:37 -0700 (PDT) Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp06.blacknight.com (Postfix) with ESMTPS id DCC78989DE for ; Fri, 9 Sep 2016 09:59:36 +0000 (UTC) From: Mel Gorman Subject: [PATCH 3/4] mm, vmscan: Stall kswapd if contending on tree_lock Date: Fri, 9 Sep 2016 10:59:34 +0100 Message-Id: <1473415175-20807-4-git-send-email-mgorman@techsingularity.net> In-Reply-To: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> Sender: owner-linux-mm@kvack.org List-ID: To: LKML , Linux-MM , Mel Gorman Cc: Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko If there is a large reader/writer, it's possible for multiple kswapd instances and the processes issueing IO to contend on a single mapping->tree_lock. This patch will cause all kswapd instances except one to backoff if contending on tree_lock. A sleep kswapd instance will be woken when one has made progress. 4.8.0-rc5 4.8.0-rc5 ramdisknonrot-v1 waitqueue-v1 Min Elapsd-8 18.31 ( 0.00%) 28.32 (-54.67%) Amean System-1 181.00 ( 0.00%) 179.61 ( 0.77%) Amean System-3 86.19 ( 0.00%) 68.91 ( 20.05%) Amean System-5 67.43 ( 0.00%) 93.09 (-38.05%) Amean System-7 89.55 ( 0.00%) 90.98 ( -1.60%) Amean System-8 102.92 ( 0.00%) 299.81 (-191.30%) Amean Elapsd-1 209.23 ( 0.00%) 210.41 ( -0.57%) Amean Elapsd-3 36.93 ( 0.00%) 33.89 ( 8.25%) Amean Elapsd-5 19.52 ( 0.00%) 25.19 (-29.08%) Amean Elapsd-7 21.93 ( 0.00%) 18.45 ( 15.88%) Amean Elapsd-8 23.63 ( 0.00%) 48.80 (-106.51%) Note that unlike the previous patches that this is not an unconditional win. System CPU usage is generally higher because direct reclaim is used instead of multiple competing kswapd instances. According to the stats, there is 10 times more direct reclaim scanning and reclaim activity and overall the workload takes longer to complete. 4.8.0-rc5 4.8.0-rc5 amdisknonrot-v1 waitqueue-v1 User 473.24 462.40 System 3690.20 5127.32 Elapsed 2186.05 2364.08 The motivation for this patch was Dave Chinner reporting that an xfs_io workload rewriting a single file spent significant amount of time spinning on the tree_lock. Local tests were inconclusive. On spinning storage, the IO was so slow as it was not noticable. When xfs_io is backed by ramdisk to simulate fast storage then it can be observed; 4.8.0-rc5 4.8.0-rc5 ramdisknonrot-v1 waitqueue-v1 Min pwrite-single-rewrite-async-System 3.12 ( 0.00%) 3.06 ( 1.92%) Min pwrite-single-rewrite-async-Elapsd 3.25 ( 0.00%) 3.17 ( 2.46%) Amean pwrite-single-rewrite-async-System 3.32 ( 0.00%) 3.23 ( 2.67%) Amean pwrite-single-rewrite-async-Elapsd 3.42 ( 0.00%) 3.33 ( 2.71%) 4.8.0-rc5 4.8.0-rc5 ramdisknonrot-v1 waitqueue-v1 User 9.06 8.76 System 402.67 392.31 Elapsed 416.91 406.29 That's roughly a 2.5% drop in CPU usage overall. A test from Dave Chinner with some data to support/reject this patch is highly desirable. Signed-off-by: Mel Gorman --- mm/vmscan.c | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f7beb573a594..936070b0790e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -735,6 +735,9 @@ static enum remove_mapping __remove_mapping(struct address_space *mapping, return REMOVED_FAIL; } +static unsigned long kswapd_exclusive = NUMA_NO_NODE; +static DECLARE_WAIT_QUEUE_HEAD(kswapd_contended_wait); + static unsigned long remove_mapping_list(struct list_head *mapping_list, struct list_head *free_pages, struct list_head *ret_pages) @@ -755,8 +758,28 @@ static unsigned long remove_mapping_list(struct list_head *mapping_list, list_del(&page->lru); if (!mapping) { + pg_data_t *pgdat = page_pgdat(page); mapping = page_mapping(page); - spin_lock_irqsave(&mapping->tree_lock, flags); + + /* Account for trylock contentions in kswapd */ + if (!current_is_kswapd() || + pgdat->node_id == kswapd_exclusive) { + spin_lock_irqsave(&mapping->tree_lock, flags); + } else { + /* Account for contended pages and contended kswapds */ + if (!spin_trylock_irqsave(&mapping->tree_lock, flags)) { + /* Stall kswapd once for 10ms on contention */ + if (cmpxchg(&kswapd_exclusive, NUMA_NO_NODE, pgdat->node_id) != NUMA_NO_NODE) { + DEFINE_WAIT(wait); + prepare_to_wait(&kswapd_contended_wait, + &wait, TASK_INTERRUPTIBLE); + io_schedule_timeout(HZ/100); + finish_wait(&kswapd_contended_wait, &wait); + } + + spin_lock_irqsave(&mapping->tree_lock, flags); + } + } } switch (__remove_mapping(mapping, page, true, &freepage)) { @@ -3212,6 +3235,7 @@ static void age_active_anon(struct pglist_data *pgdat, static bool zone_balanced(struct zone *zone, int order, int classzone_idx) { unsigned long mark = high_wmark_pages(zone); + unsigned long nid; if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx)) return false; @@ -3223,6 +3247,12 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx) clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags); clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags); + nid = zone->zone_pgdat->node_id; + if (nid == kswapd_exclusive) { + cmpxchg(&kswapd_exclusive, nid, NUMA_NO_NODE); + wake_up_interruptible(&kswapd_contended_wait); + } + return true; } -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f72.google.com (mail-lf0-f72.google.com [209.85.215.72]) by kanga.kvack.org (Postfix) with ESMTP id 67CD86B0261 for ; Fri, 9 Sep 2016 05:59:45 -0400 (EDT) Received: by mail-lf0-f72.google.com with SMTP id s64so42336659lfs.1 for ; Fri, 09 Sep 2016 02:59:45 -0700 (PDT) Received: from outbound-smtp05.blacknight.com (outbound-smtp05.blacknight.com. [81.17.249.38]) by mx.google.com with ESMTPS id ty2si2229214wjb.223.2016.09.09.02.59.37 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 09 Sep 2016 02:59:37 -0700 (PDT) Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp05.blacknight.com (Postfix) with ESMTPS id 16111989D7 for ; Fri, 9 Sep 2016 09:59:37 +0000 (UTC) From: Mel Gorman Subject: [PATCH 4/4] mm, vmscan: Potentially stall direct reclaimers on tree_lock contention Date: Fri, 9 Sep 2016 10:59:35 +0100 Message-Id: <1473415175-20807-5-git-send-email-mgorman@techsingularity.net> In-Reply-To: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> Sender: owner-linux-mm@kvack.org List-ID: To: LKML , Linux-MM , Mel Gorman Cc: Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko If a heavy writer of a single file is forcing contention on the tree_lock then it may be necessary to tempoarily stall the direct writer to allow kswapd to make progress. This patch marks a pgdat congested if tree_lock is being contended on the tail of the LRU. On a swap-intensive workload to ramdisk, the following is observed usemem 4.8.0-rc5 4.8.0-rc5 waitqueue-v1 directcongest-v1 Amean System-1 179.61 ( 0.00%) 202.21 (-12.58%) Amean System-3 68.91 ( 0.00%) 105.14 (-52.59%) Amean System-5 93.09 ( 0.00%) 80.98 ( 13.01%) Amean System-7 90.98 ( 0.00%) 81.07 ( 10.90%) Amean System-8 299.81 ( 0.00%) 227.08 ( 24.26%) Amean Elapsd-1 210.41 ( 0.00%) 236.56 (-12.43%) Amean Elapsd-3 33.89 ( 0.00%) 46.78 (-38.06%) Amean Elapsd-5 25.19 ( 0.00%) 23.33 ( 7.38%) Amean Elapsd-7 18.45 ( 0.00%) 17.18 ( 6.91%) Amean Elapsd-8 48.80 ( 0.00%) 38.09 ( 21.93%) Note that system CPU usage is reduced for high thread counts but it is not a universal win and it's known to be highly variable. The overall time stats look like 4.8.0-rc5 4.8.0-rc5 waitqueue-v1 directcongest-v1 User 462.40 468.18 System 5127.32 4875.92 Elapsed 2364.08 2539.77 It takes longer to complete but uses less system CPU. The benefit is more noticable with xfs_io rewriting a file backed by ramdisk 4.8.0-rc5 4.8.0-rc5 waitqueue-v1r24 directcongest-v1r24 Amean pwrite-single-rewrite-async-System 3.23 ( 0.00%) 3.21 ( 0.80%) Amean pwrite-single-rewrite-async-Elapsd 3.33 ( 0.00%) 3.31 ( 0.67%) 4.8.0-rc5 4.8.0-rc5 waitqueue-v1 directcongest-v1 User 8.76 9.25 System 392.31 389.10 Elapsed 406.29 403.74 As with the previous patch, a test from Dave Chinner would be necessary to decide whether this patch is worthwhile. It seems reasonable to favour workloads that are heavily writing files than heavily swapping as the former situation is normal and reasonable while the latter situation will never be optimal. Signed-off-by: Mel Gorman --- mm/vmscan.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 936070b0790e..953df97abe0c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -771,6 +771,15 @@ static unsigned long remove_mapping_list(struct list_head *mapping_list, /* Stall kswapd once for 10ms on contention */ if (cmpxchg(&kswapd_exclusive, NUMA_NO_NODE, pgdat->node_id) != NUMA_NO_NODE) { DEFINE_WAIT(wait); + + /* + * Tag the pgdat as congested as it may + * indicate contention with a heavy + * writer that should stall on + * wait_iff_congested. + */ + set_bit(PGDAT_CONGESTED, &pgdat->flags); + prepare_to_wait(&kswapd_contended_wait, &wait, TASK_INTERRUPTIBLE); io_schedule_timeout(HZ/100); -- 2.6.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f200.google.com (mail-qk0-f200.google.com [209.85.220.200]) by kanga.kvack.org (Postfix) with ESMTP id DBD336B0069 for ; Fri, 9 Sep 2016 11:31:49 -0400 (EDT) Received: by mail-qk0-f200.google.com with SMTP id w204so154911381qka.3 for ; Fri, 09 Sep 2016 08:31:49 -0700 (PDT) Received: from mail-it0-x236.google.com (mail-it0-x236.google.com. [2607:f8b0:4001:c0b::236]) by mx.google.com with ESMTPS id j64si4456479ita.81.2016.09.09.08.31.28 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 09 Sep 2016 08:31:30 -0700 (PDT) Received: by mail-it0-x236.google.com with SMTP id i184so19759924itf.1 for ; Fri, 09 Sep 2016 08:31:28 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> From: Linus Torvalds Date: Fri, 9 Sep 2016 08:31:27 -0700 Message-ID: Subject: Re: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: LKML , Linux-MM , Dave Chinner , Ying Huang , Michal Hocko On Fri, Sep 9, 2016 at 2:59 AM, Mel Gorman wrote: > > The progression of this series has been unsatisfactory. Yeah, I have to say that I particularly don't like patch #1. It's some rather nasty complexity for dubious gains, and holding the lock for longer times might have downsides. And the numbers seem to not necessarily be in favor of patch #3 either, which I would have otherwise been predisposed to like (ie it looks fairly targeted and not very complex). #2 seems trivially correct but largely irrelevant. So I think this series is one of those "we need to find that it makes a big positive impact" to make sense. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-lf0-f70.google.com (mail-lf0-f70.google.com [209.85.215.70]) by kanga.kvack.org (Postfix) with ESMTP id 166AB6B0069 for ; Fri, 9 Sep 2016 12:19:12 -0400 (EDT) Received: by mail-lf0-f70.google.com with SMTP id n4so40175448lfb.3 for ; Fri, 09 Sep 2016 09:19:12 -0700 (PDT) Received: from outbound-smtp10.blacknight.com (outbound-smtp10.blacknight.com. [46.22.139.15]) by mx.google.com with ESMTPS id d191si3600324wme.111.2016.09.09.09.19.10 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 09 Sep 2016 09:19:10 -0700 (PDT) Received: from mail.blacknight.com (pemlinmail01.blacknight.ie [81.17.254.10]) by outbound-smtp10.blacknight.com (Postfix) with ESMTPS id 578801C18B7 for ; Fri, 9 Sep 2016 17:19:10 +0100 (IST) Date: Fri, 9 Sep 2016 17:19:08 +0100 From: Mel Gorman Subject: Re: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Message-ID: <20160909161908.GG8119@techsingularity.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: LKML , Linux-MM , Dave Chinner , Ying Huang , Michal Hocko On Fri, Sep 09, 2016 at 08:31:27AM -0700, Linus Torvalds wrote: > On Fri, Sep 9, 2016 at 2:59 AM, Mel Gorman wrote: > > > > The progression of this series has been unsatisfactory. > > Yeah, I have to say that I particularly don't like patch #1. There isn't many ways to make it prettier. Making it nicer is partially hindered by the fact that tree_lock is IRQ-safe for IO completions but even if that was addressed there might be lock ordering issues. > It's some > rather nasty complexity for dubious gains, and holding the lock for > longer times might have downsides. > Kswapd reclaim would delay a parallel truncation for example. Doubtful it matters but the possibility is there. The gain in swapping is nice but ramdisk is excessively artifical. It might matter if someone reported it made a big difference swapping to faster storage like SSD or NVMe although the cases where fast swap is important are few -- overcommitted host with multiple idle VMs with a new active VM starting is the only one that springs to mind. > So I think this series is one of those "we need to find that it makes > a big positive impact" to make sense. > Agreed. I don't mind leaving it on the back burner unless Dave reports it really helps or a new bug report about realistic tree_lock contention shows up. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f200.google.com (mail-pf0-f200.google.com [209.85.192.200]) by kanga.kvack.org (Postfix) with ESMTP id 9931E6B0069 for ; Fri, 9 Sep 2016 14:16:38 -0400 (EDT) Received: by mail-pf0-f200.google.com with SMTP id v67so200631616pfv.1 for ; Fri, 09 Sep 2016 11:16:38 -0700 (PDT) Received: from mga01.intel.com (mga01.intel.com. [192.55.52.88]) by mx.google.com with ESMTPS id o3si4963592pfb.55.2016.09.09.11.16.37 for (version=TLS1 cipher=AES128-SHA bits=128/128); Fri, 09 Sep 2016 11:16:37 -0700 (PDT) From: "Huang\, Ying" Subject: Re: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> <20160909161908.GG8119@techsingularity.net> Date: Fri, 09 Sep 2016 11:16:35 -0700 In-Reply-To: <20160909161908.GG8119@techsingularity.net> (Mel Gorman's message of "Fri, 9 Sep 2016 17:19:08 +0100") Message-ID: <8760q52b24.fsf@yhuang-mobile.sh.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: Linus Torvalds , LKML , Linux-MM , Dave Chinner , Ying Huang , Michal Hocko , "Tim C. Chen" , Dave Hansen , Andi Kleen Mel Gorman writes: > On Fri, Sep 09, 2016 at 08:31:27AM -0700, Linus Torvalds wrote: >> On Fri, Sep 9, 2016 at 2:59 AM, Mel Gorman wrote: >> > >> > The progression of this series has been unsatisfactory. >> >> Yeah, I have to say that I particularly don't like patch #1. > > There isn't many ways to make it prettier. Making it nicer is partially > hindered by the fact that tree_lock is IRQ-safe for IO completions but > even if that was addressed there might be lock ordering issues. > >> It's some >> rather nasty complexity for dubious gains, and holding the lock for >> longer times might have downsides. >> > > Kswapd reclaim would delay a parallel truncation for example. Doubtful it > matters but the possibility is there. > > The gain in swapping is nice but ramdisk is excessively artifical. It might > matter if someone reported it made a big difference swapping to faster > storage like SSD or NVMe although the cases where fast swap is important > are few -- overcommitted host with multiple idle VMs with a new active VM > starting is the only one that springs to mind. I will try to provide some data for the NVMe disk. I think the trend is that the performance of the disk is increasing fast and will continue in the near future at least. We found we cannot saturate the latest NVMe disk when swapping because of locking issues in swap and page reclaim path. The swap usage problem could be a "Chicken and Egg" problem. Because swap performance is poor, nobody uses swap, and because nobody uses swap, nobody works on improving the performance of the swap. With the faster and faster storage device, swap could be more popular in the future if we optimize its performance to catch up with the performance of the storage. >> So I think this series is one of those "we need to find that it makes >> a big positive impact" to make sense. >> > > Agreed. I don't mind leaving it on the back burner unless Dave reports > it really helps or a new bug report about realistic tree_lock contention > shows up. Best Regards, Huang, Ying -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f71.google.com (mail-it0-f71.google.com [209.85.214.71]) by kanga.kvack.org (Postfix) with ESMTP id 6D7576B0069 for ; Fri, 16 Sep 2016 09:25:13 -0400 (EDT) Received: by mail-it0-f71.google.com with SMTP id u18so63522769ita.2 for ; Fri, 16 Sep 2016 06:25:13 -0700 (PDT) Received: from merlin.infradead.org (merlin.infradead.org. [2001:4978:20e::2]) by mx.google.com with ESMTPS id k77si13074559iod.60.2016.09.16.06.25.12 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 16 Sep 2016 06:25:12 -0700 (PDT) Date: Fri, 16 Sep 2016 15:25:06 +0200 From: Peter Zijlstra Subject: Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Message-ID: <20160916132506.GB5035@twins.programming.kicks-ass.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> <1473415175-20807-2-git-send-email-mgorman@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1473415175-20807-2-git-send-email-mgorman@techsingularity.net> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: LKML , Linux-MM , Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko On Fri, Sep 09, 2016 at 10:59:32AM +0100, Mel Gorman wrote: > Pages unmapped during reclaim acquire/release the mapping->tree_lock for > every single page. There are two cases when it's likely that pages at the > tail of the LRU share the same mapping -- large amounts of IO to/from a > single file and swapping. This patch acquires the mapping->tree_lock for > multiple page removals. So, once upon a time, in a galaxy far away,.. I did a concurrent pagecache patch set that replaced the tree_lock with a per page bit- spinlock and fine grained locking in the radix tree. I know the mm has changed quite a bit since, but would such an approach still be feasible? I cannot seem to find an online reference to a 'complete' version of that patch set, but I did find the OLS paper on it and I did find some copies on my local machines. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f72.google.com (mail-pa0-f72.google.com [209.85.220.72]) by kanga.kvack.org (Postfix) with ESMTP id 13F006B0253 for ; Fri, 16 Sep 2016 10:07:12 -0400 (EDT) Received: by mail-pa0-f72.google.com with SMTP id fu14so87328844pad.0 for ; Fri, 16 Sep 2016 07:07:12 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org. [2001:1868:205::9]) by mx.google.com with ESMTPS id m66si45316548pfc.281.2016.09.16.07.07.11 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 16 Sep 2016 07:07:11 -0700 (PDT) Date: Fri, 16 Sep 2016 16:07:07 +0200 From: Peter Zijlstra Subject: Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Message-ID: <20160916140707.GI5020@twins.programming.kicks-ass.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> <1473415175-20807-2-git-send-email-mgorman@techsingularity.net> <20160916132506.GB5035@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160916132506.GB5035@twins.programming.kicks-ass.net> Sender: owner-linux-mm@kvack.org List-ID: To: Mel Gorman Cc: LKML , Linux-MM , Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko On Fri, Sep 16, 2016 at 03:25:06PM +0200, Peter Zijlstra wrote: > On Fri, Sep 09, 2016 at 10:59:32AM +0100, Mel Gorman wrote: > > Pages unmapped during reclaim acquire/release the mapping->tree_lock for > > every single page. There are two cases when it's likely that pages at the > > tail of the LRU share the same mapping -- large amounts of IO to/from a > > single file and swapping. This patch acquires the mapping->tree_lock for > > multiple page removals. > > So, once upon a time, in a galaxy far away,.. I did a concurrent > pagecache patch set that replaced the tree_lock with a per page bit- > spinlock and fine grained locking in the radix tree. > > I know the mm has changed quite a bit since, but would such an approach > still be feasible? > > I cannot seem to find an online reference to a 'complete' version of > that patch set, but I did find the OLS paper on it and I did find some > copies on my local machines. https://www.kernel.org/doc/ols/2007/ols2007v2-pages-311-318.pdf -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-it0-f71.google.com (mail-it0-f71.google.com [209.85.214.71]) by kanga.kvack.org (Postfix) with ESMTP id BBAEA6B0069 for ; Fri, 16 Sep 2016 14:33:01 -0400 (EDT) Received: by mail-it0-f71.google.com with SMTP id u18so84732319ita.2 for ; Fri, 16 Sep 2016 11:33:01 -0700 (PDT) Received: from mail-oi0-x22d.google.com (mail-oi0-x22d.google.com. [2607:f8b0:4003:c06::22d]) by mx.google.com with ESMTPS id l3si12027069otd.292.2016.09.16.11.33.00 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 16 Sep 2016 11:33:00 -0700 (PDT) Received: by mail-oi0-x22d.google.com with SMTP id q188so121794747oia.3 for ; Fri, 16 Sep 2016 11:33:00 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <20160916132506.GB5035@twins.programming.kicks-ass.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> <1473415175-20807-2-git-send-email-mgorman@techsingularity.net> <20160916132506.GB5035@twins.programming.kicks-ass.net> From: Linus Torvalds Date: Fri, 16 Sep 2016 11:33:00 -0700 Message-ID: Subject: Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: Mel Gorman , LKML , Linux-MM , Dave Chinner , Ying Huang , Michal Hocko On Fri, Sep 16, 2016 at 6:25 AM, Peter Zijlstra wrote: > > So, once upon a time, in a galaxy far away,.. I did a concurrent > pagecache patch set that replaced the tree_lock with a per page bit- > spinlock and fine grained locking in the radix tree. I'd love to see the patch for that. I'd be a bit worried about extra locking in the trivial cases (ie multi-level locking when we now take just the single mapping lock), but if there is some smart reason why that doesn't happen, then.. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f70.google.com (mail-pa0-f70.google.com [209.85.220.70]) by kanga.kvack.org (Postfix) with ESMTP id 485BE6B0069 for ; Fri, 16 Sep 2016 21:36:12 -0400 (EDT) Received: by mail-pa0-f70.google.com with SMTP id mi5so182608208pab.2 for ; Fri, 16 Sep 2016 18:36:12 -0700 (PDT) Received: from bombadil.infradead.org (bombadil.infradead.org. [2001:1868:205::9]) by mx.google.com with ESMTPS id s86si6685937pfd.23.2016.09.16.18.36.11 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 16 Sep 2016 18:36:11 -0700 (PDT) Date: Sat, 17 Sep 2016 03:36:06 +0200 From: Peter Zijlstra Subject: Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Message-ID: <20160917013606.GM5016@twins.programming.kicks-ass.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> <1473415175-20807-2-git-send-email-mgorman@techsingularity.net> <20160916132506.GB5035@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Mel Gorman , LKML , Linux-MM , Dave Chinner , Ying Huang , Michal Hocko On Fri, Sep 16, 2016 at 11:33:00AM -0700, Linus Torvalds wrote: > On Fri, Sep 16, 2016 at 6:25 AM, Peter Zijlstra wrote: > > > > So, once upon a time, in a galaxy far away,.. I did a concurrent > > pagecache patch set that replaced the tree_lock with a per page bit- > > spinlock and fine grained locking in the radix tree. > > I'd love to see the patch for that. I'd be a bit worried about extra > locking in the trivial cases (ie multi-level locking when we now take > just the single mapping lock), but if there is some smart reason why > that doesn't happen, then.. On average we'll likely take a few more locks, but its not as bad as having to take the whole tree depth every time, or even touching the root lock most times. There's two cases, the first: the modification is only done on a single node (like insert), here we do an RCU lookup of the node, lock it, verify the node is still correct, do modification and unlock, done. The second case, the modification needs to then back up the tree (like setting/clearing tags, delete). For this case we can determine on our way down where the first node is we need to modify, lock that, verify, and then lock all nodes down to the last. i.e. we lock a partial path. I can send you the 2.6.31 patches if you're interested, but if you want something that applies to a kernel from this decade I'll have to go rewrite them which will take a wee bit of time :-) Both the radix tree code and the mm have changed somewhat. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753626AbcIIJ7k (ORCPT ); Fri, 9 Sep 2016 05:59:40 -0400 Received: from outbound-smtp04.blacknight.com ([81.17.249.35]:38986 "EHLO outbound-smtp04.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751091AbcIIJ7i (ORCPT ); Fri, 9 Sep 2016 05:59:38 -0400 From: Mel Gorman To: LKML , Linux-MM , Mel Gorman Cc: Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko Subject: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Date: Fri, 9 Sep 2016 10:59:31 +0100 Message-Id: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> X-Mailer: git-send-email 2.6.4 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is a follow-on series from the thread "[lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression" with active parties cc'd. I've pushed the series to git.kernel.org where the LKP robot should pick it up automatically. git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaim-contention-v1r15 The progression of this series has been unsatisfactory. Dave originally reported a problem with tree_lock contention and while it can be fixed by pushing reclaim to direct reclaim, it slows swap considerably and was not a universal win. This series is the best balance I've found so far between the swapping and large rewriter cases. I never reliably produced the same contentions that Dave did so testing is needed. Dave, ideally you would test patches 1+2 and patches 1+4 but a test of patches 1+3 would also be nice if you have the time. Minimally, I'm expected that patches 1+2 will help the swapping-to-fast-storage case (LKP to confirm independently) and may be worth considering on their own even if Dave's test case is not helped. drivers/block/brd.c | 1 + mm/vmscan.c | 209 +++++++++++++++++++++++++++++++++++++++++++++------- 2 files changed, 182 insertions(+), 28 deletions(-) -- 2.6.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753932AbcIIJ74 (ORCPT ); Fri, 9 Sep 2016 05:59:56 -0400 Received: from outbound-smtp10.blacknight.com ([46.22.139.15]:44735 "EHLO outbound-smtp10.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751120AbcIIJ7j (ORCPT ); Fri, 9 Sep 2016 05:59:39 -0400 From: Mel Gorman To: LKML , Linux-MM , Mel Gorman Cc: Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko Subject: [PATCH 2/4] block, brd: Treat storage as non-rotational Date: Fri, 9 Sep 2016 10:59:33 +0100 Message-Id: <1473415175-20807-3-git-send-email-mgorman@techsingularity.net> X-Mailer: git-send-email 2.6.4 In-Reply-To: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Unlike the rims of a punked out car, RAM does not spin. Ramdisk as implemented by the brd is treated as rotational storage. When used as swap to simulate fast storage, swap uses the algoritms for minimising seek times instead of the algorithms optimised for SSD. When the tree_lock contention was reduced by the previous patch, it was found that the workload was dominated by scan_swap_map(). This patch has no practical application as swap-on-ramdisk is dumb is rocks but it's trivial to fix. 4.8.0-rc5 4.8.0-rc5 batch-v1 ramdisknonrot-v1 Amean System-1 192.98 ( 0.00%) 181.00 ( 6.21%) Amean System-3 198.33 ( 0.00%) 86.19 ( 56.54%) Amean System-5 105.22 ( 0.00%) 67.43 ( 35.91%) Amean System-7 97.79 ( 0.00%) 89.55 ( 8.42%) Amean System-8 149.39 ( 0.00%) 102.92 ( 31.11%) Amean Elapsd-1 219.95 ( 0.00%) 209.23 ( 4.88%) Amean Elapsd-3 79.02 ( 0.00%) 36.93 ( 53.26%) Amean Elapsd-5 29.88 ( 0.00%) 19.52 ( 34.69%) Amean Elapsd-7 24.06 ( 0.00%) 21.93 ( 8.84%) Amean Elapsd-8 33.34 ( 0.00%) 23.63 ( 29.12%) Signed-off-by: Mel Gorman --- drivers/block/brd.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/block/brd.c b/drivers/block/brd.c index 0c76d4016eeb..83a76a74e027 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -504,6 +504,7 @@ static struct brd_device *brd_alloc(int i) blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX); brd->brd_queue->limits.discard_zeroes_data = 1; queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue); + queue_flag_set_unlocked(QUEUE_FLAG_NONROT, brd->brd_queue); #ifdef CONFIG_BLK_DEV_RAM_DAX queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue); #endif -- 2.6.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753963AbcIIJ75 (ORCPT ); Fri, 9 Sep 2016 05:59:57 -0400 Received: from outbound-smtp03.blacknight.com ([81.17.249.16]:38583 "EHLO outbound-smtp03.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750839AbcIIJ7i (ORCPT ); Fri, 9 Sep 2016 05:59:38 -0400 From: Mel Gorman To: LKML , Linux-MM , Mel Gorman Cc: Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko Subject: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Date: Fri, 9 Sep 2016 10:59:32 +0100 Message-Id: <1473415175-20807-2-git-send-email-mgorman@techsingularity.net> X-Mailer: git-send-email 2.6.4 In-Reply-To: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Pages unmapped during reclaim acquire/release the mapping->tree_lock for every single page. There are two cases when it's likely that pages at the tail of the LRU share the same mapping -- large amounts of IO to/from a single file and swapping. This patch acquires the mapping->tree_lock for multiple page removals. To trigger heavy swapping, varying numbers of usemem instances were used to read anonymous memory larger than the physical memory size. A UMA machine was used with 4 fake NUMA nodes to increase interference from kswapd. The swap device was backed by ramdisk using the brd driver. NUMA balancing was disabled to limit interference. 4.8.0-rc5 4.8.0-rc5 vanilla batch-v1 Amean System-1 260.53 ( 0.00%) 192.98 ( 25.93%) Amean System-3 179.59 ( 0.00%) 198.33 (-10.43%) Amean System-5 205.71 ( 0.00%) 105.22 ( 48.85%) Amean System-7 146.46 ( 0.00%) 97.79 ( 33.23%) Amean System-8 275.37 ( 0.00%) 149.39 ( 45.75%) Amean Elapsd-1 292.89 ( 0.00%) 219.95 ( 24.90%) Amean Elapsd-3 69.47 ( 0.00%) 79.02 (-13.74%) Amean Elapsd-5 54.12 ( 0.00%) 29.88 ( 44.79%) Amean Elapsd-7 34.28 ( 0.00%) 24.06 ( 29.81%) Amean Elapsd-8 57.98 ( 0.00%) 33.34 ( 42.50%) System is system CPU usage and elapsed time is the time to complete the workload. Regardless of the thread count, the workload generally completes faster although there is a lot of varability as much more work is being done under a single lock. xfs_io and pwrite was used to rewrite a file multiple times to measure any locking overhead reduction. xfsio Time 4.8.0-rc5 4.8.0-rc5 vanilla batch-v1r18 Amean pwrite-single-rewrite-async-System 49.19 ( 0.00%) 49.49 ( -0.60%) Amean pwrite-single-rewrite-async-Elapsd 322.87 ( 0.00%) 322.72 ( 0.05%) Unfortunately the difference here is well within the noise as the workload is dominated by the cost of the IO. It may be the case that the benefit is noticable on faster storage or in KVM instances where the data may be resident in the page cache of the host. Signed-off-by: Mel Gorman --- mm/vmscan.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 142 insertions(+), 28 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index b1e12a1ea9cf..f7beb573a594 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -622,18 +622,47 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, } /* + * Finalise the mapping removal without the mapping lock held. The pages + * are placed on the free_list and the caller is expected to drop the + * final reference. + */ +static void finalise_remove_mapping_list(struct list_head *swapcache, + struct list_head *filecache, + void (*freepage)(struct page *), + struct list_head *free_list) +{ + struct page *page; + + list_for_each_entry(page, swapcache, lru) { + swp_entry_t swap = { .val = page_private(page) }; + swapcache_free(swap); + set_page_private(page, 0); + } + + list_for_each_entry(page, filecache, lru) + freepage(page); + + list_splice_init(swapcache, free_list); + list_splice_init(filecache, free_list); +} + +enum remove_mapping { + REMOVED_FAIL, + REMOVED_SWAPCACHE, + REMOVED_FILECACHE +}; + +/* * Same as remove_mapping, but if the page is removed from the mapping, it * gets returned with a refcount of 0. */ -static int __remove_mapping(struct address_space *mapping, struct page *page, - bool reclaimed) +static enum remove_mapping __remove_mapping(struct address_space *mapping, + struct page *page, bool reclaimed, + void (**freepage)(struct page *)) { - unsigned long flags; - BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); - spin_lock_irqsave(&mapping->tree_lock, flags); /* * The non racy check for a busy page. * @@ -668,16 +697,17 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, } if (PageSwapCache(page)) { - swp_entry_t swap = { .val = page_private(page) }; + unsigned long swapval = page_private(page); + swp_entry_t swap = { .val = swapval }; mem_cgroup_swapout(page, swap); __delete_from_swap_cache(page); - spin_unlock_irqrestore(&mapping->tree_lock, flags); - swapcache_free(swap); + set_page_private(page, swapval); + return REMOVED_SWAPCACHE; } else { - void (*freepage)(struct page *); void *shadow = NULL; - freepage = mapping->a_ops->freepage; + *freepage = mapping->a_ops->freepage; + /* * Remember a shadow entry for reclaimed file cache in * order to detect refaults, thus thrashing, later on. @@ -698,17 +728,76 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, !mapping_exiting(mapping) && !dax_mapping(mapping)) shadow = workingset_eviction(mapping, page); __delete_from_page_cache(page, shadow); - spin_unlock_irqrestore(&mapping->tree_lock, flags); + return REMOVED_FILECACHE; + } - if (freepage != NULL) - freepage(page); +cannot_free: + return REMOVED_FAIL; +} + +static unsigned long remove_mapping_list(struct list_head *mapping_list, + struct list_head *free_pages, + struct list_head *ret_pages) +{ + unsigned long flags; + struct address_space *mapping = NULL; + void (*freepage)(struct page *) = NULL; + LIST_HEAD(swapcache); + LIST_HEAD(filecache); + struct page *page, *tmp; + unsigned long nr_reclaimed = 0; + +continue_removal: + list_for_each_entry_safe(page, tmp, mapping_list, lru) { + /* Batch removals under one tree lock at a time */ + if (mapping && page_mapping(page) != mapping) + continue; + + list_del(&page->lru); + if (!mapping) { + mapping = page_mapping(page); + spin_lock_irqsave(&mapping->tree_lock, flags); + } + + switch (__remove_mapping(mapping, page, true, &freepage)) { + case REMOVED_FILECACHE: + /* + * At this point, we have no other references and there + * is no way to pick any more up (removed from LRU, + * removed from pagecache). Can use non-atomic bitops + * now (and we obviously don't have to worry about + * waking up a process waiting on the page lock, + * because there are no references. + */ + __ClearPageLocked(page); + if (freepage) + list_add(&page->lru, &filecache); + else + list_add(&page->lru, free_pages); + nr_reclaimed++; + break; + case REMOVED_SWAPCACHE: + /* See FILECACHE case as to why non-atomic is safe */ + __ClearPageLocked(page); + list_add(&page->lru, &swapcache); + nr_reclaimed++; + break; + case REMOVED_FAIL: + unlock_page(page); + list_add(&page->lru, ret_pages); + } } - return 1; + if (mapping) { + spin_unlock_irqrestore(&mapping->tree_lock, flags); + finalise_remove_mapping_list(&swapcache, &filecache, freepage, free_pages); + mapping = NULL; + } -cannot_free: - spin_unlock_irqrestore(&mapping->tree_lock, flags); - return 0; + if (!list_empty(mapping_list)) + goto continue_removal; + + return nr_reclaimed; } /* @@ -719,16 +808,42 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, */ int remove_mapping(struct address_space *mapping, struct page *page) { - if (__remove_mapping(mapping, page, false)) { + unsigned long flags; + LIST_HEAD(swapcache); + LIST_HEAD(filecache); + void (*freepage)(struct page *) = NULL; + swp_entry_t swap; + int ret = 0; + + spin_lock_irqsave(&mapping->tree_lock, flags); + freepage = mapping->a_ops->freepage; + ret = __remove_mapping(mapping, page, false, &freepage); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + + if (ret != REMOVED_FAIL) { /* * Unfreezing the refcount with 1 rather than 2 effectively * drops the pagecache ref for us without requiring another * atomic operation. */ page_ref_unfreeze(page, 1); + } + + switch (ret) { + case REMOVED_FILECACHE: + if (freepage) + freepage(page); + return 1; + case REMOVED_SWAPCACHE: + swap.val = page_private(page); + swapcache_free(swap); + set_page_private(page, 0); return 1; + case REMOVED_FAIL: + return 0; } - return 0; + + BUG(); } /** @@ -910,6 +1025,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, { LIST_HEAD(ret_pages); LIST_HEAD(free_pages); + LIST_HEAD(mapping_pages); int pgactivate = 0; unsigned long nr_unqueued_dirty = 0; unsigned long nr_dirty = 0; @@ -1206,17 +1322,14 @@ static unsigned long shrink_page_list(struct list_head *page_list, } lazyfree: - if (!mapping || !__remove_mapping(mapping, page, true)) + if (!mapping) goto keep_locked; - /* - * At this point, we have no other references and there is - * no way to pick any more up (removed from LRU, removed - * from pagecache). Can use non-atomic bitops now (and - * we obviously don't have to worry about waking up a process - * waiting on the page lock, because there are no references. - */ - __ClearPageLocked(page); + list_add(&page->lru, &mapping_pages); + if (ret == SWAP_LZFREE) + count_vm_event(PGLAZYFREED); + continue; + free_it: if (ret == SWAP_LZFREE) count_vm_event(PGLAZYFREED); @@ -1251,6 +1364,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page); } + nr_reclaimed += remove_mapping_list(&mapping_pages, &free_pages, &ret_pages); mem_cgroup_uncharge_list(&free_pages); try_to_unmap_flush(); free_hot_cold_page_list(&free_pages, true); -- 2.6.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753896AbcIIJ7x (ORCPT ); Fri, 9 Sep 2016 05:59:53 -0400 Received: from outbound-smtp09.blacknight.com ([46.22.139.14]:33345 "EHLO outbound-smtp09.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751635AbcIIJ7j (ORCPT ); Fri, 9 Sep 2016 05:59:39 -0400 From: Mel Gorman To: LKML , Linux-MM , Mel Gorman Cc: Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko Subject: [PATCH 4/4] mm, vmscan: Potentially stall direct reclaimers on tree_lock contention Date: Fri, 9 Sep 2016 10:59:35 +0100 Message-Id: <1473415175-20807-5-git-send-email-mgorman@techsingularity.net> X-Mailer: git-send-email 2.6.4 In-Reply-To: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If a heavy writer of a single file is forcing contention on the tree_lock then it may be necessary to tempoarily stall the direct writer to allow kswapd to make progress. This patch marks a pgdat congested if tree_lock is being contended on the tail of the LRU. On a swap-intensive workload to ramdisk, the following is observed usemem 4.8.0-rc5 4.8.0-rc5 waitqueue-v1 directcongest-v1 Amean System-1 179.61 ( 0.00%) 202.21 (-12.58%) Amean System-3 68.91 ( 0.00%) 105.14 (-52.59%) Amean System-5 93.09 ( 0.00%) 80.98 ( 13.01%) Amean System-7 90.98 ( 0.00%) 81.07 ( 10.90%) Amean System-8 299.81 ( 0.00%) 227.08 ( 24.26%) Amean Elapsd-1 210.41 ( 0.00%) 236.56 (-12.43%) Amean Elapsd-3 33.89 ( 0.00%) 46.78 (-38.06%) Amean Elapsd-5 25.19 ( 0.00%) 23.33 ( 7.38%) Amean Elapsd-7 18.45 ( 0.00%) 17.18 ( 6.91%) Amean Elapsd-8 48.80 ( 0.00%) 38.09 ( 21.93%) Note that system CPU usage is reduced for high thread counts but it is not a universal win and it's known to be highly variable. The overall time stats look like 4.8.0-rc5 4.8.0-rc5 waitqueue-v1 directcongest-v1 User 462.40 468.18 System 5127.32 4875.92 Elapsed 2364.08 2539.77 It takes longer to complete but uses less system CPU. The benefit is more noticable with xfs_io rewriting a file backed by ramdisk 4.8.0-rc5 4.8.0-rc5 waitqueue-v1r24 directcongest-v1r24 Amean pwrite-single-rewrite-async-System 3.23 ( 0.00%) 3.21 ( 0.80%) Amean pwrite-single-rewrite-async-Elapsd 3.33 ( 0.00%) 3.31 ( 0.67%) 4.8.0-rc5 4.8.0-rc5 waitqueue-v1 directcongest-v1 User 8.76 9.25 System 392.31 389.10 Elapsed 406.29 403.74 As with the previous patch, a test from Dave Chinner would be necessary to decide whether this patch is worthwhile. It seems reasonable to favour workloads that are heavily writing files than heavily swapping as the former situation is normal and reasonable while the latter situation will never be optimal. Signed-off-by: Mel Gorman --- mm/vmscan.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 936070b0790e..953df97abe0c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -771,6 +771,15 @@ static unsigned long remove_mapping_list(struct list_head *mapping_list, /* Stall kswapd once for 10ms on contention */ if (cmpxchg(&kswapd_exclusive, NUMA_NO_NODE, pgdat->node_id) != NUMA_NO_NODE) { DEFINE_WAIT(wait); + + /* + * Tag the pgdat as congested as it may + * indicate contention with a heavy + * writer that should stall on + * wait_iff_congested. + */ + set_bit(PGDAT_CONGESTED, &pgdat->flags); + prepare_to_wait(&kswapd_contended_wait, &wait, TASK_INTERRUPTIBLE); io_schedule_timeout(HZ/100); -- 2.6.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753784AbcIIJ7v (ORCPT ); Fri, 9 Sep 2016 05:59:51 -0400 Received: from outbound-smtp10.blacknight.com ([46.22.139.15]:57890 "EHLO outbound-smtp10.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752035AbcIIJ7j (ORCPT ); Fri, 9 Sep 2016 05:59:39 -0400 From: Mel Gorman To: LKML , Linux-MM , Mel Gorman Cc: Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko Subject: [PATCH 3/4] mm, vmscan: Stall kswapd if contending on tree_lock Date: Fri, 9 Sep 2016 10:59:34 +0100 Message-Id: <1473415175-20807-4-git-send-email-mgorman@techsingularity.net> X-Mailer: git-send-email 2.6.4 In-Reply-To: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If there is a large reader/writer, it's possible for multiple kswapd instances and the processes issueing IO to contend on a single mapping->tree_lock. This patch will cause all kswapd instances except one to backoff if contending on tree_lock. A sleep kswapd instance will be woken when one has made progress. 4.8.0-rc5 4.8.0-rc5 ramdisknonrot-v1 waitqueue-v1 Min Elapsd-8 18.31 ( 0.00%) 28.32 (-54.67%) Amean System-1 181.00 ( 0.00%) 179.61 ( 0.77%) Amean System-3 86.19 ( 0.00%) 68.91 ( 20.05%) Amean System-5 67.43 ( 0.00%) 93.09 (-38.05%) Amean System-7 89.55 ( 0.00%) 90.98 ( -1.60%) Amean System-8 102.92 ( 0.00%) 299.81 (-191.30%) Amean Elapsd-1 209.23 ( 0.00%) 210.41 ( -0.57%) Amean Elapsd-3 36.93 ( 0.00%) 33.89 ( 8.25%) Amean Elapsd-5 19.52 ( 0.00%) 25.19 (-29.08%) Amean Elapsd-7 21.93 ( 0.00%) 18.45 ( 15.88%) Amean Elapsd-8 23.63 ( 0.00%) 48.80 (-106.51%) Note that unlike the previous patches that this is not an unconditional win. System CPU usage is generally higher because direct reclaim is used instead of multiple competing kswapd instances. According to the stats, there is 10 times more direct reclaim scanning and reclaim activity and overall the workload takes longer to complete. 4.8.0-rc5 4.8.0-rc5 amdisknonrot-v1 waitqueue-v1 User 473.24 462.40 System 3690.20 5127.32 Elapsed 2186.05 2364.08 The motivation for this patch was Dave Chinner reporting that an xfs_io workload rewriting a single file spent significant amount of time spinning on the tree_lock. Local tests were inconclusive. On spinning storage, the IO was so slow as it was not noticable. When xfs_io is backed by ramdisk to simulate fast storage then it can be observed; 4.8.0-rc5 4.8.0-rc5 ramdisknonrot-v1 waitqueue-v1 Min pwrite-single-rewrite-async-System 3.12 ( 0.00%) 3.06 ( 1.92%) Min pwrite-single-rewrite-async-Elapsd 3.25 ( 0.00%) 3.17 ( 2.46%) Amean pwrite-single-rewrite-async-System 3.32 ( 0.00%) 3.23 ( 2.67%) Amean pwrite-single-rewrite-async-Elapsd 3.42 ( 0.00%) 3.33 ( 2.71%) 4.8.0-rc5 4.8.0-rc5 ramdisknonrot-v1 waitqueue-v1 User 9.06 8.76 System 402.67 392.31 Elapsed 416.91 406.29 That's roughly a 2.5% drop in CPU usage overall. A test from Dave Chinner with some data to support/reject this patch is highly desirable. Signed-off-by: Mel Gorman --- mm/vmscan.c | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index f7beb573a594..936070b0790e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -735,6 +735,9 @@ static enum remove_mapping __remove_mapping(struct address_space *mapping, return REMOVED_FAIL; } +static unsigned long kswapd_exclusive = NUMA_NO_NODE; +static DECLARE_WAIT_QUEUE_HEAD(kswapd_contended_wait); + static unsigned long remove_mapping_list(struct list_head *mapping_list, struct list_head *free_pages, struct list_head *ret_pages) @@ -755,8 +758,28 @@ static unsigned long remove_mapping_list(struct list_head *mapping_list, list_del(&page->lru); if (!mapping) { + pg_data_t *pgdat = page_pgdat(page); mapping = page_mapping(page); - spin_lock_irqsave(&mapping->tree_lock, flags); + + /* Account for trylock contentions in kswapd */ + if (!current_is_kswapd() || + pgdat->node_id == kswapd_exclusive) { + spin_lock_irqsave(&mapping->tree_lock, flags); + } else { + /* Account for contended pages and contended kswapds */ + if (!spin_trylock_irqsave(&mapping->tree_lock, flags)) { + /* Stall kswapd once for 10ms on contention */ + if (cmpxchg(&kswapd_exclusive, NUMA_NO_NODE, pgdat->node_id) != NUMA_NO_NODE) { + DEFINE_WAIT(wait); + prepare_to_wait(&kswapd_contended_wait, + &wait, TASK_INTERRUPTIBLE); + io_schedule_timeout(HZ/100); + finish_wait(&kswapd_contended_wait, &wait); + } + + spin_lock_irqsave(&mapping->tree_lock, flags); + } + } } switch (__remove_mapping(mapping, page, true, &freepage)) { @@ -3212,6 +3235,7 @@ static void age_active_anon(struct pglist_data *pgdat, static bool zone_balanced(struct zone *zone, int order, int classzone_idx) { unsigned long mark = high_wmark_pages(zone); + unsigned long nid; if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx)) return false; @@ -3223,6 +3247,12 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx) clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags); clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags); + nid = zone->zone_pgdat->node_id; + if (nid == kswapd_exclusive) { + cmpxchg(&kswapd_exclusive, nid, NUMA_NO_NODE); + wake_up_interruptible(&kswapd_contended_wait); + } + return true; } -- 2.6.4 From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752729AbcIIPba (ORCPT ); Fri, 9 Sep 2016 11:31:30 -0400 Received: from mail-it0-f46.google.com ([209.85.214.46]:37071 "EHLO mail-it0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751817AbcIIPb3 (ORCPT ); Fri, 9 Sep 2016 11:31:29 -0400 MIME-Version: 1.0 In-Reply-To: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> From: Linus Torvalds Date: Fri, 9 Sep 2016 08:31:27 -0700 X-Google-Sender-Auth: 0sAPV9MBVB4rx_G5zJ--fkY1FyY Message-ID: Subject: Re: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 To: Mel Gorman Cc: LKML , Linux-MM , Dave Chinner , Ying Huang , Michal Hocko Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 9, 2016 at 2:59 AM, Mel Gorman wrote: > > The progression of this series has been unsatisfactory. Yeah, I have to say that I particularly don't like patch #1. It's some rather nasty complexity for dubious gains, and holding the lock for longer times might have downsides. And the numbers seem to not necessarily be in favor of patch #3 either, which I would have otherwise been predisposed to like (ie it looks fairly targeted and not very complex). #2 seems trivially correct but largely irrelevant. So I think this series is one of those "we need to find that it makes a big positive impact" to make sense. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752564AbcIIQTO (ORCPT ); Fri, 9 Sep 2016 12:19:14 -0400 Received: from outbound-smtp04.blacknight.com ([81.17.249.35]:49725 "EHLO outbound-smtp04.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751008AbcIIQTN (ORCPT ); Fri, 9 Sep 2016 12:19:13 -0400 Date: Fri, 9 Sep 2016 17:19:08 +0100 From: Mel Gorman To: Linus Torvalds Cc: LKML , Linux-MM , Dave Chinner , Ying Huang , Michal Hocko Subject: Re: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Message-ID: <20160909161908.GG8119@techsingularity.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 09, 2016 at 08:31:27AM -0700, Linus Torvalds wrote: > On Fri, Sep 9, 2016 at 2:59 AM, Mel Gorman wrote: > > > > The progression of this series has been unsatisfactory. > > Yeah, I have to say that I particularly don't like patch #1. There isn't many ways to make it prettier. Making it nicer is partially hindered by the fact that tree_lock is IRQ-safe for IO completions but even if that was addressed there might be lock ordering issues. > It's some > rather nasty complexity for dubious gains, and holding the lock for > longer times might have downsides. > Kswapd reclaim would delay a parallel truncation for example. Doubtful it matters but the possibility is there. The gain in swapping is nice but ramdisk is excessively artifical. It might matter if someone reported it made a big difference swapping to faster storage like SSD or NVMe although the cases where fast swap is important are few -- overcommitted host with multiple idle VMs with a new active VM starting is the only one that springs to mind. > So I think this series is one of those "we need to find that it makes > a big positive impact" to make sense. > Agreed. I don't mind leaving it on the back burner unless Dave reports it really helps or a new bug report about realistic tree_lock contention shows up. -- Mel Gorman SUSE Labs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754354AbcIISQj (ORCPT ); Fri, 9 Sep 2016 14:16:39 -0400 Received: from mga04.intel.com ([192.55.52.120]:48061 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752115AbcIISQi (ORCPT ); Fri, 9 Sep 2016 14:16:38 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.30,306,1470726000"; d="scan'208";a="1047977475" From: "Huang\, Ying" To: Mel Gorman Cc: Linus Torvalds , LKML , Linux-MM , "Dave Chinner" , Ying Huang , "Michal Hocko" , "Tim C. Chen" , Dave Hansen , Andi Kleen Subject: Re: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> <20160909161908.GG8119@techsingularity.net> Date: Fri, 09 Sep 2016 11:16:35 -0700 In-Reply-To: <20160909161908.GG8119@techsingularity.net> (Mel Gorman's message of "Fri, 9 Sep 2016 17:19:08 +0100") Message-ID: <8760q52b24.fsf@yhuang-mobile.sh.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Mel Gorman writes: > On Fri, Sep 09, 2016 at 08:31:27AM -0700, Linus Torvalds wrote: >> On Fri, Sep 9, 2016 at 2:59 AM, Mel Gorman wrote: >> > >> > The progression of this series has been unsatisfactory. >> >> Yeah, I have to say that I particularly don't like patch #1. > > There isn't many ways to make it prettier. Making it nicer is partially > hindered by the fact that tree_lock is IRQ-safe for IO completions but > even if that was addressed there might be lock ordering issues. > >> It's some >> rather nasty complexity for dubious gains, and holding the lock for >> longer times might have downsides. >> > > Kswapd reclaim would delay a parallel truncation for example. Doubtful it > matters but the possibility is there. > > The gain in swapping is nice but ramdisk is excessively artifical. It might > matter if someone reported it made a big difference swapping to faster > storage like SSD or NVMe although the cases where fast swap is important > are few -- overcommitted host with multiple idle VMs with a new active VM > starting is the only one that springs to mind. I will try to provide some data for the NVMe disk. I think the trend is that the performance of the disk is increasing fast and will continue in the near future at least. We found we cannot saturate the latest NVMe disk when swapping because of locking issues in swap and page reclaim path. The swap usage problem could be a "Chicken and Egg" problem. Because swap performance is poor, nobody uses swap, and because nobody uses swap, nobody works on improving the performance of the swap. With the faster and faster storage device, swap could be more popular in the future if we optimize its performance to catch up with the performance of the storage. >> So I think this series is one of those "we need to find that it makes >> a big positive impact" to make sense. >> > > Agreed. I don't mind leaving it on the back burner unless Dave reports > it really helps or a new bug report about realistic tree_lock contention > shows up. Best Regards, Huang, Ying From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964806AbcIPNZz (ORCPT ); Fri, 16 Sep 2016 09:25:55 -0400 Received: from merlin.infradead.org ([205.233.59.134]:54918 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933530AbcIPNZO (ORCPT ); Fri, 16 Sep 2016 09:25:14 -0400 Date: Fri, 16 Sep 2016 15:25:06 +0200 From: Peter Zijlstra To: Mel Gorman Cc: LKML , Linux-MM , Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko Subject: Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Message-ID: <20160916132506.GB5035@twins.programming.kicks-ass.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> <1473415175-20807-2-git-send-email-mgorman@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1473415175-20807-2-git-send-email-mgorman@techsingularity.net> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 09, 2016 at 10:59:32AM +0100, Mel Gorman wrote: > Pages unmapped during reclaim acquire/release the mapping->tree_lock for > every single page. There are two cases when it's likely that pages at the > tail of the LRU share the same mapping -- large amounts of IO to/from a > single file and swapping. This patch acquires the mapping->tree_lock for > multiple page removals. So, once upon a time, in a galaxy far away,.. I did a concurrent pagecache patch set that replaced the tree_lock with a per page bit- spinlock and fine grained locking in the radix tree. I know the mm has changed quite a bit since, but would such an approach still be feasible? I cannot seem to find an online reference to a 'complete' version of that patch set, but I did find the OLS paper on it and I did find some copies on my local machines. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933592AbcIPOHT (ORCPT ); Fri, 16 Sep 2016 10:07:19 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:47760 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757411AbcIPOHM (ORCPT ); Fri, 16 Sep 2016 10:07:12 -0400 Date: Fri, 16 Sep 2016 16:07:07 +0200 From: Peter Zijlstra To: Mel Gorman Cc: LKML , Linux-MM , Dave Chinner , Linus Torvalds , Ying Huang , Michal Hocko Subject: Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Message-ID: <20160916140707.GI5020@twins.programming.kicks-ass.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> <1473415175-20807-2-git-send-email-mgorman@techsingularity.net> <20160916132506.GB5035@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160916132506.GB5035@twins.programming.kicks-ass.net> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 16, 2016 at 03:25:06PM +0200, Peter Zijlstra wrote: > On Fri, Sep 09, 2016 at 10:59:32AM +0100, Mel Gorman wrote: > > Pages unmapped during reclaim acquire/release the mapping->tree_lock for > > every single page. There are two cases when it's likely that pages at the > > tail of the LRU share the same mapping -- large amounts of IO to/from a > > single file and swapping. This patch acquires the mapping->tree_lock for > > multiple page removals. > > So, once upon a time, in a galaxy far away,.. I did a concurrent > pagecache patch set that replaced the tree_lock with a per page bit- > spinlock and fine grained locking in the radix tree. > > I know the mm has changed quite a bit since, but would such an approach > still be feasible? > > I cannot seem to find an online reference to a 'complete' version of > that patch set, but I did find the OLS paper on it and I did find some > copies on my local machines. https://www.kernel.org/doc/ols/2007/ols2007v2-pages-311-318.pdf From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934067AbcIPSdS (ORCPT ); Fri, 16 Sep 2016 14:33:18 -0400 Received: from mail-oi0-f44.google.com ([209.85.218.44]:35902 "EHLO mail-oi0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1760246AbcIPSdG (ORCPT ); Fri, 16 Sep 2016 14:33:06 -0400 MIME-Version: 1.0 In-Reply-To: <20160916132506.GB5035@twins.programming.kicks-ass.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> <1473415175-20807-2-git-send-email-mgorman@techsingularity.net> <20160916132506.GB5035@twins.programming.kicks-ass.net> From: Linus Torvalds Date: Fri, 16 Sep 2016 11:33:00 -0700 X-Google-Sender-Auth: aNd3ORq5HYDDr_mwjKgRnvUjJf0 Message-ID: Subject: Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim To: Peter Zijlstra Cc: Mel Gorman , LKML , Linux-MM , Dave Chinner , Ying Huang , Michal Hocko Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 16, 2016 at 6:25 AM, Peter Zijlstra wrote: > > So, once upon a time, in a galaxy far away,.. I did a concurrent > pagecache patch set that replaced the tree_lock with a per page bit- > spinlock and fine grained locking in the radix tree. I'd love to see the patch for that. I'd be a bit worried about extra locking in the trivial cases (ie multi-level locking when we now take just the single mapping lock), but if there is some smart reason why that doesn't happen, then.. Linus From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757437AbcIQBgT (ORCPT ); Fri, 16 Sep 2016 21:36:19 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:52764 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756306AbcIQBgM (ORCPT ); Fri, 16 Sep 2016 21:36:12 -0400 Date: Sat, 17 Sep 2016 03:36:06 +0200 From: Peter Zijlstra To: Linus Torvalds Cc: Mel Gorman , LKML , Linux-MM , Dave Chinner , Ying Huang , Michal Hocko Subject: Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Message-ID: <20160917013606.GM5016@twins.programming.kicks-ass.net> References: <1473415175-20807-1-git-send-email-mgorman@techsingularity.net> <1473415175-20807-2-git-send-email-mgorman@techsingularity.net> <20160916132506.GB5035@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 16, 2016 at 11:33:00AM -0700, Linus Torvalds wrote: > On Fri, Sep 16, 2016 at 6:25 AM, Peter Zijlstra wrote: > > > > So, once upon a time, in a galaxy far away,.. I did a concurrent > > pagecache patch set that replaced the tree_lock with a per page bit- > > spinlock and fine grained locking in the radix tree. > > I'd love to see the patch for that. I'd be a bit worried about extra > locking in the trivial cases (ie multi-level locking when we now take > just the single mapping lock), but if there is some smart reason why > that doesn't happen, then.. On average we'll likely take a few more locks, but its not as bad as having to take the whole tree depth every time, or even touching the root lock most times. There's two cases, the first: the modification is only done on a single node (like insert), here we do an RCU lookup of the node, lock it, verify the node is still correct, do modification and unlock, done. The second case, the modification needs to then back up the tree (like setting/clearing tags, delete). For this case we can determine on our way down where the first node is we need to modify, lock that, verify, and then lock all nodes down to the last. i.e. we lock a partial path. I can send you the 2.6.31 patches if you're interested, but if you want something that applies to a kernel from this decade I'll have to go rewrite them which will take a wee bit of time :-) Both the radix tree code and the mm have changed somewhat.