[RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1
@ 2016-09-09  9:59 Mel Gorman
  2016-09-09  9:59 ` [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Mel Gorman
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Mel Gorman @ 2016-09-09  9:59 UTC (permalink / raw)
  To: LKML, Linux-MM, Mel Gorman
  Cc: Dave Chinner, Linus Torvalds, Ying Huang, Michal Hocko

This is a follow-on series from the thread "[lkp] [xfs] 68a9f5e700:
aim7.jobs-per-min -13.6% regression" with active parties cc'd.  I've
pushed the series to git.kernel.org where the LKP robot should pick it
up automatically.

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaim-contention-v1r15

The progression of this series has been unsatisfactory. Dave originally
reported a problem with tree_lock contention and while it can be fixed
by pushing reclaim to direct reclaim, it slows swap considerably and was
not a universal win. This series is the best balance I've found so far
between the swapping and large rewriter cases.

I never reliably produced the same contentions that Dave did so testing
is needed.  Dave, ideally you would test patches 1+2 and patches 1+4 but
a test of patches 1+3 would also be nice if you have the time. Minimally,
I'm expected that patches 1+2 will help the swapping-to-fast-storage case
(LKP to confirm independently) and may be worth considering on their own
even if Dave's test case is not helped.

 drivers/block/brd.c |   1 +
 mm/vmscan.c         | 209 +++++++++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 182 insertions(+), 28 deletions(-)

-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim
  2016-09-09  9:59 [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Mel Gorman
@ 2016-09-09  9:59 ` Mel Gorman
  2016-09-16 13:25   ` Peter Zijlstra
  2016-09-09  9:59 ` [PATCH 2/4] block, brd: Treat storage as non-rotational Mel Gorman
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Mel Gorman @ 2016-09-09  9:59 UTC (permalink / raw)
  To: LKML, Linux-MM, Mel Gorman
  Cc: Dave Chinner, Linus Torvalds, Ying Huang, Michal Hocko

Pages unmapped during reclaim acquire/release the mapping->tree_lock for
every single page. There are two cases when it's likely that pages at the
tail of the LRU share the same mapping -- large amounts of IO to/from a
single file and swapping. This patch acquires the mapping->tree_lock for
multiple page removals.

To trigger heavy swapping, varying numbers of usemem instances were used to
read anonymous memory larger than the physical memory size. A UMA machine
was used with 4 fake NUMA nodes to increase interference from kswapd. The
swap device was backed by ramdisk using the brd driver. NUMA balancing
was disabled to limit interference.

                              4.8.0-rc5             4.8.0-rc5
                                vanilla              batch-v1
Amean    System-1      260.53 (  0.00%)      192.98 ( 25.93%)
Amean    System-3      179.59 (  0.00%)      198.33 (-10.43%)
Amean    System-5      205.71 (  0.00%)      105.22 ( 48.85%)
Amean    System-7      146.46 (  0.00%)       97.79 ( 33.23%)
Amean    System-8      275.37 (  0.00%)      149.39 ( 45.75%)
Amean    Elapsd-1      292.89 (  0.00%)      219.95 ( 24.90%)
Amean    Elapsd-3       69.47 (  0.00%)       79.02 (-13.74%)
Amean    Elapsd-5       54.12 (  0.00%)       29.88 ( 44.79%)
Amean    Elapsd-7       34.28 (  0.00%)       24.06 ( 29.81%)
Amean    Elapsd-8       57.98 (  0.00%)       33.34 ( 42.50%)

System is system CPU usage and elapsed time is the time to complete the
workload. Regardless of the thread count, the workload generally completes
faster although there is a lot of varability as much more work is being
done under a single lock.

xfs_io and pwrite was used to rewrite a file multiple times to measure any
locking overhead reduction.

xfsio Time
                                                        4.8.0-rc5             4.8.0-rc5
                                                          vanilla           batch-v1r18
Amean    pwrite-single-rewrite-async-System       49.19 (  0.00%)       49.49 ( -0.60%)
Amean    pwrite-single-rewrite-async-Elapsd      322.87 (  0.00%)      322.72 (  0.05%)

Unfortunately the difference here is well within the noise as the workload
is dominated by the cost of the IO. It may be the case that the benefit
is noticable on faster storage or in KVM instances where the data may be
resident in the page cache of the host.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 142 insertions(+), 28 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b1e12a1ea9cf..f7beb573a594 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -622,18 +622,47 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
 }
 
 /*
+ * Finalise the mapping removal without the mapping lock held. The pages
+ * are placed on the free_list and the caller is expected to drop the
+ * final reference.
+ */
+static void finalise_remove_mapping_list(struct list_head *swapcache,
+				    struct list_head *filecache,
+				    void (*freepage)(struct page *),
+				    struct list_head *free_list)
+{
+	struct page *page;
+
+	list_for_each_entry(page, swapcache, lru) {
+		swp_entry_t swap = { .val = page_private(page) };
+		swapcache_free(swap);
+		set_page_private(page, 0);
+	}
+
+	list_for_each_entry(page, filecache, lru)
+		freepage(page);
+
+	list_splice_init(swapcache, free_list);
+	list_splice_init(filecache, free_list);
+}
+
+enum remove_mapping {
+	REMOVED_FAIL,
+	REMOVED_SWAPCACHE,
+	REMOVED_FILECACHE
+};
+
+/*
  * Same as remove_mapping, but if the page is removed from the mapping, it
  * gets returned with a refcount of 0.
  */
-static int __remove_mapping(struct address_space *mapping, struct page *page,
-			    bool reclaimed)
+static enum remove_mapping __remove_mapping(struct address_space *mapping,
+				struct page *page, bool reclaimed,
+				void (**freepage)(struct page *))
 {
-	unsigned long flags;
-
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
-	spin_lock_irqsave(&mapping->tree_lock, flags);
 	/*
 	 * The non racy check for a busy page.
 	 *
@@ -668,16 +697,17 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 	}
 
 	if (PageSwapCache(page)) {
-		swp_entry_t swap = { .val = page_private(page) };
+		unsigned long swapval = page_private(page);
+		swp_entry_t swap = { .val = swapval };
 		mem_cgroup_swapout(page, swap);
 		__delete_from_swap_cache(page);
-		spin_unlock_irqrestore(&mapping->tree_lock, flags);
-		swapcache_free(swap);
+		set_page_private(page, swapval);
+		return REMOVED_SWAPCACHE;
 	} else {
-		void (*freepage)(struct page *);
 		void *shadow = NULL;
 
-		freepage = mapping->a_ops->freepage;
+		*freepage = mapping->a_ops->freepage;
+
 		/*
 		 * Remember a shadow entry for reclaimed file cache in
 		 * order to detect refaults, thus thrashing, later on.
@@ -698,17 +728,76 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
 		    !mapping_exiting(mapping) && !dax_mapping(mapping))
 			shadow = workingset_eviction(mapping, page);
 		__delete_from_page_cache(page, shadow);
-		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		return REMOVED_FILECACHE;
+	}
 
-		if (freepage != NULL)
-			freepage(page);
+cannot_free:
+	return REMOVED_FAIL;
+}
+
+static unsigned long remove_mapping_list(struct list_head *mapping_list,
+					 struct list_head *free_pages,
+					 struct list_head *ret_pages)
+{
+	unsigned long flags;
+	struct address_space *mapping = NULL;
+	void (*freepage)(struct page *) = NULL;
+	LIST_HEAD(swapcache);
+	LIST_HEAD(filecache);
+	struct page *page, *tmp;
+	unsigned long nr_reclaimed = 0;
+
+continue_removal:
+	list_for_each_entry_safe(page, tmp, mapping_list, lru) {
+		/* Batch removals under one tree lock at a time */
+		if (mapping && page_mapping(page) != mapping)
+			continue;
+
+		list_del(&page->lru);
+		if (!mapping) {
+			mapping = page_mapping(page);
+			spin_lock_irqsave(&mapping->tree_lock, flags);
+		}
+
+		switch (__remove_mapping(mapping, page, true, &freepage)) {
+		case REMOVED_FILECACHE:
+			/*
+			 * At this point, we have no other references and there
+			 * is no way to pick any more up (removed from LRU,
+			 * removed from pagecache). Can use non-atomic bitops
+			 * now (and we obviously don't have to worry about
+			 * waking up a process  waiting on the page lock,
+			 * because there are no references.
+			 */
+			__ClearPageLocked(page);
+			if (freepage)
+				list_add(&page->lru, &filecache);
+			else
+				list_add(&page->lru, free_pages);
+			nr_reclaimed++;
+			break;
+		case REMOVED_SWAPCACHE:
+			/* See FILECACHE case as to why non-atomic is safe */
+			__ClearPageLocked(page);
+			list_add(&page->lru, &swapcache);
+			nr_reclaimed++;
+			break;
+		case REMOVED_FAIL:
+			unlock_page(page);
+			list_add(&page->lru, ret_pages);
+		}
 	}
 
-	return 1;
+	if (mapping) {
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		finalise_remove_mapping_list(&swapcache, &filecache, freepage, free_pages);
+		mapping = NULL;
+	}
 
-cannot_free:
-	spin_unlock_irqrestore(&mapping->tree_lock, flags);
-	return 0;
+	if (!list_empty(mapping_list))
+		goto continue_removal;
+
+	return nr_reclaimed;
 }
 
 /*
@@ -719,16 +808,42 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
  */
 int remove_mapping(struct address_space *mapping, struct page *page)
 {
-	if (__remove_mapping(mapping, page, false)) {
+	unsigned long flags;
+	LIST_HEAD(swapcache);
+	LIST_HEAD(filecache);
+	void (*freepage)(struct page *) = NULL;
+	swp_entry_t swap;
+	int ret = 0;
+
+	spin_lock_irqsave(&mapping->tree_lock, flags);
+	freepage = mapping->a_ops->freepage;
+	ret = __remove_mapping(mapping, page, false, &freepage);
+	spin_unlock_irqrestore(&mapping->tree_lock, flags);
+
+	if (ret != REMOVED_FAIL) {
 		/*
 		 * Unfreezing the refcount with 1 rather than 2 effectively
 		 * drops the pagecache ref for us without requiring another
 		 * atomic operation.
 		 */
 		page_ref_unfreeze(page, 1);
+	}
+
+	switch (ret) {
+	case REMOVED_FILECACHE:
+		if (freepage)
+			freepage(page);
+		return 1;
+	case REMOVED_SWAPCACHE:
+		swap.val = page_private(page);
+		swapcache_free(swap);
+		set_page_private(page, 0);
 		return 1;
+	case REMOVED_FAIL:
+		return 0;
 	}
-	return 0;
+
+	BUG();
 }
 
 /**
@@ -910,6 +1025,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
+	LIST_HEAD(mapping_pages);
 	int pgactivate = 0;
 	unsigned long nr_unqueued_dirty = 0;
 	unsigned long nr_dirty = 0;
@@ -1206,17 +1322,14 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		}
 
 lazyfree:
-		if (!mapping || !__remove_mapping(mapping, page, true))
+		if (!mapping)
 			goto keep_locked;
 
-		/*
-		 * At this point, we have no other references and there is
-		 * no way to pick any more up (removed from LRU, removed
-		 * from pagecache). Can use non-atomic bitops now (and
-		 * we obviously don't have to worry about waking up a process
-		 * waiting on the page lock, because there are no references.
-		 */
-		__ClearPageLocked(page);
+		list_add(&page->lru, &mapping_pages);
+		if (ret == SWAP_LZFREE)
+			count_vm_event(PGLAZYFREED);
+		continue;
+
 free_it:
 		if (ret == SWAP_LZFREE)
 			count_vm_event(PGLAZYFREED);
@@ -1251,6 +1364,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
 
+	nr_reclaimed += remove_mapping_list(&mapping_pages, &free_pages, &ret_pages);
 	mem_cgroup_uncharge_list(&free_pages);
 	try_to_unmap_flush();
 	free_hot_cold_page_list(&free_pages, true);
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim
  2016-09-09  9:59 ` [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Mel Gorman
@ 2016-09-16 13:25   ` Peter Zijlstra
  2016-09-16 14:07     ` Peter Zijlstra
  2016-09-16 18:33     ` Linus Torvalds
  0 siblings, 2 replies; 12+ messages in thread
From: Peter Zijlstra @ 2016-09-16 13:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: LKML, Linux-MM, Dave Chinner, Linus Torvalds, Ying Huang,
	Michal Hocko

On Fri, Sep 09, 2016 at 10:59:32AM +0100, Mel Gorman wrote:
> Pages unmapped during reclaim acquire/release the mapping->tree_lock for
> every single page. There are two cases when it's likely that pages at the
> tail of the LRU share the same mapping -- large amounts of IO to/from a
> single file and swapping. This patch acquires the mapping->tree_lock for
> multiple page removals.

So, once upon a time, in a galaxy far away,..  I did a concurrent
pagecache patch set that replaced the tree_lock with a per page bit-
spinlock and fine grained locking in the radix tree.

I know the mm has changed quite a bit since, but would such an approach
still be feasible?

I cannot seem to find an online reference to a 'complete' version of
that patch set, but I did find the OLS paper on it and I did find some
copies on my local machines.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim
  2016-09-16 13:25   ` Peter Zijlstra
@ 2016-09-16 14:07     ` Peter Zijlstra
  2016-09-16 18:33     ` Linus Torvalds
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2016-09-16 14:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: LKML, Linux-MM, Dave Chinner, Linus Torvalds, Ying Huang,
	Michal Hocko

On Fri, Sep 16, 2016 at 03:25:06PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 09, 2016 at 10:59:32AM +0100, Mel Gorman wrote:
> > Pages unmapped during reclaim acquire/release the mapping->tree_lock for
> > every single page. There are two cases when it's likely that pages at the
> > tail of the LRU share the same mapping -- large amounts of IO to/from a
> > single file and swapping. This patch acquires the mapping->tree_lock for
> > multiple page removals.
> 
> So, once upon a time, in a galaxy far away,..  I did a concurrent
> pagecache patch set that replaced the tree_lock with a per page bit-
> spinlock and fine grained locking in the radix tree.
> 
> I know the mm has changed quite a bit since, but would such an approach
> still be feasible?
> 
> I cannot seem to find an online reference to a 'complete' version of
> that patch set, but I did find the OLS paper on it and I did find some
> copies on my local machines.

https://www.kernel.org/doc/ols/2007/ols2007v2-pages-311-318.pdf

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim
  2016-09-16 13:25   ` Peter Zijlstra
  2016-09-16 14:07     ` Peter Zijlstra
@ 2016-09-16 18:33     ` Linus Torvalds
  2016-09-17  1:36       ` Peter Zijlstra
  1 sibling, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2016-09-16 18:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, LKML, Linux-MM, Dave Chinner, Ying Huang,
	Michal Hocko

On Fri, Sep 16, 2016 at 6:25 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So, once upon a time, in a galaxy far away,..  I did a concurrent
> pagecache patch set that replaced the tree_lock with a per page bit-
> spinlock and fine grained locking in the radix tree.

I'd love to see the patch for that. I'd be a bit worried about extra
locking in the trivial cases (ie multi-level locking when we now take
just the single mapping lock), but if there is some smart reason why
that doesn't happen, then..

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim
  2016-09-16 18:33     ` Linus Torvalds
@ 2016-09-17  1:36       ` Peter Zijlstra
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2016-09-17  1:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, LKML, Linux-MM, Dave Chinner, Ying Huang,
	Michal Hocko

On Fri, Sep 16, 2016 at 11:33:00AM -0700, Linus Torvalds wrote:
> On Fri, Sep 16, 2016 at 6:25 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So, once upon a time, in a galaxy far away,..  I did a concurrent
> > pagecache patch set that replaced the tree_lock with a per page bit-
> > spinlock and fine grained locking in the radix tree.
> 
> I'd love to see the patch for that. I'd be a bit worried about extra
> locking in the trivial cases (ie multi-level locking when we now take
> just the single mapping lock), but if there is some smart reason why
> that doesn't happen, then..

On average we'll likely take a few more locks, but its not as bad as
having to take the whole tree depth every time, or even touching the
root lock most times.

There's two cases, the first: the modification is only done on a single
node (like insert), here we do an RCU lookup of the node, lock it,
verify the node is still correct, do modification and unlock, done.

The second case, the modification needs to then back up the tree (like
setting/clearing tags, delete). For this case we can determine on our
way down where the first node is we need to modify, lock that, verify,
and then lock all nodes down to the last. i.e. we lock a partial path.

I can send you the 2.6.31 patches if you're interested, but if you want
something that applies to a kernel from this decade I'll have to go
rewrite them which will take a wee bit of time :-) Both the radix tree
code and the mm have changed somewhat.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/4] block, brd: Treat storage as non-rotational
  2016-09-09  9:59 [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Mel Gorman
  2016-09-09  9:59 ` [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Mel Gorman
@ 2016-09-09  9:59 ` Mel Gorman
  2016-09-09  9:59 ` [PATCH 3/4] mm, vmscan: Stall kswapd if contending on tree_lock Mel Gorman
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Mel Gorman @ 2016-09-09  9:59 UTC (permalink / raw)
  To: LKML, Linux-MM, Mel Gorman
  Cc: Dave Chinner, Linus Torvalds, Ying Huang, Michal Hocko

Unlike the rims of a punked out car, RAM does not spin. Ramdisk as
implemented by the brd is treated as rotational storage. When used as swap
to simulate fast storage, swap uses the algoritms for minimising seek times
instead of the algorithms optimised for SSD. When the tree_lock contention
was reduced by the previous patch, it was found that the workload was
dominated by scan_swap_map(). This patch has no practical application as
swap-on-ramdisk is dumb is rocks but it's trivial to fix.

                              4.8.0-rc5             4.8.0-rc5
                               batch-v1      ramdisknonrot-v1
Amean    System-1      192.98 (  0.00%)      181.00 (  6.21%)
Amean    System-3      198.33 (  0.00%)       86.19 ( 56.54%)
Amean    System-5      105.22 (  0.00%)       67.43 ( 35.91%)
Amean    System-7       97.79 (  0.00%)       89.55 (  8.42%)
Amean    System-8      149.39 (  0.00%)      102.92 ( 31.11%)
Amean    Elapsd-1      219.95 (  0.00%)      209.23 (  4.88%)
Amean    Elapsd-3       79.02 (  0.00%)       36.93 ( 53.26%)
Amean    Elapsd-5       29.88 (  0.00%)       19.52 ( 34.69%)
Amean    Elapsd-7       24.06 (  0.00%)       21.93 (  8.84%)
Amean    Elapsd-8       33.34 (  0.00%)       23.63 ( 29.12%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 drivers/block/brd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 0c76d4016eeb..83a76a74e027 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -504,6 +504,7 @@ static struct brd_device *brd_alloc(int i)
 	blk_queue_max_discard_sectors(brd->brd_queue, UINT_MAX);
 	brd->brd_queue->limits.discard_zeroes_data = 1;
 	queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, brd->brd_queue);
+	queue_flag_set_unlocked(QUEUE_FLAG_NONROT, brd->brd_queue);
 #ifdef CONFIG_BLK_DEV_RAM_DAX
 	queue_flag_set_unlocked(QUEUE_FLAG_DAX, brd->brd_queue);
 #endif
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/4] mm, vmscan: Stall kswapd if contending on tree_lock
  2016-09-09  9:59 [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Mel Gorman
  2016-09-09  9:59 ` [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Mel Gorman
  2016-09-09  9:59 ` [PATCH 2/4] block, brd: Treat storage as non-rotational Mel Gorman
@ 2016-09-09  9:59 ` Mel Gorman
  2016-09-09  9:59 ` [PATCH 4/4] mm, vmscan: Potentially stall direct reclaimers on tree_lock contention Mel Gorman
  2016-09-09 15:31 ` [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Linus Torvalds
  4 siblings, 0 replies; 12+ messages in thread
From: Mel Gorman @ 2016-09-09  9:59 UTC (permalink / raw)
  To: LKML, Linux-MM, Mel Gorman
  Cc: Dave Chinner, Linus Torvalds, Ying Huang, Michal Hocko

If there is a large reader/writer, it's possible for multiple kswapd instances
and the processes issueing IO to contend on a single mapping->tree_lock. This
patch will cause all kswapd instances except one to backoff if contending on
tree_lock. A sleep kswapd instance will be woken when one has made progress.

                              4.8.0-rc5             4.8.0-rc5
                       ramdisknonrot-v1          waitqueue-v1
Min      Elapsd-8       18.31 (  0.00%)       28.32 (-54.67%)
Amean    System-1      181.00 (  0.00%)      179.61 (  0.77%)
Amean    System-3       86.19 (  0.00%)       68.91 ( 20.05%)
Amean    System-5       67.43 (  0.00%)       93.09 (-38.05%)
Amean    System-7       89.55 (  0.00%)       90.98 ( -1.60%)
Amean    System-8      102.92 (  0.00%)      299.81 (-191.30%)
Amean    Elapsd-1      209.23 (  0.00%)      210.41 ( -0.57%)
Amean    Elapsd-3       36.93 (  0.00%)       33.89 (  8.25%)
Amean    Elapsd-5       19.52 (  0.00%)       25.19 (-29.08%)
Amean    Elapsd-7       21.93 (  0.00%)       18.45 ( 15.88%)
Amean    Elapsd-8       23.63 (  0.00%)       48.80 (-106.51%)

Note that unlike the previous patches that this is not an unconditional win.
System CPU usage is generally higher because direct reclaim is used instead
of multiple competing kswapd instances. According to the stats, there is
10 times more direct reclaim scanning and reclaim activity and overall
the workload takes longer to complete.

           4.8.0-rc5    4.8.0-rc5
     amdisknonrot-v1 waitqueue-v1
User          473.24       462.40
System       3690.20      5127.32
Elapsed      2186.05      2364.08

The motivation for this patch was Dave Chinner reporting that an xfs_io
workload rewriting a single file spent significant amount of time spinning
on the tree_lock. Local tests were inconclusive. On spinning storage, the
IO was so slow as it was not noticable. When xfs_io is backed by ramdisk
to simulate fast storage then it can be observed;

                                                        4.8.0-rc5             4.8.0-rc5
                                                 ramdisknonrot-v1          waitqueue-v1
Min      pwrite-single-rewrite-async-System        3.12 (  0.00%)        3.06 (  1.92%)
Min      pwrite-single-rewrite-async-Elapsd        3.25 (  0.00%)        3.17 (  2.46%)
Amean    pwrite-single-rewrite-async-System        3.32 (  0.00%)        3.23 (  2.67%)
Amean    pwrite-single-rewrite-async-Elapsd        3.42 (  0.00%)        3.33 (  2.71%)

           4.8.0-rc5    4.8.0-rc5
    ramdisknonrot-v1 waitqueue-v1
User            9.06         8.76
System        402.67       392.31
Elapsed       416.91       406.29

That's roughly a 2.5% drop in CPU usage overall. A test from Dave Chinner
with some data to support/reject this patch is highly desirable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f7beb573a594..936070b0790e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -735,6 +735,9 @@ static enum remove_mapping __remove_mapping(struct address_space *mapping,
 	return REMOVED_FAIL;
 }
 
+static unsigned long kswapd_exclusive = NUMA_NO_NODE;
+static DECLARE_WAIT_QUEUE_HEAD(kswapd_contended_wait);
+
 static unsigned long remove_mapping_list(struct list_head *mapping_list,
 					 struct list_head *free_pages,
 					 struct list_head *ret_pages)
@@ -755,8 +758,28 @@ static unsigned long remove_mapping_list(struct list_head *mapping_list,
 
 		list_del(&page->lru);
 		if (!mapping) {
+			pg_data_t *pgdat = page_pgdat(page);
 			mapping = page_mapping(page);
-			spin_lock_irqsave(&mapping->tree_lock, flags);
+
+			/* Account for trylock contentions in kswapd */
+			if (!current_is_kswapd() ||
+			    pgdat->node_id == kswapd_exclusive) {
+				spin_lock_irqsave(&mapping->tree_lock, flags);
+			} else {
+				/* Account for contended pages and contended kswapds */
+				if (!spin_trylock_irqsave(&mapping->tree_lock, flags)) {
+					/* Stall kswapd once for 10ms on contention */
+					if (cmpxchg(&kswapd_exclusive, NUMA_NO_NODE, pgdat->node_id) != NUMA_NO_NODE) {
+						DEFINE_WAIT(wait);
+						prepare_to_wait(&kswapd_contended_wait,
+							&wait, TASK_INTERRUPTIBLE);
+						io_schedule_timeout(HZ/100);
+						finish_wait(&kswapd_contended_wait, &wait);
+					}
+
+					spin_lock_irqsave(&mapping->tree_lock, flags);
+				}
+			}
 		}
 
 		switch (__remove_mapping(mapping, page, true, &freepage)) {
@@ -3212,6 +3235,7 @@ static void age_active_anon(struct pglist_data *pgdat,
 static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
 {
 	unsigned long mark = high_wmark_pages(zone);
+	unsigned long nid;
 
 	if (!zone_watermark_ok_safe(zone, order, mark, classzone_idx))
 		return false;
@@ -3223,6 +3247,12 @@ static bool zone_balanced(struct zone *zone, int order, int classzone_idx)
 	clear_bit(PGDAT_CONGESTED, &zone->zone_pgdat->flags);
 	clear_bit(PGDAT_DIRTY, &zone->zone_pgdat->flags);
 
+	nid = zone->zone_pgdat->node_id;
+	if (nid == kswapd_exclusive) {
+		cmpxchg(&kswapd_exclusive, nid, NUMA_NO_NODE);
+		wake_up_interruptible(&kswapd_contended_wait);
+	}
+
 	return true;
 }
 
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/4] mm, vmscan: Potentially stall direct reclaimers on tree_lock contention
  2016-09-09  9:59 [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Mel Gorman
                   ` (2 preceding siblings ...)
  2016-09-09  9:59 ` [PATCH 3/4] mm, vmscan: Stall kswapd if contending on tree_lock Mel Gorman
@ 2016-09-09  9:59 ` Mel Gorman
  2016-09-09 15:31 ` [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Linus Torvalds
  4 siblings, 0 replies; 12+ messages in thread
From: Mel Gorman @ 2016-09-09  9:59 UTC (permalink / raw)
  To: LKML, Linux-MM, Mel Gorman
  Cc: Dave Chinner, Linus Torvalds, Ying Huang, Michal Hocko

If a heavy writer of a single file is forcing contention on the tree_lock
then it may be necessary to tempoarily stall the direct writer to allow
kswapd to make progress. This patch marks a pgdat congested if tree_lock
is being contended on the tail of the LRU.

On a swap-intensive workload to ramdisk, the following is observed

usemem
                              4.8.0-rc5             4.8.0-rc5
                           waitqueue-v1      directcongest-v1
Amean    System-1      179.61 (  0.00%)      202.21 (-12.58%)
Amean    System-3       68.91 (  0.00%)      105.14 (-52.59%)
Amean    System-5       93.09 (  0.00%)       80.98 ( 13.01%)
Amean    System-7       90.98 (  0.00%)       81.07 ( 10.90%)
Amean    System-8      299.81 (  0.00%)      227.08 ( 24.26%)
Amean    Elapsd-1      210.41 (  0.00%)      236.56 (-12.43%)
Amean    Elapsd-3       33.89 (  0.00%)       46.78 (-38.06%)
Amean    Elapsd-5       25.19 (  0.00%)       23.33 (  7.38%)
Amean    Elapsd-7       18.45 (  0.00%)       17.18 (  6.91%)
Amean    Elapsd-8       48.80 (  0.00%)       38.09 ( 21.93%)

Note that system CPU usage is reduced for high thread counts but it
is not a universal win and it's known to be highly variable. The
overall time stats look like

           4.8.0-rc5   4.8.0-rc5
        waitqueue-v1 directcongest-v1
User          462.40      468.18
System       5127.32     4875.92
Elapsed      2364.08     2539.77

It takes longer to complete but uses less system CPU. The benefit
is more noticable with xfs_io rewriting a file backed by ramdisk

                                                        4.8.0-rc5             4.8.0-rc5
                                                  waitqueue-v1r24   directcongest-v1r24
Amean    pwrite-single-rewrite-async-System        3.23 (  0.00%)        3.21 (  0.80%)
Amean    pwrite-single-rewrite-async-Elapsd        3.33 (  0.00%)        3.31 (  0.67%)

           4.8.0-rc5   4.8.0-rc5
        waitqueue-v1 directcongest-v1
User            8.76        9.25
System        392.31      389.10
Elapsed       406.29      403.74

As with the previous patch, a test from Dave Chinner would be necessary
to decide whether this patch is worthwhile. It seems reasonable to favour
workloads that are heavily writing files than heavily swapping as the
former situation is normal and reasonable while the latter situation will
never be optimal.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/vmscan.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 936070b0790e..953df97abe0c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -771,6 +771,15 @@ static unsigned long remove_mapping_list(struct list_head *mapping_list,
 					/* Stall kswapd once for 10ms on contention */
 					if (cmpxchg(&kswapd_exclusive, NUMA_NO_NODE, pgdat->node_id) != NUMA_NO_NODE) {
 						DEFINE_WAIT(wait);
+
+						/*
+						 * Tag the pgdat as congested as it may
+						 * indicate contention with a heavy
+						 * writer that should stall on
+						 * wait_iff_congested.
+						 */
+						set_bit(PGDAT_CONGESTED, &pgdat->flags);
+
 						prepare_to_wait(&kswapd_contended_wait,
 							&wait, TASK_INTERRUPTIBLE);
 						io_schedule_timeout(HZ/100);
-- 
2.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1
  2016-09-09  9:59 [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Mel Gorman
                   ` (3 preceding siblings ...)
  2016-09-09  9:59 ` [PATCH 4/4] mm, vmscan: Potentially stall direct reclaimers on tree_lock contention Mel Gorman
@ 2016-09-09 15:31 ` Linus Torvalds
  2016-09-09 16:19   ` Mel Gorman
  4 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2016-09-09 15:31 UTC (permalink / raw)
  To: Mel Gorman; +Cc: LKML, Linux-MM, Dave Chinner, Ying Huang, Michal Hocko

On Fri, Sep 9, 2016 at 2:59 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> The progression of this series has been unsatisfactory.

Yeah, I have to say that I particularly don't like patch #1. It's some
rather nasty complexity for dubious gains, and holding the lock for
longer times might have downsides.

And the numbers seem to not necessarily be in favor of patch #3
either, which I would have otherwise been predisposed to like (ie it
looks fairly targeted and not very complex).

#2 seems trivially correct but largely irrelevant.

So I think this series is one of those "we need to find that it makes
a big positive impact" to make sense.

              Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1
  2016-09-09 15:31 ` [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Linus Torvalds
@ 2016-09-09 16:19   ` Mel Gorman
  2016-09-09 18:16     ` Huang, Ying
  0 siblings, 1 reply; 12+ messages in thread
From: Mel Gorman @ 2016-09-09 16:19 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: LKML, Linux-MM, Dave Chinner, Ying Huang, Michal Hocko

On Fri, Sep 09, 2016 at 08:31:27AM -0700, Linus Torvalds wrote:
> On Fri, Sep 9, 2016 at 2:59 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > The progression of this series has been unsatisfactory.
> 
> Yeah, I have to say that I particularly don't like patch #1.

There isn't many ways to make it prettier. Making it nicer is partially
hindered by the fact that tree_lock is IRQ-safe for IO completions but
even if that was addressed there might be lock ordering issues.

> It's some
> rather nasty complexity for dubious gains, and holding the lock for
> longer times might have downsides.
> 

Kswapd reclaim would delay a parallel truncation for example. Doubtful it
matters but the possibility is there.

The gain in swapping is nice but ramdisk is excessively artifical. It might
matter if someone reported it made a big difference swapping to faster
storage like SSD or NVMe although the cases where fast swap is important
are few -- overcommitted host with multiple idle VMs with a new active VM
starting is the only one that springs to mind.

> So I think this series is one of those "we need to find that it makes
> a big positive impact" to make sense.
> 

Agreed. I don't mind leaving it on the back burner unless Dave reports
it really helps or a new bug report about realistic tree_lock contention
shows up.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1
  2016-09-09 16:19   ` Mel Gorman
@ 2016-09-09 18:16     ` Huang, Ying
  0 siblings, 0 replies; 12+ messages in thread
From: Huang, Ying @ 2016-09-09 18:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, LKML, Linux-MM, Dave Chinner, Ying Huang,
	Michal Hocko, Tim C. Chen, Dave Hansen, Andi Kleen

Mel Gorman <mgorman@techsingularity.net> writes:

> On Fri, Sep 09, 2016 at 08:31:27AM -0700, Linus Torvalds wrote:
>> On Fri, Sep 9, 2016 at 2:59 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
>> >
>> > The progression of this series has been unsatisfactory.
>> 
>> Yeah, I have to say that I particularly don't like patch #1.
>
> There isn't many ways to make it prettier. Making it nicer is partially
> hindered by the fact that tree_lock is IRQ-safe for IO completions but
> even if that was addressed there might be lock ordering issues.
>
>> It's some
>> rather nasty complexity for dubious gains, and holding the lock for
>> longer times might have downsides.
>> 
>
> Kswapd reclaim would delay a parallel truncation for example. Doubtful it
> matters but the possibility is there.
>
> The gain in swapping is nice but ramdisk is excessively artifical. It might
> matter if someone reported it made a big difference swapping to faster
> storage like SSD or NVMe although the cases where fast swap is important
> are few -- overcommitted host with multiple idle VMs with a new active VM
> starting is the only one that springs to mind.

I will try to provide some data for the NVMe disk.  I think the trend is
that the performance of the disk is increasing fast and will continue in
the near future at least.  We found we cannot saturate the latest NVMe
disk when swapping because of locking issues in swap and page reclaim
path.

The swap usage problem could be a "Chicken and Egg" problem.  Because
swap performance is poor, nobody uses swap, and because nobody uses
swap, nobody works on improving the performance of the swap.  With the
faster and faster storage device, swap could be more popular in the
future if we optimize its performance to catch up with the performance
of the storage.

>> So I think this series is one of those "we need to find that it makes
>> a big positive impact" to make sense.
>> 
>
> Agreed. I don't mind leaving it on the back burner unless Dave reports
> it really helps or a new bug report about realistic tree_lock contention
> shows up.

Best Regards,
Huang, Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-09-17  1:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-09-09  9:59 [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Mel Gorman
2016-09-09  9:59 ` [PATCH 1/4] mm, vmscan: Batch removal of mappings under a single lock during reclaim Mel Gorman
2016-09-16 13:25   ` Peter Zijlstra
2016-09-16 14:07     ` Peter Zijlstra
2016-09-16 18:33     ` Linus Torvalds
2016-09-17  1:36       ` Peter Zijlstra
2016-09-09  9:59 ` [PATCH 2/4] block, brd: Treat storage as non-rotational Mel Gorman
2016-09-09  9:59 ` [PATCH 3/4] mm, vmscan: Stall kswapd if contending on tree_lock Mel Gorman
2016-09-09  9:59 ` [PATCH 4/4] mm, vmscan: Potentially stall direct reclaimers on tree_lock contention Mel Gorman
2016-09-09 15:31 ` [RFC PATCH 0/4] Reduce tree_lock contention during swap and reclaim of a single file v1 Linus Torvalds
2016-09-09 16:19   ` Mel Gorman
2016-09-09 18:16     ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).