[PATCH v2 0/7] Improving munlock() performance for large non-THP areas

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/7] Improving munlock() performance for large non-THP areas
@ 2013-08-19 12:23 Vlastimil Babka
  2013-08-19 12:23 ` [PATCH v2 1/7] mm: putback_lru_page: remove unnecessary call to page_lru_base_type() Vlastimil Babka
                   ` (7 more replies)
  0 siblings, 8 replies; 24+ messages in thread
From: Vlastimil Babka @ 2013-08-19 12:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm,
	Vlastimil Babka

Changes in version 2:

o Added Patch 7 that tries to avoid calling follow_page_mask() on each
  individual page where possible and instead obtain page reference through
  pte walk. Increases the total observed speedup in ideal case to 43%.

o Fixed a pgrescued counting bug in Patch 5.

o Removed the few likely/unlikely wrappers as suggested by JA?rn.

o The switch around call to __isolate_lru_page that used to be introduced in
  Patch 3 and removed in Patch 5 is now not introduced at all, as several
  people pointed out it was unnecessarily ugly.

The goal of this patch series is to improve performance of munlock() of large
mlocked memory areas on systems without THP. This is motivated by reported very
long times of crash recovery of processes with such areas, where munlock() can
take several seconds. See http://lwn.net/Articles/548108/

The work was driven by a simple benchmark (to be included in mmtests) that
mmaps() e.g. 56GB with MAP_LOCKED | MAP_POPULATE and measures the time of
munlock(). Profiling was performed by attaching operf --pid to the process
and sending a signal to trigger the munlock() part and then notify bach
the monitoring wrapper to stop operf, so that only munlock() appears in the
profile.

The profiles have shown that CPU time is spent mostly by atomic operations
and repeated locking per single pages. This series aims to reduce both, starting
from simpler to more complex changes.

Patch 1 performs a simple cleanup in putback_lru_page() so that page lru base
	type is not determined without being actually needed.

Patch 2 removes an unnecessary call to lru_add_drain() which drains the per-cpu
	pagevec after each munlocked page is put there.

Patch 3 changes munlock_vma_range() to use an on-stack pagevec for isolating
	multiple non-THP pages under a single lru_lock instead of locking and
	processing each page separately.

Patch 4 changes the NR_MLOCK accounting to be called only once per the pvec
	introduced by previous patch.

Patch 5 uses the introduced pagevec to batch also the work of putback_lru_page
	when possible, bypassing the per-cpu pvec and associated overhead.

Patch 6 removes a redundant get_page/put_page pair which saves costly atomic
	operations.

Patch 7 avoids calling follow_page_mask() on each individual page, and obtains
	multiple page references under a single page table lock where possible.

Measurements were made using 3.11-rc3 as a baseline.
The first set of measurements shows the possibly ideal conditions where
batching should help the most. All memory is allocated from a single NUMA
node and THP is disabled.

timedmunlock
                            3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3
                                   0                     1                     2                     3                     4                     5                     6                     7
Elapsed min           3.38 (  0.00%)        3.39 ( -0.13%)        3.00 ( 11.33%)        2.70 ( 20.20%)        2.67 ( 21.11%)        2.37 ( 29.88%)        2.20 ( 34.91%)        1.91 ( 43.59%)
Elapsed mean          3.39 (  0.00%)        3.40 ( -0.23%)        3.01 ( 11.33%)        2.70 ( 20.26%)        2.67 ( 21.21%)        2.38 ( 29.88%)        2.21 ( 34.93%)        1.92 ( 43.46%)
Elapsed stddev        0.01 (  0.00%)        0.01 (-43.09%)        0.01 ( 15.42%)        0.01 ( 23.42%)        0.00 ( 89.78%)        0.01 ( -7.15%)        0.00 ( 76.69%)        0.02 (-91.77%)
Elapsed max           3.41 (  0.00%)        3.43 ( -0.52%)        3.03 ( 11.29%)        2.72 ( 20.16%)        2.67 ( 21.63%)        2.40 ( 29.50%)        2.21 ( 35.21%)        1.96 ( 42.39%)
Elapsed range         0.03 (  0.00%)        0.04 (-51.16%)        0.02 (  6.27%)        0.02 ( 14.67%)        0.00 ( 88.90%)        0.03 (-19.18%)        0.01 ( 73.70%)        0.06 (-113.35%

The second set of measurements simulates the worst possible conditions for
batching by using numactl --interleave, so that there is in fact only one page
per pagevec. Even in this case the series seems to improve performance thanks
to reduced atomic operations and removal of lru_add_drain().

timedmunlock
                            3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3
                                   0                     1                     2                     3                     4                     5                     6                     7
Elapsed min           4.00 (  0.00%)        4.04 ( -0.93%)        3.87 (  3.37%)        3.72 (  6.94%)        3.81 (  4.72%)        3.69 (  7.82%)        3.64 (  8.92%)        3.41 ( 14.81%)
Elapsed mean          4.17 (  0.00%)        4.15 (  0.51%)        4.03 (  3.49%)        3.89 (  6.84%)        3.86 (  7.48%)        3.89 (  6.69%)        3.70 ( 11.27%)        3.48 ( 16.59%)
Elapsed stddev        0.16 (  0.00%)        0.08 ( 50.76%)        0.10 ( 41.58%)        0.16 (  4.59%)        0.05 ( 72.38%)        0.19 (-12.91%)        0.05 ( 68.09%)        0.06 ( 66.03%)
Elapsed max           4.34 (  0.00%)        4.32 (  0.56%)        4.19 (  3.62%)        4.12 (  5.15%)        3.91 (  9.88%)        4.12 (  5.25%)        3.80 ( 12.58%)        3.56 ( 18.08%)
Elapsed range         0.34 (  0.00%)        0.28 ( 17.91%)        0.32 (  6.45%)        0.40 (-15.73%)        0.10 ( 70.06%)        0.43 (-24.84%)        0.15 ( 55.32%)        0.15 ( 56.16%)

For completeness, a third set of measurements shows the situation where THP is
enabled and allocations are again done on a single NUMA node. Here munlock() is
already very fast thanks to huge pages, and thies series does not compromise
that performance. It seems that the removal of call to lru_add_drain() still
helps a bit.

timedmunlock
                            3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3              3.11-rc3
                                   0                     1                     2                     3                     4                     5                     6                     7
Elapsed min           0.01 (  0.00%)        0.01 ( -0.11%)        0.01 (  6.59%)        0.01 (  5.41%)        0.01 (  5.45%)        0.01 (  5.03%)        0.01 (  6.08%)        0.01 (  5.20%)
Elapsed mean          0.01 (  0.00%)        0.01 ( -0.27%)        0.01 (  6.39%)        0.01 (  5.30%)        0.01 (  5.32%)        0.01 (  5.03%)        0.01 (  5.97%)        0.01 (  5.22%)
Elapsed stddev        0.00 (  0.00%)        0.00 ( -9.59%)        0.00 ( 10.77%)        0.00 (  3.24%)        0.00 ( 24.42%)        0.00 ( 31.86%)        0.00 ( -7.46%)        0.00 (  6.11%)
Elapsed max           0.01 (  0.00%)        0.01 ( -0.01%)        0.01 (  6.83%)        0.01 (  5.42%)        0.01 (  5.79%)        0.01 (  5.53%)        0.01 (  6.08%)        0.01 (  5.26%)
Elapsed range         0.00 (  0.00%)        0.00 (  7.30%)        0.00 ( 24.38%)        0.00 (  6.10%)        0.00 ( 30.79%)        0.00 ( 42.52%)        0.00 (  6.11%)        0.00 ( 10.07%)

Vlastimil Babka (7):
  mm: putback_lru_page: remove unnecessary call to page_lru_base_type()
  mm: munlock: remove unnecessary call to lru_add_drain()
  mm: munlock: batch non-THP page isolation and munlock+putback using
    pagevec
  mm: munlock: batch NR_MLOCK zone state updates
  mm: munlock: bypass per-cpu pvec for putback_lru_page
  mm: munlock: remove redundant get_page/put_page pair on the fast path
  mm: munlock: manual pte walk in fast path instead of
    follow_page_mask()

 mm/mlock.c  | 331 +++++++++++++++++++++++++++++++++++++++++++++++++++---------
 mm/vmscan.c |  12 +--
 2 files changed, 288 insertions(+), 55 deletions(-)

-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 1/7] mm: putback_lru_page: remove unnecessary call to page_lru_base_type()
  2013-08-19 12:23 [PATCH v2 0/7] Improving munlock() performance for large non-THP areas Vlastimil Babka
@ 2013-08-19 12:23 ` Vlastimil Babka
  2013-08-19 14:48   ` Mel Gorman
  2013-08-19 12:23 ` [PATCH v2 2/7] mm: munlock: remove unnecessary call to lru_add_drain() Vlastimil Babka
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2013-08-19 12:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm,
	Vlastimil Babka

In putback_lru_page() since commit c53954a092 (""mm: remove lru parameter from
__lru_cache_add and lru_cache_add_lru") it is no longer needed to determine lru
list via page_lru_base_type().

This patch replaces it with simple flag is_unevictable which says that the page
was put on the inevictable list. This is the only information that matters in
subsequent tests.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: JA?rn Engel <joern@logfs.org>
---
 mm/vmscan.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2cff0d4..0fa537e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -545,7 +545,7 @@ int remove_mapping(struct address_space *mapping, struct page *page)
  */
 void putback_lru_page(struct page *page)
 {
-	int lru;
+	bool is_unevictable;
 	int was_unevictable = PageUnevictable(page);
 
 	VM_BUG_ON(PageLRU(page));
@@ -560,14 +560,14 @@ redo:
 		 * unevictable page on [in]active list.
 		 * We know how to handle that.
 		 */
-		lru = page_lru_base_type(page);
+		is_unevictable = false;
 		lru_cache_add(page);
 	} else {
 		/*
 		 * Put unevictable pages directly on zone's unevictable
 		 * list.
 		 */
-		lru = LRU_UNEVICTABLE;
+		is_unevictable = true;
 		add_page_to_unevictable_list(page);
 		/*
 		 * When racing with an mlock or AS_UNEVICTABLE clearing
@@ -587,7 +587,7 @@ redo:
 	 * page is on unevictable list, it never be freed. To avoid that,
 	 * check after we added it to the list, again.
 	 */
-	if (lru == LRU_UNEVICTABLE && page_evictable(page)) {
+	if (is_unevictable && page_evictable(page)) {
 		if (!isolate_lru_page(page)) {
 			put_page(page);
 			goto redo;
@@ -598,9 +598,9 @@ redo:
 		 */
 	}
 
-	if (was_unevictable && lru != LRU_UNEVICTABLE)
+	if (was_unevictable && !is_unevictable)
 		count_vm_event(UNEVICTABLE_PGRESCUED);
-	else if (!was_unevictable && lru == LRU_UNEVICTABLE)
+	else if (!was_unevictable && is_unevictable)
 		count_vm_event(UNEVICTABLE_PGCULLED);
 
 	put_page(page);		/* drop ref from isolate */
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 1/7] mm: putback_lru_page: remove unnecessary call to page_lru_base_type()
  2013-08-19 12:23 ` [PATCH v2 1/7] mm: putback_lru_page: remove unnecessary call to page_lru_base_type() Vlastimil Babka
@ 2013-08-19 14:48   ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2013-08-19 14:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, J?rn Engel, Michel Lespinasse, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Michal Hocko, linux-mm

On Mon, Aug 19, 2013 at 02:23:36PM +0200, Vlastimil Babka wrote:
> In putback_lru_page() since commit c53954a092 (""mm: remove lru parameter from
> __lru_cache_add and lru_cache_add_lru") it is no longer needed to determine lru
> list via page_lru_base_type().
> 
> This patch replaces it with simple flag is_unevictable which says that the page
> was put on the inevictable list. This is the only information that matters in
> subsequent tests.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Jorn Engel <joern@logfs.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 2/7] mm: munlock: remove unnecessary call to lru_add_drain()
  2013-08-19 12:23 [PATCH v2 0/7] Improving munlock() performance for large non-THP areas Vlastimil Babka
  2013-08-19 12:23 ` [PATCH v2 1/7] mm: putback_lru_page: remove unnecessary call to page_lru_base_type() Vlastimil Babka
@ 2013-08-19 12:23 ` Vlastimil Babka
  2013-08-19 14:48   ` Mel Gorman
  2013-08-19 12:23 ` [PATCH v2 3/7] mm: munlock: batch non-THP page isolation and munlock+putback using pagevec Vlastimil Babka
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2013-08-19 12:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm,
	Vlastimil Babka

In munlock_vma_range(), lru_add_drain() is currently called in a loop before
each munlock_vma_page() call.
This is suboptimal for performance when munlocking many pages. The benefits
of per-cpu pagevec for batching the LRU putback are removed since the pagevec
only holds at most one page from the previous loop's iteration.

The lru_add_drain() call also does not serve any purposes for correctness - it
does not even drain pagavecs of all cpu's. The munlock code already expects
and handles situations where a page cannot be isolated from the LRU (e.g.
because it is on some per-cpu pagevec).

The history of the (not commented) call also suggest that it appears there as
an oversight rather than intentionally. Before commit ff6a6da6 ("mm: accelerate
munlock() treatment of THP pages") the call happened only once upon entering the
function. The commit has moved the call into the while loope. So while the
other changes in the commit improved munlock performance for THP pages, it
introduced the abovementioned suboptimal per-cpu pagevec usage.

Further in history, before commit 408e82b7 ("mm: munlock use follow_page"),
munlock_vma_pages_range() was just a wrapper around __mlock_vma_pages_range
which performed both mlock and munlock depending on a flag. However, before
ba470de4 ("mmap: handle mlocked pages during map, remap, unmap") the function
handled only mlock, not munlock. The lru_add_drain call thus comes from the
implementation in commit b291f000 ("mlock: mlocked pages are unevictable" and
was intended only for mlocking, not munlocking. The original intention of
draining the LRU pagevec at mlock time was to ensure the pages were on the LRU
before the lock operation so that they could be placed on the unevictable list
immediately. There is very little motivation to do the same in the munlock path
this, particularly for every single page.

This patch therefore removes the call completely. After removing the call, a
10% speedup was measured for munlock() of a 56GB large memory area with THP
disabled.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: JA?rn Engel <joern@logfs.org>
---
 mm/mlock.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 79b7cf7..b85f1e8 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -247,7 +247,6 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 					&page_mask);
 		if (page && !IS_ERR(page)) {
 			lock_page(page);
-			lru_add_drain();
 			/*
 			 * Any THP page found by follow_page_mask() may have
 			 * gotten split before reaching munlock_vma_page(),
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 2/7] mm: munlock: remove unnecessary call to lru_add_drain()
  2013-08-19 12:23 ` [PATCH v2 2/7] mm: munlock: remove unnecessary call to lru_add_drain() Vlastimil Babka
@ 2013-08-19 14:48   ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2013-08-19 14:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, J?rn Engel, Michel Lespinasse, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Michal Hocko, linux-mm

On Mon, Aug 19, 2013 at 02:23:37PM +0200, Vlastimil Babka wrote:
> In munlock_vma_range(), lru_add_drain() is currently called in a loop before
> each munlock_vma_page() call.
> This is suboptimal for performance when munlocking many pages. The benefits
> of per-cpu pagevec for batching the LRU putback are removed since the pagevec
> only holds at most one page from the previous loop's iteration.
> 
> The lru_add_drain() call also does not serve any purposes for correctness - it
> does not even drain pagavecs of all cpu's. The munlock code already expects
> and handles situations where a page cannot be isolated from the LRU (e.g.
> because it is on some per-cpu pagevec).
> 
> The history of the (not commented) call also suggest that it appears there as
> an oversight rather than intentionally. Before commit ff6a6da6 ("mm: accelerate
> munlock() treatment of THP pages") the call happened only once upon entering the
> function. The commit has moved the call into the while loope. So while the
> other changes in the commit improved munlock performance for THP pages, it
> introduced the abovementioned suboptimal per-cpu pagevec usage.
> 
> Further in history, before commit 408e82b7 ("mm: munlock use follow_page"),
> munlock_vma_pages_range() was just a wrapper around __mlock_vma_pages_range
> which performed both mlock and munlock depending on a flag. However, before
> ba470de4 ("mmap: handle mlocked pages during map, remap, unmap") the function
> handled only mlock, not munlock. The lru_add_drain call thus comes from the
> implementation in commit b291f000 ("mlock: mlocked pages are unevictable" and
> was intended only for mlocking, not munlocking. The original intention of
> draining the LRU pagevec at mlock time was to ensure the pages were on the LRU
> before the lock operation so that they could be placed on the unevictable list
> immediately. There is very little motivation to do the same in the munlock path
> this, particularly for every single page.
> 
> This patch therefore removes the call completely. After removing the call, a
> 10% speedup was measured for munlock() of a 56GB large memory area with THP
> disabled.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Jorn Engel <joern@logfs.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 3/7] mm: munlock: batch non-THP page isolation and munlock+putback using pagevec
  2013-08-19 12:23 [PATCH v2 0/7] Improving munlock() performance for large non-THP areas Vlastimil Babka
  2013-08-19 12:23 ` [PATCH v2 1/7] mm: putback_lru_page: remove unnecessary call to page_lru_base_type() Vlastimil Babka
  2013-08-19 12:23 ` [PATCH v2 2/7] mm: munlock: remove unnecessary call to lru_add_drain() Vlastimil Babka
@ 2013-08-19 12:23 ` Vlastimil Babka
  2013-08-19 14:58   ` Mel Gorman
  2013-08-19 22:38   ` Andrew Morton
  2013-08-19 12:23 ` [PATCH v2 4/7] mm: munlock: batch NR_MLOCK zone state updates Vlastimil Babka
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: Vlastimil Babka @ 2013-08-19 12:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm,
	Vlastimil Babka

Currently, munlock_vma_range() calls munlock_vma_page on each page in a loop,
which results in repeated taking and releasing of the lru_lock spinlock for
isolating pages one by one. This patch batches the munlock operations using
an on-stack pagevec, so that isolation is done under single lru_lock. For THP
pages, the old behavior is preserved as they might be split while putting them
into the pagevec. After this patch, a 9% speedup was measured for munlocking
a 56GB large memory area with THP disabled.

A new function __munlock_pagevec() is introduced that takes a pagevec and:
1) It clears PageMlocked and isolates all pages under lru_lock. Zone page stats
can be also updated using the variant which assumes disabled interrupts.
2) It finishes the munlock and lru putback on all pages under their lock_page.
Note that previously, lock_page covered also the PageMlocked clearing and page
isolation, but it is not needed for those operations.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: JA?rn Engel <joern@logfs.org>
---
 mm/mlock.c | 193 ++++++++++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 153 insertions(+), 40 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index b85f1e8..4a19838 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -11,6 +11,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/pagemap.h>
+#include <linux/pagevec.h>
 #include <linux/mempolicy.h>
 #include <linux/syscalls.h>
 #include <linux/sched.h>
@@ -18,6 +19,8 @@
 #include <linux/rmap.h>
 #include <linux/mmzone.h>
 #include <linux/hugetlb.h>
+#include <linux/memcontrol.h>
+#include <linux/mm_inline.h>
 
 #include "internal.h"
 
@@ -87,6 +90,47 @@ void mlock_vma_page(struct page *page)
 	}
 }
 
+/*
+ * Finish munlock after successful page isolation
+ *
+ * Page must be locked. This is a wrapper for try_to_munlock()
+ * and putback_lru_page() with munlock accounting.
+ */
+static void __munlock_isolated_page(struct page *page)
+{
+	int ret = SWAP_AGAIN;
+
+	/*
+	 * Optimization: if the page was mapped just once, that's our mapping
+	 * and we don't need to check all the other vmas.
+	 */
+	if (page_mapcount(page) > 1)
+		ret = try_to_munlock(page);
+
+	/* Did try_to_unlock() succeed or punt? */
+	if (ret != SWAP_MLOCK)
+		count_vm_event(UNEVICTABLE_PGMUNLOCKED);
+
+	putback_lru_page(page);
+}
+
+/*
+ * Accounting for page isolation fail during munlock
+ *
+ * Performs accounting when page isolation fails in munlock. There is nothing
+ * else to do because it means some other task has already removed the page
+ * from the LRU. putback_lru_page() will take care of removing the page from
+ * the unevictable list, if necessary. vmscan [page_referenced()] will move
+ * the page back to the unevictable list if some other vma has it mlocked.
+ */
+static void __munlock_isolation_failed(struct page *page)
+{
+	if (PageUnevictable(page))
+		count_vm_event(UNEVICTABLE_PGSTRANDED);
+	else
+		count_vm_event(UNEVICTABLE_PGMUNLOCKED);
+}
+
 /**
  * munlock_vma_page - munlock a vma page
  * @page - page to be unlocked
@@ -112,37 +156,10 @@ unsigned int munlock_vma_page(struct page *page)
 		unsigned int nr_pages = hpage_nr_pages(page);
 		mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
 		page_mask = nr_pages - 1;
-		if (!isolate_lru_page(page)) {
-			int ret = SWAP_AGAIN;
-
-			/*
-			 * Optimization: if the page was mapped just once,
-			 * that's our mapping and we don't need to check all the
-			 * other vmas.
-			 */
-			if (page_mapcount(page) > 1)
-				ret = try_to_munlock(page);
-			/*
-			 * did try_to_unlock() succeed or punt?
-			 */
-			if (ret != SWAP_MLOCK)
-				count_vm_event(UNEVICTABLE_PGMUNLOCKED);
-
-			putback_lru_page(page);
-		} else {
-			/*
-			 * Some other task has removed the page from the LRU.
-			 * putback_lru_page() will take care of removing the
-			 * page from the unevictable list, if necessary.
-			 * vmscan [page_referenced()] will move the page back
-			 * to the unevictable list if some other vma has it
-			 * mlocked.
-			 */
-			if (PageUnevictable(page))
-				count_vm_event(UNEVICTABLE_PGSTRANDED);
-			else
-				count_vm_event(UNEVICTABLE_PGMUNLOCKED);
-		}
+		if (!isolate_lru_page(page))
+			__munlock_isolated_page(page);
+		else
+			__munlock_isolation_failed(page);
 	}
 
 	return page_mask;
@@ -210,6 +227,70 @@ static int __mlock_posix_error_return(long retval)
 }
 
 /*
+ * Munlock a batch of pages from the same zone
+ *
+ * The work is split to two main phases. First phase clears the Mlocked flag
+ * and attempts to isolate the pages, all under a single zone lru lock.
+ * The second phase finishes the munlock only for pages where isolation
+ * succeeded.
+ */
+static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
+{
+	int i;
+	int nr = pagevec_count(pvec);
+
+	/* Phase 1: page isolation */
+	spin_lock_irq(&zone->lru_lock);
+	for (i = 0; i < nr; i++) {
+		struct page *page = pvec->pages[i];
+
+		if (TestClearPageMlocked(page)) {
+			struct lruvec *lruvec;
+			int lru;
+
+			/* we have disabled interrupts */
+			__mod_zone_page_state(zone, NR_MLOCK, -1);
+
+			if (PageLRU(page)) {
+				lruvec = mem_cgroup_page_lruvec(page, zone);
+				lru = page_lru(page);
+
+				get_page(page);
+				ClearPageLRU(page);
+				del_page_from_lru_list(page, lruvec, lru);
+			} else {
+				__munlock_isolation_failed(page);
+				goto skip_munlock;
+			}
+
+		} else {
+skip_munlock:
+			/*
+			 * We won't be munlocking this page in the next phase
+			 * but we still need to release the follow_page_mask()
+			 * pin.
+			 */
+			pvec->pages[i] = NULL;
+			put_page(page);
+		}
+	}
+	spin_unlock_irq(&zone->lru_lock);
+
+	/* Phase 2: page munlock and putback */
+	for (i = 0; i < nr; i++) {
+		struct page *page = pvec->pages[i];
+
+		if (page) {
+			lock_page(page);
+			__munlock_isolated_page(page);
+			unlock_page(page);
+			put_page(page); /* pin from follow_page_mask() */
+		}
+	}
+	pagevec_reinit(pvec);
+}
+
+/*
  * munlock_vma_pages_range() - munlock all pages in the vma range.'
  * @vma - vma containing range to be munlock()ed.
  * @start - start address in @vma of the range
@@ -230,11 +311,16 @@ static int __mlock_posix_error_return(long retval)
 void munlock_vma_pages_range(struct vm_area_struct *vma,
 			     unsigned long start, unsigned long end)
 {
+	struct pagevec pvec;
+	struct zone *zone = NULL;
+
+	pagevec_init(&pvec, 0);
 	vma->vm_flags &= ~VM_LOCKED;
 
 	while (start < end) {
 		struct page *page;
 		unsigned int page_mask, page_increm;
+		struct zone *pagezone;
 
 		/*
 		 * Although FOLL_DUMP is intended for get_dump_page(),
@@ -246,20 +332,47 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 		page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
 					&page_mask);
 		if (page && !IS_ERR(page)) {
-			lock_page(page);
-			/*
-			 * Any THP page found by follow_page_mask() may have
-			 * gotten split before reaching munlock_vma_page(),
-			 * so we need to recompute the page_mask here.
-			 */
-			page_mask = munlock_vma_page(page);
-			unlock_page(page);
-			put_page(page);
+			pagezone = page_zone(page);
+			/* The whole pagevec must be in the same zone */
+			if (pagezone != zone) {
+				if (pagevec_count(&pvec))
+					__munlock_pagevec(&pvec, zone);
+				zone = pagezone;
+			}
+			if (PageTransHuge(page)) {
+				/*
+				 * THP pages are not handled by pagevec due
+				 * to their possible split (see below).
+				 */
+				if (pagevec_count(&pvec))
+					__munlock_pagevec(&pvec, zone);
+				lock_page(page);
+				/*
+				 * Any THP page found by follow_page_mask() may
+				 * have gotten split before reaching
+				 * munlock_vma_page(), so we need to recompute
+				 * the page_mask here.
+				 */
+				page_mask = munlock_vma_page(page);
+				unlock_page(page);
+				put_page(page); /* follow_page_mask() */
+			} else {
+				/*
+				 * Non-huge pages are handled in batches
+				 * via pagevec. The pin from
+				 * follow_page_mask() prevents them from
+				 * collapsing by THP.
+				 */
+				if (pagevec_add(&pvec, page) == 0)
+					__munlock_pagevec(&pvec, zone);
+			}
 		}
 		page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
 		start += page_increm * PAGE_SIZE;
 		cond_resched();
 	}
+	if (pagevec_count(&pvec))
+		__munlock_pagevec(&pvec, zone);
 }
 
 /*
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] mm: munlock: batch non-THP page isolation and munlock+putback using pagevec
  2013-08-19 12:23 ` [PATCH v2 3/7] mm: munlock: batch non-THP page isolation and munlock+putback using pagevec Vlastimil Babka
@ 2013-08-19 14:58   ` Mel Gorman
  2013-08-19 22:38   ` Andrew Morton
  1 sibling, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2013-08-19 14:58 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, J?rn Engel, Michel Lespinasse, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Michal Hocko, linux-mm

On Mon, Aug 19, 2013 at 02:23:38PM +0200, Vlastimil Babka wrote:
> Currently, munlock_vma_range() calls munlock_vma_page on each page in a loop,
> which results in repeated taking and releasing of the lru_lock spinlock for
> isolating pages one by one. This patch batches the munlock operations using
> an on-stack pagevec, so that isolation is done under single lru_lock. For THP
> pages, the old behavior is preserved as they might be split while putting them
> into the pagevec. After this patch, a 9% speedup was measured for munlocking
> a 56GB large memory area with THP disabled.
> 
> A new function __munlock_pagevec() is introduced that takes a pagevec and:
> 1) It clears PageMlocked and isolates all pages under lru_lock. Zone page stats
> can be also updated using the variant which assumes disabled interrupts.
> 2) It finishes the munlock and lru putback on all pages under their lock_page.
> Note that previously, lock_page covered also the PageMlocked clearing and page
> isolation, but it is not needed for those operations.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Jorn Engel <joern@logfs.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] mm: munlock: batch non-THP page isolation and munlock+putback using pagevec
  2013-08-19 12:23 ` [PATCH v2 3/7] mm: munlock: batch non-THP page isolation and munlock+putback using pagevec Vlastimil Babka
  2013-08-19 14:58   ` Mel Gorman
@ 2013-08-19 22:38   ` Andrew Morton
  2013-08-22 11:13     ` Vlastimil Babka
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2013-08-19 22:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm

On Mon, 19 Aug 2013 14:23:38 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> Currently, munlock_vma_range() calls munlock_vma_page on each page in a loop,
> which results in repeated taking and releasing of the lru_lock spinlock for
> isolating pages one by one. This patch batches the munlock operations using
> an on-stack pagevec, so that isolation is done under single lru_lock. For THP
> pages, the old behavior is preserved as they might be split while putting them
> into the pagevec. After this patch, a 9% speedup was measured for munlocking
> a 56GB large memory area with THP disabled.
> 
> A new function __munlock_pagevec() is introduced that takes a pagevec and:
> 1) It clears PageMlocked and isolates all pages under lru_lock. Zone page stats
> can be also updated using the variant which assumes disabled interrupts.
> 2) It finishes the munlock and lru putback on all pages under their lock_page.
> Note that previously, lock_page covered also the PageMlocked clearing and page
> isolation, but it is not needed for those operations.
> 
> ...
>
> +static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
> +{
> +	int i;
> +	int nr = pagevec_count(pvec);
> +
> +	/* Phase 1: page isolation */
> +	spin_lock_irq(&zone->lru_lock);
> +	for (i = 0; i < nr; i++) {
> +		struct page *page = pvec->pages[i];
> +
> +		if (TestClearPageMlocked(page)) {
> +			struct lruvec *lruvec;
> +			int lru;
> +
> +			/* we have disabled interrupts */
> +			__mod_zone_page_state(zone, NR_MLOCK, -1);
> +
> +			if (PageLRU(page)) {
> +				lruvec = mem_cgroup_page_lruvec(page, zone);
> +				lru = page_lru(page);
> +
> +				get_page(page);
> +				ClearPageLRU(page);
> +				del_page_from_lru_list(page, lruvec, lru);
> +			} else {
> +				__munlock_isolation_failed(page);
> +				goto skip_munlock;
> +			}
> +
> +		} else {
> +skip_munlock:
> +			/*
> +			 * We won't be munlocking this page in the next phase
> +			 * but we still need to release the follow_page_mask()
> +			 * pin.
> +			 */
> +			pvec->pages[i] = NULL;
> +			put_page(page);
> +		}
> +	}
> +	spin_unlock_irq(&zone->lru_lock);
> +
> +	/* Phase 2: page munlock and putback */
> +	for (i = 0; i < nr; i++) {
> +		struct page *page = pvec->pages[i];
> +
> +		if (page) {
> +			lock_page(page);
> +			__munlock_isolated_page(page);
> +			unlock_page(page);
> +			put_page(page); /* pin from follow_page_mask() */
> +		}
> +	}
> +	pagevec_reinit(pvec);

A minor thing: it would be a little neater if the pagevec_reinit() was
in the caller, munlock_vma_pages_range().  So the caller remains in
control of the state of the pagevec and the callee treats it in a
read-only fashion.

> +}
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 3/7] mm: munlock: batch non-THP page isolation and munlock+putback using pagevec
  2013-08-19 22:38   ` Andrew Morton
@ 2013-08-22 11:13     ` Vlastimil Babka
  0 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2013-08-22 11:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm

On 08/20/2013 12:38 AM, Andrew Morton wrote:
> On Mon, 19 Aug 2013 14:23:38 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> Currently, munlock_vma_range() calls munlock_vma_page on each page in a loop,
>> which results in repeated taking and releasing of the lru_lock spinlock for
>> isolating pages one by one. This patch batches the munlock operations using
>> an on-stack pagevec, so that isolation is done under single lru_lock. For THP
>> pages, the old behavior is preserved as they might be split while putting them
>> into the pagevec. After this patch, a 9% speedup was measured for munlocking
>> a 56GB large memory area with THP disabled.
>>
>> A new function __munlock_pagevec() is introduced that takes a pagevec and:
>> 1) It clears PageMlocked and isolates all pages under lru_lock. Zone page stats
>> can be also updated using the variant which assumes disabled interrupts.
>> 2) It finishes the munlock and lru putback on all pages under their lock_page.
>> Note that previously, lock_page covered also the PageMlocked clearing and page
>> isolation, but it is not needed for those operations.
>>
>> ...
>>
>> +static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
>> +{
>> +	int i;
>> +	int nr = pagevec_count(pvec);
>> +
>> +	/* Phase 1: page isolation */
>> +	spin_lock_irq(&zone->lru_lock);
>> +	for (i = 0; i < nr; i++) {
>> +		struct page *page = pvec->pages[i];
>> +
>> +		if (TestClearPageMlocked(page)) {
>> +			struct lruvec *lruvec;
>> +			int lru;
>> +
>> +			/* we have disabled interrupts */
>> +			__mod_zone_page_state(zone, NR_MLOCK, -1);
>> +
>> +			if (PageLRU(page)) {
>> +				lruvec = mem_cgroup_page_lruvec(page, zone);
>> +				lru = page_lru(page);
>> +
>> +				get_page(page);
>> +				ClearPageLRU(page);
>> +				del_page_from_lru_list(page, lruvec, lru);
>> +			} else {
>> +				__munlock_isolation_failed(page);
>> +				goto skip_munlock;
>> +			}
>> +
>> +		} else {
>> +skip_munlock:
>> +			/*
>> +			 * We won't be munlocking this page in the next phase
>> +			 * but we still need to release the follow_page_mask()
>> +			 * pin.
>> +			 */
>> +			pvec->pages[i] = NULL;
>> +			put_page(page);
>> +		}
>> +	}
>> +	spin_unlock_irq(&zone->lru_lock);
>> +
>> +	/* Phase 2: page munlock and putback */
>> +	for (i = 0; i < nr; i++) {
>> +		struct page *page = pvec->pages[i];
>> +
>> +		if (page) {
>> +			lock_page(page);
>> +			__munlock_isolated_page(page);
>> +			unlock_page(page);
>> +			put_page(page); /* pin from follow_page_mask() */
>> +		}
>> +	}
>> +	pagevec_reinit(pvec);
> 
> A minor thing: it would be a little neater if the pagevec_reinit() was
> in the caller, munlock_vma_pages_range().  So the caller remains in
> control of the state of the pagevec and the callee treats it in a
> read-only fashion.


Yeah that's right. Unfortunately the function may also modify the
pagevec by setting a page pointer to NULL. When it fails to isolate the
page, it does that to mark it as excluded from further phases of
processing. I'm not sure that it's worth to allocate an extra array for
such marking just to avoid the modification.

So maybe we could just clarify this in the function's comment?

--------------------->8----------------------------
From: Vlastimil Babka <vbabka@suse.cz>
Date: Thu, 22 Aug 2013 11:30:28 +0200
Subject: [PATCH v2 4/9]
 mm-munlock-batch-non-thp-page-isolation-and-munlockputback-using-pagevec-fix

Clarify in __munlock_pagevec() comment that pagegec is modified and
reinited.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mlock.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/mlock.c b/mm/mlock.c
index 4a19838..4b3fc72 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -233,6 +233,9 @@ static int __mlock_posix_error_return(long retval)
  * and attempts to isolate the pages, all under a single zone lru lock.
  * The second phase finishes the munlock only for pages where isolation
  * succeeded.
+ *
+ * Note that pvec is modified during the process. Before returning
+ * pagevec_reinit() is called on it.
  */
 static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 {
-- 1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 4/7] mm: munlock: batch NR_MLOCK zone state updates
  2013-08-19 12:23 [PATCH v2 0/7] Improving munlock() performance for large non-THP areas Vlastimil Babka
                   ` (2 preceding siblings ...)
  2013-08-19 12:23 ` [PATCH v2 3/7] mm: munlock: batch non-THP page isolation and munlock+putback using pagevec Vlastimil Babka
@ 2013-08-19 12:23 ` Vlastimil Babka
  2013-08-19 15:01   ` Mel Gorman
  2013-08-19 12:23 ` [PATCH v2 5/7] mm: munlock: bypass per-cpu pvec for putback_lru_page Vlastimil Babka
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2013-08-19 12:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm,
	Vlastimil Babka

Depending on previous batch which introduced batched isolation in
munlock_vma_range(), we can batch also the updates of NR_MLOCK
page stats. After the whole pagevec is processed for page isolation,
the stats are updated only once with the number of successful isolations.
There were however no measurable perfomance gains.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: JA?rn Engel <joern@logfs.org>
---
 mm/mlock.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 4a19838..95c152d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -238,6 +238,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 {
 	int i;
 	int nr = pagevec_count(pvec);
+	int delta_munlocked = -nr;
 
 	/* Phase 1: page isolation */
 	spin_lock_irq(&zone->lru_lock);
@@ -248,9 +249,6 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			struct lruvec *lruvec;
 			int lru;
 
-			/* we have disabled interrupts */
-			__mod_zone_page_state(zone, NR_MLOCK, -1);
-
 			if (PageLRU(page)) {
 				lruvec = mem_cgroup_page_lruvec(page, zone);
 				lru = page_lru(page);
@@ -272,8 +270,10 @@ skip_munlock:
 			 */
 			pvec->pages[i] = NULL;
 			put_page(page);
+			delta_munlocked++;
 		}
 	}
+	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
 	spin_unlock_irq(&zone->lru_lock);
 
 	/* Phase 2: page munlock and putback */
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 4/7] mm: munlock: batch NR_MLOCK zone state updates
  2013-08-19 12:23 ` [PATCH v2 4/7] mm: munlock: batch NR_MLOCK zone state updates Vlastimil Babka
@ 2013-08-19 15:01   ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2013-08-19 15:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, J?rn Engel, Michel Lespinasse, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Michal Hocko, linux-mm

On Mon, Aug 19, 2013 at 02:23:39PM +0200, Vlastimil Babka wrote:
> Depending on previous batch which introduced batched isolation in
> munlock_vma_range(), we can batch also the updates of NR_MLOCK
> page stats. After the whole pagevec is processed for page isolation,
> the stats are updated only once with the number of successful isolations.
> There were however no measurable perfomance gains.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Jorn Engel <joern@logfs.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 5/7] mm: munlock: bypass per-cpu pvec for putback_lru_page
  2013-08-19 12:23 [PATCH v2 0/7] Improving munlock() performance for large non-THP areas Vlastimil Babka
                   ` (3 preceding siblings ...)
  2013-08-19 12:23 ` [PATCH v2 4/7] mm: munlock: batch NR_MLOCK zone state updates Vlastimil Babka
@ 2013-08-19 12:23 ` Vlastimil Babka
  2013-08-19 15:05   ` Mel Gorman
  2013-08-19 22:45   ` Andrew Morton
  2013-08-19 12:23 ` [PATCH v2 6/7] mm: munlock: remove redundant get_page/put_page pair on the fast path Vlastimil Babka
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 24+ messages in thread
From: Vlastimil Babka @ 2013-08-19 12:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm,
	Vlastimil Babka

After introducing batching by pagevecs into munlock_vma_range(), we can further
improve performance by bypassing the copying into per-cpu pagevec and the
get_page/put_page pair associated with that. Instead we perform LRU putback
directly from our pagevec. However, this is possible only for single-mapped
pages that are evictable after munlock. Unevictable pages require rechecking
after putting on the unevictable list, so for those we fallback to
putback_lru_page(), hich handles that.

After this patch, a 13% speedup was measured for munlocking a 56GB large memory
area with THP disabled.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: JA?rn Engel <joern@logfs.org>
---
 mm/mlock.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 95c152d..43c1828 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -227,6 +227,49 @@ static int __mlock_posix_error_return(long retval)
 }
 
 /*
+ * Prepare page for fast batched LRU putback via putback_lru_evictable_pagevec()
+ *
+ * The fast path is available only for evictable pages with single mapping.
+ * Then we can bypass the per-cpu pvec and get better performance.
+ * when mapcount > 1 we need try_to_munlock() which can fail.
+ * when !page_evictable(), we need the full redo logic of putback_lru_page to
+ * avoid leaving evictable page in unevictable list.
+ *
+ * In case of success, @page is added to @pvec and @pgrescued is incremented
+ * in case that the page was previously unevictable. @page is also unlocked.
+ */
+static bool __putback_lru_fast_prepare(struct page *page, struct pagevec *pvec,
+		int *pgrescued)
+{
+	VM_BUG_ON(PageLRU(page));
+	VM_BUG_ON(!PageLocked(page));
+
+	if (page_mapcount(page) <= 1 && page_evictable(page)) {
+		pagevec_add(pvec, page);
+		if (TestClearPageUnevictable(page))
+			(*pgrescued)++;
+		unlock_page(page);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Putback multiple evictable pages to the LRU
+ *
+ * Batched putback of evictable pages that bypasses the per-cpu pvec. Some of
+ * the pages might have meanwhile become unevictable but that is OK.
+ */
+static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
+{
+	count_vm_events(UNEVICTABLE_PGMUNLOCKED, pagevec_count(pvec));
+	/* This includes put_page so we don't call it explicitly */
+	__pagevec_lru_add(pvec);
+	count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
+}
+
+/*
  * Munlock a batch of pages from the same zone
  *
  * The work is split to two main phases. First phase clears the Mlocked flag
@@ -239,6 +282,8 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 	int i;
 	int nr = pagevec_count(pvec);
 	int delta_munlocked = -nr;
+	struct pagevec pvec_putback;
+	int pgrescued = 0;
 
 	/* Phase 1: page isolation */
 	spin_lock_irq(&zone->lru_lock);
@@ -276,17 +321,34 @@ skip_munlock:
 	__mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
 	spin_unlock_irq(&zone->lru_lock);
 
-	/* Phase 2: page munlock and putback */
+	/* Phase 2: page munlock */
+	pagevec_init(&pvec_putback, 0);
 	for (i = 0; i < nr; i++) {
 		struct page *page = pvec->pages[i];
 
 		if (page) {
 			lock_page(page);
-			__munlock_isolated_page(page);
-			unlock_page(page);
-			put_page(page); /* pin from follow_page_mask() */
+			if (!__putback_lru_fast_prepare(page, &pvec_putback,
+					&pgrescued)) {
+				/* Slow path */
+				__munlock_isolated_page(page);
+				unlock_page(page);
+			}
 		}
 	}
+
+	/* Phase 3: page putback for pages that qualified for the fast path */
+	if (pagevec_count(&pvec_putback))
+		__putback_lru_fast(&pvec_putback, pgrescued);
+
+	/* Phase 4: put_page to return pin from follow_page_mask() */
+	for (i = 0; i < nr; i++) {
+		struct page *page = pvec->pages[i];
+
+		if (page)
+			put_page(page);
+	}
+
 	pagevec_reinit(pvec);
 }
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] mm: munlock: bypass per-cpu pvec for putback_lru_page
  2013-08-19 12:23 ` [PATCH v2 5/7] mm: munlock: bypass per-cpu pvec for putback_lru_page Vlastimil Babka
@ 2013-08-19 15:05   ` Mel Gorman
  2013-08-19 22:45   ` Andrew Morton
  1 sibling, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2013-08-19 15:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, J?rn Engel, Michel Lespinasse, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Michal Hocko, linux-mm

On Mon, Aug 19, 2013 at 02:23:40PM +0200, Vlastimil Babka wrote:
> After introducing batching by pagevecs into munlock_vma_range(), we can further
> improve performance by bypassing the copying into per-cpu pagevec and the
> get_page/put_page pair associated with that. Instead we perform LRU putback
> directly from our pagevec. However, this is possible only for single-mapped
> pages that are evictable after munlock. Unevictable pages require rechecking
> after putting on the unevictable list, so for those we fallback to
> putback_lru_page(), hich handles that.
> 
> After this patch, a 13% speedup was measured for munlocking a 56GB large memory
> area with THP disabled.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Jorn Engel <joern@logfs.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] mm: munlock: bypass per-cpu pvec for putback_lru_page
  2013-08-19 12:23 ` [PATCH v2 5/7] mm: munlock: bypass per-cpu pvec for putback_lru_page Vlastimil Babka
  2013-08-19 15:05   ` Mel Gorman
@ 2013-08-19 22:45   ` Andrew Morton
  2013-08-22 11:16     ` Vlastimil Babka
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2013-08-19 22:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm

On Mon, 19 Aug 2013 14:23:40 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> After introducing batching by pagevecs into munlock_vma_range(), we can further
> improve performance by bypassing the copying into per-cpu pagevec and the
> get_page/put_page pair associated with that. Instead we perform LRU putback
> directly from our pagevec. However, this is possible only for single-mapped
> pages that are evictable after munlock. Unevictable pages require rechecking
> after putting on the unevictable list, so for those we fallback to
> putback_lru_page(), hich handles that.
> 
> After this patch, a 13% speedup was measured for munlocking a 56GB large memory
> area with THP disabled.
> 
> ...
>
> +static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
> +{
> +	count_vm_events(UNEVICTABLE_PGMUNLOCKED, pagevec_count(pvec));
> +	/* This includes put_page so we don't call it explicitly */

This had me confused for a sec.  __pagevec_lru_add() includes put_page,
so we don't call __pagevec_lru_add()?  That's the problem with the word
"it" - one often doesn't know what it refers to.

Clarity: 

--- a/mm/mlock.c~mm-munlock-bypass-per-cpu-pvec-for-putback_lru_page-fix
+++ a/mm/mlock.c
@@ -264,7 +264,10 @@ static bool __putback_lru_fast_prepare(s
 static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
 {
 	count_vm_events(UNEVICTABLE_PGMUNLOCKED, pagevec_count(pvec));
-	/* This includes put_page so we don't call it explicitly */
+	/*
+	 *__pagevec_lru_add() calls release_pages() so we don't call
+	 * put_page() explicitly
+	 */
 	__pagevec_lru_add(pvec);
 	count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 5/7] mm: munlock: bypass per-cpu pvec for putback_lru_page
  2013-08-19 22:45   ` Andrew Morton
@ 2013-08-22 11:16     ` Vlastimil Babka
  0 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2013-08-22 11:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm

On 08/20/2013 12:45 AM, Andrew Morton wrote:
> On Mon, 19 Aug 2013 14:23:40 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> After introducing batching by pagevecs into munlock_vma_range(), we can further
>> improve performance by bypassing the copying into per-cpu pagevec and the
>> get_page/put_page pair associated with that. Instead we perform LRU putback
>> directly from our pagevec. However, this is possible only for single-mapped
>> pages that are evictable after munlock. Unevictable pages require rechecking
>> after putting on the unevictable list, so for those we fallback to
>> putback_lru_page(), hich handles that.
>>
>> After this patch, a 13% speedup was measured for munlocking a 56GB large memory
>> area with THP disabled.
>>
>> ...
>>
>> +static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
>> +{
>> +	count_vm_events(UNEVICTABLE_PGMUNLOCKED, pagevec_count(pvec));
>> +	/* This includes put_page so we don't call it explicitly */
> 
> This had me confused for a sec.  __pagevec_lru_add() includes put_page,
> so we don't call __pagevec_lru_add()?  That's the problem with the word
> "it" - one often doesn't know what it refers to.
> 
> Clarity: 
> 
> --- a/mm/mlock.c~mm-munlock-bypass-per-cpu-pvec-for-putback_lru_page-fix
> +++ a/mm/mlock.c
> @@ -264,7 +264,10 @@ static bool __putback_lru_fast_prepare(s
>  static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
>  {
>  	count_vm_events(UNEVICTABLE_PGMUNLOCKED, pagevec_count(pvec));
> -	/* This includes put_page so we don't call it explicitly */
> +	/*
> +	 *__pagevec_lru_add() calls release_pages() so we don't call
> +	 * put_page() explicitly
> +	 */
>  	__pagevec_lru_add(pvec);
>  	count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
>  }

Yes this is definitely better, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 6/7] mm: munlock: remove redundant get_page/put_page pair on the fast path
  2013-08-19 12:23 [PATCH v2 0/7] Improving munlock() performance for large non-THP areas Vlastimil Babka
                   ` (4 preceding siblings ...)
  2013-08-19 12:23 ` [PATCH v2 5/7] mm: munlock: bypass per-cpu pvec for putback_lru_page Vlastimil Babka
@ 2013-08-19 12:23 ` Vlastimil Babka
  2013-08-19 15:07   ` Mel Gorman
  2013-08-19 12:23 ` [PATCH v2 7/7] mm: munlock: manual pte walk in fast path instead of follow_page_mask() Vlastimil Babka
  2013-08-19 22:48 ` [PATCH v2 0/7] Improving munlock() performance for large non-THP areas Andrew Morton
  7 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2013-08-19 12:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm,
	Vlastimil Babka

The performance of the fast path in munlock_vma_range() can be further improved
by avoiding atomic ops of a redundant get_page()/put_page() pair.

When calling get_page() during page isolation, we already have the pin from
follow_page_mask(). This pin will be then returned by __pagevec_lru_add(),
after which we do not reference the pages anymore.

After this patch, an 8% speedup was measured for munlocking a 56GB large memory
area with THP disabled.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: JA?rn Engel <joern@logfs.org>
---
 mm/mlock.c | 26 ++++++++++++++------------
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 43c1828..77ddd6a 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -297,8 +297,10 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
 			if (PageLRU(page)) {
 				lruvec = mem_cgroup_page_lruvec(page, zone);
 				lru = page_lru(page);
-
-				get_page(page);
+				/*
+				 * We already have pin from follow_page_mask()
+				 * so we can spare the get_page() here.
+				 */
 				ClearPageLRU(page);
 				del_page_from_lru_list(page, lruvec, lru);
 			} else {
@@ -330,25 +332,25 @@ skip_munlock:
 			lock_page(page);
 			if (!__putback_lru_fast_prepare(page, &pvec_putback,
 					&pgrescued)) {
-				/* Slow path */
+				/*
+				 * Slow path. We don't want to lose the last
+				 * pin before unlock_page()
+				 */
+				get_page(page); /* for putback_lru_page() */
 				__munlock_isolated_page(page);
 				unlock_page(page);
+				put_page(page); /* from follow_page_mask() */
 			}
 		}
 	}
 
-	/* Phase 3: page putback for pages that qualified for the fast path */
+	/*
+	 * Phase 3: page putback for pages that qualified for the fast path
+	 * This will also call put_page() to return pin from follow_page_mask()
+	 */
 	if (pagevec_count(&pvec_putback))
 		__putback_lru_fast(&pvec_putback, pgrescued);
 
-	/* Phase 4: put_page to return pin from follow_page_mask() */
-	for (i = 0; i < nr; i++) {
-		struct page *page = pvec->pages[i];
-
-		if (page)
-			put_page(page);
-	}
-
 	pagevec_reinit(pvec);
 }
 
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 6/7] mm: munlock: remove redundant get_page/put_page pair on the fast path
  2013-08-19 12:23 ` [PATCH v2 6/7] mm: munlock: remove redundant get_page/put_page pair on the fast path Vlastimil Babka
@ 2013-08-19 15:07   ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2013-08-19 15:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, J?rn Engel, Michel Lespinasse, Hugh Dickins,
	Rik van Riel, Johannes Weiner, Michal Hocko, linux-mm

On Mon, Aug 19, 2013 at 02:23:41PM +0200, Vlastimil Babka wrote:
> The performance of the fast path in munlock_vma_range() can be further improved
> by avoiding atomic ops of a redundant get_page()/put_page() pair.
> 
> When calling get_page() during page isolation, we already have the pin from
> follow_page_mask(). This pin will be then returned by __pagevec_lru_add(),
> after which we do not reference the pages anymore.
> 
> After this patch, an 8% speedup was measured for munlocking a 56GB large memory
> area with THP disabled.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Reviewed-by: Jorn Engel <joern@logfs.org>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 7/7] mm: munlock: manual pte walk in fast path instead of follow_page_mask()
  2013-08-19 12:23 [PATCH v2 0/7] Improving munlock() performance for large non-THP areas Vlastimil Babka
                   ` (5 preceding siblings ...)
  2013-08-19 12:23 ` [PATCH v2 6/7] mm: munlock: remove redundant get_page/put_page pair on the fast path Vlastimil Babka
@ 2013-08-19 12:23 ` Vlastimil Babka
  2013-08-19 22:47   ` Andrew Morton
  2013-08-27 22:24   ` Andrew Morton
  2013-08-19 22:48 ` [PATCH v2 0/7] Improving munlock() performance for large non-THP areas Andrew Morton
  7 siblings, 2 replies; 24+ messages in thread
From: Vlastimil Babka @ 2013-08-19 12:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm,
	Vlastimil Babka

Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each
struct page. This entails repeated full page table translations and page table
lock taken for each page separately.

This patch attempts to avoid the costly follow_page_mask() where possible, by
iterating over ptes within single pmd under single page table lock. The first
pte is obtained by get_locked_pte() for non-THP page acquired by the initial
follow_page_mask(). The latter function is also used as a fallback in case
simple pte_present() and vm_normal_page() are not sufficient to obtain the
struct page.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mlock.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 68 insertions(+), 11 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 77ddd6a..f9f21f4 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -377,33 +377,73 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 {
 	struct pagevec pvec;
 	struct zone *zone = NULL;
+	pte_t *pte = NULL;
+	spinlock_t *ptl;
+	unsigned long pmd_end;
 
 	pagevec_init(&pvec, 0);
 	vma->vm_flags &= ~VM_LOCKED;
 
 	while (start < end) {
-		struct page *page;
+		struct page *page = NULL;
 		unsigned int page_mask, page_increm;
 		struct zone *pagezone;
 
+		/* If we can, try pte walk instead of follow_page_mask() */
+		if (pte && start < pmd_end) {
+			pte++;
+			if (pte_present(*pte))
+				page = vm_normal_page(vma, start, *pte);
+			if (page) {
+				get_page(page);
+				page_mask = 0;
+			}
+		}
+
 		/*
-		 * Although FOLL_DUMP is intended for get_dump_page(),
-		 * it just so happens that its special treatment of the
-		 * ZERO_PAGE (returning an error instead of doing get_page)
-		 * suits munlock very well (and if somehow an abnormal page
-		 * has sneaked into the range, we won't oops here: great).
+		 * If we did sucessful pte walk step, use that page.
+		 * Otherwise (NULL pte, !pte_present or vm_normal_page failed
+		 * due to e.g. zero page), fallback to follow_page_mask() which
+		 * handles all exceptions.
 		 */
-		page = follow_page_mask(vma, start, FOLL_GET | FOLL_DUMP,
-					&page_mask);
+		if (!page) {
+			if (pte) {
+				pte_unmap_unlock(pte, ptl);
+				pte = NULL;
+			}
+
+			/*
+			 * Although FOLL_DUMP is intended for get_dump_page(),
+			 * it just so happens that its special treatment of the
+			 * ZERO_PAGE (returning an error instead of doing
+			 * get_page) suits munlock very well (and if somehow an
+			 * abnormal page has sneaked into the range, we won't
+			 * oops here: great).
+			 */
+			page = follow_page_mask(vma, start,
+					FOLL_GET | FOLL_DUMP, &page_mask);
+			pmd_end = pmd_addr_end(start, end);
+		}
+
 		if (page && !IS_ERR(page)) {
 			pagezone = page_zone(page);
 			/* The whole pagevec must be in the same zone */
 			if (pagezone != zone) {
-				if (pagevec_count(&pvec))
+				if (pagevec_count(&pvec)) {
+					if (pte) {
+						pte_unmap_unlock(pte, ptl);
+						pte = NULL;
+					}
 					__munlock_pagevec(&pvec, zone);
+				}
 				zone = pagezone;
 			}
 			if (PageTransHuge(page)) {
+				/* 
+				 * We could not have stumbled upon a THP page
+				 * using the pte walk.
+				 */
+				VM_BUG_ON(pte);
 				/*
 				 * THP pages are not handled by pagevec due
 				 * to their possible split (see below).
@@ -422,19 +462,36 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 				put_page(page); /* follow_page_mask() */
 			} else {
 				/*
+				 * Initialize pte walk for further pages. We
+				 * can do this here since we know the current
+				 * page is not THP.
+				 */
+				if (!pte)
+					pte = get_locked_pte(vma->vm_mm, start,
+							&ptl);
+				/*
 				 * Non-huge pages are handled in batches
 				 * via pagevec. The pin from
 				 * follow_page_mask() prevents them from
 				 * collapsing by THP.
 				 */
-				if (pagevec_add(&pvec, page) == 0)
+				if (pagevec_add(&pvec, page) == 0) {
+					if (pte) {
+						pte_unmap_unlock(pte, ptl);
+						pte = NULL;
+					}
 					__munlock_pagevec(&pvec, zone);
+				}
 			}
 		}
 		page_increm = 1 + (~(start >> PAGE_SHIFT) & page_mask);
 		start += page_increm * PAGE_SIZE;
-		cond_resched();
+		/* Don't resched while ptl is held */
+		if (!pte)
+			cond_resched();
 	}
+	if (pte)
+		pte_unmap_unlock(pte, ptl);
 	if (pagevec_count(&pvec))
 		__munlock_pagevec(&pvec, zone);
 }
-- 
1.8.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 7/7] mm: munlock: manual pte walk in fast path instead of follow_page_mask()
  2013-08-19 12:23 ` [PATCH v2 7/7] mm: munlock: manual pte walk in fast path instead of follow_page_mask() Vlastimil Babka
@ 2013-08-19 22:47   ` Andrew Morton
  2013-08-22 11:18     ` Vlastimil Babka
  2013-08-27 22:24   ` Andrew Morton
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2013-08-19 22:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm

On Mon, 19 Aug 2013 14:23:42 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each
> struct page. This entails repeated full page table translations and page table
> lock taken for each page separately.
> 
> This patch attempts to avoid the costly follow_page_mask() where possible, by
> iterating over ptes within single pmd under single page table lock. The first
> pte is obtained by get_locked_pte() for non-THP page acquired by the initial
> follow_page_mask(). The latter function is also used as a fallback in case
> simple pte_present() and vm_normal_page() are not sufficient to obtain the
> struct page.

Patch #7 appears to provide significant performance gains, but the
improvement wasn't individually described here, unlike the other
patches.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 7/7] mm: munlock: manual pte walk in fast path instead of follow_page_mask()
  2013-08-19 22:47   ` Andrew Morton
@ 2013-08-22 11:18     ` Vlastimil Babka
  0 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2013-08-22 11:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm

On 08/20/2013 12:47 AM, Andrew Morton wrote:
> On Mon, 19 Aug 2013 14:23:42 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each
>> struct page. This entails repeated full page table translations and page table
>> lock taken for each page separately.
>>
>> This patch attempts to avoid the costly follow_page_mask() where possible, by
>> iterating over ptes within single pmd under single page table lock. The first
>> pte is obtained by get_locked_pte() for non-THP page acquired by the initial
>> follow_page_mask(). The latter function is also used as a fallback in case
>> simple pte_present() and vm_normal_page() are not sufficient to obtain the
>> struct page.
> 
> Patch #7 appears to provide significant performance gains, but the
> improvement wasn't individually described here, unlike the other
> patches.

Oops I forgot to mention this here. Can you please add the following to
the comment then? Thanks.

After this patch, a 13% speedup was measured for munlocking a 56GB large
memory area with THP disabled.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 7/7] mm: munlock: manual pte walk in fast path instead of follow_page_mask()
  2013-08-19 12:23 ` [PATCH v2 7/7] mm: munlock: manual pte walk in fast path instead of follow_page_mask() Vlastimil Babka
  2013-08-19 22:47   ` Andrew Morton
@ 2013-08-27 22:24   ` Andrew Morton
  2013-08-29 13:02     ` Vlastimil Babka
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2013-08-27 22:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm

On Mon, 19 Aug 2013 14:23:42 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each
> struct page. This entails repeated full page table translations and page table
> lock taken for each page separately.
> 
> This patch attempts to avoid the costly follow_page_mask() where possible, by
> iterating over ptes within single pmd under single page table lock. The first
> pte is obtained by get_locked_pte() for non-THP page acquired by the initial
> follow_page_mask(). The latter function is also used as a fallback in case
> simple pte_present() and vm_normal_page() are not sufficient to obtain the
> struct page.

mm/mlock.c: In function 'munlock_vma_pages_range':
mm/mlock.c:388: warning: 'pmd_end' may be used uninitialized in this function

As far as I can tell, this is notabug, but I'm not at all confident in
that - the protocol for locals `pte' and `pmd_end' is bizarre.

The function is fantastically hard to follow and deserves to be dragged
outside, shot repeatedly then burned.  Could you please, as a matter of
some urgency, take a look at rewriting the entire thing so that it is
less than completely insane?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 7/7] mm: munlock: manual pte walk in fast path instead of follow_page_mask()
  2013-08-27 22:24   ` Andrew Morton
@ 2013-08-29 13:02     ` Vlastimil Babka
  0 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2013-08-29 13:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm

On 08/28/2013 12:24 AM, Andrew Morton wrote:
> On Mon, 19 Aug 2013 14:23:42 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> Currently munlock_vma_pages_range() calls follow_page_mask() to obtain each
>> struct page. This entails repeated full page table translations and page table
>> lock taken for each page separately.
>>
>> This patch attempts to avoid the costly follow_page_mask() where possible, by
>> iterating over ptes within single pmd under single page table lock. The first
>> pte is obtained by get_locked_pte() for non-THP page acquired by the initial
>> follow_page_mask(). The latter function is also used as a fallback in case
>> simple pte_present() and vm_normal_page() are not sufficient to obtain the
>> struct page.
> 
> mm/mlock.c: In function 'munlock_vma_pages_range':
> mm/mlock.c:388: warning: 'pmd_end' may be used uninitialized in this function
> 
> As far as I can tell, this is notabug, but I'm not at all confident in
> that - the protocol for locals `pte' and `pmd_end' is bizarre.

I agree with both points.
 
> The function is fantastically hard to follow and deserves to be dragged
> outside, shot repeatedly then burned.

Aww, poor function, and it's all my fault. Let's put it on a diet instead...

> Could you please, as a matter of
> some urgency, take a look at rewriting the entire thing so that it is
> less than completely insane?

This patch replaces the following patch in the mm tree:
mm-munlock-manual-pte-walk-in-fast-path-instead-of-follow_page_mask.patch

Changelog since V2:
  o Split PTE walk to __munlock_pagevec_fill()
  o __munlock_pagevec() does not reinitialize the pagevec anymore
  o Use page_zone_id() for checking if pages are in the same zone (smaller
    overhead than page_zone())

The only small functional change is that previously failing the pte walk would
fall back to follow_page_mask() while continuing with the same partially filled
pagevec. Now, pagevec is munlocked immediately after pte walk fails. This means
that batching might be sometimes less effective, but it the gained simplicity
should be worth it.

--->8---

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/7] Improving munlock() performance for large non-THP areas
  2013-08-19 12:23 [PATCH v2 0/7] Improving munlock() performance for large non-THP areas Vlastimil Babka
                   ` (6 preceding siblings ...)
  2013-08-19 12:23 ` [PATCH v2 7/7] mm: munlock: manual pte walk in fast path instead of follow_page_mask() Vlastimil Babka
@ 2013-08-19 22:48 ` Andrew Morton
  2013-08-22 11:21   ` Vlastimil Babka
  7 siblings, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2013-08-19 22:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm

On Mon, 19 Aug 2013 14:23:35 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:

> The goal of this patch series is to improve performance of munlock() of large
> mlocked memory areas on systems without THP. This is motivated by reported very
> long times of crash recovery of processes with such areas, where munlock() can
> take several seconds. See http://lwn.net/Articles/548108/

That was a very nice patchset.  Not bad for a first effort ;)

Thanks, and welcome.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/7] Improving munlock() performance for large non-THP areas
  2013-08-19 22:48 ` [PATCH v2 0/7] Improving munlock() performance for large non-THP areas Andrew Morton
@ 2013-08-22 11:21   ` Vlastimil Babka
  0 siblings, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2013-08-22 11:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jörn Engel, Michel Lespinasse, Hugh Dickins, Rik van Riel,
	Johannes Weiner, Mel Gorman, Michal Hocko, linux-mm

On 08/20/2013 12:48 AM, Andrew Morton wrote:
> On Mon, 19 Aug 2013 14:23:35 +0200 Vlastimil Babka <vbabka@suse.cz> wrote:
> 
>> The goal of this patch series is to improve performance of munlock() of large
>> mlocked memory areas on systems without THP. This is motivated by reported very
>> long times of crash recovery of processes with such areas, where munlock() can
>> take several seconds. See http://lwn.net/Articles/548108/
> 
> That was a very nice patchset.  Not bad for a first effort ;)
> 
> Thanks, and welcome.

Thanks for the quick review and acceptance :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2013-08-29 13:02 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-19 12:23 [PATCH v2 0/7] Improving munlock() performance for large non-THP areas Vlastimil Babka
2013-08-19 12:23 ` [PATCH v2 1/7] mm: putback_lru_page: remove unnecessary call to page_lru_base_type() Vlastimil Babka
2013-08-19 14:48   ` Mel Gorman
2013-08-19 12:23 ` [PATCH v2 2/7] mm: munlock: remove unnecessary call to lru_add_drain() Vlastimil Babka
2013-08-19 14:48   ` Mel Gorman
2013-08-19 12:23 ` [PATCH v2 3/7] mm: munlock: batch non-THP page isolation and munlock+putback using pagevec Vlastimil Babka
2013-08-19 14:58   ` Mel Gorman
2013-08-19 22:38   ` Andrew Morton
2013-08-22 11:13     ` Vlastimil Babka
2013-08-19 12:23 ` [PATCH v2 4/7] mm: munlock: batch NR_MLOCK zone state updates Vlastimil Babka
2013-08-19 15:01   ` Mel Gorman
2013-08-19 12:23 ` [PATCH v2 5/7] mm: munlock: bypass per-cpu pvec for putback_lru_page Vlastimil Babka
2013-08-19 15:05   ` Mel Gorman
2013-08-19 22:45   ` Andrew Morton
2013-08-22 11:16     ` Vlastimil Babka
2013-08-19 12:23 ` [PATCH v2 6/7] mm: munlock: remove redundant get_page/put_page pair on the fast path Vlastimil Babka
2013-08-19 15:07   ` Mel Gorman
2013-08-19 12:23 ` [PATCH v2 7/7] mm: munlock: manual pte walk in fast path instead of follow_page_mask() Vlastimil Babka
2013-08-19 22:47   ` Andrew Morton
2013-08-22 11:18     ` Vlastimil Babka
2013-08-27 22:24   ` Andrew Morton
2013-08-29 13:02     ` Vlastimil Babka
2013-08-19 22:48 ` [PATCH v2 0/7] Improving munlock() performance for large non-THP areas Andrew Morton
2013-08-22 11:21   ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).