[PATCH 0 of 3] THP: mremap support and TLB optimization #3

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0 of 3] THP: mremap support and TLB optimization #3
@ 2011-08-06 16:58 aarcange
  2011-08-06 16:58 ` [PATCH 1 of 3] mremap: check for overflow using deltas aarcange
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: aarcange @ 2011-08-06 16:58 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Rik van Riel, Hugh Dickins

Hello everyone,

here a third version of the mremap optimizations following the reviews on the
list. The requested tristate cleanup I didn't do yet because I don't like to
call more external functions than needed during mremap. 

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1 of 3] mremap: check for overflow using deltas
  2011-08-06 16:58 [PATCH 0 of 3] THP: mremap support and TLB optimization #3 aarcange
@ 2011-08-06 16:58 ` aarcange
  2011-08-08  8:20   ` Johannes Weiner
                     ` (2 more replies)
  2011-08-06 16:58 ` [PATCH 2 of 3] mremap: avoid sending one IPI per page aarcange
  2011-08-06 16:58 ` [PATCH 3 of 3] thp: mremap support and TLB optimization aarcange
  2 siblings, 3 replies; 16+ messages in thread
From: aarcange @ 2011-08-06 16:58 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Rik van Riel, Hugh Dickins

From: Andrea Arcangeli <aarcange@redhat.com>

Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1, those
are reasonable requirements but the check remains obscure and it looks more
like an off by one error than an overflow check. This I feel will improve
readibility.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -141,9 +141,10 @@ unsigned long move_page_tables(struct vm
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
 		next = (old_addr + PMD_SIZE) & PMD_MASK;
-		if (next - 1 > old_end)
-			next = old_end;
+		/* even if next overflowed, extent below will be ok */
 		extent = next - old_addr;
+		if (extent > old_end - old_addr)
+			extent = old_end - old_addr;
 		old_pmd = get_old_pmd(vma->vm_mm, old_addr);
 		if (!old_pmd)
 			continue;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1 of 3] mremap: check for overflow using deltas
  2011-08-06 16:58 ` [PATCH 1 of 3] mremap: check for overflow using deltas aarcange
@ 2011-08-08  8:20   ` Johannes Weiner
  2011-08-10 10:48   ` Mel Gorman
  2011-08-11  0:14   ` Rik van Riel
  2 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2011-08-08  8:20 UTC (permalink / raw)
  To: aarcange; +Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins

On Sat, Aug 06, 2011 at 06:58:03PM +0200, aarcange@redhat.com wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1, those
> are reasonable requirements but the check remains obscure and it looks more
> like an off by one error than an overflow check. This I feel will improve
> readibility.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Johannes Weiner <jweiner@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1 of 3] mremap: check for overflow using deltas
  2011-08-06 16:58 ` [PATCH 1 of 3] mremap: check for overflow using deltas aarcange
  2011-08-08  8:20   ` Johannes Weiner
@ 2011-08-10 10:48   ` Mel Gorman
  2011-08-11  0:14   ` Rik van Riel
  2 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2011-08-10 10:48 UTC (permalink / raw)
  To: aarcange; +Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins

On Sat, Aug 06, 2011 at 06:58:03PM +0200, aarcange@redhat.com wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1, those
> are reasonable requirements but the check remains obscure and it looks more
> like an off by one error than an overflow check. This I feel will improve
> readibility.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1 of 3] mremap: check for overflow using deltas
  2011-08-06 16:58 ` [PATCH 1 of 3] mremap: check for overflow using deltas aarcange
  2011-08-08  8:20   ` Johannes Weiner
  2011-08-10 10:48   ` Mel Gorman
@ 2011-08-11  0:14   ` Rik van Riel
  2 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2011-08-11  0:14 UTC (permalink / raw)
  To: aarcange; +Cc: linux-mm, Mel Gorman, Johannes Weiner, Hugh Dickins

On 08/06/2011 12:58 PM, aarcange@redhat.com wrote:
> From: Andrea Arcangeli<aarcange@redhat.com>
>
> Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE>  1, those
> are reasonable requirements but the check remains obscure and it looks more
> like an off by one error than an overflow check. This I feel will improve
> readibility.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 2 of 3] mremap: avoid sending one IPI per page
  2011-08-06 16:58 [PATCH 0 of 3] THP: mremap support and TLB optimization #3 aarcange
  2011-08-06 16:58 ` [PATCH 1 of 3] mremap: check for overflow using deltas aarcange
@ 2011-08-06 16:58 ` aarcange
  2011-08-08  8:23   ` Johannes Weiner
                     ` (2 more replies)
  2011-08-06 16:58 ` [PATCH 3 of 3] thp: mremap support and TLB optimization aarcange
  2 siblings, 3 replies; 16+ messages in thread
From: aarcange @ 2011-08-06 16:58 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Rik van Riel, Hugh Dickins

From: Andrea Arcangeli <aarcange@redhat.com>

This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
flush_tlb_range() at the end of the loop, to avoid sending one IPI for each
page.

The mmu_notifier_invalidate_range_start/end section is enlarged accordingly but
this is not going to fundamentally change things. It was more by accident that
the region under mremap was for the most part still available for secondary
MMUs: the primary MMU was never allowed to reliably access that region for the
duration of the mremap (modulo trapping SIGSEGV on the old address range which
sounds unpractical and flakey). If users wants secondary MMUs not to lose
access to a large region under mremap they should reduce the mremap size
accordingly in userland and run multiple calls. Overall this will run faster so
it's actually going to reduce the time the region is under mremap for the
primary MMU which should provide a net benefit to apps.

For KVM this is a noop because the guest physical memory is never mremapped,
there's just no point it ever moving it while guest runs. One target of this
optimization is JVM GC (so unrelated to the mmu notifier logic).

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -80,11 +80,7 @@ static void move_ptes(struct vm_area_str
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
-	unsigned long old_start;
 
-	old_start = old_addr;
-	mmu_notifier_invalidate_range_start(vma->vm_mm,
-					    old_start, old_end);
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -111,7 +107,7 @@ static void move_ptes(struct vm_area_str
 				   new_pte++, new_addr += PAGE_SIZE) {
 		if (pte_none(*old_pte))
 			continue;
-		pte = ptep_clear_flush(vma, old_addr, old_pte);
+		pte = ptep_get_and_clear(mm, old_addr, old_pte);
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		set_pte_at(mm, new_addr, new_pte, pte);
 	}
@@ -123,7 +119,6 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		mutex_unlock(&mapping->i_mmap_mutex);
-	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
@@ -134,10 +129,13 @@ unsigned long move_page_tables(struct vm
 {
 	unsigned long extent, next, old_end;
 	pmd_t *old_pmd, *new_pmd;
+	bool need_flush = false;
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
+	mmu_notifier_invalidate_range_start(vma->vm_mm, old_addr, old_end);
+
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
 		next = (old_addr + PMD_SIZE) & PMD_MASK;
@@ -158,7 +156,12 @@ unsigned long move_page_tables(struct vm
 			extent = LATENCY_LIMIT;
 		move_ptes(vma, old_pmd, old_addr, old_addr + extent,
 				new_vma, new_pmd, new_addr);
+		need_flush = true;
 	}
+	if (likely(need_flush))
+		flush_tlb_range(vma, old_end-len, old_addr);
+
+	mmu_notifier_invalidate_range_end(vma->vm_mm, old_end-len, old_end);
 
 	return len + old_addr - old_end;	/* how much done */
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2 of 3] mremap: avoid sending one IPI per page
  2011-08-06 16:58 ` [PATCH 2 of 3] mremap: avoid sending one IPI per page aarcange
@ 2011-08-08  8:23   ` Johannes Weiner
  2011-08-10 10:55   ` Mel Gorman
  2011-08-11  0:16   ` Rik van Riel
  2 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2011-08-08  8:23 UTC (permalink / raw)
  To: aarcange; +Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins

On Sat, Aug 06, 2011 at 06:58:04PM +0200, aarcange@redhat.com wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
> flush_tlb_range() at the end of the loop, to avoid sending one IPI for each
> page.
> 
> The mmu_notifier_invalidate_range_start/end section is enlarged accordingly but
> this is not going to fundamentally change things. It was more by accident that
> the region under mremap was for the most part still available for secondary
> MMUs: the primary MMU was never allowed to reliably access that region for the
> duration of the mremap (modulo trapping SIGSEGV on the old address range which
> sounds unpractical and flakey). If users wants secondary MMUs not to lose
> access to a large region under mremap they should reduce the mremap size
> accordingly in userland and run multiple calls. Overall this will run faster so
> it's actually going to reduce the time the region is under mremap for the
> primary MMU which should provide a net benefit to apps.
> 
> For KVM this is a noop because the guest physical memory is never mremapped,
> there's just no point it ever moving it while guest runs. One target of this
> optimization is JVM GC (so unrelated to the mmu notifier logic).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Johannes Weiner <jweiner@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2 of 3] mremap: avoid sending one IPI per page
  2011-08-06 16:58 ` [PATCH 2 of 3] mremap: avoid sending one IPI per page aarcange
  2011-08-08  8:23   ` Johannes Weiner
@ 2011-08-10 10:55   ` Mel Gorman
  2011-08-11  0:16   ` Rik van Riel
  2 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2011-08-10 10:55 UTC (permalink / raw)
  To: aarcange; +Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins

On Sat, Aug 06, 2011 at 06:58:04PM +0200, aarcange@redhat.com wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
> flush_tlb_range() at the end of the loop, to avoid sending one IPI for each
> page.
> 
> The mmu_notifier_invalidate_range_start/end section is enlarged accordingly but
> this is not going to fundamentally change things. It was more by accident that
> the region under mremap was for the most part still available for secondary
> MMUs: the primary MMU was never allowed to reliably access that region for the
> duration of the mremap (modulo trapping SIGSEGV on the old address range which
> sounds unpractical and flakey). If users wants secondary MMUs not to lose
> access to a large region under mremap they should reduce the mremap size
> accordingly in userland and run multiple calls. Overall this will run faster so
> it's actually going to reduce the time the region is under mremap for the
> primary MMU which should provide a net benefit to apps.
> 
> For KVM this is a noop because the guest physical memory is never mremapped,
> there's just no point it ever moving it while guest runs. One target of this
> optimization is JVM GC (so unrelated to the mmu notifier logic).
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2 of 3] mremap: avoid sending one IPI per page
  2011-08-06 16:58 ` [PATCH 2 of 3] mremap: avoid sending one IPI per page aarcange
  2011-08-08  8:23   ` Johannes Weiner
  2011-08-10 10:55   ` Mel Gorman
@ 2011-08-11  0:16   ` Rik van Riel
  2 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2011-08-11  0:16 UTC (permalink / raw)
  To: aarcange; +Cc: linux-mm, Mel Gorman, Johannes Weiner, Hugh Dickins

On 08/06/2011 12:58 PM, aarcange@redhat.com wrote:
> From: Andrea Arcangeli<aarcange@redhat.com>
>
> This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
> flush_tlb_range() at the end of the loop, to avoid sending one IPI for each
> page.
>
> The mmu_notifier_invalidate_range_start/end section is enlarged accordingly but
> this is not going to fundamentally change things. It was more by accident that
> the region under mremap was for the most part still available for secondary
> MMUs: the primary MMU was never allowed to reliably access that region for the
> duration of the mremap (modulo trapping SIGSEGV on the old address range which
> sounds unpractical and flakey). If users wants secondary MMUs not to lose
> access to a large region under mremap

Userspace programs do not get reliable access to memory
that is currently being mremapped.  This patch does not
change that situation in the least.

> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 3 of 3] thp: mremap support and TLB optimization
  2011-08-06 16:58 [PATCH 0 of 3] THP: mremap support and TLB optimization #3 aarcange
  2011-08-06 16:58 ` [PATCH 1 of 3] mremap: check for overflow using deltas aarcange
  2011-08-06 16:58 ` [PATCH 2 of 3] mremap: avoid sending one IPI per page aarcange
@ 2011-08-06 16:58 ` aarcange
  2011-08-08  8:25   ` Johannes Weiner
                     ` (3 more replies)
  2 siblings, 4 replies; 16+ messages in thread
From: aarcange @ 2011-08-06 16:58 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Rik van Riel, Hugh Dickins

From: Andrea Arcangeli <aarcange@redhat.com>

This adds THP support to mremap (decreases the number of split_huge_page
called).

Here are also some benchmarks with a proggy like this:

===
#define _GNU_SOURCE
#include <sys/mman.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

#define SIZE (5UL*1024*1024*1024)

int main()
{
        static struct timeval oldstamp, newstamp;
	long diffsec;
	char *p, *p2, *p3, *p4;
	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
		perror("memalign"), exit(1);
	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
		perror("memalign"), exit(1);
	if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
		perror("memalign"), exit(1);

	memset(p, 0xff, SIZE);
	memset(p2, 0xff, SIZE);
	memset(p3, 0x77, 4096);
	gettimeofday(&oldstamp, NULL);
	p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
	gettimeofday(&newstamp, NULL);
	diffsec = newstamp.tv_sec - oldstamp.tv_sec;
	diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
	printf("usec %ld\n", diffsec);
	if (p == MAP_FAILED || p4 != p3)
	//if (p == MAP_FAILED)
		perror("mremap"), exit(1);
	if (memcmp(p4, p2, SIZE))
		printf("mremap bug\n"), exit(1);
	printf("ok\n");

	return 0;
}
===

THP on

 Performance counter stats for './largepage13' (3 runs):

          69195836 dTLB-loads                 ( +-   3.546% )  (scaled from 50.30%)
             60708 dTLB-load-misses           ( +-  11.776% )  (scaled from 52.62%)
         676266476 dTLB-stores                ( +-   5.654% )  (scaled from 69.54%)
             29856 dTLB-store-misses          ( +-   4.081% )  (scaled from 89.22%)
        1055848782 iTLB-loads                 ( +-   4.526% )  (scaled from 80.18%)
              8689 iTLB-load-misses           ( +-   2.987% )  (scaled from 58.20%)

        7.314454164  seconds time elapsed   ( +-   0.023% )

THP off

 Performance counter stats for './largepage13' (3 runs):

        1967379311 dTLB-loads                 ( +-   0.506% )  (scaled from 60.59%)
           9238687 dTLB-load-misses           ( +-  22.547% )  (scaled from 61.87%)
        2014239444 dTLB-stores                ( +-   0.692% )  (scaled from 60.40%)
           3312335 dTLB-store-misses          ( +-   7.304% )  (scaled from 67.60%)
        6764372065 iTLB-loads                 ( +-   0.925% )  (scaled from 79.00%)
              8202 iTLB-load-misses           ( +-   0.475% )  (scaled from 70.55%)

        9.693655243  seconds time elapsed   ( +-   0.069% )

grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0

thp_split 0 confirms no thp split despite plenty of hugepages allocated.

The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):

THP on

usec 14824
usec 14862
usec 14859

THP off

usec 256416
usec 255981
usec 255847

With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).

THP on

usec 392107
usec 390237
usec 404124

THP off

usec 444294
usec 445237
usec 445820

I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.

All debug options are off except DEBUG_VM to avoid skewing the
results.

The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't
be moved without a split but that is an hardware limitation.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -22,6 +22,11 @@ extern int zap_huge_pmd(struct mmu_gathe
 extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, unsigned long end,
 			unsigned char *vec);
+extern int move_huge_pmd(struct vm_area_struct *vma,
+			 struct vm_area_struct *new_vma,
+			 unsigned long old_addr,
+			 unsigned long new_addr, unsigned long old_end,
+			 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot);
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1054,6 +1054,52 @@ int mincore_huge_pmd(struct vm_area_stru
 	return ret;
 }
 
+int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
+		  unsigned long old_addr,
+		  unsigned long new_addr, unsigned long old_end,
+		  pmd_t *old_pmd, pmd_t *new_pmd)
+{
+	int ret = 0;
+	pmd_t pmd;
+
+	struct mm_struct *mm = vma->vm_mm;
+
+	if ((old_addr & ~HPAGE_PMD_MASK) ||
+	    (new_addr & ~HPAGE_PMD_MASK) ||
+	    (old_addr + HPAGE_PMD_SIZE) > old_end ||
+	    new_vma->vm_flags & VM_NOHUGEPAGE)
+		goto out;
+
+	/*
+	 * The destination pmd shouldn't be established, free_pgtables()
+	 * should have release it.
+	 */
+	if (!pmd_none(*new_pmd)) {
+		WARN_ON(1);
+		VM_BUG_ON(pmd_trans_huge(*new_pmd));
+		goto out;
+	}
+
+	spin_lock(&mm->page_table_lock);
+	if (likely(pmd_trans_huge(*old_pmd))) {
+		if (pmd_trans_splitting(*old_pmd)) {
+			spin_unlock(&mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma, old_pmd);
+			ret = -1;
+		} else {
+			pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
+			VM_BUG_ON(!pmd_none(*new_pmd));
+			set_pmd_at(mm, new_addr, new_pmd, pmd);
+			spin_unlock(&mm->page_table_lock);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&mm->page_table_lock);
+
+out:
+	return ret;
+}
+
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, pgprot_t newprot)
 {
diff --git a/mm/mremap.c b/mm/mremap.c
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -41,8 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
-	split_huge_page_pmd(mm, pmd);
-	if (pmd_none_or_clear_bad(pmd))
+	if (pmd_none(*pmd))
 		return NULL;
 
 	return pmd;
@@ -65,8 +64,6 @@ static pmd_t *alloc_new_pmd(struct mm_st
 		return NULL;
 
 	VM_BUG_ON(pmd_trans_huge(*pmd));
-	if (pmd_none(*pmd) && __pte_alloc(mm, vma, pmd, addr))
-		return NULL;
 
 	return pmd;
 }
@@ -149,6 +146,23 @@ unsigned long move_page_tables(struct vm
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
+		if (pmd_trans_huge(*old_pmd)) {
+			int err = 0;
+			if (extent == HPAGE_PMD_SIZE)
+				err = move_huge_pmd(vma, new_vma, old_addr,
+						    new_addr, old_end,
+						    old_pmd, new_pmd);
+			if (err > 0) {
+				need_flush = true;
+				continue;
+			} else if (!err)
+				split_huge_page_pmd(vma->vm_mm, old_pmd);
+			VM_BUG_ON(pmd_trans_huge(*old_pmd));
+		}
+		if (pmd_none(*new_pmd) && __pte_alloc(new_vma->vm_mm, new_vma,
+						      new_pmd,
+						      new_addr))
+			break;
 		next = (new_addr + PMD_SIZE) & PMD_MASK;
 		if (extent > next - new_addr)
 			extent = next - new_addr;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3 of 3] thp: mremap support and TLB optimization
  2011-08-06 16:58 ` [PATCH 3 of 3] thp: mremap support and TLB optimization aarcange
@ 2011-08-08  8:25   ` Johannes Weiner
  2011-08-10 11:01   ` Mel Gorman
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Johannes Weiner @ 2011-08-08  8:25 UTC (permalink / raw)
  To: aarcange; +Cc: linux-mm, Mel Gorman, Rik van Riel, Hugh Dickins

On Sat, Aug 06, 2011 at 06:58:05PM +0200, aarcange@redhat.com wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This adds THP support to mremap (decreases the number of split_huge_page
> called).
> 
> Here are also some benchmarks with a proggy like this:
> 
> ===
> #define _GNU_SOURCE
> #include <sys/mman.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
> #include <sys/time.h>
> 
> #define SIZE (5UL*1024*1024*1024)
> 
> int main()
> {
>         static struct timeval oldstamp, newstamp;
> 	long diffsec;
> 	char *p, *p2, *p3, *p4;
> 	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
> 		perror("memalign"), exit(1);
> 	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
> 		perror("memalign"), exit(1);
> 	if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
> 		perror("memalign"), exit(1);
> 
> 	memset(p, 0xff, SIZE);
> 	memset(p2, 0xff, SIZE);
> 	memset(p3, 0x77, 4096);
> 	gettimeofday(&oldstamp, NULL);
> 	p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
> 	gettimeofday(&newstamp, NULL);
> 	diffsec = newstamp.tv_sec - oldstamp.tv_sec;
> 	diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
> 	printf("usec %ld\n", diffsec);
> 	if (p == MAP_FAILED || p4 != p3)
> 	//if (p == MAP_FAILED)
> 		perror("mremap"), exit(1);
> 	if (memcmp(p4, p2, SIZE))
> 		printf("mremap bug\n"), exit(1);
> 	printf("ok\n");
> 
> 	return 0;
> }
> ===
> 
> THP on
> 
>  Performance counter stats for './largepage13' (3 runs):
> 
>           69195836 dTLB-loads                 ( +-   3.546% )  (scaled from 50.30%)
>              60708 dTLB-load-misses           ( +-  11.776% )  (scaled from 52.62%)
>          676266476 dTLB-stores                ( +-   5.654% )  (scaled from 69.54%)
>              29856 dTLB-store-misses          ( +-   4.081% )  (scaled from 89.22%)
>         1055848782 iTLB-loads                 ( +-   4.526% )  (scaled from 80.18%)
>               8689 iTLB-load-misses           ( +-   2.987% )  (scaled from 58.20%)
> 
>         7.314454164  seconds time elapsed   ( +-   0.023% )
> 
> THP off
> 
>  Performance counter stats for './largepage13' (3 runs):
> 
>         1967379311 dTLB-loads                 ( +-   0.506% )  (scaled from 60.59%)
>            9238687 dTLB-load-misses           ( +-  22.547% )  (scaled from 61.87%)
>         2014239444 dTLB-stores                ( +-   0.692% )  (scaled from 60.40%)
>            3312335 dTLB-store-misses          ( +-   7.304% )  (scaled from 67.60%)
>         6764372065 iTLB-loads                 ( +-   0.925% )  (scaled from 79.00%)
>               8202 iTLB-load-misses           ( +-   0.475% )  (scaled from 70.55%)
> 
>         9.693655243  seconds time elapsed   ( +-   0.069% )
> 
> grep thp /proc/vmstat
> thp_fault_alloc 35849
> thp_fault_fallback 0
> thp_collapse_alloc 3
> thp_collapse_alloc_failed 0
> thp_split 0
> 
> thp_split 0 confirms no thp split despite plenty of hugepages allocated.
> 
> The measurement of only the mremap time (so excluding the 3 long
> memset and final long 10GB memory accessing memcmp):
> 
> THP on
> 
> usec 14824
> usec 14862
> usec 14859
> 
> THP off
> 
> usec 256416
> usec 255981
> usec 255847
> 
> With an older kernel without the mremap optimizations (the below patch
> optimizes the non THP version too).
> 
> THP on
> 
> usec 392107
> usec 390237
> usec 404124
> 
> THP off
> 
> usec 444294
> usec 445237
> usec 445820
> 
> I guess with a threaded program that sends more IPI on large SMP it'd
> create an even larger difference.
> 
> All debug options are off except DEBUG_VM to avoid skewing the
> results.
> 
> The only problem for native 2M mremap like it happens above both the
> source and destination address must be 2M aligned or the hugepmd can't
> be moved without a split but that is an hardware limitation.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Johannes Weiner <jweiner@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3 of 3] thp: mremap support and TLB optimization
  2011-08-06 16:58 ` [PATCH 3 of 3] thp: mremap support and TLB optimization aarcange
  2011-08-08  8:25   ` Johannes Weiner
@ 2011-08-10 11:01   ` Mel Gorman
  2011-08-11  0:26   ` Rik van Riel
  2011-08-23 21:14   ` Andrew Morton
  3 siblings, 0 replies; 16+ messages in thread
From: Mel Gorman @ 2011-08-10 11:01 UTC (permalink / raw)
  To: aarcange; +Cc: linux-mm, Johannes Weiner, Rik van Riel, Hugh Dickins

On Sat, Aug 06, 2011 at 06:58:05PM +0200, aarcange@redhat.com wrote:
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This adds THP support to mremap (decreases the number of split_huge_page
> called).
> 
> Here are also some benchmarks with a proggy like this:
> 
> ===
> #define _GNU_SOURCE
> #include <sys/mman.h>
> #include <stdlib.h>
> #include <stdio.h>
> #include <string.h>
> #include <sys/time.h>
> 
> #define SIZE (5UL*1024*1024*1024)
> 
> int main()
> {
>         static struct timeval oldstamp, newstamp;
> 	long diffsec;
> 	char *p, *p2, *p3, *p4;
> 	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
> 		perror("memalign"), exit(1);
> 	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
> 		perror("memalign"), exit(1);
> 	if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
> 		perror("memalign"), exit(1);
> 
> 	memset(p, 0xff, SIZE);
> 	memset(p2, 0xff, SIZE);
> 	memset(p3, 0x77, 4096);
> 	gettimeofday(&oldstamp, NULL);
> 	p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
> 	gettimeofday(&newstamp, NULL);
> 	diffsec = newstamp.tv_sec - oldstamp.tv_sec;
> 	diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
> 	printf("usec %ld\n", diffsec);
> 	if (p == MAP_FAILED || p4 != p3)
> 	//if (p == MAP_FAILED)
> 		perror("mremap"), exit(1);
> 	if (memcmp(p4, p2, SIZE))
> 		printf("mremap bug\n"), exit(1);
> 	printf("ok\n");
> 
> 	return 0;
> }
> ===
> 
> THP on
> 
>  Performance counter stats for './largepage13' (3 runs):
> 
>           69195836 dTLB-loads                 ( +-   3.546% )  (scaled from 50.30%)
>              60708 dTLB-load-misses           ( +-  11.776% )  (scaled from 52.62%)
>          676266476 dTLB-stores                ( +-   5.654% )  (scaled from 69.54%)
>              29856 dTLB-store-misses          ( +-   4.081% )  (scaled from 89.22%)
>         1055848782 iTLB-loads                 ( +-   4.526% )  (scaled from 80.18%)
>               8689 iTLB-load-misses           ( +-   2.987% )  (scaled from 58.20%)
> 
>         7.314454164  seconds time elapsed   ( +-   0.023% )
> 
> THP off
> 
>  Performance counter stats for './largepage13' (3 runs):
> 
>         1967379311 dTLB-loads                 ( +-   0.506% )  (scaled from 60.59%)
>            9238687 dTLB-load-misses           ( +-  22.547% )  (scaled from 61.87%)
>         2014239444 dTLB-stores                ( +-   0.692% )  (scaled from 60.40%)
>            3312335 dTLB-store-misses          ( +-   7.304% )  (scaled from 67.60%)
>         6764372065 iTLB-loads                 ( +-   0.925% )  (scaled from 79.00%)
>               8202 iTLB-load-misses           ( +-   0.475% )  (scaled from 70.55%)
> 
>         9.693655243  seconds time elapsed   ( +-   0.069% )
> 
> grep thp /proc/vmstat
> thp_fault_alloc 35849
> thp_fault_fallback 0
> thp_collapse_alloc 3
> thp_collapse_alloc_failed 0
> thp_split 0
> 
> thp_split 0 confirms no thp split despite plenty of hugepages allocated.
> 
> The measurement of only the mremap time (so excluding the 3 long
> memset and final long 10GB memory accessing memcmp):
> 
> THP on
> 
> usec 14824
> usec 14862
> usec 14859
> 
> THP off
> 
> usec 256416
> usec 255981
> usec 255847
> 
> With an older kernel without the mremap optimizations (the below patch
> optimizes the non THP version too).
> 
> THP on
> 
> usec 392107
> usec 390237
> usec 404124
> 
> THP off
> 
> usec 444294
> usec 445237
> usec 445820
> 
> I guess with a threaded program that sends more IPI on large SMP it'd
> create an even larger difference.
> 
> All debug options are off except DEBUG_VM to avoid skewing the
> results.
> 
> The only problem for native 2M mremap like it happens above both the
> source and destination address must be 2M aligned or the hugepmd can't
> be moved without a split but that is an hardware limitation.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

It's a pity about the tristate but if the alternative really is that bad
and results in unnecessary calls, it can be lived with if a comment is
stuck above move_huge_pmd() explaining the return values.

Otherwise

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3 of 3] thp: mremap support and TLB optimization
  2011-08-06 16:58 ` [PATCH 3 of 3] thp: mremap support and TLB optimization aarcange
  2011-08-08  8:25   ` Johannes Weiner
  2011-08-10 11:01   ` Mel Gorman
@ 2011-08-11  0:26   ` Rik van Riel
  2011-08-23 21:14   ` Andrew Morton
  3 siblings, 0 replies; 16+ messages in thread
From: Rik van Riel @ 2011-08-11  0:26 UTC (permalink / raw)
  To: aarcange; +Cc: linux-mm, Mel Gorman, Johannes Weiner, Hugh Dickins

On 08/06/2011 12:58 PM, aarcange@redhat.com wrote:
> From: Andrea Arcangeli<aarcange@redhat.com>
>
> This adds THP support to mremap (decreases the number of split_huge_page
> called).
>
> Here are also some benchmarks with a proggy like this:

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3 of 3] thp: mremap support and TLB optimization
  2011-08-06 16:58 ` [PATCH 3 of 3] thp: mremap support and TLB optimization aarcange
                     ` (2 preceding siblings ...)
  2011-08-11  0:26   ` Rik van Riel
@ 2011-08-23 21:14   ` Andrew Morton
  2011-08-23 22:13     ` Andrea Arcangeli
  3 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2011-08-23 21:14 UTC (permalink / raw)
  To: aarcange
  Cc: linux-mm, Mel Gorman, Johannes Weiner, Rik van Riel, Hugh Dickins

On Sat, 06 Aug 2011 18:58:05 +0200
aarcange@redhat.com wrote:

> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This adds THP support to mremap (decreases the number of split_huge_page
> called).
>
> ...

I have some nitpicking.
 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1054,6 +1054,52 @@ int mincore_huge_pmd(struct vm_area_stru
>  	return ret;
>  }
>  
> +int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
> +		  unsigned long old_addr,
> +		  unsigned long new_addr, unsigned long old_end,
> +		  pmd_t *old_pmd, pmd_t *new_pmd)
> +{
> +	int ret = 0;
> +	pmd_t pmd;
> +
> +	struct mm_struct *mm = vma->vm_mm;
> +
> +	if ((old_addr & ~HPAGE_PMD_MASK) ||
> +	    (new_addr & ~HPAGE_PMD_MASK) ||
> +	    (old_addr + HPAGE_PMD_SIZE) > old_end ||

Can (old_addr + HPAGE_PMD_SIZE) wrap past zero?

> +	    new_vma->vm_flags & VM_NOHUGEPAGE)

This should be parenthesized, IMO, like the other sub-expressions.

> +		goto out;
> +
> +	/*
> +	 * The destination pmd shouldn't be established, free_pgtables()
> +	 * should have release it.
> +	 */
> +	if (!pmd_none(*new_pmd)) {
> +		WARN_ON(1);

We can use the WARN_ON return value trick here.

> +		VM_BUG_ON(pmd_trans_huge(*new_pmd));
> +		goto out;
> +	}
> +
> +	spin_lock(&mm->page_table_lock);
> +	if (likely(pmd_trans_huge(*old_pmd))) {
> +		if (pmd_trans_splitting(*old_pmd)) {
> +			spin_unlock(&mm->page_table_lock);
> +			wait_split_huge_page(vma->anon_vma, old_pmd);
> +			ret = -1;
> +		} else {
> +			pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
> +			VM_BUG_ON(!pmd_none(*new_pmd));
> +			set_pmd_at(mm, new_addr, new_pmd, pmd);
> +			spin_unlock(&mm->page_table_lock);
> +			ret = 1;
> +		}
> +	} else
> +		spin_unlock(&mm->page_table_lock);

If the "if" part has braces, it's conventional to also add them to the
"else" part.

> +out:
> +	return ret;
> +}
> +

Result:

diff -puN mm/huge_memory.c~thp-mremap-support-and-tlb-optimization-fix mm/huge_memory.c
--- a/mm/huge_memory.c~thp-mremap-support-and-tlb-optimization-fix
+++ a/mm/huge_memory.c
@@ -1065,15 +1065,14 @@ int move_huge_pmd(struct vm_area_struct 
 	if ((old_addr & ~HPAGE_PMD_MASK) ||
 	    (new_addr & ~HPAGE_PMD_MASK) ||
 	    (old_addr + HPAGE_PMD_SIZE) > old_end ||
-	    new_vma->vm_flags & VM_NOHUGEPAGE)
+	    (new_vma->vm_flags & VM_NOHUGEPAGE))
 		goto out;
 
 	/*
 	 * The destination pmd shouldn't be established, free_pgtables()
 	 * should have release it.
 	 */
-	if (!pmd_none(*new_pmd)) {
-		WARN_ON(1);
+	if (!WARN_ON(pmd_none(*new_pmd))) {
 		VM_BUG_ON(pmd_trans_huge(*new_pmd));
 		goto out;
 	}
@@ -1091,9 +1090,9 @@ int move_huge_pmd(struct vm_area_struct 
 			spin_unlock(&mm->page_table_lock);
 			ret = 1;
 		}
-	} else
+	} else {
 		spin_unlock(&mm->page_table_lock);
-
+	}
 out:
 	return ret;
 }
--- a/mm/mremap.c~thp-mremap-support-and-tlb-optimization-fix
+++ a/mm/mremap.c
@@ -155,13 +155,13 @@ unsigned long move_page_tables(struct vm
 			if (err > 0) {
 				need_flush = true;
 				continue;
-			} else if (!err)
+			} else if (!err) {
 				split_huge_page_pmd(vma->vm_mm, old_pmd);
+			}
 			VM_BUG_ON(pmd_trans_huge(*old_pmd));
 		}
 		if (pmd_none(*new_pmd) && __pte_alloc(new_vma->vm_mm, new_vma,
-						      new_pmd,
-						      new_addr))
+						      new_pmd, new_addr))
 			break;
 		next = (new_addr + PMD_SIZE) & PMD_MASK;
 		if (extent > next - new_addr)
_

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3 of 3] thp: mremap support and TLB optimization
  2011-08-23 21:14   ` Andrew Morton
@ 2011-08-23 22:13     ` Andrea Arcangeli
  2011-08-23 22:25       ` Andrea Arcangeli
  0 siblings, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2011-08-23 22:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Mel Gorman, Johannes Weiner, Rik van Riel, Hugh Dickins

Hi Andrew,

On Tue, Aug 23, 2011 at 02:14:45PM -0700, Andrew Morton wrote:
> > +	if ((old_addr & ~HPAGE_PMD_MASK) ||
> > +	    (new_addr & ~HPAGE_PMD_MASK) ||
> > +	    (old_addr + HPAGE_PMD_SIZE) > old_end ||
> 
> Can (old_addr + HPAGE_PMD_SIZE) wrap past zero?

Good question. old_addr is hpage aligned so to overflow it'd need to
be exactly at address 0-HPAGE_PMD_SIZE. Can any userland map an
address there? I doubt and surely not x86* or sparc (currently THP is
only enabled on x86 anyway so answer is it can't wrap past zero). But
probably we should add a wrap check for other archs in the future
unless we have a real guarantee from all archs to avoid the check. I
only can guarantee about x86*.

> -	if (!pmd_none(*new_pmd)) {
> -		WARN_ON(1);
> +	if (!WARN_ON(pmd_none(*new_pmd))) {

WARN_ON(!pmd_none

Thanks for the cleanups!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3 of 3] thp: mremap support and TLB optimization
  2011-08-23 22:13     ` Andrea Arcangeli
@ 2011-08-23 22:25       ` Andrea Arcangeli
  0 siblings, 0 replies; 16+ messages in thread
From: Andrea Arcangeli @ 2011-08-23 22:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Mel Gorman, Johannes Weiner, Rik van Riel, Hugh Dickins

On Wed, Aug 24, 2011 at 12:13:21AM +0200, Andrea Arcangeli wrote:
> Hi Andrew,
> 
> On Tue, Aug 23, 2011 at 02:14:45PM -0700, Andrew Morton wrote:
> > > +	if ((old_addr & ~HPAGE_PMD_MASK) ||
> > > +	    (new_addr & ~HPAGE_PMD_MASK) ||
> > > +	    (old_addr + HPAGE_PMD_SIZE) > old_end ||
> > 
> > Can (old_addr + HPAGE_PMD_SIZE) wrap past zero?
> 
> Good question. old_addr is hpage aligned so to overflow it'd need to
> be exactly at address 0-HPAGE_PMD_SIZE. Can any userland map an
> address there? I doubt and surely not x86* or sparc (currently THP is
> only enabled on x86 anyway so answer is it can't wrap past zero). But
> probably we should add a wrap check for other archs in the future
> unless we have a real guarantee from all archs to avoid the check. I
> only can guarantee about x86*.

I suggest to incorporate this into
thp-mremap-support-and-tlb-optimization-fix.patch (it's only for the
future but it's zero cost so it's certainly worth it...). Untested...

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1066,7 +1066,7 @@ int move_huge_pmd(struct vm_area_struct 
 
 	if ((old_addr & ~HPAGE_PMD_MASK) ||
 	    (new_addr & ~HPAGE_PMD_MASK) ||
-	    (old_addr + HPAGE_PMD_SIZE) > old_end ||
+	    old_end - old_addr < HPAGE_PMD_SIZE ||
 	    (new_vma->vm_flags & VM_NOHUGEPAGE))
 		goto out;
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2011-08-23 22:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-06 16:58 [PATCH 0 of 3] THP: mremap support and TLB optimization #3 aarcange
2011-08-06 16:58 ` [PATCH 1 of 3] mremap: check for overflow using deltas aarcange
2011-08-08  8:20   ` Johannes Weiner
2011-08-10 10:48   ` Mel Gorman
2011-08-11  0:14   ` Rik van Riel
2011-08-06 16:58 ` [PATCH 2 of 3] mremap: avoid sending one IPI per page aarcange
2011-08-08  8:23   ` Johannes Weiner
2011-08-10 10:55   ` Mel Gorman
2011-08-11  0:16   ` Rik van Riel
2011-08-06 16:58 ` [PATCH 3 of 3] thp: mremap support and TLB optimization aarcange
2011-08-08  8:25   ` Johannes Weiner
2011-08-10 11:01   ` Mel Gorman
2011-08-11  0:26   ` Rik van Riel
2011-08-23 21:14   ` Andrew Morton
2011-08-23 22:13     ` Andrea Arcangeli
2011-08-23 22:25       ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).