[PATCH] thp: mremap support and TLB optimization

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] thp: mremap support and TLB optimization
@ 2011-03-11  2:04 Andrea Arcangeli
  2011-03-11 15:16 ` Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Andrea Arcangeli @ 2011-03-11  2:04 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Rik van Riel

Hello everyone,

I've been wondering why mremap is sending one IPI for each page that
it moves. I tried to remove that so we send an IPI for each
vma/syscall (not for each pte/page). I also added native THP support
without calling split_huge_page unconditionally if both the source and
destination alignment allows a pmd_trans_huge to be preserved (the
mremap extension and truncation already preserved existing hugepages
but the move into new place didn't yet). If the destination alignment
isn't ok, split_huge_page is unavoidable but that is an
userland/hardware limitation, not really something we can optimize
further in the kernel.

I've no real numbers yet (volanomark results are mostly unchanged,
it's a tinybit faster but it may be measurement error, and it doesn't
seem to call mremap enough, but the thp_split number in /proc/vmstat
seem to go down close to zero, maybe other JIT workloads will
benefit?).

In the meantime I'm posting this for review. I'm not entirely sure
this is safe at this point (I mean the tlb part especially). Also note
if any arch needs the tlb flush after ptep_get_and_clear, move_pte can
provide it. The huge_memory.c part has no move_pmd equivalent because
the only arch that needs move_pte (sparc64) doesn't supports THP yet
(I've no idea if sparc64 is one of the candidates of future THP
capable archs, arm/ppcembedded should make it eventually).

I applied this to my aa.git tree and I'm running this on all my
systems with no adverse effects for more than a day, so if you want to
test the usual procedure works.

first: git clone git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
or first: git clone --reference linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
later: git fetch; git checkout -f origin/master

===
Subject: thp: mremap support and TLB optimization

From: Andrea Arcangeli <aarcange@redhat.com>

This adds THP support to mremap (decreases the number of split_huge_page
called).

This also replaces ptep_clear_flush with ptep_get_and_clear and replaces it
with a final flush_tlb_range to send a single tlb flush IPI instead of one IPI
for each page.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    3 +++
 mm/huge_memory.c        |   44 ++++++++++++++++++++++++++++++++++++++++++++
 mm/mremap.c             |   31 ++++++++++++++++++++++++-------
 3 files changed, 71 insertions(+), 7 deletions(-)

--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -22,6 +22,9 @@ extern int zap_huge_pmd(struct mmu_gathe
 extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, unsigned long end,
 			unsigned char *vec);
+extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+			 unsigned long new_addr, unsigned long old_end,
+			 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot);
 
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -42,7 +42,7 @@ static pmd_t *get_old_pmd(struct mm_stru
 
 	pmd = pmd_offset(pud, addr);
 	split_huge_page_pmd(mm, pmd);
-	if (pmd_none_or_clear_bad(pmd))
+	if (pmd_none(*pmd))
 		return NULL;
 
 	return pmd;
@@ -80,11 +80,7 @@ static void move_ptes(struct vm_area_str
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
-	unsigned long old_start;
 
-	old_start = old_addr;
-	mmu_notifier_invalidate_range_start(vma->vm_mm,
-					    old_start, old_end);
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -112,7 +108,7 @@ static void move_ptes(struct vm_area_str
 				   new_pte++, new_addr += PAGE_SIZE) {
 		if (pte_none(*old_pte))
 			continue;
-		pte = ptep_clear_flush(vma, old_addr, old_pte);
+		pte = ptep_get_and_clear(mm, old_addr, old_pte);
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		set_pte_at(mm, new_addr, new_pte, pte);
 	}
@@ -124,7 +120,6 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
-	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
@@ -139,6 +134,8 @@ unsigned long move_page_tables(struct vm
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
+	mmu_notifier_invalidate_range_start(vma->vm_mm, old_addr, old_end);
+
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
 		next = (old_addr + PMD_SIZE) & PMD_MASK;
@@ -151,6 +148,23 @@ unsigned long move_page_tables(struct vm
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
+		if (pmd_trans_huge(*old_pmd)) {
+			int err = move_huge_pmd(vma, old_addr, new_addr,
+						old_end, old_pmd, new_pmd);
+			if (err > 0) {
+				old_addr += HPAGE_PMD_SIZE;
+				new_addr += HPAGE_PMD_SIZE;
+				continue;
+			}
+		}
+		/*
+		 * split_huge_page_pmd() must run outside the
+		 * pmd_trans_huge() block above because that check
+		 * racy. split_huge_page_pmd() will recheck
+		 * pmd_trans_huge() but in a not racy way under the
+		 * page_table_lock.
+		 */
+		split_huge_page_pmd(vma->vm_mm, old_pmd);
 		next = (new_addr + PMD_SIZE) & PMD_MASK;
 		if (extent > next - new_addr)
 			extent = next - new_addr;
@@ -159,6 +173,9 @@ unsigned long move_page_tables(struct vm
 		move_ptes(vma, old_pmd, old_addr, old_addr + extent,
 				new_vma, new_pmd, new_addr);
 	}
+	flush_tlb_range(vma, old_end-len, old_addr);
+
+	mmu_notifier_invalidate_range_end(vma->vm_mm, old_end-len, old_end);
 
 	return len + old_addr - old_end;	/* how much done */
 }
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1048,6 +1048,50 @@ int mincore_huge_pmd(struct vm_area_stru
 	return ret;
 }
 
+int move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+		  unsigned long new_addr, unsigned long old_end,
+		  pmd_t *old_pmd, pmd_t *new_pmd)
+{
+	int ret = 0;
+	pmd_t pmd;
+
+	struct mm_struct *mm = vma->vm_mm;
+
+	if ((old_addr & ~HPAGE_PMD_MASK) ||
+	    (new_addr & ~HPAGE_PMD_MASK) ||
+	    (old_addr + HPAGE_PMD_SIZE) > old_end)
+		goto out;
+
+	/* if the new area is all for our destination it must be unmapped */
+	VM_BUG_ON(!pmd_none(*new_pmd));
+	/* mostly to remember this locking isn't enough with filebacked vma */
+	VM_BUG_ON(vma->vm_file);
+
+	spin_lock(&mm->page_table_lock);
+	if (likely(pmd_trans_huge(*old_pmd))) {
+		if (pmd_trans_splitting(*old_pmd)) {
+			spin_unlock(&vma->vm_mm->page_table_lock);
+			/*
+			 * It's not mandatory to wait here as the
+			 * caller will run split_huge_page_pmd(), but
+			 * this is faster and it will avoid the caller
+			 * to invoke __split_huge_page_pmd() (and to
+			 * take the page_table_lock again).
+			 */
+			wait_split_huge_page(vma->anon_vma, old_pmd);
+		} else {
+			pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
+			set_pmd_at(mm, new_addr, new_pmd, pmd);
+			spin_unlock(&mm->page_table_lock);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&mm->page_table_lock);
+
+out:
+	return ret;
+}
+
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, pgprot_t newprot)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] thp: mremap support and TLB optimization
  2011-03-11  2:04 [PATCH] thp: mremap support and TLB optimization Andrea Arcangeli
@ 2011-03-11 15:16 ` Rik van Riel
  2011-03-11 19:44 ` Hugh Dickins
  2011-03-15  9:27 ` Johannes Weiner
  2 siblings, 0 replies; 9+ messages in thread
From: Rik van Riel @ 2011-03-11 15:16 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Mel Gorman, Johannes Weiner

On 03/10/2011 09:04 PM, Andrea Arcangeli wrote:
> Hello everyone,
>
> I've been wondering why mremap is sending one IPI for each page that
> it moves. I tried to remove that so we send an IPI for each
> vma/syscall (not for each pte/page). I also added native THP support
> without calling split_huge_page unconditionally if both the source and
> destination alignment allows a pmd_trans_huge to be preserved (the
> mremap extension and truncation already preserved existing hugepages
> but the move into new place didn't yet). If the destination alignment
> isn't ok, split_huge_page is unavoidable but that is an
> userland/hardware limitation, not really something we can optimize
> further in the kernel.
>
> I've no real numbers yet (volanomark results are mostly unchanged,
> it's a tinybit faster but it may be measurement error, and it doesn't
> seem to call mremap enough, but the thp_split number in /proc/vmstat
> seem to go down close to zero, maybe other JIT workloads will
> benefit?).

Reviewed-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] thp: mremap support and TLB optimization
  2011-03-11  2:04 [PATCH] thp: mremap support and TLB optimization Andrea Arcangeli
  2011-03-11 15:16 ` Rik van Riel
@ 2011-03-11 19:44 ` Hugh Dickins
  2011-03-11 20:25   ` Hugh Dickins
  2011-03-12  4:02   ` Andrea Arcangeli
  2011-03-15  9:27 ` Johannes Weiner
  2 siblings, 2 replies; 9+ messages in thread
From: Hugh Dickins @ 2011-03-11 19:44 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Mel Gorman, Johannes Weiner, Rik van Riel

On Thu, Mar 10, 2011 at 6:04 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> I've been wondering why mremap is sending one IPI for each page that
> it moves. I tried to remove that so we send an IPI for each
> vma/syscall (not for each pte/page).

(It wouldn't usually have been sending an IPI for each page, only if
the mm were active on another cpu, but...)

That looks like a good optimization to me: I can't think of a good
reason for it to be the way it was, just it started out like that and
none of us ever thought to change it before.  Plus it's always nice to
see the flush_tlb_range() afterwards complementing the
flush_cache_range() beforehand, as you now have in move_page_tables().

And don't forget that move_page_tables() is also used by exec's
shift_arg_pages(): no IPI saving there, but it should be more
efficient when exec'ing with many arguments.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] thp: mremap support and TLB optimization
  2011-03-11 19:44 ` Hugh Dickins
@ 2011-03-11 20:25   ` Hugh Dickins
  2011-03-12  4:28     ` Andrea Arcangeli
  2011-03-12  4:02   ` Andrea Arcangeli
  1 sibling, 1 reply; 9+ messages in thread
From: Hugh Dickins @ 2011-03-11 20:25 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Mel Gorman, Johannes Weiner, Rik van Riel

Fri, Mar 11, 2011 at 11:44 AM, Hugh Dickins <hughd@google.com> wrote:
> On Thu, Mar 10, 2011 at 6:04 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>>
>> I've been wondering why mremap is sending one IPI for each page that
>> it moves. I tried to remove that so we send an IPI for each
>> vma/syscall (not for each pte/page).
>
> (It wouldn't usually have been sending an IPI for each page, only if
> the mm were active on another cpu, but...)
>
> That looks like a good optimization to me: I can't think of a good
> reason for it to be the way it was, just it started out like that and
> none of us ever thought to change it before.  Plus it's always nice to
> see the flush_tlb_range() afterwards complementing the
> flush_cache_range() beforehand, as you now have in move_page_tables().
>
> And don't forget that move_page_tables() is also used by exec's
> shift_arg_pages(): no IPI saving there, but it should be more
> efficient when exec'ing with many arguments.

Perhaps I should qualify that answer: although I still think it's the
right change to make (it matches mprotect, for example), and an
optimization in many cases, it will be a pessimization for anyone who
mremap moves unpopulated areas (I doubt that's common), and for anyone
who moves around single page areas (on x86 and probably some others).
But the exec args case has, I think, few useful tlb entries to lose
from the mm-wide tlb flush.

flush_tlb_range() ought to special case small areas, doing at most one
IPI, but up to some number of flush_tlb_one()s; but that would
certainly have to be another patch.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] thp: mremap support and TLB optimization
  2011-03-11 20:25   ` Hugh Dickins
@ 2011-03-12  4:28     ` Andrea Arcangeli
  0 siblings, 0 replies; 9+ messages in thread
From: Andrea Arcangeli @ 2011-03-12  4:28 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-mm, Mel Gorman, Johannes Weiner, Rik van Riel

On Fri, Mar 11, 2011 at 12:25:42PM -0800, Hugh Dickins wrote:
> Perhaps I should qualify that answer: although I still think it's the
> right change to make (it matches mprotect, for example), and an
> optimization in many cases, it will be a pessimization for anyone who
> mremap moves unpopulated areas (I doubt that's common), and for anyone
> who moves around single page areas (on x86 and probably some others).
> But the exec args case has, I think, few useful tlb entries to lose
> from the mm-wide tlb flush.

The pessimization of the totally unmapped areas I didn't consider it
as a real life possibility. At least the mmu notifier isn't pessimized
if the pmd wasn't none but the pte was none, only the TLB flush really
is pessimized in that case (if all old ptes are none regardless of the
pmd). But the range TLB flush is real easy to optimize for unmapped
areas if we want, just skip the flush_tlb_range if all old ptes were
none and we actually changed nothing, right? Fixing the mmu notifier
isn't possible but that's not a concern and it'll surely be fine to
stay in move_page_tables.

So do we want one more branch to avoid one IPI if mremap runs on an
unmapped area? That's ok with me if it's a real life possibility. At
the moment I think any app doing that is pretty stupid and shouldn't
call mremap in the first place, and it should have used
mmap(MAP_FIXED) or a bigger mmap size in the first place though... If
we add a branch for that case, maybe we should also printk if we
detect that, in addition to skipping the tlb flush.

> flush_tlb_range() ought to special case small areas, doing at most one
> IPI, but up to some number of flush_tlb_one()s; but that would
> certainly have to be another patch.

That's probably a good tradeoff. Even better would be if x86 would be
extended to allow range flushes so we don't have to do guesswork in
software.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] thp: mremap support and TLB optimization
  2011-03-11 19:44 ` Hugh Dickins
  2011-03-11 20:25   ` Hugh Dickins
@ 2011-03-12  4:02   ` Andrea Arcangeli
  1 sibling, 0 replies; 9+ messages in thread
From: Andrea Arcangeli @ 2011-03-12  4:02 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-mm, Mel Gorman, Johannes Weiner, Rik van Riel

Hi Hugh,

On Fri, Mar 11, 2011 at 11:44:03AM -0800, Hugh Dickins wrote:
> On Thu, Mar 10, 2011 at 6:04 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > I've been wondering why mremap is sending one IPI for each page that
> > it moves. I tried to remove that so we send an IPI for each
> > vma/syscall (not for each pte/page).
> 
> (It wouldn't usually have been sending an IPI for each page, only if
> the mm were active on another cpu, but...)

Correct, it mostly applies to threaded applications (but it also
applies to regular apps that migrate to one idle cpu to the next).  In
these cases it's very likely to send IPIs for each page, especially if
some other thread is running in another CPU. The IPI won't alter the
mm_cpumask(). So it can make quite some performance difference in some
microbenchmark using threads (which I didn't try to run yet). But more
interesting than microbenchmarks, is to see if this makes any
difference with real life JITs.

> That looks like a good optimization to me: I can't think of a good
> reason for it to be the way it was, just it started out like that and
> none of us ever thought to change it before.  Plus it's always nice to
> see the flush_tlb_range() afterwards complementing the
> flush_cache_range() beforehand, as you now have in move_page_tables().

Same here. I couldn't see a good reason for it to be the way it
was.

> And don't forget that move_page_tables() is also used by exec's
> shift_arg_pages(): no IPI saving there, but it should be more
> efficient when exec'ing with many arguments.

Yep I didn't forget it's also called from execve, that is an area we
had to fix too for the (hopefully) last migrate rmap SMP race with Mel
recently. I think the big saving is in the IPI reduction on large CPU
systems with plenty of threads running during mremap, that should be
measurable, execve I doubt because like you said there's no IPI
savings there but it sure will help a bit there too.

On this execve/move_page_tables very topic one thought I had last time
I read it, is that I don't get why we don't randomize the top of the
stack address _before_ allocating the stack, instead randomizing it
after it's created requiring an mremap. There shall be a good reason
for it but I didn't search for it too hard yet... so I may figure this
out myself if I look into the execve paths just a bit deeper (I assume
there's good reason for it, otherwise my point is we shouldn't have
been calling move_page_tables inside execve in the first place). Maybe
something in the randomization of the top of the stack seeds from
something that is known only after the stack exists, dunno yet. But
that's a separate issue...

Thanks a lot to you and Rik for reviewing it,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] thp: mremap support and TLB optimization
  2011-03-11  2:04 [PATCH] thp: mremap support and TLB optimization Andrea Arcangeli
  2011-03-11 15:16 ` Rik van Riel
  2011-03-11 19:44 ` Hugh Dickins
@ 2011-03-15  9:27 ` Johannes Weiner
  2011-03-15 10:01   ` Andrea Arcangeli
  2 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2011-03-15  9:27 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Mel Gorman, Rik van Riel

On Fri, Mar 11, 2011 at 03:04:10AM +0100, Andrea Arcangeli wrote:
> @@ -42,7 +42,7 @@ static pmd_t *get_old_pmd(struct mm_stru
>  
>  	pmd = pmd_offset(pud, addr);
>  	split_huge_page_pmd(mm, pmd);

Wasn't getting rid of this line the sole purpose of the patch? :)

> -	if (pmd_none_or_clear_bad(pmd))
> +	if (pmd_none(*pmd))
>  		return NULL;
>  
>  	return pmd;

[...]

> @@ -151,6 +148,23 @@ unsigned long move_page_tables(struct vm
>  		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
>  		if (!new_pmd)
>  			break;
> +		if (pmd_trans_huge(*old_pmd)) {
> +			int err = move_huge_pmd(vma, old_addr, new_addr,
> +						old_end, old_pmd, new_pmd);
> +			if (err > 0) {
> +				old_addr += HPAGE_PMD_SIZE;
> +				new_addr += HPAGE_PMD_SIZE;
> +				continue;
> +			}
> +		}
> +		/*
> +		 * split_huge_page_pmd() must run outside the
> +		 * pmd_trans_huge() block above because that check
> +		 * racy. split_huge_page_pmd() will recheck
> +		 * pmd_trans_huge() but in a not racy way under the
> +		 * page_table_lock.
> +		 */
> +		split_huge_page_pmd(vma->vm_mm, old_pmd);

I don't understand what we are racing here against.  If we see a huge
pmd, it may split.  But we hold mmap_sem in write-mode, I don't see
how a regular pmd could become huge all of a sudden at this point.

	Hannes

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] thp: mremap support and TLB optimization
  2011-03-15  9:27 ` Johannes Weiner
@ 2011-03-15 10:01   ` Andrea Arcangeli
  2011-03-15 12:07     ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: Andrea Arcangeli @ 2011-03-15 10:01 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: linux-mm, Mel Gorman, Rik van Riel

On Tue, Mar 15, 2011 at 10:27:50AM +0100, Johannes Weiner wrote:
> On Fri, Mar 11, 2011 at 03:04:10AM +0100, Andrea Arcangeli wrote:
> > @@ -42,7 +42,7 @@ static pmd_t *get_old_pmd(struct mm_stru
> >  
> >  	pmd = pmd_offset(pud, addr);
> >  	split_huge_page_pmd(mm, pmd);
> 
> Wasn't getting rid of this line the sole purpose of the patch? :)

Leftover that should have been deleted right...

> > +		if (pmd_trans_huge(*old_pmd)) {
> > +			int err = move_huge_pmd(vma, old_addr, new_addr,
> > +						old_end, old_pmd, new_pmd);
> > +			if (err > 0) {
> > +				old_addr += HPAGE_PMD_SIZE;
> > +				new_addr += HPAGE_PMD_SIZE;
> > +				continue;
> > +			}
> > +		}
> > +		/*
> > +		 * split_huge_page_pmd() must run outside the
> > +		 * pmd_trans_huge() block above because that check
> > +		 * racy. split_huge_page_pmd() will recheck
> > +		 * pmd_trans_huge() but in a not racy way under the
> > +		 * page_table_lock.
> > +		 */
> > +		split_huge_page_pmd(vma->vm_mm, old_pmd);
> 
> I don't understand what we are racing here against.  If we see a huge
> pmd, it may split.  But we hold mmap_sem in write-mode, I don't see
> how a regular pmd could become huge all of a sudden at this point.

Agreed, in fact it runs it without the lock too...

Does this look any better? This also optimizes away the tlb flush for
totally uninitialized areas.

===
Subject: thp: mremap support and TLB optimization

From: Andrea Arcangeli <aarcange@redhat.com>

This adds THP support to mremap (decreases the number of split_huge_page
called).

This also replaces ptep_clear_flush with ptep_get_and_clear and replaces it
with a final flush_tlb_range to send a single tlb flush IPI instead of one IPI
for each page.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/huge_mm.h |    3 +++
 mm/huge_memory.c        |   38 ++++++++++++++++++++++++++++++++++++++
 mm/mremap.c             |   29 +++++++++++++++++++++--------
 3 files changed, 62 insertions(+), 8 deletions(-)

--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -22,6 +22,9 @@ extern int zap_huge_pmd(struct mmu_gathe
 extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, unsigned long end,
 			unsigned char *vec);
+extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+			 unsigned long new_addr, unsigned long old_end,
+			 pmd_t *old_pmd, pmd_t *new_pmd);
 extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, pgprot_t newprot);
 
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -41,8 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
 		return NULL;
 
 	pmd = pmd_offset(pud, addr);
-	split_huge_page_pmd(mm, pmd);
-	if (pmd_none_or_clear_bad(pmd))
+	if (pmd_none(*pmd))
 		return NULL;
 
 	return pmd;
@@ -80,11 +79,7 @@ static void move_ptes(struct vm_area_str
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_pte, *new_pte, pte;
 	spinlock_t *old_ptl, *new_ptl;
-	unsigned long old_start;
 
-	old_start = old_addr;
-	mmu_notifier_invalidate_range_start(vma->vm_mm,
-					    old_start, old_end);
 	if (vma->vm_file) {
 		/*
 		 * Subtle point from Rajesh Venkatasubramanian: before
@@ -112,7 +107,7 @@ static void move_ptes(struct vm_area_str
 				   new_pte++, new_addr += PAGE_SIZE) {
 		if (pte_none(*old_pte))
 			continue;
-		pte = ptep_clear_flush(vma, old_addr, old_pte);
+		pte = ptep_get_and_clear(mm, old_addr, old_pte);
 		pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr);
 		set_pte_at(mm, new_addr, new_pte, pte);
 	}
@@ -124,7 +119,6 @@ static void move_ptes(struct vm_area_str
 	pte_unmap_unlock(old_pte - 1, old_ptl);
 	if (mapping)
 		spin_unlock(&mapping->i_mmap_lock);
-	mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end);
 }
 
 #define LATENCY_LIMIT	(64 * PAGE_SIZE)
@@ -135,10 +129,13 @@ unsigned long move_page_tables(struct vm
 {
 	unsigned long extent, next, old_end;
 	pmd_t *old_pmd, *new_pmd;
+	bool need_flush = false;
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
+	mmu_notifier_invalidate_range_start(vma->vm_mm, old_addr, old_end);
+
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
 		next = (old_addr + PMD_SIZE) & PMD_MASK;
@@ -151,6 +148,18 @@ unsigned long move_page_tables(struct vm
 		new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
 		if (!new_pmd)
 			break;
+		need_flush = true;
+		if (pmd_trans_huge(*old_pmd)) {
+			int err = move_huge_pmd(vma, old_addr, new_addr,
+						old_end, old_pmd, new_pmd);
+			if (err > 0) {
+				old_addr += HPAGE_PMD_SIZE;
+				new_addr += HPAGE_PMD_SIZE;
+				continue;
+			} else if (!err)
+				__split_huge_page_pmd(vma->vm_mm, old_pmd);
+			VM_BUG_ON(pmd_trans_huge(*old_pmd));
+		}
 		next = (new_addr + PMD_SIZE) & PMD_MASK;
 		if (extent > next - new_addr)
 			extent = next - new_addr;
@@ -159,6 +168,10 @@ unsigned long move_page_tables(struct vm
 		move_ptes(vma, old_pmd, old_addr, old_addr + extent,
 				new_vma, new_pmd, new_addr);
 	}
+	if (likely(need_flush))
+		flush_tlb_range(vma, old_end-len, old_addr);
+
+	mmu_notifier_invalidate_range_end(vma->vm_mm, old_end-len, old_end);
 
 	return len + old_addr - old_end;	/* how much done */
 }
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1048,6 +1048,44 @@ int mincore_huge_pmd(struct vm_area_stru
 	return ret;
 }
 
+int move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+		  unsigned long new_addr, unsigned long old_end,
+		  pmd_t *old_pmd, pmd_t *new_pmd)
+{
+	int ret = 0;
+	pmd_t pmd;
+
+	struct mm_struct *mm = vma->vm_mm;
+
+	if ((old_addr & ~HPAGE_PMD_MASK) ||
+	    (new_addr & ~HPAGE_PMD_MASK) ||
+	    (old_addr + HPAGE_PMD_SIZE) > old_end)
+		goto out;
+
+	/* if the new area is all for our destination it must be unmapped */
+	VM_BUG_ON(!pmd_none(*new_pmd));
+	/* mostly to remember this locking isn't enough with filebacked vma */
+	VM_BUG_ON(vma->vm_file);
+
+	spin_lock(&mm->page_table_lock);
+	if (likely(pmd_trans_huge(*old_pmd))) {
+		if (pmd_trans_splitting(*old_pmd)) {
+			spin_unlock(&vma->vm_mm->page_table_lock);
+			wait_split_huge_page(vma->anon_vma, old_pmd);
+			ret = -1;
+		} else {
+			pmd = pmdp_get_and_clear(mm, old_addr, old_pmd);
+			set_pmd_at(mm, new_addr, new_pmd, pmd);
+			spin_unlock(&mm->page_table_lock);
+			ret = 1;
+		}
+	} else
+		spin_unlock(&mm->page_table_lock);
+
+out:
+	return ret;
+}
+
 int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, pgprot_t newprot)
 {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] thp: mremap support and TLB optimization
  2011-03-15 10:01   ` Andrea Arcangeli
@ 2011-03-15 12:07     ` Johannes Weiner
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Weiner @ 2011-03-15 12:07 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-mm, Mel Gorman, Rik van Riel

On Tue, Mar 15, 2011 at 11:01:07AM +0100, Andrea Arcangeli wrote:
> Does this look any better? This also optimizes away the tlb flush for
> totally uninitialized areas.

Looks perfect to me, thanks!

> Subject: thp: mremap support and TLB optimization
> 
> From: Andrea Arcangeli <aarcange@redhat.com>
> 
> This adds THP support to mremap (decreases the number of split_huge_page
> called).
> 
> This also replaces ptep_clear_flush with ptep_get_and_clear and replaces it
> with a final flush_tlb_range to send a single tlb flush IPI instead of one IPI
> for each page.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-03-15 12:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-11  2:04 [PATCH] thp: mremap support and TLB optimization Andrea Arcangeli
2011-03-11 15:16 ` Rik van Riel
2011-03-11 19:44 ` Hugh Dickins
2011-03-11 20:25   ` Hugh Dickins
2011-03-12  4:28     ` Andrea Arcangeli
2011-03-12  4:02   ` Andrea Arcangeli
2011-03-15  9:27 ` Johannes Weiner
2011-03-15 10:01   ` Andrea Arcangeli
2011-03-15 12:07     ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).