* [PATCH] thp: mremap support and TLB optimization @ 2011-03-11 2:04 Andrea Arcangeli 2011-03-11 15:16 ` Rik van Riel ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Andrea Arcangeli @ 2011-03-11 2:04 UTC (permalink / raw) To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Rik van Riel Hello everyone, I've been wondering why mremap is sending one IPI for each page that it moves. I tried to remove that so we send an IPI for each vma/syscall (not for each pte/page). I also added native THP support without calling split_huge_page unconditionally if both the source and destination alignment allows a pmd_trans_huge to be preserved (the mremap extension and truncation already preserved existing hugepages but the move into new place didn't yet). If the destination alignment isn't ok, split_huge_page is unavoidable but that is an userland/hardware limitation, not really something we can optimize further in the kernel. I've no real numbers yet (volanomark results are mostly unchanged, it's a tinybit faster but it may be measurement error, and it doesn't seem to call mremap enough, but the thp_split number in /proc/vmstat seem to go down close to zero, maybe other JIT workloads will benefit?). In the meantime I'm posting this for review. I'm not entirely sure this is safe at this point (I mean the tlb part especially). Also note if any arch needs the tlb flush after ptep_get_and_clear, move_pte can provide it. The huge_memory.c part has no move_pmd equivalent because the only arch that needs move_pte (sparc64) doesn't supports THP yet (I've no idea if sparc64 is one of the candidates of future THP capable archs, arm/ppcembedded should make it eventually). I applied this to my aa.git tree and I'm running this on all my systems with no adverse effects for more than a day, so if you want to test the usual procedure works. first: git clone git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git or first: git clone --reference linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git later: git fetch; git checkout -f origin/master === Subject: thp: mremap support and TLB optimization From: Andrea Arcangeli <aarcange@redhat.com> This adds THP support to mremap (decreases the number of split_huge_page called). This also replaces ptep_clear_flush with ptep_get_and_clear and replaces it with a final flush_tlb_range to send a single tlb flush IPI instead of one IPI for each page. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- include/linux/huge_mm.h | 3 +++ mm/huge_memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ mm/mremap.c | 31 ++++++++++++++++++++++++------- 3 files changed, 71 insertions(+), 7 deletions(-) --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -22,6 +22,9 @@ extern int zap_huge_pmd(struct mmu_gathe extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, unsigned char *vec); +extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, + unsigned long new_addr, unsigned long old_end, + pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, pgprot_t newprot); --- a/mm/mremap.c +++ b/mm/mremap.c @@ -42,7 +42,7 @@ static pmd_t *get_old_pmd(struct mm_stru pmd = pmd_offset(pud, addr); split_huge_page_pmd(mm, pmd); - if (pmd_none_or_clear_bad(pmd)) + if (pmd_none(*pmd)) return NULL; return pmd; @@ -80,11 +80,7 @@ static void move_ptes(struct vm_area_str struct mm_struct *mm = vma->vm_mm; pte_t *old_pte, *new_pte, pte; spinlock_t *old_ptl, *new_ptl; - unsigned long old_start; - old_start = old_addr; - mmu_notifier_invalidate_range_start(vma->vm_mm, - old_start, old_end); if (vma->vm_file) { /* * Subtle point from Rajesh Venkatasubramanian: before @@ -112,7 +108,7 @@ static void move_ptes(struct vm_area_str new_pte++, new_addr += PAGE_SIZE) { if (pte_none(*old_pte)) continue; - pte = ptep_clear_flush(vma, old_addr, old_pte); + pte = ptep_get_and_clear(mm, old_addr, old_pte); pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); set_pte_at(mm, new_addr, new_pte, pte); } @@ -124,7 +120,6 @@ static void move_ptes(struct vm_area_str pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) spin_unlock(&mapping->i_mmap_lock); - mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); } #define LATENCY_LIMIT (64 * PAGE_SIZE) @@ -139,6 +134,8 @@ unsigned long move_page_tables(struct vm old_end = old_addr + len; flush_cache_range(vma, old_addr, old_end); + mmu_notifier_invalidate_range_start(vma->vm_mm, old_addr, old_end); + for (; old_addr < old_end; old_addr += extent, new_addr += extent) { cond_resched(); next = (old_addr + PMD_SIZE) & PMD_MASK; @@ -151,6 +148,23 @@ unsigned long move_page_tables(struct vm new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr); if (!new_pmd) break; + if (pmd_trans_huge(*old_pmd)) { + int err = move_huge_pmd(vma, old_addr, new_addr, + old_end, old_pmd, new_pmd); + if (err > 0) { + old_addr += HPAGE_PMD_SIZE; + new_addr += HPAGE_PMD_SIZE; + continue; + } + } + /* + * split_huge_page_pmd() must run outside the + * pmd_trans_huge() block above because that check + * racy. split_huge_page_pmd() will recheck + * pmd_trans_huge() but in a not racy way under the + * page_table_lock. + */ + split_huge_page_pmd(vma->vm_mm, old_pmd); next = (new_addr + PMD_SIZE) & PMD_MASK; if (extent > next - new_addr) extent = next - new_addr; @@ -159,6 +173,9 @@ unsigned long move_page_tables(struct vm move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma, new_pmd, new_addr); } + flush_tlb_range(vma, old_end-len, old_addr); + + mmu_notifier_invalidate_range_end(vma->vm_mm, old_end-len, old_end); return len + old_addr - old_end; /* how much done */ } --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1048,6 +1048,50 @@ int mincore_huge_pmd(struct vm_area_stru return ret; } +int move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, + unsigned long new_addr, unsigned long old_end, + pmd_t *old_pmd, pmd_t *new_pmd) +{ + int ret = 0; + pmd_t pmd; + + struct mm_struct *mm = vma->vm_mm; + + if ((old_addr & ~HPAGE_PMD_MASK) || + (new_addr & ~HPAGE_PMD_MASK) || + (old_addr + HPAGE_PMD_SIZE) > old_end) + goto out; + + /* if the new area is all for our destination it must be unmapped */ + VM_BUG_ON(!pmd_none(*new_pmd)); + /* mostly to remember this locking isn't enough with filebacked vma */ + VM_BUG_ON(vma->vm_file); + + spin_lock(&mm->page_table_lock); + if (likely(pmd_trans_huge(*old_pmd))) { + if (pmd_trans_splitting(*old_pmd)) { + spin_unlock(&vma->vm_mm->page_table_lock); + /* + * It's not mandatory to wait here as the + * caller will run split_huge_page_pmd(), but + * this is faster and it will avoid the caller + * to invoke __split_huge_page_pmd() (and to + * take the page_table_lock again). + */ + wait_split_huge_page(vma->anon_vma, old_pmd); + } else { + pmd = pmdp_get_and_clear(mm, old_addr, old_pmd); + set_pmd_at(mm, new_addr, new_pmd, pmd); + spin_unlock(&mm->page_table_lock); + ret = 1; + } + } else + spin_unlock(&mm->page_table_lock); + +out: + return ret; +} + int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, pgprot_t newprot) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] thp: mremap support and TLB optimization 2011-03-11 2:04 [PATCH] thp: mremap support and TLB optimization Andrea Arcangeli @ 2011-03-11 15:16 ` Rik van Riel 2011-03-11 19:44 ` Hugh Dickins 2011-03-15 9:27 ` Johannes Weiner 2 siblings, 0 replies; 9+ messages in thread From: Rik van Riel @ 2011-03-11 15:16 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, Mel Gorman, Johannes Weiner On 03/10/2011 09:04 PM, Andrea Arcangeli wrote: > Hello everyone, > > I've been wondering why mremap is sending one IPI for each page that > it moves. I tried to remove that so we send an IPI for each > vma/syscall (not for each pte/page). I also added native THP support > without calling split_huge_page unconditionally if both the source and > destination alignment allows a pmd_trans_huge to be preserved (the > mremap extension and truncation already preserved existing hugepages > but the move into new place didn't yet). If the destination alignment > isn't ok, split_huge_page is unavoidable but that is an > userland/hardware limitation, not really something we can optimize > further in the kernel. > > I've no real numbers yet (volanomark results are mostly unchanged, > it's a tinybit faster but it may be measurement error, and it doesn't > seem to call mremap enough, but the thp_split number in /proc/vmstat > seem to go down close to zero, maybe other JIT workloads will > benefit?). Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] thp: mremap support and TLB optimization 2011-03-11 2:04 [PATCH] thp: mremap support and TLB optimization Andrea Arcangeli 2011-03-11 15:16 ` Rik van Riel @ 2011-03-11 19:44 ` Hugh Dickins 2011-03-11 20:25 ` Hugh Dickins 2011-03-12 4:02 ` Andrea Arcangeli 2011-03-15 9:27 ` Johannes Weiner 2 siblings, 2 replies; 9+ messages in thread From: Hugh Dickins @ 2011-03-11 19:44 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, Mel Gorman, Johannes Weiner, Rik van Riel On Thu, Mar 10, 2011 at 6:04 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > I've been wondering why mremap is sending one IPI for each page that > it moves. I tried to remove that so we send an IPI for each > vma/syscall (not for each pte/page). (It wouldn't usually have been sending an IPI for each page, only if the mm were active on another cpu, but...) That looks like a good optimization to me: I can't think of a good reason for it to be the way it was, just it started out like that and none of us ever thought to change it before. Plus it's always nice to see the flush_tlb_range() afterwards complementing the flush_cache_range() beforehand, as you now have in move_page_tables(). And don't forget that move_page_tables() is also used by exec's shift_arg_pages(): no IPI saving there, but it should be more efficient when exec'ing with many arguments. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] thp: mremap support and TLB optimization 2011-03-11 19:44 ` Hugh Dickins @ 2011-03-11 20:25 ` Hugh Dickins 2011-03-12 4:28 ` Andrea Arcangeli 2011-03-12 4:02 ` Andrea Arcangeli 1 sibling, 1 reply; 9+ messages in thread From: Hugh Dickins @ 2011-03-11 20:25 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, Mel Gorman, Johannes Weiner, Rik van Riel Fri, Mar 11, 2011 at 11:44 AM, Hugh Dickins <hughd@google.com> wrote: > On Thu, Mar 10, 2011 at 6:04 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: >> >> I've been wondering why mremap is sending one IPI for each page that >> it moves. I tried to remove that so we send an IPI for each >> vma/syscall (not for each pte/page). > > (It wouldn't usually have been sending an IPI for each page, only if > the mm were active on another cpu, but...) > > That looks like a good optimization to me: I can't think of a good > reason for it to be the way it was, just it started out like that and > none of us ever thought to change it before. Plus it's always nice to > see the flush_tlb_range() afterwards complementing the > flush_cache_range() beforehand, as you now have in move_page_tables(). > > And don't forget that move_page_tables() is also used by exec's > shift_arg_pages(): no IPI saving there, but it should be more > efficient when exec'ing with many arguments. Perhaps I should qualify that answer: although I still think it's the right change to make (it matches mprotect, for example), and an optimization in many cases, it will be a pessimization for anyone who mremap moves unpopulated areas (I doubt that's common), and for anyone who moves around single page areas (on x86 and probably some others). But the exec args case has, I think, few useful tlb entries to lose from the mm-wide tlb flush. flush_tlb_range() ought to special case small areas, doing at most one IPI, but up to some number of flush_tlb_one()s; but that would certainly have to be another patch. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] thp: mremap support and TLB optimization 2011-03-11 20:25 ` Hugh Dickins @ 2011-03-12 4:28 ` Andrea Arcangeli 0 siblings, 0 replies; 9+ messages in thread From: Andrea Arcangeli @ 2011-03-12 4:28 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-mm, Mel Gorman, Johannes Weiner, Rik van Riel On Fri, Mar 11, 2011 at 12:25:42PM -0800, Hugh Dickins wrote: > Perhaps I should qualify that answer: although I still think it's the > right change to make (it matches mprotect, for example), and an > optimization in many cases, it will be a pessimization for anyone who > mremap moves unpopulated areas (I doubt that's common), and for anyone > who moves around single page areas (on x86 and probably some others). > But the exec args case has, I think, few useful tlb entries to lose > from the mm-wide tlb flush. The pessimization of the totally unmapped areas I didn't consider it as a real life possibility. At least the mmu notifier isn't pessimized if the pmd wasn't none but the pte was none, only the TLB flush really is pessimized in that case (if all old ptes are none regardless of the pmd). But the range TLB flush is real easy to optimize for unmapped areas if we want, just skip the flush_tlb_range if all old ptes were none and we actually changed nothing, right? Fixing the mmu notifier isn't possible but that's not a concern and it'll surely be fine to stay in move_page_tables. So do we want one more branch to avoid one IPI if mremap runs on an unmapped area? That's ok with me if it's a real life possibility. At the moment I think any app doing that is pretty stupid and shouldn't call mremap in the first place, and it should have used mmap(MAP_FIXED) or a bigger mmap size in the first place though... If we add a branch for that case, maybe we should also printk if we detect that, in addition to skipping the tlb flush. > flush_tlb_range() ought to special case small areas, doing at most one > IPI, but up to some number of flush_tlb_one()s; but that would > certainly have to be another patch. That's probably a good tradeoff. Even better would be if x86 would be extended to allow range flushes so we don't have to do guesswork in software. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] thp: mremap support and TLB optimization 2011-03-11 19:44 ` Hugh Dickins 2011-03-11 20:25 ` Hugh Dickins @ 2011-03-12 4:02 ` Andrea Arcangeli 1 sibling, 0 replies; 9+ messages in thread From: Andrea Arcangeli @ 2011-03-12 4:02 UTC (permalink / raw) To: Hugh Dickins; +Cc: linux-mm, Mel Gorman, Johannes Weiner, Rik van Riel Hi Hugh, On Fri, Mar 11, 2011 at 11:44:03AM -0800, Hugh Dickins wrote: > On Thu, Mar 10, 2011 at 6:04 PM, Andrea Arcangeli <aarcange@redhat.com> wrote: > > > > I've been wondering why mremap is sending one IPI for each page that > > it moves. I tried to remove that so we send an IPI for each > > vma/syscall (not for each pte/page). > > (It wouldn't usually have been sending an IPI for each page, only if > the mm were active on another cpu, but...) Correct, it mostly applies to threaded applications (but it also applies to regular apps that migrate to one idle cpu to the next). In these cases it's very likely to send IPIs for each page, especially if some other thread is running in another CPU. The IPI won't alter the mm_cpumask(). So it can make quite some performance difference in some microbenchmark using threads (which I didn't try to run yet). But more interesting than microbenchmarks, is to see if this makes any difference with real life JITs. > That looks like a good optimization to me: I can't think of a good > reason for it to be the way it was, just it started out like that and > none of us ever thought to change it before. Plus it's always nice to > see the flush_tlb_range() afterwards complementing the > flush_cache_range() beforehand, as you now have in move_page_tables(). Same here. I couldn't see a good reason for it to be the way it was. > And don't forget that move_page_tables() is also used by exec's > shift_arg_pages(): no IPI saving there, but it should be more > efficient when exec'ing with many arguments. Yep I didn't forget it's also called from execve, that is an area we had to fix too for the (hopefully) last migrate rmap SMP race with Mel recently. I think the big saving is in the IPI reduction on large CPU systems with plenty of threads running during mremap, that should be measurable, execve I doubt because like you said there's no IPI savings there but it sure will help a bit there too. On this execve/move_page_tables very topic one thought I had last time I read it, is that I don't get why we don't randomize the top of the stack address _before_ allocating the stack, instead randomizing it after it's created requiring an mremap. There shall be a good reason for it but I didn't search for it too hard yet... so I may figure this out myself if I look into the execve paths just a bit deeper (I assume there's good reason for it, otherwise my point is we shouldn't have been calling move_page_tables inside execve in the first place). Maybe something in the randomization of the top of the stack seeds from something that is known only after the stack exists, dunno yet. But that's a separate issue... Thanks a lot to you and Rik for reviewing it, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] thp: mremap support and TLB optimization 2011-03-11 2:04 [PATCH] thp: mremap support and TLB optimization Andrea Arcangeli 2011-03-11 15:16 ` Rik van Riel 2011-03-11 19:44 ` Hugh Dickins @ 2011-03-15 9:27 ` Johannes Weiner 2011-03-15 10:01 ` Andrea Arcangeli 2 siblings, 1 reply; 9+ messages in thread From: Johannes Weiner @ 2011-03-15 9:27 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, Mel Gorman, Rik van Riel On Fri, Mar 11, 2011 at 03:04:10AM +0100, Andrea Arcangeli wrote: > @@ -42,7 +42,7 @@ static pmd_t *get_old_pmd(struct mm_stru > > pmd = pmd_offset(pud, addr); > split_huge_page_pmd(mm, pmd); Wasn't getting rid of this line the sole purpose of the patch? :) > - if (pmd_none_or_clear_bad(pmd)) > + if (pmd_none(*pmd)) > return NULL; > > return pmd; [...] > @@ -151,6 +148,23 @@ unsigned long move_page_tables(struct vm > new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr); > if (!new_pmd) > break; > + if (pmd_trans_huge(*old_pmd)) { > + int err = move_huge_pmd(vma, old_addr, new_addr, > + old_end, old_pmd, new_pmd); > + if (err > 0) { > + old_addr += HPAGE_PMD_SIZE; > + new_addr += HPAGE_PMD_SIZE; > + continue; > + } > + } > + /* > + * split_huge_page_pmd() must run outside the > + * pmd_trans_huge() block above because that check > + * racy. split_huge_page_pmd() will recheck > + * pmd_trans_huge() but in a not racy way under the > + * page_table_lock. > + */ > + split_huge_page_pmd(vma->vm_mm, old_pmd); I don't understand what we are racing here against. If we see a huge pmd, it may split. But we hold mmap_sem in write-mode, I don't see how a regular pmd could become huge all of a sudden at this point. Hannes -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] thp: mremap support and TLB optimization 2011-03-15 9:27 ` Johannes Weiner @ 2011-03-15 10:01 ` Andrea Arcangeli 2011-03-15 12:07 ` Johannes Weiner 0 siblings, 1 reply; 9+ messages in thread From: Andrea Arcangeli @ 2011-03-15 10:01 UTC (permalink / raw) To: Johannes Weiner; +Cc: linux-mm, Mel Gorman, Rik van Riel On Tue, Mar 15, 2011 at 10:27:50AM +0100, Johannes Weiner wrote: > On Fri, Mar 11, 2011 at 03:04:10AM +0100, Andrea Arcangeli wrote: > > @@ -42,7 +42,7 @@ static pmd_t *get_old_pmd(struct mm_stru > > > > pmd = pmd_offset(pud, addr); > > split_huge_page_pmd(mm, pmd); > > Wasn't getting rid of this line the sole purpose of the patch? :) Leftover that should have been deleted right... > > + if (pmd_trans_huge(*old_pmd)) { > > + int err = move_huge_pmd(vma, old_addr, new_addr, > > + old_end, old_pmd, new_pmd); > > + if (err > 0) { > > + old_addr += HPAGE_PMD_SIZE; > > + new_addr += HPAGE_PMD_SIZE; > > + continue; > > + } > > + } > > + /* > > + * split_huge_page_pmd() must run outside the > > + * pmd_trans_huge() block above because that check > > + * racy. split_huge_page_pmd() will recheck > > + * pmd_trans_huge() but in a not racy way under the > > + * page_table_lock. > > + */ > > + split_huge_page_pmd(vma->vm_mm, old_pmd); > > I don't understand what we are racing here against. If we see a huge > pmd, it may split. But we hold mmap_sem in write-mode, I don't see > how a regular pmd could become huge all of a sudden at this point. Agreed, in fact it runs it without the lock too... Does this look any better? This also optimizes away the tlb flush for totally uninitialized areas. === Subject: thp: mremap support and TLB optimization From: Andrea Arcangeli <aarcange@redhat.com> This adds THP support to mremap (decreases the number of split_huge_page called). This also replaces ptep_clear_flush with ptep_get_and_clear and replaces it with a final flush_tlb_range to send a single tlb flush IPI instead of one IPI for each page. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> --- include/linux/huge_mm.h | 3 +++ mm/huge_memory.c | 38 ++++++++++++++++++++++++++++++++++++++ mm/mremap.c | 29 +++++++++++++++++++++-------- 3 files changed, 62 insertions(+), 8 deletions(-) --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -22,6 +22,9 @@ extern int zap_huge_pmd(struct mmu_gathe extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, unsigned char *vec); +extern int move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, + unsigned long new_addr, unsigned long old_end, + pmd_t *old_pmd, pmd_t *new_pmd); extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, pgprot_t newprot); --- a/mm/mremap.c +++ b/mm/mremap.c @@ -41,8 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru return NULL; pmd = pmd_offset(pud, addr); - split_huge_page_pmd(mm, pmd); - if (pmd_none_or_clear_bad(pmd)) + if (pmd_none(*pmd)) return NULL; return pmd; @@ -80,11 +79,7 @@ static void move_ptes(struct vm_area_str struct mm_struct *mm = vma->vm_mm; pte_t *old_pte, *new_pte, pte; spinlock_t *old_ptl, *new_ptl; - unsigned long old_start; - old_start = old_addr; - mmu_notifier_invalidate_range_start(vma->vm_mm, - old_start, old_end); if (vma->vm_file) { /* * Subtle point from Rajesh Venkatasubramanian: before @@ -112,7 +107,7 @@ static void move_ptes(struct vm_area_str new_pte++, new_addr += PAGE_SIZE) { if (pte_none(*old_pte)) continue; - pte = ptep_clear_flush(vma, old_addr, old_pte); + pte = ptep_get_and_clear(mm, old_addr, old_pte); pte = move_pte(pte, new_vma->vm_page_prot, old_addr, new_addr); set_pte_at(mm, new_addr, new_pte, pte); } @@ -124,7 +119,6 @@ static void move_ptes(struct vm_area_str pte_unmap_unlock(old_pte - 1, old_ptl); if (mapping) spin_unlock(&mapping->i_mmap_lock); - mmu_notifier_invalidate_range_end(vma->vm_mm, old_start, old_end); } #define LATENCY_LIMIT (64 * PAGE_SIZE) @@ -135,10 +129,13 @@ unsigned long move_page_tables(struct vm { unsigned long extent, next, old_end; pmd_t *old_pmd, *new_pmd; + bool need_flush = false; old_end = old_addr + len; flush_cache_range(vma, old_addr, old_end); + mmu_notifier_invalidate_range_start(vma->vm_mm, old_addr, old_end); + for (; old_addr < old_end; old_addr += extent, new_addr += extent) { cond_resched(); next = (old_addr + PMD_SIZE) & PMD_MASK; @@ -151,6 +148,18 @@ unsigned long move_page_tables(struct vm new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr); if (!new_pmd) break; + need_flush = true; + if (pmd_trans_huge(*old_pmd)) { + int err = move_huge_pmd(vma, old_addr, new_addr, + old_end, old_pmd, new_pmd); + if (err > 0) { + old_addr += HPAGE_PMD_SIZE; + new_addr += HPAGE_PMD_SIZE; + continue; + } else if (!err) + __split_huge_page_pmd(vma->vm_mm, old_pmd); + VM_BUG_ON(pmd_trans_huge(*old_pmd)); + } next = (new_addr + PMD_SIZE) & PMD_MASK; if (extent > next - new_addr) extent = next - new_addr; @@ -159,6 +168,10 @@ unsigned long move_page_tables(struct vm move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma, new_pmd, new_addr); } + if (likely(need_flush)) + flush_tlb_range(vma, old_end-len, old_addr); + + mmu_notifier_invalidate_range_end(vma->vm_mm, old_end-len, old_end); return len + old_addr - old_end; /* how much done */ } --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1048,6 +1048,44 @@ int mincore_huge_pmd(struct vm_area_stru return ret; } +int move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, + unsigned long new_addr, unsigned long old_end, + pmd_t *old_pmd, pmd_t *new_pmd) +{ + int ret = 0; + pmd_t pmd; + + struct mm_struct *mm = vma->vm_mm; + + if ((old_addr & ~HPAGE_PMD_MASK) || + (new_addr & ~HPAGE_PMD_MASK) || + (old_addr + HPAGE_PMD_SIZE) > old_end) + goto out; + + /* if the new area is all for our destination it must be unmapped */ + VM_BUG_ON(!pmd_none(*new_pmd)); + /* mostly to remember this locking isn't enough with filebacked vma */ + VM_BUG_ON(vma->vm_file); + + spin_lock(&mm->page_table_lock); + if (likely(pmd_trans_huge(*old_pmd))) { + if (pmd_trans_splitting(*old_pmd)) { + spin_unlock(&vma->vm_mm->page_table_lock); + wait_split_huge_page(vma->anon_vma, old_pmd); + ret = -1; + } else { + pmd = pmdp_get_and_clear(mm, old_addr, old_pmd); + set_pmd_at(mm, new_addr, new_pmd, pmd); + spin_unlock(&mm->page_table_lock); + ret = 1; + } + } else + spin_unlock(&mm->page_table_lock); + +out: + return ret; +} + int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, pgprot_t newprot) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] thp: mremap support and TLB optimization 2011-03-15 10:01 ` Andrea Arcangeli @ 2011-03-15 12:07 ` Johannes Weiner 0 siblings, 0 replies; 9+ messages in thread From: Johannes Weiner @ 2011-03-15 12:07 UTC (permalink / raw) To: Andrea Arcangeli; +Cc: linux-mm, Mel Gorman, Rik van Riel On Tue, Mar 15, 2011 at 11:01:07AM +0100, Andrea Arcangeli wrote: > Does this look any better? This also optimizes away the tlb flush for > totally uninitialized areas. Looks perfect to me, thanks! > Subject: thp: mremap support and TLB optimization > > From: Andrea Arcangeli <aarcange@redhat.com> > > This adds THP support to mremap (decreases the number of split_huge_page > called). > > This also replaces ptep_clear_flush with ptep_get_and_clear and replaces it > with a final flush_tlb_range to send a single tlb flush IPI instead of one IPI > for each page. > > Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2011-03-15 12:08 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-03-11 2:04 [PATCH] thp: mremap support and TLB optimization Andrea Arcangeli 2011-03-11 15:16 ` Rik van Riel 2011-03-11 19:44 ` Hugh Dickins 2011-03-11 20:25 ` Hugh Dickins 2011-03-12 4:28 ` Andrea Arcangeli 2011-03-12 4:02 ` Andrea Arcangeli 2011-03-15 9:27 ` Johannes Weiner 2011-03-15 10:01 ` Andrea Arcangeli 2011-03-15 12:07 ` Johannes Weiner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).