* [PATCH 0/3 v2] mmu_notifier: Allow to manage CPU external TLBs @ 2014-07-29 16:18 Joerg Roedel 2014-07-29 16:18 ` [PATCH 1/3] mmu_notifier: Add mmu_notifier_invalidate_range() Joerg Roedel ` (3 more replies) 0 siblings, 4 replies; 6+ messages in thread From: Joerg Roedel @ 2014-07-29 16:18 UTC (permalink / raw) To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel, Hugh Dickins, Mel Gorman, Johannes Weiner Cc: Jerome Glisse, jroedel, Jay.Cornwall, Oded.Gabbay, John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes, David Woodhouse, linux-kernel, linux-mm, iommu, Joerg Roedel Changes V1->V2: * Rebase to v3.16-rc7 * Added call of ->invalidate_range to __mmu_notifier_invalidate_end() so that the subsystem doesn't need to register an ->invalidate_end() call-back, subsystems will likely either register invalidate_range_start/end or invalidate_range, so that should be fine. * Re-orded declarations a bit to reflect that invalidate_range is not only called between invalidate_range_start/end * Updated documentation to cover the case where invalidate_range is called outside of invalidate_range_start/end to flush page-table pages out of the TLB Hi, here is a patch-set to extend the mmu_notifiers in the Linux kernel to allow managing CPU external TLBs. Those TLBs may be implemented in IOMMUs or any other external device, e.g. ATS/PRI capable PCI devices. The problem with managing these TLBs are the semantics of the invalidate_range_start/end call-backs currently available. Currently the subsystem using mmu_notifiers has to guarantee that no new TLB entries are established between invalidate_range_start/end. Furthermore the invalidate_range_start() function is called when all pages are still mapped and invalidate_range_end() when the pages are unmapped an already freed. So both call-backs can't be used to safely flush any non-CPU TLB because _start() is called too early and _end() too late. In the AMD IOMMUv2 driver this is currently implemented by assigning an empty page-table to the external device between _start() and _end(). But as tests have shown this doesn't work as external devices don't re-fault infinitly but enter a failure state after some time. Next problem with this solution is that it causes an interrupt storm for IO page faults to be handled when an empty page-table is assigned. Furthermore the _start()/end() notifiers only catch the moment when page mappings are released, but not page-table pages. But this is necessary for managing external TLBs when the page-table is shared with the CPU. To solve this situation I wrote a patch-set to introduce a new notifier call-back: mmu_notifer_invalidate_range(). This notifier lifts the strict requirements that no new references are taken in the range between _start() and _end(). When the subsystem can't guarantee that any new references are taken is has to provide the invalidate_range() call-back to clear any new references in there. It is called between invalidate_range_start() and _end() every time the VMM has to wipe out any references to a couple of pages. This are usually the places where the CPU TLBs are flushed too and where its important that this happens before invalidate_range_end() is called. Any comments and review appreciated! Thanks, Joerg Joerg Roedel (3): mmu_notifier: Add mmu_notifier_invalidate_range() mmu_notifier: Call mmu_notifier_invalidate_range() from VMM mmu_notifier: Add the call-back for mmu_notifier_invalidate_range() include/linux/mmu_notifier.h | 75 +++++++++++++++++++++++++++++++++++++++++--- kernel/events/uprobes.c | 2 +- mm/fremap.c | 2 +- mm/huge_memory.c | 9 +++--- mm/hugetlb.c | 7 ++++- mm/ksm.c | 4 +-- mm/memory.c | 3 +- mm/migrate.c | 3 +- mm/mmu_notifier.c | 25 +++++++++++++++ mm/rmap.c | 2 +- 10 files changed, 115 insertions(+), 17 deletions(-) -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/3] mmu_notifier: Add mmu_notifier_invalidate_range() 2014-07-29 16:18 [PATCH 0/3 v2] mmu_notifier: Allow to manage CPU external TLBs Joerg Roedel @ 2014-07-29 16:18 ` Joerg Roedel 2014-07-29 16:18 ` [PATCH 2/3] mmu_notifier: Call mmu_notifier_invalidate_range() from VMM Joerg Roedel ` (2 subsequent siblings) 3 siblings, 0 replies; 6+ messages in thread From: Joerg Roedel @ 2014-07-29 16:18 UTC (permalink / raw) To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel, Hugh Dickins, Mel Gorman, Johannes Weiner Cc: Jerome Glisse, jroedel, Jay.Cornwall, Oded.Gabbay, John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes, David Woodhouse, linux-kernel, linux-mm, iommu From: Joerg Roedel <jroedel@suse.de> This notifier closes two important gaps with the current invalidate_range_start()/end() notifiers. The _start() part is called when all pages are still mapped while the _end() notifier is called when all pages are potentially unmapped and already freed. This does not allow to manage external (non-CPU) hardware TLBs with MMU-notifiers because there is no way to prevent that hardware will establish new TLB entries between the calls of these two functions. But this is a requirement to the subsytem that implements these existing notifiers. To allow managing external TLBs the MMU-notifiers need to catch the moment when pages are unmapped but not yet freed. This new notifier catches that moment and notifies the interested subsytem when pages that were unmapped are about to be freed. The new notifier will be called between invalidate_range_start()/end() to catch the moment when pages are unmapped but not yet freed. For non-CPU TLBs it is also necessary to know when page-table pages are freed. This is the second gap in current mmu_notifiers. At those events the new notifier will also be called, without calling invalidate_range_start() and invalidate_range_end() around it. Signed-off-by: Joerg Roedel <jroedel@suse.de> --- include/linux/mmu_notifier.h | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index deca874..1bac99c 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -235,6 +235,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, __mmu_notifier_invalidate_range_end(mm, start, end); } +static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + static inline void mmu_notifier_mm_init(struct mm_struct *mm) { mm->mmu_notifier_mm = NULL; @@ -326,6 +331,11 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, { } +static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ +} + static inline void mmu_notifier_mm_init(struct mm_struct *mm) { } -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/3] mmu_notifier: Call mmu_notifier_invalidate_range() from VMM 2014-07-29 16:18 [PATCH 0/3 v2] mmu_notifier: Allow to manage CPU external TLBs Joerg Roedel 2014-07-29 16:18 ` [PATCH 1/3] mmu_notifier: Add mmu_notifier_invalidate_range() Joerg Roedel @ 2014-07-29 16:18 ` Joerg Roedel 2014-08-16 12:55 ` Oded Gabbay 2014-07-29 16:18 ` [PATCH 3/3] mmu_notifier: Add the call-back for mmu_notifier_invalidate_range() Joerg Roedel 2014-07-31 14:54 ` [PATCH 0/3 v2] mmu_notifier: Allow to manage CPU external TLBs Jerome Glisse 3 siblings, 1 reply; 6+ messages in thread From: Joerg Roedel @ 2014-07-29 16:18 UTC (permalink / raw) To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel, Hugh Dickins, Mel Gorman, Johannes Weiner Cc: Jerome Glisse, jroedel, Jay.Cornwall, Oded.Gabbay, John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes, David Woodhouse, linux-kernel, linux-mm, iommu From: Joerg Roedel <jroedel@suse.de> Add calls to the new mmu_notifier_invalidate_range() function to all places if the VMM that need it. Signed-off-by: Joerg Roedel <jroedel@suse.de> --- include/linux/mmu_notifier.h | 28 ++++++++++++++++++++++++++++ kernel/events/uprobes.c | 2 +- mm/fremap.c | 2 +- mm/huge_memory.c | 9 +++++---- mm/hugetlb.c | 7 ++++++- mm/ksm.c | 4 ++-- mm/memory.c | 3 ++- mm/migrate.c | 3 ++- mm/rmap.c | 2 +- 9 files changed, 48 insertions(+), 12 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 1bac99c..f760e95 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -273,6 +273,32 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) __young; \ }) +#define ptep_clear_flush_notify(__vma, __address, __ptep) \ +({ \ + unsigned long ___addr = __address & PAGE_MASK; \ + struct mm_struct *___mm = (__vma)->vm_mm; \ + pte_t ___pte; \ + \ + ___pte = ptep_clear_flush(__vma, __address, __ptep); \ + mmu_notifier_invalidate_range(___mm, ___addr, \ + ___addr + PAGE_SIZE); \ + \ + ___pte; \ +}) + +#define pmdp_clear_flush_notify(__vma, __haddr, __pmd) \ +({ \ + unsigned long ___haddr = __haddr & HPAGE_PMD_MASK; \ + struct mm_struct *___mm = (__vma)->vm_mm; \ + pmd_t ___pmd; \ + \ + ___pmd = pmdp_clear_flush(__vma, __haddr, __pmd); \ + mmu_notifier_invalidate_range(___mm, ___haddr, \ + ___haddr + HPAGE_PMD_SIZE); \ + \ + ___pmd; \ +}) + /* * set_pte_at_notify() sets the pte _after_ running the notifier. * This is safe to start by updating the secondary MMUs, because the primary MMU @@ -346,6 +372,8 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) #define ptep_clear_flush_young_notify ptep_clear_flush_young #define pmdp_clear_flush_young_notify pmdp_clear_flush_young +#define ptep_clear_flush_notify ptep_clear_flush +#define pmdp_clear_flush_notify pmdp_clear_flush #define set_pte_at_notify set_pte_at #endif /* CONFIG_MMU_NOTIFIER */ diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 6f3254e..642262d 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -186,7 +186,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, } flush_cache_page(vma, addr, pte_pfn(*ptep)); - ptep_clear_flush(vma, addr, ptep); + ptep_clear_flush_notify(vma, addr, ptep); set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot)); page_remove_rmap(page); diff --git a/mm/fremap.c b/mm/fremap.c index 72b8fa3..9129013 100644 --- a/mm/fremap.c +++ b/mm/fremap.c @@ -37,7 +37,7 @@ static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma, if (pte_present(pte)) { flush_cache_page(vma, addr, pte_pfn(pte)); - pte = ptep_clear_flush(vma, addr, ptep); + pte = ptep_clear_flush_notify(vma, addr, ptep); page = vm_normal_page(vma, addr, pte); if (page) { if (pte_dirty(pte)) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 33514d8..b322c97 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1031,7 +1031,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, goto out_free_pages; VM_BUG_ON_PAGE(!PageHead(page), page); - pmdp_clear_flush(vma, haddr, pmd); + pmdp_clear_flush_notify(vma, haddr, pmd); /* leave pmd empty until pte is filled */ pgtable = pgtable_trans_huge_withdraw(mm, pmd); @@ -1168,7 +1168,7 @@ alloc: pmd_t entry; entry = mk_huge_pmd(new_page, vma->vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); - pmdp_clear_flush(vma, haddr, pmd); + pmdp_clear_flush_notify(vma, haddr, pmd); page_add_new_anon_rmap(new_page, vma, haddr); set_pmd_at(mm, haddr, pmd, entry); update_mmu_cache_pmd(vma, address, pmd); @@ -1499,7 +1499,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, pmd_t entry; ret = 1; if (!prot_numa) { - entry = pmdp_get_and_clear(mm, addr, pmd); + entry = pmdp_get_and_clear_notify(mm, addr, pmd); if (pmd_numa(entry)) entry = pmd_mknonnuma(entry); entry = pmd_modify(entry, newprot); @@ -1631,6 +1631,7 @@ static int __split_huge_page_splitting(struct page *page, * serialize against split_huge_page*. */ pmdp_splitting_flush(vma, address, pmd); + ret = 1; spin_unlock(ptl); } @@ -2793,7 +2794,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, pmd_t _pmd; int i; - pmdp_clear_flush(vma, haddr, pmd); + pmdp_clear_flush_notify(vma, haddr, pmd); /* leave pmd empty until pte is filled */ pgtable = pgtable_trans_huge_withdraw(mm, pmd); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9221c02..603851d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2602,8 +2602,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, } set_huge_pte_at(dst, addr, dst_pte, entry); } else { - if (cow) + if (cow) { huge_ptep_set_wrprotect(src, addr, src_pte); + mmu_notifier_invalidate_range(src, mmun_start, + mmun_end); + } entry = huge_ptep_get(src_pte); ptepage = pte_page(entry); get_page(ptepage); @@ -2911,6 +2914,7 @@ retry_avoidcopy: /* Break COW */ huge_ptep_clear_flush(vma, address, ptep); + mmu_notifier_invalidate_range(mm, mmun_start, mmun_end); set_huge_pte_at(mm, address, ptep, make_huge_pte(vma, new_page, 1)); page_remove_rmap(old_page); @@ -3385,6 +3389,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, * and that page table be reused and filled with junk. */ flush_tlb_range(vma, start, end); + mmu_notifier_invalidate_range(mm, start, end); mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); mmu_notifier_invalidate_range_end(mm, start, end); diff --git a/mm/ksm.c b/mm/ksm.c index 346ddc9..a73df3b 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -892,7 +892,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, * this assure us that no O_DIRECT can happen after the check * or in the middle of the check. */ - entry = ptep_clear_flush(vma, addr, ptep); + entry = ptep_clear_flush_notify(vma, addr, ptep); /* * Check that no O_DIRECT or similar I/O is in progress on the * page @@ -960,7 +960,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, page_add_anon_rmap(kpage, vma, addr); flush_cache_page(vma, addr, pte_pfn(*ptep)); - ptep_clear_flush(vma, addr, ptep); + ptep_clear_flush_notify(vma, addr, ptep); set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot)); page_remove_rmap(page); diff --git a/mm/memory.c b/mm/memory.c index 7e8d820..36daa2d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -236,6 +236,7 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { tlb->need_flush = 0; tlb_flush(tlb); + mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end); #ifdef CONFIG_HAVE_RCU_TABLE_FREE tlb_table_flush(tlb); #endif @@ -2232,7 +2233,7 @@ gotten: * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush(vma, address, page_table); + ptep_clear_flush_notify(vma, address, page_table); page_add_new_anon_rmap(new_page, vma, address); /* * We call the notify macro here because, when using secondary diff --git a/mm/migrate.c b/mm/migrate.c index be6dbf9..d3fb8d0 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1875,7 +1875,7 @@ fail_putback: */ flush_cache_range(vma, mmun_start, mmun_end); page_add_anon_rmap(new_page, vma, mmun_start); - pmdp_clear_flush(vma, mmun_start, pmd); + pmdp_clear_flush_notify(vma, mmun_start, pmd); set_pmd_at(mm, mmun_start, pmd, entry); flush_tlb_range(vma, mmun_start, mmun_end); update_mmu_cache_pmd(vma, address, &entry); @@ -1883,6 +1883,7 @@ fail_putback: if (page_count(page) != 2) { set_pmd_at(mm, mmun_start, pmd, orig_entry); flush_tlb_range(vma, mmun_start, mmun_end); + mmu_notifier_invalidate_range(mm, mmun_start, mmun_end); update_mmu_cache_pmd(vma, address, &entry); page_remove_rmap(new_page); goto fail_putback; diff --git a/mm/rmap.c b/mm/rmap.c index 22a4a76..8a0d02d 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1380,7 +1380,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush(vma, address, pte); + pteval = ptep_clear_flush_notify(vma, address, pte); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) { -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 2/3] mmu_notifier: Call mmu_notifier_invalidate_range() from VMM 2014-07-29 16:18 ` [PATCH 2/3] mmu_notifier: Call mmu_notifier_invalidate_range() from VMM Joerg Roedel @ 2014-08-16 12:55 ` Oded Gabbay 0 siblings, 0 replies; 6+ messages in thread From: Oded Gabbay @ 2014-08-16 12:55 UTC (permalink / raw) To: Joerg Roedel, Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel, Hugh Dickins, Mel Gorman, Johannes Weiner Cc: Jerome Glisse, jroedel, Jay.Cornwall, John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes, David Woodhouse, linux-kernel, linux-mm, iommu On 29/07/14 19:18, Joerg Roedel wrote: > From: Joerg Roedel <jroedel@suse.de> > > Add calls to the new mmu_notifier_invalidate_range() > function to all places if the VMM that need it. > > Signed-off-by: Joerg Roedel <jroedel@suse.de> > --- > include/linux/mmu_notifier.h | 28 ++++++++++++++++++++++++++++ > kernel/events/uprobes.c | 2 +- > mm/fremap.c | 2 +- > mm/huge_memory.c | 9 +++++---- > mm/hugetlb.c | 7 ++++++- > mm/ksm.c | 4 ++-- > mm/memory.c | 3 ++- > mm/migrate.c | 3 ++- > mm/rmap.c | 2 +- > 9 files changed, 48 insertions(+), 12 deletions(-) > > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h > index 1bac99c..f760e95 100644 > --- a/include/linux/mmu_notifier.h > +++ b/include/linux/mmu_notifier.h > @@ -273,6 +273,32 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) > __young; \ > }) > > +#define ptep_clear_flush_notify(__vma, __address, __ptep) \ > +({ \ > + unsigned long ___addr = __address & PAGE_MASK; \ > + struct mm_struct *___mm = (__vma)->vm_mm; \ > + pte_t ___pte; \ > + \ > + ___pte = ptep_clear_flush(__vma, __address, __ptep); \ > + mmu_notifier_invalidate_range(___mm, ___addr, \ > + ___addr + PAGE_SIZE); \ > + \ > + ___pte; \ > +}) > + > +#define pmdp_clear_flush_notify(__vma, __haddr, __pmd) \ > +({ \ > + unsigned long ___haddr = __haddr & HPAGE_PMD_MASK; \ > + struct mm_struct *___mm = (__vma)->vm_mm; \ > + pmd_t ___pmd; \ > + \ > + ___pmd = pmdp_clear_flush(__vma, __haddr, __pmd); \ > + mmu_notifier_invalidate_range(___mm, ___haddr, \ > + ___haddr + HPAGE_PMD_SIZE); \ > + \ > + ___pmd; \ > +}) > + > /* > * set_pte_at_notify() sets the pte _after_ running the notifier. > * This is safe to start by updating the secondary MMUs, because the primary MMU > @@ -346,6 +372,8 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) > > #define ptep_clear_flush_young_notify ptep_clear_flush_young > #define pmdp_clear_flush_young_notify pmdp_clear_flush_young > +#define ptep_clear_flush_notify ptep_clear_flush > +#define pmdp_clear_flush_notify pmdp_clear_flush > #define set_pte_at_notify set_pte_at > > #endif /* CONFIG_MMU_NOTIFIER */ > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 6f3254e..642262d 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -186,7 +186,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, > } > > flush_cache_page(vma, addr, pte_pfn(*ptep)); > - ptep_clear_flush(vma, addr, ptep); > + ptep_clear_flush_notify(vma, addr, ptep); > set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot)); > > page_remove_rmap(page); > diff --git a/mm/fremap.c b/mm/fremap.c > index 72b8fa3..9129013 100644 > --- a/mm/fremap.c > +++ b/mm/fremap.c > @@ -37,7 +37,7 @@ static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma, > > if (pte_present(pte)) { > flush_cache_page(vma, addr, pte_pfn(pte)); > - pte = ptep_clear_flush(vma, addr, ptep); > + pte = ptep_clear_flush_notify(vma, addr, ptep); > page = vm_normal_page(vma, addr, pte); > if (page) { > if (pte_dirty(pte)) > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 33514d8..b322c97 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1031,7 +1031,7 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, > goto out_free_pages; > VM_BUG_ON_PAGE(!PageHead(page), page); > > - pmdp_clear_flush(vma, haddr, pmd); > + pmdp_clear_flush_notify(vma, haddr, pmd); > /* leave pmd empty until pte is filled */ > > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > @@ -1168,7 +1168,7 @@ alloc: > pmd_t entry; > entry = mk_huge_pmd(new_page, vma->vm_page_prot); > entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); > - pmdp_clear_flush(vma, haddr, pmd); > + pmdp_clear_flush_notify(vma, haddr, pmd); > page_add_new_anon_rmap(new_page, vma, haddr); > set_pmd_at(mm, haddr, pmd, entry); > update_mmu_cache_pmd(vma, address, pmd); > @@ -1499,7 +1499,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, > pmd_t entry; > ret = 1; > if (!prot_numa) { > - entry = pmdp_get_and_clear(mm, addr, pmd); > + entry = pmdp_get_and_clear_notify(mm, addr, pmd); Where is pmdp_get_and_clear_notify() implemented ? I didn't find any implementation in this patch nor in linux-next. Oded > if (pmd_numa(entry)) > entry = pmd_mknonnuma(entry); > entry = pmd_modify(entry, newprot); > @@ -1631,6 +1631,7 @@ static int __split_huge_page_splitting(struct page *page, > * serialize against split_huge_page*. > */ > pmdp_splitting_flush(vma, address, pmd); > + > ret = 1; > spin_unlock(ptl); > } > @@ -2793,7 +2794,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, > pmd_t _pmd; > int i; > > - pmdp_clear_flush(vma, haddr, pmd); > + pmdp_clear_flush_notify(vma, haddr, pmd); > /* leave pmd empty until pte is filled */ > > pgtable = pgtable_trans_huge_withdraw(mm, pmd); > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 9221c02..603851d 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -2602,8 +2602,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, > } > set_huge_pte_at(dst, addr, dst_pte, entry); > } else { > - if (cow) > + if (cow) { > huge_ptep_set_wrprotect(src, addr, src_pte); > + mmu_notifier_invalidate_range(src, mmun_start, > + mmun_end); > + } > entry = huge_ptep_get(src_pte); > ptepage = pte_page(entry); > get_page(ptepage); > @@ -2911,6 +2914,7 @@ retry_avoidcopy: > > /* Break COW */ > huge_ptep_clear_flush(vma, address, ptep); > + mmu_notifier_invalidate_range(mm, mmun_start, mmun_end); > set_huge_pte_at(mm, address, ptep, > make_huge_pte(vma, new_page, 1)); > page_remove_rmap(old_page); > @@ -3385,6 +3389,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, > * and that page table be reused and filled with junk. > */ > flush_tlb_range(vma, start, end); > + mmu_notifier_invalidate_range(mm, start, end); > mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex); > mmu_notifier_invalidate_range_end(mm, start, end); > > diff --git a/mm/ksm.c b/mm/ksm.c > index 346ddc9..a73df3b 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -892,7 +892,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, > * this assure us that no O_DIRECT can happen after the check > * or in the middle of the check. > */ > - entry = ptep_clear_flush(vma, addr, ptep); > + entry = ptep_clear_flush_notify(vma, addr, ptep); > /* > * Check that no O_DIRECT or similar I/O is in progress on the > * page > @@ -960,7 +960,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, > page_add_anon_rmap(kpage, vma, addr); > > flush_cache_page(vma, addr, pte_pfn(*ptep)); > - ptep_clear_flush(vma, addr, ptep); > + ptep_clear_flush_notify(vma, addr, ptep); > set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot)); > > page_remove_rmap(page); > diff --git a/mm/memory.c b/mm/memory.c > index 7e8d820..36daa2d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -236,6 +236,7 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) > { > tlb->need_flush = 0; > tlb_flush(tlb); > + mmu_notifier_invalidate_range(tlb->mm, tlb->start, tlb->end); > #ifdef CONFIG_HAVE_RCU_TABLE_FREE > tlb_table_flush(tlb); > #endif > @@ -2232,7 +2233,7 @@ gotten: > * seen in the presence of one thread doing SMC and another > * thread doing COW. > */ > - ptep_clear_flush(vma, address, page_table); > + ptep_clear_flush_notify(vma, address, page_table); > page_add_new_anon_rmap(new_page, vma, address); > /* > * We call the notify macro here because, when using secondary > diff --git a/mm/migrate.c b/mm/migrate.c > index be6dbf9..d3fb8d0 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -1875,7 +1875,7 @@ fail_putback: > */ > flush_cache_range(vma, mmun_start, mmun_end); > page_add_anon_rmap(new_page, vma, mmun_start); > - pmdp_clear_flush(vma, mmun_start, pmd); > + pmdp_clear_flush_notify(vma, mmun_start, pmd); > set_pmd_at(mm, mmun_start, pmd, entry); > flush_tlb_range(vma, mmun_start, mmun_end); > update_mmu_cache_pmd(vma, address, &entry); > @@ -1883,6 +1883,7 @@ fail_putback: > if (page_count(page) != 2) { > set_pmd_at(mm, mmun_start, pmd, orig_entry); > flush_tlb_range(vma, mmun_start, mmun_end); > + mmu_notifier_invalidate_range(mm, mmun_start, mmun_end); > update_mmu_cache_pmd(vma, address, &entry); > page_remove_rmap(new_page); > goto fail_putback; > diff --git a/mm/rmap.c b/mm/rmap.c > index 22a4a76..8a0d02d 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1380,7 +1380,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, > > /* Nuke the page table entry. */ > flush_cache_page(vma, address, pte_pfn(*pte)); > - pteval = ptep_clear_flush(vma, address, pte); > + pteval = ptep_clear_flush_notify(vma, address, pte); > > /* If nonlinear, store the file page offset in the pte. */ > if (page->index != linear_page_index(vma, address)) { > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 3/3] mmu_notifier: Add the call-back for mmu_notifier_invalidate_range() 2014-07-29 16:18 [PATCH 0/3 v2] mmu_notifier: Allow to manage CPU external TLBs Joerg Roedel 2014-07-29 16:18 ` [PATCH 1/3] mmu_notifier: Add mmu_notifier_invalidate_range() Joerg Roedel 2014-07-29 16:18 ` [PATCH 2/3] mmu_notifier: Call mmu_notifier_invalidate_range() from VMM Joerg Roedel @ 2014-07-29 16:18 ` Joerg Roedel 2014-07-31 14:54 ` [PATCH 0/3 v2] mmu_notifier: Allow to manage CPU external TLBs Jerome Glisse 3 siblings, 0 replies; 6+ messages in thread From: Joerg Roedel @ 2014-07-29 16:18 UTC (permalink / raw) To: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel, Hugh Dickins, Mel Gorman, Johannes Weiner Cc: Jerome Glisse, jroedel, Jay.Cornwall, Oded.Gabbay, John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes, David Woodhouse, linux-kernel, linux-mm, iommu From: Joerg Roedel <jroedel@suse.de> Now that the mmu_notifier_invalidate_range() calls are in place, add the call-back to allow subsystems to register against it. Signed-off-by: Joerg Roedel <jroedel@suse.de> --- include/linux/mmu_notifier.h | 37 ++++++++++++++++++++++++++++++++----- mm/mmu_notifier.c | 25 +++++++++++++++++++++++++ 2 files changed, 57 insertions(+), 5 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index f760e95..596ea08 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -95,11 +95,11 @@ struct mmu_notifier_ops { /* * invalidate_range_start() and invalidate_range_end() must be * paired and are called only when the mmap_sem and/or the - * locks protecting the reverse maps are held. The subsystem - * must guarantee that no additional references are taken to - * the pages in the range established between the call to - * invalidate_range_start() and the matching call to - * invalidate_range_end(). + * locks protecting the reverse maps are held. If the subsystem + * can't guarantee that no additional references are taken to + * the pages in the range, it has to implement the + * invalidate_range() notifier to remove any references taken + * after invalidate_range_start(). * * Invalidation of multiple concurrent ranges may be * optionally permitted by the driver. Either way the @@ -141,6 +141,29 @@ struct mmu_notifier_ops { void (*invalidate_range_end)(struct mmu_notifier *mn, struct mm_struct *mm, unsigned long start, unsigned long end); + + /* + * invalidate_range() is either called between + * invalidate_range_start() and invalidate_range_end() when the + * VM has to free pages that where unmapped, but before the + * pages are actually freed, or outside of _start()/_end() when + * page-table pages are about to be freed. + * + * If invalidate_range() is used to manage a non-CPU TLB with + * shared page-tables, it not necessary to implement the + * invalidate_range_start()/end() notifiers, as + * invalidate_range() alread catches the points in time when an + * external TLB range needs to be flushed. + * + * The invalidate_range() function is called under the ptl + * spin-lock and not allowed to sleep. + * + * Note that this function might be called with just a sub-range + * of what was passed to invalidate_range_start()/end(), if + * called between those functions. + */ + void (*invalidate_range)(struct mmu_notifier *mn, struct mm_struct *mm, + unsigned long start, unsigned long end); }; /* @@ -184,6 +207,8 @@ extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, unsigned long start, unsigned long end); extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, unsigned long start, unsigned long end); +extern void __mmu_notifier_invalidate_range(struct mm_struct *mm, + unsigned long start, unsigned long end); static inline void mmu_notifier_release(struct mm_struct *mm) { @@ -238,6 +263,8 @@ static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm, static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end) { + if (mm_has_notifiers(mm)) + __mmu_notifier_invalidate_range(mm, start, end); } static inline void mmu_notifier_mm_init(struct mm_struct *mm) diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 41cefdf..e8ecbed 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -173,6 +173,16 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, id = srcu_read_lock(&srcu); hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) { + /* + * Call invalidate_range here too to avoid the need for the + * subsystem of having to register an invalidate_range_end + * call-back when there is invalidate_range already. Usually a + * subsystem registers either invalidate_range_start()/end() or + * invalidate_range(), so this will be no additional overhead + * (besides the pointer check). + */ + if (mn->ops->invalidate_range) + mn->ops->invalidate_range(mn, mm, start, end); if (mn->ops->invalidate_range_end) mn->ops->invalidate_range_end(mn, mm, start, end); } @@ -180,6 +190,21 @@ void __mmu_notifier_invalidate_range_end(struct mm_struct *mm, } EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_end); +void __mmu_notifier_invalidate_range(struct mm_struct *mm, + unsigned long start, unsigned long end) +{ + struct mmu_notifier *mn; + int id; + + id = srcu_read_lock(&srcu); + hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) { + if (mn->ops->invalidate_range) + mn->ops->invalidate_range(mn, mm, start, end); + } + srcu_read_unlock(&srcu, id); +} +EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range); + static int do_mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm, int take_mmap_sem) -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 0/3 v2] mmu_notifier: Allow to manage CPU external TLBs 2014-07-29 16:18 [PATCH 0/3 v2] mmu_notifier: Allow to manage CPU external TLBs Joerg Roedel ` (2 preceding siblings ...) 2014-07-29 16:18 ` [PATCH 3/3] mmu_notifier: Add the call-back for mmu_notifier_invalidate_range() Joerg Roedel @ 2014-07-31 14:54 ` Jerome Glisse 3 siblings, 0 replies; 6+ messages in thread From: Jerome Glisse @ 2014-07-31 14:54 UTC (permalink / raw) To: Joerg Roedel Cc: Andrew Morton, Andrea Arcangeli, Peter Zijlstra, Rik van Riel, Hugh Dickins, Mel Gorman, Johannes Weiner, Jerome Glisse, jroedel, Jay.Cornwall, Oded.Gabbay, John.Bridgman, Suravee.Suthikulpanit, ben.sander, Jesse Barnes, David Woodhouse, linux-kernel, linux-mm, iommu On Tue, Jul 29, 2014 at 06:18:10PM +0200, Joerg Roedel wrote: > Changes V1->V2: > > * Rebase to v3.16-rc7 > * Added call of ->invalidate_range to > __mmu_notifier_invalidate_end() so that the subsystem > doesn't need to register an ->invalidate_end() call-back, > subsystems will likely either register > invalidate_range_start/end or invalidate_range, so that > should be fine. > * Re-orded declarations a bit to reflect that > invalidate_range is not only called between > invalidate_range_start/end > * Updated documentation to cover the case where > invalidate_range is called outside of > invalidate_range_start/end to flush page-table pages out > of the TLB > > Hi, > > here is a patch-set to extend the mmu_notifiers in the Linux > kernel to allow managing CPU external TLBs. Those TLBs may > be implemented in IOMMUs or any other external device, e.g. > ATS/PRI capable PCI devices. > > The problem with managing these TLBs are the semantics of > the invalidate_range_start/end call-backs currently > available. Currently the subsystem using mmu_notifiers has > to guarantee that no new TLB entries are established between > invalidate_range_start/end. Furthermore the > invalidate_range_start() function is called when all pages > are still mapped and invalidate_range_end() when the pages > are unmapped an already freed. > > So both call-backs can't be used to safely flush any non-CPU > TLB because _start() is called too early and _end() too > late. > > In the AMD IOMMUv2 driver this is currently implemented by > assigning an empty page-table to the external device between > _start() and _end(). But as tests have shown this doesn't > work as external devices don't re-fault infinitly but enter > a failure state after some time. > > Next problem with this solution is that it causes an > interrupt storm for IO page faults to be handled when an > empty page-table is assigned. > > Furthermore the _start()/end() notifiers only catch the > moment when page mappings are released, but not page-table > pages. But this is necessary for managing external TLBs when > the page-table is shared with the CPU. > > To solve this situation I wrote a patch-set to introduce a > new notifier call-back: mmu_notifer_invalidate_range(). This > notifier lifts the strict requirements that no new > references are taken in the range between _start() and > _end(). When the subsystem can't guarantee that any new > references are taken is has to provide the > invalidate_range() call-back to clear any new references in > there. > > It is called between invalidate_range_start() and _end() > every time the VMM has to wipe out any references to a > couple of pages. This are usually the places where the CPU > TLBs are flushed too and where its important that this > happens before invalidate_range_end() is called. > > Any comments and review appreciated! For the series : Reviewed-by: Jerome Glisse <jglisse@redhat.com> > > Thanks, > > Joerg > > Joerg Roedel (3): > mmu_notifier: Add mmu_notifier_invalidate_range() > mmu_notifier: Call mmu_notifier_invalidate_range() from VMM > mmu_notifier: Add the call-back for mmu_notifier_invalidate_range() > > include/linux/mmu_notifier.h | 75 +++++++++++++++++++++++++++++++++++++++++--- > kernel/events/uprobes.c | 2 +- > mm/fremap.c | 2 +- > mm/huge_memory.c | 9 +++--- > mm/hugetlb.c | 7 ++++- > mm/ksm.c | 4 +-- > mm/memory.c | 3 +- > mm/migrate.c | 3 +- > mm/mmu_notifier.c | 25 +++++++++++++++ > mm/rmap.c | 2 +- > 10 files changed, 115 insertions(+), 17 deletions(-) > > -- > 1.9.1 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-08-16 12:56 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-07-29 16:18 [PATCH 0/3 v2] mmu_notifier: Allow to manage CPU external TLBs Joerg Roedel 2014-07-29 16:18 ` [PATCH 1/3] mmu_notifier: Add mmu_notifier_invalidate_range() Joerg Roedel 2014-07-29 16:18 ` [PATCH 2/3] mmu_notifier: Call mmu_notifier_invalidate_range() from VMM Joerg Roedel 2014-08-16 12:55 ` Oded Gabbay 2014-07-29 16:18 ` [PATCH 3/3] mmu_notifier: Add the call-back for mmu_notifier_invalidate_range() Joerg Roedel 2014-07-31 14:54 ` [PATCH 0/3 v2] mmu_notifier: Allow to manage CPU external TLBs Jerome Glisse
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).