From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx124.postini.com [74.125.245.124]) by kanga.kvack.org (Postfix) with SMTP id F364D6B0078 for ; Sun, 26 Aug 2012 06:12:14 -0400 (EDT) From: Haggai Eran Subject: [PATCH 0/3] Enable clients to schedule in mmu_notifier methods Date: Sun, 26 Aug 2012 13:11:36 +0300 Message-Id: <1345975899-2236-1-git-send-email-haggaie@mellanox.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrea Arcangeli , Peter Zijlstra , Xiao Guangrong , Andrew Morton , Sagi Grimberg , Or Gerlitz , Haggai Eran The following short patch series completes the support for allowing clients to sleep in mmu notifiers (specifically in invalidate_page and invalidate_range_start/end), adding on the work done by Andrea Arcangeli and Sagi Grimberg in http://marc.info/?l=linux-mm&m=133113297028676&w=3 This patchset is a preliminary step towards on-demand paging design to be added to the Infiniband stack. Our goal is to avoid pinning pages in memory regions registered for IB communication, so we need to get notifications for invalidations on such memory regions, and stop the hardware from continuing its access to the invalidated pages. The hardware operation that flushes the page tables can block, so we need to sleep until the hardware is guaranteed not to access these pages anymore. The first patch moves the mentioned notifier functions out of the PTL, and the other two patches prevent notifiers from sleeping between calls to tlb_gather_mmu and tlb_flush_mmu. I believe that Peter Zijlstra made a comment saying that patch 2 isn't needed anymore. For the same reason patch 3 would no longer be necessary. Let's discuss this now... Regards, Haggai Eran Sagi Grimberg (3): mm: Move all mmu notifier invocations to be done outside the PT lock mm: Move the tlb flushing into free_pgtables mm: Move the tlb flushing inside of unmap vmas include/linux/mmu_notifier.h | 48 -------------------------------------------- mm/filemap_xip.c | 4 +++- mm/huge_memory.c | 32 +++++++++++++++++++++++------ mm/hugetlb.c | 15 ++++++++------ mm/memory.c | 25 +++++++++++++---------- mm/mmap.c | 7 ------- mm/rmap.c | 27 ++++++++++++++++++------- 7 files changed, 72 insertions(+), 86 deletions(-) -- 1.7.11.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx180.postini.com [74.125.245.180]) by kanga.kvack.org (Postfix) with SMTP id 7A0126B007B for ; Sun, 26 Aug 2012 06:12:25 -0400 (EDT) From: Haggai Eran Subject: [PATCH 1/3] mm: Move all mmu notifier invocations to be done outside the PT lock Date: Sun, 26 Aug 2012 13:11:37 +0300 Message-Id: <1345975899-2236-2-git-send-email-haggaie@mellanox.com> In-Reply-To: <1345975899-2236-1-git-send-email-haggaie@mellanox.com> References: <1345975899-2236-1-git-send-email-haggaie@mellanox.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrea Arcangeli , Peter Zijlstra , Xiao Guangrong , Andrew Morton , Sagi Grimberg , Or Gerlitz , Andrea Arcangeli , Haggai Eran From: Sagi Grimberg In order to allow sleeping during mmu notifier calls, we need to avoid invoking them under the page table spinlock. This patch solves the problem by calling invalidate_page notification after releasing the lock (but before freeing the page itself), or by wrapping the page invalidation with calls to invalidate_range_begin and invalidate_range_end. Signed-off-by: Andrea Arcangeli Signed-off-by: Sagi Grimberg Signed-off-by: Haggai Eran --- include/linux/mmu_notifier.h | 48 -------------------------------------------- mm/filemap_xip.c | 4 +++- mm/huge_memory.c | 32 +++++++++++++++++++++++------ mm/hugetlb.c | 15 ++++++++------ mm/memory.c | 10 ++++++--- mm/rmap.c | 27 ++++++++++++++++++------- 6 files changed, 65 insertions(+), 71 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index ee2baf0..470a825 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -246,50 +246,6 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) __mmu_notifier_mm_destroy(mm); } -/* - * These two macros will sometime replace ptep_clear_flush. - * ptep_clear_flush is implemented as macro itself, so this also is - * implemented as a macro until ptep_clear_flush will converted to an - * inline function, to diminish the risk of compilation failure. The - * invalidate_page method over time can be moved outside the PT lock - * and these two macros can be later removed. - */ -#define ptep_clear_flush_notify(__vma, __address, __ptep) \ -({ \ - pte_t __pte; \ - struct vm_area_struct *___vma = __vma; \ - unsigned long ___address = __address; \ - __pte = ptep_clear_flush(___vma, ___address, __ptep); \ - mmu_notifier_invalidate_page(___vma->vm_mm, ___address); \ - __pte; \ -}) - -#define pmdp_clear_flush_notify(__vma, __address, __pmdp) \ -({ \ - pmd_t __pmd; \ - struct vm_area_struct *___vma = __vma; \ - unsigned long ___address = __address; \ - VM_BUG_ON(__address & ~HPAGE_PMD_MASK); \ - mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address, \ - (__address)+HPAGE_PMD_SIZE);\ - __pmd = pmdp_clear_flush(___vma, ___address, __pmdp); \ - mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address, \ - (__address)+HPAGE_PMD_SIZE); \ - __pmd; \ -}) - -#define pmdp_splitting_flush_notify(__vma, __address, __pmdp) \ -({ \ - struct vm_area_struct *___vma = __vma; \ - unsigned long ___address = __address; \ - VM_BUG_ON(__address & ~HPAGE_PMD_MASK); \ - mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address, \ - (__address)+HPAGE_PMD_SIZE);\ - pmdp_splitting_flush(___vma, ___address, __pmdp); \ - mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address, \ - (__address)+HPAGE_PMD_SIZE); \ -}) - #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ ({ \ int __young; \ @@ -368,11 +324,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) { } -#define ptep_clear_flush_young_notify ptep_clear_flush_young #define pmdp_clear_flush_young_notify pmdp_clear_flush_young -#define ptep_clear_flush_notify ptep_clear_flush -#define pmdp_clear_flush_notify pmdp_clear_flush -#define pmdp_splitting_flush_notify pmdp_splitting_flush #define set_pte_at_notify set_pte_at #endif /* CONFIG_MMU_NOTIFIER */ diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index 13e013b..a002a6d 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -193,11 +193,13 @@ retry: if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush_notify(vma, address, pte); + pteval = ptep_clear_flush(vma, address, pte); page_remove_rmap(page); dec_mm_counter(mm, MM_FILEPAGES); BUG_ON(pte_dirty(pteval)); pte_unmap_unlock(pte, ptl); + /* must invalidate_page _before_ freeing the page */ + mmu_notifier_invalidate_page(mm, address); page_cache_release(page); } } diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 57c4b93..5a5b9e4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -868,12 +868,14 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, cond_resched(); } + mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE); + spin_lock(&mm->page_table_lock); if (unlikely(!pmd_same(*pmd, orig_pmd))) goto out_free_pages; VM_BUG_ON(!PageHead(page)); - pmdp_clear_flush_notify(vma, haddr, pmd); + pmdp_clear_flush(vma, haddr, pmd); /* leave pmd empty until pte is filled */ pgtable = get_pmd_huge_pte(mm); @@ -896,6 +898,9 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm, page_remove_rmap(page); spin_unlock(&mm->page_table_lock); + mmu_notifier_invalidate_range_end(vma->vm_mm, haddr, + haddr + HPAGE_PMD_SIZE); + ret |= VM_FAULT_WRITE; put_page(page); @@ -904,6 +909,7 @@ out: out_free_pages: spin_unlock(&mm->page_table_lock); + mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE); mem_cgroup_uncharge_start(); for (i = 0; i < HPAGE_PMD_NR; i++) { mem_cgroup_uncharge_page(pages[i]); @@ -970,20 +976,22 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR); __SetPageUptodate(new_page); + mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE); + spin_lock(&mm->page_table_lock); put_page(page); if (unlikely(!pmd_same(*pmd, orig_pmd))) { spin_unlock(&mm->page_table_lock); mem_cgroup_uncharge_page(new_page); put_page(new_page); - goto out; + goto out_mn; } else { pmd_t entry; VM_BUG_ON(!PageHead(page)); entry = mk_pmd(new_page, vma->vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); entry = pmd_mkhuge(entry); - pmdp_clear_flush_notify(vma, haddr, pmd); + pmdp_clear_flush(vma, haddr, pmd); page_add_new_anon_rmap(new_page, vma, haddr); set_pmd_at(mm, haddr, pmd, entry); update_mmu_cache(vma, address, entry); @@ -991,10 +999,14 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, put_page(page); ret |= VM_FAULT_WRITE; } -out_unlock: spin_unlock(&mm->page_table_lock); +out_mn: + mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE); out: return ret; +out_unlock: + spin_unlock(&mm->page_table_lock); + return ret; } struct page *follow_trans_huge_pmd(struct mm_struct *mm, @@ -1208,6 +1220,8 @@ static int __split_huge_page_splitting(struct page *page, pmd_t *pmd; int ret = 0; + mmu_notifier_invalidate_range_start(mm, address, + address + HPAGE_PMD_SIZE); spin_lock(&mm->page_table_lock); pmd = page_check_address_pmd(page, mm, address, PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG); @@ -1219,10 +1233,12 @@ static int __split_huge_page_splitting(struct page *page, * and it won't wait on the anon_vma->root->mutex to * serialize against split_huge_page*. */ - pmdp_splitting_flush_notify(vma, address, pmd); + pmdp_splitting_flush(vma, address, pmd); ret = 1; } spin_unlock(&mm->page_table_lock); + mmu_notifier_invalidate_range_end(mm, address, + address + HPAGE_PMD_SIZE); return ret; } @@ -1937,6 +1953,8 @@ static void collapse_huge_page(struct mm_struct *mm, pte = pte_offset_map(pmd, address); ptl = pte_lockptr(mm, pmd); + mmu_notifier_invalidate_range_start(mm, address, + address + HPAGE_PMD_SIZE); spin_lock(&mm->page_table_lock); /* probably unnecessary */ /* * After this gup_fast can't run anymore. This also removes @@ -1944,8 +1962,10 @@ static void collapse_huge_page(struct mm_struct *mm, * huge and small TLB entries for the same virtual address * to avoid the risk of CPU bugs in that area. */ - _pmd = pmdp_clear_flush_notify(vma, address, pmd); + _pmd = pmdp_clear_flush(vma, address, pmd); spin_unlock(&mm->page_table_lock); + mmu_notifier_invalidate_range_end(mm, address, + address + HPAGE_PMD_SIZE); spin_lock(ptl); isolated = __collapse_huge_page_isolate(vma, address, pte); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index bc72712..c569b97 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2611,6 +2611,9 @@ retry_avoidcopy: pages_per_huge_page(h)); __SetPageUptodate(new_page); + mmu_notifier_invalidate_range_start(mm, + address & huge_page_mask(h), + (address & huge_page_mask(h)) + huge_page_size(h)); /* * Retake the page_table_lock to check for racing updates * before the page tables are altered @@ -2619,9 +2622,6 @@ retry_avoidcopy: ptep = huge_pte_offset(mm, address & huge_page_mask(h)); if (likely(pte_same(huge_ptep_get(ptep), pte))) { /* Break COW */ - mmu_notifier_invalidate_range_start(mm, - address & huge_page_mask(h), - (address & huge_page_mask(h)) + huge_page_size(h)); huge_ptep_clear_flush(vma, address, ptep); set_huge_pte_at(mm, address, ptep, make_huge_pte(vma, new_page, 1)); @@ -2629,10 +2629,13 @@ retry_avoidcopy: hugepage_add_new_anon_rmap(new_page, vma, address); /* Make the old page be freed below */ new_page = old_page; - mmu_notifier_invalidate_range_end(mm, - address & huge_page_mask(h), - (address & huge_page_mask(h)) + huge_page_size(h)); } + spin_unlock(&mm->page_table_lock); + mmu_notifier_invalidate_range_end(mm, + address & huge_page_mask(h), + (address & huge_page_mask(h)) + huge_page_size(h)); + /* Caller expects lock to be held */ + spin_lock(&mm->page_table_lock); page_cache_release(new_page); page_cache_release(old_page); return 0; diff --git a/mm/memory.c b/mm/memory.c index 5736170..b657a2e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2516,7 +2516,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, spinlock_t *ptl, pte_t orig_pte) __releases(ptl) { - struct page *old_page, *new_page; + struct page *old_page, *new_page = NULL; pte_t entry; int ret = 0; int page_mkwrite = 0; @@ -2760,10 +2760,14 @@ gotten: } else mem_cgroup_uncharge_page(new_page); - if (new_page) - page_cache_release(new_page); unlock: pte_unmap_unlock(page_table, ptl); + if (new_page) { + if (new_page == old_page) + /* cow happened, notify before releasing old_page */ + mmu_notifier_invalidate_page(mm, address); + page_cache_release(new_page); + } if (old_page) { /* * Don't let another task, with possibly unlocked vma, diff --git a/mm/rmap.c b/mm/rmap.c index 0f3b7cd..f13e6cf 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -694,7 +694,7 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma, unsigned long *vm_flags) { struct mm_struct *mm = vma->vm_mm; - int referenced = 0; + int referenced = 0, clear_flush_young = 0; if (unlikely(PageTransHuge(page))) { pmd_t *pmd; @@ -741,7 +741,8 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma, goto out; } - if (ptep_clear_flush_young_notify(vma, address, pte)) { + clear_flush_young = 1; + if (ptep_clear_flush_young(vma, address, pte)) { /* * Don't treat a reference through a sequentially read * mapping as such. If the page has been used in @@ -757,6 +758,9 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma, (*mapcount)--; + if (clear_flush_young) + referenced += mmu_notifier_clear_flush_young(mm, address); + if (referenced) *vm_flags |= vma->vm_flags; out: @@ -929,7 +933,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma, pte_t entry; flush_cache_page(vma, address, pte_pfn(*pte)); - entry = ptep_clear_flush_notify(vma, address, pte); + entry = ptep_clear_flush(vma, address, pte); entry = pte_wrprotect(entry); entry = pte_mkclean(entry); set_pte_at(mm, address, pte, entry); @@ -937,6 +941,9 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma, } pte_unmap_unlock(pte, ptl); + + if (ret) + mmu_notifier_invalidate_page(mm, address); out: return ret; } @@ -1256,7 +1263,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, /* Nuke the page table entry. */ flush_cache_page(vma, address, page_to_pfn(page)); - pteval = ptep_clear_flush_notify(vma, address, pte); + pteval = ptep_clear_flush(vma, address, pte); /* Move the dirty bit to the physical page now the pte is gone. */ if (pte_dirty(pteval)) @@ -1318,6 +1325,8 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, out_unmap: pte_unmap_unlock(pte, ptl); + if (ret != SWAP_FAIL) + mmu_notifier_invalidate_page(mm, address); out: return ret; @@ -1382,7 +1391,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, spinlock_t *ptl; struct page *page; unsigned long address; - unsigned long end; + unsigned long start, end; int ret = SWAP_AGAIN; int locked_vma = 0; @@ -1405,6 +1414,9 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, if (!pmd_present(*pmd)) return ret; + start = address; + mmu_notifier_invalidate_range_start(mm, start, end); + /* * If we can acquire the mmap_sem for read, and vma is VM_LOCKED, * keep the sem while scanning the cluster for mlocking pages. @@ -1433,12 +1445,12 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, continue; /* don't unmap */ } - if (ptep_clear_flush_young_notify(vma, address, pte)) + if (ptep_clear_flush_young(vma, address, pte)) continue; /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush_notify(vma, address, pte); + pteval = ptep_clear_flush(vma, address, pte); /* If nonlinear, store the file page offset in the pte. */ if (page->index != linear_page_index(vma, address)) @@ -1454,6 +1466,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, (*mapcount)--; } pte_unmap_unlock(pte - 1, ptl); + mmu_notifier_invalidate_range_end(mm, start, end); if (locked_vma) up_read(&vma->vm_mm->mmap_sem); return ret; -- 1.7.11.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx141.postini.com [74.125.245.141]) by kanga.kvack.org (Postfix) with SMTP id 2ED766B0080 for ; Sun, 26 Aug 2012 06:12:28 -0400 (EDT) From: Haggai Eran Subject: [PATCH 2/3] mm: Move the tlb flushing into free_pgtables Date: Sun, 26 Aug 2012 13:11:38 +0300 Message-Id: <1345975899-2236-3-git-send-email-haggaie@mellanox.com> In-Reply-To: <1345975899-2236-1-git-send-email-haggaie@mellanox.com> References: <1345975899-2236-1-git-send-email-haggaie@mellanox.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrea Arcangeli , Peter Zijlstra , Xiao Guangrong , Andrew Morton , Sagi Grimberg , Or Gerlitz , Christoph Lameter , Haggai Eran From: Sagi Grimberg The conversion of the locks taken for reverse map scanning would require taking sleeping locks in free_pgtables() and we cannot sleep while gathering pages for a tlb flush. Move the tlb_gather/tlb_finish call to free_pgtables() to be done for each vma. This may add a number of tlb flushes depending on the number of vmas that cannot be coalesced into one. Signed-off-by: Christoph Lameter Signed-off-by: Sagi Grimberg Signed-off-by: Haggai Eran --- mm/memory.c | 3 +++ mm/mmap.c | 4 ++-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index b657a2e..e721432 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -553,6 +553,7 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma, unlink_file_vma(vma); if (is_vm_hugetlb_page(vma)) { + tlb_gather_mmu(tlb, vma->vm_mm, 0); hugetlb_free_pgd_range(tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); } else { @@ -566,9 +567,11 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma, unlink_anon_vmas(vma); unlink_file_vma(vma); } + tlb_gather_mmu(tlb, vma->vm_mm, 0); free_pgd_range(tlb, addr, vma->vm_end, floor, next? next->vm_start: ceiling); } + tlb_finish_mmu(tlb, addr, vma->vm_end); vma = next; } } diff --git a/mm/mmap.c b/mm/mmap.c index e3e8691..731da04 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1912,9 +1912,9 @@ static void unmap_region(struct mm_struct *mm, tlb_gather_mmu(&tlb, mm, 0); update_hiwater_rss(mm); unmap_vmas(&tlb, vma, start, end); + tlb_finish_mmu(&tlb, start, end); free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, next ? next->vm_start : 0); - tlb_finish_mmu(&tlb, start, end); } /* @@ -2295,8 +2295,8 @@ void exit_mmap(struct mm_struct *mm) /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(&tlb, 0, -1); + free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); /* * Walk the list again, actually closing and freeing it, -- 1.7.11.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx118.postini.com [74.125.245.118]) by kanga.kvack.org (Postfix) with SMTP id B2E2D6B0082 for ; Sun, 26 Aug 2012 06:12:30 -0400 (EDT) From: Haggai Eran Subject: [PATCH 3/3] mm: Move the tlb flushing inside of unmap vmas Date: Sun, 26 Aug 2012 13:11:39 +0300 Message-Id: <1345975899-2236-4-git-send-email-haggaie@mellanox.com> In-Reply-To: <1345975899-2236-1-git-send-email-haggaie@mellanox.com> References: <1345975899-2236-1-git-send-email-haggaie@mellanox.com> Sender: owner-linux-mm@kvack.org List-ID: To: linux-mm@kvack.org Cc: Andrea Arcangeli , Peter Zijlstra , Xiao Guangrong , Andrew Morton , Sagi Grimberg , Or Gerlitz , Haggai Eran From: Sagi Grimberg This patch removes another hurdle preventing sleeping in mmu notifiers. It is based on the assumption that we cannot sleep between calls to tlb_gather_mmu, and tlb_finish_mmu. Signed-off-by: Sagi Grimberg Signed-off-by: Haggai Eran --- mm/memory.c | 12 ++++-------- mm/mmap.c | 7 ------- 2 files changed, 4 insertions(+), 15 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index e721432..ca2e0cd 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1327,6 +1327,9 @@ static void unmap_single_vma(struct mmu_gather *tlb, if (end <= vma->vm_start) return; + lru_add_drain(); + tlb_gather_mmu(tlb, vma->vm_mm, 0); + update_hiwater_rss(vma->vm_mm); if (vma->vm_file) uprobe_munmap(vma, start, end); @@ -1354,6 +1357,7 @@ static void unmap_single_vma(struct mmu_gather *tlb, } else unmap_page_range(tlb, vma, start, end, details); } + tlb_finish_mmu(tlb, start_addr, end_addr); } /** @@ -1402,14 +1406,10 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start, struct mmu_gather tlb; unsigned long end = start + size; - lru_add_drain(); - tlb_gather_mmu(&tlb, mm, 0); - update_hiwater_rss(mm); mmu_notifier_invalidate_range_start(mm, start, end); for ( ; vma && vma->vm_start < end; vma = vma->vm_next) unmap_single_vma(&tlb, vma, start, end, details); mmu_notifier_invalidate_range_end(mm, start, end); - tlb_finish_mmu(&tlb, start, end); } /** @@ -1428,13 +1428,9 @@ static void zap_page_range_single(struct vm_area_struct *vma, unsigned long addr struct mmu_gather tlb; unsigned long end = address + size; - lru_add_drain(); - tlb_gather_mmu(&tlb, mm, 0); - update_hiwater_rss(mm); mmu_notifier_invalidate_range_start(mm, address, end); unmap_single_vma(&tlb, vma, address, end, details); mmu_notifier_invalidate_range_end(mm, address, end); - tlb_finish_mmu(&tlb, address, end); } /** diff --git a/mm/mmap.c b/mm/mmap.c index 731da04..4b614fe 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1908,11 +1908,7 @@ static void unmap_region(struct mm_struct *mm, struct vm_area_struct *next = prev? prev->vm_next: mm->mmap; struct mmu_gather tlb; - lru_add_drain(); - tlb_gather_mmu(&tlb, mm, 0); - update_hiwater_rss(mm); unmap_vmas(&tlb, vma, start, end); - tlb_finish_mmu(&tlb, start, end); free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, next ? next->vm_start : 0); } @@ -2288,14 +2284,11 @@ void exit_mmap(struct mm_struct *mm) if (!vma) /* Can happen if dup_mmap() received an OOM */ return; - lru_add_drain(); flush_cache_mm(mm); - tlb_gather_mmu(&tlb, mm, 1); /* update_hiwater_rss(mm) here? but nobody should be looking */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - tlb_finish_mmu(&tlb, 0, -1); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, 0); /* -- 1.7.11.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx156.postini.com [74.125.245.156]) by kanga.kvack.org (Postfix) with SMTP id 3B2216B002B for ; Mon, 27 Aug 2012 00:19:40 -0400 (EDT) Received: from canuck.infradead.org ([2001:4978:20e::1]) by merlin.infradead.org with esmtps (Exim 4.76 #1 (Red Hat Linux)) id 1T5qna-0000Lr-NP for linux-mm@kvack.org; Mon, 27 Aug 2012 04:19:38 +0000 Received: from dhcp-089-099-019-018.chello.nl ([89.99.19.18] helo=dyad.programming.kicks-ass.net) by canuck.infradead.org with esmtpsa (Exim 4.76 #1 (Red Hat Linux)) id 1T5qna-0004qH-ER for linux-mm@kvack.org; Mon, 27 Aug 2012 04:19:38 +0000 Subject: Re: [PATCH 2/3] mm: Move the tlb flushing into free_pgtables From: Peter Zijlstra In-Reply-To: <1345975899-2236-3-git-send-email-haggaie@mellanox.com> References: <1345975899-2236-1-git-send-email-haggaie@mellanox.com> <1345975899-2236-3-git-send-email-haggaie@mellanox.com> Content-Type: text/plain; charset="UTF-8" Date: Mon, 27 Aug 2012 06:19:14 +0200 Message-ID: <1346041154.2296.1.camel@laptop> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Haggai Eran Cc: linux-mm@kvack.org, Andrea Arcangeli , Xiao Guangrong , Andrew Morton , Sagi Grimberg , Or Gerlitz , Christoph Lameter On Sun, 2012-08-26 at 13:11 +0300, Haggai Eran wrote: > > The conversion of the locks taken for reverse map scanning would > require taking sleeping locks in free_pgtables() and we cannot sleep > while gathering pages for a tlb flush. We can. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx122.postini.com [74.125.245.122]) by kanga.kvack.org (Postfix) with SMTP id 368206B002B for ; Tue, 28 Aug 2012 04:55:19 -0400 (EDT) Message-ID: <503C86DC.3040705@mellanox.com> Date: Tue, 28 Aug 2012 11:52:44 +0300 From: Haggai Eran MIME-Version: 1.0 Subject: Re: [PATCH 2/3] mm: Move the tlb flushing into free_pgtables References: <1345975899-2236-1-git-send-email-haggaie@mellanox.com> <1345975899-2236-3-git-send-email-haggaie@mellanox.com> <1346041154.2296.1.camel@laptop> In-Reply-To: <1346041154.2296.1.camel@laptop> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org List-ID: To: Peter Zijlstra Cc: "linux-mm@kvack.org" , Andrea Arcangeli , Xiao Guangrong , Andrew Morton , Sagi Grimberg , Or Gerlitz , Christoph Lameter On 27/08/2012 07:19, Peter Zijlstra wrote: > On Sun, 2012-08-26 at 13:11 +0300, Haggai Eran wrote: >> The conversion of the locks taken for reverse map scanning would >> require taking sleeping locks in free_pgtables() and we cannot sleep >> while gathering pages for a tlb flush. > We can. > After further reading I tend to agree. We can drop this patch and patch number 3 then and focus on the first patch in this set. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from psmtp.com (na3sys010amx139.postini.com [74.125.245.139]) by kanga.kvack.org (Postfix) with SMTP id 886CC6B0068 for ; Wed, 5 Sep 2012 10:23:09 -0400 (EDT) Date: Wed, 5 Sep 2012 17:24:15 +0300 From: "Michael S. Tsirkin" Subject: Re: [PATCH 0/3] Enable clients to schedule in mmu_notifier methods Message-ID: <20120905142415.GA10832@redhat.com> References: <1345975899-2236-1-git-send-email-haggaie@mellanox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1345975899-2236-1-git-send-email-haggaie@mellanox.com> Sender: owner-linux-mm@kvack.org List-ID: To: Haggai Eran Cc: linux-mm@kvack.org, Andrea Arcangeli , Peter Zijlstra , Xiao Guangrong , Andrew Morton , Sagi Grimberg , Or Gerlitz On Wed, Sep 05, 2012 at 04:35:49PM +0300, Haggai Eran wrote: > The following short patch series completes the support for allowing clients to > sleep in mmu notifiers (specifically in invalidate_page and > invalidate_range_start/end), adding on the work done by Andrea Arcangeli and > Sagi Grimberg in http://marc.info/?l=linux-mm&m=133113297028676&w=3 > > This patchset is a preliminary step towards on-demand paging design to be > added to the Infiniband stack. Our goal is to avoid pinning pages in > memory regions registered for IB communication, so we need to get > notifications for invalidations on such memory regions, and stop the hardware > from continuing its access to the invalidated pages. The hardware operation > that flushes the page tables can block, so we need to sleep until the hardware > is guaranteed not to access these pages anymore. Since people have been asking about the need for on demand paging in devices: this can be useful for KVM where we sometimes want to let guest directly (bypassing the hypervisor) control a virtual function of a PCI device (prevented by an iommu from accessing host memory). Currently this means host needs to pin all guest memory that *might* be used by this virtual function which breaks setups with memory overcommit; at the moment we can address this by means of ballooning - cooperative memory management - but this has some obvious problems: for example, to get some memory out of a low priority guest it needs to run so the balloon can get inflated. By comparison on demand paging would not require guest cooperation. The problem is not specific to Infiniband; addressing it by on demand paging does require hardware and host driver support though. If Infiniband drivers happen to be the first to implement it more power to them. -- MST -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org