[PATCH V1 0/2] Enable clients to schedule in mmu

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V1 0/2] Enable clients to schedule in mmu_notifier methods
@ 2012-09-04  8:41 Haggai Eran
  2012-09-04  8:41 ` [PATCH V1 1/2] mm: Move all mmu notifier invocations to be done outside the PT lock Haggai Eran
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Haggai Eran @ 2012-09-04  8:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Peter Zijlstra, Xiao Guangrong,
	Haggai Eran, Shachar Raindel, Sagi Grimberg, Or Gerlitz

> The following short patch series completes the support for allowing clients to
> sleep in mmu notifiers (specifically in invalidate_page and
> invalidate_range_start/end), adding on the work done by Andrea Arcangeli and
> Sagi Grimberg in http://marc.info/?l=linux-mm&m=133113297028676&w=3
>
> This patchset is a preliminary step towards on-demand paging design to be
> added to the Infiniband stack. Our goal is to avoid pinning pages in
> memory regions registered for IB communication, so we need to get
> notifications for invalidations on such memory regions, and stop the hardware
> from continuing its access to the invalidated pages. The hardware operation
> that flushes the page tables can block, so we need to sleep until the hardware
> is guaranteed not to access these pages anymore.

The first patch moves the mentioned notifier functions out of the PTL, and the
second patch changes the change_pte notification to stop calling
invalidate_page as a default.

Regards,
Haggai Eran

Changes from V0:
- Fixed a bug in patch 1 that prevented compilation without MMU notifiers.
- Dropped the patches 2 and 3 that were moving tlb_gather_mmu calls.
- Added a patch to handle invalidate_page being called from change_pte.

Haggai Eran (1):
  mm: Wrap calls to set_pte_at_notify with invalidate_range_start and
    invalidate_range_end

Sagi Grimberg (1):
  mm: Move all mmu notifier invocations to be done outside the PT lock

 include/linux/mmu_notifier.h | 47 --------------------------------------------
 kernel/events/uprobes.c      |  2 ++
 mm/filemap_xip.c             |  4 +++-
 mm/huge_memory.c             | 32 ++++++++++++++++++++++++------
 mm/hugetlb.c                 | 15 ++++++++------
 mm/ksm.c                     | 13 ++++++++++--
 mm/memory.c                  |  9 ++++++++-
 mm/mmu_notifier.c            |  6 ------
 mm/rmap.c                    | 27 ++++++++++++++++++-------
 9 files changed, 79 insertions(+), 76 deletions(-)

-- 
1.7.11.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V1 1/2] mm: Move all mmu notifier invocations to be done outside the PT lock
  2012-09-04  8:41 [PATCH V1 0/2] Enable clients to schedule in mmu_notifier methods Haggai Eran
@ 2012-09-04  8:41 ` Haggai Eran
  2012-09-04 22:07   ` Andrew Morton
  2012-09-04  8:41 ` [PATCH V1 2/2] mm: Wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end Haggai Eran
  2012-09-04 22:06 ` [PATCH V1 0/2] Enable clients to schedule in mmu_notifier methods Andrew Morton
  2 siblings, 1 reply; 12+ messages in thread
From: Haggai Eran @ 2012-09-04  8:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Peter Zijlstra, Xiao Guangrong,
	Haggai Eran, Shachar Raindel, Sagi Grimberg, Or Gerlitz,
	Andrea Arcangeli

From: Sagi Grimberg <sagig@mellanox.com>

In order to allow sleeping during mmu notifier calls, we need to avoid
invoking them under the page table spinlock. This patch solves the
problem by calling invalidate_page notification after releasing
the lock (but before freeing the page itself), or by wrapping the page
invalidation with calls to invalidate_range_begin and
invalidate_range_end.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
---
 include/linux/mmu_notifier.h | 47 --------------------------------------------
 mm/filemap_xip.c             |  4 +++-
 mm/huge_memory.c             | 32 ++++++++++++++++++++++++------
 mm/hugetlb.c                 | 15 ++++++++------
 mm/memory.c                  | 10 +++++++---
 mm/rmap.c                    | 27 ++++++++++++++++++-------
 6 files changed, 65 insertions(+), 70 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index ee2baf0..2702bcb 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -246,50 +246,6 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 		__mmu_notifier_mm_destroy(mm);
 }
 
-/*
- * These two macros will sometime replace ptep_clear_flush.
- * ptep_clear_flush is implemented as macro itself, so this also is
- * implemented as a macro until ptep_clear_flush will converted to an
- * inline function, to diminish the risk of compilation failure. The
- * invalidate_page method over time can be moved outside the PT lock
- * and these two macros can be later removed.
- */
-#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
-({									\
-	pte_t __pte;							\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
-	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
-	__pte;								\
-})
-
-#define pmdp_clear_flush_notify(__vma, __address, __pmdp)		\
-({									\
-	pmd_t __pmd;							\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
-	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
-					    (__address)+HPAGE_PMD_SIZE);\
-	__pmd = pmdp_clear_flush(___vma, ___address, __pmdp);		\
-	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
-					  (__address)+HPAGE_PMD_SIZE);	\
-	__pmd;								\
-})
-
-#define pmdp_splitting_flush_notify(__vma, __address, __pmdp)		\
-({									\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
-	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
-					    (__address)+HPAGE_PMD_SIZE);\
-	pmdp_splitting_flush(___vma, ___address, __pmdp);		\
-	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
-					  (__address)+HPAGE_PMD_SIZE);	\
-})
-
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
 ({									\
 	int __young;							\
@@ -370,9 +326,6 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
-#define ptep_clear_flush_notify ptep_clear_flush
-#define pmdp_clear_flush_notify pmdp_clear_flush
-#define pmdp_splitting_flush_notify pmdp_splitting_flush
 #define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 13e013b..a002a6d 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -193,11 +193,13 @@ retry:
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush_notify(vma, address, pte);
+			pteval = ptep_clear_flush(vma, address, pte);
 			page_remove_rmap(page);
 			dec_mm_counter(mm, MM_FILEPAGES);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
+			/* must invalidate_page _before_ freeing the page */
+			mmu_notifier_invalidate_page(mm, address);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..5a5b9e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -868,12 +868,14 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		cond_resched();
 	}
 
+	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
+
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
 		goto out_free_pages;
 	VM_BUG_ON(!PageHead(page));
 
-	pmdp_clear_flush_notify(vma, haddr, pmd);
+	pmdp_clear_flush(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
 	pgtable = get_pmd_huge_pte(mm);
@@ -896,6 +898,9 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(&mm->page_table_lock);
 
+	mmu_notifier_invalidate_range_end(vma->vm_mm, haddr,
+					  haddr + HPAGE_PMD_SIZE);
+
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
 
@@ -904,6 +909,7 @@ out:
 
 out_free_pages:
 	spin_unlock(&mm->page_table_lock);
+	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
 	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		mem_cgroup_uncharge_page(pages[i]);
@@ -970,20 +976,22 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
+	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
+
 	spin_lock(&mm->page_table_lock);
 	put_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(&mm->page_table_lock);
 		mem_cgroup_uncharge_page(new_page);
 		put_page(new_page);
-		goto out;
+		goto out_mn;
 	} else {
 		pmd_t entry;
 		VM_BUG_ON(!PageHead(page));
 		entry = mk_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		entry = pmd_mkhuge(entry);
-		pmdp_clear_flush_notify(vma, haddr, pmd);
+		pmdp_clear_flush(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
 		update_mmu_cache(vma, address, entry);
@@ -991,10 +999,14 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		put_page(page);
 		ret |= VM_FAULT_WRITE;
 	}
-out_unlock:
 	spin_unlock(&mm->page_table_lock);
+out_mn:
+	mmu_notifier_invalidate_range_end(mm, haddr, haddr + HPAGE_PMD_SIZE);
 out:
 	return ret;
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+	return ret;
 }
 
 struct page *follow_trans_huge_pmd(struct mm_struct *mm,
@@ -1208,6 +1220,8 @@ static int __split_huge_page_splitting(struct page *page,
 	pmd_t *pmd;
 	int ret = 0;
 
+	mmu_notifier_invalidate_range_start(mm, address,
+					    address + HPAGE_PMD_SIZE);
 	spin_lock(&mm->page_table_lock);
 	pmd = page_check_address_pmd(page, mm, address,
 				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
@@ -1219,10 +1233,12 @@ static int __split_huge_page_splitting(struct page *page,
 		 * and it won't wait on the anon_vma->root->mutex to
 		 * serialize against split_huge_page*.
 		 */
-		pmdp_splitting_flush_notify(vma, address, pmd);
+		pmdp_splitting_flush(vma, address, pmd);
 		ret = 1;
 	}
 	spin_unlock(&mm->page_table_lock);
+	mmu_notifier_invalidate_range_end(mm, address,
+					  address + HPAGE_PMD_SIZE);
 
 	return ret;
 }
@@ -1937,6 +1953,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	pte = pte_offset_map(pmd, address);
 	ptl = pte_lockptr(mm, pmd);
 
+	mmu_notifier_invalidate_range_start(mm, address,
+					    address + HPAGE_PMD_SIZE);
 	spin_lock(&mm->page_table_lock); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -1944,8 +1962,10 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * huge and small TLB entries for the same virtual address
 	 * to avoid the risk of CPU bugs in that area.
 	 */
-	_pmd = pmdp_clear_flush_notify(vma, address, pmd);
+	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(&mm->page_table_lock);
+	mmu_notifier_invalidate_range_end(mm, address,
+					  address + HPAGE_PMD_SIZE);
 
 	spin_lock(ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc72712..c569b97 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2611,6 +2611,9 @@ retry_avoidcopy:
 			    pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
 
+	mmu_notifier_invalidate_range_start(mm,
+		address & huge_page_mask(h),
+		(address & huge_page_mask(h)) + huge_page_size(h));
 	/*
 	 * Retake the page_table_lock to check for racing updates
 	 * before the page tables are altered
@@ -2619,9 +2622,6 @@ retry_avoidcopy:
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
-		mmu_notifier_invalidate_range_start(mm,
-			address & huge_page_mask(h),
-			(address & huge_page_mask(h)) + huge_page_size(h));
 		huge_ptep_clear_flush(vma, address, ptep);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
@@ -2629,10 +2629,13 @@ retry_avoidcopy:
 		hugepage_add_new_anon_rmap(new_page, vma, address);
 		/* Make the old page be freed below */
 		new_page = old_page;
-		mmu_notifier_invalidate_range_end(mm,
-			address & huge_page_mask(h),
-			(address & huge_page_mask(h)) + huge_page_size(h));
 	}
+	spin_unlock(&mm->page_table_lock);
+	mmu_notifier_invalidate_range_end(mm,
+		address & huge_page_mask(h),
+		(address & huge_page_mask(h)) + huge_page_size(h));
+	/* Caller expects lock to be held */
+	spin_lock(&mm->page_table_lock);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..b657a2e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2516,7 +2516,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		spinlock_t *ptl, pte_t orig_pte)
 	__releases(ptl)
 {
-	struct page *old_page, *new_page;
+	struct page *old_page, *new_page = NULL;
 	pte_t entry;
 	int ret = 0;
 	int page_mkwrite = 0;
@@ -2760,10 +2760,14 @@ gotten:
 	} else
 		mem_cgroup_uncharge_page(new_page);
 
-	if (new_page)
-		page_cache_release(new_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (new_page) {
+		if (new_page == old_page)
+			/* cow happened, notify before releasing old_page */
+			mmu_notifier_invalidate_page(mm, address);
+		page_cache_release(new_page);
+	}
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cd..f13e6cf 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -694,7 +694,7 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			unsigned long *vm_flags)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	int referenced = 0;
+	int referenced = 0, clear_flush_young = 0;
 
 	if (unlikely(PageTransHuge(page))) {
 		pmd_t *pmd;
@@ -741,7 +741,8 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 			goto out;
 		}
 
-		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+		clear_flush_young = 1;
+		if (ptep_clear_flush_young(vma, address, pte)) {
 			/*
 			 * Don't treat a reference through a sequentially read
 			 * mapping as such.  If the page has been used in
@@ -757,6 +758,9 @@ int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 
 	(*mapcount)--;
 
+	if (clear_flush_young)
+		referenced += mmu_notifier_clear_flush_young(mm, address);
+
 	if (referenced)
 		*vm_flags |= vma->vm_flags;
 out:
@@ -929,7 +933,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		pte_t entry;
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush_notify(vma, address, pte);
+		entry = ptep_clear_flush(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -937,6 +941,9 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	pte_unmap_unlock(pte, ptl);
+
+	if (ret)
+		mmu_notifier_invalidate_page(mm, address);
 out:
 	return ret;
 }
@@ -1256,7 +1263,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush_notify(vma, address, pte);
+	pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -1318,6 +1325,8 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
+	if (ret != SWAP_FAIL)
+		mmu_notifier_invalidate_page(mm, address);
 out:
 	return ret;
 
@@ -1382,7 +1391,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	spinlock_t *ptl;
 	struct page *page;
 	unsigned long address;
-	unsigned long end;
+	unsigned long start, end;
 	int ret = SWAP_AGAIN;
 	int locked_vma = 0;
 
@@ -1405,6 +1414,9 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	if (!pmd_present(*pmd))
 		return ret;
 
+	start = address;
+	mmu_notifier_invalidate_range_start(mm, start, end);
+
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
 	 * keep the sem while scanning the cluster for mlocking pages.
@@ -1433,12 +1445,12 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 			continue;	/* don't unmap */
 		}
 
-		if (ptep_clear_flush_young_notify(vma, address, pte))
+		if (ptep_clear_flush_young(vma, address, pte))
 			continue;
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush_notify(vma, address, pte);
+		pteval = ptep_clear_flush(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
@@ -1454,6 +1466,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
+	mmu_notifier_invalidate_range_end(mm, start, end);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
-- 
1.7.11.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 1/2] mm: Move all mmu notifier invocations to be done outside the PT lock
  2012-09-04  8:41 ` [PATCH V1 1/2] mm: Move all mmu notifier invocations to be done outside the PT lock Haggai Eran
@ 2012-09-04 22:07   ` Andrew Morton
  2012-09-06 14:34     ` [PATCH V2 0/2] Enable clients to schedule in mmu_notifier methods Haggai Eran
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2012-09-04 22:07 UTC (permalink / raw)
  To: Haggai Eran
  Cc: linux-mm, Andrea Arcangeli, Peter Zijlstra, Xiao Guangrong,
	Shachar Raindel, Sagi Grimberg, Or Gerlitz, Andrea Arcangeli

On Tue,  4 Sep 2012 11:41:20 +0300
Haggai Eran <haggaie@mellanox.com> wrote:

> From: Sagi Grimberg <sagig@mellanox.com>
> 
> In order to allow sleeping during mmu notifier calls, we need to avoid
> invoking them under the page table spinlock. This patch solves the
> problem by calling invalidate_page notification after releasing
> the lock (but before freeing the page itself), or by wrapping the page
> invalidation with calls to invalidate_range_begin and
> invalidate_range_end.
> 
>
> ...
>
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -868,12 +868,14 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  		cond_resched();
>  	}
>  
> +	mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
> +
>  	spin_lock(&mm->page_table_lock);
>  	if (unlikely(!pmd_same(*pmd, orig_pmd)))
>  		goto out_free_pages;
>  	VM_BUG_ON(!PageHead(page));
>  
> -	pmdp_clear_flush_notify(vma, haddr, pmd);
> +	pmdp_clear_flush(vma, haddr, pmd);
>  	/* leave pmd empty until pte is filled */
>  
>  	pgtable = get_pmd_huge_pte(mm);
> @@ -896,6 +898,9 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
>  	page_remove_rmap(page);
>  	spin_unlock(&mm->page_table_lock);
>  
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, haddr,
> +					  haddr + HPAGE_PMD_SIZE);

But `haddr' has been altered by the time we get here.  We should have
saved the original value?

That's a thing I don't like about this patchset - it adds maintenance
overhead.  This need to retain values of local variables or incoming
args across lengthy code sequences is fragile.  We could easily break
your changes as the core code evolves, and it would take a long long
time before anyone noticed the breakage.

I'm wondering if it would be better to adopt the convention throughout
this patchset that mmu_notifier_invalidate_range_start() and
mmu_notifier_invalidate_range_end() always use their own locals.  ie:

	unsigned long mmun_start;	/* For mmu_notifiers */
	unsigned long mmun_end;		/* For mmu_notifiers */

	...

	mmun_start = ...;
	mmun_end = ...;

	...

	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

	...

	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);


This is verbose, but it is explicit and clear and more robust than what
you have.  It shouldn't generate any additional text or stack usage or
register usage unless gcc is having an especially stupid day.


>  	ret |= VM_FAULT_WRITE;
>  	put_page(page);
>  
>
> ...
>
> @@ -1382,7 +1391,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  	spinlock_t *ptl;
>  	struct page *page;
>  	unsigned long address;
> -	unsigned long end;
> +	unsigned long start, end;

You'll note that this function uses the one-definition-per-line
convention, which has a few (smallish) advantages over
multiple-definitions-per-line.  One such advantage is that it leaves
room for a nice little comment at the RHS.  Take that as a hint ;)

>  	int ret = SWAP_AGAIN;
>  	int locked_vma = 0;
>  
> @@ -1405,6 +1414,9 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  	if (!pmd_present(*pmd))
>  		return ret;
>  
> +	start = address;
> +	mmu_notifier_invalidate_range_start(mm, start, end);

`end' is used uninitialised in this function.

I'm surprised that it didn't generate a warning(?) and I worry about
the testing coverage?

>  	/*
>  	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
>  	 * keep the sem while scanning the cluster for mlocking pages.
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V2 0/2] Enable clients to schedule in mmu_notifier methods
  2012-09-04 22:07   ` Andrew Morton
@ 2012-09-06 14:34     ` Haggai Eran
  2012-09-06 14:34       ` [PATCH V2 1/2] mm: Move all mmu notifier invocations to be done outside the PT lock Haggai Eran
                         ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Haggai Eran @ 2012-09-06 14:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Andrea Arcangeli, Peter Zijlstra, Xiao Guangrong,
	Or Gerlitz, Sagi Grimberg, Haggai Eran, Shachar Raindel,
	Liran Liss

> The following short patch series completes the support for allowing clients to
> sleep in mmu notifiers (specifically in invalidate_page and
> invalidate_range_start/end), adding on the work done by Andrea Arcangeli and
> Sagi Grimberg in http://marc.info/?l=linux-mm&m=133113297028676&w=3
>
> This patchset is a preliminary step towards on-demand paging design to be
> added to the Infiniband stack. Our goal is to avoid pinning pages in
> memory regions registered for IB communication, so we need to get
> notifications for invalidations on such memory regions, and stop the hardware
> from continuing its access to the invalidated pages. The hardware operation
> that flushes the page tables can block, so we need to sleep until the hardware
> is guaranteed not to access these pages anymore.
>
> The first patch moves the mentioned notifier functions out of the PTL, and the
> second patch changes the change_pte notification to stop calling
> invalidate_page as a default.

On Wed, 5 Sep 2012 01:07:42 +0300, Andrew Morton wrote:
> On Tue,  4 Sep 2012 11:41:20 +0300
> Haggai Eran <haggaie@mellanox.com> wrote:
>> @@ -1405,6 +1414,9 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>>  	if (!pmd_present(*pmd))
>>  		return ret;
>>  
>> +	start = address;
>> +	mmu_notifier_invalidate_range_start(mm, start, end);
> `end' is used uninitialised in this function.

I don't think it is. You might think so because the patch didn't initialize it
itself - it was already defined in this function. Anyway, to make it more clear
I've used your suggested convention with mmun_start/end in this function as
well as the others.

> I'm surprised that it didn't generate a warning(?) and I worry about
> the testing coverage?

I tried to test these patches by writing a small module that registered as an
mmu notifiers client. The module used might_sleep() in each notifier to verify
that it was called from a sleepable context. I then used a set of user space
tests that attempted to invoke various mmu notifiers. I had tests for:
* munmap
* fork and copy-on-write breaking (either with regular pages or huge pages)
* swapping out regular pages
* swapping out a nonlinear vma
* madvise with MADV_DONTNEED and with MADV_REMOVE
* KSM
* mremap
* mprotect
* transparent huge pages

The module exported the notifications to the user space programs, and it
checked that range invalidations came in matching pairs of begin and end,
but only after you wrote about the bug in V1 I noticed that I didn't have a
test for transparent huge pages COW breaking where the new huge page allocation
fails (do_huge_pmd_wp_page_fallback). Before sending V2 I've added a new test
for that, using fail_page_alloc.

Changes from V1:
- Add the motivation for on-demand paging in patch 1 changelog.

- Fix issues in patch 1 where invalidate_range_begin and invalidate_range_end
  are called with different arguments.

- Used the convention Andrew suggested in both patches to make it a little
  harder for such bugs to be introduced in the future.

- Dropped changes in patch 1 that moved calls to ptep_clear_flush_young_notify
  out of the PTL. The patch doesn't intend to make clear_flush_young
  notification sleepable, only invalidate_range_begin/end and invalidate_page.

Changes from V0:
- Fixed a bug in patch 1 that prevented compilation without MMU notifiers.
- Dropped the patches 2 and 3 that were moving tlb_gather_mmu calls.
- Added a patch to handle invalidate_page being called from change_pte.

Haggai Eran (1):
  mm: Wrap calls to set_pte_at_notify with invalidate_range_start and
    invalidate_range_end

Sagi Grimberg (1):
  mm: Move all mmu notifier invocations to be done outside the PT lock

 include/linux/mmu_notifier.h | 47 --------------------------------------------
 kernel/events/uprobes.c      |  5 +++++
 mm/filemap_xip.c             |  4 +++-
 mm/huge_memory.c             | 42 +++++++++++++++++++++++++++++++++------
 mm/hugetlb.c                 | 21 ++++++++++++--------
 mm/ksm.c                     | 21 ++++++++++++++++++--
 mm/memory.c                  | 25 ++++++++++++++++++-----
 mm/mmu_notifier.c            |  6 ------
 mm/mremap.c                  |  8 ++++++--
 mm/rmap.c                    | 18 ++++++++++++++---
 10 files changed, 117 insertions(+), 80 deletions(-)

-- 
1.7.11.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V2 1/2] mm: Move all mmu notifier invocations to be done outside the PT lock
  2012-09-06 14:34     ` [PATCH V2 0/2] Enable clients to schedule in mmu_notifier methods Haggai Eran
@ 2012-09-06 14:34       ` Haggai Eran
  2012-09-06 14:34       ` [PATCH V2 2/2] mm: Wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end Haggai Eran
  2012-09-06 20:08       ` [PATCH V2 0/2] Enable clients to schedule in mmu_notifier methods Andrew Morton
  2 siblings, 0 replies; 12+ messages in thread
From: Haggai Eran @ 2012-09-06 14:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Andrea Arcangeli, Peter Zijlstra, Xiao Guangrong,
	Or Gerlitz, Sagi Grimberg, Haggai Eran, Shachar Raindel,
	Liran Liss, Andrea Arcangeli

From: Sagi Grimberg <sagig@mellanox.com>

In order to allow sleeping during mmu notifier calls, we need to avoid
invoking them under the page table spinlock. This patch solves the
problem by calling invalidate_page notification after releasing
the lock (but before freeing the page itself), or by wrapping the page
invalidation with calls to invalidate_range_begin and
invalidate_range_end.

To prevent accidental changes to the invalidate_range_end arguments after
the call to invalidate_range_begin, the patch introduces a convention of
saving the arguments in consistently named locals:

> unsigned long mmun_start;	/* For mmu_notifiers */
> unsigned long mmun_end;	/* For mmu_notifiers */
>
> ...
>
> mmun_start = ...
> mmun_end = ...
> mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
>
> ...
>
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

The patch changes code to use this convention for all calls to
mmu_notifier_invalidate_range_start/end, except those where the calls are
close enough so that anyone who glances at the code can see the values
aren't changing.

This patchset is a preliminary step towards on-demand paging design to be
added to the RDMA stack.

On 05/09/2012 01:06, Andrew Morton wrote:
> Exactly why do we want on-demand paging for Infiniband?

Applications register memory with an RDMA adapter using system calls,
and subsequently post IO operations that refer to the corresponding
virtual addresses directly to HW. Until now, this was achieved by
pinning the memory during the registration calls. The goal of on demand
paging is to avoid pinning the pages of registered memory regions (MRs).
This will allow users the same flexibility they get when swapping any
other part of their processes address spaces. Instead of requiring the
entire MR to fit in physical memory, we can allow the MR to be larger,
and only fit the current working set in physical memory.

> Why should anyone care?  What problems are users currently experiencing?

This can make programming with RDMA much simpler. Today, developers that
are working with more data than their RAM can hold need either to
deregister and reregister memory regions throughout their process's
life, or keep a single memory region and copy the data to it. On demand
paging will allow these developers to register a single MR at the
beginning of their process's life, and let the operating system manage
which pages needs to be fetched at a given time. In the future, we might
be able to provide a single memory access key for each process that
would provide the entire process's address as one large memory region,
and the developers wouldn't need to register memory regions at all.

> Is there any prospect that any other subsystems will utilise these
> infrastructural changes?  If so, which and how, etc?

As for other subsystems, I understand that XPMEM wanted to sleep in MMU
notifiers, as Christoph Lameter wrote at
http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html
and perhaps Andrea knows about other use cases.

Scheduling in mmu notifications is required since we need to sync the
hardware with the secondary page tables change. A TLB flush of an IO
device is inherently slower than a CPU TLB flush, so our design works by
sending the invalidation request to the device, and waiting for an
interrupt before exiting the mmu notifier handler.

Signed-off-by: Andrea Arcangeli <andrea@qumranet.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
---
 include/linux/mmu_notifier.h | 47 --------------------------------------------
 mm/filemap_xip.c             |  4 +++-
 mm/huge_memory.c             | 42 +++++++++++++++++++++++++++++++++------
 mm/hugetlb.c                 | 21 ++++++++++++--------
 mm/memory.c                  | 24 +++++++++++++++-------
 mm/mremap.c                  |  8 ++++++--
 mm/rmap.c                    | 18 ++++++++++++++---
 7 files changed, 90 insertions(+), 74 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index ee2baf0..2702bcb 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -246,50 +246,6 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 		__mmu_notifier_mm_destroy(mm);
 }
 
-/*
- * These two macros will sometime replace ptep_clear_flush.
- * ptep_clear_flush is implemented as macro itself, so this also is
- * implemented as a macro until ptep_clear_flush will converted to an
- * inline function, to diminish the risk of compilation failure. The
- * invalidate_page method over time can be moved outside the PT lock
- * and these two macros can be later removed.
- */
-#define ptep_clear_flush_notify(__vma, __address, __ptep)		\
-({									\
-	pte_t __pte;							\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	__pte = ptep_clear_flush(___vma, ___address, __ptep);		\
-	mmu_notifier_invalidate_page(___vma->vm_mm, ___address);	\
-	__pte;								\
-})
-
-#define pmdp_clear_flush_notify(__vma, __address, __pmdp)		\
-({									\
-	pmd_t __pmd;							\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
-	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
-					    (__address)+HPAGE_PMD_SIZE);\
-	__pmd = pmdp_clear_flush(___vma, ___address, __pmdp);		\
-	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
-					  (__address)+HPAGE_PMD_SIZE);	\
-	__pmd;								\
-})
-
-#define pmdp_splitting_flush_notify(__vma, __address, __pmdp)		\
-({									\
-	struct vm_area_struct *___vma = __vma;				\
-	unsigned long ___address = __address;				\
-	VM_BUG_ON(__address & ~HPAGE_PMD_MASK);				\
-	mmu_notifier_invalidate_range_start(___vma->vm_mm, ___address,	\
-					    (__address)+HPAGE_PMD_SIZE);\
-	pmdp_splitting_flush(___vma, ___address, __pmdp);		\
-	mmu_notifier_invalidate_range_end(___vma->vm_mm, ___address,	\
-					  (__address)+HPAGE_PMD_SIZE);	\
-})
-
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
 ({									\
 	int __young;							\
@@ -370,9 +326,6 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 
 #define ptep_clear_flush_young_notify ptep_clear_flush_young
 #define pmdp_clear_flush_young_notify pmdp_clear_flush_young
-#define ptep_clear_flush_notify ptep_clear_flush
-#define pmdp_clear_flush_notify pmdp_clear_flush
-#define pmdp_splitting_flush_notify pmdp_splitting_flush
 #define set_pte_at_notify set_pte_at
 
 #endif /* CONFIG_MMU_NOTIFIER */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 13e013b..a002a6d 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -193,11 +193,13 @@ retry:
 		if (pte) {
 			/* Nuke the page table entry. */
 			flush_cache_page(vma, address, pte_pfn(*pte));
-			pteval = ptep_clear_flush_notify(vma, address, pte);
+			pteval = ptep_clear_flush(vma, address, pte);
 			page_remove_rmap(page);
 			dec_mm_counter(mm, MM_FILEPAGES);
 			BUG_ON(pte_dirty(pteval));
 			pte_unmap_unlock(pte, ptl);
+			/* must invalidate_page _before_ freeing the page */
+			mmu_notifier_invalidate_page(mm, address);
 			page_cache_release(page);
 		}
 	}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..e7b40bc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -832,6 +832,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	pmd_t _pmd;
 	int ret = 0, i;
 	struct page **pages;
+	unsigned long mmun_start;	/* For mmu_notifiers */
+	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	pages = kmalloc(sizeof(struct page *) * HPAGE_PMD_NR,
 			GFP_KERNEL);
@@ -868,12 +870,16 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 		cond_resched();
 	}
 
+	mmun_start = haddr;
+	mmun_end   = haddr + HPAGE_PMD_SIZE;
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(*pmd, orig_pmd)))
 		goto out_free_pages;
 	VM_BUG_ON(!PageHead(page));
 
-	pmdp_clear_flush_notify(vma, haddr, pmd);
+	pmdp_clear_flush(vma, haddr, pmd);
 	/* leave pmd empty until pte is filled */
 
 	pgtable = get_pmd_huge_pte(mm);
@@ -896,6 +902,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
 	page_remove_rmap(page);
 	spin_unlock(&mm->page_table_lock);
 
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+
 	ret |= VM_FAULT_WRITE;
 	put_page(page);
 
@@ -904,6 +912,7 @@ out:
 
 out_free_pages:
 	spin_unlock(&mm->page_table_lock);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	mem_cgroup_uncharge_start();
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		mem_cgroup_uncharge_page(pages[i]);
@@ -920,6 +929,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int ret = 0;
 	struct page *page, *new_page;
 	unsigned long haddr;
+	unsigned long mmun_start;	/* For mmu_notifiers */
+	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	VM_BUG_ON(!vma->anon_vma);
 	spin_lock(&mm->page_table_lock);
@@ -970,20 +981,24 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
 	__SetPageUptodate(new_page);
 
+	mmun_start = haddr;
+	mmun_end   = haddr + HPAGE_PMD_SIZE;
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+
 	spin_lock(&mm->page_table_lock);
 	put_page(page);
 	if (unlikely(!pmd_same(*pmd, orig_pmd))) {
 		spin_unlock(&mm->page_table_lock);
 		mem_cgroup_uncharge_page(new_page);
 		put_page(new_page);
-		goto out;
+		goto out_mn;
 	} else {
 		pmd_t entry;
 		VM_BUG_ON(!PageHead(page));
 		entry = mk_pmd(new_page, vma->vm_page_prot);
 		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 		entry = pmd_mkhuge(entry);
-		pmdp_clear_flush_notify(vma, haddr, pmd);
+		pmdp_clear_flush(vma, haddr, pmd);
 		page_add_new_anon_rmap(new_page, vma, haddr);
 		set_pmd_at(mm, haddr, pmd, entry);
 		update_mmu_cache(vma, address, entry);
@@ -991,10 +1006,14 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		put_page(page);
 		ret |= VM_FAULT_WRITE;
 	}
-out_unlock:
 	spin_unlock(&mm->page_table_lock);
+out_mn:
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 out:
 	return ret;
+out_unlock:
+	spin_unlock(&mm->page_table_lock);
+	return ret;
 }
 
 struct page *follow_trans_huge_pmd(struct mm_struct *mm,
@@ -1207,7 +1226,11 @@ static int __split_huge_page_splitting(struct page *page,
 	struct mm_struct *mm = vma->vm_mm;
 	pmd_t *pmd;
 	int ret = 0;
+	/* For mmu_notifiers */
+	const unsigned long mmun_start = address;
+	const unsigned long mmun_end   = address + HPAGE_PMD_SIZE;
 
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	spin_lock(&mm->page_table_lock);
 	pmd = page_check_address_pmd(page, mm, address,
 				     PAGE_CHECK_ADDRESS_PMD_NOTSPLITTING_FLAG);
@@ -1219,10 +1242,11 @@ static int __split_huge_page_splitting(struct page *page,
 		 * and it won't wait on the anon_vma->root->mutex to
 		 * serialize against split_huge_page*.
 		 */
-		pmdp_splitting_flush_notify(vma, address, pmd);
+		pmdp_splitting_flush(vma, address, pmd);
 		ret = 1;
 	}
 	spin_unlock(&mm->page_table_lock);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 	return ret;
 }
@@ -1849,6 +1873,8 @@ static void collapse_huge_page(struct mm_struct *mm,
 	spinlock_t *ptl;
 	int isolated;
 	unsigned long hstart, hend;
+	unsigned long mmun_start;	/* For mmu_notifiers */
+	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 #ifndef CONFIG_NUMA
@@ -1937,6 +1963,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 	pte = pte_offset_map(pmd, address);
 	ptl = pte_lockptr(mm, pmd);
 
+	mmun_start = address;
+	mmun_end   = address + HPAGE_PMD_SIZE;
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	spin_lock(&mm->page_table_lock); /* probably unnecessary */
 	/*
 	 * After this gup_fast can't run anymore. This also removes
@@ -1944,8 +1973,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * huge and small TLB entries for the same virtual address
 	 * to avoid the risk of CPU bugs in that area.
 	 */
-	_pmd = pmdp_clear_flush_notify(vma, address, pmd);
+	_pmd = pmdp_clear_flush(vma, address, pmd);
 	spin_unlock(&mm->page_table_lock);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 
 	spin_lock(ptl);
 	isolated = __collapse_huge_page_isolate(vma, address, pte);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bc72712..aaa3b4b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2355,13 +2355,15 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	struct page *page;
 	struct hstate *h = hstate_vma(vma);
 	unsigned long sz = huge_page_size(h);
+	const unsigned long mmun_start = start;	/* For mmu_notifiers */
+	const unsigned long mmun_end   = end;	/* For mmu_notifiers */
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
 	BUG_ON(end & ~huge_page_mask(h));
 
 	tlb_start_vma(tlb, vma);
-	mmu_notifier_invalidate_range_start(mm, start, end);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 again:
 	spin_lock(&mm->page_table_lock);
 	for (address = start; address < end; address += sz) {
@@ -2425,7 +2427,7 @@ again:
 		if (address < end && !ref_page)
 			goto again;
 	}
-	mmu_notifier_invalidate_range_end(mm, start, end);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -2525,6 +2527,8 @@ static int hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct page *old_page, *new_page;
 	int avoidcopy;
 	int outside_reserve = 0;
+	unsigned long mmun_start;	/* For mmu_notifiers */
+	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	old_page = pte_page(pte);
 
@@ -2611,6 +2615,9 @@ retry_avoidcopy:
 			    pages_per_huge_page(h));
 	__SetPageUptodate(new_page);
 
+	mmun_start = address & huge_page_mask(h);
+	mmun_end   = (address & huge_page_mask(h)) + huge_page_size(h);
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	/*
 	 * Retake the page_table_lock to check for racing updates
 	 * before the page tables are altered
@@ -2619,9 +2626,6 @@ retry_avoidcopy:
 	ptep = huge_pte_offset(mm, address & huge_page_mask(h));
 	if (likely(pte_same(huge_ptep_get(ptep), pte))) {
 		/* Break COW */
-		mmu_notifier_invalidate_range_start(mm,
-			address & huge_page_mask(h),
-			(address & huge_page_mask(h)) + huge_page_size(h));
 		huge_ptep_clear_flush(vma, address, ptep);
 		set_huge_pte_at(mm, address, ptep,
 				make_huge_pte(vma, new_page, 1));
@@ -2629,10 +2633,11 @@ retry_avoidcopy:
 		hugepage_add_new_anon_rmap(new_page, vma, address);
 		/* Make the old page be freed below */
 		new_page = old_page;
-		mmu_notifier_invalidate_range_end(mm,
-			address & huge_page_mask(h),
-			(address & huge_page_mask(h)) + huge_page_size(h));
 	}
+	spin_unlock(&mm->page_table_lock);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	/* Caller expects lock to be held */
+	spin_lock(&mm->page_table_lock);
 	page_cache_release(new_page);
 	page_cache_release(old_page);
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..fe7948f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1039,6 +1039,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	unsigned long next;
 	unsigned long addr = vma->vm_start;
 	unsigned long end = vma->vm_end;
+	unsigned long mmun_start;	/* For mmu_notifiers */
+	unsigned long mmun_end;		/* For mmu_notifiers */
 	int ret;
 
 	/*
@@ -1071,8 +1073,12 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	 * parent mm. And a permission downgrade will only happen if
 	 * is_cow_mapping() returns true.
 	 */
-	if (is_cow_mapping(vma->vm_flags))
-		mmu_notifier_invalidate_range_start(src_mm, addr, end);
+	if (is_cow_mapping(vma->vm_flags)) {
+		mmun_start = addr;
+		mmun_end   = end;
+		mmu_notifier_invalidate_range_start(src_mm, mmun_start,
+						    mmun_end);
+	}
 
 	ret = 0;
 	dst_pgd = pgd_offset(dst_mm, addr);
@@ -1089,8 +1095,8 @@ int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow_mapping(vma->vm_flags))
-		mmu_notifier_invalidate_range_end(src_mm,
-						  vma->vm_start, end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start,
+						  mmun_end);
 	return ret;
 }
 
@@ -2516,7 +2522,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		spinlock_t *ptl, pte_t orig_pte)
 	__releases(ptl)
 {
-	struct page *old_page, *new_page;
+	struct page *old_page, *new_page = NULL;
 	pte_t entry;
 	int ret = 0;
 	int page_mkwrite = 0;
@@ -2760,10 +2766,14 @@ gotten:
 	} else
 		mem_cgroup_uncharge_page(new_page);
 
-	if (new_page)
-		page_cache_release(new_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
+	if (new_page) {
+		if (new_page == old_page)
+			/* cow happened, notify before releasing old_page */
+			mmu_notifier_invalidate_page(mm, address);
+		page_cache_release(new_page);
+	}
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/mremap.c b/mm/mremap.c
index cc06d0e..0bf94d0 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -127,11 +127,15 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	unsigned long extent, next, old_end;
 	pmd_t *old_pmd, *new_pmd;
 	bool need_flush = false;
+	unsigned long mmun_start;	/* For mmu_notifiers */
+	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
-	mmu_notifier_invalidate_range_start(vma->vm_mm, old_addr, old_end);
+	mmun_start = old_addr;
+	mmun_end   = old_end;
+	mmu_notifier_invalidate_range_start(vma->vm_mm, mmun_start, mmun_end);
 
 	for (; old_addr < old_end; old_addr += extent, new_addr += extent) {
 		cond_resched();
@@ -175,7 +179,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	if (likely(need_flush))
 		flush_tlb_range(vma, old_end-len, old_addr);
 
-	mmu_notifier_invalidate_range_end(vma->vm_mm, old_end-len, old_end);
+	mmu_notifier_invalidate_range_end(vma->vm_mm, mmun_start, mmun_end);
 
 	return len + old_addr - old_end;	/* how much done */
 }
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cd..f0b4d54 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -929,7 +929,7 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		pte_t entry;
 
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		entry = ptep_clear_flush_notify(vma, address, pte);
+		entry = ptep_clear_flush(vma, address, pte);
 		entry = pte_wrprotect(entry);
 		entry = pte_mkclean(entry);
 		set_pte_at(mm, address, pte, entry);
@@ -937,6 +937,9 @@ static int page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	pte_unmap_unlock(pte, ptl);
+
+	if (ret)
+		mmu_notifier_invalidate_page(mm, address);
 out:
 	return ret;
 }
@@ -1256,7 +1259,7 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 	/* Nuke the page table entry. */
 	flush_cache_page(vma, address, page_to_pfn(page));
-	pteval = ptep_clear_flush_notify(vma, address, pte);
+	pteval = ptep_clear_flush(vma, address, pte);
 
 	/* Move the dirty bit to the physical page now the pte is gone. */
 	if (pte_dirty(pteval))
@@ -1318,6 +1321,8 @@ int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
+	if (ret != SWAP_FAIL)
+		mmu_notifier_invalidate_page(mm, address);
 out:
 	return ret;
 
@@ -1382,6 +1387,8 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	spinlock_t *ptl;
 	struct page *page;
 	unsigned long address;
+	unsigned long mmun_start;	/* For mmu_notifiers */
+	unsigned long mmun_end;		/* For mmu_notifiers */
 	unsigned long end;
 	int ret = SWAP_AGAIN;
 	int locked_vma = 0;
@@ -1405,6 +1412,10 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 	if (!pmd_present(*pmd))
 		return ret;
 
+	mmun_start = address;
+	mmun_end   = end;
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+
 	/*
 	 * If we can acquire the mmap_sem for read, and vma is VM_LOCKED,
 	 * keep the sem while scanning the cluster for mlocking pages.
@@ -1438,7 +1449,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 
 		/* Nuke the page table entry. */
 		flush_cache_page(vma, address, pte_pfn(*pte));
-		pteval = ptep_clear_flush_notify(vma, address, pte);
+		pteval = ptep_clear_flush(vma, address, pte);
 
 		/* If nonlinear, store the file page offset in the pte. */
 		if (page->index != linear_page_index(vma, address))
@@ -1454,6 +1465,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		(*mapcount)--;
 	}
 	pte_unmap_unlock(pte - 1, ptl);
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	if (locked_vma)
 		up_read(&vma->vm_mm->mmap_sem);
 	return ret;
-- 
1.7.11.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH V2 2/2] mm: Wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end
  2012-09-06 14:34     ` [PATCH V2 0/2] Enable clients to schedule in mmu_notifier methods Haggai Eran
  2012-09-06 14:34       ` [PATCH V2 1/2] mm: Move all mmu notifier invocations to be done outside the PT lock Haggai Eran
@ 2012-09-06 14:34       ` Haggai Eran
  2012-09-06 20:08       ` [PATCH V2 0/2] Enable clients to schedule in mmu_notifier methods Andrew Morton
  2 siblings, 0 replies; 12+ messages in thread
From: Haggai Eran @ 2012-09-06 14:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Andrea Arcangeli, Peter Zijlstra, Xiao Guangrong,
	Or Gerlitz, Sagi Grimberg, Haggai Eran, Shachar Raindel,
	Liran Liss

In order to allow sleeping during invalidate_page mmu notifier calls, we
need to avoid calling when holding the PT lock. In addition to its
direct calls, invalidate_page can also be called as a substitute for a
change_pte call, in case the notifier client hasn't implemented
change_pte.

This patch drops the invalidate_page call from change_pte, and instead
wraps all calls to change_pte with invalidate_range_start and
invalidate_range_end calls.

Note that change_pte still cannot sleep after this patch, and that
clients implementing change_pte should not take action on it in case the
number of outstanding invalidate_range_start calls is larger than one,
otherwise they might miss a later invalidation.

Signed-off-by: Haggai Eran <haggaie@mellanox.com>
---
 kernel/events/uprobes.c |  5 +++++
 mm/ksm.c                | 21 +++++++++++++++++++--
 mm/memory.c             | 17 +++++++++++------
 mm/mmu_notifier.c       |  6 ------
 4 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index c08a22d..abe568b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -141,10 +141,14 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	spinlock_t *ptl;
 	pte_t *ptep;
 	int err;
+	/* For mmu_notifiers */
+	const unsigned long mmun_start = addr;
+	const unsigned long mmun_end   = addr + PAGE_SIZE;
 
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -173,6 +177,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	err = 0;
  unlock:
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 47c8853..6f00463 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -709,15 +709,22 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 	spinlock_t *ptl;
 	int swapped;
 	int err = -EFAULT;
+	unsigned long mmun_start;	/* For mmu_notifiers */
+	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
 		goto out;
 
 	BUG_ON(PageTransCompound(page));
+
+	mmun_start = addr;
+	mmun_end   = addr + PAGE_SIZE;
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
-		goto out;
+		goto out_mn;
 
 	if (pte_write(*ptep) || pte_dirty(*ptep)) {
 		pte_t entry;
@@ -752,6 +759,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
+out_mn:
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 out:
 	return err;
 }
@@ -776,6 +785,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	spinlock_t *ptl;
 	unsigned long addr;
 	int err = -EFAULT;
+	unsigned long mmun_start;	/* For mmu_notifiers */
+	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	addr = page_address_in_vma(page, vma);
 	if (addr == -EFAULT)
@@ -794,10 +805,14 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	if (!pmd_present(*pmd))
 		goto out;
 
+	mmun_start = addr;
+	mmun_end   = addr + PAGE_SIZE;
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
 		pte_unmap_unlock(ptep, ptl);
-		goto out;
+		goto out_mn;
 	}
 
 	get_page(kpage);
@@ -814,6 +829,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
+out_mn:
+	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index fe7948f..3c88368 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2527,6 +2527,8 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	int ret = 0;
 	int page_mkwrite = 0;
 	struct page *dirty_page = NULL;
+	unsigned long mmun_start;	/* For mmu_notifiers */
+	unsigned long mmun_end;		/* For mmu_notifiers */
 
 	old_page = vm_normal_page(vma, address, orig_pte);
 	if (!old_page) {
@@ -2704,6 +2706,10 @@ gotten:
 	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
 		goto oom_free_new;
 
+	mmun_start = address & PAGE_MASK;
+	mmun_end   = (address & PAGE_MASK) + PAGE_SIZE;
+	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -2766,14 +2772,13 @@ gotten:
 	} else
 		mem_cgroup_uncharge_page(new_page);
 
+	if (new_page)
+		page_cache_release(new_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
-	if (new_page) {
-		if (new_page == old_page)
-			/* cow happened, notify before releasing old_page */
-			mmu_notifier_invalidate_page(mm, address);
-		page_cache_release(new_page);
-	}
+	if (new_page)
+		/* Only call the end notifier if the begin was called. */
+		mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 441dae0..89d84d0 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -137,12 +137,6 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
 			mn->ops->change_pte(mn, mm, address, pte);
-		/*
-		 * Some drivers don't have change_pte,
-		 * so we must call invalidate_page in that case.
-		 */
-		else if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
 	}
 	srcu_read_unlock(&srcu, id);
 }
-- 
1.7.11.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V2 0/2] Enable clients to schedule in mmu_notifier methods
  2012-09-06 14:34     ` [PATCH V2 0/2] Enable clients to schedule in mmu_notifier methods Haggai Eran
  2012-09-06 14:34       ` [PATCH V2 1/2] mm: Move all mmu notifier invocations to be done outside the PT lock Haggai Eran
  2012-09-06 14:34       ` [PATCH V2 2/2] mm: Wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end Haggai Eran
@ 2012-09-06 20:08       ` Andrew Morton
  2 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2012-09-06 20:08 UTC (permalink / raw)
  To: Haggai Eran
  Cc: linux-mm, Andrea Arcangeli, Peter Zijlstra, Xiao Guangrong,
	Or Gerlitz, Sagi Grimberg, Shachar Raindel, Liran Liss

On Thu,  6 Sep 2012 17:34:53 +0300
Haggai Eran <haggaie@mellanox.com> wrote:

>  include/linux/mmu_notifier.h | 47 --------------------------------------------
>  kernel/events/uprobes.c      |  5 +++++
>  mm/filemap_xip.c             |  4 +++-
>  mm/huge_memory.c             | 42 +++++++++++++++++++++++++++++++++------
>  mm/hugetlb.c                 | 21 ++++++++++++--------
>  mm/ksm.c                     | 21 ++++++++++++++++++--
>  mm/memory.c                  | 25 ++++++++++++++++++-----
>  mm/mmu_notifier.c            |  6 ------
>  mm/mremap.c                  |  8 ++++++--
>  mm/rmap.c                    | 18 ++++++++++++++---
>  10 files changed, 117 insertions(+), 80 deletions(-)

ho hum, spose so - the maintenance overhead does look to be a bit less now.

I use an ancient gcc. Do you see these with newer gcc?

mm/memory.c: In function 'do_wp_page':
mm/memory.c:2529: warning: 'mmun_start' may be used uninitialized in this function
mm/memory.c:2530: warning: 'mmun_end' may be used uninitialized in this function
mm/memory.c: In function 'copy_page_range':
mm/memory.c:1042: warning: 'mmun_start' may be used uninitialized in this function
mm/memory.c:1043: warning: 'mmun_end' may be used uninitialized in this function

The copy_page_range() one is a bit of a worry.  We're assuming that the
return value of is_cow_mapping(vma->vm_flags) will not change.  It
would be pretty alarming if it *were* to change, but exactly what
guarantees this?


I fiddled a couple of minor things:

From: Andrew Morton <akpm@linux-foundation.org>
Subject: mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock-fix

possible speed tweak in hugetlb_cow(), cleanups

Cc: Andrea Arcangeli <andrea@qumranet.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Liran Liss <liranl@mellanox.com>
Cc: Or Gerlitz <ogerlitz@mellanox.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Sagi Grimberg <sagig@mellanox.com>
Cc: Shachar Raindel <raindel@mellanox.com>
Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    2 +-
 mm/memory.c  |    3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

--- a/mm/hugetlb.c~mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock-fix
+++ a/mm/hugetlb.c
@@ -2616,7 +2616,7 @@ retry_avoidcopy:
 	__SetPageUptodate(new_page);
 
 	mmun_start = address & huge_page_mask(h);
-	mmun_end   = (address & huge_page_mask(h)) + huge_page_size(h);
+	mmun_end = mmun_start + huge_page_size(h);
 	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
 	/*
 	 * Retake the page_table_lock to check for racing updates
--- a/mm/memory.c~mm-move-all-mmu-notifier-invocations-to-be-done-outside-the-pt-lock-fix
+++ a/mm/memory.c
@@ -1096,8 +1096,7 @@ int copy_page_range(struct mm_struct *ds
 	} while (dst_pgd++, src_pgd++, addr = next, addr != end);
 
 	if (is_cow_mapping(vma->vm_flags))
-		mmu_notifier_invalidate_range_end(src_mm, mmun_start,
-						  mmun_end);
+		mmu_notifier_invalidate_range_end(src_mm, mmun_start, mmun_end);
 	return ret;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH V1 2/2] mm: Wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end
  2012-09-04  8:41 [PATCH V1 0/2] Enable clients to schedule in mmu_notifier methods Haggai Eran
  2012-09-04  8:41 ` [PATCH V1 1/2] mm: Move all mmu notifier invocations to be done outside the PT lock Haggai Eran
@ 2012-09-04  8:41 ` Haggai Eran
  2012-09-04 22:07   ` Andrew Morton
  2012-09-04 22:06 ` [PATCH V1 0/2] Enable clients to schedule in mmu_notifier methods Andrew Morton
  2 siblings, 1 reply; 12+ messages in thread
From: Haggai Eran @ 2012-09-04  8:41 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrea Arcangeli, Andrew Morton, Peter Zijlstra, Xiao Guangrong,
	Haggai Eran, Shachar Raindel, Sagi Grimberg, Or Gerlitz

In order to allow sleeping during invalidate_page mmu notifier calls, we
need to avoid calling when holding the PT lock. In addition to its
direct calls, invalidate_page can also be called as a substitute for a
change_pte call, in case the notifier client hasn't implemented
change_pte.

This patch drops the invalidate_page call from change_pte, and instead
wraps all calls to change_pte with invalidate_range_start and
invalidate_range_end calls.

Note that change_pte still cannot sleep after this patch, and that
clients implementing change_pte should not take action on it in case the
number of outstanding invalidate_range_start calls is larger than one,
otherwise they might miss a later invalidation.

Signed-off-by: Haggai Eran <haggaie@mellanox.com>
---
 kernel/events/uprobes.c |  2 ++
 mm/ksm.c                | 13 +++++++++++--
 mm/memory.c             | 15 +++++++++------
 mm/mmu_notifier.c       |  6 ------
 4 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index c08a22d..8af2596 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -145,6 +145,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 	/* For try_to_free_swap() and munlock_vma_page() below */
 	lock_page(page);
 
+	mmu_notifier_invalidate_range_start(mm, addr, addr + PAGE_SIZE);
 	err = -EAGAIN;
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
@@ -173,6 +174,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
 
 	err = 0;
  unlock:
+	mmu_notifier_invalidate_range_end(mm, addr, addr + PAGE_SIZE);
 	unlock_page(page);
 	return err;
 }
diff --git a/mm/ksm.c b/mm/ksm.c
index 47c8853..7defc02 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -715,9 +715,12 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 		goto out;
 
 	BUG_ON(PageTransCompound(page));
+
+	mmu_notifier_invalidate_range_start(mm, addr, addr + PAGE_SIZE);
+
 	ptep = page_check_address(page, mm, addr, &ptl, 0);
 	if (!ptep)
-		goto out;
+		goto out_mn;
 
 	if (pte_write(*ptep) || pte_dirty(*ptep)) {
 		pte_t entry;
@@ -752,6 +755,8 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
 
 out_unlock:
 	pte_unmap_unlock(ptep, ptl);
+out_mn:
+	mmu_notifier_invalidate_range_end(mm, addr, addr + PAGE_SIZE);
 out:
 	return err;
 }
@@ -794,10 +799,12 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 	if (!pmd_present(*pmd))
 		goto out;
 
+	mmu_notifier_invalidate_range_start(mm, addr, addr + PAGE_SIZE);
+
 	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	if (!pte_same(*ptep, orig_pte)) {
 		pte_unmap_unlock(ptep, ptl);
-		goto out;
+		goto out_mn;
 	}
 
 	get_page(kpage);
@@ -814,6 +821,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
 
 	pte_unmap_unlock(ptep, ptl);
 	err = 0;
+out_mn:
+	mmu_notifier_invalidate_range_end(mm, addr, addr + PAGE_SIZE);
 out:
 	return err;
 }
diff --git a/mm/memory.c b/mm/memory.c
index b657a2e..402a19e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2698,6 +2698,9 @@ gotten:
 	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
 		goto oom_free_new;
 
+	mmu_notifier_invalidate_range_start(mm, address & PAGE_MASK,
+					    (address & PAGE_MASK) + PAGE_SIZE);
+
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
@@ -2760,14 +2763,14 @@ gotten:
 	} else
 		mem_cgroup_uncharge_page(new_page);
 
+	if (new_page)
+		page_cache_release(new_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
-	if (new_page) {
-		if (new_page == old_page)
-			/* cow happened, notify before releasing old_page */
-			mmu_notifier_invalidate_page(mm, address);
-		page_cache_release(new_page);
-	}
+	if (new_page)
+		/* Only call the end notifier if the begin was called. */
+		mmu_notifier_invalidate_range_end(mm, address & PAGE_MASK,
+			(address & PAGE_MASK) + PAGE_SIZE);
 	if (old_page) {
 		/*
 		 * Don't let another task, with possibly unlocked vma,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 441dae0..89d84d0 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -137,12 +137,6 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->change_pte)
 			mn->ops->change_pte(mn, mm, address, pte);
-		/*
-		 * Some drivers don't have change_pte,
-		 * so we must call invalidate_page in that case.
-		 */
-		else if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
 	}
 	srcu_read_unlock(&srcu, id);
 }
-- 
1.7.11.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 2/2] mm: Wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end
  2012-09-04  8:41 ` [PATCH V1 2/2] mm: Wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end Haggai Eran
@ 2012-09-04 22:07   ` Andrew Morton
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2012-09-04 22:07 UTC (permalink / raw)
  To: Haggai Eran
  Cc: linux-mm, Andrea Arcangeli, Peter Zijlstra, Xiao Guangrong,
	Shachar Raindel, Sagi Grimberg, Or Gerlitz

On Tue,  4 Sep 2012 11:41:21 +0300
Haggai Eran <haggaie@mellanox.com> wrote:

> In order to allow sleeping during invalidate_page mmu notifier calls, we
> need to avoid calling when holding the PT lock. In addition to its
> direct calls, invalidate_page can also be called as a substitute for a
> change_pte call, in case the notifier client hasn't implemented
> change_pte.
> 
> This patch drops the invalidate_page call from change_pte, and instead
> wraps all calls to change_pte with invalidate_range_start and
> invalidate_range_end calls.
> 
> Note that change_pte still cannot sleep after this patch, and that
> clients implementing change_pte should not take action on it in case the
> number of outstanding invalidate_range_start calls is larger than one,
> otherwise they might miss a later invalidation.
> 
> ...
>
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -145,6 +145,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  	/* For try_to_free_swap() and munlock_vma_page() below */
>  	lock_page(page);
>  
> +	mmu_notifier_invalidate_range_start(mm, addr, addr + PAGE_SIZE);
>  	err = -EAGAIN;
>  	ptep = page_check_address(page, mm, addr, &ptl, 0);
>  	if (!ptep)
> @@ -173,6 +174,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>  
>  	err = 0;
>   unlock:
> +	mmu_notifier_invalidate_range_end(mm, addr, addr + PAGE_SIZE);
>  	unlock_page(page);
>  	return err;
>  }

Again, now I have to apply the patch and peer at the code to work out
whether `addr' got changed anywhere between these two calls.  And I
somehow need to ensure that `addr' will not get modified in the future
and I can't do that!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 0/2] Enable clients to schedule in mmu_notifier methods
  2012-09-04  8:41 [PATCH V1 0/2] Enable clients to schedule in mmu_notifier methods Haggai Eran
  2012-09-04  8:41 ` [PATCH V1 1/2] mm: Move all mmu notifier invocations to be done outside the PT lock Haggai Eran
  2012-09-04  8:41 ` [PATCH V1 2/2] mm: Wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end Haggai Eran
@ 2012-09-04 22:06 ` Andrew Morton
  2012-09-05 12:55   ` Avi Kivity
  2012-09-05 14:01   ` Haggai Eran
  2 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2012-09-04 22:06 UTC (permalink / raw)
  To: Haggai Eran
  Cc: linux-mm, Andrea Arcangeli, Peter Zijlstra, Xiao Guangrong,
	Shachar Raindel, Sagi Grimberg, Or Gerlitz

On Tue,  4 Sep 2012 11:41:19 +0300
Haggai Eran <haggaie@mellanox.com> wrote:

> > This patchset is a preliminary step towards on-demand paging design to be
> > added to the Infiniband stack.

The above sentence is the most important part of the patchset.  Because
it answers the question "ytf is Haggai sending this stuff at me".

I'm unsure if the patchset adds runtime overhead but it does add
maintenance overhead (perhaps we can reduce this - see later emails). 
So we need to take a close look at what we're getting in return for
that overhead, please.

Exactly why do we want on-demand paging for Infiniband?  Why should
anyone care?  What problems are users currently experiencing?  How many
users and how serious are the problems and what if any workarounds are
available?

Is there any prospect that any other subsystems will utilise these
infrastructural changes?  If so, which and how, etc?

IOW, sell this code to us!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 0/2] Enable clients to schedule in mmu_notifier methods
  2012-09-04 22:06 ` [PATCH V1 0/2] Enable clients to schedule in mmu_notifier methods Andrew Morton
@ 2012-09-05 12:55   ` Avi Kivity
  2012-09-05 14:01   ` Haggai Eran
  1 sibling, 0 replies; 12+ messages in thread
From: Avi Kivity @ 2012-09-05 12:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Haggai Eran, linux-mm, Andrea Arcangeli, Peter Zijlstra,
	Xiao Guangrong, Shachar Raindel, Sagi Grimberg, Or Gerlitz

On 09/05/2012 01:06 AM, Andrew Morton wrote:
> On Tue,  4 Sep 2012 11:41:19 +0300
> Haggai Eran <haggaie@mellanox.com> wrote:
> 
>> > This patchset is a preliminary step towards on-demand paging design to be
>> > added to the Infiniband stack.
> 
> The above sentence is the most important part of the patchset.  Because
> it answers the question "ytf is Haggai sending this stuff at me".
> 
> I'm unsure if the patchset adds runtime overhead but it does add
> maintenance overhead (perhaps we can reduce this - see later emails). 
> So we need to take a close look at what we're getting in return for
> that overhead, please.
> 
> Exactly why do we want on-demand paging for Infiniband?  Why should
> anyone care?  What problems are users currently experiencing?  How many
> users and how serious are the problems and what if any workarounds are
> available?
> 
> Is there any prospect that any other subsystems will utilise these
> infrastructural changes?  If so, which and how, etc?
> 
> 
> 
> IOW, sell this code to us!

kvm may be a buyer.  kvm::mmu_lock, which serializes guest page faults,
also protects long operations such as destroying large ranges.  It would
be good to convert it into a spinlock, but as it is used inside mmu
notifiers, this cannot be done.

(there are alternatives, such as keeping the spinlock and using a
generation counter to do the teardown in O(1), which is what the "may"
is doing up there).

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V1 0/2] Enable clients to schedule in mmu_notifier methods
  2012-09-04 22:06 ` [PATCH V1 0/2] Enable clients to schedule in mmu_notifier methods Andrew Morton
  2012-09-05 12:55   ` Avi Kivity
@ 2012-09-05 14:01   ` Haggai Eran
  1 sibling, 0 replies; 12+ messages in thread
From: Haggai Eran @ 2012-09-05 14:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm@kvack.org, Andrea Arcangeli, Peter Zijlstra,
	Xiao Guangrong, Shachar Raindel, Sagi Grimberg, Or Gerlitz

On 05/09/2012 01:06, Andrew Morton wrote:
> Exactly why do we want on-demand paging for Infiniband?  

Applications register memory with an RDMA adapter using system calls,
and subsequently post IO operations that refer to the corresponding
virtual addresses directly to HW. Until now, this was achieved by
pinning the memory during the registration calls. The goal of on demand
paging is to avoid pinning the pages of registered memory regions (MRs).
This will allow users the same flexibility they get when swapping any
other part of their processes address spaces. Instead of requiring the
entire MR to fit in physical memory, we can allow the MR to be larger,
and only fit the current working set in physical memory.

> Why should anyone care?  What problems are users currently experiencing?

This can make programming with RDMA much simpler. Today, developers that
are working with more data than their RAM can hold need either to
deregister and reregister memory regions throughout their process's
life, or keep a single memory region and copy the data to it. On demand
paging will allow these developers to register a single MR at the
beginning of their process's life, and let the operating system manage
which pages needs to be fetched at a given time. In the future, we might
be able to provide a single memory access key for each process that
would provide the entire process's address as one large memory region,
and the developers wouldn't need to register memory regions at all.

> Is there any prospect that any other subsystems will utilise these
> infrastructural changes?  If so, which and how, etc?

As for other subsystems, I understand that XPMEM wanted to sleep in MMU
notifiers, as Christoph Lameter wrote at
http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html
and perhaps Andrea knows about other use cases.

Scheduling in mmu notifications is required since we need to sync the
hardware with the secondary page tables change. A TLB flush of an IO
device is inherently slower than a CPU TLB flush, so our design works by
sending the invalidation request to the device, and waiting for an
interrupt before exiting the mmu notifier handler.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-09-06 20:08 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-04  8:41 [PATCH V1 0/2] Enable clients to schedule in mmu_notifier methods Haggai Eran
2012-09-04  8:41 ` [PATCH V1 1/2] mm: Move all mmu notifier invocations to be done outside the PT lock Haggai Eran
2012-09-04 22:07   ` Andrew Morton
2012-09-06 14:34     ` [PATCH V2 0/2] Enable clients to schedule in mmu_notifier methods Haggai Eran
2012-09-06 14:34       ` [PATCH V2 1/2] mm: Move all mmu notifier invocations to be done outside the PT lock Haggai Eran
2012-09-06 14:34       ` [PATCH V2 2/2] mm: Wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end Haggai Eran
2012-09-06 20:08       ` [PATCH V2 0/2] Enable clients to schedule in mmu_notifier methods Andrew Morton
2012-09-04  8:41 ` [PATCH V1 2/2] mm: Wrap calls to set_pte_at_notify with invalidate_range_start and invalidate_range_end Haggai Eran
2012-09-04 22:07   ` Andrew Morton
2012-09-04 22:06 ` [PATCH V1 0/2] Enable clients to schedule in mmu_notifier methods Andrew Morton
2012-09-05 12:55   ` Avi Kivity
2012-09-05 14:01   ` Haggai Eran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).