[PATCH 0/5] batched remove rmap in try_to_unmap

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] batched remove rmap in try_to_unmap_one()
@ 2023-02-23  8:31 Yin Fengwei
  2023-02-23  8:31 ` [PATCH 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Yin Fengwei @ 2023-02-23  8:31 UTC (permalink / raw)
  To: linux-mm, akpm, willy; +Cc: fengwei.yin

This series is trying to bring the batched rmap removing to
try_to_unmap_one(). It's expected that the batched rmap
removing bring performance gain than remove rmap per page.

The changes are organized as:
Patch1/2 move the hugetlb and normal page unmap to dedicated
functions to make try_to_unmap_one() logic clearer and easy
to add batched rmap removing. To make code review easier, no
function change.

Patch3 cleanup the try_to_unmap_one_page(). Try to removed
some duplicated function calls.

Patch4 adds folio_remove_rmap_range() which batched remove rmap.

Patch5 make try_to_unmap_one() to batched remove rmap.

Testing done with the series in a qemu guest
with 4G mem + 512M zram:
  - kernel mm selftest to trigger vmscan() and final hit
    try_to_unmap_one().
  - Inject hwpoison to hugetlb page to trigger try_to_unmap_one()
    call against hugetlb.
  - 24 hours stress testing: Firefox + kernel mm selftest + kernel
    build.

This series is based on next-20230222.

Yin Fengwei (5):
  rmap: move hugetlb try_to_unmap to dedicated function
  rmap: move page unmap operation to dedicated function
  rmap: cleanup exit path of try_to_unmap_one_page()
  rmap:addd folio_remove_rmap_range()
  try_to_unmap_one: batched remove rmap, update folio refcount

 include/linux/rmap.h |   5 +
 mm/page_vma_mapped.c |  30 +++
 mm/rmap.c            | 628 +++++++++++++++++++++++++------------------
 3 files changed, 403 insertions(+), 260 deletions(-)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/5] rmap: move hugetlb try_to_unmap to dedicated function
  2023-02-23  8:31 [PATCH 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
@ 2023-02-23  8:31 ` Yin Fengwei
  2023-02-23 17:28   ` Matthew Wilcox
  2023-02-23  8:31 ` [PATCH 2/5] rmap: move page unmap operation " Yin Fengwei
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Yin Fengwei @ 2023-02-23  8:31 UTC (permalink / raw)
  To: linux-mm, akpm, willy; +Cc: fengwei.yin

It's to prepare the batched rmap update for large folio.
No need to looped handle hugetlb. Just handle hugetlb and
bail out early.

Almost no functional change. Just one change to mm counter
update.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/rmap.c | 205 +++++++++++++++++++++++++++++++++---------------------
 1 file changed, 126 insertions(+), 79 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 15ae24585fc4..e7aa63b800f7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1443,6 +1443,108 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 	munlock_vma_folio(folio, vma, compound);
 }
 
+static bool try_to_unmap_one_hugetlb(struct folio *folio,
+		struct vm_area_struct *vma, struct mmu_notifier_range range,
+		struct page_vma_mapped_walk pvmw, unsigned long address,
+		enum ttu_flags flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t pteval;
+	bool ret = true, anon = folio_test_anon(folio);
+
+	/*
+	 * The try_to_unmap() is only passed a hugetlb page
+	 * in the case where the hugetlb page is poisoned.
+	 */
+	VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio);
+	/*
+	 * huge_pmd_unshare may unmap an entire PMD page.
+	 * There is no way of knowing exactly which PMDs may
+	 * be cached for this mm, so we must flush them all.
+	 * start/end were already adjusted above to cover this
+	 * range.
+	 */
+	flush_cache_range(vma, range.start, range.end);
+
+	/*
+	 * To call huge_pmd_unshare, i_mmap_rwsem must be
+	 * held in write mode.  Caller needs to explicitly
+	 * do this outside rmap routines.
+	 *
+	 * We also must hold hugetlb vma_lock in write mode.
+	 * Lock order dictates acquiring vma_lock BEFORE
+	 * i_mmap_rwsem.  We can only try lock here and fail
+	 * if unsuccessful.
+	 */
+	if (!anon) {
+		VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+		if (!hugetlb_vma_trylock_write(vma)) {
+			ret = false;
+			goto out;
+		}
+		if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
+			hugetlb_vma_unlock_write(vma);
+			flush_tlb_range(vma,
+					range.start, range.end);
+			mmu_notifier_invalidate_range(mm,
+					range.start, range.end);
+			/*
+			 * The ref count of the PMD page was
+			 * dropped which is part of the way map
+			 * counting is done for shared PMDs.
+			 * Return 'true' here.  When there is
+			 * no other sharing, huge_pmd_unshare
+			 * returns false and we will unmap the
+			 * actual page and drop map count
+			 * to zero.
+			 */
+			goto out;
+		}
+		hugetlb_vma_unlock_write(vma);
+	}
+	pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
+
+	/*
+	 * Now the pte is cleared. If this pte was uffd-wp armed,
+	 * we may want to replace a none pte with a marker pte if
+	 * it's file-backed, so we don't lose the tracking info.
+	 */
+	pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
+	/* Set the dirty flag on the folio now the pte is gone. */
+	if (pte_dirty(pteval))
+		folio_mark_dirty(folio);
+
+	/* Update high watermark before we lower rss */
+	update_hiwater_rss(mm);
+
+	if (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) {
+		pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page));
+		set_huge_pte_at(mm, address, pvmw.pte, pteval);
+	}
+
+	/*** try_to_unmap_one() called dec_mm_counter for
+	 * (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) not
+	 * true case, looks incorrect. Change it to hugetlb_count_sub() here.
+	 */
+	hugetlb_count_sub(folio_nr_pages(folio), mm);
+
+	/*
+	 * No need to call mmu_notifier_invalidate_range() it has be
+	 * done above for all cases requiring it to happen under page
+	 * table lock before mmu_notifier_invalidate_range_end()
+	 *
+	 * See Documentation/mm/mmu_notifier.rst
+	 */
+	page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));
+	if (vma->vm_flags & VM_LOCKED)
+		mlock_drain_local();
+	folio_put(folio);
+
+out:
+	return ret;
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
@@ -1506,86 +1608,37 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			break;
 		}
 
+		address = pvmw.address;
+		if (folio_test_hugetlb(folio)) {
+			ret = try_to_unmap_one_hugetlb(folio, vma, range,
+							pvmw, address, flags);
+
+			/* no need to loop for hugetlb */
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+
 		subpage = folio_page(folio,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
-		address = pvmw.address;
 		anon_exclusive = folio_test_anon(folio) &&
 				 PageAnonExclusive(subpage);
 
-		if (folio_test_hugetlb(folio)) {
-			bool anon = folio_test_anon(folio);
-
+		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		/* Nuke the page table entry. */
+		if (should_defer_flush(mm, flags)) {
 			/*
-			 * The try_to_unmap() is only passed a hugetlb page
-			 * in the case where the hugetlb page is poisoned.
+			 * We clear the PTE but do not flush so potentially
+			 * a remote CPU could still be writing to the folio.
+			 * If the entry was previously clean then the
+			 * architecture must guarantee that a clear->dirty
+			 * transition on a cached TLB entry is written through
+			 * and traps if the PTE is unmapped.
 			 */
-			VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
-			/*
-			 * huge_pmd_unshare may unmap an entire PMD page.
-			 * There is no way of knowing exactly which PMDs may
-			 * be cached for this mm, so we must flush them all.
-			 * start/end were already adjusted above to cover this
-			 * range.
-			 */
-			flush_cache_range(vma, range.start, range.end);
+			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
 
-			/*
-			 * To call huge_pmd_unshare, i_mmap_rwsem must be
-			 * held in write mode.  Caller needs to explicitly
-			 * do this outside rmap routines.
-			 *
-			 * We also must hold hugetlb vma_lock in write mode.
-			 * Lock order dictates acquiring vma_lock BEFORE
-			 * i_mmap_rwsem.  We can only try lock here and fail
-			 * if unsuccessful.
-			 */
-			if (!anon) {
-				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
-				if (!hugetlb_vma_trylock_write(vma)) {
-					page_vma_mapped_walk_done(&pvmw);
-					ret = false;
-					break;
-				}
-				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
-					hugetlb_vma_unlock_write(vma);
-					flush_tlb_range(vma,
-						range.start, range.end);
-					mmu_notifier_invalidate_range(mm,
-						range.start, range.end);
-					/*
-					 * The ref count of the PMD page was
-					 * dropped which is part of the way map
-					 * counting is done for shared PMDs.
-					 * Return 'true' here.  When there is
-					 * no other sharing, huge_pmd_unshare
-					 * returns false and we will unmap the
-					 * actual page and drop map count
-					 * to zero.
-					 */
-					page_vma_mapped_walk_done(&pvmw);
-					break;
-				}
-				hugetlb_vma_unlock_write(vma);
-			}
-			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
+			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
 		} else {
-			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
-			/* Nuke the page table entry. */
-			if (should_defer_flush(mm, flags)) {
-				/*
-				 * We clear the PTE but do not flush so potentially
-				 * a remote CPU could still be writing to the folio.
-				 * If the entry was previously clean then the
-				 * architecture must guarantee that a clear->dirty
-				 * transition on a cached TLB entry is written through
-				 * and traps if the PTE is unmapped.
-				 */
-				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
-
-				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
-			} else {
-				pteval = ptep_clear_flush(vma, address, pvmw.pte);
-			}
+			pteval = ptep_clear_flush(vma, address, pvmw.pte);
 		}
 
 		/*
@@ -1604,14 +1657,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 		if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
-			if (folio_test_hugetlb(folio)) {
-				hugetlb_count_sub(folio_nr_pages(folio), mm);
-				set_huge_pte_at(mm, address, pvmw.pte, pteval);
-			} else {
-				dec_mm_counter(mm, mm_counter(&folio->page));
-				set_pte_at(mm, address, pvmw.pte, pteval);
-			}
-
+			dec_mm_counter(mm, mm_counter(&folio->page));
+			set_pte_at(mm, address, pvmw.pte, pteval);
 		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
 			/*
 			 * The guest indicated that the page content is of no
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] rmap: move hugetlb try_to_unmap to dedicated function
  2023-02-23  8:31 ` [PATCH 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
@ 2023-02-23 17:28   ` Matthew Wilcox
  2023-02-24  0:20     ` Mike Kravetz
  2023-02-24  2:51     ` HORIGUCHI NAOYA(堀口　直也)
  0 siblings, 2 replies; 13+ messages in thread
From: Matthew Wilcox @ 2023-02-23 17:28 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: linux-mm, akpm, Sidhartha Kumar, Mike Kravetz, Jane Chu,
	Naoya Horiguchi

On Thu, Feb 23, 2023 at 04:31:56PM +0800, Yin Fengwei wrote:
> It's to prepare the batched rmap update for large folio.
> No need to looped handle hugetlb. Just handle hugetlb and
> bail out early.
> 
> Almost no functional change. Just one change to mm counter
> update.

This looks like a very worthwhile cleanup in its own right.  Adding
various hugetlb & memory poisoning experts for commentary on the mm
counter change (search for three asterisks)

> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> ---
>  mm/rmap.c | 205 +++++++++++++++++++++++++++++++++---------------------
>  1 file changed, 126 insertions(+), 79 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 15ae24585fc4..e7aa63b800f7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1443,6 +1443,108 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>  	munlock_vma_folio(folio, vma, compound);
>  }
>  
> +static bool try_to_unmap_one_hugetlb(struct folio *folio,
> +		struct vm_area_struct *vma, struct mmu_notifier_range range,
> +		struct page_vma_mapped_walk pvmw, unsigned long address,
> +		enum ttu_flags flags)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	pte_t pteval;
> +	bool ret = true, anon = folio_test_anon(folio);
> +
> +	/*
> +	 * The try_to_unmap() is only passed a hugetlb page
> +	 * in the case where the hugetlb page is poisoned.
> +	 */
> +	VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio);
> +	/*
> +	 * huge_pmd_unshare may unmap an entire PMD page.
> +	 * There is no way of knowing exactly which PMDs may
> +	 * be cached for this mm, so we must flush them all.
> +	 * start/end were already adjusted above to cover this
> +	 * range.
> +	 */
> +	flush_cache_range(vma, range.start, range.end);
> +
> +	/*
> +	 * To call huge_pmd_unshare, i_mmap_rwsem must be
> +	 * held in write mode.  Caller needs to explicitly
> +	 * do this outside rmap routines.
> +	 *
> +	 * We also must hold hugetlb vma_lock in write mode.
> +	 * Lock order dictates acquiring vma_lock BEFORE
> +	 * i_mmap_rwsem.  We can only try lock here and fail
> +	 * if unsuccessful.
> +	 */
> +	if (!anon) {
> +		VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> +		if (!hugetlb_vma_trylock_write(vma)) {
> +			ret = false;
> +			goto out;
> +		}
> +		if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> +			hugetlb_vma_unlock_write(vma);
> +			flush_tlb_range(vma,
> +					range.start, range.end);
> +			mmu_notifier_invalidate_range(mm,
> +					range.start, range.end);
> +			/*
> +			 * The ref count of the PMD page was
> +			 * dropped which is part of the way map
> +			 * counting is done for shared PMDs.
> +			 * Return 'true' here.  When there is
> +			 * no other sharing, huge_pmd_unshare
> +			 * returns false and we will unmap the
> +			 * actual page and drop map count
> +			 * to zero.
> +			 */
> +			goto out;
> +		}
> +		hugetlb_vma_unlock_write(vma);
> +	}
> +	pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> +
> +	/*
> +	 * Now the pte is cleared. If this pte was uffd-wp armed,
> +	 * we may want to replace a none pte with a marker pte if
> +	 * it's file-backed, so we don't lose the tracking info.
> +	 */
> +	pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
> +
> +	/* Set the dirty flag on the folio now the pte is gone. */
> +	if (pte_dirty(pteval))
> +		folio_mark_dirty(folio);
> +
> +	/* Update high watermark before we lower rss */
> +	update_hiwater_rss(mm);
> +
> +	if (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) {
> +		pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page));
> +		set_huge_pte_at(mm, address, pvmw.pte, pteval);
> +	}
> +
> +	/*** try_to_unmap_one() called dec_mm_counter for
> +	 * (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) not
> +	 * true case, looks incorrect. Change it to hugetlb_count_sub() here.
> +	 */
> +	hugetlb_count_sub(folio_nr_pages(folio), mm);
> +
> +	/*
> +	 * No need to call mmu_notifier_invalidate_range() it has be
> +	 * done above for all cases requiring it to happen under page
> +	 * table lock before mmu_notifier_invalidate_range_end()
> +	 *
> +	 * See Documentation/mm/mmu_notifier.rst
> +	 */
> +	page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));
> +	if (vma->vm_flags & VM_LOCKED)
> +		mlock_drain_local();
> +	folio_put(folio);
> +
> +out:
> +	return ret;
> +}
> +
>  /*
>   * @arg: enum ttu_flags will be passed to this argument
>   */
> @@ -1506,86 +1608,37 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			break;
>  		}
>  
> +		address = pvmw.address;
> +		if (folio_test_hugetlb(folio)) {
> +			ret = try_to_unmap_one_hugetlb(folio, vma, range,
> +							pvmw, address, flags);
> +
> +			/* no need to loop for hugetlb */
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
> +
>  		subpage = folio_page(folio,
>  					pte_pfn(*pvmw.pte) - folio_pfn(folio));
> -		address = pvmw.address;
>  		anon_exclusive = folio_test_anon(folio) &&
>  				 PageAnonExclusive(subpage);
>  
> -		if (folio_test_hugetlb(folio)) {
> -			bool anon = folio_test_anon(folio);
> -
> +		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> +		/* Nuke the page table entry. */
> +		if (should_defer_flush(mm, flags)) {
>  			/*
> -			 * The try_to_unmap() is only passed a hugetlb page
> -			 * in the case where the hugetlb page is poisoned.
> +			 * We clear the PTE but do not flush so potentially
> +			 * a remote CPU could still be writing to the folio.
> +			 * If the entry was previously clean then the
> +			 * architecture must guarantee that a clear->dirty
> +			 * transition on a cached TLB entry is written through
> +			 * and traps if the PTE is unmapped.
>  			 */
> -			VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
> -			/*
> -			 * huge_pmd_unshare may unmap an entire PMD page.
> -			 * There is no way of knowing exactly which PMDs may
> -			 * be cached for this mm, so we must flush them all.
> -			 * start/end were already adjusted above to cover this
> -			 * range.
> -			 */
> -			flush_cache_range(vma, range.start, range.end);
> +			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>  
> -			/*
> -			 * To call huge_pmd_unshare, i_mmap_rwsem must be
> -			 * held in write mode.  Caller needs to explicitly
> -			 * do this outside rmap routines.
> -			 *
> -			 * We also must hold hugetlb vma_lock in write mode.
> -			 * Lock order dictates acquiring vma_lock BEFORE
> -			 * i_mmap_rwsem.  We can only try lock here and fail
> -			 * if unsuccessful.
> -			 */
> -			if (!anon) {
> -				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> -				if (!hugetlb_vma_trylock_write(vma)) {
> -					page_vma_mapped_walk_done(&pvmw);
> -					ret = false;
> -					break;
> -				}
> -				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> -					hugetlb_vma_unlock_write(vma);
> -					flush_tlb_range(vma,
> -						range.start, range.end);
> -					mmu_notifier_invalidate_range(mm,
> -						range.start, range.end);
> -					/*
> -					 * The ref count of the PMD page was
> -					 * dropped which is part of the way map
> -					 * counting is done for shared PMDs.
> -					 * Return 'true' here.  When there is
> -					 * no other sharing, huge_pmd_unshare
> -					 * returns false and we will unmap the
> -					 * actual page and drop map count
> -					 * to zero.
> -					 */
> -					page_vma_mapped_walk_done(&pvmw);
> -					break;
> -				}
> -				hugetlb_vma_unlock_write(vma);
> -			}
> -			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> +			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
>  		} else {
> -			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> -			/* Nuke the page table entry. */
> -			if (should_defer_flush(mm, flags)) {
> -				/*
> -				 * We clear the PTE but do not flush so potentially
> -				 * a remote CPU could still be writing to the folio.
> -				 * If the entry was previously clean then the
> -				 * architecture must guarantee that a clear->dirty
> -				 * transition on a cached TLB entry is written through
> -				 * and traps if the PTE is unmapped.
> -				 */
> -				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
> -
> -				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> -			} else {
> -				pteval = ptep_clear_flush(vma, address, pvmw.pte);
> -			}
> +			pteval = ptep_clear_flush(vma, address, pvmw.pte);
>  		}
>  
>  		/*
> @@ -1604,14 +1657,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  
>  		if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
>  			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> -			if (folio_test_hugetlb(folio)) {
> -				hugetlb_count_sub(folio_nr_pages(folio), mm);
> -				set_huge_pte_at(mm, address, pvmw.pte, pteval);
> -			} else {
> -				dec_mm_counter(mm, mm_counter(&folio->page));
> -				set_pte_at(mm, address, pvmw.pte, pteval);
> -			}
> -
> +			dec_mm_counter(mm, mm_counter(&folio->page));
> +			set_pte_at(mm, address, pvmw.pte, pteval);
>  		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
>  			/*
>  			 * The guest indicated that the page content is of no
> -- 
> 2.30.2
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] rmap: move hugetlb try_to_unmap to dedicated function
  2023-02-23 17:28   ` Matthew Wilcox
@ 2023-02-24  0:20     ` Mike Kravetz
  2023-02-24  0:52       ` Yin, Fengwei
  2023-02-24  2:51     ` HORIGUCHI NAOYA(堀口　直也)
  1 sibling, 1 reply; 13+ messages in thread
From: Mike Kravetz @ 2023-02-24  0:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yin Fengwei, linux-mm, akpm, Sidhartha Kumar, Jane Chu,
	Naoya Horiguchi

On 02/23/23 17:28, Matthew Wilcox wrote:
> On Thu, Feb 23, 2023 at 04:31:56PM +0800, Yin Fengwei wrote:
> > It's to prepare the batched rmap update for large folio.
> > No need to looped handle hugetlb. Just handle hugetlb and
> > bail out early.
> > 
> > Almost no functional change. Just one change to mm counter
> > update.
> 
> This looks like a very worthwhile cleanup in its own right.  Adding
> various hugetlb & memory poisoning experts for commentary on the mm
> counter change (search for three asterisks)

The counter change looks correct to me.
However, there is a bunch of code in the else cases for

	if (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON))

in try_to_unmap_one.  Are you sure none of it applies to hugetlb pages?
It seems like some of the code in the

	} else if (folio_test_anon(folio)) {

could apply.  pte_uffd_wp perhaps?

Since you are cleaning up, I think the following make apply ...

> > +static bool try_to_unmap_one_hugetlb(struct folio *folio,
> > +		struct vm_area_struct *vma, struct mmu_notifier_range range,
> > +		struct page_vma_mapped_walk pvmw, unsigned long address,
> > +		enum ttu_flags flags)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	pte_t pteval;
> > +	bool ret = true, anon = folio_test_anon(folio);
> > +
> > +	/*
> > +	 * The try_to_unmap() is only passed a hugetlb page
> > +	 * in the case where the hugetlb page is poisoned.
> > +	 */
> > +	VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio);
> > +	/*
> > +	 * huge_pmd_unshare may unmap an entire PMD page.
> > +	 * There is no way of knowing exactly which PMDs may
> > +	 * be cached for this mm, so we must flush them all.
> > +	 * start/end were already adjusted above to cover this
> > +	 * range.
> > +	 */
> > +	flush_cache_range(vma, range.start, range.end);
> > +
> > +	/*
> > +	 * To call huge_pmd_unshare, i_mmap_rwsem must be
> > +	 * held in write mode.  Caller needs to explicitly
> > +	 * do this outside rmap routines.
> > +	 *
> > +	 * We also must hold hugetlb vma_lock in write mode.
> > +	 * Lock order dictates acquiring vma_lock BEFORE
> > +	 * i_mmap_rwsem.  We can only try lock here and fail
> > +	 * if unsuccessful.
> > +	 */
> > +	if (!anon) {
> > +		VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> > +		if (!hugetlb_vma_trylock_write(vma)) {
> > +			ret = false;
> > +			goto out;
> > +		}
> > +		if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> > +			hugetlb_vma_unlock_write(vma);
> > +			flush_tlb_range(vma,
> > +					range.start, range.end);
> > +			mmu_notifier_invalidate_range(mm,
> > +					range.start, range.end);
> > +			/*
> > +			 * The ref count of the PMD page was
> > +			 * dropped which is part of the way map
> > +			 * counting is done for shared PMDs.
> > +			 * Return 'true' here.  When there is
> > +			 * no other sharing, huge_pmd_unshare
> > +			 * returns false and we will unmap the
> > +			 * actual page and drop map count
> > +			 * to zero.
> > +			 */
> > +			goto out;
> > +		}
> > +		hugetlb_vma_unlock_write(vma);
> > +	}
> > +	pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> > +
> > +	/*
> > +	 * Now the pte is cleared. If this pte was uffd-wp armed,
> > +	 * we may want to replace a none pte with a marker pte if
> > +	 * it's file-backed, so we don't lose the tracking info.
> > +	 */
> > +	pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
> > +
> > +	/* Set the dirty flag on the folio now the pte is gone. */
> > +	if (pte_dirty(pteval))
> > +		folio_mark_dirty(folio);
> > +
> > +	/* Update high watermark before we lower rss */
> > +	update_hiwater_rss(mm);
> > +
> > +	if (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) {

I'm guessing this can just be,
	if (!(flags & TTU_IGNORE_HWPOISON))

as we only get here in the poison case as indicated by,
	VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio); 
above.

> > +		pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page));
> > +		set_huge_pte_at(mm, address, pvmw.pte, pteval);
> > +	}
> > +
> > +	/*** try_to_unmap_one() called dec_mm_counter for
> > +	 * (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) not
> > +	 * true case, looks incorrect. Change it to hugetlb_count_sub() here.
> > +	 */
> > +	hugetlb_count_sub(folio_nr_pages(folio), mm);
> > +
> > +	/*
> > +	 * No need to call mmu_notifier_invalidate_range() it has be
> > +	 * done above for all cases requiring it to happen under page
> > +	 * table lock before mmu_notifier_invalidate_range_end()
> > +	 *
> > +	 * See Documentation/mm/mmu_notifier.rst
> > +	 */
> > +	page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));

Always compound/hugetlb, so could be:

	page_remove_rmap(&folio->page, vma, true);

> > +	if (vma->vm_flags & VM_LOCKED)
> > +		mlock_drain_local();

Since hugetlb pages are always memory resident, I do not think the above
is necessary.

> > +	folio_put(folio);
> > +
> > +out:
> > +	return ret;
> > +}

-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] rmap: move hugetlb try_to_unmap to dedicated function
  2023-02-24  0:20     ` Mike Kravetz
@ 2023-02-24  0:52       ` Yin, Fengwei
  0 siblings, 0 replies; 13+ messages in thread
From: Yin, Fengwei @ 2023-02-24  0:52 UTC (permalink / raw)
  To: Mike Kravetz, Matthew Wilcox
  Cc: linux-mm, akpm, Sidhartha Kumar, Jane Chu, Naoya Horiguchi


On 2/24/2023 8:20 AM, Mike Kravetz wrote:
> On 02/23/23 17:28, Matthew Wilcox wrote:
>> On Thu, Feb 23, 2023 at 04:31:56PM +0800, Yin Fengwei wrote:
>>> It's to prepare the batched rmap update for large folio.
>>> No need to looped handle hugetlb. Just handle hugetlb and
>>> bail out early.
>>>
>>> Almost no functional change. Just one change to mm counter
>>> update.
>>
>> This looks like a very worthwhile cleanup in its own right.  Adding
>> various hugetlb & memory poisoning experts for commentary on the mm
>> counter change (search for three asterisks)
> 
> The counter change looks correct to me.
> However, there is a bunch of code in the else cases for
> 
> 	if (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON))
> 
> in try_to_unmap_one.  Are you sure none of it applies to hugetlb pages?
> It seems like some of the code in the
> 
> 	} else if (folio_test_anon(folio)) {
> 
> could apply.  pte_uffd_wp perhaps?
In the "else if (pte_unused).." and  "else if (folio_test_anon(folio)"
path, it uses set_pte_at or PAGE_SIZE for mmu_notifier_invalidate_range().
So I suppose hugetlb will not hit these two branches.

> 
> Since you are cleaning up, I think the following make apply ...
> 
>>> +static bool try_to_unmap_one_hugetlb(struct folio *folio,
>>> +		struct vm_area_struct *vma, struct mmu_notifier_range range,
>>> +		struct page_vma_mapped_walk pvmw, unsigned long address,
>>> +		enum ttu_flags flags)
>>> +{
>>> +	struct mm_struct *mm = vma->vm_mm;
>>> +	pte_t pteval;
>>> +	bool ret = true, anon = folio_test_anon(folio);
>>> +
>>> +	/*
>>> +	 * The try_to_unmap() is only passed a hugetlb page
>>> +	 * in the case where the hugetlb page is poisoned.
>>> +	 */
>>> +	VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio);
>>> +	/*
>>> +	 * huge_pmd_unshare may unmap an entire PMD page.
>>> +	 * There is no way of knowing exactly which PMDs may
>>> +	 * be cached for this mm, so we must flush them all.
>>> +	 * start/end were already adjusted above to cover this
>>> +	 * range.
>>> +	 */
>>> +	flush_cache_range(vma, range.start, range.end);
>>> +
>>> +	/*
>>> +	 * To call huge_pmd_unshare, i_mmap_rwsem must be
>>> +	 * held in write mode.  Caller needs to explicitly
>>> +	 * do this outside rmap routines.
>>> +	 *
>>> +	 * We also must hold hugetlb vma_lock in write mode.
>>> +	 * Lock order dictates acquiring vma_lock BEFORE
>>> +	 * i_mmap_rwsem.  We can only try lock here and fail
>>> +	 * if unsuccessful.
>>> +	 */
>>> +	if (!anon) {
>>> +		VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>>> +		if (!hugetlb_vma_trylock_write(vma)) {
>>> +			ret = false;
>>> +			goto out;
>>> +		}
>>> +		if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
>>> +			hugetlb_vma_unlock_write(vma);
>>> +			flush_tlb_range(vma,
>>> +					range.start, range.end);
>>> +			mmu_notifier_invalidate_range(mm,
>>> +					range.start, range.end);
>>> +			/*
>>> +			 * The ref count of the PMD page was
>>> +			 * dropped which is part of the way map
>>> +			 * counting is done for shared PMDs.
>>> +			 * Return 'true' here.  When there is
>>> +			 * no other sharing, huge_pmd_unshare
>>> +			 * returns false and we will unmap the
>>> +			 * actual page and drop map count
>>> +			 * to zero.
>>> +			 */
>>> +			goto out;
>>> +		}
>>> +		hugetlb_vma_unlock_write(vma);
>>> +	}
>>> +	pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
>>> +
>>> +	/*
>>> +	 * Now the pte is cleared. If this pte was uffd-wp armed,
>>> +	 * we may want to replace a none pte with a marker pte if
>>> +	 * it's file-backed, so we don't lose the tracking info.
>>> +	 */
>>> +	pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
>>> +
>>> +	/* Set the dirty flag on the folio now the pte is gone. */
>>> +	if (pte_dirty(pteval))
>>> +		folio_mark_dirty(folio);
>>> +
>>> +	/* Update high watermark before we lower rss */
>>> +	update_hiwater_rss(mm);
>>> +
>>> +	if (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) {
> 
> I'm guessing this can just be,
> 	if (!(flags & TTU_IGNORE_HWPOISON))
> 
> as we only get here in the poison case as indicated by,
> 	VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio); 
> above.
You are correct. Will remove the folio_test_hwpoison().

> 
>>> +		pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page));
>>> +		set_huge_pte_at(mm, address, pvmw.pte, pteval);
>>> +	}
>>> +
>>> +	/*** try_to_unmap_one() called dec_mm_counter for
>>> +	 * (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) not
>>> +	 * true case, looks incorrect. Change it to hugetlb_count_sub() here.
>>> +	 */
>>> +	hugetlb_count_sub(folio_nr_pages(folio), mm);
>>> +
>>> +	/*
>>> +	 * No need to call mmu_notifier_invalidate_range() it has be
>>> +	 * done above for all cases requiring it to happen under page
>>> +	 * table lock before mmu_notifier_invalidate_range_end()
>>> +	 *
>>> +	 * See Documentation/mm/mmu_notifier.rst
>>> +	 */
>>> +	page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));
> 
> Always compound/hugetlb, so could be:
> 
> 	page_remove_rmap(&folio->page, vma, true);
This is in the patch 3 of this series.

> 
>>> +	if (vma->vm_flags & VM_LOCKED)
>>> +		mlock_drain_local();
> 
> Since hugetlb pages are always memory resident, I do not think the above
> is necessary.
OK. Will remove this piece of code.

Thanks a lot for the comments.

Regards
Yin, Fengwei

> 
>>> +	folio_put(folio);
>>> +
>>> +out:
>>> +	return ret;
>>> +}
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] rmap: move hugetlb try_to_unmap to dedicated function
  2023-02-23 17:28   ` Matthew Wilcox
  2023-02-24  0:20     ` Mike Kravetz
@ 2023-02-24  2:51     ` HORIGUCHI NAOYA(堀口　直也)
  2023-02-24  4:41       ` Yin, Fengwei
  2023-02-24 19:21       ` Mike Kravetz
  1 sibling, 2 replies; 13+ messages in thread
From: HORIGUCHI NAOYA(堀口　直也) @ 2023-02-24  2:51 UTC (permalink / raw)
  To: Yin Fengwei
  Cc: Matthew Wilcox, linux-mm@kvack.org, akpm@linux-foundation.org,
	Sidhartha Kumar, Mike Kravetz, Jane Chu

On Thu, Feb 23, 2023 at 05:28:10PM +0000, Matthew Wilcox wrote:
> On Thu, Feb 23, 2023 at 04:31:56PM +0800, Yin Fengwei wrote:
> > It's to prepare the batched rmap update for large folio.
> > No need to looped handle hugetlb. Just handle hugetlb and
> > bail out early.
> > 
> > Almost no functional change. Just one change to mm counter
> > update.
> 
> This looks like a very worthwhile cleanup in its own right.  Adding
> various hugetlb & memory poisoning experts for commentary on the mm
> counter change (search for three asterisks)
> 
> > Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
> > ---
> >  mm/rmap.c | 205 +++++++++++++++++++++++++++++++++---------------------
> >  1 file changed, 126 insertions(+), 79 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 15ae24585fc4..e7aa63b800f7 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1443,6 +1443,108 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> >  	munlock_vma_folio(folio, vma, compound);
> >  }
> >  
> > +static bool try_to_unmap_one_hugetlb(struct folio *folio,
> > +		struct vm_area_struct *vma, struct mmu_notifier_range range,
> > +		struct page_vma_mapped_walk pvmw, unsigned long address,
> > +		enum ttu_flags flags)
> > +{
> > +	struct mm_struct *mm = vma->vm_mm;
> > +	pte_t pteval;
> > +	bool ret = true, anon = folio_test_anon(folio);
> > +
> > +	/*
> > +	 * The try_to_unmap() is only passed a hugetlb page
> > +	 * in the case where the hugetlb page is poisoned.
> > +	 */
> > +	VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio);
> > +	/*
> > +	 * huge_pmd_unshare may unmap an entire PMD page.
> > +	 * There is no way of knowing exactly which PMDs may
> > +	 * be cached for this mm, so we must flush them all.
> > +	 * start/end were already adjusted above to cover this
> > +	 * range.
> > +	 */
> > +	flush_cache_range(vma, range.start, range.end);
> > +
> > +	/*
> > +	 * To call huge_pmd_unshare, i_mmap_rwsem must be
> > +	 * held in write mode.  Caller needs to explicitly
> > +	 * do this outside rmap routines.
> > +	 *
> > +	 * We also must hold hugetlb vma_lock in write mode.
> > +	 * Lock order dictates acquiring vma_lock BEFORE
> > +	 * i_mmap_rwsem.  We can only try lock here and fail
> > +	 * if unsuccessful.
> > +	 */
> > +	if (!anon) {
> > +		VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> > +		if (!hugetlb_vma_trylock_write(vma)) {
> > +			ret = false;
> > +			goto out;
> > +		}
> > +		if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> > +			hugetlb_vma_unlock_write(vma);
> > +			flush_tlb_range(vma,
> > +					range.start, range.end);
> > +			mmu_notifier_invalidate_range(mm,
> > +					range.start, range.end);
> > +			/*
> > +			 * The ref count of the PMD page was
> > +			 * dropped which is part of the way map
> > +			 * counting is done for shared PMDs.
> > +			 * Return 'true' here.  When there is
> > +			 * no other sharing, huge_pmd_unshare
> > +			 * returns false and we will unmap the
> > +			 * actual page and drop map count
> > +			 * to zero.
> > +			 */
> > +			goto out;
> > +		}
> > +		hugetlb_vma_unlock_write(vma);
> > +	}
> > +	pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> > +
> > +	/*
> > +	 * Now the pte is cleared. If this pte was uffd-wp armed,
> > +	 * we may want to replace a none pte with a marker pte if
> > +	 * it's file-backed, so we don't lose the tracking info.
> > +	 */
> > +	pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
> > +
> > +	/* Set the dirty flag on the folio now the pte is gone. */
> > +	if (pte_dirty(pteval))
> > +		folio_mark_dirty(folio);
> > +
> > +	/* Update high watermark before we lower rss */
> > +	update_hiwater_rss(mm);
> > +
> > +	if (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) {
> > +		pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page));
> > +		set_huge_pte_at(mm, address, pvmw.pte, pteval);
> > +	}
> > +
> > +	/*** try_to_unmap_one() called dec_mm_counter for
> > +	 * (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) not
> > +	 * true case, looks incorrect. Change it to hugetlb_count_sub() here.
> > +	 */
> > +	hugetlb_count_sub(folio_nr_pages(folio), mm);

I have no objection to this change (moving hugetlb_count_sub() outside the
if), but I have a question related to this.

Generally TTU_IGNORE_HWPOISON is used to control the pte-conversion based
on page dirtiness.  But actually what it depends on is whether data lost
happens when the page is forcibly dropped.  And for hugetlb pages, that's
true regardless of PageDirty flag on it.
So I think we can assume that try_to_unmap_one_hugetlb() is called with
TTU_IGNORE_HWPOISON clear.  So maybe we don't need the if-check?

Otherwise, the cleanup looks good to me.

Thanks,
Naoya Horiguchi

> > +
> > +	/*
> > +	 * No need to call mmu_notifier_invalidate_range() it has be
> > +	 * done above for all cases requiring it to happen under page
> > +	 * table lock before mmu_notifier_invalidate_range_end()
> > +	 *
> > +	 * See Documentation/mm/mmu_notifier.rst
> > +	 */
> > +	page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));
> > +	if (vma->vm_flags & VM_LOCKED)
> > +		mlock_drain_local();
> > +	folio_put(folio);
> > +
> > +out:
> > +	return ret;
> > +}
> > +
> >  /*
> >   * @arg: enum ttu_flags will be passed to this argument
> >   */
> > @@ -1506,86 +1608,37 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >  			break;
> >  		}
> >  
> > +		address = pvmw.address;
> > +		if (folio_test_hugetlb(folio)) {
> > +			ret = try_to_unmap_one_hugetlb(folio, vma, range,
> > +							pvmw, address, flags);
> > +
> > +			/* no need to loop for hugetlb */
> > +			page_vma_mapped_walk_done(&pvmw);
> > +			break;
> > +		}
> > +
> >  		subpage = folio_page(folio,
> >  					pte_pfn(*pvmw.pte) - folio_pfn(folio));
> > -		address = pvmw.address;
> >  		anon_exclusive = folio_test_anon(folio) &&
> >  				 PageAnonExclusive(subpage);
> >  
> > -		if (folio_test_hugetlb(folio)) {
> > -			bool anon = folio_test_anon(folio);
> > -
> > +		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> > +		/* Nuke the page table entry. */
> > +		if (should_defer_flush(mm, flags)) {
> >  			/*
> > -			 * The try_to_unmap() is only passed a hugetlb page
> > -			 * in the case where the hugetlb page is poisoned.
> > +			 * We clear the PTE but do not flush so potentially
> > +			 * a remote CPU could still be writing to the folio.
> > +			 * If the entry was previously clean then the
> > +			 * architecture must guarantee that a clear->dirty
> > +			 * transition on a cached TLB entry is written through
> > +			 * and traps if the PTE is unmapped.
> >  			 */
> > -			VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
> > -			/*
> > -			 * huge_pmd_unshare may unmap an entire PMD page.
> > -			 * There is no way of knowing exactly which PMDs may
> > -			 * be cached for this mm, so we must flush them all.
> > -			 * start/end were already adjusted above to cover this
> > -			 * range.
> > -			 */
> > -			flush_cache_range(vma, range.start, range.end);
> > +			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
> >  
> > -			/*
> > -			 * To call huge_pmd_unshare, i_mmap_rwsem must be
> > -			 * held in write mode.  Caller needs to explicitly
> > -			 * do this outside rmap routines.
> > -			 *
> > -			 * We also must hold hugetlb vma_lock in write mode.
> > -			 * Lock order dictates acquiring vma_lock BEFORE
> > -			 * i_mmap_rwsem.  We can only try lock here and fail
> > -			 * if unsuccessful.
> > -			 */
> > -			if (!anon) {
> > -				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
> > -				if (!hugetlb_vma_trylock_write(vma)) {
> > -					page_vma_mapped_walk_done(&pvmw);
> > -					ret = false;
> > -					break;
> > -				}
> > -				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> > -					hugetlb_vma_unlock_write(vma);
> > -					flush_tlb_range(vma,
> > -						range.start, range.end);
> > -					mmu_notifier_invalidate_range(mm,
> > -						range.start, range.end);
> > -					/*
> > -					 * The ref count of the PMD page was
> > -					 * dropped which is part of the way map
> > -					 * counting is done for shared PMDs.
> > -					 * Return 'true' here.  When there is
> > -					 * no other sharing, huge_pmd_unshare
> > -					 * returns false and we will unmap the
> > -					 * actual page and drop map count
> > -					 * to zero.
> > -					 */
> > -					page_vma_mapped_walk_done(&pvmw);
> > -					break;
> > -				}
> > -				hugetlb_vma_unlock_write(vma);
> > -			}
> > -			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> > +			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> >  		} else {
> > -			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> > -			/* Nuke the page table entry. */
> > -			if (should_defer_flush(mm, flags)) {
> > -				/*
> > -				 * We clear the PTE but do not flush so potentially
> > -				 * a remote CPU could still be writing to the folio.
> > -				 * If the entry was previously clean then the
> > -				 * architecture must guarantee that a clear->dirty
> > -				 * transition on a cached TLB entry is written through
> > -				 * and traps if the PTE is unmapped.
> > -				 */
> > -				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
> > -
> > -				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
> > -			} else {
> > -				pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > -			}
> > +			pteval = ptep_clear_flush(vma, address, pvmw.pte);
> >  		}
> >  
> >  		/*
> > @@ -1604,14 +1657,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >  
> >  		if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
> >  			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
> > -			if (folio_test_hugetlb(folio)) {
> > -				hugetlb_count_sub(folio_nr_pages(folio), mm);
> > -				set_huge_pte_at(mm, address, pvmw.pte, pteval);
> > -			} else {
> > -				dec_mm_counter(mm, mm_counter(&folio->page));
> > -				set_pte_at(mm, address, pvmw.pte, pteval);
> > -			}
> > -
> > +			dec_mm_counter(mm, mm_counter(&folio->page));
> > +			set_pte_at(mm, address, pvmw.pte, pteval);
> >  		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
> >  			/*
> >  			 * The guest indicated that the page content is of no
> > -- 
> > 2.30.2
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] rmap: move hugetlb try_to_unmap to dedicated function
  2023-02-24  2:51     ` HORIGUCHI NAOYA(堀口　直也)
@ 2023-02-24  4:41       ` Yin, Fengwei
  2023-02-24 19:21       ` Mike Kravetz
  1 sibling, 0 replies; 13+ messages in thread
From: Yin, Fengwei @ 2023-02-24  4:41 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Matthew Wilcox, linux-mm@kvack.org, akpm@linux-foundation.org,
	Sidhartha Kumar, Mike Kravetz, Jane Chu



On 2/24/2023 10:51 AM, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Thu, Feb 23, 2023 at 05:28:10PM +0000, Matthew Wilcox wrote:
>> On Thu, Feb 23, 2023 at 04:31:56PM +0800, Yin Fengwei wrote:
>>> It's to prepare the batched rmap update for large folio.
>>> No need to looped handle hugetlb. Just handle hugetlb and
>>> bail out early.
>>>
>>> Almost no functional change. Just one change to mm counter
>>> update.
>>
>> This looks like a very worthwhile cleanup in its own right.  Adding
>> various hugetlb & memory poisoning experts for commentary on the mm
>> counter change (search for three asterisks)
>>
>>> Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
>>> ---
>>>  mm/rmap.c | 205 +++++++++++++++++++++++++++++++++---------------------
>>>  1 file changed, 126 insertions(+), 79 deletions(-)
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 15ae24585fc4..e7aa63b800f7 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1443,6 +1443,108 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
>>>  	munlock_vma_folio(folio, vma, compound);
>>>  }
>>>  
>>> +static bool try_to_unmap_one_hugetlb(struct folio *folio,
>>> +		struct vm_area_struct *vma, struct mmu_notifier_range range,
>>> +		struct page_vma_mapped_walk pvmw, unsigned long address,
>>> +		enum ttu_flags flags)
>>> +{
>>> +	struct mm_struct *mm = vma->vm_mm;
>>> +	pte_t pteval;
>>> +	bool ret = true, anon = folio_test_anon(folio);
>>> +
>>> +	/*
>>> +	 * The try_to_unmap() is only passed a hugetlb page
>>> +	 * in the case where the hugetlb page is poisoned.
>>> +	 */
>>> +	VM_BUG_ON_FOLIO(!folio_test_hwpoison(folio), folio);
>>> +	/*
>>> +	 * huge_pmd_unshare may unmap an entire PMD page.
>>> +	 * There is no way of knowing exactly which PMDs may
>>> +	 * be cached for this mm, so we must flush them all.
>>> +	 * start/end were already adjusted above to cover this
>>> +	 * range.
>>> +	 */
>>> +	flush_cache_range(vma, range.start, range.end);
>>> +
>>> +	/*
>>> +	 * To call huge_pmd_unshare, i_mmap_rwsem must be
>>> +	 * held in write mode.  Caller needs to explicitly
>>> +	 * do this outside rmap routines.
>>> +	 *
>>> +	 * We also must hold hugetlb vma_lock in write mode.
>>> +	 * Lock order dictates acquiring vma_lock BEFORE
>>> +	 * i_mmap_rwsem.  We can only try lock here and fail
>>> +	 * if unsuccessful.
>>> +	 */
>>> +	if (!anon) {
>>> +		VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>>> +		if (!hugetlb_vma_trylock_write(vma)) {
>>> +			ret = false;
>>> +			goto out;
>>> +		}
>>> +		if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
>>> +			hugetlb_vma_unlock_write(vma);
>>> +			flush_tlb_range(vma,
>>> +					range.start, range.end);
>>> +			mmu_notifier_invalidate_range(mm,
>>> +					range.start, range.end);
>>> +			/*
>>> +			 * The ref count of the PMD page was
>>> +			 * dropped which is part of the way map
>>> +			 * counting is done for shared PMDs.
>>> +			 * Return 'true' here.  When there is
>>> +			 * no other sharing, huge_pmd_unshare
>>> +			 * returns false and we will unmap the
>>> +			 * actual page and drop map count
>>> +			 * to zero.
>>> +			 */
>>> +			goto out;
>>> +		}
>>> +		hugetlb_vma_unlock_write(vma);
>>> +	}
>>> +	pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
>>> +
>>> +	/*
>>> +	 * Now the pte is cleared. If this pte was uffd-wp armed,
>>> +	 * we may want to replace a none pte with a marker pte if
>>> +	 * it's file-backed, so we don't lose the tracking info.
>>> +	 */
>>> +	pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
>>> +
>>> +	/* Set the dirty flag on the folio now the pte is gone. */
>>> +	if (pte_dirty(pteval))
>>> +		folio_mark_dirty(folio);
>>> +
>>> +	/* Update high watermark before we lower rss */
>>> +	update_hiwater_rss(mm);
>>> +
>>> +	if (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) {
>>> +		pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page));
>>> +		set_huge_pte_at(mm, address, pvmw.pte, pteval);
>>> +	}
>>> +
>>> +	/*** try_to_unmap_one() called dec_mm_counter for
>>> +	 * (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) not
>>> +	 * true case, looks incorrect. Change it to hugetlb_count_sub() here.
>>> +	 */
>>> +	hugetlb_count_sub(folio_nr_pages(folio), mm);
> 
> I have no objection to this change (moving hugetlb_count_sub() outside the
> if), but I have a question related to this.
> 
> Generally TTU_IGNORE_HWPOISON is used to control the pte-conversion based
> on page dirtiness.  But actually what it depends on is whether data lost
> happens when the page is forcibly dropped.  And for hugetlb pages, that's
> true regardless of PageDirty flag on it.
> So I think we can assume that try_to_unmap_one_hugetlb() is called with
> TTU_IGNORE_HWPOISON clear.  So maybe we don't need the if-check?
Thanks a lot for detail explaination. I will remove the if check if no
object from other reviewer.


Regards
Yin, Fengwei

> 
> Otherwise, the cleanup looks good to me.
> 
> Thanks,
> Naoya Horiguchi
> 
>>> +
>>> +	/*
>>> +	 * No need to call mmu_notifier_invalidate_range() it has be
>>> +	 * done above for all cases requiring it to happen under page
>>> +	 * table lock before mmu_notifier_invalidate_range_end()
>>> +	 *
>>> +	 * See Documentation/mm/mmu_notifier.rst
>>> +	 */
>>> +	page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));
>>> +	if (vma->vm_flags & VM_LOCKED)
>>> +		mlock_drain_local();
>>> +	folio_put(folio);
>>> +
>>> +out:
>>> +	return ret;
>>> +}
>>> +
>>>  /*
>>>   * @arg: enum ttu_flags will be passed to this argument
>>>   */
>>> @@ -1506,86 +1608,37 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>  			break;
>>>  		}
>>>  
>>> +		address = pvmw.address;
>>> +		if (folio_test_hugetlb(folio)) {
>>> +			ret = try_to_unmap_one_hugetlb(folio, vma, range,
>>> +							pvmw, address, flags);
>>> +
>>> +			/* no need to loop for hugetlb */
>>> +			page_vma_mapped_walk_done(&pvmw);
>>> +			break;
>>> +		}
>>> +
>>>  		subpage = folio_page(folio,
>>>  					pte_pfn(*pvmw.pte) - folio_pfn(folio));
>>> -		address = pvmw.address;
>>>  		anon_exclusive = folio_test_anon(folio) &&
>>>  				 PageAnonExclusive(subpage);
>>>  
>>> -		if (folio_test_hugetlb(folio)) {
>>> -			bool anon = folio_test_anon(folio);
>>> -
>>> +		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
>>> +		/* Nuke the page table entry. */
>>> +		if (should_defer_flush(mm, flags)) {
>>>  			/*
>>> -			 * The try_to_unmap() is only passed a hugetlb page
>>> -			 * in the case where the hugetlb page is poisoned.
>>> +			 * We clear the PTE but do not flush so potentially
>>> +			 * a remote CPU could still be writing to the folio.
>>> +			 * If the entry was previously clean then the
>>> +			 * architecture must guarantee that a clear->dirty
>>> +			 * transition on a cached TLB entry is written through
>>> +			 * and traps if the PTE is unmapped.
>>>  			 */
>>> -			VM_BUG_ON_PAGE(!PageHWPoison(subpage), subpage);
>>> -			/*
>>> -			 * huge_pmd_unshare may unmap an entire PMD page.
>>> -			 * There is no way of knowing exactly which PMDs may
>>> -			 * be cached for this mm, so we must flush them all.
>>> -			 * start/end were already adjusted above to cover this
>>> -			 * range.
>>> -			 */
>>> -			flush_cache_range(vma, range.start, range.end);
>>> +			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>>>  
>>> -			/*
>>> -			 * To call huge_pmd_unshare, i_mmap_rwsem must be
>>> -			 * held in write mode.  Caller needs to explicitly
>>> -			 * do this outside rmap routines.
>>> -			 *
>>> -			 * We also must hold hugetlb vma_lock in write mode.
>>> -			 * Lock order dictates acquiring vma_lock BEFORE
>>> -			 * i_mmap_rwsem.  We can only try lock here and fail
>>> -			 * if unsuccessful.
>>> -			 */
>>> -			if (!anon) {
>>> -				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>>> -				if (!hugetlb_vma_trylock_write(vma)) {
>>> -					page_vma_mapped_walk_done(&pvmw);
>>> -					ret = false;
>>> -					break;
>>> -				}
>>> -				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
>>> -					hugetlb_vma_unlock_write(vma);
>>> -					flush_tlb_range(vma,
>>> -						range.start, range.end);
>>> -					mmu_notifier_invalidate_range(mm,
>>> -						range.start, range.end);
>>> -					/*
>>> -					 * The ref count of the PMD page was
>>> -					 * dropped which is part of the way map
>>> -					 * counting is done for shared PMDs.
>>> -					 * Return 'true' here.  When there is
>>> -					 * no other sharing, huge_pmd_unshare
>>> -					 * returns false and we will unmap the
>>> -					 * actual page and drop map count
>>> -					 * to zero.
>>> -					 */
>>> -					page_vma_mapped_walk_done(&pvmw);
>>> -					break;
>>> -				}
>>> -				hugetlb_vma_unlock_write(vma);
>>> -			}
>>> -			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
>>> +			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
>>>  		} else {
>>> -			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
>>> -			/* Nuke the page table entry. */
>>> -			if (should_defer_flush(mm, flags)) {
>>> -				/*
>>> -				 * We clear the PTE but do not flush so potentially
>>> -				 * a remote CPU could still be writing to the folio.
>>> -				 * If the entry was previously clean then the
>>> -				 * architecture must guarantee that a clear->dirty
>>> -				 * transition on a cached TLB entry is written through
>>> -				 * and traps if the PTE is unmapped.
>>> -				 */
>>> -				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
>>> -
>>> -				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
>>> -			} else {
>>> -				pteval = ptep_clear_flush(vma, address, pvmw.pte);
>>> -			}
>>> +			pteval = ptep_clear_flush(vma, address, pvmw.pte);
>>>  		}
>>>  
>>>  		/*
>>> @@ -1604,14 +1657,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>  
>>>  		if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
>>>  			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
>>> -			if (folio_test_hugetlb(folio)) {
>>> -				hugetlb_count_sub(folio_nr_pages(folio), mm);
>>> -				set_huge_pte_at(mm, address, pvmw.pte, pteval);
>>> -			} else {
>>> -				dec_mm_counter(mm, mm_counter(&folio->page));
>>> -				set_pte_at(mm, address, pvmw.pte, pteval);
>>> -			}
>>> -
>>> +			dec_mm_counter(mm, mm_counter(&folio->page));
>>> +			set_pte_at(mm, address, pvmw.pte, pteval);
>>>  		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
>>>  			/*
>>>  			 * The guest indicated that the page content is of no
>>> -- 
>>> 2.30.2


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] rmap: move hugetlb try_to_unmap to dedicated function
  2023-02-24  2:51     ` HORIGUCHI NAOYA(堀口　直也)
  2023-02-24  4:41       ` Yin, Fengwei
@ 2023-02-24 19:21       ` Mike Kravetz
  2023-02-26 11:44         ` Yin, Fengwei
  1 sibling, 1 reply; 13+ messages in thread
From: Mike Kravetz @ 2023-02-24 19:21 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口　直也)
  Cc: Yin Fengwei, Matthew Wilcox, linux-mm@kvack.org,
	akpm@linux-foundation.org, Sidhartha Kumar, Jane Chu

On 02/24/23 02:51, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Thu, Feb 23, 2023 at 05:28:10PM +0000, Matthew Wilcox wrote:
> > On Thu, Feb 23, 2023 at 04:31:56PM +0800, Yin Fengwei wrote:
> > > +
> > > +	if (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) {
> > > +		pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page));
> > > +		set_huge_pte_at(mm, address, pvmw.pte, pteval);
> > > +	}
> > > +
> > > +	/*** try_to_unmap_one() called dec_mm_counter for
> > > +	 * (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) not
> > > +	 * true case, looks incorrect. Change it to hugetlb_count_sub() here.
> > > +	 */
> > > +	hugetlb_count_sub(folio_nr_pages(folio), mm);
> 
> I have no objection to this change (moving hugetlb_count_sub() outside the
> if), but I have a question related to this.
> 
> Generally TTU_IGNORE_HWPOISON is used to control the pte-conversion based
> on page dirtiness.  But actually what it depends on is whether data lost
> happens when the page is forcibly dropped.  And for hugetlb pages, that's
> true regardless of PageDirty flag on it.
> So I think we can assume that try_to_unmap_one_hugetlb() is called with
> TTU_IGNORE_HWPOISON clear.  So maybe we don't need the if-check?

Thanks for looking at this Naoya!

In another reply I asked about the possibility of that if statement ever
being false for hugetlb pages.  Looks like that can never happen.  I went
back and looked at the memory failure/poison code just to be sure.

Yin,
Since we NEVER took went down the (folio_test_hwpoison(folio) &&
!(flags & TTU_IGNORE_HWPOISON)) not true path in the current code,
perhaps we not need the comment possibly calling dec_mm_counter. 
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/5] rmap: move hugetlb try_to_unmap to dedicated function
  2023-02-24 19:21       ` Mike Kravetz
@ 2023-02-26 11:44         ` Yin, Fengwei
  0 siblings, 0 replies; 13+ messages in thread
From: Yin, Fengwei @ 2023-02-26 11:44 UTC (permalink / raw)
  To: Mike Kravetz, HORIGUCHI NAOYA(堀口 直也)
  Cc: Matthew Wilcox, linux-mm@kvack.org, akpm@linux-foundation.org,
	Sidhartha Kumar, Jane Chu



On 2/25/2023 3:21 AM, Mike Kravetz wrote:
> On 02/24/23 02:51, HORIGUCHI NAOYA(堀口 直也) wrote:
>> On Thu, Feb 23, 2023 at 05:28:10PM +0000, Matthew Wilcox wrote:
>>> On Thu, Feb 23, 2023 at 04:31:56PM +0800, Yin Fengwei wrote:
>>>> +
>>>> +	if (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) {
>>>> +		pteval = swp_entry_to_pte(make_hwpoison_entry(&folio->page));
>>>> +		set_huge_pte_at(mm, address, pvmw.pte, pteval);
>>>> +	}
>>>> +
>>>> +	/*** try_to_unmap_one() called dec_mm_counter for
>>>> +	 * (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) not
>>>> +	 * true case, looks incorrect. Change it to hugetlb_count_sub() here.
>>>> +	 */
>>>> +	hugetlb_count_sub(folio_nr_pages(folio), mm);
>>
>> I have no objection to this change (moving hugetlb_count_sub() outside the
>> if), but I have a question related to this.
>>
>> Generally TTU_IGNORE_HWPOISON is used to control the pte-conversion based
>> on page dirtiness.  But actually what it depends on is whether data lost
>> happens when the page is forcibly dropped.  And for hugetlb pages, that's
>> true regardless of PageDirty flag on it.
>> So I think we can assume that try_to_unmap_one_hugetlb() is called with
>> TTU_IGNORE_HWPOISON clear.  So maybe we don't need the if-check?
> 
> Thanks for looking at this Naoya!
> 
> In another reply I asked about the possibility of that if statement ever
> being false for hugetlb pages.  Looks like that can never happen.  I went
> back and looked at the memory failure/poison code just to be sure.
> 
> Yin,
> Since we NEVER took went down the (folio_test_hwpoison(folio) &&
> !(flags & TTU_IGNORE_HWPOISON)) not true path in the current code,
> perhaps we not need the comment possibly calling dec_mm_counter. 
Sure. I am going to remove this line and the comment of mm counter 
also:
if (folio_test_hwpoison(folio) && !(flags & TTU_IGNORE_HWPOISON)) {

Will send out new version after few more days to see whether there
are comments to other patches in this series.

Thanks all for sharing the comments.


Regards
Yin, Fengwei


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 2/5] rmap: move page unmap operation to dedicated function
  2023-02-23  8:31 [PATCH 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
  2023-02-23  8:31 ` [PATCH 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
@ 2023-02-23  8:31 ` Yin Fengwei
  2023-02-23  8:31 ` [PATCH 3/5] rmap: cleanup exit path of try_to_unmap_one_page() Yin Fengwei
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Yin Fengwei @ 2023-02-23  8:31 UTC (permalink / raw)
  To: linux-mm, akpm, willy; +Cc: fengwei.yin

No functional change. Just code reorganized.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/rmap.c | 369 ++++++++++++++++++++++++++++--------------------------
 1 file changed, 194 insertions(+), 175 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index e7aa63b800f7..879e90bbf6aa 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1545,17 +1545,204 @@ static bool try_to_unmap_one_hugetlb(struct folio *folio,
 	return ret;
 }
 
+static bool try_to_unmap_one_page(struct folio *folio,
+		struct vm_area_struct *vma, struct mmu_notifier_range range,
+		struct page_vma_mapped_walk pvmw, unsigned long address,
+		enum ttu_flags flags)
+{
+	bool anon_exclusive, ret = true;
+	struct page *subpage;
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t pteval;
+
+	subpage = folio_page(folio,
+			pte_pfn(*pvmw.pte) - folio_pfn(folio));
+	anon_exclusive = folio_test_anon(folio) &&
+		PageAnonExclusive(subpage);
+
+	flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+	/* Nuke the page table entry. */
+	if (should_defer_flush(mm, flags)) {
+		/*
+		 * We clear the PTE but do not flush so potentially
+		 * a remote CPU could still be writing to the folio.
+		 * If the entry was previously clean then the
+		 * architecture must guarantee that a clear->dirty
+		 * transition on a cached TLB entry is written through
+		 * and traps if the PTE is unmapped.
+		 */
+		pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+
+		set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
+	} else {
+		pteval = ptep_clear_flush(vma, address, pvmw.pte);
+	}
+
+	/*
+	 * Now the pte is cleared. If this pte was uffd-wp armed,
+	 * we may want to replace a none pte with a marker pte if
+	 * it's file-backed, so we don't lose the tracking info.
+	 */
+	pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
+
+	/* Set the dirty flag on the folio now the pte is gone. */
+	if (pte_dirty(pteval))
+		folio_mark_dirty(folio);
+
+	/* Update high watermark before we lower rss */
+	update_hiwater_rss(mm);
+
+	if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
+		pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
+		dec_mm_counter(mm, mm_counter(&folio->page));
+		set_pte_at(mm, address, pvmw.pte, pteval);
+	} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
+		/*
+		 * The guest indicated that the page content is of no
+		 * interest anymore. Simply discard the pte, vmscan
+		 * will take care of the rest.
+		 * A future reference will then fault in a new zero
+		 * page. When userfaultfd is active, we must not drop
+		 * this page though, as its main user (postcopy
+		 * migration) will not expect userfaults on already
+		 * copied pages.
+		 */
+		dec_mm_counter(mm, mm_counter(&folio->page));
+		/* We have to invalidate as we cleared the pte */
+		mmu_notifier_invalidate_range(mm, address,
+				address + PAGE_SIZE);
+	} else if (folio_test_anon(folio)) {
+		swp_entry_t entry = { .val = page_private(subpage) };
+		pte_t swp_pte;
+		/*
+		 * Store the swap location in the pte.
+		 * See handle_pte_fault() ...
+		 */
+		if (unlikely(folio_test_swapbacked(folio) !=
+					folio_test_swapcache(folio))) {
+			WARN_ON_ONCE(1);
+			ret = false;
+			/* We have to invalidate as we cleared the pte */
+			mmu_notifier_invalidate_range(mm, address,
+					address + PAGE_SIZE);
+			page_vma_mapped_walk_done(&pvmw);
+			goto discard;
+		}
+
+		/* MADV_FREE page check */
+		if (!folio_test_swapbacked(folio)) {
+			int ref_count, map_count;
+
+			/*
+			 * Synchronize with gup_pte_range():
+			 * - clear PTE; barrier; read refcount
+			 * - inc refcount; barrier; read PTE
+			 */
+			smp_mb();
+
+			ref_count = folio_ref_count(folio);
+			map_count = folio_mapcount(folio);
+
+			/*
+			 * Order reads for page refcount and dirty flag
+			 * (see comments in __remove_mapping()).
+			 */
+			smp_rmb();
+
+			/*
+			 * The only page refs must be one from isolation
+			 * plus the rmap(s) (dropped by discard:).
+			 */
+			if (ref_count == 1 + map_count &&
+					!folio_test_dirty(folio)) {
+				/* Invalidate as we cleared the pte */
+				mmu_notifier_invalidate_range(mm,
+						address, address + PAGE_SIZE);
+				dec_mm_counter(mm, MM_ANONPAGES);
+				goto discard;
+			}
+
+			/*
+			 * If the folio was redirtied, it cannot be
+			 * discarded. Remap the page to page table.
+			 */
+			set_pte_at(mm, address, pvmw.pte, pteval);
+			folio_set_swapbacked(folio);
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			goto discard;
+		}
+
+		if (swap_duplicate(entry) < 0) {
+			set_pte_at(mm, address, pvmw.pte, pteval);
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			goto discard;
+		}
+		if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+			swap_free(entry);
+			set_pte_at(mm, address, pvmw.pte, pteval);
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			goto discard;
+		}
+
+		/* See page_try_share_anon_rmap(): clear PTE first. */
+		if (anon_exclusive &&
+				page_try_share_anon_rmap(subpage)) {
+			swap_free(entry);
+			set_pte_at(mm, address, pvmw.pte, pteval);
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			goto discard;
+		}
+		if (list_empty(&mm->mmlist)) {
+			spin_lock(&mmlist_lock);
+			if (list_empty(&mm->mmlist))
+				list_add(&mm->mmlist, &init_mm.mmlist);
+			spin_unlock(&mmlist_lock);
+		}
+		dec_mm_counter(mm, MM_ANONPAGES);
+		inc_mm_counter(mm, MM_SWAPENTS);
+		swp_pte = swp_entry_to_pte(entry);
+		if (anon_exclusive)
+			swp_pte = pte_swp_mkexclusive(swp_pte);
+		if (pte_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		if (pte_uffd_wp(pteval))
+			swp_pte = pte_swp_mkuffd_wp(swp_pte);
+		set_pte_at(mm, address, pvmw.pte, swp_pte);
+		/* Invalidate as we cleared the pte */
+		mmu_notifier_invalidate_range(mm, address,
+				address + PAGE_SIZE);
+	} else {
+		/*
+		 * This is a locked file-backed folio,
+		 * so it cannot be removed from the page
+		 * cache and replaced by a new folio before
+		 * mmu_notifier_invalidate_range_end, so no
+		 * concurrent thread might update its page table
+		 * to point at a new folio while a device is
+		 * still using this folio.
+		 *
+		 * See Documentation/mm/mmu_notifier.rst
+		 */
+		dec_mm_counter(mm, mm_counter_file(&folio->page));
+	}
+
+discard:
+	return ret;
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
 static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		     unsigned long address, void *arg)
 {
-	struct mm_struct *mm = vma->vm_mm;
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
-	pte_t pteval;
 	struct page *subpage;
-	bool anon_exclusive, ret = true;
+	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 
@@ -1620,179 +1807,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 		subpage = folio_page(folio,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
-		anon_exclusive = folio_test_anon(folio) &&
-				 PageAnonExclusive(subpage);
-
-		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
-		/* Nuke the page table entry. */
-		if (should_defer_flush(mm, flags)) {
-			/*
-			 * We clear the PTE but do not flush so potentially
-			 * a remote CPU could still be writing to the folio.
-			 * If the entry was previously clean then the
-			 * architecture must guarantee that a clear->dirty
-			 * transition on a cached TLB entry is written through
-			 * and traps if the PTE is unmapped.
-			 */
-			pteval = ptep_get_and_clear(mm, address, pvmw.pte);
-
-			set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
-		} else {
-			pteval = ptep_clear_flush(vma, address, pvmw.pte);
-		}
-
-		/*
-		 * Now the pte is cleared. If this pte was uffd-wp armed,
-		 * we may want to replace a none pte with a marker pte if
-		 * it's file-backed, so we don't lose the tracking info.
-		 */
-		pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
-
-		/* Set the dirty flag on the folio now the pte is gone. */
-		if (pte_dirty(pteval))
-			folio_mark_dirty(folio);
-
-		/* Update high watermark before we lower rss */
-		update_hiwater_rss(mm);
-
-		if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
-			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
-			dec_mm_counter(mm, mm_counter(&folio->page));
-			set_pte_at(mm, address, pvmw.pte, pteval);
-		} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
-			/*
-			 * The guest indicated that the page content is of no
-			 * interest anymore. Simply discard the pte, vmscan
-			 * will take care of the rest.
-			 * A future reference will then fault in a new zero
-			 * page. When userfaultfd is active, we must not drop
-			 * this page though, as its main user (postcopy
-			 * migration) will not expect userfaults on already
-			 * copied pages.
-			 */
-			dec_mm_counter(mm, mm_counter(&folio->page));
-			/* We have to invalidate as we cleared the pte */
-			mmu_notifier_invalidate_range(mm, address,
-						      address + PAGE_SIZE);
-		} else if (folio_test_anon(folio)) {
-			swp_entry_t entry = { .val = page_private(subpage) };
-			pte_t swp_pte;
-			/*
-			 * Store the swap location in the pte.
-			 * See handle_pte_fault() ...
-			 */
-			if (unlikely(folio_test_swapbacked(folio) !=
-					folio_test_swapcache(folio))) {
-				WARN_ON_ONCE(1);
-				ret = false;
-				/* We have to invalidate as we cleared the pte */
-				mmu_notifier_invalidate_range(mm, address,
-							address + PAGE_SIZE);
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
-
-			/* MADV_FREE page check */
-			if (!folio_test_swapbacked(folio)) {
-				int ref_count, map_count;
-
-				/*
-				 * Synchronize with gup_pte_range():
-				 * - clear PTE; barrier; read refcount
-				 * - inc refcount; barrier; read PTE
-				 */
-				smp_mb();
-
-				ref_count = folio_ref_count(folio);
-				map_count = folio_mapcount(folio);
-
-				/*
-				 * Order reads for page refcount and dirty flag
-				 * (see comments in __remove_mapping()).
-				 */
-				smp_rmb();
-
-				/*
-				 * The only page refs must be one from isolation
-				 * plus the rmap(s) (dropped by discard:).
-				 */
-				if (ref_count == 1 + map_count &&
-				    !folio_test_dirty(folio)) {
-					/* Invalidate as we cleared the pte */
-					mmu_notifier_invalidate_range(mm,
-						address, address + PAGE_SIZE);
-					dec_mm_counter(mm, MM_ANONPAGES);
-					goto discard;
-				}
-
-				/*
-				 * If the folio was redirtied, it cannot be
-				 * discarded. Remap the page to page table.
-				 */
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				folio_set_swapbacked(folio);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
-
-			if (swap_duplicate(entry) < 0) {
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
-			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				swap_free(entry);
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
+		ret = try_to_unmap_one_page(folio, vma,
+						range, pvmw, address, flags);
+		if (!ret)
+			break;
 
-			/* See page_try_share_anon_rmap(): clear PTE first. */
-			if (anon_exclusive &&
-			    page_try_share_anon_rmap(subpage)) {
-				swap_free(entry);
-				set_pte_at(mm, address, pvmw.pte, pteval);
-				ret = false;
-				page_vma_mapped_walk_done(&pvmw);
-				break;
-			}
-			if (list_empty(&mm->mmlist)) {
-				spin_lock(&mmlist_lock);
-				if (list_empty(&mm->mmlist))
-					list_add(&mm->mmlist, &init_mm.mmlist);
-				spin_unlock(&mmlist_lock);
-			}
-			dec_mm_counter(mm, MM_ANONPAGES);
-			inc_mm_counter(mm, MM_SWAPENTS);
-			swp_pte = swp_entry_to_pte(entry);
-			if (anon_exclusive)
-				swp_pte = pte_swp_mkexclusive(swp_pte);
-			if (pte_soft_dirty(pteval))
-				swp_pte = pte_swp_mksoft_dirty(swp_pte);
-			if (pte_uffd_wp(pteval))
-				swp_pte = pte_swp_mkuffd_wp(swp_pte);
-			set_pte_at(mm, address, pvmw.pte, swp_pte);
-			/* Invalidate as we cleared the pte */
-			mmu_notifier_invalidate_range(mm, address,
-						      address + PAGE_SIZE);
-		} else {
-			/*
-			 * This is a locked file-backed folio,
-			 * so it cannot be removed from the page
-			 * cache and replaced by a new folio before
-			 * mmu_notifier_invalidate_range_end, so no
-			 * concurrent thread might update its page table
-			 * to point at a new folio while a device is
-			 * still using this folio.
-			 *
-			 * See Documentation/mm/mmu_notifier.rst
-			 */
-			dec_mm_counter(mm, mm_counter_file(&folio->page));
-		}
-discard:
 		/*
 		 * No need to call mmu_notifier_invalidate_range() it has be
 		 * done above for all cases requiring it to happen under page
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/5] rmap: cleanup exit path of try_to_unmap_one_page()
  2023-02-23  8:31 [PATCH 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
  2023-02-23  8:31 ` [PATCH 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
  2023-02-23  8:31 ` [PATCH 2/5] rmap: move page unmap operation " Yin Fengwei
@ 2023-02-23  8:31 ` Yin Fengwei
  2023-02-23  8:31 ` [PATCH 4/5] rmap:addd folio_remove_rmap_range() Yin Fengwei
  2023-02-23  8:32 ` [PATCH 5/5] try_to_unmap_one: batched remove rmap, update folio refcount Yin Fengwei
  4 siblings, 0 replies; 13+ messages in thread
From: Yin Fengwei @ 2023-02-23  8:31 UTC (permalink / raw)
  To: linux-mm, akpm, willy; +Cc: fengwei.yin

Cleanup exit path of try_to_unmap_one_page() by removing
some duplicated code.
Move page_vma_mapped_walk_done() back to try_to_unmap_one().
Change subpage to page as folio has no concept of subpage.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 mm/rmap.c | 74 ++++++++++++++++++++++---------------------------------
 1 file changed, 30 insertions(+), 44 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 879e90bbf6aa..097774c809a0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1536,7 +1536,7 @@ static bool try_to_unmap_one_hugetlb(struct folio *folio,
 	 *
 	 * See Documentation/mm/mmu_notifier.rst
 	 */
-	page_remove_rmap(&folio->page, vma, folio_test_hugetlb(folio));
+	page_remove_rmap(&folio->page, vma, true);
 	if (vma->vm_flags & VM_LOCKED)
 		mlock_drain_local();
 	folio_put(folio);
@@ -1550,15 +1550,13 @@ static bool try_to_unmap_one_page(struct folio *folio,
 		struct page_vma_mapped_walk pvmw, unsigned long address,
 		enum ttu_flags flags)
 {
-	bool anon_exclusive, ret = true;
-	struct page *subpage;
+	bool anon_exclusive;
+	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t pteval;
 
-	subpage = folio_page(folio,
-			pte_pfn(*pvmw.pte) - folio_pfn(folio));
-	anon_exclusive = folio_test_anon(folio) &&
-		PageAnonExclusive(subpage);
+	page = folio_page(folio, pte_pfn(*pvmw.pte) - folio_pfn(folio));
+	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
 
 	flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
 	/* Nuke the page table entry. */
@@ -1586,15 +1584,14 @@ static bool try_to_unmap_one_page(struct folio *folio,
 	pte_install_uffd_wp_if_needed(vma, address, pvmw.pte, pteval);
 
 	/* Set the dirty flag on the folio now the pte is gone. */
-	if (pte_dirty(pteval))
+	if (pte_dirty(pteval) && !folio_test_dirty(folio))
 		folio_mark_dirty(folio);
 
 	/* Update high watermark before we lower rss */
 	update_hiwater_rss(mm);
 
-	if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
-		pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
-		dec_mm_counter(mm, mm_counter(&folio->page));
+	if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
+		pteval = swp_entry_to_pte(make_hwpoison_entry(page));
 		set_pte_at(mm, address, pvmw.pte, pteval);
 	} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
 		/*
@@ -1607,12 +1604,11 @@ static bool try_to_unmap_one_page(struct folio *folio,
 		 * migration) will not expect userfaults on already
 		 * copied pages.
 		 */
-		dec_mm_counter(mm, mm_counter(&folio->page));
 		/* We have to invalidate as we cleared the pte */
 		mmu_notifier_invalidate_range(mm, address,
 				address + PAGE_SIZE);
 	} else if (folio_test_anon(folio)) {
-		swp_entry_t entry = { .val = page_private(subpage) };
+		swp_entry_t entry = { .val = page_private(page) };
 		pte_t swp_pte;
 		/*
 		 * Store the swap location in the pte.
@@ -1621,12 +1617,10 @@ static bool try_to_unmap_one_page(struct folio *folio,
 		if (unlikely(folio_test_swapbacked(folio) !=
 					folio_test_swapcache(folio))) {
 			WARN_ON_ONCE(1);
-			ret = false;
 			/* We have to invalidate as we cleared the pte */
 			mmu_notifier_invalidate_range(mm, address,
 					address + PAGE_SIZE);
-			page_vma_mapped_walk_done(&pvmw);
-			goto discard;
+			goto exit;
 		}
 
 		/* MADV_FREE page check */
@@ -1658,7 +1652,6 @@ static bool try_to_unmap_one_page(struct folio *folio,
 				/* Invalidate as we cleared the pte */
 				mmu_notifier_invalidate_range(mm,
 						address, address + PAGE_SIZE);
-				dec_mm_counter(mm, MM_ANONPAGES);
 				goto discard;
 			}
 
@@ -1666,43 +1659,30 @@ static bool try_to_unmap_one_page(struct folio *folio,
 			 * If the folio was redirtied, it cannot be
 			 * discarded. Remap the page to page table.
 			 */
-			set_pte_at(mm, address, pvmw.pte, pteval);
 			folio_set_swapbacked(folio);
-			ret = false;
-			page_vma_mapped_walk_done(&pvmw);
-			goto discard;
+			goto exit_restore_pte;
 		}
 
-		if (swap_duplicate(entry) < 0) {
-			set_pte_at(mm, address, pvmw.pte, pteval);
-			ret = false;
-			page_vma_mapped_walk_done(&pvmw);
-			goto discard;
-		}
+		if (swap_duplicate(entry) < 0)
+			goto exit_restore_pte;
+
 		if (arch_unmap_one(mm, vma, address, pteval) < 0) {
 			swap_free(entry);
-			set_pte_at(mm, address, pvmw.pte, pteval);
-			ret = false;
-			page_vma_mapped_walk_done(&pvmw);
-			goto discard;
+			goto exit_restore_pte;
 		}
 
 		/* See page_try_share_anon_rmap(): clear PTE first. */
-		if (anon_exclusive &&
-				page_try_share_anon_rmap(subpage)) {
+		if (anon_exclusive && page_try_share_anon_rmap(page)) {
 			swap_free(entry);
-			set_pte_at(mm, address, pvmw.pte, pteval);
-			ret = false;
-			page_vma_mapped_walk_done(&pvmw);
-			goto discard;
+			goto exit_restore_pte;
 		}
+
 		if (list_empty(&mm->mmlist)) {
 			spin_lock(&mmlist_lock);
 			if (list_empty(&mm->mmlist))
 				list_add(&mm->mmlist, &init_mm.mmlist);
 			spin_unlock(&mmlist_lock);
 		}
-		dec_mm_counter(mm, MM_ANONPAGES);
 		inc_mm_counter(mm, MM_SWAPENTS);
 		swp_pte = swp_entry_to_pte(entry);
 		if (anon_exclusive)
@@ -1713,8 +1693,7 @@ static bool try_to_unmap_one_page(struct folio *folio,
 			swp_pte = pte_swp_mkuffd_wp(swp_pte);
 		set_pte_at(mm, address, pvmw.pte, swp_pte);
 		/* Invalidate as we cleared the pte */
-		mmu_notifier_invalidate_range(mm, address,
-				address + PAGE_SIZE);
+		mmu_notifier_invalidate_range(mm, address, address + PAGE_SIZE);
 	} else {
 		/*
 		 * This is a locked file-backed folio,
@@ -1727,11 +1706,16 @@ static bool try_to_unmap_one_page(struct folio *folio,
 		 *
 		 * See Documentation/mm/mmu_notifier.rst
 		 */
-		dec_mm_counter(mm, mm_counter_file(&folio->page));
 	}
 
 discard:
-	return ret;
+	dec_mm_counter(vma->vm_mm, mm_counter(&folio->page));
+	return true;
+
+exit_restore_pte:
+	set_pte_at(mm, address, pvmw.pte, pteval);
+exit:
+	return false;
 }
 
 /*
@@ -1809,8 +1793,10 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
 		ret = try_to_unmap_one_page(folio, vma,
 						range, pvmw, address, flags);
-		if (!ret)
+		if (!ret) {
+			page_vma_mapped_walk_done(&pvmw);
 			break;
+		}
 
 		/*
 		 * No need to call mmu_notifier_invalidate_range() it has be
@@ -1819,7 +1805,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 *
 		 * See Documentation/mm/mmu_notifier.rst
 		 */
-		page_remove_rmap(subpage, vma, folio_test_hugetlb(folio));
+		page_remove_rmap(subpage, vma, false);
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
 		folio_put(folio);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/5] rmap:addd folio_remove_rmap_range()
  2023-02-23  8:31 [PATCH 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
                   ` (2 preceding siblings ...)
  2023-02-23  8:31 ` [PATCH 3/5] rmap: cleanup exit path of try_to_unmap_one_page() Yin Fengwei
@ 2023-02-23  8:31 ` Yin Fengwei
  2023-02-23  8:32 ` [PATCH 5/5] try_to_unmap_one: batched remove rmap, update folio refcount Yin Fengwei
  4 siblings, 0 replies; 13+ messages in thread
From: Yin Fengwei @ 2023-02-23  8:31 UTC (permalink / raw)
  To: linux-mm, akpm, willy; +Cc: fengwei.yin

folio_remove_rmap_range() allows to take down the pte mapping to
a specific range of folio. Comparing to page_remove_rmap(), it
batched updates __lruvec_stat for large folio.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/rmap.h |  4 +++
 mm/rmap.c            | 58 +++++++++++++++++++++++++++++++++-----------
 2 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index a4570da03e58..d7a51b96f379 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -200,6 +200,10 @@ void page_add_file_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
 void page_remove_rmap(struct page *, struct vm_area_struct *,
 		bool compound);
+void folio_remove_rmap_range(struct folio *, struct page *,
+			unsigned int nr_pages, struct vm_area_struct *,
+			bool compound);
+
 
 void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
diff --git a/mm/rmap.c b/mm/rmap.c
index 097774c809a0..3680765b7ec8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1357,23 +1357,25 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
 }
 
 /**
- * page_remove_rmap - take down pte mapping from a page
- * @page:	page to remove mapping from
+ * folio_remove_rmap_range - take down pte mapping from a range of pages
+ * @folio:	folio to remove mapping from
+ * @page:	The first page to take down pte mapping
+ * @nr_pages:	The number of pages which will be take down pte mapping
  * @vma:	the vm area from which the mapping is removed
  * @compound:	uncharge the page as compound or small page
  *
  * The caller needs to hold the pte lock.
  */
-void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
-		bool compound)
+void folio_remove_rmap_range(struct folio *folio, struct page *page,
+			unsigned int nr_pages, struct vm_area_struct *vma,
+			bool compound)
 {
-	struct folio *folio = page_folio(page);
 	atomic_t *mapped = &folio->_nr_pages_mapped;
-	int nr = 0, nr_pmdmapped = 0;
-	bool last;
+	int nr = 0, nr_pmdmapped = 0, last;
 	enum node_stat_item idx;
 
-	VM_BUG_ON_PAGE(compound && !PageHead(page), page);
+	VM_BUG_ON_FOLIO(compound && (nr_pages != folio_nr_pages(folio)), folio);
+	VM_BUG_ON_FOLIO(compound && (page != &folio->page), folio);
 
 	/* Hugetlb pages are not counted in NR_*MAPPED */
 	if (unlikely(folio_test_hugetlb(folio))) {
@@ -1384,12 +1386,16 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 
 	/* Is page being unmapped by PTE? Is this its last map to be removed? */
 	if (likely(!compound)) {
-		last = atomic_add_negative(-1, &page->_mapcount);
-		nr = last;
-		if (last && folio_test_large(folio)) {
-			nr = atomic_dec_return_relaxed(mapped);
-			nr = (nr < COMPOUND_MAPPED);
-		}
+		do {
+			last = atomic_add_negative(-1, &page->_mapcount);
+			if (last && folio_test_large(folio)) {
+				last = atomic_dec_return_relaxed(mapped);
+				last = (last < COMPOUND_MAPPED);
+			}
+
+			if (last)
+				nr++;
+		} while (page++, --nr_pages > 0);
 	} else if (folio_test_pmd_mappable(folio)) {
 		/* That test is redundant: it's for safety or to optimize out */
 
@@ -1443,6 +1449,30 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
 	munlock_vma_folio(folio, vma, compound);
 }
 
+/**
+ * page_remove_rmap - take down pte mapping from a page
+ * @page:	page to remove mapping from
+ * @vma:	the vm area from which the mapping is removed
+ * @compound:	uncharge the page as compound or small page
+ *
+ * The caller needs to hold the pte lock.
+ */
+void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
+		bool compound)
+{
+	struct folio *folio = page_folio(page);
+	unsigned int nr_pages;
+
+	VM_BUG_ON_FOLIO(compound && (page != &folio->page), folio);
+
+	if (likely(!compound))
+		nr_pages = 1;
+	else
+		nr_pages = folio_nr_pages(folio);
+
+	folio_remove_rmap_range(folio, page, nr_pages, vma, compound);
+}
+
 static bool try_to_unmap_one_hugetlb(struct folio *folio,
 		struct vm_area_struct *vma, struct mmu_notifier_range range,
 		struct page_vma_mapped_walk pvmw, unsigned long address,
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/5] try_to_unmap_one: batched remove rmap, update folio refcount
  2023-02-23  8:31 [PATCH 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
                   ` (3 preceding siblings ...)
  2023-02-23  8:31 ` [PATCH 4/5] rmap:addd folio_remove_rmap_range() Yin Fengwei
@ 2023-02-23  8:32 ` Yin Fengwei
  4 siblings, 0 replies; 13+ messages in thread
From: Yin Fengwei @ 2023-02-23  8:32 UTC (permalink / raw)
  To: linux-mm, akpm, willy; +Cc: fengwei.yin

If unmap one page fails, or the vma walk will skip next pte,
or the vma walk will end on next pte, batched remove map,
update folio refcount.

Signed-off-by: Yin Fengwei <fengwei.yin@intel.com>
---
 include/linux/rmap.h |  1 +
 mm/page_vma_mapped.c | 30 +++++++++++++++++++++++++++
 mm/rmap.c            | 48 ++++++++++++++++++++++++++++++++++----------
 3 files changed, 68 insertions(+), 11 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index d7a51b96f379..568801ee8d6a 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -424,6 +424,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
 }
 
 bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
+bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk *pvmw);
 
 /*
  * Used by swapoff to help locate where page is expected in vma.
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 4e448cfbc6ef..19e997dfb5c6 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -291,6 +291,36 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	return false;
 }
 
+/**
+ * pvmw_walk_skip_or_end_on_next - check if next pte will be skipped or
+ *                                 end the walk
+ * @pvmw: pointer to struct page_vma_mapped_walk.
+ *
+ * This function can only be called with correct pte lock hold
+ */
+bool pvmw_walk_skip_or_end_on_next(struct page_vma_mapped_walk *pvmw)
+{
+       unsigned long address = pvmw->address + PAGE_SIZE;
+
+       if (address >= vma_address_end(pvmw))
+               return true;
+
+       if ((address & (PMD_SIZE - PAGE_SIZE)) == 0)
+               return true;
+
+       if (pte_none(*pvmw->pte))
+               return true;
+
+       pvmw->pte++;
+       if (!check_pte(pvmw)) {
+               pvmw->pte--;
+               return true;
+       }
+       pvmw->pte--;
+
+       return false;
+}
+
 /**
  * page_mapped_in_vma - check whether a page is really mapped in a VMA
  * @page: the page to test
diff --git a/mm/rmap.c b/mm/rmap.c
index 3680765b7ec8..7156b804d424 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1748,6 +1748,26 @@ static bool try_to_unmap_one_page(struct folio *folio,
 	return false;
 }
 
+static void folio_remove_rmap_and_update_count(struct folio *folio,
+		struct page *start, struct vm_area_struct *vma, int count)
+{
+	if (count == 0)
+		return;
+
+	/*
+	 * No need to call mmu_notifier_invalidate_range() it has be
+	 * done above for all cases requiring it to happen under page
+	 * table lock before mmu_notifier_invalidate_range_end()
+	 *
+	 * See Documentation/mm/mmu_notifier.rst
+	 */
+	folio_remove_rmap_range(folio, start, count, vma,
+					folio_test_hugetlb(folio));
+	if (vma->vm_flags & VM_LOCKED)
+		mlock_drain_local();
+	folio_ref_sub(folio, count);
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
@@ -1755,10 +1775,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		     unsigned long address, void *arg)
 {
 	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
-	struct page *subpage;
+	struct page *start = NULL;
 	bool ret = true;
 	struct mmu_notifier_range range;
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	int count = 0;
 
 	/*
 	 * When racing against e.g. zap_pte_range() on another cpu,
@@ -1819,26 +1840,31 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			break;
 		}
 
-		subpage = folio_page(folio,
+		if (!start)
+			start = folio_page(folio,
 					pte_pfn(*pvmw.pte) - folio_pfn(folio));
 		ret = try_to_unmap_one_page(folio, vma,
 						range, pvmw, address, flags);
 		if (!ret) {
+			folio_remove_rmap_and_update_count(folio,
+							start, vma, count);
 			page_vma_mapped_walk_done(&pvmw);
 			break;
 		}
+		count++;
 
 		/*
-		 * No need to call mmu_notifier_invalidate_range() it has be
-		 * done above for all cases requiring it to happen under page
-		 * table lock before mmu_notifier_invalidate_range_end()
-		 *
-		 * See Documentation/mm/mmu_notifier.rst
+		 * If next pte will be skipped in page_vma_mapped_walk() or
+		 * the walk will end at it, batched remove rmap and update
+		 * page refcount. We can't do it after page_vma_mapped_walk()
+		 * return false because the pte lock will not be hold.
 		 */
-		page_remove_rmap(subpage, vma, false);
-		if (vma->vm_flags & VM_LOCKED)
-			mlock_drain_local();
-		folio_put(folio);
+		if (pvmw_walk_skip_or_end_on_next(&pvmw)) {
+			folio_remove_rmap_and_update_count(folio,
+							start, vma, count);
+			count = 0;
+			start = NULL;
+		}
 	}
 
 	mmu_notifier_invalidate_range_end(&range);
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-02-26 11:45 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-02-23  8:31 [PATCH 0/5] batched remove rmap in try_to_unmap_one() Yin Fengwei
2023-02-23  8:31 ` [PATCH 1/5] rmap: move hugetlb try_to_unmap to dedicated function Yin Fengwei
2023-02-23 17:28   ` Matthew Wilcox
2023-02-24  0:20     ` Mike Kravetz
2023-02-24  0:52       ` Yin, Fengwei
2023-02-24  2:51     ` HORIGUCHI NAOYA(堀口　直也)
2023-02-24  4:41       ` Yin, Fengwei
2023-02-24 19:21       ` Mike Kravetz
2023-02-26 11:44         ` Yin, Fengwei
2023-02-23  8:31 ` [PATCH 2/5] rmap: move page unmap operation " Yin Fengwei
2023-02-23  8:31 ` [PATCH 3/5] rmap: cleanup exit path of try_to_unmap_one_page() Yin Fengwei
2023-02-23  8:31 ` [PATCH 4/5] rmap:addd folio_remove_rmap_range() Yin Fengwei
2023-02-23  8:32 ` [PATCH 5/5] try_to_unmap_one: batched remove rmap, update folio refcount Yin Fengwei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).