[PATCH v4 02/12] mm/rmap: Add try_to_unmap_hugetlb_one

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dev Jain <dev.jain@arm.com>
To: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org,
	chrisl@kernel.org, kasong@tencent.com, hughd@google.com,
	liam@infradead.org
Cc: Dev Jain <dev.jain@arm.com>,
	riel@surriel.com, vbabka@kernel.org, harry@kernel.org,
	jannh@google.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com, qi.zheng@linux.dev, shakeel.butt@linux.dev,
	baohua@kernel.org, axelrasmussen@google.com, yuanchu@google.com,
	weixugc@google.com, shikemeng@huaweicloud.com, nphamcs@gmail.com,
	bhe@redhat.com, youngjun.park@lge.com,
	baolin.wang@linux.alibaba.com, pfalcato@suse.de,
	ryan.roberts@arm.com, anshuman.khandual@arm.com
Subject: [PATCH v4 02/12] mm/rmap: Add try_to_unmap_hugetlb_one
Date: Tue, 26 May 2026 12:06:25 +0530	[thread overview]
Message-ID: <20260526063635.61721-3-dev.jain@arm.com> (raw)
In-Reply-To: <20260526063635.61721-1-dev.jain@arm.com>

Simplify try_to_unmap_one by separating out the hugetlb parts into
try_to_unmap_hugetlb_one.

To understand the correctness of the refactoring, the following points
are noted:

1. try_to_unmap() is called for hugetlb folios only when they are
   hwpoisoned.

2. A hugetlb VMA cannot be mlocked.

3. The pvmw API sets pvmw.pte to the base of the hugetlb folio (pvmw.pmd
   is NULL).

4. We won't ever process a softleaf entry that encodes a hugetlb folio;
   hugetlb folios are never swapped out, migration entries will be skipped
   (PVMW_MIGRATION not passed) and device-exclusive does not work for
   hugetlb.

5. uffd-wp bit is lost when converting pvmw.pte to hwpoison softleaf
   (therefore no need to call pte_install_uffd_wp_if_needed after
   clearing pvmw.pte)

6. TTU_HWPOISON is always present; for it to not be present, either folio
   has to be in swapcache, or mapping_can_writeback() is true (see
   unmap_poisoned_folio), none of which is true for hugetlb folios.

7. hugetlb uses separate counters from normal rss counters, therefore
   update_highwater_rss() need not be called.

While at it:

 - Change VM_BUG_* to VM_WARN_*.

 - Do not declare variables which are only used once

 - Assert that the subpage derived by the pvmw walk is exactly the head
   page. This is because try_to_unmap() does not remember the specific
   subpage which was hwpoisoned, and, since we cannot munmap/mremap
   across a hugetlb folio partially, the first pte mapping the hugetlb
   folio (in case of a contpte or contpmd mapped folio) cannot ever point
   to an intermediate page.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/rmap.c | 203 ++++++++++++++++++++++++++++++++----------------------
 1 file changed, 121 insertions(+), 82 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 430c91c8fe2ae..06ab1158d4cd1 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1978,6 +1978,121 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 				     FPB_RESPECT_WRITE | FPB_RESPECT_SOFT_DIRTY);
 }
 
+static bool __try_to_unmap_hugetlb_one(struct folio *folio,
+		struct vm_area_struct *vma, struct page_vma_mapped_walk *pvmw,
+		struct mmu_notifier_range *range, enum ttu_flags flags)
+{
+	unsigned long hsz = huge_page_size(hstate_vma(vma));
+	unsigned long address = pvmw->address;
+	struct mm_struct *mm = vma->vm_mm;
+	struct page *subpage;
+	bool ret = true;
+	pte_t pteval;
+
+	if (!page_vma_mapped_walk(pvmw))
+		return true;
+
+	pteval = ptep_get(pvmw->pte);
+	VM_WARN_ON(!pte_present(pteval));
+	subpage = folio_page(folio, pte_pfn(pteval) - folio_pfn(folio));
+	VM_WARN_ON(folio_page(folio, 0) != subpage);
+
+	/*
+	 * huge_pmd_unshare may unmap an entire PMD page. There is no way of
+	 * knowing exactly which PMDs may be cached for this mm, so we must
+	 * flush them all. start/end were already adjusted above to cover this
+	 * range.
+	 */
+	flush_cache_range(vma, range->start, range->end);
+
+	/*
+	 * To call huge_pmd_unshare, i_mmap_rwsem must be held in write mode.
+	 * Caller needs to explicitly do this outside rmap routines.
+	 *
+	 * We also must hold hugetlb vma_lock in write mode. Lock order dictates
+	 * acquiring vma_lock BEFORE i_mmap_rwsem. We can only try lock here and
+	 * fail if unsuccessful.
+	 */
+	if (!folio_test_anon(folio)) {
+		struct mmu_gather tlb;
+
+		VM_WARN_ON(!(flags & TTU_RMAP_LOCKED));
+		if (!hugetlb_vma_trylock_write(vma)) {
+			ret = false;
+			goto walk_done;
+		}
+
+		tlb_gather_mmu_vma(&tlb, vma);
+		if (huge_pmd_unshare(&tlb, vma, address, pvmw->pte)) {
+			hugetlb_vma_unlock_write(vma);
+			huge_pmd_unshare_flush(&tlb, vma);
+			tlb_finish_mmu(&tlb);
+			/*
+			 * The PMD table was unmapped, consequently unmapping
+			 * the folio.
+			 */
+			goto walk_done;
+		}
+		hugetlb_vma_unlock_write(vma);
+		tlb_finish_mmu(&tlb);
+	}
+	pteval = huge_ptep_clear_flush(vma, address, pvmw->pte);
+	if (pte_dirty(pteval))
+		folio_mark_dirty(folio);
+
+	VM_WARN_ON(!(flags & TTU_HWPOISON));
+	pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
+	hugetlb_count_sub(folio_nr_pages(folio), mm);
+	set_huge_pte_at(mm, address, pvmw->pte, pteval, hsz);
+	hugetlb_remove_rmap(folio);
+	folio_put_refs(folio, 1);
+
+walk_done:
+	page_vma_mapped_walk_done(pvmw);
+	return ret;
+}
+
+static bool try_to_unmap_hugetlb_one(struct folio *folio,
+		struct vm_area_struct *vma, unsigned long address, void *arg)
+{
+	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+	struct mmu_notifier_range range;
+	enum ttu_flags flags = (enum ttu_flags)(long)arg;
+	bool ret;
+
+	/*
+	 * The try_to_unmap() is only passed a hugetlb folio in the case
+	 * where the hugetlb folio contains a poisoned page.
+	 */
+	VM_WARN_ON_FOLIO(!folio_contain_hwpoisoned_page(folio), folio);
+
+	/*
+	 * When racing against e.g. zap_pte_range() on another cpu,
+	 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
+	 * try_to_unmap() may return before folio_mapped() has become false,
+	 * if page table locking is skipped: use TTU_SYNC to wait for that.
+	 */
+	if (flags & TTU_SYNC)
+		pvmw.flags = PVMW_SYNC;
+
+	/*
+	 * For hugetlb, it could be much worse than THP if we need pud
+	 * invalidation in the case of pmd sharing.
+	 *
+	 * Note that the folio can not be freed in this function as call of
+	 * try_to_unmap() must hold a reference on the folio.
+	 */
+	range.end = vma_address_end(&pvmw);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
+				address, range.end);
+	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
+	mmu_notifier_invalidate_range_start(&range);
+	ret = __try_to_unmap_hugetlb_one(folio, vma, &pvmw, &range,
+					 flags);
+	mmu_notifier_invalidate_range_end(&range);
+	return ret;
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
@@ -1993,7 +2108,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	enum ttu_flags flags = (enum ttu_flags)(long)arg;
 	unsigned long nr_pages = 1, end_addr;
 	unsigned long pfn;
-	unsigned long hsz = 0;
 	int ptes = 0;
 
 	/*
@@ -2007,8 +2121,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 	/*
 	 * For THP, we have to assume the worse case ie pmd for invalidation.
-	 * For hugetlb, it could be much worse if we need to do pud
-	 * invalidation in the case of pmd sharing.
 	 *
 	 * Note that the folio can not be freed in this function as call of
 	 * try_to_unmap() must hold a reference on the folio.
@@ -2016,17 +2128,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	range.end = vma_address_end(&pvmw);
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
 				address, range.end);
-	if (folio_test_hugetlb(folio)) {
-		/*
-		 * If sharing is possible, start and end will be adjusted
-		 * accordingly.
-		 */
-		adjust_range_if_pmd_sharing_possible(vma, &range.start,
-						     &range.end);
-
-		/* We need the huge page size for set_huge_pte_at() */
-		hsz = huge_page_size(hstate_vma(vma));
-	}
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -2104,7 +2205,6 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			const softleaf_t entry = softleaf_from_pte(pteval);
 
 			pfn = softleaf_to_pfn(entry);
-			VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
 		}
 
 		subpage = folio_page(folio, pfn - folio_pfn(folio));
@@ -2112,59 +2212,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		anon_exclusive = folio_test_anon(folio) &&
 				 PageAnonExclusive(subpage);
 
-		if (folio_test_hugetlb(folio)) {
-			bool anon = folio_test_anon(folio);
-
-			/*
-			 * The try_to_unmap() is only passed a hugetlb folio
-			 * in the case where the hugetlb folio contains a
-			 * poisoned page.
-			 */
-			VM_WARN_ON_FOLIO(!folio_contain_hwpoisoned_page(folio), folio);
-			/*
-			 * huge_pmd_unshare may unmap an entire PMD page.
-			 * There is no way of knowing exactly which PMDs may
-			 * be cached for this mm, so we must flush them all.
-			 * start/end were already adjusted above to cover this
-			 * range.
-			 */
-			flush_cache_range(vma, range.start, range.end);
-
-			/*
-			 * To call huge_pmd_unshare, i_mmap_rwsem must be
-			 * held in write mode.  Caller needs to explicitly
-			 * do this outside rmap routines.
-			 *
-			 * We also must hold hugetlb vma_lock in write mode.
-			 * Lock order dictates acquiring vma_lock BEFORE
-			 * i_mmap_rwsem.  We can only try lock here and fail
-			 * if unsuccessful.
-			 */
-			if (!anon) {
-				struct mmu_gather tlb;
-
-				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
-				if (!hugetlb_vma_trylock_write(vma))
-					goto walk_abort;
-
-				tlb_gather_mmu_vma(&tlb, vma);
-				if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
-					hugetlb_vma_unlock_write(vma);
-					huge_pmd_unshare_flush(&tlb, vma);
-					tlb_finish_mmu(&tlb);
-					/*
-					 * The PMD table was unmapped,
-					 * consequently unmapping the folio.
-					 */
-					goto walk_done;
-				}
-				hugetlb_vma_unlock_write(vma);
-				tlb_finish_mmu(&tlb);
-			}
-			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
-			if (pte_dirty(pteval))
-				folio_mark_dirty(folio);
-		} else if (likely(pte_present(pteval))) {
+		if (likely(pte_present(pteval))) {
 			nr_pages = folio_unmap_pte_batch(folio, &pvmw, flags, pteval);
 			end_addr = address + nr_pages * PAGE_SIZE;
 			flush_cache_range(vma, address, end_addr);
@@ -2201,14 +2249,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 
 		if (folio_contain_hwpoisoned_page(folio) && (flags & TTU_HWPOISON)) {
 			pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
-			if (folio_test_hugetlb(folio)) {
-				hugetlb_count_sub(folio_nr_pages(folio), mm);
-				set_huge_pte_at(mm, address, pvmw.pte, pteval,
-						hsz);
-			} else {
-				dec_mm_counter(mm, mm_counter(folio));
-				set_pte_at(mm, address, pvmw.pte, pteval);
-			}
+			dec_mm_counter(mm, mm_counter(folio));
+			set_pte_at(mm, address, pvmw.pte, pteval);
 		} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
 			   !userfaultfd_armed(vma)) {
 			/*
@@ -2341,11 +2383,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			add_mm_counter(mm, mm_counter_file(folio), -nr_pages);
 		}
 discard:
-		if (unlikely(folio_test_hugetlb(folio))) {
-			hugetlb_remove_rmap(folio);
-		} else {
-			folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
-		}
+		folio_remove_rmap_ptes(folio, subpage, nr_pages, vma);
 		if (vma->vm_flags & VM_LOCKED)
 			mlock_drain_local();
 		folio_put_refs(folio, nr_pages);
@@ -2393,7 +2431,8 @@ static int folio_not_mapped(struct folio *folio)
 void try_to_unmap(struct folio *folio, enum ttu_flags flags)
 {
 	struct rmap_walk_control rwc = {
-		.rmap_one = try_to_unmap_one,
+		.rmap_one = folio_test_hugetlb(folio) ?
+				try_to_unmap_hugetlb_one : try_to_unmap_one,
 		.arg = (void *)flags,
 		.done = folio_not_mapped,
 		.anon_lock = folio_lock_anon_vma_read,
-- 
2.34.1

next prev parent reply	other threads:[~2026-05-26  6:37 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-26  6:36 [PATCH v4 00/12] Optimize anonymous large folio unmapping Dev Jain
2026-05-26  6:36 ` [PATCH v4 01/12] mm/rmap: convert page -> folio for hwpoison checks Dev Jain
2026-05-26  6:36 ` Dev Jain [this message]
2026-05-26  6:36 ` [PATCH v4 03/12] mm/rmap: refactor some code around lazyfree folio unmapping Dev Jain
2026-05-26  6:36 ` [PATCH v4 04/12] mm/memory: Batch set uffd-wp markers during zapping Dev Jain
2026-05-26  6:36 ` [PATCH v4 05/12] mm/rmap: batch unmap folios belonging to uffd-wp VMAs Dev Jain
2026-05-26  6:36 ` [PATCH v4 06/12] mm/swap: rename subpage->page in folio_dup_swap/folio_put_swap Dev Jain
2026-05-26  6:36 ` [PATCH v4 07/12] mm/swapfile: Add batched version of folio_dup_swap Dev Jain
2026-05-26  6:36 ` [PATCH v4 08/12] mm/swapfile: Add batched version of folio_put_swap Dev Jain
2026-05-26  6:36 ` [PATCH v4 09/12] mm/rmap: Add batched version of folio_try_share_anon_rmap_pte Dev Jain
2026-05-26  6:36 ` [PATCH v4 10/12] mm/rmap: refactor anon folio unmap in try_to_unmap_one Dev Jain
2026-05-26  6:36 ` [PATCH v4 11/12] mm/mprotect: drop 'sub' from page_anon_exclusive_sub_batch Dev Jain
2026-05-26  6:36 ` [PATCH v4 12/12] mm/rmap: enable batch unmapping of anonymous folios Dev Jain
2026-05-28 16:50 ` [PATCH v4 00/12] Optimize anonymous large folio unmapping Dev Jain

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:430c91c8fe2a dfblob:06ab1158d4cd )
 OR (
bs:"[PATCH v4 02/12] mm/rmap: Add try_to_unmap_hugetlb_one" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260526063635.61721-3-dev.jain@arm.com \
    --to=dev.jain@arm.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=david@kernel.org \
    --cc=harry@kernel.org \
    --cc=hughd@google.com \
    --cc=jannh@google.com \
    --cc=kasong@tencent.com \
    --cc=liam@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=nphamcs@gmail.com \
    --cc=pfalcato@suse.de \
    --cc=qi.zheng@linux.dev \
    --cc=riel@surriel.com \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    --cc=weixugc@google.com \
    --cc=youngjun.park@lge.com \
    --cc=yuanchu@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.