[PATCH 0/5] Misc rework on hugetlb

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] Misc rework on hugetlb_fault
@ 2025-06-12 13:46 Oscar Salvador
  2025-06-12 13:46 ` [PATCH 1/5] mm,hugetlb: Change mechanism to detect a COW on private mapping Oscar Salvador
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Oscar Salvador @ 2025-06-12 13:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Muchun Song, James Houghton, Peter Xu,
	Gavin Guo, linux-mm, linux-kernel, Oscar Salvador

 RFC -> v1:
  - Stop looking up folio in the pagecache for detecting a COW
    on a private mapping.
  - Document the locks

Hi,

This is a new version of the RFC I sent a couple of weeks ago [1].
It's the conclusion of the discussions that ocurred in that thread.

Patch#1 is the fix for the deadlock.
Patch#2 is to document why we take the lock, so none of us have to spend
time again in figuring that out.
Patch#3-5 are a bit of cleanup (removing obsolete comments etc.)

More detail in every patch.

Oscar Salvador (5):
  mm,hugetlb: Change mechanism to detect a COW on private mapping
  mm,hugetlb: Document the reason to lock the folio in the faulting path
  mm,hugetlb: Conver anon_rmap into boolean
  mm,hugetlb: Drop obsolete comment about non-present pte and second
    faults
  mm,hugetlb: Drop unlikelys from hugetlb_fault

 mm/hugetlb.c | 98 +++++++++++++++++++---------------------------------
 1 file changed, 36 insertions(+), 62 deletions(-)

-- 
2.49.0


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/5] mm,hugetlb: Change mechanism to detect a COW on private mapping
  2025-06-12 13:46 [PATCH 0/5] Misc rework on hugetlb_fault Oscar Salvador
@ 2025-06-12 13:46 ` Oscar Salvador
  2025-06-13 13:52   ` David Hildenbrand
  2025-06-12 13:46 ` [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path Oscar Salvador
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2025-06-12 13:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Muchun Song, James Houghton, Peter Xu,
	Gavin Guo, linux-mm, linux-kernel, Oscar Salvador

hugetlb_wp() checks whether the process is trying to COW on a private mapping
in order to know whether the reservation for that address was already consumed
or not.
If it was consumed and we are the ownner of the mapping, the folio will have to
be unmapped from the other processes.

Currently, that check is done by looking up the folio in the pagecache and
compare it to the folio which is mapped in our pagetables.
If it differs, it means we already mapped it privately before, consuming a
reservation on the way.
All we are interested in is whether the mapped folio is anonymous, so we can
simplify and check for that instead.

Also, we transition from a trylock to a folio_lock, since the former was only
needed when hugetlb_fault() had to lock both folios, in order to avoid deadlock.

Closes: https://lore.kernel.org/lkml/20250513093448.592150-1-gavinguo@igalia.com/
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 70 +++++++++++++---------------------------------------
 1 file changed, 17 insertions(+), 53 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8746ed2fec13..175edafeec67 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6152,8 +6152,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
  * cannot race with other handlers or page migration.
  * Keep the pte_same checks anyway to make transition from the mutex easier.
  */
-static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
-		       struct vm_fault *vmf)
+static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct mm_struct *mm = vma->vm_mm;
@@ -6215,16 +6214,17 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 		       PageAnonExclusive(&old_folio->page), &old_folio->page);
 
 	/*
-	 * If the process that created a MAP_PRIVATE mapping is about to
-	 * perform a COW due to a shared page count, attempt to satisfy
-	 * the allocation without using the existing reserves. The pagecache
-	 * page is used to determine if the reserve at this address was
-	 * consumed or not. If reserves were used, a partial faulted mapping
-	 * at the time of fork() could consume its reserves on COW instead
-	 * of the full address range.
+	 * If the process that created a MAP_PRIVATE mapping is about to perform
+	 * a COW due to a shared page count, attempt to satisfy the allocation
+	 * without using the existing reserves.
+	 * In order to determine where this is a COW on a MAP_PRIVATE mapping it
+	 * is enough to check whether the old_folio is anonymous. This means that
+	 * the reserve for this address was consumed. If reserves were used, a
+	 * partial faulted mapping at the fime of fork() could consume its reserves
+	 * on COW instead of the full address range.
 	 */
 	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
-			old_folio != pagecache_folio)
+	    folio_test_anon(old_folio))
 		cow_from_owner = true;
 
 	folio_get(old_folio);
@@ -6603,7 +6603,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 	hugetlb_count_add(pages_per_huge_page(h), mm);
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
-		ret = hugetlb_wp(folio, vmf);
+		ret = hugetlb_wp(vmf);
 	}
 
 	spin_unlock(vmf->ptl);
@@ -6670,11 +6670,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	vm_fault_t ret;
 	u32 hash;
-	struct folio *folio = NULL;
-	struct folio *pagecache_folio = NULL;
+	struct folio *folio;
 	struct hstate *h = hstate_vma(vma);
 	struct address_space *mapping;
-	int need_wait_lock = 0;
 	struct vm_fault vmf = {
 		.vma = vma,
 		.address = address & huge_page_mask(h),
@@ -6769,8 +6767,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * If we are going to COW/unshare the mapping later, we examine the
 	 * pending reservations for this page now. This will ensure that any
 	 * allocations necessary to record that reservation occur outside the
-	 * spinlock. Also lookup the pagecache page now as it is used to
-	 * determine if a reservation has been consumed.
+	 * spinlock.
 	 */
 	if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
 	    !(vma->vm_flags & VM_MAYSHARE) && !huge_pte_write(vmf.orig_pte)) {
@@ -6780,11 +6777,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		/* Just decrements count, does not deallocate */
 		vma_end_reservation(h, vma, vmf.address);
-
-		pagecache_folio = filemap_lock_hugetlb_folio(h, mapping,
-							     vmf.pgoff);
-		if (IS_ERR(pagecache_folio))
-			pagecache_folio = NULL;
 	}
 
 	vmf.ptl = huge_pte_lock(h, mm, vmf.pte);
@@ -6798,10 +6790,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	    (flags & FAULT_FLAG_WRITE) && !huge_pte_write(vmf.orig_pte)) {
 		if (!userfaultfd_wp_async(vma)) {
 			spin_unlock(vmf.ptl);
-			if (pagecache_folio) {
-				folio_unlock(pagecache_folio);
-				folio_put(pagecache_folio);
-			}
 			hugetlb_vma_unlock_read(vma);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			return handle_userfault(&vmf, VM_UFFD_WP);
@@ -6813,23 +6801,14 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Fallthrough to CoW */
 	}
 
-	/*
-	 * hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) and
-	 * pagecache_folio, so here we need take the former one
-	 * when folio != pagecache_folio or !pagecache_folio.
-	 */
+	/* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */
 	folio = page_folio(pte_page(vmf.orig_pte));
-	if (folio != pagecache_folio)
-		if (!folio_trylock(folio)) {
-			need_wait_lock = 1;
-			goto out_ptl;
-		}
-
+	folio_lock(folio);
 	folio_get(folio);
 
 	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 		if (!huge_pte_write(vmf.orig_pte)) {
-			ret = hugetlb_wp(pagecache_folio, &vmf);
+			ret = hugetlb_wp(&vmf);
 			goto out_put_page;
 		} else if (likely(flags & FAULT_FLAG_WRITE)) {
 			vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte);
@@ -6840,16 +6819,10 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 						flags & FAULT_FLAG_WRITE))
 		update_mmu_cache(vma, vmf.address, vmf.pte);
 out_put_page:
-	if (folio != pagecache_folio)
-		folio_unlock(folio);
+	folio_unlock(folio);
 	folio_put(folio);
 out_ptl:
 	spin_unlock(vmf.ptl);
-
-	if (pagecache_folio) {
-		folio_unlock(pagecache_folio);
-		folio_put(pagecache_folio);
-	}
 out_mutex:
 	hugetlb_vma_unlock_read(vma);
 
@@ -6861,15 +6834,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		vma_end_read(vma);
 
 	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-	/*
-	 * Generally it's safe to hold refcount during waiting page lock. But
-	 * here we just wait to defer the next page fault to avoid busy loop and
-	 * the page is not used after unlocked before returning from the current
-	 * page fault. So we are safe from accessing freed page, even if we wait
-	 * here without taking refcount.
-	 */
-	if (need_wait_lock)
-		folio_wait_locked(folio);
 	return ret;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-12 13:46 [PATCH 0/5] Misc rework on hugetlb_fault Oscar Salvador
  2025-06-12 13:46 ` [PATCH 1/5] mm,hugetlb: Change mechanism to detect a COW on private mapping Oscar Salvador
@ 2025-06-12 13:46 ` Oscar Salvador
  2025-06-13 13:56   ` David Hildenbrand
  2025-06-12 13:46 ` [PATCH 3/5] mm,hugetlb: Conver anon_rmap into boolean Oscar Salvador
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2025-06-12 13:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Muchun Song, James Houghton, Peter Xu,
	Gavin Guo, linux-mm, linux-kernel, Oscar Salvador

Recent conversations showed that there was a misunderstanding about why we
were locking the folio prior to calling hugetlb_wp().
Document explicitly why we need to take the lock, explaining on the way that
although the timespan for the locking of anonymous and file folios is different,
it would require a major surgery to represent that difference with the current
code.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hugetlb.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 175edafeec67..dfa09fc3b2c6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6537,6 +6537,10 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 			}
 			new_pagecache_folio = true;
 		} else {
+			/*
+			 * hugetlb_wp() expects the folio to be locked in order to
+			 * check whether we can re-use this page exclusively for us.
+			 */
 			folio_lock(folio);
 			anon_rmap = 1;
 		}
@@ -6801,7 +6805,19 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Fallthrough to CoW */
 	}
 
-	/* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */
+	/*
+	 * We need to lock the folio before calling hugetlb_wp().
+	 * Either the folio is in the pagecache and we need to copy it over
+	 * to another file, so it must remain stable throughout the operation,
+	 * or the folio is anonymous and we need to lock it in order to check
+	 * whether we can re-use it and mark it exclusive for this process.
+	 * The timespan for the lock differs depending on the type, since
+	 * anonymous folios only need to hold the lock while checking whether we
+	 * can re-use it, while we need to hold it throughout the copy in case
+	 * we are dealing with a folio from a pagecache.
+	 * Representing this difference would be tricky with the current code,
+	 * so just hold the lock for the duration of hugetlb_wp().
+	 */
 	folio = page_folio(pte_page(vmf.orig_pte));
 	folio_lock(folio);
 	folio_get(folio);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/5] mm,hugetlb: Conver anon_rmap into boolean
  2025-06-12 13:46 [PATCH 0/5] Misc rework on hugetlb_fault Oscar Salvador
  2025-06-12 13:46 ` [PATCH 1/5] mm,hugetlb: Change mechanism to detect a COW on private mapping Oscar Salvador
  2025-06-12 13:46 ` [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path Oscar Salvador
@ 2025-06-12 13:46 ` Oscar Salvador
  2025-06-13 13:48   ` David Hildenbrand
  2025-06-12 13:47 ` [PATCH 4/5] mm,hugetlb: Drop obsolete comment about non-present pte and second faults Oscar Salvador
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2025-06-12 13:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Muchun Song, James Houghton, Peter Xu,
	Gavin Guo, linux-mm, linux-kernel, Oscar Salvador

We use anon_rmap as a boolean, so declare it as that.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hugetlb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dfa09fc3b2c6..62bc3808f99e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6431,7 +6431,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 	struct mm_struct *mm = vma->vm_mm;
 	struct hstate *h = hstate_vma(vma);
 	vm_fault_t ret = VM_FAULT_SIGBUS;
-	int anon_rmap = 0;
+	bool anon_rmap = false;
 	unsigned long size;
 	struct folio *folio;
 	pte_t new_pte;
@@ -6542,7 +6542,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 			 * check whether we can re-use this page exclusively for us.
 			 */
 			folio_lock(folio);
-			anon_rmap = 1;
+			anon_rmap = true;
 		}
 	} else {
 		/*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/5] mm,hugetlb: Drop obsolete comment about non-present pte and second faults
  2025-06-12 13:46 [PATCH 0/5] Misc rework on hugetlb_fault Oscar Salvador
                   ` (2 preceding siblings ...)
  2025-06-12 13:46 ` [PATCH 3/5] mm,hugetlb: Conver anon_rmap into boolean Oscar Salvador
@ 2025-06-12 13:47 ` Oscar Salvador
  2025-06-12 13:47 ` [PATCH 5/5] mm,hugetlb: Drop unlikelys from hugetlb_fault Oscar Salvador
  2025-06-13  8:55 ` [PATCH 0/5] Misc rework on hugetlb_fault Oscar Salvador
  5 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2025-06-12 13:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Muchun Song, James Houghton, Peter Xu,
	Gavin Guo, linux-mm, linux-kernel, Oscar Salvador

There is a comment in hugetlb_fault() that does not hold anymore.
This one:

 /*
  * vmf.orig_pte could be a migration/hwpoison vmf.orig_pte at this
  * point, so this check prevents the kernel from going below assuming
  * that we have an active hugepage in pagecache. This goto expects
  * the 2nd page fault, and is_hugetlb_entry_(migration|hwpoisoned)
  * check will properly handle it.
  */

This was written because back in the day we used to do:

 hugetlb_fault () {
  ptep = huge_pte_offset(...)
  if (ptep) {
    entry = huge_ptep_get(ptep)
    if (unlikely(is_hugetlb_entry_migration(entry))
        ...
    else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
        ...
  }

  ...
  ...

  /*
   * entry could be a migration/hwpoison entry at this point, so this
   * check prevents the kernel from going below assuming that we have
   * a active hugepage in pagecache. This goto expects the 2nd page fault,
   * and is_hugetlb_entry_(migration|hwpoisoned) check will properly
   * handle it.
   */
  if (!pte_present(entry))
          goto out_mutex;
  ...
 }

The code was designed to check for hwpoisoned/migration entries upfront,
and then bail out if further down the pte was not present anymore,
relying on the second fault to properly handle migration/hwpoison entries that
time around.

The way we handle this is different nowadays, so drop the misleading comment.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hugetlb.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 62bc3808f99e..ad377e24b7d0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6742,13 +6742,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	ret = 0;
 
-	/*
-	 * vmf.orig_pte could be a migration/hwpoison vmf.orig_pte at this
-	 * point, so this check prevents the kernel from going below assuming
-	 * that we have an active hugepage in pagecache. This goto expects
-	 * the 2nd page fault, and is_hugetlb_entry_(migration|hwpoisoned)
-	 * check will properly handle it.
-	 */
+	/* Not present, either a migration or a hwpoisoned entry */
 	if (!pte_present(vmf.orig_pte)) {
 		if (unlikely(is_hugetlb_entry_migration(vmf.orig_pte))) {
 			/*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/5] mm,hugetlb: Drop unlikelys from hugetlb_fault
  2025-06-12 13:46 [PATCH 0/5] Misc rework on hugetlb_fault Oscar Salvador
                   ` (3 preceding siblings ...)
  2025-06-12 13:47 ` [PATCH 4/5] mm,hugetlb: Drop obsolete comment about non-present pte and second faults Oscar Salvador
@ 2025-06-12 13:47 ` Oscar Salvador
  2025-06-13  8:55 ` [PATCH 0/5] Misc rework on hugetlb_fault Oscar Salvador
  5 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2025-06-12 13:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Muchun Song, James Houghton, Peter Xu,
	Gavin Guo, linux-mm, linux-kernel, Oscar Salvador

The unlikely predates an era where we were checking for hwpoisoned/migration
entries prior to checking whether the pte was present.

Currently, we check for the pte to be a migration/hwpoison entry after we
have checked that is not present, so it must be either one or the other.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hugetlb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ad377e24b7d0..d6b0f2b68beb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6744,7 +6744,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	/* Not present, either a migration or a hwpoisoned entry */
 	if (!pte_present(vmf.orig_pte)) {
-		if (unlikely(is_hugetlb_entry_migration(vmf.orig_pte))) {
+		if (is_hugetlb_entry_migration(vmf.orig_pte)) {
 			/*
 			 * Release the hugetlb fault lock now, but retain
 			 * the vma lock, because it is needed to guard the
@@ -6755,7 +6755,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			migration_entry_wait_huge(vma, vmf.address, vmf.pte);
 			return 0;
-		} else if (unlikely(is_hugetlb_entry_hwpoisoned(vmf.orig_pte)))
+		} else if (is_hugetlb_entry_hwpoisoned(vmf.orig_pte))
 			ret = VM_FAULT_HWPOISON_LARGE |
 			    VM_FAULT_SET_HINDEX(hstate_index(h));
 		goto out_mutex;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/5] Misc rework on hugetlb_fault
  2025-06-12 13:46 [PATCH 0/5] Misc rework on hugetlb_fault Oscar Salvador
                   ` (4 preceding siblings ...)
  2025-06-12 13:47 ` [PATCH 5/5] mm,hugetlb: Drop unlikelys from hugetlb_fault Oscar Salvador
@ 2025-06-13  8:55 ` Oscar Salvador
  5 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2025-06-13  8:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Muchun Song, James Houghton, Peter Xu,
	Gavin Guo, linux-mm, linux-kernel

On Thu, Jun 12, 2025 at 03:46:56PM +0200, Oscar Salvador wrote:
> This is a new version of the RFC I sent a couple of weeks ago [1].
> It's the conclusion of the discussions that ocurred in that thread.
> 
> Patch#1 is the fix for the deadlock.
> Patch#2 is to document why we take the lock, so none of us have to spend
> time again in figuring that out.
> Patch#3-5 are a bit of cleanup (removing obsolete comments etc.)

Andrew told me that this is very vague for a coverletter, and while this patchset
is sort of a refactor/clenaup/fixup mix, I acknowledge that it is far
from optimal to get people some context.
So let me try again, and I apologyze.

"
 This patchset aims to give some love to the hugetlb faulting path, doing so
 by removing obsolete comments that are no longer true, documenting in a clear
 way the reason we take certain locks, and changing the mechanism we use to
 determine whether we are COWing a private mapping already.

 The most important patch of the series is #1, as it fixes a deadlock that
 was described in [1], where two processes were holding the same lock
 for the folio in the pagecache, and then deadlocked in the mutex.
 Looking up and locking the folio in the pagecache was done to check whether
 that folio was the same folio we had mapped in our pagetables, meaning that if it
 was different we knew that we already mapped that folio privately, so any
 further CoW would be made on a private mapping, which lead us to the  question:
  __Was the reservation for that address consumed?__
 That is all we care about, because if it was indeed consumed and we are the
 owner and we cannot allocate more folios, we need to unmap the folio from the
 processes pagetables and make it exclusive for us.

 We figured we do not need to look up the folio at all, and it is just enough to
 check whether the folio we have mapped is anonymous, which means we mapped it
 privately, so the reservation was indeed consumed.

 Patch#2 follows a bit more to the trend on why we need to lock the
 folio from the pagetables, whether it is anonymous or not.
 For anonymous folios we need the lock in order to check whether we can re-use
 exclusively for us, while for file folios we need to hold the lock to make the
 folio stable throughout the copy.
 We also hold it even if we are read-faulting, just to make sure it does not go
 away. For this last case, we are already covered by the hugetlb_mutex, but
 in order to bring hugetlb closer to the general faulting path, let us rely on
 the lock instead, as do_read_fault() does.

 Patch#3 is a merely correctness issue if you will, while Patch#4 and
 Patch#5 remove obsolete comments and assumptions that are no longer
 true.

 [1] https://lore.kernel.org/lkml/20250513093448.592150-1-gavinguo@igalia.com/
"

Shame on me, this should have been the cover letter.

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/5] mm,hugetlb: Conver anon_rmap into boolean
  2025-06-12 13:46 ` [PATCH 3/5] mm,hugetlb: Conver anon_rmap into boolean Oscar Salvador
@ 2025-06-13 13:48   ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-06-13 13:48 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Muchun Song, James Houghton, Peter Xu, Gavin Guo, linux-mm,
	linux-kernel

On 12.06.25 15:46, Oscar Salvador wrote:
> We use anon_rmap as a boolean, so declare it as that.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   mm/hugetlb.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dfa09fc3b2c6..62bc3808f99e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6431,7 +6431,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>   	struct mm_struct *mm = vma->vm_mm;
>   	struct hstate *h = hstate_vma(vma);
>   	vm_fault_t ret = VM_FAULT_SIGBUS;
> -	int anon_rmap = 0;
> +	bool anon_rmap = false;
>   	unsigned long size;
>   	struct folio *folio;
>   	pte_t new_pte;
> @@ -6542,7 +6542,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>   			 * check whether we can re-use this page exclusively for us.
>   			 */
>   			folio_lock(folio);
> -			anon_rmap = 1;
> +			anon_rmap = true;
>   		}
>   	} else {
>   		/*

While at it, I would rename that to "new_anon_folio" or sth like that.

It's used to distinguish whether we create an anon or a file folio.

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/5] mm,hugetlb: Change mechanism to detect a COW on private mapping
  2025-06-12 13:46 ` [PATCH 1/5] mm,hugetlb: Change mechanism to detect a COW on private mapping Oscar Salvador
@ 2025-06-13 13:52   ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-06-13 13:52 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Muchun Song, James Houghton, Peter Xu, Gavin Guo, linux-mm,
	linux-kernel

On 12.06.25 15:46, Oscar Salvador wrote:
> hugetlb_wp() checks whether the process is trying to COW on a private mapping
> in order to know whether the reservation for that address was already consumed
> or not.
> If it was consumed and we are the ownner of the mapping, the folio will have to
> be unmapped from the other processes.
> 
> Currently, that check is done by looking up the folio in the pagecache and
> compare it to the folio which is mapped in our pagetables.
> If it differs, it means we already mapped it privately before, consuming a
> reservation on the way.
> All we are interested in is whether the mapped folio is anonymous, so we can
> simplify and check for that instead.
> 
> Also, we transition from a trylock to a folio_lock, since the former was only
> needed when hugetlb_fault() had to lock both folios, in order to avoid deadlock.
> 
> Closes: https://lore.kernel.org/lkml/20250513093448.592150-1-gavinguo@igalia.com/

If there is a Closes: there should probably be a Fixes: and a Reported-by:

> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Suggested-by: Peter Xu <peterx@redhat.com>
> ---

Nothing jumped at me, we'll learn if we missed anything soon I guess :)

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-12 13:46 ` [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path Oscar Salvador
@ 2025-06-13 13:56   ` David Hildenbrand
  2025-06-13 14:23     ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-06-13 13:56 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Muchun Song, James Houghton, Peter Xu, Gavin Guo, linux-mm,
	linux-kernel

On 12.06.25 15:46, Oscar Salvador wrote:
> Recent conversations showed that there was a misunderstanding about why we
> were locking the folio prior to calling hugetlb_wp().
> Document explicitly why we need to take the lock, explaining on the way that
> although the timespan for the locking of anonymous and file folios is different,
> it would require a major surgery to represent that difference with the current
> code.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   mm/hugetlb.c | 18 +++++++++++++++++-
>   1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 175edafeec67..dfa09fc3b2c6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6537,6 +6537,10 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>   			}
>   			new_pagecache_folio = true;
>   		} else {
> +			/*
> +			 * hugetlb_wp() expects the folio to be locked in order to
> +			 * check whether we can re-use this page exclusively for us.
> +			 */
>   			folio_lock(folio);
>   			anon_rmap = 1;
>   		}
> @@ -6801,7 +6805,19 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   		/* Fallthrough to CoW */
>   	}
>   
> -	/* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */
> +	/*
> +	 * We need to lock the folio before calling hugetlb_wp().
> +	 * Either the folio is in the pagecache and we need to copy it over
> +	 * to another file, so it must remain stable throughout the operation,

But as discussed, why is that the case? We don't need that for ordinary 
pages, and existing folio mappings can already concurrently modify the page?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-13 13:56   ` David Hildenbrand
@ 2025-06-13 14:23     ` Oscar Salvador
  2025-06-13 19:57       ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2025-06-13 14:23 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On Fri, Jun 13, 2025 at 03:56:15PM +0200, David Hildenbrand wrote:
> On 12.06.25 15:46, Oscar Salvador wrote:
> > -	/* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */
> > +	/*
> > +	 * We need to lock the folio before calling hugetlb_wp().
> > +	 * Either the folio is in the pagecache and we need to copy it over
> > +	 * to another file, so it must remain stable throughout the operation,
> 
> But as discussed, why is that the case? We don't need that for ordinary
> pages, and existing folio mappings can already concurrently modify the page?

Normal faulting path takes the lock when we fault in a file read-only or to
to map it privately.
That is done via __do_fault or cow_fault, in __do_fault()->vma->vm_ops_>fault().
E.g. filemap_fault() will locate the page and lock it.
And it will hold it during the entire operation, note that we unlock it
after we have called finish_fault().

The page can't go away because filemap_fault also gets a reference on
it, so I guess it's to hold it stable.

Now, when we get a wp fault for an already private mapping, via do_wp_page(), we
only take the lock to check whether we can re-use the page for us.
We don't hold it during the copy.

The only reason we hold it in hugetlb for the entire hugetlb_wp() is because making
this distinction on the lock's timespan depending on what we are dealing with,
would be tricky with the current code.
We would need to do some dancing of release-and-take-it depending on
what we want to do, and frankly it seems overcomplicated at this point.

I plan to do some more work in that area, and potentially glue it with
the generic faulting path (I have some ideas but I need some other works
to land first)

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-13 14:23     ` Oscar Salvador
@ 2025-06-13 19:57       ` David Hildenbrand
  2025-06-13 21:47         ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-06-13 19:57 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On 13.06.25 16:23, Oscar Salvador wrote:
> On Fri, Jun 13, 2025 at 03:56:15PM +0200, David Hildenbrand wrote:
>> On 12.06.25 15:46, Oscar Salvador wrote:
>>> -	/* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */
>>> +	/*
>>> +	 * We need to lock the folio before calling hugetlb_wp().
>>> +	 * Either the folio is in the pagecache and we need to copy it over
>>> +	 * to another file, so it must remain stable throughout the operation,
>>
>> But as discussed, why is that the case? We don't need that for ordinary
>> pages, and existing folio mappings can already concurrently modify the page?
> 
> Normal faulting path takes the lock when we fault in a file read-only or to
> to map it privately.
> That is done via __do_fault or cow_fault, in __do_fault()->vma->vm_ops_>fault().
> E.g. filemap_fault() will locate the page and lock it.
> And it will hold it during the entire operation, note that we unlock it
> after we have called finish_fault().
 > > The page can't go away because filemap_fault also gets a reference on
> it, so I guess it's to hold it stable.
> 

What I meant is:

Assume we have a pagecache page mapped into our page tables R/O 
(MAP_PRIVATE mapping).

During a write fault on such a pagecache page, we end up in 
do_wp_page()->wp_page_copy() we perform the copy via 
__wp_page_copy_user() without the folio lock.

In wp_page_copy(), we retake the pt lock, to make sure that the page is 
still mapped (pte_same). If the page is no longer mapped, we retry the 
fault.

In that case, we only want to make sure that the folio is still mapped 
after possibly dropping the page table lock in between.

As we are holding an additional folio reference in 
do_wp_page()->wp_page_copy(), the folio cannot get freed concurrently.

There is indeed the do_cow_fault() path where we avoid faulting in the 
pagecache page in the first place. So no page table reference, an I can 
understand why we would need the folio lock there.

Regarding hugetlb_no_page(): I think we could drop the folio lock for a 
pagecache folio after inserting the folio into the page table. Just like 
do_wp_page()->wp_page_copy(), we would have to verify again under PTL if 
the folio is still mapped

... which we already do through pte_same() checks?

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-13 19:57       ` David Hildenbrand
@ 2025-06-13 21:47         ` Oscar Salvador
  2025-06-14  9:07           ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2025-06-13 21:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On Fri, Jun 13, 2025 at 09:57:23PM +0200, David Hildenbrand wrote:
> What I meant is:
> 
> Assume we have a pagecache page mapped into our page tables R/O (MAP_PRIVATE
> mapping).
> 
> During a write fault on such a pagecache page, we end up in
> do_wp_page()->wp_page_copy() we perform the copy via __wp_page_copy_user()
> without the folio lock.

Yes, it would be similar to doing

hugetlb_fault()->hugetlb_no_page() which would map it R/O.
Then, if we write to it, we will go to hugetlb_wp().
Since it is a private mapping, we would only need to lock the folio to
see if we can re-use it (the wp_can_reuse_anon_folio() analog to
hugetlb).

> In wp_page_copy(), we retake the pt lock, to make sure that the page is
> still mapped (pte_same). If the page is no longer mapped, we retry the
> fault.
> 
> In that case, we only want to make sure that the folio is still mapped after
> possibly dropping the page table lock in between.
> 
> As we are holding an additional folio reference in
> do_wp_page()->wp_page_copy(), the folio cannot get freed concurrently.
> 
> 
> There is indeed the do_cow_fault() path where we avoid faulting in the
> pagecache page in the first place. So no page table reference, an I can
> understand why we would need the folio lock there.

But do_cow_fault() does take a reference via __do_fault()->filemap_fault().

> Regarding hugetlb_no_page(): I think we could drop the folio lock for a
> pagecache folio after inserting the folio into the page table. Just like
> do_wp_page()->wp_page_copy(), we would have to verify again under PTL if the
> folio is still mapped
> 
> ... which we already do through pte_same() checks?

But that is somewhat similar what we do in the generic faulting path.

Assume you fault in a file for a private mapping and do COW.
So, do_pte_missing()->do_fault()->do_cow_fault().

do_cow_fault()->__do_fault() will a) get a reference and b) lock the folio.
And then we will proceed with copying the file to the page we will map privately.

This would be something like hugetlb_fault()->hugetlb_no_page()->hugetlb_wp().
So we have to hold the lock throughout hugetlb_wp() for file pages we are copying
to private mappings.

Now, let us assume you map the file R/O. And after a while you write-fault to it.
In the generic faulting path, that will go through:

do_pte_missing()->do_fault()->do_read_fault()
do_wp_page()->wp_page_copy()

wp_page_copy(), which indeed doesn't hold the lock (but takes a reference).

Maybe it's because it's Friday, but I'm confused as to why 
do_pte_missing()->do_fault()->do_cow_fault() holds the lock while do_wp_page() doesn't
although it might the file's page we have to copy. 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-13 21:47         ` Oscar Salvador
@ 2025-06-14  9:07           ` Oscar Salvador
  2025-06-16  9:22             ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2025-06-14  9:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On Fri, Jun 13, 2025 at 11:47:50PM +0200, Oscar Salvador wrote:
> Maybe it's because it's Friday, but I'm confused as to why 
> do_pte_missing()->do_fault()->do_cow_fault() holds the lock while do_wp_page() doesn't
> although it might the file's page we have to copy. 

Scratch that, I see my confusion.
The first time we map the file privately, the folio must remain stable.
But if we already mapped it privately before (R/O), and we write fault on it,
we don't need to be stable (e.g: uptodated).

But I think my comment on hugetlb_no_page() still holds, because

hugetlb_fault->hugetlb_no_page->hugetlb_wp 

would be similar to do_pte_missing->do_cow, and in do_cow we hold both
the reference and the lock.
Were we might not need the lock is in hugetlb_fault->hugetlb_wp, which
would be similar to do_wp_page()->wp_page_copy.
Of course we will need to take it if it is an anonymous folio because we need
to check the re-use case.

So, it gets complicated because hugetlb_no_page() needs to call
hugetlb_wp() with the lock held in case it is a pagecache folio, and
and the same time hugetlb_wp() needs to take the lock if it us an anonymous
one for the re-use case.
So, all in all, I think that it is easier when both callers of hugetlb_wp()
hold the lock, so we do not have to do weird dances, and document why it is done.

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-14  9:07           ` Oscar Salvador
@ 2025-06-16  9:22             ` David Hildenbrand
  2025-06-16 14:10               ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-06-16  9:22 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On 14.06.25 11:07, Oscar Salvador wrote:
> On Fri, Jun 13, 2025 at 11:47:50PM +0200, Oscar Salvador wrote:
>> Maybe it's because it's Friday, but I'm confused as to why
>> do_pte_missing()->do_fault()->do_cow_fault() holds the lock while do_wp_page() doesn't
>> although it might the file's page we have to copy.
> 
> Scratch that, I see my confusion.
> The first time we map the file privately, the folio must remain stable.
> But if we already mapped it privately before (R/O), and we write fault on it,
> we don't need to be stable (e.g: uptodated).
> 
> But I think my comment on hugetlb_no_page() still holds, because
> 
> hugetlb_fault->hugetlb_no_page->hugetlb_wp
> 
> would be similar to do_pte_missing->do_cow, and in do_cow we hold both
> the reference and the lock.

Well, there is an important difference:

hugetlb_fault->hugetlb_no_page->hugetlb_wp

already *mapped* the pagecache page into the page table.

See
	if (anon_rmap)
		hugetlb_add_new_anon_rmap(folio, vma, vmf->address);
	else
		hugetlb_add_file_rmap(folio);

So at that point it would be "stable" unless I am missing something?

So once we are in hugetlb_wp(), that path much rather corresponds to 
do_wp_page()->wp_page_copy.

> Were we might not need the lock is in hugetlb_fault->hugetlb_wp, which
> would be similar to do_wp_page()->wp_page_copy.

Exactly.

> Of course we will need to take it if it is an anonymous folio because we need
> to check the re-use case.

Yes.

> 
> So, it gets complicated because hugetlb_no_page() needs to call
> hugetlb_wp() with the lock held in case it is a pagecache folio,

Per above discussion: why? After we mapped the pagecache folio, we can 
unlock the folio I think.

and
> and the same time hugetlb_wp() needs to take the lock if it us an anonymous
> one for the re-use case.

I think it hugetlb_wp() really only needs the lock for the anon folio.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-16  9:22             ` David Hildenbrand
@ 2025-06-16 14:10               ` Oscar Salvador
  2025-06-16 14:41                 ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2025-06-16 14:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On Mon, Jun 16, 2025 at 11:22:43AM +0200, David Hildenbrand wrote:

> hugetlb_fault->hugetlb_no_page->hugetlb_wp
> 
> already *mapped* the pagecache page into the page table.
> 
> See
> 	if (anon_rmap)
> 		hugetlb_add_new_anon_rmap(folio, vma, vmf->address);
> 	else
> 		hugetlb_add_file_rmap(folio);
> 
> So at that point it would be "stable" unless I am missing something?
> 
> So once we are in hugetlb_wp(), that path much rather corresponds to
> do_wp_page()->wp_page_copy.

Yes, that's right.
That's something I've been thinking over the weekend.

E.g: do_cow_fault, first copies the page from the pagecache to a new one
and __then__ maps the that page into the page tables.
While in hugetlb_no_page->hugetlb_wp, the workflow is a bit different.

We first map it and then we copy it if we need to.

What do you mean by stable?
In the generic faulting path, we're not worried about the page going away
because we hold a reference, so I guess the lock must be to keep content stable?

I mean, yes, after we have mapped the page privately into the pagetables,
we don't have business about content-integrity anymore, so given this rule, yes,
I guess hugetlb_wp() wouldn't need the lock (for !anonymous) because we already
have mapped it privately at that point.

But there's something I don't fully understand and makes me feel uneasy.
If the lock in the generic faultin path is to keep content stable till we
have mapped it privately, wouldn't be more correct to also hold it
during the copy in hugetlb_wp, to kinda emulate that?

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-16 14:10               ` Oscar Salvador
@ 2025-06-16 14:41                 ` David Hildenbrand
  2025-06-17 10:03                   ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-06-16 14:41 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On 16.06.25 16:10, Oscar Salvador wrote:
> On Mon, Jun 16, 2025 at 11:22:43AM +0200, David Hildenbrand wrote:
>   
>> hugetlb_fault->hugetlb_no_page->hugetlb_wp
>>
>> already *mapped* the pagecache page into the page table.
>>
>> See
>> 	if (anon_rmap)
>> 		hugetlb_add_new_anon_rmap(folio, vma, vmf->address);
>> 	else
>> 		hugetlb_add_file_rmap(folio);
>>
>> So at that point it would be "stable" unless I am missing something?
>>
>> So once we are in hugetlb_wp(), that path much rather corresponds to
>> do_wp_page()->wp_page_copy.
> 
> Yes, that's right.
> That's something I've been thinking over the weekend.
> 
> E.g: do_cow_fault, first copies the page from the pagecache to a new one
> and __then__ maps the that page into the page tables.
> While in hugetlb_no_page->hugetlb_wp, the workflow is a bit different.
> 
> We first map it and then we copy it if we need to.
> 
> What do you mean by stable?

The same "stable" you used in the doc, that I complained about ;)

> In the generic faulting path, we're not worried about the page going away
> because we hold a reference, so I guess the lock must be to keep content stable?

What you want to avoid is IIRC, is someone doing a truncation/reclaim on 
the folio while you are mapping it.

Take a look at truncate_inode_pages_range() where we do a folio_lock() 
around truncate_inode_folio().

In other words, while you hold the folio lock (and verified that the 
folio was not truncated yet: for example, that folio->mapping is still 
set), you know that it cannot get truncated concurrently -- without 
holding other expensive locks.

Observe how truncate_cleanup_folio() calls

	if (folio_mapped(folio))
		unmap_mapping_folio(folio);

To remove all page table mappings.

So while holding the folio lock, new page table mappings are not 
expected to appear (IIRC).

> 
> I mean, yes, after we have mapped the page privately into the pagetables,
> we don't have business about content-integrity anymore, so given this rule, yes,
> I guess hugetlb_wp() wouldn't need the lock (for !anonymous) because we already
> have mapped it privately at that point.

That's my understanding. And while holding the PTL it cannot get 
unmapped. Whenever you temporarily drop the PTL, you have to do a 
pte_same() check to make sure concurrent truncation didn't happen.

So far my understanding at least of common filemap code.

> 
> But there's something I don't fully understand and makes me feel uneasy.
> If the lock in the generic faultin path is to keep content stable till we
> have mapped it privately, wouldn't be more correct to also hold it
> during the copy in hugetlb_wp, to kinda emulate that?
As long there us a page table mapping, it cannot get truncated. So if 
you find a PTE under PTL that maps that folio, truncation could not have 
happened.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-16 14:41                 ` David Hildenbrand
@ 2025-06-17 10:03                   ` Oscar Salvador
  2025-06-17 11:27                     ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2025-06-17 10:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On Mon, Jun 16, 2025 at 04:41:20PM +0200, David Hildenbrand wrote:
> On 16.06.25 16:10, Oscar Salvador wrote:
> > What do you mean by stable?
> 
> The same "stable" you used in the doc, that I complained about ;)

Touche :-D

> > In the generic faulting path, we're not worried about the page going away
> > because we hold a reference, so I guess the lock must be to keep content stable?
> 
> What you want to avoid is IIRC, is someone doing a truncation/reclaim on the
> folio while you are mapping it.

Ok, I see. I thought it was more about holding writes, but this makes sense.

> Take a look at truncate_inode_pages_range() where we do a folio_lock()
> around truncate_inode_folio().
> 
> In other words, while you hold the folio lock (and verified that the folio
> was not truncated yet: for example, that folio->mapping is still set), you
> know that it cannot get truncated concurrently -- without holding other
> expensive locks.
> 
> Observe how truncate_cleanup_folio() calls
> 
> 	if (folio_mapped(folio))
> 		unmap_mapping_folio(folio);
> 
> To remove all page table mappings.
> 
> So while holding the folio lock, new page table mappings are not expected to
> appear (IIRC).

Ah ok, so it's more that we don't end up mapping something that's not there
anymore (or something completely different).

> > I mean, yes, after we have mapped the page privately into the pagetables,
> > we don't have business about content-integrity anymore, so given this rule, yes,
> > I guess hugetlb_wp() wouldn't need the lock (for !anonymous) because we already
> > have mapped it privately at that point.
> 
> That's my understanding. And while holding the PTL it cannot get unmapped.
> Whenever you temporarily drop the PTL, you have to do a pte_same() check to
> make sure concurrent truncation didn't happen.

Yap, hugetlb_wp() drops the locks temporarily when it needs to unmap the private
page from other processes, but then does the pte_same() check.

> So far my understanding at least of common filemap code.
> 
> > 
> > But there's something I don't fully understand and makes me feel uneasy.
> > If the lock in the generic faultin path is to keep content stable till we
> > have mapped it privately, wouldn't be more correct to also hold it
> > during the copy in hugetlb_wp, to kinda emulate that?
> As long there us a page table mapping, it cannot get truncated. So if you
> find a PTE under PTL that maps that folio, truncation could not have
> happened.

I see, this makes a lot of sense, thanks for walking me through David!
Alright, then, with all this clear now we should:

- Not take any locks on hugetlb_fault()->hugetlb_wp(), hugetlb_wp() will take it
  if it's an anonymous folio (re-use check)
- Drop the lock in hugetlb_no_page() after we have mapped the page in
  the pagetables
- hugetlb_wp() will take the lock IFF the folio is anonymous

This will lead to something like the following:

 diff --git a/mm/hugetlb.c b/mm/hugetlb.c
 index dfa09fc3b2c6..4d48cda8a56d 100644
 --- a/mm/hugetlb.c
 +++ b/mm/hugetlb.c
 @@ -6198,6 +6198,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
  	 * in scenarios that used to work. As a side effect, there can still
  	 * be leaks between processes, for example, with FOLL_GET users.
  	 */
 +	if (folio_test_anon(old_folio))
 +		folio_lock(old_folio);
  	if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) {
  		if (!PageAnonExclusive(&old_folio->page)) {
  			folio_move_anon_rmap(old_folio, vma);
 @@ -6212,6 +6214,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
  	}
  	VM_BUG_ON_PAGE(folio_test_anon(old_folio) &&
  		       PageAnonExclusive(&old_folio->page), &old_folio->page);
 +	if (folio_test_anon(old_folio))
 +		folio_unlock(old_folio);
 
  	/*
  	 * If the process that created a MAP_PRIVATE mapping is about to perform
 @@ -6537,11 +6541,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
  			}
  			new_pagecache_folio = true;
  		} else {
 -			/*
 -			 * hugetlb_wp() expects the folio to be locked in order to
 -			 * check whether we can re-use this page exclusively for us.
 -			 */
 -			folio_lock(folio);
  			anon_rmap = 1;
  		}
  	} else {
 @@ -6558,7 +6557,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 
  		/* Check for page in userfault range. */
  		if (userfaultfd_minor(vma)) {
 -			folio_unlock(folio);
 +			if (!anon_rmap)
 +				folio_unlock(folio);
  			folio_put(folio);
  			/* See comment in userfaultfd_missing() block above */
  			if (!hugetlb_pte_stable(h, mm, vmf->address, vmf->pte, vmf->orig_pte)) {
 @@ -6604,6 +6604,13 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
  		new_pte = huge_pte_mkuffd_wp(new_pte);
  	set_huge_pte_at(mm, vmf->address, vmf->pte, new_pte, huge_page_size(h));
 
 +	/*
 +	 * This folio cannot have been truncated since we were holding the lock,
 +	 * and we just mapped it into the pagetables. Drop the lock now.
 +	 */
 +	if (!anon_rmap)
 +		folio_unlock(folio);
 +
  	hugetlb_count_add(pages_per_huge_page(h), mm);
  	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
  		/* Optimization, do the COW without a second fault */
 @@ -6619,8 +6626,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
  	 */
  	if (new_folio)
  		folio_set_hugetlb_migratable(folio);
 -
 -	folio_unlock(folio);
  out:
  	hugetlb_vma_unlock_read(vma);
 
 @@ -6639,8 +6644,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
  backout_unlocked:
  	if (new_folio && !new_pagecache_folio)
  		restore_reserve_on_error(h, vma, vmf->address, folio);
 -
 -	folio_unlock(folio);
 +	if (!anon_rmap)
 +		folio_unlock(folio);
  	folio_put(folio);
  	goto out;
  }
 @@ -6805,21 +6810,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  		/* Fallthrough to CoW */
  	}
 
 -	/*
 -	 * We need to lock the folio before calling hugetlb_wp().
 -	 * Either the folio is in the pagecache and we need to copy it over
 -	 * to another file, so it must remain stable throughout the operation,
 -	 * or the folio is anonymous and we need to lock it in order to check
 -	 * whether we can re-use it and mark it exclusive for this process.
 -	 * The timespan for the lock differs depending on the type, since
 -	 * anonymous folios only need to hold the lock while checking whether we
 -	 * can re-use it, while we need to hold it throughout the copy in case
 -	 * we are dealing with a folio from a pagecache.
 -	 * Representing this difference would be tricky with the current code,
 -	 * so just hold the lock for the duration of hugetlb_wp().
 -	 */
  	folio = page_folio(pte_page(vmf.orig_pte));
 -	folio_lock(folio);
  	folio_get(folio);
 
  	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 @@ -6835,7 +6826,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  						flags & FAULT_FLAG_WRITE))
  		update_mmu_cache(vma, vmf.address, vmf.pte);
  out_put_page:
 -	folio_unlock(folio);
  	folio_put(folio);
  out_ptl:
  	spin_unlock(vmf.ptl);
  
This should be patch#2 with something like "Sorting out locking" per
title, and maybe explaining a bit more why the lock in hugelb_wp for
anonymous folios.

What do you think?

 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-17 10:03                   ` Oscar Salvador
@ 2025-06-17 11:27                     ` David Hildenbrand
  2025-06-17 12:04                       ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-06-17 11:27 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel


> 
> I see, this makes a lot of sense, thanks for walking me through David!

Well, I hope the same logic applies to hugetlb :D

> Alright, then, with all this clear now we should:
> 
> - Not take any locks on hugetlb_fault()->hugetlb_wp(), hugetlb_wp() will take it
>    if it's an anonymous folio (re-use check)
> - Drop the lock in hugetlb_no_page() after we have mapped the page in
>    the pagetables
> - hugetlb_wp() will take the lock IFF the folio is anonymous
> 
> This will lead to something like the following:
> 
>   diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>   index dfa09fc3b2c6..4d48cda8a56d 100644
>   --- a/mm/hugetlb.c
>   +++ b/mm/hugetlb.c
>   @@ -6198,6 +6198,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
>    	 * in scenarios that used to work. As a side effect, there can still
>    	 * be leaks between processes, for example, with FOLL_GET users.
>    	 */
>   +	if (folio_test_anon(old_folio))
>   +		folio_lock(old_folio);

If holding the PTL, this would not work. You'd have to unlock PTL, lock 
folio, retake PTL, check pte_same.

>    	if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) {
>    		if (!PageAnonExclusive(&old_folio->page)) {
>    			folio_move_anon_rmap(old_folio, vma);
>   @@ -6212,6 +6214,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
>    	}
>    	VM_BUG_ON_PAGE(folio_test_anon(old_folio) &&
>    		       PageAnonExclusive(&old_folio->page), &old_folio->page);
>   +	if (folio_test_anon(old_folio))
>   +		folio_unlock(old_folio);

[...]

>    
> This should be patch#2 with something like "Sorting out locking" per
> title, and maybe explaining a bit more why the lock in hugelb_wp for
> anonymous folios.

Jup.


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-17 11:27                     ` David Hildenbrand
@ 2025-06-17 12:04                       ` Oscar Salvador
  2025-06-17 12:08                         ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2025-06-17 12:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On Tue, Jun 17, 2025 at 01:27:18PM +0200, David Hildenbrand wrote:
> >   @@ -6198,6 +6198,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
> >    	 * in scenarios that used to work. As a side effect, there can still
> >    	 * be leaks between processes, for example, with FOLL_GET users.
> >    	 */
> >   +	if (folio_test_anon(old_folio))
> >   +		folio_lock(old_folio);
> 
> If holding the PTL, this would not work. You'd have to unlock PTL, lock
> folio, retake PTL, check pte_same.

Why so?

hugetlb_no_page() has already checked pte_same under PTL, then mapped the page
and called hugetlb_wp().

 hugetlb_no_page
  vmf->ptl = huge_pte_lock()
  pte_same
  set_huge_pte_at
  hugetlb_wp

and in hugetlb_wp() we're still holding the PTL.
Why do we have to release PTL in order to lock the folio?
This folio can't have been unmapped because we're holding PTL, right?
And it can't have been truncaed for the same reason.

It's because some lock-order issue? 


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-17 12:04                       ` Oscar Salvador
@ 2025-06-17 12:08                         ` David Hildenbrand
  2025-06-17 12:10                           ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2025-06-17 12:08 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On 17.06.25 14:04, Oscar Salvador wrote:
> On Tue, Jun 17, 2025 at 01:27:18PM +0200, David Hildenbrand wrote:
>>>    @@ -6198,6 +6198,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf)
>>>     	 * in scenarios that used to work. As a side effect, there can still
>>>     	 * be leaks between processes, for example, with FOLL_GET users.
>>>     	 */
>>>    +	if (folio_test_anon(old_folio))
>>>    +		folio_lock(old_folio);
>>
>> If holding the PTL, this would not work. You'd have to unlock PTL, lock
>> folio, retake PTL, check pte_same.
> 
> Why so?
> 
> hugetlb_no_page() has already checked pte_same under PTL, then mapped the page
> and called hugetlb_wp().
> 
>   hugetlb_no_page
>    vmf->ptl = huge_pte_lock()
>    pte_same
>    set_huge_pte_at
>    hugetlb_wp
> 
> and in hugetlb_wp() we're still holding the PTL.
> Why do we have to release PTL in order to lock the folio?
> This folio can't have been unmapped because we're holding PTL, right?
> And it can't have been truncaed for the same reason.
> 
> It's because some lock-order issue?

folio lock is a sleeping lock, PTL is a spinlock. :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-17 12:08                         ` David Hildenbrand
@ 2025-06-17 12:10                           ` Oscar Salvador
  2025-06-17 12:50                             ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2025-06-17 12:10 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On Tue, Jun 17, 2025 at 02:08:16PM +0200, David Hildenbrand wrote:
> folio lock is a sleeping lock, PTL is a spinlock. :)

Lol yes, overlooked that totally.
And I also saw the comment from mm/rmap.c about lockin order.


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-17 12:10                           ` Oscar Salvador
@ 2025-06-17 12:50                             ` Oscar Salvador
  2025-06-17 13:42                               ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2025-06-17 12:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On Tue, Jun 17, 2025 at 02:10:09PM +0200, Oscar Salvador wrote:
> On Tue, Jun 17, 2025 at 02:08:16PM +0200, David Hildenbrand wrote:
> > folio lock is a sleeping lock, PTL is a spinlock. :)
> 
> Lol yes, overlooked that totally.
> And I also saw the comment from mm/rmap.c about lockin order.

So, we could do something like this:

        if (folio_test_anon(old_folio)) {
                spin_unlock(vmf->ptl);
                folio_lock(old_folio);
                spin_lock(vmf->ptl);
                vmf->pte = hugetlb_walk(vma, vmf->address, huge_page_size(h));
                if (unlikely(!vmf->pte ||
                    !pte_same(huge_ptep_get(mm, vmf->address, vmf->pte), pte))) {
                        spin_unlock(vmf->ptl);
                        folio_unlock(old_folio);
                        goto out_take_lock;
                }

                if (folio_mapcount(old_folio == 1)) {
                        if (!PageAnonExclusive(&old_folio->page)) {
                                folio_move_anon_rmap(old_folio, vma);
                                SetPageAnonExclusive(&old_folio->page);
                        }
                        if (likely(!unshare))
                                set_huge_ptep_maybe_writable(vma, vmf->address,
                                                             vmf->pte);

                        spin_unlock(vmf->ptl);
                        folio_unlock(old_folio);
                        goto out_take_lock;
                }
                VM_BUG_ON_PAGE(folio_test_anon(old_folio) &&
                               PageAnonExclusive(&old_folio->page), &old_folio->page);
                spin_unlock(vmf->ptl);
                folio_unlock(old_folio);
                spin_lock(vmf->ptl);
                vmf->pte = hugetlb_walk(vma, vmf->address, huge_page_size(h));
                if (unlikely(!vmf->pte ||
                   !pte_same(huge_ptep_get(mm, vmf->address, vmf->pte), pte)))
                        return 0;
        }

Hopefully we can do some refactor here, because I quite dislike the
unlock-lock-retake-unlock-blah cycle.



-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-17 12:50                             ` Oscar Salvador
@ 2025-06-17 13:42                               ` David Hildenbrand
  2025-06-17 14:00                                 ` Oscar Salvador
  2025-06-19 11:52                                 ` Oscar Salvador
  0 siblings, 2 replies; 26+ messages in thread
From: David Hildenbrand @ 2025-06-17 13:42 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On 17.06.25 14:50, Oscar Salvador wrote:
> On Tue, Jun 17, 2025 at 02:10:09PM +0200, Oscar Salvador wrote:
>> On Tue, Jun 17, 2025 at 02:08:16PM +0200, David Hildenbrand wrote:
>>> folio lock is a sleeping lock, PTL is a spinlock. :)
>>
>> Lol yes, overlooked that totally.
>> And I also saw the comment from mm/rmap.c about lockin order.
> 
> So, we could do something like this:
> 
>          if (folio_test_anon(old_folio)) {
>                  spin_unlock(vmf->ptl);
>                  folio_lock(old_folio);
>                  spin_lock(vmf->ptl);
>                  vmf->pte = hugetlb_walk(vma, vmf->address, huge_page_size(h));
>                  if (unlikely(!vmf->pte ||
>                      !pte_same(huge_ptep_get(mm, vmf->address, vmf->pte), pte))) {
>                          spin_unlock(vmf->ptl);
>                          folio_unlock(old_folio);
>                          goto out_take_lock;
>                  }
> 
>                  if (folio_mapcount(old_folio == 1)) {
>                          if (!PageAnonExclusive(&old_folio->page)) {
>                                  folio_move_anon_rmap(old_folio, vma);
>                                  SetPageAnonExclusive(&old_folio->page);
>                          }
>                          if (likely(!unshare))
>                                  set_huge_ptep_maybe_writable(vma, vmf->address,
>                                                               vmf->pte);
> 
>                          spin_unlock(vmf->ptl);
>                          folio_unlock(old_folio);
>                          goto out_take_lock;
>                  }
>                  VM_BUG_ON_PAGE(folio_test_anon(old_folio) &&
>                                 PageAnonExclusive(&old_folio->page), &old_folio->page);
>                  spin_unlock(vmf->ptl);
>                  folio_unlock(old_folio);
>                  spin_lock(vmf->ptl);
>                  vmf->pte = hugetlb_walk(vma, vmf->address, huge_page_size(h));
>                  if (unlikely(!vmf->pte ||
>                     !pte_same(huge_ptep_get(mm, vmf->address, vmf->pte), pte)))
>                          return 0;
>          }
> 
> Hopefully we can do some refactor here, because I quite dislike the
> unlock-lock-retake-unlock-blah cycle.

Yes. As an alternative, keep locking it in the caller and only unlock in 
the !anon case?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-17 13:42                               ` David Hildenbrand
@ 2025-06-17 14:00                                 ` Oscar Salvador
  2025-06-19 11:52                                 ` Oscar Salvador
  1 sibling, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2025-06-17 14:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On Tue, Jun 17, 2025 at 03:42:09PM +0200, David Hildenbrand wrote:
> Yes. As an alternative, keep locking it in the caller and only unlock in the
> !anon case?

Yes, that crossed my mind too, and I think that that would the cleanest way.
Since hugetlb_no_page() will be the only one taking the lock, we can drop it
there for !anon case, and put a fat (maybe not so fat :-)?) explaining the
deal.

thanks for the insights David! 


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path
  2025-06-17 13:42                               ` David Hildenbrand
  2025-06-17 14:00                                 ` Oscar Salvador
@ 2025-06-19 11:52                                 ` Oscar Salvador
  1 sibling, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2025-06-19 11:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Muchun Song, James Houghton, Peter Xu, Gavin Guo,
	linux-mm, linux-kernel

On Tue, Jun 17, 2025 at 03:42:09PM +0200, David Hildenbrand wrote:
> Yes. As an alternative, keep locking it in the caller and only unlock in the
> !anon case?

This is what I came up with:

What do you think?

I just made sure that all hugetlb-LTP tests pass fine (after I fixed an
obvious mistake :-S)

 From 3a0c53a00511abdcf5df53491bbb9295973f24f9 Mon Sep 17 00:00:00 2001
 From: Oscar Salvador <osalvador@suse.de>
 Date: Wed, 11 Jun 2025 10:05:34 +0200
 Subject: [PATCH] mm,hugetlb: Sort out folio locking in the faulting path
 
 Recent conversations showed that there was a misunderstanding about why we
 were locking the folio prior to call in hugetlb_wp().
 In fact, as soon as we have the folio mapped into the pagetables, we no longer
 need to hold it locked, because we know that no concurrent truncation could have
 happened.
 There is only one case where the folio needs to be locked, and that is when we
 are handling an anonymous folio, because hugetlb_wp() will check whether it can
 re-use it exclusively for the process that is faulting it in.
 
 So, pass the folio locked to hugetlb_wp() when that is the case.
 
 Suggested-by: David Hildenbrand <david@redhat.com>
 Signed-off-by: Oscar Salvador <osalvador@suse.de>
 ---
  mm/hugetlb.c | 43 +++++++++++++++++++++++++++++++++----------
  1 file changed, 33 insertions(+), 10 deletions(-)
 
 diff --git a/mm/hugetlb.c b/mm/hugetlb.c
 index 175edafeec67..04049d0fb70d 100644
 --- a/mm/hugetlb.c
 +++ b/mm/hugetlb.c
 @@ -6437,6 +6437,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
  	pte_t new_pte;
  	bool new_folio, new_pagecache_folio = false;
  	u32 hash = hugetlb_fault_mutex_hash(mapping, vmf->pgoff);
 +	bool folio_locked = true;
  
  	/*
  	 * Currently, we are forced to kill the process in the event the
 @@ -6602,6 +6603,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
  
  	hugetlb_count_add(pages_per_huge_page(h), mm);
  	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 +		/* No need to lock file folios. See comment in hugetlb_fault() */
 +		if (!anon_rmap) {
 +			folio_locked = false;
 +			folio_unlock(folio);
 +		}
  		/* Optimization, do the COW without a second fault */
  		ret = hugetlb_wp(vmf);
  	}
 @@ -6616,7 +6622,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
  	if (new_folio)
  		folio_set_hugetlb_migratable(folio);
  
 -	folio_unlock(folio);
 +	if (folio_locked)
 +		folio_unlock(folio);
  out:
  	hugetlb_vma_unlock_read(vma);
  
 @@ -6636,7 +6643,8 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
  	if (new_folio && !new_pagecache_folio)
  		restore_reserve_on_error(h, vma, vmf->address, folio);
  
 -	folio_unlock(folio);
 +	if (folio_locked)
 +		folio_unlock(folio);
  	folio_put(folio);
  	goto out;
  }
 @@ -6670,7 +6678,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  {
  	vm_fault_t ret;
  	u32 hash;
 -	struct folio *folio;
 +	struct folio *folio = NULL;
  	struct hstate *h = hstate_vma(vma);
  	struct address_space *mapping;
  	struct vm_fault vmf = {
 @@ -6687,6 +6695,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  		 * be hard to debug if called functions make assumptions
  		 */
  	};
 +	bool folio_locked = false;
  
  	/*
  	 * Serialize hugepage allocation and instantiation, so that we don't
 @@ -6801,13 +6810,24 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  		/* Fallthrough to CoW */
  	}
  
 -	/* hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) */
 -	folio = page_folio(pte_page(vmf.orig_pte));
 -	folio_lock(folio);
 -	folio_get(folio);
 -
  	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
  		if (!huge_pte_write(vmf.orig_pte)) {
 +			/*
 +			 * Anonymous folios need to be lock since hugetlb_wp()
 +			 * checks whether we can re-use it exclusively for us in
 +			 * case we are the only user.
 +			 */
 +			folio = page_folio(pte_page(vmf.orig_pte));
 +			folio_get(folio);
 +			if (folio_test_anon(folio)) {
 +				spin_unlock(vmf.ptl);
 +				folio_lock(folio);
 +				folio_locked = true;
 +				spin_lock(vmf.ptl);
 +				if (unlikely(!pte_same(vmf.orig_pte, huge_ptep_get(mm,
 +						   vmf.address, vmf.pte))))
 +					goto out_put_page;
 +			}
  			ret = hugetlb_wp(&vmf);
  			goto out_put_page;
  		} else if (likely(flags & FAULT_FLAG_WRITE)) {
 @@ -6819,8 +6839,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  						flags & FAULT_FLAG_WRITE))
  		update_mmu_cache(vma, vmf.address, vmf.pte);
  out_put_page:
 -	folio_unlock(folio);
 -	folio_put(folio);
 +	if (folio) {
 +		if (folio_locked)
 +			folio_unlock(folio);
 +		folio_put(folio);
 +	}
  out_ptl:
  	spin_unlock(vmf.ptl);
  out_mutex:
 -- 
 2.49.0

 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2025-06-19 11:52 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-12 13:46 [PATCH 0/5] Misc rework on hugetlb_fault Oscar Salvador
2025-06-12 13:46 ` [PATCH 1/5] mm,hugetlb: Change mechanism to detect a COW on private mapping Oscar Salvador
2025-06-13 13:52   ` David Hildenbrand
2025-06-12 13:46 ` [PATCH 2/5] mm,hugetlb: Document the reason to lock the folio in the faulting path Oscar Salvador
2025-06-13 13:56   ` David Hildenbrand
2025-06-13 14:23     ` Oscar Salvador
2025-06-13 19:57       ` David Hildenbrand
2025-06-13 21:47         ` Oscar Salvador
2025-06-14  9:07           ` Oscar Salvador
2025-06-16  9:22             ` David Hildenbrand
2025-06-16 14:10               ` Oscar Salvador
2025-06-16 14:41                 ` David Hildenbrand
2025-06-17 10:03                   ` Oscar Salvador
2025-06-17 11:27                     ` David Hildenbrand
2025-06-17 12:04                       ` Oscar Salvador
2025-06-17 12:08                         ` David Hildenbrand
2025-06-17 12:10                           ` Oscar Salvador
2025-06-17 12:50                             ` Oscar Salvador
2025-06-17 13:42                               ` David Hildenbrand
2025-06-17 14:00                                 ` Oscar Salvador
2025-06-19 11:52                                 ` Oscar Salvador
2025-06-12 13:46 ` [PATCH 3/5] mm,hugetlb: Conver anon_rmap into boolean Oscar Salvador
2025-06-13 13:48   ` David Hildenbrand
2025-06-12 13:47 ` [PATCH 4/5] mm,hugetlb: Drop obsolete comment about non-present pte and second faults Oscar Salvador
2025-06-12 13:47 ` [PATCH 5/5] mm,hugetlb: Drop unlikelys from hugetlb_fault Oscar Salvador
2025-06-13  8:55 ` [PATCH 0/5] Misc rework on hugetlb_fault Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).