[RFC PATCH 0/3] Clean up locking in hugetlb faulting code

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] Clean up locking in hugetlb faulting code
@ 2025-06-02 14:16 Oscar Salvador
  2025-06-02 14:16 ` [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp Oscar Salvador
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Oscar Salvador @ 2025-06-02 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Muchun Song, David Hildenbrand, James Houghton, Peter Xu,
	Gavin Guo, linux-mm, linux-kernel, Oscar Salvador

Hi all,

This RFC is the culmination of the discussion that happened in [1].
TLDR: No one really knew what the locks were protecting us against, and
whether we needed them at all.

Some reasearch showed that most of them were introduced in a time were
truncation was not serialized with the mutex, as it is today, so we were
relying on the lock for the page to not go away from the pagecache.
More details can be find in patch#1.

This is for the locks, but I also started to look at the references
we take in hugetlb_fault and hugetlb_wp as it seems to me we are taking
more than actually needed, but that is once we manage to sort this out.

I ran hugetlb LTP tests and nothing screamed, and I also plan to run selftests
later on.

@Galvin. Could you please run your syzkaller with this patchset applied and
see whether you can trigger something?

Special thanks to David and Peter Xu that were helping out with this mess.

[1] https://lore.kernel.org/linux-mm/aDeBUXCRLRZobHq0@localhost.localdomain/T/#md02880ebc2c679678b7f326c5e9e93992428e124

Oscar Salvador (3):
  mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  mm, hugetlb: Update comments in hugetlb_fault
  mm, hugetlb: Drop unlikelys from hugetlb_fault

 include/linux/hugetlb.h |  12 +++++
 mm/hugetlb.c            | 117 +++++++++++++++++-----------------------
 2 files changed, 62 insertions(+), 67 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-02 14:16 [RFC PATCH 0/3] Clean up locking in hugetlb faulting code Oscar Salvador
@ 2025-06-02 14:16 ` Oscar Salvador
  2025-06-02 15:14   ` Peter Xu
  2025-06-02 14:16 ` [RFC PATCH 2/3] mm, hugetlb: Update comments in hugetlb_fault Oscar Salvador
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Oscar Salvador @ 2025-06-02 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Muchun Song, David Hildenbrand, James Houghton, Peter Xu,
	Gavin Guo, linux-mm, linux-kernel, Oscar Salvador

Recent events surfaced the fact that it is not clear why are we taking
certain locks in the hugetlb faulting path.
More specifically pagecache_folio's lock and folio lock in hugetlb_fault,
and folio lock in hugetlb_no_page.

When digging in the history, I saw that those locks were taken to protect
us against truncation, which back then was not serialized with the
hugetlb_fault_mutex as it is today.

For example, the lock in hugetlb_no_page, looking at the comment above:

 /*
  * Use page lock to guard against racing truncation
  * before we get page_table_lock.
  */
 new_folio = false;
 folio = filemap_lock_hugetlb_folio(h, mapping, vmf->pgoff);

which was added by 'commit 4c8872659772 ("[PATCH] hugetlb: demand fault handler")'.
Back when that commit was added (2025), truncate_hugepages looked something like:

 truncate_hugepages
  lock_page(page)
   truncate_huge_page(page) <- it also removed it from the pagecache
  unlock_page(page)

While today we have

 remove_inode_hugepages
  mutex_lock(&hugetlb_fault_mutex_table[hash])
   remove_inode_single_folio
    folio_lock(folio)
     hugetlb_delete_from_page_cache
    folio_unlock
  mutex_unlock(&hugetlb_fault_mutex_table[hash])

And the same happened with the lock for pagecache_folio in hugetlb_fault(),
which was introduced in 'commit 04f2cbe35699 ("hugetlb: guarantee that COW
faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed")',
when we did not have serialization against truncation yet.

The only worrisome part of dropping the locks that I considered is when
cow_from_owner is true and we cannot allocate another hugetlb page, because then
we drop all the locks, try to unmap the page from the other processes, and then
we re-take the locks again.

        hash = hugetlb_fault_mutex_hash(mapping, idx);
        hugetlb_vma_unlock_read(vma);
        mutex_unlock(&hugetlb_fault_mutex_table[hash]);

        unmap_ref_private(mm, vma, &old_folio->page,
                        vmf->address);

        mutex_lock(&hugetlb_fault_mutex_table[hash]);
        hugetlb_vma_lock_read(vma);
        spin_lock(vmf->ptl);

So, there is a small window where we are not holding the lock.

In this window, someone might have truncated the file (aka pagecache_folio),
and call hugetlb_unreserve_pages().
But I do not think it matters for the following reasons
1) we only retry in case the pte has not changed, which means that old_folio
   still is old_folio.
2) And if the original file got truncated in that window and reserves were
   adjusted, alloc_hugetlb_folio() will catch this under the lock when we
   retry again.

Another concern was brought up by James Houghton, about UFFDIO_CONTINUE
case, and whether we could end up mapping a hugetlb page which has not been
zeroed yet.
But mfill_atomic_hugetlb(), which calls hugetlb_mfill_atomic_pte(), holds the
mutex throughout the operation, so we cannot race with truncation/instantiation
either.

E.g: hugetlbfs_fallocate() will allocate the new hugetlb folio and zero it under
the mutex.

So, there should be no races.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/hugetlb.h | 12 +++++
 mm/hugetlb.c            | 98 ++++++++++++++++++-----------------------
 2 files changed, 55 insertions(+), 55 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 42f374e828a2..a176724e1204 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -811,6 +811,12 @@ static inline unsigned int blocks_per_huge_page(struct hstate *h)
 	return huge_page_size(h) / 512;
 }
 
+static inline struct folio *filemap_get_hugetlb_folio(struct hstate *h,
+				struct address_space *mapping, pgoff_t idx)
+{
+	return filemap_get_folio(mapping, idx << huge_page_order(h));
+}
+
 static inline struct folio *filemap_lock_hugetlb_folio(struct hstate *h,
 				struct address_space *mapping, pgoff_t idx)
 {
@@ -1088,6 +1094,12 @@ static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio
 	return NULL;
 }
 
+static inline struct folio *filemap_get_hugetlb_folio(struct hstate *h,
+				struct address_space *mapping, pgoff_t idx)
+{
+	return NULL;
+}
+
 static inline struct folio *filemap_lock_hugetlb_folio(struct hstate *h,
 				struct address_space *mapping, pgoff_t idx)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8746ed2fec13..f7bef660ef94 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6146,25 +6146,28 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
 	i_mmap_unlock_write(mapping);
 }
 
+enum cow_context {
+	HUGETLB_FAULT_CONTEXT,
+	HUGETLB_NO_PAGE_CONTEXT,
+};
+
 /*
- * hugetlb_wp() should be called with page lock of the original hugepage held.
  * Called with hugetlb_fault_mutex_table held and pte_page locked so we
  * cannot race with other handlers or page migration.
  * Keep the pte_same checks anyway to make transition from the mutex easier.
  */
-static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
-		       struct vm_fault *vmf)
+static vm_fault_t hugetlb_wp(struct vm_fault *vmf, enum cow_context context)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct mm_struct *mm = vma->vm_mm;
 	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
 	pte_t pte = huge_ptep_get(mm, vmf->address, vmf->pte);
 	struct hstate *h = hstate_vma(vma);
-	struct folio *old_folio;
-	struct folio *new_folio;
 	bool cow_from_owner = 0;
 	vm_fault_t ret = 0;
 	struct mmu_notifier_range range;
+	struct folio *old_folio, *new_folio, *pagecache_folio;
+	struct address_space *mapping = vma->vm_file->f_mapping;
 
 	/*
 	 * Never handle CoW for uffd-wp protected pages.  It should be only
@@ -6198,7 +6201,11 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 	 * we run out of free hugetlb folios: we would have to kill processes
 	 * in scenarios that used to work. As a side effect, there can still
 	 * be leaks between processes, for example, with FOLL_GET users.
+	 *
+	 * We need to take the lock here because the folio might be undergoing a
+	 * migration. e.g: see folio_try_share_anon_rmap_pte.
 	 */
+	folio_lock(old_folio);
 	if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) {
 		if (!PageAnonExclusive(&old_folio->page)) {
 			folio_move_anon_rmap(old_folio, vma);
@@ -6209,22 +6216,44 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 						     vmf->pte);
 
 		delayacct_wpcopy_end();
+		folio_unlock(old_folio);
 		return 0;
 	}
 	VM_BUG_ON_PAGE(folio_test_anon(old_folio) &&
 		       PageAnonExclusive(&old_folio->page), &old_folio->page);
+	folio_unlock(old_folio);
+
 
+	/*
+	 * We can be called from two different contexts: hugetlb_no_page or
+	 * hugetlb_fault.
+	 * When called from the latter, we try to find the original folio in
+	 * the pagecache and compare it against the current folio we have mapped
+	 * in the pagetables. If it differs, it means that this process already
+	 * CoWed and mapped the folio privately, so we know that a reservation
+	 * was already consumed.
+	 * This will be latter used to determine whether we need to unmap the folio
+	 * from other processes in case we fail to allocate a new folio.
+	 */
+	if (context == HUGETLB_FAULT_CONTEXT) {
+		pagecache_folio = filemap_get_hugetlb_folio(h, mapping, vmf->pgoff);
+		if (IS_ERR(pagecache_folio))
+			pagecache_folio = NULL;
+		else
+			folio_put(pagecache_folio);
+	}
 	/*
 	 * If the process that created a MAP_PRIVATE mapping is about to
 	 * perform a COW due to a shared page count, attempt to satisfy
 	 * the allocation without using the existing reserves. The pagecache
-	 * page is used to determine if the reserve at this address was
+	 * folio is used to determine if the reserve at this address was already
 	 * consumed or not. If reserves were used, a partial faulted mapping
 	 * at the time of fork() could consume its reserves on COW instead
 	 * of the full address range.
 	 */
-	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
-			old_folio != pagecache_folio)
+	if (context == HUGETLB_FAULT_CONTEXT &&
+	    is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
+	    old_folio != pagecache_folio)
 		cow_from_owner = true;
 
 	folio_get(old_folio);
@@ -6245,7 +6274,6 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 		 * may get SIGKILLed if it later faults.
 		 */
 		if (cow_from_owner) {
-			struct address_space *mapping = vma->vm_file->f_mapping;
 			pgoff_t idx;
 			u32 hash;
 
@@ -6451,11 +6479,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 	}
 
 	/*
-	 * Use page lock to guard against racing truncation
-	 * before we get page_table_lock.
+	 * hugetlb_fault_mutex_table guards us against truncation,
+	 * so we do not need to take locks on the folio.
 	 */
 	new_folio = false;
-	folio = filemap_lock_hugetlb_folio(h, mapping, vmf->pgoff);
+	folio = filemap_get_hugetlb_folio(h, mapping, vmf->pgoff);
 	if (IS_ERR(folio)) {
 		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		if (vmf->pgoff >= size)
@@ -6521,6 +6549,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 		if (vma->vm_flags & VM_MAYSHARE) {
 			int err = hugetlb_add_to_page_cache(folio, mapping,
 							vmf->pgoff);
+			folio_unlock(folio);
 			if (err) {
 				/*
 				 * err can't be -EEXIST which implies someone
@@ -6537,7 +6566,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 			}
 			new_pagecache_folio = true;
 		} else {
-			folio_lock(folio);
 			anon_rmap = 1;
 		}
 	} else {
@@ -6603,7 +6631,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 	hugetlb_count_add(pages_per_huge_page(h), mm);
 	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
 		/* Optimization, do the COW without a second fault */
-		ret = hugetlb_wp(folio, vmf);
+		ret = hugetlb_wp(vmf, HUGETLB_NO_PAGE_CONTEXT);
 	}
 
 	spin_unlock(vmf->ptl);
@@ -6615,8 +6643,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 	 */
 	if (new_folio)
 		folio_set_hugetlb_migratable(folio);
-
-	folio_unlock(folio);
 out:
 	hugetlb_vma_unlock_read(vma);
 
@@ -6636,7 +6662,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 	if (new_folio && !new_pagecache_folio)
 		restore_reserve_on_error(h, vma, vmf->address, folio);
 
-	folio_unlock(folio);
 	folio_put(folio);
 	goto out;
 }
@@ -6671,7 +6696,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	vm_fault_t ret;
 	u32 hash;
 	struct folio *folio = NULL;
-	struct folio *pagecache_folio = NULL;
 	struct hstate *h = hstate_vma(vma);
 	struct address_space *mapping;
 	int need_wait_lock = 0;
@@ -6780,11 +6804,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		/* Just decrements count, does not deallocate */
 		vma_end_reservation(h, vma, vmf.address);
-
-		pagecache_folio = filemap_lock_hugetlb_folio(h, mapping,
-							     vmf.pgoff);
-		if (IS_ERR(pagecache_folio))
-			pagecache_folio = NULL;
 	}
 
 	vmf.ptl = huge_pte_lock(h, mm, vmf.pte);
@@ -6798,10 +6817,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	    (flags & FAULT_FLAG_WRITE) && !huge_pte_write(vmf.orig_pte)) {
 		if (!userfaultfd_wp_async(vma)) {
 			spin_unlock(vmf.ptl);
-			if (pagecache_folio) {
-				folio_unlock(pagecache_folio);
-				folio_put(pagecache_folio);
-			}
 			hugetlb_vma_unlock_read(vma);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			return handle_userfault(&vmf, VM_UFFD_WP);
@@ -6813,23 +6828,12 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* Fallthrough to CoW */
 	}
 
-	/*
-	 * hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) and
-	 * pagecache_folio, so here we need take the former one
-	 * when folio != pagecache_folio or !pagecache_folio.
-	 */
 	folio = page_folio(pte_page(vmf.orig_pte));
-	if (folio != pagecache_folio)
-		if (!folio_trylock(folio)) {
-			need_wait_lock = 1;
-			goto out_ptl;
-		}
-
 	folio_get(folio);
 
 	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
 		if (!huge_pte_write(vmf.orig_pte)) {
-			ret = hugetlb_wp(pagecache_folio, &vmf);
+			ret = hugetlb_wp(&vmf, HUGETLB_FAULT_CONTEXT);
 			goto out_put_page;
 		} else if (likely(flags & FAULT_FLAG_WRITE)) {
 			vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte);
@@ -6840,16 +6844,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 						flags & FAULT_FLAG_WRITE))
 		update_mmu_cache(vma, vmf.address, vmf.pte);
 out_put_page:
-	if (folio != pagecache_folio)
-		folio_unlock(folio);
 	folio_put(folio);
 out_ptl:
 	spin_unlock(vmf.ptl);
-
-	if (pagecache_folio) {
-		folio_unlock(pagecache_folio);
-		folio_put(pagecache_folio);
-	}
 out_mutex:
 	hugetlb_vma_unlock_read(vma);
 
@@ -6861,15 +6858,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		vma_end_read(vma);
 
 	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-	/*
-	 * Generally it's safe to hold refcount during waiting page lock. But
-	 * here we just wait to defer the next page fault to avoid busy loop and
-	 * the page is not used after unlocked before returning from the current
-	 * page fault. So we are safe from accessing freed page, even if we wait
-	 * here without taking refcount.
-	 */
-	if (need_wait_lock)
-		folio_wait_locked(folio);
 	return ret;
 }
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 2/3] mm, hugetlb: Update comments in hugetlb_fault
  2025-06-02 14:16 [RFC PATCH 0/3] Clean up locking in hugetlb faulting code Oscar Salvador
  2025-06-02 14:16 ` [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp Oscar Salvador
@ 2025-06-02 14:16 ` Oscar Salvador
  2025-06-02 14:16 ` [RFC PATCH 3/3] mm, hugetlb: Drop unlikelys from hugetlb_fault Oscar Salvador
  2025-06-16  3:21 ` [RFC PATCH 0/3] Clean up locking in hugetlb faulting code Gavin Guo
  3 siblings, 0 replies; 18+ messages in thread
From: Oscar Salvador @ 2025-06-02 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Muchun Song, David Hildenbrand, James Houghton, Peter Xu,
	Gavin Guo, linux-mm, linux-kernel, Oscar Salvador

There are some comments in the hugetlb faulting path that are not
holding anymore because the code has changed.
Most promiment is this one:

 /*
  * vmf.orig_pte could be a migration/hwpoison vmf.orig_pte at this
  * point, so this check prevents the kernel from going below assuming
  * that we have an active hugepage in pagecache. This goto expects
  * the 2nd page fault, and is_hugetlb_entry_(migration|hwpoisoned)
  * check will properly handle it.
  */

This was written because back in the day we used to do:

 hugetlb_fault
  ptep = huge_pte_offset(...)
  if (ptep) {
    entry = huge_ptep_get(ptep)
    if (unlikely(is_hugetlb_entry_migration(entry))
        ...
    else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
        ...

  ...
  ...

 /*
  * entry could be a migration/hwpoison entry at this point, so this
  * check prevents the kernel from going below assuming that we have
  * a active hugepage in pagecache. This goto expects the 2nd page fault,
  * and is_hugetlb_entry_(migration|hwpoisoned) check will properly
  * handle it.
  */
 if (!pte_present(entry))
         goto out_mutex;
 ...

The code was designed to check for hwpoisoned/migration entries upfront,
and then bail out if further down the pte was not present anymore,
relying on taking the second fault so the code from the beginning would
handle it this time for real.

This has changed, so this comment does not hold.
Also, mention in hugetlb_fault() that besides allocation and instantiation,
the mutex also serializes against truncation.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hugetlb.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f7bef660ef94..6ef90958839f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6715,9 +6715,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	};
 
 	/*
-	 * Serialize hugepage allocation and instantiation, so that we don't
-	 * get spurious allocation failures if two CPUs race to instantiate
-	 * the same page in the page cache.
+	 * hugetlb_fault_mutex_hash serializes allocation, instantiation and
+	 * truncation.
 	 */
 	mapping = vma->vm_file->f_mapping;
 	hash = hugetlb_fault_mutex_hash(mapping, vmf.pgoff);
@@ -6765,11 +6764,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	ret = 0;
 
 	/*
-	 * vmf.orig_pte could be a migration/hwpoison vmf.orig_pte at this
-	 * point, so this check prevents the kernel from going below assuming
-	 * that we have an active hugepage in pagecache. This goto expects
-	 * the 2nd page fault, and is_hugetlb_entry_(migration|hwpoisoned)
-	 * check will properly handle it.
+	 * If it is not present, it means we are dealing either with a migration
+	 * or hwpoisoned entry.
 	 */
 	if (!pte_present(vmf.orig_pte)) {
 		if (unlikely(is_hugetlb_entry_migration(vmf.orig_pte))) {
@@ -6793,8 +6789,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * If we are going to COW/unshare the mapping later, we examine the
 	 * pending reservations for this page now. This will ensure that any
 	 * allocations necessary to record that reservation occur outside the
-	 * spinlock. Also lookup the pagecache page now as it is used to
-	 * determine if a reservation has been consumed.
+	 * spinlock.
 	 */
 	if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
 	    !(vma->vm_flags & VM_MAYSHARE) && !huge_pte_write(vmf.orig_pte)) {
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [RFC PATCH 3/3] mm, hugetlb: Drop unlikelys from hugetlb_fault
  2025-06-02 14:16 [RFC PATCH 0/3] Clean up locking in hugetlb faulting code Oscar Salvador
  2025-06-02 14:16 ` [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp Oscar Salvador
  2025-06-02 14:16 ` [RFC PATCH 2/3] mm, hugetlb: Update comments in hugetlb_fault Oscar Salvador
@ 2025-06-02 14:16 ` Oscar Salvador
  2025-06-16  3:21 ` [RFC PATCH 0/3] Clean up locking in hugetlb faulting code Gavin Guo
  3 siblings, 0 replies; 18+ messages in thread
From: Oscar Salvador @ 2025-06-02 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Muchun Song, David Hildenbrand, James Houghton, Peter Xu,
	Gavin Guo, linux-mm, linux-kernel, Oscar Salvador

The unlikely predates an era where we were checking for
hwpoisoned/migration entries without knowing whether the pte was
present.

The code-flow has changed, and we do not do that anymore, plus
they do not bring any advantatge at all.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hugetlb.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6ef90958839f..02ede63e7e5e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6768,7 +6768,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * or hwpoisoned entry.
 	 */
 	if (!pte_present(vmf.orig_pte)) {
-		if (unlikely(is_hugetlb_entry_migration(vmf.orig_pte))) {
+		if (is_hugetlb_entry_migration(vmf.orig_pte)) {
 			/*
 			 * Release the hugetlb fault lock now, but retain
 			 * the vma lock, because it is needed to guard the
@@ -6779,7 +6779,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			migration_entry_wait_huge(vma, vmf.address, vmf.pte);
 			return 0;
-		} else if (unlikely(is_hugetlb_entry_hwpoisoned(vmf.orig_pte)))
+		} else if (is_hugetlb_entry_hwpoisoned(vmf.orig_pte))
 			ret = VM_FAULT_HWPOISON_LARGE |
 			    VM_FAULT_SET_HINDEX(hstate_index(h));
 		goto out_mutex;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-02 14:16 ` [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp Oscar Salvador
@ 2025-06-02 15:14   ` Peter Xu
  2025-06-02 20:47     ` Oscar Salvador
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2025-06-02 15:14 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Muchun Song, David Hildenbrand, James Houghton,
	Gavin Guo, linux-mm, linux-kernel

On Mon, Jun 02, 2025 at 04:16:08PM +0200, Oscar Salvador wrote:
> Recent events surfaced the fact that it is not clear why are we taking
> certain locks in the hugetlb faulting path.
> More specifically pagecache_folio's lock and folio lock in hugetlb_fault,
> and folio lock in hugetlb_no_page.
> 
> When digging in the history, I saw that those locks were taken to protect
> us against truncation, which back then was not serialized with the
> hugetlb_fault_mutex as it is today.
> 
> For example, the lock in hugetlb_no_page, looking at the comment above:
> 
>  /*
>   * Use page lock to guard against racing truncation
>   * before we get page_table_lock.
>   */
>  new_folio = false;
>  folio = filemap_lock_hugetlb_folio(h, mapping, vmf->pgoff);
> 
> which was added by 'commit 4c8872659772 ("[PATCH] hugetlb: demand fault handler")'.
> Back when that commit was added (2025), truncate_hugepages looked something like:
> 
>  truncate_hugepages
>   lock_page(page)
>    truncate_huge_page(page) <- it also removed it from the pagecache
>   unlock_page(page)
> 
> While today we have
> 
>  remove_inode_hugepages
>   mutex_lock(&hugetlb_fault_mutex_table[hash])
>    remove_inode_single_folio
>     folio_lock(folio)
>      hugetlb_delete_from_page_cache
>     folio_unlock
>   mutex_unlock(&hugetlb_fault_mutex_table[hash])
> 
> And the same happened with the lock for pagecache_folio in hugetlb_fault(),
> which was introduced in 'commit 04f2cbe35699 ("hugetlb: guarantee that COW
> faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed")',
> when we did not have serialization against truncation yet.
> 
> The only worrisome part of dropping the locks that I considered is when
> cow_from_owner is true and we cannot allocate another hugetlb page, because then
> we drop all the locks, try to unmap the page from the other processes, and then
> we re-take the locks again.
> 
>         hash = hugetlb_fault_mutex_hash(mapping, idx);
>         hugetlb_vma_unlock_read(vma);
>         mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> 
>         unmap_ref_private(mm, vma, &old_folio->page,
>                         vmf->address);
> 
>         mutex_lock(&hugetlb_fault_mutex_table[hash]);
>         hugetlb_vma_lock_read(vma);
>         spin_lock(vmf->ptl);
> 
> So, there is a small window where we are not holding the lock.
> 
> In this window, someone might have truncated the file (aka pagecache_folio),
> and call hugetlb_unreserve_pages().
> But I do not think it matters for the following reasons
> 1) we only retry in case the pte has not changed, which means that old_folio
>    still is old_folio.
> 2) And if the original file got truncated in that window and reserves were
>    adjusted, alloc_hugetlb_folio() will catch this under the lock when we
>    retry again.
> 
> Another concern was brought up by James Houghton, about UFFDIO_CONTINUE
> case, and whether we could end up mapping a hugetlb page which has not been
> zeroed yet.
> But mfill_atomic_hugetlb(), which calls hugetlb_mfill_atomic_pte(), holds the
> mutex throughout the operation, so we cannot race with truncation/instantiation
> either.
> 
> E.g: hugetlbfs_fallocate() will allocate the new hugetlb folio and zero it under
> the mutex.

It makes me feel nervous when we move more thing over to fault mutex - I
had a feeling it's abused.

IIRC the fault mutex was inintially introduced only to solve one problem on
concurrent allocations.  I confess I didn't follow or yet dig into all
history, though.  From that POV, mfill_atomic_hugetlb() is indeed
reasonable to still take it because it's user-guided fault injections.  I'm
not yet sure about other places, e.g., for truncations.

Meanwhile, IIUC this patch can at least be split into more than one, in
which case the 1st patch should only change the behavior of
pagecache_folio, rather than the rest - if we really want to move more
things over to fault mutex, we can do that in the 2nd+ patches.

I'd suggest we stick with fixing pagecache_folio issue first, which can be
much easier and looks like a lock ordering issue only (while we can still
resolve it with removing on lock, but only on pagecache_folio not rest yet).

> 
> So, there should be no races.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  include/linux/hugetlb.h | 12 +++++
>  mm/hugetlb.c            | 98 ++++++++++++++++++-----------------------
>  2 files changed, 55 insertions(+), 55 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 42f374e828a2..a176724e1204 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -811,6 +811,12 @@ static inline unsigned int blocks_per_huge_page(struct hstate *h)
>  	return huge_page_size(h) / 512;
>  }
>  
> +static inline struct folio *filemap_get_hugetlb_folio(struct hstate *h,
> +				struct address_space *mapping, pgoff_t idx)
> +{
> +	return filemap_get_folio(mapping, idx << huge_page_order(h));
> +}
> +
>  static inline struct folio *filemap_lock_hugetlb_folio(struct hstate *h,
>  				struct address_space *mapping, pgoff_t idx)
>  {
> @@ -1088,6 +1094,12 @@ static inline struct hugepage_subpool *hugetlb_folio_subpool(struct folio *folio
>  	return NULL;
>  }
>  
> +static inline struct folio *filemap_get_hugetlb_folio(struct hstate *h,
> +				struct address_space *mapping, pgoff_t idx)
> +{
> +	return NULL;
> +}
> +
>  static inline struct folio *filemap_lock_hugetlb_folio(struct hstate *h,
>  				struct address_space *mapping, pgoff_t idx)
>  {
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8746ed2fec13..f7bef660ef94 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6146,25 +6146,28 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
>  	i_mmap_unlock_write(mapping);
>  }
>  
> +enum cow_context {
> +	HUGETLB_FAULT_CONTEXT,
> +	HUGETLB_NO_PAGE_CONTEXT,
> +};

I'm not sure this is required, looks like currently it's needed only
because we want to detect whether pagecache folio matched.

More below.

> +
>  /*
> - * hugetlb_wp() should be called with page lock of the original hugepage held.
>   * Called with hugetlb_fault_mutex_table held and pte_page locked so we
>   * cannot race with other handlers or page migration.
>   * Keep the pte_same checks anyway to make transition from the mutex easier.
>   */
> -static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> -		       struct vm_fault *vmf)
> +static vm_fault_t hugetlb_wp(struct vm_fault *vmf, enum cow_context context)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct mm_struct *mm = vma->vm_mm;
>  	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
>  	pte_t pte = huge_ptep_get(mm, vmf->address, vmf->pte);
>  	struct hstate *h = hstate_vma(vma);
> -	struct folio *old_folio;
> -	struct folio *new_folio;
>  	bool cow_from_owner = 0;
>  	vm_fault_t ret = 0;
>  	struct mmu_notifier_range range;
> +	struct folio *old_folio, *new_folio, *pagecache_folio;
> +	struct address_space *mapping = vma->vm_file->f_mapping;
>  
>  	/*
>  	 * Never handle CoW for uffd-wp protected pages.  It should be only
> @@ -6198,7 +6201,11 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>  	 * we run out of free hugetlb folios: we would have to kill processes
>  	 * in scenarios that used to work. As a side effect, there can still
>  	 * be leaks between processes, for example, with FOLL_GET users.
> +	 *
> +	 * We need to take the lock here because the folio might be undergoing a
> +	 * migration. e.g: see folio_try_share_anon_rmap_pte.
>  	 */

I agree we'd better keep the old_folio locked as of now to be further
discussed, but I'm not 100% sure about the comment - doesn't migration path
still need the pgtable lock to modify mapcounts?  I think we hold pgtable
lock here.

> +	folio_lock(old_folio);
>  	if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) {
>  		if (!PageAnonExclusive(&old_folio->page)) {
>  			folio_move_anon_rmap(old_folio, vma);
> @@ -6209,22 +6216,44 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>  						     vmf->pte);
>  
>  		delayacct_wpcopy_end();
> +		folio_unlock(old_folio);
>  		return 0;
>  	}
>  	VM_BUG_ON_PAGE(folio_test_anon(old_folio) &&
>  		       PageAnonExclusive(&old_folio->page), &old_folio->page);
> +	folio_unlock(old_folio);
> +
>  
> +	/*
> +	 * We can be called from two different contexts: hugetlb_no_page or
> +	 * hugetlb_fault.
> +	 * When called from the latter, we try to find the original folio in
> +	 * the pagecache and compare it against the current folio we have mapped
> +	 * in the pagetables. If it differs, it means that this process already
> +	 * CoWed and mapped the folio privately, so we know that a reservation
> +	 * was already consumed.
> +	 * This will be latter used to determine whether we need to unmap the folio
> +	 * from other processes in case we fail to allocate a new folio.
> +	 */
> +	if (context == HUGETLB_FAULT_CONTEXT) {
> +		pagecache_folio = filemap_get_hugetlb_folio(h, mapping, vmf->pgoff);
> +		if (IS_ERR(pagecache_folio))
> +			pagecache_folio = NULL;
> +		else
> +			folio_put(pagecache_folio);

So here we released the refcount but then we're referencing the pointer
below..  I don't know whether this is wise, looks like it's prone to
races..  If we want, we can compare the folios before releasing the
refcount.  Said that,...

> +	}
>  	/*
>  	 * If the process that created a MAP_PRIVATE mapping is about to
>  	 * perform a COW due to a shared page count, attempt to satisfy
>  	 * the allocation without using the existing reserves. The pagecache
> -	 * page is used to determine if the reserve at this address was
> +	 * folio is used to determine if the reserve at this address was already
>  	 * consumed or not. If reserves were used, a partial faulted mapping
>  	 * at the time of fork() could consume its reserves on COW instead
>  	 * of the full address range.
>  	 */
> -	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> -			old_folio != pagecache_folio)
> +	if (context == HUGETLB_FAULT_CONTEXT &&
> +	    is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> +	    old_folio != pagecache_folio)

.. here I am actually thinking whether we can use folio_test_anon() and
completely avoid looking up the page cache.  Here, the ultimate goal is
trying to know whether this is a CoW on top of a private page.  Then IIUC
folio_test_anon(old_folio) is enough.

>  		cow_from_owner = true;
>  
>  	folio_get(old_folio);
> @@ -6245,7 +6274,6 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>  		 * may get SIGKILLed if it later faults.
>  		 */
>  		if (cow_from_owner) {
> -			struct address_space *mapping = vma->vm_file->f_mapping;
>  			pgoff_t idx;
>  			u32 hash;
>  
> @@ -6451,11 +6479,11 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>  	}
>  
>  	/*
> -	 * Use page lock to guard against racing truncation
> -	 * before we get page_table_lock.
> +	 * hugetlb_fault_mutex_table guards us against truncation,
> +	 * so we do not need to take locks on the folio.
>  	 */
>  	new_folio = false;
> -	folio = filemap_lock_hugetlb_folio(h, mapping, vmf->pgoff);
> +	folio = filemap_get_hugetlb_folio(h, mapping, vmf->pgoff);

So this is part of the changes that I mentioned previously, that we should
avoid doing in one patch; actually I really think we should think twice on
using fault mutex explicitly for more things.  Actually I tend to go the
other way round: I used to think whether we can avoid some fault mutex in
some paths.  E.g. for UFFDIO_COPY it still makes sense to take fault mutex
because it faces similar challenge of concurrent allocations. However I'm
not sure about truncate/hole-punch lines.  IIUC most file folios use
invalidate_lock for that, and I'd think going the other way to use the same
lock in hugetlb code, then keep fault mutex as minimum used as possible,
then the semantics of the lock and what it protects are much clearer.

Thanks,

>  	if (IS_ERR(folio)) {
>  		size = i_size_read(mapping->host) >> huge_page_shift(h);
>  		if (vmf->pgoff >= size)
> @@ -6521,6 +6549,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>  		if (vma->vm_flags & VM_MAYSHARE) {
>  			int err = hugetlb_add_to_page_cache(folio, mapping,
>  							vmf->pgoff);
> +			folio_unlock(folio);
>  			if (err) {
>  				/*
>  				 * err can't be -EEXIST which implies someone
> @@ -6537,7 +6566,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>  			}
>  			new_pagecache_folio = true;
>  		} else {
> -			folio_lock(folio);
>  			anon_rmap = 1;
>  		}
>  	} else {
> @@ -6603,7 +6631,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>  	hugetlb_count_add(pages_per_huge_page(h), mm);
>  	if ((vmf->flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
>  		/* Optimization, do the COW without a second fault */
> -		ret = hugetlb_wp(folio, vmf);
> +		ret = hugetlb_wp(vmf, HUGETLB_NO_PAGE_CONTEXT);
>  	}
>  
>  	spin_unlock(vmf->ptl);
> @@ -6615,8 +6643,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>  	 */
>  	if (new_folio)
>  		folio_set_hugetlb_migratable(folio);
> -
> -	folio_unlock(folio);
>  out:
>  	hugetlb_vma_unlock_read(vma);
>  
> @@ -6636,7 +6662,6 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
>  	if (new_folio && !new_pagecache_folio)
>  		restore_reserve_on_error(h, vma, vmf->address, folio);
>  
> -	folio_unlock(folio);
>  	folio_put(folio);
>  	goto out;
>  }
> @@ -6671,7 +6696,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	vm_fault_t ret;
>  	u32 hash;
>  	struct folio *folio = NULL;
> -	struct folio *pagecache_folio = NULL;
>  	struct hstate *h = hstate_vma(vma);
>  	struct address_space *mapping;
>  	int need_wait_lock = 0;
> @@ -6780,11 +6804,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		}
>  		/* Just decrements count, does not deallocate */
>  		vma_end_reservation(h, vma, vmf.address);
> -
> -		pagecache_folio = filemap_lock_hugetlb_folio(h, mapping,
> -							     vmf.pgoff);
> -		if (IS_ERR(pagecache_folio))
> -			pagecache_folio = NULL;
>  	}
>  
>  	vmf.ptl = huge_pte_lock(h, mm, vmf.pte);
> @@ -6798,10 +6817,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	    (flags & FAULT_FLAG_WRITE) && !huge_pte_write(vmf.orig_pte)) {
>  		if (!userfaultfd_wp_async(vma)) {
>  			spin_unlock(vmf.ptl);
> -			if (pagecache_folio) {
> -				folio_unlock(pagecache_folio);
> -				folio_put(pagecache_folio);
> -			}
>  			hugetlb_vma_unlock_read(vma);
>  			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>  			return handle_userfault(&vmf, VM_UFFD_WP);
> @@ -6813,23 +6828,12 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		/* Fallthrough to CoW */
>  	}
>  
> -	/*
> -	 * hugetlb_wp() requires page locks of pte_page(vmf.orig_pte) and
> -	 * pagecache_folio, so here we need take the former one
> -	 * when folio != pagecache_folio or !pagecache_folio.
> -	 */
>  	folio = page_folio(pte_page(vmf.orig_pte));
> -	if (folio != pagecache_folio)
> -		if (!folio_trylock(folio)) {
> -			need_wait_lock = 1;
> -			goto out_ptl;
> -		}
> -
>  	folio_get(folio);
>  
>  	if (flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
>  		if (!huge_pte_write(vmf.orig_pte)) {
> -			ret = hugetlb_wp(pagecache_folio, &vmf);
> +			ret = hugetlb_wp(&vmf, HUGETLB_FAULT_CONTEXT);
>  			goto out_put_page;
>  		} else if (likely(flags & FAULT_FLAG_WRITE)) {
>  			vmf.orig_pte = huge_pte_mkdirty(vmf.orig_pte);
> @@ -6840,16 +6844,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  						flags & FAULT_FLAG_WRITE))
>  		update_mmu_cache(vma, vmf.address, vmf.pte);
>  out_put_page:
> -	if (folio != pagecache_folio)
> -		folio_unlock(folio);
>  	folio_put(folio);
>  out_ptl:
>  	spin_unlock(vmf.ptl);
> -
> -	if (pagecache_folio) {
> -		folio_unlock(pagecache_folio);
> -		folio_put(pagecache_folio);
> -	}
>  out_mutex:
>  	hugetlb_vma_unlock_read(vma);
>  
> @@ -6861,15 +6858,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  		vma_end_read(vma);
>  
>  	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> -	/*
> -	 * Generally it's safe to hold refcount during waiting page lock. But
> -	 * here we just wait to defer the next page fault to avoid busy loop and
> -	 * the page is not used after unlocked before returning from the current
> -	 * page fault. So we are safe from accessing freed page, even if we wait
> -	 * here without taking refcount.
> -	 */
> -	if (need_wait_lock)
> -		folio_wait_locked(folio);
>  	return ret;
>  }
>  
> -- 
> 2.49.0
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-02 15:14   ` Peter Xu
@ 2025-06-02 20:47     ` Oscar Salvador
  2025-06-02 21:30       ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Oscar Salvador @ 2025-06-02 20:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, Muchun Song, David Hildenbrand, James Houghton,
	Gavin Guo, linux-mm, linux-kernel

On Mon, Jun 02, 2025 at 11:14:26AM -0400, Peter Xu wrote:
> It makes me feel nervous when we move more thing over to fault mutex - I
> had a feeling it's abused.
> 
> IIRC the fault mutex was inintially introduced only to solve one problem on
> concurrent allocations.  I confess I didn't follow or yet dig into all
> history, though.  From that POV, mfill_atomic_hugetlb() is indeed
> reasonable to still take it because it's user-guided fault injections.  I'm
> not yet sure about other places, e.g., for truncations.

Hi Peter,

taking the mutex for truncation/punching hole was added back in:

 commit b5cec28d36f5ee6b4e6f68a0a40aa1e4045d6d99
 Author: Mike Kravetz <mike.kravetz@oracle.com>
 Date:   Tue Sep 8 15:01:41 2015 -0700
 
     hugetlbfs: truncate_hugepages() takes a range of pages

when the fallocate functionality and the punching-mode were introduced.
E.g:

 commit c672c7f29f2fdb73e1f72911bf499675c81fcdbb
 Author: Mike Kravetz <mike.kravetz@oracle.com>
 Date:   Tue Sep 8 15:01:35 2015 -0700
 
     mm/hugetlb: expose hugetlb fault mutex for use by fallocate
 
     hugetlb page faults are currently synchronized by the table of mutexes
     (htlb_fault_mutex_table).  fallocate code will need to synchronize with
     the page fault code when it allocates or deletes pages.  Expose
     interfaces so that fallocate operations can be synchronized with page
     faults.  Minor name changes to be more consistent with other global
     hugetlb symbols.

Sadly, I cannot really see why it was added.
It was introduced from RFC v2 -> RFC v3:

 RFC v3:
  Folded in patch for alloc_huge_page/hugetlb_reserve_pages race
    in current code
  fallocate allocation and hole punch is synchronized with page
    faults via existing mutex table
   hole punch uses existing hugetlb_vmtruncate_list instead of more
    generic unmap_mapping_range for unmapping
   Error handling for the case when region_del() fauils

But RFC v2 shows no discussions in that matter whatsoever, so go figure
[1].
I read that some tests were written, so I am not sure whether any of
those tests caught a race.

Looking at remove_inode_single_folio, there is a case though:

 /*
  * If folio is mapped, it was faulted in after being
  * unmapped in caller.  Unmap (again) while holding
  * the fault mutex.  The mutex will prevent faults
  * until we finish removing the folio.
  */
	if (unlikely(folio_mapped(folio)))
		hugetlb_unmap_file_folio(h, mapping, folio, index);

So, while the folio_lock that is taken further down guards us against
removing the page from the pagecache, I guess that we need to take the
mutex to guard us against parallel faults for the page.

Looking back at earlier code, we did not handle that race.

It was reworked in

 commit 4aae8d1c051ea00b456da6811bc36d1f69de5445
 Author: Mike Kravetz <mike.kravetz@oracle.com>
 Date:   Fri Jan 15 16:57:40 2016 -0800
 
     mm/hugetlbfs: unmap pages if page fault raced with hole punch

and the comment looked like:

 /*
  * If page is mapped, it was faulted in after being
  * unmapped in caller.  Unmap (again) now after taking
  * the fault mutex.  The mutex will prevent faults
  * until we finish removing the page.
  *
  * This race can only happen in the hole punch case.
  * Getting here in a truncate operation is a bug.
  */

So, it was placed to close the race.
Now, I do not know whether we could close that race somewhat different,
but it seems risky, and having the mutex here seems like a good
trade-off.


> Meanwhile, IIUC this patch can at least be split into more than one, in
> which case the 1st patch should only change the behavior of
> pagecache_folio, rather than the rest - if we really want to move more
> things over to fault mutex, we can do that in the 2nd+ patches.

Fair enough. This RFC was mainly for kicking off a discussion and decide
the direction we want to take.

> I'd suggest we stick with fixing pagecache_folio issue first, which can be
> much easier and looks like a lock ordering issue only (while we can still
> resolve it with removing on lock, but only on pagecache_folio not rest yet).

Yes, I think that that should be enough to determine if we already mapped and cowed the
folio privately.

> > +
> >  /*
> > - * hugetlb_wp() should be called with page lock of the original hugepage held.
> >   * Called with hugetlb_fault_mutex_table held and pte_page locked so we
> >   * cannot race with other handlers or page migration.
> >   * Keep the pte_same checks anyway to make transition from the mutex easier.
> >   */
> > -static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> > -		       struct vm_fault *vmf)
> > +static vm_fault_t hugetlb_wp(struct vm_fault *vmf, enum cow_context context)
> >  {
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
> >  	pte_t pte = huge_ptep_get(mm, vmf->address, vmf->pte);
> >  	struct hstate *h = hstate_vma(vma);
> > -	struct folio *old_folio;
> > -	struct folio *new_folio;
> >  	bool cow_from_owner = 0;
> >  	vm_fault_t ret = 0;
> >  	struct mmu_notifier_range range;
> > +	struct folio *old_folio, *new_folio, *pagecache_folio;
> > +	struct address_space *mapping = vma->vm_file->f_mapping;
> >  
> >  	/*
> >  	 * Never handle CoW for uffd-wp protected pages.  It should be only
> > @@ -6198,7 +6201,11 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> >  	 * we run out of free hugetlb folios: we would have to kill processes
> >  	 * in scenarios that used to work. As a side effect, there can still
> >  	 * be leaks between processes, for example, with FOLL_GET users.
> > +	 *
> > +	 * We need to take the lock here because the folio might be undergoing a
> > +	 * migration. e.g: see folio_try_share_anon_rmap_pte.
> >  	 */
> 
> I agree we'd better keep the old_folio locked as of now to be further
> discussed, but I'm not 100% sure about the comment - doesn't migration path
> still need the pgtable lock to modify mapcounts?  I think we hold pgtable
> lock here.

Uhm, yes, looking closer I think you are right.
In hugetlb_fault, prior to call in hugetlb_wp we take the ptl spinlock,
and the migration path also takes it (via page_vma_mapped_walk), so we
should not be seeing any spurious data from hugetlb_remove_rmap, as we
are serialized.

> > +	/*
> > +	 * We can be called from two different contexts: hugetlb_no_page or
> > +	 * hugetlb_fault.
> > +	 * When called from the latter, we try to find the original folio in
> > +	 * the pagecache and compare it against the current folio we have mapped
> > +	 * in the pagetables. If it differs, it means that this process already
> > +	 * CoWed and mapped the folio privately, so we know that a reservation
> > +	 * was already consumed.
> > +	 * This will be latter used to determine whether we need to unmap the folio
> > +	 * from other processes in case we fail to allocate a new folio.
> > +	 */
> > +	if (context == HUGETLB_FAULT_CONTEXT) {
> > +		pagecache_folio = filemap_get_hugetlb_folio(h, mapping, vmf->pgoff);
> > +		if (IS_ERR(pagecache_folio))
> > +			pagecache_folio = NULL;
> > +		else
> > +			folio_put(pagecache_folio);
> 
> So here we released the refcount but then we're referencing the pointer
> below..  I don't know whether this is wise, looks like it's prone to
> races..  If we want, we can compare the folios before releasing the
> refcount.  Said that,...

We are holding the mutex so the folio cannot really go anywhere, but
could folio_put after the check as well.

> > -	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> > -			old_folio != pagecache_folio)
> > +	if (context == HUGETLB_FAULT_CONTEXT &&
> > +	    is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> > +	    old_folio != pagecache_folio)
> 
> .. here I am actually thinking whether we can use folio_test_anon() and
> completely avoid looking up the page cache.  Here, the ultimate goal is
> trying to know whether this is a CoW on top of a private page.  Then IIUC
> folio_test_anon(old_folio) is enough.

I yet have to fully check, but this sounds reasonable to me.

> >  	new_folio = false;
> > -	folio = filemap_lock_hugetlb_folio(h, mapping, vmf->pgoff);
> > +	folio = filemap_get_hugetlb_folio(h, mapping, vmf->pgoff);
> 
> So this is part of the changes that I mentioned previously, that we should
> avoid doing in one patch; actually I really think we should think twice on
> using fault mutex explicitly for more things.  Actually I tend to go the
> other way round: I used to think whether we can avoid some fault mutex in
> some paths.  E.g. for UFFDIO_COPY it still makes sense to take fault mutex
> because it faces similar challenge of concurrent allocations. However I'm
> not sure about truncate/hole-punch lines.  IIUC most file folios use
> invalidate_lock for that, and I'd think going the other way to use the same
> lock in hugetlb code, then keep fault mutex as minimum used as possible,
> then the semantics of the lock and what it protects are much clearer.

I hear you, but as explained above, I am not entirely sure we can get
rid of the mutex in the truncation/punch-hole path, and if that is the
case, having both the lock and the mutex is definitely an overkill.

I can split up the changes, I have no problem with that.
Moreover, with the fact that we might be able to get away checking for
anon-folio vs pagecache_folio-thingy-we-have-now.

 
Thanks a lot for the feedback Peter! 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-02 20:47     ` Oscar Salvador
@ 2025-06-02 21:30       ` Peter Xu
  2025-06-03 13:50         ` Oscar Salvador
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2025-06-02 21:30 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Muchun Song, David Hildenbrand, James Houghton,
	Gavin Guo, linux-mm, linux-kernel

On Mon, Jun 02, 2025 at 10:47:04PM +0200, Oscar Salvador wrote:
> On Mon, Jun 02, 2025 at 11:14:26AM -0400, Peter Xu wrote:
> > It makes me feel nervous when we move more thing over to fault mutex - I
> > had a feeling it's abused.
> > 
> > IIRC the fault mutex was inintially introduced only to solve one problem on
> > concurrent allocations.  I confess I didn't follow or yet dig into all
> > history, though.  From that POV, mfill_atomic_hugetlb() is indeed
> > reasonable to still take it because it's user-guided fault injections.  I'm
> > not yet sure about other places, e.g., for truncations.
> 
> Hi Peter,
> 
> taking the mutex for truncation/punching hole was added back in:
> 
>  commit b5cec28d36f5ee6b4e6f68a0a40aa1e4045d6d99
>  Author: Mike Kravetz <mike.kravetz@oracle.com>
>  Date:   Tue Sep 8 15:01:41 2015 -0700
>  
>      hugetlbfs: truncate_hugepages() takes a range of pages
> 
> when the fallocate functionality and the punching-mode were introduced.
> E.g:
> 
>  commit c672c7f29f2fdb73e1f72911bf499675c81fcdbb
>  Author: Mike Kravetz <mike.kravetz@oracle.com>
>  Date:   Tue Sep 8 15:01:35 2015 -0700
>  
>      mm/hugetlb: expose hugetlb fault mutex for use by fallocate
>  
>      hugetlb page faults are currently synchronized by the table of mutexes
>      (htlb_fault_mutex_table).  fallocate code will need to synchronize with
>      the page fault code when it allocates or deletes pages.  Expose
>      interfaces so that fallocate operations can be synchronized with page
>      faults.  Minor name changes to be more consistent with other global
>      hugetlb symbols.
> 
> Sadly, I cannot really see why it was added.
> It was introduced from RFC v2 -> RFC v3:
> 
>  RFC v3:
>   Folded in patch for alloc_huge_page/hugetlb_reserve_pages race
>     in current code
>   fallocate allocation and hole punch is synchronized with page
>     faults via existing mutex table
>    hole punch uses existing hugetlb_vmtruncate_list instead of more
>     generic unmap_mapping_range for unmapping
>    Error handling for the case when region_del() fauils
> 
> But RFC v2 shows no discussions in that matter whatsoever, so go figure
> [1].
> I read that some tests were written, so I am not sure whether any of
> those tests caught a race.
> 
> Looking at remove_inode_single_folio, there is a case though:
> 
>  /*
>   * If folio is mapped, it was faulted in after being
>   * unmapped in caller.  Unmap (again) while holding
>   * the fault mutex.  The mutex will prevent faults
>   * until we finish removing the folio.
>   */
> 	if (unlikely(folio_mapped(folio)))
> 		hugetlb_unmap_file_folio(h, mapping, folio, index);
> 
> So, while the folio_lock that is taken further down guards us against
> removing the page from the pagecache, I guess that we need to take the
> mutex to guard us against parallel faults for the page.
> 
> Looking back at earlier code, we did not handle that race.
> 
> It was reworked in
> 
>  commit 4aae8d1c051ea00b456da6811bc36d1f69de5445
>  Author: Mike Kravetz <mike.kravetz@oracle.com>
>  Date:   Fri Jan 15 16:57:40 2016 -0800
>  
>      mm/hugetlbfs: unmap pages if page fault raced with hole punch
> 
> and the comment looked like:
> 
>  /*
>   * If page is mapped, it was faulted in after being
>   * unmapped in caller.  Unmap (again) now after taking
>   * the fault mutex.  The mutex will prevent faults
>   * until we finish removing the page.
>   *
>   * This race can only happen in the hole punch case.
>   * Getting here in a truncate operation is a bug.
>   */
> 
> So, it was placed to close the race.
> Now, I do not know whether we could close that race somewhat different,
> but it seems risky, and having the mutex here seems like a good
> trade-off.

Right, and thanks for the git digging as usual.  I would agree hugetlb is
more challenge than many other modules on git archaeology. :)

Even if I mentioned the invalidate_lock, I don't think I thought deeper
than that. I just wished whenever possible we still move hugetlb code
closer to generic code, so if that's the goal we may still want to one day
have a closer look at whether hugetlb can also use invalidate_lock.  Maybe
it isn't worthwhile at last: invalidate_lock is currently a rwsem, which
normally at least allows concurrent fault, but that's currently what isn't
allowed in hugetlb anyway..

If we start to remove finer grained locks that work will be even harder,
and removing folio lock in this case in fault path also brings hugetlbfs
even further from other file systems.  That might be slightly against what
we used to wish to do, which is to make it closer to others.  Meanwhile I'm
also not yet sure the benefit of not taking folio lock all across, e.g. I
don't expect perf would change at all even if lock is avoided.  We may want
to think about that too when doing so.

> 
> 
> > Meanwhile, IIUC this patch can at least be split into more than one, in
> > which case the 1st patch should only change the behavior of
> > pagecache_folio, rather than the rest - if we really want to move more
> > things over to fault mutex, we can do that in the 2nd+ patches.
> 
> Fair enough. This RFC was mainly for kicking off a discussion and decide
> the direction we want to take.
> 
> > I'd suggest we stick with fixing pagecache_folio issue first, which can be
> > much easier and looks like a lock ordering issue only (while we can still
> > resolve it with removing on lock, but only on pagecache_folio not rest yet).
> 
> Yes, I think that that should be enough to determine if we already mapped and cowed the
> folio privately.
> 
> > > +
> > >  /*
> > > - * hugetlb_wp() should be called with page lock of the original hugepage held.
> > >   * Called with hugetlb_fault_mutex_table held and pte_page locked so we
> > >   * cannot race with other handlers or page migration.
> > >   * Keep the pte_same checks anyway to make transition from the mutex easier.
> > >   */
> > > -static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> > > -		       struct vm_fault *vmf)
> > > +static vm_fault_t hugetlb_wp(struct vm_fault *vmf, enum cow_context context)
> > >  {
> > >  	struct vm_area_struct *vma = vmf->vma;
> > >  	struct mm_struct *mm = vma->vm_mm;
> > >  	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
> > >  	pte_t pte = huge_ptep_get(mm, vmf->address, vmf->pte);
> > >  	struct hstate *h = hstate_vma(vma);
> > > -	struct folio *old_folio;
> > > -	struct folio *new_folio;
> > >  	bool cow_from_owner = 0;
> > >  	vm_fault_t ret = 0;
> > >  	struct mmu_notifier_range range;
> > > +	struct folio *old_folio, *new_folio, *pagecache_folio;
> > > +	struct address_space *mapping = vma->vm_file->f_mapping;
> > >  
> > >  	/*
> > >  	 * Never handle CoW for uffd-wp protected pages.  It should be only
> > > @@ -6198,7 +6201,11 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
> > >  	 * we run out of free hugetlb folios: we would have to kill processes
> > >  	 * in scenarios that used to work. As a side effect, there can still
> > >  	 * be leaks between processes, for example, with FOLL_GET users.
> > > +	 *
> > > +	 * We need to take the lock here because the folio might be undergoing a
> > > +	 * migration. e.g: see folio_try_share_anon_rmap_pte.
> > >  	 */
> > 
> > I agree we'd better keep the old_folio locked as of now to be further
> > discussed, but I'm not 100% sure about the comment - doesn't migration path
> > still need the pgtable lock to modify mapcounts?  I think we hold pgtable
> > lock here.
> 
> Uhm, yes, looking closer I think you are right.
> In hugetlb_fault, prior to call in hugetlb_wp we take the ptl spinlock,
> and the migration path also takes it (via page_vma_mapped_walk), so we
> should not be seeing any spurious data from hugetlb_remove_rmap, as we
> are serialized.
> 
> > > +	/*
> > > +	 * We can be called from two different contexts: hugetlb_no_page or
> > > +	 * hugetlb_fault.
> > > +	 * When called from the latter, we try to find the original folio in
> > > +	 * the pagecache and compare it against the current folio we have mapped
> > > +	 * in the pagetables. If it differs, it means that this process already
> > > +	 * CoWed and mapped the folio privately, so we know that a reservation
> > > +	 * was already consumed.
> > > +	 * This will be latter used to determine whether we need to unmap the folio
> > > +	 * from other processes in case we fail to allocate a new folio.
> > > +	 */
> > > +	if (context == HUGETLB_FAULT_CONTEXT) {
> > > +		pagecache_folio = filemap_get_hugetlb_folio(h, mapping, vmf->pgoff);
> > > +		if (IS_ERR(pagecache_folio))
> > > +			pagecache_folio = NULL;
> > > +		else
> > > +			folio_put(pagecache_folio);
> > 
> > So here we released the refcount but then we're referencing the pointer
> > below..  I don't know whether this is wise, looks like it's prone to
> > races..  If we want, we can compare the folios before releasing the
> > refcount.  Said that,...
> 
> We are holding the mutex so the folio cannot really go anywhere, but
> could folio_put after the check as well.
> 
> > > -	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> > > -			old_folio != pagecache_folio)
> > > +	if (context == HUGETLB_FAULT_CONTEXT &&
> > > +	    is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
> > > +	    old_folio != pagecache_folio)
> > 
> > .. here I am actually thinking whether we can use folio_test_anon() and
> > completely avoid looking up the page cache.  Here, the ultimate goal is
> > trying to know whether this is a CoW on top of a private page.  Then IIUC
> > folio_test_anon(old_folio) is enough.
> 
> I yet have to fully check, but this sounds reasonable to me.
> 
> > >  	new_folio = false;
> > > -	folio = filemap_lock_hugetlb_folio(h, mapping, vmf->pgoff);
> > > +	folio = filemap_get_hugetlb_folio(h, mapping, vmf->pgoff);
> > 
> > So this is part of the changes that I mentioned previously, that we should
> > avoid doing in one patch; actually I really think we should think twice on
> > using fault mutex explicitly for more things.  Actually I tend to go the
> > other way round: I used to think whether we can avoid some fault mutex in
> > some paths.  E.g. for UFFDIO_COPY it still makes sense to take fault mutex
> > because it faces similar challenge of concurrent allocations. However I'm
> > not sure about truncate/hole-punch lines.  IIUC most file folios use
> > invalidate_lock for that, and I'd think going the other way to use the same
> > lock in hugetlb code, then keep fault mutex as minimum used as possible,
> > then the semantics of the lock and what it protects are much clearer.
> 
> I hear you, but as explained above, I am not entirely sure we can get
> rid of the mutex in the truncation/punch-hole path, and if that is the
> case, having both the lock and the mutex is definitely an overkill.
> 
> I can split up the changes, I have no problem with that.

Thanks!  I hope that'll also help whatever patch to land sooner, after it
can be verified to fix the issue.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-02 21:30       ` Peter Xu
@ 2025-06-03 13:50         ` Oscar Salvador
  2025-06-03 14:57           ` Peter Xu
  2025-06-03 15:12           ` David Hildenbrand
  0 siblings, 2 replies; 18+ messages in thread
From: Oscar Salvador @ 2025-06-03 13:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, Muchun Song, David Hildenbrand, James Houghton,
	Gavin Guo, linux-mm, linux-kernel

On Mon, Jun 02, 2025 at 05:30:19PM -0400, Peter Xu wrote:
> Right, and thanks for the git digging as usual.  I would agree hugetlb is
> more challenge than many other modules on git archaeology. :)
> 
> Even if I mentioned the invalidate_lock, I don't think I thought deeper
> than that. I just wished whenever possible we still move hugetlb code
> closer to generic code, so if that's the goal we may still want to one day
> have a closer look at whether hugetlb can also use invalidate_lock.  Maybe
> it isn't worthwhile at last: invalidate_lock is currently a rwsem, which
> normally at least allows concurrent fault, but that's currently what isn't
> allowed in hugetlb anyway..
> 
> If we start to remove finer grained locks that work will be even harder,
> and removing folio lock in this case in fault path also brings hugetlbfs
> even further from other file systems.  That might be slightly against what
> we used to wish to do, which is to make it closer to others.  Meanwhile I'm
> also not yet sure the benefit of not taking folio lock all across, e.g. I
> don't expect perf would change at all even if lock is avoided.  We may want
> to think about that too when doing so.

Ok, I have to confess I was not looking things from this perspective,
but when doing so, yes, you are right, we should strive to find
replacements wherever we can for not using hugetlb-specific code.

I do not know about this case though, not sure what other options do we
have when trying to shut concurrent faults while doing other operation.
But it is something we should definitely look at.

Wrt. to the lock.
There were two locks, old_folio (taken in hugetlb_fault) and
pagecache_folio one.
The thing was not about worry as how much perf we leave on the table
because of these locks, as I am pretty sure is next to 0, but my drive
was to understand what are protection and why, because as the discussion
showed, none of us really had a good idea about it and it turns out that this
goes back more than ~20 years ago.

Another topic for the lock (old_folio, so the one we copy from),
when we compare it to generic code, we do not take the lock there.
Looking at do_wp_page(), we do __get__ a reference on the folio we copy
from, but not the lock, so AFAIU, the lock seems only to please
folio_move_anon_rmap() from hugetlb_wp.

Taking a look at do_wp_page()->wp_can_reuse_anon_folio() which also
calls folio_move_anon_rmap() in case we can re-use the folio, it only
takes the lock before the call to folio_move_anon_rmap(), and then
unlocks it.

Which, I think, hugetlb should also do.

 do_wp_page
  wp_can_reuse_anon_folio ?
   : yes: folio_lock ; folio_move_anon_rmap ; folio_unlock
     bail out
   : no: get a reference on the folio and call wp_page_copy

So, this should be the lead that hugetlb follows.
As I said, it is not about the performance, and I agree that relying on
finer granularity locks is the way to go, but we need to understand
where and why, and with the current code from upstream, that is not
clear at all.

That is why I wanted to reduce the scope of old_folio to what is
actually needed, which is the snippet:

  if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) {
           if (!PageAnonExclusive(&old_folio->page)) {
                    folio_move_anon_rmap(old_folio, vma);
                    SetPageAnonExclusive(&old_folio->page);
           }
           if (likely(!unshare))
                    set_huge_ptep_maybe_writable(vma, vmf->address,
                                                 vmf->pte);

           delayacct_wpcopy_end();
           return 0;
  }

I think it is important to 1) reduce it to wrap what actually needs to
be within the lock and 2) document why, so no one has to put the gloves
and start digging in the history again.

> Thanks!  I hope that'll also help whatever patch to land sooner, after it
> can be verified to fix the issue.

So, my plan is:

1) Fix pagecache folio issue in one patch (test for anon, still need to
   check but it should work)
2) implement the 'filemap_get_hugetlb_folio' thing to get a reference and not
   lock it
3) reduce scope of old_folio

I want to make it clear that while I still want to add filemap_get_hugetlb_folio
and stop using the lock version, the reason is not to give more power to the mutex,
but to bring it closer to what do_wp_page does.

What do you think about it?

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-03 13:50         ` Oscar Salvador
@ 2025-06-03 14:57           ` Peter Xu
  2025-06-03 15:08             ` David Hildenbrand
  2025-06-03 18:31             ` Peter Xu
  2025-06-03 15:12           ` David Hildenbrand
  1 sibling, 2 replies; 18+ messages in thread
From: Peter Xu @ 2025-06-03 14:57 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Muchun Song, David Hildenbrand, James Houghton,
	Gavin Guo, linux-mm, linux-kernel

On Tue, Jun 03, 2025 at 03:50:54PM +0200, Oscar Salvador wrote:
> On Mon, Jun 02, 2025 at 05:30:19PM -0400, Peter Xu wrote:
> > Right, and thanks for the git digging as usual.  I would agree hugetlb is
> > more challenge than many other modules on git archaeology. :)
> > 
> > Even if I mentioned the invalidate_lock, I don't think I thought deeper
> > than that. I just wished whenever possible we still move hugetlb code
> > closer to generic code, so if that's the goal we may still want to one day
> > have a closer look at whether hugetlb can also use invalidate_lock.  Maybe
> > it isn't worthwhile at last: invalidate_lock is currently a rwsem, which
> > normally at least allows concurrent fault, but that's currently what isn't
> > allowed in hugetlb anyway..
> > 
> > If we start to remove finer grained locks that work will be even harder,
> > and removing folio lock in this case in fault path also brings hugetlbfs
> > even further from other file systems.  That might be slightly against what
> > we used to wish to do, which is to make it closer to others.  Meanwhile I'm
> > also not yet sure the benefit of not taking folio lock all across, e.g. I
> > don't expect perf would change at all even if lock is avoided.  We may want
> > to think about that too when doing so.
> 
> Ok, I have to confess I was not looking things from this perspective,
> but when doing so, yes, you are right, we should strive to find
> replacements wherever we can for not using hugetlb-specific code.
> 
> I do not know about this case though, not sure what other options do we
> have when trying to shut concurrent faults while doing other operation.
> But it is something we should definitely look at.
> 
> Wrt. to the lock.
> There were two locks, old_folio (taken in hugetlb_fault) and
> pagecache_folio one.

There're actually three places this patch touched, the 3rd one is
hugetlb_no_page(), in which case I also think we should lock it, not only
because file folios normally does it (see do_fault(), for example), but
also that's exactly what James mentioned I believe on possible race of
!uptodate hugetlb folio being injected by UFFDIO_CONTINUE, along the lines:

		folio = alloc_hugetlb_folio(vma, vmf->address, false);
                ...
		folio_zero_user(folio, vmf->real_address);
		__folio_mark_uptodate(folio);

> The thing was not about worry as how much perf we leave on the table
> because of these locks, as I am pretty sure is next to 0, but my drive
> was to understand what are protection and why, because as the discussion
> showed, none of us really had a good idea about it and it turns out that this
> goes back more than ~20 years ago.
> 
> Another topic for the lock (old_folio, so the one we copy from),
> when we compare it to generic code, we do not take the lock there.
> Looking at do_wp_page(), we do __get__ a reference on the folio we copy
> from, but not the lock, so AFAIU, the lock seems only to please

Yes this is a good point; for CoW path alone maybe we don't need to lock
old_folio.

> folio_move_anon_rmap() from hugetlb_wp.
> 
> Taking a look at do_wp_page()->wp_can_reuse_anon_folio() which also
> calls folio_move_anon_rmap() in case we can re-use the folio, it only
> takes the lock before the call to folio_move_anon_rmap(), and then
> unlocks it.

IMHO, do_wp_page() took the folio lock not for folio_move_anon_rmap(), but
for checking swapcache/ksm stuff which needs to be serialized with folio
lock.

So I'm not 100% confident on the folio_move_anon_rmap(), but I _think_ it
deserves a data_race() and IIUC it only work not because of the folio lock,
but because of how anon_vma is managed as a tree as of now, so that as long
as WRITE_ONCE() even a race is benign (because the rmap walker will either
see a complete old anon_vma that includes the parent process's anon_vma, or
the child's).  What really protects the anon_vma should really be anon_vma
lock.. That can definitely be a separate topic.  I'm not sure whether you'd
like to dig this part out, but if you do I'd also be more than happy to
know whether my understanding needs correction here.. :)

In general, I still agree with you that if hugetlb CoW path can look closer
to do_wp_page then it's great.

> 
> Which, I think, hugetlb should also do.
> 
>  do_wp_page
>   wp_can_reuse_anon_folio ?
>    : yes: folio_lock ; folio_move_anon_rmap ; folio_unlock
>      bail out
>    : no: get a reference on the folio and call wp_page_copy
> 
> So, this should be the lead that hugetlb follows.
> As I said, it is not about the performance, and I agree that relying on
> finer granularity locks is the way to go, but we need to understand
> where and why, and with the current code from upstream, that is not
> clear at all.
> 
> That is why I wanted to reduce the scope of old_folio to what is
> actually needed, which is the snippet:
> 
>   if (folio_mapcount(old_folio) == 1 && folio_test_anon(old_folio)) {
>            if (!PageAnonExclusive(&old_folio->page)) {
>                     folio_move_anon_rmap(old_folio, vma);
>                     SetPageAnonExclusive(&old_folio->page);
>            }
>            if (likely(!unshare))
>                     set_huge_ptep_maybe_writable(vma, vmf->address,
>                                                  vmf->pte);
>   
>            delayacct_wpcopy_end();
>            return 0;
>   }
> 
> I think it is important to 1) reduce it to wrap what actually needs to
> be within the lock and 2) document why, so no one has to put the gloves
> and start digging in the history again.
> 
> > Thanks!  I hope that'll also help whatever patch to land sooner, after it
> > can be verified to fix the issue.
> 
> So, my plan is:
> 
> 1) Fix pagecache folio issue in one patch (test for anon, still need to
>    check but it should work)
> 2) implement the 'filemap_get_hugetlb_folio' thing to get a reference and not
>    lock it
> 3) reduce scope of old_folio
> 
> I want to make it clear that while I still want to add filemap_get_hugetlb_folio
> and stop using the lock version, the reason is not to give more power to the mutex,
> but to bring it closer to what do_wp_page does.
> 
> What do you think about it?

So in this case as long as (2) won't change the lock folio for no_page then
I agree.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-03 14:57           ` Peter Xu
@ 2025-06-03 15:08             ` David Hildenbrand
  2025-06-03 15:46               ` Peter Xu
  2025-06-03 18:31             ` Peter Xu
  1 sibling, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-06-03 15:08 UTC (permalink / raw)
  To: Peter Xu, Oscar Salvador
  Cc: Andrew Morton, Muchun Song, James Houghton, Gavin Guo, linux-mm,
	linux-kernel

On 03.06.25 16:57, Peter Xu wrote:
> On Tue, Jun 03, 2025 at 03:50:54PM +0200, Oscar Salvador wrote:
>> On Mon, Jun 02, 2025 at 05:30:19PM -0400, Peter Xu wrote:
>>> Right, and thanks for the git digging as usual.  I would agree hugetlb is
>>> more challenge than many other modules on git archaeology. :)
>>>
>>> Even if I mentioned the invalidate_lock, I don't think I thought deeper
>>> than that. I just wished whenever possible we still move hugetlb code
>>> closer to generic code, so if that's the goal we may still want to one day
>>> have a closer look at whether hugetlb can also use invalidate_lock.  Maybe
>>> it isn't worthwhile at last: invalidate_lock is currently a rwsem, which
>>> normally at least allows concurrent fault, but that's currently what isn't
>>> allowed in hugetlb anyway..
>>>
>>> If we start to remove finer grained locks that work will be even harder,
>>> and removing folio lock in this case in fault path also brings hugetlbfs
>>> even further from other file systems.  That might be slightly against what
>>> we used to wish to do, which is to make it closer to others.  Meanwhile I'm
>>> also not yet sure the benefit of not taking folio lock all across, e.g. I
>>> don't expect perf would change at all even if lock is avoided.  We may want
>>> to think about that too when doing so.
>>
>> Ok, I have to confess I was not looking things from this perspective,
>> but when doing so, yes, you are right, we should strive to find
>> replacements wherever we can for not using hugetlb-specific code.
>>
>> I do not know about this case though, not sure what other options do we
>> have when trying to shut concurrent faults while doing other operation.
>> But it is something we should definitely look at.
>>
>> Wrt. to the lock.
>> There were two locks, old_folio (taken in hugetlb_fault) and
>> pagecache_folio one.
> 
> There're actually three places this patch touched, the 3rd one is
> hugetlb_no_page(), in which case I also think we should lock it, not only
> because file folios normally does it (see do_fault(), for example), but
> also that's exactly what James mentioned I believe on possible race of
> !uptodate hugetlb folio being injected by UFFDIO_CONTINUE, along the lines:
> 
> 		folio = alloc_hugetlb_folio(vma, vmf->address, false);
>                  ...
> 		folio_zero_user(folio, vmf->real_address);
> 		__folio_mark_uptodate(folio);
> 
>> The thing was not about worry as how much perf we leave on the table
>> because of these locks, as I am pretty sure is next to 0, but my drive
>> was to understand what are protection and why, because as the discussion
>> showed, none of us really had a good idea about it and it turns out that this
>> goes back more than ~20 years ago.
>>
>> Another topic for the lock (old_folio, so the one we copy from),
>> when we compare it to generic code, we do not take the lock there.
>> Looking at do_wp_page(), we do __get__ a reference on the folio we copy
>> from, but not the lock, so AFAIU, the lock seems only to please
> 
> Yes this is a good point; for CoW path alone maybe we don't need to lock
> old_folio.
> 
>> folio_move_anon_rmap() from hugetlb_wp.
>>
>> Taking a look at do_wp_page()->wp_can_reuse_anon_folio() which also
>> calls folio_move_anon_rmap() in case we can re-use the folio, it only
>> takes the lock before the call to folio_move_anon_rmap(), and then
>> unlocks it.
> 
> IMHO, do_wp_page() took the folio lock not for folio_move_anon_rmap(), but
> for checking swapcache/ksm stuff which needs to be serialized with folio
> lock.
> 
> So I'm not 100% confident on the folio_move_anon_rmap(), but I _think_ it
> deserves a data_race() and IIUC it only work not because of the folio lock,
> but because of how anon_vma is managed as a tree as of now, so that as long
> as WRITE_ONCE() even a race is benign (because the rmap walker will either
> see a complete old anon_vma that includes the parent process's anon_vma, or
> the child's).  What really protects the anon_vma should really be anon_vma
> lock.. That can definitely be a separate topic.  I'm not sure whether you'd
> like to dig this part out, but if you do I'd also be more than happy to
> know whether my understanding needs correction here.. :)
> 
> In general, I still agree with you that if hugetlb CoW path can look closer
> to do_wp_page then it's great.


As stated elsewhere, the mapcount check + folio_move_anon_rmap need the 
folio lock.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-03 13:50         ` Oscar Salvador
  2025-06-03 14:57           ` Peter Xu
@ 2025-06-03 15:12           ` David Hildenbrand
  1 sibling, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-06-03 15:12 UTC (permalink / raw)
  To: Oscar Salvador, Peter Xu
  Cc: Andrew Morton, Muchun Song, James Houghton, Gavin Guo, linux-mm,
	linux-kernel

On 03.06.25 15:50, Oscar Salvador wrote:
> On Mon, Jun 02, 2025 at 05:30:19PM -0400, Peter Xu wrote:
>> Right, and thanks for the git digging as usual.  I would agree hugetlb is
>> more challenge than many other modules on git archaeology. :)
>>
>> Even if I mentioned the invalidate_lock, I don't think I thought deeper
>> than that. I just wished whenever possible we still move hugetlb code
>> closer to generic code, so if that's the goal we may still want to one day
>> have a closer look at whether hugetlb can also use invalidate_lock.  Maybe
>> it isn't worthwhile at last: invalidate_lock is currently a rwsem, which
>> normally at least allows concurrent fault, but that's currently what isn't
>> allowed in hugetlb anyway..
>>
>> If we start to remove finer grained locks that work will be even harder,
>> and removing folio lock in this case in fault path also brings hugetlbfs
>> even further from other file systems.  That might be slightly against what
>> we used to wish to do, which is to make it closer to others.  Meanwhile I'm
>> also not yet sure the benefit of not taking folio lock all across, e.g. I
>> don't expect perf would change at all even if lock is avoided.  We may want
>> to think about that too when doing so.
> 
> Ok, I have to confess I was not looking things from this perspective,
> but when doing so, yes, you are right, we should strive to find
> replacements wherever we can for not using hugetlb-specific code.
> 
> I do not know about this case though, not sure what other options do we
> have when trying to shut concurrent faults while doing other operation.
> But it is something we should definitely look at.
> 
> Wrt. to the lock.
> There were two locks, old_folio (taken in hugetlb_fault) and
> pagecache_folio one.
> The thing was not about worry as how much perf we leave on the table
> because of these locks, as I am pretty sure is next to 0, but my drive
> was to understand what are protection and why, because as the discussion
> showed, none of us really had a good idea about it and it turns out that this
> goes back more than ~20 years ago.
> 
> Another topic for the lock (old_folio, so the one we copy from),
> when we compare it to generic code, we do not take the lock there.
> Looking at do_wp_page(), we do __get__ a reference on the folio we copy
> from, but not the lock, so AFAIU, the lock seems only to please
> folio_move_anon_rmap() from hugetlb_wp.
> 
> Taking a look at do_wp_page()->wp_can_reuse_anon_folio() which also
> calls folio_move_anon_rmap() in case we can re-use the folio, it only
> takes the lock before the call to folio_move_anon_rmap(), and then
> unlocks it.

No.

It takes the lock around "folio_ref_count(folio) != 1" as well.

IOW, if the ref_count is 1, the mapcount must be <= 1, and as the page 
*is* mapped, we know the mapcount is >= 1.

So if the ref_count == mapcount == 1 and the folio is locked, we cannot 
have concurrent unmapping/splitting/migration of the folio that could 
affect the mapcount/refcount.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-03 15:08             ` David Hildenbrand
@ 2025-06-03 15:46               ` Peter Xu
  2025-06-03 17:19                 ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2025-06-03 15:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, Andrew Morton, Muchun Song, James Houghton,
	Gavin Guo, linux-mm, linux-kernel

On Tue, Jun 03, 2025 at 05:08:55PM +0200, David Hildenbrand wrote:
> On 03.06.25 16:57, Peter Xu wrote:
> > On Tue, Jun 03, 2025 at 03:50:54PM +0200, Oscar Salvador wrote:
> > > On Mon, Jun 02, 2025 at 05:30:19PM -0400, Peter Xu wrote:
> > > > Right, and thanks for the git digging as usual.  I would agree hugetlb is
> > > > more challenge than many other modules on git archaeology. :)
> > > > 
> > > > Even if I mentioned the invalidate_lock, I don't think I thought deeper
> > > > than that. I just wished whenever possible we still move hugetlb code
> > > > closer to generic code, so if that's the goal we may still want to one day
> > > > have a closer look at whether hugetlb can also use invalidate_lock.  Maybe
> > > > it isn't worthwhile at last: invalidate_lock is currently a rwsem, which
> > > > normally at least allows concurrent fault, but that's currently what isn't
> > > > allowed in hugetlb anyway..
> > > > 
> > > > If we start to remove finer grained locks that work will be even harder,
> > > > and removing folio lock in this case in fault path also brings hugetlbfs
> > > > even further from other file systems.  That might be slightly against what
> > > > we used to wish to do, which is to make it closer to others.  Meanwhile I'm
> > > > also not yet sure the benefit of not taking folio lock all across, e.g. I
> > > > don't expect perf would change at all even if lock is avoided.  We may want
> > > > to think about that too when doing so.
> > > 
> > > Ok, I have to confess I was not looking things from this perspective,
> > > but when doing so, yes, you are right, we should strive to find
> > > replacements wherever we can for not using hugetlb-specific code.
> > > 
> > > I do not know about this case though, not sure what other options do we
> > > have when trying to shut concurrent faults while doing other operation.
> > > But it is something we should definitely look at.
> > > 
> > > Wrt. to the lock.
> > > There were two locks, old_folio (taken in hugetlb_fault) and
> > > pagecache_folio one.
> > 
> > There're actually three places this patch touched, the 3rd one is
> > hugetlb_no_page(), in which case I also think we should lock it, not only
> > because file folios normally does it (see do_fault(), for example), but
> > also that's exactly what James mentioned I believe on possible race of
> > !uptodate hugetlb folio being injected by UFFDIO_CONTINUE, along the lines:
> > 
> > 		folio = alloc_hugetlb_folio(vma, vmf->address, false);
> >                  ...
> > 		folio_zero_user(folio, vmf->real_address);
> > 		__folio_mark_uptodate(folio);
> > 
> > > The thing was not about worry as how much perf we leave on the table
> > > because of these locks, as I am pretty sure is next to 0, but my drive
> > > was to understand what are protection and why, because as the discussion
> > > showed, none of us really had a good idea about it and it turns out that this
> > > goes back more than ~20 years ago.
> > > 
> > > Another topic for the lock (old_folio, so the one we copy from),
> > > when we compare it to generic code, we do not take the lock there.
> > > Looking at do_wp_page(), we do __get__ a reference on the folio we copy
> > > from, but not the lock, so AFAIU, the lock seems only to please
> > 
> > Yes this is a good point; for CoW path alone maybe we don't need to lock
> > old_folio.
> > 
> > > folio_move_anon_rmap() from hugetlb_wp.
> > > 
> > > Taking a look at do_wp_page()->wp_can_reuse_anon_folio() which also
> > > calls folio_move_anon_rmap() in case we can re-use the folio, it only
> > > takes the lock before the call to folio_move_anon_rmap(), and then
> > > unlocks it.
> > 
> > IMHO, do_wp_page() took the folio lock not for folio_move_anon_rmap(), but
> > for checking swapcache/ksm stuff which needs to be serialized with folio
> > lock.
> > 
> > So I'm not 100% confident on the folio_move_anon_rmap(), but I _think_ it
> > deserves a data_race() and IIUC it only work not because of the folio lock,
> > but because of how anon_vma is managed as a tree as of now, so that as long
> > as WRITE_ONCE() even a race is benign (because the rmap walker will either
> > see a complete old anon_vma that includes the parent process's anon_vma, or
> > the child's).  What really protects the anon_vma should really be anon_vma
> > lock.. That can definitely be a separate topic.  I'm not sure whether you'd
> > like to dig this part out, but if you do I'd also be more than happy to
> > know whether my understanding needs correction here.. :)
> > 
> > In general, I still agree with you that if hugetlb CoW path can look closer
> > to do_wp_page then it's great.
> 
> 
> As stated elsewhere, the mapcount check + folio_move_anon_rmap need the
> folio lock.

Could you elaborate what would go wrong if we do folio_move_anon_rmap()
without folio lock here?  Just to make sure we're on the same page: we
already have pgtable lock held, and we decided to reuse an anonymous
hugetlb page.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-03 15:46               ` Peter Xu
@ 2025-06-03 17:19                 ` David Hildenbrand
  2025-06-03 19:11                   ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-06-03 17:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: Oscar Salvador, Andrew Morton, Muchun Song, James Houghton,
	Gavin Guo, linux-mm, linux-kernel

>> As stated elsewhere, the mapcount check + folio_move_anon_rmap need the
>> folio lock.
> 
> Could you elaborate what would go wrong if we do folio_move_anon_rmap()
> without folio lock here?  Just to make sure we're on the same page: we
> already have pgtable lock held, and we decided to reuse an anonymous
> hugetlb page.

For now we have

VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);

right at the beginning of folio_move_anon_rmap().

That dates back to

commit c44b674323f4a2480dbeb65d4b487fa5f06f49e0
Author: Rik van Riel <riel@redhat.com>
Date:   Fri Mar 5 13:42:09 2010 -0800

     rmap: move exclusively owned pages to own anon_vma in do_wp_page()
     
     When the parent process breaks the COW on a page, both the original which
     is mapped at child and the new page which is mapped parent end up in that
     same anon_vma.  Generally this won't be a problem, but for some workloads
     it could preserve the O(N) rmap scanning complexity.
     
     A simple fix is to ensure that, when a page which is mapped child gets
     reused in do_wp_page, because we already are the exclusive owner, the page
     gets moved to our own exclusive child's anon_vma.


My recollection is that the folio lock protects folio->mapping. So relevant rmap walks
that hold the folio lock can assume that folio->mapping and
thereby folio_anon_vma() cannot change.

folio_lock_anon_vma_read() documents something regarding the folio lock protecting the
anon_vma.

I can only speculate that locking the folio is cheaper than locking the relevant anon_vma, and
that rmap code depends on that.


I'll note that in the introducing commit we didn't use the WRITE_ONCE, though. That was added in

commit 16f5e707d6f6f7644ff07e583b8f18c3dcc5499f
Author: Alex Shi <alexs@kernel.org>
Date:   Tue Dec 15 12:33:42 2020 -0800

     mm/rmap: stop store reordering issue on page->mapping

But I don't think that the folio lock was a replacement to that WRITE_ONCE.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-03 14:57           ` Peter Xu
  2025-06-03 15:08             ` David Hildenbrand
@ 2025-06-03 18:31             ` Peter Xu
  2025-06-10 14:13               ` Oscar Salvador
  1 sibling, 1 reply; 18+ messages in thread
From: Peter Xu @ 2025-06-03 18:31 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Muchun Song, David Hildenbrand, James Houghton,
	Gavin Guo, linux-mm, linux-kernel

On Tue, Jun 03, 2025 at 10:57:21AM -0400, Peter Xu wrote:
> On Tue, Jun 03, 2025 at 03:50:54PM +0200, Oscar Salvador wrote:
> > On Mon, Jun 02, 2025 at 05:30:19PM -0400, Peter Xu wrote:
> > > Right, and thanks for the git digging as usual.  I would agree hugetlb is
> > > more challenge than many other modules on git archaeology. :)
> > > 
> > > Even if I mentioned the invalidate_lock, I don't think I thought deeper
> > > than that. I just wished whenever possible we still move hugetlb code
> > > closer to generic code, so if that's the goal we may still want to one day
> > > have a closer look at whether hugetlb can also use invalidate_lock.  Maybe
> > > it isn't worthwhile at last: invalidate_lock is currently a rwsem, which
> > > normally at least allows concurrent fault, but that's currently what isn't
> > > allowed in hugetlb anyway..
> > > 
> > > If we start to remove finer grained locks that work will be even harder,
> > > and removing folio lock in this case in fault path also brings hugetlbfs
> > > even further from other file systems.  That might be slightly against what
> > > we used to wish to do, which is to make it closer to others.  Meanwhile I'm
> > > also not yet sure the benefit of not taking folio lock all across, e.g. I
> > > don't expect perf would change at all even if lock is avoided.  We may want
> > > to think about that too when doing so.
> > 
> > Ok, I have to confess I was not looking things from this perspective,
> > but when doing so, yes, you are right, we should strive to find
> > replacements wherever we can for not using hugetlb-specific code.
> > 
> > I do not know about this case though, not sure what other options do we
> > have when trying to shut concurrent faults while doing other operation.
> > But it is something we should definitely look at.
> > 
> > Wrt. to the lock.
> > There were two locks, old_folio (taken in hugetlb_fault) and
> > pagecache_folio one.
> 
> There're actually three places this patch touched, the 3rd one is
> hugetlb_no_page(), in which case I also think we should lock it, not only
> because file folios normally does it (see do_fault(), for example), but
> also that's exactly what James mentioned I believe on possible race of
> !uptodate hugetlb folio being injected by UFFDIO_CONTINUE, along the lines:
> 
> 		folio = alloc_hugetlb_folio(vma, vmf->address, false);
>                 ...
> 		folio_zero_user(folio, vmf->real_address);
> 		__folio_mark_uptodate(folio);

I think I was wrong here at least partly..  So what this patch changed is
only the lookup of the no_page path, hence what I said here doesn't apply.
This patch also mentioned in the commit message on why James's concern was
ok - the fault mutex was held.  Yes I agree. Actually even without fault
mutex, the folio is only injected into page cache after mark uptodate.. so
it looks fine even without the mutex.

Though it's still true there're three paths to be discussed, which should
include no_page, and it's still needed to be discussed when any of us like
to remove folio lock even in the lookup path.

For example, I'm not sure whether it's always thread safe to do
folio_test_hwpoison() when without it, even with the fault mutex.

Sorry for the confusion, Oscar.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-03 17:19                 ` David Hildenbrand
@ 2025-06-03 19:11                   ` Peter Xu
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2025-06-03 19:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, Andrew Morton, Muchun Song, James Houghton,
	Gavin Guo, linux-mm, linux-kernel

On Tue, Jun 03, 2025 at 07:19:13PM +0200, David Hildenbrand wrote:
> > > As stated elsewhere, the mapcount check + folio_move_anon_rmap need the
> > > folio lock.
> > 
> > Could you elaborate what would go wrong if we do folio_move_anon_rmap()
> > without folio lock here?  Just to make sure we're on the same page: we
> > already have pgtable lock held, and we decided to reuse an anonymous
> > hugetlb page.
> 
> For now we have
> 
> VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> 
> right at the beginning of folio_move_anon_rmap().
> 
> That dates back to
> 
> commit c44b674323f4a2480dbeb65d4b487fa5f06f49e0
> Author: Rik van Riel <riel@redhat.com>
> Date:   Fri Mar 5 13:42:09 2010 -0800
> 
>     rmap: move exclusively owned pages to own anon_vma in do_wp_page()
>     When the parent process breaks the COW on a page, both the original which
>     is mapped at child and the new page which is mapped parent end up in that
>     same anon_vma.  Generally this won't be a problem, but for some workloads
>     it could preserve the O(N) rmap scanning complexity.
>     A simple fix is to ensure that, when a page which is mapped child gets
>     reused in do_wp_page, because we already are the exclusive owner, the page
>     gets moved to our own exclusive child's anon_vma.
> 
> 
> My recollection is that the folio lock protects folio->mapping. So relevant rmap walks

Yes, I had similar impression but only for file, as the comment discussed
in rmap_walk_file().  For anonymous, it was always not clear to me, as at
least rmap walk anon doesn't seem to need folio lock in some special paths
like damon/page_idle/folio_referenced.

[1]

> that hold the folio lock can assume that folio->mapping and
> thereby folio_anon_vma() cannot change.
> 
> folio_lock_anon_vma_read() documents something regarding the folio lock protecting the
> anon_vma.

Right, I remember that change, though I was expecting the comment was
referring to the assert(locked) above.  Unfortunately we didn't have more
clue on the folio lock, even though the change itself makes perfect sense
regardless, to double check anon_vma from changing (after UFFDIO_MOVE).

> 
> I can only speculate that locking the folio is cheaper than locking the relevant anon_vma, and
> that rmap code depends on that.

I see this as two separate things to protect: folio->mapping, and the
anon_vma tree inside of it.

For now it looks like we're "almost" using folio lock to protect
folio->mapping for anon, however we could still read folio->mapping without
folio lock, per discussed above [1].  Below commit should be another sign
of that, where Alex mentioned the WRITE_ONCE needed for page_idle.

IOW, even with folio lock held, something can be reading folio->mapping,
and further walking the anon_vma..

I had a long-standing feeling that it works out only because anon_vma
updates can only happen within parent/child processes, so they're
internally holding the same anon_vma lock (anon_vma_trylock_read, taking
the root lock).  Hence even if a race happened it'll still take the same
lock.

I think it means as long as we decided to reuse an anon page (hugetlb or
not, as long as holding the pgtable lock), updating folio->mapping
with/without the lock should keep working.. But maybe I missed something.
And it may not be extremely important either so far; taking the lock
doesn't seem to be bad anyway.  It's only some confusion I never figured
out myself.

Thanks,

> 
> 
> I'll note that in the introducing commit we didn't use the WRITE_ONCE, though. That was added in
> 
> commit 16f5e707d6f6f7644ff07e583b8f18c3dcc5499f
> Author: Alex Shi <alexs@kernel.org>
> Date:   Tue Dec 15 12:33:42 2020 -0800
> 
>     mm/rmap: stop store reordering issue on page->mapping
> 
> But I don't think that the folio lock was a replacement to that WRITE_ONCE.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-03 18:31             ` Peter Xu
@ 2025-06-10 14:13               ` Oscar Salvador
  2025-06-10 15:57                 ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Oscar Salvador @ 2025-06-10 14:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: Andrew Morton, Muchun Song, David Hildenbrand, James Houghton,
	Gavin Guo, linux-mm, linux-kernel

On Tue, Jun 03, 2025 at 02:31:15PM -0400, Peter Xu wrote:
> > There're actually three places this patch touched, the 3rd one is
> > hugetlb_no_page(), in which case I also think we should lock it, not only
> > because file folios normally does it (see do_fault(), for example), but
> > also that's exactly what James mentioned I believe on possible race of
> > !uptodate hugetlb folio being injected by UFFDIO_CONTINUE, along the lines:
> > 
> > 		folio = alloc_hugetlb_folio(vma, vmf->address, false);
> >                 ...
> > 		folio_zero_user(folio, vmf->real_address);
> > 		__folio_mark_uptodate(folio);
> 
> I think I was wrong here at least partly..  So what this patch changed is
> only the lookup of the no_page path, hence what I said here doesn't apply.
> This patch also mentioned in the commit message on why James's concern was
> ok - the fault mutex was held.  Yes I agree. Actually even without fault
> mutex, the folio is only injected into page cache after mark uptodate.. so
> it looks fine even without the mutex.
> 
> Though it's still true there're three paths to be discussed, which should
> include no_page, and it's still needed to be discussed when any of us like
> to remove folio lock even in the lookup path.
> 
> For example, I'm not sure whether it's always thread safe to do
> folio_test_hwpoison() when without it, even with the fault mutex.
> 
> Sorry for the confusion, Oscar.

So, I'm having another go at this.
I started with replacing the check for cow-on-private-mappings [1].

I'm still to figure out how to approach the other locks though.
The thing is, we only need the lock in two cases:

1) folio is the original folio in the pagecache: We need to lock it to
   copy it over to another page (do_fault->do_cow_page locks the folio
   via filemap_fault).
   Although, the truth be told, we even lock the folio on read-fault in
   do_read_fault. Maybe that is just because __do_fault() ends up
   calling filemap_fault (so it doesn't differentiate?).
   But that is not a problem, just a note.
   So it is ok for hugetlb_no_page() to also take the lock when
   read-faulting.
   We need to take the lock here.

2) folio is anonymous: We need to take the lock when we want to check
   if we can re-use the folio in hugetlb_wp.
   Normal faulting path does it in wp_can_reuse_anon_folio().

Now, the scope for them is different.
Normal faulting path only locks the anonymous folio when calling
wp_can_reuse_anon_folio() (so, it isn't hold during the copy), while we
lock folios from the pagecache in filemap_fault() and keep the lock until
we have a) mapped it into our pagetables (read-fault) b) or we have
copied it over to another page.

Representing this difference of timespan in hugetlb is rather tricky as
is.
Maybe it could be done if we refactor hugetlb_{fault,no_page,wp} to look
more alike to the non-hugetlb faulting path.
E.g: hugetlb_fault could directly call hugetlb_wp if it sees that
FAULT_FLAG_WRITE is on, instead of doing a first pass to
hugetlb_no_page, and leave hugetlb_no_page for only RO stuff.
Not put much though on it, but just an idead of how do_fault() handles
it.

(In an ideal world hugetlb could be re-routed to generic faulting path
but we would need to sort out somehow the need to allocate folios,
reserves etc. so that is still far future)

Anyway, as I said, representing this different of timespan file vs
anonymous in hugetlb is tricky, so I think the current state of locks,
with [1] applied is fine.

hugetlb_fault:
               - locks the folio mapped in pte_orig
hugetlb_no_page:
               - locks the folio from the pagecache
	       - or allocates a new folio
	         - and inserts it into the pagecache and locks it
	         - and locks it (anonymous)

So, hugetlb_wp() already gets the folio locked and doesn't have to
bother about anything else.

Of course, while I think that locks can be keep as they are, I plan to
document them properly so the next poor soul that comes along doesn't
have to spend time trying to figure out why we do what we do.

Another thing of keeping the locks is that for the future we can
potentially rely less on the faulting mutex.

[1] https://github.com/leberus/linux/commit/b8e10b854fc3e427fd69d61d065798c7f31ebd55

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
  2025-06-10 14:13               ` Oscar Salvador
@ 2025-06-10 15:57                 ` Peter Xu
  0 siblings, 0 replies; 18+ messages in thread
From: Peter Xu @ 2025-06-10 15:57 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Andrew Morton, Muchun Song, David Hildenbrand, James Houghton,
	Gavin Guo, linux-mm, linux-kernel

On Tue, Jun 10, 2025 at 04:13:45PM +0200, Oscar Salvador wrote:
> [1] https://github.com/leberus/linux/commit/b8e10b854fc3e427fd69d61d065798c7f31ebd55

Two trivial comments on the change:

(1) maybe you want to still mention the trylock -> lock change for the
folio in the commit log. Looks like it was only needed when we need to lock
two folios to avoid deadlock (even though I don't know whether it could
happen in any real reproducer..).

(2) you can also drop need_wait_lock var now.

The change looks all good otherwise to me, thanks!

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC PATCH 0/3] Clean up locking in hugetlb faulting code
  2025-06-02 14:16 [RFC PATCH 0/3] Clean up locking in hugetlb faulting code Oscar Salvador
                   ` (2 preceding siblings ...)
  2025-06-02 14:16 ` [RFC PATCH 3/3] mm, hugetlb: Drop unlikelys from hugetlb_fault Oscar Salvador
@ 2025-06-16  3:21 ` Gavin Guo
  3 siblings, 0 replies; 18+ messages in thread
From: Gavin Guo @ 2025-06-16  3:21 UTC (permalink / raw)
  To: Oscar Salvador, Andrew Morton
  Cc: Muchun Song, David Hildenbrand, James Houghton, Peter Xu,
	linux-mm, linux-kernel

Hi Oscar,

On 6/2/25 22:16, Oscar Salvador wrote:
> Hi all,
> 
> This RFC is the culmination of the discussion that happened in [1].
> TLDR: No one really knew what the locks were protecting us against, and
> whether we needed them at all.
> 
> Some reasearch showed that most of them were introduced in a time were
> truncation was not serialized with the mutex, as it is today, so we were
> relying on the lock for the page to not go away from the pagecache.
> More details can be find in patch#1.
> 
> This is for the locks, but I also started to look at the references
> we take in hugetlb_fault and hugetlb_wp as it seems to me we are taking
> more than actually needed, but that is once we manage to sort this out.
> 
> I ran hugetlb LTP tests and nothing screamed, and I also plan to run selftests
> later on.
> 
> @Galvin. Could you please run your syzkaller with this patchset applied and
> see whether you can trigger something?

Sorry for the late response. My capacity is limited in the last two 
weeks of joining an event and didn't notice the talk. And it seems 
already huge discussions and good progress. Currently, I saw the 
discussion is in another latest thread: 
https://lore.kernel.org/linux-mm/20250612134701.377855-1-osalvador@suse.de/

Please let me know if the testing is still useful.

> 
> Special thanks to David and Peter Xu that were helping out with this mess.
> 
> [1] https://lore.kernel.org/linux-mm/aDeBUXCRLRZobHq0@localhost.localdomain/T/#md02880ebc2c679678b7f326c5e9e93992428e124
> 
> Oscar Salvador (3):
>    mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp
>    mm, hugetlb: Update comments in hugetlb_fault
>    mm, hugetlb: Drop unlikelys from hugetlb_fault
> 
>   include/linux/hugetlb.h |  12 +++++
>   mm/hugetlb.c            | 117 +++++++++++++++++-----------------------
>   2 files changed, 62 insertions(+), 67 deletions(-)




^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-06-16  3:21 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-02 14:16 [RFC PATCH 0/3] Clean up locking in hugetlb faulting code Oscar Salvador
2025-06-02 14:16 ` [RFC PATCH 1/3] mm, hugetlb: Clean up locking in hugetlb_fault and hugetlb_wp Oscar Salvador
2025-06-02 15:14   ` Peter Xu
2025-06-02 20:47     ` Oscar Salvador
2025-06-02 21:30       ` Peter Xu
2025-06-03 13:50         ` Oscar Salvador
2025-06-03 14:57           ` Peter Xu
2025-06-03 15:08             ` David Hildenbrand
2025-06-03 15:46               ` Peter Xu
2025-06-03 17:19                 ` David Hildenbrand
2025-06-03 19:11                   ` Peter Xu
2025-06-03 18:31             ` Peter Xu
2025-06-10 14:13               ` Oscar Salvador
2025-06-10 15:57                 ` Peter Xu
2025-06-03 15:12           ` David Hildenbrand
2025-06-02 14:16 ` [RFC PATCH 2/3] mm, hugetlb: Update comments in hugetlb_fault Oscar Salvador
2025-06-02 14:16 ` [RFC PATCH 3/3] mm, hugetlb: Drop unlikelys from hugetlb_fault Oscar Salvador
2025-06-16  3:21 ` [RFC PATCH 0/3] Clean up locking in hugetlb faulting code Gavin Guo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).