[PATCH v1 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather)
@ 2025-12-05 21:35 David Hildenbrand (Red Hat)
  2025-12-05 21:35 ` [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared() David Hildenbrand (Red Hat)
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-05 21:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, linux-mm, David Hildenbrand (Red Hat), Will Deacon,
	Aneesh Kumar K.V, Andrew Morton, Nick Piggin, Peter Zijlstra,
	Arnd Bergmann, Muchun Song, Oscar Salvador, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, Harry Yoo, Laurence Oberman, Prakash Sangappa,
	Nadav Amit

One functional fix, one performance regression fix, and two related
comment fixes.

I cleaned up my prototype I recently shared [1] for the performance fix,
deferring most of the cleanups I had in the prototype to a later point.
While doing that I identified the other things.

The goal of this patch set is to be backported to stable trees "fairly"
easily. At least patch #1 and #4.

Patch #1 fixes hugetlb_pmd_shared() not detecting any sharing
Patch #2 + #3 are simple comment fixes that patch #4 interacts with.
Patch #4 is a fix for the reported performance regression due to excessive
IPI broadcasts during fork()+exit().

The last patch is all about TLB flushes, IPIs and mmu_gather.
Read: complicated

I added as much comments + description that I possibly could, and I am
hoping for review from Jann.

There are plenty of cleanups in the future to be had + one reasonable
optimization on x86. But that's all out of scope for this series.

Compile tested on plenty of architectures.

Runtime tested, with a focus on fixing the performance regression using
the original reproducer [2] on x86.

I'm still busy with more testing (making sure that my TLB flushing changes
are good), but sending this out already so people can test and review
while I am soon heading for LPC.

[1] https://lore.kernel.org/all/8cab934d-4a56-44aa-b641-bfd7e23bd673@kernel.org/
[2] https://lore.kernel.org/all/8cab934d-4a56-44aa-b641-bfd7e23bd673@kernel.org/

Cc: Will Deacon <will@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jann Horn <jannh@google.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Harry Yoo <harry.yoo@oracle.com>
Cc: Uschakow, Stanislav" <suschako@amazon.de>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Nadav Amit <nadav.amit@gmail.com>

David Hildenbrand (Red Hat) (4):
  mm/hugetlb: fix hugetlb_pmd_shared()
  mm/hugetlb: fix two comments related to huge_pmd_unshare()
  mm/rmap: fix two comments related to huge_pmd_unshare()
  mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables
    using mmu_gather

 include/asm-generic/tlb.h |  69 +++++++++++++++++++-
 include/linux/hugetlb.h   |  21 ++++---
 mm/hugetlb.c              | 129 ++++++++++++++++++++------------------
 mm/mmu_gather.c           |   6 ++
 mm/mprotect.c             |   2 +-
 mm/rmap.c                 |  45 +++++++------
 6 files changed, 178 insertions(+), 94 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared()
  2025-12-05 21:35 [PATCH v1 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather) David Hildenbrand (Red Hat)
@ 2025-12-05 21:35 ` David Hildenbrand (Red Hat)
  2025-12-06  2:18   ` Rik van Riel
                     ` (5 more replies)
  2025-12-05 21:35 ` [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare() David Hildenbrand (Red Hat)
                   ` (3 subsequent siblings)
  4 siblings, 6 replies; 27+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-05 21:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, linux-mm, David Hildenbrand (Red Hat), Will Deacon,
	Aneesh Kumar K.V, Andrew Morton, Nick Piggin, Peter Zijlstra,
	Arnd Bergmann, Muchun Song, Oscar Salvador, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, Harry Yoo, Laurence Oberman, Prakash Sangappa,
	Nadav Amit, stable, Liu Shixin

We switched from (wrongly) using the page count to an independent
shared count. Now, shared page tables have a refcount of 1 (excluding
speculative references) and instead use ptdesc->pt_share_count to
identify sharing.

We didn't convert hugetlb_pmd_shared(), so right now, we would never
detect a shared PMD table as such, because sharing/unsharing no longer
touches the refcount of a PMD table.

Page migration, like mbind() or migrate_pages() would allow for migrating
folios mapped into such shared PMD tables, even though the folios are
not exclusive. In smaps we would account them as "private" although they
are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the
pagemap interface.

Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared().

Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
Cc: <stable@vger.kernel.org>
Cc: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
---
 include/linux/hugetlb.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 019a1c5281e4e..03c8725efa289 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1326,7 +1326,7 @@ static inline __init void hugetlb_cma_reserve(int order)
 #ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
 static inline bool hugetlb_pmd_shared(pte_t *pte)
 {
-	return page_count(virt_to_page(pte)) > 1;
+	return ptdesc_pmd_is_shared(virt_to_ptdesc(pte));
 }
 #else
 static inline bool hugetlb_pmd_shared(pte_t *pte)
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare()
  2025-12-05 21:35 [PATCH v1 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather) David Hildenbrand (Red Hat)
  2025-12-05 21:35 ` [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared() David Hildenbrand (Red Hat)
@ 2025-12-05 21:35 ` David Hildenbrand (Red Hat)
  2025-12-06  2:26   ` Rik van Riel
                     ` (2 more replies)
  2025-12-05 21:35 ` [PATCH v1 3/4] mm/rmap: " David Hildenbrand (Red Hat)
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 27+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-05 21:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, linux-mm, David Hildenbrand (Red Hat), Will Deacon,
	Aneesh Kumar K.V, Andrew Morton, Nick Piggin, Peter Zijlstra,
	Arnd Bergmann, Muchun Song, Oscar Salvador, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, Harry Yoo, Laurence Oberman, Prakash Sangappa,
	Nadav Amit, Liu Shixin

Ever since we stopped using the page count to detect shared PMD
page tables, these comments are outdated.

The only reason we have to flush the TLB early is because once we drop
the i_mmap_rwsem, the previously shared page table could get freed (to
then get reallocated and used for other purpose). So we really have to
flush the TLB before that could happen.

So let's simplify the comments a bit.

The "If we unshared PMDs, the TLB flush was not recorded in mmu_gather."
part introduced as in commit a4a118f2eead ("hugetlbfs: flush TLBs
correctly after huge_pmd_unshare") was confusing: sure it is recorded
in the mmu_gather, otherwise tlb_flush_mmu_tlbonly() wouldn't do
anything. So let's drop that comment while at it as well.

We'll centralize these comments in a single helper as we rework the code
next.

Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
Cc: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
---
 mm/hugetlb.c | 24 ++++++++----------------
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 51273baec9e5d..3c77cdef12a32 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5304,17 +5304,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	tlb_end_vma(tlb, vma);
 
 	/*
-	 * If we unshared PMDs, the TLB flush was not recorded in mmu_gather. We
-	 * could defer the flush until now, since by holding i_mmap_rwsem we
-	 * guaranteed that the last reference would not be dropped. But we must
-	 * do the flushing before we return, as otherwise i_mmap_rwsem will be
-	 * dropped and the last reference to the shared PMDs page might be
-	 * dropped as well.
-	 *
-	 * In theory we could defer the freeing of the PMD pages as well, but
-	 * huge_pmd_unshare() relies on the exact page_count for the PMD page to
-	 * detect sharing, so we cannot defer the release of the page either.
-	 * Instead, do flush now.
+	 * There is nothing protecting a previously-shared page table that we
+	 * unshared through huge_pmd_unshare() from getting freed after we
+	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
+	 * succeeded, flush the range corresponding to the pud.
 	 */
 	if (force_flush)
 		tlb_flush_mmu_tlbonly(tlb);
@@ -6536,11 +6529,10 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 		cond_resched();
 	}
 	/*
-	 * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
-	 * may have cleared our pud entry and done put_page on the page table:
-	 * once we release i_mmap_rwsem, another task can do the final put_page
-	 * and that page table be reused and filled with junk.  If we actually
-	 * did unshare a page of pmds, flush the range corresponding to the pud.
+	 * There is nothing protecting a previously-shared page table that we
+	 * unshared through huge_pmd_unshare() from getting freed after we
+	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
+	 * succeeded, flush the range corresponding to the pud.
 	 */
 	if (shared_pmd)
 		flush_hugetlb_tlb_range(vma, range.start, range.end);
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v1 3/4] mm/rmap: fix two comments related to huge_pmd_unshare()
  2025-12-05 21:35 [PATCH v1 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather) David Hildenbrand (Red Hat)
  2025-12-05 21:35 ` [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared() David Hildenbrand (Red Hat)
  2025-12-05 21:35 ` [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare() David Hildenbrand (Red Hat)
@ 2025-12-05 21:35 ` David Hildenbrand (Red Hat)
  2025-12-06  2:50   ` Rik van Riel
                     ` (2 more replies)
  2025-12-05 21:35 ` [PATCH v1 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather David Hildenbrand (Red Hat)
  2025-12-06 19:53 ` [PATCH v1 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather) Laurence Oberman
  4 siblings, 3 replies; 27+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-05 21:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, linux-mm, David Hildenbrand (Red Hat), Will Deacon,
	Aneesh Kumar K.V, Andrew Morton, Nick Piggin, Peter Zijlstra,
	Arnd Bergmann, Muchun Song, Oscar Salvador, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, Harry Yoo, Laurence Oberman, Prakash Sangappa,
	Nadav Amit, Liu Shixin

PMD page table unsharing no longer touches the refcount of a PMD page
table. Also, it is not about dropping the refcount of a "PMD page" but
the "PMD page table".

Let's just simplify by saying that the PMD page table was unmapped,
consequently also unmapping the folio that was mapped into this page.

This code should be deduplicated in the future.

Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
Cc: Liu Shixin <liushixin2@huawei.com>
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
---
 mm/rmap.c | 20 ++++----------------
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index f955f02d570ed..748f48727a162 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2016,14 +2016,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					flush_tlb_range(vma,
 						range.start, range.end);
 					/*
-					 * The ref count of the PMD page was
-					 * dropped which is part of the way map
-					 * counting is done for shared PMDs.
-					 * Return 'true' here.  When there is
-					 * no other sharing, huge_pmd_unshare
-					 * returns false and we will unmap the
-					 * actual page and drop map count
-					 * to zero.
+					 * The PMD table was unmapped,
+					 * consequently unmapping the folio.
 					 */
 					goto walk_done;
 				}
@@ -2416,14 +2410,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 						range.start, range.end);
 
 					/*
-					 * The ref count of the PMD page was
-					 * dropped which is part of the way map
-					 * counting is done for shared PMDs.
-					 * Return 'true' here.  When there is
-					 * no other sharing, huge_pmd_unshare
-					 * returns false and we will unmap the
-					 * actual page and drop map count
-					 * to zero.
+					 * The PMD table was unmapped,
+					 * consequently unmapping the folio.
 					 */
 					page_vma_mapped_walk_done(&pvmw);
 					break;
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v1 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather
  2025-12-05 21:35 [PATCH v1 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather) David Hildenbrand (Red Hat)
                   ` (2 preceding siblings ...)
  2025-12-05 21:35 ` [PATCH v1 3/4] mm/rmap: " David Hildenbrand (Red Hat)
@ 2025-12-05 21:35 ` David Hildenbrand (Red Hat)
  2025-12-07 12:15   ` Nadav Amit
  2025-12-10 15:06   ` Lorenzo Stoakes
  2025-12-06 19:53 ` [PATCH v1 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather) Laurence Oberman
  4 siblings, 2 replies; 27+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-05 21:35 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-arch, linux-mm, David Hildenbrand (Red Hat), Will Deacon,
	Aneesh Kumar K.V, Andrew Morton, Nick Piggin, Peter Zijlstra,
	Arnd Bergmann, Muchun Song, Oscar Salvador, Liam R. Howlett,
	Lorenzo Stoakes, Vlastimil Babka, Jann Horn, Pedro Falcato,
	Rik van Riel, Harry Yoo, Laurence Oberman, Prakash Sangappa,
	Nadav Amit, stable

As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix
huge_pmd_unshare() vs GUP-fast race") we can end up in some situations
where we perform so many IPI broadcasts when unsharing hugetlb PMD page
tables that it severely regresses some workloads.

In particular, when we fork()+exit(), or when we munmap() a large
area backed by many shared PMD tables, we perform one IPI broadcast per
unshared PMD table.

There are two optimizations to be had:

(1) When we process (unshare) multiple such PMD tables, such as during
    exit(), it is sufficient to send a single IPI broadcast (as long as
    we respect locking rules) instead of one per PMD table.

    Locking prevents that any of these PMD tables could get reuse before
    we drop the lock.

(2) When we are not the last sharer (> 2 users including us), there is
    no need to send the IPI broadcast. The shared PMD tables cannot
    become exclusive (fully unshared) before an IPI will be broadcasted
    by the last sharer.

    Concurrent GUP-fast could walk into a PMD table just before we
    unshared it. It could then succeed in grabbing a page from the
    shared page table even after munmap() etc succeeded (and supressed
    an IPI). But there is not difference compared to GUP-fast just
    sleeping for a while after grabbing the page and re-enabling IRQs.

    Most importantly, GUP-fast will never walk into page tables that are
    no-longer shared, because the last sharer will issue an IPI
    broadcast.

    (if ever required, checking whether the PUD changed in GUP-fast
     after grabbing the page like we do in the PTE case could handle
     this)

So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather
infrastructure so we can implement these optimizations and demystify the
code at least a bit. Extend the mmu_gather infrastructure to be able to
deal with our special hugetlb PMD table sharing implementation.

We'll consolidate the handling for (full) unsharing of PMD tables in
tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track
in "struct mmu_gather" whether we had (full) unsharing of PMD tables.

Because locking is very special (concurrent unsharing+reuse must be
prevented), we disallow deferring flushing to tlb_finish_mmu() and instead
require an explicit earlier call to tlb_flush_unshared_tables().

From hugetlb code, we call huge_pmd_unshare_flush() where we make sure
that the expected lock protecting us from concurrent unsharing+reuse is
still held.

Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that
tlb_flush_unshared_tables() was properly called earlier.

Document it all properly.

Notes about tlb_remove_table_sync_one() interaction with unsharing:

There are two fairly tricky things:

(1) tlb_remove_table_sync_one() is a NOP on architectures without
    CONFIG_MMU_GATHER_RCU_TABLE_FREE.

    Here, the assumption is that the previous TLB flush would send an
    IPI to all relevant CPUs. Careful: some architectures like x86 only
    send IPIs to all relevant CPUs when tlb->freed_tables is set.

    The relevant architectures should be selecting
    MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable
    kernels and it might have been problematic before this patch.

    Also, the arch flushing behavior (independent of IPIs) is different
    when tlb->freed_tables is set. Do we have to enlighten them to also
    take care of tlb->unshared_tables? So far we didn't care, so
    hopefully we are fine. Of course, we could be setting
    tlb->freed_tables as well, but that might then unnecessarily flush
    too much, because the semantics of tlb->freed_tables are a bit
    fuzzy.

    This patch changes nothing in this regard.

(2) tlb_remove_table_sync_one() is not a NOP on architectures with
    CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync.

    Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB)
    we still issue IPIs during TLB flushes and don't actually need the
    second tlb_remove_table_sync_one().

    This optimized can be implemented on top of this, by checking e.g., in
    tlb_remove_table_sync_one() whether we really need IPIs. But as
    described in (1), it really must honor tlb->freed_tables then to
    send IPIs to all relevant CPUs.

Further note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a
concern, as we are holding the i_mmap_lock the whole time, preventing
concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed
separately as a cleanup later.

There are plenty more cleanups to be had, but they have to wait until
this is fixed.

Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race")
Reported-by: Uschakow, Stanislav" <suschako@amazon.de>
Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/
Cc: <stable@vger.kernel.org>
Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
---
 include/asm-generic/tlb.h |  69 +++++++++++++++++++++-
 include/linux/hugetlb.h   |  19 +++---
 mm/hugetlb.c              | 121 ++++++++++++++++++++++----------------
 mm/mmu_gather.c           |   6 ++
 mm/mprotect.c             |   2 +-
 mm/rmap.c                 |  25 +++++---
 6 files changed, 173 insertions(+), 69 deletions(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 1fff717cae510..324a21f53b644 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -364,6 +364,17 @@ struct mmu_gather {
 	unsigned int		vma_huge : 1;
 	unsigned int		vma_pfn  : 1;
 
+	/*
+	 * Did we unshare (unmap) any shared page tables?
+	 */
+	unsigned int		unshared_tables : 1;
+
+	/*
+	 * Did we unshare any page tables such that they are now exclusive
+	 * and could get reused+modified by the new owner?
+	 */
+	unsigned int		fully_unshared_tables : 1;
+
 	unsigned int		batch_count;
 
 #ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -400,6 +411,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 	tlb->cleared_pmds = 0;
 	tlb->cleared_puds = 0;
 	tlb->cleared_p4ds = 0;
+	tlb->unshared_tables = 0;
 	/*
 	 * Do not reset mmu_gather::vma_* fields here, we do not
 	 * call into tlb_start_vma() again to set them if there is an
@@ -484,7 +496,7 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 	 * these bits.
 	 */
 	if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
-	      tlb->cleared_puds || tlb->cleared_p4ds))
+	      tlb->cleared_puds || tlb->cleared_p4ds || tlb->unshared_tables))
 		return;
 
 	tlb_flush(tlb);
@@ -773,6 +785,61 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
 }
 #endif
 
+#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
+static inline void tlb_unshare_pmd_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt,
+					  unsigned long addr)
+{
+	/*
+	 * The caller must make sure that concurrent unsharing + exclusive
+	 * reuse is impossible until tlb_flush_unshared_tables() was called.
+	 */
+	VM_WARN_ON_ONCE(!ptdesc_pmd_is_shared(pt));
+	ptdesc_pmd_pts_dec(pt);
+
+	/* Clearing a PUD pointing at a PMD table with PMD leaves. */
+	tlb_flush_pmd_range(tlb, addr & PUD_MASK, PUD_SIZE);
+
+	/*
+	 * If the page table is now exclusively owned, we fully unshared
+	 * a page table.
+	 */
+	if (!ptdesc_pmd_is_shared(pt))
+		tlb->fully_unshared_tables = true;
+	tlb->unshared_tables = true;
+}
+
+static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
+{
+	/*
+	 * As soon as the caller drops locks to allow for reuse of
+	 * previously-shared tables, these tables could get modified and
+	 * even reused outside of hugetlb context. So flush the TLB now.
+	 *
+	 * Note that we cannot defer the flush to a later point even if we are
+	 * not the last sharer of the page table.
+	 */
+	if (tlb->unshared_tables)
+		tlb_flush_mmu_tlbonly(tlb);
+
+	/*
+	 * Similarly, we must make sure that concurrent GUP-fast will not
+	 * walk previously-shared page tables that are getting modified+reused
+	 * elsewhere. So broadcast an IPI to wait for any concurrent GUP-fast.
+	 *
+	 * We only perform this when we are the last sharer of a page table,
+	 * as the IPI will reach all CPUs: any GUP-fast.
+	 *
+	 * Note that on configs where tlb_remove_table_sync_one() is a NOP,
+	 * the expectation is that the tlb_flush_mmu_tlbonly() would have issued
+	 * required IPIs already for us.
+	 */
+	if (tlb->fully_unshared_tables) {
+		tlb_remove_table_sync_one();
+		tlb->fully_unshared_tables = false;
+	}
+}
+#endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
+
 #endif /* CONFIG_MMU */
 
 #endif /* _ASM_GENERIC__TLB_H */
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 03c8725efa289..63b248c6bfd47 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -240,8 +240,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 unsigned long hugetlb_mask_last_page(struct hstate *h);
-int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep);
+int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep);
+void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma);
 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 				unsigned long *start, unsigned long *end);
 
@@ -271,7 +272,7 @@ void hugetlb_vma_unlock_write(struct vm_area_struct *vma);
 int hugetlb_vma_trylock_write(struct vm_area_struct *vma);
 void hugetlb_vma_assert_locked(struct vm_area_struct *vma);
 void hugetlb_vma_lock_release(struct kref *kref);
-long hugetlb_change_protection(struct vm_area_struct *vma,
+long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot,
 		unsigned long cp_flags);
 void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
@@ -300,13 +301,17 @@ static inline struct address_space *hugetlb_folio_mapping_lock_write(
 	return NULL;
 }
 
-static inline int huge_pmd_unshare(struct mm_struct *mm,
-					struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+static inline int huge_pmd_unshare(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
 {
 	return 0;
 }
 
+static inline void huge_pmd_unshare_flush(struct mmu_gather *tlb,
+		struct vm_area_struct *vma)
+{
+}
+
 static inline void adjust_range_if_pmd_sharing_possible(
 				struct vm_area_struct *vma,
 				unsigned long *start, unsigned long *end)
@@ -432,7 +437,7 @@ static inline void move_hugetlb_state(struct folio *old_folio,
 {
 }
 
-static inline long hugetlb_change_protection(
+static inline long hugetlb_change_protection(struct mmu_gather *tlb,
 			struct vm_area_struct *vma, unsigned long address,
 			unsigned long end, pgprot_t newprot,
 			unsigned long cp_flags)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3c77cdef12a32..3db94693a06fc 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5096,8 +5096,9 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 	unsigned long last_addr_mask;
 	pte_t *src_pte, *dst_pte;
 	struct mmu_notifier_range range;
-	bool shared_pmd = false;
+	struct mmu_gather tlb;
 
+	tlb_gather_mmu(&tlb, vma->vm_mm);
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, old_addr,
 				old_end);
 	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
@@ -5122,12 +5123,12 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 		if (huge_pte_none(huge_ptep_get(mm, old_addr, src_pte)))
 			continue;
 
-		if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) {
-			shared_pmd = true;
+		if (huge_pmd_unshare(&tlb, vma, old_addr, src_pte)) {
 			old_addr |= last_addr_mask;
 			new_addr |= last_addr_mask;
 			continue;
 		}
+		tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);
 
 		dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz);
 		if (!dst_pte)
@@ -5136,13 +5137,13 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
 		move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz);
 	}
 
-	if (shared_pmd)
-		flush_hugetlb_tlb_range(vma, range.start, range.end);
-	else
-		flush_hugetlb_tlb_range(vma, old_end - len, old_end);
+	tlb_flush_mmu_tlbonly(&tlb);
+	huge_pmd_unshare_flush(&tlb, vma);
+
 	mmu_notifier_invalidate_range_end(&range);
 	i_mmap_unlock_write(mapping);
 	hugetlb_vma_unlock_write(vma);
+	tlb_finish_mmu(&tlb);
 
 	return len + old_addr - old_end;
 }
@@ -5161,7 +5162,6 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	unsigned long sz = huge_page_size(h);
 	bool adjust_reservation;
 	unsigned long last_addr_mask;
-	bool force_flush = false;
 
 	WARN_ON(!is_vm_hugetlb_page(vma));
 	BUG_ON(start & ~huge_page_mask(h));
@@ -5184,10 +5184,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		}
 
 		ptl = huge_pte_lock(h, mm, ptep);
-		if (huge_pmd_unshare(mm, vma, address, ptep)) {
+		if (huge_pmd_unshare(tlb, vma, address, ptep)) {
 			spin_unlock(ptl);
-			tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
-			force_flush = true;
 			address |= last_addr_mask;
 			continue;
 		}
@@ -5303,14 +5301,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	}
 	tlb_end_vma(tlb, vma);
 
-	/*
-	 * There is nothing protecting a previously-shared page table that we
-	 * unshared through huge_pmd_unshare() from getting freed after we
-	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
-	 * succeeded, flush the range corresponding to the pud.
-	 */
-	if (force_flush)
-		tlb_flush_mmu_tlbonly(tlb);
+	huge_pmd_unshare_flush(tlb, vma);
 }
 
 void __hugetlb_zap_begin(struct vm_area_struct *vma,
@@ -6399,7 +6390,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 }
 #endif /* CONFIG_USERFAULTFD */
 
-long hugetlb_change_protection(struct vm_area_struct *vma,
+long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		unsigned long address, unsigned long end,
 		pgprot_t newprot, unsigned long cp_flags)
 {
@@ -6409,7 +6400,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 	pte_t pte;
 	struct hstate *h = hstate_vma(vma);
 	long pages = 0, psize = huge_page_size(h);
-	bool shared_pmd = false;
 	struct mmu_notifier_range range;
 	unsigned long last_addr_mask;
 	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
@@ -6452,7 +6442,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 			}
 		}
 		ptl = huge_pte_lock(h, mm, ptep);
-		if (huge_pmd_unshare(mm, vma, address, ptep)) {
+		if (huge_pmd_unshare(tlb, vma, address, ptep)) {
 			/*
 			 * When uffd-wp is enabled on the vma, unshare
 			 * shouldn't happen at all.  Warn about it if it
@@ -6461,7 +6451,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 			WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
 			pages++;
 			spin_unlock(ptl);
-			shared_pmd = true;
 			address |= last_addr_mask;
 			continue;
 		}
@@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 				pte = huge_pte_clear_uffd_wp(pte);
 			huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
 			pages++;
+			tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
 		}
 
 next:
 		spin_unlock(ptl);
 		cond_resched();
 	}
-	/*
-	 * There is nothing protecting a previously-shared page table that we
-	 * unshared through huge_pmd_unshare() from getting freed after we
-	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
-	 * succeeded, flush the range corresponding to the pud.
-	 */
-	if (shared_pmd)
-		flush_hugetlb_tlb_range(vma, range.start, range.end);
-	else
-		flush_hugetlb_tlb_range(vma, start, end);
+
+	tlb_flush_mmu_tlbonly(tlb);
+	huge_pmd_unshare_flush(tlb, vma);
 	/*
 	 * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are
 	 * downgrading page table protection not changing it to point to a new
@@ -6904,18 +6887,27 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	return pte;
 }
 
-/*
- * unmap huge page backed by shared pte.
+/**
+ * huge_pmd_unshare - Unmap a pmd table if it is shared by multiple users
+ * @tlb: the current mmu_gather.
+ * @vma: the vma covering the pmd table.
+ * @addr: the address we are trying to unshare.
+ * @ptep: pointer into the (pmd) page table.
+ *
+ * Called with the page table lock held, the i_mmap_rwsem held in write mode
+ * and the hugetlb vma lock held in write mode.
  *
- * Called with page table lock held.
+ * Note: The caller must call huge_pmd_unshare_flush() before dropping the
+ * i_mmap_rwsem.
  *
- * returns: 1 successfully unmapped a shared pte page
- *	    0 the underlying pte page is not shared, or it is the last user
+ * Returns: 1 if it was a shared PMD table and it got unmapped, or 0 if it
+ *	    was not a shared PMD table.
  */
-int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep)
 {
 	unsigned long sz = huge_page_size(hstate_vma(vma));
+	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd = pgd_offset(mm, addr);
 	p4d_t *p4d = p4d_offset(pgd, addr);
 	pud_t *pud = pud_offset(p4d, addr);
@@ -6927,18 +6919,36 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
 	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
 	hugetlb_vma_assert_locked(vma);
 	pud_clear(pud);
-	/*
-	 * Once our caller drops the rmap lock, some other process might be
-	 * using this page table as a normal, non-hugetlb page table.
-	 * Wait for pending gup_fast() in other threads to finish before letting
-	 * that happen.
-	 */
-	tlb_remove_table_sync_one();
-	ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep));
+
+	tlb_unshare_pmd_ptdesc(tlb, virt_to_ptdesc(ptep), addr);
+
 	mm_dec_nr_pmds(mm);
 	return 1;
 }
 
+/*
+ * huge_pmd_unshare_flush - Complete a sequence of huge_pmd_unshare() calls
+ * @tlb: the current mmu_gather.
+ * @vma: the vma covering the pmd table.
+ *
+ * Perform necessary TLB flushes or IPI broadcasts to synchronize PMD table
+ * unsharing with concurrent page table walkers (TLB, GUP-fast, etc.).
+ *
+ * This function must be called after a sequence of huge_pmd_unshare()
+ * calls while still holding the i_mmap_rwsem.
+ */
+void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+	/*
+	 * We must synchronize page table unsharing such that nobody will
+	 * try reusing a previously-shared page table while it might still
+	 * be in use by previous sharers (TLB, GUP_fast).
+	 */
+	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
+
+	tlb_flush_unshared_tables(tlb);
+}
+
 #else /* !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
 
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -6947,12 +6957,16 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 	return NULL;
 }
 
-int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep)
+int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		unsigned long addr, pte_t *ptep)
 {
 	return 0;
 }
 
+void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma)
+{
+}
+
 void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
 				unsigned long *start, unsigned long *end)
 {
@@ -7219,6 +7233,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
 	unsigned long sz = huge_page_size(h);
 	struct mm_struct *mm = vma->vm_mm;
 	struct mmu_notifier_range range;
+	struct mmu_gather tlb;
 	unsigned long address;
 	spinlock_t *ptl;
 	pte_t *ptep;
@@ -7229,6 +7244,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
 	if (start >= end)
 		return;
 
+	tlb_gather_mmu(&tlb, mm);
 	flush_cache_range(vma, start, end);
 	/*
 	 * No need to call adjust_range_if_pmd_sharing_possible(), because
@@ -7248,10 +7264,10 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
 		if (!ptep)
 			continue;
 		ptl = huge_pte_lock(h, mm, ptep);
-		huge_pmd_unshare(mm, vma, address, ptep);
+		huge_pmd_unshare(&tlb, vma, address, ptep);
 		spin_unlock(ptl);
 	}
-	flush_hugetlb_tlb_range(vma, start, end);
+	huge_pmd_unshare_flush(&tlb, vma);
 	if (take_locks) {
 		i_mmap_unlock_write(vma->vm_file->f_mapping);
 		hugetlb_vma_unlock_write(vma);
@@ -7261,6 +7277,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
 	 * Documentation/mm/mmu_notifier.rst.
 	 */
 	mmu_notifier_invalidate_range_end(&range);
+	tlb_finish_mmu(&tlb);
 }
 
 /*
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 247e3f9db6c7a..822a790127375 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -468,6 +468,12 @@ void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm)
  */
 void tlb_finish_mmu(struct mmu_gather *tlb)
 {
+	/*
+	 * We expect an earlier huge_pmd_unshare_flush() call to sort this out,
+	 * due to complicated locking requirements with page table unsharing.
+	 */
+	VM_WARN_ON_ONCE(tlb->fully_unshared_tables);
+
 	/*
 	 * If there are parallel threads are doing PTE changes on same range
 	 * under non-exclusive lock (e.g., mmap_lock read-side) but defer TLB
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 283889e4f1cec..5c330e817129e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -652,7 +652,7 @@ long change_protection(struct mmu_gather *tlb,
 #endif
 
 	if (is_vm_hugetlb_page(vma))
-		pages = hugetlb_change_protection(vma, start, end, newprot,
+		pages = hugetlb_change_protection(tlb, vma, start, end, newprot,
 						  cp_flags);
 	else
 		pages = change_protection_range(tlb, vma, start, end, newprot,
diff --git a/mm/rmap.c b/mm/rmap.c
index 748f48727a162..d6799afe11147 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -76,7 +76,7 @@
 #include <linux/mm_inline.h>
 #include <linux/oom.h>
 
-#include <asm/tlbflush.h>
+#include <asm/tlb.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/migrate.h>
@@ -2008,13 +2008,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * if unsuccessful.
 			 */
 			if (!anon) {
+				struct mmu_gather tlb;
+
 				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
 				if (!hugetlb_vma_trylock_write(vma))
 					goto walk_abort;
-				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
+
+				tlb_gather_mmu(&tlb, mm);
+				if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
 					hugetlb_vma_unlock_write(vma);
-					flush_tlb_range(vma,
-						range.start, range.end);
+					huge_pmd_unshare_flush(&tlb, vma);
+					tlb_finish_mmu(&tlb);
 					/*
 					 * The PMD table was unmapped,
 					 * consequently unmapping the folio.
@@ -2022,6 +2026,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 					goto walk_done;
 				}
 				hugetlb_vma_unlock_write(vma);
+				tlb_finish_mmu(&tlb);
 			}
 			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
 			if (pte_dirty(pteval))
@@ -2398,17 +2403,20 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 			 * fail if unsuccessful.
 			 */
 			if (!anon) {
+				struct mmu_gather tlb;
+
 				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
 				if (!hugetlb_vma_trylock_write(vma)) {
 					page_vma_mapped_walk_done(&pvmw);
 					ret = false;
 					break;
 				}
-				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
-					hugetlb_vma_unlock_write(vma);
-					flush_tlb_range(vma,
-						range.start, range.end);
 
+				tlb_gather_mmu(&tlb, mm);
+				if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
+					hugetlb_vma_unlock_write(vma);
+					huge_pmd_unshare_flush(&tlb, vma);
+					tlb_finish_mmu(&tlb);
 					/*
 					 * The PMD table was unmapped,
 					 * consequently unmapping the folio.
@@ -2417,6 +2425,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 					break;
 				}
 				hugetlb_vma_unlock_write(vma);
+				tlb_finish_mmu(&tlb);
 			}
 			/* Nuke the hugetlb page table entry */
 			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared()
  2025-12-05 21:35 ` [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared() David Hildenbrand (Red Hat)
@ 2025-12-06  2:18   ` Rik van Riel
  2025-12-06  5:55   ` Lance Yang
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2025-12-06  2:18 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), linux-kernel
  Cc: linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Harry Yoo,
	Laurence Oberman, Prakash Sangappa, Nadav Amit, stable,
	Liu Shixin

On Fri, 2025-12-05 at 22:35 +0100, David Hildenbrand (Red Hat) wrote:
> 
> Fix it by properly using ptdesc_pmd_is_shared() in
> hugetlb_pmd_shared().
> 
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared
> count")
> Cc: <stable@vger.kernel.org>
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
> 
Reviewed-by: Rik van Riel <riel@surriel.com>


-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare()
  2025-12-05 21:35 ` [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare() David Hildenbrand (Red Hat)
@ 2025-12-06  2:26   ` Rik van Riel
  2025-12-10 11:22   ` Lorenzo Stoakes
  2025-12-11  5:41   ` Oscar Salvador
  2 siblings, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2025-12-06  2:26 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), linux-kernel
  Cc: linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Harry Yoo,
	Laurence Oberman, Prakash Sangappa, Nadav Amit, Liu Shixin

On Fri, 2025-12-05 at 22:35 +0100, David Hildenbrand (Red Hat) wrote:
> 
> So let's simplify the comments a bit.
> 
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared
> count")
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>

Reviewed-by: Rik van Riel <riel@surriel.com>

-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 3/4] mm/rmap: fix two comments related to huge_pmd_unshare()
  2025-12-05 21:35 ` [PATCH v1 3/4] mm/rmap: " David Hildenbrand (Red Hat)
@ 2025-12-06  2:50   ` Rik van Riel
  2025-12-10 11:24   ` Lorenzo Stoakes
  2025-12-11  5:42   ` Oscar Salvador
  2 siblings, 0 replies; 27+ messages in thread
From: Rik van Riel @ 2025-12-06  2:50 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), linux-kernel
  Cc: linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Harry Yoo,
	Laurence Oberman, Prakash Sangappa, Nadav Amit, Liu Shixin

On Fri, 2025-12-05 at 22:35 +0100, David Hildenbrand (Red Hat) wrote:
> PMD page table unsharing no longer touches the refcount of a PMD page
> table. Also, it is not about dropping the refcount of a "PMD page"
> but
> the "PMD page table".
> 
> Let's just simplify by saying that the PMD page table was unmapped,
> consequently also unmapping the folio that was mapped into this page.
> 
> This code should be deduplicated in the future.
> 
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared
> count")
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
> 
Reviewed-by: Rik van Riel <riel@surriel.com>


-- 
All Rights Reversed.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared()
  2025-12-05 21:35 ` [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared() David Hildenbrand (Red Hat)
  2025-12-06  2:18   ` Rik van Riel
@ 2025-12-06  5:55   ` Lance Yang
  2025-12-06  6:24     ` Lance Yang
  2025-12-08  2:32   ` Lance Yang
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 27+ messages in thread
From: Lance Yang @ 2025-12-06  5:55 UTC (permalink / raw)
  To: david
  Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, harry.yoo, jannh,
	linux-arch, linux-kernel, linux-mm, liushixin2, loberman,
	lorenzo.stoakes, muchun.song, nadav.amit, npiggin, osalvador,
	peterz, pfalcato, prakash.sangappa, riel, stable, vbabka, will,
	Lance Yang

From: Lance Yang <lance.yang@linux.dev>


On Fri,  5 Dec 2025 22:35:55 +0100, David Hildenbrand (Red Hat) wrote:
> We switched from (wrongly) using the page count to an independent
> shared count. Now, shared page tables have a refcount of 1 (excluding
> speculative references) and instead use ptdesc->pt_share_count to
> identify sharing.
> 
> We didn't convert hugetlb_pmd_shared(), so right now, we would never
> detect a shared PMD table as such, because sharing/unsharing no longer
> touches the refcount of a PMD table.
> 
> Page migration, like mbind() or migrate_pages() would allow for migrating
> folios mapped into such shared PMD tables, even though the folios are
> not exclusive. In smaps we would account them as "private" although they
> are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the
> pagemap interface.
> 
> Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared().
> 
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
> Cc: <stable@vger.kernel.org>
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
> ---

Good catch! Feel free to add:

Reviewed-by: Lance yang <lance.yang@linux.dev>

>  include/linux/hugetlb.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 019a1c5281e4e..03c8725efa289 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1326,7 +1326,7 @@ static inline __init void hugetlb_cma_reserve(int order)
>  #ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
>  static inline bool hugetlb_pmd_shared(pte_t *pte)
>  {
> -	return page_count(virt_to_page(pte)) > 1;
> +	return ptdesc_pmd_is_shared(virt_to_ptdesc(pte));
>  }
>  #else
>  static inline bool hugetlb_pmd_shared(pte_t *pte)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared()
  2025-12-06  5:55   ` Lance Yang
@ 2025-12-06  6:24     ` Lance Yang
  0 siblings, 0 replies; 27+ messages in thread
From: Lance Yang @ 2025-12-06  6:24 UTC (permalink / raw)
  To: david
  Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, harry.yoo, jannh,
	linux-arch, linux-kernel, linux-mm, liushixin2, loberman,
	lorenzo.stoakes, muchun.song, nadav.amit, npiggin, osalvador,
	peterz, pfalcato, prakash.sangappa, riel, stable, vbabka, will,
	Lance Yang



On 2025/12/6 13:55, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> 
> On Fri,  5 Dec 2025 22:35:55 +0100, David Hildenbrand (Red Hat) wrote:
>> We switched from (wrongly) using the page count to an independent
>> shared count. Now, shared page tables have a refcount of 1 (excluding
>> speculative references) and instead use ptdesc->pt_share_count to
>> identify sharing.
>>
>> We didn't convert hugetlb_pmd_shared(), so right now, we would never
>> detect a shared PMD table as such, because sharing/unsharing no longer
>> touches the refcount of a PMD table.
>>
>> Page migration, like mbind() or migrate_pages() would allow for migrating
>> folios mapped into such shared PMD tables, even though the folios are
>> not exclusive. In smaps we would account them as "private" although they
>> are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the
>> pagemap interface.
>>
>> Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared().
>>
>> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
>> Cc: <stable@vger.kernel.org>
>> Cc: Liu Shixin <liushixin2@huawei.com>
>> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
>> ---
> 
> Good catch! Feel free to add:
> 
> Reviewed-by: Lance yang <lance.yang@linux.dev>

Actually:

Reviewed-by: Lance Yang <lance.yang@linux.dev>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather)
  2025-12-05 21:35 [PATCH v1 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather) David Hildenbrand (Red Hat)
                   ` (3 preceding siblings ...)
  2025-12-05 21:35 ` [PATCH v1 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather David Hildenbrand (Red Hat)
@ 2025-12-06 19:53 ` Laurence Oberman
  4 siblings, 0 replies; 27+ messages in thread
From: Laurence Oberman @ 2025-12-06 19:53 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), linux-kernel
  Cc: linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Rik van Riel,
	Harry Yoo, Prakash Sangappa, Nadav Amit

On Fri, 2025-12-05 at 22:35 +0100, David Hildenbrand (Red Hat) wrote:
> One functional fix, one performance regression fix, and two related
> comment fixes.
> 
> I cleaned up my prototype I recently shared [1] for the performance
> fix,
> deferring most of the cleanups I had in the prototype to a later
> point.
> While doing that I identified the other things.
> 
> The goal of this patch set is to be backported to stable trees
> "fairly"
> easily. At least patch #1 and #4.
> 
> Patch #1 fixes hugetlb_pmd_shared() not detecting any sharing
> Patch #2 + #3 are simple comment fixes that patch #4 interacts with.
> Patch #4 is a fix for the reported performance regression due to
> excessive
> IPI broadcasts during fork()+exit().
> 
> The last patch is all about TLB flushes, IPIs and mmu_gather.
> Read: complicated
> 
> I added as much comments + description that I possibly could, and I
> am
> hoping for review from Jann.
> 
> There are plenty of cleanups in the future to be had + one reasonable
> optimization on x86. But that's all out of scope for this series.
> 
> Compile tested on plenty of architectures.
> 
> Runtime tested, with a focus on fixing the performance regression
> using
> the original reproducer [2] on x86.
> 
> I'm still busy with more testing (making sure that my TLB flushing
> changes
> are good), but sending this out already so people can test and review
> while I am soon heading for LPC.
> 
> [1]
> https://lore.kernel.org/all/8cab934d-4a56-44aa-b641-bfd7e23bd673@kernel.org/
> [2]
> https://lore.kernel.org/all/8cab934d-4a56-44aa-b641-bfd7e23bd673@kernel.org/
> 
> Cc: Will Deacon <will@kernel.org>
> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Nick Piggin <npiggin@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Jann Horn <jannh@google.com>
> Cc: Pedro Falcato <pfalcato@suse.de>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Harry Yoo <harry.yoo@oracle.com>
> Cc: Uschakow, Stanislav" <suschako@amazon.de>
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
> Cc: Nadav Amit <nadav.amit@gmail.com>
> 
> David Hildenbrand (Red Hat) (4):
>   mm/hugetlb: fix hugetlb_pmd_shared()
>   mm/hugetlb: fix two comments related to huge_pmd_unshare()
>   mm/rmap: fix two comments related to huge_pmd_unshare()
>   mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables
>     using mmu_gather
> 
>  include/asm-generic/tlb.h |  69 +++++++++++++++++++-
>  include/linux/hugetlb.h   |  21 ++++---
>  mm/hugetlb.c              | 129 ++++++++++++++++++++----------------
> --
>  mm/mmu_gather.c           |   6 ++
>  mm/mprotect.c             |   2 +-
>  mm/rmap.c                 |  45 +++++++------
>  6 files changed, 178 insertions(+), 94 deletions(-)
> 

For the Series passed generic testing with a focus on the CVE
regression and looks good.

Tested-by: Laurence Oberman <loberman@redhat.com>




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather
  2025-12-05 21:35 ` [PATCH v1 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather David Hildenbrand (Red Hat)
@ 2025-12-07 12:15   ` Nadav Amit
  2025-12-07 12:24     ` Nadav Amit
  2025-12-10 15:06   ` Lorenzo Stoakes
  1 sibling, 1 reply; 27+ messages in thread
From: Nadav Amit @ 2025-12-07 12:15 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: linux-kernel, linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Rik van Riel,
	Harry Yoo, Laurence Oberman, Prakash Sangappa, stable


> On 5 Dec 2025, at 23:35, David Hildenbrand (Red Hat) <david@kernel.org> wrote:
> 
> @@ -400,6 +411,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
> 	tlb->cleared_pmds = 0;
> 	tlb->cleared_puds = 0;
> 	tlb->cleared_p4ds = 0;
> +	tlb->unshared_tables = 0;
> 	/*
> 	 * Do not reset mmu_gather::vma_* fields here, we do not
> 	 * call into tlb_start_vma() again to set them if there is an

I understand you don’t want to initialize fully_unshared_tables here, but
tlb_gather_mmu() needs to happen somewhere. So you probably want it to
take place in tlb_gather_mmu(), no?



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather
  2025-12-07 12:15   ` Nadav Amit
@ 2025-12-07 12:24     ` Nadav Amit
  2025-12-07 12:39       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 27+ messages in thread
From: Nadav Amit @ 2025-12-07 12:24 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: linux-kernel, linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Rik van Riel,
	Harry Yoo, Laurence Oberman, Prakash Sangappa, stable



> On 7 Dec 2025, at 14:15, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> 
>> On 5 Dec 2025, at 23:35, David Hildenbrand (Red Hat) <david@kernel.org> wrote:
>> 
>> @@ -400,6 +411,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
>> 
> 
> @@ -400,6 +411,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
> 	tlb->cleared_pmds = 0;
> 	tlb->cleared_puds = 0;
> 	tlb->cleared_p4ds = 0;
> +	tlb->unshared_tables = 0;
> 	/*
> 	 * Do not reset mmu_gather::vma_* fields here, we do not
> 	 * call into tlb_start_vma() again to set them if there is an
> 
> I understand you don’t want to initialize fully_unshared_tables here, but
> tlb_gather_mmu() needs to happen somewhere. So you probably want it to
> take place in tlb_gather_mmu(), no?

To clarify my messed up response: the code needs to initialize fully_unshared_tables
somewhere during tlb_gather_mmu() invocation.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather
  2025-12-07 12:24     ` Nadav Amit
@ 2025-12-07 12:39       ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-07 12:39 UTC (permalink / raw)
  To: Nadav Amit
  Cc: linux-kernel, linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Rik van Riel,
	Harry Yoo, Laurence Oberman, Prakash Sangappa, stable

On 12/7/25 13:24, Nadav Amit wrote:
> 
> 
>> On 7 Dec 2025, at 14:15, Nadav Amit <nadav.amit@gmail.com> wrote:
>>
>>
>>> On 5 Dec 2025, at 23:35, David Hildenbrand (Red Hat) <david@kernel.org> wrote:
>>>
>>> @@ -400,6 +411,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
>>>
>>
>> @@ -400,6 +411,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
>> 	tlb->cleared_pmds = 0;
>> 	tlb->cleared_puds = 0;
>> 	tlb->cleared_p4ds = 0;
>> +	tlb->unshared_tables = 0;
>> 	/*
>> 	 * Do not reset mmu_gather::vma_* fields here, we do not
>> 	 * call into tlb_start_vma() again to set them if there is an
>>
>> I understand you don’t want to initialize fully_unshared_tables here, but
>> tlb_gather_mmu() needs to happen somewhere. So you probably want it to
>> take place in tlb_gather_mmu(), no?
> 
> To clarify my messed up response: the code needs to initialize fully_unshared_tables
> somewhere during tlb_gather_mmu() invocation.

Good point, __tlb_gather_mmu() needs to initialize it explicitly!

Thanks a lot for the review, appreciated!

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared()
  2025-12-05 21:35 ` [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared() David Hildenbrand (Red Hat)
  2025-12-06  2:18   ` Rik van Riel
  2025-12-06  5:55   ` Lance Yang
@ 2025-12-08  2:32   ` Lance Yang
  2025-12-08 11:01     ` David Hildenbrand (Red Hat)
  2025-12-10 11:15     ` Lorenzo Stoakes
  2025-12-08  9:08   ` Harry Yoo
                     ` (2 subsequent siblings)
  5 siblings, 2 replies; 27+ messages in thread
From: Lance Yang @ 2025-12-08  2:32 UTC (permalink / raw)
  To: david
  Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, harry.yoo, jannh,
	linux-arch, linux-kernel, linux-mm, liushixin2, loberman,
	lorenzo.stoakes, muchun.song, nadav.amit, npiggin, osalvador,
	peterz, pfalcato, prakash.sangappa, riel, stable, vbabka, will,
	Lance Yang

From: Lance Yang <lance.yang@linux.dev>


On Fri,  5 Dec 2025 22:35:55 +0100, David Hildenbrand (Red Hat) wrote:
> We switched from (wrongly) using the page count to an independent
> shared count. Now, shared page tables have a refcount of 1 (excluding
> speculative references) and instead use ptdesc->pt_share_count to
> identify sharing.
> 
> We didn't convert hugetlb_pmd_shared(), so right now, we would never
> detect a shared PMD table as such, because sharing/unsharing no longer
> touches the refcount of a PMD table.
> 
> Page migration, like mbind() or migrate_pages() would allow for migrating
> folios mapped into such shared PMD tables, even though the folios are
> not exclusive. In smaps we would account them as "private" although they
> are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the
> pagemap interface.
> 
> Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared().
> 
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
> Cc: <stable@vger.kernel.org>
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
> ---

Tested on x86 with two independent processes sharing a 1GiB hugetlbfs file
(aligned a 1GiB boundary).

Before the fix, even though PMD sharing worked (pt_share_count=1),
hugetlb_pmd_shared() returned false because page_count() was still 1,
causing smaps to report it as "Private" and pagemap to set it
PM_MMAP_EXCLUSIVE incorrectly :(

After the fix, hugetlb_pmd_shared() correctly detects the sharing, smaps
reports it as "Shared", and PM_MMAP_EXCLUSIVE is cleared ;)

Tested-by: Lance Yang <lance.yang@linux.dev>

Cheers!

>  include/linux/hugetlb.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 019a1c5281e4e..03c8725efa289 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1326,7 +1326,7 @@ static inline __init void hugetlb_cma_reserve(int order)
>  #ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
>  static inline bool hugetlb_pmd_shared(pte_t *pte)
>  {
> -	return page_count(virt_to_page(pte)) > 1;
> +	return ptdesc_pmd_is_shared(virt_to_ptdesc(pte));
>  }
>  #else
>  static inline bool hugetlb_pmd_shared(pte_t *pte)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared()
  2025-12-05 21:35 ` [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared() David Hildenbrand (Red Hat)
                     ` (2 preceding siblings ...)
  2025-12-08  2:32   ` Lance Yang
@ 2025-12-08  9:08   ` Harry Yoo
  2025-12-10 11:16   ` Lorenzo Stoakes
  2025-12-11  5:38   ` Oscar Salvador
  5 siblings, 0 replies; 27+ messages in thread
From: Harry Yoo @ 2025-12-08  9:08 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: linux-kernel, linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Lorenzo Stoakes,
	Vlastimil Babka, Jann Horn, Pedro Falcato, Rik van Riel,
	Laurence Oberman, Prakash Sangappa, Nadav Amit, stable,
	Liu Shixin

On Fri, Dec 05, 2025 at 10:35:55PM +0100, David Hildenbrand (Red Hat) wrote:
> We switched from (wrongly) using the page count to an independent
> shared count. Now, shared page tables have a refcount of 1 (excluding
> speculative references) and instead use ptdesc->pt_share_count to
> identify sharing.
> 
> We didn't convert hugetlb_pmd_shared(), so right now, we would never
> detect a shared PMD table as such, because sharing/unsharing no longer
> touches the refcount of a PMD table.
> 
> Page migration, like mbind() or migrate_pages() would allow for migrating
> folios mapped into such shared PMD tables, even though the folios are
> not exclusive. In smaps we would account them as "private" although they
> are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the
> pagemap interface.
> 
> Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared().
> 
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
> Cc: <stable@vger.kernel.org>
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
> ---

Oops, didn't notice there's still missing conversion!

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared()
  2025-12-08  2:32   ` Lance Yang
@ 2025-12-08 11:01     ` David Hildenbrand (Red Hat)
  2025-12-10 11:15     ` Lorenzo Stoakes
  1 sibling, 0 replies; 27+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-08 11:01 UTC (permalink / raw)
  To: Lance Yang
  Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, harry.yoo, jannh,
	linux-arch, linux-kernel, linux-mm, liushixin2, loberman,
	lorenzo.stoakes, muchun.song, nadav.amit, npiggin, osalvador,
	peterz, pfalcato, prakash.sangappa, riel, stable, vbabka, will,
	Lance Yang

On 12/8/25 03:32, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> 
> On Fri,  5 Dec 2025 22:35:55 +0100, David Hildenbrand (Red Hat) wrote:
>> We switched from (wrongly) using the page count to an independent
>> shared count. Now, shared page tables have a refcount of 1 (excluding
>> speculative references) and instead use ptdesc->pt_share_count to
>> identify sharing.
>>
>> We didn't convert hugetlb_pmd_shared(), so right now, we would never
>> detect a shared PMD table as such, because sharing/unsharing no longer
>> touches the refcount of a PMD table.
>>
>> Page migration, like mbind() or migrate_pages() would allow for migrating
>> folios mapped into such shared PMD tables, even though the folios are
>> not exclusive. In smaps we would account them as "private" although they
>> are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the
>> pagemap interface.
>>
>> Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared().
>>
>> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
>> Cc: <stable@vger.kernel.org>
>> Cc: Liu Shixin <liushixin2@huawei.com>
>> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
>> ---
> 
> Tested on x86 with two independent processes sharing a 1GiB hugetlbfs file
> (aligned a 1GiB boundary).
> 
> Before the fix, even though PMD sharing worked (pt_share_count=1),
> hugetlb_pmd_shared() returned false because page_count() was still 1,
> causing smaps to report it as "Private" and pagemap to set it
> PM_MMAP_EXCLUSIVE incorrectly :(
> 
> After the fix, hugetlb_pmd_shared() correctly detects the sharing, smaps
> reports it as "Shared", and PM_MMAP_EXCLUSIVE is cleared ;)
> 
> Tested-by: Lance Yang <lance.yang@linux.dev>

Thanks a lot Lance for the testing and thanks to everybody for the review!

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared()
  2025-12-08  2:32   ` Lance Yang
  2025-12-08 11:01     ` David Hildenbrand (Red Hat)
@ 2025-12-10 11:15     ` Lorenzo Stoakes
  1 sibling, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-12-10 11:15 UTC (permalink / raw)
  To: Lance Yang
  Cc: david, Liam.Howlett, akpm, aneesh.kumar, arnd, harry.yoo, jannh,
	linux-arch, linux-kernel, linux-mm, liushixin2, loberman,
	muchun.song, nadav.amit, npiggin, osalvador, peterz, pfalcato,
	prakash.sangappa, riel, stable, vbabka, will, Lance Yang

On Mon, Dec 08, 2025 at 10:32:31AM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
>
> On Fri,  5 Dec 2025 22:35:55 +0100, David Hildenbrand (Red Hat) wrote:
> > We switched from (wrongly) using the page count to an independent
> > shared count. Now, shared page tables have a refcount of 1 (excluding
> > speculative references) and instead use ptdesc->pt_share_count to
> > identify sharing.
> >
> > We didn't convert hugetlb_pmd_shared(), so right now, we would never
> > detect a shared PMD table as such, because sharing/unsharing no longer
> > touches the refcount of a PMD table.
> >
> > Page migration, like mbind() or migrate_pages() would allow for migrating
> > folios mapped into such shared PMD tables, even though the folios are
> > not exclusive. In smaps we would account them as "private" although they
> > are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the
> > pagemap interface.
> >
> > Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared().
> >
> > Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
> > Cc: <stable@vger.kernel.org>
> > Cc: Liu Shixin <liushixin2@huawei.com>
> > Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
> > ---
>
> Tested on x86 with two independent processes sharing a 1GiB hugetlbfs file
> (aligned a 1GiB boundary).
>
> Before the fix, even though PMD sharing worked (pt_share_count=1),
> hugetlb_pmd_shared() returned false because page_count() was still 1,
> causing smaps to report it as "Private" and pagemap to set it
> PM_MMAP_EXCLUSIVE incorrectly :(
>
> After the fix, hugetlb_pmd_shared() correctly detects the sharing, smaps
> reports it as "Shared", and PM_MMAP_EXCLUSIVE is cleared ;)

Yikes yikes yikes...

I wonder what else might be broken in this stuff :/

>
> Tested-by: Lance Yang <lance.yang@linux.dev>
>
> Cheers!
>
> >  include/linux/hugetlb.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 019a1c5281e4e..03c8725efa289 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -1326,7 +1326,7 @@ static inline __init void hugetlb_cma_reserve(int order)
> >  #ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
> >  static inline bool hugetlb_pmd_shared(pte_t *pte)
> >  {
> > -	return page_count(virt_to_page(pte)) > 1;
> > +	return ptdesc_pmd_is_shared(virt_to_ptdesc(pte));
> >  }
> >  #else
> >  static inline bool hugetlb_pmd_shared(pte_t *pte)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared()
  2025-12-05 21:35 ` [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared() David Hildenbrand (Red Hat)
                     ` (3 preceding siblings ...)
  2025-12-08  9:08   ` Harry Yoo
@ 2025-12-10 11:16   ` Lorenzo Stoakes
  2025-12-11  5:38   ` Oscar Salvador
  5 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-12-10 11:16 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: linux-kernel, linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, Harry Yoo,
	Laurence Oberman, Prakash Sangappa, Nadav Amit, stable,
	Liu Shixin

On Fri, Dec 05, 2025 at 10:35:55PM +0100, David Hildenbrand (Red Hat) wrote:
> We switched from (wrongly) using the page count to an independent
> shared count. Now, shared page tables have a refcount of 1 (excluding
> speculative references) and instead use ptdesc->pt_share_count to
> identify sharing.
>
> We didn't convert hugetlb_pmd_shared(), so right now, we would never
> detect a shared PMD table as such, because sharing/unsharing no longer
> touches the refcount of a PMD table.
>
> Page migration, like mbind() or migrate_pages() would allow for migrating
> folios mapped into such shared PMD tables, even though the folios are
> not exclusive. In smaps we would account them as "private" although they
> are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the
> pagemap interface.

Yikes this seems pretty serious!!

How did we not pick up on this before...

>
> Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared().
>
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
> Cc: <stable@vger.kernel.org>
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>

Esp. given Lance's testing... LGTM so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/hugetlb.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 019a1c5281e4e..03c8725efa289 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1326,7 +1326,7 @@ static inline __init void hugetlb_cma_reserve(int order)
>  #ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
>  static inline bool hugetlb_pmd_shared(pte_t *pte)
>  {
> -	return page_count(virt_to_page(pte)) > 1;
> +	return ptdesc_pmd_is_shared(virt_to_ptdesc(pte));
>  }
>  #else
>  static inline bool hugetlb_pmd_shared(pte_t *pte)
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare()
  2025-12-05 21:35 ` [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare() David Hildenbrand (Red Hat)
  2025-12-06  2:26   ` Rik van Riel
@ 2025-12-10 11:22   ` Lorenzo Stoakes
  2025-12-11  1:58     ` David Hildenbrand (Red Hat)
  2025-12-11  5:41   ` Oscar Salvador
  2 siblings, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-12-10 11:22 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: linux-kernel, linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, Harry Yoo,
	Laurence Oberman, Prakash Sangappa, Nadav Amit, Liu Shixin

On Fri, Dec 05, 2025 at 10:35:56PM +0100, David Hildenbrand (Red Hat) wrote:
> Ever since we stopped using the page count to detect shared PMD
> page tables, these comments are outdated.
>
> The only reason we have to flush the TLB early is because once we drop
> the i_mmap_rwsem, the previously shared page table could get freed (to
> then get reallocated and used for other purpose). So we really have to
> flush the TLB before that could happen.
>
> So let's simplify the comments a bit.
>
> The "If we unshared PMDs, the TLB flush was not recorded in mmu_gather."
> part introduced as in commit a4a118f2eead ("hugetlbfs: flush TLBs
> correctly after huge_pmd_unshare") was confusing: sure it is recorded
> in the mmu_gather, otherwise tlb_flush_mmu_tlbonly() wouldn't do
> anything. So let's drop that comment while at it as well.
>
> We'll centralize these comments in a single helper as we rework the code
> next.
>
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/hugetlb.c | 24 ++++++++----------------
>  1 file changed, 8 insertions(+), 16 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 51273baec9e5d..3c77cdef12a32 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5304,17 +5304,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	tlb_end_vma(tlb, vma);
>
>  	/*
> -	 * If we unshared PMDs, the TLB flush was not recorded in mmu_gather. We
> -	 * could defer the flush until now, since by holding i_mmap_rwsem we
> -	 * guaranteed that the last reference would not be dropped. But we must
> -	 * do the flushing before we return, as otherwise i_mmap_rwsem will be
> -	 * dropped and the last reference to the shared PMDs page might be
> -	 * dropped as well.
> -	 *
> -	 * In theory we could defer the freeing of the PMD pages as well, but
> -	 * huge_pmd_unshare() relies on the exact page_count for the PMD page to
> -	 * detect sharing, so we cannot defer the release of the page either.

Was it this comment that led you to question the page_count issue? :)

> -	 * Instead, do flush now.
> +	 * There is nothing protecting a previously-shared page table that we
> +	 * unshared through huge_pmd_unshare() from getting freed after we
> +	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
> +	 * succeeded, flush the range corresponding to the pud.
>  	 */
>  	if (force_flush)
>  		tlb_flush_mmu_tlbonly(tlb);
> @@ -6536,11 +6529,10 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
>  		cond_resched();
>  	}
>  	/*
> -	 * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
> -	 * may have cleared our pud entry and done put_page on the page table:
> -	 * once we release i_mmap_rwsem, another task can do the final put_page
> -	 * and that page table be reused and filled with junk.  If we actually
> -	 * did unshare a page of pmds, flush the range corresponding to the pud.
> +	 * There is nothing protecting a previously-shared page table that we
> +	 * unshared through huge_pmd_unshare() from getting freed after we
> +	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
> +	 * succeeded, flush the range corresponding to the pud.
>  	 */
>  	if (shared_pmd)
>  		flush_hugetlb_tlb_range(vma, range.start, range.end);
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 3/4] mm/rmap: fix two comments related to huge_pmd_unshare()
  2025-12-05 21:35 ` [PATCH v1 3/4] mm/rmap: " David Hildenbrand (Red Hat)
  2025-12-06  2:50   ` Rik van Riel
@ 2025-12-10 11:24   ` Lorenzo Stoakes
  2025-12-11  5:42   ` Oscar Salvador
  2 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-12-10 11:24 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: linux-kernel, linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, Harry Yoo,
	Laurence Oberman, Prakash Sangappa, Nadav Amit, Liu Shixin

On Fri, Dec 05, 2025 at 10:35:57PM +0100, David Hildenbrand (Red Hat) wrote:
> PMD page table unsharing no longer touches the refcount of a PMD page
> table. Also, it is not about dropping the refcount of a "PMD page" but
> the "PMD page table".
>
> Let's just simplify by saying that the PMD page table was unmapped,
> consequently also unmapping the folio that was mapped into this page.
>
> This code should be deduplicated in the future.
>
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/rmap.c | 20 ++++----------------
>  1 file changed, 4 insertions(+), 16 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f955f02d570ed..748f48727a162 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2016,14 +2016,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  					flush_tlb_range(vma,
>  						range.start, range.end);
>  					/*
> -					 * The ref count of the PMD page was
> -					 * dropped which is part of the way map
> -					 * counting is done for shared PMDs.
> -					 * Return 'true' here.  When there is
> -					 * no other sharing, huge_pmd_unshare
> -					 * returns false and we will unmap the
> -					 * actual page and drop map count
> -					 * to zero.
> +					 * The PMD table was unmapped,
> +					 * consequently unmapping the folio.
>  					 */
>  					goto walk_done;
>  				}
> @@ -2416,14 +2410,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  						range.start, range.end);
>
>  					/*
> -					 * The ref count of the PMD page was
> -					 * dropped which is part of the way map
> -					 * counting is done for shared PMDs.
> -					 * Return 'true' here.  When there is
> -					 * no other sharing, huge_pmd_unshare
> -					 * returns false and we will unmap the
> -					 * actual page and drop map count
> -					 * to zero.
> +					 * The PMD table was unmapped,
> +					 * consequently unmapping the folio.
>  					 */
>  					page_vma_mapped_walk_done(&pvmw);
>  					break;
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather
  2025-12-05 21:35 ` [PATCH v1 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather David Hildenbrand (Red Hat)
  2025-12-07 12:15   ` Nadav Amit
@ 2025-12-10 15:06   ` Lorenzo Stoakes
  2025-12-11  2:27     ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-12-10 15:06 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: linux-kernel, linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, Harry Yoo,
	Laurence Oberman, Prakash Sangappa, Nadav Amit, stable

On Fri, Dec 05, 2025 at 10:35:58PM +0100, David Hildenbrand (Red Hat) wrote:
> As reported, ever since commit 1013af4f585f ("mm/hugetlb: fix
> huge_pmd_unshare() vs GUP-fast race") we can end up in some situations
> where we perform so many IPI broadcasts when unsharing hugetlb PMD page
> tables that it severely regresses some workloads.
>
> In particular, when we fork()+exit(), or when we munmap() a large
> area backed by many shared PMD tables, we perform one IPI broadcast per
> unshared PMD table.
>
> There are two optimizations to be had:
>
> (1) When we process (unshare) multiple such PMD tables, such as during
>     exit(), it is sufficient to send a single IPI broadcast (as long as
>     we respect locking rules) instead of one per PMD table.

Yes :)

>
>     Locking prevents that any of these PMD tables could get reuse before
>     we drop the lock.
>
> (2) When we are not the last sharer (> 2 users including us), there is
>     no need to send the IPI broadcast. The shared PMD tables cannot
>     become exclusive (fully unshared) before an IPI will be broadcasted
>     by the last sharer.

Smart this makes sense.

>
>     Concurrent GUP-fast could walk into a PMD table just before we
>     unshared it. It could then succeed in grabbing a page from the
>     shared page table even after munmap() etc succeeded (and supressed
>     an IPI). But there is not difference compared to GUP-fast just
>     sleeping for a while after grabbing the page and re-enabling IRQs.
>
>     Most importantly, GUP-fast will never walk into page tables that are
>     no-longer shared, because the last sharer will issue an IPI
>     broadcast.
>
>     (if ever required, checking whether the PUD changed in GUP-fast
>      after grabbing the page like we do in the PTE case could handle
>      this)
>
> So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather
> infrastructure so we can implement these optimizations and demystify the
> code at least a bit. Extend the mmu_gather infrastructure to be able to
> deal with our special hugetlb PMD table sharing implementation.
>
> We'll consolidate the handling for (full) unsharing of PMD tables in
> tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track
> in "struct mmu_gather" whether we had (full) unsharing of PMD tables.
>
> Because locking is very special (concurrent unsharing+reuse must be
> prevented), we disallow deferring flushing to tlb_finish_mmu() and instead
> require an explicit earlier call to tlb_flush_unshared_tables().
>
> From hugetlb code, we call huge_pmd_unshare_flush() where we make sure
> that the expected lock protecting us from concurrent unsharing+reuse is
> still held.
>
> Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that
> tlb_flush_unshared_tables() was properly called earlier.
>
> Document it all properly.
>
> Notes about tlb_remove_table_sync_one() interaction with unsharing:
>
> There are two fairly tricky things:
>
> (1) tlb_remove_table_sync_one() is a NOP on architectures without
>     CONFIG_MMU_GATHER_RCU_TABLE_FREE.
>
>     Here, the assumption is that the previous TLB flush would send an
>     IPI to all relevant CPUs. Careful: some architectures like x86 only
>     send IPIs to all relevant CPUs when tlb->freed_tables is set.
>
>     The relevant architectures should be selecting
>     MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable
>     kernels and it might have been problematic before this patch.
>
>     Also, the arch flushing behavior (independent of IPIs) is different
>     when tlb->freed_tables is set. Do we have to enlighten them to also
>     take care of tlb->unshared_tables? So far we didn't care, so
>     hopefully we are fine. Of course, we could be setting
>     tlb->freed_tables as well, but that might then unnecessarily flush
>     too much, because the semantics of tlb->freed_tables are a bit
>     fuzzy.
>
>     This patch changes nothing in this regard.

Ugh at the 'special snowflaking' of hugetlb yet again...

>
> (2) tlb_remove_table_sync_one() is not a NOP on architectures with
>     CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync.
>
>     Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB)
>     we still issue IPIs during TLB flushes and don't actually need the
>     second tlb_remove_table_sync_one().

Hmm wasn't aware that x86 would still IPI even with
CONFIG_MMU_GATHER_RCU_TABLE_FREE??

But then we'd have to set tlb->freed_tables and as per your above point
maybe overkill...hm one for another time then I guess! :)

>
>     This optimized can be implemented on top of this, by checking e.g., in
>     tlb_remove_table_sync_one() whether we really need IPIs. But as
>     described in (1), it really must honor tlb->freed_tables then to
>     send IPIs to all relevant CPUs.
>
> Further note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a
> concern, as we are holding the i_mmap_lock the whole time, preventing
> concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed
> separately as a cleanup later.
>
> There are plenty more cleanups to be had, but they have to wait until
> this is fixed.

Yes!

>
> Fixes: 1013af4f585f ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race")
> Reported-by: Uschakow, Stanislav" <suschako@amazon.de>
> Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/
> Cc: <stable@vger.kernel.org>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>

Overall the approach looks reasonable, I raised various thoughts,
questions, concerns below.

I know you're at LPC right now so we'll get back to this either when you're
back or you get bored/can't sleep due to caffeine intake ;)

Cheers, Lorenzo

> ---
>  include/asm-generic/tlb.h |  69 +++++++++++++++++++++-
>  include/linux/hugetlb.h   |  19 +++---
>  mm/hugetlb.c              | 121 ++++++++++++++++++++++----------------
>  mm/mmu_gather.c           |   6 ++
>  mm/mprotect.c             |   2 +-
>  mm/rmap.c                 |  25 +++++---
>  6 files changed, 173 insertions(+), 69 deletions(-)
>
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 1fff717cae510..324a21f53b644 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -364,6 +364,17 @@ struct mmu_gather {
>  	unsigned int		vma_huge : 1;
>  	unsigned int		vma_pfn  : 1;
>
> +	/*
> +	 * Did we unshare (unmap) any shared page tables?

Given mshare is incoming, maybe worth clarifying and being explicit about
hugetlb in both comment and name?

> +	 */
> +	unsigned int		unshared_tables : 1;
> +
> +	/*
> +	 * Did we unshare any page tables such that they are now exclusive
> +	 * and could get reused+modified by the new owner?
> +	 */
> +	unsigned int		fully_unshared_tables : 1;

Does fully_unshared_tables rely on unshared_tables also being set?

> +
>  	unsigned int		batch_count;
>
>  #ifndef CONFIG_MMU_GATHER_NO_GATHER
> @@ -400,6 +411,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
>  	tlb->cleared_pmds = 0;
>  	tlb->cleared_puds = 0;
>  	tlb->cleared_p4ds = 0;
> +	tlb->unshared_tables = 0;

As Nadav points out, should also initialise fully_unshared_tables.

>  	/*
>  	 * Do not reset mmu_gather::vma_* fields here, we do not
>  	 * call into tlb_start_vma() again to set them if there is an
> @@ -484,7 +496,7 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
>  	 * these bits.
>  	 */
>  	if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
> -	      tlb->cleared_puds || tlb->cleared_p4ds))
> +	      tlb->cleared_puds || tlb->cleared_p4ds || tlb->unshared_tables))

What about fully_unshared_tables? I guess though unshared_tables implies
fully_unshared_tables.

>  		return;
>
>  	tlb_flush(tlb);
> @@ -773,6 +785,61 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
>  }
>  #endif
>
> +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
> +static inline void tlb_unshare_pmd_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt,
> +					  unsigned long addr)
> +{
> +	/*
> +	 * The caller must make sure that concurrent unsharing + exclusive
> +	 * reuse is impossible until tlb_flush_unshared_tables() was called.
> +	 */
> +	VM_WARN_ON_ONCE(!ptdesc_pmd_is_shared(pt));
> +	ptdesc_pmd_pts_dec(pt);
> +
> +	/* Clearing a PUD pointing at a PMD table with PMD leaves. */
> +	tlb_flush_pmd_range(tlb, addr & PUD_MASK, PUD_SIZE);

OK I guess before we were always flushing for each page, but now we are
accumulating the flushes here.

> +
> +	/*
> +	 * If the page table is now exclusively owned, we fully unshared
> +	 * a page table.
> +	 */
> +	if (!ptdesc_pmd_is_shared(pt))
> +		tlb->fully_unshared_tables = true;
> +	tlb->unshared_tables = true;
> +}
> +
> +static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
> +{
> +	/*
> +	 * As soon as the caller drops locks to allow for reuse of
> +	 * previously-shared tables, these tables could get modified and
> +	 * even reused outside of hugetlb context. So flush the TLB now.

Hmm but you're doing this in both the case of unshare and fully unsharing, so is
this the right place to make this comment?

Surely here this is about flushing TLBs for the unsharer only as it no longer
uses it?

> +	 *
> +	 * Note that we cannot defer the flush to a later point even if we are
> +	 * not the last sharer of the page table.
> +	 */

Not hugely clear, some double negative here. Maybe worth saying something like:

'Even if we are not fully unsharing a PMD table, we must flush the TLB for the
unsharer who no longer has access to this memory'

Or something? Assuming this is accurate :)

> +	if (tlb->unshared_tables)
> +		tlb_flush_mmu_tlbonly(tlb);
> +
> +	/*
> +	 * Similarly, we must make sure that concurrent GUP-fast will not
> +	 * walk previously-shared page tables that are getting modified+reused
> +	 * elsewhere. So broadcast an IPI to wait for any concurrent GUP-fast.
> +	 *
> +	 * We only perform this when we are the last sharer of a page table,
> +	 * as the IPI will reach all CPUs: any GUP-fast.
> +	 *
> +	 * Note that on configs where tlb_remove_table_sync_one() is a NOP,
> +	 * the expectation is that the tlb_flush_mmu_tlbonly() would have issued
> +	 * required IPIs already for us.
> +	 */

Nice, great comment!

> +	if (tlb->fully_unshared_tables) {
> +		tlb_remove_table_sync_one();
> +		tlb->fully_unshared_tables = false;
> +	}
> +}
> +#endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
> +
>  #endif /* CONFIG_MMU */
>
>  #endif /* _ASM_GENERIC__TLB_H */
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 03c8725efa289..63b248c6bfd47 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -240,8 +240,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
>  pte_t *huge_pte_offset(struct mm_struct *mm,
>  		       unsigned long addr, unsigned long sz);
>  unsigned long hugetlb_mask_last_page(struct hstate *h);
> -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep);
> +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep);
> +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma);
>  void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
>  				unsigned long *start, unsigned long *end);
>
> @@ -271,7 +272,7 @@ void hugetlb_vma_unlock_write(struct vm_area_struct *vma);
>  int hugetlb_vma_trylock_write(struct vm_area_struct *vma);
>  void hugetlb_vma_assert_locked(struct vm_area_struct *vma);
>  void hugetlb_vma_lock_release(struct kref *kref);
> -long hugetlb_change_protection(struct vm_area_struct *vma,
> +long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		unsigned long address, unsigned long end, pgprot_t newprot,
>  		unsigned long cp_flags);
>  void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
> @@ -300,13 +301,17 @@ static inline struct address_space *hugetlb_folio_mapping_lock_write(
>  	return NULL;
>  }
>
> -static inline int huge_pmd_unshare(struct mm_struct *mm,
> -					struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +static inline int huge_pmd_unshare(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
>  {
>  	return 0;
>  }
>
> +static inline void huge_pmd_unshare_flush(struct mmu_gather *tlb,
> +		struct vm_area_struct *vma)
> +{
> +}
> +
>  static inline void adjust_range_if_pmd_sharing_possible(
>  				struct vm_area_struct *vma,
>  				unsigned long *start, unsigned long *end)
> @@ -432,7 +437,7 @@ static inline void move_hugetlb_state(struct folio *old_folio,
>  {
>  }
>
> -static inline long hugetlb_change_protection(
> +static inline long hugetlb_change_protection(struct mmu_gather *tlb,
>  			struct vm_area_struct *vma, unsigned long address,
>  			unsigned long end, pgprot_t newprot,
>  			unsigned long cp_flags)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3c77cdef12a32..3db94693a06fc 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5096,8 +5096,9 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>  	unsigned long last_addr_mask;
>  	pte_t *src_pte, *dst_pte;
>  	struct mmu_notifier_range range;
> -	bool shared_pmd = false;
> +	struct mmu_gather tlb;
>
> +	tlb_gather_mmu(&tlb, vma->vm_mm);
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, old_addr,
>  				old_end);
>  	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
> @@ -5122,12 +5123,12 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>  		if (huge_pte_none(huge_ptep_get(mm, old_addr, src_pte)))
>  			continue;
>
> -		if (huge_pmd_unshare(mm, vma, old_addr, src_pte)) {
> -			shared_pmd = true;
> +		if (huge_pmd_unshare(&tlb, vma, old_addr, src_pte)) {
>  			old_addr |= last_addr_mask;
>  			new_addr |= last_addr_mask;
>  			continue;
>  		}
> +		tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);

OK I guess we need to add these to cases where we remove previous entries
because before we weren't accumulating TLB state except in
__unmap_hugepage_range()?

>
>  		dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz);
>  		if (!dst_pte)
> @@ -5136,13 +5137,13 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>  		move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz);
>  	}
>
> -	if (shared_pmd)
> -		flush_hugetlb_tlb_range(vma, range.start, range.end);
> -	else
> -		flush_hugetlb_tlb_range(vma, old_end - len, old_end);
> +	tlb_flush_mmu_tlbonly(&tlb);

Maybe:

	if (!tlb->unshared_tables)
		tlb_flush_mmu_tlbonly(&tlb);

To avoid doing that twice? Not sure if so important as will be noop second time
though.

> +	huge_pmd_unshare_flush(&tlb, vma);

OK I guess the semantics are

huge_pmd_unshare() for everything that needs unsharing, accumulating tlb state...

huge_pmd_unshare_flush() to, err, flush :) followed by tlb_finish_mmu() obv, and
with i_mmap lock held...

> +
>  	mmu_notifier_invalidate_range_end(&range);
>  	i_mmap_unlock_write(mapping);
>  	hugetlb_vma_unlock_write(vma);
> +	tlb_finish_mmu(&tlb);

Does it matter that the hugetlb VMA lock is gone when we invoke
tlb_finish_mmu()? I'm guessing not.

Kinda wish we could wrap these start/end states, it's fiddly to know that
you have to:

- call huge_pmd_unshare_flush()
- release i_mmap lock
- unlock vma hugetlb (ugh god don't even get me started on how that's implemented :)
- call tlb_finish_mmu()

Obv. this kind of thing can be part of future cleanups

>
>  	return len + old_addr - old_end;
>  }
> @@ -5161,7 +5162,6 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	unsigned long sz = huge_page_size(h);
>  	bool adjust_reservation;
>  	unsigned long last_addr_mask;
> -	bool force_flush = false;
>
>  	WARN_ON(!is_vm_hugetlb_page(vma));
>  	BUG_ON(start & ~huge_page_mask(h));
> @@ -5184,10 +5184,8 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		}
>
>  		ptl = huge_pte_lock(h, mm, ptep);
> -		if (huge_pmd_unshare(mm, vma, address, ptep)) {
> +		if (huge_pmd_unshare(tlb, vma, address, ptep)) {
>  			spin_unlock(ptl);
> -			tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
> -			force_flush = true;
>  			address |= last_addr_mask;
>  			continue;
>  		}
> @@ -5303,14 +5301,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  	}
>  	tlb_end_vma(tlb, vma);
>
> -	/*
> -	 * There is nothing protecting a previously-shared page table that we
> -	 * unshared through huge_pmd_unshare() from getting freed after we
> -	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
> -	 * succeeded, flush the range corresponding to the pud.
> -	 */
> -	if (force_flush)
> -		tlb_flush_mmu_tlbonly(tlb);
> +	huge_pmd_unshare_flush(tlb, vma);
>  }
>
>  void __hugetlb_zap_begin(struct vm_area_struct *vma,
> @@ -6399,7 +6390,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>  }
>  #endif /* CONFIG_USERFAULTFD */
>
> -long hugetlb_change_protection(struct vm_area_struct *vma,
> +long hugetlb_change_protection(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		unsigned long address, unsigned long end,
>  		pgprot_t newprot, unsigned long cp_flags)
>  {
> @@ -6409,7 +6400,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
>  	pte_t pte;
>  	struct hstate *h = hstate_vma(vma);
>  	long pages = 0, psize = huge_page_size(h);
> -	bool shared_pmd = false;
>  	struct mmu_notifier_range range;
>  	unsigned long last_addr_mask;
>  	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
> @@ -6452,7 +6442,7 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
>  			}
>  		}
>  		ptl = huge_pte_lock(h, mm, ptep);
> -		if (huge_pmd_unshare(mm, vma, address, ptep)) {
> +		if (huge_pmd_unshare(tlb, vma, address, ptep)) {
>  			/*
>  			 * When uffd-wp is enabled on the vma, unshare
>  			 * shouldn't happen at all.  Warn about it if it
> @@ -6461,7 +6451,6 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
>  			WARN_ON_ONCE(uffd_wp || uffd_wp_resolve);
>  			pages++;
>  			spin_unlock(ptl);
> -			shared_pmd = true;
>  			address |= last_addr_mask;
>  			continue;
>  		}
> @@ -6522,22 +6511,16 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
>  				pte = huge_pte_clear_uffd_wp(pte);
>  			huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte);
>  			pages++;
> +			tlb_remove_huge_tlb_entry(h, tlb, ptep, address);
>  		}
>
>  next:
>  		spin_unlock(ptl);
>  		cond_resched();
>  	}
> -	/*
> -	 * There is nothing protecting a previously-shared page table that we
> -	 * unshared through huge_pmd_unshare() from getting freed after we
> -	 * release i_mmap_rwsem, so flush the TLB now. If huge_pmd_unshare()
> -	 * succeeded, flush the range corresponding to the pud.
> -	 */
> -	if (shared_pmd)
> -		flush_hugetlb_tlb_range(vma, range.start, range.end);
> -	else
> -		flush_hugetlb_tlb_range(vma, start, end);
> +
> +	tlb_flush_mmu_tlbonly(tlb);
> +	huge_pmd_unshare_flush(tlb, vma);
>  	/*
>  	 * No need to call mmu_notifier_arch_invalidate_secondary_tlbs() we are
>  	 * downgrading page table protection not changing it to point to a new
> @@ -6904,18 +6887,27 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>  	return pte;
>  }
>
> -/*
> - * unmap huge page backed by shared pte.
> +/**
> + * huge_pmd_unshare - Unmap a pmd table if it is shared by multiple users
> + * @tlb: the current mmu_gather.
> + * @vma: the vma covering the pmd table.
> + * @addr: the address we are trying to unshare.
> + * @ptep: pointer into the (pmd) page table.
> + *
> + * Called with the page table lock held, the i_mmap_rwsem held in write mode
> + * and the hugetlb vma lock held in write mode.
>   *
> - * Called with page table lock held.
> + * Note: The caller must call huge_pmd_unshare_flush() before dropping the
> + * i_mmap_rwsem.
>   *
> - * returns: 1 successfully unmapped a shared pte page
> - *	    0 the underlying pte page is not shared, or it is the last user
> + * Returns: 1 if it was a shared PMD table and it got unmapped, or 0 if it
> + *	    was not a shared PMD table.
>   */
> -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep)
>  {
>  	unsigned long sz = huge_page_size(hstate_vma(vma));
> +	struct mm_struct *mm = vma->vm_mm;
>  	pgd_t *pgd = pgd_offset(mm, addr);
>  	p4d_t *p4d = p4d_offset(pgd, addr);
>  	pud_t *pud = pud_offset(p4d, addr);
> @@ -6927,18 +6919,36 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
>  	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
>  	hugetlb_vma_assert_locked(vma);
>  	pud_clear(pud);
> -	/*
> -	 * Once our caller drops the rmap lock, some other process might be
> -	 * using this page table as a normal, non-hugetlb page table.
> -	 * Wait for pending gup_fast() in other threads to finish before letting
> -	 * that happen.
> -	 */
> -	tlb_remove_table_sync_one();
> -	ptdesc_pmd_pts_dec(virt_to_ptdesc(ptep));
> +
> +	tlb_unshare_pmd_ptdesc(tlb, virt_to_ptdesc(ptep), addr);
> +
>  	mm_dec_nr_pmds(mm);
>  	return 1;
>  }
>
> +/*
> + * huge_pmd_unshare_flush - Complete a sequence of huge_pmd_unshare() calls
> + * @tlb: the current mmu_gather.
> + * @vma: the vma covering the pmd table.
> + *
> + * Perform necessary TLB flushes or IPI broadcasts to synchronize PMD table
> + * unsharing with concurrent page table walkers (TLB, GUP-fast, etc.).
> + *
> + * This function must be called after a sequence of huge_pmd_unshare()
> + * calls while still holding the i_mmap_rwsem.
> + */
> +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma)
> +{
> +	/*
> +	 * We must synchronize page table unsharing such that nobody will
> +	 * try reusing a previously-shared page table while it might still
> +	 * be in use by previous sharers (TLB, GUP_fast).
> +	 */
> +	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
> +

Extreme nit: inconsistent newline here compared to elsewhere :)

Not even sure why I'm making this point tbh

> +	tlb_flush_unshared_tables(tlb);
> +}
> +
>  #else /* !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
>
>  pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> @@ -6947,12 +6957,16 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>  	return NULL;
>  }
>
> -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep)
> +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep)
>  {
>  	return 0;
>  }
>
> +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma)
> +{
> +}
> +
>  void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
>  				unsigned long *start, unsigned long *end)
>  {
> @@ -7219,6 +7233,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>  	unsigned long sz = huge_page_size(h);
>  	struct mm_struct *mm = vma->vm_mm;
>  	struct mmu_notifier_range range;
> +	struct mmu_gather tlb;
>  	unsigned long address;
>  	spinlock_t *ptl;
>  	pte_t *ptep;
> @@ -7229,6 +7244,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>  	if (start >= end)
>  		return;
>
> +	tlb_gather_mmu(&tlb, mm);
>  	flush_cache_range(vma, start, end);
>  	/*
>  	 * No need to call adjust_range_if_pmd_sharing_possible(), because
> @@ -7248,10 +7264,10 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>  		if (!ptep)
>  			continue;
>  		ptl = huge_pte_lock(h, mm, ptep);
> -		huge_pmd_unshare(mm, vma, address, ptep);
> +		huge_pmd_unshare(&tlb, vma, address, ptep);
>  		spin_unlock(ptl);
>  	}
> -	flush_hugetlb_tlb_range(vma, start, end);
> +	huge_pmd_unshare_flush(&tlb, vma);
>  	if (take_locks) {
>  		i_mmap_unlock_write(vma->vm_file->f_mapping);
>  		hugetlb_vma_unlock_write(vma);
> @@ -7261,6 +7277,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>  	 * Documentation/mm/mmu_notifier.rst.
>  	 */
>  	mmu_notifier_invalidate_range_end(&range);
> +	tlb_finish_mmu(&tlb);

Hmm, does it matter that if !take_locks, the i_mmap lock and hugetlb vma
locks will still be held when tlb_finish_mmu() is invoked here? I'm
guessing it has no bearing but just to be sure :)

>  }
>
>  /*
> diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
> index 247e3f9db6c7a..822a790127375 100644
> --- a/mm/mmu_gather.c
> +++ b/mm/mmu_gather.c
> @@ -468,6 +468,12 @@ void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm)
>   */
>  void tlb_finish_mmu(struct mmu_gather *tlb)
>  {
> +	/*
> +	 * We expect an earlier huge_pmd_unshare_flush() call to sort this out,
> +	 * due to complicated locking requirements with page table unsharing.
> +	 */
> +	VM_WARN_ON_ONCE(tlb->fully_unshared_tables);
> +
>  	/*
>  	 * If there are parallel threads are doing PTE changes on same range
>  	 * under non-exclusive lock (e.g., mmap_lock read-side) but defer TLB
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 283889e4f1cec..5c330e817129e 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -652,7 +652,7 @@ long change_protection(struct mmu_gather *tlb,
>  #endif
>
>  	if (is_vm_hugetlb_page(vma))
> -		pages = hugetlb_change_protection(vma, start, end, newprot,
> +		pages = hugetlb_change_protection(tlb, vma, start, end, newprot,
>  						  cp_flags);
>  	else
>  		pages = change_protection_range(tlb, vma, start, end, newprot,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 748f48727a162..d6799afe11147 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -76,7 +76,7 @@
>  #include <linux/mm_inline.h>
>  #include <linux/oom.h>
>
> -#include <asm/tlbflush.h>
> +#include <asm/tlb.h>
>
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/migrate.h>
> @@ -2008,13 +2008,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  			 * if unsuccessful.
>  			 */
>  			if (!anon) {
> +				struct mmu_gather tlb;
> +
>  				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>  				if (!hugetlb_vma_trylock_write(vma))
>  					goto walk_abort;
> -				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> +
> +				tlb_gather_mmu(&tlb, mm);
> +				if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
>  					hugetlb_vma_unlock_write(vma);
> -					flush_tlb_range(vma,
> -						range.start, range.end);
> +					huge_pmd_unshare_flush(&tlb, vma);
> +					tlb_finish_mmu(&tlb);

Not sure if it matters, but elsewhere you order the locks as:

- huge_pmd_unshare_flush()
- release i_mmap lock
- unlock vma hugetlb
- call tlb_finish_mmu()

But here it's:

- unlock vma hugetlb
- huge_pmd_unshare_flush()
- call tlb_finish_mmu()
- (later) release i_mmap lock

Does that matter in terms of lock inversions etc.?

>  					/*
>  					 * The PMD table was unmapped,
>  					 * consequently unmapping the folio.
> @@ -2022,6 +2026,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  					goto walk_done;
>  				}
>  				hugetlb_vma_unlock_write(vma);
> +				tlb_finish_mmu(&tlb);
>  			}
>  			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
>  			if (pte_dirty(pteval))
> @@ -2398,17 +2403,20 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  			 * fail if unsuccessful.
>  			 */
>  			if (!anon) {
> +				struct mmu_gather tlb;
> +
>  				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>  				if (!hugetlb_vma_trylock_write(vma)) {
>  					page_vma_mapped_walk_done(&pvmw);
>  					ret = false;
>  					break;
>  				}
> -				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
> -					hugetlb_vma_unlock_write(vma);
> -					flush_tlb_range(vma,
> -						range.start, range.end);
>
> +				tlb_gather_mmu(&tlb, mm);
> +				if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
> +					hugetlb_vma_unlock_write(vma);
> +					huge_pmd_unshare_flush(&tlb, vma);
> +					tlb_finish_mmu(&tlb);

Again this ordering is different from elsewhere, as per above. Not sure if an issue?

>  					/*
>  					 * The PMD table was unmapped,
>  					 * consequently unmapping the folio.
> @@ -2417,6 +2425,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  					break;
>  				}
>  				hugetlb_vma_unlock_write(vma);
> +				tlb_finish_mmu(&tlb);
>  			}
>  			/* Nuke the hugetlb page table entry */
>  			pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
> --
> 2.52.0
>


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare()
  2025-12-10 11:22   ` Lorenzo Stoakes
@ 2025-12-11  1:58     ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-11  1:58 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, Harry Yoo,
	Laurence Oberman, Prakash Sangappa, Nadav Amit, Liu Shixin

On 12/10/25 12:22, Lorenzo Stoakes wrote:
> On Fri, Dec 05, 2025 at 10:35:56PM +0100, David Hildenbrand (Red Hat) wrote:
>> Ever since we stopped using the page count to detect shared PMD
>> page tables, these comments are outdated.
>>
>> The only reason we have to flush the TLB early is because once we drop
>> the i_mmap_rwsem, the previously shared page table could get freed (to
>> then get reallocated and used for other purpose). So we really have to
>> flush the TLB before that could happen.
>>
>> So let's simplify the comments a bit.
>>
>> The "If we unshared PMDs, the TLB flush was not recorded in mmu_gather."
>> part introduced as in commit a4a118f2eead ("hugetlbfs: flush TLBs
>> correctly after huge_pmd_unshare") was confusing: sure it is recorded
>> in the mmu_gather, otherwise tlb_flush_mmu_tlbonly() wouldn't do
>> anything. So let's drop that comment while at it as well.
>>
>> We'll centralize these comments in a single helper as we rework the code
>> next.
>>
>> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
>> Cc: Liu Shixin <liushixin2@huawei.com>
>> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>
> 
> LGTM, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!

> 
>> ---
>>   mm/hugetlb.c | 24 ++++++++----------------
>>   1 file changed, 8 insertions(+), 16 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 51273baec9e5d..3c77cdef12a32 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -5304,17 +5304,10 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>   	tlb_end_vma(tlb, vma);
>>
>>   	/*
>> -	 * If we unshared PMDs, the TLB flush was not recorded in mmu_gather. We
>> -	 * could defer the flush until now, since by holding i_mmap_rwsem we
>> -	 * guaranteed that the last reference would not be dropped. But we must
>> -	 * do the flushing before we return, as otherwise i_mmap_rwsem will be
>> -	 * dropped and the last reference to the shared PMDs page might be
>> -	 * dropped as well.
>> -	 *
>> -	 * In theory we could defer the freeing of the PMD pages as well, but
>> -	 * huge_pmd_unshare() relies on the exact page_count for the PMD page to
>> -	 * detect sharing, so we cannot defer the release of the page either.
> 
> Was it this comment that led you to question the page_count issue? :)

Heh, no, I know about the changed handling already. I stumbled over the 
page_count() remaining usage while working on some cleanups I previously 
had as part of this series :)

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather
  2025-12-10 15:06   ` Lorenzo Stoakes
@ 2025-12-11  2:27     ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-11  2:27 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Oscar Salvador, Liam R. Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, Harry Yoo,
	Laurence Oberman, Prakash Sangappa, Nadav Amit, stable

>> (2) tlb_remove_table_sync_one() is not a NOP on architectures with
>>      CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync.
>>
>>      Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB)
>>      we still issue IPIs during TLB flushes and don't actually need the
>>      second tlb_remove_table_sync_one().
> 
> Hmm wasn't aware that x86 would still IPI even with
> CONFIG_MMU_GATHER_RCU_TABLE_FREE??
> 
> But then we'd have to set tlb->freed_tables and as per your above point
> maybe overkill...hm one for another time then I guess! :)

I have a prototype patch to handle that, Lance wants to look into 
polishing it up :)

[...]

>> +++ b/include/asm-generic/tlb.h
>> @@ -364,6 +364,17 @@ struct mmu_gather {
>>   	unsigned int		vma_huge : 1;
>>   	unsigned int		vma_pfn  : 1;
>>
>> +	/*
>> +	 * Did we unshare (unmap) any shared page tables?
> 
> Given mshare is incoming, maybe worth clarifying and being explicit about
> hugetlb in both comment and name?

I think we should instead call out mshare eplicitly, once we know what 
that would need here.

In general, I don't think we'll really use the "unshare" termonology a 
lot with mshare, as it's not really something transparent. In the mshare 
world it's simply an unmap in the mshare-owner MM. (mshare_detach 
similarly is rather a unmap operation)

Long story short: I'll lean towards keeping this here as it is and not 
creating even longer names.

> 
>> +	 */
>> +	unsigned int		unshared_tables : 1;
>> +
>> +	/*
>> +	 * Did we unshare any page tables such that they are now exclusive
>> +	 * and could get reused+modified by the new owner?
>> +	 */
>> +	unsigned int		fully_unshared_tables : 1;
> 
> Does fully_unshared_tables rely on unshared_tables also being set?

See below.

> 
>> +
>>   	unsigned int		batch_count;
>>
>>   #ifndef CONFIG_MMU_GATHER_NO_GATHER
>> @@ -400,6 +411,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
>>   	tlb->cleared_pmds = 0;
>>   	tlb->cleared_puds = 0;
>>   	tlb->cleared_p4ds = 0;
>> +	tlb->unshared_tables = 0;
> 
> As Nadav points out, should also initialise fully_unshared_tables.

Right, but on an earlier init path, not on the range reset path here.

> 
>>   	/*
>>   	 * Do not reset mmu_gather::vma_* fields here, we do not
>>   	 * call into tlb_start_vma() again to set them if there is an
>> @@ -484,7 +496,7 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
>>   	 * these bits.
>>   	 */
>>   	if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
>> -	      tlb->cleared_puds || tlb->cleared_p4ds))
>> +	      tlb->cleared_puds || tlb->cleared_p4ds || tlb->unshared_tables))
> 
> What about fully_unshared_tables? I guess though unshared_tables implies
> fully_unshared_tables.

fully_unshared_tables is only for triggering IPIs and consequently not 
about flushing TLBs.

The TLB part is taken care of by unshared_tables, and we will always set 
unshared_tables when unsharing any page tables (incl. fully unshared ones).


> 
>>   		return;
>>
>>   	tlb_flush(tlb);
>> @@ -773,6 +785,61 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
>>   }
>>   #endif
>>
>> +#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
>> +static inline void tlb_unshare_pmd_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt,
>> +					  unsigned long addr)
>> +{
>> +	/*
>> +	 * The caller must make sure that concurrent unsharing + exclusive
>> +	 * reuse is impossible until tlb_flush_unshared_tables() was called.
>> +	 */
>> +	VM_WARN_ON_ONCE(!ptdesc_pmd_is_shared(pt));
>> +	ptdesc_pmd_pts_dec(pt);
>> +
>> +	/* Clearing a PUD pointing at a PMD table with PMD leaves. */
>> +	tlb_flush_pmd_range(tlb, addr & PUD_MASK, PUD_SIZE);
> 
> OK I guess before we were always flushing for each page, but now we are
> accumulating the flushes here.

Yes.

> 
>> +
>> +	/*
>> +	 * If the page table is now exclusively owned, we fully unshared
>> +	 * a page table.
>> +	 */
>> +	if (!ptdesc_pmd_is_shared(pt))
>> +		tlb->fully_unshared_tables = true;
>> +	tlb->unshared_tables = true;
>> +}
>> +
>> +static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
>> +{
>> +	/*
>> +	 * As soon as the caller drops locks to allow for reuse of
>> +	 * previously-shared tables, these tables could get modified and
>> +	 * even reused outside of hugetlb context. So flush the TLB now.
> 
> Hmm but you're doing this in both the case of unshare and fully unsharing, so is
> this the right place to make this comment?

That's why I start the comment below with "Similarly", to make it clear 
that the comments build up on each other.

But I'm afraid I might not be getting your point fully here :/

> 
> Surely here this is about flushing TLBs for the unsharer only as it no longer
> uses it?
> 
>> +	 *
>> +	 * Note that we cannot defer the flush to a later point even if we are
>> +	 * not the last sharer of the page table.
>> +	 */
> 
> Not hugely clear, some double negative here. Maybe worth saying something like:
> 
> 'Even if we are not fully unsharing a PMD table, we must flush the TLB for the
> unsharer who no longer has access to this memory'
> 
> Or something? Assuming this is accurate :)

I'll adjust it to "Not that even if we are not fully unsharing a PMD 
table, we must flush the TLB for the unsharer now.".

[...]

>> +		tlb_remove_huge_tlb_entry(h, &tlb, src_pte, old_addr);
> 
> OK I guess we need to add these to cases where we remove previous entries
> because before we weren't accumulating TLB state except in
> __unmap_hugepage_range()?

Exactly.

> 
>>
>>   		dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz);
>>   		if (!dst_pte)
>> @@ -5136,13 +5137,13 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
>>   		move_huge_pte(vma, old_addr, new_addr, src_pte, dst_pte, sz);
>>   	}
>>
>> -	if (shared_pmd)
>> -		flush_hugetlb_tlb_range(vma, range.start, range.end);
>> -	else
>> -		flush_hugetlb_tlb_range(vma, old_end - len, old_end);
>> +	tlb_flush_mmu_tlbonly(&tlb);
> 
> Maybe:
> 
> 	if (!tlb->unshared_tables)
> 		tlb_flush_mmu_tlbonly(&tlb);
> 
> To avoid doing that twice? Not sure if so important as will be noop second time
> though.

No,  see the existing code on the !shared path: we flush even if we 
didn't unshare anything, and I am not changing these semantics.

The huge_pmd_unshare_flush() will skip the second tlb_flush_mmu_tlbonly().

> 
>> +	huge_pmd_unshare_flush(&tlb, vma);
> 
> OK I guess the semantics are
> 
> huge_pmd_unshare() for everything that needs unsharing, accumulating tlb state...
> 
> huge_pmd_unshare_flush() to, err, flush :) followed by tlb_finish_mmu() obv, and
> with i_mmap lock held...


The tlb_finish_mmu() can be defered, as it's mostly for safety checks in 
the hugetlb usage here. For pure unsharing, huge_pmd_unshare_flush() 
will do any flushing early.

> 
>> +
>>   	mmu_notifier_invalidate_range_end(&range);
>>   	i_mmap_unlock_write(mapping);
>>   	hugetlb_vma_unlock_write(vma);
>> +	tlb_finish_mmu(&tlb);
> 
> Does it matter that the hugetlb VMA lock is gone when we invoke
> tlb_finish_mmu()? I'm guessing not.

It shouldn't, as it's mostly safety-checks only for our use case here.

> 
> Kinda wish we could wrap these start/end states, it's fiddly to know that
> you have to:
> 
> - call huge_pmd_unshare_flush()
> - release i_mmap lock
> - unlock vma hugetlb (ugh god don't even get me started on how that's implemented :)
> - call tlb_finish_mmu()
> 
> Obv. this kind of thing can be part of future cleanups

Yes, not messing with there here :)

[...]

>> +/*
>> + * huge_pmd_unshare_flush - Complete a sequence of huge_pmd_unshare() calls
>> + * @tlb: the current mmu_gather.
>> + * @vma: the vma covering the pmd table.
>> + *
>> + * Perform necessary TLB flushes or IPI broadcasts to synchronize PMD table
>> + * unsharing with concurrent page table walkers (TLB, GUP-fast, etc.).
>> + *
>> + * This function must be called after a sequence of huge_pmd_unshare()
>> + * calls while still holding the i_mmap_rwsem.
>> + */
>> +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma)
>> +{
>> +	/*
>> +	 * We must synchronize page table unsharing such that nobody will
>> +	 * try reusing a previously-shared page table while it might still
>> +	 * be in use by previous sharers (TLB, GUP_fast).
>> +	 */
>> +	i_mmap_assert_write_locked(vma->vm_file->f_mapping);
>> +
> 
> Extreme nit: inconsistent newline here compared to elsewhere :)
> 
> Not even sure why I'm making this point tbh

Inconsistent with what exactly? I prefered separating the 
safety-check+comment that explains why we check for the lock from the 
actual logic that carries out the actual logic.

> 
>> +	tlb_flush_unshared_tables(tlb);
>> +}
>> +
>>   #else /* !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
>>
>>   pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>> @@ -6947,12 +6957,16 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
>>   	return NULL;
>>   }
>>
>> -int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
>> -				unsigned long addr, pte_t *ptep)
>> +int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
>> +		unsigned long addr, pte_t *ptep)
>>   {
>>   	return 0;
>>   }
>>
>> +void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma)
>> +{
>> +}
>> +
>>   void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
>>   				unsigned long *start, unsigned long *end)
>>   {
>> @@ -7219,6 +7233,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>>   	unsigned long sz = huge_page_size(h);
>>   	struct mm_struct *mm = vma->vm_mm;
>>   	struct mmu_notifier_range range;
>> +	struct mmu_gather tlb;
>>   	unsigned long address;
>>   	spinlock_t *ptl;
>>   	pte_t *ptep;
>> @@ -7229,6 +7244,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>>   	if (start >= end)
>>   		return;
>>
>> +	tlb_gather_mmu(&tlb, mm);
>>   	flush_cache_range(vma, start, end);
>>   	/*
>>   	 * No need to call adjust_range_if_pmd_sharing_possible(), because
>> @@ -7248,10 +7264,10 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>>   		if (!ptep)
>>   			continue;
>>   		ptl = huge_pte_lock(h, mm, ptep);
>> -		huge_pmd_unshare(mm, vma, address, ptep);
>> +		huge_pmd_unshare(&tlb, vma, address, ptep);
>>   		spin_unlock(ptl);
>>   	}
>> -	flush_hugetlb_tlb_range(vma, start, end);
>> +	huge_pmd_unshare_flush(&tlb, vma);
>>   	if (take_locks) {
>>   		i_mmap_unlock_write(vma->vm_file->f_mapping);
>>   		hugetlb_vma_unlock_write(vma);
>> @@ -7261,6 +7277,7 @@ static void hugetlb_unshare_pmds(struct vm_area_struct *vma,
>>   	 * Documentation/mm/mmu_notifier.rst.
>>   	 */
>>   	mmu_notifier_invalidate_range_end(&range);
>> +	tlb_finish_mmu(&tlb);
> 
> Hmm, does it matter that if !take_locks, the i_mmap lock and hugetlb vma
> locks will still be held when tlb_finish_mmu() is invoked here? I'm
> guessing it has no bearing but just to be sure :)

See above regarding safety checks.

[...]

>>   #define CREATE_TRACE_POINTS
>>   #include <trace/events/migrate.h>
>> @@ -2008,13 +2008,17 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>   			 * if unsuccessful.
>>   			 */
>>   			if (!anon) {
>> +				struct mmu_gather tlb;
>> +
>>   				VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
>>   				if (!hugetlb_vma_trylock_write(vma))
>>   					goto walk_abort;
>> -				if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
>> +
>> +				tlb_gather_mmu(&tlb, mm);
>> +				if (huge_pmd_unshare(&tlb, vma, address, pvmw.pte)) {
>>   					hugetlb_vma_unlock_write(vma);
>> -					flush_tlb_range(vma,
>> -						range.start, range.end);
>> +					huge_pmd_unshare_flush(&tlb, vma);
>> +					tlb_finish_mmu(&tlb);
> 
> Not sure if it matters, but elsewhere you order the locks as:
> 
> - huge_pmd_unshare_flush()
> - release i_mmap lock
> - unlock vma hugetlb
> - call tlb_finish_mmu()
> 
> But here it's:
> 
> - unlock vma hugetlb
> - huge_pmd_unshare_flush()
> - call tlb_finish_mmu()
> - (later) release i_mmap lock
> 
> Does that matter in terms of lock inversions etc.?
> 

I had a cleanup patch to change some of that, but I decided to keep it 
has is for this series: flush the tlb and issue the IPI while we still 
hold the page table lock.

(no idea what the hugetlb vma locking even does here, and how it 
interacts with the existing TLB flush -- not touching that confusing bit 
:) )

Thanks for the review!

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared()
  2025-12-05 21:35 ` [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared() David Hildenbrand (Red Hat)
                     ` (4 preceding siblings ...)
  2025-12-10 11:16   ` Lorenzo Stoakes
@ 2025-12-11  5:38   ` Oscar Salvador
  5 siblings, 0 replies; 27+ messages in thread
From: Oscar Salvador @ 2025-12-11  5:38 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: linux-kernel, linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, Harry Yoo,
	Laurence Oberman, Prakash Sangappa, Nadav Amit, stable,
	Liu Shixin

On Fri, Dec 05, 2025 at 10:35:55PM +0100, David Hildenbrand (Red Hat) wrote:
> We switched from (wrongly) using the page count to an independent
> shared count. Now, shared page tables have a refcount of 1 (excluding
> speculative references) and instead use ptdesc->pt_share_count to
> identify sharing.
> 
> We didn't convert hugetlb_pmd_shared(), so right now, we would never
> detect a shared PMD table as such, because sharing/unsharing no longer
> touches the refcount of a PMD table.
> 
> Page migration, like mbind() or migrate_pages() would allow for migrating
> folios mapped into such shared PMD tables, even though the folios are
> not exclusive. In smaps we would account them as "private" although they
> are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the
> pagemap interface.
> 
> Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared().
> 
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
> Cc: <stable@vger.kernel.org>
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>

Good catch David,

Acked-by: Oscar Salvador <osalvador@suse.de>

> ---
>  include/linux/hugetlb.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 019a1c5281e4e..03c8725efa289 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -1326,7 +1326,7 @@ static inline __init void hugetlb_cma_reserve(int order)
>  #ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
>  static inline bool hugetlb_pmd_shared(pte_t *pte)
>  {
> -	return page_count(virt_to_page(pte)) > 1;
> +	return ptdesc_pmd_is_shared(virt_to_ptdesc(pte));
>  }
>  #else
>  static inline bool hugetlb_pmd_shared(pte_t *pte)
> -- 
> 2.52.0
> 
> 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare()
  2025-12-05 21:35 ` [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare() David Hildenbrand (Red Hat)
  2025-12-06  2:26   ` Rik van Riel
  2025-12-10 11:22   ` Lorenzo Stoakes
@ 2025-12-11  5:41   ` Oscar Salvador
  2 siblings, 0 replies; 27+ messages in thread
From: Oscar Salvador @ 2025-12-11  5:41 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: linux-kernel, linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, Harry Yoo,
	Laurence Oberman, Prakash Sangappa, Nadav Amit, Liu Shixin

On Fri, Dec 05, 2025 at 10:35:56PM +0100, David Hildenbrand (Red Hat) wrote:
> Ever since we stopped using the page count to detect shared PMD
> page tables, these comments are outdated.
> 
> The only reason we have to flush the TLB early is because once we drop
> the i_mmap_rwsem, the previously shared page table could get freed (to
> then get reallocated and used for other purpose). So we really have to
> flush the TLB before that could happen.
> 
> So let's simplify the comments a bit.
> 
> The "If we unshared PMDs, the TLB flush was not recorded in mmu_gather."
> part introduced as in commit a4a118f2eead ("hugetlbfs: flush TLBs
> correctly after huge_pmd_unshare") was confusing: sure it is recorded
> in the mmu_gather, otherwise tlb_flush_mmu_tlbonly() wouldn't do
> anything. So let's drop that comment while at it as well.
> 
> We'll centralize these comments in a single helper as we rework the code
> next.
> 
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>

Acked-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v1 3/4] mm/rmap: fix two comments related to huge_pmd_unshare()
  2025-12-05 21:35 ` [PATCH v1 3/4] mm/rmap: " David Hildenbrand (Red Hat)
  2025-12-06  2:50   ` Rik van Riel
  2025-12-10 11:24   ` Lorenzo Stoakes
@ 2025-12-11  5:42   ` Oscar Salvador
  2 siblings, 0 replies; 27+ messages in thread
From: Oscar Salvador @ 2025-12-11  5:42 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: linux-kernel, linux-arch, linux-mm, Will Deacon, Aneesh Kumar K.V,
	Andrew Morton, Nick Piggin, Peter Zijlstra, Arnd Bergmann,
	Muchun Song, Liam R. Howlett, Lorenzo Stoakes, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, Harry Yoo,
	Laurence Oberman, Prakash Sangappa, Nadav Amit, Liu Shixin

On Fri, Dec 05, 2025 at 10:35:57PM +0100, David Hildenbrand (Red Hat) wrote:
> PMD page table unsharing no longer touches the refcount of a PMD page
> table. Also, it is not about dropping the refcount of a "PMD page" but
> the "PMD page table".
> 
> Let's just simplify by saying that the PMD page table was unmapped,
> consequently also unmapping the folio that was mapped into this page.
> 
> This code should be deduplicated in the future.
> 
> Fixes: 59d9094df3d7 ("mm: hugetlb: independent PMD page table shared count")
> Cc: Liu Shixin <liushixin2@huawei.com>
> Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org>

Acked-by: Oscar Salvador <osalvador@suse.de>

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-12-11  5:42 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-05 21:35 [PATCH v1 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather) David Hildenbrand (Red Hat)
2025-12-05 21:35 ` [PATCH v1 1/4] mm/hugetlb: fix hugetlb_pmd_shared() David Hildenbrand (Red Hat)
2025-12-06  2:18   ` Rik van Riel
2025-12-06  5:55   ` Lance Yang
2025-12-06  6:24     ` Lance Yang
2025-12-08  2:32   ` Lance Yang
2025-12-08 11:01     ` David Hildenbrand (Red Hat)
2025-12-10 11:15     ` Lorenzo Stoakes
2025-12-08  9:08   ` Harry Yoo
2025-12-10 11:16   ` Lorenzo Stoakes
2025-12-11  5:38   ` Oscar Salvador
2025-12-05 21:35 ` [PATCH v1 2/4] mm/hugetlb: fix two comments related to huge_pmd_unshare() David Hildenbrand (Red Hat)
2025-12-06  2:26   ` Rik van Riel
2025-12-10 11:22   ` Lorenzo Stoakes
2025-12-11  1:58     ` David Hildenbrand (Red Hat)
2025-12-11  5:41   ` Oscar Salvador
2025-12-05 21:35 ` [PATCH v1 3/4] mm/rmap: " David Hildenbrand (Red Hat)
2025-12-06  2:50   ` Rik van Riel
2025-12-10 11:24   ` Lorenzo Stoakes
2025-12-11  5:42   ` Oscar Salvador
2025-12-05 21:35 ` [PATCH v1 4/4] mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather David Hildenbrand (Red Hat)
2025-12-07 12:15   ` Nadav Amit
2025-12-07 12:24     ` Nadav Amit
2025-12-07 12:39       ` David Hildenbrand (Red Hat)
2025-12-10 15:06   ` Lorenzo Stoakes
2025-12-11  2:27     ` David Hildenbrand (Red Hat)
2025-12-06 19:53 ` [PATCH v1 0/4] mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather) Laurence Oberman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).