[PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
@ 2024-07-17 22:02 Peter Xu
  2024-07-17 22:02 ` [PATCH RFC 1/6] mm/treewide: Remove pgd_devmap() Peter Xu
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Peter Xu @ 2024-07-17 22:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Vlastimil Babka, peterx, David Hildenbrand, Oscar Salvador,
	linux-s390, Andrew Morton, Matthew Wilcox, Dan Williams,
	Michal Hocko, linux-riscv, sparclinux, Alex Williamson,
	Jason Gunthorpe, x86, Alistair Popple, linuxppc-dev,
	linux-arm-kernel, Ryan Roberts, Hugh Dickins, Axel Rasmussen

This is an RFC series, so not yet for merging.  Please don't be scared by
the code changes: most of them are code movements only.

This series is based on the dax mprotect fix series here (while that one is
based on mm-unstable):

  [PATCH v3 0/8] mm/mprotect: Fix dax puds
  https://lore.kernel.org/r/20240715192142.3241557-1-peterx@redhat.com

Overview
========

This series doesn't provide any feature change.  The only goal of this
series is to start decoupling two ideas: "THP" and "huge mapping".  We
already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this
one extends that idea into the code.

The issue is that we have so many functions that only compile with
CONFIG_THP=on, even though they're about huge mappings, and huge mapping is
a pretty common concept, which can apply to many things besides THPs
nowadays.  The major THP file is mm/huge_memory.c as of now.

The first example of such huge mapping users will be hugetlb.  We lived
until now with no problem simply because Linux almost duplicated all the
logics there in the "THP" files into hugetlb APIs.  If we want to get rid
of hugetlb specific APIs and paths, this _might_ be the first thing we want
to do, because we want to be able to e.g., zapping a hugetlb pmd entry even
if !CONFIG_THP.

Then consider other things like dax / pfnmaps.  Dax can depend on THP, then
it'll naturally be able to use pmd/pud helpers, that's okay.  However is it
a must?  Do we also want to have every new pmd/pud mappings in the future
to depend on THP (like PFNMAP)?  My answer is no, but I'm open to opinions.

If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that
are larger than PAGE_SIZE) is a more generic concept than THP, then I think
at some point we need to move the generic code out of THP code into a
common code base.

This is what this series does as a start.

In general, this series tries to move many THP things (mostly resides in
huge_memory.c right now) into two new files: huge_mapping_{pmd|pud}.c.
When I move them out, I also put them separately into different files for
different layers.  Then if an arch supports e.g. only PMD, it can avoid
compile the PUD helpers, with things like:

        CONFIG_PGTABLE_HAS_PUD_LEAVES=n
        obj-$(CONFIG_PGTABLE_HAS_PUD_LEAVES) += huge_mapping_pud.o

Note that there're a few tree-wide changes into arch/, but that's not a
lot, to make this not disturbing too much people, I only copied the open
lists of each arch not yet the arch maintainers.

Tests
=====

My normal 19-archs cross-compilation tests pass with it, and smoke tested
on x86_64 with a local config of mine.

Comments welcomed, thanks.

Peter Xu (6):
  mm/treewide: Remove pgd_devmap()
  mm: PGTABLE_HAS_P[MU]D_LEAVES config options
  mm/treewide: Make pgtable-generic.c THP agnostic
  mm: Move huge mapping declarations from internal.h to huge_mm.h
  mm/huge_mapping: Create huge_mapping_pxx.c
  mm: Convert "*_trans_huge() || *_devmap()" to use *_leaf()

 arch/arm64/include/asm/pgtable.h             |   11 +-
 arch/powerpc/include/asm/book3s/64/pgtable.h |    7 +-
 arch/powerpc/mm/book3s64/pgtable.c           |    2 +-
 arch/riscv/include/asm/pgtable.h             |    4 +-
 arch/s390/include/asm/pgtable.h              |    2 +-
 arch/s390/mm/pgtable.c                       |    4 +-
 arch/sparc/mm/tlb.c                          |    2 +-
 arch/x86/include/asm/pgtable.h               |    5 -
 arch/x86/mm/pgtable.c                        |   15 +-
 include/linux/huge_mm.h                      |  332 ++++--
 include/linux/mm.h                           |   18 +
 include/linux/mm_types.h                     |    2 +-
 include/linux/pgtable.h                      |   61 +-
 include/trace/events/huge_mapping.h          |   41 +
 include/trace/events/thp.h                   |   28 -
 mm/Kconfig                                   |    6 +
 mm/Makefile                                  |    2 +
 mm/gup.c                                     |    2 -
 mm/hmm.c                                     |    4 +-
 mm/huge_mapping_pmd.c                        |  976 +++++++++++++++
 mm/huge_mapping_pud.c                        |  235 ++++
 mm/huge_memory.c                             | 1125 +-----------------
 mm/internal.h                                |   33 -
 mm/mapping_dirty_helpers.c                   |    4 +-
 mm/memory.c                                  |   16 +-
 mm/migrate_device.c                          |    2 +-
 mm/mprotect.c                                |    4 +-
 mm/mremap.c                                  |    5 +-
 mm/page_vma_mapped.c                         |    5 +-
 mm/pgtable-generic.c                         |   37 +-
 30 files changed, 1595 insertions(+), 1395 deletions(-)
 create mode 100644 include/trace/events/huge_mapping.h
 create mode 100644 mm/huge_mapping_pmd.c
 create mode 100644 mm/huge_mapping_pud.c

-- 
2.45.0

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH RFC 1/6] mm/treewide: Remove pgd_devmap()
  2024-07-17 22:02 [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings Peter Xu
@ 2024-07-17 22:02 ` Peter Xu
  2024-07-17 22:02 ` [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options Peter Xu
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2024-07-17 22:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Vlastimil Babka, peterx, David Hildenbrand, Oscar Salvador,
	linux-s390, Andrew Morton, Matthew Wilcox, Dan Williams,
	Michal Hocko, linux-riscv, sparclinux, Alex Williamson,
	Jason Gunthorpe, x86, Alistair Popple, linuxppc-dev,
	linux-arm-kernel, Ryan Roberts, Hugh Dickins, Axel Rasmussen

It's always 0 for all archs, and there's no sign to even support p4d entry
in the near future.  Remove it until it's needed for real.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/include/asm/pgtable.h             | 5 -----
 arch/powerpc/include/asm/book3s/64/pgtable.h | 5 -----
 arch/x86/include/asm/pgtable.h               | 5 -----
 include/linux/pgtable.h                      | 4 ----
 mm/gup.c                                     | 2 --
 5 files changed, 21 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f8efbc128446..5d5d1b18b837 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1119,11 +1119,6 @@ static inline int pud_devmap(pud_t pud)
 {
 	return 0;
 }
-
-static inline int pgd_devmap(pgd_t pgd)
-{
-	return 0;
-}
 #endif
 
 #ifdef CONFIG_PAGE_TABLE_CHECK
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 5da92ba68a45..051b1b6d729c 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1431,11 +1431,6 @@ static inline int pud_devmap(pud_t pud)
 {
 	return pte_devmap(pud_pte(pud));
 }
-
-static inline int pgd_devmap(pgd_t pgd)
-{
-	return 0;
-}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 701593c53f3b..0d234f48ceeb 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -311,11 +311,6 @@ static inline int pud_devmap(pud_t pud)
 	return 0;
 }
 #endif
-
-static inline int pgd_devmap(pgd_t pgd)
-{
-	return 0;
-}
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2289e9f7aa1b..0a904300ac90 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1626,10 +1626,6 @@ static inline int pud_devmap(pud_t pud)
 {
 	return 0;
 }
-static inline int pgd_devmap(pgd_t pgd)
-{
-	return 0;
-}
 #endif
 
 #if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \
diff --git a/mm/gup.c b/mm/gup.c
index 54d0dc3831fb..b023bcd38235 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3149,8 +3149,6 @@ static int gup_fast_pgd_leaf(pgd_t orig, pgd_t *pgdp, unsigned long addr,
 	if (!pgd_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
-	BUILD_BUG_ON(pgd_devmap(orig));
-
 	page = pgd_page(orig);
 	refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr);
 
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options
  2024-07-17 22:02 [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings Peter Xu
  2024-07-17 22:02 ` [PATCH RFC 1/6] mm/treewide: Remove pgd_devmap() Peter Xu
@ 2024-07-17 22:02 ` Peter Xu
  2024-08-22 17:22   ` LEROY Christophe
  2024-07-17 22:02 ` [PATCH RFC 3/6] mm/treewide: Make pgtable-generic.c THP agnostic Peter Xu
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2024-07-17 22:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Vlastimil Babka, peterx, David Hildenbrand, Oscar Salvador,
	linux-s390, Andrew Morton, Matthew Wilcox, Dan Williams,
	Michal Hocko, linux-riscv, sparclinux, Alex Williamson,
	Jason Gunthorpe, x86, Alistair Popple, linuxppc-dev,
	linux-arm-kernel, Ryan Roberts, Hugh Dickins, Axel Rasmussen

Introduce two more sub-options for PGTABLE_HAS_HUGE_LEAVES:

  - PGTABLE_HAS_PMD_LEAVES: set when there can be PMD mappings
  - PGTABLE_HAS_PUD_LEAVES: set when there can be PUD mappings

It will help to identify whether the current build may only want PMD
helpers but not PUD ones, as these sub-options will also check against the
arch support over HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD].

Note that having them depend on HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD] is
still some intermediate step.  The best way is to have an option say
"whether arch XXX supports PMD/PUD mappings" and so on.  However let's
leave that for later as that's the easy part.  So far, we use these options
to stably detect per-arch huge mapping support.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 10 +++++++---
 mm/Kconfig              |  6 ++++++
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 711632df7edf..37482c8445d1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -96,14 +96,18 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
 #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
 
-#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
-#define HPAGE_PMD_SHIFT PMD_SHIFT
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
 #define HPAGE_PUD_SHIFT PUD_SHIFT
 #else
-#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
 #endif
 
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
+#define HPAGE_PMD_SHIFT PMD_SHIFT
+#else
+#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
+#endif
+
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
 #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
 #define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
diff --git a/mm/Kconfig b/mm/Kconfig
index 60796402850e..2dbdc088dee8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -860,6 +860,12 @@ endif # TRANSPARENT_HUGEPAGE
 config PGTABLE_HAS_HUGE_LEAVES
 	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
 
+config PGTABLE_HAS_PMD_LEAVES
+	def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE && PGTABLE_HAS_HUGE_LEAVES
+
+config PGTABLE_HAS_PUD_LEAVES
+	def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD && PGTABLE_HAS_HUGE_LEAVES
+
 #
 # UP and nommu archs use km based percpu allocator
 #
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RFC 3/6] mm/treewide: Make pgtable-generic.c THP agnostic
  2024-07-17 22:02 [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings Peter Xu
  2024-07-17 22:02 ` [PATCH RFC 1/6] mm/treewide: Remove pgd_devmap() Peter Xu
  2024-07-17 22:02 ` [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options Peter Xu
@ 2024-07-17 22:02 ` Peter Xu
  2024-07-17 22:02 ` [PATCH RFC 4/6] mm: Move huge mapping declarations from internal.h to huge_mm.h Peter Xu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2024-07-17 22:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Vlastimil Babka, peterx, David Hildenbrand, Oscar Salvador,
	linux-s390, Andrew Morton, Matthew Wilcox, Dan Williams,
	Michal Hocko, linux-riscv, sparclinux, Alex Williamson,
	Jason Gunthorpe, x86, Alistair Popple, linuxppc-dev,
	linux-arm-kernel, Ryan Roberts, Hugh Dickins, Axel Rasmussen

Make pmd/pud helpers to rely on the new PGTABLE_HAS_*_LEAVES option, rather
than THP alone, as THP is only one form of huge mapping.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/arm64/include/asm/pgtable.h             |  6 ++--
 arch/powerpc/include/asm/book3s/64/pgtable.h |  2 +-
 arch/powerpc/mm/book3s64/pgtable.c           |  2 +-
 arch/riscv/include/asm/pgtable.h             |  4 +--
 arch/s390/include/asm/pgtable.h              |  2 +-
 arch/s390/mm/pgtable.c                       |  4 +--
 arch/sparc/mm/tlb.c                          |  2 +-
 arch/x86/mm/pgtable.c                        | 15 ++++-----
 include/linux/mm_types.h                     |  2 +-
 include/linux/pgtable.h                      |  4 +--
 mm/memory.c                                  |  2 +-
 mm/pgtable-generic.c                         | 32 ++++++++++----------
 12 files changed, 40 insertions(+), 37 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5d5d1b18b837..b93c03256ada 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1105,7 +1105,7 @@ extern int __ptep_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pte_t *ptep,
 				 pte_t entry, int dirty);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 #define __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
 static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
 					unsigned long address, pmd_t *pmdp,
@@ -1114,7 +1114,9 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
 	return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
 							pmd_pte(entry), dirty);
 }
+#endif
 
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
 static inline int pud_devmap(pud_t pud)
 {
 	return 0;
@@ -1178,7 +1180,7 @@ static inline int __ptep_clear_flush_young(struct vm_area_struct *vma,
 	return young;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 #define __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 051b1b6d729c..84cf55e18334 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1119,7 +1119,7 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool write)
 	return pte_access_permitted(pmd_pte(pmd), write);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
 extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 5a4a75369043..d6a5457627df 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -37,7 +37,7 @@ EXPORT_SYMBOL(__pmd_frag_nr);
 unsigned long __pmd_frag_size_shift;
 EXPORT_SYMBOL(__pmd_frag_size_shift);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
 /*
  * This is called when relaxing access to a hugepage. It's also called in the page
  * fault path when we don't hit any of the major fault cases, ie, a minor
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index ebfe8faafb79..8c28f15f601b 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -752,7 +752,7 @@ static inline bool pud_user_accessible_page(pud_t pud)
 }
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 static inline int pmd_trans_huge(pmd_t pmd)
 {
 	return pmd_leaf(pmd);
@@ -802,7 +802,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 #define pmdp_collapse_flush pmdp_collapse_flush
 extern pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
 				 unsigned long address, pmd_t *pmdp);
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_PGTABLE_HAS_PMD_LEAVES */
 
 /*
  * Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index fb6870384b97..398bbed20dee 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1710,7 +1710,7 @@ pmd_t pmdp_xchg_direct(struct mm_struct *, unsigned long, pmd_t *, pmd_t);
 pmd_t pmdp_xchg_lazy(struct mm_struct *, unsigned long, pmd_t *, pmd_t);
 pud_t pudp_xchg_direct(struct mm_struct *, unsigned long, pud_t *, pud_t);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 
 #define __HAVE_ARCH_PGTABLE_DEPOSIT
 void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 2c944bafb030..c4481068734e 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -561,7 +561,7 @@ pud_t pudp_xchg_direct(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL(pudp_xchg_direct);
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
 				pgtable_t pgtable)
 {
@@ -600,7 +600,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 	set_pte(ptep, __pte(_PAGE_INVALID));
 	return pgtable;
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_PGTABLE_HAS_PMD_LEAVES */
 
 #ifdef CONFIG_PGSTE
 void ptep_set_pte_at(struct mm_struct *mm, unsigned long addr,
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 8648a50afe88..140813d07c9f 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -143,7 +143,7 @@ void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
 		tlb_batch_add_one(mm, vaddr, pte_exec(orig), hugepage_shift);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 static void tlb_batch_pmd_scan(struct mm_struct *mm, unsigned long vaddr,
 			       pmd_t pmd)
 {
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index fa77411bb266..7b10d4a0c0cd 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -511,7 +511,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
 	return changed;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 int pmdp_set_access_flags(struct vm_area_struct *vma,
 			  unsigned long address, pmd_t *pmdp,
 			  pmd_t entry, int dirty)
@@ -532,7 +532,9 @@ int pmdp_set_access_flags(struct vm_area_struct *vma,
 
 	return changed;
 }
+#endif	/* PGTABLE_HAS_PMD_LEAVES */
 
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
 int pudp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
 			  pud_t *pudp, pud_t entry, int dirty)
 {
@@ -552,7 +554,7 @@ int pudp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
 
 	return changed;
 }
-#endif
+#endif	/* PGTABLE_HAS_PUD_LEAVES */
 
 int ptep_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pte_t *ptep)
@@ -566,7 +568,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return ret;
 }
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
+#if defined(CONFIG_PGTABLE_HAS_PMD_LEAVES) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
 int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pmd_t *pmdp)
 {
@@ -580,7 +582,7 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 }
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
 int pudp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pud_t *pudp)
 {
@@ -613,7 +615,7 @@ int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return ptep_test_and_clear_young(vma, address, ptep);
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 int pmdp_clear_flush_young(struct vm_area_struct *vma,
 			   unsigned long address, pmd_t *pmdp)
 {
@@ -641,8 +643,7 @@ pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
-	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
 pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pud_t *pudp)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ef09c4eef6d3..44ef91ce720c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -942,7 +942,7 @@ struct mm_struct {
 #ifdef CONFIG_MMU_NOTIFIER
 		struct mmu_notifier_subscriptions *notifier_subscriptions;
 #endif
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
+#if defined(CONFIG_PGTABLE_HAS_PMD_LEAVES) && !USE_SPLIT_PMD_PTLOCKS
 		pgtable_t pmd_huge_pte; /* protected by page_table_lock */
 #endif
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 0a904300ac90..5a5aaee5fa1c 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -362,7 +362,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
+#if defined(CONFIG_PGTABLE_HAS_PMD_LEAVES) || defined(CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG)
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
 					    pmd_t *pmdp)
@@ -383,7 +383,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 	BUILD_BUG();
 	return 0;
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
+#endif /* CONFIG_PGTABLE_HAS_PMD_LEAVES || CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG */
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
diff --git a/mm/memory.c b/mm/memory.c
index 802d0d8a40f9..126ee0903c79 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -666,7 +666,7 @@ struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
 	return NULL;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t pmd)
 {
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index a78a4adf711a..e9fc3f6774a6 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -103,7 +103,7 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
 }
 #endif
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 
 #ifndef __HAVE_ARCH_PMDP_SET_ACCESS_FLAGS
 int pmdp_set_access_flags(struct vm_area_struct *vma,
@@ -145,20 +145,6 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return pmd;
 }
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
-			    pud_t *pudp)
-{
-	pud_t pud;
-
-	VM_BUG_ON(address & ~HPAGE_PUD_MASK);
-	VM_BUG_ON(!pud_trans_huge(*pudp) && !pud_devmap(*pudp));
-	pud = pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
-	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
-	return pud;
-}
-#endif
 #endif
 
 #ifndef __HAVE_ARCH_PGTABLE_DEPOSIT
@@ -252,7 +238,21 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
 	call_rcu(&page->rcu_head, pte_free_now);
 }
 #endif /* pte_free_defer */
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_PGTABLE_HAS_PMD_LEAVES */
+
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
+pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
+			    pud_t *pudp)
+{
+	pud_t pud;
+
+	VM_BUG_ON(address & ~HPAGE_PUD_MASK);
+	VM_BUG_ON(!pud_trans_huge(*pudp) && !pud_devmap(*pudp));
+	pud = pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
+	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+	return pud;
+}
+#endif	/* CONFIG_PGTABLE_HAS_PUD_LEAVES */
 
 #if defined(CONFIG_GUP_GET_PXX_LOW_HIGH) && \
 	(defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RCU))
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RFC 4/6] mm: Move huge mapping declarations from internal.h to huge_mm.h
  2024-07-17 22:02 [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings Peter Xu
                   ` (2 preceding siblings ...)
  2024-07-17 22:02 ` [PATCH RFC 3/6] mm/treewide: Make pgtable-generic.c THP agnostic Peter Xu
@ 2024-07-17 22:02 ` Peter Xu
  2024-07-17 22:02 ` [PATCH RFC 5/6] mm/huge_mapping: Create huge_mapping_pxx.c Peter Xu
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2024-07-17 22:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Vlastimil Babka, peterx, David Hildenbrand, Oscar Salvador,
	linux-s390, Andrew Morton, Matthew Wilcox, Dan Williams,
	Michal Hocko, linux-riscv, sparclinux, Alex Williamson,
	Jason Gunthorpe, x86, Alistair Popple, linuxppc-dev,
	linux-arm-kernel, Ryan Roberts, Hugh Dickins, Axel Rasmussen

Most of the huge mapping relevant helpers are declared in huge_mm.h, not
internal.h.  Move the only few from internal.h into huge_mm.h.

Here to move pmd_needs_soft_dirty_wp() over, we'll also need to move
vma_soft_dirty_enabled() into mm.h as it'll be needed in two headers
later (internal.h, huge_mm.h).

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 10 ++++++++++
 include/linux/mm.h      | 18 ++++++++++++++++++
 mm/internal.h           | 33 ---------------------------------
 3 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 37482c8445d1..d8b642ad512d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -8,6 +8,11 @@
 #include <linux/fs.h> /* only for vma_is_dax() */
 #include <linux/kobject.h>
 
+void touch_pud(struct vm_area_struct *vma, unsigned long addr,
+	       pud_t *pud, bool write);
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	       pmd_t *pmd, bool write);
+pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
 vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
@@ -629,4 +634,9 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
 #define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
 #define split_folio(f) split_folio_to_order(f, 0)
 
+static inline bool pmd_needs_soft_dirty_wp(struct vm_area_struct *vma, pmd_t pmd)
+{
+	return vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd);
+}
+
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5f1075d19600..fa10802d8faa 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1117,6 +1117,24 @@ static inline unsigned int folio_order(struct folio *folio)
 	return folio->_flags_1 & 0xff;
 }
 
+static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
+{
+	/*
+	 * NOTE: we must check this before VM_SOFTDIRTY on soft-dirty
+	 * enablements, because when without soft-dirty being compiled in,
+	 * VM_SOFTDIRTY is defined as 0x0, then !(vm_flags & VM_SOFTDIRTY)
+	 * will be constantly true.
+	 */
+	if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY))
+		return false;
+
+	/*
+	 * Soft-dirty is kind of special: its tracking is enabled when the
+	 * vma flags not set.
+	 */
+	return !(vma->vm_flags & VM_SOFTDIRTY);
+}
+
 #include <linux/huge_mm.h>
 
 /*
diff --git a/mm/internal.h b/mm/internal.h
index b4d86436565b..e49941747749 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -917,8 +917,6 @@ bool need_mlock_drain(int cpu);
 void mlock_drain_local(void);
 void mlock_drain_remote(int cpu);
 
-extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
-
 /**
  * vma_address - Find the virtual address a page range is mapped at
  * @vma: The vma which maps this object.
@@ -1229,14 +1227,6 @@ int migrate_device_coherent_page(struct page *page);
 int __must_check try_grab_folio(struct folio *folio, int refs,
 				unsigned int flags);
 
-/*
- * mm/huge_memory.c
- */
-void touch_pud(struct vm_area_struct *vma, unsigned long addr,
-	       pud_t *pud, bool write);
-void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
-	       pmd_t *pmd, bool write);
-
 /*
  * mm/mmap.c
  */
@@ -1342,29 +1332,6 @@ static __always_inline void vma_set_range(struct vm_area_struct *vma,
 	vma->vm_pgoff = pgoff;
 }
 
-static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
-{
-	/*
-	 * NOTE: we must check this before VM_SOFTDIRTY on soft-dirty
-	 * enablements, because when without soft-dirty being compiled in,
-	 * VM_SOFTDIRTY is defined as 0x0, then !(vm_flags & VM_SOFTDIRTY)
-	 * will be constantly true.
-	 */
-	if (!IS_ENABLED(CONFIG_MEM_SOFT_DIRTY))
-		return false;
-
-	/*
-	 * Soft-dirty is kind of special: its tracking is enabled when the
-	 * vma flags not set.
-	 */
-	return !(vma->vm_flags & VM_SOFTDIRTY);
-}
-
-static inline bool pmd_needs_soft_dirty_wp(struct vm_area_struct *vma, pmd_t pmd)
-{
-	return vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd);
-}
-
 static inline bool pte_needs_soft_dirty_wp(struct vm_area_struct *vma, pte_t pte)
 {
 	return vma_soft_dirty_enabled(vma) && !pte_soft_dirty(pte);
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RFC 5/6] mm/huge_mapping: Create huge_mapping_pxx.c
  2024-07-17 22:02 [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings Peter Xu
                   ` (3 preceding siblings ...)
  2024-07-17 22:02 ` [PATCH RFC 4/6] mm: Move huge mapping declarations from internal.h to huge_mm.h Peter Xu
@ 2024-07-17 22:02 ` Peter Xu
  2024-07-17 22:02 ` [PATCH RFC 6/6] mm: Convert "*_trans_huge() || *_devmap()" to use *_leaf() Peter Xu
  2024-07-22 13:29 ` [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings David Hildenbrand
  6 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2024-07-17 22:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Vlastimil Babka, peterx, David Hildenbrand, Oscar Salvador,
	linux-s390, Andrew Morton, Matthew Wilcox, Dan Williams,
	Michal Hocko, linux-riscv, sparclinux, Alex Williamson,
	Jason Gunthorpe, x86, Alistair Popple, linuxppc-dev,
	linux-arm-kernel, Ryan Roberts, Hugh Dickins, Axel Rasmussen

At some point, we need to decouple "huge mapping" with THP, for any non-THP
huge mappings in the future (hugetlb, pfnmap, etc..).  This is the first
step towards it.

Or say, we already started to do this when PGTABLE_HAS_HUGE_LEAVES option
was introduced: that is the first thing Linux start to describe LEAVEs
rather than THPs when it is about huge mappings.  Before that, mostly any
huge mapping will have THP involved, like devmap.  Hugetlb is special only
because we duplicated the whole world there, but we also have a demand to
decouple that now.

Linux used to have huge_memory.c which only compiles with THP enabled, I
wished it was called thp.c from the start.  In reality, it contains more
than processing THP: any huge mapping (even if not falling into THP
category) will be able to leverage many of these helpers, but unfortunately
this file is not compiled if !THP.  These helpers are normally only about
the pgtable operations, which may not be directly relevant to what type of
huge folio (e.g. THP) underneath, or perhaps even if there's no vmemmap to
back it.

It's better we move them out of THP world.

Create a new set of files huge_mapping_p[mu]d.c.  This patch starts to move
quite a few essential helpers from huge_memory.c into these new files, so
that they'll start to work and compile rely on PGTABLE_HAS_PXX_LEAVES
rather than THP.  Split them into two files by nature so that e.g. archs
that only supports PMD huge mapping can avoid compiling the whole -pud
file, with the hope to reduce the size of object compiled and linked.

No functional change intended, but only code movement.  Said that, there
will be some "ifdef" machinery changes to pass all kinds of compilations.

Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h             |  318 +++++---
 include/linux/pgtable.h             |   23 +-
 include/trace/events/huge_mapping.h |   41 +
 include/trace/events/thp.h          |   28 -
 mm/Makefile                         |    2 +
 mm/huge_mapping_pmd.c               |  979 +++++++++++++++++++++++
 mm/huge_mapping_pud.c               |  235 ++++++
 mm/huge_memory.c                    | 1125 +--------------------------
 8 files changed, 1472 insertions(+), 1279 deletions(-)
 create mode 100644 include/trace/events/huge_mapping.h
 create mode 100644 mm/huge_mapping_pmd.c
 create mode 100644 mm/huge_mapping_pud.c

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d8b642ad512d..aea2784df8ef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -8,43 +8,214 @@
 #include <linux/fs.h> /* only for vma_is_dax() */
 #include <linux/kobject.h>
 
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
+void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
 void touch_pud(struct vm_area_struct *vma, unsigned long addr,
 	       pud_t *pud, bool write);
-void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
-	       pmd_t *pmd, bool write);
-pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
-vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
-int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
-		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
-void huge_pmd_set_accessed(struct vm_fault *vmf);
 int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
 		  struct vm_area_struct *vma);
+int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud,
+		 unsigned long addr);
+int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
+		    unsigned long cp_flags);
+void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
+		unsigned long address);
+spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma);
 
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud);
-#else
-static inline void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
+static inline spinlock_t *
+pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 {
+	if (pud_trans_huge(*pud) || pud_devmap(*pud))
+		return __pud_trans_huge_lock(pud, vma);
+	else
+		return NULL;
 }
-#endif
 
+#define split_huge_pud(__vma, __pud, __address)				\
+	do {								\
+		pud_t *____pud = (__pud);				\
+		if (pud_trans_huge(*____pud) || pud_devmap(*____pud))	\
+			__split_huge_pud(__vma, __pud, __address);	\
+	}  while (0)
+#else  /* CONFIG_PGTABLE_HAS_PUD_LEAVES */
+static inline void
+huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
+{
+}
+
+static inline int
+change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		pud_t *pudp, unsigned long addr, pgprot_t newprot,
+		unsigned long cp_flags)
+{
+	return 0;
+}
+
+static inline spinlock_t *
+pud_trans_huge_lock(pud_t *pud,
+		struct vm_area_struct *vma)
+{
+	return NULL;
+}
+
+static inline void
+touch_pud(struct vm_area_struct *vma, unsigned long addr,
+	       pud_t *pud, bool write)
+{
+}
+
+static inline int
+copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+	      pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
+	      struct vm_area_struct *vma)
+{
+	return 0;
+}
+
+static inline int
+zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud,
+	     unsigned long addr)
+{
+	return 0;
+}
+
+static inline void
+__split_huge_pud(struct vm_area_struct *vma, pud_t *pud, unsigned long address)
+{
+}
+
+#define split_huge_pud(__vma, __pud, __address) do {} while (0)
+#endif  /* CONFIG_PGTABLE_HAS_PUD_LEAVES */
+
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	       pmd_t *pmd, bool write);
+pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);
+int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
+void huge_pmd_set_accessed(struct vm_fault *vmf);
 vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf);
-bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
-			   pmd_t *pmd, unsigned long addr, unsigned long next);
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd,
 		 unsigned long addr);
-int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud,
-		 unsigned long addr);
 bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 		   unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
 int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
 		    unsigned long cp_flags);
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address, bool freeze, struct folio *folio);
+void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
+			   pmd_t *pmd, bool freeze, struct folio *folio);
+void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
+		bool freeze, struct folio *folio);
+spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
+bool can_change_pmd_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pmd_t pmd);
+void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd);
+
+static inline int is_swap_pmd(pmd_t pmd)
+{
+	return !pmd_none(pmd) && !pmd_present(pmd);
+}
+
+/* mmap_lock must be held on entry */
+static inline spinlock_t *
+pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
+{
+	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+		return __pmd_trans_huge_lock(pmd, vma);
+	else
+		return NULL;
+}
+
+#define split_huge_pmd(__vma, __pmd, __address)				\
+	do {								\
+		pmd_t *____pmd = (__pmd);				\
+		if (is_swap_pmd(*____pmd) || pmd_is_leaf(*____pmd))	\
+			__split_huge_pmd(__vma, __pmd, __address,	\
+					 false, NULL);			\
+	}  while (0)
+#else  /* CONFIG_PGTABLE_HAS_PMD_LEAVES */
+static inline spinlock_t *
+pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
+{
+	return NULL;
+}
+
+static inline int is_swap_pmd(pmd_t pmd)
+{
+	return 0;
+}
 
+static inline void
+__split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		 unsigned long address, bool freeze, struct folio *folio)
+{
+}
+#define split_huge_pmd(__vma, __pmd, __address)  do {} while (0)
+
+static inline int
+copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+	      pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+	      struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
+{
+	return 0;
+}
+
+static inline int
+zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd,
+	     unsigned long addr)
+{
+	return 0;
+}
+
+static inline vm_fault_t
+do_huge_pmd_wp_page(struct vm_fault *vmf)
+{
+	return 0;
+}
+
+static inline void
+huge_pmd_set_accessed(struct vm_fault *vmf)
+{
+}
+
+static inline int
+change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		pmd_t *pmd, unsigned long addr, pgprot_t newprot,
+		unsigned long cp_flags)
+{
+	return 0;
+}
+
+static inline bool
+move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+	      unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
+{
+	return false;
+}
+
+static inline void
+split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
+		      pmd_t *pmd, bool freeze, struct folio *folio)
+{
+}
+
+static inline void
+split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
+		       bool freeze, struct folio *folio)
+{
+}
+#endif  /* CONFIG_PGTABLE_HAS_PMD_LEAVES */
+
+bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+			   pmd_t *pmd, unsigned long addr, unsigned long next);
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
+struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
 
 enum transparent_hugepage_flag {
 	TRANSPARENT_HUGEPAGE_UNSUPPORTED,
@@ -130,6 +301,9 @@ extern unsigned long huge_anon_orders_always;
 extern unsigned long huge_anon_orders_madvise;
 extern unsigned long huge_anon_orders_inherit;
 
+void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+				unsigned long haddr, pmd_t *pmd);
+
 static inline bool hugepage_global_enabled(void)
 {
 	return transparent_hugepage_flags &
@@ -332,44 +506,6 @@ static inline int split_huge_page(struct page *page)
 }
 void deferred_split_folio(struct folio *folio);
 
-void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long address, bool freeze, struct folio *folio);
-
-#define split_huge_pmd(__vma, __pmd, __address)				\
-	do {								\
-		pmd_t *____pmd = (__pmd);				\
-		if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)	\
-					|| pmd_devmap(*____pmd))	\
-			__split_huge_pmd(__vma, __pmd, __address,	\
-						false, NULL);		\
-	}  while (0)
-
-
-void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
-		bool freeze, struct folio *folio);
-
-void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long address);
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
-		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
-		    unsigned long cp_flags);
-#else
-static inline int
-change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
-		pud_t *pudp, unsigned long addr, pgprot_t newprot,
-		unsigned long cp_flags) { return 0; }
-#endif
-
-#define split_huge_pud(__vma, __pud, __address)				\
-	do {								\
-		pud_t *____pud = (__pud);				\
-		if (pud_trans_huge(*____pud)				\
-					|| pud_devmap(*____pud))	\
-			__split_huge_pud(__vma, __pud, __address);	\
-	}  while (0)
-
 int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
 		     int advice);
 int madvise_collapse(struct vm_area_struct *vma,
@@ -377,31 +513,6 @@ int madvise_collapse(struct vm_area_struct *vma,
 		     unsigned long start, unsigned long end);
 void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
 			   unsigned long end, long adjust_next);
-spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
-spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma);
-
-static inline int is_swap_pmd(pmd_t pmd)
-{
-	return !pmd_none(pmd) && !pmd_present(pmd);
-}
-
-/* mmap_lock must be held on entry */
-static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
-		struct vm_area_struct *vma)
-{
-	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
-		return __pmd_trans_huge_lock(pmd, vma);
-	else
-		return NULL;
-}
-static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
-		struct vm_area_struct *vma)
-{
-	if (pud_trans_huge(*pud) || pud_devmap(*pud))
-		return __pud_trans_huge_lock(pud, vma);
-	else
-		return NULL;
-}
 
 /**
  * folio_test_pmd_mappable - Can we map this folio with a PMD?
@@ -416,6 +527,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
+vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 
 extern struct folio *huge_zero_folio;
 extern unsigned long huge_zero_pfn;
@@ -445,13 +557,17 @@ static inline bool thp_migration_supported(void)
 	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
 }
 
-void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
-			   pmd_t *pmd, bool freeze, struct folio *folio);
 bool unmap_huge_pmd_locked(struct vm_area_struct *vma, unsigned long addr,
 			   pmd_t *pmdp, struct folio *folio);
 
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline void
+__split_huge_zero_page_pmd(struct vm_area_struct *vma,
+			   unsigned long haddr, pmd_t *pmd)
+{
+}
+
 static inline bool folio_test_pmd_mappable(struct folio *folio)
 {
 	return false;
@@ -505,16 +621,6 @@ static inline int split_huge_page(struct page *page)
 	return 0;
 }
 static inline void deferred_split_folio(struct folio *folio) {}
-#define split_huge_pmd(__vma, __pmd, __address)	\
-	do { } while (0)
-
-static inline void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long address, bool freeze, struct folio *folio) {}
-static inline void split_huge_pmd_address(struct vm_area_struct *vma,
-		unsigned long address, bool freeze, struct folio *folio) {}
-static inline void split_huge_pmd_locked(struct vm_area_struct *vma,
-					 unsigned long address, pmd_t *pmd,
-					 bool freeze, struct folio *folio) {}
 
 static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
 					 unsigned long addr, pmd_t *pmdp,
@@ -523,9 +629,6 @@ static inline bool unmap_huge_pmd_locked(struct vm_area_struct *vma,
 	return false;
 }
 
-#define split_huge_pud(__vma, __pmd, __address)	\
-	do { } while (0)
-
 static inline int hugepage_madvise(struct vm_area_struct *vma,
 				   unsigned long *vm_flags, int advice)
 {
@@ -545,20 +648,6 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
 					 long adjust_next)
 {
 }
-static inline int is_swap_pmd(pmd_t pmd)
-{
-	return 0;
-}
-static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
-		struct vm_area_struct *vma)
-{
-	return NULL;
-}
-static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
-		struct vm_area_struct *vma)
-{
-	return NULL;
-}
 
 static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
@@ -606,15 +695,8 @@ static inline int next_order(unsigned long *orders, int prev)
 	return 0;
 }
 
-static inline void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-				    unsigned long address)
-{
-}
-
-static inline int change_huge_pud(struct mmu_gather *tlb,
-				  struct vm_area_struct *vma, pud_t *pudp,
-				  unsigned long addr, pgprot_t newprot,
-				  unsigned long cp_flags)
+static inline vm_fault_t
+do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 {
 	return 0;
 }
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5a5aaee5fa1c..5e505373b113 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -628,8 +628,8 @@ static inline pud_t pudp_huge_get_and_clear(struct mm_struct *mm,
 #endif /* __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR */
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#ifndef __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR_FULL
+#if defined(CONFIG_PGTABLE_HAS_PMD_LEAVES) && \
+	!defined(__HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR_FULL)
 static inline pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
 					    unsigned long address, pmd_t *pmdp,
 					    int full)
@@ -638,14 +638,14 @@ static inline pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
 }
 #endif
 
-#ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL
+#if defined(CONFIG_PGTABLE_HAS_PUD_LEAVES) && \
+	!defined(__HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL)
 static inline pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
 					    unsigned long address, pud_t *pudp,
 					    int full)
 {
 	return pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
 }
-#endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
@@ -894,9 +894,9 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
+
 #ifndef __HAVE_ARCH_PUDP_SET_WRPROTECT
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
 static inline void pudp_set_wrprotect(struct mm_struct *mm,
 				      unsigned long address, pud_t *pudp)
 {
@@ -910,8 +910,7 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm,
 {
 	BUILD_BUG();
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+#endif /* CONFIG_PGTABLE_HAS_PUD_LEAVES */
 #endif
 
 #ifndef pmdp_collapse_flush
@@ -1735,7 +1734,6 @@ static inline int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 #endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
 #ifndef __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
  * ARCHes with special requirements for evicting THP backing TLB entries can
  * implement this. Otherwise also, it can help optimize normal TLB flush in
@@ -1745,10 +1743,15 @@ static inline int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
  * invalidate the entire TLB which is not desirable.
  * e.g. see arch/arc: flush_pmd_tlb_range
  */
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
 #define flush_pmd_tlb_range(vma, addr, end)	flush_tlb_range(vma, addr, end)
-#define flush_pud_tlb_range(vma, addr, end)	flush_tlb_range(vma, addr, end)
 #else
 #define flush_pmd_tlb_range(vma, addr, end)	BUILD_BUG()
+#endif
+
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
+#define flush_pud_tlb_range(vma, addr, end)	flush_tlb_range(vma, addr, end)
+#else
 #define flush_pud_tlb_range(vma, addr, end)	BUILD_BUG()
 #endif
 #endif
diff --git a/include/trace/events/huge_mapping.h b/include/trace/events/huge_mapping.h
new file mode 100644
index 000000000000..20036d090ce5
--- /dev/null
+++ b/include/trace/events/huge_mapping.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM huge_mapping
+
+#if !defined(_TRACE_HUGE_MAPPING_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_HUGE_MAPPING_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(migration_pmd,
+
+		TP_PROTO(unsigned long addr, unsigned long pmd),
+
+		TP_ARGS(addr, pmd),
+
+		TP_STRUCT__entry(
+			__field(unsigned long, addr)
+			__field(unsigned long, pmd)
+		),
+
+		TP_fast_assign(
+			__entry->addr = addr;
+			__entry->pmd = pmd;
+		),
+		TP_printk("addr=%lx, pmd=%lx", __entry->addr, __entry->pmd)
+);
+
+DEFINE_EVENT(migration_pmd, set_migration_pmd,
+	TP_PROTO(unsigned long addr, unsigned long pmd),
+	TP_ARGS(addr, pmd)
+);
+
+DEFINE_EVENT(migration_pmd, remove_migration_pmd,
+	TP_PROTO(unsigned long addr, unsigned long pmd),
+	TP_ARGS(addr, pmd)
+);
+#endif /* _TRACE_HUGE_MAPPING_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/trace/events/thp.h b/include/trace/events/thp.h
index f50048af5fcc..395b574b1c79 100644
--- a/include/trace/events/thp.h
+++ b/include/trace/events/thp.h
@@ -66,34 +66,6 @@ DEFINE_EVENT(hugepage_update, hugepage_update_pud,
 	    TP_PROTO(unsigned long addr, unsigned long pud, unsigned long clr, unsigned long set),
 	    TP_ARGS(addr, pud, clr, set)
 );
-
-DECLARE_EVENT_CLASS(migration_pmd,
-
-		TP_PROTO(unsigned long addr, unsigned long pmd),
-
-		TP_ARGS(addr, pmd),
-
-		TP_STRUCT__entry(
-			__field(unsigned long, addr)
-			__field(unsigned long, pmd)
-		),
-
-		TP_fast_assign(
-			__entry->addr = addr;
-			__entry->pmd = pmd;
-		),
-		TP_printk("addr=%lx, pmd=%lx", __entry->addr, __entry->pmd)
-);
-
-DEFINE_EVENT(migration_pmd, set_migration_pmd,
-	TP_PROTO(unsigned long addr, unsigned long pmd),
-	TP_ARGS(addr, pmd)
-);
-
-DEFINE_EVENT(migration_pmd, remove_migration_pmd,
-	TP_PROTO(unsigned long addr, unsigned long pmd),
-	TP_ARGS(addr, pmd)
-);
 #endif /* _TRACE_THP_H */
 
 /* This part must be outside protection */
diff --git a/mm/Makefile b/mm/Makefile
index d2915f8c9dc0..3a846121b1f5 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -95,6 +95,8 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_PGTABLE_HAS_PMD_LEAVES) += huge_mapping_pmd.o
+obj-$(CONFIG_PGTABLE_HAS_PUD_LEAVES) += huge_mapping_pud.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
diff --git a/mm/huge_mapping_pmd.c b/mm/huge_mapping_pmd.c
new file mode 100644
index 000000000000..7b85e2a564d6
--- /dev/null
+++ b/mm/huge_mapping_pmd.c
@@ -0,0 +1,979 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Copyright (C) 2024  Red Hat, Inc.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/coredump.h>
+#include <linux/sched/numa_balancing.h>
+#include <linux/highmem.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/shrinker.h>
+#include <linux/mm_inline.h>
+#include <linux/swapops.h>
+#include <linux/backing-dev.h>
+#include <linux/dax.h>
+#include <linux/mm_types.h>
+#include <linux/khugepaged.h>
+#include <linux/freezer.h>
+#include <linux/pfn_t.h>
+#include <linux/mman.h>
+#include <linux/memremap.h>
+#include <linux/pagemap.h>
+#include <linux/debugfs.h>
+#include <linux/migrate.h>
+#include <linux/hashtable.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/page_idle.h>
+#include <linux/shmem_fs.h>
+#include <linux/oom.h>
+#include <linux/numa.h>
+#include <linux/page_owner.h>
+#include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
+#include <linux/compat.h>
+#include <linux/pgalloc_tag.h>
+
+#include <asm/tlb.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+#include "swap.h"
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/huge_mapping.h>
+
+/*
+ * Returns page table lock pointer if a given pmd maps a thp, NULL otherwise.
+ *
+ * Note that if it returns page table lock pointer, this routine returns without
+ * unlocking page table lock. So callers must unlock it.
+ */
+spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
+{
+	spinlock_t *ptl;
+
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
+		   pmd_devmap(*pmd)))
+		return ptl;
+	spin_unlock(ptl);
+	return NULL;
+}
+
+pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+{
+	if (likely(vma->vm_flags & VM_WRITE))
+		pmd = pmd_mkwrite(pmd, vma);
+	return pmd;
+}
+
+void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
+	       pmd_t *pmd, bool write)
+{
+	pmd_t _pmd;
+
+	_pmd = pmd_mkyoung(*pmd);
+	if (write)
+		_pmd = pmd_mkdirty(_pmd);
+	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
+				  pmd, _pmd, write))
+		update_mmu_cache_pmd(vma, addr, pmd);
+}
+
+int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
+{
+	spinlock_t *dst_ptl, *src_ptl;
+	struct page *src_page;
+	struct folio *src_folio;
+	pmd_t pmd;
+	pgtable_t pgtable = NULL;
+	int ret = -ENOMEM;
+
+	/* Skip if can be re-fill on fault */
+	if (!vma_is_anonymous(dst_vma))
+		return 0;
+
+	pgtable = pte_alloc_one(dst_mm);
+	if (unlikely(!pgtable))
+		goto out;
+
+	dst_ptl = pmd_lock(dst_mm, dst_pmd);
+	src_ptl = pmd_lockptr(src_mm, src_pmd);
+	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+
+	ret = -EAGAIN;
+	pmd = *src_pmd;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+	if (unlikely(is_swap_pmd(pmd))) {
+		swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+		VM_BUG_ON(!is_pmd_migration_entry(pmd));
+		if (!is_readable_migration_entry(entry)) {
+			entry = make_readable_migration_entry(
+							swp_offset(entry));
+			pmd = swp_entry_to_pmd(entry);
+			if (pmd_swp_soft_dirty(*src_pmd))
+				pmd = pmd_swp_mksoft_dirty(pmd);
+			if (pmd_swp_uffd_wp(*src_pmd))
+				pmd = pmd_swp_mkuffd_wp(pmd);
+			set_pmd_at(src_mm, addr, src_pmd, pmd);
+		}
+		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+		mm_inc_nr_ptes(dst_mm);
+		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+		if (!userfaultfd_wp(dst_vma))
+			pmd = pmd_swp_clear_uffd_wp(pmd);
+		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+		ret = 0;
+		goto out_unlock;
+	}
+#endif
+
+	if (unlikely(!pmd_trans_huge(pmd))) {
+		pte_free(dst_mm, pgtable);
+		goto out_unlock;
+	}
+	/*
+	 * When page table lock is held, the huge zero pmd should not be
+	 * under splitting since we don't split the page itself, only pmd to
+	 * a page table.
+	 */
+	if (is_huge_zero_pmd(pmd)) {
+		/*
+		 * mm_get_huge_zero_folio() will never allocate a new
+		 * folio here, since we already have a zero page to
+		 * copy. It just takes a reference.
+		 */
+		mm_get_huge_zero_folio(dst_mm);
+		goto out_zero_page;
+	}
+
+	src_page = pmd_page(pmd);
+	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+	src_folio = page_folio(src_page);
+
+	folio_get(src_folio);
+	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, src_vma))) {
+		/* Page maybe pinned: split and retry the fault on PTEs. */
+		folio_put(src_folio);
+		pte_free(dst_mm, pgtable);
+		spin_unlock(src_ptl);
+		spin_unlock(dst_ptl);
+		__split_huge_pmd(src_vma, src_pmd, addr, false, NULL);
+		return -EAGAIN;
+	}
+	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+out_zero_page:
+	mm_inc_nr_ptes(dst_mm);
+	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+	pmdp_set_wrprotect(src_mm, addr, src_pmd);
+	if (!userfaultfd_wp(dst_vma))
+		pmd = pmd_clear_uffd_wp(pmd);
+	pmd = pmd_mkold(pmd_wrprotect(pmd));
+	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+
+	ret = 0;
+out_unlock:
+	spin_unlock(src_ptl);
+	spin_unlock(dst_ptl);
+out:
+	return ret;
+}
+
+void huge_pmd_set_accessed(struct vm_fault *vmf)
+{
+	bool write = vmf->flags & FAULT_FLAG_WRITE;
+
+	vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
+	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd)))
+		goto unlock;
+
+	touch_pmd(vmf->vma, vmf->address, vmf->pmd, write);
+
+unlock:
+	spin_unlock(vmf->ptl);
+}
+
+vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
+{
+	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
+	struct vm_area_struct *vma = vmf->vma;
+	struct folio *folio;
+	struct page *page;
+	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	pmd_t orig_pmd = vmf->orig_pmd;
+
+	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
+	VM_BUG_ON_VMA(!vma->anon_vma, vma);
+
+	if (is_huge_zero_pmd(orig_pmd))
+		goto fallback;
+
+	spin_lock(vmf->ptl);
+
+	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) {
+		spin_unlock(vmf->ptl);
+		return 0;
+	}
+
+	page = pmd_page(orig_pmd);
+	folio = page_folio(page);
+	VM_BUG_ON_PAGE(!PageHead(page), page);
+
+	/* Early check when only holding the PT lock. */
+	if (PageAnonExclusive(page))
+		goto reuse;
+
+	if (!folio_trylock(folio)) {
+		folio_get(folio);
+		spin_unlock(vmf->ptl);
+		folio_lock(folio);
+		spin_lock(vmf->ptl);
+		if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) {
+			spin_unlock(vmf->ptl);
+			folio_unlock(folio);
+			folio_put(folio);
+			return 0;
+		}
+		folio_put(folio);
+	}
+
+	/* Recheck after temporarily dropping the PT lock. */
+	if (PageAnonExclusive(page)) {
+		folio_unlock(folio);
+		goto reuse;
+	}
+
+	/*
+	 * See do_wp_page(): we can only reuse the folio exclusively if
+	 * there are no additional references. Note that we always drain
+	 * the LRU cache immediately after adding a THP.
+	 */
+	if (folio_ref_count(folio) >
+			1 + folio_test_swapcache(folio) * folio_nr_pages(folio))
+		goto unlock_fallback;
+	if (folio_test_swapcache(folio))
+		folio_free_swap(folio);
+	if (folio_ref_count(folio) == 1) {
+		pmd_t entry;
+
+		folio_move_anon_rmap(folio, vma);
+		SetPageAnonExclusive(page);
+		folio_unlock(folio);
+reuse:
+		if (unlikely(unshare)) {
+			spin_unlock(vmf->ptl);
+			return 0;
+		}
+		entry = pmd_mkyoung(orig_pmd);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
+			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
+		spin_unlock(vmf->ptl);
+		return 0;
+	}
+
+unlock_fallback:
+	folio_unlock(folio);
+	spin_unlock(vmf->ptl);
+fallback:
+	__split_huge_pmd(vma, vmf->pmd, vmf->address, false, NULL);
+	return VM_FAULT_FALLBACK;
+}
+
+bool can_change_pmd_writable(struct vm_area_struct *vma, unsigned long addr,
+			     pmd_t pmd)
+{
+	struct page *page;
+
+	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
+		return false;
+
+	/* Don't touch entries that are not even readable (NUMA hinting). */
+	if (pmd_protnone(pmd))
+		return false;
+
+	/* Do we need write faults for softdirty tracking? */
+	if (pmd_needs_soft_dirty_wp(vma, pmd))
+		return false;
+
+	/* Do we need write faults for uffd-wp tracking? */
+	if (userfaultfd_huge_pmd_wp(vma, pmd))
+		return false;
+
+	if (!(vma->vm_flags & VM_SHARED)) {
+		/* See can_change_pte_writable(). */
+		page = vm_normal_page_pmd(vma, addr, pmd);
+		return page && PageAnon(page) && PageAnonExclusive(page);
+	}
+
+	/* See can_change_pte_writable(). */
+	return pmd_dirty(pmd);
+}
+
+void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
+{
+	pgtable_t pgtable;
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pte_free(mm, pgtable);
+	mm_dec_nr_ptes(mm);
+}
+
+int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pmd_t *pmd, unsigned long addr)
+{
+	pmd_t orig_pmd;
+	spinlock_t *ptl;
+
+	tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
+
+	ptl = __pmd_trans_huge_lock(pmd, vma);
+	if (!ptl)
+		return 0;
+	/*
+	 * For architectures like ppc64 we look at deposited pgtable
+	 * when calling pmdp_huge_get_and_clear. So do the
+	 * pgtable_trans_huge_withdraw after finishing pmdp related
+	 * operations.
+	 */
+	orig_pmd = pmdp_huge_get_and_clear_full(vma, addr, pmd,
+						tlb->fullmm);
+	arch_check_zapped_pmd(vma, orig_pmd);
+	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+	if (vma_is_special_huge(vma)) {
+		if (arch_needs_pgtable_deposit())
+			zap_deposited_table(tlb->mm, pmd);
+		spin_unlock(ptl);
+	} else if (is_huge_zero_pmd(orig_pmd)) {
+		zap_deposited_table(tlb->mm, pmd);
+		spin_unlock(ptl);
+	} else {
+		struct folio *folio = NULL;
+		int flush_needed = 1;
+
+		if (pmd_present(orig_pmd)) {
+			struct page *page = pmd_page(orig_pmd);
+
+			folio = page_folio(page);
+			folio_remove_rmap_pmd(folio, page, vma);
+			WARN_ON_ONCE(folio_mapcount(folio) < 0);
+			VM_BUG_ON_PAGE(!PageHead(page), page);
+		} else if (thp_migration_supported()) {
+			swp_entry_t entry;
+
+			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
+			entry = pmd_to_swp_entry(orig_pmd);
+			folio = pfn_swap_entry_folio(entry);
+			flush_needed = 0;
+		} else
+			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+		if (folio_test_anon(folio)) {
+			zap_deposited_table(tlb->mm, pmd);
+			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+		} else {
+			if (arch_needs_pgtable_deposit())
+				zap_deposited_table(tlb->mm, pmd);
+			add_mm_counter(tlb->mm, mm_counter_file(folio),
+				       -HPAGE_PMD_NR);
+		}
+
+		spin_unlock(ptl);
+		if (flush_needed)
+			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
+	}
+	return 1;
+}
+
+static pmd_t move_soft_dirty_pmd(pmd_t pmd)
+{
+#ifdef CONFIG_MEM_SOFT_DIRTY
+	if (unlikely(is_pmd_migration_entry(pmd)))
+		pmd = pmd_swp_mksoft_dirty(pmd);
+	else if (pmd_present(pmd))
+		pmd = pmd_mksoft_dirty(pmd);
+#endif
+	return pmd;
+}
+
+#ifndef pmd_move_must_withdraw
+static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
+					 spinlock_t *old_pmd_ptl,
+					 struct vm_area_struct *vma)
+{
+	/*
+	 * With split pmd lock we also need to move preallocated
+	 * PTE page table if new_pmd is on different PMD page table.
+	 *
+	 * We also don't deposit and withdraw tables for file pages.
+	 */
+	return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
+}
+#endif
+
+bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+		  unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
+{
+	spinlock_t *old_ptl, *new_ptl;
+	pmd_t pmd;
+	struct mm_struct *mm = vma->vm_mm;
+	bool force_flush = false;
+
+	/*
+	 * The destination pmd shouldn't be established, free_pgtables()
+	 * should have released it; but move_page_tables() might have already
+	 * inserted a page table, if racing against shmem/file collapse.
+	 */
+	if (!pmd_none(*new_pmd)) {
+		VM_BUG_ON(pmd_trans_huge(*new_pmd));
+		return false;
+	}
+
+	/*
+	 * We don't have to worry about the ordering of src and dst
+	 * ptlocks because exclusive mmap_lock prevents deadlock.
+	 */
+	old_ptl = __pmd_trans_huge_lock(old_pmd, vma);
+	if (old_ptl) {
+		new_ptl = pmd_lockptr(mm, new_pmd);
+		if (new_ptl != old_ptl)
+			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+		pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
+		if (pmd_present(pmd))
+			force_flush = true;
+		VM_BUG_ON(!pmd_none(*new_pmd));
+
+		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
+			pgtable_t pgtable;
+			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
+			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
+		}
+		pmd = move_soft_dirty_pmd(pmd);
+		set_pmd_at(mm, new_addr, new_pmd, pmd);
+		if (force_flush)
+			flush_pmd_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
+		if (new_ptl != old_ptl)
+			spin_unlock(new_ptl);
+		spin_unlock(old_ptl);
+		return true;
+	}
+	return false;
+}
+
+/*
+ * Returns
+ *  - 0 if PMD could not be locked
+ *  - 1 if PMD was locked but protections unchanged and TLB flush unnecessary
+ *      or if prot_numa but THP migration is not supported
+ *  - HPAGE_PMD_NR if protections changed and TLB flush necessary
+ */
+int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
+		    unsigned long cp_flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *ptl;
+	pmd_t oldpmd, entry;
+	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
+	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
+	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+	int ret = 1;
+
+	tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
+
+	if (prot_numa && !thp_migration_supported())
+		return 1;
+
+	ptl = __pmd_trans_huge_lock(pmd, vma);
+	if (!ptl)
+		return 0;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+	if (is_swap_pmd(*pmd)) {
+		swp_entry_t entry = pmd_to_swp_entry(*pmd);
+		struct folio *folio = pfn_swap_entry_folio(entry);
+		pmd_t newpmd;
+
+		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
+		if (is_writable_migration_entry(entry)) {
+			/*
+			 * A protection check is difficult so
+			 * just be safe and disable write
+			 */
+			if (folio_test_anon(folio))
+				entry = make_readable_exclusive_migration_entry(swp_offset(entry));
+			else
+				entry = make_readable_migration_entry(swp_offset(entry));
+			newpmd = swp_entry_to_pmd(entry);
+			if (pmd_swp_soft_dirty(*pmd))
+				newpmd = pmd_swp_mksoft_dirty(newpmd);
+		} else {
+			newpmd = *pmd;
+		}
+
+		if (uffd_wp)
+			newpmd = pmd_swp_mkuffd_wp(newpmd);
+		else if (uffd_wp_resolve)
+			newpmd = pmd_swp_clear_uffd_wp(newpmd);
+		if (!pmd_same(*pmd, newpmd))
+			set_pmd_at(mm, addr, pmd, newpmd);
+		goto unlock;
+	}
+#endif
+
+	if (prot_numa) {
+		struct folio *folio;
+		bool toptier;
+		/*
+		 * Avoid trapping faults against the zero page. The read-only
+		 * data is likely to be read-cached on the local CPU and
+		 * local/remote hits to the zero page are not interesting.
+		 */
+		if (is_huge_zero_pmd(*pmd))
+			goto unlock;
+
+		if (pmd_protnone(*pmd))
+			goto unlock;
+
+		folio = pmd_folio(*pmd);
+		toptier = node_is_toptier(folio_nid(folio));
+		/*
+		 * Skip scanning top tier node if normal numa
+		 * balancing is disabled
+		 */
+		if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
+		    toptier)
+			goto unlock;
+
+		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
+		    !toptier)
+			folio_xchg_access_time(folio,
+					       jiffies_to_msecs(jiffies));
+	}
+	/*
+	 * In case prot_numa, we are under mmap_read_lock(mm). It's critical
+	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
+	 * which is also under mmap_read_lock(mm):
+	 *
+	 *	CPU0:				CPU1:
+	 *				change_huge_pmd(prot_numa=1)
+	 *				 pmdp_huge_get_and_clear_notify()
+	 * madvise_dontneed()
+	 *  zap_pmd_range()
+	 *   pmd_trans_huge(*pmd) == 0 (without ptl)
+	 *   // skip the pmd
+	 *				 set_pmd_at();
+	 *				 // pmd is re-established
+	 *
+	 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
+	 * which may break userspace.
+	 *
+	 * pmdp_invalidate_ad() is required to make sure we don't miss
+	 * dirty/young flags set by hardware.
+	 */
+	oldpmd = pmdp_invalidate_ad(vma, addr, pmd);
+
+	entry = pmd_modify(oldpmd, newprot);
+	if (uffd_wp)
+		entry = pmd_mkuffd_wp(entry);
+	else if (uffd_wp_resolve)
+		/*
+		 * Leave the write bit to be handled by PF interrupt
+		 * handler, then things like COW could be properly
+		 * handled.
+		 */
+		entry = pmd_clear_uffd_wp(entry);
+
+	/* See change_pte_range(). */
+	if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) &&
+	    can_change_pmd_writable(vma, addr, entry))
+		entry = pmd_mkwrite(entry, vma);
+
+	ret = HPAGE_PMD_NR;
+	set_pmd_at(mm, addr, pmd, entry);
+
+	if (huge_pmd_needs_flush(oldpmd, entry))
+		tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
+unlock:
+	spin_unlock(ptl);
+	return ret;
+}
+
+static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long haddr, bool freeze)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct folio *folio;
+	struct page *page;
+	pgtable_t pgtable;
+	pmd_t old_pmd, _pmd;
+	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
+	bool anon_exclusive = false, dirty = false;
+	unsigned long addr;
+	pte_t *pte;
+	int i;
+
+	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
+	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
+	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
+	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd) &&
+		  !pmd_devmap(*pmd));
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	count_vm_event(THP_SPLIT_PMD);
+#endif
+
+	if (!vma_is_anonymous(vma)) {
+		old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
+		/*
+		 * We are going to unmap this huge page. So
+		 * just go ahead and zap it
+		 */
+		if (arch_needs_pgtable_deposit())
+			zap_deposited_table(mm, pmd);
+		if (vma_is_special_huge(vma))
+			return;
+		if (unlikely(is_pmd_migration_entry(old_pmd))) {
+			swp_entry_t entry;
+
+			entry = pmd_to_swp_entry(old_pmd);
+			folio = pfn_swap_entry_folio(entry);
+		} else {
+			page = pmd_page(old_pmd);
+			folio = page_folio(page);
+			if (!folio_test_dirty(folio) && pmd_dirty(old_pmd))
+				folio_mark_dirty(folio);
+			if (!folio_test_referenced(folio) && pmd_young(old_pmd))
+				folio_set_referenced(folio);
+			folio_remove_rmap_pmd(folio, page, vma);
+			folio_put(folio);
+		}
+		add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR);
+		return;
+	}
+
+	if (is_huge_zero_pmd(*pmd)) {
+		/*
+		 * FIXME: Do we want to invalidate secondary mmu by calling
+		 * mmu_notifier_arch_invalidate_secondary_tlbs() see comments below
+		 * inside __split_huge_pmd() ?
+		 *
+		 * We are going from a zero huge page write protected to zero
+		 * small page also write protected so it does not seems useful
+		 * to invalidate secondary mmu at this time.
+		 */
+		return __split_huge_zero_page_pmd(vma, haddr, pmd);
+	}
+
+	pmd_migration = is_pmd_migration_entry(*pmd);
+	if (unlikely(pmd_migration)) {
+		swp_entry_t entry;
+
+		old_pmd = *pmd;
+		entry = pmd_to_swp_entry(old_pmd);
+		page = pfn_swap_entry_to_page(entry);
+		write = is_writable_migration_entry(entry);
+		if (PageAnon(page))
+			anon_exclusive = is_readable_exclusive_migration_entry(entry);
+		young = is_migration_entry_young(entry);
+		dirty = is_migration_entry_dirty(entry);
+		soft_dirty = pmd_swp_soft_dirty(old_pmd);
+		uffd_wp = pmd_swp_uffd_wp(old_pmd);
+	} else {
+		/*
+		 * Up to this point the pmd is present and huge and userland has
+		 * the whole access to the hugepage during the split (which
+		 * happens in place). If we overwrite the pmd with the not-huge
+		 * version pointing to the pte here (which of course we could if
+		 * all CPUs were bug free), userland could trigger a small page
+		 * size TLB miss on the small sized TLB while the hugepage TLB
+		 * entry is still established in the huge TLB. Some CPU doesn't
+		 * like that. See
+		 * http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum
+		 * 383 on page 105. Intel should be safe but is also warns that
+		 * it's only safe if the permission and cache attributes of the
+		 * two entries loaded in the two TLB is identical (which should
+		 * be the case here). But it is generally safer to never allow
+		 * small and huge TLB entries for the same virtual address to be
+		 * loaded simultaneously. So instead of doing "pmd_populate();
+		 * flush_pmd_tlb_range();" we first mark the current pmd
+		 * notpresent (atomically because here the pmd_trans_huge must
+		 * remain set at all times on the pmd until the split is
+		 * complete for this pmd), then we flush the SMP TLB and finally
+		 * we write the non-huge version of the pmd entry with
+		 * pmd_populate.
+		 */
+		old_pmd = pmdp_invalidate(vma, haddr, pmd);
+		page = pmd_page(old_pmd);
+		folio = page_folio(page);
+		if (pmd_dirty(old_pmd)) {
+			dirty = true;
+			folio_set_dirty(folio);
+		}
+		write = pmd_write(old_pmd);
+		young = pmd_young(old_pmd);
+		soft_dirty = pmd_soft_dirty(old_pmd);
+		uffd_wp = pmd_uffd_wp(old_pmd);
+
+		VM_WARN_ON_FOLIO(!folio_ref_count(folio), folio);
+		VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
+
+		/*
+		 * Without "freeze", we'll simply split the PMD, propagating the
+		 * PageAnonExclusive() flag for each PTE by setting it for
+		 * each subpage -- no need to (temporarily) clear.
+		 *
+		 * With "freeze" we want to replace mapped pages by
+		 * migration entries right away. This is only possible if we
+		 * managed to clear PageAnonExclusive() -- see
+		 * set_pmd_migration_entry().
+		 *
+		 * In case we cannot clear PageAnonExclusive(), split the PMD
+		 * only and let try_to_migrate_one() fail later.
+		 *
+		 * See folio_try_share_anon_rmap_pmd(): invalidate PMD first.
+		 */
+		anon_exclusive = PageAnonExclusive(page);
+		if (freeze && anon_exclusive &&
+		    folio_try_share_anon_rmap_pmd(folio, page))
+			freeze = false;
+		if (!freeze) {
+			rmap_t rmap_flags = RMAP_NONE;
+
+			folio_ref_add(folio, HPAGE_PMD_NR - 1);
+			if (anon_exclusive)
+				rmap_flags |= RMAP_EXCLUSIVE;
+			folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
+						 vma, haddr, rmap_flags);
+		}
+	}
+
+	/*
+	 * Withdraw the table only after we mark the pmd entry invalid.
+	 * This's critical for some architectures (Power).
+	 */
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	pte = pte_offset_map(&_pmd, haddr);
+	VM_BUG_ON(!pte);
+
+	/*
+	 * Note that NUMA hinting access restrictions are not transferred to
+	 * avoid any possibility of altering permissions across VMAs.
+	 */
+	if (freeze || pmd_migration) {
+		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+			pte_t entry;
+			swp_entry_t swp_entry;
+
+			if (write)
+				swp_entry = make_writable_migration_entry(
+							page_to_pfn(page + i));
+			else if (anon_exclusive)
+				swp_entry = make_readable_exclusive_migration_entry(
+							page_to_pfn(page + i));
+			else
+				swp_entry = make_readable_migration_entry(
+							page_to_pfn(page + i));
+			if (young)
+				swp_entry = make_migration_entry_young(swp_entry);
+			if (dirty)
+				swp_entry = make_migration_entry_dirty(swp_entry);
+			entry = swp_entry_to_pte(swp_entry);
+			if (soft_dirty)
+				entry = pte_swp_mksoft_dirty(entry);
+			if (uffd_wp)
+				entry = pte_swp_mkuffd_wp(entry);
+
+			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+			set_pte_at(mm, addr, pte + i, entry);
+		}
+	} else {
+		pte_t entry;
+
+		entry = mk_pte(page, READ_ONCE(vma->vm_page_prot));
+		if (write)
+			entry = pte_mkwrite(entry, vma);
+		if (!young)
+			entry = pte_mkold(entry);
+		/* NOTE: this may set soft-dirty too on some archs */
+		if (dirty)
+			entry = pte_mkdirty(entry);
+		if (soft_dirty)
+			entry = pte_mksoft_dirty(entry);
+		if (uffd_wp)
+			entry = pte_mkuffd_wp(entry);
+
+		for (i = 0; i < HPAGE_PMD_NR; i++)
+			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
+
+		set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);
+	}
+	pte_unmap(pte);
+
+	if (!pmd_migration)
+		folio_remove_rmap_pmd(folio, page, vma);
+	if (freeze)
+		put_page(page);
+
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+}
+
+void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
+			   pmd_t *pmd, bool freeze, struct folio *folio)
+{
+	VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
+	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
+	VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
+	VM_BUG_ON(freeze && !folio);
+
+	/*
+	 * When the caller requests to set up a migration entry, we
+	 * require a folio to check the PMD against. Otherwise, there
+	 * is a risk of replacing the wrong folio.
+	 */
+	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
+	    is_pmd_migration_entry(*pmd)) {
+		if (folio && folio != pmd_folio(*pmd))
+			return;
+		__split_huge_pmd_locked(vma, pmd, address, freeze);
+	}
+}
+
+void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		unsigned long address, bool freeze, struct folio *folio)
+{
+	spinlock_t *ptl;
+	struct mmu_notifier_range range;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
+				address & HPAGE_PMD_MASK,
+				(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	split_huge_pmd_locked(vma, range.start, pmd, freeze, folio);
+	spin_unlock(ptl);
+	mmu_notifier_invalidate_range_end(&range);
+}
+
+void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
+		bool freeze, struct folio *folio)
+{
+	pmd_t *pmd = mm_find_pmd(vma->vm_mm, address);
+
+	if (!pmd)
+		return;
+
+	__split_huge_pmd(vma, pmd, address, freeze, folio);
+}
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page)
+{
+	struct folio *folio = page_folio(page);
+	struct vm_area_struct *vma = pvmw->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address = pvmw->address;
+	bool anon_exclusive;
+	pmd_t pmdval;
+	swp_entry_t entry;
+	pmd_t pmdswp;
+
+	if (!(pvmw->pmd && !pvmw->pte))
+		return 0;
+
+	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
+	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
+
+	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
+	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
+	if (anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) {
+		set_pmd_at(mm, address, pvmw->pmd, pmdval);
+		return -EBUSY;
+	}
+
+	if (pmd_dirty(pmdval))
+		folio_mark_dirty(folio);
+	if (pmd_write(pmdval))
+		entry = make_writable_migration_entry(page_to_pfn(page));
+	else if (anon_exclusive)
+		entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
+	else
+		entry = make_readable_migration_entry(page_to_pfn(page));
+	if (pmd_young(pmdval))
+		entry = make_migration_entry_young(entry);
+	if (pmd_dirty(pmdval))
+		entry = make_migration_entry_dirty(entry);
+	pmdswp = swp_entry_to_pmd(entry);
+	if (pmd_soft_dirty(pmdval))
+		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
+	if (pmd_uffd_wp(pmdval))
+		pmdswp = pmd_swp_mkuffd_wp(pmdswp);
+	set_pmd_at(mm, address, pvmw->pmd, pmdswp);
+	folio_remove_rmap_pmd(folio, page, vma);
+	folio_put(folio);
+	trace_set_migration_pmd(address, pmd_val(pmdswp));
+
+	return 0;
+}
+
+void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
+{
+	struct folio *folio = page_folio(new);
+	struct vm_area_struct *vma = pvmw->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long address = pvmw->address;
+	unsigned long haddr = address & HPAGE_PMD_MASK;
+	pmd_t pmde;
+	swp_entry_t entry;
+
+	if (!(pvmw->pmd && !pvmw->pte))
+		return;
+
+	entry = pmd_to_swp_entry(*pvmw->pmd);
+	folio_get(folio);
+	pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
+	if (pmd_swp_soft_dirty(*pvmw->pmd))
+		pmde = pmd_mksoft_dirty(pmde);
+	if (is_writable_migration_entry(entry))
+		pmde = pmd_mkwrite(pmde, vma);
+	if (pmd_swp_uffd_wp(*pvmw->pmd))
+		pmde = pmd_mkuffd_wp(pmde);
+	if (!is_migration_entry_young(entry))
+		pmde = pmd_mkold(pmde);
+	/* NOTE: this may contain setting soft-dirty on some archs */
+	if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
+		pmde = pmd_mkdirty(pmde);
+
+	if (folio_test_anon(folio)) {
+		rmap_t rmap_flags = RMAP_NONE;
+
+		if (!is_readable_migration_entry(entry))
+			rmap_flags |= RMAP_EXCLUSIVE;
+
+		folio_add_anon_rmap_pmd(folio, new, vma, haddr, rmap_flags);
+	} else {
+		folio_add_file_rmap_pmd(folio, new, vma);
+	}
+	VM_BUG_ON(pmd_write(pmde) && folio_test_anon(folio) && !PageAnonExclusive(new));
+	set_pmd_at(mm, haddr, pvmw->pmd, pmde);
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache_pmd(vma, address, pvmw->pmd);
+	trace_remove_migration_pmd(address, pmd_val(pmde));
+}
+#endif
diff --git a/mm/huge_mapping_pud.c b/mm/huge_mapping_pud.c
new file mode 100644
index 000000000000..c3a6bffe2871
--- /dev/null
+++ b/mm/huge_mapping_pud.c
@@ -0,0 +1,235 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  Copyright (C) 2024  Red Hat, Inc.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/coredump.h>
+#include <linux/sched/numa_balancing.h>
+#include <linux/highmem.h>
+#include <linux/hugetlb.h>
+#include <linux/mmu_notifier.h>
+#include <linux/rmap.h>
+#include <linux/swap.h>
+#include <linux/shrinker.h>
+#include <linux/mm_inline.h>
+#include <linux/swapops.h>
+#include <linux/backing-dev.h>
+#include <linux/dax.h>
+#include <linux/mm_types.h>
+#include <linux/khugepaged.h>
+#include <linux/freezer.h>
+#include <linux/pfn_t.h>
+#include <linux/mman.h>
+#include <linux/memremap.h>
+#include <linux/pagemap.h>
+#include <linux/debugfs.h>
+#include <linux/migrate.h>
+#include <linux/hashtable.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/page_idle.h>
+#include <linux/shmem_fs.h>
+#include <linux/oom.h>
+#include <linux/numa.h>
+#include <linux/page_owner.h>
+#include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
+#include <linux/compat.h>
+#include <linux/pgalloc_tag.h>
+
+#include <asm/tlb.h>
+#include <asm/pgalloc.h>
+#include "internal.h"
+#include "swap.h"
+
+/*
+ * Returns page table lock pointer if a given pud maps a thp, NULL otherwise.
+ *
+ * Note that if it returns page table lock pointer, this routine returns without
+ * unlocking page table lock. So callers must unlock it.
+ */
+spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
+{
+	spinlock_t *ptl;
+
+	ptl = pud_lock(vma->vm_mm, pud);
+	if (likely(pud_trans_huge(*pud) || pud_devmap(*pud)))
+		return ptl;
+	spin_unlock(ptl);
+	return NULL;
+}
+
+void touch_pud(struct vm_area_struct *vma, unsigned long addr,
+	       pud_t *pud, bool write)
+{
+	pud_t _pud;
+
+	_pud = pud_mkyoung(*pud);
+	if (write)
+		_pud = pud_mkdirty(_pud);
+	if (pudp_set_access_flags(vma, addr & HPAGE_PUD_MASK,
+				  pud, _pud, write))
+		update_mmu_cache_pud(vma, addr, pud);
+}
+
+int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
+		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
+		  struct vm_area_struct *vma)
+{
+	spinlock_t *dst_ptl, *src_ptl;
+	pud_t pud;
+	int ret;
+
+	dst_ptl = pud_lock(dst_mm, dst_pud);
+	src_ptl = pud_lockptr(src_mm, src_pud);
+	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+
+	ret = -EAGAIN;
+	pud = *src_pud;
+	if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
+		goto out_unlock;
+
+	/*
+	 * When page table lock is held, the huge zero pud should not be
+	 * under splitting since we don't split the page itself, only pud to
+	 * a page table.
+	 */
+	if (is_huge_zero_pud(pud)) {
+		/* No huge zero pud yet */
+	}
+
+	/*
+	 * TODO: once we support anonymous pages, use
+	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
+	 */
+	pudp_set_wrprotect(src_mm, addr, src_pud);
+	pud = pud_mkold(pud_wrprotect(pud));
+	set_pud_at(dst_mm, addr, dst_pud, pud);
+
+	ret = 0;
+out_unlock:
+	spin_unlock(src_ptl);
+	spin_unlock(dst_ptl);
+	return ret;
+}
+
+void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
+{
+	bool write = vmf->flags & FAULT_FLAG_WRITE;
+
+	vmf->ptl = pud_lock(vmf->vma->vm_mm, vmf->pud);
+	if (unlikely(!pud_same(*vmf->pud, orig_pud)))
+		goto unlock;
+
+	touch_pud(vmf->vma, vmf->address, vmf->pud, write);
+unlock:
+	spin_unlock(vmf->ptl);
+}
+
+/*
+ * Returns:
+ *
+ * - 0: if pud leaf changed from under us
+ * - 1: if pud can be skipped
+ * - HPAGE_PUD_NR: if pud was successfully processed
+ */
+int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
+		    unsigned long cp_flags)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pud_t oldpud, entry;
+	spinlock_t *ptl;
+
+	tlb_change_page_size(tlb, HPAGE_PUD_SIZE);
+
+	/* NUMA balancing doesn't apply to dax */
+	if (cp_flags & MM_CP_PROT_NUMA)
+		return 1;
+
+	/*
+	 * Huge entries on userfault-wp only works with anonymous, while we
+	 * don't have anonymous PUDs yet.
+	 */
+	if (WARN_ON_ONCE(cp_flags & MM_CP_UFFD_WP_ALL))
+		return 1;
+
+	ptl = __pud_trans_huge_lock(pudp, vma);
+	if (!ptl)
+		return 0;
+
+	/*
+	 * Can't clear PUD or it can race with concurrent zapping.  See
+	 * change_huge_pmd().
+	 */
+	oldpud = pudp_invalidate(vma, addr, pudp);
+	entry = pud_modify(oldpud, newprot);
+	set_pud_at(mm, addr, pudp, entry);
+	tlb_flush_pud_range(tlb, addr, HPAGE_PUD_SIZE);
+
+	spin_unlock(ptl);
+	return HPAGE_PUD_NR;
+}
+
+int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+		 pud_t *pud, unsigned long addr)
+{
+	spinlock_t *ptl;
+	pud_t orig_pud;
+
+	ptl = __pud_trans_huge_lock(pud, vma);
+	if (!ptl)
+		return 0;
+
+	orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
+	arch_check_zapped_pud(vma, orig_pud);
+	tlb_remove_pud_tlb_entry(tlb, pud, addr);
+	if (vma_is_special_huge(vma)) {
+		spin_unlock(ptl);
+		/* No zero page support yet */
+	} else {
+		/* No support for anonymous PUD pages yet */
+		BUG();
+	}
+	return 1;
+}
+
+static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
+		unsigned long haddr)
+{
+	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
+	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
+	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
+	VM_BUG_ON(!pud_trans_huge(*pud) && !pud_devmap(*pud));
+
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
+	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+	count_vm_event(THP_SPLIT_PUD);
+#endif
+
+	pudp_huge_clear_flush(vma, haddr, pud);
+}
+
+void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
+		unsigned long address)
+{
+	spinlock_t *ptl;
+	struct mmu_notifier_range range;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
+				address & HPAGE_PUD_MASK,
+				(address & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+	ptl = pud_lock(vma->vm_mm, pud);
+	if (unlikely(!pud_trans_huge(*pud) && !pud_devmap(*pud)))
+		goto out;
+	__split_huge_pud_locked(vma, pud, range.start);
+
+out:
+	spin_unlock(ptl);
+	mmu_notifier_invalidate_range_end(&range);
+}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 554dec14b768..11aee24ce21a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -838,13 +838,6 @@ static int __init setup_transparent_hugepage(char *str)
 }
 __setup("transparent_hugepage=", setup_transparent_hugepage);
 
-pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
-{
-	if (likely(vma->vm_flags & VM_WRITE))
-		pmd = pmd_mkwrite(pmd, vma);
-	return pmd;
-}
-
 #ifdef CONFIG_MEMCG
 static inline
 struct deferred_split *get_deferred_split_queue(struct folio *folio)
@@ -1313,19 +1306,6 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 
-void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
-	       pmd_t *pmd, bool write)
-{
-	pmd_t _pmd;
-
-	_pmd = pmd_mkyoung(*pmd);
-	if (write)
-		_pmd = pmd_mkdirty(_pmd);
-	if (pmdp_set_access_flags(vma, addr & HPAGE_PMD_MASK,
-				  pmd, _pmd, write))
-		update_mmu_cache_pmd(vma, addr, pmd);
-}
-
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
 {
@@ -1366,309 +1346,6 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
 	return page;
 }
 
-int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		  pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
-		  struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
-{
-	spinlock_t *dst_ptl, *src_ptl;
-	struct page *src_page;
-	struct folio *src_folio;
-	pmd_t pmd;
-	pgtable_t pgtable = NULL;
-	int ret = -ENOMEM;
-
-	/* Skip if can be re-fill on fault */
-	if (!vma_is_anonymous(dst_vma))
-		return 0;
-
-	pgtable = pte_alloc_one(dst_mm);
-	if (unlikely(!pgtable))
-		goto out;
-
-	dst_ptl = pmd_lock(dst_mm, dst_pmd);
-	src_ptl = pmd_lockptr(src_mm, src_pmd);
-	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
-
-	ret = -EAGAIN;
-	pmd = *src_pmd;
-
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-	if (unlikely(is_swap_pmd(pmd))) {
-		swp_entry_t entry = pmd_to_swp_entry(pmd);
-
-		VM_BUG_ON(!is_pmd_migration_entry(pmd));
-		if (!is_readable_migration_entry(entry)) {
-			entry = make_readable_migration_entry(
-							swp_offset(entry));
-			pmd = swp_entry_to_pmd(entry);
-			if (pmd_swp_soft_dirty(*src_pmd))
-				pmd = pmd_swp_mksoft_dirty(pmd);
-			if (pmd_swp_uffd_wp(*src_pmd))
-				pmd = pmd_swp_mkuffd_wp(pmd);
-			set_pmd_at(src_mm, addr, src_pmd, pmd);
-		}
-		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-		mm_inc_nr_ptes(dst_mm);
-		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
-		if (!userfaultfd_wp(dst_vma))
-			pmd = pmd_swp_clear_uffd_wp(pmd);
-		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
-		ret = 0;
-		goto out_unlock;
-	}
-#endif
-
-	if (unlikely(!pmd_trans_huge(pmd))) {
-		pte_free(dst_mm, pgtable);
-		goto out_unlock;
-	}
-	/*
-	 * When page table lock is held, the huge zero pmd should not be
-	 * under splitting since we don't split the page itself, only pmd to
-	 * a page table.
-	 */
-	if (is_huge_zero_pmd(pmd)) {
-		/*
-		 * mm_get_huge_zero_folio() will never allocate a new
-		 * folio here, since we already have a zero page to
-		 * copy. It just takes a reference.
-		 */
-		mm_get_huge_zero_folio(dst_mm);
-		goto out_zero_page;
-	}
-
-	src_page = pmd_page(pmd);
-	VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
-	src_folio = page_folio(src_page);
-
-	folio_get(src_folio);
-	if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, src_vma))) {
-		/* Page maybe pinned: split and retry the fault on PTEs. */
-		folio_put(src_folio);
-		pte_free(dst_mm, pgtable);
-		spin_unlock(src_ptl);
-		spin_unlock(dst_ptl);
-		__split_huge_pmd(src_vma, src_pmd, addr, false, NULL);
-		return -EAGAIN;
-	}
-	add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-out_zero_page:
-	mm_inc_nr_ptes(dst_mm);
-	pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
-	pmdp_set_wrprotect(src_mm, addr, src_pmd);
-	if (!userfaultfd_wp(dst_vma))
-		pmd = pmd_clear_uffd_wp(pmd);
-	pmd = pmd_mkold(pmd_wrprotect(pmd));
-	set_pmd_at(dst_mm, addr, dst_pmd, pmd);
-
-	ret = 0;
-out_unlock:
-	spin_unlock(src_ptl);
-	spin_unlock(dst_ptl);
-out:
-	return ret;
-}
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-void touch_pud(struct vm_area_struct *vma, unsigned long addr,
-	       pud_t *pud, bool write)
-{
-	pud_t _pud;
-
-	_pud = pud_mkyoung(*pud);
-	if (write)
-		_pud = pud_mkdirty(_pud);
-	if (pudp_set_access_flags(vma, addr & HPAGE_PUD_MASK,
-				  pud, _pud, write))
-		update_mmu_cache_pud(vma, addr, pud);
-}
-
-int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		  pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
-		  struct vm_area_struct *vma)
-{
-	spinlock_t *dst_ptl, *src_ptl;
-	pud_t pud;
-	int ret;
-
-	dst_ptl = pud_lock(dst_mm, dst_pud);
-	src_ptl = pud_lockptr(src_mm, src_pud);
-	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
-
-	ret = -EAGAIN;
-	pud = *src_pud;
-	if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
-		goto out_unlock;
-
-	/*
-	 * When page table lock is held, the huge zero pud should not be
-	 * under splitting since we don't split the page itself, only pud to
-	 * a page table.
-	 */
-	if (is_huge_zero_pud(pud)) {
-		/* No huge zero pud yet */
-	}
-
-	/*
-	 * TODO: once we support anonymous pages, use
-	 * folio_try_dup_anon_rmap_*() and split if duplicating fails.
-	 */
-	pudp_set_wrprotect(src_mm, addr, src_pud);
-	pud = pud_mkold(pud_wrprotect(pud));
-	set_pud_at(dst_mm, addr, dst_pud, pud);
-
-	ret = 0;
-out_unlock:
-	spin_unlock(src_ptl);
-	spin_unlock(dst_ptl);
-	return ret;
-}
-
-void huge_pud_set_accessed(struct vm_fault *vmf, pud_t orig_pud)
-{
-	bool write = vmf->flags & FAULT_FLAG_WRITE;
-
-	vmf->ptl = pud_lock(vmf->vma->vm_mm, vmf->pud);
-	if (unlikely(!pud_same(*vmf->pud, orig_pud)))
-		goto unlock;
-
-	touch_pud(vmf->vma, vmf->address, vmf->pud, write);
-unlock:
-	spin_unlock(vmf->ptl);
-}
-#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-
-void huge_pmd_set_accessed(struct vm_fault *vmf)
-{
-	bool write = vmf->flags & FAULT_FLAG_WRITE;
-
-	vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
-	if (unlikely(!pmd_same(*vmf->pmd, vmf->orig_pmd)))
-		goto unlock;
-
-	touch_pmd(vmf->vma, vmf->address, vmf->pmd, write);
-
-unlock:
-	spin_unlock(vmf->ptl);
-}
-
-vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf)
-{
-	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
-	struct vm_area_struct *vma = vmf->vma;
-	struct folio *folio;
-	struct page *page;
-	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
-	pmd_t orig_pmd = vmf->orig_pmd;
-
-	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
-	VM_BUG_ON_VMA(!vma->anon_vma, vma);
-
-	if (is_huge_zero_pmd(orig_pmd))
-		goto fallback;
-
-	spin_lock(vmf->ptl);
-
-	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) {
-		spin_unlock(vmf->ptl);
-		return 0;
-	}
-
-	page = pmd_page(orig_pmd);
-	folio = page_folio(page);
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-
-	/* Early check when only holding the PT lock. */
-	if (PageAnonExclusive(page))
-		goto reuse;
-
-	if (!folio_trylock(folio)) {
-		folio_get(folio);
-		spin_unlock(vmf->ptl);
-		folio_lock(folio);
-		spin_lock(vmf->ptl);
-		if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) {
-			spin_unlock(vmf->ptl);
-			folio_unlock(folio);
-			folio_put(folio);
-			return 0;
-		}
-		folio_put(folio);
-	}
-
-	/* Recheck after temporarily dropping the PT lock. */
-	if (PageAnonExclusive(page)) {
-		folio_unlock(folio);
-		goto reuse;
-	}
-
-	/*
-	 * See do_wp_page(): we can only reuse the folio exclusively if
-	 * there are no additional references. Note that we always drain
-	 * the LRU cache immediately after adding a THP.
-	 */
-	if (folio_ref_count(folio) >
-			1 + folio_test_swapcache(folio) * folio_nr_pages(folio))
-		goto unlock_fallback;
-	if (folio_test_swapcache(folio))
-		folio_free_swap(folio);
-	if (folio_ref_count(folio) == 1) {
-		pmd_t entry;
-
-		folio_move_anon_rmap(folio, vma);
-		SetPageAnonExclusive(page);
-		folio_unlock(folio);
-reuse:
-		if (unlikely(unshare)) {
-			spin_unlock(vmf->ptl);
-			return 0;
-		}
-		entry = pmd_mkyoung(orig_pmd);
-		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
-		if (pmdp_set_access_flags(vma, haddr, vmf->pmd, entry, 1))
-			update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
-		spin_unlock(vmf->ptl);
-		return 0;
-	}
-
-unlock_fallback:
-	folio_unlock(folio);
-	spin_unlock(vmf->ptl);
-fallback:
-	__split_huge_pmd(vma, vmf->pmd, vmf->address, false, NULL);
-	return VM_FAULT_FALLBACK;
-}
-
-static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
-					   unsigned long addr, pmd_t pmd)
-{
-	struct page *page;
-
-	if (WARN_ON_ONCE(!(vma->vm_flags & VM_WRITE)))
-		return false;
-
-	/* Don't touch entries that are not even readable (NUMA hinting). */
-	if (pmd_protnone(pmd))
-		return false;
-
-	/* Do we need write faults for softdirty tracking? */
-	if (pmd_needs_soft_dirty_wp(vma, pmd))
-		return false;
-
-	/* Do we need write faults for uffd-wp tracking? */
-	if (userfaultfd_huge_pmd_wp(vma, pmd))
-		return false;
-
-	if (!(vma->vm_flags & VM_SHARED)) {
-		/* See can_change_pte_writable(). */
-		page = vm_normal_page_pmd(vma, addr, pmd);
-		return page && PageAnon(page) && PageAnonExclusive(page);
-	}
-
-	/* See can_change_pte_writable(). */
-	return pmd_dirty(pmd);
-}
-
 /* NUMA hinting page fault entry point for trans huge pmds */
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 {
@@ -1830,342 +1507,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	return ret;
 }
 
-static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
-{
-	pgtable_t pgtable;
-
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-	pte_free(mm, pgtable);
-	mm_dec_nr_ptes(mm);
-}
-
-int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
-		 pmd_t *pmd, unsigned long addr)
-{
-	pmd_t orig_pmd;
-	spinlock_t *ptl;
-
-	tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
-
-	ptl = __pmd_trans_huge_lock(pmd, vma);
-	if (!ptl)
-		return 0;
-	/*
-	 * For architectures like ppc64 we look at deposited pgtable
-	 * when calling pmdp_huge_get_and_clear. So do the
-	 * pgtable_trans_huge_withdraw after finishing pmdp related
-	 * operations.
-	 */
-	orig_pmd = pmdp_huge_get_and_clear_full(vma, addr, pmd,
-						tlb->fullmm);
-	arch_check_zapped_pmd(vma, orig_pmd);
-	tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
-	if (vma_is_special_huge(vma)) {
-		if (arch_needs_pgtable_deposit())
-			zap_deposited_table(tlb->mm, pmd);
-		spin_unlock(ptl);
-	} else if (is_huge_zero_pmd(orig_pmd)) {
-		zap_deposited_table(tlb->mm, pmd);
-		spin_unlock(ptl);
-	} else {
-		struct folio *folio = NULL;
-		int flush_needed = 1;
-
-		if (pmd_present(orig_pmd)) {
-			struct page *page = pmd_page(orig_pmd);
-
-			folio = page_folio(page);
-			folio_remove_rmap_pmd(folio, page, vma);
-			WARN_ON_ONCE(folio_mapcount(folio) < 0);
-			VM_BUG_ON_PAGE(!PageHead(page), page);
-		} else if (thp_migration_supported()) {
-			swp_entry_t entry;
-
-			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
-			entry = pmd_to_swp_entry(orig_pmd);
-			folio = pfn_swap_entry_folio(entry);
-			flush_needed = 0;
-		} else
-			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
-
-		if (folio_test_anon(folio)) {
-			zap_deposited_table(tlb->mm, pmd);
-			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
-		} else {
-			if (arch_needs_pgtable_deposit())
-				zap_deposited_table(tlb->mm, pmd);
-			add_mm_counter(tlb->mm, mm_counter_file(folio),
-				       -HPAGE_PMD_NR);
-		}
-
-		spin_unlock(ptl);
-		if (flush_needed)
-			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
-	}
-	return 1;
-}
-
-#ifndef pmd_move_must_withdraw
-static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
-					 spinlock_t *old_pmd_ptl,
-					 struct vm_area_struct *vma)
-{
-	/*
-	 * With split pmd lock we also need to move preallocated
-	 * PTE page table if new_pmd is on different PMD page table.
-	 *
-	 * We also don't deposit and withdraw tables for file pages.
-	 */
-	return (new_pmd_ptl != old_pmd_ptl) && vma_is_anonymous(vma);
-}
-#endif
-
-static pmd_t move_soft_dirty_pmd(pmd_t pmd)
-{
-#ifdef CONFIG_MEM_SOFT_DIRTY
-	if (unlikely(is_pmd_migration_entry(pmd)))
-		pmd = pmd_swp_mksoft_dirty(pmd);
-	else if (pmd_present(pmd))
-		pmd = pmd_mksoft_dirty(pmd);
-#endif
-	return pmd;
-}
-
-bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
-		  unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd)
-{
-	spinlock_t *old_ptl, *new_ptl;
-	pmd_t pmd;
-	struct mm_struct *mm = vma->vm_mm;
-	bool force_flush = false;
-
-	/*
-	 * The destination pmd shouldn't be established, free_pgtables()
-	 * should have released it; but move_page_tables() might have already
-	 * inserted a page table, if racing against shmem/file collapse.
-	 */
-	if (!pmd_none(*new_pmd)) {
-		VM_BUG_ON(pmd_trans_huge(*new_pmd));
-		return false;
-	}
-
-	/*
-	 * We don't have to worry about the ordering of src and dst
-	 * ptlocks because exclusive mmap_lock prevents deadlock.
-	 */
-	old_ptl = __pmd_trans_huge_lock(old_pmd, vma);
-	if (old_ptl) {
-		new_ptl = pmd_lockptr(mm, new_pmd);
-		if (new_ptl != old_ptl)
-			spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
-		pmd = pmdp_huge_get_and_clear(mm, old_addr, old_pmd);
-		if (pmd_present(pmd))
-			force_flush = true;
-		VM_BUG_ON(!pmd_none(*new_pmd));
-
-		if (pmd_move_must_withdraw(new_ptl, old_ptl, vma)) {
-			pgtable_t pgtable;
-			pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
-			pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
-		}
-		pmd = move_soft_dirty_pmd(pmd);
-		set_pmd_at(mm, new_addr, new_pmd, pmd);
-		if (force_flush)
-			flush_pmd_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
-		if (new_ptl != old_ptl)
-			spin_unlock(new_ptl);
-		spin_unlock(old_ptl);
-		return true;
-	}
-	return false;
-}
-
-/*
- * Returns
- *  - 0 if PMD could not be locked
- *  - 1 if PMD was locked but protections unchanged and TLB flush unnecessary
- *      or if prot_numa but THP migration is not supported
- *  - HPAGE_PMD_NR if protections changed and TLB flush necessary
- */
-int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
-		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
-		    unsigned long cp_flags)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pmd_t oldpmd, entry;
-	bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
-	bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
-	bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
-	int ret = 1;
-
-	tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
-
-	if (prot_numa && !thp_migration_supported())
-		return 1;
-
-	ptl = __pmd_trans_huge_lock(pmd, vma);
-	if (!ptl)
-		return 0;
-
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-	if (is_swap_pmd(*pmd)) {
-		swp_entry_t entry = pmd_to_swp_entry(*pmd);
-		struct folio *folio = pfn_swap_entry_folio(entry);
-		pmd_t newpmd;
-
-		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
-		if (is_writable_migration_entry(entry)) {
-			/*
-			 * A protection check is difficult so
-			 * just be safe and disable write
-			 */
-			if (folio_test_anon(folio))
-				entry = make_readable_exclusive_migration_entry(swp_offset(entry));
-			else
-				entry = make_readable_migration_entry(swp_offset(entry));
-			newpmd = swp_entry_to_pmd(entry);
-			if (pmd_swp_soft_dirty(*pmd))
-				newpmd = pmd_swp_mksoft_dirty(newpmd);
-		} else {
-			newpmd = *pmd;
-		}
-
-		if (uffd_wp)
-			newpmd = pmd_swp_mkuffd_wp(newpmd);
-		else if (uffd_wp_resolve)
-			newpmd = pmd_swp_clear_uffd_wp(newpmd);
-		if (!pmd_same(*pmd, newpmd))
-			set_pmd_at(mm, addr, pmd, newpmd);
-		goto unlock;
-	}
-#endif
-
-	if (prot_numa) {
-		struct folio *folio;
-		bool toptier;
-		/*
-		 * Avoid trapping faults against the zero page. The read-only
-		 * data is likely to be read-cached on the local CPU and
-		 * local/remote hits to the zero page are not interesting.
-		 */
-		if (is_huge_zero_pmd(*pmd))
-			goto unlock;
-
-		if (pmd_protnone(*pmd))
-			goto unlock;
-
-		folio = pmd_folio(*pmd);
-		toptier = node_is_toptier(folio_nid(folio));
-		/*
-		 * Skip scanning top tier node if normal numa
-		 * balancing is disabled
-		 */
-		if (!(sysctl_numa_balancing_mode & NUMA_BALANCING_NORMAL) &&
-		    toptier)
-			goto unlock;
-
-		if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING &&
-		    !toptier)
-			folio_xchg_access_time(folio,
-					       jiffies_to_msecs(jiffies));
-	}
-	/*
-	 * In case prot_numa, we are under mmap_read_lock(mm). It's critical
-	 * to not clear pmd intermittently to avoid race with MADV_DONTNEED
-	 * which is also under mmap_read_lock(mm):
-	 *
-	 *	CPU0:				CPU1:
-	 *				change_huge_pmd(prot_numa=1)
-	 *				 pmdp_huge_get_and_clear_notify()
-	 * madvise_dontneed()
-	 *  zap_pmd_range()
-	 *   pmd_trans_huge(*pmd) == 0 (without ptl)
-	 *   // skip the pmd
-	 *				 set_pmd_at();
-	 *				 // pmd is re-established
-	 *
-	 * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
-	 * which may break userspace.
-	 *
-	 * pmdp_invalidate_ad() is required to make sure we don't miss
-	 * dirty/young flags set by hardware.
-	 */
-	oldpmd = pmdp_invalidate_ad(vma, addr, pmd);
-
-	entry = pmd_modify(oldpmd, newprot);
-	if (uffd_wp)
-		entry = pmd_mkuffd_wp(entry);
-	else if (uffd_wp_resolve)
-		/*
-		 * Leave the write bit to be handled by PF interrupt
-		 * handler, then things like COW could be properly
-		 * handled.
-		 */
-		entry = pmd_clear_uffd_wp(entry);
-
-	/* See change_pte_range(). */
-	if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) &&
-	    can_change_pmd_writable(vma, addr, entry))
-		entry = pmd_mkwrite(entry, vma);
-
-	ret = HPAGE_PMD_NR;
-	set_pmd_at(mm, addr, pmd, entry);
-
-	if (huge_pmd_needs_flush(oldpmd, entry))
-		tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
-unlock:
-	spin_unlock(ptl);
-	return ret;
-}
-
-/*
- * Returns:
- *
- * - 0: if pud leaf changed from under us
- * - 1: if pud can be skipped
- * - HPAGE_PUD_NR: if pud was successfully processed
- */
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
-		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
-		    unsigned long cp_flags)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	pud_t oldpud, entry;
-	spinlock_t *ptl;
-
-	tlb_change_page_size(tlb, HPAGE_PUD_SIZE);
-
-	/* NUMA balancing doesn't apply to dax */
-	if (cp_flags & MM_CP_PROT_NUMA)
-		return 1;
-
-	/*
-	 * Huge entries on userfault-wp only works with anonymous, while we
-	 * don't have anonymous PUDs yet.
-	 */
-	if (WARN_ON_ONCE(cp_flags & MM_CP_UFFD_WP_ALL))
-		return 1;
-
-	ptl = __pud_trans_huge_lock(pudp, vma);
-	if (!ptl)
-		return 0;
-
-	/*
-	 * Can't clear PUD or it can race with concurrent zapping.  See
-	 * change_huge_pmd().
-	 */
-	oldpud = pudp_invalidate(vma, addr, pudp);
-	entry = pud_modify(oldpud, newprot);
-	set_pud_at(mm, addr, pudp, entry);
-	tlb_flush_pud_range(tlb, addr, HPAGE_PUD_SIZE);
-
-	spin_unlock(ptl);
-	return HPAGE_PUD_NR;
-}
-#endif
-
 #ifdef CONFIG_USERFAULTFD
 /*
  * The PT lock for src_pmd and dst_vma/src_vma (for reading) are locked by
@@ -2306,105 +1647,8 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, pm
 }
 #endif /* CONFIG_USERFAULTFD */
 
-/*
- * Returns page table lock pointer if a given pmd maps a thp, NULL otherwise.
- *
- * Note that if it returns page table lock pointer, this routine returns without
- * unlocking page table lock. So callers must unlock it.
- */
-spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
-{
-	spinlock_t *ptl;
-	ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
-			pmd_devmap(*pmd)))
-		return ptl;
-	spin_unlock(ptl);
-	return NULL;
-}
-
-/*
- * Returns page table lock pointer if a given pud maps a thp, NULL otherwise.
- *
- * Note that if it returns page table lock pointer, this routine returns without
- * unlocking page table lock. So callers must unlock it.
- */
-spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
-{
-	spinlock_t *ptl;
-
-	ptl = pud_lock(vma->vm_mm, pud);
-	if (likely(pud_trans_huge(*pud) || pud_devmap(*pud)))
-		return ptl;
-	spin_unlock(ptl);
-	return NULL;
-}
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
-		 pud_t *pud, unsigned long addr)
-{
-	spinlock_t *ptl;
-	pud_t orig_pud;
-
-	ptl = __pud_trans_huge_lock(pud, vma);
-	if (!ptl)
-		return 0;
-
-	orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
-	arch_check_zapped_pud(vma, orig_pud);
-	tlb_remove_pud_tlb_entry(tlb, pud, addr);
-	if (vma_is_special_huge(vma)) {
-		spin_unlock(ptl);
-		/* No zero page support yet */
-	} else {
-		/* No support for anonymous PUD pages yet */
-		BUG();
-	}
-	return 1;
-}
-
-static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long haddr)
-{
-	VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
-	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
-	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
-	VM_BUG_ON(!pud_trans_huge(*pud) && !pud_devmap(*pud));
-
-	count_vm_event(THP_SPLIT_PUD);
-
-	pudp_huge_clear_flush(vma, haddr, pud);
-}
-
-void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long address)
-{
-	spinlock_t *ptl;
-	struct mmu_notifier_range range;
-
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
-				address & HPAGE_PUD_MASK,
-				(address & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE);
-	mmu_notifier_invalidate_range_start(&range);
-	ptl = pud_lock(vma->vm_mm, pud);
-	if (unlikely(!pud_trans_huge(*pud) && !pud_devmap(*pud)))
-		goto out;
-	__split_huge_pud_locked(vma, pud, range.start);
-
-out:
-	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(&range);
-}
-#else
-void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
-		unsigned long address)
-{
-}
-#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-
-static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
-		unsigned long haddr, pmd_t *pmd)
+void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+				unsigned long haddr, pmd_t *pmd)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgtable_t pgtable;
@@ -2444,274 +1688,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_populate(mm, pmd, pgtable);
 }
 
-static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long haddr, bool freeze)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	struct folio *folio;
-	struct page *page;
-	pgtable_t pgtable;
-	pmd_t old_pmd, _pmd;
-	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
-	bool anon_exclusive = false, dirty = false;
-	unsigned long addr;
-	pte_t *pte;
-	int i;
-
-	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
-	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
-	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
-				&& !pmd_devmap(*pmd));
-
-	count_vm_event(THP_SPLIT_PMD);
-
-	if (!vma_is_anonymous(vma)) {
-		old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd);
-		/*
-		 * We are going to unmap this huge page. So
-		 * just go ahead and zap it
-		 */
-		if (arch_needs_pgtable_deposit())
-			zap_deposited_table(mm, pmd);
-		if (vma_is_special_huge(vma))
-			return;
-		if (unlikely(is_pmd_migration_entry(old_pmd))) {
-			swp_entry_t entry;
-
-			entry = pmd_to_swp_entry(old_pmd);
-			folio = pfn_swap_entry_folio(entry);
-		} else {
-			page = pmd_page(old_pmd);
-			folio = page_folio(page);
-			if (!folio_test_dirty(folio) && pmd_dirty(old_pmd))
-				folio_mark_dirty(folio);
-			if (!folio_test_referenced(folio) && pmd_young(old_pmd))
-				folio_set_referenced(folio);
-			folio_remove_rmap_pmd(folio, page, vma);
-			folio_put(folio);
-		}
-		add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR);
-		return;
-	}
-
-	if (is_huge_zero_pmd(*pmd)) {
-		/*
-		 * FIXME: Do we want to invalidate secondary mmu by calling
-		 * mmu_notifier_arch_invalidate_secondary_tlbs() see comments below
-		 * inside __split_huge_pmd() ?
-		 *
-		 * We are going from a zero huge page write protected to zero
-		 * small page also write protected so it does not seems useful
-		 * to invalidate secondary mmu at this time.
-		 */
-		return __split_huge_zero_page_pmd(vma, haddr, pmd);
-	}
-
-	pmd_migration = is_pmd_migration_entry(*pmd);
-	if (unlikely(pmd_migration)) {
-		swp_entry_t entry;
-
-		old_pmd = *pmd;
-		entry = pmd_to_swp_entry(old_pmd);
-		page = pfn_swap_entry_to_page(entry);
-		write = is_writable_migration_entry(entry);
-		if (PageAnon(page))
-			anon_exclusive = is_readable_exclusive_migration_entry(entry);
-		young = is_migration_entry_young(entry);
-		dirty = is_migration_entry_dirty(entry);
-		soft_dirty = pmd_swp_soft_dirty(old_pmd);
-		uffd_wp = pmd_swp_uffd_wp(old_pmd);
-	} else {
-		/*
-		 * Up to this point the pmd is present and huge and userland has
-		 * the whole access to the hugepage during the split (which
-		 * happens in place). If we overwrite the pmd with the not-huge
-		 * version pointing to the pte here (which of course we could if
-		 * all CPUs were bug free), userland could trigger a small page
-		 * size TLB miss on the small sized TLB while the hugepage TLB
-		 * entry is still established in the huge TLB. Some CPU doesn't
-		 * like that. See
-		 * http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf, Erratum
-		 * 383 on page 105. Intel should be safe but is also warns that
-		 * it's only safe if the permission and cache attributes of the
-		 * two entries loaded in the two TLB is identical (which should
-		 * be the case here). But it is generally safer to never allow
-		 * small and huge TLB entries for the same virtual address to be
-		 * loaded simultaneously. So instead of doing "pmd_populate();
-		 * flush_pmd_tlb_range();" we first mark the current pmd
-		 * notpresent (atomically because here the pmd_trans_huge must
-		 * remain set at all times on the pmd until the split is
-		 * complete for this pmd), then we flush the SMP TLB and finally
-		 * we write the non-huge version of the pmd entry with
-		 * pmd_populate.
-		 */
-		old_pmd = pmdp_invalidate(vma, haddr, pmd);
-		page = pmd_page(old_pmd);
-		folio = page_folio(page);
-		if (pmd_dirty(old_pmd)) {
-			dirty = true;
-			folio_set_dirty(folio);
-		}
-		write = pmd_write(old_pmd);
-		young = pmd_young(old_pmd);
-		soft_dirty = pmd_soft_dirty(old_pmd);
-		uffd_wp = pmd_uffd_wp(old_pmd);
-
-		VM_WARN_ON_FOLIO(!folio_ref_count(folio), folio);
-		VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
-
-		/*
-		 * Without "freeze", we'll simply split the PMD, propagating the
-		 * PageAnonExclusive() flag for each PTE by setting it for
-		 * each subpage -- no need to (temporarily) clear.
-		 *
-		 * With "freeze" we want to replace mapped pages by
-		 * migration entries right away. This is only possible if we
-		 * managed to clear PageAnonExclusive() -- see
-		 * set_pmd_migration_entry().
-		 *
-		 * In case we cannot clear PageAnonExclusive(), split the PMD
-		 * only and let try_to_migrate_one() fail later.
-		 *
-		 * See folio_try_share_anon_rmap_pmd(): invalidate PMD first.
-		 */
-		anon_exclusive = PageAnonExclusive(page);
-		if (freeze && anon_exclusive &&
-		    folio_try_share_anon_rmap_pmd(folio, page))
-			freeze = false;
-		if (!freeze) {
-			rmap_t rmap_flags = RMAP_NONE;
-
-			folio_ref_add(folio, HPAGE_PMD_NR - 1);
-			if (anon_exclusive)
-				rmap_flags |= RMAP_EXCLUSIVE;
-			folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
-						 vma, haddr, rmap_flags);
-		}
-	}
-
-	/*
-	 * Withdraw the table only after we mark the pmd entry invalid.
-	 * This's critical for some architectures (Power).
-	 */
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-	pmd_populate(mm, &_pmd, pgtable);
-
-	pte = pte_offset_map(&_pmd, haddr);
-	VM_BUG_ON(!pte);
-
-	/*
-	 * Note that NUMA hinting access restrictions are not transferred to
-	 * avoid any possibility of altering permissions across VMAs.
-	 */
-	if (freeze || pmd_migration) {
-		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
-			pte_t entry;
-			swp_entry_t swp_entry;
-
-			if (write)
-				swp_entry = make_writable_migration_entry(
-							page_to_pfn(page + i));
-			else if (anon_exclusive)
-				swp_entry = make_readable_exclusive_migration_entry(
-							page_to_pfn(page + i));
-			else
-				swp_entry = make_readable_migration_entry(
-							page_to_pfn(page + i));
-			if (young)
-				swp_entry = make_migration_entry_young(swp_entry);
-			if (dirty)
-				swp_entry = make_migration_entry_dirty(swp_entry);
-			entry = swp_entry_to_pte(swp_entry);
-			if (soft_dirty)
-				entry = pte_swp_mksoft_dirty(entry);
-			if (uffd_wp)
-				entry = pte_swp_mkuffd_wp(entry);
-
-			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
-			set_pte_at(mm, addr, pte + i, entry);
-		}
-	} else {
-		pte_t entry;
-
-		entry = mk_pte(page, READ_ONCE(vma->vm_page_prot));
-		if (write)
-			entry = pte_mkwrite(entry, vma);
-		if (!young)
-			entry = pte_mkold(entry);
-		/* NOTE: this may set soft-dirty too on some archs */
-		if (dirty)
-			entry = pte_mkdirty(entry);
-		if (soft_dirty)
-			entry = pte_mksoft_dirty(entry);
-		if (uffd_wp)
-			entry = pte_mkuffd_wp(entry);
-
-		for (i = 0; i < HPAGE_PMD_NR; i++)
-			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
-
-		set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR);
-	}
-	pte_unmap(pte);
-
-	if (!pmd_migration)
-		folio_remove_rmap_pmd(folio, page, vma);
-	if (freeze)
-		put_page(page);
-
-	smp_wmb(); /* make pte visible before pmd */
-	pmd_populate(mm, pmd, pgtable);
-}
-
-void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
-			   pmd_t *pmd, bool freeze, struct folio *folio)
-{
-	VM_WARN_ON_ONCE(folio && !folio_test_pmd_mappable(folio));
-	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
-	VM_WARN_ON_ONCE(folio && !folio_test_locked(folio));
-	VM_BUG_ON(freeze && !folio);
-
-	/*
-	 * When the caller requests to set up a migration entry, we
-	 * require a folio to check the PMD against. Otherwise, there
-	 * is a risk of replacing the wrong folio.
-	 */
-	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
-	    is_pmd_migration_entry(*pmd)) {
-		if (folio && folio != pmd_folio(*pmd))
-			return;
-		__split_huge_pmd_locked(vma, pmd, address, freeze);
-	}
-}
-
-void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-		unsigned long address, bool freeze, struct folio *folio)
-{
-	spinlock_t *ptl;
-	struct mmu_notifier_range range;
-
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
-				address & HPAGE_PMD_MASK,
-				(address & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
-	mmu_notifier_invalidate_range_start(&range);
-	ptl = pmd_lock(vma->vm_mm, pmd);
-	split_huge_pmd_locked(vma, range.start, pmd, freeze, folio);
-	spin_unlock(ptl);
-	mmu_notifier_invalidate_range_end(&range);
-}
-
-void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address,
-		bool freeze, struct folio *folio)
-{
-	pmd_t *pmd = mm_find_pmd(vma->vm_mm, address);
-
-	if (!pmd)
-		return;
-
-	__split_huge_pmd(vma, pmd, address, freeze, folio);
-}
-
 static inline void split_huge_pmd_if_needed(struct vm_area_struct *vma, unsigned long address)
 {
 	/*
@@ -3772,100 +2748,3 @@ static int __init split_huge_pages_debugfs(void)
 late_initcall(split_huge_pages_debugfs);
 #endif
 
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
-		struct page *page)
-{
-	struct folio *folio = page_folio(page);
-	struct vm_area_struct *vma = pvmw->vma;
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long address = pvmw->address;
-	bool anon_exclusive;
-	pmd_t pmdval;
-	swp_entry_t entry;
-	pmd_t pmdswp;
-
-	if (!(pvmw->pmd && !pvmw->pte))
-		return 0;
-
-	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
-	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
-
-	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
-	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
-	if (anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) {
-		set_pmd_at(mm, address, pvmw->pmd, pmdval);
-		return -EBUSY;
-	}
-
-	if (pmd_dirty(pmdval))
-		folio_mark_dirty(folio);
-	if (pmd_write(pmdval))
-		entry = make_writable_migration_entry(page_to_pfn(page));
-	else if (anon_exclusive)
-		entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
-	else
-		entry = make_readable_migration_entry(page_to_pfn(page));
-	if (pmd_young(pmdval))
-		entry = make_migration_entry_young(entry);
-	if (pmd_dirty(pmdval))
-		entry = make_migration_entry_dirty(entry);
-	pmdswp = swp_entry_to_pmd(entry);
-	if (pmd_soft_dirty(pmdval))
-		pmdswp = pmd_swp_mksoft_dirty(pmdswp);
-	if (pmd_uffd_wp(pmdval))
-		pmdswp = pmd_swp_mkuffd_wp(pmdswp);
-	set_pmd_at(mm, address, pvmw->pmd, pmdswp);
-	folio_remove_rmap_pmd(folio, page, vma);
-	folio_put(folio);
-	trace_set_migration_pmd(address, pmd_val(pmdswp));
-
-	return 0;
-}
-
-void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
-{
-	struct folio *folio = page_folio(new);
-	struct vm_area_struct *vma = pvmw->vma;
-	struct mm_struct *mm = vma->vm_mm;
-	unsigned long address = pvmw->address;
-	unsigned long haddr = address & HPAGE_PMD_MASK;
-	pmd_t pmde;
-	swp_entry_t entry;
-
-	if (!(pvmw->pmd && !pvmw->pte))
-		return;
-
-	entry = pmd_to_swp_entry(*pvmw->pmd);
-	folio_get(folio);
-	pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
-	if (pmd_swp_soft_dirty(*pvmw->pmd))
-		pmde = pmd_mksoft_dirty(pmde);
-	if (is_writable_migration_entry(entry))
-		pmde = pmd_mkwrite(pmde, vma);
-	if (pmd_swp_uffd_wp(*pvmw->pmd))
-		pmde = pmd_mkuffd_wp(pmde);
-	if (!is_migration_entry_young(entry))
-		pmde = pmd_mkold(pmde);
-	/* NOTE: this may contain setting soft-dirty on some archs */
-	if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
-		pmde = pmd_mkdirty(pmde);
-
-	if (folio_test_anon(folio)) {
-		rmap_t rmap_flags = RMAP_NONE;
-
-		if (!is_readable_migration_entry(entry))
-			rmap_flags |= RMAP_EXCLUSIVE;
-
-		folio_add_anon_rmap_pmd(folio, new, vma, haddr, rmap_flags);
-	} else {
-		folio_add_file_rmap_pmd(folio, new, vma);
-	}
-	VM_BUG_ON(pmd_write(pmde) && folio_test_anon(folio) && !PageAnonExclusive(new));
-	set_pmd_at(mm, haddr, pvmw->pmd, pmde);
-
-	/* No need to invalidate - it was non-present before */
-	update_mmu_cache_pmd(vma, address, pvmw->pmd);
-	trace_remove_migration_pmd(address, pmd_val(pmde));
-}
-#endif
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH RFC 6/6] mm: Convert "*_trans_huge() || *_devmap()" to use *_leaf()
  2024-07-17 22:02 [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings Peter Xu
                   ` (4 preceding siblings ...)
  2024-07-17 22:02 ` [PATCH RFC 5/6] mm/huge_mapping: Create huge_mapping_pxx.c Peter Xu
@ 2024-07-17 22:02 ` Peter Xu
  2024-07-22 13:29 ` [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings David Hildenbrand
  6 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2024-07-17 22:02 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Vlastimil Babka, peterx, David Hildenbrand, Oscar Salvador,
	linux-s390, Andrew Morton, Matthew Wilcox, Dan Williams,
	Michal Hocko, linux-riscv, sparclinux, Alex Williamson,
	Jason Gunthorpe, x86, Alistair Popple, linuxppc-dev,
	linux-arm-kernel, Ryan Roberts, Hugh Dickins, Axel Rasmussen

This patch converted all such checks into one *_leaf() check under common
mm/, as "thp+devmap" should compose everything for a *_leaf() for now.  I
didn't yet touch arch code in other directories, as some arch may need some
special attention, so I left those separately.

It should start to save some cycles on such check and pave way for the new
leaf types.  E.g., when a new type of leaf is introduced, it'll naturally
go the same route to what we have now for thp+devmap.

Here one issue with pxx_leaf() API is that such API will be defined by arch
but it doesn't consider kernel config.  For example, below "if" branch
cannot be automatically optimized:

  if (pmd_leaf()) { ... }

Even if both THP && HUGETLB are not enabled (which means pmd_leaf() can
never return true).

To provide a chance for compilers to optimize and omit code when possible,
introduce a light wrapper for them and call them pxx_is_leaf().  That will
take kernel config into account and properly allow omitting branches when
the compiler knows it'll constantly returns false.  This tries to mimic
what we used to have with pxx_trans_huge() when !THP, so it now also
applies to pxx_leaf() API.

Cc: Alistair Popple <apopple@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h    |  6 +++---
 include/linux/pgtable.h    | 30 +++++++++++++++++++++++++++++-
 mm/hmm.c                   |  4 ++--
 mm/huge_mapping_pmd.c      |  9 +++------
 mm/huge_mapping_pud.c      |  6 +++---
 mm/mapping_dirty_helpers.c |  4 ++--
 mm/memory.c                | 14 ++++++--------
 mm/migrate_device.c        |  2 +-
 mm/mprotect.c              |  4 ++--
 mm/mremap.c                |  5 ++---
 mm/page_vma_mapped.c       |  5 ++---
 mm/pgtable-generic.c       |  7 +++----
 12 files changed, 58 insertions(+), 38 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index aea2784df8ef..a5b026d0731e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -27,7 +27,7 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma);
 static inline spinlock_t *
 pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 {
-	if (pud_trans_huge(*pud) || pud_devmap(*pud))
+	if (pud_is_leaf(*pud))
 		return __pud_trans_huge_lock(pud, vma);
 	else
 		return NULL;
@@ -36,7 +36,7 @@ pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 #define split_huge_pud(__vma, __pud, __address)				\
 	do {								\
 		pud_t *____pud = (__pud);				\
-		if (pud_trans_huge(*____pud) || pud_devmap(*____pud))	\
+		if (pud_is_leaf(*____pud))				\
 			__split_huge_pud(__vma, __pud, __address);	\
 	}  while (0)
 #else  /* CONFIG_PGTABLE_HAS_PUD_LEAVES */
@@ -125,7 +125,7 @@ static inline int is_swap_pmd(pmd_t pmd)
 static inline spinlock_t *
 pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 {
-	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+	if (is_swap_pmd(*pmd) || pmd_is_leaf(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma);
 	else
 		return NULL;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5e505373b113..af7709a132aa 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1641,7 +1641,7 @@ static inline int pud_trans_unstable(pud_t *pud)
 	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 	pud_t pudval = READ_ONCE(*pud);
 
-	if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval))
+	if (pud_none(pudval) || pud_leaf(pudval))
 		return 1;
 	if (unlikely(pud_bad(pudval))) {
 		pud_clear_bad(pud);
@@ -1901,6 +1901,34 @@ typedef unsigned int pgtbl_mod_mask;
 #define pmd_leaf(x)	false
 #endif
 
+/*
+ * Wrapper of pxx_leaf() helpers.
+ *
+ * Comparing to pxx_leaf() API, the only difference is: using these macros
+ * can help code generation, so unnecessary code can be omitted when the
+ * specific level of leaf is not possible due to kernel config.  It is
+ * needed because normally pxx_leaf() can be defined in arch code without
+ * knowing the kernel config.
+ *
+ * Currently we only need pmd/pud versions, because the largest leaf Linux
+ * supports so far is pud.
+ *
+ * Defining here also means that in arch's pgtable headers these macros
+ * cannot be used, pxx_leaf()s need to be used instead, because this file
+ * will not be included in arch's pgtable headers.
+ */
+#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
+#define pmd_is_leaf(x)  pmd_leaf(x)
+#else
+#define pmd_is_leaf(x)  false
+#endif
+
+#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
+#define pud_is_leaf(x)  pud_leaf(x)
+#else
+#define pud_is_leaf(x)  false
+#endif
+
 #ifndef pgd_leaf_size
 #define pgd_leaf_size(x) (1ULL << PGDIR_SHIFT)
 #endif
diff --git a/mm/hmm.c b/mm/hmm.c
index 7e0229ae4a5a..8d985bbbfee9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -351,7 +351,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
 	}
 
-	if (pmd_devmap(pmd) || pmd_trans_huge(pmd)) {
+	if (pmd_is_leaf(pmd)) {
 		/*
 		 * No need to take pmd_lock here, even if some other thread
 		 * is splitting the huge pmd we will get that event through
@@ -362,7 +362,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		 * values.
 		 */
 		pmd = pmdp_get_lockless(pmdp);
-		if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
+		if (!pmd_is_leaf(pmd))
 			goto again;
 
 		return hmm_vma_handle_pmd(walk, addr, end, hmm_pfns, pmd);
diff --git a/mm/huge_mapping_pmd.c b/mm/huge_mapping_pmd.c
index 7b85e2a564d6..d30c60685f66 100644
--- a/mm/huge_mapping_pmd.c
+++ b/mm/huge_mapping_pmd.c
@@ -60,8 +60,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	spinlock_t *ptl;
 
 	ptl = pmd_lock(vma->vm_mm, pmd);
-	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
-		   pmd_devmap(*pmd)))
+	if (likely(is_swap_pmd(*pmd) || pmd_is_leaf(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
@@ -627,8 +626,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd) &&
-		  !pmd_devmap(*pmd));
+	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_is_leaf(*pmd));
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	count_vm_event(THP_SPLIT_PMD);
@@ -845,8 +843,7 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 	 * require a folio to check the PMD against. Otherwise, there
 	 * is a risk of replacing the wrong folio.
 	 */
-	if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
-	    is_pmd_migration_entry(*pmd)) {
+	if (pmd_is_leaf(*pmd) || is_pmd_migration_entry(*pmd)) {
 		if (folio && folio != pmd_folio(*pmd))
 			return;
 		__split_huge_pmd_locked(vma, pmd, address, freeze);
diff --git a/mm/huge_mapping_pud.c b/mm/huge_mapping_pud.c
index c3a6bffe2871..58871dd74df2 100644
--- a/mm/huge_mapping_pud.c
+++ b/mm/huge_mapping_pud.c
@@ -57,7 +57,7 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 	spinlock_t *ptl;
 
 	ptl = pud_lock(vma->vm_mm, pud);
-	if (likely(pud_trans_huge(*pud) || pud_devmap(*pud)))
+	if (likely(pud_is_leaf(*pud)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
@@ -90,7 +90,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 
 	ret = -EAGAIN;
 	pud = *src_pud;
-	if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
+	if (unlikely(!pud_leaf(pud)))
 		goto out_unlock;
 
 	/*
@@ -225,7 +225,7 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 				(address & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE);
 	mmu_notifier_invalidate_range_start(&range);
 	ptl = pud_lock(vma->vm_mm, pud);
-	if (unlikely(!pud_trans_huge(*pud) && !pud_devmap(*pud)))
+	if (unlikely(!pud_is_leaf(*pud)))
 		goto out;
 	__split_huge_pud_locked(vma, pud, range.start);
 
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 2f8829b3541a..a9ea767d2d73 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -129,7 +129,7 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pmd_t pmdval = pmdp_get_lockless(pmd);
 
 	/* Do not split a huge pmd, present or migrated */
-	if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) {
+	if (pmd_is_leaf(pmdval)) {
 		WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval));
 		walk->action = ACTION_CONTINUE;
 	}
@@ -152,7 +152,7 @@ static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
 	pud_t pudval = READ_ONCE(*pud);
 
 	/* Do not split a huge pud */
-	if (pud_trans_huge(pudval) || pud_devmap(pudval)) {
+	if (pud_is_leaf(pudval)) {
 		WARN_ON(pud_write(pudval) || pud_dirty(pudval));
 		walk->action = ACTION_CONTINUE;
 	}
diff --git a/mm/memory.c b/mm/memory.c
index 126ee0903c79..6dc92c514bb7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1235,8 +1235,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	src_pmd = pmd_offset(src_pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
-			|| pmd_devmap(*src_pmd)) {
+		if (is_swap_pmd(*src_pmd) || pmd_is_leaf(*src_pmd)) {
 			int err;
 			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
 			err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1272,7 +1271,7 @@ copy_pud_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	src_pud = pud_offset(src_p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
+		if (pud_is_leaf(*src_pud)) {
 			int err;
 
 			VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, src_vma);
@@ -1710,7 +1709,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+		if (is_swap_pmd(*pmd) || pmd_is_leaf(*pmd)) {
 			if (next - addr != HPAGE_PMD_SIZE)
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
 			else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
@@ -1752,7 +1751,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
+		if (pud_is_leaf(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
 				split_huge_pud(vma, pud, addr);
@@ -5605,8 +5604,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		pud_t orig_pud = *vmf.pud;
 
 		barrier();
-		if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {
-
+		if (pud_is_leaf(orig_pud)) {
 			/*
 			 * TODO once we support anonymous PUDs: NUMA case and
 			 * FAULT_FLAG_UNSHARE handling.
@@ -5646,7 +5644,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
 		}
-		if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
+		if (pmd_is_leaf(vmf.orig_pmd)) {
 			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf);
 
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 6d66dc1c6ffa..1fbeee9619c8 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -596,7 +596,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	pmdp = pmd_alloc(mm, pudp, addr);
 	if (!pmdp)
 		goto abort;
-	if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
+	if (pmd_leaf(*pmdp))
 		goto abort;
 	if (pte_alloc(mm, pmdp))
 		goto abort;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 694f13b83864..ddfee216a02b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -381,7 +381,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 			goto next;
 
 		_pmd = pmdp_get_lockless(pmd);
-		if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
+		if (is_swap_pmd(_pmd) || pmd_is_leaf(_pmd)) {
 			if ((next - addr != HPAGE_PMD_SIZE) ||
 			    pgtable_split_needed(vma, cp_flags)) {
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -452,7 +452,7 @@ static inline long change_pud_range(struct mmu_gather *tlb,
 			mmu_notifier_invalidate_range_start(&range);
 		}
 
-		if (pud_leaf(pud)) {
+		if (pud_is_leaf(pud)) {
 			if ((next - addr != PUD_SIZE) ||
 			    pgtable_split_needed(vma, cp_flags)) {
 				__split_huge_pud(vma, pudp, addr);
diff --git a/mm/mremap.c b/mm/mremap.c
index e7ae140fc640..f5c9884ea1f8 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -587,7 +587,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
 		if (!new_pud)
 			break;
-		if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
+		if (pud_is_leaf(*old_pud)) {
 			if (extent == HPAGE_PUD_SIZE) {
 				move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr,
 					       old_pud, new_pud, need_rmap_locks);
@@ -609,8 +609,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		if (!new_pmd)
 			break;
 again:
-		if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) ||
-		    pmd_devmap(*old_pmd)) {
+		if (is_swap_pmd(*old_pmd) || pmd_is_leaf(*old_pmd)) {
 			if (extent == HPAGE_PMD_SIZE &&
 			    move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr,
 					   old_pmd, new_pmd, need_rmap_locks))
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index ae5cc42aa208..891bea8062d2 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -235,8 +235,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 		 */
 		pmde = pmdp_get_lockless(pvmw->pmd);
 
-		if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
-		    (pmd_present(pmde) && pmd_devmap(pmde))) {
+		if (pmd_is_leaf(pmde) || is_pmd_migration_entry(pmde)) {
 			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
 			pmde = *pvmw->pmd;
 			if (!pmd_present(pmde)) {
@@ -251,7 +250,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 					return not_found(pvmw);
 				return true;
 			}
-			if (likely(pmd_trans_huge(pmde) || pmd_devmap(pmde))) {
+			if (likely(pmd_is_leaf(pmde))) {
 				if (pvmw->flags & PVMW_MIGRATION)
 					return not_found(pvmw);
 				if (!check_pmd(pmd_pfn(pmde), pvmw))
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e9fc3f6774a6..c7b7a803f4ad 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -139,8 +139,7 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 {
 	pmd_t pmd;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
-	VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
-			   !pmd_devmap(*pmdp));
+	VM_BUG_ON(pmd_present(*pmdp) && !pmd_leaf(*pmdp));
 	pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
 	flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
 	return pmd;
@@ -247,7 +246,7 @@ pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
 	pud_t pud;
 
 	VM_BUG_ON(address & ~HPAGE_PUD_MASK);
-	VM_BUG_ON(!pud_trans_huge(*pudp) && !pud_devmap(*pudp));
+	VM_BUG_ON(!pud_leaf(*pudp));
 	pud = pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
 	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
 	return pud;
@@ -293,7 +292,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 		*pmdvalp = pmdval;
 	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
 		goto nomap;
-	if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
+	if (unlikely(pmd_leaf(pmdval)))
 		goto nomap;
 	if (unlikely(pmd_bad(pmdval))) {
 		pmd_clear_bad(pmd);
-- 
2.45.0



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
  2024-07-17 22:02 [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings Peter Xu
                   ` (5 preceding siblings ...)
  2024-07-17 22:02 ` [PATCH RFC 6/6] mm: Convert "*_trans_huge() || *_devmap()" to use *_leaf() Peter Xu
@ 2024-07-22 13:29 ` David Hildenbrand
  2024-07-22 15:31   ` Peter Xu
  6 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2024-07-22 13:29 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm
  Cc: Vlastimil Babka, Oscar Salvador, linux-s390, Andrew Morton,
	Matthew Wilcox, Dan Williams, Michal Hocko, linux-riscv,
	sparclinux, Alex Williamson, Jason Gunthorpe, x86,
	Alistair Popple, linuxppc-dev, linux-arm-kernel, Ryan Roberts,
	Hugh Dickins, Axel Rasmussen

On 18.07.24 00:02, Peter Xu wrote:
> This is an RFC series, so not yet for merging.  Please don't be scared by
> the code changes: most of them are code movements only.
> 
> This series is based on the dax mprotect fix series here (while that one is
> based on mm-unstable):
> 
>    [PATCH v3 0/8] mm/mprotect: Fix dax puds
>    https://lore.kernel.org/r/20240715192142.3241557-1-peterx@redhat.com
> 
> Overview
> ========
> 
> This series doesn't provide any feature change.  The only goal of this
> series is to start decoupling two ideas: "THP" and "huge mapping".  We
> already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this
> one extends that idea into the code.
> 
> The issue is that we have so many functions that only compile with
> CONFIG_THP=on, even though they're about huge mappings, and huge mapping is
> a pretty common concept, which can apply to many things besides THPs
> nowadays.  The major THP file is mm/huge_memory.c as of now.
> 
> The first example of such huge mapping users will be hugetlb.  We lived
> until now with no problem simply because Linux almost duplicated all the
> logics there in the "THP" files into hugetlb APIs.  If we want to get rid
> of hugetlb specific APIs and paths, this _might_ be the first thing we want
> to do, because we want to be able to e.g., zapping a hugetlb pmd entry even
> if !CONFIG_THP.
> 
> Then consider other things like dax / pfnmaps.  Dax can depend on THP, then
> it'll naturally be able to use pmd/pud helpers, that's okay.  However is it
> a must?  Do we also want to have every new pmd/pud mappings in the future
> to depend on THP (like PFNMAP)?  My answer is no, but I'm open to opinions.
> 
> If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that
> are larger than PAGE_SIZE) is a more generic concept than THP, then I think
> at some point we need to move the generic code out of THP code into a
> common code base.
> 
> This is what this series does as a start.

Hi Peter!

 From a quick glimpse, patch #1-#4 do make sense independent of patch #5.

I am not so sure about all of the code movement in patch #5. If large 
folios are the future, then likely huge_memory.c should simply be the 
home for all that logic.

Maybe the goal should better be to compile huge_memory.c not only for 
THP, but also for other use cases that require that logic, and fence off 
all THP specific stuff using #ifdef?

Not sure, though. But a lot of this code movements/churn might be avoidable.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
  2024-07-22 13:29 ` [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings David Hildenbrand
@ 2024-07-22 15:31   ` Peter Xu
  2024-07-23  8:18     ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2024-07-22 15:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Vlastimil Babka, Oscar Salvador,
	linux-s390, Andrew Morton, Matthew Wilcox, Dan Williams,
	Michal Hocko, linux-riscv, sparclinux, Alex Williamson,
	Jason Gunthorpe, x86, Alistair Popple, linuxppc-dev,
	linux-arm-kernel, Ryan Roberts, Hugh Dickins, Axel Rasmussen

On Mon, Jul 22, 2024 at 03:29:43PM +0200, David Hildenbrand wrote:
> On 18.07.24 00:02, Peter Xu wrote:
> > This is an RFC series, so not yet for merging.  Please don't be scared by
> > the code changes: most of them are code movements only.
> > 
> > This series is based on the dax mprotect fix series here (while that one is
> > based on mm-unstable):
> > 
> >    [PATCH v3 0/8] mm/mprotect: Fix dax puds
> >    https://lore.kernel.org/r/20240715192142.3241557-1-peterx@redhat.com
> > 
> > Overview
> > ========
> > 
> > This series doesn't provide any feature change.  The only goal of this
> > series is to start decoupling two ideas: "THP" and "huge mapping".  We
> > already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this
> > one extends that idea into the code.
> > 
> > The issue is that we have so many functions that only compile with
> > CONFIG_THP=on, even though they're about huge mappings, and huge mapping is
> > a pretty common concept, which can apply to many things besides THPs
> > nowadays.  The major THP file is mm/huge_memory.c as of now.
> > 
> > The first example of such huge mapping users will be hugetlb.  We lived
> > until now with no problem simply because Linux almost duplicated all the
> > logics there in the "THP" files into hugetlb APIs.  If we want to get rid
> > of hugetlb specific APIs and paths, this _might_ be the first thing we want
> > to do, because we want to be able to e.g., zapping a hugetlb pmd entry even
> > if !CONFIG_THP.
> > 
> > Then consider other things like dax / pfnmaps.  Dax can depend on THP, then
> > it'll naturally be able to use pmd/pud helpers, that's okay.  However is it
> > a must?  Do we also want to have every new pmd/pud mappings in the future
> > to depend on THP (like PFNMAP)?  My answer is no, but I'm open to opinions.
> > 
> > If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that
> > are larger than PAGE_SIZE) is a more generic concept than THP, then I think
> > at some point we need to move the generic code out of THP code into a
> > common code base.
> > 
> > This is what this series does as a start.
> 
> Hi Peter!
> 
> From a quick glimpse, patch #1-#4 do make sense independent of patch #5.
> 
> I am not so sure about all of the code movement in patch #5. If large folios
> are the future, then likely huge_memory.c should simply be the home for all
> that logic.
> 
> Maybe the goal should better be to compile huge_memory.c not only for THP,
> but also for other use cases that require that logic, and fence off all THP
> specific stuff using #ifdef?
> 
> Not sure, though. But a lot of this code movements/churn might be avoidable.

I'm fine using ifdefs in the current fine, but IMHO it's a matter of
whether we want to keep huge_memory.c growing into even larger file, and
keep all large folio logics only in that file.  Currently it's ~4000 LOCs.

Nornally I don't see this as much of a "code churn" category, because it
doesn't changes the code itself but only move things.  I personally also
prefer without code churns, but only in the case where there'll be tiny
little functional changes here and there without real benefit.

It's pretty unavoidable to me when one file grows too large and we'll need
to split, and in this case git doesn't have a good way to track such
movement..

Irrelevant of this, just to mention I think there's still one option that I
at least can make the huge pfnmap depends on THP again which shouldn't be a
huge deal (I don't have any use case that needs huge pfnmap but disable
THP, anyway..), so this series isn't an immediate concern to me for that
route.  But for a hugetlb rework this might be something we need to do,
because we simplly can't make CONFIG_HUGETLB rely on CONFIG_THP..

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
  2024-07-22 15:31   ` Peter Xu
@ 2024-07-23  8:18     ` David Hildenbrand
  2024-07-23 21:04       ` Peter Xu
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2024-07-23  8:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Vlastimil Babka, Oscar Salvador,
	linux-s390, Andrew Morton, Matthew Wilcox, Dan Williams,
	Michal Hocko, linux-riscv, sparclinux, Alex Williamson,
	Jason Gunthorpe, x86, Alistair Popple, linuxppc-dev,
	linux-arm-kernel, Ryan Roberts, Hugh Dickins, Axel Rasmussen

On 22.07.24 17:31, Peter Xu wrote:
> On Mon, Jul 22, 2024 at 03:29:43PM +0200, David Hildenbrand wrote:
>> On 18.07.24 00:02, Peter Xu wrote:
>>> This is an RFC series, so not yet for merging.  Please don't be scared by
>>> the code changes: most of them are code movements only.
>>>
>>> This series is based on the dax mprotect fix series here (while that one is
>>> based on mm-unstable):
>>>
>>>     [PATCH v3 0/8] mm/mprotect: Fix dax puds
>>>     https://lore.kernel.org/r/20240715192142.3241557-1-peterx@redhat.com
>>>
>>> Overview
>>> ========
>>>
>>> This series doesn't provide any feature change.  The only goal of this
>>> series is to start decoupling two ideas: "THP" and "huge mapping".  We
>>> already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this
>>> one extends that idea into the code.
>>>
>>> The issue is that we have so many functions that only compile with
>>> CONFIG_THP=on, even though they're about huge mappings, and huge mapping is
>>> a pretty common concept, which can apply to many things besides THPs
>>> nowadays.  The major THP file is mm/huge_memory.c as of now.
>>>
>>> The first example of such huge mapping users will be hugetlb.  We lived
>>> until now with no problem simply because Linux almost duplicated all the
>>> logics there in the "THP" files into hugetlb APIs.  If we want to get rid
>>> of hugetlb specific APIs and paths, this _might_ be the first thing we want
>>> to do, because we want to be able to e.g., zapping a hugetlb pmd entry even
>>> if !CONFIG_THP.
>>>
>>> Then consider other things like dax / pfnmaps.  Dax can depend on THP, then
>>> it'll naturally be able to use pmd/pud helpers, that's okay.  However is it
>>> a must?  Do we also want to have every new pmd/pud mappings in the future
>>> to depend on THP (like PFNMAP)?  My answer is no, but I'm open to opinions.
>>>
>>> If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that
>>> are larger than PAGE_SIZE) is a more generic concept than THP, then I think
>>> at some point we need to move the generic code out of THP code into a
>>> common code base.
>>>
>>> This is what this series does as a start.
>>
>> Hi Peter!
>>
>>  From a quick glimpse, patch #1-#4 do make sense independent of patch #5.
>>
>> I am not so sure about all of the code movement in patch #5. If large folios
>> are the future, then likely huge_memory.c should simply be the home for all
>> that logic.
>>
>> Maybe the goal should better be to compile huge_memory.c not only for THP,
>> but also for other use cases that require that logic, and fence off all THP
>> specific stuff using #ifdef?
>>
>> Not sure, though. But a lot of this code movements/churn might be avoidable.
> 
> I'm fine using ifdefs in the current fine, but IMHO it's a matter of
> whether we want to keep huge_memory.c growing into even larger file, and
> keep all large folio logics only in that file.  Currently it's ~4000 LOCs.

Depends on "how much" for sure. huge_memory.c is currently on place 12 
of the biggest files in mm/. So there might not be immediate cause for 
action ... just yet :) [guess which file is on #2 :) ]

> 
> Nornally I don't see this as much of a "code churn" category, because it
> doesn't changes the code itself but only move things.  I personally also
> prefer without code churns, but only in the case where there'll be tiny
> little functional changes here and there without real benefit.
> 
> It's pretty unavoidable to me when one file grows too large and we'll need
> to split, and in this case git doesn't have a good way to track such
> movement..

Yes, that's what I mean.

I've been recently thinking if we should pursue a different direction:

Just as we recently relocated most follow_huge_* stuff into gup.c, 
likely we should rather look into moving copy_huge_pmd, change_huge_pmd, 
copy_huge_pmd ... into the files where they logically belong to.

In madvise.c, we've been doing that in some places already: For 
madvise_cold_or_pageout_pte_range() we inline the code, but not for 
madvise_free_huge_pmd().

pmd_trans_huge() would already compile to a NOP without 
CONFIG_TRANSPARENT_HUGEPAGE, but to make that code avoid most 
CONFIG_TRANSPARENT_HUGEPAGE, we'd need a couple more function stubs to 
make the compiler happy while still being able to compile that code out 
when not required.

The idea would be that e.g., pmd_leaf() would return "false" at compile 
time if no active configuration (THP, HUGETLB, ...) would be active. So 
we could just use pmd_leaf() similar to pmd_trans_huge() in relevant 
code and have the compiler optimize it all out without putting it into 
separate files.

That means, large folios and PMD/PUD mappings will become "more common" 
and better integrated, without the need to jump between files.

Just some thought about an alternative that would make sense to me.

> 
> Irrelevant of this, just to mention I think there's still one option that I
> at least can make the huge pfnmap depends on THP again which shouldn't be a
> huge deal (I don't have any use case that needs huge pfnmap but disable
> THP, anyway..), so this series isn't an immediate concern to me for that
> route.  But for a hugetlb rework this might be something we need to do,
> because we simplly can't make CONFIG_HUGETLB rely on CONFIG_THP..

Yes, likely. FSDAX went a similar direction and called that FSDAX thing 
a "THP" whereby it really doesn't have anything in common with a THP, 
besides being partially mappable -- IMHO.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
  2024-07-23  8:18     ` David Hildenbrand
@ 2024-07-23 21:04       ` Peter Xu
  2024-07-23 21:22         ` David Hildenbrand
  2024-08-22 17:08         ` LEROY Christophe
  0 siblings, 2 replies; 17+ messages in thread
From: Peter Xu @ 2024-07-23 21:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Vlastimil Babka, Oscar Salvador,
	linux-s390, Andrew Morton, Matthew Wilcox, Dan Williams,
	Michal Hocko, linux-riscv, sparclinux, Alex Williamson,
	Jason Gunthorpe, x86, Alistair Popple, linuxppc-dev,
	linux-arm-kernel, Ryan Roberts, Hugh Dickins, Axel Rasmussen

On Tue, Jul 23, 2024 at 10:18:37AM +0200, David Hildenbrand wrote:
> On 22.07.24 17:31, Peter Xu wrote:
> > On Mon, Jul 22, 2024 at 03:29:43PM +0200, David Hildenbrand wrote:
> > > On 18.07.24 00:02, Peter Xu wrote:
> > > > This is an RFC series, so not yet for merging.  Please don't be scared by
> > > > the code changes: most of them are code movements only.
> > > > 
> > > > This series is based on the dax mprotect fix series here (while that one is
> > > > based on mm-unstable):
> > > > 
> > > >     [PATCH v3 0/8] mm/mprotect: Fix dax puds
> > > >     https://lore.kernel.org/r/20240715192142.3241557-1-peterx@redhat.com
> > > > 
> > > > Overview
> > > > ========
> > > > 
> > > > This series doesn't provide any feature change.  The only goal of this
> > > > series is to start decoupling two ideas: "THP" and "huge mapping".  We
> > > > already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this
> > > > one extends that idea into the code.
> > > > 
> > > > The issue is that we have so many functions that only compile with
> > > > CONFIG_THP=on, even though they're about huge mappings, and huge mapping is
> > > > a pretty common concept, which can apply to many things besides THPs
> > > > nowadays.  The major THP file is mm/huge_memory.c as of now.
> > > > 
> > > > The first example of such huge mapping users will be hugetlb.  We lived
> > > > until now with no problem simply because Linux almost duplicated all the
> > > > logics there in the "THP" files into hugetlb APIs.  If we want to get rid
> > > > of hugetlb specific APIs and paths, this _might_ be the first thing we want
> > > > to do, because we want to be able to e.g., zapping a hugetlb pmd entry even
> > > > if !CONFIG_THP.
> > > > 
> > > > Then consider other things like dax / pfnmaps.  Dax can depend on THP, then
> > > > it'll naturally be able to use pmd/pud helpers, that's okay.  However is it
> > > > a must?  Do we also want to have every new pmd/pud mappings in the future
> > > > to depend on THP (like PFNMAP)?  My answer is no, but I'm open to opinions.
> > > > 
> > > > If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that
> > > > are larger than PAGE_SIZE) is a more generic concept than THP, then I think
> > > > at some point we need to move the generic code out of THP code into a
> > > > common code base.
> > > > 
> > > > This is what this series does as a start.
> > > 
> > > Hi Peter!
> > > 
> > >  From a quick glimpse, patch #1-#4 do make sense independent of patch #5.
> > > 
> > > I am not so sure about all of the code movement in patch #5. If large folios
> > > are the future, then likely huge_memory.c should simply be the home for all
> > > that logic.
> > > 
> > > Maybe the goal should better be to compile huge_memory.c not only for THP,
> > > but also for other use cases that require that logic, and fence off all THP
> > > specific stuff using #ifdef?
> > > 
> > > Not sure, though. But a lot of this code movements/churn might be avoidable.
> > 
> > I'm fine using ifdefs in the current fine, but IMHO it's a matter of
> > whether we want to keep huge_memory.c growing into even larger file, and
> > keep all large folio logics only in that file.  Currently it's ~4000 LOCs.
> 
> Depends on "how much" for sure. huge_memory.c is currently on place 12 of
> the biggest files in mm/. So there might not be immediate cause for action
> ... just yet :) [guess which file is on #2 :) ]

7821, hugetlb.c  
7602, vmscan.c          
7275, slub.c       
7072, page_alloc.c
6673, memory.c     
5402, memcontrol.c 
5239, shmem.c   
5155, vmalloc.c      
4419, filemap.c       
4060, mmap.c       
3882, huge_memory.c

IMHO a split is normally better than keeping everything in one file, but
yeah I'd confess THP file isn't that bad comparing to others..  And I'm
definitely surprised it's even out of top ten.

> 
> > 
> > Nornally I don't see this as much of a "code churn" category, because it
> > doesn't changes the code itself but only move things.  I personally also
> > prefer without code churns, but only in the case where there'll be tiny
> > little functional changes here and there without real benefit.
> > 
> > It's pretty unavoidable to me when one file grows too large and we'll need
> > to split, and in this case git doesn't have a good way to track such
> > movement..
> 
> Yes, that's what I mean.
> 
> I've been recently thinking if we should pursue a different direction:
> 
> Just as we recently relocated most follow_huge_* stuff into gup.c, likely we
> should rather look into moving copy_huge_pmd, change_huge_pmd, copy_huge_pmd
> ... into the files where they logically belong to.
> 
> In madvise.c, we've been doing that in some places already: For
> madvise_cold_or_pageout_pte_range() we inline the code, but not for
> madvise_free_huge_pmd().
> 
> pmd_trans_huge() would already compile to a NOP without
> CONFIG_TRANSPARENT_HUGEPAGE, but to make that code avoid most
> CONFIG_TRANSPARENT_HUGEPAGE, we'd need a couple more function stubs to make
> the compiler happy while still being able to compile that code out when not
> required.

Right, I had a patch does exactly that, where it's called pmd_is_leaf(),
for example, but taking CONFIG_* into account.

I remember I had some issue with that, e.g. I used to see pmd_trans_huge()
(when !THP) can optimize some path but pmd_is_leaf() didn't do the same job
even if all configs were off.  But that's another story and I didn't yet
dig deeper.  Could be something small but overlooked.

> 
> The idea would be that e.g., pmd_leaf() would return "false" at compile time
> if no active configuration (THP, HUGETLB, ...) would be active. So we could
> just use pmd_leaf() similar to pmd_trans_huge() in relevant code and have
> the compiler optimize it all out without putting it into separate files.
> 
> That means, large folios and PMD/PUD mappings will become "more common" and
> better integrated, without the need to jump between files.
> 
> Just some thought about an alternative that would make sense to me.

Yeah comments are always welcomed, thanks.

So I suppose maybe it would be easier for now that I make the pfnmap branch
depending on THP. It looks to me something like this may still take some
time to consolidate.  When it's light enough, maybe it can be a few initial
patches on top of a hugetlb series that can start to use this.  Maybe
that'll at least make the patches easier to review.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
  2024-07-23 21:04       ` Peter Xu
@ 2024-07-23 21:22         ` David Hildenbrand
  2024-08-22 17:08         ` LEROY Christophe
  1 sibling, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2024-07-23 21:22 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Vlastimil Babka, Oscar Salvador,
	linux-s390, Andrew Morton, Matthew Wilcox, Dan Williams,
	Michal Hocko, linux-riscv, sparclinux, Alex Williamson,
	Jason Gunthorpe, x86, Alistair Popple, linuxppc-dev,
	linux-arm-kernel, Ryan Roberts, Hugh Dickins, Axel Rasmussen

On 23.07.24 23:04, Peter Xu wrote:
> On Tue, Jul 23, 2024 at 10:18:37AM +0200, David Hildenbrand wrote:
>> On 22.07.24 17:31, Peter Xu wrote:
>>> On Mon, Jul 22, 2024 at 03:29:43PM +0200, David Hildenbrand wrote:
>>>> On 18.07.24 00:02, Peter Xu wrote:
>>>>> This is an RFC series, so not yet for merging.  Please don't be scared by
>>>>> the code changes: most of them are code movements only.
>>>>>
>>>>> This series is based on the dax mprotect fix series here (while that one is
>>>>> based on mm-unstable):
>>>>>
>>>>>      [PATCH v3 0/8] mm/mprotect: Fix dax puds
>>>>>      https://lore.kernel.org/r/20240715192142.3241557-1-peterx@redhat.com
>>>>>
>>>>> Overview
>>>>> ========
>>>>>
>>>>> This series doesn't provide any feature change.  The only goal of this
>>>>> series is to start decoupling two ideas: "THP" and "huge mapping".  We
>>>>> already started with having PGTABLE_HAS_HUGE_LEAVES config option, and this
>>>>> one extends that idea into the code.
>>>>>
>>>>> The issue is that we have so many functions that only compile with
>>>>> CONFIG_THP=on, even though they're about huge mappings, and huge mapping is
>>>>> a pretty common concept, which can apply to many things besides THPs
>>>>> nowadays.  The major THP file is mm/huge_memory.c as of now.
>>>>>
>>>>> The first example of such huge mapping users will be hugetlb.  We lived
>>>>> until now with no problem simply because Linux almost duplicated all the
>>>>> logics there in the "THP" files into hugetlb APIs.  If we want to get rid
>>>>> of hugetlb specific APIs and paths, this _might_ be the first thing we want
>>>>> to do, because we want to be able to e.g., zapping a hugetlb pmd entry even
>>>>> if !CONFIG_THP.
>>>>>
>>>>> Then consider other things like dax / pfnmaps.  Dax can depend on THP, then
>>>>> it'll naturally be able to use pmd/pud helpers, that's okay.  However is it
>>>>> a must?  Do we also want to have every new pmd/pud mappings in the future
>>>>> to depend on THP (like PFNMAP)?  My answer is no, but I'm open to opinions.
>>>>>
>>>>> If anyone agrees with me that "huge mapping" (aka, PMD/PUD mappings that
>>>>> are larger than PAGE_SIZE) is a more generic concept than THP, then I think
>>>>> at some point we need to move the generic code out of THP code into a
>>>>> common code base.
>>>>>
>>>>> This is what this series does as a start.
>>>>
>>>> Hi Peter!
>>>>
>>>>   From a quick glimpse, patch #1-#4 do make sense independent of patch #5.
>>>>
>>>> I am not so sure about all of the code movement in patch #5. If large folios
>>>> are the future, then likely huge_memory.c should simply be the home for all
>>>> that logic.
>>>>
>>>> Maybe the goal should better be to compile huge_memory.c not only for THP,
>>>> but also for other use cases that require that logic, and fence off all THP
>>>> specific stuff using #ifdef?
>>>>
>>>> Not sure, though. But a lot of this code movements/churn might be avoidable.
>>>
>>> I'm fine using ifdefs in the current fine, but IMHO it's a matter of
>>> whether we want to keep huge_memory.c growing into even larger file, and
>>> keep all large folio logics only in that file.  Currently it's ~4000 LOCs.
>>
>> Depends on "how much" for sure. huge_memory.c is currently on place 12 of
>> the biggest files in mm/. So there might not be immediate cause for action
>> ... just yet :) [guess which file is on #2 :) ]
> 
> 7821, hugetlb.c
> 7602, vmscan.c
> 7275, slub.c
> 7072, page_alloc.c
> 6673, memory.c
> 5402, memcontrol.c
> 5239, shmem.c
> 5155, vmalloc.c
> 4419, filemap.c
> 4060, mmap.c
> 3882, huge_memory.c
> 
> IMHO a split is normally better than keeping everything in one file, but
> yeah I'd confess THP file isn't that bad comparing to others..  And I'm
> definitely surprised it's even out of top ten.

It's always interesting looking at the numbers here. For v6.10 we had:

     8521 mm/memcontrol.c
     7813 mm/hugetlb.c
     7550 mm/vmscan.c
     7266 mm/slub.c
     7018 mm/page_alloc.c
     6468 mm/memory.c
     5154 mm/vmalloc.c
     5002 mm/shmem.c
     4419 mm/filemap.c
     4019 mm/mmap.c
     3954 mm/ksm.c
     3740 mm/swapfile.c
     3730 mm/huge_memory.c
     3689 mm/gup.c
     3542 mm/mempolicy.c

I suspect memcontrol.c shrunk because of the v1 split-off, leaving 
hugetlb.c now at #1 :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings
  2024-07-23 21:04       ` Peter Xu
  2024-07-23 21:22         ` David Hildenbrand
@ 2024-08-22 17:08         ` LEROY Christophe
  1 sibling, 0 replies; 17+ messages in thread
From: LEROY Christophe @ 2024-08-22 17:08 UTC (permalink / raw)
  To: Peter Xu, David Hildenbrand, Andrew Morton
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Vlastimil Babka,
	Oscar Salvador, linux-s390@vger.kernel.org, Matthew Wilcox,
	Dan Williams, Michal Hocko, linux-riscv@lists.infradead.org,
	sparclinux@vger.kernel.org, Alex Williamson, Jason Gunthorpe,
	x86@kernel.org, Alistair Popple, linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org, Ryan Roberts, Hugh Dickins,
	Axel Rasmussen



Le 23/07/2024 à 23:04, Peter Xu a écrit :
>>
>>>
>>> Nornally I don't see this as much of a "code churn" category, because it
>>> doesn't changes the code itself but only move things.  I personally also
>>> prefer without code churns, but only in the case where there'll be tiny
>>> little functional changes here and there without real benefit.
>>>
>>> It's pretty unavoidable to me when one file grows too large and we'll need
>>> to split, and in this case git doesn't have a good way to track such
>>> movement..
>>
>> Yes, that's what I mean.
>>
>> I've been recently thinking if we should pursue a different direction:
>>
>> Just as we recently relocated most follow_huge_* stuff into gup.c, likely we
>> should rather look into moving copy_huge_pmd, change_huge_pmd, copy_huge_pmd
>> ... into the files where they logically belong to.
>>
>> In madvise.c, we've been doing that in some places already: For
>> madvise_cold_or_pageout_pte_range() we inline the code, but not for
>> madvise_free_huge_pmd().
>>
>> pmd_trans_huge() would already compile to a NOP without
>> CONFIG_TRANSPARENT_HUGEPAGE, but to make that code avoid most
>> CONFIG_TRANSPARENT_HUGEPAGE, we'd need a couple more function stubs to make
>> the compiler happy while still being able to compile that code out when not
>> required.
> 
> Right, I had a patch does exactly that, where it's called pmd_is_leaf(),
> for example, but taking CONFIG_* into account.
> 
> I remember I had some issue with that, e.g. I used to see pmd_trans_huge()
> (when !THP) can optimize some path but pmd_is_leaf() didn't do the same job
> even if all configs were off.  But that's another story and I didn't yet
> dig deeper.  Could be something small but overlooked.

When I prepared series 
https://patchwork.kernel.org/project/linux-mm/list/?series=871008 , I 
detected that some architectures define some pXd_leaf() for levels they 
don't support, that's the reason why Andrew had to drop v2 at the last 
minute.

And that's maybe the reason why some of the code you expect to get 
folded-off remains.

Since then I sent v3 that fixes that. Don't know if Andrew is planning 
to take it.

Christophe

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options
  2024-07-17 22:02 ` [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options Peter Xu
@ 2024-08-22 17:22   ` LEROY Christophe
  2024-08-22 19:16     ` Peter Xu
  0 siblings, 1 reply; 17+ messages in thread
From: LEROY Christophe @ 2024-08-22 17:22 UTC (permalink / raw)
  To: Peter Xu, linux-kernel@vger.kernel.org, linux-mm@kvack.org
  Cc: Vlastimil Babka, David Hildenbrand, Oscar Salvador,
	linux-s390@vger.kernel.org, Andrew Morton, Matthew Wilcox,
	Dan Williams, Michal Hocko, linux-riscv@lists.infradead.org,
	sparclinux@vger.kernel.org, Alex Williamson, Jason Gunthorpe,
	x86@kernel.org, Alistair Popple, linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org, Ryan Roberts, Hugh Dickins,
	Axel Rasmussen



Le 18/07/2024 à 00:02, Peter Xu a écrit :
> Introduce two more sub-options for PGTABLE_HAS_HUGE_LEAVES:
> 
>    - PGTABLE_HAS_PMD_LEAVES: set when there can be PMD mappings
>    - PGTABLE_HAS_PUD_LEAVES: set when there can be PUD mappings
> 
> It will help to identify whether the current build may only want PMD
> helpers but not PUD ones, as these sub-options will also check against the
> arch support over HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD].
> 
> Note that having them depend on HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD] is
> still some intermediate step.  The best way is to have an option say
> "whether arch XXX supports PMD/PUD mappings" and so on.  However let's
> leave that for later as that's the easy part.  So far, we use these options
> to stably detect per-arch huge mapping support.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   include/linux/huge_mm.h | 10 +++++++---
>   mm/Kconfig              |  6 ++++++
>   2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 711632df7edf..37482c8445d1 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -96,14 +96,18 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>   #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
>   	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>   
> -#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
> -#define HPAGE_PMD_SHIFT PMD_SHIFT
> +#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
>   #define HPAGE_PUD_SHIFT PUD_SHIFT
>   #else
> -#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>   #define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
>   #endif
>   
> +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
> +#define HPAGE_PMD_SHIFT PMD_SHIFT
> +#else
> +#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> +#endif
> +
>   #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
>   #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>   #define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 60796402850e..2dbdc088dee8 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -860,6 +860,12 @@ endif # TRANSPARENT_HUGEPAGE
>   config PGTABLE_HAS_HUGE_LEAVES
>   	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
>   
> +config PGTABLE_HAS_PMD_LEAVES
> +	def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE && PGTABLE_HAS_HUGE_LEAVES
> +
> +config PGTABLE_HAS_PUD_LEAVES
> +	def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD && PGTABLE_HAS_HUGE_LEAVES
> +

What if an architecture has hugepages at PMD and/or PUD level and 
doesn't support THP ?

Christophe

>   #
>   # UP and nommu archs use km based percpu allocator
>   #

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options
  2024-08-22 17:22   ` LEROY Christophe
@ 2024-08-22 19:16     ` Peter Xu
  2024-08-23  6:19       ` LEROY Christophe
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2024-08-22 19:16 UTC (permalink / raw)
  To: LEROY Christophe
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Vlastimil Babka,
	David Hildenbrand, Oscar Salvador, linux-s390@vger.kernel.org,
	Andrew Morton, Matthew Wilcox, Dan Williams, Michal Hocko,
	linux-riscv@lists.infradead.org, sparclinux@vger.kernel.org,
	Alex Williamson, Jason Gunthorpe, x86@kernel.org, Alistair Popple,
	linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org, Ryan Roberts, Hugh Dickins,
	Axel Rasmussen

On Thu, Aug 22, 2024 at 05:22:03PM +0000, LEROY Christophe wrote:
> 
> 
> Le 18/07/2024 à 00:02, Peter Xu a écrit :
> > Introduce two more sub-options for PGTABLE_HAS_HUGE_LEAVES:
> > 
> >    - PGTABLE_HAS_PMD_LEAVES: set when there can be PMD mappings
> >    - PGTABLE_HAS_PUD_LEAVES: set when there can be PUD mappings
> > 
> > It will help to identify whether the current build may only want PMD
> > helpers but not PUD ones, as these sub-options will also check against the
> > arch support over HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD].
> > 
> > Note that having them depend on HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD] is
> > still some intermediate step.  The best way is to have an option say
> > "whether arch XXX supports PMD/PUD mappings" and so on.  However let's
> > leave that for later as that's the easy part.  So far, we use these options
> > to stably detect per-arch huge mapping support.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   include/linux/huge_mm.h | 10 +++++++---
> >   mm/Kconfig              |  6 ++++++
> >   2 files changed, 13 insertions(+), 3 deletions(-)
> > 
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 711632df7edf..37482c8445d1 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -96,14 +96,18 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
> >   #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
> >   	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
> >   
> > -#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
> > -#define HPAGE_PMD_SHIFT PMD_SHIFT
> > +#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
> >   #define HPAGE_PUD_SHIFT PUD_SHIFT
> >   #else
> > -#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> >   #define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
> >   #endif
> >   
> > +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
> > +#define HPAGE_PMD_SHIFT PMD_SHIFT
> > +#else
> > +#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> > +#endif
> > +
> >   #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
> >   #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> >   #define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 60796402850e..2dbdc088dee8 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -860,6 +860,12 @@ endif # TRANSPARENT_HUGEPAGE
> >   config PGTABLE_HAS_HUGE_LEAVES
> >   	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
> >   
> > +config PGTABLE_HAS_PMD_LEAVES
> > +	def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE && PGTABLE_HAS_HUGE_LEAVES
> > +
> > +config PGTABLE_HAS_PUD_LEAVES
> > +	def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD && PGTABLE_HAS_HUGE_LEAVES
> > +
> 
> What if an architecture has hugepages at PMD and/or PUD level and 
> doesn't support THP ?

What's the arch to be discussed here?

The whole purpose of this series so far is trying to make some pmd/pud
helpers that only defined with CONFIG_THP=on to be available even if not.
It means this series alone (or any future plan) shouldn't affect any arch
that has CONFIG_THP=off always.

But logically I think we should need some config option just to say "this
arch supports pmd mappings" indeed, even if CONFIG_THP=off.  When that's
there, we should perhaps add that option into this equation so
PGTABLE_HAS_*_LEAVES will also be selected in that case.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options
  2024-08-22 19:16     ` Peter Xu
@ 2024-08-23  6:19       ` LEROY Christophe
  2024-08-26 14:34         ` Peter Xu
  0 siblings, 1 reply; 17+ messages in thread
From: LEROY Christophe @ 2024-08-23  6:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Vlastimil Babka,
	David Hildenbrand, Oscar Salvador, linux-s390@vger.kernel.org,
	Andrew Morton, Matthew Wilcox, Dan Williams, Michal Hocko,
	linux-riscv@lists.infradead.org, sparclinux@vger.kernel.org,
	Alex Williamson, Jason Gunthorpe, x86@kernel.org, Alistair Popple,
	linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org, Ryan Roberts, Hugh Dickins,
	Axel Rasmussen



Le 22/08/2024 à 21:16, Peter Xu a écrit :
> On Thu, Aug 22, 2024 at 05:22:03PM +0000, LEROY Christophe wrote:
>>
>>
>> Le 18/07/2024 à 00:02, Peter Xu a écrit :
>>> Introduce two more sub-options for PGTABLE_HAS_HUGE_LEAVES:
>>>
>>>     - PGTABLE_HAS_PMD_LEAVES: set when there can be PMD mappings
>>>     - PGTABLE_HAS_PUD_LEAVES: set when there can be PUD mappings
>>>
>>> It will help to identify whether the current build may only want PMD
>>> helpers but not PUD ones, as these sub-options will also check against the
>>> arch support over HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD].
>>>
>>> Note that having them depend on HAVE_ARCH_TRANSPARENT_HUGEPAGE[_PUD] is
>>> still some intermediate step.  The best way is to have an option say
>>> "whether arch XXX supports PMD/PUD mappings" and so on.  However let's
>>> leave that for later as that's the easy part.  So far, we use these options
>>> to stably detect per-arch huge mapping support.
>>>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>>    include/linux/huge_mm.h | 10 +++++++---
>>>    mm/Kconfig              |  6 ++++++
>>>    2 files changed, 13 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 711632df7edf..37482c8445d1 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -96,14 +96,18 @@ extern struct kobj_attribute thpsize_shmem_enabled_attr;
>>>    #define thp_vma_allowable_order(vma, vm_flags, tva_flags, order) \
>>>    	(!!thp_vma_allowable_orders(vma, vm_flags, tva_flags, BIT(order)))
>>>    
>>> -#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
>>> -#define HPAGE_PMD_SHIFT PMD_SHIFT
>>> +#ifdef CONFIG_PGTABLE_HAS_PUD_LEAVES
>>>    #define HPAGE_PUD_SHIFT PUD_SHIFT
>>>    #else
>>> -#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>>>    #define HPAGE_PUD_SHIFT ({ BUILD_BUG(); 0; })
>>>    #endif
>>>    
>>> +#ifdef CONFIG_PGTABLE_HAS_PMD_LEAVES
>>> +#define HPAGE_PMD_SHIFT PMD_SHIFT
>>> +#else
>>> +#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
>>> +#endif
>>> +
>>>    #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
>>>    #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
>>>    #define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
>>> diff --git a/mm/Kconfig b/mm/Kconfig
>>> index 60796402850e..2dbdc088dee8 100644
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -860,6 +860,12 @@ endif # TRANSPARENT_HUGEPAGE
>>>    config PGTABLE_HAS_HUGE_LEAVES
>>>    	def_bool TRANSPARENT_HUGEPAGE || HUGETLB_PAGE
>>>    
>>> +config PGTABLE_HAS_PMD_LEAVES
>>> +	def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE && PGTABLE_HAS_HUGE_LEAVES
>>> +
>>> +config PGTABLE_HAS_PUD_LEAVES
>>> +	def_bool HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD && PGTABLE_HAS_HUGE_LEAVES
>>> +
>>
>> What if an architecture has hugepages at PMD and/or PUD level and
>> doesn't support THP ?
> 
> What's the arch to be discussed here?

It is LOONGARCH and MIPS, they provide pud_leaf() that can return true 
even when they have no PUD.

> 
> The whole purpose of this series so far is trying to make some pmd/pud
> helpers that only defined with CONFIG_THP=on to be available even if not.
> It means this series alone (or any future plan) shouldn't affect any arch
> that has CONFIG_THP=off always.
> 
> But logically I think we should need some config option just to say "this
> arch supports pmd mappings" indeed, even if CONFIG_THP=off.  When that's
> there, we should perhaps add that option into this equation so
> PGTABLE_HAS_*_LEAVES will also be selected in that case.
> 

Why is an option needed for that ? If pmd_leaf() returns always false, 
it means the arch doesn't support pmd mappings and if properly used all 
related code should fold away without a config option, shouldn't it ?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options
  2024-08-23  6:19       ` LEROY Christophe
@ 2024-08-26 14:34         ` Peter Xu
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2024-08-26 14:34 UTC (permalink / raw)
  To: LEROY Christophe
  Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Vlastimil Babka,
	David Hildenbrand, Oscar Salvador, linux-s390@vger.kernel.org,
	Andrew Morton, Matthew Wilcox, Dan Williams, Michal Hocko,
	linux-riscv@lists.infradead.org, sparclinux@vger.kernel.org,
	Alex Williamson, Jason Gunthorpe, x86@kernel.org, Alistair Popple,
	linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org, Ryan Roberts, Hugh Dickins,
	Axel Rasmussen

On Fri, Aug 23, 2024 at 06:19:52AM +0000, LEROY Christophe wrote:
> Why is an option needed for that ? If pmd_leaf() returns always false, 
> it means the arch doesn't support pmd mappings and if properly used all 
> related code should fold away without a config option, shouldn't it ?

It's not always easy to leverage an "if" clause there, IIUC.  Take the case
of when a driver wants to inject a pmd pfnmap, we may want something like:

  if (pmd_leaf_supported())
      inject_pmd_leaf(&pmd);

We don't have a pmd entry to reference at the point of pmd_leaf_supported()
when making the decision.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-08-26 14:34 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-17 22:02 [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings Peter Xu
2024-07-17 22:02 ` [PATCH RFC 1/6] mm/treewide: Remove pgd_devmap() Peter Xu
2024-07-17 22:02 ` [PATCH RFC 2/6] mm: PGTABLE_HAS_P[MU]D_LEAVES config options Peter Xu
2024-08-22 17:22   ` LEROY Christophe
2024-08-22 19:16     ` Peter Xu
2024-08-23  6:19       ` LEROY Christophe
2024-08-26 14:34         ` Peter Xu
2024-07-17 22:02 ` [PATCH RFC 3/6] mm/treewide: Make pgtable-generic.c THP agnostic Peter Xu
2024-07-17 22:02 ` [PATCH RFC 4/6] mm: Move huge mapping declarations from internal.h to huge_mm.h Peter Xu
2024-07-17 22:02 ` [PATCH RFC 5/6] mm/huge_mapping: Create huge_mapping_pxx.c Peter Xu
2024-07-17 22:02 ` [PATCH RFC 6/6] mm: Convert "*_trans_huge() || *_devmap()" to use *_leaf() Peter Xu
2024-07-22 13:29 ` [PATCH RFC 0/6] mm: THP-agnostic refactor on huge mappings David Hildenbrand
2024-07-22 15:31   ` Peter Xu
2024-07-23  8:18     ` David Hildenbrand
2024-07-23 21:04       ` Peter Xu
2024-07-23 21:22         ` David Hildenbrand
2024-08-22 17:08         ` LEROY Christophe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).