* [PATCH RFC v3 0/4] mm: add huge pfnmap support for remap_pfn_range()
@ 2026-02-28 7:09 Yin Tirui
2026-02-28 7:09 ` [PATCH RFC v3 1/4] x86/mm: Use proper page table helpers for huge page generation Yin Tirui
` (3 more replies)
0 siblings, 4 replies; 16+ messages in thread
From: Yin Tirui @ 2026-02-28 7:09 UTC (permalink / raw)
To: linux-kernel, linux-mm, x86, linux-arm-kernel, willy, david,
catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto,
peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt,
surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky,
apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams,
yu-cheng.yu, yangyicong, baolu.lu, jgross, conor.dooley,
Jonathan.Cameron, riel
Cc: wangkefeng.wang, chenjun102, yintirui
v3:
1. Architectural Type Safety (Matthew Wilcox):
Following the insightful architectural feedback from Matthew Wilcox in v2,
the approach to clearing huge page attributes has been completely redesigned.
Instead of spreading the `pte_clrhuge()` anti-pattern to ARM64 and RISC-V,
this series enforces strict type safety at the lowest level: `pfn_pte()`
must never natively return a PTE with huge page attributes set.
To achieve this without breaking the x86 core MM, the series is structured as:
- Fix historical type-casting abuses in x86 (vmemmap, vmalloc, CPA) where
`pfn_pte()` was wrongly used to generate huge PMDs/PUDs.
- Update `pfn_pte()` on x86 and ARM64 to inherently filter out huge page
attributes. (RISC-V leaf PMDs and PTEs share the exact same hardware
format without a specific "huge" bit, so it is naturally compliant).
- Completely eradicate `pte_clrhuge()` from the x86 tree and clean up
the type-casting mess in `arch/x86/mm/init_64.c`.
2. Page Table Deposit fix during clone() (syzbot):
Previously, `copy_huge_pmd()` was unaware of special PMDs created by pfnmap,
failing to deposit a page table for the child process during `clone()`.
This led to crashes during process teardown or PMD splitting. The logic is now
updated to properly allocate and deposit pgtables for `pmd_special()` entries.
v2: https://lore.kernel.org/linux-mm/20251016112704.179280-1-yintirui@huawei.com/#t
- remove "nohugepfnmap" boot option and "pfnmap_max_page_shift" variable.
- zap_deposited_table for non-special pmd.
- move set_pmd_at() inside pmd_lock.
- prevent PMD mapping creation when pgtable allocation fails.
- defer the refactor of pte_clrhuge() to a separate patch series. For now,
add a TODO to track this.
v1: https://lore.kernel.org/linux-mm/20250923133104.926672-1-yintirui@huawei.com/
Overview
========
This patch series adds huge page support for remap_pfn_range(),
automatically creating huge mappings when prerequisites are satisfied
(size, alignment, architecture support, etc.) and falling back to
normal page mappings otherwise.
This work builds on Peter Xu's previous efforts on huge pfnmap
support [0].
TODO
====
- Add PUD-level huge page support. Currently, only PMD-level huge
pages are supported.
Tests Done
==========
- Cross-build tests.
- Core MM Regression Tests
- Booted x86 kernel with `debug_pagealloc=on` to heavily stress the
large page splitting logic in direct mapping. No panics observed.
- Ran `make -C tools/testing/selftests/vm run_tests`. Both THP and
Hugetlbfs tests passed successfully, proving the `pfn_pte()` changes
do not interfere with native huge page generation.
- Functional Tests (with a custom device driver & PTDUMP):
- Verified that `remap_pfn_range()` successfully creates 2MB mappings
by observing `/sys/kernel/debug/page_tables/current_user`.
- Triggered PMD splits via 4K-granular `mprotect()` and partial `munmap()`,
verifying correct fallback to 512 PTEs without corrupting permissions
or causing kernel crashes.
- Triggered `fork()`/`clone()` on the mapped regions, validating the
syzbot fix and ensuring safe pgtable deposit/withdraw lifecycle.
- Performance tests with custom device driver implementing mmap()
with remap_pfn_range():
- lat_mem_rd benchmark modified to use mmap(device_fd) instead of
malloc() shows around 40% improvement in memory access latency with
huge page support compared to normal page mappings.
numactl -C 0 lat_mem_rd -t 4096M (stride=64)
Memory Size (MB) Without Huge Mapping With Huge Mapping Improvement
---------------- ----------------- -------------- -----------
64.00 148.858 ns 100.780 ns 32.3%
128.00 164.745 ns 103.537 ns 37.2%
256.00 169.907 ns 103.179 ns 39.3%
512.00 171.285 ns 103.072 ns 39.8%
1024.00 173.054 ns 103.055 ns 40.4%
2048.00 172.820 ns 103.091 ns 40.3%
4096.00 172.877 ns 103.115 ns 40.4%
- Custom memory copy operations on mmap(device_fd) show around 18% performance
improvement with huge page support compared to normal page mappings.
numactl -C 0 memcpy_test (memory copy performance test)
Memory Size (MB) Without Huge Mapping With Huge Mapping Improvement
---------------- ----------------- -------------- -----------
1024.00 95.76 ms 77.91 ms 18.6%
2048.00 190.87 ms 155.64 ms 18.5%
4096.00 380.84 ms 311.45 ms 18.2%
[0] https://lore.kernel.org/all/20240826204353.2228736-2-peterx@redhat.com/T/#u
Yin Tirui (4):
x86/mm: Use proper page table helpers for huge page generation
mm/pgtable: Make pfn_pte() filter out huge page attributes
x86/mm: Remove pte_clrhuge() and clean up init_64.c
mm: add PMD-level huge page support for remap_pfn_range()
arch/arm64/include/asm/pgtable.h | 4 +++-
arch/x86/include/asm/pgtable.h | 9 ++++---
arch/x86/mm/init_64.c | 10 ++++----
arch/x86/mm/pat/set_memory.c | 6 ++++-
arch/x86/mm/pgtable.c | 4 ++--
mm/huge_memory.c | 36 ++++++++++++++++++++++++++--
mm/memory.c | 40 ++++++++++++++++++++++++++++++++
7 files changed, 93 insertions(+), 16 deletions(-)
--
2.22.0
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH RFC v3 1/4] x86/mm: Use proper page table helpers for huge page generation 2026-02-28 7:09 [PATCH RFC v3 0/4] mm: add huge pfnmap support for remap_pfn_range() Yin Tirui @ 2026-02-28 7:09 ` Yin Tirui 2026-03-06 9:29 ` Jonathan Cameron 2026-02-28 7:09 ` [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes Yin Tirui ` (2 subsequent siblings) 3 siblings, 1 reply; 16+ messages in thread From: Yin Tirui @ 2026-02-28 7:09 UTC (permalink / raw) To: linux-kernel, linux-mm, x86, linux-arm-kernel, willy, david, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, yangyicong, baolu.lu, jgross, conor.dooley, Jonathan.Cameron, riel Cc: wangkefeng.wang, chenjun102, yintirui Historically, several core x86 mm subsystems (vmemmap, vmalloc, and CPA) have abused `pfn_pte()` to generate PMD and PUD entries by passing pgprot values containing the _PAGE_PSE flag, and then casting the resulting pte_t to a pmd_t or pud_t. This violates strict type safety and prevents us from enforcing the rule that `pfn_pte()` should strictly generate pte without huge page attributes. Fix these abuses by explicitly using the correct level-specific helpers (`pfn_pmd()` and `pfn_pud()`) and their corresponding setters (`set_pmd()`, `set_pud()`). For the CPA (Change Page Attribute) code, which uses `pte_t` as a generic container for page table entries across all levels in __should_split_large_page(), pack the correctly generated PMD/PUD values into the pte_t container. This cleanup prepares the ground for making `pfn_pte()` strictly filter out huge page attributes. Signed-off-by: Yin Tirui <yintirui@huawei.com> --- arch/x86/mm/init_64.c | 6 +++--- arch/x86/mm/pat/set_memory.c | 6 +++++- arch/x86/mm/pgtable.c | 4 ++-- 3 files changed, 10 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index df2261fa4f98..d65f3d05c66f 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1518,11 +1518,11 @@ static int __meminitdata node_start; void __meminit vmemmap_set_pmd(pmd_t *pmd, void *p, int node, unsigned long addr, unsigned long next) { - pte_t entry; + pmd_t entry; - entry = pfn_pte(__pa(p) >> PAGE_SHIFT, + entry = pfn_pmd(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL_LARGE); - set_pmd(pmd, __pmd(pte_val(entry))); + set_pmd(pmd, entry); /* check to see if we have contiguous blocks */ if (p_end != p || node_start != node) { diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 40581a720fe8..87aa0e9a8f82 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -1059,7 +1059,11 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address, return 1; /* All checks passed. Update the large page mapping. */ - new_pte = pfn_pte(old_pfn, new_prot); + if (level == PG_LEVEL_2M) + new_pte = __pte(pmd_val(pfn_pmd(old_pfn, new_prot))); + else + new_pte = __pte(pud_val(pfn_pud(old_pfn, new_prot))); + __set_pmd_pte(kpte, address, new_pte); cpa->flags |= CPA_FLUSHTLB; cpa_inc_lp_preserved(level); diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 2e5ecfdce73c..61320fd44e16 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -644,7 +644,7 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot) if (pud_present(*pud) && !pud_leaf(*pud)) return 0; - set_pte((pte_t *)pud, pfn_pte( + set_pud(pud, pfn_pud( (u64)addr >> PAGE_SHIFT, __pgprot(protval_4k_2_large(pgprot_val(prot)) | _PAGE_PSE))); @@ -676,7 +676,7 @@ int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot) if (pmd_present(*pmd) && !pmd_leaf(*pmd)) return 0; - set_pte((pte_t *)pmd, pfn_pte( + set_pmd(pmd, pfn_pmd( (u64)addr >> PAGE_SHIFT, __pgprot(protval_4k_2_large(pgprot_val(prot)) | _PAGE_PSE))); -- 2.22.0 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 1/4] x86/mm: Use proper page table helpers for huge page generation 2026-02-28 7:09 ` [PATCH RFC v3 1/4] x86/mm: Use proper page table helpers for huge page generation Yin Tirui @ 2026-03-06 9:29 ` Jonathan Cameron 2026-03-10 3:23 ` Yin Tirui 0 siblings, 1 reply; 16+ messages in thread From: Jonathan Cameron @ 2026-03-06 9:29 UTC (permalink / raw) To: Yin Tirui Cc: linux-kernel, linux-mm, x86, linux-arm-kernel, willy, david, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, yangyicong, baolu.lu, jgross, conor.dooley, riel, wangkefeng.wang, chenjun102 On Sat, 28 Feb 2026 15:09:03 +0800 Yin Tirui <yintirui@huawei.com> wrote: > Historically, several core x86 mm subsystems (vmemmap, vmalloc, and CPA) > have abused `pfn_pte()` to generate PMD and PUD entries by passing > pgprot values containing the _PAGE_PSE flag, and then casting the > resulting pte_t to a pmd_t or pud_t. > > This violates strict type safety and prevents us from enforcing the rule > that `pfn_pte()` should strictly generate pte without huge page attributes. > > Fix these abuses by explicitly using the correct level-specific helpers > (`pfn_pmd()` and `pfn_pud()`) and their corresponding setters > (`set_pmd()`, `set_pud()`). > > For the CPA (Change Page Attribute) code, which uses `pte_t` as a generic > container for page table entries across all levels in > __should_split_large_page(), pack the correctly generated PMD/PUD values > into the pte_t container. > > This cleanup prepares the ground for making `pfn_pte()` strictly filter > out huge page attributes. > > Signed-off-by: Yin Tirui <yintirui@huawei.com> Hi. A tiny drive by review comment below. > --- > arch/x86/mm/init_64.c | 6 +++--- > arch/x86/mm/pat/set_memory.c | 6 +++++- > arch/x86/mm/pgtable.c | 4 ++-- > 3 files changed, 10 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c > index df2261fa4f98..d65f3d05c66f 100644 > --- a/arch/x86/mm/init_64.c > +++ b/arch/x86/mm/init_64.c > @@ -1518,11 +1518,11 @@ static int __meminitdata node_start; > void __meminit vmemmap_set_pmd(pmd_t *pmd, void *p, int node, > unsigned long addr, unsigned long next) > { > - pte_t entry; > + pmd_t entry; > > - entry = pfn_pte(__pa(p) >> PAGE_SHIFT, > + entry = pfn_pmd(__pa(p) >> PAGE_SHIFT, > PAGE_KERNEL_LARGE); Whilst you are here, can we make that a one liner. entry = pfn_pmd(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL_LARGE); Could even do pmd_t entry = pfn_pmd(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL_LARGE); but that's more of a question of taste. > - set_pmd(pmd, __pmd(pte_val(entry))); > + set_pmd(pmd, entry); > > /* check to see if we have contiguous blocks */ ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 1/4] x86/mm: Use proper page table helpers for huge page generation 2026-03-06 9:29 ` Jonathan Cameron @ 2026-03-10 3:23 ` Yin Tirui 0 siblings, 0 replies; 16+ messages in thread From: Yin Tirui @ 2026-03-10 3:23 UTC (permalink / raw) To: Jonathan Cameron Cc: linux-kernel, linux-mm, x86, linux-arm-kernel, willy, david, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, yangyicong, baolu.lu, jgross, conor.dooley, riel, wangkefeng.wang, chenjun102 On 3/6/2026 5:29 PM, Jonathan Cameron wrote: > Whilst you are here, can we make that a one liner. > entry = pfn_pmd(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL_LARGE); > > Could even do > > pmd_t entry = pfn_pmd(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL_LARGE); > but that's more of a question of taste. Hi Jonathan, Thanks for the suggestion, I will fold this one-liner cleanup into the next respin. -- Yin Tirui ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes 2026-02-28 7:09 [PATCH RFC v3 0/4] mm: add huge pfnmap support for remap_pfn_range() Yin Tirui 2026-02-28 7:09 ` [PATCH RFC v3 1/4] x86/mm: Use proper page table helpers for huge page generation Yin Tirui @ 2026-02-28 7:09 ` Yin Tirui 2026-03-04 7:52 ` Jürgen Groß 2026-02-28 7:09 ` [PATCH RFC v3 3/4] x86/mm: Remove pte_clrhuge() and clean up init_64.c Yin Tirui 2026-02-28 7:09 ` [PATCH RFC v3 4/4] mm: add PMD-level huge page support for remap_pfn_range() Yin Tirui 3 siblings, 1 reply; 16+ messages in thread From: Yin Tirui @ 2026-02-28 7:09 UTC (permalink / raw) To: linux-kernel, linux-mm, x86, linux-arm-kernel, willy, david, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, yangyicong, baolu.lu, jgross, conor.dooley, Jonathan.Cameron, riel Cc: wangkefeng.wang, chenjun102, yintirui A fundamental principle of page table type safety is that `pte_t` represents the lowest level page table entry and should never carry huge page attributes. Currently, passing a pgprot with huge page bits (e.g., extracted via pmd_pgprot()) into pfn_pte() creates a malformed PTE that retains the huge attribute, leading to the necessity of the ugly `pte_clrhuge()` anti-pattern. Enforce type safety by making `pfn_pte()` inherently filter out huge page attributes: - On x86: Strip the `_PAGE_PSE` bit. - On ARM64: Mask out the block descriptor bits in `PTE_TYPE_MASK` and enforce the `PTE_TYPE_PAGE` format. - On RISC-V: No changes required, as RISC-V leaf PMDs and PTEs share the exact same hardware format and do not use a distinct huge bit. Signed-off-by: Yin Tirui <yintirui@huawei.com> --- arch/arm64/include/asm/pgtable.h | 4 +++- arch/x86/include/asm/pgtable.h | 4 ++++ 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index b3e58735c49b..f2a7a40106d2 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -141,7 +141,9 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys) #define pte_pfn(pte) (__pte_to_phys(pte) >> PAGE_SHIFT) #define pfn_pte(pfn,prot) \ - __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot)) + __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | \ + ((pgprot_val(prot) & ~(PTE_TYPE_MASK & ~PTE_VALID)) | \ + (PTE_TYPE_PAGE & ~PTE_VALID))) #define pte_none(pte) (!pte_val(pte)) #define pte_page(pte) (pfn_to_page(pte_pfn(pte))) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 1662c5a8f445..a4dbd81d42bf 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -738,6 +738,10 @@ static inline pgprotval_t check_pgprot(pgprot_t pgprot) static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot) { phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT; + + /* Filter out _PAGE_PSE to ensure PTEs never carry the huge page bit */ + pgprot = __pgprot(pgprot_val(pgprot) & ~_PAGE_PSE); + /* This bit combination is used to mark shadow stacks */ WARN_ON_ONCE((pgprot_val(pgprot) & (_PAGE_DIRTY | _PAGE_RW)) == _PAGE_DIRTY); -- 2.22.0 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes 2026-02-28 7:09 ` [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes Yin Tirui @ 2026-03-04 7:52 ` Jürgen Groß 2026-03-04 10:08 ` Yin Tirui 2026-03-05 9:38 ` Yin Tirui 0 siblings, 2 replies; 16+ messages in thread From: Jürgen Groß @ 2026-03-04 7:52 UTC (permalink / raw) To: Yin Tirui, linux-kernel, linux-mm, x86, linux-arm-kernel, willy, david, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, yangyicong, baolu.lu, conor.dooley, Jonathan.Cameron, riel Cc: wangkefeng.wang, chenjun102 [-- Attachment #1.1.1: Type: text/plain, Size: 2673 bytes --] On 28.02.26 08:09, Yin Tirui wrote: > A fundamental principle of page table type safety is that `pte_t` represents > the lowest level page table entry and should never carry huge page attributes. > > Currently, passing a pgprot with huge page bits (e.g., extracted via > pmd_pgprot()) into pfn_pte() creates a malformed PTE that retains the huge > attribute, leading to the necessity of the ugly `pte_clrhuge()` anti-pattern. > > Enforce type safety by making `pfn_pte()` inherently filter out huge page > attributes: > - On x86: Strip the `_PAGE_PSE` bit. > - On ARM64: Mask out the block descriptor bits in `PTE_TYPE_MASK` and > enforce the `PTE_TYPE_PAGE` format. > - On RISC-V: No changes required, as RISC-V leaf PMDs and PTEs share the > exact same hardware format and do not use a distinct huge bit. > > Signed-off-by: Yin Tirui <yintirui@huawei.com> > --- > arch/arm64/include/asm/pgtable.h | 4 +++- > arch/x86/include/asm/pgtable.h | 4 ++++ > 2 files changed, 7 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index b3e58735c49b..f2a7a40106d2 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -141,7 +141,9 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys) > > #define pte_pfn(pte) (__pte_to_phys(pte) >> PAGE_SHIFT) > #define pfn_pte(pfn,prot) \ > - __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot)) > + __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | \ > + ((pgprot_val(prot) & ~(PTE_TYPE_MASK & ~PTE_VALID)) | \ > + (PTE_TYPE_PAGE & ~PTE_VALID))) > > #define pte_none(pte) (!pte_val(pte)) > #define pte_page(pte) (pfn_to_page(pte_pfn(pte))) > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h > index 1662c5a8f445..a4dbd81d42bf 100644 > --- a/arch/x86/include/asm/pgtable.h > +++ b/arch/x86/include/asm/pgtable.h > @@ -738,6 +738,10 @@ static inline pgprotval_t check_pgprot(pgprot_t pgprot) > static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot) > { > phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT; > + > + /* Filter out _PAGE_PSE to ensure PTEs never carry the huge page bit */ > + pgprot = __pgprot(pgprot_val(pgprot) & ~_PAGE_PSE); Is it really a good idea to silently drop the bit? Today it can either be used for a large page (which should be a pmd, of course), or - much worse - you'd strip the _PAGE_PAT bit, which is at the same position in PTEs. So basically you are removing the ability to use some cache modes. NACK! Juergen [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 3743 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 495 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes 2026-03-04 7:52 ` Jürgen Groß @ 2026-03-04 10:08 ` Yin Tirui 2026-03-05 9:38 ` Yin Tirui 1 sibling, 0 replies; 16+ messages in thread From: Yin Tirui @ 2026-03-04 10:08 UTC (permalink / raw) To: Jürgen Groß, linux-kernel, linux-mm, x86, linux-arm-kernel, willy, david, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, yangyicong, baolu.lu, conor.dooley, Jonathan.Cameron, riel Cc: wangkefeng.wang, chenjun102 On 3/4/2026 3:52 PM, Jürgen Groß wrote: > On 28.02.26 08:09, Yin Tirui wrote: >> A fundamental principle of page table type safety is that `pte_t` >> represents >> the lowest level page table entry and should never carry huge page >> attributes. >> >> Currently, passing a pgprot with huge page bits (e.g., extracted via >> pmd_pgprot()) into pfn_pte() creates a malformed PTE that retains the >> huge >> attribute, leading to the necessity of the ugly `pte_clrhuge()` anti- >> pattern. >> >> Enforce type safety by making `pfn_pte()` inherently filter out huge page >> attributes: >> - On x86: Strip the `_PAGE_PSE` bit. >> - On ARM64: Mask out the block descriptor bits in `PTE_TYPE_MASK` and >> enforce the `PTE_TYPE_PAGE` format. >> - On RISC-V: No changes required, as RISC-V leaf PMDs and PTEs share the >> exact same hardware format and do not use a distinct huge bit. >> >> Signed-off-by: Yin Tirui <yintirui@huawei.com> >> --- >> arch/arm64/include/asm/pgtable.h | 4 +++- >> arch/x86/include/asm/pgtable.h | 4 ++++ >> 2 files changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/ >> asm/pgtable.h >> index b3e58735c49b..f2a7a40106d2 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -141,7 +141,9 @@ static inline pteval_t >> __phys_to_pte_val(phys_addr_t phys) >> #define pte_pfn(pte) (__pte_to_phys(pte) >> PAGE_SHIFT) >> #define pfn_pte(pfn,prot) \ >> - __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | >> pgprot_val(prot)) >> + __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | \ >> + ((pgprot_val(prot) & ~(PTE_TYPE_MASK & ~PTE_VALID)) | \ >> + (PTE_TYPE_PAGE & ~PTE_VALID))) >> #define pte_none(pte) (!pte_val(pte)) >> #define pte_page(pte) (pfn_to_page(pte_pfn(pte))) >> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/ >> pgtable.h >> index 1662c5a8f445..a4dbd81d42bf 100644 >> --- a/arch/x86/include/asm/pgtable.h >> +++ b/arch/x86/include/asm/pgtable.h >> @@ -738,6 +738,10 @@ static inline pgprotval_t check_pgprot(pgprot_t >> pgprot) >> static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot) >> { >> phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT; >> + >> + /* Filter out _PAGE_PSE to ensure PTEs never carry the huge page >> bit */ >> + pgprot = __pgprot(pgprot_val(pgprot) & ~_PAGE_PSE); > > Is it really a good idea to silently drop the bit? > > Today it can either be used for a large page (which should be a pmd, > of course), or - much worse - you'd strip the _PAGE_PAT bit, which is > at the same position in PTEs. > > So basically you are removing the ability to use some cache modes. > > NACK! > > > Juergen Hi Jürgen, You are absolutely right. I missed the fact that `_PAGE_PSE` aliases with `_PAGE_PAT` on 4K PTEs. The intention here was to follow previous feedback to enforce type safety by filtering out huge page attributes directly inside `pfn_pte()`. However, doing it this way obviously breaks the cache modes on x86. I agree with the NACK. I will drop this approach and rethink how to handle the huge-to-normal pgprot conversion safely for v4. -- Thanks, Yin Tirui ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes 2026-03-04 7:52 ` Jürgen Groß 2026-03-04 10:08 ` Yin Tirui @ 2026-03-05 9:38 ` Yin Tirui 2026-03-05 10:05 ` Jürgen Groß 2026-03-06 4:25 ` Matthew Wilcox 1 sibling, 2 replies; 16+ messages in thread From: Yin Tirui @ 2026-03-05 9:38 UTC (permalink / raw) To: Matthew Wilcox, Jürgen Groß Cc: linux-kernel, linux-mm, x86, linux-arm-kernel, david, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, baolu.lu, conor.dooley, Jonathan.Cameron, riel, Kefeng Wang, chenjun102 On 3/4/2026 3:52 PM, Jürgen Groß wrote: > On 28.02.26 08:09, Yin Tirui wrote: >> A fundamental principle of page table type safety is that `pte_t` >> represents >> the lowest level page table entry and should never carry huge page >> attributes. >> >> Currently, passing a pgprot with huge page bits (e.g., extracted via >> pmd_pgprot()) into pfn_pte() creates a malformed PTE that retains the >> huge >> attribute, leading to the necessity of the ugly `pte_clrhuge()` anti- >> pattern. >> >> Enforce type safety by making `pfn_pte()` inherently filter out huge page >> attributes: >> - On x86: Strip the `_PAGE_PSE` bit. >> - On ARM64: Mask out the block descriptor bits in `PTE_TYPE_MASK` and >> enforce the `PTE_TYPE_PAGE` format. >> - On RISC-V: No changes required, as RISC-V leaf PMDs and PTEs share the >> exact same hardware format and do not use a distinct huge bit. >> >> Signed-off-by: Yin Tirui <yintirui@huawei.com> >> --- >> arch/arm64/include/asm/pgtable.h | 4 +++- >> arch/x86/include/asm/pgtable.h | 4 ++++ >> 2 files changed, 7 insertions(+), 1 deletion(-) >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/ >> asm/pgtable.h >> index b3e58735c49b..f2a7a40106d2 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -141,7 +141,9 @@ static inline pteval_t >> __phys_to_pte_val(phys_addr_t phys) >> #define pte_pfn(pte) (__pte_to_phys(pte) >> PAGE_SHIFT) >> #define pfn_pte(pfn,prot) \ >> - __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | >> pgprot_val(prot)) >> + __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | \ >> + ((pgprot_val(prot) & ~(PTE_TYPE_MASK & ~PTE_VALID)) | \ >> + (PTE_TYPE_PAGE & ~PTE_VALID))) >> #define pte_none(pte) (!pte_val(pte)) >> #define pte_page(pte) (pfn_to_page(pte_pfn(pte))) >> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/ >> pgtable.h >> index 1662c5a8f445..a4dbd81d42bf 100644 >> --- a/arch/x86/include/asm/pgtable.h >> +++ b/arch/x86/include/asm/pgtable.h >> @@ -738,6 +738,10 @@ static inline pgprotval_t check_pgprot(pgprot_t >> pgprot) >> static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot) >> { >> phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT; >> + >> + /* Filter out _PAGE_PSE to ensure PTEs never carry the huge page >> bit */ >> + pgprot = __pgprot(pgprot_val(pgprot) & ~_PAGE_PSE); > > Is it really a good idea to silently drop the bit? > > Today it can either be used for a large page (which should be a pmd, > of course), or - much worse - you'd strip the _PAGE_PAT bit, which is > at the same position in PTEs. > > So basically you are removing the ability to use some cache modes. > > NACK! > > > Juergen Hi Willy and Jürgen, Following up on the x86 _PAGE_PSE and _PAGE_PAT aliasing issue. To achieve the goal of keeping pfn_pte() pure and completely eradicating the pte_clrhuge() anti-pattern, we need a way to ensure pfn_pte() never receives a pgprot with the huge bit set. @Jürgen: Just to be absolutely certain: is there any safe way to filter out the huge page attributes directly inside x86's pfn_pte() without breaking PAT? Or does the hardware bit-aliasing make this strictly impossible at the pfn_pte() level? @Willy @Jürgen: Assuming it is impossible to filter this safely inside pfn_pte() on x86, we must translate the pgprot before passing it down. To maintain strict type-safety and still drop pte_clrhuge(), I plan to introduce two arch-neutral wrappers: x86: /* Translates large prot to 4K. Shifts PAT back to bit 7, inherently clearing _PAGE_PSE */ #define pgprot_huge_to_pte(prot) pgprot_large_2_4k(prot) /* Translates 4K prot to large. Shifts PAT to bit 12, strictly sets _PAGE_PSE */ #define pgprot_pte_to_huge(prot) __pgprot(pgprot_val(pgprot_4k_2_large(prot)) | _PAGE_PSE) arm64: /* * Drops Block marker, enforces Page marker. * Strictly preserves the PTE_VALID bit to avoid validating PROT_NONE pages. */ #define pgprot_huge_to_pte(prot) \ __pgprot((pgprot_val(prot) & ~(PMD_TYPE_MASK & ~PTE_VALID)) | \ (PTE_TYPE_PAGE & ~PTE_VALID)) /* * Drops Page marker, sets Block marker. * Strictly preserves the PTE_VALID bit. */ #define pgprot_pte_to_huge(prot) \ __pgprot((pgprot_val(prot) & ~(PTE_TYPE_MASK & ~PTE_VALID)) | \ (PMD_TYPE_SECT & ~PTE_VALID)) Usage: 1. Creating a huge pfnmap (remap_try_huge_pmd) pgprot_t huge_prot = pgprot_pte_to_huge(prot); /* No need for pmd_mkhuge() */ pmd_t entry = pmd_mkspecial(pfn_pmd(pfn, huge_prot)); set_pmd_at(mm, addr, pmd, entry); 2. Splitting a huge pfnmap (__split_huge_pmd_locked) pgprot_t small_prot = pgprot_huge_to_pte(pmd_pgprot(old_pmd)); /* No need for pte_clrhuge() */ pte_t entry = pfn_pte(pmd_pfn(old_pmd), small_prot); set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR); Willy, is there a better architectural approach to handle this and satisfy the type-safety requirement given the x86 hardware constraints? -- Thanks, Yin Tirui ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes 2026-03-05 9:38 ` Yin Tirui @ 2026-03-05 10:05 ` Jürgen Groß 2026-03-10 3:32 ` Yin Tirui 2026-03-06 4:25 ` Matthew Wilcox 1 sibling, 1 reply; 16+ messages in thread From: Jürgen Groß @ 2026-03-05 10:05 UTC (permalink / raw) To: Yin Tirui, Matthew Wilcox Cc: linux-kernel, linux-mm, x86, linux-arm-kernel, david, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, baolu.lu, conor.dooley, Jonathan.Cameron, riel, Kefeng Wang, chenjun102 [-- Attachment #1.1.1: Type: text/plain, Size: 4549 bytes --] On 05.03.26 10:38, Yin Tirui wrote: > > On 3/4/2026 3:52 PM, Jürgen Groß wrote: >> On 28.02.26 08:09, Yin Tirui wrote: >>> A fundamental principle of page table type safety is that `pte_t` represents >>> the lowest level page table entry and should never carry huge page attributes. >>> >>> Currently, passing a pgprot with huge page bits (e.g., extracted via >>> pmd_pgprot()) into pfn_pte() creates a malformed PTE that retains the huge >>> attribute, leading to the necessity of the ugly `pte_clrhuge()` anti- pattern. >>> >>> Enforce type safety by making `pfn_pte()` inherently filter out huge page >>> attributes: >>> - On x86: Strip the `_PAGE_PSE` bit. >>> - On ARM64: Mask out the block descriptor bits in `PTE_TYPE_MASK` and >>> enforce the `PTE_TYPE_PAGE` format. >>> - On RISC-V: No changes required, as RISC-V leaf PMDs and PTEs share the >>> exact same hardware format and do not use a distinct huge bit. >>> >>> Signed-off-by: Yin Tirui <yintirui@huawei.com> >>> --- >>> arch/arm64/include/asm/pgtable.h | 4 +++- >>> arch/x86/include/asm/pgtable.h | 4 ++++ >>> 2 files changed, 7 insertions(+), 1 deletion(-) >>> >>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/ asm/ >>> pgtable.h >>> index b3e58735c49b..f2a7a40106d2 100644 >>> --- a/arch/arm64/include/asm/pgtable.h >>> +++ b/arch/arm64/include/asm/pgtable.h >>> @@ -141,7 +141,9 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys) >>> #define pte_pfn(pte) (__pte_to_phys(pte) >> PAGE_SHIFT) >>> #define pfn_pte(pfn,prot) \ >>> - __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | >>> pgprot_val(prot)) >>> + __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | \ >>> + ((pgprot_val(prot) & ~(PTE_TYPE_MASK & ~PTE_VALID)) | \ >>> + (PTE_TYPE_PAGE & ~PTE_VALID))) >>> #define pte_none(pte) (!pte_val(pte)) >>> #define pte_page(pte) (pfn_to_page(pte_pfn(pte))) >>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/ pgtable.h >>> index 1662c5a8f445..a4dbd81d42bf 100644 >>> --- a/arch/x86/include/asm/pgtable.h >>> +++ b/arch/x86/include/asm/pgtable.h >>> @@ -738,6 +738,10 @@ static inline pgprotval_t check_pgprot(pgprot_t pgprot) >>> static inline pte_t pfn_pte(unsigned long page_nr, pgprot_t pgprot) >>> { >>> phys_addr_t pfn = (phys_addr_t)page_nr << PAGE_SHIFT; >>> + >>> + /* Filter out _PAGE_PSE to ensure PTEs never carry the huge page bit */ >>> + pgprot = __pgprot(pgprot_val(pgprot) & ~_PAGE_PSE); >> >> Is it really a good idea to silently drop the bit? >> >> Today it can either be used for a large page (which should be a pmd, >> of course), or - much worse - you'd strip the _PAGE_PAT bit, which is >> at the same position in PTEs. >> >> So basically you are removing the ability to use some cache modes. >> >> NACK! >> >> >> Juergen > > Hi Willy and Jürgen, > > Following up on the x86 _PAGE_PSE and _PAGE_PAT aliasing issue. > > To achieve the goal of keeping pfn_pte() pure and completely eradicating the > pte_clrhuge() anti-pattern, we need a way to ensure pfn_pte() never receives a > pgprot with the huge bit set. > > @Jürgen: > Just to be absolutely certain: is there any safe way to filter out the huge page > attributes directly inside x86's pfn_pte() without breaking PAT? Or does the > hardware bit-aliasing make this strictly impossible at the pfn_pte() level? There is no huge bit at the PTE level. It is existing only at the PMD and the PUD level. So: yes, it is absolutely impossible to filter it out, as the bit has a different meaning in "real" PTEs (with "PTE" having the meaning: a translation entry in a page referenced by a PMD entry not having the PSE bit set). > > @Willy @Jürgen: > Assuming it is impossible to filter this safely inside pfn_pte() on x86, we must > translate the pgprot before passing it down. To maintain strict type-safety and > still drop pte_clrhuge(), I plan to introduce two arch-neutral wrappers: > > x86: > /* Translates large prot to 4K. Shifts PAT back to bit 7, inherently clearing > _PAGE_PSE */ > #define pgprot_huge_to_pte(prot) pgprot_large_2_4k(prot) > /* Translates 4K prot to large. Shifts PAT to bit 12, strictly sets _PAGE_PSE */ > #define pgprot_pte_to_huge(prot) __pgprot(pgprot_val(pgprot_4k_2_large(prot)) | > _PAGE_PSE) Seems to be okay. Juergen [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 3743 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 495 bytes --] ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes 2026-03-05 10:05 ` Jürgen Groß @ 2026-03-10 3:32 ` Yin Tirui 0 siblings, 0 replies; 16+ messages in thread From: Yin Tirui @ 2026-03-10 3:32 UTC (permalink / raw) To: Jürgen Groß, Matthew Wilcox Cc: linux-kernel, linux-mm, x86, linux-arm-kernel, david, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, baolu.lu, conor.dooley, Jonathan.Cameron, riel, Kefeng Wang, chenjun102 On 3/5/2026 6:05 PM, Jürgen Groß wrote: >> Hi Willy and Jürgen, >> >> Following up on the x86 _PAGE_PSE and _PAGE_PAT aliasing issue. >> >> To achieve the goal of keeping pfn_pte() pure and completely >> eradicating the pte_clrhuge() anti-pattern, we need a way to ensure >> pfn_pte() never receives a pgprot with the huge bit set. >> >> @Jürgen: >> Just to be absolutely certain: is there any safe way to filter out the >> huge page attributes directly inside x86's pfn_pte() without breaking >> PAT? Or does the hardware bit-aliasing make this strictly impossible >> at the pfn_pte() level? > > There is no huge bit at the PTE level. It is existing only at the PMD > and the > PUD level. > > So: yes, it is absolutely impossible to filter it out, as the bit has a > different meaning in "real" PTEs (with "PTE" having the meaning: a > translation > entry in a page referenced by a PMD entry not having the PSE bit set). > Hi Jürgen, Thank you for your confirmation. >> >> @Willy @Jürgen: >> Assuming it is impossible to filter this safely inside pfn_pte() on >> x86, we must translate the pgprot before passing it down. To maintain >> strict type-safety and still drop pte_clrhuge(), I plan to introduce >> two arch-neutral wrappers: >> >> x86: >> /* Translates large prot to 4K. Shifts PAT back to bit 7, inherently >> clearing _PAGE_PSE */ >> #define pgprot_huge_to_pte(prot) pgprot_large_2_4k(prot) >> /* Translates 4K prot to large. Shifts PAT to bit 12, strictly sets >> _PAGE_PSE */ >> #define pgprot_pte_to_huge(prot) >> __pgprot(pgprot_val(pgprot_4k_2_large(prot)) | _PAGE_PSE) > > Seems to be okay. While the wrapper approach handles the aliasing, Willy recently suggested taking it a step further by embedding this translation directly into `pfn_pmd()` and `pmd_pgprot()`. I am going to explore this embedded approach for the v4 respin. -- Yin Tirui ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes 2026-03-05 9:38 ` Yin Tirui 2026-03-05 10:05 ` Jürgen Groß @ 2026-03-06 4:25 ` Matthew Wilcox 2026-03-10 3:36 ` Yin Tirui 1 sibling, 1 reply; 16+ messages in thread From: Matthew Wilcox @ 2026-03-06 4:25 UTC (permalink / raw) To: Yin Tirui Cc: Jürgen Groß, linux-kernel, linux-mm, x86, linux-arm-kernel, david, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, baolu.lu, conor.dooley, Jonathan.Cameron, riel, Kefeng Wang, chenjun102 On Thu, Mar 05, 2026 at 05:38:46PM +0800, Yin Tirui wrote: > On 3/4/2026 3:52 PM, Jürgen Groß wrote: > > Today it can either be used for a large page (which should be a pmd, > > of course), or - much worse - you'd strip the _PAGE_PAT bit, which is > > at the same position in PTEs. > > > > So basically you are removing the ability to use some cache modes. > > > > NACK! > > > > > > Juergen > > Hi Willy and Jürgen, > > Following up on the x86 _PAGE_PSE and _PAGE_PAT aliasing issue. > > To achieve the goal of keeping pfn_pte() pure and completely eradicating the > pte_clrhuge() anti-pattern, we need a way to ensure pfn_pte() never receives > a pgprot with the huge bit set. > > @Jürgen: > Just to be absolutely certain: is there any safe way to filter out the huge > page attributes directly inside x86's pfn_pte() without breaking PAT? Or > does the hardware bit-aliasing make this strictly impossible at the > pfn_pte() level? > > @Willy @Jürgen: > Assuming it is impossible to filter this safely inside pfn_pte() on x86, we > must translate the pgprot before passing it down. To maintain strict > type-safety and still drop pte_clrhuge(), I plan to introduce two > arch-neutral wrappers: > > x86: > /* Translates large prot to 4K. Shifts PAT back to bit 7, inherently > clearing _PAGE_PSE */ > #define pgprot_huge_to_pte(prot) pgprot_large_2_4k(prot) > /* Translates 4K prot to large. Shifts PAT to bit 12, strictly sets > _PAGE_PSE */ > #define pgprot_pte_to_huge(prot) > __pgprot(pgprot_val(pgprot_4k_2_large(prot)) | _PAGE_PSE) I don't think we should have pgprot_large_2_4k(). Or rather, I think it should be embedded in pmd_pgprot() / pud_pgprot(). That is, we should have an 'ideal' pgprot which, on x86, perhaps matches that used by the 4k level. pfn_pmd() should be converting from the ideal pgprot to that actually used by PMDs (and setting _PAGE_PSE?) > arm64: > /* > * Drops Block marker, enforces Page marker. > * Strictly preserves the PTE_VALID bit to avoid validating PROT_NONE pages. > */ > #define pgprot_huge_to_pte(prot) \ > __pgprot((pgprot_val(prot) & ~(PMD_TYPE_MASK & ~PTE_VALID)) | \ > (PTE_TYPE_PAGE & ~PTE_VALID)) > /* > * Drops Page marker, sets Block marker. > * Strictly preserves the PTE_VALID bit. > */ > #define pgprot_pte_to_huge(prot) \ > __pgprot((pgprot_val(prot) & ~(PTE_TYPE_MASK & ~PTE_VALID)) | \ > (PMD_TYPE_SECT & ~PTE_VALID)) > > Usage: > 1. Creating a huge pfnmap (remap_try_huge_pmd) > pgprot_t huge_prot = pgprot_pte_to_huge(prot); > > /* No need for pmd_mkhuge() */ > pmd_t entry = pmd_mkspecial(pfn_pmd(pfn, huge_prot)); > set_pmd_at(mm, addr, pmd, entry); > > 2. Splitting a huge pfnmap (__split_huge_pmd_locked) > pgprot_t small_prot = pgprot_huge_to_pte(pmd_pgprot(old_pmd)); > > /* No need for pte_clrhuge() */ > pte_t entry = pfn_pte(pmd_pfn(old_pmd), small_prot); > set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR); > > > Willy, is there a better architectural approach to handle this and satisfy > the type-safety requirement given the x86 hardware constraints? > > -- > Thanks, > Yin Tirui > > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes 2026-03-06 4:25 ` Matthew Wilcox @ 2026-03-10 3:36 ` Yin Tirui 0 siblings, 0 replies; 16+ messages in thread From: Yin Tirui @ 2026-03-10 3:36 UTC (permalink / raw) To: Matthew Wilcox Cc: Jürgen Groß, linux-kernel, linux-mm, x86, linux-arm-kernel, david, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, baolu.lu, conor.dooley, Jonathan.Cameron, riel, Kefeng Wang, chenjun102 On 3/6/2026 12:25 PM, Matthew Wilcox wrote: > > I don't think we should have pgprot_large_2_4k(). Or rather, I think it > should be embedded in pmd_pgprot() / pud_pgprot(). That is, we should > have an 'ideal' pgprot which, on x86, perhaps matches that used by the > 4k level. pfn_pmd() should be converting from the ideal pgprot to > that actually used by PMDs (and setting _PAGE_PSE?) > Hi Willy, I will take this route and implement the embedded approach for the v4 respin. -- Yin Tirui ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH RFC v3 3/4] x86/mm: Remove pte_clrhuge() and clean up init_64.c 2026-02-28 7:09 [PATCH RFC v3 0/4] mm: add huge pfnmap support for remap_pfn_range() Yin Tirui 2026-02-28 7:09 ` [PATCH RFC v3 1/4] x86/mm: Use proper page table helpers for huge page generation Yin Tirui 2026-02-28 7:09 ` [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes Yin Tirui @ 2026-02-28 7:09 ` Yin Tirui 2026-02-28 7:09 ` [PATCH RFC v3 4/4] mm: add PMD-level huge page support for remap_pfn_range() Yin Tirui 3 siblings, 0 replies; 16+ messages in thread From: Yin Tirui @ 2026-02-28 7:09 UTC (permalink / raw) To: linux-kernel, linux-mm, x86, linux-arm-kernel, willy, david, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, yangyicong, baolu.lu, jgross, conor.dooley, Jonathan.Cameron, riel Cc: wangkefeng.wang, chenjun102, yintirui With `pfn_pte()` now guaranteeing that it will natively filter out huge page attributes like `_PAGE_PSE`, the `pte_clrhuge()` helper has become obsolete. Remove `pte_clrhuge()` entirely. Concurrently, clean up the ugly type-casting anti-pattern in `arch/x86/mm/init_64.c` where `(pte_t *)` was forcibly cast from `pmd_t *` to call `pte_clrhuge()`. Now, we can simply extract the pgprot directly via `pmd_pgprot()` and safely pass it downstream, knowing that `pfn_pte()` will strip the huge bit automatically. Signed-off-by: Yin Tirui <yintirui@huawei.com> --- arch/x86/include/asm/pgtable.h | 5 ----- arch/x86/mm/init_64.c | 4 ++-- 2 files changed, 2 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index a4dbd81d42bf..e8564d4ce318 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -483,11 +483,6 @@ static inline pte_t pte_mkhuge(pte_t pte) return pte_set_flags(pte, _PAGE_PSE); } -static inline pte_t pte_clrhuge(pte_t pte) -{ - return pte_clear_flags(pte, _PAGE_PSE); -} - static inline pte_t pte_mkglobal(pte_t pte) { return pte_set_flags(pte, _PAGE_GLOBAL); diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index d65f3d05c66f..a1ddcf793a8a 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -572,7 +572,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, unsigned long paddr_end, paddr_last = paddr_next; continue; } - new_prot = pte_pgprot(pte_clrhuge(*(pte_t *)pmd)); + new_prot = pmd_pgprot(*pmd); } if (page_size_mask & (1<<PG_LEVEL_2M)) { @@ -658,7 +658,7 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end, paddr_last = paddr_next; continue; } - prot = pte_pgprot(pte_clrhuge(*(pte_t *)pud)); + prot = pud_pgprot(*pud); } if (page_size_mask & (1<<PG_LEVEL_1G)) { -- 2.22.0 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RFC v3 4/4] mm: add PMD-level huge page support for remap_pfn_range() 2026-02-28 7:09 [PATCH RFC v3 0/4] mm: add huge pfnmap support for remap_pfn_range() Yin Tirui ` (2 preceding siblings ...) 2026-02-28 7:09 ` [PATCH RFC v3 3/4] x86/mm: Remove pte_clrhuge() and clean up init_64.c Yin Tirui @ 2026-02-28 7:09 ` Yin Tirui 2026-04-13 20:02 ` David Hildenbrand (Arm) 3 siblings, 1 reply; 16+ messages in thread From: Yin Tirui @ 2026-02-28 7:09 UTC (permalink / raw) To: linux-kernel, linux-mm, x86, linux-arm-kernel, willy, david, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, yangyicong, baolu.lu, jgross, conor.dooley, Jonathan.Cameron, riel Cc: wangkefeng.wang, chenjun102, yintirui Add PMD-level huge page support to remap_pfn_range(), automatically creating huge mappings when prerequisites are satisfied (size, alignment, architecture support, etc.) and falling back to normal page mappings otherwise. Implement special huge PMD splitting by utilizing the pgtable deposit/ withdraw mechanism. When splitting is needed, the deposited pgtable is withdrawn and populated with individual PTEs created from the original huge mapping. Signed-off-by: Yin Tirui <yintirui@huawei.com> --- mm/huge_memory.c | 36 ++++++++++++++++++++++++++++++++++-- mm/memory.c | 40 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 74 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d4ca8cfd7f9d..e463d51005ee 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1857,6 +1857,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd = pmdp_get_lockless(src_pmd); if (unlikely(pmd_present(pmd) && pmd_special(pmd) && !is_huge_zero_pmd(pmd))) { + pgtable = pte_alloc_one(dst_mm); + if (unlikely(!pgtable)) + goto out; dst_ptl = pmd_lock(dst_mm, dst_pmd); src_ptl = pmd_lockptr(src_mm, src_pmd); spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); @@ -1870,6 +1873,12 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, * able to wrongly write to the backend MMIO. */ VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)); + + /* dax won't reach here, it will be intercepted at vma_needs_copy() */ + VM_WARN_ON_ONCE(vma_is_dax(src_vma)); + + mm_inc_nr_ptes(dst_mm); + pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); goto set_pmd; } @@ -2360,6 +2369,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, arch_check_zapped_pmd(vma, orig_pmd); tlb_remove_pmd_tlb_entry(tlb, pmd, addr); if (!vma_is_dax(vma) && vma_is_special_huge(vma)) { + if (pmd_special(orig_pmd)) + zap_deposited_table(tlb->mm, pmd); if (arch_needs_pgtable_deposit()) zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); @@ -3005,14 +3016,35 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, if (!vma_is_anonymous(vma)) { old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd); + + if (!vma_is_dax(vma) && vma_is_special_huge(vma)) { + pte_t entry; + + if (!pmd_special(old_pmd)) { + zap_deposited_table(mm, pmd); + return; + } + pgtable = pgtable_trans_huge_withdraw(mm, pmd); + if (unlikely(!pgtable)) + return; + pmd_populate(mm, &_pmd, pgtable); + pte = pte_offset_map(&_pmd, haddr); + entry = pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd)); + set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR); + pte_unmap(pte); + + smp_wmb(); /* make pte visible before pmd */ + pmd_populate(mm, pmd, pgtable); + return; + } + /* * We are going to unmap this huge page. So * just go ahead and zap it */ if (arch_needs_pgtable_deposit()) zap_deposited_table(mm, pmd); - if (!vma_is_dax(vma) && vma_is_special_huge(vma)) - return; + if (unlikely(pmd_is_migration_entry(old_pmd))) { const softleaf_t old_entry = softleaf_from_pmd(old_pmd); diff --git a/mm/memory.c b/mm/memory.c index 07778814b4a8..affccf38cbcf 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2890,6 +2890,40 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, return err; } +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP +static int remap_try_huge_pmd(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, unsigned long end, + unsigned long pfn, pgprot_t prot) +{ + pgtable_t pgtable; + spinlock_t *ptl; + + if ((end - addr) != PMD_SIZE) + return 0; + + if (!IS_ALIGNED(addr, PMD_SIZE)) + return 0; + + if (!IS_ALIGNED(pfn, HPAGE_PMD_NR)) + return 0; + + if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr)) + return 0; + + pgtable = pte_alloc_one(mm); + if (unlikely(!pgtable)) + return 0; + + mm_inc_nr_ptes(mm); + ptl = pmd_lock(mm, pmd); + set_pmd_at(mm, addr, pmd, pmd_mkspecial(pmd_mkhuge(pfn_pmd(pfn, prot)))); + pgtable_trans_huge_deposit(mm, pmd, pgtable); + spin_unlock(ptl); + + return 1; +} +#endif + static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, unsigned long addr, unsigned long end, unsigned long pfn, pgprot_t prot) @@ -2905,6 +2939,12 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, VM_BUG_ON(pmd_trans_huge(*pmd)); do { next = pmd_addr_end(addr, end); +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP + if (remap_try_huge_pmd(mm, pmd, addr, next, + pfn + (addr >> PAGE_SHIFT), prot)) { + continue; + } +#endif err = remap_pte_range(mm, pmd, addr, next, pfn + (addr >> PAGE_SHIFT), prot); if (err) -- 2.22.0 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC v3 4/4] mm: add PMD-level huge page support for remap_pfn_range() 2026-02-28 7:09 ` [PATCH RFC v3 4/4] mm: add PMD-level huge page support for remap_pfn_range() Yin Tirui @ 2026-04-13 20:02 ` David Hildenbrand (Arm) 2026-04-19 11:41 ` [RESEND] " Yin Tirui 0 siblings, 1 reply; 16+ messages in thread From: David Hildenbrand (Arm) @ 2026-04-13 20:02 UTC (permalink / raw) To: Yin Tirui, linux-kernel, linux-mm, x86, linux-arm-kernel, willy, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, yangyicong, baolu.lu, jgross, conor.dooley, Jonathan.Cameron, riel Cc: wangkefeng.wang, chenjun102 On 2/28/26 08:09, Yin Tirui wrote: > Add PMD-level huge page support to remap_pfn_range(), automatically > creating huge mappings when prerequisites are satisfied (size, alignment, > architecture support, etc.) and falling back to normal page mappings > otherwise. > > Implement special huge PMD splitting by utilizing the pgtable deposit/ > withdraw mechanism. When splitting is needed, the deposited pgtable is > withdrawn and populated with individual PTEs created from the original > huge mapping. > > Signed-off-by: Yin Tirui <yintirui@huawei.com> > --- [...] > > if (!vma_is_anonymous(vma)) { > old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd); > + > + if (!vma_is_dax(vma) && vma_is_special_huge(vma)) { These magical vma checks are really bad. This all needs a cleanup (Lorenzo is doing some, hoping it will look better on top of that). > + pte_t entry; > + > + if (!pmd_special(old_pmd)) { If you are using pmd_special(), you are doing something wrong. Hint: vm_normal_page_pmd() is usually what you want. > + zap_deposited_table(mm, pmd); > + return; > + } > + pgtable = pgtable_trans_huge_withdraw(mm, pmd); > + if (unlikely(!pgtable)) > + return; > + pmd_populate(mm, &_pmd, pgtable); > + pte = pte_offset_map(&_pmd, haddr); > + entry = pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd)); > + set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR); > + pte_unmap(pte); > + > + smp_wmb(); /* make pte visible before pmd */ > + pmd_populate(mm, pmd, pgtable); > + return; > + } > + > /* > * We are going to unmap this huge page. So > * just go ahead and zap it > */ > if (arch_needs_pgtable_deposit()) > zap_deposited_table(mm, pmd); > - if (!vma_is_dax(vma) && vma_is_special_huge(vma)) > - return; > + > if (unlikely(pmd_is_migration_entry(old_pmd))) { > const softleaf_t old_entry = softleaf_from_pmd(old_pmd); > > diff --git a/mm/memory.c b/mm/memory.c > index 07778814b4a8..affccf38cbcf 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2890,6 +2890,40 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, > return err; > } > > +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP Why exactly do we need arch support for that in form of a Kconfig. Usually, we guard pmd support by CONFIG_TRANSPARENT_HUGEPAGE. And then, we must check at runtime if PMD leaves are actually supported. Luiz is working on a cleanup series: https://lore.kernel.org/r/cover.1775679721.git.luizcap@redhat.com pgtable_has_pmd_leaves() is what you would want to check. > +static int remap_try_huge_pmd(struct mm_struct *mm, pmd_t *pmd, > + unsigned long addr, unsigned long end, > + unsigned long pfn, pgprot_t prot) Use two-tab indent. (currently 3? :) ) Also, we tend to call these things now "pmd leaves". Call it "remap_try_pmd_leaf" or something even more expressive like "remap_try_install_pmd_leaf()" > +{ > + pgtable_t pgtable; > + spinlock_t *ptl; > + > + if ((end - addr) != PMD_SIZE) if (end - addr != PMD_SIZE) Should work > + return 0; > + > + if (!IS_ALIGNED(addr, PMD_SIZE)) > + return 0; > + You could likely combine both things into a if (!IS_ALIGNED(addr | end, PMD_SIZE)) > + if (!IS_ALIGNED(pfn, HPAGE_PMD_NR)) Another sign that you piggy-back on THP support ;) > + return 0; > + > + if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr)) > + return 0; Ripping out a page table?! That doesn't sound right :) Why is that required? We shouldn't be doing that here. Gah. Especially, without any pmd locks etc. > + > + pgtable = pte_alloc_one(mm); > + if (unlikely(!pgtable)) > + return 0; > + > + mm_inc_nr_ptes(mm); > + ptl = pmd_lock(mm, pmd); > + set_pmd_at(mm, addr, pmd, pmd_mkspecial(pmd_mkhuge(pfn_pmd(pfn, prot)))); > + pgtable_trans_huge_deposit(mm, pmd, pgtable); > + spin_unlock(ptl); > + > + return 1; > +} > +#endif > + > static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, > unsigned long addr, unsigned long end, > unsigned long pfn, pgprot_t prot) > @@ -2905,6 +2939,12 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, > VM_BUG_ON(pmd_trans_huge(*pmd)); > do { > next = pmd_addr_end(addr, end); > +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP > + if (remap_try_huge_pmd(mm, pmd, addr, next, > + pfn + (addr >> PAGE_SHIFT), prot)) { Please provide a stub instead so we don't end up with ifdef in this code. -- Cheers, David ^ permalink raw reply [flat|nested] 16+ messages in thread
* [RESEND] Re: [PATCH RFC v3 4/4] mm: add PMD-level huge page support for remap_pfn_range() 2026-04-13 20:02 ` David Hildenbrand (Arm) @ 2026-04-19 11:41 ` Yin Tirui 0 siblings, 0 replies; 16+ messages in thread From: Yin Tirui @ 2026-04-19 11:41 UTC (permalink / raw) To: David Hildenbrand (Arm), lorenzo.stoakes Cc: linux-kernel, linux-mm, x86, linux-arm-kernel, willy, jgross, catalin.marinas, will, tglx, mingo, bp, dave.hansen, hpa, luto, peterz, akpm, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, lance.yang, vbabka, rppt, surenb, mhocko, anshuman.khandual, rmclure, kevin.brodsky, apopple, ajd, pasha.tatashin, bhe, thuth, coxu, dan.j.williams, yu-cheng.yu, baolu.lu, conor.dooley, Jonathan.Cameron, riel, wangkefeng.wang, chenjun102 (Resending to keep the thread intact, sorry for the noise) Hi David, Thanks a lot for the thorough review! On 4/14/26 04:02, David Hildenbrand (Arm) wrote: > On 2/28/26 08:09, Yin Tirui wrote: >> Add PMD-level huge page support to remap_pfn_range(), automatically >> creating huge mappings when prerequisites are satisfied (size, alignment, >> architecture support, etc.) and falling back to normal page mappings >> otherwise. >> >> Implement special huge PMD splitting by utilizing the pgtable deposit/ >> withdraw mechanism. When splitting is needed, the deposited pgtable is >> withdrawn and populated with individual PTEs created from the original >> huge mapping. >> >> Signed-off-by: Yin Tirui <yintirui@huawei.com> >> --- > > [...] > >> >> if (!vma_is_anonymous(vma)) { >> old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd); >> + >> + if (!vma_is_dax(vma) && vma_is_special_huge(vma)) { > > These magical vma checks are really bad. This all needs a cleanup > (Lorenzo is doing some, hoping it will look better on top of that). > Agreed. I am following Lorenzo's recent cleanups closely. >> + pte_t entry; >> + >> + if (!pmd_special(old_pmd)) { > > If you are using pmd_special(), you are doing something wrong. > > Hint: vm_normal_page_pmd() is usually what you want. Spot on. While looking into applying vm_normal_folio_pmd() here to avoid the magical VMA checks, I realized that both __split_huge_pmd_locked() and copy_huge_pmd() currently suffer from the same !vma_is_anonymous(vma) top-level entanglement.I think these functions could benefit from a structural refactoring similar to what Lorenzo is currently doing in zap_huge_pmd(). My idea is to flatten both functions into a pmd_present()-driven decision tree: 1. Branch strictly on pmd_present(). 2. For present PMDs, use vm_normal_folio_pmd() as the single source of truth. 3. If !folio (and not a huge zero page), it cleanly identifies special mappings (like PFNMAPs) without relying on vma_is_special_huge(). We can handle the split/copy directly and return early. 4. Otherwise, proceed with the normal Anon/File THP logic, or handle non-present migration entries in the !pmd_present() branch. I have drafted two preparation patches demonstrating this approach and appended the diffs at the end of this email. Does this direction look reasonable to you? If so, I will iron out the implementation details and include these refactoring patches in my upcoming v4 series. > >> + zap_deposited_table(mm, pmd); >> + return; >> + } >> + pgtable = pgtable_trans_huge_withdraw(mm, pmd); >> + if (unlikely(!pgtable)) >> + return; >> + pmd_populate(mm, &_pmd, pgtable); >> + pte = pte_offset_map(&_pmd, haddr); >> + entry = pfn_pte(pmd_pfn(old_pmd), pmd_pgprot(old_pmd)); >> + set_ptes(mm, haddr, pte, entry, HPAGE_PMD_NR); >> + pte_unmap(pte); >> + >> + smp_wmb(); /* make pte visible before pmd */ >> + pmd_populate(mm, pmd, pgtable); >> + return; >> + } >> + >> /* >> * We are going to unmap this huge page. So >> * just go ahead and zap it >> */ >> if (arch_needs_pgtable_deposit()) >> zap_deposited_table(mm, pmd); >> - if (!vma_is_dax(vma) && vma_is_special_huge(vma)) >> - return; >> + >> if (unlikely(pmd_is_migration_entry(old_pmd))) { >> const softleaf_t old_entry = softleaf_from_pmd(old_pmd); >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 07778814b4a8..affccf38cbcf 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -2890,6 +2890,40 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd, >> return err; >> } >> >> +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP > > Why exactly do we need arch support for that in form of a Kconfig. > > Usually, we guard pmd support by CONFIG_TRANSPARENT_HUGEPAGE. > > And then, we must check at runtime if PMD leaves are actually supported. > > Luiz is working on a cleanup series: > > https://lore.kernel.org/r/cover.1775679721.git.luizcap@redhat.com > > pgtable_has_pmd_leaves() is what you would want to check. Makes sense. This Kconfig was inherited from Peter Xu's earlier proposal, but depending on CONFIG_TRANSPARENT_HUGEPAGE and pgtable_has_pmd_leaves() is indeed the correct standard. I will rebase on Luiz's series. > > >> +static int remap_try_huge_pmd(struct mm_struct *mm, pmd_t *pmd, >> + unsigned long addr, unsigned long end, >> + unsigned long pfn, pgprot_t prot) > > Use two-tab indent. (currently 3? 🙂 ) > > Also, we tend to call these things now "pmd leaves". Call it > "remap_try_pmd_leaf" or something even more expressive like > > "remap_try_install_pmd_leaf()" > Noted. Will fix the indentation and rename it. >> +{ >> + pgtable_t pgtable; >> + spinlock_t *ptl; >> + >> + if ((end - addr) != PMD_SIZE) > > if (end - addr != PMD_SIZE) > > Should work Noted. > >> + return 0; >> + >> + if (!IS_ALIGNED(addr, PMD_SIZE)) >> + return 0; >> + > > You could likely combine both things into a > > if (!IS_ALIGNED(addr | end, PMD_SIZE)) > >> + if (!IS_ALIGNED(pfn, HPAGE_PMD_NR)) > > Another sign that you piggy-back on THP support 😉 Indeed! 🙂 > >> + return 0; >> + >> + if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr)) >> + return 0; > > Ripping out a page table?! That doesn't sound right 🙂 > > Why is that required? We shouldn't be doing that here. Gah. > > Especially, without any pmd locks etc. ...oops. That is indeed a silly one. Thanks for catching it. I will fix this to: if (!pmd_none(*pmd)) return 0; > >> + >> + pgtable = pte_alloc_one(mm); >> + if (unlikely(!pgtable)) >> + return 0; >> + >> + mm_inc_nr_ptes(mm); >> + ptl = pmd_lock(mm, pmd); >> + set_pmd_at(mm, addr, pmd, pmd_mkspecial(pmd_mkhuge(pfn_pmd(pfn, prot)))); >> + pgtable_trans_huge_deposit(mm, pmd, pgtable); >> + spin_unlock(ptl); >> + >> + return 1; >> +} >> +#endif >> + >> static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, >> unsigned long addr, unsigned long end, >> unsigned long pfn, pgprot_t prot) >> @@ -2905,6 +2939,12 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud, >> VM_BUG_ON(pmd_trans_huge(*pmd)); >> do { >> next = pmd_addr_end(addr, end); >> +#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP >> + if (remap_try_huge_pmd(mm, pmd, addr, next, >> + pfn + (addr >> PAGE_SHIFT), prot)) { > > Please provide a stub instead so we don't end up with ifdef in this code. Will do. > Appendix: 1. copy_huge_pmd() diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 42c983821c03..3f8b3f15c6ba 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1912,35 +1912,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) { spinlock_t *dst_ptl, *src_ptl; - struct page *src_page; struct folio *src_folio; pmd_t pmd; pgtable_t pgtable = NULL; int ret = -ENOMEM; - pmd = pmdp_get_lockless(src_pmd); - if (unlikely(pmd_present(pmd) && pmd_special(pmd) && - !is_huge_zero_pmd(pmd))) { - dst_ptl = pmd_lock(dst_mm, dst_pmd); - src_ptl = pmd_lockptr(src_mm, src_pmd); - spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING); - /* - * No need to recheck the pmd, it can't change with write - * mmap lock held here. - * - * Meanwhile, making sure it's not a CoW VMA with writable - * mapping, otherwise it means either the anon page wrongly - * applied special bit, or we made the PRIVATE mapping be - * able to wrongly write to the backend MMIO. - */ - VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)); - goto set_pmd; - } - - /* Skip if can be re-fill on fault */ - if (!vma_is_anonymous(dst_vma)) - return 0; - pgtable = pte_alloc_one(dst_mm); if (unlikely(!pgtable)) goto out; @@ -1952,48 +1928,69 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, ret = -EAGAIN; pmd = *src_pmd; - if (unlikely(thp_migration_supported() && - pmd_is_valid_softleaf(pmd))) { + if (likely(pmd_present(pmd))) { + src_folio = vm_normal_folio_pmd(src_vma, addr, pmd); + if (unlikely(!src_folio)) { + /* + * When page table lock is held, the huge zero pmd should not be + * under splitting since we don't split the page itself, only pmd to + * a page table. + */ + if (is_huge_zero_pmd(pmd)) { + /* + * mm_get_huge_zero_folio() will never allocate a new + * folio here, since we already have a zero page to + * copy. It just takes a reference. + */ + mm_get_huge_zero_folio(dst_mm); + goto out_zero_page; + } + + /* + * Making sure it's not a CoW VMA with writable + * mapping, otherwise it means either the anon page wrongly + * applied special bit, or we made the PRIVATE mapping be + * able to wrongly write to the backend MMIO. + */ + VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd)); + pte_free(dst_mm, pgtable); + goto set_pmd; + } + + if (!folio_test_anon(src_folio)) { + pte_free(dst_mm, pgtable); + ret = 0; + goto out_unlock; + } + + folio_get(src_folio); + if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, &src_folio->page, dst_vma, src_vma))) { + /* Page maybe pinned: split and retry the fault on PTEs. */ + folio_put(src_folio); + pte_free(dst_mm, pgtable); + spin_unlock(src_ptl); + spin_unlock(dst_ptl); + __split_huge_pmd(src_vma, src_pmd, addr, false); + return -EAGAIN; + } + add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); + + } else if (unlikely(thp_migration_supported() && pmd_is_valid_softleaf(pmd))) { + if (unlikely(!vma_is_anonymous(dst_vma))) { + pte_free(dst_mm, pgtable); + ret = 0; + goto out_unlock; + } copy_huge_non_present_pmd(dst_mm, src_mm, dst_pmd, src_pmd, addr, dst_vma, src_vma, pmd, pgtable); ret = 0; goto out_unlock; - } - if (unlikely(!pmd_trans_huge(pmd))) { + } else { pte_free(dst_mm, pgtable); goto out_unlock; } - /* - * When page table lock is held, the huge zero pmd should not be - * under splitting since we don't split the page itself, only pmd to - * a page table. - */ - if (is_huge_zero_pmd(pmd)) { - /* - * mm_get_huge_zero_folio() will never allocate a new - * folio here, since we already have a zero page to - * copy. It just takes a reference. - */ - mm_get_huge_zero_folio(dst_mm); - goto out_zero_page; - } - src_page = pmd_page(pmd); - VM_BUG_ON_PAGE(!PageHead(src_page), src_page); - src_folio = page_folio(src_page); - - folio_get(src_folio); - if (unlikely(folio_try_dup_anon_rmap_pmd(src_folio, src_page, dst_vma, src_vma))) { - /* Page maybe pinned: split and retry the fault on PTEs. */ - folio_put(src_folio); - pte_free(dst_mm, pgtable); - spin_unlock(src_ptl); - spin_unlock(dst_ptl); - __split_huge_pmd(src_vma, src_pmd, addr, false); - return -EAGAIN; - } - add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); out_zero_page: mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); -- 2.43.0 2. __split_huge_pmd_locked() diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 3f8b3f15c6ba..c02c2843520f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -3090,98 +3090,50 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, count_vm_event(THP_SPLIT_PMD); - if (!vma_is_anonymous(vma)) { - old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd); - /* - * We are going to unmap this huge page. So - * just go ahead and zap it - */ - if (arch_needs_pgtable_deposit()) - zap_deposited_table(mm, pmd); - if (vma_is_special_huge(vma)) - return; - if (unlikely(pmd_is_migration_entry(old_pmd))) { - const softleaf_t old_entry = softleaf_from_pmd(old_pmd); + if (pmd_present(*pmd)) { + folio = vm_normal_folio_pmd(vma, haddr, *pmd); - folio = softleaf_to_folio(old_entry); - } else if (is_huge_zero_pmd(old_pmd)) { + if (unlikely(!folio)) { + /* Huge Zero Page */ + if (is_huge_zero_pmd(*pmd)) + /* + * FIXME: Do we want to invalidate secondary mmu by calling + * mmu_notifier_arch_invalidate_secondary_tlbs() see comments below + * inside __split_huge_pmd() ? + * + * We are going from a zero huge page write protected to zero + * small page also write protected so it does not seems useful + * to invalidate secondary mmu at this time. + */ + return __split_huge_zero_page_pmd(vma, haddr, pmd); + + /* Huge PFNMAP */ + old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd); + if (arch_needs_pgtable_deposit()) + zap_deposited_table(mm, pmd); return; - } else { + } + + /* File/Shmem THP */ + if (unlikely(!folio_test_anon(folio))) { + old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd); + if (arch_needs_pgtable_deposit()) + zap_deposited_table(mm, pmd); + if (vma_is_special_huge(vma)) + return; + page = pmd_page(old_pmd); - folio = page_folio(page); if (!folio_test_dirty(folio) && pmd_dirty(old_pmd)) folio_mark_dirty(folio); if (!folio_test_referenced(folio) && pmd_young(old_pmd)) folio_set_referenced(folio); folio_remove_rmap_pmd(folio, page, vma); folio_put(folio); + add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR); + return; } - add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR); - return; - } - - if (is_huge_zero_pmd(*pmd)) { - /* - * FIXME: Do we want to invalidate secondary mmu by calling - * mmu_notifier_arch_invalidate_secondary_tlbs() see comments below - * inside __split_huge_pmd() ? - * - * We are going from a zero huge page write protected to zero - * small page also write protected so it does not seems useful - * to invalidate secondary mmu at this time. - */ - return __split_huge_zero_page_pmd(vma, haddr, pmd); - } - - if (pmd_is_migration_entry(*pmd)) { - softleaf_t entry; - - old_pmd = *pmd; - entry = softleaf_from_pmd(old_pmd); - page = softleaf_to_page(entry); - folio = page_folio(page); - - soft_dirty = pmd_swp_soft_dirty(old_pmd); - uffd_wp = pmd_swp_uffd_wp(old_pmd); - - write = softleaf_is_migration_write(entry); - if (PageAnon(page)) - anon_exclusive = softleaf_is_migration_read_exclusive(entry); - young = softleaf_is_migration_young(entry); - dirty = softleaf_is_migration_dirty(entry); - } else if (pmd_is_device_private_entry(*pmd)) { - softleaf_t entry; - - old_pmd = *pmd; - entry = softleaf_from_pmd(old_pmd); - page = softleaf_to_page(entry); - folio = page_folio(page); - - soft_dirty = pmd_swp_soft_dirty(old_pmd); - uffd_wp = pmd_swp_uffd_wp(old_pmd); - - write = softleaf_is_device_private_write(entry); - anon_exclusive = PageAnonExclusive(page); - /* - * Device private THP should be treated the same as regular - * folios w.r.t anon exclusive handling. See the comments for - * folio handling and anon_exclusive below. - */ - if (freeze && anon_exclusive && - folio_try_share_anon_rmap_pmd(folio, page)) - freeze = false; - if (!freeze) { - rmap_t rmap_flags = RMAP_NONE; - - folio_ref_add(folio, HPAGE_PMD_NR - 1); - if (anon_exclusive) - rmap_flags |= RMAP_EXCLUSIVE; - - folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR, - vma, haddr, rmap_flags); - } - } else { + /* Anon THP */ /* * Up to this point the pmd is present and huge and userland has * the whole access to the hugepage during the split (which @@ -3207,7 +3159,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, */ old_pmd = pmdp_invalidate(vma, haddr, pmd); page = pmd_page(old_pmd); - folio = page_folio(page); if (pmd_dirty(old_pmd)) { dirty = true; folio_set_dirty(folio); @@ -3218,8 +3169,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, uffd_wp = pmd_uffd_wp(old_pmd); VM_WARN_ON_FOLIO(!folio_ref_count(folio), folio); - VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); - /* * Without "freeze", we'll simply split the PMD, propagating the * PageAnonExclusive() flag for each PTE by setting it for @@ -3236,17 +3185,82 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, * See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */ anon_exclusive = PageAnonExclusive(page); - if (freeze && anon_exclusive && - folio_try_share_anon_rmap_pmd(folio, page)) + if (freeze && anon_exclusive && folio_try_share_anon_rmap_pmd(folio, page)) freeze = false; if (!freeze) { rmap_t rmap_flags = RMAP_NONE; - folio_ref_add(folio, HPAGE_PMD_NR - 1); if (anon_exclusive) rmap_flags |= RMAP_EXCLUSIVE; - folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR, - vma, haddr, rmap_flags); + folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR, vma, haddr, rmap_flags); + } + } else { /* pmd not present */ + folio = pmd_to_softleaf_folio(*pmd); + if (unlikely(!folio)) + return; + + /* Migration of File/Shmem THP */ + if (unlikely(!folio_test_anon(folio))) { + old_pmd = pmdp_huge_clear_flush(vma, haddr, pmd); + if (arch_needs_pgtable_deposit()) + zap_deposited_table(mm, pmd); + if (vma_is_special_huge(vma)) + return; + add_mm_counter(mm, mm_counter_file(folio), -HPAGE_PMD_NR); + return; + } + + /* Migration of Anon THP or Device Private*/ + if (pmd_is_migration_entry(*pmd)) { + softleaf_t entry; + + old_pmd = *pmd; + entry = softleaf_from_pmd(old_pmd); + page = softleaf_to_page(entry); + folio = page_folio(page); + + soft_dirty = pmd_swp_soft_dirty(old_pmd); + uffd_wp = pmd_swp_uffd_wp(old_pmd); + + write = softleaf_is_migration_write(entry); + if (PageAnon(page)) + anon_exclusive = softleaf_is_migration_read_exclusive(entry); + young = softleaf_is_migration_young(entry); + dirty = softleaf_is_migration_dirty(entry); + } else if (pmd_is_device_private_entry(*pmd)) { + softleaf_t entry; + + old_pmd = *pmd; + entry = softleaf_from_pmd(old_pmd); + page = softleaf_to_page(entry); + + soft_dirty = pmd_swp_soft_dirty(old_pmd); + uffd_wp = pmd_swp_uffd_wp(old_pmd); + + write = softleaf_is_device_private_write(entry); + anon_exclusive = PageAnonExclusive(page); + + /* + * Device private THP should be treated the same as regular + * folios w.r.t anon exclusive handling. See the comments for + * folio handling and anon_exclusive below. + */ + if (freeze && anon_exclusive && + folio_try_share_anon_rmap_pmd(folio, page)) + freeze = false; + if (!freeze) { + rmap_t rmap_flags = RMAP_NONE; + + folio_ref_add(folio, HPAGE_PMD_NR - 1); + if (anon_exclusive) + rmap_flags |= RMAP_EXCLUSIVE; + + folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR, + vma, haddr, rmap_flags); + } + } else { + VM_WARN_ONCE(1, "unknown situation."); + return; } } -- 2.43.0 -- Yin Tirui ^ permalink raw reply related [flat|nested] 16+ messages in thread
end of thread, other threads:[~2026-04-19 11:41 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-28 7:09 [PATCH RFC v3 0/4] mm: add huge pfnmap support for remap_pfn_range() Yin Tirui 2026-02-28 7:09 ` [PATCH RFC v3 1/4] x86/mm: Use proper page table helpers for huge page generation Yin Tirui 2026-03-06 9:29 ` Jonathan Cameron 2026-03-10 3:23 ` Yin Tirui 2026-02-28 7:09 ` [PATCH RFC v3 2/4] mm/pgtable: Make pfn_pte() filter out huge page attributes Yin Tirui 2026-03-04 7:52 ` Jürgen Groß 2026-03-04 10:08 ` Yin Tirui 2026-03-05 9:38 ` Yin Tirui 2026-03-05 10:05 ` Jürgen Groß 2026-03-10 3:32 ` Yin Tirui 2026-03-06 4:25 ` Matthew Wilcox 2026-03-10 3:36 ` Yin Tirui 2026-02-28 7:09 ` [PATCH RFC v3 3/4] x86/mm: Remove pte_clrhuge() and clean up init_64.c Yin Tirui 2026-02-28 7:09 ` [PATCH RFC v3 4/4] mm: add PMD-level huge page support for remap_pfn_range() Yin Tirui 2026-04-13 20:02 ` David Hildenbrand (Arm) 2026-04-19 11:41 ` [RESEND] " Yin Tirui
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox