From: Yin Tirui <yintirui@huawei.com>
To: Andrew Morton <akpm@linux-foundation.org>,
Matthew Wilcox <willy@infradead.org>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>, Juergen Gross <jgross@suse.com>,
Jonathan Cameron <jic23@kernel.org>,
Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Peter Xu <peterx@redhat.com>,
Luiz Capitulino <luizcap@redhat.com>,
Thomas Gleixner <tglx@kernel.org>, Ingo Molnar <mingo@redhat.com>,
Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
"H . Peter Anvin" <hpa@zytor.com>,
Andy Lutomirski <luto@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
Madhavan Srinivasan <maddy@linux.ibm.com>,
Michael Ellerman <mpe@ellerman.id.au>,
Nicholas Piggin <npiggin@gmail.com>,
Christophe Leroy <chleroy@kernel.org>,
"Liam R . Howlett" <liam@infradead.org>, Zi Yan <ziy@nvidia.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Nico Pache <npache@redhat.com>,
Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
Barry Song <baohua@kernel.org>, Lance Yang <lance.yang@linux.dev>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>,
Anshuman Khandual <anshuman.khandual@arm.com>,
Rohan McLure <rmclure@linux.ibm.com>,
Kevin Brodsky <kevin.brodsky@arm.com>,
Alistair Popple <apopple@nvidia.com>,
Andrew Donnellan <andrew+kernel@donnellan.id.au>,
Pasha Tatashin <pasha.tatashin@soleen.com>,
Baoquan He <bhe@redhat.com>, Thomas Huth <thuth@redhat.com>,
Coiby Xu <coxu@redhat.com>, Dan Williams <djbw@kernel.org>,
Yu-cheng Yu <yu-cheng.yu@intel.com>,
Lu Baolu <baolu.lu@linux.intel.com>,
Conor Dooley <conor.dooley@microchip.com>,
Rik van Riel <riel@surriel.com>, <wangkefeng.wang@huawei.com>,
<chenjun102@huawei.com>, <yintirui@huawei.com>,
<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<x86@kernel.org>, <linux-arm-kernel@lists.infradead.org>,
<linuxppc-dev@lists.ozlabs.org>, <linux-pm@vger.kernel.org>
Subject: [PATCH mm-unstable RFC v4 0/7] mm: add huge pfnmap support for remap_pfn_range()
Date: Tue, 26 May 2026 22:49:56 +0800 [thread overview]
Message-ID: <20260526145003.88445-1-yintirui@huawei.com> (raw)
This series is based on mm-unstable and depends on:
1. pgtable_has_pmd_leaves(), introduced by Luiz's series:
https://lore.kernel.org/linux-mm/cover.1777663129.git.luizcap@redhat.com/
2. mm/huge_memory: update file PMD counter before folio_put()
https://lore.kernel.org/linux-mm/20260526101337.1984081-1-yintirui@huawei.com/T/#u
v4:
- Following Matthew Wilcox's feedback that huge-page attribute handling
should stay in architecture helpers:
https://lore.kernel.org/all/aapXRN4KjWtUUJ0g@casper.infradead.org/
Reworked the pgprot contract for architectures that enable
CONFIG_ARCH_SUPPORTS_PMD_PFNMAP: pfn_pmd()/pfn_pud() construct PMD/PUD
leaf entries from base-PTE pgprot_t, while pmd_pgprot()/pud_pgprot()
return base-PTE pgprot_t. Added the required x86, arm64 and powerpc
support; RISC-V already satisfies the required semantics.
- Refactored copy_huge_pmd() and __split_huge_pmd_locked() to first
classify PMDs by pmd_present(), and then use vm_normal_folio_pmd() for
present PMDs, and make move_huge_pmd() use has_deposited_pgtable().
- Introduced a restriction, following the discussion with Lorenzo and
David, that remap_pfn_range() does not create PMD-sized mappings for
VMAs that have a fault handler:
[https://lore.kernel.org/linux-mm/6417587a-7e43-4615-9e2c-50a245842f59@kernel.org/]
With this restriction, PMD PFNMAP entries in VMAs without fault handlers
are known to have been installed by remap_pfn_range(), which deposits a
page table when installing such mappings; PMD PFNMAP entries in VMAs
with fault handlers are created through fault-time insertion paths such
as vmf_insert_pfn_pmd().
v3: https://lore.kernel.org/all/20260228070906.1418911-1-yintirui@huawei.com/
1. Architectural Type Safety (Matthew Wilcox):
Following the insightful architectural feedback from Matthew Wilcox in v2,
the approach to clearing huge page attributes has been completely redesigned.
Instead of spreading the `pte_clrhuge()` anti-pattern to ARM64 and RISC-V,
this series enforces strict type safety at the lowest level: `pfn_pte()`
must never natively return a PTE with huge page attributes set.
To achieve this without breaking the x86 core MM, the series is structured as:
- Fix historical type-casting abuses in x86 (vmemmap, vmalloc, CPA) where
`pfn_pte()` was wrongly used to generate huge PMDs/PUDs.
- Update `pfn_pte()` on x86 and ARM64 to inherently filter out huge page
attributes. (RISC-V leaf PMDs and PTEs share the exact same hardware
format without a specific "huge" bit, so it is naturally compliant).
- Completely eradicate `pte_clrhuge()` from the x86 tree and clean up
the type-casting mess in `arch/x86/mm/init_64.c`.
2. Page Table Deposit fix during clone() (syzbot):
Previously, `copy_huge_pmd()` was unaware of special PMDs created by pfnmap,
failing to deposit a page table for the child process during `clone()`.
This led to crashes during process teardown or PMD splitting. The logic is now
updated to properly allocate and deposit pgtables for `pmd_special()` entries.
v2: https://lore.kernel.org/linux-mm/20251016112704.179280-1-yintirui@huawei.com/#t
- remove "nohugepfnmap" boot option and "pfnmap_max_page_shift" variable.
- zap_deposited_table for non-special pmd.
- move set_pmd_at() inside pmd_lock.
- prevent PMD mapping creation when pgtable allocation fails.
- defer the refactor of pte_clrhuge() to a separate patch series. For now,
add a TODO to track this.
v1: https://lore.kernel.org/linux-mm/20250923133104.926672-1-yintirui@huawei.com/
Overview
========
This patch series adds huge page support for remap_pfn_range(),
automatically creating huge mappings when prerequisites are satisfied
(size, alignment, architecture support, etc.) and falling back to
normal page mappings otherwise.
This work builds on Peter Xu's previous efforts on huge pfnmap
support [0].
TODO
====
- Add PUD-level huge page support. Currently, only PMD-level huge
pages are supported.
Tests Done
==========
- Cross-build tests.
- Core MM Regression Tests
- Booted x86 kernel with `debug_pagealloc=on` to heavily stress the
large page splitting logic in direct mapping. No panics observed.
- Ran `make -C tools/testing/selftests/vm run_tests`. Both THP and
Hugetlbfs tests passed successfully, proving the `pfn_pte()` changes
do not interfere with native huge page generation.
- Functional Tests (with a custom device driver & PTDUMP):
- Verified that `remap_pfn_range()` successfully creates 2MB mappings
by observing `/sys/kernel/debug/page_tables/current_user`.
- Triggered PMD splits via 4K-granular `mprotect()` and partial `munmap()`,
verifying correct fallback to 512 PTEs without corrupting permissions
or causing kernel crashes.
- Triggered `fork()`/`clone()` on the mapped regions, validating the
syzbot fix and ensuring safe pgtable deposit/withdraw lifecycle.
- Performance tests with custom device driver implementing mmap()
with remap_pfn_range():
- lat_mem_rd benchmark modified to use mmap(device_fd) instead of
malloc() shows around 40% improvement in memory access latency with
huge page support compared to normal page mappings.
numactl -C 0 lat_mem_rd -t 4096M (stride=64)
Memory Size (MB) Without Huge Mapping With Huge Mapping Improvement
---------------- ----------------- -------------- -----------
64.00 148.858 ns 100.780 ns 32.3%
128.00 164.745 ns 103.537 ns 37.2%
256.00 169.907 ns 103.179 ns 39.3%
512.00 171.285 ns 103.072 ns 39.8%
1024.00 173.054 ns 103.055 ns 40.4%
2048.00 172.820 ns 103.091 ns 40.3%
4096.00 172.877 ns 103.115 ns 40.4%
- Custom memory copy operations on mmap(device_fd) show around 18% performance
improvement with huge page support compared to normal page mappings.
numactl -C 0 memcpy_test (memory copy performance test)
Memory Size (MB) Without Huge Mapping With Huge Mapping Improvement
---------------- ----------------- -------------- -----------
1024.00 95.76 ms 77.91 ms 18.6%
2048.00 190.87 ms 155.64 ms 18.5%
4096.00 380.84 ms 311.45 ms 18.2%
[0] https://lore.kernel.org/all/20240826204353.2228736-2-peterx@redhat.com/T/#u
Yin Tirui (7):
x86/mm: use PTE-level pgprot for huge PFN helpers
arm64/mm: use PTE-level pgprot for huge PFN helpers
powerpc/mm: use PTE-level pgprot for huge PFN helpers
mm/huge_memory: refactor copy_huge_pmd()
mm/huge_memory: refactor __split_huge_pmd_locked()
mm/huge_memory: make move_huge_pmd() use has_deposited_pgtable()
mm: add PMD-level PFNMAP support for remap_pfn_range()
arch/arm64/include/asm/pgtable.h | 48 +-
arch/arm64/mm/mmu.c | 4 +-
arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +-
arch/powerpc/include/asm/pgtable.h | 11 +-
arch/powerpc/mm/book3s64/pgtable.c | 11 +-
arch/x86/include/asm/pgtable.h | 68 ++-
arch/x86/include/asm/pgtable_types.h | 12 +-
arch/x86/mm/init_32.c | 8 +-
arch/x86/mm/init_64.c | 30 +-
arch/x86/mm/pat/set_memory.c | 51 +--
arch/x86/mm/pgtable.c | 8 +-
arch/x86/power/hibernate_32.c | 6 +-
mm/huge_memory.c | 440 +++++++++++--------
mm/internal.h | 21 +
mm/memory.c | 87 +++-
15 files changed, 493 insertions(+), 317 deletions(-)
--
2.43.0
next reply other threads:[~2026-05-26 22:40 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-26 14:49 Yin Tirui [this message]
2026-05-26 14:49 ` [PATCH mm-unstable RFC v4 1/7] x86/mm: use PTE-level pgprot for huge PFN helpers Yin Tirui
2026-05-26 14:49 ` [PATCH mm-unstable RFC v4 2/7] arm64/mm: " Yin Tirui
2026-05-26 14:49 ` [PATCH mm-unstable RFC v4 3/7] powerpc/mm: " Yin Tirui
2026-05-26 14:50 ` [PATCH mm-unstable RFC v4 4/7] mm/huge_memory: refactor copy_huge_pmd() Yin Tirui
2026-05-27 12:24 ` Dev Jain
2026-05-26 14:50 ` [PATCH mm-unstable RFC v4 5/7] mm/huge_memory: refactor __split_huge_pmd_locked() Yin Tirui
2026-05-26 14:50 ` [PATCH mm-unstable RFC v4 6/7] mm/huge_memory: make move_huge_pmd() use has_deposited_pgtable() Yin Tirui
2026-05-26 14:50 ` [PATCH mm-unstable RFC v4 7/7] mm: add PMD-level PFNMAP support for remap_pfn_range() Yin Tirui
2026-05-26 15:33 ` [PATCH mm-unstable RFC v4 0/7] mm: add huge pfnmap " Lorenzo Stoakes
2026-05-27 2:57 ` Yin Tirui
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260526145003.88445-1-yintirui@huawei.com \
--to=yintirui@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=andrew+kernel@donnellan.id.au \
--cc=anshuman.khandual@arm.com \
--cc=apopple@nvidia.com \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=baolu.lu@linux.intel.com \
--cc=bhe@redhat.com \
--cc=bp@alien8.de \
--cc=catalin.marinas@arm.com \
--cc=chenjun102@huawei.com \
--cc=chleroy@kernel.org \
--cc=conor.dooley@microchip.com \
--cc=coxu@redhat.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=djbw@kernel.org \
--cc=hpa@zytor.com \
--cc=jgross@suse.com \
--cc=jic23@kernel.org \
--cc=kevin.brodsky@arm.com \
--cc=lance.yang@linux.dev \
--cc=liam@infradead.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-pm@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=ljs@kernel.org \
--cc=luizcap@redhat.com \
--cc=luto@kernel.org \
--cc=maddy@linux.ibm.com \
--cc=mhocko@suse.com \
--cc=mingo@redhat.com \
--cc=mpe@ellerman.id.au \
--cc=npache@redhat.com \
--cc=npiggin@gmail.com \
--cc=pasha.tatashin@soleen.com \
--cc=peterx@redhat.com \
--cc=peterz@infradead.org \
--cc=riel@surriel.com \
--cc=rmclure@linux.ibm.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=surenb@google.com \
--cc=tglx@kernel.org \
--cc=thuth@redhat.com \
--cc=vbabka@kernel.org \
--cc=wangkefeng.wang@huawei.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=x86@kernel.org \
--cc=yu-cheng.yu@intel.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox