* [PATCH v3 00/11] mm: vm_normal_page*() improvements
@ 2025-08-11 11:26 David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 01/11] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
` (10 more replies)
0 siblings, 11 replies; 27+ messages in thread
From: David Hildenbrand @ 2025-08-11 11:26 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
David Hildenbrand, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang
Based on mm/mm-new from today.
Cleanup and unify vm_normal_page_*() handling, also marking the
huge zerofolio as special in the PMD. Add+use vm_normal_page_pud() and
cleanup that XEN vm_ops->find_special_page thingy.
There are plans of using vm_normal_page_*() more widely soon.
Briefly tested on UML (making sure vm_normal_page() still works as expected
without pte_special() support) and on x86-64 with a bunch of tests.
Cross-compiled for a variety of weird archs.
v2 -> v3:
* "mm/huge_memory: mark PMD mappings of the huge zero folio special"
-> Extend vm_normal_page_pmd() comment + patch description
-> Take care of copy_huge_pmd() checking for pmd_special().
* "powerpc/ptdump: rename "struct pgtable_level" to "struct ptdump_pglevel""
-> Added
* "mm/rmap: convert "enum rmap_level" to "enum pgtable_level""
-> Added
* "mm/memory: convert print_bad_pte() to print_bad_page_map()"
-> Consume level so we can keep the level indication through
pgtable_level_to_str().
-> Improve locking comments
* "mm/memory: factor out common code from vm_normal_page_*()"
-> Factor everything out into __vm_normal_page() and let it consume the
special bit + pfn (and the value+level for error reporting purposes)
-> Improve function docs
-> Improve patch description
v1 -> v2:
* "mm/memory: convert print_bad_pte() to print_bad_page_map()"
-> Don't use pgdp_get(), because it's broken on some arm configs
-> Extend patch description
-> Don't use pmd_val(pmdp_get()), because that doesn't work on some
m68k configs
* Added RBs
RFC -> v1:
* Dropped the highest_memmap_pfn removal stuff and instead added
"mm/memory: convert print_bad_pte() to print_bad_page_map()"
* Dropped "mm: compare pfns only if the entry is present when inserting
pfns/pages" for now, will probably clean that up separately.
* Dropped "mm: remove "horrible special case to handle copy-on-write
behaviour"", and "mm: drop addr parameter from vm_normal_*_pmd()" will
require more thought
* "mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()"
-> Extend patch description.
* "fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio"
-> Extend patch description.
* "mm/huge_memory: mark PMD mappings of the huge zero folio special"
-> Remove comment from vm_normal_page_pmd().
* "mm/memory: factor out common code from vm_normal_page_*()"
-> Adjust to print_bad_page_map()/highest_memmap_pfn changes.
-> Add proper kernel doc to all involved functions
* "mm: introduce and use vm_normal_page_pud()"
-> Adjust to print_bad_page_map() changes.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lance Yang <lance.yang@linux.dev>
David Hildenbrand (11):
mm/huge_memory: move more common code into insert_pmd()
mm/huge_memory: move more common code into insert_pud()
mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
mm/huge_memory: mark PMD mappings of the huge zero folio special
powerpc/ptdump: rename "struct pgtable_level" to "struct
ptdump_pglevel"
mm/rmap: convert "enum rmap_level" to "enum pgtable_level"
mm/memory: convert print_bad_pte() to print_bad_page_map()
mm/memory: factor out common code from vm_normal_page_*()
mm: introduce and use vm_normal_page_pud()
mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
arch/powerpc/mm/ptdump/8xx.c | 2 +-
arch/powerpc/mm/ptdump/book3s64.c | 2 +-
arch/powerpc/mm/ptdump/ptdump.h | 4 +-
arch/powerpc/mm/ptdump/shared.c | 2 +-
drivers/xen/Kconfig | 1 +
drivers/xen/gntdev.c | 5 +-
fs/dax.c | 47 +----
include/linux/mm.h | 20 +-
include/linux/pgtable.h | 27 +++
include/linux/rmap.h | 60 +++---
mm/Kconfig | 2 +
mm/huge_memory.c | 122 +++++------
mm/memory.c | 332 +++++++++++++++++++++---------
mm/pagewalk.c | 20 +-
mm/rmap.c | 56 ++---
tools/testing/vma/vma_internal.h | 18 +-
16 files changed, 421 insertions(+), 299 deletions(-)
base-commit: 53c448023185717d0ed56b5546dc2be405da92ff
--
2.50.1
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v3 01/11] mm/huge_memory: move more common code into insert_pmd()
2025-08-11 11:26 [PATCH v3 00/11] mm: vm_normal_page*() improvements David Hildenbrand
@ 2025-08-11 11:26 ` David Hildenbrand
2025-08-12 4:52 ` Lance Yang
2025-08-11 11:26 ` [PATCH v3 02/11] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
` (9 subsequent siblings)
10 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-08-11 11:26 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
David Hildenbrand, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang, Alistair Popple, Wei Yang
Let's clean it all further up.
No functional change intended.
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/huge_memory.c | 72 ++++++++++++++++--------------------------------
1 file changed, 24 insertions(+), 48 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2b4ea5a2ce7d2..5314a89d676f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1379,15 +1379,25 @@ struct folio_or_pfn {
bool is_folio;
};
-static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
+static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, struct folio_or_pfn fop, pgprot_t prot,
- bool write, pgtable_t pgtable)
+ bool write)
{
struct mm_struct *mm = vma->vm_mm;
+ pgtable_t pgtable = NULL;
+ spinlock_t *ptl;
pmd_t entry;
- lockdep_assert_held(pmd_lockptr(mm, pmd));
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ return VM_FAULT_SIGBUS;
+ if (arch_needs_pgtable_deposit()) {
+ pgtable = pte_alloc_one(vma->vm_mm);
+ if (!pgtable)
+ return VM_FAULT_OOM;
+ }
+
+ ptl = pmd_lock(mm, pmd);
if (!pmd_none(*pmd)) {
const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
fop.pfn;
@@ -1395,15 +1405,14 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
if (write) {
if (pmd_pfn(*pmd) != pfn) {
WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
- return -EEXIST;
+ goto out_unlock;
}
entry = pmd_mkyoung(*pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
update_mmu_cache_pmd(vma, addr, pmd);
}
-
- return -EEXIST;
+ goto out_unlock;
}
if (fop.is_folio) {
@@ -1424,11 +1433,17 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
if (pgtable) {
pgtable_trans_huge_deposit(mm, pmd, pgtable);
mm_inc_nr_ptes(mm);
+ pgtable = NULL;
}
set_pmd_at(mm, addr, pmd, entry);
update_mmu_cache_pmd(vma, addr, pmd);
- return 0;
+
+out_unlock:
+ spin_unlock(ptl);
+ if (pgtable)
+ pte_free(mm, pgtable);
+ return VM_FAULT_NOPAGE;
}
/**
@@ -1450,9 +1465,6 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn,
struct folio_or_pfn fop = {
.pfn = pfn,
};
- pgtable_t pgtable = NULL;
- spinlock_t *ptl;
- int error;
/*
* If we had pmd_special, we could avoid all these restrictions,
@@ -1464,25 +1476,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn,
(VM_PFNMAP|VM_MIXEDMAP));
BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
- if (addr < vma->vm_start || addr >= vma->vm_end)
- return VM_FAULT_SIGBUS;
-
- if (arch_needs_pgtable_deposit()) {
- pgtable = pte_alloc_one(vma->vm_mm);
- if (!pgtable)
- return VM_FAULT_OOM;
- }
-
pfnmap_setup_cachemode_pfn(pfn, &pgprot);
- ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- error = insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write,
- pgtable);
- spin_unlock(ptl);
- if (error && pgtable)
- pte_free(vma->vm_mm, pgtable);
-
- return VM_FAULT_NOPAGE;
+ return insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write);
}
EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
@@ -1491,35 +1487,15 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
{
struct vm_area_struct *vma = vmf->vma;
unsigned long addr = vmf->address & PMD_MASK;
- struct mm_struct *mm = vma->vm_mm;
struct folio_or_pfn fop = {
.folio = folio,
.is_folio = true,
};
- spinlock_t *ptl;
- pgtable_t pgtable = NULL;
- int error;
-
- if (addr < vma->vm_start || addr >= vma->vm_end)
- return VM_FAULT_SIGBUS;
if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
return VM_FAULT_SIGBUS;
- if (arch_needs_pgtable_deposit()) {
- pgtable = pte_alloc_one(vma->vm_mm);
- if (!pgtable)
- return VM_FAULT_OOM;
- }
-
- ptl = pmd_lock(mm, vmf->pmd);
- error = insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot,
- write, pgtable);
- spin_unlock(ptl);
- if (error && pgtable)
- pte_free(mm, pgtable);
-
- return VM_FAULT_NOPAGE;
+ return insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot, write);
}
EXPORT_SYMBOL_GPL(vmf_insert_folio_pmd);
--
2.50.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 02/11] mm/huge_memory: move more common code into insert_pud()
2025-08-11 11:26 [PATCH v3 00/11] mm: vm_normal_page*() improvements David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 01/11] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
@ 2025-08-11 11:26 ` David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 03/11] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
` (8 subsequent siblings)
10 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2025-08-11 11:26 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
David Hildenbrand, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang, Alistair Popple, Wei Yang
Let's clean it all further up.
No functional change intended.
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/huge_memory.c | 36 +++++++++++++-----------------------
1 file changed, 13 insertions(+), 23 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5314a89d676f1..7933791b75f4d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1507,25 +1507,30 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
return pud;
}
-static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
+static vm_fault_t insert_pud(struct vm_area_struct *vma, unsigned long addr,
pud_t *pud, struct folio_or_pfn fop, pgprot_t prot, bool write)
{
struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
pud_t entry;
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ return VM_FAULT_SIGBUS;
+
+ ptl = pud_lock(mm, pud);
if (!pud_none(*pud)) {
const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
fop.pfn;
if (write) {
if (WARN_ON_ONCE(pud_pfn(*pud) != pfn))
- return;
+ goto out_unlock;
entry = pud_mkyoung(*pud);
entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
if (pudp_set_access_flags(vma, addr, pud, entry, 1))
update_mmu_cache_pud(vma, addr, pud);
}
- return;
+ goto out_unlock;
}
if (fop.is_folio) {
@@ -1544,6 +1549,9 @@ static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
}
set_pud_at(mm, addr, pud, entry);
update_mmu_cache_pud(vma, addr, pud);
+out_unlock:
+ spin_unlock(ptl);
+ return VM_FAULT_NOPAGE;
}
/**
@@ -1565,7 +1573,6 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn,
struct folio_or_pfn fop = {
.pfn = pfn,
};
- spinlock_t *ptl;
/*
* If we had pud_special, we could avoid all these restrictions,
@@ -1577,16 +1584,9 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn,
(VM_PFNMAP|VM_MIXEDMAP));
BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
- if (addr < vma->vm_start || addr >= vma->vm_end)
- return VM_FAULT_SIGBUS;
-
pfnmap_setup_cachemode_pfn(pfn, &pgprot);
- ptl = pud_lock(vma->vm_mm, vmf->pud);
- insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
- spin_unlock(ptl);
-
- return VM_FAULT_NOPAGE;
+ return insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
}
EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
@@ -1603,25 +1603,15 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
{
struct vm_area_struct *vma = vmf->vma;
unsigned long addr = vmf->address & PUD_MASK;
- pud_t *pud = vmf->pud;
- struct mm_struct *mm = vma->vm_mm;
struct folio_or_pfn fop = {
.folio = folio,
.is_folio = true,
};
- spinlock_t *ptl;
-
- if (addr < vma->vm_start || addr >= vma->vm_end)
- return VM_FAULT_SIGBUS;
if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER))
return VM_FAULT_SIGBUS;
- ptl = pud_lock(mm, pud);
- insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
- spin_unlock(ptl);
-
- return VM_FAULT_NOPAGE;
+ return insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
}
EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
--
2.50.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 03/11] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
2025-08-11 11:26 [PATCH v3 00/11] mm: vm_normal_page*() improvements David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 01/11] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 02/11] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
@ 2025-08-11 11:26 ` David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 04/11] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
` (7 subsequent siblings)
10 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2025-08-11 11:26 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
David Hildenbrand, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang, Wei Yang
Just like we do for vmf_insert_page_mkwrite() -> ... ->
insert_page_into_pte_locked() with the shared zeropage, support the
huge zero folio in vmf_insert_folio_pmd().
When (un)mapping the huge zero folio in page tables, we neither
adjust the refcount nor the mapcount, just like for the shared zeropage.
For now, the huge zero folio is not marked as special yet, although
vm_normal_page_pmd() really wants to treat it as special. We'll change
that next.
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/huge_memory.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7933791b75f4d..ec89e0607424e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1418,9 +1418,11 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
if (fop.is_folio) {
entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
- folio_get(fop.folio);
- folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
- add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
+ if (!is_huge_zero_folio(fop.folio)) {
+ folio_get(fop.folio);
+ folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
+ add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
+ }
} else {
entry = pmd_mkhuge(pfn_pmd(fop.pfn, prot));
entry = pmd_mkspecial(entry);
--
2.50.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 04/11] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
2025-08-11 11:26 [PATCH v3 00/11] mm: vm_normal_page*() improvements David Hildenbrand
` (2 preceding siblings ...)
2025-08-11 11:26 ` [PATCH v3 03/11] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
@ 2025-08-11 11:26 ` David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 05/11] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
` (6 subsequent siblings)
10 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2025-08-11 11:26 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
David Hildenbrand, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang, Alistair Popple
Let's convert to vmf_insert_folio_pmd().
There is a theoretical change in behavior: in the unlikely case there is
already something mapped, we'll now still call trace_dax_pmd_load_hole()
and return VM_FAULT_NOPAGE.
Previously, we would have returned VM_FAULT_FALLBACK, and the caller
would have zapped the PMD to try a PTE fault.
However, that behavior was different to other PTE+PMD faults, when there
would already be something mapped, and it's not even clear if it could
be triggered.
Assuming the huge zero folio is already mapped, all good, no need to
fallback to PTEs.
Assuming there is already a leaf page table ... the behavior would be
just like when trying to insert a PMD mapping a folio through
dax_fault_iter()->vmf_insert_folio_pmd().
Assuming there is already something else mapped as PMD? It sounds like
a BUG, and the behavior would be just like when trying to insert a PMD
mapping a folio through dax_fault_iter()->vmf_insert_folio_pmd().
So, it sounds reasonable to not handle huge zero folios differently
to inserting PMDs mapping folios when there already is something mapped.
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
fs/dax.c | 47 ++++++++++-------------------------------------
1 file changed, 10 insertions(+), 37 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 4229513806bea..ae90706674a3f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1375,51 +1375,24 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
const struct iomap_iter *iter, void **entry)
{
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
- unsigned long pmd_addr = vmf->address & PMD_MASK;
- struct vm_area_struct *vma = vmf->vma;
struct inode *inode = mapping->host;
- pgtable_t pgtable = NULL;
struct folio *zero_folio;
- spinlock_t *ptl;
- pmd_t pmd_entry;
- unsigned long pfn;
+ vm_fault_t ret;
zero_folio = mm_get_huge_zero_folio(vmf->vma->vm_mm);
- if (unlikely(!zero_folio))
- goto fallback;
-
- pfn = page_to_pfn(&zero_folio->page);
- *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn,
- DAX_PMD | DAX_ZERO_PAGE);
-
- if (arch_needs_pgtable_deposit()) {
- pgtable = pte_alloc_one(vma->vm_mm);
- if (!pgtable)
- return VM_FAULT_OOM;
- }
-
- ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
- if (!pmd_none(*(vmf->pmd))) {
- spin_unlock(ptl);
- goto fallback;
+ if (unlikely(!zero_folio)) {
+ trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
+ return VM_FAULT_FALLBACK;
}
- if (pgtable) {
- pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
- mm_inc_nr_ptes(vma->vm_mm);
- }
- pmd_entry = folio_mk_pmd(zero_folio, vmf->vma->vm_page_prot);
- set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
- spin_unlock(ptl);
- trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
- return VM_FAULT_NOPAGE;
+ *entry = dax_insert_entry(xas, vmf, iter, *entry, folio_pfn(zero_folio),
+ DAX_PMD | DAX_ZERO_PAGE);
-fallback:
- if (pgtable)
- pte_free(vma->vm_mm, pgtable);
- trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
- return VM_FAULT_FALLBACK;
+ ret = vmf_insert_folio_pmd(vmf, zero_folio, false);
+ if (ret == VM_FAULT_NOPAGE)
+ trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
+ return ret;
}
#else
static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
--
2.50.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 05/11] mm/huge_memory: mark PMD mappings of the huge zero folio special
2025-08-11 11:26 [PATCH v3 00/11] mm: vm_normal_page*() improvements David Hildenbrand
` (3 preceding siblings ...)
2025-08-11 11:26 ` [PATCH v3 04/11] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
@ 2025-08-11 11:26 ` David Hildenbrand
2025-08-12 18:14 ` Lorenzo Stoakes
2025-08-11 11:26 ` [PATCH v3 06/11] powerpc/ptdump: rename "struct pgtable_level" to "struct ptdump_pglevel" David Hildenbrand
` (5 subsequent siblings)
10 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-08-11 11:26 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
David Hildenbrand, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang
The huge zero folio is refcounted (+mapcounted -- is that a word?)
differently than "normal" folios, similarly (but different) to the ordinary
shared zeropage.
For this reason, we special-case these pages in
vm_normal_page*/vm_normal_folio*, and only allow selected callers to
still use them (e.g., GUP can still take a reference on them).
vm_normal_page_pmd() already filters out the huge zero folio, to
indicate it a special (return NULL). However, so far we are not making
use of pmd_special() on architectures that support it
(CONFIG_ARCH_HAS_PTE_SPECIAL), like we would with the ordinary shared
zeropage.
Let's mark PMD mappings of the huge zero folio similarly as special, so we
can avoid the manual check for the huge zero folio with
CONFIG_ARCH_HAS_PTE_SPECIAL next, and only perform the check on
!CONFIG_ARCH_HAS_PTE_SPECIAL.
In copy_huge_pmd(), where we have a manual pmd_special() check to handle
PFNMAP, we have to manually rule out the huge zero folio. That code
needs a serious cleanup, but that's something for another day.
While at it, update the doc regarding the shared zero folios.
No functional change intended: vm_normal_page_pmd() still returns NULL
when it encounters the huge zero folio.
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/huge_memory.c | 8 ++++++--
mm/memory.c | 15 ++++++++++-----
2 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec89e0607424e..58bac83e7fa31 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1309,6 +1309,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
{
pmd_t entry;
entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
+ entry = pmd_mkspecial(entry);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
mm_inc_nr_ptes(mm);
@@ -1418,7 +1419,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
if (fop.is_folio) {
entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
- if (!is_huge_zero_folio(fop.folio)) {
+ if (is_huge_zero_folio(fop.folio)) {
+ entry = pmd_mkspecial(entry);
+ } else {
folio_get(fop.folio);
folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
@@ -1643,7 +1646,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
int ret = -ENOMEM;
pmd = pmdp_get_lockless(src_pmd);
- if (unlikely(pmd_present(pmd) && pmd_special(pmd))) {
+ if (unlikely(pmd_present(pmd) && pmd_special(pmd) &&
+ !is_huge_zero_pmd(pmd))) {
dst_ptl = pmd_lock(dst_mm, dst_pmd);
src_ptl = pmd_lockptr(src_mm, src_pmd);
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
diff --git a/mm/memory.c b/mm/memory.c
index 0ba4f6b718471..626caedce35e0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -555,7 +555,14 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
*
* "Special" mappings do not wish to be associated with a "struct page" (either
* it doesn't exist, or it exists but they don't want to touch it). In this
- * case, NULL is returned here. "Normal" mappings do have a struct page.
+ * case, NULL is returned here. "Normal" mappings do have a struct page and
+ * are ordinarily refcounted.
+ *
+ * Page mappings of the shared zero folios are always considered "special", as
+ * they are not ordinarily refcounted: neither the refcount nor the mapcount
+ * of these folios is adjusted when mapping them into user page tables.
+ * Selected page table walkers (such as GUP) can still identify mappings of the
+ * shared zero folios and work with the underlying "struct page".
*
* There are 2 broad cases. Firstly, an architecture may define a pte_special()
* pte bit, in which case this function is trivial. Secondly, an architecture
@@ -585,9 +592,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
*
* VM_MIXEDMAP mappings can likewise contain memory with or without "struct
* page" backing, however the difference is that _all_ pages with a struct
- * page (that is, those where pfn_valid is true) are refcounted and considered
- * normal pages by the VM. The only exception are zeropages, which are
- * *never* refcounted.
+ * page (that is, those where pfn_valid is true, except the shared zero
+ * folios) are refcounted and considered normal pages by the VM.
*
* The disadvantage is that pages are refcounted (which can be slower and
* simply not an option for some PFNMAP users). The advantage is that we
@@ -667,7 +673,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
{
unsigned long pfn = pmd_pfn(pmd);
- /* Currently it's only used for huge pfnmaps */
if (unlikely(pmd_special(pmd)))
return NULL;
--
2.50.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 06/11] powerpc/ptdump: rename "struct pgtable_level" to "struct ptdump_pglevel"
2025-08-11 11:26 [PATCH v3 00/11] mm: vm_normal_page*() improvements David Hildenbrand
` (4 preceding siblings ...)
2025-08-11 11:26 ` [PATCH v3 05/11] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
@ 2025-08-11 11:26 ` David Hildenbrand
2025-08-12 18:23 ` Lorenzo Stoakes
2025-08-26 16:28 ` Ritesh Harjani
2025-08-11 11:26 ` [PATCH v3 07/11] mm/rmap: convert "enum rmap_level" to "enum pgtable_level" David Hildenbrand
` (4 subsequent siblings)
10 siblings, 2 replies; 27+ messages in thread
From: David Hildenbrand @ 2025-08-11 11:26 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
David Hildenbrand, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang
We want to make use of "pgtable_level" for an enum in core-mm. Other
architectures seem to call "struct pgtable_level" either:
* "struct pg_level" when not exposed in a header (riscv, arm)
* "struct ptdump_pg_level" when expose in a header (arm64)
So let's follow what arm64 does.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
arch/powerpc/mm/ptdump/8xx.c | 2 +-
arch/powerpc/mm/ptdump/book3s64.c | 2 +-
arch/powerpc/mm/ptdump/ptdump.h | 4 ++--
arch/powerpc/mm/ptdump/shared.c | 2 +-
4 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/mm/ptdump/8xx.c b/arch/powerpc/mm/ptdump/8xx.c
index b5c79b11ea3c2..4ca9cf7a90c9e 100644
--- a/arch/powerpc/mm/ptdump/8xx.c
+++ b/arch/powerpc/mm/ptdump/8xx.c
@@ -69,7 +69,7 @@ static const struct flag_info flag_array[] = {
}
};
-struct pgtable_level pg_level[5] = {
+struct ptdump_pg_level pg_level[5] = {
{ /* pgd */
.flag = flag_array,
.num = ARRAY_SIZE(flag_array),
diff --git a/arch/powerpc/mm/ptdump/book3s64.c b/arch/powerpc/mm/ptdump/book3s64.c
index 5ad92d9dc5d10..6b2da9241d4c4 100644
--- a/arch/powerpc/mm/ptdump/book3s64.c
+++ b/arch/powerpc/mm/ptdump/book3s64.c
@@ -102,7 +102,7 @@ static const struct flag_info flag_array[] = {
}
};
-struct pgtable_level pg_level[5] = {
+struct ptdump_pg_level pg_level[5] = {
{ /* pgd */
.flag = flag_array,
.num = ARRAY_SIZE(flag_array),
diff --git a/arch/powerpc/mm/ptdump/ptdump.h b/arch/powerpc/mm/ptdump/ptdump.h
index 154efae96ae09..4232aa4b57eae 100644
--- a/arch/powerpc/mm/ptdump/ptdump.h
+++ b/arch/powerpc/mm/ptdump/ptdump.h
@@ -11,12 +11,12 @@ struct flag_info {
int shift;
};
-struct pgtable_level {
+struct ptdump_pg_level {
const struct flag_info *flag;
size_t num;
u64 mask;
};
-extern struct pgtable_level pg_level[5];
+extern struct ptdump_pg_level pg_level[5];
void pt_dump_size(struct seq_file *m, unsigned long delta);
diff --git a/arch/powerpc/mm/ptdump/shared.c b/arch/powerpc/mm/ptdump/shared.c
index 39c30c62b7ea7..58998960eb9a4 100644
--- a/arch/powerpc/mm/ptdump/shared.c
+++ b/arch/powerpc/mm/ptdump/shared.c
@@ -67,7 +67,7 @@ static const struct flag_info flag_array[] = {
}
};
-struct pgtable_level pg_level[5] = {
+struct ptdump_pg_level pg_level[5] = {
{ /* pgd */
.flag = flag_array,
.num = ARRAY_SIZE(flag_array),
--
2.50.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 07/11] mm/rmap: convert "enum rmap_level" to "enum pgtable_level"
2025-08-11 11:26 [PATCH v3 00/11] mm: vm_normal_page*() improvements David Hildenbrand
` (5 preceding siblings ...)
2025-08-11 11:26 ` [PATCH v3 06/11] powerpc/ptdump: rename "struct pgtable_level" to "struct ptdump_pglevel" David Hildenbrand
@ 2025-08-11 11:26 ` David Hildenbrand
2025-08-12 18:33 ` Lorenzo Stoakes
2025-08-11 11:26 ` [PATCH v3 08/11] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
` (3 subsequent siblings)
10 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-08-11 11:26 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
David Hildenbrand, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang
Let's factor it out, and convert all checks for unsupported levels to
BUILD_BUG(). The code is written in a way such that force-inlining will
optimize out the levels.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
include/linux/pgtable.h | 8 ++++++
include/linux/rmap.h | 60 +++++++++++++++++++----------------------
mm/rmap.c | 56 +++++++++++++++++++++-----------------
3 files changed, 66 insertions(+), 58 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 4c035637eeb77..bff5c4241bf2e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1958,6 +1958,14 @@ static inline bool arch_has_pfn_modify_check(void)
/* Page-Table Modification Mask */
typedef unsigned int pgtbl_mod_mask;
+enum pgtable_level {
+ PGTABLE_LEVEL_PTE = 0,
+ PGTABLE_LEVEL_PMD,
+ PGTABLE_LEVEL_PUD,
+ PGTABLE_LEVEL_P4D,
+ PGTABLE_LEVEL_PGD,
+};
+
#endif /* !__ASSEMBLY__ */
#if !defined(MAX_POSSIBLE_PHYSMEM_BITS) && !defined(CONFIG_64BIT)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6cd020eea37a2..9d40d127bdb78 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -394,18 +394,8 @@ typedef int __bitwise rmap_t;
/* The anonymous (sub)page is exclusive to a single process. */
#define RMAP_EXCLUSIVE ((__force rmap_t)BIT(0))
-/*
- * Internally, we're using an enum to specify the granularity. We make the
- * compiler emit specialized code for each granularity.
- */
-enum rmap_level {
- RMAP_LEVEL_PTE = 0,
- RMAP_LEVEL_PMD,
- RMAP_LEVEL_PUD,
-};
-
static inline void __folio_rmap_sanity_checks(const struct folio *folio,
- const struct page *page, int nr_pages, enum rmap_level level)
+ const struct page *page, int nr_pages, enum pgtable_level level)
{
/* hugetlb folios are handled separately. */
VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
@@ -427,18 +417,18 @@ static inline void __folio_rmap_sanity_checks(const struct folio *folio,
VM_WARN_ON_FOLIO(page_folio(page + nr_pages - 1) != folio, folio);
switch (level) {
- case RMAP_LEVEL_PTE:
+ case PGTABLE_LEVEL_PTE:
break;
- case RMAP_LEVEL_PMD:
+ case PGTABLE_LEVEL_PMD:
/*
* We don't support folios larger than a single PMD yet. So
- * when RMAP_LEVEL_PMD is set, we assume that we are creating
+ * when PGTABLE_LEVEL_PMD is set, we assume that we are creating
* a single "entire" mapping of the folio.
*/
VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PMD_NR, folio);
VM_WARN_ON_FOLIO(nr_pages != HPAGE_PMD_NR, folio);
break;
- case RMAP_LEVEL_PUD:
+ case PGTABLE_LEVEL_PUD:
/*
* Assume that we are creating a single "entire" mapping of the
* folio.
@@ -447,7 +437,7 @@ static inline void __folio_rmap_sanity_checks(const struct folio *folio,
VM_WARN_ON_FOLIO(nr_pages != HPAGE_PUD_NR, folio);
break;
default:
- VM_WARN_ON_ONCE(true);
+ BUILD_BUG();
}
/*
@@ -567,14 +557,14 @@ static inline void hugetlb_remove_rmap(struct folio *folio)
static __always_inline void __folio_dup_file_rmap(struct folio *folio,
struct page *page, int nr_pages, struct vm_area_struct *dst_vma,
- enum rmap_level level)
+ enum pgtable_level level)
{
const int orig_nr_pages = nr_pages;
__folio_rmap_sanity_checks(folio, page, nr_pages, level);
switch (level) {
- case RMAP_LEVEL_PTE:
+ case PGTABLE_LEVEL_PTE:
if (!folio_test_large(folio)) {
atomic_inc(&folio->_mapcount);
break;
@@ -587,11 +577,13 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
}
folio_add_large_mapcount(folio, orig_nr_pages, dst_vma);
break;
- case RMAP_LEVEL_PMD:
- case RMAP_LEVEL_PUD:
+ case PGTABLE_LEVEL_PMD:
+ case PGTABLE_LEVEL_PUD:
atomic_inc(&folio->_entire_mapcount);
folio_inc_large_mapcount(folio, dst_vma);
break;
+ default:
+ BUILD_BUG();
}
}
@@ -609,13 +601,13 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
static inline void folio_dup_file_rmap_ptes(struct folio *folio,
struct page *page, int nr_pages, struct vm_area_struct *dst_vma)
{
- __folio_dup_file_rmap(folio, page, nr_pages, dst_vma, RMAP_LEVEL_PTE);
+ __folio_dup_file_rmap(folio, page, nr_pages, dst_vma, PGTABLE_LEVEL_PTE);
}
static __always_inline void folio_dup_file_rmap_pte(struct folio *folio,
struct page *page, struct vm_area_struct *dst_vma)
{
- __folio_dup_file_rmap(folio, page, 1, dst_vma, RMAP_LEVEL_PTE);
+ __folio_dup_file_rmap(folio, page, 1, dst_vma, PGTABLE_LEVEL_PTE);
}
/**
@@ -632,7 +624,7 @@ static inline void folio_dup_file_rmap_pmd(struct folio *folio,
struct page *page, struct vm_area_struct *dst_vma)
{
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- __folio_dup_file_rmap(folio, page, HPAGE_PMD_NR, dst_vma, RMAP_LEVEL_PTE);
+ __folio_dup_file_rmap(folio, page, HPAGE_PMD_NR, dst_vma, PGTABLE_LEVEL_PTE);
#else
WARN_ON_ONCE(true);
#endif
@@ -640,7 +632,7 @@ static inline void folio_dup_file_rmap_pmd(struct folio *folio,
static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
struct page *page, int nr_pages, struct vm_area_struct *dst_vma,
- struct vm_area_struct *src_vma, enum rmap_level level)
+ struct vm_area_struct *src_vma, enum pgtable_level level)
{
const int orig_nr_pages = nr_pages;
bool maybe_pinned;
@@ -665,7 +657,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
* copying if the folio maybe pinned.
*/
switch (level) {
- case RMAP_LEVEL_PTE:
+ case PGTABLE_LEVEL_PTE:
if (unlikely(maybe_pinned)) {
for (i = 0; i < nr_pages; i++)
if (PageAnonExclusive(page + i))
@@ -687,8 +679,8 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
} while (page++, --nr_pages > 0);
folio_add_large_mapcount(folio, orig_nr_pages, dst_vma);
break;
- case RMAP_LEVEL_PMD:
- case RMAP_LEVEL_PUD:
+ case PGTABLE_LEVEL_PMD:
+ case PGTABLE_LEVEL_PUD:
if (PageAnonExclusive(page)) {
if (unlikely(maybe_pinned))
return -EBUSY;
@@ -697,6 +689,8 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
atomic_inc(&folio->_entire_mapcount);
folio_inc_large_mapcount(folio, dst_vma);
break;
+ default:
+ BUILD_BUG();
}
return 0;
}
@@ -730,7 +724,7 @@ static inline int folio_try_dup_anon_rmap_ptes(struct folio *folio,
struct vm_area_struct *src_vma)
{
return __folio_try_dup_anon_rmap(folio, page, nr_pages, dst_vma,
- src_vma, RMAP_LEVEL_PTE);
+ src_vma, PGTABLE_LEVEL_PTE);
}
static __always_inline int folio_try_dup_anon_rmap_pte(struct folio *folio,
@@ -738,7 +732,7 @@ static __always_inline int folio_try_dup_anon_rmap_pte(struct folio *folio,
struct vm_area_struct *src_vma)
{
return __folio_try_dup_anon_rmap(folio, page, 1, dst_vma, src_vma,
- RMAP_LEVEL_PTE);
+ PGTABLE_LEVEL_PTE);
}
/**
@@ -770,7 +764,7 @@ static inline int folio_try_dup_anon_rmap_pmd(struct folio *folio,
{
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
return __folio_try_dup_anon_rmap(folio, page, HPAGE_PMD_NR, dst_vma,
- src_vma, RMAP_LEVEL_PMD);
+ src_vma, PGTABLE_LEVEL_PMD);
#else
WARN_ON_ONCE(true);
return -EBUSY;
@@ -778,7 +772,7 @@ static inline int folio_try_dup_anon_rmap_pmd(struct folio *folio,
}
static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
- struct page *page, int nr_pages, enum rmap_level level)
+ struct page *page, int nr_pages, enum pgtable_level level)
{
VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
VM_WARN_ON_FOLIO(!PageAnonExclusive(page), folio);
@@ -873,7 +867,7 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
static inline int folio_try_share_anon_rmap_pte(struct folio *folio,
struct page *page)
{
- return __folio_try_share_anon_rmap(folio, page, 1, RMAP_LEVEL_PTE);
+ return __folio_try_share_anon_rmap(folio, page, 1, PGTABLE_LEVEL_PTE);
}
/**
@@ -904,7 +898,7 @@ static inline int folio_try_share_anon_rmap_pmd(struct folio *folio,
{
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
return __folio_try_share_anon_rmap(folio, page, HPAGE_PMD_NR,
- RMAP_LEVEL_PMD);
+ PGTABLE_LEVEL_PMD);
#else
WARN_ON_ONCE(true);
return -EBUSY;
diff --git a/mm/rmap.c b/mm/rmap.c
index 84a8d8b02ef77..0e9c4041f8687 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1265,7 +1265,7 @@ static void __folio_mod_stat(struct folio *folio, int nr, int nr_pmdmapped)
static __always_inline void __folio_add_rmap(struct folio *folio,
struct page *page, int nr_pages, struct vm_area_struct *vma,
- enum rmap_level level)
+ enum pgtable_level level)
{
atomic_t *mapped = &folio->_nr_pages_mapped;
const int orig_nr_pages = nr_pages;
@@ -1274,7 +1274,7 @@ static __always_inline void __folio_add_rmap(struct folio *folio,
__folio_rmap_sanity_checks(folio, page, nr_pages, level);
switch (level) {
- case RMAP_LEVEL_PTE:
+ case PGTABLE_LEVEL_PTE:
if (!folio_test_large(folio)) {
nr = atomic_inc_and_test(&folio->_mapcount);
break;
@@ -1300,11 +1300,11 @@ static __always_inline void __folio_add_rmap(struct folio *folio,
folio_add_large_mapcount(folio, orig_nr_pages, vma);
break;
- case RMAP_LEVEL_PMD:
- case RMAP_LEVEL_PUD:
+ case PGTABLE_LEVEL_PMD:
+ case PGTABLE_LEVEL_PUD:
first = atomic_inc_and_test(&folio->_entire_mapcount);
if (IS_ENABLED(CONFIG_NO_PAGE_MAPCOUNT)) {
- if (level == RMAP_LEVEL_PMD && first)
+ if (level == PGTABLE_LEVEL_PMD && first)
nr_pmdmapped = folio_large_nr_pages(folio);
nr = folio_inc_return_large_mapcount(folio, vma);
if (nr == 1)
@@ -1323,7 +1323,7 @@ static __always_inline void __folio_add_rmap(struct folio *folio,
* We only track PMD mappings of PMD-sized
* folios separately.
*/
- if (level == RMAP_LEVEL_PMD)
+ if (level == PGTABLE_LEVEL_PMD)
nr_pmdmapped = nr_pages;
nr = nr_pages - (nr & FOLIO_PAGES_MAPPED);
/* Raced ahead of a remove and another add? */
@@ -1336,6 +1336,8 @@ static __always_inline void __folio_add_rmap(struct folio *folio,
}
folio_inc_large_mapcount(folio, vma);
break;
+ default:
+ BUILD_BUG();
}
__folio_mod_stat(folio, nr, nr_pmdmapped);
}
@@ -1427,7 +1429,7 @@ static void __page_check_anon_rmap(const struct folio *folio,
static __always_inline void __folio_add_anon_rmap(struct folio *folio,
struct page *page, int nr_pages, struct vm_area_struct *vma,
- unsigned long address, rmap_t flags, enum rmap_level level)
+ unsigned long address, rmap_t flags, enum pgtable_level level)
{
int i;
@@ -1440,20 +1442,22 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
if (flags & RMAP_EXCLUSIVE) {
switch (level) {
- case RMAP_LEVEL_PTE:
+ case PGTABLE_LEVEL_PTE:
for (i = 0; i < nr_pages; i++)
SetPageAnonExclusive(page + i);
break;
- case RMAP_LEVEL_PMD:
+ case PGTABLE_LEVEL_PMD:
SetPageAnonExclusive(page);
break;
- case RMAP_LEVEL_PUD:
+ case PGTABLE_LEVEL_PUD:
/*
* Keep the compiler happy, we don't support anonymous
* PUD mappings.
*/
WARN_ON_ONCE(1);
break;
+ default:
+ BUILD_BUG();
}
}
@@ -1507,7 +1511,7 @@ void folio_add_anon_rmap_ptes(struct folio *folio, struct page *page,
rmap_t flags)
{
__folio_add_anon_rmap(folio, page, nr_pages, vma, address, flags,
- RMAP_LEVEL_PTE);
+ PGTABLE_LEVEL_PTE);
}
/**
@@ -1528,7 +1532,7 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
{
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
__folio_add_anon_rmap(folio, page, HPAGE_PMD_NR, vma, address, flags,
- RMAP_LEVEL_PMD);
+ PGTABLE_LEVEL_PMD);
#else
WARN_ON_ONCE(true);
#endif
@@ -1609,7 +1613,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
static __always_inline void __folio_add_file_rmap(struct folio *folio,
struct page *page, int nr_pages, struct vm_area_struct *vma,
- enum rmap_level level)
+ enum pgtable_level level)
{
VM_WARN_ON_FOLIO(folio_test_anon(folio), folio);
@@ -1634,7 +1638,7 @@ static __always_inline void __folio_add_file_rmap(struct folio *folio,
void folio_add_file_rmap_ptes(struct folio *folio, struct page *page,
int nr_pages, struct vm_area_struct *vma)
{
- __folio_add_file_rmap(folio, page, nr_pages, vma, RMAP_LEVEL_PTE);
+ __folio_add_file_rmap(folio, page, nr_pages, vma, PGTABLE_LEVEL_PTE);
}
/**
@@ -1651,7 +1655,7 @@ void folio_add_file_rmap_pmd(struct folio *folio, struct page *page,
struct vm_area_struct *vma)
{
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- __folio_add_file_rmap(folio, page, HPAGE_PMD_NR, vma, RMAP_LEVEL_PMD);
+ __folio_add_file_rmap(folio, page, HPAGE_PMD_NR, vma, PGTABLE_LEVEL_PMD);
#else
WARN_ON_ONCE(true);
#endif
@@ -1672,7 +1676,7 @@ void folio_add_file_rmap_pud(struct folio *folio, struct page *page,
{
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
- __folio_add_file_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
+ __folio_add_file_rmap(folio, page, HPAGE_PUD_NR, vma, PGTABLE_LEVEL_PUD);
#else
WARN_ON_ONCE(true);
#endif
@@ -1680,7 +1684,7 @@ void folio_add_file_rmap_pud(struct folio *folio, struct page *page,
static __always_inline void __folio_remove_rmap(struct folio *folio,
struct page *page, int nr_pages, struct vm_area_struct *vma,
- enum rmap_level level)
+ enum pgtable_level level)
{
atomic_t *mapped = &folio->_nr_pages_mapped;
int last = 0, nr = 0, nr_pmdmapped = 0;
@@ -1689,7 +1693,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
__folio_rmap_sanity_checks(folio, page, nr_pages, level);
switch (level) {
- case RMAP_LEVEL_PTE:
+ case PGTABLE_LEVEL_PTE:
if (!folio_test_large(folio)) {
nr = atomic_add_negative(-1, &folio->_mapcount);
break;
@@ -1719,11 +1723,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
partially_mapped = nr && atomic_read(mapped);
break;
- case RMAP_LEVEL_PMD:
- case RMAP_LEVEL_PUD:
+ case PGTABLE_LEVEL_PMD:
+ case PGTABLE_LEVEL_PUD:
if (IS_ENABLED(CONFIG_NO_PAGE_MAPCOUNT)) {
last = atomic_add_negative(-1, &folio->_entire_mapcount);
- if (level == RMAP_LEVEL_PMD && last)
+ if (level == PGTABLE_LEVEL_PMD && last)
nr_pmdmapped = folio_large_nr_pages(folio);
nr = folio_dec_return_large_mapcount(folio, vma);
if (!nr) {
@@ -1743,7 +1747,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
nr = atomic_sub_return_relaxed(ENTIRELY_MAPPED, mapped);
if (likely(nr < ENTIRELY_MAPPED)) {
nr_pages = folio_large_nr_pages(folio);
- if (level == RMAP_LEVEL_PMD)
+ if (level == PGTABLE_LEVEL_PMD)
nr_pmdmapped = nr_pages;
nr = nr_pages - (nr & FOLIO_PAGES_MAPPED);
/* Raced ahead of another remove and an add? */
@@ -1757,6 +1761,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
partially_mapped = nr && nr < nr_pmdmapped;
break;
+ default:
+ BUILD_BUG();
}
/*
@@ -1796,7 +1802,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
void folio_remove_rmap_ptes(struct folio *folio, struct page *page,
int nr_pages, struct vm_area_struct *vma)
{
- __folio_remove_rmap(folio, page, nr_pages, vma, RMAP_LEVEL_PTE);
+ __folio_remove_rmap(folio, page, nr_pages, vma, PGTABLE_LEVEL_PTE);
}
/**
@@ -1813,7 +1819,7 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page,
struct vm_area_struct *vma)
{
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- __folio_remove_rmap(folio, page, HPAGE_PMD_NR, vma, RMAP_LEVEL_PMD);
+ __folio_remove_rmap(folio, page, HPAGE_PMD_NR, vma, PGTABLE_LEVEL_PMD);
#else
WARN_ON_ONCE(true);
#endif
@@ -1834,7 +1840,7 @@ void folio_remove_rmap_pud(struct folio *folio, struct page *page,
{
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
- __folio_remove_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
+ __folio_remove_rmap(folio, page, HPAGE_PUD_NR, vma, PGTABLE_LEVEL_PUD);
#else
WARN_ON_ONCE(true);
#endif
--
2.50.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 08/11] mm/memory: convert print_bad_pte() to print_bad_page_map()
2025-08-11 11:26 [PATCH v3 00/11] mm: vm_normal_page*() improvements David Hildenbrand
` (6 preceding siblings ...)
2025-08-11 11:26 ` [PATCH v3 07/11] mm/rmap: convert "enum rmap_level" to "enum pgtable_level" David Hildenbrand
@ 2025-08-11 11:26 ` David Hildenbrand
2025-08-12 18:48 ` Lorenzo Stoakes
2025-08-25 12:31 ` David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 09/11] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
` (2 subsequent siblings)
10 siblings, 2 replies; 27+ messages in thread
From: David Hildenbrand @ 2025-08-11 11:26 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
David Hildenbrand, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang
print_bad_pte() looks like something that should actually be a WARN
or similar, but historically it apparently has proven to be useful to
detect corruption of page tables even on production systems -- report
the issue and keep the system running to make it easier to actually detect
what is going wrong (e.g., multiple such messages might shed a light).
As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
to take care of print_bad_pte() as well.
Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
implementation and renaming the function to print_bad_page_map().
Provide print_bad_pte() as a simple wrapper.
Document the implicit locking requirements for the page table re-walk.
To make the function a bit more readable, factor out the ratelimit check
into is_bad_page_map_ratelimited() and place the printing of page
table content into __print_bad_page_map_pgtable(). We'll now dump
information from each level in a single line, and just stop the table
walk once we hit something that is not a present page table.
The report will now look something like (dumping pgd to pmd values):
[ 77.943408] BUG: Bad page map in process XXX pte:80000001233f5867
[ 77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
[ 77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
Not using pgdp_get(), because that does not work properly on some arm
configs where pgd_t is an array. Note that we are dumping all levels
even when levels are folded for simplicity.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
include/linux/pgtable.h | 19 ++++++++
mm/memory.c | 104 ++++++++++++++++++++++++++++++++--------
2 files changed, 103 insertions(+), 20 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index bff5c4241bf2e..33c84b38b7ec6 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1966,6 +1966,25 @@ enum pgtable_level {
PGTABLE_LEVEL_PGD,
};
+static inline const char *pgtable_level_to_str(enum pgtable_level level)
+{
+ switch (level) {
+ case PGTABLE_LEVEL_PTE:
+ return "pte";
+ case PGTABLE_LEVEL_PMD:
+ return "pmd";
+ case PGTABLE_LEVEL_PUD:
+ return "pud";
+ case PGTABLE_LEVEL_P4D:
+ return "p4d";
+ case PGTABLE_LEVEL_PGD:
+ return "pgd";
+ default:
+ VM_WARN_ON_ONCE(1);
+ return "unknown";
+ }
+}
+
#endif /* !__ASSEMBLY__ */
#if !defined(MAX_POSSIBLE_PHYSMEM_BITS) && !defined(CONFIG_64BIT)
diff --git a/mm/memory.c b/mm/memory.c
index 626caedce35e0..dc0107354d37b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -491,22 +491,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
add_mm_counter(mm, i, rss[i]);
}
-/*
- * This function is called to print an error when a bad pte
- * is found. For example, we might have a PFN-mapped pte in
- * a region that doesn't allow it.
- *
- * The calling function must still handle the error.
- */
-static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte, struct page *page)
+static bool is_bad_page_map_ratelimited(void)
{
- pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
- p4d_t *p4d = p4d_offset(pgd, addr);
- pud_t *pud = pud_offset(p4d, addr);
- pmd_t *pmd = pmd_offset(pud, addr);
- struct address_space *mapping;
- pgoff_t index;
static unsigned long resume;
static unsigned long nr_shown;
static unsigned long nr_unshown;
@@ -518,7 +504,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
if (nr_shown == 60) {
if (time_before(jiffies, resume)) {
nr_unshown++;
- return;
+ return true;
}
if (nr_unshown) {
pr_alert("BUG: Bad page map: %lu messages suppressed\n",
@@ -529,15 +515,91 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
}
if (nr_shown++ == 0)
resume = jiffies + 60 * HZ;
+ return false;
+}
+
+static void __print_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
+{
+ unsigned long long pgdv, p4dv, pudv, pmdv;
+ p4d_t p4d, *p4dp;
+ pud_t pud, *pudp;
+ pmd_t pmd, *pmdp;
+ pgd_t *pgdp;
+
+ /*
+ * Although this looks like a fully lockless pgtable walk, it is not:
+ * see locking requirements for print_bad_page_map().
+ */
+ pgdp = pgd_offset(mm, addr);
+ pgdv = pgd_val(*pgdp);
+
+ if (!pgd_present(*pgdp) || pgd_leaf(*pgdp)) {
+ pr_alert("pgd:%08llx\n", pgdv);
+ return;
+ }
+
+ p4dp = p4d_offset(pgdp, addr);
+ p4d = p4dp_get(p4dp);
+ p4dv = p4d_val(p4d);
+
+ if (!p4d_present(p4d) || p4d_leaf(p4d)) {
+ pr_alert("pgd:%08llx p4d:%08llx\n", pgdv, p4dv);
+ return;
+ }
+
+ pudp = pud_offset(p4dp, addr);
+ pud = pudp_get(pudp);
+ pudv = pud_val(pud);
+
+ if (!pud_present(pud) || pud_leaf(pud)) {
+ pr_alert("pgd:%08llx p4d:%08llx pud:%08llx\n", pgdv, p4dv, pudv);
+ return;
+ }
+
+ pmdp = pmd_offset(pudp, addr);
+ pmd = pmdp_get(pmdp);
+ pmdv = pmd_val(pmd);
+
+ /*
+ * Dumping the PTE would be nice, but it's tricky with CONFIG_HIGHPTE,
+ * because the table should already be mapped by the caller and
+ * doing another map would be bad. print_bad_page_map() should
+ * already take care of printing the PTE.
+ */
+ pr_alert("pgd:%08llx p4d:%08llx pud:%08llx pmd:%08llx\n", pgdv,
+ p4dv, pudv, pmdv);
+}
+
+/*
+ * This function is called to print an error when a bad page table entry (e.g.,
+ * corrupted page table entry) is found. For example, we might have a
+ * PFN-mapped pte in a region that doesn't allow it.
+ *
+ * The calling function must still handle the error.
+ *
+ * This function must be called during a proper page table walk, as it will
+ * re-walk the page table to dump information: the caller MUST prevent page
+ * table teardown (by holding mmap, vma or rmap lock) and MUST hold the leaf
+ * page table lock.
+ */
+static void print_bad_page_map(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long long entry, struct page *page,
+ enum pgtable_level level)
+{
+ struct address_space *mapping;
+ pgoff_t index;
+
+ if (is_bad_page_map_ratelimited())
+ return;
mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
index = linear_page_index(vma, addr);
- pr_alert("BUG: Bad page map in process %s pte:%08llx pmd:%08llx\n",
- current->comm,
- (long long)pte_val(pte), (long long)pmd_val(*pmd));
+ pr_alert("BUG: Bad page map in process %s %s:%08llx", current->comm,
+ pgtable_level_to_str(level), entry);
+ __print_bad_page_map_pgtable(vma->vm_mm, addr);
if (page)
- dump_page(page, "bad pte");
+ dump_page(page, "bad page map");
pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
(void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
@@ -549,6 +611,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
dump_stack();
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
}
+#define print_bad_pte(vma, addr, pte, page) \
+ print_bad_page_map(vma, addr, pte_val(pte), page, PGTABLE_LEVEL_PTE)
/*
* vm_normal_page -- This function gets the "struct page" associated with a pte.
--
2.50.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 09/11] mm/memory: factor out common code from vm_normal_page_*()
2025-08-11 11:26 [PATCH v3 00/11] mm: vm_normal_page*() improvements David Hildenbrand
` (7 preceding siblings ...)
2025-08-11 11:26 ` [PATCH v3 08/11] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
@ 2025-08-11 11:26 ` David Hildenbrand
2025-08-12 19:06 ` Lorenzo Stoakes
2025-08-11 11:26 ` [PATCH v3 10/11] mm: introduce and use vm_normal_page_pud() David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 11/11] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
10 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-08-11 11:26 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
David Hildenbrand, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang
Let's reduce the code duplication and factor out the non-pte/pmd related
magic into __vm_normal_page().
To keep it simpler, check the pfn against both zero folios, which
shouldn't really make a difference.
It's a good question if we can even hit the !CONFIG_ARCH_HAS_PTE_SPECIAL
scenario in the PMD case in practice: but doesn't really matter, as
it's now all unified in vm_normal_page_pfn().
Add kerneldoc for all involved functions.
Note that, as a side product, we now:
* Support the find_special_page special thingy also for PMD
* Don't check for is_huge_zero_pfn() anymore if we have
CONFIG_ARCH_HAS_PTE_SPECIAL and the PMD is not special. The
VM_WARN_ON_ONCE would catch any abuse
No functional change intended.
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/memory.c | 186 ++++++++++++++++++++++++++++++----------------------
1 file changed, 109 insertions(+), 77 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index dc0107354d37b..78af3f243cee7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -614,8 +614,14 @@ static void print_bad_page_map(struct vm_area_struct *vma,
#define print_bad_pte(vma, addr, pte, page) \
print_bad_page_map(vma, addr, pte_val(pte), page, PGTABLE_LEVEL_PTE)
-/*
- * vm_normal_page -- This function gets the "struct page" associated with a pte.
+/**
+ * __vm_normal_page() - Get the "struct page" associated with a page table entry.
+ * @vma: The VMA mapping the page table entry.
+ * @addr: The address where the page table entry is mapped.
+ * @pfn: The PFN stored in the page table entry.
+ * @special: Whether the page table entry is marked "special".
+ * @level: The page table level for error reporting purposes only.
+ * @entry: The page table entry value for error reporting purposes only.
*
* "Special" mappings do not wish to be associated with a "struct page" (either
* it doesn't exist, or it exists but they don't want to touch it). In this
@@ -628,10 +634,10 @@ static void print_bad_page_map(struct vm_area_struct *vma,
* Selected page table walkers (such as GUP) can still identify mappings of the
* shared zero folios and work with the underlying "struct page".
*
- * There are 2 broad cases. Firstly, an architecture may define a pte_special()
- * pte bit, in which case this function is trivial. Secondly, an architecture
- * may not have a spare pte bit, which requires a more complicated scheme,
- * described below.
+ * There are 2 broad cases. Firstly, an architecture may define a "special"
+ * page table entry bit, such as pte_special(), in which case this function is
+ * trivial. Secondly, an architecture may not have a spare page table
+ * entry bit, which requires a more complicated scheme, described below.
*
* A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
* special mapping (even if there are underlying and valid "struct pages").
@@ -664,63 +670,94 @@ static void print_bad_page_map(struct vm_area_struct *vma,
* don't have to follow the strict linearity rule of PFNMAP mappings in
* order to support COWable mappings.
*
+ * Return: Returns the "struct page" if this is a "normal" mapping. Returns
+ * NULL if this is a "special" mapping.
*/
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte)
+static inline struct page *__vm_normal_page(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long pfn, bool special,
+ unsigned long long entry, enum pgtable_level level)
{
- unsigned long pfn = pte_pfn(pte);
-
if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
- if (likely(!pte_special(pte)))
- goto check_pfn;
- if (vma->vm_ops && vma->vm_ops->find_special_page)
- return vma->vm_ops->find_special_page(vma, addr);
- if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
- return NULL;
- if (is_zero_pfn(pfn))
- return NULL;
-
- print_bad_pte(vma, addr, pte, NULL);
- return NULL;
- }
-
- /* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
-
- if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
- if (vma->vm_flags & VM_MIXEDMAP) {
- if (!pfn_valid(pfn))
+ if (unlikely(special)) {
+ if (vma->vm_ops && vma->vm_ops->find_special_page)
+ return vma->vm_ops->find_special_page(vma, addr);
+ if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
return NULL;
- if (is_zero_pfn(pfn))
- return NULL;
- goto out;
- } else {
- unsigned long off;
- off = (addr - vma->vm_start) >> PAGE_SHIFT;
- if (pfn == vma->vm_pgoff + off)
- return NULL;
- if (!is_cow_mapping(vma->vm_flags))
+ if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
return NULL;
+
+ print_bad_page_map(vma, addr, entry, NULL, level);
+ return NULL;
}
- }
+ /*
+ * With CONFIG_ARCH_HAS_PTE_SPECIAL, any special page table
+ * mappings (incl. shared zero folios) are marked accordingly.
+ */
+ } else {
+ if (unlikely(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))) {
+ if (vma->vm_flags & VM_MIXEDMAP) {
+ /* If it has a "struct page", it's "normal". */
+ if (!pfn_valid(pfn))
+ return NULL;
+ } else {
+ unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
- if (is_zero_pfn(pfn))
- return NULL;
+ /* Only CoW'ed anon folios are "normal". */
+ if (pfn == vma->vm_pgoff + off)
+ return NULL;
+ if (!is_cow_mapping(vma->vm_flags))
+ return NULL;
+ }
+ }
+
+ if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
+ return NULL;
+ }
-check_pfn:
if (unlikely(pfn > highest_memmap_pfn)) {
- print_bad_pte(vma, addr, pte, NULL);
+ /* Corrupted page table entry. */
+ print_bad_page_map(vma, addr, entry, NULL, level);
return NULL;
}
-
/*
* NOTE! We still have PageReserved() pages in the page tables.
- * eg. VDSO mappings can cause them to exist.
+ * For example, VDSO mappings can cause them to exist.
*/
-out:
- VM_WARN_ON_ONCE(is_zero_pfn(pfn));
+ VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn));
return pfn_to_page(pfn);
}
+/**
+ * vm_normal_page() - Get the "struct page" associated with a PTE
+ * @vma: The VMA mapping the @pte.
+ * @addr: The address where the @pte is mapped.
+ * @pte: The PTE.
+ *
+ * Get the "struct page" associated with a PTE. See __vm_normal_page()
+ * for details on "normal" and "special" mappings.
+ *
+ * Return: Returns the "struct page" if this is a "normal" mapping. Returns
+ * NULL if this is a "special" mapping.
+ */
+struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte)
+{
+ return __vm_normal_page(vma, addr, pte_pfn(pte), pte_special(pte),
+ pte_val(pte), PGTABLE_LEVEL_PTE);
+}
+
+/**
+ * vm_normal_folio() - Get the "struct folio" associated with a PTE
+ * @vma: The VMA mapping the @pte.
+ * @addr: The address where the @pte is mapped.
+ * @pte: The PTE.
+ *
+ * Get the "struct folio" associated with a PTE. See __vm_normal_page()
+ * for details on "normal" and "special" mappings.
+ *
+ * Return: Returns the "struct folio" if this is a "normal" mapping. Returns
+ * NULL if this is a "special" mapping.
+ */
struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
pte_t pte)
{
@@ -732,42 +769,37 @@ struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
}
#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+/**
+ * vm_normal_page_pmd() - Get the "struct page" associated with a PMD
+ * @vma: The VMA mapping the @pmd.
+ * @addr: The address where the @pmd is mapped.
+ * @pmd: The PMD.
+ *
+ * Get the "struct page" associated with a PTE. See __vm_normal_page()
+ * for details on "normal" and "special" mappings.
+ *
+ * Return: Returns the "struct page" if this is a "normal" mapping. Returns
+ * NULL if this is a "special" mapping.
+ */
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd)
{
- unsigned long pfn = pmd_pfn(pmd);
-
- if (unlikely(pmd_special(pmd)))
- return NULL;
-
- if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
- if (vma->vm_flags & VM_MIXEDMAP) {
- if (!pfn_valid(pfn))
- return NULL;
- goto out;
- } else {
- unsigned long off;
- off = (addr - vma->vm_start) >> PAGE_SHIFT;
- if (pfn == vma->vm_pgoff + off)
- return NULL;
- if (!is_cow_mapping(vma->vm_flags))
- return NULL;
- }
- }
-
- if (is_huge_zero_pfn(pfn))
- return NULL;
- if (unlikely(pfn > highest_memmap_pfn))
- return NULL;
-
- /*
- * NOTE! We still have PageReserved() pages in the page tables.
- * eg. VDSO mappings can cause them to exist.
- */
-out:
- return pfn_to_page(pfn);
+ return __vm_normal_page(vma, addr, pmd_pfn(pmd), pmd_special(pmd),
+ pmd_val(pmd), PGTABLE_LEVEL_PMD);
}
+/**
+ * vm_normal_folio_pmd() - Get the "struct folio" associated with a PMD
+ * @vma: The VMA mapping the @pmd.
+ * @addr: The address where the @pmd is mapped.
+ * @pmd: The PMD.
+ *
+ * Get the "struct folio" associated with a PTE. See __vm_normal_page()
+ * for details on "normal" and "special" mappings.
+ *
+ * Return: Returns the "struct folio" if this is a "normal" mapping. Returns
+ * NULL if this is a "special" mapping.
+ */
struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd)
{
--
2.50.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 10/11] mm: introduce and use vm_normal_page_pud()
2025-08-11 11:26 [PATCH v3 00/11] mm: vm_normal_page*() improvements David Hildenbrand
` (8 preceding siblings ...)
2025-08-11 11:26 ` [PATCH v3 09/11] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
@ 2025-08-11 11:26 ` David Hildenbrand
2025-08-12 19:38 ` Lorenzo Stoakes
2025-08-11 11:26 ` [PATCH v3 11/11] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
10 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-08-11 11:26 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
David Hildenbrand, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang, Wei Yang
Let's introduce vm_normal_page_pud(), which ends up being fairly simple
because of our new common helpers and there not being a PUD-sized zero
folio.
Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO,
structuring the code like the other (pmd/pte) cases. Defer
introducing vm_normal_folio_pud() until really used.
Note that we can so far get PUDs with hugetlb, daxfs and PFNMAP entries.
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
include/linux/mm.h | 2 ++
mm/memory.c | 19 +++++++++++++++++++
mm/pagewalk.c | 20 ++++++++++----------
3 files changed, 31 insertions(+), 10 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b626d1bacef52..8ca7d2fa71343 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2360,6 +2360,8 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd);
+struct page *vm_normal_page_pud(struct vm_area_struct *vma, unsigned long addr,
+ pud_t pud);
void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
diff --git a/mm/memory.c b/mm/memory.c
index 78af3f243cee7..6f806bf3cc994 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -809,6 +809,25 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
return page_folio(page);
return NULL;
}
+
+/**
+ * vm_normal_page_pud() - Get the "struct page" associated with a PUD
+ * @vma: The VMA mapping the @pud.
+ * @addr: The address where the @pud is mapped.
+ * @pud: The PUD.
+ *
+ * Get the "struct page" associated with a PUD. See __vm_normal_page()
+ * for details on "normal" and "special" mappings.
+ *
+ * Return: Returns the "struct page" if this is a "normal" mapping. Returns
+ * NULL if this is a "special" mapping.
+ */
+struct page *vm_normal_page_pud(struct vm_area_struct *vma,
+ unsigned long addr, pud_t pud)
+{
+ return __vm_normal_page(vma, addr, pud_pfn(pud), pud_special(pud),
+ pud_val(pud), PGTABLE_LEVEL_PUD);
+}
#endif
/**
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 648038247a8d2..c6753d370ff4e 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -902,23 +902,23 @@ struct folio *folio_walk_start(struct folio_walk *fw,
fw->pudp = pudp;
fw->pud = pud;
- /*
- * TODO: FW_MIGRATION support for PUD migration entries
- * once there are relevant users.
- */
- if (!pud_present(pud) || pud_special(pud)) {
+ if (pud_none(pud)) {
spin_unlock(ptl);
goto not_found;
- } else if (!pud_leaf(pud)) {
+ } else if (pud_present(pud) && !pud_leaf(pud)) {
spin_unlock(ptl);
goto pmd_table;
+ } else if (pud_present(pud)) {
+ page = vm_normal_page_pud(vma, addr, pud);
+ if (page)
+ goto found;
}
/*
- * TODO: vm_normal_page_pud() will be handy once we want to
- * support PUD mappings in VM_PFNMAP|VM_MIXEDMAP VMAs.
+ * TODO: FW_MIGRATION support for PUD migration entries
+ * once there are relevant users.
*/
- page = pud_page(pud);
- goto found;
+ spin_unlock(ptl);
+ goto not_found;
}
pmd_table:
--
2.50.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* [PATCH v3 11/11] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
2025-08-11 11:26 [PATCH v3 00/11] mm: vm_normal_page*() improvements David Hildenbrand
` (9 preceding siblings ...)
2025-08-11 11:26 ` [PATCH v3 10/11] mm: introduce and use vm_normal_page_pud() David Hildenbrand
@ 2025-08-11 11:26 ` David Hildenbrand
2025-08-12 19:43 ` Lorenzo Stoakes
10 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-08-11 11:26 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
David Hildenbrand, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang, David Vrabel, Wei Yang
... and hide it behind a kconfig option. There is really no need for
any !xen code to perform this check.
The naming is a bit off: we want to find the "normal" page when a PTE
was marked "special". So it's really not "finding a special" page.
Improve the documentation, and add a comment in the code where XEN ends
up performing the pte_mkspecial() through a hypercall. More details can
be found in commit 923b2919e2c3 ("xen/gntdev: mark userspace PTEs as
special on x86 PV guests").
Cc: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
drivers/xen/Kconfig | 1 +
drivers/xen/gntdev.c | 5 +++--
include/linux/mm.h | 18 +++++++++++++-----
mm/Kconfig | 2 ++
mm/memory.c | 12 ++++++++++--
tools/testing/vma/vma_internal.h | 18 +++++++++++++-----
6 files changed, 42 insertions(+), 14 deletions(-)
diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index 24f485827e039..f9a35ed266ecf 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -138,6 +138,7 @@ config XEN_GNTDEV
depends on XEN
default m
select MMU_NOTIFIER
+ select FIND_NORMAL_PAGE
help
Allows userspace processes to use grants.
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 1f21607656182..26f13b37c78e6 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -321,6 +321,7 @@ static int find_grant_ptes(pte_t *pte, unsigned long addr, void *data)
BUG_ON(pgnr >= map->count);
pte_maddr = arbitrary_virt_to_machine(pte).maddr;
+ /* Note: this will perform a pte_mkspecial() through the hypercall. */
gnttab_set_map_op(&map->map_ops[pgnr], pte_maddr, flags,
map->grants[pgnr].ref,
map->grants[pgnr].domid);
@@ -528,7 +529,7 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
gntdev_put_map(priv, map);
}
-static struct page *gntdev_vma_find_special_page(struct vm_area_struct *vma,
+static struct page *gntdev_vma_find_normal_page(struct vm_area_struct *vma,
unsigned long addr)
{
struct gntdev_grant_map *map = vma->vm_private_data;
@@ -539,7 +540,7 @@ static struct page *gntdev_vma_find_special_page(struct vm_area_struct *vma,
static const struct vm_operations_struct gntdev_vmops = {
.open = gntdev_vma_open,
.close = gntdev_vma_close,
- .find_special_page = gntdev_vma_find_special_page,
+ .find_normal_page = gntdev_vma_find_normal_page,
};
/* ------------------------------------------------------------------ */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ca7d2fa71343..3868ca1a25f9c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -657,13 +657,21 @@ struct vm_operations_struct {
struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
unsigned long addr, pgoff_t *ilx);
#endif
+#ifdef CONFIG_FIND_NORMAL_PAGE
/*
- * Called by vm_normal_page() for special PTEs to find the
- * page for @addr. This is useful if the default behavior
- * (using pte_page()) would not find the correct page.
+ * Called by vm_normal_page() for special PTEs in @vma at @addr. This
+ * allows for returning a "normal" page from vm_normal_page() even
+ * though the PTE indicates that the "struct page" either does not exist
+ * or should not be touched: "special".
+ *
+ * Do not add new users: this really only works when a "normal" page
+ * was mapped, but then the PTE got changed to something weird (+
+ * marked special) that would not make pte_pfn() identify the originally
+ * inserted page.
*/
- struct page *(*find_special_page)(struct vm_area_struct *vma,
- unsigned long addr);
+ struct page *(*find_normal_page)(struct vm_area_struct *vma,
+ unsigned long addr);
+#endif /* CONFIG_FIND_NORMAL_PAGE */
};
#ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/Kconfig b/mm/Kconfig
index e443fe8cd6cf2..59a04d0b2e272 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1381,6 +1381,8 @@ config PT_RECLAIM
Note: now only empty user PTE page table pages will be reclaimed.
+config FIND_NORMAL_PAGE
+ def_bool n
source "mm/damon/Kconfig"
diff --git a/mm/memory.c b/mm/memory.c
index 6f806bf3cc994..002c28795d8b7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -639,6 +639,12 @@ static void print_bad_page_map(struct vm_area_struct *vma,
* trivial. Secondly, an architecture may not have a spare page table
* entry bit, which requires a more complicated scheme, described below.
*
+ * With CONFIG_FIND_NORMAL_PAGE, we might have the "special" bit set on
+ * page table entries that actually map "normal" pages: however, that page
+ * cannot be looked up through the PFN stored in the page table entry, but
+ * instead will be looked up through vm_ops->find_normal_page(). So far, this
+ * only applies to PTEs.
+ *
* A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
* special mapping (even if there are underlying and valid "struct pages").
* COWed pages of a VM_PFNMAP are always normal.
@@ -679,8 +685,10 @@ static inline struct page *__vm_normal_page(struct vm_area_struct *vma,
{
if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
if (unlikely(special)) {
- if (vma->vm_ops && vma->vm_ops->find_special_page)
- return vma->vm_ops->find_special_page(vma, addr);
+#ifdef CONFIG_FIND_NORMAL_PAGE
+ if (vma->vm_ops && vma->vm_ops->find_normal_page)
+ return vma->vm_ops->find_normal_page(vma, addr);
+#endif /* CONFIG_FIND_NORMAL_PAGE */
if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
return NULL;
if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 3639aa8dd2b06..cb1c2a8afe265 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -467,13 +467,21 @@ struct vm_operations_struct {
struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
unsigned long addr, pgoff_t *ilx);
#endif
+#ifdef CONFIG_FIND_NORMAL_PAGE
/*
- * Called by vm_normal_page() for special PTEs to find the
- * page for @addr. This is useful if the default behavior
- * (using pte_page()) would not find the correct page.
+ * Called by vm_normal_page() for special PTEs in @vma at @addr. This
+ * allows for returning a "normal" page from vm_normal_page() even
+ * though the PTE indicates that the "struct page" either does not exist
+ * or should not be touched: "special".
+ *
+ * Do not add new users: this really only works when a "normal" page
+ * was mapped, but then the PTE got changed to something weird (+
+ * marked special) that would not make pte_pfn() identify the originally
+ * inserted page.
*/
- struct page *(*find_special_page)(struct vm_area_struct *vma,
- unsigned long addr);
+ struct page *(*find_normal_page)(struct vm_area_struct *vma,
+ unsigned long addr);
+#endif /* CONFIG_FIND_NORMAL_PAGE */
};
struct vm_unmapped_area_info {
--
2.50.1
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v3 01/11] mm/huge_memory: move more common code into insert_pmd()
2025-08-11 11:26 ` [PATCH v3 01/11] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
@ 2025-08-12 4:52 ` Lance Yang
0 siblings, 0 replies; 27+ messages in thread
From: Lance Yang @ 2025-08-12 4:52 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
Andrew Morton, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy, Juergen Gross,
Stefano Stabellini, Oleksandr Tyshchenko, Dan Williams,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Alistair Popple, Wei Yang,
linux-kernel
On 2025/8/11 19:26, David Hildenbrand wrote:
> Let's clean it all further up.
>
> No functional change intended.
>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Nice. Feel free to add:
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Thanks,
Lance
> ---
> mm/huge_memory.c | 72 ++++++++++++++++--------------------------------
> 1 file changed, 24 insertions(+), 48 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2b4ea5a2ce7d2..5314a89d676f1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1379,15 +1379,25 @@ struct folio_or_pfn {
> bool is_folio;
> };
>
> -static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
> +static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t *pmd, struct folio_or_pfn fop, pgprot_t prot,
> - bool write, pgtable_t pgtable)
> + bool write)
> {
> struct mm_struct *mm = vma->vm_mm;
> + pgtable_t pgtable = NULL;
> + spinlock_t *ptl;
> pmd_t entry;
>
> - lockdep_assert_held(pmd_lockptr(mm, pmd));
> + if (addr < vma->vm_start || addr >= vma->vm_end)
> + return VM_FAULT_SIGBUS;
>
> + if (arch_needs_pgtable_deposit()) {
> + pgtable = pte_alloc_one(vma->vm_mm);
> + if (!pgtable)
> + return VM_FAULT_OOM;
> + }
> +
> + ptl = pmd_lock(mm, pmd);
> if (!pmd_none(*pmd)) {
> const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
> fop.pfn;
> @@ -1395,15 +1405,14 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
> if (write) {
> if (pmd_pfn(*pmd) != pfn) {
> WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
> - return -EEXIST;
> + goto out_unlock;
> }
> entry = pmd_mkyoung(*pmd);
> entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
> update_mmu_cache_pmd(vma, addr, pmd);
> }
> -
> - return -EEXIST;
> + goto out_unlock;
> }
>
> if (fop.is_folio) {
> @@ -1424,11 +1433,17 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
> if (pgtable) {
> pgtable_trans_huge_deposit(mm, pmd, pgtable);
> mm_inc_nr_ptes(mm);
> + pgtable = NULL;
> }
>
> set_pmd_at(mm, addr, pmd, entry);
> update_mmu_cache_pmd(vma, addr, pmd);
> - return 0;
> +
> +out_unlock:
> + spin_unlock(ptl);
> + if (pgtable)
> + pte_free(mm, pgtable);
> + return VM_FAULT_NOPAGE;
> }
>
> /**
> @@ -1450,9 +1465,6 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn,
> struct folio_or_pfn fop = {
> .pfn = pfn,
> };
> - pgtable_t pgtable = NULL;
> - spinlock_t *ptl;
> - int error;
>
> /*
> * If we had pmd_special, we could avoid all these restrictions,
> @@ -1464,25 +1476,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn,
> (VM_PFNMAP|VM_MIXEDMAP));
> BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
>
> - if (addr < vma->vm_start || addr >= vma->vm_end)
> - return VM_FAULT_SIGBUS;
> -
> - if (arch_needs_pgtable_deposit()) {
> - pgtable = pte_alloc_one(vma->vm_mm);
> - if (!pgtable)
> - return VM_FAULT_OOM;
> - }
> -
> pfnmap_setup_cachemode_pfn(pfn, &pgprot);
>
> - ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> - error = insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write,
> - pgtable);
> - spin_unlock(ptl);
> - if (error && pgtable)
> - pte_free(vma->vm_mm, pgtable);
> -
> - return VM_FAULT_NOPAGE;
> + return insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write);
> }
> EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
>
> @@ -1491,35 +1487,15 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
> {
> struct vm_area_struct *vma = vmf->vma;
> unsigned long addr = vmf->address & PMD_MASK;
> - struct mm_struct *mm = vma->vm_mm;
> struct folio_or_pfn fop = {
> .folio = folio,
> .is_folio = true,
> };
> - spinlock_t *ptl;
> - pgtable_t pgtable = NULL;
> - int error;
> -
> - if (addr < vma->vm_start || addr >= vma->vm_end)
> - return VM_FAULT_SIGBUS;
>
> if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
> return VM_FAULT_SIGBUS;
>
> - if (arch_needs_pgtable_deposit()) {
> - pgtable = pte_alloc_one(vma->vm_mm);
> - if (!pgtable)
> - return VM_FAULT_OOM;
> - }
> -
> - ptl = pmd_lock(mm, vmf->pmd);
> - error = insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot,
> - write, pgtable);
> - spin_unlock(ptl);
> - if (error && pgtable)
> - pte_free(mm, pgtable);
> -
> - return VM_FAULT_NOPAGE;
> + return insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot, write);
> }
> EXPORT_SYMBOL_GPL(vmf_insert_folio_pmd);
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 05/11] mm/huge_memory: mark PMD mappings of the huge zero folio special
2025-08-11 11:26 ` [PATCH v3 05/11] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
@ 2025-08-12 18:14 ` Lorenzo Stoakes
0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-08-12 18:14 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
linuxppc-dev, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, Pedro Falcato, Hugh Dickins, Oscar Salvador,
Lance Yang
On Mon, Aug 11, 2025 at 01:26:25PM +0200, David Hildenbrand wrote:
> The huge zero folio is refcounted (+mapcounted -- is that a word?)
> differently than "normal" folios, similarly (but different) to the ordinary
> shared zeropage.
>
> For this reason, we special-case these pages in
> vm_normal_page*/vm_normal_folio*, and only allow selected callers to
> still use them (e.g., GUP can still take a reference on them).
Hm, interestingly in gup_fast_pmd_leaf() we explicitly check pmd_special(),
so surely setting the zero huge pmd special will change behaviour there?
But I guess this is actually _more_ correct as it's not really sensible to
grab the huge zero PMD page.
Then again, follow_huge_pmd() _will_, afaict.
I see the GUP fast change was introduced by commit ae3c99e650da ("mm/gup:
detect huge pfnmap entries in gup-fast") so was specifically intended for
pfnmap not the zero page.
>
> vm_normal_page_pmd() already filters out the huge zero folio, to
> indicate it a special (return NULL). However, so far we are not making
> use of pmd_special() on architectures that support it
> (CONFIG_ARCH_HAS_PTE_SPECIAL), like we would with the ordinary shared
> zeropage.
>
> Let's mark PMD mappings of the huge zero folio similarly as special, so we
> can avoid the manual check for the huge zero folio with
> CONFIG_ARCH_HAS_PTE_SPECIAL next, and only perform the check on
> !CONFIG_ARCH_HAS_PTE_SPECIAL.
>
> In copy_huge_pmd(), where we have a manual pmd_special() check to handle
> PFNMAP, we have to manually rule out the huge zero folio. That code
> needs a serious cleanup, but that's something for another day.
>
> While at it, update the doc regarding the shared zero folios.
>
> No functional change intended: vm_normal_page_pmd() still returns NULL
> when it encounters the huge zero folio.
>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
I R-b this before, and Wei did also, did you drop because of changes?
Anyway, apart from query about GUP-fast above, this LGTM so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/huge_memory.c | 8 ++++++--
> mm/memory.c | 15 ++++++++++-----
> 2 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index ec89e0607424e..58bac83e7fa31 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1309,6 +1309,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
> {
> pmd_t entry;
> entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
> + entry = pmd_mkspecial(entry);
> pgtable_trans_huge_deposit(mm, pmd, pgtable);
> set_pmd_at(mm, haddr, pmd, entry);
> mm_inc_nr_ptes(mm);
> @@ -1418,7 +1419,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
> if (fop.is_folio) {
> entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
>
> - if (!is_huge_zero_folio(fop.folio)) {
> + if (is_huge_zero_folio(fop.folio)) {
> + entry = pmd_mkspecial(entry);
> + } else {
> folio_get(fop.folio);
> folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
> add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
> @@ -1643,7 +1646,8 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> int ret = -ENOMEM;
>
> pmd = pmdp_get_lockless(src_pmd);
> - if (unlikely(pmd_present(pmd) && pmd_special(pmd))) {
> + if (unlikely(pmd_present(pmd) && pmd_special(pmd) &&
> + !is_huge_zero_pmd(pmd))) {
OK yeah this is new I see from cover letter + ranged-diff.
Yeah this is important actually wow, as otherwise the is_huge_zero_pmd()
branch will not be executed.
Good spot!
> dst_ptl = pmd_lock(dst_mm, dst_pmd);
> src_ptl = pmd_lockptr(src_mm, src_pmd);
> spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> diff --git a/mm/memory.c b/mm/memory.c
> index 0ba4f6b718471..626caedce35e0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -555,7 +555,14 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> *
> * "Special" mappings do not wish to be associated with a "struct page" (either
> * it doesn't exist, or it exists but they don't want to touch it). In this
> - * case, NULL is returned here. "Normal" mappings do have a struct page.
> + * case, NULL is returned here. "Normal" mappings do have a struct page and
> + * are ordinarily refcounted.
> + *
> + * Page mappings of the shared zero folios are always considered "special", as
> + * they are not ordinarily refcounted: neither the refcount nor the mapcount
> + * of these folios is adjusted when mapping them into user page tables.
> + * Selected page table walkers (such as GUP) can still identify mappings of the
> + * shared zero folios and work with the underlying "struct page".
Thanks for this.
> *
> * There are 2 broad cases. Firstly, an architecture may define a pte_special()
> * pte bit, in which case this function is trivial. Secondly, an architecture
> @@ -585,9 +592,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> *
> * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
> * page" backing, however the difference is that _all_ pages with a struct
> - * page (that is, those where pfn_valid is true) are refcounted and considered
> - * normal pages by the VM. The only exception are zeropages, which are
> - * *never* refcounted.
> + * page (that is, those where pfn_valid is true, except the shared zero
> + * folios) are refcounted and considered normal pages by the VM.
> *
> * The disadvantage is that pages are refcounted (which can be slower and
> * simply not an option for some PFNMAP users). The advantage is that we
> @@ -667,7 +673,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> {
> unsigned long pfn = pmd_pfn(pmd);
>
> - /* Currently it's only used for huge pfnmaps */
> if (unlikely(pmd_special(pmd)))
> return NULL;
>
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 06/11] powerpc/ptdump: rename "struct pgtable_level" to "struct ptdump_pglevel"
2025-08-11 11:26 ` [PATCH v3 06/11] powerpc/ptdump: rename "struct pgtable_level" to "struct ptdump_pglevel" David Hildenbrand
@ 2025-08-12 18:23 ` Lorenzo Stoakes
2025-08-12 18:39 ` Christophe Leroy
2025-08-26 16:28 ` Ritesh Harjani
1 sibling, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-08-12 18:23 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
linuxppc-dev, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, Pedro Falcato, Hugh Dickins, Oscar Salvador,
Lance Yang
On Mon, Aug 11, 2025 at 01:26:26PM +0200, David Hildenbrand wrote:
> We want to make use of "pgtable_level" for an enum in core-mm. Other
> architectures seem to call "struct pgtable_level" either:
> * "struct pg_level" when not exposed in a header (riscv, arm)
> * "struct ptdump_pg_level" when expose in a header (arm64)
>
> So let's follow what arm64 does.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
This LGTM, but I'm super confused what these are for, they don't seem to be
used anywhere? Maybe I'm missing some macro madness, but it seems like dead
code anyway?
Anyway:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> arch/powerpc/mm/ptdump/8xx.c | 2 +-
> arch/powerpc/mm/ptdump/book3s64.c | 2 +-
> arch/powerpc/mm/ptdump/ptdump.h | 4 ++--
> arch/powerpc/mm/ptdump/shared.c | 2 +-
> 4 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/arch/powerpc/mm/ptdump/8xx.c b/arch/powerpc/mm/ptdump/8xx.c
> index b5c79b11ea3c2..4ca9cf7a90c9e 100644
> --- a/arch/powerpc/mm/ptdump/8xx.c
> +++ b/arch/powerpc/mm/ptdump/8xx.c
> @@ -69,7 +69,7 @@ static const struct flag_info flag_array[] = {
> }
> };
>
> -struct pgtable_level pg_level[5] = {
> +struct ptdump_pg_level pg_level[5] = {
> { /* pgd */
> .flag = flag_array,
> .num = ARRAY_SIZE(flag_array),
> diff --git a/arch/powerpc/mm/ptdump/book3s64.c b/arch/powerpc/mm/ptdump/book3s64.c
> index 5ad92d9dc5d10..6b2da9241d4c4 100644
> --- a/arch/powerpc/mm/ptdump/book3s64.c
> +++ b/arch/powerpc/mm/ptdump/book3s64.c
> @@ -102,7 +102,7 @@ static const struct flag_info flag_array[] = {
> }
> };
>
> -struct pgtable_level pg_level[5] = {
> +struct ptdump_pg_level pg_level[5] = {
> { /* pgd */
> .flag = flag_array,
> .num = ARRAY_SIZE(flag_array),
> diff --git a/arch/powerpc/mm/ptdump/ptdump.h b/arch/powerpc/mm/ptdump/ptdump.h
> index 154efae96ae09..4232aa4b57eae 100644
> --- a/arch/powerpc/mm/ptdump/ptdump.h
> +++ b/arch/powerpc/mm/ptdump/ptdump.h
> @@ -11,12 +11,12 @@ struct flag_info {
> int shift;
> };
>
> -struct pgtable_level {
> +struct ptdump_pg_level {
> const struct flag_info *flag;
> size_t num;
> u64 mask;
> };
>
> -extern struct pgtable_level pg_level[5];
> +extern struct ptdump_pg_level pg_level[5];
>
> void pt_dump_size(struct seq_file *m, unsigned long delta);
> diff --git a/arch/powerpc/mm/ptdump/shared.c b/arch/powerpc/mm/ptdump/shared.c
> index 39c30c62b7ea7..58998960eb9a4 100644
> --- a/arch/powerpc/mm/ptdump/shared.c
> +++ b/arch/powerpc/mm/ptdump/shared.c
> @@ -67,7 +67,7 @@ static const struct flag_info flag_array[] = {
> }
> };
>
> -struct pgtable_level pg_level[5] = {
> +struct ptdump_pg_level pg_level[5] = {
> { /* pgd */
> .flag = flag_array,
> .num = ARRAY_SIZE(flag_array),
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 07/11] mm/rmap: convert "enum rmap_level" to "enum pgtable_level"
2025-08-11 11:26 ` [PATCH v3 07/11] mm/rmap: convert "enum rmap_level" to "enum pgtable_level" David Hildenbrand
@ 2025-08-12 18:33 ` Lorenzo Stoakes
0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-08-12 18:33 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
linuxppc-dev, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, Pedro Falcato, Hugh Dickins, Oscar Salvador,
Lance Yang
On Mon, Aug 11, 2025 at 01:26:27PM +0200, David Hildenbrand wrote:
> Let's factor it out, and convert all checks for unsupported levels to
> BUILD_BUG(). The code is written in a way such that force-inlining will
> optimize out the levels.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Nice cleanup! This LGTM, so:
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> include/linux/pgtable.h | 8 ++++++
> include/linux/rmap.h | 60 +++++++++++++++++++----------------------
> mm/rmap.c | 56 +++++++++++++++++++++-----------------
> 3 files changed, 66 insertions(+), 58 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 4c035637eeb77..bff5c4241bf2e 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1958,6 +1958,14 @@ static inline bool arch_has_pfn_modify_check(void)
> /* Page-Table Modification Mask */
> typedef unsigned int pgtbl_mod_mask;
>
> +enum pgtable_level {
> + PGTABLE_LEVEL_PTE = 0,
> + PGTABLE_LEVEL_PMD,
> + PGTABLE_LEVEL_PUD,
> + PGTABLE_LEVEL_P4D,
> + PGTABLE_LEVEL_PGD,
> +};
> +
> #endif /* !__ASSEMBLY__ */
>
> #if !defined(MAX_POSSIBLE_PHYSMEM_BITS) && !defined(CONFIG_64BIT)
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 6cd020eea37a2..9d40d127bdb78 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -394,18 +394,8 @@ typedef int __bitwise rmap_t;
> /* The anonymous (sub)page is exclusive to a single process. */
> #define RMAP_EXCLUSIVE ((__force rmap_t)BIT(0))
>
> -/*
> - * Internally, we're using an enum to specify the granularity. We make the
> - * compiler emit specialized code for each granularity.
> - */
> -enum rmap_level {
> - RMAP_LEVEL_PTE = 0,
> - RMAP_LEVEL_PMD,
> - RMAP_LEVEL_PUD,
> -};
> -
> static inline void __folio_rmap_sanity_checks(const struct folio *folio,
> - const struct page *page, int nr_pages, enum rmap_level level)
> + const struct page *page, int nr_pages, enum pgtable_level level)
> {
> /* hugetlb folios are handled separately. */
> VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
> @@ -427,18 +417,18 @@ static inline void __folio_rmap_sanity_checks(const struct folio *folio,
> VM_WARN_ON_FOLIO(page_folio(page + nr_pages - 1) != folio, folio);
>
> switch (level) {
> - case RMAP_LEVEL_PTE:
> + case PGTABLE_LEVEL_PTE:
> break;
> - case RMAP_LEVEL_PMD:
> + case PGTABLE_LEVEL_PMD:
> /*
> * We don't support folios larger than a single PMD yet. So
> - * when RMAP_LEVEL_PMD is set, we assume that we are creating
> + * when PGTABLE_LEVEL_PMD is set, we assume that we are creating
> * a single "entire" mapping of the folio.
> */
> VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PMD_NR, folio);
> VM_WARN_ON_FOLIO(nr_pages != HPAGE_PMD_NR, folio);
> break;
> - case RMAP_LEVEL_PUD:
> + case PGTABLE_LEVEL_PUD:
> /*
> * Assume that we are creating a single "entire" mapping of the
> * folio.
> @@ -447,7 +437,7 @@ static inline void __folio_rmap_sanity_checks(const struct folio *folio,
> VM_WARN_ON_FOLIO(nr_pages != HPAGE_PUD_NR, folio);
> break;
> default:
> - VM_WARN_ON_ONCE(true);
> + BUILD_BUG();
> }
>
> /*
> @@ -567,14 +557,14 @@ static inline void hugetlb_remove_rmap(struct folio *folio)
>
> static __always_inline void __folio_dup_file_rmap(struct folio *folio,
> struct page *page, int nr_pages, struct vm_area_struct *dst_vma,
> - enum rmap_level level)
> + enum pgtable_level level)
> {
> const int orig_nr_pages = nr_pages;
>
> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>
> switch (level) {
> - case RMAP_LEVEL_PTE:
> + case PGTABLE_LEVEL_PTE:
> if (!folio_test_large(folio)) {
> atomic_inc(&folio->_mapcount);
> break;
> @@ -587,11 +577,13 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
> }
> folio_add_large_mapcount(folio, orig_nr_pages, dst_vma);
> break;
> - case RMAP_LEVEL_PMD:
> - case RMAP_LEVEL_PUD:
> + case PGTABLE_LEVEL_PMD:
> + case PGTABLE_LEVEL_PUD:
> atomic_inc(&folio->_entire_mapcount);
> folio_inc_large_mapcount(folio, dst_vma);
> break;
> + default:
> + BUILD_BUG();
> }
> }
>
> @@ -609,13 +601,13 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
> static inline void folio_dup_file_rmap_ptes(struct folio *folio,
> struct page *page, int nr_pages, struct vm_area_struct *dst_vma)
> {
> - __folio_dup_file_rmap(folio, page, nr_pages, dst_vma, RMAP_LEVEL_PTE);
> + __folio_dup_file_rmap(folio, page, nr_pages, dst_vma, PGTABLE_LEVEL_PTE);
> }
>
> static __always_inline void folio_dup_file_rmap_pte(struct folio *folio,
> struct page *page, struct vm_area_struct *dst_vma)
> {
> - __folio_dup_file_rmap(folio, page, 1, dst_vma, RMAP_LEVEL_PTE);
> + __folio_dup_file_rmap(folio, page, 1, dst_vma, PGTABLE_LEVEL_PTE);
> }
>
> /**
> @@ -632,7 +624,7 @@ static inline void folio_dup_file_rmap_pmd(struct folio *folio,
> struct page *page, struct vm_area_struct *dst_vma)
> {
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> - __folio_dup_file_rmap(folio, page, HPAGE_PMD_NR, dst_vma, RMAP_LEVEL_PTE);
> + __folio_dup_file_rmap(folio, page, HPAGE_PMD_NR, dst_vma, PGTABLE_LEVEL_PTE);
> #else
> WARN_ON_ONCE(true);
> #endif
> @@ -640,7 +632,7 @@ static inline void folio_dup_file_rmap_pmd(struct folio *folio,
>
> static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
> struct page *page, int nr_pages, struct vm_area_struct *dst_vma,
> - struct vm_area_struct *src_vma, enum rmap_level level)
> + struct vm_area_struct *src_vma, enum pgtable_level level)
> {
> const int orig_nr_pages = nr_pages;
> bool maybe_pinned;
> @@ -665,7 +657,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
> * copying if the folio maybe pinned.
> */
> switch (level) {
> - case RMAP_LEVEL_PTE:
> + case PGTABLE_LEVEL_PTE:
> if (unlikely(maybe_pinned)) {
> for (i = 0; i < nr_pages; i++)
> if (PageAnonExclusive(page + i))
> @@ -687,8 +679,8 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
> } while (page++, --nr_pages > 0);
> folio_add_large_mapcount(folio, orig_nr_pages, dst_vma);
> break;
> - case RMAP_LEVEL_PMD:
> - case RMAP_LEVEL_PUD:
> + case PGTABLE_LEVEL_PMD:
> + case PGTABLE_LEVEL_PUD:
> if (PageAnonExclusive(page)) {
> if (unlikely(maybe_pinned))
> return -EBUSY;
> @@ -697,6 +689,8 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
> atomic_inc(&folio->_entire_mapcount);
> folio_inc_large_mapcount(folio, dst_vma);
> break;
> + default:
> + BUILD_BUG();
> }
> return 0;
> }
> @@ -730,7 +724,7 @@ static inline int folio_try_dup_anon_rmap_ptes(struct folio *folio,
> struct vm_area_struct *src_vma)
> {
> return __folio_try_dup_anon_rmap(folio, page, nr_pages, dst_vma,
> - src_vma, RMAP_LEVEL_PTE);
> + src_vma, PGTABLE_LEVEL_PTE);
> }
>
> static __always_inline int folio_try_dup_anon_rmap_pte(struct folio *folio,
> @@ -738,7 +732,7 @@ static __always_inline int folio_try_dup_anon_rmap_pte(struct folio *folio,
> struct vm_area_struct *src_vma)
> {
> return __folio_try_dup_anon_rmap(folio, page, 1, dst_vma, src_vma,
> - RMAP_LEVEL_PTE);
> + PGTABLE_LEVEL_PTE);
> }
>
> /**
> @@ -770,7 +764,7 @@ static inline int folio_try_dup_anon_rmap_pmd(struct folio *folio,
> {
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> return __folio_try_dup_anon_rmap(folio, page, HPAGE_PMD_NR, dst_vma,
> - src_vma, RMAP_LEVEL_PMD);
> + src_vma, PGTABLE_LEVEL_PMD);
> #else
> WARN_ON_ONCE(true);
> return -EBUSY;
> @@ -778,7 +772,7 @@ static inline int folio_try_dup_anon_rmap_pmd(struct folio *folio,
> }
>
> static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
> - struct page *page, int nr_pages, enum rmap_level level)
> + struct page *page, int nr_pages, enum pgtable_level level)
> {
> VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> VM_WARN_ON_FOLIO(!PageAnonExclusive(page), folio);
> @@ -873,7 +867,7 @@ static __always_inline int __folio_try_share_anon_rmap(struct folio *folio,
> static inline int folio_try_share_anon_rmap_pte(struct folio *folio,
> struct page *page)
> {
> - return __folio_try_share_anon_rmap(folio, page, 1, RMAP_LEVEL_PTE);
> + return __folio_try_share_anon_rmap(folio, page, 1, PGTABLE_LEVEL_PTE);
> }
>
> /**
> @@ -904,7 +898,7 @@ static inline int folio_try_share_anon_rmap_pmd(struct folio *folio,
> {
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> return __folio_try_share_anon_rmap(folio, page, HPAGE_PMD_NR,
> - RMAP_LEVEL_PMD);
> + PGTABLE_LEVEL_PMD);
> #else
> WARN_ON_ONCE(true);
> return -EBUSY;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 84a8d8b02ef77..0e9c4041f8687 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1265,7 +1265,7 @@ static void __folio_mod_stat(struct folio *folio, int nr, int nr_pmdmapped)
>
> static __always_inline void __folio_add_rmap(struct folio *folio,
> struct page *page, int nr_pages, struct vm_area_struct *vma,
> - enum rmap_level level)
> + enum pgtable_level level)
> {
> atomic_t *mapped = &folio->_nr_pages_mapped;
> const int orig_nr_pages = nr_pages;
> @@ -1274,7 +1274,7 @@ static __always_inline void __folio_add_rmap(struct folio *folio,
> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>
> switch (level) {
> - case RMAP_LEVEL_PTE:
> + case PGTABLE_LEVEL_PTE:
> if (!folio_test_large(folio)) {
> nr = atomic_inc_and_test(&folio->_mapcount);
> break;
> @@ -1300,11 +1300,11 @@ static __always_inline void __folio_add_rmap(struct folio *folio,
>
> folio_add_large_mapcount(folio, orig_nr_pages, vma);
> break;
> - case RMAP_LEVEL_PMD:
> - case RMAP_LEVEL_PUD:
> + case PGTABLE_LEVEL_PMD:
> + case PGTABLE_LEVEL_PUD:
> first = atomic_inc_and_test(&folio->_entire_mapcount);
> if (IS_ENABLED(CONFIG_NO_PAGE_MAPCOUNT)) {
> - if (level == RMAP_LEVEL_PMD && first)
> + if (level == PGTABLE_LEVEL_PMD && first)
> nr_pmdmapped = folio_large_nr_pages(folio);
> nr = folio_inc_return_large_mapcount(folio, vma);
> if (nr == 1)
> @@ -1323,7 +1323,7 @@ static __always_inline void __folio_add_rmap(struct folio *folio,
> * We only track PMD mappings of PMD-sized
> * folios separately.
> */
> - if (level == RMAP_LEVEL_PMD)
> + if (level == PGTABLE_LEVEL_PMD)
> nr_pmdmapped = nr_pages;
> nr = nr_pages - (nr & FOLIO_PAGES_MAPPED);
> /* Raced ahead of a remove and another add? */
> @@ -1336,6 +1336,8 @@ static __always_inline void __folio_add_rmap(struct folio *folio,
> }
> folio_inc_large_mapcount(folio, vma);
> break;
> + default:
> + BUILD_BUG();
> }
> __folio_mod_stat(folio, nr, nr_pmdmapped);
> }
> @@ -1427,7 +1429,7 @@ static void __page_check_anon_rmap(const struct folio *folio,
>
> static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> struct page *page, int nr_pages, struct vm_area_struct *vma,
> - unsigned long address, rmap_t flags, enum rmap_level level)
> + unsigned long address, rmap_t flags, enum pgtable_level level)
> {
> int i;
>
> @@ -1440,20 +1442,22 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>
> if (flags & RMAP_EXCLUSIVE) {
> switch (level) {
> - case RMAP_LEVEL_PTE:
> + case PGTABLE_LEVEL_PTE:
> for (i = 0; i < nr_pages; i++)
> SetPageAnonExclusive(page + i);
> break;
> - case RMAP_LEVEL_PMD:
> + case PGTABLE_LEVEL_PMD:
> SetPageAnonExclusive(page);
> break;
> - case RMAP_LEVEL_PUD:
> + case PGTABLE_LEVEL_PUD:
> /*
> * Keep the compiler happy, we don't support anonymous
> * PUD mappings.
> */
> WARN_ON_ONCE(1);
> break;
> + default:
> + BUILD_BUG();
> }
> }
>
> @@ -1507,7 +1511,7 @@ void folio_add_anon_rmap_ptes(struct folio *folio, struct page *page,
> rmap_t flags)
> {
> __folio_add_anon_rmap(folio, page, nr_pages, vma, address, flags,
> - RMAP_LEVEL_PTE);
> + PGTABLE_LEVEL_PTE);
> }
>
> /**
> @@ -1528,7 +1532,7 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
> {
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> __folio_add_anon_rmap(folio, page, HPAGE_PMD_NR, vma, address, flags,
> - RMAP_LEVEL_PMD);
> + PGTABLE_LEVEL_PMD);
> #else
> WARN_ON_ONCE(true);
> #endif
> @@ -1609,7 +1613,7 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>
> static __always_inline void __folio_add_file_rmap(struct folio *folio,
> struct page *page, int nr_pages, struct vm_area_struct *vma,
> - enum rmap_level level)
> + enum pgtable_level level)
> {
> VM_WARN_ON_FOLIO(folio_test_anon(folio), folio);
>
> @@ -1634,7 +1638,7 @@ static __always_inline void __folio_add_file_rmap(struct folio *folio,
> void folio_add_file_rmap_ptes(struct folio *folio, struct page *page,
> int nr_pages, struct vm_area_struct *vma)
> {
> - __folio_add_file_rmap(folio, page, nr_pages, vma, RMAP_LEVEL_PTE);
> + __folio_add_file_rmap(folio, page, nr_pages, vma, PGTABLE_LEVEL_PTE);
> }
>
> /**
> @@ -1651,7 +1655,7 @@ void folio_add_file_rmap_pmd(struct folio *folio, struct page *page,
> struct vm_area_struct *vma)
> {
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> - __folio_add_file_rmap(folio, page, HPAGE_PMD_NR, vma, RMAP_LEVEL_PMD);
> + __folio_add_file_rmap(folio, page, HPAGE_PMD_NR, vma, PGTABLE_LEVEL_PMD);
> #else
> WARN_ON_ONCE(true);
> #endif
> @@ -1672,7 +1676,7 @@ void folio_add_file_rmap_pud(struct folio *folio, struct page *page,
> {
> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> - __folio_add_file_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
> + __folio_add_file_rmap(folio, page, HPAGE_PUD_NR, vma, PGTABLE_LEVEL_PUD);
> #else
> WARN_ON_ONCE(true);
> #endif
> @@ -1680,7 +1684,7 @@ void folio_add_file_rmap_pud(struct folio *folio, struct page *page,
>
> static __always_inline void __folio_remove_rmap(struct folio *folio,
> struct page *page, int nr_pages, struct vm_area_struct *vma,
> - enum rmap_level level)
> + enum pgtable_level level)
> {
> atomic_t *mapped = &folio->_nr_pages_mapped;
> int last = 0, nr = 0, nr_pmdmapped = 0;
> @@ -1689,7 +1693,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> __folio_rmap_sanity_checks(folio, page, nr_pages, level);
>
> switch (level) {
> - case RMAP_LEVEL_PTE:
> + case PGTABLE_LEVEL_PTE:
> if (!folio_test_large(folio)) {
> nr = atomic_add_negative(-1, &folio->_mapcount);
> break;
> @@ -1719,11 +1723,11 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>
> partially_mapped = nr && atomic_read(mapped);
> break;
> - case RMAP_LEVEL_PMD:
> - case RMAP_LEVEL_PUD:
> + case PGTABLE_LEVEL_PMD:
> + case PGTABLE_LEVEL_PUD:
> if (IS_ENABLED(CONFIG_NO_PAGE_MAPCOUNT)) {
> last = atomic_add_negative(-1, &folio->_entire_mapcount);
> - if (level == RMAP_LEVEL_PMD && last)
> + if (level == PGTABLE_LEVEL_PMD && last)
> nr_pmdmapped = folio_large_nr_pages(folio);
> nr = folio_dec_return_large_mapcount(folio, vma);
> if (!nr) {
> @@ -1743,7 +1747,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> nr = atomic_sub_return_relaxed(ENTIRELY_MAPPED, mapped);
> if (likely(nr < ENTIRELY_MAPPED)) {
> nr_pages = folio_large_nr_pages(folio);
> - if (level == RMAP_LEVEL_PMD)
> + if (level == PGTABLE_LEVEL_PMD)
> nr_pmdmapped = nr_pages;
> nr = nr_pages - (nr & FOLIO_PAGES_MAPPED);
> /* Raced ahead of another remove and an add? */
> @@ -1757,6 +1761,8 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
>
> partially_mapped = nr && nr < nr_pmdmapped;
> break;
> + default:
> + BUILD_BUG();
> }
>
> /*
> @@ -1796,7 +1802,7 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
> void folio_remove_rmap_ptes(struct folio *folio, struct page *page,
> int nr_pages, struct vm_area_struct *vma)
> {
> - __folio_remove_rmap(folio, page, nr_pages, vma, RMAP_LEVEL_PTE);
> + __folio_remove_rmap(folio, page, nr_pages, vma, PGTABLE_LEVEL_PTE);
> }
>
> /**
> @@ -1813,7 +1819,7 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page,
> struct vm_area_struct *vma)
> {
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> - __folio_remove_rmap(folio, page, HPAGE_PMD_NR, vma, RMAP_LEVEL_PMD);
> + __folio_remove_rmap(folio, page, HPAGE_PMD_NR, vma, PGTABLE_LEVEL_PMD);
> #else
> WARN_ON_ONCE(true);
> #endif
> @@ -1834,7 +1840,7 @@ void folio_remove_rmap_pud(struct folio *folio, struct page *page,
> {
> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> - __folio_remove_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
> + __folio_remove_rmap(folio, page, HPAGE_PUD_NR, vma, PGTABLE_LEVEL_PUD);
> #else
> WARN_ON_ONCE(true);
> #endif
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 06/11] powerpc/ptdump: rename "struct pgtable_level" to "struct ptdump_pglevel"
2025-08-12 18:23 ` Lorenzo Stoakes
@ 2025-08-12 18:39 ` Christophe Leroy
2025-08-12 18:54 ` Lorenzo Stoakes
0 siblings, 1 reply; 27+ messages in thread
From: Christophe Leroy @ 2025-08-12 18:39 UTC (permalink / raw)
To: Lorenzo Stoakes, David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
linuxppc-dev, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Juergen Gross,
Stefano Stabellini, Oleksandr Tyshchenko, Dan Williams,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang
Hi Lorenzo,
Le 12/08/2025 à 20:23, Lorenzo Stoakes a écrit :
> On Mon, Aug 11, 2025 at 01:26:26PM +0200, David Hildenbrand wrote:
>> We want to make use of "pgtable_level" for an enum in core-mm. Other
>> architectures seem to call "struct pgtable_level" either:
>> * "struct pg_level" when not exposed in a header (riscv, arm)
>> * "struct ptdump_pg_level" when expose in a header (arm64)
>>
>> So let's follow what arm64 does.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> This LGTM, but I'm super confused what these are for, they don't seem to be
> used anywhere? Maybe I'm missing some macro madness, but it seems like dead
> code anyway?
pg_level[] are used several times in arch/powerpc/mm/ptdump/ptdump.c,
for instance here:
static void note_page_update_state(struct pg_state *st, unsigned long
addr, int level, u64 val)
{
u64 flag = level >= 0 ? val & pg_level[level].mask : 0;
u64 pa = val & PTE_RPN_MASK;
st->level = level;
st->current_flags = flag;
st->start_address = addr;
st->start_pa = pa;
while (addr >= st->marker[1].start_address) {
st->marker++;
pt_dump_seq_printf(st->seq, "---[ %s ]---\n", st->marker->name);
}
}
>
> Anyway:
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
>> ---
>> arch/powerpc/mm/ptdump/8xx.c | 2 +-
>> arch/powerpc/mm/ptdump/book3s64.c | 2 +-
>> arch/powerpc/mm/ptdump/ptdump.h | 4 ++--
>> arch/powerpc/mm/ptdump/shared.c | 2 +-
>> 4 files changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/ptdump/8xx.c b/arch/powerpc/mm/ptdump/8xx.c
>> index b5c79b11ea3c2..4ca9cf7a90c9e 100644
>> --- a/arch/powerpc/mm/ptdump/8xx.c
>> +++ b/arch/powerpc/mm/ptdump/8xx.c
>> @@ -69,7 +69,7 @@ static const struct flag_info flag_array[] = {
>> }
>> };
>>
>> -struct pgtable_level pg_level[5] = {
>> +struct ptdump_pg_level pg_level[5] = {
>> { /* pgd */
>> .flag = flag_array,
>> .num = ARRAY_SIZE(flag_array),
>> diff --git a/arch/powerpc/mm/ptdump/book3s64.c b/arch/powerpc/mm/ptdump/book3s64.c
>> index 5ad92d9dc5d10..6b2da9241d4c4 100644
>> --- a/arch/powerpc/mm/ptdump/book3s64.c
>> +++ b/arch/powerpc/mm/ptdump/book3s64.c
>> @@ -102,7 +102,7 @@ static const struct flag_info flag_array[] = {
>> }
>> };
>>
>> -struct pgtable_level pg_level[5] = {
>> +struct ptdump_pg_level pg_level[5] = {
>> { /* pgd */
>> .flag = flag_array,
>> .num = ARRAY_SIZE(flag_array),
>> diff --git a/arch/powerpc/mm/ptdump/ptdump.h b/arch/powerpc/mm/ptdump/ptdump.h
>> index 154efae96ae09..4232aa4b57eae 100644
>> --- a/arch/powerpc/mm/ptdump/ptdump.h
>> +++ b/arch/powerpc/mm/ptdump/ptdump.h
>> @@ -11,12 +11,12 @@ struct flag_info {
>> int shift;
>> };
>>
>> -struct pgtable_level {
>> +struct ptdump_pg_level {
>> const struct flag_info *flag;
>> size_t num;
>> u64 mask;
>> };
>>
>> -extern struct pgtable_level pg_level[5];
>> +extern struct ptdump_pg_level pg_level[5];
>>
>> void pt_dump_size(struct seq_file *m, unsigned long delta);
>> diff --git a/arch/powerpc/mm/ptdump/shared.c b/arch/powerpc/mm/ptdump/shared.c
>> index 39c30c62b7ea7..58998960eb9a4 100644
>> --- a/arch/powerpc/mm/ptdump/shared.c
>> +++ b/arch/powerpc/mm/ptdump/shared.c
>> @@ -67,7 +67,7 @@ static const struct flag_info flag_array[] = {
>> }
>> };
>>
>> -struct pgtable_level pg_level[5] = {
>> +struct ptdump_pg_level pg_level[5] = {
>> { /* pgd */
>> .flag = flag_array,
>> .num = ARRAY_SIZE(flag_array),
>> --
>> 2.50.1
>>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 08/11] mm/memory: convert print_bad_pte() to print_bad_page_map()
2025-08-11 11:26 ` [PATCH v3 08/11] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
@ 2025-08-12 18:48 ` Lorenzo Stoakes
2025-08-25 12:31 ` David Hildenbrand
1 sibling, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-08-12 18:48 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
linuxppc-dev, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, Pedro Falcato, Hugh Dickins, Oscar Salvador,
Lance Yang
On Mon, Aug 11, 2025 at 01:26:28PM +0200, David Hildenbrand wrote:
> print_bad_pte() looks like something that should actually be a WARN
> or similar, but historically it apparently has proven to be useful to
> detect corruption of page tables even on production systems -- report
> the issue and keep the system running to make it easier to actually detect
> what is going wrong (e.g., multiple such messages might shed a light).
>
> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
> to take care of print_bad_pte() as well.
>
> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
> implementation and renaming the function to print_bad_page_map().
> Provide print_bad_pte() as a simple wrapper.
>
> Document the implicit locking requirements for the page table re-walk.
>
> To make the function a bit more readable, factor out the ratelimit check
> into is_bad_page_map_ratelimited() and place the printing of page
> table content into __print_bad_page_map_pgtable(). We'll now dump
> information from each level in a single line, and just stop the table
> walk once we hit something that is not a present page table.
>
> The report will now look something like (dumping pgd to pmd values):
>
> [ 77.943408] BUG: Bad page map in process XXX pte:80000001233f5867
> [ 77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> [ 77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>
> Not using pgdp_get(), because that does not work properly on some arm
> configs where pgd_t is an array. Note that we are dumping all levels
> even when levels are folded for simplicity.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
This LGTM, great explanations and thanks for the page table level stuff!
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> include/linux/pgtable.h | 19 ++++++++
> mm/memory.c | 104 ++++++++++++++++++++++++++++++++--------
> 2 files changed, 103 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index bff5c4241bf2e..33c84b38b7ec6 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1966,6 +1966,25 @@ enum pgtable_level {
> PGTABLE_LEVEL_PGD,
> };
>
> +static inline const char *pgtable_level_to_str(enum pgtable_level level)
> +{
> + switch (level) {
> + case PGTABLE_LEVEL_PTE:
> + return "pte";
> + case PGTABLE_LEVEL_PMD:
> + return "pmd";
> + case PGTABLE_LEVEL_PUD:
> + return "pud";
> + case PGTABLE_LEVEL_P4D:
> + return "p4d";
> + case PGTABLE_LEVEL_PGD:
> + return "pgd";
> + default:
> + VM_WARN_ON_ONCE(1);
> + return "unknown";
> + }
> +}
> +
> #endif /* !__ASSEMBLY__ */
>
> #if !defined(MAX_POSSIBLE_PHYSMEM_BITS) && !defined(CONFIG_64BIT)
> diff --git a/mm/memory.c b/mm/memory.c
> index 626caedce35e0..dc0107354d37b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -491,22 +491,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> add_mm_counter(mm, i, rss[i]);
> }
>
> -/*
> - * This function is called to print an error when a bad pte
> - * is found. For example, we might have a PFN-mapped pte in
> - * a region that doesn't allow it.
> - *
> - * The calling function must still handle the error.
> - */
> -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> - pte_t pte, struct page *page)
> +static bool is_bad_page_map_ratelimited(void)
> {
> - pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> - p4d_t *p4d = p4d_offset(pgd, addr);
> - pud_t *pud = pud_offset(p4d, addr);
> - pmd_t *pmd = pmd_offset(pud, addr);
> - struct address_space *mapping;
> - pgoff_t index;
> static unsigned long resume;
> static unsigned long nr_shown;
> static unsigned long nr_unshown;
> @@ -518,7 +504,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> if (nr_shown == 60) {
> if (time_before(jiffies, resume)) {
> nr_unshown++;
> - return;
> + return true;
> }
> if (nr_unshown) {
> pr_alert("BUG: Bad page map: %lu messages suppressed\n",
> @@ -529,15 +515,91 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> }
> if (nr_shown++ == 0)
> resume = jiffies + 60 * HZ;
> + return false;
> +}
> +
> +static void __print_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
> +{
> + unsigned long long pgdv, p4dv, pudv, pmdv;
> + p4d_t p4d, *p4dp;
> + pud_t pud, *pudp;
> + pmd_t pmd, *pmdp;
> + pgd_t *pgdp;
> +
> + /*
> + * Although this looks like a fully lockless pgtable walk, it is not:
> + * see locking requirements for print_bad_page_map().
> + */
Thanks
> + pgdp = pgd_offset(mm, addr);
> + pgdv = pgd_val(*pgdp);
> +
> + if (!pgd_present(*pgdp) || pgd_leaf(*pgdp)) {
> + pr_alert("pgd:%08llx\n", pgdv);
> + return;
> + }
> +
> + p4dp = p4d_offset(pgdp, addr);
> + p4d = p4dp_get(p4dp);
> + p4dv = p4d_val(p4d);
> +
> + if (!p4d_present(p4d) || p4d_leaf(p4d)) {
> + pr_alert("pgd:%08llx p4d:%08llx\n", pgdv, p4dv);
> + return;
> + }
> +
> + pudp = pud_offset(p4dp, addr);
> + pud = pudp_get(pudp);
> + pudv = pud_val(pud);
> +
> + if (!pud_present(pud) || pud_leaf(pud)) {
> + pr_alert("pgd:%08llx p4d:%08llx pud:%08llx\n", pgdv, p4dv, pudv);
> + return;
> + }
> +
> + pmdp = pmd_offset(pudp, addr);
> + pmd = pmdp_get(pmdp);
> + pmdv = pmd_val(pmd);
> +
> + /*
> + * Dumping the PTE would be nice, but it's tricky with CONFIG_HIGHPTE,
Sigh, 32-bit.
> + * because the table should already be mapped by the caller and
> + * doing another map would be bad. print_bad_page_map() should
> + * already take care of printing the PTE.
> + */
> + pr_alert("pgd:%08llx p4d:%08llx pud:%08llx pmd:%08llx\n", pgdv,
> + p4dv, pudv, pmdv);
> +}
> +
> +/*
> + * This function is called to print an error when a bad page table entry (e.g.,
> + * corrupted page table entry) is found. For example, we might have a
> + * PFN-mapped pte in a region that doesn't allow it.
> + *
> + * The calling function must still handle the error.
> + *
> + * This function must be called during a proper page table walk, as it will
> + * re-walk the page table to dump information: the caller MUST prevent page
> + * table teardown (by holding mmap, vma or rmap lock) and MUST hold the leaf
> + * page table lock.
> + */
Thanks this is good!
> +static void print_bad_page_map(struct vm_area_struct *vma,
> + unsigned long addr, unsigned long long entry, struct page *page,
> + enum pgtable_level level)
> +{
> + struct address_space *mapping;
> + pgoff_t index;
> +
> + if (is_bad_page_map_ratelimited())
> + return;
>
> mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
> index = linear_page_index(vma, addr);
>
> - pr_alert("BUG: Bad page map in process %s pte:%08llx pmd:%08llx\n",
> - current->comm,
> - (long long)pte_val(pte), (long long)pmd_val(*pmd));
> + pr_alert("BUG: Bad page map in process %s %s:%08llx", current->comm,
> + pgtable_level_to_str(level), entry);
> + __print_bad_page_map_pgtable(vma->vm_mm, addr);
> if (page)
> - dump_page(page, "bad pte");
> + dump_page(page, "bad page map");
> pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
> (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
> pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
> @@ -549,6 +611,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> dump_stack();
> add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
> }
> +#define print_bad_pte(vma, addr, pte, page) \
> + print_bad_page_map(vma, addr, pte_val(pte), page, PGTABLE_LEVEL_PTE)
This is a nice abstraction.
>
> /*
> * vm_normal_page -- This function gets the "struct page" associated with a pte.
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 06/11] powerpc/ptdump: rename "struct pgtable_level" to "struct ptdump_pglevel"
2025-08-12 18:39 ` Christophe Leroy
@ 2025-08-12 18:54 ` Lorenzo Stoakes
0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-08-12 18:54 UTC (permalink / raw)
To: Christophe Leroy
Cc: David Hildenbrand, linux-kernel, linux-mm, xen-devel,
linux-fsdevel, nvdimm, linuxppc-dev, Andrew Morton,
Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, Pedro Falcato, Hugh Dickins, Oscar Salvador,
Lance Yang
On Tue, Aug 12, 2025 at 08:39:36PM +0200, Christophe Leroy wrote:
> Hi Lorenzo,
>
> Le 12/08/2025 à 20:23, Lorenzo Stoakes a écrit :
> > On Mon, Aug 11, 2025 at 01:26:26PM +0200, David Hildenbrand wrote:
> > > We want to make use of "pgtable_level" for an enum in core-mm. Other
> > > architectures seem to call "struct pgtable_level" either:
> > > * "struct pg_level" when not exposed in a header (riscv, arm)
> > > * "struct ptdump_pg_level" when expose in a header (arm64)
> > >
> > > So let's follow what arm64 does.
> > >
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> >
> > This LGTM, but I'm super confused what these are for, they don't seem to be
> > used anywhere? Maybe I'm missing some macro madness, but it seems like dead
> > code anyway?
>
> pg_level[] are used several times in arch/powerpc/mm/ptdump/ptdump.c, for
> instance here:
>
> static void note_page_update_state(struct pg_state *st, unsigned long addr,
> int level, u64 val)
> {
> u64 flag = level >= 0 ? val & pg_level[level].mask : 0;
> u64 pa = val & PTE_RPN_MASK;
>
> st->level = level;
> st->current_flags = flag;
> st->start_address = addr;
> st->start_pa = pa;
>
> while (addr >= st->marker[1].start_address) {
> st->marker++;
> pt_dump_seq_printf(st->seq, "---[ %s ]---\n", st->marker->name);
> }
> }
>
Ahhhh ok so you're _always_ happening to reference a field in the global value,
thereby not referencing the _type_ anywhere but referencing fields of the
global.
Thanks, that clears that up! :)
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 09/11] mm/memory: factor out common code from vm_normal_page_*()
2025-08-11 11:26 ` [PATCH v3 09/11] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
@ 2025-08-12 19:06 ` Lorenzo Stoakes
0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-08-12 19:06 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
linuxppc-dev, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, Pedro Falcato, Hugh Dickins, Oscar Salvador,
Lance Yang
On Mon, Aug 11, 2025 at 01:26:29PM +0200, David Hildenbrand wrote:
> Let's reduce the code duplication and factor out the non-pte/pmd related
> magic into __vm_normal_page().
>
> To keep it simpler, check the pfn against both zero folios, which
> shouldn't really make a difference.
>
> It's a good question if we can even hit the !CONFIG_ARCH_HAS_PTE_SPECIAL
> scenario in the PMD case in practice: but doesn't really matter, as
> it's now all unified in vm_normal_page_pfn().
>
> Add kerneldoc for all involved functions.
>
> Note that, as a side product, we now:
> * Support the find_special_page special thingy also for PMD
> * Don't check for is_huge_zero_pfn() anymore if we have
> CONFIG_ARCH_HAS_PTE_SPECIAL and the PMD is not special. The
> VM_WARN_ON_ONCE would catch any abuse
>
> No functional change intended.
>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Fantastic cleanup, thanks for refactoring with levels, this looks great!
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> mm/memory.c | 186 ++++++++++++++++++++++++++++++----------------------
> 1 file changed, 109 insertions(+), 77 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index dc0107354d37b..78af3f243cee7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -614,8 +614,14 @@ static void print_bad_page_map(struct vm_area_struct *vma,
> #define print_bad_pte(vma, addr, pte, page) \
> print_bad_page_map(vma, addr, pte_val(pte), page, PGTABLE_LEVEL_PTE)
>
> -/*
> - * vm_normal_page -- This function gets the "struct page" associated with a pte.
> +/**
> + * __vm_normal_page() - Get the "struct page" associated with a page table entry.
> + * @vma: The VMA mapping the page table entry.
> + * @addr: The address where the page table entry is mapped.
> + * @pfn: The PFN stored in the page table entry.
> + * @special: Whether the page table entry is marked "special".
> + * @level: The page table level for error reporting purposes only.
> + * @entry: The page table entry value for error reporting purposes only.
> *
> * "Special" mappings do not wish to be associated with a "struct page" (either
> * it doesn't exist, or it exists but they don't want to touch it). In this
> @@ -628,10 +634,10 @@ static void print_bad_page_map(struct vm_area_struct *vma,
> * Selected page table walkers (such as GUP) can still identify mappings of the
> * shared zero folios and work with the underlying "struct page".
> *
> - * There are 2 broad cases. Firstly, an architecture may define a pte_special()
> - * pte bit, in which case this function is trivial. Secondly, an architecture
> - * may not have a spare pte bit, which requires a more complicated scheme,
> - * described below.
> + * There are 2 broad cases. Firstly, an architecture may define a "special"
> + * page table entry bit, such as pte_special(), in which case this function is
> + * trivial. Secondly, an architecture may not have a spare page table
> + * entry bit, which requires a more complicated scheme, described below.
OK cool, nice to have this here in one place!
> *
> * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
> * special mapping (even if there are underlying and valid "struct pages").
> @@ -664,63 +670,94 @@ static void print_bad_page_map(struct vm_area_struct *vma,
> * don't have to follow the strict linearity rule of PFNMAP mappings in
> * order to support COWable mappings.
> *
> + * Return: Returns the "struct page" if this is a "normal" mapping. Returns
> + * NULL if this is a "special" mapping.
> */
> -struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> - pte_t pte)
> +static inline struct page *__vm_normal_page(struct vm_area_struct *vma,
> + unsigned long addr, unsigned long pfn, bool special,
> + unsigned long long entry, enum pgtable_level level)
> {
> - unsigned long pfn = pte_pfn(pte);
> -
> if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
> - if (likely(!pte_special(pte)))
> - goto check_pfn;
> - if (vma->vm_ops && vma->vm_ops->find_special_page)
> - return vma->vm_ops->find_special_page(vma, addr);
> - if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> - return NULL;
> - if (is_zero_pfn(pfn))
> - return NULL;
> -
> - print_bad_pte(vma, addr, pte, NULL);
> - return NULL;
> - }
> -
> - /* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
> -
> - if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> - if (vma->vm_flags & VM_MIXEDMAP) {
> - if (!pfn_valid(pfn))
> + if (unlikely(special)) {
> + if (vma->vm_ops && vma->vm_ops->find_special_page)
> + return vma->vm_ops->find_special_page(vma, addr);
> + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> return NULL;
> - if (is_zero_pfn(pfn))
> - return NULL;
> - goto out;
> - } else {
> - unsigned long off;
> - off = (addr - vma->vm_start) >> PAGE_SHIFT;
> - if (pfn == vma->vm_pgoff + off)
> - return NULL;
> - if (!is_cow_mapping(vma->vm_flags))
> + if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
Yeah this works fine.
> return NULL;
> +
> + print_bad_page_map(vma, addr, entry, NULL, level);
OK nice this is where the the print_bad_page_map() with level comes in handy.
> + return NULL;
> }
> - }
> + /*
> + * With CONFIG_ARCH_HAS_PTE_SPECIAL, any special page table
> + * mappings (incl. shared zero folios) are marked accordingly.
> + */
> + } else {
> + if (unlikely(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))) {
> + if (vma->vm_flags & VM_MIXEDMAP) {
> + /* If it has a "struct page", it's "normal". */
> + if (!pfn_valid(pfn))
> + return NULL;
> + } else {
> + unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
>
> - if (is_zero_pfn(pfn))
> - return NULL;
> + /* Only CoW'ed anon folios are "normal". */
> + if (pfn == vma->vm_pgoff + off)
> + return NULL;
> + if (!is_cow_mapping(vma->vm_flags))
> + return NULL;
> + }
> + }
> +
> + if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
Yeah this is fine too! This is all working out rather neatly! :)
> + return NULL;
> + }
>
> -check_pfn:
> if (unlikely(pfn > highest_memmap_pfn)) {
> - print_bad_pte(vma, addr, pte, NULL);
> + /* Corrupted page table entry. */
> + print_bad_page_map(vma, addr, entry, NULL, level);
> return NULL;
> }
> -
> /*
> * NOTE! We still have PageReserved() pages in the page tables.
> - * eg. VDSO mappings can cause them to exist.
> + * For example, VDSO mappings can cause them to exist.
> */
> -out:
> - VM_WARN_ON_ONCE(is_zero_pfn(pfn));
> + VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn));
And ACK on this as well.
> return pfn_to_page(pfn);
> }
>
> +/**
> + * vm_normal_page() - Get the "struct page" associated with a PTE
> + * @vma: The VMA mapping the @pte.
> + * @addr: The address where the @pte is mapped.
> + * @pte: The PTE.
> + *
> + * Get the "struct page" associated with a PTE. See __vm_normal_page()
> + * for details on "normal" and "special" mappings.
Lovely.
> + *
> + * Return: Returns the "struct page" if this is a "normal" mapping. Returns
> + * NULL if this is a "special" mapping.
> + */
> +struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> + pte_t pte)
> +{
> + return __vm_normal_page(vma, addr, pte_pfn(pte), pte_special(pte),
> + pte_val(pte), PGTABLE_LEVEL_PTE);
Nice and neat!
> +}
> +
> +/**
> + * vm_normal_folio() - Get the "struct folio" associated with a PTE
> + * @vma: The VMA mapping the @pte.
> + * @addr: The address where the @pte is mapped.
> + * @pte: The PTE.
> + *
> + * Get the "struct folio" associated with a PTE. See __vm_normal_page()
> + * for details on "normal" and "special" mappings.
> + *
> + * Return: Returns the "struct folio" if this is a "normal" mapping. Returns
> + * NULL if this is a "special" mapping.
> + */
Great, thanks for adding this! I especially like '*special*' :P
> struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
> pte_t pte)
> {
> @@ -732,42 +769,37 @@ struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
> }
>
> #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
> +/**
> + * vm_normal_page_pmd() - Get the "struct page" associated with a PMD
> + * @vma: The VMA mapping the @pmd.
> + * @addr: The address where the @pmd is mapped.
> + * @pmd: The PMD.
> + *
> + * Get the "struct page" associated with a PTE. See __vm_normal_page()
> + * for details on "normal" and "special" mappings.
> + *
> + * Return: Returns the "struct page" if this is a "normal" mapping. Returns
> + * NULL if this is a "special" mapping.
> + */
> struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t pmd)
> {
> - unsigned long pfn = pmd_pfn(pmd);
> -
> - if (unlikely(pmd_special(pmd)))
> - return NULL;
> -
> - if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> - if (vma->vm_flags & VM_MIXEDMAP) {
> - if (!pfn_valid(pfn))
> - return NULL;
> - goto out;
> - } else {
> - unsigned long off;
> - off = (addr - vma->vm_start) >> PAGE_SHIFT;
> - if (pfn == vma->vm_pgoff + off)
> - return NULL;
> - if (!is_cow_mapping(vma->vm_flags))
> - return NULL;
> - }
> - }
> -
> - if (is_huge_zero_pfn(pfn))
> - return NULL;
> - if (unlikely(pfn > highest_memmap_pfn))
> - return NULL;
> -
> - /*
> - * NOTE! We still have PageReserved() pages in the page tables.
> - * eg. VDSO mappings can cause them to exist.
> - */
> -out:
> - return pfn_to_page(pfn);
> + return __vm_normal_page(vma, addr, pmd_pfn(pmd), pmd_special(pmd),
> + pmd_val(pmd), PGTABLE_LEVEL_PMD);
So much red... so much delight! :) this is great!
> }
>
> +/**
> + * vm_normal_folio_pmd() - Get the "struct folio" associated with a PMD
> + * @vma: The VMA mapping the @pmd.
> + * @addr: The address where the @pmd is mapped.
> + * @pmd: The PMD.
> + *
> + * Get the "struct folio" associated with a PTE. See __vm_normal_page()
> + * for details on "normal" and "special" mappings.
> + *
> + * Return: Returns the "struct folio" if this is a "normal" mapping. Returns
> + * NULL if this is a "special" mapping.
> + */
This is great also!
> struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
> unsigned long addr, pmd_t pmd)
> {
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 10/11] mm: introduce and use vm_normal_page_pud()
2025-08-11 11:26 ` [PATCH v3 10/11] mm: introduce and use vm_normal_page_pud() David Hildenbrand
@ 2025-08-12 19:38 ` Lorenzo Stoakes
0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-08-12 19:38 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
linuxppc-dev, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, Pedro Falcato, Hugh Dickins, Oscar Salvador,
Lance Yang, Wei Yang
On Mon, Aug 11, 2025 at 01:26:30PM +0200, David Hildenbrand wrote:
> Let's introduce vm_normal_page_pud(), which ends up being fairly simple
> because of our new common helpers and there not being a PUD-sized zero
> folio.
>
> Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO,
> structuring the code like the other (pmd/pte) cases. Defer
> introducing vm_normal_folio_pud() until really used.
>
> Note that we can so far get PUDs with hugetlb, daxfs and PFNMAP entries.
I guess hugetlb will be handled in a separate way, daxfs will be... special, I
think? and PFNMAP definitely is.
>
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Anyway this is nice, thanks! Nice to resolve the todo :)
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
> include/linux/mm.h | 2 ++
> mm/memory.c | 19 +++++++++++++++++++
> mm/pagewalk.c | 20 ++++++++++----------
> 3 files changed, 31 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index b626d1bacef52..8ca7d2fa71343 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2360,6 +2360,8 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
> unsigned long addr, pmd_t pmd);
> struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t pmd);
> +struct page *vm_normal_page_pud(struct vm_area_struct *vma, unsigned long addr,
> + pud_t pud);
>
> void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
> unsigned long size);
> diff --git a/mm/memory.c b/mm/memory.c
> index 78af3f243cee7..6f806bf3cc994 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -809,6 +809,25 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
> return page_folio(page);
> return NULL;
> }
> +
> +/**
> + * vm_normal_page_pud() - Get the "struct page" associated with a PUD
> + * @vma: The VMA mapping the @pud.
> + * @addr: The address where the @pud is mapped.
> + * @pud: The PUD.
> + *
> + * Get the "struct page" associated with a PUD. See __vm_normal_page()
> + * for details on "normal" and "special" mappings.
> + *
> + * Return: Returns the "struct page" if this is a "normal" mapping. Returns
> + * NULL if this is a "special" mapping.
> + */
> +struct page *vm_normal_page_pud(struct vm_area_struct *vma,
> + unsigned long addr, pud_t pud)
> +{
> + return __vm_normal_page(vma, addr, pud_pfn(pud), pud_special(pud),
> + pud_val(pud), PGTABLE_LEVEL_PUD);
> +}
> #endif
>
> /**
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 648038247a8d2..c6753d370ff4e 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -902,23 +902,23 @@ struct folio *folio_walk_start(struct folio_walk *fw,
> fw->pudp = pudp;
> fw->pud = pud;
>
> - /*
> - * TODO: FW_MIGRATION support for PUD migration entries
> - * once there are relevant users.
> - */
> - if (!pud_present(pud) || pud_special(pud)) {
> + if (pud_none(pud)) {
> spin_unlock(ptl);
> goto not_found;
> - } else if (!pud_leaf(pud)) {
> + } else if (pud_present(pud) && !pud_leaf(pud)) {
> spin_unlock(ptl);
> goto pmd_table;
> + } else if (pud_present(pud)) {
> + page = vm_normal_page_pud(vma, addr, pud);
> + if (page)
> + goto found;
> }
> /*
> - * TODO: vm_normal_page_pud() will be handy once we want to
> - * support PUD mappings in VM_PFNMAP|VM_MIXEDMAP VMAs.
> + * TODO: FW_MIGRATION support for PUD migration entries
> + * once there are relevant users.
> */
> - page = pud_page(pud);
> - goto found;
> + spin_unlock(ptl);
> + goto not_found;
> }
>
> pmd_table:
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 11/11] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
2025-08-11 11:26 ` [PATCH v3 11/11] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
@ 2025-08-12 19:43 ` Lorenzo Stoakes
0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-08-12 19:43 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
linuxppc-dev, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, Pedro Falcato, Hugh Dickins, Oscar Salvador,
Lance Yang, David Vrabel, Wei Yang
On Mon, Aug 11, 2025 at 01:26:31PM +0200, David Hildenbrand wrote:
> ... and hide it behind a kconfig option. There is really no need for
> any !xen code to perform this check.
>
> The naming is a bit off: we want to find the "normal" page when a PTE
> was marked "special". So it's really not "finding a special" page.
>
> Improve the documentation, and add a comment in the code where XEN ends
> up performing the pte_mkspecial() through a hypercall. More details can
> be found in commit 923b2919e2c3 ("xen/gntdev: mark userspace PTEs as
> special on x86 PV guests").
>
> Cc: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Oh I already reviewed it. But anyway, may as well say - THANKS fof this
it's great again :)
> ---
> drivers/xen/Kconfig | 1 +
> drivers/xen/gntdev.c | 5 +++--
> include/linux/mm.h | 18 +++++++++++++-----
> mm/Kconfig | 2 ++
> mm/memory.c | 12 ++++++++++--
> tools/testing/vma/vma_internal.h | 18 +++++++++++++-----
> 6 files changed, 42 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> index 24f485827e039..f9a35ed266ecf 100644
> --- a/drivers/xen/Kconfig
> +++ b/drivers/xen/Kconfig
> @@ -138,6 +138,7 @@ config XEN_GNTDEV
> depends on XEN
> default m
> select MMU_NOTIFIER
> + select FIND_NORMAL_PAGE
> help
> Allows userspace processes to use grants.
>
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index 1f21607656182..26f13b37c78e6 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -321,6 +321,7 @@ static int find_grant_ptes(pte_t *pte, unsigned long addr, void *data)
> BUG_ON(pgnr >= map->count);
> pte_maddr = arbitrary_virt_to_machine(pte).maddr;
>
> + /* Note: this will perform a pte_mkspecial() through the hypercall. */
> gnttab_set_map_op(&map->map_ops[pgnr], pte_maddr, flags,
> map->grants[pgnr].ref,
> map->grants[pgnr].domid);
> @@ -528,7 +529,7 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
> gntdev_put_map(priv, map);
> }
>
> -static struct page *gntdev_vma_find_special_page(struct vm_area_struct *vma,
> +static struct page *gntdev_vma_find_normal_page(struct vm_area_struct *vma,
> unsigned long addr)
> {
> struct gntdev_grant_map *map = vma->vm_private_data;
> @@ -539,7 +540,7 @@ static struct page *gntdev_vma_find_special_page(struct vm_area_struct *vma,
> static const struct vm_operations_struct gntdev_vmops = {
> .open = gntdev_vma_open,
> .close = gntdev_vma_close,
> - .find_special_page = gntdev_vma_find_special_page,
> + .find_normal_page = gntdev_vma_find_normal_page,
> };
>
> /* ------------------------------------------------------------------ */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8ca7d2fa71343..3868ca1a25f9c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -657,13 +657,21 @@ struct vm_operations_struct {
> struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
> unsigned long addr, pgoff_t *ilx);
> #endif
> +#ifdef CONFIG_FIND_NORMAL_PAGE
> /*
> - * Called by vm_normal_page() for special PTEs to find the
> - * page for @addr. This is useful if the default behavior
> - * (using pte_page()) would not find the correct page.
> + * Called by vm_normal_page() for special PTEs in @vma at @addr. This
> + * allows for returning a "normal" page from vm_normal_page() even
> + * though the PTE indicates that the "struct page" either does not exist
> + * or should not be touched: "special".
> + *
> + * Do not add new users: this really only works when a "normal" page
> + * was mapped, but then the PTE got changed to something weird (+
> + * marked special) that would not make pte_pfn() identify the originally
> + * inserted page.
> */
> - struct page *(*find_special_page)(struct vm_area_struct *vma,
> - unsigned long addr);
> + struct page *(*find_normal_page)(struct vm_area_struct *vma,
> + unsigned long addr);
> +#endif /* CONFIG_FIND_NORMAL_PAGE */
> };
>
> #ifdef CONFIG_NUMA_BALANCING
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e443fe8cd6cf2..59a04d0b2e272 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1381,6 +1381,8 @@ config PT_RECLAIM
>
> Note: now only empty user PTE page table pages will be reclaimed.
>
> +config FIND_NORMAL_PAGE
> + def_bool n
>
> source "mm/damon/Kconfig"
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 6f806bf3cc994..002c28795d8b7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -639,6 +639,12 @@ static void print_bad_page_map(struct vm_area_struct *vma,
> * trivial. Secondly, an architecture may not have a spare page table
> * entry bit, which requires a more complicated scheme, described below.
> *
> + * With CONFIG_FIND_NORMAL_PAGE, we might have the "special" bit set on
> + * page table entries that actually map "normal" pages: however, that page
> + * cannot be looked up through the PFN stored in the page table entry, but
> + * instead will be looked up through vm_ops->find_normal_page(). So far, this
> + * only applies to PTEs.
> + *
> * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
> * special mapping (even if there are underlying and valid "struct pages").
> * COWed pages of a VM_PFNMAP are always normal.
> @@ -679,8 +685,10 @@ static inline struct page *__vm_normal_page(struct vm_area_struct *vma,
> {
> if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
> if (unlikely(special)) {
> - if (vma->vm_ops && vma->vm_ops->find_special_page)
> - return vma->vm_ops->find_special_page(vma, addr);
> +#ifdef CONFIG_FIND_NORMAL_PAGE
> + if (vma->vm_ops && vma->vm_ops->find_normal_page)
> + return vma->vm_ops->find_normal_page(vma, addr);
> +#endif /* CONFIG_FIND_NORMAL_PAGE */
> if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> return NULL;
> if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 3639aa8dd2b06..cb1c2a8afe265 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -467,13 +467,21 @@ struct vm_operations_struct {
> struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
> unsigned long addr, pgoff_t *ilx);
> #endif
> +#ifdef CONFIG_FIND_NORMAL_PAGE
> /*
> - * Called by vm_normal_page() for special PTEs to find the
> - * page for @addr. This is useful if the default behavior
> - * (using pte_page()) would not find the correct page.
> + * Called by vm_normal_page() for special PTEs in @vma at @addr. This
> + * allows for returning a "normal" page from vm_normal_page() even
> + * though the PTE indicates that the "struct page" either does not exist
> + * or should not be touched: "special".
> + *
> + * Do not add new users: this really only works when a "normal" page
> + * was mapped, but then the PTE got changed to something weird (+
> + * marked special) that would not make pte_pfn() identify the originally
> + * inserted page.
> */
> - struct page *(*find_special_page)(struct vm_area_struct *vma,
> - unsigned long addr);
> + struct page *(*find_normal_page)(struct vm_area_struct *vma,
> + unsigned long addr);
> +#endif /* CONFIG_FIND_NORMAL_PAGE */
> };
>
> struct vm_unmapped_area_info {
> --
> 2.50.1
>
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 08/11] mm/memory: convert print_bad_pte() to print_bad_page_map()
2025-08-11 11:26 ` [PATCH v3 08/11] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
2025-08-12 18:48 ` Lorenzo Stoakes
@ 2025-08-25 12:31 ` David Hildenbrand
2025-08-26 5:25 ` Lorenzo Stoakes
1 sibling, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2025-08-25 12:31 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
Andrew Morton, Madhavan Srinivasan, Michael Ellerman,
Nicholas Piggin, Christophe Leroy, Juergen Gross,
Stefano Stabellini, Oleksandr Tyshchenko, Dan Williams,
Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang
On 11.08.25 13:26, David Hildenbrand wrote:
> print_bad_pte() looks like something that should actually be a WARN
> or similar, but historically it apparently has proven to be useful to
> detect corruption of page tables even on production systems -- report
> the issue and keep the system running to make it easier to actually detect
> what is going wrong (e.g., multiple such messages might shed a light).
>
> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
> to take care of print_bad_pte() as well.
>
> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
> implementation and renaming the function to print_bad_page_map().
> Provide print_bad_pte() as a simple wrapper.
>
> Document the implicit locking requirements for the page table re-walk.
>
> To make the function a bit more readable, factor out the ratelimit check
> into is_bad_page_map_ratelimited() and place the printing of page
> table content into __print_bad_page_map_pgtable(). We'll now dump
> information from each level in a single line, and just stop the table
> walk once we hit something that is not a present page table.
>
> The report will now look something like (dumping pgd to pmd values):
>
> [ 77.943408] BUG: Bad page map in process XXX pte:80000001233f5867
> [ 77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> [ 77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>
> Not using pgdp_get(), because that does not work properly on some arm
> configs where pgd_t is an array. Note that we are dumping all levels
> even when levels are folded for simplicity.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> include/linux/pgtable.h | 19 ++++++++
> mm/memory.c | 104 ++++++++++++++++++++++++++++++++--------
> 2 files changed, 103 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index bff5c4241bf2e..33c84b38b7ec6 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1966,6 +1966,25 @@ enum pgtable_level {
> PGTABLE_LEVEL_PGD,
> };
>
> +static inline const char *pgtable_level_to_str(enum pgtable_level level)
> +{
> + switch (level) {
> + case PGTABLE_LEVEL_PTE:
> + return "pte";
> + case PGTABLE_LEVEL_PMD:
> + return "pmd";
> + case PGTABLE_LEVEL_PUD:
> + return "pud";
> + case PGTABLE_LEVEL_P4D:
> + return "p4d";
> + case PGTABLE_LEVEL_PGD:
> + return "pgd";
> + default:
> + VM_WARN_ON_ONCE(1);
> + return "unknown";
> + }
> +}
One kernel config doesn't like the VM_WARN_ON_ONCE here, and I don't think we
really need it. @Andrew can you squash:
From 0b8f6cdfe2c9d96393e7da1772e82048e096a903 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Mon, 25 Aug 2025 14:25:59 +0200
Subject: [PATCH] fixup: mm/memory: convert print_bad_pte() to
print_bad_page_map()
Let's just drop the warning, it's highly unlikely that we ever run into
this, and if so, there is serious stuff going wrong elsewhere.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
include/linux/pgtable.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 9f0329d45b1e1..94249e671a7e8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1997,7 +1997,6 @@ static inline const char *pgtable_level_to_str(enum pgtable_level level)
case PGTABLE_LEVEL_PGD:
return "pgd";
default:
- VM_WARN_ON_ONCE(1);
return "unknown";
}
}
--
2.50.1
--
Cheers
David / dhildenb
^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v3 08/11] mm/memory: convert print_bad_pte() to print_bad_page_map()
2025-08-25 12:31 ` David Hildenbrand
@ 2025-08-26 5:25 ` Lorenzo Stoakes
2025-08-26 6:17 ` David Hildenbrand
0 siblings, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2025-08-26 5:25 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
linuxppc-dev, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, Pedro Falcato, Hugh Dickins, Oscar Salvador,
Lance Yang
On Mon, Aug 25, 2025 at 02:31:00PM +0200, David Hildenbrand wrote:
> On 11.08.25 13:26, David Hildenbrand wrote:
> > print_bad_pte() looks like something that should actually be a WARN
> > or similar, but historically it apparently has proven to be useful to
> > detect corruption of page tables even on production systems -- report
> > the issue and keep the system running to make it easier to actually detect
> > what is going wrong (e.g., multiple such messages might shed a light).
> >
> > As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
> > to take care of print_bad_pte() as well.
> >
> > Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
> > implementation and renaming the function to print_bad_page_map().
> > Provide print_bad_pte() as a simple wrapper.
> >
> > Document the implicit locking requirements for the page table re-walk.
> >
> > To make the function a bit more readable, factor out the ratelimit check
> > into is_bad_page_map_ratelimited() and place the printing of page
> > table content into __print_bad_page_map_pgtable(). We'll now dump
> > information from each level in a single line, and just stop the table
> > walk once we hit something that is not a present page table.
> >
> > The report will now look something like (dumping pgd to pmd values):
> >
> > [ 77.943408] BUG: Bad page map in process XXX pte:80000001233f5867
> > [ 77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> > [ 77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
> >
> > Not using pgdp_get(), because that does not work properly on some arm
> > configs where pgd_t is an array. Note that we are dumping all levels
> > even when levels are folded for simplicity.
> >
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> > include/linux/pgtable.h | 19 ++++++++
> > mm/memory.c | 104 ++++++++++++++++++++++++++++++++--------
> > 2 files changed, 103 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index bff5c4241bf2e..33c84b38b7ec6 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1966,6 +1966,25 @@ enum pgtable_level {
> > PGTABLE_LEVEL_PGD,
> > };
> > +static inline const char *pgtable_level_to_str(enum pgtable_level level)
> > +{
> > + switch (level) {
> > + case PGTABLE_LEVEL_PTE:
> > + return "pte";
> > + case PGTABLE_LEVEL_PMD:
> > + return "pmd";
> > + case PGTABLE_LEVEL_PUD:
> > + return "pud";
> > + case PGTABLE_LEVEL_P4D:
> > + return "p4d";
> > + case PGTABLE_LEVEL_PGD:
> > + return "pgd";
> > + default:
> > + VM_WARN_ON_ONCE(1);
> > + return "unknown";
> > + }
> > +}
>
> One kernel config doesn't like the VM_WARN_ON_ONCE here, and I don't think we
> really need it. @Andrew can you squash:
Out of interest do you know why this is happening? xtensa right? Does
xtensa not like CONFIG_DEBUG_VM?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 08/11] mm/memory: convert print_bad_pte() to print_bad_page_map()
2025-08-26 5:25 ` Lorenzo Stoakes
@ 2025-08-26 6:17 ` David Hildenbrand
0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2025-08-26 6:17 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
linuxppc-dev, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Zi Yan,
Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
Jann Horn, Pedro Falcato, Hugh Dickins, Oscar Salvador,
Lance Yang
On 26.08.25 07:25, Lorenzo Stoakes wrote:
> On Mon, Aug 25, 2025 at 02:31:00PM +0200, David Hildenbrand wrote:
>> On 11.08.25 13:26, David Hildenbrand wrote:
>>> print_bad_pte() looks like something that should actually be a WARN
>>> or similar, but historically it apparently has proven to be useful to
>>> detect corruption of page tables even on production systems -- report
>>> the issue and keep the system running to make it easier to actually detect
>>> what is going wrong (e.g., multiple such messages might shed a light).
>>>
>>> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
>>> to take care of print_bad_pte() as well.
>>>
>>> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
>>> implementation and renaming the function to print_bad_page_map().
>>> Provide print_bad_pte() as a simple wrapper.
>>>
>>> Document the implicit locking requirements for the page table re-walk.
>>>
>>> To make the function a bit more readable, factor out the ratelimit check
>>> into is_bad_page_map_ratelimited() and place the printing of page
>>> table content into __print_bad_page_map_pgtable(). We'll now dump
>>> information from each level in a single line, and just stop the table
>>> walk once we hit something that is not a present page table.
>>>
>>> The report will now look something like (dumping pgd to pmd values):
>>>
>>> [ 77.943408] BUG: Bad page map in process XXX pte:80000001233f5867
>>> [ 77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
>>> [ 77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>>>
>>> Not using pgdp_get(), because that does not work properly on some arm
>>> configs where pgd_t is an array. Note that we are dumping all levels
>>> even when levels are folded for simplicity.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>> include/linux/pgtable.h | 19 ++++++++
>>> mm/memory.c | 104 ++++++++++++++++++++++++++++++++--------
>>> 2 files changed, 103 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>> index bff5c4241bf2e..33c84b38b7ec6 100644
>>> --- a/include/linux/pgtable.h
>>> +++ b/include/linux/pgtable.h
>>> @@ -1966,6 +1966,25 @@ enum pgtable_level {
>>> PGTABLE_LEVEL_PGD,
>>> };
>>> +static inline const char *pgtable_level_to_str(enum pgtable_level level)
>>> +{
>>> + switch (level) {
>>> + case PGTABLE_LEVEL_PTE:
>>> + return "pte";
>>> + case PGTABLE_LEVEL_PMD:
>>> + return "pmd";
>>> + case PGTABLE_LEVEL_PUD:
>>> + return "pud";
>>> + case PGTABLE_LEVEL_P4D:
>>> + return "p4d";
>>> + case PGTABLE_LEVEL_PGD:
>>> + return "pgd";
>>> + default:
>>> + VM_WARN_ON_ONCE(1);
>>> + return "unknown";
>>> + }
>>> +}
>>
>> One kernel config doesn't like the VM_WARN_ON_ONCE here, and I don't think we
>> really need it. @Andrew can you squash:
>
> Out of interest do you know why this is happening? xtensa right? Does
> xtensa not like CONFIG_DEBUG_VM?
We don't happen to include mmdebug.h in a xtensa configuration.
Briefly thought about using a BUILD_BUG_ON_INVALID(), but decided to
just drop it completely.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 06/11] powerpc/ptdump: rename "struct pgtable_level" to "struct ptdump_pglevel"
2025-08-11 11:26 ` [PATCH v3 06/11] powerpc/ptdump: rename "struct pgtable_level" to "struct ptdump_pglevel" David Hildenbrand
2025-08-12 18:23 ` Lorenzo Stoakes
@ 2025-08-26 16:28 ` Ritesh Harjani
2025-08-27 13:57 ` David Hildenbrand
1 sibling, 1 reply; 27+ messages in thread
From: Ritesh Harjani @ 2025-08-26 16:28 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
David Hildenbrand, Andrew Morton, Madhavan Srinivasan,
Michael Ellerman, Nicholas Piggin, Christophe Leroy,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang
David Hildenbrand <david@redhat.com> writes:
> We want to make use of "pgtable_level" for an enum in core-mm. Other
> architectures seem to call "struct pgtable_level" either:
> * "struct pg_level" when not exposed in a header (riscv, arm)
> * "struct ptdump_pg_level" when expose in a header (arm64)
>
> So let's follow what arm64 does.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> arch/powerpc/mm/ptdump/8xx.c | 2 +-
> arch/powerpc/mm/ptdump/book3s64.c | 2 +-
> arch/powerpc/mm/ptdump/ptdump.h | 4 ++--
> arch/powerpc/mm/ptdump/shared.c | 2 +-
> 4 files changed, 5 insertions(+), 5 deletions(-)
As mentioned in commit msg mostly a mechanical change to convert
"struct pgtable_level" to "struct ptdump_pg_level" for aforementioned purpose..
The patch looks ok and compiles fine on my book3s64 and ppc32 platform.
I think we should fix the subject line.. s/ptdump_pglevel/ptdump_pg_level
Otherwise the changes looks good to me. So please feel free to add -
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>
> diff --git a/arch/powerpc/mm/ptdump/8xx.c b/arch/powerpc/mm/ptdump/8xx.c
> index b5c79b11ea3c2..4ca9cf7a90c9e 100644
> --- a/arch/powerpc/mm/ptdump/8xx.c
> +++ b/arch/powerpc/mm/ptdump/8xx.c
> @@ -69,7 +69,7 @@ static const struct flag_info flag_array[] = {
> }
> };
>
> -struct pgtable_level pg_level[5] = {
> +struct ptdump_pg_level pg_level[5] = {
> { /* pgd */
> .flag = flag_array,
> .num = ARRAY_SIZE(flag_array),
> diff --git a/arch/powerpc/mm/ptdump/book3s64.c b/arch/powerpc/mm/ptdump/book3s64.c
> index 5ad92d9dc5d10..6b2da9241d4c4 100644
> --- a/arch/powerpc/mm/ptdump/book3s64.c
> +++ b/arch/powerpc/mm/ptdump/book3s64.c
> @@ -102,7 +102,7 @@ static const struct flag_info flag_array[] = {
> }
> };
>
> -struct pgtable_level pg_level[5] = {
> +struct ptdump_pg_level pg_level[5] = {
> { /* pgd */
> .flag = flag_array,
> .num = ARRAY_SIZE(flag_array),
> diff --git a/arch/powerpc/mm/ptdump/ptdump.h b/arch/powerpc/mm/ptdump/ptdump.h
> index 154efae96ae09..4232aa4b57eae 100644
> --- a/arch/powerpc/mm/ptdump/ptdump.h
> +++ b/arch/powerpc/mm/ptdump/ptdump.h
> @@ -11,12 +11,12 @@ struct flag_info {
> int shift;
> };
>
> -struct pgtable_level {
> +struct ptdump_pg_level {
> const struct flag_info *flag;
> size_t num;
> u64 mask;
> };
>
> -extern struct pgtable_level pg_level[5];
> +extern struct ptdump_pg_level pg_level[5];
>
> void pt_dump_size(struct seq_file *m, unsigned long delta);
> diff --git a/arch/powerpc/mm/ptdump/shared.c b/arch/powerpc/mm/ptdump/shared.c
> index 39c30c62b7ea7..58998960eb9a4 100644
> --- a/arch/powerpc/mm/ptdump/shared.c
> +++ b/arch/powerpc/mm/ptdump/shared.c
> @@ -67,7 +67,7 @@ static const struct flag_info flag_array[] = {
> }
> };
>
> -struct pgtable_level pg_level[5] = {
> +struct ptdump_pg_level pg_level[5] = {
> { /* pgd */
> .flag = flag_array,
> .num = ARRAY_SIZE(flag_array),
> --
> 2.50.1
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v3 06/11] powerpc/ptdump: rename "struct pgtable_level" to "struct ptdump_pglevel"
2025-08-26 16:28 ` Ritesh Harjani
@ 2025-08-27 13:57 ` David Hildenbrand
0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2025-08-27 13:57 UTC (permalink / raw)
To: Ritesh Harjani (IBM), linux-kernel, Andrew Morton
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, linuxppc-dev,
Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
Christophe Leroy, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang
On 26.08.25 18:28, Ritesh Harjani (IBM) wrote:
> David Hildenbrand <david@redhat.com> writes:
>
>> We want to make use of "pgtable_level" for an enum in core-mm. Other
>> architectures seem to call "struct pgtable_level" either:
>> * "struct pg_level" when not exposed in a header (riscv, arm)
>> * "struct ptdump_pg_level" when expose in a header (arm64)
>>
>> So let's follow what arm64 does.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>> arch/powerpc/mm/ptdump/8xx.c | 2 +-
>> arch/powerpc/mm/ptdump/book3s64.c | 2 +-
>> arch/powerpc/mm/ptdump/ptdump.h | 4 ++--
>> arch/powerpc/mm/ptdump/shared.c | 2 +-
>> 4 files changed, 5 insertions(+), 5 deletions(-)
>
>
> As mentioned in commit msg mostly a mechanical change to convert
> "struct pgtable_level" to "struct ptdump_pg_level" for aforementioned purpose..
>
> The patch looks ok and compiles fine on my book3s64 and ppc32 platform.
>
> I think we should fix the subject line.. s/ptdump_pglevel/ptdump_pg_level
>
Ahh, yes thanks.
@Andrew, can you fix that up?
> Otherwise the changes looks good to me. So please feel free to add -
> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Thanks!
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2025-08-27 13:57 UTC | newest]
Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-11 11:26 [PATCH v3 00/11] mm: vm_normal_page*() improvements David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 01/11] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
2025-08-12 4:52 ` Lance Yang
2025-08-11 11:26 ` [PATCH v3 02/11] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 03/11] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 04/11] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 05/11] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
2025-08-12 18:14 ` Lorenzo Stoakes
2025-08-11 11:26 ` [PATCH v3 06/11] powerpc/ptdump: rename "struct pgtable_level" to "struct ptdump_pglevel" David Hildenbrand
2025-08-12 18:23 ` Lorenzo Stoakes
2025-08-12 18:39 ` Christophe Leroy
2025-08-12 18:54 ` Lorenzo Stoakes
2025-08-26 16:28 ` Ritesh Harjani
2025-08-27 13:57 ` David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 07/11] mm/rmap: convert "enum rmap_level" to "enum pgtable_level" David Hildenbrand
2025-08-12 18:33 ` Lorenzo Stoakes
2025-08-11 11:26 ` [PATCH v3 08/11] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
2025-08-12 18:48 ` Lorenzo Stoakes
2025-08-25 12:31 ` David Hildenbrand
2025-08-26 5:25 ` Lorenzo Stoakes
2025-08-26 6:17 ` David Hildenbrand
2025-08-11 11:26 ` [PATCH v3 09/11] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
2025-08-12 19:06 ` Lorenzo Stoakes
2025-08-11 11:26 ` [PATCH v3 10/11] mm: introduce and use vm_normal_page_pud() David Hildenbrand
2025-08-12 19:38 ` Lorenzo Stoakes
2025-08-11 11:26 ` [PATCH v3 11/11] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
2025-08-12 19:43 ` Lorenzo Stoakes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).