* [PATCH v1 0/9] mm: vm_normal_page*() improvements
@ 2025-07-15 13:23 David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 1/9] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
` (9 more replies)
0 siblings, 10 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-07-15 13:23 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang
This is the follow-up of [1]:
[PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements
Based on mm/mm-new. I dropped the CoW PFNMAP changes for now, still
working on a better way to sort all that out cleanly.
Cleanup and unify vm_normal_page_*() handling, also marking the
huge zerofolio as special in the PMD. Add+use vm_normal_page_pud() and
cleanup that XEN vm_ops->find_special_page thingy.
There are plans of using vm_normal_page_*() more widely soon.
Briefly tested on UML (making sure vm_normal_page() still works as expected
without pte_special() support) and on x86-64 with a bunch of tests.
[1] https://lkml.kernel.org/r/20250617154345.2494405-1-david@redhat.com
RFC -> v1:
* Dropped the highest_memmap_pfn removal stuff and instead added
"mm/memory: convert print_bad_pte() to print_bad_page_map()"
* Dropped "mm: compare pfns only if the entry is present when inserting
pfns/pages" for now, will probably clean that up separately.
* Dropped "mm: remove "horrible special case to handle copy-on-write
behaviour"", and "mm: drop addr parameter from vm_normal_*_pmd()" will
require more thought
* "mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()"
-> Extend patch description.
* "fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio"
-> Extend patch description.
* "mm/huge_memory: mark PMD mappings of the huge zero folio special"
-> Remove comment from vm_normal_page_pmd().
* "mm/memory: factor out common code from vm_normal_page_*()"
-> Adjust to print_bad_page_map()/highest_memmap_pfn changes.
-> Add proper kernel doc to all involved functions
* "mm: introduce and use vm_normal_page_pud()"
-> Adjust to print_bad_page_map() changes.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lance Yang <lance.yang@linux.dev>
David Hildenbrand (9):
mm/huge_memory: move more common code into insert_pmd()
mm/huge_memory: move more common code into insert_pud()
mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
mm/huge_memory: mark PMD mappings of the huge zero folio special
mm/memory: convert print_bad_pte() to print_bad_page_map()
mm/memory: factor out common code from vm_normal_page_*()
mm: introduce and use vm_normal_page_pud()
mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
drivers/xen/Kconfig | 1 +
drivers/xen/gntdev.c | 5 +-
fs/dax.c | 47 +----
include/linux/mm.h | 20 +-
mm/Kconfig | 2 +
mm/huge_memory.c | 119 ++++-------
mm/memory.c | 346 ++++++++++++++++++++++---------
mm/pagewalk.c | 20 +-
tools/testing/vma/vma_internal.h | 18 +-
9 files changed, 343 insertions(+), 235 deletions(-)
base-commit: 64d19a2cdb7b62bcea83d9309d83e06d7aff4722
--
2.50.1
^ permalink raw reply [flat|nested] 20+ messages in thread
* [PATCH v1 1/9] mm/huge_memory: move more common code into insert_pmd()
2025-07-15 13:23 [PATCH v1 0/9] mm: vm_normal_page*() improvements David Hildenbrand
@ 2025-07-15 13:23 ` David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 2/9] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
` (8 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-07-15 13:23 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang, Alistair Popple
Let's clean it all further up.
No functional change intended.
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/huge_memory.c | 72 ++++++++++++++++--------------------------------
1 file changed, 24 insertions(+), 48 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 31b5c4e61a574..154cafec58dcf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1390,15 +1390,25 @@ struct folio_or_pfn {
bool is_folio;
};
-static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
+static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, struct folio_or_pfn fop, pgprot_t prot,
- bool write, pgtable_t pgtable)
+ bool write)
{
struct mm_struct *mm = vma->vm_mm;
+ pgtable_t pgtable = NULL;
+ spinlock_t *ptl;
pmd_t entry;
- lockdep_assert_held(pmd_lockptr(mm, pmd));
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ return VM_FAULT_SIGBUS;
+ if (arch_needs_pgtable_deposit()) {
+ pgtable = pte_alloc_one(vma->vm_mm);
+ if (!pgtable)
+ return VM_FAULT_OOM;
+ }
+
+ ptl = pmd_lock(mm, pmd);
if (!pmd_none(*pmd)) {
const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
fop.pfn;
@@ -1406,15 +1416,14 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
if (write) {
if (pmd_pfn(*pmd) != pfn) {
WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
- return -EEXIST;
+ goto out_unlock;
}
entry = pmd_mkyoung(*pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
update_mmu_cache_pmd(vma, addr, pmd);
}
-
- return -EEXIST;
+ goto out_unlock;
}
if (fop.is_folio) {
@@ -1435,11 +1444,17 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
if (pgtable) {
pgtable_trans_huge_deposit(mm, pmd, pgtable);
mm_inc_nr_ptes(mm);
+ pgtable = NULL;
}
set_pmd_at(mm, addr, pmd, entry);
update_mmu_cache_pmd(vma, addr, pmd);
- return 0;
+
+out_unlock:
+ spin_unlock(ptl);
+ if (pgtable)
+ pte_free(mm, pgtable);
+ return VM_FAULT_NOPAGE;
}
/**
@@ -1461,9 +1476,6 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn,
struct folio_or_pfn fop = {
.pfn = pfn,
};
- pgtable_t pgtable = NULL;
- spinlock_t *ptl;
- int error;
/*
* If we had pmd_special, we could avoid all these restrictions,
@@ -1475,25 +1487,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn,
(VM_PFNMAP|VM_MIXEDMAP));
BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
- if (addr < vma->vm_start || addr >= vma->vm_end)
- return VM_FAULT_SIGBUS;
-
- if (arch_needs_pgtable_deposit()) {
- pgtable = pte_alloc_one(vma->vm_mm);
- if (!pgtable)
- return VM_FAULT_OOM;
- }
-
pfnmap_setup_cachemode_pfn(pfn, &pgprot);
- ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- error = insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write,
- pgtable);
- spin_unlock(ptl);
- if (error && pgtable)
- pte_free(vma->vm_mm, pgtable);
-
- return VM_FAULT_NOPAGE;
+ return insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write);
}
EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
@@ -1502,35 +1498,15 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
{
struct vm_area_struct *vma = vmf->vma;
unsigned long addr = vmf->address & PMD_MASK;
- struct mm_struct *mm = vma->vm_mm;
struct folio_or_pfn fop = {
.folio = folio,
.is_folio = true,
};
- spinlock_t *ptl;
- pgtable_t pgtable = NULL;
- int error;
-
- if (addr < vma->vm_start || addr >= vma->vm_end)
- return VM_FAULT_SIGBUS;
if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
return VM_FAULT_SIGBUS;
- if (arch_needs_pgtable_deposit()) {
- pgtable = pte_alloc_one(vma->vm_mm);
- if (!pgtable)
- return VM_FAULT_OOM;
- }
-
- ptl = pmd_lock(mm, vmf->pmd);
- error = insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot,
- write, pgtable);
- spin_unlock(ptl);
- if (error && pgtable)
- pte_free(mm, pgtable);
-
- return VM_FAULT_NOPAGE;
+ return insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot, write);
}
EXPORT_SYMBOL_GPL(vmf_insert_folio_pmd);
--
2.50.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v1 2/9] mm/huge_memory: move more common code into insert_pud()
2025-07-15 13:23 [PATCH v1 0/9] mm: vm_normal_page*() improvements David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 1/9] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
@ 2025-07-15 13:23 ` David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
` (7 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-07-15 13:23 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang, Alistair Popple
Let's clean it all further up.
No functional change intended.
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/huge_memory.c | 36 +++++++++++++-----------------------
1 file changed, 13 insertions(+), 23 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 154cafec58dcf..1c4a42413042a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1518,25 +1518,30 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
return pud;
}
-static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
+static vm_fault_t insert_pud(struct vm_area_struct *vma, unsigned long addr,
pud_t *pud, struct folio_or_pfn fop, pgprot_t prot, bool write)
{
struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
pud_t entry;
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ return VM_FAULT_SIGBUS;
+
+ ptl = pud_lock(mm, pud);
if (!pud_none(*pud)) {
const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
fop.pfn;
if (write) {
if (WARN_ON_ONCE(pud_pfn(*pud) != pfn))
- return;
+ goto out_unlock;
entry = pud_mkyoung(*pud);
entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
if (pudp_set_access_flags(vma, addr, pud, entry, 1))
update_mmu_cache_pud(vma, addr, pud);
}
- return;
+ goto out_unlock;
}
if (fop.is_folio) {
@@ -1555,6 +1560,9 @@ static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
}
set_pud_at(mm, addr, pud, entry);
update_mmu_cache_pud(vma, addr, pud);
+out_unlock:
+ spin_unlock(ptl);
+ return VM_FAULT_NOPAGE;
}
/**
@@ -1576,7 +1584,6 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn,
struct folio_or_pfn fop = {
.pfn = pfn,
};
- spinlock_t *ptl;
/*
* If we had pud_special, we could avoid all these restrictions,
@@ -1588,16 +1595,9 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn,
(VM_PFNMAP|VM_MIXEDMAP));
BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
- if (addr < vma->vm_start || addr >= vma->vm_end)
- return VM_FAULT_SIGBUS;
-
pfnmap_setup_cachemode_pfn(pfn, &pgprot);
- ptl = pud_lock(vma->vm_mm, vmf->pud);
- insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
- spin_unlock(ptl);
-
- return VM_FAULT_NOPAGE;
+ return insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
}
EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
@@ -1614,25 +1614,15 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
{
struct vm_area_struct *vma = vmf->vma;
unsigned long addr = vmf->address & PUD_MASK;
- pud_t *pud = vmf->pud;
- struct mm_struct *mm = vma->vm_mm;
struct folio_or_pfn fop = {
.folio = folio,
.is_folio = true,
};
- spinlock_t *ptl;
-
- if (addr < vma->vm_start || addr >= vma->vm_end)
- return VM_FAULT_SIGBUS;
if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER))
return VM_FAULT_SIGBUS;
- ptl = pud_lock(mm, pud);
- insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
- spin_unlock(ptl);
-
- return VM_FAULT_NOPAGE;
+ return insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
}
EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
--
2.50.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v1 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
2025-07-15 13:23 [PATCH v1 0/9] mm: vm_normal_page*() improvements David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 1/9] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 2/9] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
@ 2025-07-15 13:23 ` David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
` (6 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-07-15 13:23 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang
Just like we do for vmf_insert_page_mkwrite() -> ... ->
insert_page_into_pte_locked() with the shared zeropage, support the
huge zero folio in vmf_insert_folio_pmd().
When (un)mapping the huge zero folio in page tables, we neither
adjust the refcount nor the mapcount, just like for the shared zeropage.
For now, the huge zero folio is not marked as special yet, although
vm_normal_page_pmd() really wants to treat it as special. We'll change
that next.
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/huge_memory.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1c4a42413042a..9ec7f48efde09 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1429,9 +1429,11 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
if (fop.is_folio) {
entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
- folio_get(fop.folio);
- folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
- add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
+ if (!is_huge_zero_folio(fop.folio)) {
+ folio_get(fop.folio);
+ folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
+ add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
+ }
} else {
entry = pmd_mkhuge(pfn_pmd(fop.pfn, prot));
entry = pmd_mkspecial(entry);
--
2.50.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v1 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
2025-07-15 13:23 [PATCH v1 0/9] mm: vm_normal_page*() improvements David Hildenbrand
` (2 preceding siblings ...)
2025-07-15 13:23 ` [PATCH v1 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
@ 2025-07-15 13:23 ` David Hildenbrand
2025-07-17 8:38 ` Alistair Popple
2025-07-15 13:23 ` [PATCH v1 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
` (5 subsequent siblings)
9 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-07-15 13:23 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang
Let's convert to vmf_insert_folio_pmd().
There is a theoretical change in behavior: in the unlikely case there is
already something mapped, we'll now still call trace_dax_pmd_load_hole()
and return VM_FAULT_NOPAGE.
Previously, we would have returned VM_FAULT_FALLBACK, and the caller
would have zapped the PMD to try a PTE fault.
However, that behavior was different to other PTE+PMD faults, when there
would already be something mapped, and it's not even clear if it could
be triggered.
Assuming the huge zero folio is already mapped, all good, no need to
fallback to PTEs.
Assuming there is already a leaf page table ... the behavior would be
just like when trying to insert a PMD mapping a folio through
dax_fault_iter()->vmf_insert_folio_pmd().
Assuming there is already something else mapped as PMD? It sounds like
a BUG, and the behavior would be just like when trying to insert a PMD
mapping a folio through dax_fault_iter()->vmf_insert_folio_pmd().
So, it sounds reasonable to not handle huge zero folios differently
to inserting PMDs mapping folios when there already is something mapped.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
fs/dax.c | 47 ++++++++++-------------------------------------
1 file changed, 10 insertions(+), 37 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 4229513806bea..ae90706674a3f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1375,51 +1375,24 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
const struct iomap_iter *iter, void **entry)
{
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
- unsigned long pmd_addr = vmf->address & PMD_MASK;
- struct vm_area_struct *vma = vmf->vma;
struct inode *inode = mapping->host;
- pgtable_t pgtable = NULL;
struct folio *zero_folio;
- spinlock_t *ptl;
- pmd_t pmd_entry;
- unsigned long pfn;
+ vm_fault_t ret;
zero_folio = mm_get_huge_zero_folio(vmf->vma->vm_mm);
- if (unlikely(!zero_folio))
- goto fallback;
-
- pfn = page_to_pfn(&zero_folio->page);
- *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn,
- DAX_PMD | DAX_ZERO_PAGE);
-
- if (arch_needs_pgtable_deposit()) {
- pgtable = pte_alloc_one(vma->vm_mm);
- if (!pgtable)
- return VM_FAULT_OOM;
- }
-
- ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
- if (!pmd_none(*(vmf->pmd))) {
- spin_unlock(ptl);
- goto fallback;
+ if (unlikely(!zero_folio)) {
+ trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
+ return VM_FAULT_FALLBACK;
}
- if (pgtable) {
- pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
- mm_inc_nr_ptes(vma->vm_mm);
- }
- pmd_entry = folio_mk_pmd(zero_folio, vmf->vma->vm_page_prot);
- set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
- spin_unlock(ptl);
- trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
- return VM_FAULT_NOPAGE;
+ *entry = dax_insert_entry(xas, vmf, iter, *entry, folio_pfn(zero_folio),
+ DAX_PMD | DAX_ZERO_PAGE);
-fallback:
- if (pgtable)
- pte_free(vma->vm_mm, pgtable);
- trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
- return VM_FAULT_FALLBACK;
+ ret = vmf_insert_folio_pmd(vmf, zero_folio, false);
+ if (ret == VM_FAULT_NOPAGE)
+ trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
+ return ret;
}
#else
static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
--
2.50.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v1 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
2025-07-15 13:23 [PATCH v1 0/9] mm: vm_normal_page*() improvements David Hildenbrand
` (3 preceding siblings ...)
2025-07-15 13:23 ` [PATCH v1 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
@ 2025-07-15 13:23 ` David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
` (4 subsequent siblings)
9 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-07-15 13:23 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang
The huge zero folio is refcounted (+mapcounted -- is that a word?)
differently than "normal" folios, similarly (but different) to the ordinary
shared zeropage.
For this reason, we special-case these pages in
vm_normal_page*/vm_normal_folio*, and only allow selected callers to
still use them (e.g., GUP can still take a reference on them).
vm_normal_page_pmd() already filters out the huge zero folio. However,
so far we are not marking it as special like we do with the ordinary
shared zeropage. Let's mark it as special, so we can further refactor
vm_normal_page_pmd() and vm_normal_page().
While at it, update the doc regarding the shared zero folios.
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/huge_memory.c | 5 ++++-
mm/memory.c | 14 +++++++++-----
2 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9ec7f48efde09..24aff14d22a1e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1320,6 +1320,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
{
pmd_t entry;
entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
+ entry = pmd_mkspecial(entry);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
mm_inc_nr_ptes(mm);
@@ -1429,7 +1430,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
if (fop.is_folio) {
entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
- if (!is_huge_zero_folio(fop.folio)) {
+ if (is_huge_zero_folio(fop.folio)) {
+ entry = pmd_mkspecial(entry);
+ } else {
folio_get(fop.folio);
folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
diff --git a/mm/memory.c b/mm/memory.c
index 3dd6c57e6511e..a4f62923b961c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -543,7 +543,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
*
* "Special" mappings do not wish to be associated with a "struct page" (either
* it doesn't exist, or it exists but they don't want to touch it). In this
- * case, NULL is returned here. "Normal" mappings do have a struct page.
+ * case, NULL is returned here. "Normal" mappings do have a struct page and
+ * are ordinarily refcounted.
+ *
+ * Page mappings of the shared zero folios are always considered "special", as
+ * they are not ordinarily refcounted. However, selected page table walkers
+ * (such as GUP) can still identify these mappings and work with the
+ * underlying "struct page".
*
* There are 2 broad cases. Firstly, an architecture may define a pte_special()
* pte bit, in which case this function is trivial. Secondly, an architecture
@@ -573,9 +579,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
*
* VM_MIXEDMAP mappings can likewise contain memory with or without "struct
* page" backing, however the difference is that _all_ pages with a struct
- * page (that is, those where pfn_valid is true) are refcounted and considered
- * normal pages by the VM. The only exception are zeropages, which are
- * *never* refcounted.
+ * page (that is, those where pfn_valid is true, except the shared zero
+ * folios) are refcounted and considered normal pages by the VM.
*
* The disadvantage is that pages are refcounted (which can be slower and
* simply not an option for some PFNMAP users). The advantage is that we
@@ -655,7 +660,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
{
unsigned long pfn = pmd_pfn(pmd);
- /* Currently it's only used for huge pfnmaps */
if (unlikely(pmd_special(pmd)))
return NULL;
--
2.50.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v1 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
2025-07-15 13:23 [PATCH v1 0/9] mm: vm_normal_page*() improvements David Hildenbrand
` (4 preceding siblings ...)
2025-07-15 13:23 ` [PATCH v1 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
@ 2025-07-15 13:23 ` David Hildenbrand
2025-07-16 8:40 ` David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 7/9] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
` (3 subsequent siblings)
9 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-07-15 13:23 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang
print_bad_pte() looks like something that should actually be a WARN
or similar, but historically it apparently has proven to be useful to
detect corruption of page tables even on production systems -- report
the issue and keep the system running to make it easier to actually detect
what is going wrong (e.g., multiple such messages might shed a light).
As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
to take care of print_bad_pte() as well.
Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
implementation and renaming the function -- we'll rename it to what
we actually print: bad (page) mappings. Maybe it should be called
"print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
because the assumption is that we are dealing with some (previously)
present page table entry that got corrupted in weird ways.
Whether it is a PTE or something else will usually become obvious from the
page table dump or from the dumped stack. If ever required in the future,
we could pass the entry level type similar to "enum rmap_level". For now,
let's keep it simple.
To make the function a bit more readable, factor out the ratelimit check
into is_bad_page_map_ratelimited() and place the dumping of page
table content into __dump_bad_page_map_pgtable(). We'll now dump
information from each level in a single line, and just stop the table
walk once we hit something that is not a present page table.
Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
for vm_normal_page(), now that we have a function that can handle it.
The report will now look something like (dumping pgd to pmd values):
[ 77.943408] BUG: Bad page map in process XXX entry:80000001233f5867
[ 77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
[ 77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 94 insertions(+), 26 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index a4f62923b961c..00ee0df020503 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -479,22 +479,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
add_mm_counter(mm, i, rss[i]);
}
-/*
- * This function is called to print an error when a bad pte
- * is found. For example, we might have a PFN-mapped pte in
- * a region that doesn't allow it.
- *
- * The calling function must still handle the error.
- */
-static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
- pte_t pte, struct page *page)
+static bool is_bad_page_map_ratelimited(void)
{
- pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
- p4d_t *p4d = p4d_offset(pgd, addr);
- pud_t *pud = pud_offset(p4d, addr);
- pmd_t *pmd = pmd_offset(pud, addr);
- struct address_space *mapping;
- pgoff_t index;
static unsigned long resume;
static unsigned long nr_shown;
static unsigned long nr_unshown;
@@ -506,7 +492,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
if (nr_shown == 60) {
if (time_before(jiffies, resume)) {
nr_unshown++;
- return;
+ return true;
}
if (nr_unshown) {
pr_alert("BUG: Bad page map: %lu messages suppressed\n",
@@ -517,15 +503,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
}
if (nr_shown++ == 0)
resume = jiffies + 60 * HZ;
+ return false;
+}
+
+static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
+{
+ unsigned long long pgdv, p4dv, pudv, pmdv;
+ pgd_t pgd, *pgdp;
+ p4d_t p4d, *p4dp;
+ pud_t pud, *pudp;
+ pmd_t *pmdp;
+
+ /*
+ * This looks like a fully lockless walk, however, the caller is
+ * expected to hold the leaf page table lock in addition to other
+ * rmap/mm/vma locks. So this is just a re-walk to dump page table
+ * content while any concurrent modifications should be completely
+ * prevented.
+ */
+ pgdp = pgd_offset(mm, addr);
+ pgd = pgdp_get(pgdp);
+ pgdv = pgd_val(pgd);
+
+ if (!pgd_present(pgd) || pgd_leaf(pgd)) {
+ pr_alert("pgd:%08llx\n", pgdv);
+ return;
+ }
+
+ p4dp = p4d_offset(pgdp, addr);
+ p4d = p4dp_get(p4dp);
+ p4dv = p4d_val(p4d);
+
+ if (!p4d_present(p4d) || p4d_leaf(p4d)) {
+ pr_alert("pgd:%08llx p4d:%08llx\n", pgdv, p4dv);
+ return;
+ }
+
+ pudp = pud_offset(p4dp, addr);
+ pud = pudp_get(pudp);
+ pudv = pud_val(pud);
+
+ if (!pud_present(pud) || pud_leaf(pud)) {
+ pr_alert("pgd:%08llx p4d:%08llx pud:%08llx\n", pgdv, p4dv, pudv);
+ return;
+ }
+
+ pmdp = pmd_offset(pudp, addr);
+ pmdv = pmd_val(pmdp_get(pmdp));
+
+ /*
+ * Dumping the PTE would be nice, but it's tricky with CONFIG_HIGHPTE,
+ * because the table should already be mapped by the caller and
+ * doing another map would be bad. print_bad_page_map() should
+ * already take care of printing the PTE.
+ */
+ pr_alert("pgd:%08llx p4d:%08llx pud:%08llx pmd:%08llx\n", pgdv,
+ p4dv, pudv, pmdv);
+}
+
+/*
+ * This function is called to print an error when a bad page table entry (e.g.,
+ * corrupted page table entry) is found. For example, we might have a
+ * PFN-mapped pte in a region that doesn't allow it.
+ *
+ * The calling function must still handle the error.
+ */
+static void print_bad_page_map(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long long entry, struct page *page)
+{
+ struct address_space *mapping;
+ pgoff_t index;
+
+ if (is_bad_page_map_ratelimited())
+ return;
mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
index = linear_page_index(vma, addr);
- pr_alert("BUG: Bad page map in process %s pte:%08llx pmd:%08llx\n",
- current->comm,
- (long long)pte_val(pte), (long long)pmd_val(*pmd));
+ pr_alert("BUG: Bad page map in process %s entry:%08llx", current->comm, entry);
+ __dump_bad_page_map_pgtable(vma->vm_mm, addr);
if (page)
- dump_page(page, "bad pte");
+ dump_page(page, "bad page map");
pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
(void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
@@ -603,7 +661,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
if (is_zero_pfn(pfn))
return NULL;
- print_bad_pte(vma, addr, pte, NULL);
+ print_bad_page_map(vma, addr, pte_val(pte), NULL);
return NULL;
}
@@ -631,7 +689,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
check_pfn:
if (unlikely(pfn > highest_memmap_pfn)) {
- print_bad_pte(vma, addr, pte, NULL);
+ print_bad_page_map(vma, addr, pte_val(pte), NULL);
return NULL;
}
@@ -660,8 +718,15 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
{
unsigned long pfn = pmd_pfn(pmd);
- if (unlikely(pmd_special(pmd)))
+ if (unlikely(pmd_special(pmd))) {
+ if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+ return NULL;
+ if (is_huge_zero_pfn(pfn))
+ return NULL;
+
+ print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
return NULL;
+ }
if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
if (vma->vm_flags & VM_MIXEDMAP) {
@@ -680,8 +745,10 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
if (is_huge_zero_pfn(pfn))
return NULL;
- if (unlikely(pfn > highest_memmap_pfn))
+ if (unlikely(pfn > highest_memmap_pfn)) {
+ print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
return NULL;
+ }
/*
* NOTE! We still have PageReserved() pages in the page tables.
@@ -1515,7 +1582,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
folio_remove_rmap_ptes(folio, page, nr, vma);
if (unlikely(folio_mapcount(folio) < 0))
- print_bad_pte(vma, addr, ptent, page);
+ print_bad_page_map(vma, addr, pte_val(ptent), page);
}
if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
*force_flush = true;
@@ -4513,7 +4580,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
} else if (is_pte_marker_entry(entry)) {
ret = handle_pte_marker(vmf);
} else {
- print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
+ print_bad_page_map(vma, vmf->address,
+ pte_val(vmf->orig_pte), NULL);
ret = VM_FAULT_SIGBUS;
}
goto out;
--
2.50.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v1 7/9] mm/memory: factor out common code from vm_normal_page_*()
2025-07-15 13:23 [PATCH v1 0/9] mm: vm_normal_page*() improvements David Hildenbrand
` (5 preceding siblings ...)
2025-07-15 13:23 ` [PATCH v1 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
@ 2025-07-15 13:23 ` David Hildenbrand
2025-07-16 8:15 ` Oscar Salvador
2025-07-15 13:23 ` [PATCH v1 8/9] mm: introduce and use vm_normal_page_pud() David Hildenbrand
` (2 subsequent siblings)
9 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-07-15 13:23 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang
Let's reduce the code duplication and factor out the non-pte/pmd related
magic into vm_normal_page_pfn().
To keep it simpler, check the pfn against both zero folios. We could
optimize this, but as it's only for the !CONFIG_ARCH_HAS_PTE_SPECIAL
case, it's not a compelling micro-optimization.
With CONFIG_ARCH_HAS_PTE_SPECIAL we don't have to check anything else,
really.
It's a good question if we can even hit the !CONFIG_ARCH_HAS_PTE_SPECIAL
scenario in the PMD case in practice: but doesn't really matter, as
it's now all unified in vm_normal_page_pfn().
Add kerneldoc for all involved functions.
No functional change intended.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/memory.c | 183 +++++++++++++++++++++++++++++++---------------------
1 file changed, 109 insertions(+), 74 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 00ee0df020503..d5f80419989b9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -596,8 +596,13 @@ static void print_bad_page_map(struct vm_area_struct *vma,
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
}
-/*
- * vm_normal_page -- This function gets the "struct page" associated with a pte.
+/**
+ * vm_normal_page_pfn() - Get the "struct page" associated with a PFN in a
+ * non-special page table entry.
+ * @vma: The VMA mapping the @pfn.
+ * @addr: The address where the @pfn is mapped.
+ * @pfn: The PFN.
+ * @entry: The page table entry value for error reporting purposes.
*
* "Special" mappings do not wish to be associated with a "struct page" (either
* it doesn't exist, or it exists but they don't want to touch it). In this
@@ -609,10 +614,10 @@ static void print_bad_page_map(struct vm_area_struct *vma,
* (such as GUP) can still identify these mappings and work with the
* underlying "struct page".
*
- * There are 2 broad cases. Firstly, an architecture may define a pte_special()
- * pte bit, in which case this function is trivial. Secondly, an architecture
- * may not have a spare pte bit, which requires a more complicated scheme,
- * described below.
+ * There are 2 broad cases. Firstly, an architecture may define a "special"
+ * page table entry bit (e.g., pte_special()), in which case this function is
+ * trivial. Secondly, an architecture may not have a spare page table
+ * entry bit, which requires a more complicated scheme, described below.
*
* A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
* special mapping (even if there are underlying and valid "struct pages").
@@ -645,15 +650,72 @@ static void print_bad_page_map(struct vm_area_struct *vma,
* don't have to follow the strict linearity rule of PFNMAP mappings in
* order to support COWable mappings.
*
+ * This function is not expected to be called for obviously special mappings:
+ * when the page table entry has the "special" bit set.
+ *
+ * Return: Returns the "struct page" if this is a "normal" mapping. Returns
+ * NULL if this is a "special" mapping.
+ */
+static inline struct page *vm_normal_page_pfn(struct vm_area_struct *vma,
+ unsigned long addr, unsigned long pfn, unsigned long long entry)
+{
+ /*
+ * With CONFIG_ARCH_HAS_PTE_SPECIAL, any special page table mappings
+ * (incl. shared zero folios) are marked accordingly and are handled
+ * by the caller.
+ */
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
+ if (unlikely(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))) {
+ if (vma->vm_flags & VM_MIXEDMAP) {
+ /* If it has a "struct page", it's "normal". */
+ if (!pfn_valid(pfn))
+ return NULL;
+ } else {
+ unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
+
+ /* Only CoW'ed anon folios are "normal". */
+ if (pfn == vma->vm_pgoff + off)
+ return NULL;
+ if (!is_cow_mapping(vma->vm_flags))
+ return NULL;
+ }
+ }
+
+ if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
+ return NULL;
+ }
+
+ /* Cheap check for corrupted page table entries. */
+ if (pfn > highest_memmap_pfn) {
+ print_bad_page_map(vma, addr, entry, NULL);
+ return NULL;
+ }
+ /*
+ * NOTE! We still have PageReserved() pages in the page tables.
+ * For example, VDSO mappings can cause them to exist.
+ */
+ VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn));
+ return pfn_to_page(pfn);
+}
+
+/**
+ * vm_normal_page() - Get the "struct page" associated with a PTE
+ * @vma: The VMA mapping the @pte.
+ * @addr: The address where the @pte is mapped.
+ * @pte: The PTE.
+ *
+ * Get the "struct page" associated with a PTE. See vm_normal_page_pfn()
+ * for details.
+ *
+ * Return: Returns the "struct page" if this is a "normal" mapping. Returns
+ * NULL if this is a "special" mapping.
*/
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte)
{
unsigned long pfn = pte_pfn(pte);
- if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
- if (likely(!pte_special(pte)))
- goto check_pfn;
+ if (unlikely(pte_special(pte))) {
if (vma->vm_ops && vma->vm_ops->find_special_page)
return vma->vm_ops->find_special_page(vma, addr);
if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
@@ -664,44 +726,21 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
print_bad_page_map(vma, addr, pte_val(pte), NULL);
return NULL;
}
-
- /* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
-
- if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
- if (vma->vm_flags & VM_MIXEDMAP) {
- if (!pfn_valid(pfn))
- return NULL;
- if (is_zero_pfn(pfn))
- return NULL;
- goto out;
- } else {
- unsigned long off;
- off = (addr - vma->vm_start) >> PAGE_SHIFT;
- if (pfn == vma->vm_pgoff + off)
- return NULL;
- if (!is_cow_mapping(vma->vm_flags))
- return NULL;
- }
- }
-
- if (is_zero_pfn(pfn))
- return NULL;
-
-check_pfn:
- if (unlikely(pfn > highest_memmap_pfn)) {
- print_bad_page_map(vma, addr, pte_val(pte), NULL);
- return NULL;
- }
-
- /*
- * NOTE! We still have PageReserved() pages in the page tables.
- * eg. VDSO mappings can cause them to exist.
- */
-out:
- VM_WARN_ON_ONCE(is_zero_pfn(pfn));
- return pfn_to_page(pfn);
+ return vm_normal_page_pfn(vma, addr, pfn, pte_val(pte));
}
+/**
+ * vm_normal_folio() - Get the "struct folio" associated with a PTE
+ * @vma: The VMA mapping the @pte.
+ * @addr: The address where the @pte is mapped.
+ * @pte: The PTE.
+ *
+ * Get the "struct folio" associated with a PTE. See vm_normal_page_pfn()
+ * for details.
+ *
+ * Return: Returns the "struct folio" if this is a "normal" mapping. Returns
+ * NULL if this is a "special" mapping.
+ */
struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
pte_t pte)
{
@@ -713,6 +752,18 @@ struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
}
#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+/**
+ * vm_normal_page_pmd() - Get the "struct page" associated with a PMD
+ * @vma: The VMA mapping the @pmd.
+ * @addr: The address where the @pmd is mapped.
+ * @pmd: The PMD.
+ *
+ * Get the "struct page" associated with a PMD. See vm_normal_page_pfn()
+ * for details.
+ *
+ * Return: Returns the "struct page" if this is a "normal" mapping. Returns
+ * NULL if this is a "special" mapping.
+ */
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd)
{
@@ -727,37 +778,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
return NULL;
}
-
- if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
- if (vma->vm_flags & VM_MIXEDMAP) {
- if (!pfn_valid(pfn))
- return NULL;
- goto out;
- } else {
- unsigned long off;
- off = (addr - vma->vm_start) >> PAGE_SHIFT;
- if (pfn == vma->vm_pgoff + off)
- return NULL;
- if (!is_cow_mapping(vma->vm_flags))
- return NULL;
- }
- }
-
- if (is_huge_zero_pfn(pfn))
- return NULL;
- if (unlikely(pfn > highest_memmap_pfn)) {
- print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
- return NULL;
- }
-
- /*
- * NOTE! We still have PageReserved() pages in the page tables.
- * eg. VDSO mappings can cause them to exist.
- */
-out:
- return pfn_to_page(pfn);
+ return vm_normal_page_pfn(vma, addr, pfn, pmd_val(pmd));
}
+/**
+ * vm_normal_folio_pmd() - Get the "struct folio" associated with a PMD
+ * @vma: The VMA mapping the @pmd.
+ * @addr: The address where the @pmd is mapped.
+ * @pmd: The PMD.
+ *
+ * Get the "struct folio" associated with a PMD. See vm_normal_page_pfn()
+ * for details.
+ *
+ * Return: Returns the "struct folio" if this is a "normal" mapping. Returns
+ * NULL if this is a "special" mapping.
+ */
struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd)
{
--
2.50.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v1 8/9] mm: introduce and use vm_normal_page_pud()
2025-07-15 13:23 [PATCH v1 0/9] mm: vm_normal_page*() improvements David Hildenbrand
` (6 preceding siblings ...)
2025-07-15 13:23 ` [PATCH v1 7/9] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
@ 2025-07-15 13:23 ` David Hildenbrand
2025-07-16 8:20 ` Oscar Salvador
2025-07-15 13:23 ` [PATCH v1 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
2025-07-15 23:31 ` [PATCH v1 0/9] mm: vm_normal_page*() improvements Andrew Morton
9 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-07-15 13:23 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang
Let's introduce vm_normal_page_pud(), which ends up being fairly simple
because of our new common helpers and there not being a PUD-sized zero
folio.
Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO,
structuring the code like the other (pmd/pte) cases. Defer
introducing vm_normal_folio_pud() until really used.
Signed-off-by: David Hildenbrand <david@redhat.com>
---
include/linux/mm.h | 2 ++
mm/memory.c | 27 +++++++++++++++++++++++++++
mm/pagewalk.c | 20 ++++++++++----------
3 files changed, 39 insertions(+), 10 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 611f337cc36c9..6877c894fe526 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2347,6 +2347,8 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd);
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd);
+struct page *vm_normal_page_pud(struct vm_area_struct *vma, unsigned long addr,
+ pud_t pud);
void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
unsigned long size);
diff --git a/mm/memory.c b/mm/memory.c
index d5f80419989b9..f1834a19a2f1e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -802,6 +802,33 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
return page_folio(page);
return NULL;
}
+
+/**
+ * vm_normal_page_pud() - Get the "struct page" associated with a PUD
+ * @vma: The VMA mapping the @pud.
+ * @addr: The address where the @pud is mapped.
+ * @pud: The PUD.
+ *
+ * Get the "struct page" associated with a PUD. See vm_normal_page_pfn()
+ * for details.
+ *
+ * Return: Returns the "struct page" if this is a "normal" mapping. Returns
+ * NULL if this is a "special" mapping.
+ */
+struct page *vm_normal_page_pud(struct vm_area_struct *vma,
+ unsigned long addr, pud_t pud)
+{
+ unsigned long pfn = pud_pfn(pud);
+
+ if (unlikely(pud_special(pud))) {
+ if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+ return NULL;
+
+ print_bad_page_map(vma, addr, pud_val(pud), NULL);
+ return NULL;
+ }
+ return vm_normal_page_pfn(vma, addr, pfn, pud_val(pud));
+}
#endif
/**
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 648038247a8d2..c6753d370ff4e 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -902,23 +902,23 @@ struct folio *folio_walk_start(struct folio_walk *fw,
fw->pudp = pudp;
fw->pud = pud;
- /*
- * TODO: FW_MIGRATION support for PUD migration entries
- * once there are relevant users.
- */
- if (!pud_present(pud) || pud_special(pud)) {
+ if (pud_none(pud)) {
spin_unlock(ptl);
goto not_found;
- } else if (!pud_leaf(pud)) {
+ } else if (pud_present(pud) && !pud_leaf(pud)) {
spin_unlock(ptl);
goto pmd_table;
+ } else if (pud_present(pud)) {
+ page = vm_normal_page_pud(vma, addr, pud);
+ if (page)
+ goto found;
}
/*
- * TODO: vm_normal_page_pud() will be handy once we want to
- * support PUD mappings in VM_PFNMAP|VM_MIXEDMAP VMAs.
+ * TODO: FW_MIGRATION support for PUD migration entries
+ * once there are relevant users.
*/
- page = pud_page(pud);
- goto found;
+ spin_unlock(ptl);
+ goto not_found;
}
pmd_table:
--
2.50.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* [PATCH v1 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
2025-07-15 13:23 [PATCH v1 0/9] mm: vm_normal_page*() improvements David Hildenbrand
` (7 preceding siblings ...)
2025-07-15 13:23 ` [PATCH v1 8/9] mm: introduce and use vm_normal_page_pud() David Hildenbrand
@ 2025-07-15 13:23 ` David Hildenbrand
2025-07-16 8:22 ` Oscar Salvador
2025-07-15 23:31 ` [PATCH v1 0/9] mm: vm_normal_page*() improvements Andrew Morton
9 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-07-15 13:23 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang, David Vrabel
... and hide it behind a kconfig option. There is really no need for
any !xen code to perform this check.
The naming is a bit off: we want to find the "normal" page when a PTE
was marked "special". So it's really not "finding a special" page.
Improve the documentation, and add a comment in the code where XEN ends
up performing the pte_mkspecial() through a hypercall. More details can
be found in commit 923b2919e2c3 ("xen/gntdev: mark userspace PTEs as
special on x86 PV guests").
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
drivers/xen/Kconfig | 1 +
drivers/xen/gntdev.c | 5 +++--
include/linux/mm.h | 18 +++++++++++++-----
mm/Kconfig | 2 ++
mm/memory.c | 12 ++++++++++--
tools/testing/vma/vma_internal.h | 18 +++++++++++++-----
6 files changed, 42 insertions(+), 14 deletions(-)
diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index 24f485827e039..f9a35ed266ecf 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -138,6 +138,7 @@ config XEN_GNTDEV
depends on XEN
default m
select MMU_NOTIFIER
+ select FIND_NORMAL_PAGE
help
Allows userspace processes to use grants.
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 61faea1f06630..d1bc0dae2cdf9 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -309,6 +309,7 @@ static int find_grant_ptes(pte_t *pte, unsigned long addr, void *data)
BUG_ON(pgnr >= map->count);
pte_maddr = arbitrary_virt_to_machine(pte).maddr;
+ /* Note: this will perform a pte_mkspecial() through the hypercall. */
gnttab_set_map_op(&map->map_ops[pgnr], pte_maddr, flags,
map->grants[pgnr].ref,
map->grants[pgnr].domid);
@@ -516,7 +517,7 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
gntdev_put_map(priv, map);
}
-static struct page *gntdev_vma_find_special_page(struct vm_area_struct *vma,
+static struct page *gntdev_vma_find_normal_page(struct vm_area_struct *vma,
unsigned long addr)
{
struct gntdev_grant_map *map = vma->vm_private_data;
@@ -527,7 +528,7 @@ static struct page *gntdev_vma_find_special_page(struct vm_area_struct *vma,
static const struct vm_operations_struct gntdev_vmops = {
.open = gntdev_vma_open,
.close = gntdev_vma_close,
- .find_special_page = gntdev_vma_find_special_page,
+ .find_normal_page = gntdev_vma_find_normal_page,
};
/* ------------------------------------------------------------------ */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6877c894fe526..cc3322fce62f4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -646,13 +646,21 @@ struct vm_operations_struct {
struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
unsigned long addr, pgoff_t *ilx);
#endif
+#ifdef CONFIG_FIND_NORMAL_PAGE
/*
- * Called by vm_normal_page() for special PTEs to find the
- * page for @addr. This is useful if the default behavior
- * (using pte_page()) would not find the correct page.
+ * Called by vm_normal_page() for special PTEs in @vma at @addr. This
+ * allows for returning a "normal" page from vm_normal_page() even
+ * though the PTE indicates that the "struct page" either does not exist
+ * or should not be touched: "special".
+ *
+ * Do not add new users: this really only works when a "normal" page
+ * was mapped, but then the PTE got changed to something weird (+
+ * marked special) that would not make pte_pfn() identify the originally
+ * inserted page.
*/
- struct page *(*find_special_page)(struct vm_area_struct *vma,
- unsigned long addr);
+ struct page *(*find_normal_page)(struct vm_area_struct *vma,
+ unsigned long addr);
+#endif /* CONFIG_FIND_NORMAL_PAGE */
};
#ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/Kconfig b/mm/Kconfig
index 0287e8d94aea7..82c281b4f6937 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1397,6 +1397,8 @@ config PT_RECLAIM
Note: now only empty user PTE page table pages will be reclaimed.
+config FIND_NORMAL_PAGE
+ def_bool n
source "mm/damon/Kconfig"
diff --git a/mm/memory.c b/mm/memory.c
index f1834a19a2f1e..d09f2ff4a866e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -619,6 +619,12 @@ static void print_bad_page_map(struct vm_area_struct *vma,
* trivial. Secondly, an architecture may not have a spare page table
* entry bit, which requires a more complicated scheme, described below.
*
+ * With CONFIG_FIND_NORMAL_PAGE, we might have the "special" bit set on
+ * page table entries that actually map "normal" pages: however, that page
+ * cannot be looked up through the PFN stored in the page table entry, but
+ * instead will be looked up through vm_ops->find_normal_page(). So far, this
+ * only applies to PTEs.
+ *
* A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
* special mapping (even if there are underlying and valid "struct pages").
* COWed pages of a VM_PFNMAP are always normal.
@@ -716,8 +722,10 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn = pte_pfn(pte);
if (unlikely(pte_special(pte))) {
- if (vma->vm_ops && vma->vm_ops->find_special_page)
- return vma->vm_ops->find_special_page(vma, addr);
+#ifdef CONFIG_FIND_NORMAL_PAGE
+ if (vma->vm_ops && vma->vm_ops->find_normal_page)
+ return vma->vm_ops->find_normal_page(vma, addr);
+#endif /* CONFIG_FIND_NORMAL_PAGE */
if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
return NULL;
if (is_zero_pfn(pfn))
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 991022e9e0d3b..9eecfb1dcc13f 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -465,13 +465,21 @@ struct vm_operations_struct {
struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
unsigned long addr, pgoff_t *ilx);
#endif
+#ifdef CONFIG_FIND_NORMAL_PAGE
/*
- * Called by vm_normal_page() for special PTEs to find the
- * page for @addr. This is useful if the default behavior
- * (using pte_page()) would not find the correct page.
+ * Called by vm_normal_page() for special PTEs in @vma at @addr. This
+ * allows for returning a "normal" page from vm_normal_page() even
+ * though the PTE indicates that the "struct page" either does not exist
+ * or should not be touched: "special".
+ *
+ * Do not add new users: this really only works when a "normal" page
+ * was mapped, but then the PTE got changed to something weird (+
+ * marked special) that would not make pte_pfn() identify the originally
+ * inserted page.
*/
- struct page *(*find_special_page)(struct vm_area_struct *vma,
- unsigned long addr);
+ struct page *(*find_normal_page)(struct vm_area_struct *vma,
+ unsigned long addr);
+#endif /* CONFIG_FIND_NORMAL_PAGE */
};
struct vm_unmapped_area_info {
--
2.50.1
^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [PATCH v1 0/9] mm: vm_normal_page*() improvements
2025-07-15 13:23 [PATCH v1 0/9] mm: vm_normal_page*() improvements David Hildenbrand
` (8 preceding siblings ...)
2025-07-15 13:23 ` [PATCH v1 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
@ 2025-07-15 23:31 ` Andrew Morton
2025-07-16 8:47 ` David Hildenbrand
9 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2025-07-15 23:31 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang
On Tue, 15 Jul 2025 15:23:41 +0200 David Hildenbrand <david@redhat.com> wrote:
> Based on mm/mm-new. I dropped the CoW PFNMAP changes for now, still
> working on a better way to sort all that out cleanly.
>
> Cleanup and unify vm_normal_page_*() handling, also marking the
> huge zerofolio as special in the PMD. Add+use vm_normal_page_pud() and
> cleanup that XEN vm_ops->find_special_page thingy.
>
> There are plans of using vm_normal_page_*() more widely soon.
>
> Briefly tested on UML (making sure vm_normal_page() still works as expected
> without pte_special() support) and on x86-64 with a bunch of tests.
When I was but a wee little bairn, my mother would always tell me
"never merge briefly tested patches when you're at -rc6". But three
weeks in -next should shake things out.
However the series rejects due to the is_huge_zero_pmd ->
is_huge_zero_pfn changes in Luiz's "mm: introduce snapshot_page() v3"
series, so could we please have a redo against present mm-new?
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v1 7/9] mm/memory: factor out common code from vm_normal_page_*()
2025-07-15 13:23 ` [PATCH v1 7/9] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
@ 2025-07-16 8:15 ` Oscar Salvador
0 siblings, 0 replies; 20+ messages in thread
From: Oscar Salvador @ 2025-07-16 8:15 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Lance Yang
On Tue, Jul 15, 2025 at 03:23:48PM +0200, David Hildenbrand wrote:
> Let's reduce the code duplication and factor out the non-pte/pmd related
> magic into vm_normal_page_pfn().
>
> To keep it simpler, check the pfn against both zero folios. We could
> optimize this, but as it's only for the !CONFIG_ARCH_HAS_PTE_SPECIAL
> case, it's not a compelling micro-optimization.
>
> With CONFIG_ARCH_HAS_PTE_SPECIAL we don't have to check anything else,
> really.
>
> It's a good question if we can even hit the !CONFIG_ARCH_HAS_PTE_SPECIAL
> scenario in the PMD case in practice: but doesn't really matter, as
> it's now all unified in vm_normal_page_pfn().
>
> Add kerneldoc for all involved functions.
>
> No functional change intended.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v1 8/9] mm: introduce and use vm_normal_page_pud()
2025-07-15 13:23 ` [PATCH v1 8/9] mm: introduce and use vm_normal_page_pud() David Hildenbrand
@ 2025-07-16 8:20 ` Oscar Salvador
0 siblings, 0 replies; 20+ messages in thread
From: Oscar Salvador @ 2025-07-16 8:20 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Lance Yang
On Tue, Jul 15, 2025 at 03:23:49PM +0200, David Hildenbrand wrote:
> Let's introduce vm_normal_page_pud(), which ends up being fairly simple
> because of our new common helpers and there not being a PUD-sized zero
> folio.
>
> Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO,
> structuring the code like the other (pmd/pte) cases. Defer
> introducing vm_normal_folio_pud() until really used.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v1 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
2025-07-15 13:23 ` [PATCH v1 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
@ 2025-07-16 8:22 ` Oscar Salvador
0 siblings, 0 replies; 20+ messages in thread
From: Oscar Salvador @ 2025-07-16 8:22 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Lance Yang, David Vrabel
On Tue, Jul 15, 2025 at 03:23:50PM +0200, David Hildenbrand wrote:
> ... and hide it behind a kconfig option. There is really no need for
> any !xen code to perform this check.
>
> The naming is a bit off: we want to find the "normal" page when a PTE
> was marked "special". So it's really not "finding a special" page.
>
> Improve the documentation, and add a comment in the code where XEN ends
> up performing the pte_mkspecial() through a hypercall. More details can
> be found in commit 923b2919e2c3 ("xen/gntdev: mark userspace PTEs as
> special on x86 PV guests").
>
> Cc: David Vrabel <david.vrabel@citrix.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v1 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
2025-07-15 13:23 ` [PATCH v1 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
@ 2025-07-16 8:40 ` David Hildenbrand
0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-07-16 8:40 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, Andrew Morton,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang, Russell King,
linux-kernel@vger.kernel.org
On 15.07.25 15:23, David Hildenbrand wrote:
> print_bad_pte() looks like something that should actually be a WARN
> or similar, but historically it apparently has proven to be useful to
> detect corruption of page tables even on production systems -- report
> the issue and keep the system running to make it easier to actually detect
> what is going wrong (e.g., multiple such messages might shed a light).
>
> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
> to take care of print_bad_pte() as well.
>
> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
> implementation and renaming the function -- we'll rename it to what
> we actually print: bad (page) mappings. Maybe it should be called
> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
> because the assumption is that we are dealing with some (previously)
> present page table entry that got corrupted in weird ways.
>
> Whether it is a PTE or something else will usually become obvious from the
> page table dump or from the dumped stack. If ever required in the future,
> we could pass the entry level type similar to "enum rmap_level". For now,
> let's keep it simple.
>
> To make the function a bit more readable, factor out the ratelimit check
> into is_bad_page_map_ratelimited() and place the dumping of page
> table content into __dump_bad_page_map_pgtable(). We'll now dump
> information from each level in a single line, and just stop the table
> walk once we hit something that is not a present page table.
>
> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
> for vm_normal_page(), now that we have a function that can handle it.
>
> The report will now look something like (dumping pgd to pmd values):
>
> [ 77.943408] BUG: Bad page map in process XXX entry:80000001233f5867
> [ 77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> [ 77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
> 1 file changed, 94 insertions(+), 26 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index a4f62923b961c..00ee0df020503 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -479,22 +479,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> add_mm_counter(mm, i, rss[i]);
> }
>
> -/*
> - * This function is called to print an error when a bad pte
> - * is found. For example, we might have a PFN-mapped pte in
> - * a region that doesn't allow it.
> - *
> - * The calling function must still handle the error.
> - */
> -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> - pte_t pte, struct page *page)
> +static bool is_bad_page_map_ratelimited(void)
> {
> - pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> - p4d_t *p4d = p4d_offset(pgd, addr);
> - pud_t *pud = pud_offset(p4d, addr);
> - pmd_t *pmd = pmd_offset(pud, addr);
> - struct address_space *mapping;
> - pgoff_t index;
> static unsigned long resume;
> static unsigned long nr_shown;
> static unsigned long nr_unshown;
> @@ -506,7 +492,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> if (nr_shown == 60) {
> if (time_before(jiffies, resume)) {
> nr_unshown++;
> - return;
> + return true;
> }
> if (nr_unshown) {
> pr_alert("BUG: Bad page map: %lu messages suppressed\n",
> @@ -517,15 +503,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> }
> if (nr_shown++ == 0)
> resume = jiffies + 60 * HZ;
> + return false;
> +}
> +
> +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
> +{
> + unsigned long long pgdv, p4dv, pudv, pmdv;
> + pgd_t pgd, *pgdp;
> + p4d_t p4d, *p4dp;
> + pud_t pud, *pudp;
> + pmd_t *pmdp;
> +
> + /*
> + * This looks like a fully lockless walk, however, the caller is
> + * expected to hold the leaf page table lock in addition to other
> + * rmap/mm/vma locks. So this is just a re-walk to dump page table
> + * content while any concurrent modifications should be completely
> + * prevented.
> + */
> + pgdp = pgd_offset(mm, addr);
> + pgd = pgdp_get(pgdp);
> + pgdv = pgd_val(pgd);
Apparently there is something weird here on arm-bcm2835_defconfig:
All errors (new ones prefixed by >>):
>> mm/memory.c:525:6: error: array type 'pgd_t' (aka 'unsigned int[2]')
is not assignable
525 | pgd = pgdp_get(pgdp);
| ~~~ ^
1 error generated.
... apparently because we have this questionable ...
arch/arm/include/asm/pgtable-2level-types.h:typedef pmdval_t pgd_t[2];
I mean, the whole concept of pgdp_get() doesn't make too much sense if
it wants to return an array.
I don't quite understand the "#undef STRICT_MM_TYPECHECKS #ifdef
STRICT_MM_TYPECHECKS" stuff.
Why do we want to make it easier on the compiler while doing something
fairly weird?
CCing arm maintainers: what's going on here? :)
An easy fix would be to not dump the pgd value, but having a
non-functional pgdp_get() really is weird.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v1 0/9] mm: vm_normal_page*() improvements
2025-07-15 23:31 ` [PATCH v1 0/9] mm: vm_normal_page*() improvements Andrew Morton
@ 2025-07-16 8:47 ` David Hildenbrand
2025-07-16 22:27 ` Andrew Morton
0 siblings, 1 reply; 20+ messages in thread
From: David Hildenbrand @ 2025-07-16 8:47 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang
On 16.07.25 01:31, Andrew Morton wrote:
> On Tue, 15 Jul 2025 15:23:41 +0200 David Hildenbrand <david@redhat.com> wrote:
>
>> Based on mm/mm-new. I dropped the CoW PFNMAP changes for now, still
>> working on a better way to sort all that out cleanly.
>>
>> Cleanup and unify vm_normal_page_*() handling, also marking the
>> huge zerofolio as special in the PMD. Add+use vm_normal_page_pud() and
>> cleanup that XEN vm_ops->find_special_page thingy.
>>
>> There are plans of using vm_normal_page_*() more widely soon.
>>
>> Briefly tested on UML (making sure vm_normal_page() still works as expected
>> without pte_special() support) and on x86-64 with a bunch of tests.
>
> When I was but a wee little bairn, my mother would always tell me
> "never merge briefly tested patches when you're at -rc6". But three
> weeks in -next should shake things out.
;) There is one arm oddity around pgdp_get() to figure out that a bot
reported on my github branch, so no need to rush.
Let's see how fast that can be resolved.
>
> However the series rejects due to the is_huge_zero_pmd ->
> is_huge_zero_pfn changes in Luiz's "mm: introduce snapshot_page() v3"
> series, so could we please have a redo against present mm-new?
I'm confused: mm-new *still* contains the patch from Luiz series that
was originally part of the RFC here.
commit 791cb64cd7f8c2314c65d1dd5cb9e05e51c4cd70
Author: David Hildenbrand <david@redhat.com>
Date: Mon Jul 14 09:16:51 2025 -0400
mm/memory: introduce is_huge_zero_pfn() and use it in vm_normal_page_pmd()
If you want to put this series here before Luiz', you'll have to move that
single patch as well.
But probably this series should be done on top of Luiz work, because Luiz
fixes something.
[that patch was part of the RFC series, but Luiz picked it up for his work, so I dropped it
from this series and based it on top of current mm-new]
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v1 0/9] mm: vm_normal_page*() improvements
2025-07-16 8:47 ` David Hildenbrand
@ 2025-07-16 22:27 ` Andrew Morton
2025-07-17 7:35 ` David Hildenbrand
0 siblings, 1 reply; 20+ messages in thread
From: Andrew Morton @ 2025-07-16 22:27 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang
On Wed, 16 Jul 2025 10:47:29 +0200 David Hildenbrand <david@redhat.com> wrote:
> >
> > However the series rejects due to the is_huge_zero_pmd ->
> > is_huge_zero_pfn changes in Luiz's "mm: introduce snapshot_page() v3"
> > series, so could we please have a redo against present mm-new?
>
> I'm confused: mm-new *still* contains the patch from Luiz series that
> was originally part of the RFC here.
>
> commit 791cb64cd7f8c2314c65d1dd5cb9e05e51c4cd70
> Author: David Hildenbrand <david@redhat.com>
> Date: Mon Jul 14 09:16:51 2025 -0400
>
> mm/memory: introduce is_huge_zero_pfn() and use it in vm_normal_page_pmd()
>
> If you want to put this series here before Luiz', you'll have to move that
> single patch as well.
>
> But probably this series should be done on top of Luiz work, because Luiz
> fixes something.
I'm confused at your confused. mm-new presently contains Luiz's latest
v3 series "mm: introduce snapshot_page()" which includes a copy of your
"mm/memory: introduce is_huge_zero_pfn() and use it in
vm_normal_page_pmd()".
> [that patch was part of the RFC series, but Luiz picked it up for his work, so I dropped it
> from this series and based it on top of current mm-new]
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v1 0/9] mm: vm_normal_page*() improvements
2025-07-16 22:27 ` Andrew Morton
@ 2025-07-17 7:35 ` David Hildenbrand
0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-07-17 7:35 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
Oscar Salvador, Lance Yang
On 17.07.25 00:27, Andrew Morton wrote:
> On Wed, 16 Jul 2025 10:47:29 +0200 David Hildenbrand <david@redhat.com> wrote:
>
>>>
>>> However the series rejects due to the is_huge_zero_pmd ->
>>> is_huge_zero_pfn changes in Luiz's "mm: introduce snapshot_page() v3"
>>> series, so could we please have a redo against present mm-new?
>>
>> I'm confused: mm-new *still* contains the patch from Luiz series that
>> was originally part of the RFC here.
>>
>> commit 791cb64cd7f8c2314c65d1dd5cb9e05e51c4cd70
>> Author: David Hildenbrand <david@redhat.com>
>> Date: Mon Jul 14 09:16:51 2025 -0400
>>
>> mm/memory: introduce is_huge_zero_pfn() and use it in vm_normal_page_pmd()
>>
>> If you want to put this series here before Luiz', you'll have to move that
>> single patch as well.
>>
>> But probably this series should be done on top of Luiz work, because Luiz
>> fixes something.
>
> I'm confused at your confused. mm-new presently contains Luiz's latest
> v3 series "mm: introduce snapshot_page()" which includes a copy of your
> "mm/memory: introduce is_huge_zero_pfn() and use it in
> vm_normal_page_pmd()".
Let's recap: you said "the series rejects due to the is_huge_zero_pmd ->
is_huge_zero_pfn changes in Luiz's "mm: introduce snapshot_page() v3"
series"
$ git checkout mm/mm-new -b tmp
branch 'tmp' set up to track 'mm/mm-new'.
Switched to a new branch 'tmp'
$ b4 shazam 20250715132350.2448901-1-david@redhat.com
Grabbing thread from lore.kernel.org/all/20250715132350.2448901-1-david@redhat.com/t.mbox.gz
Checking for newer revisions
Grabbing search results from lore.kernel.org
Analyzing 17 messages in the thread
Looking for additional code-review trailers on lore.kernel.org
Analyzing 65 code-review messages
Checking attestation on all messages, may take a moment...
---
✓ [PATCH v1 1/9] mm/huge_memory: move more common code into insert_pmd()
✓ [PATCH v1 2/9] mm/huge_memory: move more common code into insert_pud()
✓ [PATCH v1 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
✓ [PATCH v1 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
✓ [PATCH v1 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
✓ [PATCH v1 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
✓ [PATCH v1 7/9] mm/memory: factor out common code from vm_normal_page_*()
+ Reviewed-by: Oscar Salvador <osalvador@suse.de> (✓ DKIM/suse.de)
✓ [PATCH v1 8/9] mm: introduce and use vm_normal_page_pud()
+ Reviewed-by: Oscar Salvador <osalvador@suse.de> (✓ DKIM/suse.de)
✓ [PATCH v1 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
+ Reviewed-by: Oscar Salvador <osalvador@suse.de> (✓ DKIM/suse.de)
---
✓ Signed: DKIM/redhat.com
---
Total patches: 9
---
Base: using specified base-commit 64d19a2cdb7b62bcea83d9309d83e06d7aff4722
Applying: mm/huge_memory: move more common code into insert_pmd()
Applying: mm/huge_memory: move more common code into insert_pud()
Applying: mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
Applying: fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
Applying: mm/huge_memory: mark PMD mappings of the huge zero folio special
Applying: mm/memory: convert print_bad_pte() to print_bad_page_map()
Applying: mm/memory: factor out common code from vm_normal_page_*()
Applying: mm: introduce and use vm_normal_page_pud()
Applying: mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
$ make mm/memory.o
...
CC mm/memory.o
I know that a tree from yesterday temporarily didn't have Luiz patches, so
maybe that's what you ran into.
*anyhow*, I will resend to work around that arm pgdp_get() issue.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v1 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
2025-07-15 13:23 ` [PATCH v1 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
@ 2025-07-17 8:38 ` Alistair Popple
2025-07-17 8:39 ` David Hildenbrand
0 siblings, 1 reply; 20+ messages in thread
From: Alistair Popple @ 2025-07-17 8:38 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang
On Tue, Jul 15, 2025 at 03:23:45PM +0200, David Hildenbrand wrote:
> Let's convert to vmf_insert_folio_pmd().
>
> There is a theoretical change in behavior: in the unlikely case there is
> already something mapped, we'll now still call trace_dax_pmd_load_hole()
> and return VM_FAULT_NOPAGE.
>
> Previously, we would have returned VM_FAULT_FALLBACK, and the caller
> would have zapped the PMD to try a PTE fault.
>
> However, that behavior was different to other PTE+PMD faults, when there
> would already be something mapped, and it's not even clear if it could
> be triggered.
>
> Assuming the huge zero folio is already mapped, all good, no need to
> fallback to PTEs.
>
> Assuming there is already a leaf page table ... the behavior would be
> just like when trying to insert a PMD mapping a folio through
> dax_fault_iter()->vmf_insert_folio_pmd().
>
> Assuming there is already something else mapped as PMD? It sounds like
> a BUG, and the behavior would be just like when trying to insert a PMD
> mapping a folio through dax_fault_iter()->vmf_insert_folio_pmd().
>
> So, it sounds reasonable to not handle huge zero folios differently
> to inserting PMDs mapping folios when there already is something mapped.
Yeah, this all sounds reasonable and I was never able to hit this path with the
RFC version of this series anyway. So I suspect it really is impossible to hit
and therefore any change is theoretical.
Reviewed-by: Alistair Popple <apopple@nvidia.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
> fs/dax.c | 47 ++++++++++-------------------------------------
> 1 file changed, 10 insertions(+), 37 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 4229513806bea..ae90706674a3f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1375,51 +1375,24 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
> const struct iomap_iter *iter, void **entry)
> {
> struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> - unsigned long pmd_addr = vmf->address & PMD_MASK;
> - struct vm_area_struct *vma = vmf->vma;
> struct inode *inode = mapping->host;
> - pgtable_t pgtable = NULL;
> struct folio *zero_folio;
> - spinlock_t *ptl;
> - pmd_t pmd_entry;
> - unsigned long pfn;
> + vm_fault_t ret;
>
> zero_folio = mm_get_huge_zero_folio(vmf->vma->vm_mm);
>
> - if (unlikely(!zero_folio))
> - goto fallback;
> -
> - pfn = page_to_pfn(&zero_folio->page);
> - *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn,
> - DAX_PMD | DAX_ZERO_PAGE);
> -
> - if (arch_needs_pgtable_deposit()) {
> - pgtable = pte_alloc_one(vma->vm_mm);
> - if (!pgtable)
> - return VM_FAULT_OOM;
> - }
> -
> - ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
> - if (!pmd_none(*(vmf->pmd))) {
> - spin_unlock(ptl);
> - goto fallback;
> + if (unlikely(!zero_folio)) {
> + trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
> + return VM_FAULT_FALLBACK;
> }
>
> - if (pgtable) {
> - pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> - mm_inc_nr_ptes(vma->vm_mm);
> - }
> - pmd_entry = folio_mk_pmd(zero_folio, vmf->vma->vm_page_prot);
> - set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
> - spin_unlock(ptl);
> - trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
> - return VM_FAULT_NOPAGE;
> + *entry = dax_insert_entry(xas, vmf, iter, *entry, folio_pfn(zero_folio),
> + DAX_PMD | DAX_ZERO_PAGE);
>
> -fallback:
> - if (pgtable)
> - pte_free(vma->vm_mm, pgtable);
> - trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
> - return VM_FAULT_FALLBACK;
> + ret = vmf_insert_folio_pmd(vmf, zero_folio, false);
> + if (ret == VM_FAULT_NOPAGE)
> + trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
> + return ret;
> }
> #else
> static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
> --
> 2.50.1
>
>
^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [PATCH v1 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
2025-07-17 8:38 ` Alistair Popple
@ 2025-07-17 8:39 ` David Hildenbrand
0 siblings, 0 replies; 20+ messages in thread
From: David Hildenbrand @ 2025-07-17 8:39 UTC (permalink / raw)
To: Alistair Popple
Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
Andrew Morton, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
Alexander Viro, Christian Brauner, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
Hugh Dickins, Oscar Salvador, Lance Yang
On 17.07.25 10:38, Alistair Popple wrote:
> On Tue, Jul 15, 2025 at 03:23:45PM +0200, David Hildenbrand wrote:
>> Let's convert to vmf_insert_folio_pmd().
>>
>> There is a theoretical change in behavior: in the unlikely case there is
>> already something mapped, we'll now still call trace_dax_pmd_load_hole()
>> and return VM_FAULT_NOPAGE.
>>
>> Previously, we would have returned VM_FAULT_FALLBACK, and the caller
>> would have zapped the PMD to try a PTE fault.
>>
>> However, that behavior was different to other PTE+PMD faults, when there
>> would already be something mapped, and it's not even clear if it could
>> be triggered.
>>
>> Assuming the huge zero folio is already mapped, all good, no need to
>> fallback to PTEs.
>>
>> Assuming there is already a leaf page table ... the behavior would be
>> just like when trying to insert a PMD mapping a folio through
>> dax_fault_iter()->vmf_insert_folio_pmd().
>>
>> Assuming there is already something else mapped as PMD? It sounds like
>> a BUG, and the behavior would be just like when trying to insert a PMD
>> mapping a folio through dax_fault_iter()->vmf_insert_folio_pmd().
>>
>> So, it sounds reasonable to not handle huge zero folios differently
>> to inserting PMDs mapping folios when there already is something mapped.
>
> Yeah, this all sounds reasonable and I was never able to hit this path with the
> RFC version of this series anyway. So I suspect it really is impossible to hit
> and therefore any change is theoretical.
Thanks for the review and test, Alistair!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2025-07-17 8:39 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-15 13:23 [PATCH v1 0/9] mm: vm_normal_page*() improvements David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 1/9] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 2/9] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
2025-07-17 8:38 ` Alistair Popple
2025-07-17 8:39 ` David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
2025-07-16 8:40 ` David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 7/9] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
2025-07-16 8:15 ` Oscar Salvador
2025-07-15 13:23 ` [PATCH v1 8/9] mm: introduce and use vm_normal_page_pud() David Hildenbrand
2025-07-16 8:20 ` Oscar Salvador
2025-07-15 13:23 ` [PATCH v1 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
2025-07-16 8:22 ` Oscar Salvador
2025-07-15 23:31 ` [PATCH v1 0/9] mm: vm_normal_page*() improvements Andrew Morton
2025-07-16 8:47 ` David Hildenbrand
2025-07-16 22:27 ` Andrew Morton
2025-07-17 7:35 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).