nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements
@ 2025-06-17 15:43 David Hildenbrand
  2025-06-17 15:43 ` [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page() David Hildenbrand
                   ` (15 more replies)
  0 siblings, 16 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

RFC because it's based on mm-new where some things might still change
around the devmap removal stuff.

While removing support for CoW PFNMAPs is a noble goal, I am not even sure
if we can remove said support for e.g., /dev/mem that easily.

In the end, Cow PFNMAPs are pretty simple: everything is "special" except
CoW'ed anon folios, that are "normal".

The only complication is: how to identify such pages without pte_special().
Because with pte_special(), it's easy.

Well, of course, one day all architectures might support pte_special() ...
either because we added support for pte_special() or removed support for
... these architectures from Linux.

No need to wait for that day. Let's do some cleanups around
vm_normal_page()/vm_normal_page_pmd() and handling of the huge zero folio,
and remove the "horrible special case to handle copy-on-write behaviour"
that does questionable things in remap_pfn_range() with a VMA, simply by

... looking for anonymous folios in CoW PFNMAPs to identify anonymous
folios? I know, sounds crazy ;)

Of course, we add sanity checks that nobody dares installing PFNMAPs of
anonymous folios in these configs, that could trick us in assuming that
these are due to CoW.

We only have to do that without support for pte_special(); for
architectures that support pte_special(), nothing changes.

In the same process, add vm_normal_page_pud(), which is easy after the
cleanups, and improve our documentation. Well, and clarify that
"vm_ops->find_special_page" XEN thingy.

Briefly tested on UML (making sure vm_normal_page() still works as expected
without pte_special() support) and on x86-64 with a bunch of tests
(including ndctl, and cow.c which tests the huge zero folio).

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Jann Horn <jannh@google.com>
Cc: Pedro Falcato <pfalcato@suse.de>

David Hildenbrand (14):
  mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  mm: drop highest_memmap_pfn
  mm: compare pfns only if the entry is present when inserting
    pfns/pages
  mm/huge_memory: move more common code into insert_pmd()
  mm/huge_memory: move more common code into insert_pud()
  mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
  fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
  mm/huge_memory: mark PMD mappings of the huge zero folio special
  mm/memory: introduce is_huge_zero_pfn() and use it in
    vm_normal_page_pmd()
  mm/memory: factor out common code from vm_normal_page_*()
  mm: remove "horrible special case to handle copy-on-write behaviour"
  mm: drop addr parameter from vm_normal_*_pmd()
  mm: introduce and use vm_normal_page_pud()
  mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()

 drivers/xen/Kconfig              |   1 +
 drivers/xen/gntdev.c             |   5 +-
 fs/dax.c                         |  47 ++----
 fs/proc/task_mmu.c               |   6 +-
 include/linux/huge_mm.h          |  12 +-
 include/linux/mm.h               |  41 +++--
 mm/Kconfig                       |   2 +
 mm/huge_memory.c                 | 135 +++++++---------
 mm/internal.h                    |   2 -
 mm/memory.c                      | 261 ++++++++++++++++---------------
 mm/mm_init.c                     |   3 -
 mm/nommu.c                       |   1 -
 mm/pagewalk.c                    |  22 +--
 tools/testing/vma/vma_internal.h |  18 ++-
 14 files changed, 281 insertions(+), 275 deletions(-)


base-commit: 2f24ee10fe6d1959d674a4298d63b66f54508a68
-- 
2.49.0


^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-20 12:50   ` Oscar Salvador
                     ` (2 more replies)
  2025-06-17 15:43 ` [PATCH RFC 02/14] mm: drop highest_memmap_pfn David Hildenbrand
                   ` (14 subsequent siblings)
  15 siblings, 3 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

In 2009, we converted a VM_BUG_ON(!pfn_valid(pfn)) to the current
highest_memmap_pfn sanity check in commit 22b31eec63e5 ("badpage:
vm_normal_page use print_bad_pte"), because highest_memmap_pfn was
readily available.

Nowadays, this is the last remaining highest_memmap_pfn user, and this
sanity check is not really triggering ... frequently.

Let's convert it to VM_WARN_ON_ONCE(!pfn_valid(pfn)), so we can
simplify and get rid of highest_memmap_pfn. Checking for
pfn_to_online_page() might be even better, but it would not handle
ZONE_DEVICE properly.

Do the same in vm_normal_page_pmd(), where we don't even report a
problem at all ...

What might be better in the future is having a runtime option like
page-table-check to enable such checks dynamically on-demand. Something
for the future.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 15 +++------------
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0163d127cece9..188b84ebf479a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -590,7 +590,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 
 	if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
 		if (likely(!pte_special(pte)))
-			goto check_pfn;
+			goto out;
 		if (vma->vm_ops && vma->vm_ops->find_special_page)
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
@@ -608,9 +608,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 		if (vma->vm_flags & VM_MIXEDMAP) {
 			if (!pfn_valid(pfn))
 				return NULL;
-			if (is_zero_pfn(pfn))
-				return NULL;
-			goto out;
 		} else {
 			unsigned long off;
 			off = (addr - vma->vm_start) >> PAGE_SHIFT;
@@ -624,17 +621,12 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 	if (is_zero_pfn(pfn))
 		return NULL;
 
-check_pfn:
-	if (unlikely(pfn > highest_memmap_pfn)) {
-		print_bad_pte(vma, addr, pte, NULL);
-		return NULL;
-	}
-
 	/*
 	 * NOTE! We still have PageReserved() pages in the page tables.
 	 * eg. VDSO mappings can cause them to exist.
 	 */
 out:
+	VM_WARN_ON_ONCE(!pfn_valid(pfn));
 	VM_WARN_ON_ONCE(is_zero_pfn(pfn));
 	return pfn_to_page(pfn);
 }
@@ -676,14 +668,13 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 	if (is_huge_zero_pmd(pmd))
 		return NULL;
-	if (unlikely(pfn > highest_memmap_pfn))
-		return NULL;
 
 	/*
 	 * NOTE! We still have PageReserved() pages in the page tables.
 	 * eg. VDSO mappings can cause them to exist.
 	 */
 out:
+	VM_WARN_ON_ONCE(!pfn_valid(pfn));
 	return pfn_to_page(pfn);
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 02/14] mm: drop highest_memmap_pfn
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
  2025-06-17 15:43 ` [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page() David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-20 13:04   ` Oscar Salvador
  2025-06-20 18:11   ` Pedro Falcato
  2025-06-17 15:43 ` [PATCH RFC 03/14] mm: compare pfns only if the entry is present when inserting pfns/pages David Hildenbrand
                   ` (13 subsequent siblings)
  15 siblings, 2 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

Now unused, so let's drop it.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/internal.h | 2 --
 mm/memory.c   | 2 --
 mm/mm_init.c  | 3 ---
 mm/nommu.c    | 1 -
 4 files changed, 8 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index f519eb7217c26..703871905fd6d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -503,8 +503,6 @@ static inline bool folio_needs_release(struct folio *folio)
 		(mapping && mapping_release_always(mapping));
 }
 
-extern unsigned long highest_memmap_pfn;
-
 /*
  * Maximum number of reclaim retries without progress before the OOM
  * killer is consider the only way forward.
diff --git a/mm/memory.c b/mm/memory.c
index 188b84ebf479a..a1b5575db52ac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -146,8 +146,6 @@ __setup("norandmaps", disable_randmaps);
 unsigned long zero_pfn __read_mostly;
 EXPORT_SYMBOL(zero_pfn);
 
-unsigned long highest_memmap_pfn __read_mostly;
-
 /*
  * CONFIG_MMU architectures set up ZERO_PAGE in their paging_init()
  */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 5c21b3af216b2..1dac66c209984 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -881,9 +881,6 @@ void __meminit memmap_init_range(unsigned long size, int nid, unsigned long zone
 	unsigned long pfn, end_pfn = start_pfn + size;
 	struct page *page;
 
-	if (highest_memmap_pfn < end_pfn - 1)
-		highest_memmap_pfn = end_pfn - 1;
-
 #ifdef CONFIG_ZONE_DEVICE
 	/*
 	 * Honor reservation requested by the driver for this ZONE_DEVICE
diff --git a/mm/nommu.c b/mm/nommu.c
index 38c22ea0a95c6..cd9ddbfe1af80 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -42,7 +42,6 @@
 #include <asm/mmu_context.h>
 #include "internal.h"
 
-unsigned long highest_memmap_pfn;
 int heap_stack_gap = 0;
 
 atomic_long_t mmap_pages_allocated;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 03/14] mm: compare pfns only if the entry is present when inserting pfns/pages
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
  2025-06-17 15:43 ` [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page() David Hildenbrand
  2025-06-17 15:43 ` [PATCH RFC 02/14] mm: drop highest_memmap_pfn David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-20 13:27   ` Oscar Salvador
  2025-06-20 18:24   ` Pedro Falcato
  2025-06-17 15:43 ` [PATCH RFC 04/14] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
                   ` (12 subsequent siblings)
  15 siblings, 2 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

Doing a pte_pfn() etc. of something that is not a present page table
entry is wrong. Let's check in all relevant cases where we want to
upgrade write permissions when inserting pfns/pages whether the entry
is actually present.

It's not expected to have caused real harm in practice, so this is more a
cleanup than a fix for something that would likely trigger in some
weird circumstances.

At some point, we should likely unify the two pte handling paths,
similar to how we did it for pmds/puds.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 4 ++--
 mm/memory.c      | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8e0e3cfd9f223..e52360df87d15 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1392,7 +1392,7 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
 					  fop.pfn;
 
-		if (write) {
+		if (write && pmd_present(*pmd)) {
 			if (pmd_pfn(*pmd) != pfn) {
 				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
 				return -EEXIST;
@@ -1541,7 +1541,7 @@ static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
 		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
 					  fop.pfn;
 
-		if (write) {
+		if (write && pud_present(*pud)) {
 			if (WARN_ON_ONCE(pud_pfn(*pud) != pfn))
 				return;
 			entry = pud_mkyoung(*pud);
diff --git a/mm/memory.c b/mm/memory.c
index a1b5575db52ac..9a1acd057ce59 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2137,7 +2137,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
 	pte_t pteval = ptep_get(pte);
 
 	if (!pte_none(pteval)) {
-		if (!mkwrite)
+		if (!mkwrite || !pte_present(pteval))
 			return -EBUSY;
 
 		/* see insert_pfn(). */
@@ -2434,7 +2434,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 		return VM_FAULT_OOM;
 	entry = ptep_get(pte);
 	if (!pte_none(entry)) {
-		if (mkwrite) {
+		if (mkwrite && pte_present(entry)) {
 			/*
 			 * For read faults on private mappings the PFN passed
 			 * in may not match the PFN we have mapped if the
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 04/14] mm/huge_memory: move more common code into insert_pmd()
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
                   ` (2 preceding siblings ...)
  2025-06-17 15:43 ` [PATCH RFC 03/14] mm: compare pfns only if the entry is present when inserting pfns/pages David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-20 14:12   ` Oscar Salvador
  2025-06-17 15:43 ` [PATCH RFC 05/14] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

Let's clean it all further up.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 72 ++++++++++++++++--------------------------------
 1 file changed, 24 insertions(+), 48 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e52360df87d15..a85e0cd455109 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1379,15 +1379,25 @@ struct folio_or_pfn {
 	bool is_folio;
 };
 
-static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
+static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, struct folio_or_pfn fop, pgprot_t prot,
-		bool write, pgtable_t pgtable)
+		bool write)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	pgtable_t pgtable = NULL;
+	spinlock_t *ptl;
 	pmd_t entry;
 
-	lockdep_assert_held(pmd_lockptr(mm, pmd));
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return VM_FAULT_SIGBUS;
 
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (!pgtable)
+			return VM_FAULT_OOM;
+	}
+
+	ptl = pmd_lock(mm, pmd);
 	if (!pmd_none(*pmd)) {
 		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
 					  fop.pfn;
@@ -1395,15 +1405,14 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 		if (write && pmd_present(*pmd)) {
 			if (pmd_pfn(*pmd) != pfn) {
 				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
-				return -EEXIST;
+				goto out_unlock;
 			}
 			entry = pmd_mkyoung(*pmd);
 			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 			if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
 				update_mmu_cache_pmd(vma, addr, pmd);
 		}
-
-		return -EEXIST;
+		goto out_unlock;
 	}
 
 	if (fop.is_folio) {
@@ -1424,11 +1433,17 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 	if (pgtable) {
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		mm_inc_nr_ptes(mm);
+		pgtable = NULL;
 	}
 
 	set_pmd_at(mm, addr, pmd, entry);
 	update_mmu_cache_pmd(vma, addr, pmd);
-	return 0;
+
+out_unlock:
+	spin_unlock(ptl);
+	if (pgtable)
+		pte_free(mm, pgtable);
+	return VM_FAULT_NOPAGE;
 }
 
 /**
@@ -1450,9 +1465,6 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn,
 	struct folio_or_pfn fop = {
 		.pfn = pfn,
 	};
-	pgtable_t pgtable = NULL;
-	spinlock_t *ptl;
-	int error;
 
 	/*
 	 * If we had pmd_special, we could avoid all these restrictions,
@@ -1464,25 +1476,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn,
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
 
-	if (addr < vma->vm_start || addr >= vma->vm_end)
-		return VM_FAULT_SIGBUS;
-
-	if (arch_needs_pgtable_deposit()) {
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (!pgtable)
-			return VM_FAULT_OOM;
-	}
-
 	pfnmap_setup_cachemode_pfn(pfn, &pgprot);
 
-	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
-	error = insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write,
-			   pgtable);
-	spin_unlock(ptl);
-	if (error && pgtable)
-		pte_free(vma->vm_mm, pgtable);
-
-	return VM_FAULT_NOPAGE;
+	return insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write);
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
 
@@ -1491,35 +1487,15 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
 {
 	struct vm_area_struct *vma = vmf->vma;
 	unsigned long addr = vmf->address & PMD_MASK;
-	struct mm_struct *mm = vma->vm_mm;
 	struct folio_or_pfn fop = {
 		.folio = folio,
 		.is_folio = true,
 	};
-	spinlock_t *ptl;
-	pgtable_t pgtable = NULL;
-	int error;
-
-	if (addr < vma->vm_start || addr >= vma->vm_end)
-		return VM_FAULT_SIGBUS;
 
 	if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
 		return VM_FAULT_SIGBUS;
 
-	if (arch_needs_pgtable_deposit()) {
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (!pgtable)
-			return VM_FAULT_OOM;
-	}
-
-	ptl = pmd_lock(mm, vmf->pmd);
-	error = insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot,
-			   write, pgtable);
-	spin_unlock(ptl);
-	if (error && pgtable)
-		pte_free(mm, pgtable);
-
-	return VM_FAULT_NOPAGE;
+	return insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot, write);
 }
 EXPORT_SYMBOL_GPL(vmf_insert_folio_pmd);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 05/14] mm/huge_memory: move more common code into insert_pud()
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
                   ` (3 preceding siblings ...)
  2025-06-17 15:43 ` [PATCH RFC 04/14] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-20 14:15   ` Oscar Salvador
  2025-07-07  2:51   ` Alistair Popple
  2025-06-17 15:43 ` [PATCH RFC 06/14] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
                   ` (10 subsequent siblings)
  15 siblings, 2 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

Let's clean it all further up.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 36 +++++++++++++-----------------------
 1 file changed, 13 insertions(+), 23 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a85e0cd455109..1ea23900b5adb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1507,25 +1507,30 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
 	return pud;
 }
 
-static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
+static vm_fault_t insert_pud(struct vm_area_struct *vma, unsigned long addr,
 		pud_t *pud, struct folio_or_pfn fop, pgprot_t prot, bool write)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *ptl;
 	pud_t entry;
 
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return VM_FAULT_SIGBUS;
+
+	ptl = pud_lock(mm, pud);
 	if (!pud_none(*pud)) {
 		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
 					  fop.pfn;
 
 		if (write && pud_present(*pud)) {
 			if (WARN_ON_ONCE(pud_pfn(*pud) != pfn))
-				return;
+				goto out_unlock;
 			entry = pud_mkyoung(*pud);
 			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
 			if (pudp_set_access_flags(vma, addr, pud, entry, 1))
 				update_mmu_cache_pud(vma, addr, pud);
 		}
-		return;
+		goto out_unlock;
 	}
 
 	if (fop.is_folio) {
@@ -1544,6 +1549,9 @@ static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
 	}
 	set_pud_at(mm, addr, pud, entry);
 	update_mmu_cache_pud(vma, addr, pud);
+out_unlock:
+	spin_unlock(ptl);
+	return VM_FAULT_NOPAGE;
 }
 
 /**
@@ -1565,7 +1573,6 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn,
 	struct folio_or_pfn fop = {
 		.pfn = pfn,
 	};
-	spinlock_t *ptl;
 
 	/*
 	 * If we had pud_special, we could avoid all these restrictions,
@@ -1577,16 +1584,9 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn,
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
 
-	if (addr < vma->vm_start || addr >= vma->vm_end)
-		return VM_FAULT_SIGBUS;
-
 	pfnmap_setup_cachemode_pfn(pfn, &pgprot);
 
-	ptl = pud_lock(vma->vm_mm, vmf->pud);
-	insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
-	spin_unlock(ptl);
-
-	return VM_FAULT_NOPAGE;
+	return insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
 
@@ -1603,25 +1603,15 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
 {
 	struct vm_area_struct *vma = vmf->vma;
 	unsigned long addr = vmf->address & PUD_MASK;
-	pud_t *pud = vmf->pud;
-	struct mm_struct *mm = vma->vm_mm;
 	struct folio_or_pfn fop = {
 		.folio = folio,
 		.is_folio = true,
 	};
-	spinlock_t *ptl;
-
-	if (addr < vma->vm_start || addr >= vma->vm_end)
-		return VM_FAULT_SIGBUS;
 
 	if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER))
 		return VM_FAULT_SIGBUS;
 
-	ptl = pud_lock(mm, pud);
-	insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
-	spin_unlock(ptl);
-
-	return VM_FAULT_NOPAGE;
+	return insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
 }
 EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 06/14] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
                   ` (4 preceding siblings ...)
  2025-06-17 15:43 ` [PATCH RFC 05/14] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-25  8:15   ` Oscar Salvador
  2025-06-25  8:20   ` Oscar Salvador
  2025-06-17 15:43 ` [PATCH RFC 07/14] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

Just like we do for vmf_insert_page_mkwrite() -> ... ->
insert_page_into_pte_locked(), support the huge zero folio.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1ea23900b5adb..92400f3baa9ff 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1418,9 +1418,11 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 	if (fop.is_folio) {
 		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
 
-		folio_get(fop.folio);
-		folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
-		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
+		if (!is_huge_zero_folio(fop.folio)) {
+			folio_get(fop.folio);
+			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
+			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
+		}
 	} else {
 		entry = pmd_mkhuge(pfn_pmd(fop.pfn, prot));
 		entry = pmd_mkspecial(entry);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 07/14] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
                   ` (5 preceding siblings ...)
  2025-06-17 15:43 ` [PATCH RFC 06/14] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-24  1:16   ` Alistair Popple
  2025-06-17 15:43 ` [PATCH RFC 08/14] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

Let's convert to vmf_insert_folio_pmd().

In the unlikely case there is already something mapped, we'll now still
call trace_dax_pmd_load_hole() and return VM_FAULT_NOPAGE.

That should probably be fine, no need to add special cases for that.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/dax.c | 47 ++++++++++-------------------------------------
 1 file changed, 10 insertions(+), 37 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4229513806bea..ae90706674a3f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1375,51 +1375,24 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 		const struct iomap_iter *iter, void **entry)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
-	unsigned long pmd_addr = vmf->address & PMD_MASK;
-	struct vm_area_struct *vma = vmf->vma;
 	struct inode *inode = mapping->host;
-	pgtable_t pgtable = NULL;
 	struct folio *zero_folio;
-	spinlock_t *ptl;
-	pmd_t pmd_entry;
-	unsigned long pfn;
+	vm_fault_t ret;
 
 	zero_folio = mm_get_huge_zero_folio(vmf->vma->vm_mm);
 
-	if (unlikely(!zero_folio))
-		goto fallback;
-
-	pfn = page_to_pfn(&zero_folio->page);
-	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn,
-				  DAX_PMD | DAX_ZERO_PAGE);
-
-	if (arch_needs_pgtable_deposit()) {
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (!pgtable)
-			return VM_FAULT_OOM;
-	}
-
-	ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
-	if (!pmd_none(*(vmf->pmd))) {
-		spin_unlock(ptl);
-		goto fallback;
+	if (unlikely(!zero_folio)) {
+		trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
+		return VM_FAULT_FALLBACK;
 	}
 
-	if (pgtable) {
-		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
-		mm_inc_nr_ptes(vma->vm_mm);
-	}
-	pmd_entry = folio_mk_pmd(zero_folio, vmf->vma->vm_page_prot);
-	set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
-	spin_unlock(ptl);
-	trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
-	return VM_FAULT_NOPAGE;
+	*entry = dax_insert_entry(xas, vmf, iter, *entry, folio_pfn(zero_folio),
+				  DAX_PMD | DAX_ZERO_PAGE);
 
-fallback:
-	if (pgtable)
-		pte_free(vma->vm_mm, pgtable);
-	trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
-	return VM_FAULT_FALLBACK;
+	ret = vmf_insert_folio_pmd(vmf, zero_folio, false);
+	if (ret == VM_FAULT_NOPAGE)
+		trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
+	return ret;
 }
 #else
 static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 08/14] mm/huge_memory: mark PMD mappings of the huge zero folio special
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
                   ` (6 preceding siblings ...)
  2025-06-17 15:43 ` [PATCH RFC 07/14] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-25  8:32   ` Oscar Salvador
  2025-06-17 15:43 ` [PATCH RFC 09/14] mm/memory: introduce is_huge_zero_pfn() and use it in vm_normal_page_pmd() David Hildenbrand
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

The huge zero folio is refcounted (+mapcounted -- is that a word?)
differently than "normal" folios, similarly (but different) to the ordinary
shared zeropage.

For this reason, we special-case these pages in
vm_normal_page*/vm_normal_folio*, and only allow selected callers to
still use them (e.g., GUP can still take a reference on them).

vm_normal_page_pmd() already filters out the huge zero folio. However,
so far we are not marking it as special like we do with the ordinary
shared zeropage. Let's mark it as special, so we can further refactor
vm_normal_page_pmd() and vm_normal_page().

While at it, update the doc regarding the shared zero folios.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c |  5 ++++-
 mm/memory.c      | 13 +++++++++----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 92400f3baa9ff..8f03cd4e40397 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1309,6 +1309,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
 {
 	pmd_t entry;
 	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
+	entry = pmd_mkspecial(entry);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, haddr, pmd, entry);
 	mm_inc_nr_ptes(mm);
@@ -1418,7 +1419,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 	if (fop.is_folio) {
 		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
 
-		if (!is_huge_zero_folio(fop.folio)) {
+		if (is_huge_zero_folio(fop.folio)) {
+			entry = pmd_mkspecial(entry);
+		} else {
 			folio_get(fop.folio);
 			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
 			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
diff --git a/mm/memory.c b/mm/memory.c
index 9a1acd057ce59..ef277dab69e33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -541,7 +541,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
  *
  * "Special" mappings do not wish to be associated with a "struct page" (either
  * it doesn't exist, or it exists but they don't want to touch it). In this
- * case, NULL is returned here. "Normal" mappings do have a struct page.
+ * case, NULL is returned here. "Normal" mappings do have a struct page and
+ * are ordinarily refcounted.
+ *
+ * Page mappings of the shared zero folios are always considered "special", as
+ * they are not ordinarily refcounted. However, selected page table walkers
+ * (such as GUP) can still identify these mappings and work with the
+ * underlying "struct page".
  *
  * There are 2 broad cases. Firstly, an architecture may define a pte_special()
  * pte bit, in which case this function is trivial. Secondly, an architecture
@@ -571,9 +577,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
  *
  * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
  * page" backing, however the difference is that _all_ pages with a struct
- * page (that is, those where pfn_valid is true) are refcounted and considered
- * normal pages by the VM. The only exception are zeropages, which are
- * *never* refcounted.
+ * page (that is, those where pfn_valid is true, except the shared zero
+ * folios) are refcounted and considered normal pages by the VM.
  *
  * The disadvantage is that pages are refcounted (which can be slower and
  * simply not an option for some PFNMAP users). The advantage is that we
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 09/14] mm/memory: introduce is_huge_zero_pfn() and use it in vm_normal_page_pmd()
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
                   ` (7 preceding siblings ...)
  2025-06-17 15:43 ` [PATCH RFC 08/14] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-25  8:37   ` Oscar Salvador
  2025-06-17 15:43 ` [PATCH RFC 10/14] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
                   ` (6 subsequent siblings)
  15 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

Let's avoid working with the PMD when not required. If
vm_normal_page_pmd() would be called on something that is not a present
pmd, it would already be a bug (pfn possibly garbage).

While at it, let's support passing in any pfn covered by the huge zero
folio by masking off PFN bits -- which should be rather cheap.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/huge_mm.h | 12 +++++++++++-
 mm/memory.c             |  2 +-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 35e34e6a98a27..b260f9a1fd3f2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -483,9 +483,14 @@ static inline bool is_huge_zero_folio(const struct folio *folio)
 	return READ_ONCE(huge_zero_folio) == folio;
 }
 
+static inline bool is_huge_zero_pfn(unsigned long pfn)
+{
+	return READ_ONCE(huge_zero_pfn) == (pfn & ~(HPAGE_PMD_NR - 1));
+}
+
 static inline bool is_huge_zero_pmd(pmd_t pmd)
 {
-	return pmd_present(pmd) && READ_ONCE(huge_zero_pfn) == pmd_pfn(pmd);
+	return pmd_present(pmd) && is_huge_zero_pfn(pmd_pfn(pmd));
 }
 
 struct folio *mm_get_huge_zero_folio(struct mm_struct *mm);
@@ -633,6 +638,11 @@ static inline bool is_huge_zero_folio(const struct folio *folio)
 	return false;
 }
 
+static inline bool is_huge_zero_pfn(unsigned long pfn)
+{
+	return false;
+}
+
 static inline bool is_huge_zero_pmd(pmd_t pmd)
 {
 	return false;
diff --git a/mm/memory.c b/mm/memory.c
index ef277dab69e33..b6c069f4ad11f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -669,7 +669,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 		}
 	}
 
-	if (is_huge_zero_pmd(pmd))
+	if (is_huge_zero_pfn(pfn))
 		return NULL;
 
 	/*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 10/14] mm/memory: factor out common code from vm_normal_page_*()
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
                   ` (8 preceding siblings ...)
  2025-06-17 15:43 ` [PATCH RFC 09/14] mm/memory: introduce is_huge_zero_pfn() and use it in vm_normal_page_pmd() David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-25  8:53   ` Oscar Salvador
  2025-06-17 15:43 ` [PATCH RFC 11/14] mm: remove "horrible special case to handle copy-on-write behaviour" David Hildenbrand
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

Let's reduce the code duplication and factor out the non-pte/pmd related
magic into vm_normal_page_pfn().

To keep it simpler, check the pfn against both zero folios. We could
optimize this, but as it's only for the !CONFIG_ARCH_HAS_PTE_SPECIAL
case, it's not a compelling micro-optimization.

With CONFIG_ARCH_HAS_PTE_SPECIAL we don't have to check anything else,
really.

It's a good question if we can even hit the !CONFIG_ARCH_HAS_PTE_SPECIAL
scenario in the PMD case in practice: but doesn't really matter, as
it's now all unified in vm_normal_page_pfn().

While at it, add a check that pmd_special() is really only set where we
would expect it.

No functional change intended.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 104 +++++++++++++++++++++++-----------------------------
 1 file changed, 46 insertions(+), 58 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index b6c069f4ad11f..3d3fa01cd217e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -536,6 +536,46 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
 }
 
+/* Called only if the page table entry is not marked special. */
+static inline struct page *vm_normal_page_pfn(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long pfn)
+{
+	/*
+	 * With CONFIG_ARCH_HAS_PTE_SPECIAL, any special page table mappings
+	 * (incl. shared zero folios) are marked accordingly.
+	 */
+	if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL))
+		goto normal_page;
+
+	if (unlikely(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))) {
+		if (vma->vm_flags & VM_MIXEDMAP) {
+			/* If it has a "struct page", it's "normal". */
+			if (!pfn_valid(pfn))
+				return NULL;
+		} else {
+			unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
+
+			/* Only CoW'ed anon folios are "normal". */
+			if (pfn == vma->vm_pgoff + off)
+				return NULL;
+			if (!is_cow_mapping(vma->vm_flags))
+				return NULL;
+		}
+	}
+
+	if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
+		return NULL;
+
+normal_page:
+	/*
+	 * NOTE! We still have PageReserved() pages in the page tables.
+	 * For example, VDSO mappings can cause them to exist.
+	 */
+	VM_WARN_ON_ONCE(!pfn_valid(pfn));
+	VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn));
+	return pfn_to_page(pfn);
+}
+
 /*
  * vm_normal_page -- This function gets the "struct page" associated with a pte.
  *
@@ -591,9 +631,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 {
 	unsigned long pfn = pte_pfn(pte);
 
-	if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
-		if (likely(!pte_special(pte)))
-			goto out;
+	if (unlikely(pte_special(pte))) {
 		if (vma->vm_ops && vma->vm_ops->find_special_page)
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
@@ -604,34 +642,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 		print_bad_pte(vma, addr, pte, NULL);
 		return NULL;
 	}
-
-	/* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
-
-	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
-		if (vma->vm_flags & VM_MIXEDMAP) {
-			if (!pfn_valid(pfn))
-				return NULL;
-		} else {
-			unsigned long off;
-			off = (addr - vma->vm_start) >> PAGE_SHIFT;
-			if (pfn == vma->vm_pgoff + off)
-				return NULL;
-			if (!is_cow_mapping(vma->vm_flags))
-				return NULL;
-		}
-	}
-
-	if (is_zero_pfn(pfn))
-		return NULL;
-
-	/*
-	 * NOTE! We still have PageReserved() pages in the page tables.
-	 * eg. VDSO mappings can cause them to exist.
-	 */
-out:
-	VM_WARN_ON_ONCE(!pfn_valid(pfn));
-	VM_WARN_ON_ONCE(is_zero_pfn(pfn));
-	return pfn_to_page(pfn);
+	return vm_normal_page_pfn(vma, addr, pfn);
 }
 
 struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
@@ -650,35 +661,12 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 {
 	unsigned long pfn = pmd_pfn(pmd);
 
-	/* Currently it's only used for huge pfnmaps */
-	if (unlikely(pmd_special(pmd)))
+	if (unlikely(pmd_special(pmd))) {
+		VM_WARN_ON_ONCE(!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) &&
+				!is_huge_zero_pfn(pfn));
 		return NULL;
-
-	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
-		if (vma->vm_flags & VM_MIXEDMAP) {
-			if (!pfn_valid(pfn))
-				return NULL;
-			goto out;
-		} else {
-			unsigned long off;
-			off = (addr - vma->vm_start) >> PAGE_SHIFT;
-			if (pfn == vma->vm_pgoff + off)
-				return NULL;
-			if (!is_cow_mapping(vma->vm_flags))
-				return NULL;
-		}
 	}
-
-	if (is_huge_zero_pfn(pfn))
-		return NULL;
-
-	/*
-	 * NOTE! We still have PageReserved() pages in the page tables.
-	 * eg. VDSO mappings can cause them to exist.
-	 */
-out:
-	VM_WARN_ON_ONCE(!pfn_valid(pfn));
-	return pfn_to_page(pfn);
+	return vm_normal_page_pfn(vma, addr, pfn);
 }
 
 struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 11/14] mm: remove "horrible special case to handle copy-on-write behaviour"
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
                   ` (9 preceding siblings ...)
  2025-06-17 15:43 ` [PATCH RFC 10/14] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-25  8:47   ` David Hildenbrand
  2025-06-17 15:43 ` [PATCH RFC 12/14] mm: drop addr parameter from vm_normal_*_pmd() David Hildenbrand
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

Let's make the kernel a bit less horrible, by removing the
linearity requirement in CoW PFNMAP mappings with
!CONFIG_ARCH_HAS_PTE_SPECIAL. In particular, stop messing with
vma->vm_pgoff in weird ways.

Simply lookup in applicable (i.e., CoW PFNMAP) mappings whether we
have an anon folio.

Nobody should ever try mapping anon folios using PFNs, that just screams
for other possible issues. To be sure, let's sanity-check when inserting
PFNs. Are they really required? Probably not, but it's a good safety net
at least for now.

The runtime overhead should be limited: there is nothing to do for !CoW
mappings (common case), and archs that care about performance
(i.e., GUP-fast) should be supporting CONFIG_ARCH_HAS_PTE_SPECIAL
either way.

Likely the sanity checks added in mm/huge_memory.c are not required for
now, because that code is probably only wired up with
CONFIG_ARCH_HAS_PTE_SPECIAL, but this way is certainly cleaner and
more consistent -- and doesn't really cost us anything in the cases we
really care about.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h |  16 ++++++
 mm/huge_memory.c   |  16 +++++-
 mm/memory.c        | 118 +++++++++++++++++++++++++--------------------
 3 files changed, 96 insertions(+), 54 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 98a606908307b..3f52871becd3f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2339,6 +2339,22 @@ static inline bool can_do_mlock(void) { return false; }
 extern int user_shm_lock(size_t, struct ucounts *);
 extern void user_shm_unlock(size_t, struct ucounts *);
 
+#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
+static inline struct page *vm_pfnmap_normal_page_pfn(struct vm_area_struct *vma,
+		unsigned long pfn)
+{
+	/*
+	 * We don't identify normal pages using PFNs. So if we reach
+	 * this point, it's just for sanity checks that don't apply with
+	 * pte_special() etc.
+	 */
+	return NULL;
+}
+#else
+struct page *vm_pfnmap_normal_page_pfn(struct vm_area_struct *vma,
+		unsigned long pfn);
+#endif
+
 struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8f03cd4e40397..67220c30e7818 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1479,7 +1479,13 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn,
 	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
-	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
+
+	/*
+	 * Refuse this pfn if we could mistake it as a refcounted folio
+	 * in a CoW mapping later in vm_normal_page_pmd().
+	 */
+	if ((vma->vm_flags & VM_PFNMAP) && vm_pfnmap_normal_page_pfn(vma, pfn))
+		return VM_FAULT_SIGBUS;
 
 	pfnmap_setup_cachemode_pfn(pfn, &pgprot);
 
@@ -1587,7 +1593,13 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn,
 	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
-	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
+
+	/*
+	 * Refuse this pfn if we could mistake it as a refcounted folio
+	 * in a CoW mapping later in vm_normal_page_pud().
+	 */
+	if ((vma->vm_flags & VM_PFNMAP) && vm_pfnmap_normal_page_pfn(vma, pfn))
+		return VM_FAULT_SIGBUS;
 
 	pfnmap_setup_cachemode_pfn(pfn, &pgprot);
 
diff --git a/mm/memory.c b/mm/memory.c
index 3d3fa01cd217e..ace9c59e97181 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -536,9 +536,35 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
 }
 
+#ifndef CONFIG_ARCH_HAS_PTE_SPECIAL
+struct page *vm_pfnmap_normal_page_pfn(struct vm_area_struct *vma,
+		unsigned long pfn)
+{
+	struct folio *folio;
+	struct page *page;
+
+	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_PFNMAP));
+
+	/*
+	 * If we have a CoW mapping and spot an anon folio, then it can
+	 * only be due to CoW: the page is "normal".
+	 */
+	if (likely(!is_cow_mapping(vma->vm_flags)))
+		return NULL;
+	if (likely(!pfn_valid(pfn)))
+		return NULL;
+
+	page = pfn_to_page(pfn);
+	folio = page_folio(page);
+	if (folio_test_slab(folio) || !folio_test_anon(folio))
+		return NULL;
+	return page;
+}
+#endif /* !CONFIG_ARCH_HAS_PTE_SPECIAL */
+
 /* Called only if the page table entry is not marked special. */
 static inline struct page *vm_normal_page_pfn(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long pfn)
+		unsigned long pfn)
 {
 	/*
 	 * With CONFIG_ARCH_HAS_PTE_SPECIAL, any special page table mappings
@@ -553,13 +579,8 @@ static inline struct page *vm_normal_page_pfn(struct vm_area_struct *vma,
 			if (!pfn_valid(pfn))
 				return NULL;
 		} else {
-			unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
-
 			/* Only CoW'ed anon folios are "normal". */
-			if (pfn == vma->vm_pgoff + off)
-				return NULL;
-			if (!is_cow_mapping(vma->vm_flags))
-				return NULL;
+			return vm_pfnmap_normal_page_pfn(vma, pfn);
 		}
 	}
 
@@ -589,30 +610,19 @@ static inline struct page *vm_normal_page_pfn(struct vm_area_struct *vma,
  * (such as GUP) can still identify these mappings and work with the
  * underlying "struct page".
  *
- * There are 2 broad cases. Firstly, an architecture may define a pte_special()
- * pte bit, in which case this function is trivial. Secondly, an architecture
- * may not have a spare pte bit, which requires a more complicated scheme,
- * described below.
+ * An architecture may support pte_special() to distinguish "special"
+ * from "normal" mappings more efficiently, and even without the VMA at hand.
+ * For example, in order to support GUP-fast, whereby we don't have the VMA
+ * available when walking the page tables, support for pte_special() is
+ * crucial.
+ *
+ * If an architecture does not support pte_special(), this function is less
+ * trivial and more expensive in some cases.
  *
  * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
  * special mapping (even if there are underlying and valid "struct pages").
  * COWed pages of a VM_PFNMAP are always normal.
  *
- * The way we recognize COWed pages within VM_PFNMAP mappings is through the
- * rules set up by "remap_pfn_range()": the vma will have the VM_PFNMAP bit
- * set, and the vm_pgoff will point to the first PFN mapped: thus every special
- * mapping will always honor the rule
- *
- *	pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT)
- *
- * And for normal mappings this is false.
- *
- * This restricts such mappings to be a linear translation from virtual address
- * to pfn. To get around this restriction, we allow arbitrary mappings so long
- * as the vma is not a COW mapping; in that case, we know that all ptes are
- * special (because none can have been COWed).
- *
- *
  * In order to support COW of arbitrary special mappings, we have VM_MIXEDMAP.
  *
  * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
@@ -621,10 +631,7 @@ static inline struct page *vm_normal_page_pfn(struct vm_area_struct *vma,
  * folios) are refcounted and considered normal pages by the VM.
  *
  * The disadvantage is that pages are refcounted (which can be slower and
- * simply not an option for some PFNMAP users). The advantage is that we
- * don't have to follow the strict linearity rule of PFNMAP mappings in
- * order to support COWable mappings.
- *
+ * simply not an option for some PFNMAP users).
  */
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			    pte_t pte)
@@ -642,7 +649,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 		print_bad_pte(vma, addr, pte, NULL);
 		return NULL;
 	}
-	return vm_normal_page_pfn(vma, addr, pfn);
+	return vm_normal_page_pfn(vma, pfn);
 }
 
 struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
@@ -666,7 +673,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 				!is_huge_zero_pfn(pfn));
 		return NULL;
 	}
-	return vm_normal_page_pfn(vma, addr, pfn);
+	return vm_normal_page_pfn(vma, pfn);
 }
 
 struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
@@ -2422,6 +2429,13 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
 	pte_t *pte, entry;
 	spinlock_t *ptl;
 
+	/*
+	 * Refuse this pfn if we could mistake it as a refcounted folio
+	 * in a CoW mapping later in vm_normal_page().
+	 */
+	if ((vma->vm_flags & VM_PFNMAP) && vm_pfnmap_normal_page_pfn(vma, pfn))
+		return VM_FAULT_SIGBUS;
+
 	pte = get_locked_pte(mm, addr, &ptl);
 	if (!pte)
 		return VM_FAULT_OOM;
@@ -2511,7 +2525,6 @@ vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
 	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
-	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
 	BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn));
 
 	if (addr < vma->vm_start || addr >= vma->vm_end)
@@ -2656,10 +2669,11 @@ vm_fault_t vmf_insert_mixed_mkwrite(struct vm_area_struct *vma,
  * mappings are removed. any references to nonexistent pages results
  * in null mappings (currently treated as "copy-on-access")
  */
-static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
+static int remap_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long addr, unsigned long end,
 			unsigned long pfn, pgprot_t prot)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, *mapped_pte;
 	spinlock_t *ptl;
 	int err = 0;
@@ -2674,6 +2688,14 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			err = -EACCES;
 			break;
 		}
+		/*
+		 * Refuse this pfn if we could mistake it as a refcounted folio
+		 * in a CoW mapping later in vm_normal_page().
+		 */
+		if (vm_pfnmap_normal_page_pfn(vma, pfn)) {
+			err = -EINVAL;
+			break;
+		}
 		set_pte_at(mm, addr, pte, pte_mkspecial(pfn_pte(pfn, prot)));
 		pfn++;
 	} while (pte++, addr += PAGE_SIZE, addr != end);
@@ -2682,10 +2704,11 @@ static int remap_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	return err;
 }
 
-static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
+static inline int remap_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 			unsigned long addr, unsigned long end,
 			unsigned long pfn, pgprot_t prot)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	pmd_t *pmd;
 	unsigned long next;
 	int err;
@@ -2697,7 +2720,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 	VM_BUG_ON(pmd_trans_huge(*pmd));
 	do {
 		next = pmd_addr_end(addr, end);
-		err = remap_pte_range(mm, pmd, addr, next,
+		err = remap_pte_range(vma, pmd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
 			return err;
@@ -2705,10 +2728,11 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 	return 0;
 }
 
-static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
+static inline int remap_pud_range(struct vm_area_struct *vma, p4d_t *p4d,
 			unsigned long addr, unsigned long end,
 			unsigned long pfn, pgprot_t prot)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	pud_t *pud;
 	unsigned long next;
 	int err;
@@ -2719,7 +2743,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		err = remap_pmd_range(mm, pud, addr, next,
+		err = remap_pmd_range(vma, pud, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
 			return err;
@@ -2727,10 +2751,11 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	return 0;
 }
 
-static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
+static inline int remap_p4d_range(struct vm_area_struct *vma, pgd_t *pgd,
 			unsigned long addr, unsigned long end,
 			unsigned long pfn, pgprot_t prot)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	p4d_t *p4d;
 	unsigned long next;
 	int err;
@@ -2741,7 +2766,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
-		err = remap_pud_range(mm, p4d, addr, next,
+		err = remap_pud_range(vma, p4d, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
 			return err;
@@ -2773,18 +2798,7 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
 	 *      Disable vma merging and expanding with mremap().
 	 *   VM_DONTDUMP
 	 *      Omit vma from core dump, even when VM_IO turned off.
-	 *
-	 * There's a horrible special case to handle copy-on-write
-	 * behaviour that some programs depend on. We mark the "original"
-	 * un-COW'ed pages by matching them up with "vma->vm_pgoff".
-	 * See vm_normal_page() for details.
 	 */
-	if (is_cow_mapping(vma->vm_flags)) {
-		if (addr != vma->vm_start || end != vma->vm_end)
-			return -EINVAL;
-		vma->vm_pgoff = pfn;
-	}
-
 	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
 
 	BUG_ON(addr >= end);
@@ -2793,7 +2807,7 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
 	flush_cache_range(vma, addr, end);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = remap_p4d_range(mm, pgd, addr, next,
+		err = remap_p4d_range(vma, pgd, addr, next,
 				pfn + (addr >> PAGE_SHIFT), prot);
 		if (err)
 			return err;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 12/14] mm: drop addr parameter from vm_normal_*_pmd()
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
                   ` (10 preceding siblings ...)
  2025-06-17 15:43 ` [PATCH RFC 11/14] mm: remove "horrible special case to handle copy-on-write behaviour" David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-17 15:43 ` [PATCH RFC 13/14] mm: introduce and use vm_normal_page_pud() David Hildenbrand
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

No longer required, let's drop it.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/proc/task_mmu.c | 6 +++---
 include/linux/mm.h | 6 ++----
 mm/huge_memory.c   | 4 ++--
 mm/memory.c        | 8 +++-----
 mm/pagewalk.c      | 2 +-
 5 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index c4ad3083bbfa0..36ef67cdf7a3b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -861,7 +861,7 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 	struct folio *folio;
 
 	if (pmd_present(*pmd)) {
-		page = vm_normal_page_pmd(vma, addr, *pmd);
+		page = vm_normal_page_pmd(vma, *pmd);
 		present = true;
 	} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
@@ -2177,7 +2177,7 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
 			categories |= PAGE_IS_WRITTEN;
 
 		if (p->masks_of_interest & PAGE_IS_FILE) {
-			page = vm_normal_page_pmd(vma, addr, pmd);
+			page = vm_normal_page_pmd(vma, pmd);
 			if (page && !PageAnon(page))
 				categories |= PAGE_IS_FILE;
 		}
@@ -2942,7 +2942,7 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
 	if (!pmd_present(pmd))
 		return NULL;
 
-	page = vm_normal_page_pmd(vma, addr, pmd);
+	page = vm_normal_page_pmd(vma, pmd);
 	if (!page)
 		return NULL;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3f52871becd3f..ef709457c7076 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2359,10 +2359,8 @@ struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
-struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
-				  unsigned long addr, pmd_t pmd);
-struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
-				pmd_t pmd);
+struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, pmd_t pmd);
+struct page *vm_normal_page_pmd(struct vm_area_struct *vma, pmd_t pmd);
 
 void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		  unsigned long size);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 67220c30e7818..bf2aed8d92ec2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1994,7 +1994,7 @@ static inline bool can_change_pmd_writable(struct vm_area_struct *vma,
 
 	if (!(vma->vm_flags & VM_SHARED)) {
 		/* See can_change_pte_writable(). */
-		page = vm_normal_page_pmd(vma, addr, pmd);
+		page = vm_normal_page_pmd(vma, pmd);
 		return page && PageAnon(page) && PageAnonExclusive(page);
 	}
 
@@ -2033,7 +2033,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
 	    can_change_pmd_writable(vma, vmf->address, pmd))
 		writable = true;
 
-	folio = vm_normal_folio_pmd(vma, haddr, pmd);
+	folio = vm_normal_folio_pmd(vma, pmd);
 	if (!folio)
 		goto out_map;
 
diff --git a/mm/memory.c b/mm/memory.c
index ace9c59e97181..34f961024e8e6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -663,8 +663,7 @@ struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
 }
 
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
-struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
-				pmd_t pmd)
+struct page *vm_normal_page_pmd(struct vm_area_struct *vma, pmd_t pmd)
 {
 	unsigned long pfn = pmd_pfn(pmd);
 
@@ -676,10 +675,9 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 	return vm_normal_page_pfn(vma, pfn);
 }
 
-struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
-				  unsigned long addr, pmd_t pmd)
+struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, pmd_t pmd)
 {
-	struct page *page = vm_normal_page_pmd(vma, addr, pmd);
+	struct page *page = vm_normal_page_pmd(vma, pmd);
 
 	if (page)
 		return page_folio(page);
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 648038247a8d2..0edb7240d090c 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -944,7 +944,7 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 			spin_unlock(ptl);
 			goto pte_table;
 		} else if (pmd_present(pmd)) {
-			page = vm_normal_page_pmd(vma, addr, pmd);
+			page = vm_normal_page_pmd(vma, pmd);
 			if (page) {
 				goto found;
 			} else if ((flags & FW_ZEROPAGE) &&
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 13/14] mm: introduce and use vm_normal_page_pud()
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
                   ` (11 preceding siblings ...)
  2025-06-17 15:43 ` [PATCH RFC 12/14] mm: drop addr parameter from vm_normal_*_pmd() David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-25  9:22   ` Oscar Salvador
  2025-06-17 15:43 ` [PATCH RFC 14/14] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

Let's introduce vm_normal_page_pud(), which ends up being fairly simple
because of our new common helpers and there not being a PUD-sized zero
folio.

Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO,
structuring the code like the other (pmd/pte) cases. Defer
introducing vm_normal_folio_pud() until really used.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h |  1 +
 mm/memory.c        | 11 +++++++++++
 mm/pagewalk.c      | 20 ++++++++++----------
 3 files changed, 22 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef709457c7076..022e8ef2c78ef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2361,6 +2361,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, pmd_t pmd);
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, pmd_t pmd);
+struct page *vm_normal_page_pud(struct vm_area_struct *vma, pud_t pud);
 
 void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		  unsigned long size);
diff --git a/mm/memory.c b/mm/memory.c
index 34f961024e8e6..6c65f51248250 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -683,6 +683,17 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma, pmd_t pmd)
 		return page_folio(page);
 	return NULL;
 }
+
+struct page *vm_normal_page_pud(struct vm_area_struct *vma, pud_t pud)
+{
+	unsigned long pfn = pud_pfn(pud);
+
+	if (unlikely(pud_special(pud))) {
+		VM_WARN_ON_ONCE(!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)));
+		return NULL;
+	}
+	return vm_normal_page_pfn(vma, pfn);
+}
 #endif
 
 /**
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 0edb7240d090c..8bd95cf326872 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -902,23 +902,23 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 		fw->pudp = pudp;
 		fw->pud = pud;
 
-		/*
-		 * TODO: FW_MIGRATION support for PUD migration entries
-		 * once there are relevant users.
-		 */
-		if (!pud_present(pud) || pud_special(pud)) {
+		if (pud_none(pud)) {
 			spin_unlock(ptl);
 			goto not_found;
-		} else if (!pud_leaf(pud)) {
+		} else if (pud_present(pud) && !pud_leaf(pud)) {
 			spin_unlock(ptl);
 			goto pmd_table;
+		} else if (pud_present(pud)) {
+			page = vm_normal_page_pud(vma, pud);
+			if (page)
+				goto found;
 		}
 		/*
-		 * TODO: vm_normal_page_pud() will be handy once we want to
-		 * support PUD mappings in VM_PFNMAP|VM_MIXEDMAP VMAs.
+		 * TODO: FW_MIGRATION support for PUD migration entries
+		 * once there are relevant users.
 		 */
-		page = pud_page(pud);
-		goto found;
+		spin_unlock(ptl);
+		goto not_found;
 	}
 
 pmd_table:
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH RFC 14/14] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
                   ` (12 preceding siblings ...)
  2025-06-17 15:43 ` [PATCH RFC 13/14] mm: introduce and use vm_normal_page_pud() David Hildenbrand
@ 2025-06-17 15:43 ` David Hildenbrand
  2025-06-25  9:34   ` Oscar Salvador
  2025-06-17 16:18 ` [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
  2025-06-25  8:49 ` Lorenzo Stoakes
  15 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 15:43 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, David Hildenbrand, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
	David Vrabel

... and hide it behind a kconfig option. There is really no need for
any !xen code to perform this check.

The naming is a bit off: we want to find the "normal" page when a PTE
was marked "special". So it's really not "finding a special" page.

Improve the documentation, and add a comment in the code where XEN ends
up performing the pte_mkspecial() through a hypercall. More details can
be found in commit 923b2919e2c3 ("xen/gntdev: mark userspace PTEs as
special on x86 PV guests").

Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/xen/Kconfig              |  1 +
 drivers/xen/gntdev.c             |  5 +++--
 include/linux/mm.h               | 18 +++++++++++++-----
 mm/Kconfig                       |  2 ++
 mm/memory.c                      | 10 ++++++++--
 tools/testing/vma/vma_internal.h | 18 +++++++++++++-----
 6 files changed, 40 insertions(+), 14 deletions(-)

diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index 24f485827e039..f9a35ed266ecf 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -138,6 +138,7 @@ config XEN_GNTDEV
 	depends on XEN
 	default m
 	select MMU_NOTIFIER
+	select FIND_NORMAL_PAGE
 	help
 	  Allows userspace processes to use grants.
 
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 61faea1f06630..d1bc0dae2cdf9 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -309,6 +309,7 @@ static int find_grant_ptes(pte_t *pte, unsigned long addr, void *data)
 	BUG_ON(pgnr >= map->count);
 	pte_maddr = arbitrary_virt_to_machine(pte).maddr;
 
+	/* Note: this will perform a pte_mkspecial() through the hypercall. */
 	gnttab_set_map_op(&map->map_ops[pgnr], pte_maddr, flags,
 			  map->grants[pgnr].ref,
 			  map->grants[pgnr].domid);
@@ -516,7 +517,7 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
 	gntdev_put_map(priv, map);
 }
 
-static struct page *gntdev_vma_find_special_page(struct vm_area_struct *vma,
+static struct page *gntdev_vma_find_normal_page(struct vm_area_struct *vma,
 						 unsigned long addr)
 {
 	struct gntdev_grant_map *map = vma->vm_private_data;
@@ -527,7 +528,7 @@ static struct page *gntdev_vma_find_special_page(struct vm_area_struct *vma,
 static const struct vm_operations_struct gntdev_vmops = {
 	.open = gntdev_vma_open,
 	.close = gntdev_vma_close,
-	.find_special_page = gntdev_vma_find_special_page,
+	.find_normal_page = gntdev_vma_find_normal_page,
 };
 
 /* ------------------------------------------------------------------ */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 022e8ef2c78ef..b01475f3dca99 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -646,13 +646,21 @@ struct vm_operations_struct {
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
 					unsigned long addr, pgoff_t *ilx);
 #endif
+#ifdef CONFIG_FIND_NORMAL_PAGE
 	/*
-	 * Called by vm_normal_page() for special PTEs to find the
-	 * page for @addr.  This is useful if the default behavior
-	 * (using pte_page()) would not find the correct page.
+	 * Called by vm_normal_page() for special PTEs in @vma at @addr. This
+	 * allows for returning a "normal" page from vm_normal_page() even
+	 * though the PTE indicates that the "struct page" either does not exist
+	 * or should not be touched: "special".
+	 *
+	 * Do not add new users: this really only works when a "normal" page
+	 * was mapped, but then the PTE got changed to something weird (+
+	 * marked special) that would not make pte_pfn() identify the originally
+	 * inserted page.
 	 */
-	struct page *(*find_special_page)(struct vm_area_struct *vma,
-					  unsigned long addr);
+	struct page *(*find_normal_page)(struct vm_area_struct *vma,
+					 unsigned long addr);
+#endif /* CONFIG_FIND_NORMAL_PAGE */
 };
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/Kconfig b/mm/Kconfig
index c6194d1f9d170..607a3f9672bdb 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1390,6 +1390,8 @@ config PT_RECLAIM
 
 	  Note: now only empty user PTE page table pages will be reclaimed.
 
+config FIND_NORMAL_PAGE
+	def_bool n
 
 source "mm/damon/Kconfig"
 
diff --git a/mm/memory.c b/mm/memory.c
index 6c65f51248250..1eba95fcde096 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -619,6 +619,10 @@ static inline struct page *vm_normal_page_pfn(struct vm_area_struct *vma,
  * If an architecture does not support pte_special(), this function is less
  * trivial and more expensive in some cases.
  *
+ * With CONFIG_FIND_NORMAL_PAGE, we might have pte_special() set on PTEs that
+ * actually map "normal" pages: however, that page cannot be looked up through
+ * pte_pfn(), but instead will be looked up through vm_ops->find_normal_page().
+ *
  * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
  * special mapping (even if there are underlying and valid "struct pages").
  * COWed pages of a VM_PFNMAP are always normal.
@@ -639,8 +643,10 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 	unsigned long pfn = pte_pfn(pte);
 
 	if (unlikely(pte_special(pte))) {
-		if (vma->vm_ops && vma->vm_ops->find_special_page)
-			return vma->vm_ops->find_special_page(vma, addr);
+#ifdef CONFIG_FIND_NORMAL_PAGE
+		if (vma->vm_ops && vma->vm_ops->find_normal_page)
+			return vma->vm_ops->find_normal_page(vma, addr);
+#endif /* CONFIG_FIND_NORMAL_PAGE */
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
 		if (is_zero_pfn(pfn))
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 51dd122b8d501..c5bf041036dd7 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -470,13 +470,21 @@ struct vm_operations_struct {
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
 					unsigned long addr, pgoff_t *ilx);
 #endif
+#ifdef CONFIG_FIND_NORMAL_PAGE
 	/*
-	 * Called by vm_normal_page() for special PTEs to find the
-	 * page for @addr.  This is useful if the default behavior
-	 * (using pte_page()) would not find the correct page.
+	 * Called by vm_normal_page() for special PTEs in @vma at @addr. This
+	 * allows for returning a "normal" page from vm_normal_page() even
+	 * though the PTE indicates that the "struct page" either does not exist
+	 * or should not be touched: "special".
+	 *
+	 * Do not add new users: this really only works when a "normal" page
+	 * was mapped, but then the PTE got changed to something weird (+
+	 * marked special) that would not make pte_pfn() identify the originally
+	 * inserted page.
 	 */
-	struct page *(*find_special_page)(struct vm_area_struct *vma,
-					  unsigned long addr);
+	struct page *(*find_normal_page)(struct vm_area_struct *vma,
+					 unsigned long addr);
+#endif /* CONFIG_FIND_NORMAL_PAGE */
 };
 
 struct vm_unmapped_area_info {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
                   ` (13 preceding siblings ...)
  2025-06-17 15:43 ` [PATCH RFC 14/14] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
@ 2025-06-17 16:18 ` David Hildenbrand
  2025-06-17 18:25   ` David Hildenbrand
  2025-06-25  8:49 ` Lorenzo Stoakes
  15 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 16:18 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, Andrew Morton, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Dan Williams,
	Alistair Popple, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Zi Yan, Baolin Wang, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato

On 17.06.25 17:43, David Hildenbrand wrote:
> RFC because it's based on mm-new where some things might still change
> around the devmap removal stuff.
> 
> While removing support for CoW PFNMAPs is a noble goal, I am not even sure
> if we can remove said support for e.g., /dev/mem that easily.
> 
> In the end, Cow PFNMAPs are pretty simple: everything is "special" except
> CoW'ed anon folios, that are "normal".
> 
> The only complication is: how to identify such pages without pte_special().
> Because with pte_special(), it's easy.
> 
> Well, of course, one day all architectures might support pte_special() ...
> either because we added support for pte_special() or removed support for
> ... these architectures from Linux.
> 
> No need to wait for that day. Let's do some cleanups around
> vm_normal_page()/vm_normal_page_pmd() and handling of the huge zero folio,
> and remove the "horrible special case to handle copy-on-write behaviour"
> that does questionable things in remap_pfn_range() with a VMA, simply by
> 
> ... looking for anonymous folios in CoW PFNMAPs to identify anonymous
> folios? I know, sounds crazy ;)

I'll mention one corner case that just occurred to me: assume someone 
maps arbitrary /dev/mem that is actually used by the kernel for user 
space, and then some of that memory gets allocated as anonymous memory, 
it would probably be a problem.

Hmm, I'll have to think about that, and the interaction with 
CONFIG_STRICT_DEVMEM.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements
  2025-06-17 16:18 ` [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
@ 2025-06-17 18:25   ` David Hildenbrand
  0 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-17 18:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, Andrew Morton, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Dan Williams,
	Alistair Popple, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Zi Yan, Baolin Wang, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato

On 17.06.25 18:18, David Hildenbrand wrote:
> On 17.06.25 17:43, David Hildenbrand wrote:
>> RFC because it's based on mm-new where some things might still change
>> around the devmap removal stuff.
>>
>> While removing support for CoW PFNMAPs is a noble goal, I am not even sure
>> if we can remove said support for e.g., /dev/mem that easily.
>>
>> In the end, Cow PFNMAPs are pretty simple: everything is "special" except
>> CoW'ed anon folios, that are "normal".
>>
>> The only complication is: how to identify such pages without pte_special().
>> Because with pte_special(), it's easy.
>>
>> Well, of course, one day all architectures might support pte_special() ...
>> either because we added support for pte_special() or removed support for
>> ... these architectures from Linux.
>>
>> No need to wait for that day. Let's do some cleanups around
>> vm_normal_page()/vm_normal_page_pmd() and handling of the huge zero folio,
>> and remove the "horrible special case to handle copy-on-write behaviour"
>> that does questionable things in remap_pfn_range() with a VMA, simply by
>>
>> ... looking for anonymous folios in CoW PFNMAPs to identify anonymous
>> folios? I know, sounds crazy ;)
> 
> I'll mention one corner case that just occurred to me: assume someone
> maps arbitrary /dev/mem that is actually used by the kernel for user
> space, and then some of that memory gets allocated as anonymous memory,
> it would probably be a problem.
> 
> Hmm, I'll have to think about that, and the interaction with
> CONFIG_STRICT_DEVMEM.

The /dev/mem mapping of arbitrary memory is indeed the hard case. To 
handle all that, patch #11 is too simplistic.

I have some idea how to make it work, but have to think about a couple 
of corner cases.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-06-17 15:43 ` [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page() David Hildenbrand
@ 2025-06-20 12:50   ` Oscar Salvador
  2025-06-23 14:04     ` David Hildenbrand
  2025-06-25  7:55   ` Oscar Salvador
  2025-07-03 14:50   ` Lance Yang
  2 siblings, 1 reply; 65+ messages in thread
From: Oscar Salvador @ 2025-06-20 12:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:32PM +0200, David Hildenbrand wrote:
> In 2009, we converted a VM_BUG_ON(!pfn_valid(pfn)) to the current
> highest_memmap_pfn sanity check in commit 22b31eec63e5 ("badpage:
> vm_normal_page use print_bad_pte"), because highest_memmap_pfn was
> readily available.
> 
> Nowadays, this is the last remaining highest_memmap_pfn user, and this
> sanity check is not really triggering ... frequently.
> 
> Let's convert it to VM_WARN_ON_ONCE(!pfn_valid(pfn)), so we can
> simplify and get rid of highest_memmap_pfn. Checking for
> pfn_to_online_page() might be even better, but it would not handle
> ZONE_DEVICE properly.
> 
> Do the same in vm_normal_page_pmd(), where we don't even report a
> problem at all ...
> 
> What might be better in the future is having a runtime option like
> page-table-check to enable such checks dynamically on-demand. Something
> for the future.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

I'm confused, I'm missing something here.
Before this change we would return NULL if e.g: pfn > highest_memmap_pfn, but
now we just print the warning and call pfn_to_page() anyway.
AFAIK, pfn_to_page() doesn't return NULL?
 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 02/14] mm: drop highest_memmap_pfn
  2025-06-17 15:43 ` [PATCH RFC 02/14] mm: drop highest_memmap_pfn David Hildenbrand
@ 2025-06-20 13:04   ` Oscar Salvador
  2025-06-20 18:11   ` Pedro Falcato
  1 sibling, 0 replies; 65+ messages in thread
From: Oscar Salvador @ 2025-06-20 13:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:33PM +0200, David Hildenbrand wrote:
> Now unused, so let's drop it.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 03/14] mm: compare pfns only if the entry is present when inserting pfns/pages
  2025-06-17 15:43 ` [PATCH RFC 03/14] mm: compare pfns only if the entry is present when inserting pfns/pages David Hildenbrand
@ 2025-06-20 13:27   ` Oscar Salvador
  2025-06-23 19:22     ` David Hildenbrand
  2025-06-20 18:24   ` Pedro Falcato
  1 sibling, 1 reply; 65+ messages in thread
From: Oscar Salvador @ 2025-06-20 13:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:34PM +0200, David Hildenbrand wrote:
> Doing a pte_pfn() etc. of something that is not a present page table
> entry is wrong. Let's check in all relevant cases where we want to
> upgrade write permissions when inserting pfns/pages whether the entry
> is actually present.

Maybe I would add that's because the pte can have other info like
marker, swp_entry etc.

> It's not expected to have caused real harm in practice, so this is more a
> cleanup than a fix for something that would likely trigger in some
> weird circumstances.
> 
> At some point, we should likely unify the two pte handling paths,
> similar to how we did it for pmds/puds.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

Should we scream if someone passes us a non-present entry?


> ---
>  mm/huge_memory.c | 4 ++--
>  mm/memory.c      | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8e0e3cfd9f223..e52360df87d15 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1392,7 +1392,7 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>  		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
>  					  fop.pfn;
>  
> -		if (write) {
> +		if (write && pmd_present(*pmd)) {
>  			if (pmd_pfn(*pmd) != pfn) {
>  				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
>  				return -EEXIST;
> @@ -1541,7 +1541,7 @@ static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
>  		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
>  					  fop.pfn;
>  
> -		if (write) {
> +		if (write && pud_present(*pud)) {
>  			if (WARN_ON_ONCE(pud_pfn(*pud) != pfn))
>  				return;
>  			entry = pud_mkyoung(*pud);
> diff --git a/mm/memory.c b/mm/memory.c
> index a1b5575db52ac..9a1acd057ce59 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2137,7 +2137,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
>  	pte_t pteval = ptep_get(pte);
>  
>  	if (!pte_none(pteval)) {
> -		if (!mkwrite)
> +		if (!mkwrite || !pte_present(pteval))
>  			return -EBUSY;

Why EBUSY? because it might transitory?


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 04/14] mm/huge_memory: move more common code into insert_pmd()
  2025-06-17 15:43 ` [PATCH RFC 04/14] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
@ 2025-06-20 14:12   ` Oscar Salvador
  2025-07-07  2:48     ` Alistair Popple
  0 siblings, 1 reply; 65+ messages in thread
From: Oscar Salvador @ 2025-06-20 14:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:35PM +0200, David Hildenbrand wrote:
> Let's clean it all further up.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

I was thinking maybe we want to pass 'struct vm_fault' directly to insert_pmd(),
and retrieve the fields in there, but since you have to retrieve some in
insert_pfn_pmd().. maybe not.


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 05/14] mm/huge_memory: move more common code into insert_pud()
  2025-06-17 15:43 ` [PATCH RFC 05/14] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
@ 2025-06-20 14:15   ` Oscar Salvador
  2025-07-07  2:51   ` Alistair Popple
  1 sibling, 0 replies; 65+ messages in thread
From: Oscar Salvador @ 2025-06-20 14:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:36PM +0200, David Hildenbrand wrote:
> Let's clean it all further up.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 02/14] mm: drop highest_memmap_pfn
  2025-06-17 15:43 ` [PATCH RFC 02/14] mm: drop highest_memmap_pfn David Hildenbrand
  2025-06-20 13:04   ` Oscar Salvador
@ 2025-06-20 18:11   ` Pedro Falcato
  1 sibling, 0 replies; 65+ messages in thread
From: Pedro Falcato @ 2025-06-20 18:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn

On Tue, Jun 17, 2025 at 05:43:33PM +0200, David Hildenbrand wrote:
> Now unused, so let's drop it.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 03/14] mm: compare pfns only if the entry is present when inserting pfns/pages
  2025-06-17 15:43 ` [PATCH RFC 03/14] mm: compare pfns only if the entry is present when inserting pfns/pages David Hildenbrand
  2025-06-20 13:27   ` Oscar Salvador
@ 2025-06-20 18:24   ` Pedro Falcato
  2025-06-23 19:19     ` David Hildenbrand
  1 sibling, 1 reply; 65+ messages in thread
From: Pedro Falcato @ 2025-06-20 18:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn

On Tue, Jun 17, 2025 at 05:43:34PM +0200, David Hildenbrand wrote:
> Doing a pte_pfn() etc. of something that is not a present page table
> entry is wrong. Let's check in all relevant cases where we want to
> upgrade write permissions when inserting pfns/pages whether the entry
> is actually present.
> 
> It's not expected to have caused real harm in practice, so this is more a
> cleanup than a fix for something that would likely trigger in some
> weird circumstances.

Couldn't we e.g have a swap entry's "pfn" accidentally match the one we're
inserting? Isn't that a correctness problem?

> 
> At some point, we should likely unify the two pte handling paths,
> similar to how we did it for pmds/puds.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Nice little cleanup, thanks.

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

-- 
Pedro

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-06-20 12:50   ` Oscar Salvador
@ 2025-06-23 14:04     ` David Hildenbrand
  2025-06-25  7:54       ` Oscar Salvador
  2025-07-03 12:34       ` Lance Yang
  0 siblings, 2 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-23 14:04 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On 20.06.25 14:50, Oscar Salvador wrote:
> On Tue, Jun 17, 2025 at 05:43:32PM +0200, David Hildenbrand wrote:
>> In 2009, we converted a VM_BUG_ON(!pfn_valid(pfn)) to the current
>> highest_memmap_pfn sanity check in commit 22b31eec63e5 ("badpage:
>> vm_normal_page use print_bad_pte"), because highest_memmap_pfn was
>> readily available.
>>
>> Nowadays, this is the last remaining highest_memmap_pfn user, and this
>> sanity check is not really triggering ... frequently.
>>
>> Let's convert it to VM_WARN_ON_ONCE(!pfn_valid(pfn)), so we can
>> simplify and get rid of highest_memmap_pfn. Checking for
>> pfn_to_online_page() might be even better, but it would not handle
>> ZONE_DEVICE properly.
>>
>> Do the same in vm_normal_page_pmd(), where we don't even report a
>> problem at all ...
>>
>> What might be better in the future is having a runtime option like
>> page-table-check to enable such checks dynamically on-demand. Something
>> for the future.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 

Hi Oscar,

> I'm confused, I'm missing something here.
> Before this change we would return NULL if e.g: pfn > highest_memmap_pfn, but
> now we just print the warning and call pfn_to_page() anyway.
> AFAIK, pfn_to_page() doesn't return NULL?

You're missing that vm_normal_page_pmd() was created as a copy from 
vm_normal_page() [history of the sanity check above], but as we don't 
have (and shouldn't have ...) print_bad_pmd(), we made the code look 
like this would be something that can just happen.

"
Do the same in vm_normal_page_pmd(), where we don't even report a
problem at all ...
"

So we made something that should never happen a runtime sanity check 
without ever reporting a problem ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 03/14] mm: compare pfns only if the entry is present when inserting pfns/pages
  2025-06-20 18:24   ` Pedro Falcato
@ 2025-06-23 19:19     ` David Hildenbrand
  0 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-23 19:19 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn

On 20.06.25 20:24, Pedro Falcato wrote:
> On Tue, Jun 17, 2025 at 05:43:34PM +0200, David Hildenbrand wrote:
>> Doing a pte_pfn() etc. of something that is not a present page table
>> entry is wrong. Let's check in all relevant cases where we want to
>> upgrade write permissions when inserting pfns/pages whether the entry
>> is actually present.
>>
>> It's not expected to have caused real harm in practice, so this is more a
>> cleanup than a fix for something that would likely trigger in some
>> weird circumstances.
> 
> Couldn't we e.g have a swap entry's "pfn" accidentally match the one we're
> inserting? Isn't that a correctness problem?

In theory yes, in practice I think this will not happen.

... especially because the WARN_ON_ONCE() would already trigger in many 
other cases before we would find one situation where it doesn't.

That's why I decided against Fixes: and declaring this more a cleanup, 
because the starts really would have to align ... :)

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 03/14] mm: compare pfns only if the entry is present when inserting pfns/pages
  2025-06-20 13:27   ` Oscar Salvador
@ 2025-06-23 19:22     ` David Hildenbrand
  0 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-23 19:22 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On 20.06.25 15:27, Oscar Salvador wrote:
> On Tue, Jun 17, 2025 at 05:43:34PM +0200, David Hildenbrand wrote:
>> Doing a pte_pfn() etc. of something that is not a present page table
>> entry is wrong. Let's check in all relevant cases where we want to
>> upgrade write permissions when inserting pfns/pages whether the entry
>> is actually present.
> 
> Maybe I would add that's because the pte can have other info like
> marker, swp_entry etc.
> 
>> It's not expected to have caused real harm in practice, so this is more a
>> cleanup than a fix for something that would likely trigger in some
>> weird circumstances.
>>
>> At some point, we should likely unify the two pte handling paths,
>> similar to how we did it for pmds/puds.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> 
> Should we scream if someone passes us a non-present entry?
> 

Probably? Good point, let me think about that.

> 
>> ---
>>   mm/huge_memory.c | 4 ++--
>>   mm/memory.c      | 4 ++--
>>   2 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8e0e3cfd9f223..e52360df87d15 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1392,7 +1392,7 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>>   		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
>>   					  fop.pfn;
>>   
>> -		if (write) {
>> +		if (write && pmd_present(*pmd)) {
>>   			if (pmd_pfn(*pmd) != pfn) {
>>   				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
>>   				return -EEXIST;
>> @@ -1541,7 +1541,7 @@ static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
>>   		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
>>   					  fop.pfn;
>>   
>> -		if (write) {
>> +		if (write && pud_present(*pud)) {
>>   			if (WARN_ON_ONCE(pud_pfn(*pud) != pfn))
>>   				return;
>>   			entry = pud_mkyoung(*pud);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index a1b5575db52ac..9a1acd057ce59 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -2137,7 +2137,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
>>   	pte_t pteval = ptep_get(pte);
>>   
>>   	if (!pte_none(pteval)) {
>> -		if (!mkwrite)
>> +		if (!mkwrite || !pte_present(pteval))
>>   			return -EBUSY;
> 
> Why EBUSY? because it might transitory?

I was confused myself about error handling, and why it differs for all 
cases ... adding to me todo list to investigate that (clean it up ...) :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 07/14] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
  2025-06-17 15:43 ` [PATCH RFC 07/14] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
@ 2025-06-24  1:16   ` Alistair Popple
  2025-06-25  9:03     ` David Hildenbrand
  0 siblings, 1 reply; 65+ messages in thread
From: Alistair Popple @ 2025-06-24  1:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Zi Yan, Baolin Wang, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:38PM +0200, David Hildenbrand wrote:
> Let's convert to vmf_insert_folio_pmd().
> 
> In the unlikely case there is already something mapped, we'll now still
> call trace_dax_pmd_load_hole() and return VM_FAULT_NOPAGE.
> 
> That should probably be fine, no need to add special cases for that.

I'm not sure about that. Consider dax_iomap_pmd_fault() -> dax_fault_iter() ->
dax_pmd_load_hole(). It calls split_huge_pmd() in response to VM_FAULT_FALLBACK
which will no longer happen, what makes that ok?

> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  fs/dax.c | 47 ++++++++++-------------------------------------
>  1 file changed, 10 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 4229513806bea..ae90706674a3f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1375,51 +1375,24 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>  		const struct iomap_iter *iter, void **entry)
>  {
>  	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> -	unsigned long pmd_addr = vmf->address & PMD_MASK;
> -	struct vm_area_struct *vma = vmf->vma;
>  	struct inode *inode = mapping->host;
> -	pgtable_t pgtable = NULL;
>  	struct folio *zero_folio;
> -	spinlock_t *ptl;
> -	pmd_t pmd_entry;
> -	unsigned long pfn;
> +	vm_fault_t ret;
>  
>  	zero_folio = mm_get_huge_zero_folio(vmf->vma->vm_mm);
>  
> -	if (unlikely(!zero_folio))
> -		goto fallback;
> -
> -	pfn = page_to_pfn(&zero_folio->page);
> -	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn,
> -				  DAX_PMD | DAX_ZERO_PAGE);
> -
> -	if (arch_needs_pgtable_deposit()) {
> -		pgtable = pte_alloc_one(vma->vm_mm);
> -		if (!pgtable)
> -			return VM_FAULT_OOM;
> -	}
> -
> -	ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
> -	if (!pmd_none(*(vmf->pmd))) {
> -		spin_unlock(ptl);
> -		goto fallback;
> +	if (unlikely(!zero_folio)) {
> +		trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
> +		return VM_FAULT_FALLBACK;
>  	}
>  
> -	if (pgtable) {
> -		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> -		mm_inc_nr_ptes(vma->vm_mm);
> -	}
> -	pmd_entry = folio_mk_pmd(zero_folio, vmf->vma->vm_page_prot);
> -	set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
> -	spin_unlock(ptl);
> -	trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
> -	return VM_FAULT_NOPAGE;
> +	*entry = dax_insert_entry(xas, vmf, iter, *entry, folio_pfn(zero_folio),
> +				  DAX_PMD | DAX_ZERO_PAGE);
>  
> -fallback:
> -	if (pgtable)
> -		pte_free(vma->vm_mm, pgtable);
> -	trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
> -	return VM_FAULT_FALLBACK;
> +	ret = vmf_insert_folio_pmd(vmf, zero_folio, false);
> +	if (ret == VM_FAULT_NOPAGE)
> +		trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
> +	return ret;
>  }
>  #else
>  static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-06-23 14:04     ` David Hildenbrand
@ 2025-06-25  7:54       ` Oscar Salvador
  2025-07-03 12:34       ` Lance Yang
  1 sibling, 0 replies; 65+ messages in thread
From: Oscar Salvador @ 2025-06-25  7:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Mon, Jun 23, 2025 at 04:04:01PM +0200, David Hildenbrand wrote:
> Hi Oscar,

Hi David,

> 
> > I'm confused, I'm missing something here.
> > Before this change we would return NULL if e.g: pfn > highest_memmap_pfn, but
> > now we just print the warning and call pfn_to_page() anyway.
> > AFAIK, pfn_to_page() doesn't return NULL?
> 
> You're missing that vm_normal_page_pmd() was created as a copy from
> vm_normal_page() [history of the sanity check above], but as we don't have
> (and shouldn't have ...) print_bad_pmd(), we made the code look like this
> would be something that can just happen.

Ok I see.

> 
> "
> Do the same in vm_normal_page_pmd(), where we don't even report a
> problem at all ...
> "
> 
> So we made something that should never happen a runtime sanity check without
> ever reporting a problem ...

all clear now, thanks :-)



-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-06-17 15:43 ` [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page() David Hildenbrand
  2025-06-20 12:50   ` Oscar Salvador
@ 2025-06-25  7:55   ` Oscar Salvador
  2025-07-03 14:50   ` Lance Yang
  2 siblings, 0 replies; 65+ messages in thread
From: Oscar Salvador @ 2025-06-25  7:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:32PM +0200, David Hildenbrand wrote:
> In 2009, we converted a VM_BUG_ON(!pfn_valid(pfn)) to the current
> highest_memmap_pfn sanity check in commit 22b31eec63e5 ("badpage:
> vm_normal_page use print_bad_pte"), because highest_memmap_pfn was
> readily available.
> 
> Nowadays, this is the last remaining highest_memmap_pfn user, and this
> sanity check is not really triggering ... frequently.
> 
> Let's convert it to VM_WARN_ON_ONCE(!pfn_valid(pfn)), so we can
> simplify and get rid of highest_memmap_pfn. Checking for
> pfn_to_online_page() might be even better, but it would not handle
> ZONE_DEVICE properly.
> 
> Do the same in vm_normal_page_pmd(), where we don't even report a
> problem at all ...
> 
> What might be better in the future is having a runtime option like
> page-table-check to enable such checks dynamically on-demand. Something
> for the future.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 06/14] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
  2025-06-17 15:43 ` [PATCH RFC 06/14] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
@ 2025-06-25  8:15   ` Oscar Salvador
  2025-06-25  8:17     ` Oscar Salvador
  2025-06-25  8:20   ` Oscar Salvador
  1 sibling, 1 reply; 65+ messages in thread
From: Oscar Salvador @ 2025-06-25  8:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:37PM +0200, David Hildenbrand wrote:
> Just like we do for vmf_insert_page_mkwrite() -> ... ->
> insert_page_into_pte_locked(), support the huge zero folio.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

insert_page_into_pte_locked() creates a special pte in case it finds the
zero folio while insert_pmd() doesn't.
I know that we didn't want to create special mappings for normal refcount folios
but this seems inconsistent? I'm pretty sure there's a reason but could you
elaborate on that?

> ---
>  mm/huge_memory.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1ea23900b5adb..92400f3baa9ff 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1418,9 +1418,11 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	if (fop.is_folio) {
>  		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
>  
> -		folio_get(fop.folio);
> -		folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
> -		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
> +		if (!is_huge_zero_folio(fop.folio)) {
> +			folio_get(fop.folio);
> +			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
> +			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
> +		}
>  	} else {
>  		entry = pmd_mkhuge(pfn_pmd(fop.pfn, prot));
>  		entry = pmd_mkspecial(entry);
> -- 
> 2.49.0
> 
> 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 06/14] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
  2025-06-25  8:15   ` Oscar Salvador
@ 2025-06-25  8:17     ` Oscar Salvador
  0 siblings, 0 replies; 65+ messages in thread
From: Oscar Salvador @ 2025-06-25  8:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Wed, Jun 25, 2025 at 10:15:22AM +0200, Oscar Salvador wrote:
> On Tue, Jun 17, 2025 at 05:43:37PM +0200, David Hildenbrand wrote:
> > Just like we do for vmf_insert_page_mkwrite() -> ... ->
> > insert_page_into_pte_locked(), support the huge zero folio.
> > 
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> insert_page_into_pte_locked() creates a special pte in case it finds the
> zero folio while insert_pmd() doesn't.
> I know that we didn't want to create special mappings for normal refcount folios
> but this seems inconsistent? I'm pretty sure there's a reason but could you
> elaborate on that?

Heh, I should have checked further. I see that happening on patch#8.


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 06/14] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
  2025-06-17 15:43 ` [PATCH RFC 06/14] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
  2025-06-25  8:15   ` Oscar Salvador
@ 2025-06-25  8:20   ` Oscar Salvador
  2025-06-25  8:59     ` David Hildenbrand
  1 sibling, 1 reply; 65+ messages in thread
From: Oscar Salvador @ 2025-06-25  8:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:37PM +0200, David Hildenbrand wrote:
> Just like we do for vmf_insert_page_mkwrite() -> ... ->
> insert_page_into_pte_locked(), support the huge zero folio.

It might just be me because I don't have the full context cached, but
I might have appreciated more info here :-).

> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse,de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 08/14] mm/huge_memory: mark PMD mappings of the huge zero folio special
  2025-06-17 15:43 ` [PATCH RFC 08/14] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
@ 2025-06-25  8:32   ` Oscar Salvador
  2025-07-14 12:41     ` David Hildenbrand
  0 siblings, 1 reply; 65+ messages in thread
From: Oscar Salvador @ 2025-06-25  8:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:39PM +0200, David Hildenbrand wrote:
> The huge zero folio is refcounted (+mapcounted -- is that a word?)
> differently than "normal" folios, similarly (but different) to the ordinary
> shared zeropage.
> 
> For this reason, we special-case these pages in
> vm_normal_page*/vm_normal_folio*, and only allow selected callers to
> still use them (e.g., GUP can still take a reference on them).
> 
> vm_normal_page_pmd() already filters out the huge zero folio. However,
> so far we are not marking it as special like we do with the ordinary
> shared zeropage. Let's mark it as special, so we can further refactor
> vm_normal_page_pmd() and vm_normal_page().
> 
> While at it, update the doc regarding the shared zero folios.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

While doing this, would it make sense to update vm_normal_page_pmd()
comments to refelect that pmd_special will also catch huge_zero_folio()?
It only mentions huge pfnmaps.

It might not be worth doing since you remove that code later on, but
maybe if someone stares at this commit alone..


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 09/14] mm/memory: introduce is_huge_zero_pfn() and use it in vm_normal_page_pmd()
  2025-06-17 15:43 ` [PATCH RFC 09/14] mm/memory: introduce is_huge_zero_pfn() and use it in vm_normal_page_pmd() David Hildenbrand
@ 2025-06-25  8:37   ` Oscar Salvador
  0 siblings, 0 replies; 65+ messages in thread
From: Oscar Salvador @ 2025-06-25  8:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:40PM +0200, David Hildenbrand wrote:
> Let's avoid working with the PMD when not required. If
> vm_normal_page_pmd() would be called on something that is not a present
> pmd, it would already be a bug (pfn possibly garbage).
> 
> While at it, let's support passing in any pfn covered by the huge zero
> folio by masking off PFN bits -- which should be rather cheap.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 11/14] mm: remove "horrible special case to handle copy-on-write behaviour"
  2025-06-17 15:43 ` [PATCH RFC 11/14] mm: remove "horrible special case to handle copy-on-write behaviour" David Hildenbrand
@ 2025-06-25  8:47   ` David Hildenbrand
  2025-06-25  9:02     ` Oscar Salvador
  0 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-06-25  8:47 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-fsdevel, linux-mm, nvdimm, Andrew Morton, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Dan Williams,
	Alistair Popple, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Zi Yan, Baolin Wang, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato

On 17.06.25 17:43, David Hildenbrand wrote:
> Let's make the kernel a bit less horrible, by removing the
> linearity requirement in CoW PFNMAP mappings with
> !CONFIG_ARCH_HAS_PTE_SPECIAL. In particular, stop messing with
> vma->vm_pgoff in weird ways.
> 
> Simply lookup in applicable (i.e., CoW PFNMAP) mappings whether we
> have an anon folio.
> 
> Nobody should ever try mapping anon folios using PFNs, that just screams
> for other possible issues. To be sure, let's sanity-check when inserting
> PFNs. Are they really required? Probably not, but it's a good safety net
> at least for now.
> 
> The runtime overhead should be limited: there is nothing to do for !CoW
> mappings (common case), and archs that care about performance
> (i.e., GUP-fast) should be supporting CONFIG_ARCH_HAS_PTE_SPECIAL
> either way.
> 
> Likely the sanity checks added in mm/huge_memory.c are not required for
> now, because that code is probably only wired up with
> CONFIG_ARCH_HAS_PTE_SPECIAL, but this way is certainly cleaner and
> more consistent -- and doesn't really cost us anything in the cases we
> really care about.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---

I'm still thinking about this patch here, and will likely send out the 
other patches first as a v1, and come back to this one later.

Really, someone mapping random memory using /dev/mem, and then getting 
anonymous memory in there is the (nasty) corner case I ignored.

There are rather nasty ways of trying to detect if an anon folio really 
fits into a VMA, but I'd like to avoid that.

What I am thinking about right now is that we could, for these special 
architectures, simply disallow CoW faults on /dev/mem.

So we would still allow MAP_PRIVATE mappings (e.g., random app opening 
/dev/mem using MAP_PRIVATE but never actually writing to that memory), 
but the actual CoW faults would fail without pte_special().

Some more thinking to do ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements
  2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
                   ` (14 preceding siblings ...)
  2025-06-17 16:18 ` [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
@ 2025-06-25  8:49 ` Lorenzo Stoakes
  2025-06-25  8:55   ` David Hildenbrand
  15 siblings, 1 reply; 65+ messages in thread
From: Lorenzo Stoakes @ 2025-06-25  8:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato

David, are you planning a v2 of this soon? If so I'll hold off review until
then, if not I can get stuck in when I have time?

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 10/14] mm/memory: factor out common code from vm_normal_page_*()
  2025-06-17 15:43 ` [PATCH RFC 10/14] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
@ 2025-06-25  8:53   ` Oscar Salvador
  2025-06-25  8:57     ` David Hildenbrand
  0 siblings, 1 reply; 65+ messages in thread
From: Oscar Salvador @ 2025-06-25  8:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:41PM +0200, David Hildenbrand wrote:
> Let's reduce the code duplication and factor out the non-pte/pmd related
> magic into vm_normal_page_pfn().
> 
> To keep it simpler, check the pfn against both zero folios. We could
> optimize this, but as it's only for the !CONFIG_ARCH_HAS_PTE_SPECIAL
> case, it's not a compelling micro-optimization.
> 
> With CONFIG_ARCH_HAS_PTE_SPECIAL we don't have to check anything else,
> really.
> 
> It's a good question if we can even hit the !CONFIG_ARCH_HAS_PTE_SPECIAL
> scenario in the PMD case in practice: but doesn't really matter, as
> it's now all unified in vm_normal_page_pfn().
> 
> While at it, add a check that pmd_special() is really only set where we
> would expect it.
> 
> No functional change intended.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

Comment below

>  struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
> @@ -650,35 +661,12 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>  {
>  	unsigned long pfn = pmd_pfn(pmd);
>  
> -	/* Currently it's only used for huge pfnmaps */

Although the check kind of spells it out, we could leave this one and also add
that huge_zero_pfn, to make it more explicit.
 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements
  2025-06-25  8:49 ` Lorenzo Stoakes
@ 2025-06-25  8:55   ` David Hildenbrand
  0 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-25  8:55 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato

On 25.06.25 10:49, Lorenzo Stoakes wrote:
> David, are you planning a v2 of this soon? If so I'll hold off review until
> then, if not I can get stuck in when I have time?

There will probably be a v1 called "mm: vm_normal_page*()" where I drop 
the problematic bit, and respin the problematic bit as a new RFC once I 
have a better understanding.

So, no need to review right now.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 10/14] mm/memory: factor out common code from vm_normal_page_*()
  2025-06-25  8:53   ` Oscar Salvador
@ 2025-06-25  8:57     ` David Hildenbrand
  2025-06-25  9:20       ` Oscar Salvador
  0 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-06-25  8:57 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On 25.06.25 10:53, Oscar Salvador wrote:
> On Tue, Jun 17, 2025 at 05:43:41PM +0200, David Hildenbrand wrote:
>> Let's reduce the code duplication and factor out the non-pte/pmd related
>> magic into vm_normal_page_pfn().
>>
>> To keep it simpler, check the pfn against both zero folios. We could
>> optimize this, but as it's only for the !CONFIG_ARCH_HAS_PTE_SPECIAL
>> case, it's not a compelling micro-optimization.
>>
>> With CONFIG_ARCH_HAS_PTE_SPECIAL we don't have to check anything else,
>> really.
>>
>> It's a good question if we can even hit the !CONFIG_ARCH_HAS_PTE_SPECIAL
>> scenario in the PMD case in practice: but doesn't really matter, as
>> it's now all unified in vm_normal_page_pfn().
>>
>> While at it, add a check that pmd_special() is really only set where we
>> would expect it.
>>
>> No functional change intended.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> 
> Comment below
> 
>>   struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
>> @@ -650,35 +661,12 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>>   {
>>   	unsigned long pfn = pmd_pfn(pmd);
>>   
>> -	/* Currently it's only used for huge pfnmaps */
> 
> Although the check kind of spells it out, we could leave this one and also add
> that huge_zero_pfn, to make it more explicit.

I don't think that comment is required anymore -- we do exactly what 
vm_normal_page() does + documents,

What the current users are is not particularly important anymore.

Or why do you think it would still be important?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 06/14] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
  2025-06-25  8:20   ` Oscar Salvador
@ 2025-06-25  8:59     ` David Hildenbrand
  0 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-25  8:59 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On 25.06.25 10:20, Oscar Salvador wrote:
> On Tue, Jun 17, 2025 at 05:43:37PM +0200, David Hildenbrand wrote:
>> Just like we do for vmf_insert_page_mkwrite() -> ... ->
>> insert_page_into_pte_locked(), support the huge zero folio.
> 
> It might just be me because I don't have the full context cached, but
> I might have appreciated more info here :-).

I can add something similar to what we have in patch #8, stating that we 
will change it to be special as well soon.

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 11/14] mm: remove "horrible special case to handle copy-on-write behaviour"
  2025-06-25  8:47   ` David Hildenbrand
@ 2025-06-25  9:02     ` Oscar Salvador
  2025-06-25  9:04       ` David Hildenbrand
  0 siblings, 1 reply; 65+ messages in thread
From: Oscar Salvador @ 2025-06-25  9:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Wed, Jun 25, 2025 at 10:47:49AM +0200, David Hildenbrand wrote:
> I'm still thinking about this patch here, and will likely send out the other
> patches first as a v1, and come back to this one later.

Patch#12 depends on this one, but Patch#13 should be ok to review
If I ignore the 'addr' parameter being dropped, right?


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 07/14] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
  2025-06-24  1:16   ` Alistair Popple
@ 2025-06-25  9:03     ` David Hildenbrand
  2025-07-04 13:22       ` David Hildenbrand
  0 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-06-25  9:03 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Zi Yan, Baolin Wang, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato

On 24.06.25 03:16, Alistair Popple wrote:
> On Tue, Jun 17, 2025 at 05:43:38PM +0200, David Hildenbrand wrote:
>> Let's convert to vmf_insert_folio_pmd().
>>
>> In the unlikely case there is already something mapped, we'll now still
>> call trace_dax_pmd_load_hole() and return VM_FAULT_NOPAGE.
>>
>> That should probably be fine, no need to add special cases for that.
> 
> I'm not sure about that. Consider dax_iomap_pmd_fault() -> dax_fault_iter() ->
> dax_pmd_load_hole(). It calls split_huge_pmd() in response to VM_FAULT_FALLBACK
> which will no longer happen, what makes that ok?

My reasoning was that this is the exact same behavior other 
vmf_insert_folio_pmd() users here would result in.

But let me dig into the details.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 11/14] mm: remove "horrible special case to handle copy-on-write behaviour"
  2025-06-25  9:02     ` Oscar Salvador
@ 2025-06-25  9:04       ` David Hildenbrand
  0 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-25  9:04 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On 25.06.25 11:02, Oscar Salvador wrote:
> On Wed, Jun 25, 2025 at 10:47:49AM +0200, David Hildenbrand wrote:
>> I'm still thinking about this patch here, and will likely send out the other
>> patches first as a v1, and come back to this one later.
> 
> Patch#12 depends on this one, but Patch#13 should be ok to review
> If I ignore the 'addr' parameter being dropped, right?

Yes, only #12 and #13 will be gone. #14 and #15 will simply have the 
"addr" parameter in the _pud() variant as well.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 10/14] mm/memory: factor out common code from vm_normal_page_*()
  2025-06-25  8:57     ` David Hildenbrand
@ 2025-06-25  9:20       ` Oscar Salvador
  2025-06-25 10:14         ` David Hildenbrand
  0 siblings, 1 reply; 65+ messages in thread
From: Oscar Salvador @ 2025-06-25  9:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Wed, Jun 25, 2025 at 10:57:39AM +0200, David Hildenbrand wrote:
> I don't think that comment is required anymore -- we do exactly what
> vm_normal_page() does + documents,
> 
> What the current users are is not particularly important anymore.
> 
> Or why do you think it would still be important?

Maybe the current users are not important, but at least a comment directing 
to vm_normal_page like "See comment in vm_normal_page".
Here, and in vm_normal_page_pud().

Just someone has it clear why we're only checking for X and Y when we find a
pte/pmd/pud special.

But not really a strong opinion here, just I think that it might be helpful.


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 13/14] mm: introduce and use vm_normal_page_pud()
  2025-06-17 15:43 ` [PATCH RFC 13/14] mm: introduce and use vm_normal_page_pud() David Hildenbrand
@ 2025-06-25  9:22   ` Oscar Salvador
  0 siblings, 0 replies; 65+ messages in thread
From: Oscar Salvador @ 2025-06-25  9:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:44PM +0200, David Hildenbrand wrote:
> Let's introduce vm_normal_page_pud(), which ends up being fairly simple
> because of our new common helpers and there not being a PUD-sized zero
> folio.
> 
> Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO,
> structuring the code like the other (pmd/pte) cases. Defer
> introducing vm_normal_folio_pud() until really used.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Ignoring the parameter stuff, I'd make the same comment I made wrt.
vm_normal_page_pmd, but LGTM.

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 14/14] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
  2025-06-17 15:43 ` [PATCH RFC 14/14] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
@ 2025-06-25  9:34   ` Oscar Salvador
  2025-07-14 14:19     ` David Hildenbrand
  0 siblings, 1 reply; 65+ messages in thread
From: Oscar Salvador @ 2025-06-25  9:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
	David Vrabel

On Tue, Jun 17, 2025 at 05:43:45PM +0200, David Hildenbrand wrote:
> ... and hide it behind a kconfig option. There is really no need for
> any !xen code to perform this check.
> 
> The naming is a bit off: we want to find the "normal" page when a PTE
> was marked "special". So it's really not "finding a special" page.
> 
> Improve the documentation, and add a comment in the code where XEN ends
> up performing the pte_mkspecial() through a hypercall. More details can
> be found in commit 923b2919e2c3 ("xen/gntdev: mark userspace PTEs as
> special on x86 PV guests").

Looks good to me.
Since this seems a "mistake" from the past we don't want to repeat, I wonder
whether we could seal FIND_NORMAL_PAGE somehow, and scream if someone
tries to enable it on !XEN environments.



-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 10/14] mm/memory: factor out common code from vm_normal_page_*()
  2025-06-25  9:20       ` Oscar Salvador
@ 2025-06-25 10:14         ` David Hildenbrand
  0 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-06-25 10:14 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On 25.06.25 11:20, Oscar Salvador wrote:
> On Wed, Jun 25, 2025 at 10:57:39AM +0200, David Hildenbrand wrote:
>> I don't think that comment is required anymore -- we do exactly what
>> vm_normal_page() does + documents,
>>
>> What the current users are is not particularly important anymore.
>>
>> Or why do you think it would still be important?
> 
> Maybe the current users are not important, but at least a comment directing
> to vm_normal_page like "See comment in vm_normal_page".
> Here, and in vm_normal_page_pud().
> 
> Just someone has it clear why we're only checking for X and Y when we find a
> pte/pmd/pud special.
> 
> But not really a strong opinion here, just I think that it might be helpful.

I was already debating with myself whether to add full kerneldoc for 
these functions ... but yeah, to me the link to "vm_normal_page()" is 
obvious, but we can just spell it out "see vm_normal_page()".

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-06-23 14:04     ` David Hildenbrand
  2025-06-25  7:54       ` Oscar Salvador
@ 2025-07-03 12:34       ` Lance Yang
  2025-07-03 12:39         ` David Hildenbrand
  1 sibling, 1 reply; 65+ messages in thread
From: Lance Yang @ 2025-07-03 12:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, linux-kernel, linux-fsdevel, linux-mm, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Alistair Popple,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Zi Yan, Baolin Wang, Lorenzo Stoakes, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato

On Mon, Jun 23, 2025 at 10:04 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 20.06.25 14:50, Oscar Salvador wrote:
> > On Tue, Jun 17, 2025 at 05:43:32PM +0200, David Hildenbrand wrote:
> >> In 2009, we converted a VM_BUG_ON(!pfn_valid(pfn)) to the current
> >> highest_memmap_pfn sanity check in commit 22b31eec63e5 ("badpage:
> >> vm_normal_page use print_bad_pte"), because highest_memmap_pfn was
> >> readily available.
> >>
> >> Nowadays, this is the last remaining highest_memmap_pfn user, and this
> >> sanity check is not really triggering ... frequently.
> >>
> >> Let's convert it to VM_WARN_ON_ONCE(!pfn_valid(pfn)), so we can
> >> simplify and get rid of highest_memmap_pfn. Checking for
> >> pfn_to_online_page() might be even better, but it would not handle
> >> ZONE_DEVICE properly.
> >>
> >> Do the same in vm_normal_page_pmd(), where we don't even report a
> >> problem at all ...
> >>
> >> What might be better in the future is having a runtime option like
> >> page-table-check to enable such checks dynamically on-demand. Something
> >> for the future.
> >>
> >> Signed-off-by: David Hildenbrand <david@redhat.com>
> >
>
> Hi Oscar,
>
> > I'm confused, I'm missing something here.
> > Before this change we would return NULL if e.g: pfn > highest_memmap_pfn, but
> > now we just print the warning and call pfn_to_page() anyway.
> > AFAIK, pfn_to_page() doesn't return NULL?
>
> You're missing that vm_normal_page_pmd() was created as a copy from
> vm_normal_page() [history of the sanity check above], but as we don't
> have (and shouldn't have ...) print_bad_pmd(), we made the code look
> like this would be something that can just happen.
>
> "
> Do the same in vm_normal_page_pmd(), where we don't even report a
> problem at all ...
> "
>
> So we made something that should never happen a runtime sanity check
> without ever reporting a problem ...

IIUC, the reasoning is that because this case should never happen, we can
change the behavior from returning NULL to a "warn and continue" model?

Thanks,
Lance

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-07-03 12:34       ` Lance Yang
@ 2025-07-03 12:39         ` David Hildenbrand
  2025-07-03 14:44           ` Lance Yang
  0 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-07-03 12:39 UTC (permalink / raw)
  To: Lance Yang
  Cc: Oscar Salvador, linux-kernel, linux-fsdevel, linux-mm, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Alistair Popple,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Zi Yan, Baolin Wang, Lorenzo Stoakes, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato

On 03.07.25 14:34, Lance Yang wrote:
> On Mon, Jun 23, 2025 at 10:04 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 20.06.25 14:50, Oscar Salvador wrote:
>>> On Tue, Jun 17, 2025 at 05:43:32PM +0200, David Hildenbrand wrote:
>>>> In 2009, we converted a VM_BUG_ON(!pfn_valid(pfn)) to the current
>>>> highest_memmap_pfn sanity check in commit 22b31eec63e5 ("badpage:
>>>> vm_normal_page use print_bad_pte"), because highest_memmap_pfn was
>>>> readily available.
>>>>
>>>> Nowadays, this is the last remaining highest_memmap_pfn user, and this
>>>> sanity check is not really triggering ... frequently.
>>>>
>>>> Let's convert it to VM_WARN_ON_ONCE(!pfn_valid(pfn)), so we can
>>>> simplify and get rid of highest_memmap_pfn. Checking for
>>>> pfn_to_online_page() might be even better, but it would not handle
>>>> ZONE_DEVICE properly.
>>>>
>>>> Do the same in vm_normal_page_pmd(), where we don't even report a
>>>> problem at all ...
>>>>
>>>> What might be better in the future is having a runtime option like
>>>> page-table-check to enable such checks dynamically on-demand. Something
>>>> for the future.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>
>> Hi Oscar,
>>
>>> I'm confused, I'm missing something here.
>>> Before this change we would return NULL if e.g: pfn > highest_memmap_pfn, but
>>> now we just print the warning and call pfn_to_page() anyway.
>>> AFAIK, pfn_to_page() doesn't return NULL?
>>
>> You're missing that vm_normal_page_pmd() was created as a copy from
>> vm_normal_page() [history of the sanity check above], but as we don't
>> have (and shouldn't have ...) print_bad_pmd(), we made the code look
>> like this would be something that can just happen.
>>
>> "
>> Do the same in vm_normal_page_pmd(), where we don't even report a
>> problem at all ...
>> "
>>
>> So we made something that should never happen a runtime sanity check
>> without ever reporting a problem ...
> 
> IIUC, the reasoning is that because this case should never happen, we can
> change the behavior from returning NULL to a "warn and continue" model?

Well, yes. Point is, that check should have never been copy-pasted that 
way, while dropping the actual warning :)

It's a sanity check for something that should never happen, turned into 
something that looks like it must be handled and would be valid to 
encounter.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-07-03 12:39         ` David Hildenbrand
@ 2025-07-03 14:44           ` Lance Yang
  2025-07-04 12:40             ` David Hildenbrand
  0 siblings, 1 reply; 65+ messages in thread
From: Lance Yang @ 2025-07-03 14:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, linux-kernel, linux-fsdevel, linux-mm, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Alistair Popple,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Zi Yan, Baolin Wang, Lorenzo Stoakes, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, Lance Yang



On 2025/7/3 20:39, David Hildenbrand wrote:
> On 03.07.25 14:34, Lance Yang wrote:
>> On Mon, Jun 23, 2025 at 10:04 PM David Hildenbrand <david@redhat.com> 
>> wrote:
>>>
>>> On 20.06.25 14:50, Oscar Salvador wrote:
>>>> On Tue, Jun 17, 2025 at 05:43:32PM +0200, David Hildenbrand wrote:
>>>>> In 2009, we converted a VM_BUG_ON(!pfn_valid(pfn)) to the current
>>>>> highest_memmap_pfn sanity check in commit 22b31eec63e5 ("badpage:
>>>>> vm_normal_page use print_bad_pte"), because highest_memmap_pfn was
>>>>> readily available.
>>>>>
>>>>> Nowadays, this is the last remaining highest_memmap_pfn user, and this
>>>>> sanity check is not really triggering ... frequently.
>>>>>
>>>>> Let's convert it to VM_WARN_ON_ONCE(!pfn_valid(pfn)), so we can
>>>>> simplify and get rid of highest_memmap_pfn. Checking for
>>>>> pfn_to_online_page() might be even better, but it would not handle
>>>>> ZONE_DEVICE properly.
>>>>>
>>>>> Do the same in vm_normal_page_pmd(), where we don't even report a
>>>>> problem at all ...
>>>>>
>>>>> What might be better in the future is having a runtime option like
>>>>> page-table-check to enable such checks dynamically on-demand. 
>>>>> Something
>>>>> for the future.
>>>>>
>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>
>>>
>>> Hi Oscar,
>>>
>>>> I'm confused, I'm missing something here.
>>>> Before this change we would return NULL if e.g: pfn > 
>>>> highest_memmap_pfn, but
>>>> now we just print the warning and call pfn_to_page() anyway.
>>>> AFAIK, pfn_to_page() doesn't return NULL?
>>>
>>> You're missing that vm_normal_page_pmd() was created as a copy from
>>> vm_normal_page() [history of the sanity check above], but as we don't
>>> have (and shouldn't have ...) print_bad_pmd(), we made the code look
>>> like this would be something that can just happen.
>>>
>>> "
>>> Do the same in vm_normal_page_pmd(), where we don't even report a
>>> problem at all ...
>>> "
>>>
>>> So we made something that should never happen a runtime sanity check
>>> without ever reporting a problem ...
>>
>> IIUC, the reasoning is that because this case should never happen, we can
>> change the behavior from returning NULL to a "warn and continue" model?
> 
> Well, yes. Point is, that check should have never been copy-pasted that 
> way, while dropping the actual warning :)

Ah, I see your point now. Thanks for clarifying!

> 
> It's a sanity check for something that should never happen, turned into 
> something that looks like it must be handled and would be valid to 
> encounter.

Yeah. Makes sense to me ;)



^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-06-17 15:43 ` [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page() David Hildenbrand
  2025-06-20 12:50   ` Oscar Salvador
  2025-06-25  7:55   ` Oscar Salvador
@ 2025-07-03 14:50   ` Lance Yang
  2 siblings, 0 replies; 65+ messages in thread
From: Lance Yang @ 2025-07-03 14:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 11:44 PM David Hildenbrand <david@redhat.com> wrote:
>
> In 2009, we converted a VM_BUG_ON(!pfn_valid(pfn)) to the current
> highest_memmap_pfn sanity check in commit 22b31eec63e5 ("badpage:
> vm_normal_page use print_bad_pte"), because highest_memmap_pfn was
> readily available.
>
> Nowadays, this is the last remaining highest_memmap_pfn user, and this
> sanity check is not really triggering ... frequently.
>
> Let's convert it to VM_WARN_ON_ONCE(!pfn_valid(pfn)), so we can
> simplify and get rid of highest_memmap_pfn. Checking for
> pfn_to_online_page() might be even better, but it would not handle
> ZONE_DEVICE properly.
>
> Do the same in vm_normal_page_pmd(), where we don't even report a
> problem at all ...
>
> What might be better in the future is having a runtime option like
> page-table-check to enable such checks dynamically on-demand. Something
> for the future.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>

LGTM. Feel free to add:
Reviewed-by: Lance Yang <lance.yang@linux.dev>

Thanks,
Lance

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-07-03 14:44           ` Lance Yang
@ 2025-07-04 12:40             ` David Hildenbrand
  2025-07-07  6:31               ` Hugh Dickins
  0 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-07-04 12:40 UTC (permalink / raw)
  To: Lance Yang
  Cc: Oscar Salvador, linux-kernel, linux-fsdevel, linux-mm, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Alistair Popple,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Zi Yan, Baolin Wang, Lorenzo Stoakes, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, Lance Yang

On 03.07.25 16:44, Lance Yang wrote:
> 
> 
> On 2025/7/3 20:39, David Hildenbrand wrote:
>> On 03.07.25 14:34, Lance Yang wrote:
>>> On Mon, Jun 23, 2025 at 10:04 PM David Hildenbrand <david@redhat.com>
>>> wrote:
>>>>
>>>> On 20.06.25 14:50, Oscar Salvador wrote:
>>>>> On Tue, Jun 17, 2025 at 05:43:32PM +0200, David Hildenbrand wrote:
>>>>>> In 2009, we converted a VM_BUG_ON(!pfn_valid(pfn)) to the current
>>>>>> highest_memmap_pfn sanity check in commit 22b31eec63e5 ("badpage:
>>>>>> vm_normal_page use print_bad_pte"), because highest_memmap_pfn was
>>>>>> readily available.
>>>>>>
>>>>>> Nowadays, this is the last remaining highest_memmap_pfn user, and this
>>>>>> sanity check is not really triggering ... frequently.
>>>>>>
>>>>>> Let's convert it to VM_WARN_ON_ONCE(!pfn_valid(pfn)), so we can
>>>>>> simplify and get rid of highest_memmap_pfn. Checking for
>>>>>> pfn_to_online_page() might be even better, but it would not handle
>>>>>> ZONE_DEVICE properly.
>>>>>>
>>>>>> Do the same in vm_normal_page_pmd(), where we don't even report a
>>>>>> problem at all ...
>>>>>>
>>>>>> What might be better in the future is having a runtime option like
>>>>>> page-table-check to enable such checks dynamically on-demand.
>>>>>> Something
>>>>>> for the future.
>>>>>>
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>
>>>>
>>>> Hi Oscar,
>>>>
>>>>> I'm confused, I'm missing something here.
>>>>> Before this change we would return NULL if e.g: pfn >
>>>>> highest_memmap_pfn, but
>>>>> now we just print the warning and call pfn_to_page() anyway.
>>>>> AFAIK, pfn_to_page() doesn't return NULL?
>>>>
>>>> You're missing that vm_normal_page_pmd() was created as a copy from
>>>> vm_normal_page() [history of the sanity check above], but as we don't
>>>> have (and shouldn't have ...) print_bad_pmd(), we made the code look
>>>> like this would be something that can just happen.
>>>>
>>>> "
>>>> Do the same in vm_normal_page_pmd(), where we don't even report a
>>>> problem at all ...
>>>> "
>>>>
>>>> So we made something that should never happen a runtime sanity check
>>>> without ever reporting a problem ...
>>>
>>> IIUC, the reasoning is that because this case should never happen, we can
>>> change the behavior from returning NULL to a "warn and continue" model?
>>
>> Well, yes. Point is, that check should have never been copy-pasted that
>> way, while dropping the actual warning :)
> 
> Ah, I see your point now. Thanks for clarifying!

I'll add some of that to the patch description. Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 07/14] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
  2025-06-25  9:03     ` David Hildenbrand
@ 2025-07-04 13:22       ` David Hildenbrand
  2025-07-07 11:50         ` Alistair Popple
  0 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-07-04 13:22 UTC (permalink / raw)
  To: Alistair Popple
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Zi Yan, Baolin Wang, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato

On 25.06.25 11:03, David Hildenbrand wrote:
> On 24.06.25 03:16, Alistair Popple wrote:
>> On Tue, Jun 17, 2025 at 05:43:38PM +0200, David Hildenbrand wrote:
>>> Let's convert to vmf_insert_folio_pmd().
>>>
>>> In the unlikely case there is already something mapped, we'll now still
>>> call trace_dax_pmd_load_hole() and return VM_FAULT_NOPAGE.
>>>
>>> That should probably be fine, no need to add special cases for that.
>>
>> I'm not sure about that. Consider dax_iomap_pmd_fault() -> dax_fault_iter() ->
>> dax_pmd_load_hole(). It calls split_huge_pmd() in response to VM_FAULT_FALLBACK
>> which will no longer happen, what makes that ok?
> 
> My reasoning was that this is the exact same behavior other
> vmf_insert_folio_pmd() users here would result in.
> 
> But let me dig into the details.

Okay, trying to figure out what to do here.

Assume dax_pmd_load_hole() is called and there is already something. We 
would have returned VM_FAULT_FALLBACK, now we would return VM_FAULT_NO_PAGE.

That obviously only happens when we have not a write fault (otherwise, 
the shared zeropage does not apply).

In dax_iomap_pmd_fault(), we would indeed split_huge_pmd(). In the DAX 
case (!anon vma), that would simply zap whatever is already mapped there.

I guess we would then return VM_FAULT_FALLBACK from huge_fault-> ... -> 
dax_iomap_fault() and core MM code would fallback to handle_pte_fault() 
etc. and ... load a single PTE mapping the shared zeropage.

BUT

why is this case handled differently than everything else?

E.g.,

(1) when we try inserting the shared zeropage through 
dax_load_hole()->vmf_insert_page_mkwrite() and there is already 
something ... we return VM_FAULT_NOPAGE.

(2) when we try inserting a PTE mapping an ordinary folio through 
dax_fault_iter()->vmf_insert_page_mkwrite() and there is already 
something ... we return VM_FAULT_NOPAGE.

(3) when we try inserting a PMD mapping an ordinary folio through 
dax_fault_iter()->vmf_insert_folio_pmd() and there is already something 
... we return VM_FAULT_NOPAGE.


So that makes me think ... the VM_FAULT_FALLBACK right now is probably 
... wrong? And probably cannot be triggered?

If there is already the huge zerofolio mapped, all good.

Anything else is really not expected I would assume?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 04/14] mm/huge_memory: move more common code into insert_pmd()
  2025-06-20 14:12   ` Oscar Salvador
@ 2025-07-07  2:48     ` Alistair Popple
  0 siblings, 0 replies; 65+ messages in thread
From: Alistair Popple @ 2025-07-07  2:48 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: David Hildenbrand, linux-kernel, linux-fsdevel, linux-mm, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On Fri, Jun 20, 2025 at 04:12:35PM +0200, Oscar Salvador wrote:
> On Tue, Jun 17, 2025 at 05:43:35PM +0200, David Hildenbrand wrote:
> > Let's clean it all further up.
> > 
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> 
> I was thinking maybe we want to pass 'struct vm_fault' directly to insert_pmd(),
> and retrieve the fields in there, but since you have to retrieve some in
> insert_pfn_pmd().. maybe not.

Where practical I quite like having callers retrieve context struct fields
rather than passing the whole struct down. It makes it very obvious what
elements insert_pmd() cares about (in this case one of about fourteen fields).

Anyway looks good, thanks:

Reviewed-by: Alistair Popple <apopple@nvidia.com>

> -- 
> Oscar Salvador
> SUSE Labs

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 05/14] mm/huge_memory: move more common code into insert_pud()
  2025-06-17 15:43 ` [PATCH RFC 05/14] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
  2025-06-20 14:15   ` Oscar Salvador
@ 2025-07-07  2:51   ` Alistair Popple
  1 sibling, 0 replies; 65+ messages in thread
From: Alistair Popple @ 2025-07-07  2:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Zi Yan, Baolin Wang, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato

On Tue, Jun 17, 2025 at 05:43:36PM +0200, David Hildenbrand wrote:
> Let's clean it all further up.

Looks good:

Reviewed-by: Alistair Popple <apopple@nvidia.com>

> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/huge_memory.c | 36 +++++++++++++-----------------------
>  1 file changed, 13 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a85e0cd455109..1ea23900b5adb 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1507,25 +1507,30 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
>  	return pud;
>  }
>  
> -static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
> +static vm_fault_t insert_pud(struct vm_area_struct *vma, unsigned long addr,
>  		pud_t *pud, struct folio_or_pfn fop, pgprot_t prot, bool write)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> +	spinlock_t *ptl;
>  	pud_t entry;
>  
> +	if (addr < vma->vm_start || addr >= vma->vm_end)
> +		return VM_FAULT_SIGBUS;
> +
> +	ptl = pud_lock(mm, pud);
>  	if (!pud_none(*pud)) {
>  		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
>  					  fop.pfn;
>  
>  		if (write && pud_present(*pud)) {
>  			if (WARN_ON_ONCE(pud_pfn(*pud) != pfn))
> -				return;
> +				goto out_unlock;
>  			entry = pud_mkyoung(*pud);
>  			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
>  			if (pudp_set_access_flags(vma, addr, pud, entry, 1))
>  				update_mmu_cache_pud(vma, addr, pud);
>  		}
> -		return;
> +		goto out_unlock;
>  	}
>  
>  	if (fop.is_folio) {
> @@ -1544,6 +1549,9 @@ static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
>  	}
>  	set_pud_at(mm, addr, pud, entry);
>  	update_mmu_cache_pud(vma, addr, pud);
> +out_unlock:
> +	spin_unlock(ptl);
> +	return VM_FAULT_NOPAGE;
>  }
>  
>  /**
> @@ -1565,7 +1573,6 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn,
>  	struct folio_or_pfn fop = {
>  		.pfn = pfn,
>  	};
> -	spinlock_t *ptl;
>  
>  	/*
>  	 * If we had pud_special, we could avoid all these restrictions,
> @@ -1577,16 +1584,9 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn,
>  						(VM_PFNMAP|VM_MIXEDMAP));
>  	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
>  
> -	if (addr < vma->vm_start || addr >= vma->vm_end)
> -		return VM_FAULT_SIGBUS;
> -
>  	pfnmap_setup_cachemode_pfn(pfn, &pgprot);
>  
> -	ptl = pud_lock(vma->vm_mm, vmf->pud);
> -	insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
> -	spin_unlock(ptl);
> -
> -	return VM_FAULT_NOPAGE;
> +	return insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
>  }
>  EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
>  
> @@ -1603,25 +1603,15 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	unsigned long addr = vmf->address & PUD_MASK;
> -	pud_t *pud = vmf->pud;
> -	struct mm_struct *mm = vma->vm_mm;
>  	struct folio_or_pfn fop = {
>  		.folio = folio,
>  		.is_folio = true,
>  	};
> -	spinlock_t *ptl;
> -
> -	if (addr < vma->vm_start || addr >= vma->vm_end)
> -		return VM_FAULT_SIGBUS;
>  
>  	if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER))
>  		return VM_FAULT_SIGBUS;
>  
> -	ptl = pud_lock(mm, pud);
> -	insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
> -	spin_unlock(ptl);
> -
> -	return VM_FAULT_NOPAGE;
> +	return insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
>  }
>  EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
>  #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> -- 
> 2.49.0
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-07-04 12:40             ` David Hildenbrand
@ 2025-07-07  6:31               ` Hugh Dickins
  2025-07-07 13:19                 ` David Hildenbrand
  0 siblings, 1 reply; 65+ messages in thread
From: Hugh Dickins @ 2025-07-07  6:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lance Yang, Oscar Salvador, linux-kernel, linux-fsdevel, linux-mm,
	nvdimm, Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Alistair Popple,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Zi Yan, Baolin Wang, Lorenzo Stoakes, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, Lance Yang

[-- Attachment #1: Type: text/plain, Size: 1800 bytes --]

On Fri, 4 Jul 2025, David Hildenbrand wrote:
> On 03.07.25 16:44, Lance Yang wrote:
> > On 2025/7/3 20:39, David Hildenbrand wrote:
> >> On 03.07.25 14:34, Lance Yang wrote:
> >>> On Mon, Jun 23, 2025 at 10:04 PM David Hildenbrand <david@redhat.com>
> >>> wrote:
> >>>>
> >>>> On 20.06.25 14:50, Oscar Salvador wrote:
> >>>>> On Tue, Jun 17, 2025 at 05:43:32PM +0200, David Hildenbrand wrote:
> >>>>>> In 2009, we converted a VM_BUG_ON(!pfn_valid(pfn)) to the current
> >>>>>> highest_memmap_pfn sanity check in commit 22b31eec63e5 ("badpage:
> >>>>>> vm_normal_page use print_bad_pte"), because highest_memmap_pfn was
> >>>>>> readily available.
> >>>>>>
> >>>>>> Nowadays, this is the last remaining highest_memmap_pfn user, and this
> >>>>>> sanity check is not really triggering ... frequently.
> >>>>>>
> >>>>>> Let's convert it to VM_WARN_ON_ONCE(!pfn_valid(pfn)), so we can
> >>>>>> simplify and get rid of highest_memmap_pfn. Checking for
> >>>>>> pfn_to_online_page() might be even better, but it would not handle
> >>>>>> ZONE_DEVICE properly.
> >>>>>>
> >>>>>> Do the same in vm_normal_page_pmd(), where we don't even report a
> >>>>>> problem at all ...
> >>>>>>
> >>>>>> What might be better in the future is having a runtime option like
> >>>>>> page-table-check to enable such checks dynamically on-demand.
> >>>>>> Something
> >>>>>> for the future.
> >>>>>>
> >>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>

The author of 22b31eec63e5 thinks this is not at all an improvement.
Of course the condition is not triggering frequently, of course it
should not happen: but it does happen, and it still seems worthwhile
to catch it in production with a "Bad page map" than to let it run on
to whatever kind of crash it hits instead.

Hugh

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 07/14] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
  2025-07-04 13:22       ` David Hildenbrand
@ 2025-07-07 11:50         ` Alistair Popple
  0 siblings, 0 replies; 65+ messages in thread
From: Alistair Popple @ 2025-07-07 11:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Zi Yan, Baolin Wang, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato

On Fri, Jul 04, 2025 at 03:22:28PM +0200, David Hildenbrand wrote:
> On 25.06.25 11:03, David Hildenbrand wrote:
> > On 24.06.25 03:16, Alistair Popple wrote:
> > > On Tue, Jun 17, 2025 at 05:43:38PM +0200, David Hildenbrand wrote:
> > > > Let's convert to vmf_insert_folio_pmd().
> > > > 
> > > > In the unlikely case there is already something mapped, we'll now still
> > > > call trace_dax_pmd_load_hole() and return VM_FAULT_NOPAGE.
> > > > 
> > > > That should probably be fine, no need to add special cases for that.
> > > 
> > > I'm not sure about that. Consider dax_iomap_pmd_fault() -> dax_fault_iter() ->
> > > dax_pmd_load_hole(). It calls split_huge_pmd() in response to VM_FAULT_FALLBACK
> > > which will no longer happen, what makes that ok?
> > 
> > My reasoning was that this is the exact same behavior other
> > vmf_insert_folio_pmd() users here would result in.
> > 
> > But let me dig into the details.
> 
> Okay, trying to figure out what to do here.
> 
> Assume dax_pmd_load_hole() is called and there is already something. We
> would have returned VM_FAULT_FALLBACK, now we would return VM_FAULT_NO_PAGE.
> 
> That obviously only happens when we have not a write fault (otherwise, the
> shared zeropage does not apply).
> 
> In dax_iomap_pmd_fault(), we would indeed split_huge_pmd(). In the DAX case
> (!anon vma), that would simply zap whatever is already mapped there.
> 
> I guess we would then return VM_FAULT_FALLBACK from huge_fault-> ... ->
> dax_iomap_fault() and core MM code would fallback to handle_pte_fault() etc.
> and ... load a single PTE mapping the shared zeropage.
> 
> BUT
> 
> why is this case handled differently than everything else?

Hmm. Good question, I will have a bit more of a think about it, but your
conclusion below is probably correct.

> E.g.,
> 
> (1) when we try inserting the shared zeropage through
> dax_load_hole()->vmf_insert_page_mkwrite() and there is already something
> ... we return VM_FAULT_NOPAGE.
> 
> (2) when we try inserting a PTE mapping an ordinary folio through
> dax_fault_iter()->vmf_insert_page_mkwrite() and there is already something
> ... we return VM_FAULT_NOPAGE.
> 
> (3) when we try inserting a PMD mapping an ordinary folio through
> dax_fault_iter()->vmf_insert_folio_pmd() and there is already something ...
> we return VM_FAULT_NOPAGE.
> 
> 
> So that makes me think ... the VM_FAULT_FALLBACK right now is probably ...
> wrong? And probably cannot be triggered?

I suspect that's true. At least I just did a full run of xfstest on a XFS DAX
filesystem and was unable to trigger this path, so it's certainly not easy to
trigger.

> If there is already the huge zerofolio mapped, all good.
> 
> Anything else is really not expected I would assume?
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-07-07  6:31               ` Hugh Dickins
@ 2025-07-07 13:19                 ` David Hildenbrand
  2025-07-08  2:52                   ` Hugh Dickins
  0 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-07-07 13:19 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Lance Yang, Oscar Salvador, linux-kernel, linux-fsdevel, linux-mm,
	nvdimm, Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Alistair Popple,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Zi Yan, Baolin Wang, Lorenzo Stoakes, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, Lance Yang

On 07.07.25 08:31, Hugh Dickins wrote:
> On Fri, 4 Jul 2025, David Hildenbrand wrote:
>> On 03.07.25 16:44, Lance Yang wrote:
>>> On 2025/7/3 20:39, David Hildenbrand wrote:
>>>> On 03.07.25 14:34, Lance Yang wrote:
>>>>> On Mon, Jun 23, 2025 at 10:04 PM David Hildenbrand <david@redhat.com>
>>>>> wrote:
>>>>>>
>>>>>> On 20.06.25 14:50, Oscar Salvador wrote:
>>>>>>> On Tue, Jun 17, 2025 at 05:43:32PM +0200, David Hildenbrand wrote:
>>>>>>>> In 2009, we converted a VM_BUG_ON(!pfn_valid(pfn)) to the current
>>>>>>>> highest_memmap_pfn sanity check in commit 22b31eec63e5 ("badpage:
>>>>>>>> vm_normal_page use print_bad_pte"), because highest_memmap_pfn was
>>>>>>>> readily available.
>>>>>>>>
>>>>>>>> Nowadays, this is the last remaining highest_memmap_pfn user, and this
>>>>>>>> sanity check is not really triggering ... frequently.
>>>>>>>>
>>>>>>>> Let's convert it to VM_WARN_ON_ONCE(!pfn_valid(pfn)), so we can
>>>>>>>> simplify and get rid of highest_memmap_pfn. Checking for
>>>>>>>> pfn_to_online_page() might be even better, but it would not handle
>>>>>>>> ZONE_DEVICE properly.
>>>>>>>>
>>>>>>>> Do the same in vm_normal_page_pmd(), where we don't even report a
>>>>>>>> problem at all ...
>>>>>>>>
>>>>>>>> What might be better in the future is having a runtime option like
>>>>>>>> page-table-check to enable such checks dynamically on-demand.
>>>>>>>> Something
>>>>>>>> for the future.
>>>>>>>>
>>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> The author of 22b31eec63e5 thinks this is not at all an improvement.
> Of course the condition is not triggering frequently, of course it
> should not happen: but it does happen, and it still seems worthwhile
> to catch it in production with a "Bad page map" than to let it run on
> to whatever kind of crash it hits instead.

Well, obviously I don't agree and was waiting for having this discussion :)

We catch corruption in a handful of PTE bits, and that's about it. You 
neither detect corruption of flags nor of PFN bits that result in 
another valid PFN.

Corruption of the "special" bit might be fun.

When I was able to trigger this during development once, the whole 
machine went down shortly after -- mostly because of use-after-free of 
something that is now a page table, which is just bad for both users of 
such a page!

E.g., quit that process and we will happily clear the PTE, corrupting 
data of the other user. Fun.

I'm sure I could find a way to unify the code while printing some 
comparable message, but this check as it stands is just not worth it 
IMHO: trying to handle something gracefully that shouldn't happen, when 
really we cannot handle it gracefully.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-07-07 13:19                 ` David Hildenbrand
@ 2025-07-08  2:52                   ` Hugh Dickins
  2025-07-11 15:30                     ` David Hildenbrand
  0 siblings, 1 reply; 65+ messages in thread
From: Hugh Dickins @ 2025-07-08  2:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Lance Yang, Oscar Salvador, linux-kernel,
	linux-fsdevel, linux-mm, nvdimm, Andrew Morton, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Dan Williams,
	Alistair Popple, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Zi Yan, Baolin Wang, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato, Lance Yang

[-- Attachment #1: Type: text/plain, Size: 5546 bytes --]

On Mon, 7 Jul 2025, David Hildenbrand wrote:
> On 07.07.25 08:31, Hugh Dickins wrote:
> > On Fri, 4 Jul 2025, David Hildenbrand wrote:
> >> On 03.07.25 16:44, Lance Yang wrote:
> >>> On 2025/7/3 20:39, David Hildenbrand wrote:
> >>>> On 03.07.25 14:34, Lance Yang wrote:
> >>>>> On Mon, Jun 23, 2025 at 10:04 PM David Hildenbrand <david@redhat.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> On 20.06.25 14:50, Oscar Salvador wrote:
> >>>>>>> On Tue, Jun 17, 2025 at 05:43:32PM +0200, David Hildenbrand wrote:
> >>>>>>>> In 2009, we converted a VM_BUG_ON(!pfn_valid(pfn)) to the current
> >>>>>>>> highest_memmap_pfn sanity check in commit 22b31eec63e5 ("badpage:
> >>>>>>>> vm_normal_page use print_bad_pte"), because highest_memmap_pfn was
> >>>>>>>> readily available.

highest_memmap_pfn was introduced by that commit for this purpose.

> >>>>>>>>
> >>>>>>>> Nowadays, this is the last remaining highest_memmap_pfn user, and
> >>>>>>>> this
> >>>>>>>> sanity check is not really triggering ... frequently.
> >>>>>>>>
> >>>>>>>> Let's convert it to VM_WARN_ON_ONCE(!pfn_valid(pfn)), so we can
> >>>>>>>> simplify and get rid of highest_memmap_pfn. Checking for
> >>>>>>>> pfn_to_online_page() might be even better, but it would not handle
> >>>>>>>> ZONE_DEVICE properly.
> >>>>>>>>
> >>>>>>>> Do the same in vm_normal_page_pmd(), where we don't even report a
> >>>>>>>> problem at all ...
> >>>>>>>>
> >>>>>>>> What might be better in the future is having a runtime option like
> >>>>>>>> page-table-check to enable such checks dynamically on-demand.
> >>>>>>>> Something
> >>>>>>>> for the future.
> >>>>>>>>
> >>>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
> > 
> > The author of 22b31eec63e5 thinks this is not at all an improvement.
> > Of course the condition is not triggering frequently, of course it
> > should not happen: but it does happen, and it still seems worthwhile
> > to catch it in production with a "Bad page map" than to let it run on
> > to whatever kind of crash it hits instead.
> 
> Well, obviously I don't agree and was waiting for having this discussion :)
> 
> We catch corruption in a handful of PTE bits, and that's about it. You neither
> detect corruption of flags nor of PFN bits that result in another valid PFN.

Of course it's limited in what it can catch (and won't even get called
if the present bit was not set - a more complete patch might unify with
those various "Bad swap" messages). Of course. But it's still useful for
stopping pfn_to_page() veering off the end of the memmap[] (in some configs).
And it's still useful for printing out a series of "Bad page map" messages
when the page table is corrupted: from which a picture can sometimes be
built up (isolated instance may just be a bitflip; series of them can
sometimes show e.g. ascii text, occasionally helpful for debugging).

> 
> Corruption of the "special" bit might be fun.
> 
> When I was able to trigger this during development once, the whole machine
> went down shortly after -- mostly because of use-after-free of something that
> is now a page table, which is just bad for both users of such a page!
> 
> E.g., quit that process and we will happily clear the PTE, corrupting data of
> the other user. Fun.
> 
> I'm sure I could find a way to unify the code while printing some comparable
> message, but this check as it stands is just not worth it IMHO: trying to
> handle something gracefully that shouldn't happen, when really we cannot
> handle it gracefully.

So, you have experience of a time when it didn't help you. Okay. And we
have had experience of other times when it has helped, if only a little.
Like with other "Bad page"s: sometimes helpful, often not; but tending to
build up a big picture from repeated occurrences.

We continue to disagree. I can't argue more than append the 2.6.29
commit message, which seems to me as valid now as it was then.

From 22b31eec63e5f2e219a3ee15f456897272bc73e8 Mon Sep 17 00:00:00 2001
From: Hugh Dickins <hugh@veritas.com>
Date: Tue, 6 Jan 2009 14:40:09 -0800
Subject: [PATCH] badpage: vm_normal_page use print_bad_pte

print_bad_pte() is so far being called only when zap_pte_range() finds
negative page_mapcount, or there's a fault on a pte_file where it does not
belong.  That's weak coverage when we suspect pagetable corruption.

Originally, it was called when vm_normal_page() found an invalid pfn: but
pfn_valid is expensive on some architectures and configurations, so 2.6.24
put that under CONFIG_DEBUG_VM (which doesn't help in the field), then
2.6.26 replaced it by a VM_BUG_ON (likewise).

Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test
than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a
__read_mostly note of the highest_memmap_pfn, vm_normal_page() then check
pfn against that.  We could call this pfn_plausible() or pfn_sane(), but I
doubt we'll need it elsewhere: of course it's not reliable, but gives much
stronger pagetable validation on many boxes.

Also use print_bad_pte() when the pte_special bit is found outside a
VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON.

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-07-08  2:52                   ` Hugh Dickins
@ 2025-07-11 15:30                     ` David Hildenbrand
  2025-07-11 18:49                       ` Hugh Dickins
  0 siblings, 1 reply; 65+ messages in thread
From: David Hildenbrand @ 2025-07-11 15:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Lance Yang, Oscar Salvador, linux-kernel, linux-fsdevel, linux-mm,
	nvdimm, Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Alistair Popple,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Zi Yan, Baolin Wang, Lorenzo Stoakes, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, Lance Yang

On 08.07.25 04:52, Hugh Dickins wrote:
> On Mon, 7 Jul 2025, David Hildenbrand wrote:
>> On 07.07.25 08:31, Hugh Dickins wrote:
>>> On Fri, 4 Jul 2025, David Hildenbrand wrote:
>>>> On 03.07.25 16:44, Lance Yang wrote:
>>>>> On 2025/7/3 20:39, David Hildenbrand wrote:
>>>>>> On 03.07.25 14:34, Lance Yang wrote:
>>>>>>> On Mon, Jun 23, 2025 at 10:04 PM David Hildenbrand <david@redhat.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> On 20.06.25 14:50, Oscar Salvador wrote:
>>>>>>>>> On Tue, Jun 17, 2025 at 05:43:32PM +0200, David Hildenbrand wrote:
>>>>>>>>>> In 2009, we converted a VM_BUG_ON(!pfn_valid(pfn)) to the current
>>>>>>>>>> highest_memmap_pfn sanity check in commit 22b31eec63e5 ("badpage:
>>>>>>>>>> vm_normal_page use print_bad_pte"), because highest_memmap_pfn was
>>>>>>>>>> readily available.
> 
> highest_memmap_pfn was introduced by that commit for this purpose.
> 

Oh, somehow I thought it was around before that.

[...]

>> We catch corruption in a handful of PTE bits, and that's about it. You neither
>> detect corruption of flags nor of PFN bits that result in another valid PFN.
> 
> Of course it's limited in what it can catch (and won't even get called
> if the present bit was not set - a more complete patch might unify with
> those various "Bad swap" messages). Of course. But it's still useful for
> stopping pfn_to_page() veering off the end of the memmap[] (in some configs).

Right, probably in the configs we both don't care that much about 
nowadays :)

> And it's still useful for printing out a series of "Bad page map" messages
> when the page table is corrupted: from which a picture can sometimes be
> built up (isolated instance may just be a bitflip; series of them can
> sometimes show e.g. ascii text, occasionally helpful for debugging).

It's kind of a weird thing, because we do something very different 
opposed to other areas where we detect that something serious is going 
wrong (e.g., WARN).

But another thread just sparked whether we should WARN here, so I'll 
leave that discussion to the other thread.

> 
>>
>> Corruption of the "special" bit might be fun.
>>
>> When I was able to trigger this during development once, the whole machine
>> went down shortly after -- mostly because of use-after-free of something that
>> is now a page table, which is just bad for both users of such a page!
>>
>> E.g., quit that process and we will happily clear the PTE, corrupting data of
>> the other user. Fun.
>>
>> I'm sure I could find a way to unify the code while printing some comparable
>> message, but this check as it stands is just not worth it IMHO: trying to
>> handle something gracefully that shouldn't happen, when really we cannot
>> handle it gracefully.
> 
> So, you have experience of a time when it didn't help you. Okay. And we
> have had experience of other times when it has helped, if only a little.
> Like with other "Bad page"s: sometimes helpful, often not; but tending to
> build up a big picture from repeated occurrences.

Okay. I was rather curious how often we would actually hit this one 
here: from my recollection, the mapcount underflows are much more 
frequent than the ones from vm_normal_page().

> 
> We continue to disagree. I can't argue more than append the 2.6.29
> commit message, which seems to me as valid now as it was then.

Not that I agree that performing these sanity checks in each and every 
config is something reasonable, but apparently you think that they are 
still useful, 16 years after they were introduced.

So, let me try finding a way to unify the code while keeping that error 
handling for now in place.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-07-11 15:30                     ` David Hildenbrand
@ 2025-07-11 18:49                       ` Hugh Dickins
  2025-07-11 18:57                         ` David Hildenbrand
  0 siblings, 1 reply; 65+ messages in thread
From: Hugh Dickins @ 2025-07-11 18:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Hugh Dickins, Lance Yang, Oscar Salvador, linux-kernel,
	linux-fsdevel, linux-mm, nvdimm, Andrew Morton, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Dan Williams,
	Alistair Popple, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Zi Yan, Baolin Wang, Lorenzo Stoakes,
	Liam R. Howlett, Nico Pache, Ryan Roberts, Dev Jain, Barry Song,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Jann Horn, Pedro Falcato, Lance Yang

On Fri, 11 Jul 2025, David Hildenbrand wrote:
> On 08.07.25 04:52, Hugh Dickins wrote:
> > 
> > Of course it's limited in what it can catch (and won't even get called
> > if the present bit was not set - a more complete patch might unify with
> > those various "Bad swap" messages). Of course. But it's still useful for
> > stopping pfn_to_page() veering off the end of the memmap[] (in some
> > configs).
> 
> Right, probably in the configs we both don't care that much about nowadays :)

I thought it was the other way round: it's useful for stopping
pfn_to_page() veering off the end of the memmap[] if it's a memory model
where pfn_to_page() is a simple linear conversion.

As with SPARSEMEM_VMEMMAP, which I thought was the favourite nowadays.

If you don't care about that one much (does hotplug prevent it?), then
you do care about the complex pfn_to_page()s, and we should have worried
more when "page++"s got unnecessarily converted to folio_page(folio, i)
a year or two back (I'm thinking of in mm/rmap.c, maybe elsewhere).

Hugh

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page()
  2025-07-11 18:49                       ` Hugh Dickins
@ 2025-07-11 18:57                         ` David Hildenbrand
  0 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-07-11 18:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Lance Yang, Oscar Salvador, linux-kernel, linux-fsdevel, linux-mm,
	nvdimm, Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Alistair Popple,
	Matthew Wilcox, Jan Kara, Alexander Viro, Christian Brauner,
	Zi Yan, Baolin Wang, Lorenzo Stoakes, Liam R. Howlett, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, Lance Yang

On 11.07.25 20:49, Hugh Dickins wrote:
> On Fri, 11 Jul 2025, David Hildenbrand wrote:
>> On 08.07.25 04:52, Hugh Dickins wrote:
>>>
>>> Of course it's limited in what it can catch (and won't even get called
>>> if the present bit was not set - a more complete patch might unify with
>>> those various "Bad swap" messages). Of course. But it's still useful for
>>> stopping pfn_to_page() veering off the end of the memmap[] (in some
>>> configs).
>>
>> Right, probably in the configs we both don't care that much about nowadays :)
> 
> I thought it was the other way round: it's useful for stopping
> pfn_to_page() veering off the end of the memmap[] if it's a memory model
> where pfn_to_page() is a simple linear conversion.
> 
> As with SPARSEMEM_VMEMMAP, which I thought was the favourite nowadays.

Yes, you're right, I had the !SPARSEMEM model in mind, but obviously, 
the same applies for the SPARSEMEM_VMEMMAP case as well.

Only the SPARSEMEM && !SPARSEMEM_VMEMMAP model is weird. IIRC, it will 
dereference NULL when given a non-existant PFN. (__nr_to_section 
returning NULL and pfn_to_section_nr() not beeing happy about that)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 08/14] mm/huge_memory: mark PMD mappings of the huge zero folio special
  2025-06-25  8:32   ` Oscar Salvador
@ 2025-07-14 12:41     ` David Hildenbrand
  0 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-07-14 12:41 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato

On 25.06.25 10:32, Oscar Salvador wrote:
> On Tue, Jun 17, 2025 at 05:43:39PM +0200, David Hildenbrand wrote:
>> The huge zero folio is refcounted (+mapcounted -- is that a word?)
>> differently than "normal" folios, similarly (but different) to the ordinary
>> shared zeropage.
>>
>> For this reason, we special-case these pages in
>> vm_normal_page*/vm_normal_folio*, and only allow selected callers to
>> still use them (e.g., GUP can still take a reference on them).
>>
>> vm_normal_page_pmd() already filters out the huge zero folio. However,
>> so far we are not marking it as special like we do with the ordinary
>> shared zeropage. Let's mark it as special, so we can further refactor
>> vm_normal_page_pmd() and vm_normal_page().
>>
>> While at it, update the doc regarding the shared zero folios.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> 
> While doing this, would it make sense to update vm_normal_page_pmd()
> comments to refelect that pmd_special will also catch huge_zero_folio()?
> It only mentions huge pfnmaps.
> 
> It might not be worth doing since you remove that code later on, but
> maybe if someone stares at this commit alone..

Yes, it should be removed in this patch, thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH RFC 14/14] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
  2025-06-25  9:34   ` Oscar Salvador
@ 2025-07-14 14:19     ` David Hildenbrand
  0 siblings, 0 replies; 65+ messages in thread
From: David Hildenbrand @ 2025-07-14 14:19 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-kernel, linux-fsdevel, linux-mm, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Alistair Popple, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Zi Yan, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Nico Pache, Ryan Roberts,
	Dev Jain, Barry Song, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Jann Horn, Pedro Falcato,
	David Vrabel

On 25.06.25 11:34, Oscar Salvador wrote:
> On Tue, Jun 17, 2025 at 05:43:45PM +0200, David Hildenbrand wrote:
>> ... and hide it behind a kconfig option. There is really no need for
>> any !xen code to perform this check.
>>
>> The naming is a bit off: we want to find the "normal" page when a PTE
>> was marked "special". So it's really not "finding a special" page.
>>
>> Improve the documentation, and add a comment in the code where XEN ends
>> up performing the pte_mkspecial() through a hypercall. More details can
>> be found in commit 923b2919e2c3 ("xen/gntdev: mark userspace PTEs as
>> special on x86 PV guests").
> 
> Looks good to me.
> Since this seems a "mistake" from the past we don't want to repeat, I wonder
> whether we could seal FIND_NORMAL_PAGE somehow, and scream if someone
> tries to enable it on !XEN environments.

Hm, probably not that easy. I mean, as long as we cannot get rid of the 
XEN stuff, we don't particularly care about new users ... and it seems 
unlikely to get rid of that XEN stuff :(

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2025-07-14 14:20 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-17 15:43 [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
2025-06-17 15:43 ` [PATCH RFC 01/14] mm/memory: drop highest_memmap_pfn sanity check in vm_normal_page() David Hildenbrand
2025-06-20 12:50   ` Oscar Salvador
2025-06-23 14:04     ` David Hildenbrand
2025-06-25  7:54       ` Oscar Salvador
2025-07-03 12:34       ` Lance Yang
2025-07-03 12:39         ` David Hildenbrand
2025-07-03 14:44           ` Lance Yang
2025-07-04 12:40             ` David Hildenbrand
2025-07-07  6:31               ` Hugh Dickins
2025-07-07 13:19                 ` David Hildenbrand
2025-07-08  2:52                   ` Hugh Dickins
2025-07-11 15:30                     ` David Hildenbrand
2025-07-11 18:49                       ` Hugh Dickins
2025-07-11 18:57                         ` David Hildenbrand
2025-06-25  7:55   ` Oscar Salvador
2025-07-03 14:50   ` Lance Yang
2025-06-17 15:43 ` [PATCH RFC 02/14] mm: drop highest_memmap_pfn David Hildenbrand
2025-06-20 13:04   ` Oscar Salvador
2025-06-20 18:11   ` Pedro Falcato
2025-06-17 15:43 ` [PATCH RFC 03/14] mm: compare pfns only if the entry is present when inserting pfns/pages David Hildenbrand
2025-06-20 13:27   ` Oscar Salvador
2025-06-23 19:22     ` David Hildenbrand
2025-06-20 18:24   ` Pedro Falcato
2025-06-23 19:19     ` David Hildenbrand
2025-06-17 15:43 ` [PATCH RFC 04/14] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
2025-06-20 14:12   ` Oscar Salvador
2025-07-07  2:48     ` Alistair Popple
2025-06-17 15:43 ` [PATCH RFC 05/14] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
2025-06-20 14:15   ` Oscar Salvador
2025-07-07  2:51   ` Alistair Popple
2025-06-17 15:43 ` [PATCH RFC 06/14] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
2025-06-25  8:15   ` Oscar Salvador
2025-06-25  8:17     ` Oscar Salvador
2025-06-25  8:20   ` Oscar Salvador
2025-06-25  8:59     ` David Hildenbrand
2025-06-17 15:43 ` [PATCH RFC 07/14] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
2025-06-24  1:16   ` Alistair Popple
2025-06-25  9:03     ` David Hildenbrand
2025-07-04 13:22       ` David Hildenbrand
2025-07-07 11:50         ` Alistair Popple
2025-06-17 15:43 ` [PATCH RFC 08/14] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
2025-06-25  8:32   ` Oscar Salvador
2025-07-14 12:41     ` David Hildenbrand
2025-06-17 15:43 ` [PATCH RFC 09/14] mm/memory: introduce is_huge_zero_pfn() and use it in vm_normal_page_pmd() David Hildenbrand
2025-06-25  8:37   ` Oscar Salvador
2025-06-17 15:43 ` [PATCH RFC 10/14] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
2025-06-25  8:53   ` Oscar Salvador
2025-06-25  8:57     ` David Hildenbrand
2025-06-25  9:20       ` Oscar Salvador
2025-06-25 10:14         ` David Hildenbrand
2025-06-17 15:43 ` [PATCH RFC 11/14] mm: remove "horrible special case to handle copy-on-write behaviour" David Hildenbrand
2025-06-25  8:47   ` David Hildenbrand
2025-06-25  9:02     ` Oscar Salvador
2025-06-25  9:04       ` David Hildenbrand
2025-06-17 15:43 ` [PATCH RFC 12/14] mm: drop addr parameter from vm_normal_*_pmd() David Hildenbrand
2025-06-17 15:43 ` [PATCH RFC 13/14] mm: introduce and use vm_normal_page_pud() David Hildenbrand
2025-06-25  9:22   ` Oscar Salvador
2025-06-17 15:43 ` [PATCH RFC 14/14] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
2025-06-25  9:34   ` Oscar Salvador
2025-07-14 14:19     ` David Hildenbrand
2025-06-17 16:18 ` [PATCH RFC 00/14] mm: vm_normal_page*() + CoW PFNMAP improvements David Hildenbrand
2025-06-17 18:25   ` David Hildenbrand
2025-06-25  8:49 ` Lorenzo Stoakes
2025-06-25  8:55   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).