linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/9] mm: vm_normal_page*() improvements
@ 2025-07-17 11:52 David Hildenbrand
  2025-07-17 11:52 ` [PATCH v2 1/9] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
                   ` (8 more replies)
  0 siblings, 9 replies; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 11:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang

Based on mm/mm-new from today that contains [2].

Cleanup and unify vm_normal_page_*() handling, also marking the
huge zerofolio as special in the PMD. Add+use vm_normal_page_pud() and
cleanup that XEN vm_ops->find_special_page thingy.

There are plans of using vm_normal_page_*() more widely soon.

Briefly tested on UML (making sure vm_normal_page() still works as expected
without pte_special() support) and on x86-64 with a bunch of tests.
Cross-compiled for a variety of weird archs.

[1] https://lkml.kernel.org/r/20250617154345.2494405-1-david@redhat.com
[2] https://lkml.kernel.org/r/cover.1752499009.git.luizcap@redhat.com

v1 -> v2:
* "mm/memory: convert print_bad_pte() to print_bad_page_map()"
 -> Don't use pgdp_get(), because it's broken on some arm configs
 -> Extend patch description
 -> Don't use pmd_val(pmdp_get()), because that doesn't work on some
    m68k configs
* Added RBs

RFC -> v1:
* Dropped the highest_memmap_pfn removal stuff and instead added
  "mm/memory: convert print_bad_pte() to print_bad_page_map()"
* Dropped "mm: compare pfns only if the entry is present when inserting
  pfns/pages" for now, will probably clean that up separately.
* Dropped "mm: remove "horrible special case to handle copy-on-write
  behaviour"", and "mm: drop addr parameter from vm_normal_*_pmd()" will
  require more thought
* "mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()"
 -> Extend patch description.
* "fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio"
 -> Extend patch description.
* "mm/huge_memory: mark PMD mappings of the huge zero folio special"
 -> Remove comment from vm_normal_page_pmd().
* "mm/memory: factor out common code from vm_normal_page_*()"
 -> Adjust to print_bad_page_map()/highest_memmap_pfn changes.
 -> Add proper kernel doc to all involved functions
* "mm: introduce and use vm_normal_page_pud()"
 -> Adjust to print_bad_page_map() changes.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Pedro Falcato <pfalcato@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lance Yang <lance.yang@linux.dev>

David Hildenbrand (9):
  mm/huge_memory: move more common code into insert_pmd()
  mm/huge_memory: move more common code into insert_pud()
  mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
  fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
  mm/huge_memory: mark PMD mappings of the huge zero folio special
  mm/memory: convert print_bad_pte() to print_bad_page_map()
  mm/memory: factor out common code from vm_normal_page_*()
  mm: introduce and use vm_normal_page_pud()
  mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()

 drivers/xen/Kconfig              |   1 +
 drivers/xen/gntdev.c             |   5 +-
 fs/dax.c                         |  47 +----
 include/linux/mm.h               |  20 +-
 mm/Kconfig                       |   2 +
 mm/huge_memory.c                 | 119 ++++-------
 mm/memory.c                      | 346 ++++++++++++++++++++++---------
 mm/pagewalk.c                    |  20 +-
 tools/testing/vma/vma_internal.h |  18 +-
 9 files changed, 343 insertions(+), 235 deletions(-)


base-commit: 760b462b3921c5dc8bfa151d2d27a944e4e96081
-- 
2.50.1



^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 1/9] mm/huge_memory: move more common code into insert_pmd()
  2025-07-17 11:52 [PATCH v2 0/9] mm: vm_normal_page*() improvements David Hildenbrand
@ 2025-07-17 11:52 ` David Hildenbrand
  2025-07-17 15:34   ` Lorenzo Stoakes
  2025-07-25  2:47   ` Wei Yang
  2025-07-17 11:52 ` [PATCH v2 2/9] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 11:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang, Alistair Popple

Let's clean it all further up.

No functional change intended.

Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 72 ++++++++++++++++--------------------------------
 1 file changed, 24 insertions(+), 48 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fe17b0a157cda..1178760d2eda4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1390,15 +1390,25 @@ struct folio_or_pfn {
 	bool is_folio;
 };
 
-static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
+static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, struct folio_or_pfn fop, pgprot_t prot,
-		bool write, pgtable_t pgtable)
+		bool write)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	pgtable_t pgtable = NULL;
+	spinlock_t *ptl;
 	pmd_t entry;
 
-	lockdep_assert_held(pmd_lockptr(mm, pmd));
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return VM_FAULT_SIGBUS;
 
+	if (arch_needs_pgtable_deposit()) {
+		pgtable = pte_alloc_one(vma->vm_mm);
+		if (!pgtable)
+			return VM_FAULT_OOM;
+	}
+
+	ptl = pmd_lock(mm, pmd);
 	if (!pmd_none(*pmd)) {
 		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
 					  fop.pfn;
@@ -1406,15 +1416,14 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 		if (write) {
 			if (pmd_pfn(*pmd) != pfn) {
 				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
-				return -EEXIST;
+				goto out_unlock;
 			}
 			entry = pmd_mkyoung(*pmd);
 			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
 			if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
 				update_mmu_cache_pmd(vma, addr, pmd);
 		}
-
-		return -EEXIST;
+		goto out_unlock;
 	}
 
 	if (fop.is_folio) {
@@ -1435,11 +1444,17 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 	if (pgtable) {
 		pgtable_trans_huge_deposit(mm, pmd, pgtable);
 		mm_inc_nr_ptes(mm);
+		pgtable = NULL;
 	}
 
 	set_pmd_at(mm, addr, pmd, entry);
 	update_mmu_cache_pmd(vma, addr, pmd);
-	return 0;
+
+out_unlock:
+	spin_unlock(ptl);
+	if (pgtable)
+		pte_free(mm, pgtable);
+	return VM_FAULT_NOPAGE;
 }
 
 /**
@@ -1461,9 +1476,6 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn,
 	struct folio_or_pfn fop = {
 		.pfn = pfn,
 	};
-	pgtable_t pgtable = NULL;
-	spinlock_t *ptl;
-	int error;
 
 	/*
 	 * If we had pmd_special, we could avoid all these restrictions,
@@ -1475,25 +1487,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn,
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
 
-	if (addr < vma->vm_start || addr >= vma->vm_end)
-		return VM_FAULT_SIGBUS;
-
-	if (arch_needs_pgtable_deposit()) {
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (!pgtable)
-			return VM_FAULT_OOM;
-	}
-
 	pfnmap_setup_cachemode_pfn(pfn, &pgprot);
 
-	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
-	error = insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write,
-			   pgtable);
-	spin_unlock(ptl);
-	if (error && pgtable)
-		pte_free(vma->vm_mm, pgtable);
-
-	return VM_FAULT_NOPAGE;
+	return insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write);
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
 
@@ -1502,35 +1498,15 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
 {
 	struct vm_area_struct *vma = vmf->vma;
 	unsigned long addr = vmf->address & PMD_MASK;
-	struct mm_struct *mm = vma->vm_mm;
 	struct folio_or_pfn fop = {
 		.folio = folio,
 		.is_folio = true,
 	};
-	spinlock_t *ptl;
-	pgtable_t pgtable = NULL;
-	int error;
-
-	if (addr < vma->vm_start || addr >= vma->vm_end)
-		return VM_FAULT_SIGBUS;
 
 	if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
 		return VM_FAULT_SIGBUS;
 
-	if (arch_needs_pgtable_deposit()) {
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (!pgtable)
-			return VM_FAULT_OOM;
-	}
-
-	ptl = pmd_lock(mm, vmf->pmd);
-	error = insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot,
-			   write, pgtable);
-	spin_unlock(ptl);
-	if (error && pgtable)
-		pte_free(mm, pgtable);
-
-	return VM_FAULT_NOPAGE;
+	return insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot, write);
 }
 EXPORT_SYMBOL_GPL(vmf_insert_folio_pmd);
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 2/9] mm/huge_memory: move more common code into insert_pud()
  2025-07-17 11:52 [PATCH v2 0/9] mm: vm_normal_page*() improvements David Hildenbrand
  2025-07-17 11:52 ` [PATCH v2 1/9] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
@ 2025-07-17 11:52 ` David Hildenbrand
  2025-07-17 15:42   ` Lorenzo Stoakes
  2025-07-25  2:56   ` Wei Yang
  2025-07-17 11:52 ` [PATCH v2 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 11:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang, Alistair Popple

Let's clean it all further up.

No functional change intended.

Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 36 +++++++++++++-----------------------
 1 file changed, 13 insertions(+), 23 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1178760d2eda4..849feacaf8064 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1518,25 +1518,30 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
 	return pud;
 }
 
-static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
+static vm_fault_t insert_pud(struct vm_area_struct *vma, unsigned long addr,
 		pud_t *pud, struct folio_or_pfn fop, pgprot_t prot, bool write)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *ptl;
 	pud_t entry;
 
+	if (addr < vma->vm_start || addr >= vma->vm_end)
+		return VM_FAULT_SIGBUS;
+
+	ptl = pud_lock(mm, pud);
 	if (!pud_none(*pud)) {
 		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
 					  fop.pfn;
 
 		if (write) {
 			if (WARN_ON_ONCE(pud_pfn(*pud) != pfn))
-				return;
+				goto out_unlock;
 			entry = pud_mkyoung(*pud);
 			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
 			if (pudp_set_access_flags(vma, addr, pud, entry, 1))
 				update_mmu_cache_pud(vma, addr, pud);
 		}
-		return;
+		goto out_unlock;
 	}
 
 	if (fop.is_folio) {
@@ -1555,6 +1560,9 @@ static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
 	}
 	set_pud_at(mm, addr, pud, entry);
 	update_mmu_cache_pud(vma, addr, pud);
+out_unlock:
+	spin_unlock(ptl);
+	return VM_FAULT_NOPAGE;
 }
 
 /**
@@ -1576,7 +1584,6 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn,
 	struct folio_or_pfn fop = {
 		.pfn = pfn,
 	};
-	spinlock_t *ptl;
 
 	/*
 	 * If we had pud_special, we could avoid all these restrictions,
@@ -1588,16 +1595,9 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn,
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
 
-	if (addr < vma->vm_start || addr >= vma->vm_end)
-		return VM_FAULT_SIGBUS;
-
 	pfnmap_setup_cachemode_pfn(pfn, &pgprot);
 
-	ptl = pud_lock(vma->vm_mm, vmf->pud);
-	insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
-	spin_unlock(ptl);
-
-	return VM_FAULT_NOPAGE;
+	return insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
 }
 EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
 
@@ -1614,25 +1614,15 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
 {
 	struct vm_area_struct *vma = vmf->vma;
 	unsigned long addr = vmf->address & PUD_MASK;
-	pud_t *pud = vmf->pud;
-	struct mm_struct *mm = vma->vm_mm;
 	struct folio_or_pfn fop = {
 		.folio = folio,
 		.is_folio = true,
 	};
-	spinlock_t *ptl;
-
-	if (addr < vma->vm_start || addr >= vma->vm_end)
-		return VM_FAULT_SIGBUS;
 
 	if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER))
 		return VM_FAULT_SIGBUS;
 
-	ptl = pud_lock(mm, pud);
-	insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
-	spin_unlock(ptl);
-
-	return VM_FAULT_NOPAGE;
+	return insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
 }
 EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
  2025-07-17 11:52 [PATCH v2 0/9] mm: vm_normal_page*() improvements David Hildenbrand
  2025-07-17 11:52 ` [PATCH v2 1/9] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
  2025-07-17 11:52 ` [PATCH v2 2/9] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
@ 2025-07-17 11:52 ` David Hildenbrand
  2025-07-17 15:47   ` Lorenzo Stoakes
  2025-07-25  8:07   ` Wei Yang
  2025-07-17 11:52 ` [PATCH v2 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 11:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang

Just like we do for vmf_insert_page_mkwrite() -> ... ->
insert_page_into_pte_locked() with the shared zeropage, support the
huge zero folio in vmf_insert_folio_pmd().

When (un)mapping the huge zero folio in page tables, we neither
adjust the refcount nor the mapcount, just like for the shared zeropage.

For now, the huge zero folio is not marked as special yet, although
vm_normal_page_pmd() really wants to treat it as special. We'll change
that next.

Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 849feacaf8064..db08c37b87077 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1429,9 +1429,11 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 	if (fop.is_folio) {
 		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
 
-		folio_get(fop.folio);
-		folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
-		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
+		if (!is_huge_zero_folio(fop.folio)) {
+			folio_get(fop.folio);
+			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
+			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
+		}
 	} else {
 		entry = pmd_mkhuge(pfn_pmd(fop.pfn, prot));
 		entry = pmd_mkspecial(entry);
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
  2025-07-17 11:52 [PATCH v2 0/9] mm: vm_normal_page*() improvements David Hildenbrand
                   ` (2 preceding siblings ...)
  2025-07-17 11:52 ` [PATCH v2 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
@ 2025-07-17 11:52 ` David Hildenbrand
  2025-07-17 18:09   ` Lorenzo Stoakes
  2025-07-17 11:52 ` [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 11:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang, Alistair Popple

Let's convert to vmf_insert_folio_pmd().

There is a theoretical change in behavior: in the unlikely case there is
already something mapped, we'll now still call trace_dax_pmd_load_hole()
and return VM_FAULT_NOPAGE.

Previously, we would have returned VM_FAULT_FALLBACK, and the caller
would have zapped the PMD to try a PTE fault.

However, that behavior was different to other PTE+PMD faults, when there
would already be something mapped, and it's not even clear if it could
be triggered.

Assuming the huge zero folio is already mapped, all good, no need to
fallback to PTEs.

Assuming there is already a leaf page table ... the behavior would be
just like when trying to insert a PMD mapping a folio through
dax_fault_iter()->vmf_insert_folio_pmd().

Assuming there is already something else mapped as PMD? It sounds like
a BUG, and the behavior would be just like when trying to insert a PMD
mapping a folio through dax_fault_iter()->vmf_insert_folio_pmd().

So, it sounds reasonable to not handle huge zero folios differently
to inserting PMDs mapping folios when there already is something mapped.

Reviewed-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 fs/dax.c | 47 ++++++++++-------------------------------------
 1 file changed, 10 insertions(+), 37 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 4229513806bea..ae90706674a3f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1375,51 +1375,24 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
 		const struct iomap_iter *iter, void **entry)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
-	unsigned long pmd_addr = vmf->address & PMD_MASK;
-	struct vm_area_struct *vma = vmf->vma;
 	struct inode *inode = mapping->host;
-	pgtable_t pgtable = NULL;
 	struct folio *zero_folio;
-	spinlock_t *ptl;
-	pmd_t pmd_entry;
-	unsigned long pfn;
+	vm_fault_t ret;
 
 	zero_folio = mm_get_huge_zero_folio(vmf->vma->vm_mm);
 
-	if (unlikely(!zero_folio))
-		goto fallback;
-
-	pfn = page_to_pfn(&zero_folio->page);
-	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn,
-				  DAX_PMD | DAX_ZERO_PAGE);
-
-	if (arch_needs_pgtable_deposit()) {
-		pgtable = pte_alloc_one(vma->vm_mm);
-		if (!pgtable)
-			return VM_FAULT_OOM;
-	}
-
-	ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
-	if (!pmd_none(*(vmf->pmd))) {
-		spin_unlock(ptl);
-		goto fallback;
+	if (unlikely(!zero_folio)) {
+		trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
+		return VM_FAULT_FALLBACK;
 	}
 
-	if (pgtable) {
-		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
-		mm_inc_nr_ptes(vma->vm_mm);
-	}
-	pmd_entry = folio_mk_pmd(zero_folio, vmf->vma->vm_page_prot);
-	set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
-	spin_unlock(ptl);
-	trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
-	return VM_FAULT_NOPAGE;
+	*entry = dax_insert_entry(xas, vmf, iter, *entry, folio_pfn(zero_folio),
+				  DAX_PMD | DAX_ZERO_PAGE);
 
-fallback:
-	if (pgtable)
-		pte_free(vma->vm_mm, pgtable);
-	trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
-	return VM_FAULT_FALLBACK;
+	ret = vmf_insert_folio_pmd(vmf, zero_folio, false);
+	if (ret == VM_FAULT_NOPAGE)
+		trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
+	return ret;
 }
 #else
 static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
  2025-07-17 11:52 [PATCH v2 0/9] mm: vm_normal_page*() improvements David Hildenbrand
                   ` (3 preceding siblings ...)
  2025-07-17 11:52 ` [PATCH v2 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
@ 2025-07-17 11:52 ` David Hildenbrand
  2025-07-17 18:29   ` Lorenzo Stoakes
  2025-07-28  8:49   ` Wei Yang
  2025-07-17 11:52 ` [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 11:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang

The huge zero folio is refcounted (+mapcounted -- is that a word?)
differently than "normal" folios, similarly (but different) to the ordinary
shared zeropage.

For this reason, we special-case these pages in
vm_normal_page*/vm_normal_folio*, and only allow selected callers to
still use them (e.g., GUP can still take a reference on them).

vm_normal_page_pmd() already filters out the huge zero folio. However,
so far we are not marking it as special like we do with the ordinary
shared zeropage. Let's mark it as special, so we can further refactor
vm_normal_page_pmd() and vm_normal_page().

While at it, update the doc regarding the shared zero folios.

Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c |  5 ++++-
 mm/memory.c      | 14 +++++++++-----
 2 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index db08c37b87077..3f9a27812a590 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1320,6 +1320,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
 {
 	pmd_t entry;
 	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
+	entry = pmd_mkspecial(entry);
 	pgtable_trans_huge_deposit(mm, pmd, pgtable);
 	set_pmd_at(mm, haddr, pmd, entry);
 	mm_inc_nr_ptes(mm);
@@ -1429,7 +1430,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
 	if (fop.is_folio) {
 		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
 
-		if (!is_huge_zero_folio(fop.folio)) {
+		if (is_huge_zero_folio(fop.folio)) {
+			entry = pmd_mkspecial(entry);
+		} else {
 			folio_get(fop.folio);
 			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
 			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
diff --git a/mm/memory.c b/mm/memory.c
index 92fd18a5d8d1f..173eb6267e0ac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -537,7 +537,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
  *
  * "Special" mappings do not wish to be associated with a "struct page" (either
  * it doesn't exist, or it exists but they don't want to touch it). In this
- * case, NULL is returned here. "Normal" mappings do have a struct page.
+ * case, NULL is returned here. "Normal" mappings do have a struct page and
+ * are ordinarily refcounted.
+ *
+ * Page mappings of the shared zero folios are always considered "special", as
+ * they are not ordinarily refcounted. However, selected page table walkers
+ * (such as GUP) can still identify these mappings and work with the
+ * underlying "struct page".
  *
  * There are 2 broad cases. Firstly, an architecture may define a pte_special()
  * pte bit, in which case this function is trivial. Secondly, an architecture
@@ -567,9 +573,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
  *
  * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
  * page" backing, however the difference is that _all_ pages with a struct
- * page (that is, those where pfn_valid is true) are refcounted and considered
- * normal pages by the VM. The only exception are zeropages, which are
- * *never* refcounted.
+ * page (that is, those where pfn_valid is true, except the shared zero
+ * folios) are refcounted and considered normal pages by the VM.
  *
  * The disadvantage is that pages are refcounted (which can be slower and
  * simply not an option for some PFNMAP users). The advantage is that we
@@ -649,7 +654,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 {
 	unsigned long pfn = pmd_pfn(pmd);
 
-	/* Currently it's only used for huge pfnmaps */
 	if (unlikely(pmd_special(pmd)))
 		return NULL;
 
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
  2025-07-17 11:52 [PATCH v2 0/9] mm: vm_normal_page*() improvements David Hildenbrand
                   ` (4 preceding siblings ...)
  2025-07-17 11:52 ` [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
@ 2025-07-17 11:52 ` David Hildenbrand
  2025-07-17 19:17   ` Lorenzo Stoakes
  2025-07-17 22:06   ` Demi Marie Obenour
  2025-07-17 11:52 ` [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 11:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang

print_bad_pte() looks like something that should actually be a WARN
or similar, but historically it apparently has proven to be useful to
detect corruption of page tables even on production systems -- report
the issue and keep the system running to make it easier to actually detect
what is going wrong (e.g., multiple such messages might shed a light).

As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
to take care of print_bad_pte() as well.

Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
implementation and renaming the function -- we'll rename it to what
we actually print: bad (page) mappings. Maybe it should be called
"print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
because the assumption is that we are dealing with some (previously)
present page table entry that got corrupted in weird ways.

Whether it is a PTE or something else will usually become obvious from the
page table dump or from the dumped stack. If ever required in the future,
we could pass the entry level type similar to "enum rmap_level". For now,
let's keep it simple.

To make the function a bit more readable, factor out the ratelimit check
into is_bad_page_map_ratelimited() and place the dumping of page
table content into __dump_bad_page_map_pgtable(). We'll now dump
information from each level in a single line, and just stop the table
walk once we hit something that is not a present page table.

Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
for vm_normal_page(), now that we have a function that can handle it.

The report will now look something like (dumping pgd to pmd values):

[   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
[   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
[   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067

Not using pgdp_get(), because that does not work properly on some arm
configs where pgd_t is an array. Note that we are dumping all levels
even when levels are folded for simplicity.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 94 insertions(+), 26 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 173eb6267e0ac..08d16ed7b4cc7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -473,22 +473,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
 			add_mm_counter(mm, i, rss[i]);
 }
 
-/*
- * This function is called to print an error when a bad pte
- * is found. For example, we might have a PFN-mapped pte in
- * a region that doesn't allow it.
- *
- * The calling function must still handle the error.
- */
-static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
-			  pte_t pte, struct page *page)
+static bool is_bad_page_map_ratelimited(void)
 {
-	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
-	p4d_t *p4d = p4d_offset(pgd, addr);
-	pud_t *pud = pud_offset(p4d, addr);
-	pmd_t *pmd = pmd_offset(pud, addr);
-	struct address_space *mapping;
-	pgoff_t index;
 	static unsigned long resume;
 	static unsigned long nr_shown;
 	static unsigned long nr_unshown;
@@ -500,7 +486,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 	if (nr_shown == 60) {
 		if (time_before(jiffies, resume)) {
 			nr_unshown++;
-			return;
+			return true;
 		}
 		if (nr_unshown) {
 			pr_alert("BUG: Bad page map: %lu messages suppressed\n",
@@ -511,15 +497,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 	}
 	if (nr_shown++ == 0)
 		resume = jiffies + 60 * HZ;
+	return false;
+}
+
+static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
+{
+	unsigned long long pgdv, p4dv, pudv, pmdv;
+	p4d_t p4d, *p4dp;
+	pud_t pud, *pudp;
+	pmd_t pmd, *pmdp;
+	pgd_t *pgdp;
+
+	/*
+	 * This looks like a fully lockless walk, however, the caller is
+	 * expected to hold the leaf page table lock in addition to other
+	 * rmap/mm/vma locks. So this is just a re-walk to dump page table
+	 * content while any concurrent modifications should be completely
+	 * prevented.
+	 */
+	pgdp = pgd_offset(mm, addr);
+	pgdv = pgd_val(*pgdp);
+
+	if (!pgd_present(*pgdp) || pgd_leaf(*pgdp)) {
+		pr_alert("pgd:%08llx\n", pgdv);
+		return;
+	}
+
+	p4dp = p4d_offset(pgdp, addr);
+	p4d = p4dp_get(p4dp);
+	p4dv = p4d_val(p4d);
+
+	if (!p4d_present(p4d) || p4d_leaf(p4d)) {
+		pr_alert("pgd:%08llx p4d:%08llx\n", pgdv, p4dv);
+		return;
+	}
+
+	pudp = pud_offset(p4dp, addr);
+	pud = pudp_get(pudp);
+	pudv = pud_val(pud);
+
+	if (!pud_present(pud) || pud_leaf(pud)) {
+		pr_alert("pgd:%08llx p4d:%08llx pud:%08llx\n", pgdv, p4dv, pudv);
+		return;
+	}
+
+	pmdp = pmd_offset(pudp, addr);
+	pmd = pmdp_get(pmdp);
+	pmdv = pmd_val(pmd);
+
+	/*
+	 * Dumping the PTE would be nice, but it's tricky with CONFIG_HIGHPTE,
+	 * because the table should already be mapped by the caller and
+	 * doing another map would be bad. print_bad_page_map() should
+	 * already take care of printing the PTE.
+	 */
+	pr_alert("pgd:%08llx p4d:%08llx pud:%08llx pmd:%08llx\n", pgdv,
+		 p4dv, pudv, pmdv);
+}
+
+/*
+ * This function is called to print an error when a bad page table entry (e.g.,
+ * corrupted page table entry) is found. For example, we might have a
+ * PFN-mapped pte in a region that doesn't allow it.
+ *
+ * The calling function must still handle the error.
+ */
+static void print_bad_page_map(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long long entry, struct page *page)
+{
+	struct address_space *mapping;
+	pgoff_t index;
+
+	if (is_bad_page_map_ratelimited())
+		return;
 
 	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
 	index = linear_page_index(vma, addr);
 
-	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
-		 current->comm,
-		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
+	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);
+	__dump_bad_page_map_pgtable(vma->vm_mm, addr);
 	if (page)
-		dump_page(page, "bad pte");
+		dump_page(page, "bad page map");
 	pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
 		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
 	pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
@@ -597,7 +655,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 		if (is_zero_pfn(pfn))
 			return NULL;
 
-		print_bad_pte(vma, addr, pte, NULL);
+		print_bad_page_map(vma, addr, pte_val(pte), NULL);
 		return NULL;
 	}
 
@@ -625,7 +683,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 
 check_pfn:
 	if (unlikely(pfn > highest_memmap_pfn)) {
-		print_bad_pte(vma, addr, pte, NULL);
+		print_bad_page_map(vma, addr, pte_val(pte), NULL);
 		return NULL;
 	}
 
@@ -654,8 +712,15 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 {
 	unsigned long pfn = pmd_pfn(pmd);
 
-	if (unlikely(pmd_special(pmd)))
+	if (unlikely(pmd_special(pmd))) {
+		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+			return NULL;
+		if (is_huge_zero_pfn(pfn))
+			return NULL;
+
+		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
 		return NULL;
+	}
 
 	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
 		if (vma->vm_flags & VM_MIXEDMAP) {
@@ -674,8 +739,10 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 
 	if (is_huge_zero_pfn(pfn))
 		return NULL;
-	if (unlikely(pfn > highest_memmap_pfn))
+	if (unlikely(pfn > highest_memmap_pfn)) {
+		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
 		return NULL;
+	}
 
 	/*
 	 * NOTE! We still have PageReserved() pages in the page tables.
@@ -1509,7 +1576,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
 		folio_remove_rmap_ptes(folio, page, nr, vma);
 
 		if (unlikely(folio_mapcount(folio) < 0))
-			print_bad_pte(vma, addr, ptent, page);
+			print_bad_page_map(vma, addr, pte_val(ptent), page);
 	}
 	if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
 		*force_flush = true;
@@ -4507,7 +4574,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		} else if (is_pte_marker_entry(entry)) {
 			ret = handle_pte_marker(vmf);
 		} else {
-			print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
+			print_bad_page_map(vma, vmf->address,
+					   pte_val(vmf->orig_pte), NULL);
 			ret = VM_FAULT_SIGBUS;
 		}
 		goto out;
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*()
  2025-07-17 11:52 [PATCH v2 0/9] mm: vm_normal_page*() improvements David Hildenbrand
                   ` (5 preceding siblings ...)
  2025-07-17 11:52 ` [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
@ 2025-07-17 11:52 ` David Hildenbrand
  2025-07-17 19:51   ` Lorenzo Stoakes
  2025-07-17 11:52 ` [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud() David Hildenbrand
  2025-07-17 11:52 ` [PATCH v2 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
  8 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 11:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang

Let's reduce the code duplication and factor out the non-pte/pmd related
magic into vm_normal_page_pfn().

To keep it simpler, check the pfn against both zero folios. We could
optimize this, but as it's only for the !CONFIG_ARCH_HAS_PTE_SPECIAL
case, it's not a compelling micro-optimization.

With CONFIG_ARCH_HAS_PTE_SPECIAL we don't have to check anything else,
really.

It's a good question if we can even hit the !CONFIG_ARCH_HAS_PTE_SPECIAL
scenario in the PMD case in practice: but doesn't really matter, as
it's now all unified in vm_normal_page_pfn().

Add kerneldoc for all involved functions.

No functional change intended.

Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 183 +++++++++++++++++++++++++++++++---------------------
 1 file changed, 109 insertions(+), 74 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 08d16ed7b4cc7..c43ae5e4d7644 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -590,8 +590,13 @@ static void print_bad_page_map(struct vm_area_struct *vma,
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
 }
 
-/*
- * vm_normal_page -- This function gets the "struct page" associated with a pte.
+/**
+ * vm_normal_page_pfn() - Get the "struct page" associated with a PFN in a
+ *			  non-special page table entry.
+ * @vma: The VMA mapping the @pfn.
+ * @addr: The address where the @pfn is mapped.
+ * @pfn: The PFN.
+ * @entry: The page table entry value for error reporting purposes.
  *
  * "Special" mappings do not wish to be associated with a "struct page" (either
  * it doesn't exist, or it exists but they don't want to touch it). In this
@@ -603,10 +608,10 @@ static void print_bad_page_map(struct vm_area_struct *vma,
  * (such as GUP) can still identify these mappings and work with the
  * underlying "struct page".
  *
- * There are 2 broad cases. Firstly, an architecture may define a pte_special()
- * pte bit, in which case this function is trivial. Secondly, an architecture
- * may not have a spare pte bit, which requires a more complicated scheme,
- * described below.
+ * There are 2 broad cases. Firstly, an architecture may define a "special"
+ * page table entry bit (e.g., pte_special()), in which case this function is
+ * trivial. Secondly, an architecture may not have a spare page table
+ * entry bit, which requires a more complicated scheme, described below.
  *
  * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
  * special mapping (even if there are underlying and valid "struct pages").
@@ -639,15 +644,72 @@ static void print_bad_page_map(struct vm_area_struct *vma,
  * don't have to follow the strict linearity rule of PFNMAP mappings in
  * order to support COWable mappings.
  *
+ * This function is not expected to be called for obviously special mappings:
+ * when the page table entry has the "special" bit set.
+ *
+ * Return: Returns the "struct page" if this is a "normal" mapping. Returns
+ *	   NULL if this is a "special" mapping.
+ */
+static inline struct page *vm_normal_page_pfn(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long pfn, unsigned long long entry)
+{
+	/*
+	 * With CONFIG_ARCH_HAS_PTE_SPECIAL, any special page table mappings
+	 * (incl. shared zero folios) are marked accordingly and are handled
+	 * by the caller.
+	 */
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
+		if (unlikely(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))) {
+			if (vma->vm_flags & VM_MIXEDMAP) {
+				/* If it has a "struct page", it's "normal". */
+				if (!pfn_valid(pfn))
+					return NULL;
+			} else {
+				unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
+
+				/* Only CoW'ed anon folios are "normal". */
+				if (pfn == vma->vm_pgoff + off)
+					return NULL;
+				if (!is_cow_mapping(vma->vm_flags))
+					return NULL;
+			}
+		}
+
+		if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
+			return NULL;
+	}
+
+	/* Cheap check for corrupted page table entries. */
+	if (pfn > highest_memmap_pfn) {
+		print_bad_page_map(vma, addr, entry, NULL);
+		return NULL;
+	}
+	/*
+	 * NOTE! We still have PageReserved() pages in the page tables.
+	 * For example, VDSO mappings can cause them to exist.
+	 */
+	VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn));
+	return pfn_to_page(pfn);
+}
+
+/**
+ * vm_normal_page() - Get the "struct page" associated with a PTE
+ * @vma: The VMA mapping the @pte.
+ * @addr: The address where the @pte is mapped.
+ * @pte: The PTE.
+ *
+ * Get the "struct page" associated with a PTE. See vm_normal_page_pfn()
+ * for details.
+ *
+ * Return: Returns the "struct page" if this is a "normal" mapping. Returns
+ *	   NULL if this is a "special" mapping.
  */
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			    pte_t pte)
 {
 	unsigned long pfn = pte_pfn(pte);
 
-	if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
-		if (likely(!pte_special(pte)))
-			goto check_pfn;
+	if (unlikely(pte_special(pte))) {
 		if (vma->vm_ops && vma->vm_ops->find_special_page)
 			return vma->vm_ops->find_special_page(vma, addr);
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
@@ -658,44 +720,21 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 		print_bad_page_map(vma, addr, pte_val(pte), NULL);
 		return NULL;
 	}
-
-	/* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
-
-	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
-		if (vma->vm_flags & VM_MIXEDMAP) {
-			if (!pfn_valid(pfn))
-				return NULL;
-			if (is_zero_pfn(pfn))
-				return NULL;
-			goto out;
-		} else {
-			unsigned long off;
-			off = (addr - vma->vm_start) >> PAGE_SHIFT;
-			if (pfn == vma->vm_pgoff + off)
-				return NULL;
-			if (!is_cow_mapping(vma->vm_flags))
-				return NULL;
-		}
-	}
-
-	if (is_zero_pfn(pfn))
-		return NULL;
-
-check_pfn:
-	if (unlikely(pfn > highest_memmap_pfn)) {
-		print_bad_page_map(vma, addr, pte_val(pte), NULL);
-		return NULL;
-	}
-
-	/*
-	 * NOTE! We still have PageReserved() pages in the page tables.
-	 * eg. VDSO mappings can cause them to exist.
-	 */
-out:
-	VM_WARN_ON_ONCE(is_zero_pfn(pfn));
-	return pfn_to_page(pfn);
+	return vm_normal_page_pfn(vma, addr, pfn, pte_val(pte));
 }
 
+/**
+ * vm_normal_folio() - Get the "struct folio" associated with a PTE
+ * @vma: The VMA mapping the @pte.
+ * @addr: The address where the @pte is mapped.
+ * @pte: The PTE.
+ *
+ * Get the "struct folio" associated with a PTE. See vm_normal_page_pfn()
+ * for details.
+ *
+ * Return: Returns the "struct folio" if this is a "normal" mapping. Returns
+ *	   NULL if this is a "special" mapping.
+ */
 struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
 			    pte_t pte)
 {
@@ -707,6 +746,18 @@ struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
 }
 
 #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+/**
+ * vm_normal_page_pmd() - Get the "struct page" associated with a PMD
+ * @vma: The VMA mapping the @pmd.
+ * @addr: The address where the @pmd is mapped.
+ * @pmd: The PMD.
+ *
+ * Get the "struct page" associated with a PMD. See vm_normal_page_pfn()
+ * for details.
+ *
+ * Return: Returns the "struct page" if this is a "normal" mapping. Returns
+ *	   NULL if this is a "special" mapping.
+ */
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t pmd)
 {
@@ -721,37 +772,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
 		return NULL;
 	}
-
-	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
-		if (vma->vm_flags & VM_MIXEDMAP) {
-			if (!pfn_valid(pfn))
-				return NULL;
-			goto out;
-		} else {
-			unsigned long off;
-			off = (addr - vma->vm_start) >> PAGE_SHIFT;
-			if (pfn == vma->vm_pgoff + off)
-				return NULL;
-			if (!is_cow_mapping(vma->vm_flags))
-				return NULL;
-		}
-	}
-
-	if (is_huge_zero_pfn(pfn))
-		return NULL;
-	if (unlikely(pfn > highest_memmap_pfn)) {
-		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
-		return NULL;
-	}
-
-	/*
-	 * NOTE! We still have PageReserved() pages in the page tables.
-	 * eg. VDSO mappings can cause them to exist.
-	 */
-out:
-	return pfn_to_page(pfn);
+	return vm_normal_page_pfn(vma, addr, pfn, pmd_val(pmd));
 }
 
+/**
+ * vm_normal_folio_pmd() - Get the "struct folio" associated with a PMD
+ * @vma: The VMA mapping the @pmd.
+ * @addr: The address where the @pmd is mapped.
+ * @pmd: The PMD.
+ *
+ * Get the "struct folio" associated with a PMD. See vm_normal_page_pfn()
+ * for details.
+ *
+ * Return: Returns the "struct folio" if this is a "normal" mapping. Returns
+ *	   NULL if this is a "special" mapping.
+ */
 struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
 				  unsigned long addr, pmd_t pmd)
 {
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud()
  2025-07-17 11:52 [PATCH v2 0/9] mm: vm_normal_page*() improvements David Hildenbrand
                   ` (6 preceding siblings ...)
  2025-07-17 11:52 ` [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
@ 2025-07-17 11:52 ` David Hildenbrand
  2025-07-17 20:03   ` Lorenzo Stoakes
  2025-07-29  7:52   ` Wei Yang
  2025-07-17 11:52 ` [PATCH v2 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
  8 siblings, 2 replies; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 11:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang

Let's introduce vm_normal_page_pud(), which ends up being fairly simple
because of our new common helpers and there not being a PUD-sized zero
folio.

Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO,
structuring the code like the other (pmd/pte) cases. Defer
introducing vm_normal_folio_pud() until really used.

Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h |  2 ++
 mm/memory.c        | 27 +++++++++++++++++++++++++++
 mm/pagewalk.c      | 20 ++++++++++----------
 3 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index abc47f1f307fb..0eb991262fbbf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2349,6 +2349,8 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
 				  unsigned long addr, pmd_t pmd);
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t pmd);
+struct page *vm_normal_page_pud(struct vm_area_struct *vma, unsigned long addr,
+		pud_t pud);
 
 void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		  unsigned long size);
diff --git a/mm/memory.c b/mm/memory.c
index c43ae5e4d7644..00a0d7ae3ba4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -796,6 +796,33 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
 		return page_folio(page);
 	return NULL;
 }
+
+/**
+ * vm_normal_page_pud() - Get the "struct page" associated with a PUD
+ * @vma: The VMA mapping the @pud.
+ * @addr: The address where the @pud is mapped.
+ * @pud: The PUD.
+ *
+ * Get the "struct page" associated with a PUD. See vm_normal_page_pfn()
+ * for details.
+ *
+ * Return: Returns the "struct page" if this is a "normal" mapping. Returns
+ *	   NULL if this is a "special" mapping.
+ */
+struct page *vm_normal_page_pud(struct vm_area_struct *vma,
+		unsigned long addr, pud_t pud)
+{
+	unsigned long pfn = pud_pfn(pud);
+
+	if (unlikely(pud_special(pud))) {
+		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+			return NULL;
+
+		print_bad_page_map(vma, addr, pud_val(pud), NULL);
+		return NULL;
+	}
+	return vm_normal_page_pfn(vma, addr, pfn, pud_val(pud));
+}
 #endif
 
 /**
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 648038247a8d2..c6753d370ff4e 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -902,23 +902,23 @@ struct folio *folio_walk_start(struct folio_walk *fw,
 		fw->pudp = pudp;
 		fw->pud = pud;
 
-		/*
-		 * TODO: FW_MIGRATION support for PUD migration entries
-		 * once there are relevant users.
-		 */
-		if (!pud_present(pud) || pud_special(pud)) {
+		if (pud_none(pud)) {
 			spin_unlock(ptl);
 			goto not_found;
-		} else if (!pud_leaf(pud)) {
+		} else if (pud_present(pud) && !pud_leaf(pud)) {
 			spin_unlock(ptl);
 			goto pmd_table;
+		} else if (pud_present(pud)) {
+			page = vm_normal_page_pud(vma, addr, pud);
+			if (page)
+				goto found;
 		}
 		/*
-		 * TODO: vm_normal_page_pud() will be handy once we want to
-		 * support PUD mappings in VM_PFNMAP|VM_MIXEDMAP VMAs.
+		 * TODO: FW_MIGRATION support for PUD migration entries
+		 * once there are relevant users.
 		 */
-		page = pud_page(pud);
-		goto found;
+		spin_unlock(ptl);
+		goto not_found;
 	}
 
 pmd_table:
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
  2025-07-17 11:52 [PATCH v2 0/9] mm: vm_normal_page*() improvements David Hildenbrand
                   ` (7 preceding siblings ...)
  2025-07-17 11:52 ` [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud() David Hildenbrand
@ 2025-07-17 11:52 ` David Hildenbrand
  2025-07-17 20:07   ` Lorenzo Stoakes
  2025-07-29  7:53   ` Wei Yang
  8 siblings, 2 replies; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 11:52 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, David Hildenbrand,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang, David Vrabel

... and hide it behind a kconfig option. There is really no need for
any !xen code to perform this check.

The naming is a bit off: we want to find the "normal" page when a PTE
was marked "special". So it's really not "finding a special" page.

Improve the documentation, and add a comment in the code where XEN ends
up performing the pte_mkspecial() through a hypercall. More details can
be found in commit 923b2919e2c3 ("xen/gntdev: mark userspace PTEs as
special on x86 PV guests").

Cc: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/xen/Kconfig              |  1 +
 drivers/xen/gntdev.c             |  5 +++--
 include/linux/mm.h               | 18 +++++++++++++-----
 mm/Kconfig                       |  2 ++
 mm/memory.c                      | 12 ++++++++++--
 tools/testing/vma/vma_internal.h | 18 +++++++++++++-----
 6 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index 24f485827e039..f9a35ed266ecf 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -138,6 +138,7 @@ config XEN_GNTDEV
 	depends on XEN
 	default m
 	select MMU_NOTIFIER
+	select FIND_NORMAL_PAGE
 	help
 	  Allows userspace processes to use grants.
 
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 61faea1f06630..d1bc0dae2cdf9 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -309,6 +309,7 @@ static int find_grant_ptes(pte_t *pte, unsigned long addr, void *data)
 	BUG_ON(pgnr >= map->count);
 	pte_maddr = arbitrary_virt_to_machine(pte).maddr;
 
+	/* Note: this will perform a pte_mkspecial() through the hypercall. */
 	gnttab_set_map_op(&map->map_ops[pgnr], pte_maddr, flags,
 			  map->grants[pgnr].ref,
 			  map->grants[pgnr].domid);
@@ -516,7 +517,7 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
 	gntdev_put_map(priv, map);
 }
 
-static struct page *gntdev_vma_find_special_page(struct vm_area_struct *vma,
+static struct page *gntdev_vma_find_normal_page(struct vm_area_struct *vma,
 						 unsigned long addr)
 {
 	struct gntdev_grant_map *map = vma->vm_private_data;
@@ -527,7 +528,7 @@ static struct page *gntdev_vma_find_special_page(struct vm_area_struct *vma,
 static const struct vm_operations_struct gntdev_vmops = {
 	.open = gntdev_vma_open,
 	.close = gntdev_vma_close,
-	.find_special_page = gntdev_vma_find_special_page,
+	.find_normal_page = gntdev_vma_find_normal_page,
 };
 
 /* ------------------------------------------------------------------ */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0eb991262fbbf..036800514aa90 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -648,13 +648,21 @@ struct vm_operations_struct {
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
 					unsigned long addr, pgoff_t *ilx);
 #endif
+#ifdef CONFIG_FIND_NORMAL_PAGE
 	/*
-	 * Called by vm_normal_page() for special PTEs to find the
-	 * page for @addr.  This is useful if the default behavior
-	 * (using pte_page()) would not find the correct page.
+	 * Called by vm_normal_page() for special PTEs in @vma at @addr. This
+	 * allows for returning a "normal" page from vm_normal_page() even
+	 * though the PTE indicates that the "struct page" either does not exist
+	 * or should not be touched: "special".
+	 *
+	 * Do not add new users: this really only works when a "normal" page
+	 * was mapped, but then the PTE got changed to something weird (+
+	 * marked special) that would not make pte_pfn() identify the originally
+	 * inserted page.
 	 */
-	struct page *(*find_special_page)(struct vm_area_struct *vma,
-					  unsigned long addr);
+	struct page *(*find_normal_page)(struct vm_area_struct *vma,
+					 unsigned long addr);
+#endif /* CONFIG_FIND_NORMAL_PAGE */
 };
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/mm/Kconfig b/mm/Kconfig
index 0287e8d94aea7..82c281b4f6937 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1397,6 +1397,8 @@ config PT_RECLAIM
 
 	  Note: now only empty user PTE page table pages will be reclaimed.
 
+config FIND_NORMAL_PAGE
+	def_bool n
 
 source "mm/damon/Kconfig"
 
diff --git a/mm/memory.c b/mm/memory.c
index 00a0d7ae3ba4a..52804ca343261 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -613,6 +613,12 @@ static void print_bad_page_map(struct vm_area_struct *vma,
  * trivial. Secondly, an architecture may not have a spare page table
  * entry bit, which requires a more complicated scheme, described below.
  *
+ * With CONFIG_FIND_NORMAL_PAGE, we might have the "special" bit set on
+ * page table entries that actually map "normal" pages: however, that page
+ * cannot be looked up through the PFN stored in the page table entry, but
+ * instead will be looked up through vm_ops->find_normal_page(). So far, this
+ * only applies to PTEs.
+ *
  * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
  * special mapping (even if there are underlying and valid "struct pages").
  * COWed pages of a VM_PFNMAP are always normal.
@@ -710,8 +716,10 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 	unsigned long pfn = pte_pfn(pte);
 
 	if (unlikely(pte_special(pte))) {
-		if (vma->vm_ops && vma->vm_ops->find_special_page)
-			return vma->vm_ops->find_special_page(vma, addr);
+#ifdef CONFIG_FIND_NORMAL_PAGE
+		if (vma->vm_ops && vma->vm_ops->find_normal_page)
+			return vma->vm_ops->find_normal_page(vma, addr);
+#endif /* CONFIG_FIND_NORMAL_PAGE */
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			return NULL;
 		if (is_zero_pfn(pfn))
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 0fe52fd6782bf..8646af15a5fc0 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -467,13 +467,21 @@ struct vm_operations_struct {
 	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
 					unsigned long addr, pgoff_t *ilx);
 #endif
+#ifdef CONFIG_FIND_NORMAL_PAGE
 	/*
-	 * Called by vm_normal_page() for special PTEs to find the
-	 * page for @addr.  This is useful if the default behavior
-	 * (using pte_page()) would not find the correct page.
+	 * Called by vm_normal_page() for special PTEs in @vma at @addr. This
+	 * allows for returning a "normal" page from vm_normal_page() even
+	 * though the PTE indicates that the "struct page" either does not exist
+	 * or should not be touched: "special".
+	 *
+	 * Do not add new users: this really only works when a "normal" page
+	 * was mapped, but then the PTE got changed to something weird (+
+	 * marked special) that would not make pte_pfn() identify the originally
+	 * inserted page.
 	 */
-	struct page *(*find_special_page)(struct vm_area_struct *vma,
-					  unsigned long addr);
+	struct page *(*find_normal_page)(struct vm_area_struct *vma,
+					 unsigned long addr);
+#endif /* CONFIG_FIND_NORMAL_PAGE */
 };
 
 struct vm_unmapped_area_info {
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 1/9] mm/huge_memory: move more common code into insert_pmd()
  2025-07-17 11:52 ` [PATCH v2 1/9] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
@ 2025-07-17 15:34   ` Lorenzo Stoakes
  2025-07-25  2:47   ` Wei Yang
  1 sibling, 0 replies; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17 15:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang, Alistair Popple

On Thu, Jul 17, 2025 at 01:52:04PM +0200, David Hildenbrand wrote:
> Let's clean it all further up.
>
> No functional change intended.
>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Very nice!

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/huge_memory.c | 72 ++++++++++++++++--------------------------------
>  1 file changed, 24 insertions(+), 48 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fe17b0a157cda..1178760d2eda4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1390,15 +1390,25 @@ struct folio_or_pfn {
>  	bool is_folio;
>  };
>
> -static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
> +static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>  		pmd_t *pmd, struct folio_or_pfn fop, pgprot_t prot,
> -		bool write, pgtable_t pgtable)
> +		bool write)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> +	pgtable_t pgtable = NULL;
> +	spinlock_t *ptl;
>  	pmd_t entry;
>
> -	lockdep_assert_held(pmd_lockptr(mm, pmd));
> +	if (addr < vma->vm_start || addr >= vma->vm_end)
> +		return VM_FAULT_SIGBUS;
>
> +	if (arch_needs_pgtable_deposit()) {
> +		pgtable = pte_alloc_one(vma->vm_mm);
> +		if (!pgtable)
> +			return VM_FAULT_OOM;
> +	}
> +
> +	ptl = pmd_lock(mm, pmd);
>  	if (!pmd_none(*pmd)) {
>  		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
>  					  fop.pfn;
> @@ -1406,15 +1416,14 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>  		if (write) {
>  			if (pmd_pfn(*pmd) != pfn) {
>  				WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
> -				return -EEXIST;
> +				goto out_unlock;
>  			}
>  			entry = pmd_mkyoung(*pmd);
>  			entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
>  			if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
>  				update_mmu_cache_pmd(vma, addr, pmd);
>  		}
> -
> -		return -EEXIST;
> +		goto out_unlock;
>  	}
>
>  	if (fop.is_folio) {
> @@ -1435,11 +1444,17 @@ static int insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	if (pgtable) {
>  		pgtable_trans_huge_deposit(mm, pmd, pgtable);
>  		mm_inc_nr_ptes(mm);
> +		pgtable = NULL;
>  	}
>
>  	set_pmd_at(mm, addr, pmd, entry);
>  	update_mmu_cache_pmd(vma, addr, pmd);
> -	return 0;
> +
> +out_unlock:
> +	spin_unlock(ptl);
> +	if (pgtable)
> +		pte_free(mm, pgtable);
> +	return VM_FAULT_NOPAGE;
>  }
>
>  /**
> @@ -1461,9 +1476,6 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn,
>  	struct folio_or_pfn fop = {
>  		.pfn = pfn,
>  	};
> -	pgtable_t pgtable = NULL;
> -	spinlock_t *ptl;
> -	int error;
>
>  	/*
>  	 * If we had pmd_special, we could avoid all these restrictions,
> @@ -1475,25 +1487,9 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, unsigned long pfn,
>  						(VM_PFNMAP|VM_MIXEDMAP));
>  	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
>
> -	if (addr < vma->vm_start || addr >= vma->vm_end)
> -		return VM_FAULT_SIGBUS;
> -
> -	if (arch_needs_pgtable_deposit()) {
> -		pgtable = pte_alloc_one(vma->vm_mm);
> -		if (!pgtable)
> -			return VM_FAULT_OOM;
> -	}
> -
>  	pfnmap_setup_cachemode_pfn(pfn, &pgprot);
>
> -	ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> -	error = insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write,
> -			   pgtable);
> -	spin_unlock(ptl);
> -	if (error && pgtable)
> -		pte_free(vma->vm_mm, pgtable);
> -
> -	return VM_FAULT_NOPAGE;
> +	return insert_pmd(vma, addr, vmf->pmd, fop, pgprot, write);
>  }
>  EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
>
> @@ -1502,35 +1498,15 @@ vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio,
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	unsigned long addr = vmf->address & PMD_MASK;
> -	struct mm_struct *mm = vma->vm_mm;
>  	struct folio_or_pfn fop = {
>  		.folio = folio,
>  		.is_folio = true,
>  	};
> -	spinlock_t *ptl;
> -	pgtable_t pgtable = NULL;
> -	int error;
> -
> -	if (addr < vma->vm_start || addr >= vma->vm_end)
> -		return VM_FAULT_SIGBUS;
>
>  	if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
>  		return VM_FAULT_SIGBUS;
>
> -	if (arch_needs_pgtable_deposit()) {
> -		pgtable = pte_alloc_one(vma->vm_mm);
> -		if (!pgtable)
> -			return VM_FAULT_OOM;
> -	}
> -
> -	ptl = pmd_lock(mm, vmf->pmd);
> -	error = insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot,
> -			   write, pgtable);
> -	spin_unlock(ptl);
> -	if (error && pgtable)
> -		pte_free(mm, pgtable);
> -
> -	return VM_FAULT_NOPAGE;
> +	return insert_pmd(vma, addr, vmf->pmd, fop, vma->vm_page_prot, write);
>  }
>  EXPORT_SYMBOL_GPL(vmf_insert_folio_pmd);
>
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/9] mm/huge_memory: move more common code into insert_pud()
  2025-07-17 11:52 ` [PATCH v2 2/9] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
@ 2025-07-17 15:42   ` Lorenzo Stoakes
  2025-07-25  2:56   ` Wei Yang
  1 sibling, 0 replies; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17 15:42 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang, Alistair Popple

On Thu, Jul 17, 2025 at 01:52:05PM +0200, David Hildenbrand wrote:
> Let's clean it all further up.
>
> No functional change intended.
>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

LGTM:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/huge_memory.c | 36 +++++++++++++-----------------------
>  1 file changed, 13 insertions(+), 23 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1178760d2eda4..849feacaf8064 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1518,25 +1518,30 @@ static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
>  	return pud;
>  }
>
> -static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
> +static vm_fault_t insert_pud(struct vm_area_struct *vma, unsigned long addr,
>  		pud_t *pud, struct folio_or_pfn fop, pgprot_t prot, bool write)
>  {
>  	struct mm_struct *mm = vma->vm_mm;
> +	spinlock_t *ptl;
>  	pud_t entry;
>
> +	if (addr < vma->vm_start || addr >= vma->vm_end)
> +		return VM_FAULT_SIGBUS;
> +
> +	ptl = pud_lock(mm, pud);
>  	if (!pud_none(*pud)) {
>  		const unsigned long pfn = fop.is_folio ? folio_pfn(fop.folio) :
>  					  fop.pfn;
>
>  		if (write) {
>  			if (WARN_ON_ONCE(pud_pfn(*pud) != pfn))
> -				return;
> +				goto out_unlock;
>  			entry = pud_mkyoung(*pud);
>  			entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
>  			if (pudp_set_access_flags(vma, addr, pud, entry, 1))
>  				update_mmu_cache_pud(vma, addr, pud);
>  		}
> -		return;
> +		goto out_unlock;
>  	}
>
>  	if (fop.is_folio) {
> @@ -1555,6 +1560,9 @@ static void insert_pud(struct vm_area_struct *vma, unsigned long addr,
>  	}
>  	set_pud_at(mm, addr, pud, entry);
>  	update_mmu_cache_pud(vma, addr, pud);
> +out_unlock:
> +	spin_unlock(ptl);
> +	return VM_FAULT_NOPAGE;
>  }
>
>  /**
> @@ -1576,7 +1584,6 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn,
>  	struct folio_or_pfn fop = {
>  		.pfn = pfn,
>  	};
> -	spinlock_t *ptl;
>
>  	/*
>  	 * If we had pud_special, we could avoid all these restrictions,
> @@ -1588,16 +1595,9 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, unsigned long pfn,
>  						(VM_PFNMAP|VM_MIXEDMAP));
>  	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
>
> -	if (addr < vma->vm_start || addr >= vma->vm_end)
> -		return VM_FAULT_SIGBUS;
> -
>  	pfnmap_setup_cachemode_pfn(pfn, &pgprot);
>
> -	ptl = pud_lock(vma->vm_mm, vmf->pud);
> -	insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
> -	spin_unlock(ptl);
> -
> -	return VM_FAULT_NOPAGE;
> +	return insert_pud(vma, addr, vmf->pud, fop, pgprot, write);
>  }
>  EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
>
> @@ -1614,25 +1614,15 @@ vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio,
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	unsigned long addr = vmf->address & PUD_MASK;
> -	pud_t *pud = vmf->pud;
> -	struct mm_struct *mm = vma->vm_mm;
>  	struct folio_or_pfn fop = {
>  		.folio = folio,
>  		.is_folio = true,
>  	};
> -	spinlock_t *ptl;
> -
> -	if (addr < vma->vm_start || addr >= vma->vm_end)
> -		return VM_FAULT_SIGBUS;
>
>  	if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER))
>  		return VM_FAULT_SIGBUS;
>
> -	ptl = pud_lock(mm, pud);
> -	insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
> -	spin_unlock(ptl);
> -
> -	return VM_FAULT_NOPAGE;
> +	return insert_pud(vma, addr, vmf->pud, fop, vma->vm_page_prot, write);
>  }
>  EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
>  #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
  2025-07-17 11:52 ` [PATCH v2 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
@ 2025-07-17 15:47   ` Lorenzo Stoakes
  2025-07-25  8:07   ` Wei Yang
  1 sibling, 0 replies; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17 15:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 01:52:06PM +0200, David Hildenbrand wrote:
> Just like we do for vmf_insert_page_mkwrite() -> ... ->
> insert_page_into_pte_locked() with the shared zeropage, support the
> huge zero folio in vmf_insert_folio_pmd().
>
> When (un)mapping the huge zero folio in page tables, we neither
> adjust the refcount nor the mapcount, just like for the shared zeropage.
>
> For now, the huge zero folio is not marked as special yet, although
> vm_normal_page_pmd() really wants to treat it as special. We'll change
> that next.
>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/huge_memory.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 849feacaf8064..db08c37b87077 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1429,9 +1429,11 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	if (fop.is_folio) {
>  		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
>
> -		folio_get(fop.folio);
> -		folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
> -		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
> +		if (!is_huge_zero_folio(fop.folio)) {
> +			folio_get(fop.folio);
> +			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
> +			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
> +		}
>  	} else {
>  		entry = pmd_mkhuge(pfn_pmd(fop.pfn, prot));
>  		entry = pmd_mkspecial(entry);
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio
  2025-07-17 11:52 ` [PATCH v2 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
@ 2025-07-17 18:09   ` Lorenzo Stoakes
  0 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17 18:09 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang, Alistair Popple

On Thu, Jul 17, 2025 at 01:52:07PM +0200, David Hildenbrand wrote:
> Let's convert to vmf_insert_folio_pmd().
>
> There is a theoretical change in behavior: in the unlikely case there is
> already something mapped, we'll now still call trace_dax_pmd_load_hole()
> and return VM_FAULT_NOPAGE.
>
> Previously, we would have returned VM_FAULT_FALLBACK, and the caller
> would have zapped the PMD to try a PTE fault.
>
> However, that behavior was different to other PTE+PMD faults, when there
> would already be something mapped, and it's not even clear if it could
> be triggered.
>
> Assuming the huge zero folio is already mapped, all good, no need to
> fallback to PTEs.
>
> Assuming there is already a leaf page table ... the behavior would be
> just like when trying to insert a PMD mapping a folio through
> dax_fault_iter()->vmf_insert_folio_pmd().
>
> Assuming there is already something else mapped as PMD? It sounds like
> a BUG, and the behavior would be just like when trying to insert a PMD
> mapping a folio through dax_fault_iter()->vmf_insert_folio_pmd().
>
> So, it sounds reasonable to not handle huge zero folios differently
> to inserting PMDs mapping folios when there already is something mapped.

Sounds reasonable.

>
> Reviewed-by: Alistair Popple <apopple@nvidia.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

I've carefully checked through this and am happy that this is right. This is a
really nice cleanup :)

So:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  fs/dax.c | 47 ++++++++++-------------------------------------
>  1 file changed, 10 insertions(+), 37 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 4229513806bea..ae90706674a3f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1375,51 +1375,24 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>  		const struct iomap_iter *iter, void **entry)
>  {
>  	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> -	unsigned long pmd_addr = vmf->address & PMD_MASK;
> -	struct vm_area_struct *vma = vmf->vma;
>  	struct inode *inode = mapping->host;
> -	pgtable_t pgtable = NULL;
>  	struct folio *zero_folio;
> -	spinlock_t *ptl;
> -	pmd_t pmd_entry;
> -	unsigned long pfn;
> +	vm_fault_t ret;
>
>  	zero_folio = mm_get_huge_zero_folio(vmf->vma->vm_mm);
>
> -	if (unlikely(!zero_folio))
> -		goto fallback;
> -
> -	pfn = page_to_pfn(&zero_folio->page);
> -	*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn,
> -				  DAX_PMD | DAX_ZERO_PAGE);
> -
> -	if (arch_needs_pgtable_deposit()) {
> -		pgtable = pte_alloc_one(vma->vm_mm);
> -		if (!pgtable)
> -			return VM_FAULT_OOM;
> -	}
> -
> -	ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
> -	if (!pmd_none(*(vmf->pmd))) {
> -		spin_unlock(ptl);
> -		goto fallback;
> +	if (unlikely(!zero_folio)) {
> +		trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
> +		return VM_FAULT_FALLBACK;
>  	}
>
> -	if (pgtable) {
> -		pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> -		mm_inc_nr_ptes(vma->vm_mm);
> -	}
> -	pmd_entry = folio_mk_pmd(zero_folio, vmf->vma->vm_page_prot);
> -	set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
> -	spin_unlock(ptl);
> -	trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
> -	return VM_FAULT_NOPAGE;
> +	*entry = dax_insert_entry(xas, vmf, iter, *entry, folio_pfn(zero_folio),
> +				  DAX_PMD | DAX_ZERO_PAGE);
>
> -fallback:
> -	if (pgtable)
> -		pte_free(vma->vm_mm, pgtable);
> -	trace_dax_pmd_load_hole_fallback(inode, vmf, zero_folio, *entry);
> -	return VM_FAULT_FALLBACK;
> +	ret = vmf_insert_folio_pmd(vmf, zero_folio, false);
> +	if (ret == VM_FAULT_NOPAGE)
> +		trace_dax_pmd_load_hole(inode, vmf, zero_folio, *entry);
> +	return ret;
>  }
>  #else
>  static vm_fault_t dax_pmd_load_hole(struct xa_state *xas, struct vm_fault *vmf,
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
  2025-07-17 11:52 ` [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
@ 2025-07-17 18:29   ` Lorenzo Stoakes
  2025-07-17 20:31     ` David Hildenbrand
  2025-07-28  8:49   ` Wei Yang
  1 sibling, 1 reply; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17 18:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote:
> The huge zero folio is refcounted (+mapcounted -- is that a word?)
> differently than "normal" folios, similarly (but different) to the ordinary
> shared zeropage.

Yeah, I sort of wonder if we shouldn't just _not_ do any of that with zero
pages?

But for some reason the huge zero page wants to exist or not exist based on
usage for one. Which is stupid to me.

>
> For this reason, we special-case these pages in
> vm_normal_page*/vm_normal_folio*, and only allow selected callers to
> still use them (e.g., GUP can still take a reference on them).
>
> vm_normal_page_pmd() already filters out the huge zero folio. However,
> so far we are not marking it as special like we do with the ordinary
> shared zeropage. Let's mark it as special, so we can further refactor
> vm_normal_page_pmd() and vm_normal_page().
>
> While at it, update the doc regarding the shared zero folios.

Hmm I wonder how this will interact with the static PMD series at [0]?

I wonder if more use of that might result in some weirdness with refcounting
etc.?

Also, that series was (though I reviewed against it) moving stuff that
references the huge zero folio out of there, but also generally allows
access and mapping of this folio via largest_zero_folio() so not only via
insert_pmd().

So we're going to end up with mappings of this that are not marked special
that are potentially going to have refcount/mapcount manipulation that
contradict what you're doing here perhaps?

[0]: https://lore.kernel.org/all/20250707142319.319642-1-kernel@pankajraghav.com/

>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>

I looked thorugh places that use vm_normal_page_pm() (other than decl of
function):

fs/proc/task_mmu.c - seems to handle NULL page correctly + still undertsands zero page
mm/pagewalk.c - correctly handles NULL page + huge zero page
mm/huge_memory.c - can_change_pmd_writable() correctly returns false.

And all seems to work wtih this change.

Overall, other than concerns above + nits below LGTM, we should treat all
the zero folios the same in this regard, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/huge_memory.c |  5 ++++-
>  mm/memory.c      | 14 +++++++++-----
>  2 files changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index db08c37b87077..3f9a27812a590 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1320,6 +1320,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
>  {
>  	pmd_t entry;
>  	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
> +	entry = pmd_mkspecial(entry);
>  	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>  	set_pmd_at(mm, haddr, pmd, entry);
>  	mm_inc_nr_ptes(mm);
> @@ -1429,7 +1430,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>  	if (fop.is_folio) {
>  		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
>
> -		if (!is_huge_zero_folio(fop.folio)) {
> +		if (is_huge_zero_folio(fop.folio)) {
> +			entry = pmd_mkspecial(entry);
> +		} else {
>  			folio_get(fop.folio);
>  			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
>  			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
> diff --git a/mm/memory.c b/mm/memory.c
> index 92fd18a5d8d1f..173eb6267e0ac 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -537,7 +537,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>   *
>   * "Special" mappings do not wish to be associated with a "struct page" (either
>   * it doesn't exist, or it exists but they don't want to touch it). In this
> - * case, NULL is returned here. "Normal" mappings do have a struct page.
> + * case, NULL is returned here. "Normal" mappings do have a struct page and
> + * are ordinarily refcounted.
> + *
> + * Page mappings of the shared zero folios are always considered "special", as
> + * they are not ordinarily refcounted. However, selected page table walkers
> + * (such as GUP) can still identify these mappings and work with the
> + * underlying "struct page".

I feel like we need more detail or something more explicit about what 'not
ordinary' refcounting constitutes. This is a bit vague.

>   *
>   * There are 2 broad cases. Firstly, an architecture may define a pte_special()
>   * pte bit, in which case this function is trivial. Secondly, an architecture
> @@ -567,9 +573,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>   *
>   * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
>   * page" backing, however the difference is that _all_ pages with a struct
> - * page (that is, those where pfn_valid is true) are refcounted and considered
> - * normal pages by the VM. The only exception are zeropages, which are
> - * *never* refcounted.
> + * page (that is, those where pfn_valid is true, except the shared zero
> + * folios) are refcounted and considered normal pages by the VM.
>   *
>   * The disadvantage is that pages are refcounted (which can be slower and
>   * simply not an option for some PFNMAP users). The advantage is that we
> @@ -649,7 +654,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,

You know I"m semi-ashamed to admit I didn't even know this function
exists. But yikes that we have a separate function like this just for PMDs.

>  {
>  	unsigned long pfn = pmd_pfn(pmd);
>
> -	/* Currently it's only used for huge pfnmaps */
>  	if (unlikely(pmd_special(pmd)))
>  		return NULL;
>
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
  2025-07-17 11:52 ` [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
@ 2025-07-17 19:17   ` Lorenzo Stoakes
  2025-07-17 20:03     ` David Hildenbrand
  2025-07-17 22:06   ` Demi Marie Obenour
  1 sibling, 1 reply; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17 19:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 01:52:09PM +0200, David Hildenbrand wrote:
> print_bad_pte() looks like something that should actually be a WARN
> or similar, but historically it apparently has proven to be useful to
> detect corruption of page tables even on production systems -- report
> the issue and keep the system running to make it easier to actually detect
> what is going wrong (e.g., multiple such messages might shed a light).
>
> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
> to take care of print_bad_pte() as well.
>
> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
> implementation and renaming the function -- we'll rename it to what
> we actually print: bad (page) mappings. Maybe it should be called
> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
> because the assumption is that we are dealing with some (previously)
> present page table entry that got corrupted in weird ways.
>
> Whether it is a PTE or something else will usually become obvious from the
> page table dump or from the dumped stack. If ever required in the future,
> we could pass the entry level type similar to "enum rmap_level". For now,
> let's keep it simple.
>
> To make the function a bit more readable, factor out the ratelimit check
> into is_bad_page_map_ratelimited() and place the dumping of page
> table content into __dump_bad_page_map_pgtable(). We'll now dump
> information from each level in a single line, and just stop the table
> walk once we hit something that is not a present page table.
>
> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
> for vm_normal_page(), now that we have a function that can handle it.
>
> The report will now look something like (dumping pgd to pmd values):
>
> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>
> Not using pgdp_get(), because that does not work properly on some arm
> configs where pgd_t is an array. Note that we are dumping all levels
> even when levels are folded for simplicity.

Oh god. I reviewed this below. BUT OH GOD. What. Why???

>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 94 insertions(+), 26 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 173eb6267e0ac..08d16ed7b4cc7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -473,22 +473,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
>  			add_mm_counter(mm, i, rss[i]);
>  }
>
> -/*
> - * This function is called to print an error when a bad pte
> - * is found. For example, we might have a PFN-mapped pte in
> - * a region that doesn't allow it.
> - *
> - * The calling function must still handle the error.
> - */
> -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> -			  pte_t pte, struct page *page)
> +static bool is_bad_page_map_ratelimited(void)
>  {
> -	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> -	p4d_t *p4d = p4d_offset(pgd, addr);
> -	pud_t *pud = pud_offset(p4d, addr);
> -	pmd_t *pmd = pmd_offset(pud, addr);
> -	struct address_space *mapping;
> -	pgoff_t index;
>  	static unsigned long resume;
>  	static unsigned long nr_shown;
>  	static unsigned long nr_unshown;
> @@ -500,7 +486,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>  	if (nr_shown == 60) {
>  		if (time_before(jiffies, resume)) {
>  			nr_unshown++;
> -			return;
> +			return true;
>  		}
>  		if (nr_unshown) {
>  			pr_alert("BUG: Bad page map: %lu messages suppressed\n",
> @@ -511,15 +497,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>  	}
>  	if (nr_shown++ == 0)
>  		resume = jiffies + 60 * HZ;
> +	return false;
> +}
> +
> +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
> +{
> +	unsigned long long pgdv, p4dv, pudv, pmdv;

> +	p4d_t p4d, *p4dp;
> +	pud_t pud, *pudp;
> +	pmd_t pmd, *pmdp;
> +	pgd_t *pgdp;
> +
> +	/*
> +	 * This looks like a fully lockless walk, however, the caller is
> +	 * expected to hold the leaf page table lock in addition to other
> +	 * rmap/mm/vma locks. So this is just a re-walk to dump page table
> +	 * content while any concurrent modifications should be completely
> +	 * prevented.
> +	 */

Hmmm :)

Why aren't we trying to lock at leaf level?

We need to:

- Keep VMA stable which prevents unmap page table teardown and khugepaged
  collapse.
- (not relevant as we don't traverse PTE table but) RCU lock for PTE
  entries to avoid MADV_DONTNEED page table withdrawal.

Buuut if we're not locking at leaf level, we leave ourselves open to racing
faults, zaps, etc. etc.

So perhaps this why you require such strict conditions...

But can you truly be sure of these existing? And we should then assert them
here no? For rmap though we'd need the folio/vma.

> +	pgdp = pgd_offset(mm, addr);
> +	pgdv = pgd_val(*pgdp);

Before I went and looked again at the commit msg I said:

	"Shoudln't we strictly speaking use pgdp_get()? I see you use this
	 helper for other levels."

But obviously yeah. You explained the insane reason why not.

> +
> +	if (!pgd_present(*pgdp) || pgd_leaf(*pgdp)) {
> +		pr_alert("pgd:%08llx\n", pgdv);
> +		return;
> +	}
> +
> +	p4dp = p4d_offset(pgdp, addr);
> +	p4d = p4dp_get(p4dp);
> +	p4dv = p4d_val(p4d);
> +
> +	if (!p4d_present(p4d) || p4d_leaf(p4d)) {
> +		pr_alert("pgd:%08llx p4d:%08llx\n", pgdv, p4dv);
> +		return;
> +	}
> +
> +	pudp = pud_offset(p4dp, addr);
> +	pud = pudp_get(pudp);
> +	pudv = pud_val(pud);
> +
> +	if (!pud_present(pud) || pud_leaf(pud)) {
> +		pr_alert("pgd:%08llx p4d:%08llx pud:%08llx\n", pgdv, p4dv, pudv);
> +		return;
> +	}
> +
> +	pmdp = pmd_offset(pudp, addr);
> +	pmd = pmdp_get(pmdp);
> +	pmdv = pmd_val(pmd);
> +
> +	/*
> +	 * Dumping the PTE would be nice, but it's tricky with CONFIG_HIGHPTE,
> +	 * because the table should already be mapped by the caller and
> +	 * doing another map would be bad. print_bad_page_map() should
> +	 * already take care of printing the PTE.
> +	 */

I hate 32-bit kernels.

> +	pr_alert("pgd:%08llx p4d:%08llx pud:%08llx pmd:%08llx\n", pgdv,
> +		 p4dv, pudv, pmdv);
> +}
> +
> +/*
> + * This function is called to print an error when a bad page table entry (e.g.,
> + * corrupted page table entry) is found. For example, we might have a
> + * PFN-mapped pte in a region that doesn't allow it.
> + *
> + * The calling function must still handle the error.
> + */

We have extremely strict locking conditions for the page table traversal... but
no mention of them here?

> +static void print_bad_page_map(struct vm_area_struct *vma,
> +		unsigned long addr, unsigned long long entry, struct page *page)
> +{
> +	struct address_space *mapping;
> +	pgoff_t index;
> +
> +	if (is_bad_page_map_ratelimited())
> +		return;
>
>  	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
>  	index = linear_page_index(vma, addr);
>
> -	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
> -		 current->comm,
> -		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
> +	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);

Sort of wonder if this is even useful if you don't know what the 'entry'
is? But I guess the dump below will tell you.

Though maybe actually useful to see flags etc. in case some horrid
corruption happened and maybe dump isn't valid? But then the dump assumes
strict conditions to work so... can that happen?

> +	__dump_bad_page_map_pgtable(vma->vm_mm, addr);
>  	if (page)
> -		dump_page(page, "bad pte");
> +		dump_page(page, "bad page map");
>  	pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
>  		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
>  	pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
> @@ -597,7 +655,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>  		if (is_zero_pfn(pfn))
>  			return NULL;
>
> -		print_bad_pte(vma, addr, pte, NULL);
> +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
>  		return NULL;
>  	}
>
> @@ -625,7 +683,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>
>  check_pfn:
>  	if (unlikely(pfn > highest_memmap_pfn)) {
> -		print_bad_pte(vma, addr, pte, NULL);
> +		print_bad_page_map(vma, addr, pte_val(pte), NULL);

This is unrelated to your series, but I guess this is for cases where
you're e.g. iomapping or such? So it's not something in the memmap but it's
a PFN that might reference io memory or such?

>  		return NULL;
>  	}
>
> @@ -654,8 +712,15 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>  {
>  	unsigned long pfn = pmd_pfn(pmd);
>
> -	if (unlikely(pmd_special(pmd)))
> +	if (unlikely(pmd_special(pmd))) {
> +		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> +			return NULL;

I guess we'll bring this altogether in a later patch with vm_normal_page()
as getting a little duplicative :P

Makes me think that VM_SPECIAL is kind of badly named (other than fact
'special' is nebulous and overloaded in general) in that it contains stuff
that is -VMA-special but only VM_PFNMAP | VM_MIXEDMAP really indicates
specialness wrt to underlying folio.

Then we have VM_IO, which strictly must not have an associated page right?
Which is the odd one out and I wonder if redundant somehow.

Anyway stuff to think about...

> +		if (is_huge_zero_pfn(pfn))
> +			return NULL;
> +
> +		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
>  		return NULL;
> +	}
>
>  	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
>  		if (vma->vm_flags & VM_MIXEDMAP) {
> @@ -674,8 +739,10 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>
>  	if (is_huge_zero_pfn(pfn))
>  		return NULL;
> -	if (unlikely(pfn > highest_memmap_pfn))
> +	if (unlikely(pfn > highest_memmap_pfn)) {
> +		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
>  		return NULL;
> +	}
>
>  	/*
>  	 * NOTE! We still have PageReserved() pages in the page tables.
> @@ -1509,7 +1576,7 @@ static __always_inline void zap_present_folio_ptes(struct mmu_gather *tlb,
>  		folio_remove_rmap_ptes(folio, page, nr, vma);
>
>  		if (unlikely(folio_mapcount(folio) < 0))
> -			print_bad_pte(vma, addr, ptent, page);
> +			print_bad_page_map(vma, addr, pte_val(ptent), page);
>  	}
>  	if (unlikely(__tlb_remove_folio_pages(tlb, page, nr, delay_rmap))) {
>  		*force_flush = true;
> @@ -4507,7 +4574,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  		} else if (is_pte_marker_entry(entry)) {
>  			ret = handle_pte_marker(vmf);
>  		} else {
> -			print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
> +			print_bad_page_map(vma, vmf->address,
> +					   pte_val(vmf->orig_pte), NULL);
>  			ret = VM_FAULT_SIGBUS;
>  		}
>  		goto out;
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*()
  2025-07-17 11:52 ` [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
@ 2025-07-17 19:51   ` Lorenzo Stoakes
  2025-07-17 19:55     ` Lorenzo Stoakes
  2025-07-17 20:12     ` David Hildenbrand
  0 siblings, 2 replies; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17 19:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 01:52:10PM +0200, David Hildenbrand wrote:
> Let's reduce the code duplication and factor out the non-pte/pmd related
> magic into vm_normal_page_pfn().
>
> To keep it simpler, check the pfn against both zero folios. We could
> optimize this, but as it's only for the !CONFIG_ARCH_HAS_PTE_SPECIAL
> case, it's not a compelling micro-optimization.
>
> With CONFIG_ARCH_HAS_PTE_SPECIAL we don't have to check anything else,
> really.
>
> It's a good question if we can even hit the !CONFIG_ARCH_HAS_PTE_SPECIAL
> scenario in the PMD case in practice: but doesn't really matter, as
> it's now all unified in vm_normal_page_pfn().
>
> Add kerneldoc for all involved functions.
>
> No functional change intended.
>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 183 +++++++++++++++++++++++++++++++---------------------
>  1 file changed, 109 insertions(+), 74 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 08d16ed7b4cc7..c43ae5e4d7644 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -590,8 +590,13 @@ static void print_bad_page_map(struct vm_area_struct *vma,
>  	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
>  }
>
> -/*
> - * vm_normal_page -- This function gets the "struct page" associated with a pte.
> +/**
> + * vm_normal_page_pfn() - Get the "struct page" associated with a PFN in a
> + *			  non-special page table entry.

This is a bit nebulous/confusing, I mean you'll get PTE entries with PTE special
bit that'll have a PFN but just no struct page/folio to look at, or should not
be touched.

So the _pfn() bit doesn't really properly describe what it does.

I wonder if it'd be better to just separate out the special handler, have
that return a boolean indicating special of either form, and then separate
other shared code separately from that?

> + * @vma: The VMA mapping the @pfn.
> + * @addr: The address where the @pfn is mapped.
> + * @pfn: The PFN.
> + * @entry: The page table entry value for error reporting purposes.
>   *
>   * "Special" mappings do not wish to be associated with a "struct page" (either
>   * it doesn't exist, or it exists but they don't want to touch it). In this
> @@ -603,10 +608,10 @@ static void print_bad_page_map(struct vm_area_struct *vma,
>   * (such as GUP) can still identify these mappings and work with the
>   * underlying "struct page".
>   *
> - * There are 2 broad cases. Firstly, an architecture may define a pte_special()
> - * pte bit, in which case this function is trivial. Secondly, an architecture
> - * may not have a spare pte bit, which requires a more complicated scheme,
> - * described below.
> + * There are 2 broad cases. Firstly, an architecture may define a "special"
> + * page table entry bit (e.g., pte_special()), in which case this function is
> + * trivial. Secondly, an architecture may not have a spare page table
> + * entry bit, which requires a more complicated scheme, described below.

Strikes me this bit of the comment should be with vm_normal_page(). As this
implies the 2 broad cases are handled here and this isn't the case.

>   *
>   * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
>   * special mapping (even if there are underlying and valid "struct pages").
> @@ -639,15 +644,72 @@ static void print_bad_page_map(struct vm_area_struct *vma,
>   * don't have to follow the strict linearity rule of PFNMAP mappings in
>   * order to support COWable mappings.
>   *
> + * This function is not expected to be called for obviously special mappings:
> + * when the page table entry has the "special" bit set.

Hmm this is is a bit weird though, saying "obviously" special, because you're
handling "special" mappings here, but only for architectures that don't specify
the PTE special bit.

So it makes it quite nebulous what constitutes 'obviously' here, really you mean
pte_special().

> + *
> + * Return: Returns the "struct page" if this is a "normal" mapping. Returns
> + *	   NULL if this is a "special" mapping.
> + */
> +static inline struct page *vm_normal_page_pfn(struct vm_area_struct *vma,
> +		unsigned long addr, unsigned long pfn, unsigned long long entry)
> +{
> +	/*
> +	 * With CONFIG_ARCH_HAS_PTE_SPECIAL, any special page table mappings
> +	 * (incl. shared zero folios) are marked accordingly and are handled
> +	 * by the caller.
> +	 */
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
> +		if (unlikely(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))) {
> +			if (vma->vm_flags & VM_MIXEDMAP) {
> +				/* If it has a "struct page", it's "normal". */
> +				if (!pfn_valid(pfn))
> +					return NULL;
> +			} else {
> +				unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
> +
> +				/* Only CoW'ed anon folios are "normal". */
> +				if (pfn == vma->vm_pgoff + off)
> +					return NULL;
> +				if (!is_cow_mapping(vma->vm_flags))
> +					return NULL;
> +			}
> +		}
> +
> +		if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))

This handles zero/zero huge page handling for non-pte_special() case
only. I wonder if we even need to bother having these marked special
generally since you can just check the PFN every time anyway.

> +			return NULL;
> +	}
> +
> +	/* Cheap check for corrupted page table entries. */
> +	if (pfn > highest_memmap_pfn) {
> +		print_bad_page_map(vma, addr, entry, NULL);
> +		return NULL;
> +	}
> +	/*
> +	 * NOTE! We still have PageReserved() pages in the page tables.
> +	 * For example, VDSO mappings can cause them to exist.
> +	 */
> +	VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn));
> +	return pfn_to_page(pfn);
> +}
> +
> +/**
> + * vm_normal_page() - Get the "struct page" associated with a PTE
> + * @vma: The VMA mapping the @pte.
> + * @addr: The address where the @pte is mapped.
> + * @pte: The PTE.
> + *
> + * Get the "struct page" associated with a PTE. See vm_normal_page_pfn()
> + * for details.
> + *
> + * Return: Returns the "struct page" if this is a "normal" mapping. Returns
> + *	   NULL if this is a "special" mapping.
>   */
>  struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>  			    pte_t pte)
>  {
>  	unsigned long pfn = pte_pfn(pte);
>
> -	if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
> -		if (likely(!pte_special(pte)))
> -			goto check_pfn;
> +	if (unlikely(pte_special(pte))) {
>  		if (vma->vm_ops && vma->vm_ops->find_special_page)
>  			return vma->vm_ops->find_special_page(vma, addr);
>  		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> @@ -658,44 +720,21 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>  		print_bad_page_map(vma, addr, pte_val(pte), NULL);
>  		return NULL;
>  	}
> -
> -	/* !CONFIG_ARCH_HAS_PTE_SPECIAL case follows: */
> -
> -	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> -		if (vma->vm_flags & VM_MIXEDMAP) {
> -			if (!pfn_valid(pfn))
> -				return NULL;
> -			if (is_zero_pfn(pfn))
> -				return NULL;
> -			goto out;
> -		} else {
> -			unsigned long off;
> -			off = (addr - vma->vm_start) >> PAGE_SHIFT;
> -			if (pfn == vma->vm_pgoff + off)
> -				return NULL;
> -			if (!is_cow_mapping(vma->vm_flags))
> -				return NULL;
> -		}
> -	}
> -
> -	if (is_zero_pfn(pfn))
> -		return NULL;
> -
> -check_pfn:
> -	if (unlikely(pfn > highest_memmap_pfn)) {
> -		print_bad_page_map(vma, addr, pte_val(pte), NULL);
> -		return NULL;
> -	}
> -
> -	/*
> -	 * NOTE! We still have PageReserved() pages in the page tables.
> -	 * eg. VDSO mappings can cause them to exist.
> -	 */
> -out:
> -	VM_WARN_ON_ONCE(is_zero_pfn(pfn));
> -	return pfn_to_page(pfn);
> +	return vm_normal_page_pfn(vma, addr, pfn, pte_val(pte));
>  }
>
> +/**
> + * vm_normal_folio() - Get the "struct folio" associated with a PTE
> + * @vma: The VMA mapping the @pte.
> + * @addr: The address where the @pte is mapped.
> + * @pte: The PTE.
> + *
> + * Get the "struct folio" associated with a PTE. See vm_normal_page_pfn()
> + * for details.
> + *
> + * Return: Returns the "struct folio" if this is a "normal" mapping. Returns
> + *	   NULL if this is a "special" mapping.
> + */

Nice to add a comment, but again feels weird to have the whole explanation in
vm_normal_page_pfn() but then to invoke vm_normal_page()...

>  struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
>  			    pte_t pte)
>  {
> @@ -707,6 +746,18 @@ struct folio *vm_normal_folio(struct vm_area_struct *vma, unsigned long addr,
>  }
>
>  #ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
> +/**
> + * vm_normal_page_pmd() - Get the "struct page" associated with a PMD
> + * @vma: The VMA mapping the @pmd.
> + * @addr: The address where the @pmd is mapped.
> + * @pmd: The PMD.
> + *
> + * Get the "struct page" associated with a PMD. See vm_normal_page_pfn()
> + * for details.
> + *
> + * Return: Returns the "struct page" if this is a "normal" mapping. Returns
> + *	   NULL if this is a "special" mapping.
> + */
>  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>  				pmd_t pmd)
>  {
> @@ -721,37 +772,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>  		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
>  		return NULL;
>  	}
> -
> -	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> -		if (vma->vm_flags & VM_MIXEDMAP) {
> -			if (!pfn_valid(pfn))
> -				return NULL;
> -			goto out;
> -		} else {
> -			unsigned long off;
> -			off = (addr - vma->vm_start) >> PAGE_SHIFT;
> -			if (pfn == vma->vm_pgoff + off)
> -				return NULL;
> -			if (!is_cow_mapping(vma->vm_flags))
> -				return NULL;
> -		}
> -	}
> -
> -	if (is_huge_zero_pfn(pfn))
> -		return NULL;
> -	if (unlikely(pfn > highest_memmap_pfn)) {
> -		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
> -		return NULL;
> -	}
> -
> -	/*
> -	 * NOTE! We still have PageReserved() pages in the page tables.
> -	 * eg. VDSO mappings can cause them to exist.
> -	 */
> -out:
> -	return pfn_to_page(pfn);
> +	return vm_normal_page_pfn(vma, addr, pfn, pmd_val(pmd));

Hmm this seems broken, because you're now making these special on arches with
pte_special() right? But then you're invoking the not-special function?

Also for non-pte_special() arches you're kind of implying they _maybe_ could be
special.


>  }
>
> +/**
> + * vm_normal_folio_pmd() - Get the "struct folio" associated with a PMD
> + * @vma: The VMA mapping the @pmd.
> + * @addr: The address where the @pmd is mapped.
> + * @pmd: The PMD.
> + *
> + * Get the "struct folio" associated with a PMD. See vm_normal_page_pfn()
> + * for details.
> + *
> + * Return: Returns the "struct folio" if this is a "normal" mapping. Returns
> + *	   NULL if this is a "special" mapping.
> + */
>  struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
>  				  unsigned long addr, pmd_t pmd)
>  {
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*()
  2025-07-17 19:51   ` Lorenzo Stoakes
@ 2025-07-17 19:55     ` Lorenzo Stoakes
  2025-07-17 20:03       ` David Hildenbrand
  2025-07-17 20:12     ` David Hildenbrand
  1 sibling, 1 reply; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17 19:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 08:51:51PM +0100, Lorenzo Stoakes wrote:
> > @@ -721,37 +772,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> >  		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
> >  		return NULL;
> >  	}
> > -
> > -	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> > -		if (vma->vm_flags & VM_MIXEDMAP) {
> > -			if (!pfn_valid(pfn))
> > -				return NULL;
> > -			goto out;
> > -		} else {
> > -			unsigned long off;
> > -			off = (addr - vma->vm_start) >> PAGE_SHIFT;
> > -			if (pfn == vma->vm_pgoff + off)
> > -				return NULL;
> > -			if (!is_cow_mapping(vma->vm_flags))
> > -				return NULL;
> > -		}
> > -	}
> > -
> > -	if (is_huge_zero_pfn(pfn))
> > -		return NULL;
> > -	if (unlikely(pfn > highest_memmap_pfn)) {
> > -		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
> > -		return NULL;
> > -	}
> > -
> > -	/*
> > -	 * NOTE! We still have PageReserved() pages in the page tables.
> > -	 * eg. VDSO mappings can cause them to exist.
> > -	 */
> > -out:
> > -	return pfn_to_page(pfn);
> > +	return vm_normal_page_pfn(vma, addr, pfn, pmd_val(pmd));
>
> Hmm this seems broken, because you're now making these special on arches with
> pte_special() right? But then you're invoking the not-special function?
>
> Also for non-pte_special() arches you're kind of implying they _maybe_ could be
> special.

OK sorry the diff caught me out here, you explicitly handle the pmd_special()
case here, duplicatively (yuck).

Maybe you fix this up in a later patch :)

Anyway, again it'd be nice to somehow find a way to separate the special
check out from the rest.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud()
  2025-07-17 11:52 ` [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud() David Hildenbrand
@ 2025-07-17 20:03   ` Lorenzo Stoakes
  2025-07-17 20:14     ` David Hildenbrand
  2025-07-29  7:52   ` Wei Yang
  1 sibling, 1 reply; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17 20:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 01:52:11PM +0200, David Hildenbrand wrote:
> Let's introduce vm_normal_page_pud(), which ends up being fairly simple
> because of our new common helpers and there not being a PUD-sized zero
> folio.
>
> Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO,
> structuring the code like the other (pmd/pte) cases. Defer
> introducing vm_normal_folio_pud() until really used.

I mean fine :P but does anybody really use this?

>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Seems ok to me, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  include/linux/mm.h |  2 ++
>  mm/memory.c        | 27 +++++++++++++++++++++++++++
>  mm/pagewalk.c      | 20 ++++++++++----------
>  3 files changed, 39 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index abc47f1f307fb..0eb991262fbbf 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2349,6 +2349,8 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
>  				  unsigned long addr, pmd_t pmd);
>  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>  				pmd_t pmd);
> +struct page *vm_normal_page_pud(struct vm_area_struct *vma, unsigned long addr,
> +		pud_t pud);
>
>  void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
>  		  unsigned long size);
> diff --git a/mm/memory.c b/mm/memory.c
> index c43ae5e4d7644..00a0d7ae3ba4a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -796,6 +796,33 @@ struct folio *vm_normal_folio_pmd(struct vm_area_struct *vma,
>  		return page_folio(page);
>  	return NULL;
>  }
> +
> +/**
> + * vm_normal_page_pud() - Get the "struct page" associated with a PUD
> + * @vma: The VMA mapping the @pud.
> + * @addr: The address where the @pud is mapped.
> + * @pud: The PUD.
> + *
> + * Get the "struct page" associated with a PUD. See vm_normal_page_pfn()
> + * for details.
> + *
> + * Return: Returns the "struct page" if this is a "normal" mapping. Returns
> + *	   NULL if this is a "special" mapping.
> + */
> +struct page *vm_normal_page_pud(struct vm_area_struct *vma,
> +		unsigned long addr, pud_t pud)
> +{
> +	unsigned long pfn = pud_pfn(pud);
> +
> +	if (unlikely(pud_special(pud))) {
> +		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> +			return NULL;
> +
> +		print_bad_page_map(vma, addr, pud_val(pud), NULL);
> +		return NULL;
> +	}
> +	return vm_normal_page_pfn(vma, addr, pfn, pud_val(pud));
> +}
>  #endif
>
>  /**
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 648038247a8d2..c6753d370ff4e 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -902,23 +902,23 @@ struct folio *folio_walk_start(struct folio_walk *fw,
>  		fw->pudp = pudp;
>  		fw->pud = pud;
>
> -		/*
> -		 * TODO: FW_MIGRATION support for PUD migration entries
> -		 * once there are relevant users.
> -		 */
> -		if (!pud_present(pud) || pud_special(pud)) {
> +		if (pud_none(pud)) {
>  			spin_unlock(ptl);
>  			goto not_found;
> -		} else if (!pud_leaf(pud)) {
> +		} else if (pud_present(pud) && !pud_leaf(pud)) {
>  			spin_unlock(ptl);
>  			goto pmd_table;
> +		} else if (pud_present(pud)) {
> +			page = vm_normal_page_pud(vma, addr, pud);
> +			if (page)
> +				goto found;
>  		}
>  		/*
> -		 * TODO: vm_normal_page_pud() will be handy once we want to
> -		 * support PUD mappings in VM_PFNMAP|VM_MIXEDMAP VMAs.
> +		 * TODO: FW_MIGRATION support for PUD migration entries
> +		 * once there are relevant users.
>  		 */
> -		page = pud_page(pud);
> -		goto found;
> +		spin_unlock(ptl);
> +		goto not_found;
>  	}
>
>  pmd_table:
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
  2025-07-17 19:17   ` Lorenzo Stoakes
@ 2025-07-17 20:03     ` David Hildenbrand
  2025-07-18 10:15       ` Lorenzo Stoakes
  0 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 20:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

>> The report will now look something like (dumping pgd to pmd values):
>>
>> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
>> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
>> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>>
>> Not using pgdp_get(), because that does not work properly on some arm
>> configs where pgd_t is an array. Note that we are dumping all levels
>> even when levels are folded for simplicity.
> 
> Oh god. I reviewed this below. BUT OH GOD. What. Why???

Exactly.

I wish this patch wouldn't exist, but Hugh convinced me that apparently 
it can be useful in the real world.

> 
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
>>   1 file changed, 94 insertions(+), 26 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 173eb6267e0ac..08d16ed7b4cc7 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -473,22 +473,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
>>   			add_mm_counter(mm, i, rss[i]);
>>   }
>>
>> -/*
>> - * This function is called to print an error when a bad pte
>> - * is found. For example, we might have a PFN-mapped pte in
>> - * a region that doesn't allow it.
>> - *
>> - * The calling function must still handle the error.
>> - */
>> -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>> -			  pte_t pte, struct page *page)
>> +static bool is_bad_page_map_ratelimited(void)
>>   {
>> -	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
>> -	p4d_t *p4d = p4d_offset(pgd, addr);
>> -	pud_t *pud = pud_offset(p4d, addr);
>> -	pmd_t *pmd = pmd_offset(pud, addr);
>> -	struct address_space *mapping;
>> -	pgoff_t index;
>>   	static unsigned long resume;
>>   	static unsigned long nr_shown;
>>   	static unsigned long nr_unshown;
>> @@ -500,7 +486,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>>   	if (nr_shown == 60) {
>>   		if (time_before(jiffies, resume)) {
>>   			nr_unshown++;
>> -			return;
>> +			return true;
>>   		}
>>   		if (nr_unshown) {
>>   			pr_alert("BUG: Bad page map: %lu messages suppressed\n",
>> @@ -511,15 +497,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>>   	}
>>   	if (nr_shown++ == 0)
>>   		resume = jiffies + 60 * HZ;
>> +	return false;
>> +}
>> +
>> +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
>> +{
>> +	unsigned long long pgdv, p4dv, pudv, pmdv;
> 
>> +	p4d_t p4d, *p4dp;
>> +	pud_t pud, *pudp;
>> +	pmd_t pmd, *pmdp;
>> +	pgd_t *pgdp;
>> +
>> +	/*
>> +	 * This looks like a fully lockless walk, however, the caller is
>> +	 * expected to hold the leaf page table lock in addition to other
>> +	 * rmap/mm/vma locks. So this is just a re-walk to dump page table
>> +	 * content while any concurrent modifications should be completely
>> +	 * prevented.
>> +	 */
> 
> Hmmm :)
> 
> Why aren't we trying to lock at leaf level?

Ehm, did you read the:

"the caller is expected to hold the leaf page table lock"

:)


> 
> We need to:
> 
> - Keep VMA stable which prevents unmap page table teardown and khugepaged
>    collapse.
> - (not relevant as we don't traverse PTE table but) RCU lock for PTE
>    entries to avoid MADV_DONTNEED page table withdrawal.
> 
> Buuut if we're not locking at leaf level, we leave ourselves open to racing
> faults, zaps, etc. etc.

Yes. I can clarify in the comment of print_bad_page_map(), that it is 
expected to be called with the PTL (unwritten rule, not changing that).

> 
> So perhaps this why you require such strict conditions...
> 
> But can you truly be sure of these existing? And we should then assert them
> here no? For rmap though we'd need the folio/vma.

I hope you realize that this nastiness of a code is called in case our 
system is already running into something extremely unexpected and will 
probably be dead soon.

So I am not to interested in adding anything more here. If you run into 
this code you're in big trouble already.

> 
>> +	pgdp = pgd_offset(mm, addr);
>> +	pgdv = pgd_val(*pgdp);
> 
> Before I went and looked again at the commit msg I said:
> 
> 	"Shoudln't we strictly speaking use pgdp_get()? I see you use this
> 	 helper for other levels."
> 
> But obviously yeah. You explained the insane reason why not.

Had to find out the hard way ... :)

[...]

>> +/*
>> + * This function is called to print an error when a bad page table entry (e.g.,
>> + * corrupted page table entry) is found. For example, we might have a
>> + * PFN-mapped pte in a region that doesn't allow it.
>> + *
>> + * The calling function must still handle the error.
>> + */
> 
> We have extremely strict locking conditions for the page table traversal... but
> no mention of them here?

Yeah, I can add that.

> 
>> +static void print_bad_page_map(struct vm_area_struct *vma,
>> +		unsigned long addr, unsigned long long entry, struct page *page)
>> +{
>> +	struct address_space *mapping;
>> +	pgoff_t index;
>> +
>> +	if (is_bad_page_map_ratelimited())
>> +		return;
>>
>>   	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
>>   	index = linear_page_index(vma, addr);
>>
>> -	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
>> -		 current->comm,
>> -		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
>> +	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);
> 
> Sort of wonder if this is even useful if you don't know what the 'entry'
> is? But I guess the dump below will tell you.

You probably missed in the patch description:

"Whether it is a PTE or something else will usually become obvious from 
the page table dump or from the dumped stack. If ever required in the 
future, we could pass the entry level type similar to "enum rmap_level". 
For now, let's keep it simple."

> 
> Though maybe actually useful to see flags etc. in case some horrid
> corruption happened and maybe dump isn't valid? But then the dump assumes
> strict conditions to work so... can that happen?

Not sure what you mean, can you elaborate?

If you system is falling apart completely, I'm afraid this report here 
will not safe you.

You'll probably get a flood of 60 ... if your system doesn't collapse 
before that.

> 
>> +	__dump_bad_page_map_pgtable(vma->vm_mm, addr);
>>   	if (page)
>> -		dump_page(page, "bad pte");
>> +		dump_page(page, "bad page map");
>>   	pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
>>   		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
>>   	pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
>> @@ -597,7 +655,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>>   		if (is_zero_pfn(pfn))
>>   			return NULL;
>>
>> -		print_bad_pte(vma, addr, pte, NULL);
>> +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
>>   		return NULL;
>>   	}
>>
>> @@ -625,7 +683,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>>
>>   check_pfn:
>>   	if (unlikely(pfn > highest_memmap_pfn)) {
>> -		print_bad_pte(vma, addr, pte, NULL);
>> +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
> 
> This is unrelated to your series, but I guess this is for cases where
> you're e.g. iomapping or such? So it's not something in the memmap but it's
> a PFN that might reference io memory or such?

No, it's just an easy check for catching corruptions. In a later patch I 
add a comment.

> 
>>   		return NULL;
>>   	}
>>
>> @@ -654,8 +712,15 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>>   {
>>   	unsigned long pfn = pmd_pfn(pmd);
>>
>> -	if (unlikely(pmd_special(pmd)))
>> +	if (unlikely(pmd_special(pmd))) {
>> +		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
>> +			return NULL;
> 
> I guess we'll bring this altogether in a later patch with vm_normal_page()
> as getting a little duplicative :P

Not that part, but the other part :)

> 
> Makes me think that VM_SPECIAL is kind of badly named (other than fact
> 'special' is nebulous and overloaded in general) in that it contains stuff
> that is -VMA-special but only VM_PFNMAP | VM_MIXEDMAP really indicates
> specialness wrt to underlying folio.

It is.

> 
> Then we have VM_IO, which strictly must not have an associated page right?

VM_IO just means read/write side-effects, I think you could have ones 
with an memmap easily ... e.g., memory section (128MiB) spanning both 
memory and MMIO regions.

Thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*()
  2025-07-17 19:55     ` Lorenzo Stoakes
@ 2025-07-17 20:03       ` David Hildenbrand
  2025-07-18 12:43         ` Lorenzo Stoakes
  0 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 20:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On 17.07.25 21:55, Lorenzo Stoakes wrote:
> On Thu, Jul 17, 2025 at 08:51:51PM +0100, Lorenzo Stoakes wrote:
>>> @@ -721,37 +772,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>>>   		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
>>>   		return NULL;
>>>   	}
>>> -
>>> -	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
>>> -		if (vma->vm_flags & VM_MIXEDMAP) {
>>> -			if (!pfn_valid(pfn))
>>> -				return NULL;
>>> -			goto out;
>>> -		} else {
>>> -			unsigned long off;
>>> -			off = (addr - vma->vm_start) >> PAGE_SHIFT;
>>> -			if (pfn == vma->vm_pgoff + off)
>>> -				return NULL;
>>> -			if (!is_cow_mapping(vma->vm_flags))
>>> -				return NULL;
>>> -		}
>>> -	}
>>> -
>>> -	if (is_huge_zero_pfn(pfn))
>>> -		return NULL;
>>> -	if (unlikely(pfn > highest_memmap_pfn)) {
>>> -		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
>>> -		return NULL;
>>> -	}
>>> -
>>> -	/*
>>> -	 * NOTE! We still have PageReserved() pages in the page tables.
>>> -	 * eg. VDSO mappings can cause them to exist.
>>> -	 */
>>> -out:
>>> -	return pfn_to_page(pfn);
>>> +	return vm_normal_page_pfn(vma, addr, pfn, pmd_val(pmd));
>>
>> Hmm this seems broken, because you're now making these special on arches with
>> pte_special() right? But then you're invoking the not-special function?
>>
>> Also for non-pte_special() arches you're kind of implying they _maybe_ could be
>> special.
> 
> OK sorry the diff caught me out here, you explicitly handle the pmd_special()
> case here, duplicatively (yuck).
> 
> Maybe you fix this up in a later patch :)

I had that, but the conditions depend on the level, meaning: unnecessary 
checks for pte/pmd/pud level.

I had a variant where I would pass "bool special" into 
vm_normal_page_pfn(), but I didn't like it.

To optimize out, I would have to provide a "level" argument, and did not 
convince myself yet that that is a good idea at this point.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
  2025-07-17 11:52 ` [PATCH v2 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
@ 2025-07-17 20:07   ` Lorenzo Stoakes
  2025-07-29  7:53   ` Wei Yang
  1 sibling, 0 replies; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-17 20:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang, David Vrabel

On Thu, Jul 17, 2025 at 01:52:12PM +0200, David Hildenbrand wrote:
> ... and hide it behind a kconfig option. There is really no need for
> any !xen code to perform this check.

Lovely :)

>
> The naming is a bit off: we want to find the "normal" page when a PTE
> was marked "special". So it's really not "finding a special" page.
>
> Improve the documentation, and add a comment in the code where XEN ends
> up performing the pte_mkspecial() through a hypercall. More details can
> be found in commit 923b2919e2c3 ("xen/gntdev: mark userspace PTEs as
> special on x86 PV guests").
>
> Cc: David Vrabel <david.vrabel@citrix.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Yes, yes thank you thank you! This is long overdue. Glorious.

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  drivers/xen/Kconfig              |  1 +
>  drivers/xen/gntdev.c             |  5 +++--
>  include/linux/mm.h               | 18 +++++++++++++-----
>  mm/Kconfig                       |  2 ++
>  mm/memory.c                      | 12 ++++++++++--
>  tools/testing/vma/vma_internal.h | 18 +++++++++++++-----
>  6 files changed, 42 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
> index 24f485827e039..f9a35ed266ecf 100644
> --- a/drivers/xen/Kconfig
> +++ b/drivers/xen/Kconfig
> @@ -138,6 +138,7 @@ config XEN_GNTDEV
>  	depends on XEN
>  	default m
>  	select MMU_NOTIFIER
> +	select FIND_NORMAL_PAGE
>  	help
>  	  Allows userspace processes to use grants.
>
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index 61faea1f06630..d1bc0dae2cdf9 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -309,6 +309,7 @@ static int find_grant_ptes(pte_t *pte, unsigned long addr, void *data)
>  	BUG_ON(pgnr >= map->count);
>  	pte_maddr = arbitrary_virt_to_machine(pte).maddr;
>
> +	/* Note: this will perform a pte_mkspecial() through the hypercall. */
>  	gnttab_set_map_op(&map->map_ops[pgnr], pte_maddr, flags,
>  			  map->grants[pgnr].ref,
>  			  map->grants[pgnr].domid);
> @@ -516,7 +517,7 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
>  	gntdev_put_map(priv, map);
>  }
>
> -static struct page *gntdev_vma_find_special_page(struct vm_area_struct *vma,
> +static struct page *gntdev_vma_find_normal_page(struct vm_area_struct *vma,
>  						 unsigned long addr)
>  {
>  	struct gntdev_grant_map *map = vma->vm_private_data;
> @@ -527,7 +528,7 @@ static struct page *gntdev_vma_find_special_page(struct vm_area_struct *vma,
>  static const struct vm_operations_struct gntdev_vmops = {
>  	.open = gntdev_vma_open,
>  	.close = gntdev_vma_close,
> -	.find_special_page = gntdev_vma_find_special_page,
> +	.find_normal_page = gntdev_vma_find_normal_page,
>  };
>
>  /* ------------------------------------------------------------------ */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0eb991262fbbf..036800514aa90 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -648,13 +648,21 @@ struct vm_operations_struct {
>  	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
>  					unsigned long addr, pgoff_t *ilx);
>  #endif
> +#ifdef CONFIG_FIND_NORMAL_PAGE
>  	/*
> -	 * Called by vm_normal_page() for special PTEs to find the
> -	 * page for @addr.  This is useful if the default behavior
> -	 * (using pte_page()) would not find the correct page.
> +	 * Called by vm_normal_page() for special PTEs in @vma at @addr. This
> +	 * allows for returning a "normal" page from vm_normal_page() even
> +	 * though the PTE indicates that the "struct page" either does not exist
> +	 * or should not be touched: "special".
> +	 *
> +	 * Do not add new users: this really only works when a "normal" page
> +	 * was mapped, but then the PTE got changed to something weird (+
> +	 * marked special) that would not make pte_pfn() identify the originally
> +	 * inserted page.

Yes great, glad to quarantine this.

>  	 */
> -	struct page *(*find_special_page)(struct vm_area_struct *vma,
> -					  unsigned long addr);
> +	struct page *(*find_normal_page)(struct vm_area_struct *vma,
> +					 unsigned long addr);
> +#endif /* CONFIG_FIND_NORMAL_PAGE */
>  };
>
>  #ifdef CONFIG_NUMA_BALANCING
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 0287e8d94aea7..82c281b4f6937 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1397,6 +1397,8 @@ config PT_RECLAIM
>
>  	  Note: now only empty user PTE page table pages will be reclaimed.
>
> +config FIND_NORMAL_PAGE
> +	def_bool n
>
>  source "mm/damon/Kconfig"
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 00a0d7ae3ba4a..52804ca343261 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -613,6 +613,12 @@ static void print_bad_page_map(struct vm_area_struct *vma,
>   * trivial. Secondly, an architecture may not have a spare page table
>   * entry bit, which requires a more complicated scheme, described below.
>   *
> + * With CONFIG_FIND_NORMAL_PAGE, we might have the "special" bit set on
> + * page table entries that actually map "normal" pages: however, that page
> + * cannot be looked up through the PFN stored in the page table entry, but
> + * instead will be looked up through vm_ops->find_normal_page(). So far, this
> + * only applies to PTEs.
> + *
>   * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
>   * special mapping (even if there are underlying and valid "struct pages").
>   * COWed pages of a VM_PFNMAP are always normal.
> @@ -710,8 +716,10 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>  	unsigned long pfn = pte_pfn(pte);
>
>  	if (unlikely(pte_special(pte))) {
> -		if (vma->vm_ops && vma->vm_ops->find_special_page)
> -			return vma->vm_ops->find_special_page(vma, addr);
> +#ifdef CONFIG_FIND_NORMAL_PAGE
> +		if (vma->vm_ops && vma->vm_ops->find_normal_page)
> +			return vma->vm_ops->find_normal_page(vma, addr);
> +#endif /* CONFIG_FIND_NORMAL_PAGE */
>  		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
>  			return NULL;
>  		if (is_zero_pfn(pfn))
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 0fe52fd6782bf..8646af15a5fc0 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -467,13 +467,21 @@ struct vm_operations_struct {
>  	struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
>  					unsigned long addr, pgoff_t *ilx);
>  #endif
> +#ifdef CONFIG_FIND_NORMAL_PAGE
>  	/*
> -	 * Called by vm_normal_page() for special PTEs to find the
> -	 * page for @addr.  This is useful if the default behavior
> -	 * (using pte_page()) would not find the correct page.
> +	 * Called by vm_normal_page() for special PTEs in @vma at @addr. This
> +	 * allows for returning a "normal" page from vm_normal_page() even
> +	 * though the PTE indicates that the "struct page" either does not exist
> +	 * or should not be touched: "special".
> +	 *
> +	 * Do not add new users: this really only works when a "normal" page
> +	 * was mapped, but then the PTE got changed to something weird (+
> +	 * marked special) that would not make pte_pfn() identify the originally
> +	 * inserted page.

Also glorious.

>  	 */
> -	struct page *(*find_special_page)(struct vm_area_struct *vma,
> -					  unsigned long addr);
> +	struct page *(*find_normal_page)(struct vm_area_struct *vma,
> +					 unsigned long addr);
> +#endif /* CONFIG_FIND_NORMAL_PAGE */
>  };
>
>  struct vm_unmapped_area_info {
> --
> 2.50.1
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*()
  2025-07-17 19:51   ` Lorenzo Stoakes
  2025-07-17 19:55     ` Lorenzo Stoakes
@ 2025-07-17 20:12     ` David Hildenbrand
  2025-07-18 12:35       ` Lorenzo Stoakes
  1 sibling, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 20:12 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

>>
>> -/*
>> - * vm_normal_page -- This function gets the "struct page" associated with a pte.
>> +/**
>> + * vm_normal_page_pfn() - Get the "struct page" associated with a PFN in a
>> + *			  non-special page table entry.
> 
> This is a bit nebulous/confusing, I mean you'll get PTE entries with PTE special
> bit that'll have a PFN but just no struct page/folio to look at, or should not
> be touched.
> 
> So the _pfn() bit doesn't really properly describe what it does.
> 
> I wonder if it'd be better to just separate out the special handler, have
> that return a boolean indicating special of either form, and then separate
> other shared code separately from that?

Let me think about that; I played with various approaches and this was 
the best I was come up with before running in circles.

> 
>> + * @vma: The VMA mapping the @pfn.
>> + * @addr: The address where the @pfn is mapped.
>> + * @pfn: The PFN.
>> + * @entry: The page table entry value for error reporting purposes.
>>    *
>>    * "Special" mappings do not wish to be associated with a "struct page" (either
>>    * it doesn't exist, or it exists but they don't want to touch it). In this
>> @@ -603,10 +608,10 @@ static void print_bad_page_map(struct vm_area_struct *vma,
>>    * (such as GUP) can still identify these mappings and work with the
>>    * underlying "struct page".
>>    *
>> - * There are 2 broad cases. Firstly, an architecture may define a pte_special()
>> - * pte bit, in which case this function is trivial. Secondly, an architecture
>> - * may not have a spare pte bit, which requires a more complicated scheme,
>> - * described below.
>> + * There are 2 broad cases. Firstly, an architecture may define a "special"
>> + * page table entry bit (e.g., pte_special()), in which case this function is
>> + * trivial. Secondly, an architecture may not have a spare page table
>> + * entry bit, which requires a more complicated scheme, described below.
> 
> Strikes me this bit of the comment should be with vm_normal_page(). As this
> implies the 2 broad cases are handled here and this isn't the case.

Well, pragmatism. Splitting up the doc doesn't make sense. Having it at 
vm_normal_page() doesn't make sense.

I'm sure the educated reader will be able to make sense of it :P

But I'm happy to hear suggestions on how to do it differently :)

> 
>>    *
>>    * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
>>    * special mapping (even if there are underlying and valid "struct pages").
>> @@ -639,15 +644,72 @@ static void print_bad_page_map(struct vm_area_struct *vma,
>>    * don't have to follow the strict linearity rule of PFNMAP mappings in
>>    * order to support COWable mappings.
>>    *
>> + * This function is not expected to be called for obviously special mappings:
>> + * when the page table entry has the "special" bit set.
> 
> Hmm this is is a bit weird though, saying "obviously" special, because you're
> handling "special" mappings here, but only for architectures that don't specify
> the PTE special bit.
> 
> So it makes it quite nebulous what constitutes 'obviously' here, really you mean
> pte_special().

Yes, I can clarify that.

> 
>> + *
>> + * Return: Returns the "struct page" if this is a "normal" mapping. Returns
>> + *	   NULL if this is a "special" mapping.
>> + */
>> +static inline struct page *vm_normal_page_pfn(struct vm_area_struct *vma,
>> +		unsigned long addr, unsigned long pfn, unsigned long long entry)
>> +{
>> +	/*
>> +	 * With CONFIG_ARCH_HAS_PTE_SPECIAL, any special page table mappings
>> +	 * (incl. shared zero folios) are marked accordingly and are handled
>> +	 * by the caller.
>> +	 */
>> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
>> +		if (unlikely(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))) {
>> +			if (vma->vm_flags & VM_MIXEDMAP) {
>> +				/* If it has a "struct page", it's "normal". */
>> +				if (!pfn_valid(pfn))
>> +					return NULL;
>> +			} else {
>> +				unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
>> +
>> +				/* Only CoW'ed anon folios are "normal". */
>> +				if (pfn == vma->vm_pgoff + off)
>> +					return NULL;
>> +				if (!is_cow_mapping(vma->vm_flags))
>> +					return NULL;
>> +			}
>> +		}
>> +
>> +		if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
> 
> This handles zero/zero huge page handling for non-pte_special() case
> only. I wonder if we even need to bother having these marked special
> generally since you can just check the PFN every time anyway.

Well, that makes (a) pte_special() a bit weird -- not set for some 
special pages and (b) requires additional runtime checks for the case we 
all really care about -- pte_special().

So I don't think we should change that.

[...]

>>
>> +/**
>> + * vm_normal_folio() - Get the "struct folio" associated with a PTE
>> + * @vma: The VMA mapping the @pte.
>> + * @addr: The address where the @pte is mapped.
>> + * @pte: The PTE.
>> + *
>> + * Get the "struct folio" associated with a PTE. See vm_normal_page_pfn()
>> + * for details.
>> + *
>> + * Return: Returns the "struct folio" if this is a "normal" mapping. Returns
>> + *	   NULL if this is a "special" mapping.
>> + */
> 
> Nice to add a comment, but again feels weird to have the whole explanation in
> vm_normal_page_pfn() but then to invoke vm_normal_page()..

You want people to do pointer chasing to find what they are looking for? :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud()
  2025-07-17 20:03   ` Lorenzo Stoakes
@ 2025-07-17 20:14     ` David Hildenbrand
  2025-07-18 10:47       ` Lorenzo Stoakes
  0 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 20:14 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On 17.07.25 22:03, Lorenzo Stoakes wrote:
> On Thu, Jul 17, 2025 at 01:52:11PM +0200, David Hildenbrand wrote:
>> Let's introduce vm_normal_page_pud(), which ends up being fairly simple
>> because of our new common helpers and there not being a PUD-sized zero
>> folio.
>>
>> Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO,
>> structuring the code like the other (pmd/pte) cases. Defer
>> introducing vm_normal_folio_pud() until really used.
> 
> I mean fine :P but does anybody really use this?

This is a unified PFN walker (!hugetlb + hugetlb), so you can easily run 
into hugetlb PUDs, DAX PUDs and huge pfnmap (vfio) PUDs :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
  2025-07-17 18:29   ` Lorenzo Stoakes
@ 2025-07-17 20:31     ` David Hildenbrand
  2025-07-18 10:41       ` Lorenzo Stoakes
  0 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-07-17 20:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On 17.07.25 20:29, Lorenzo Stoakes wrote:
> On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote:
>> The huge zero folio is refcounted (+mapcounted -- is that a word?)
>> differently than "normal" folios, similarly (but different) to the ordinary
>> shared zeropage.
> 
> Yeah, I sort of wonder if we shouldn't just _not_ do any of that with zero
> pages?

I wish we could get rid of the weird refcounting of the huge zero folio 
and get rid of the shrinker. But as long as the shrinker exists, I'm 
afraid that weird per-process refcounting must stay.

> 
> But for some reason the huge zero page wants to exist or not exist based on
> usage for one. Which is stupid to me.

Yes, I will try at some point (once we have the static huge zero folio) 
to remove the shrinker part and make it always static. Well, at least 
for reasonable architectures.

> 
>>
>> For this reason, we special-case these pages in
>> vm_normal_page*/vm_normal_folio*, and only allow selected callers to
>> still use them (e.g., GUP can still take a reference on them).
>>
>> vm_normal_page_pmd() already filters out the huge zero folio. However,
>> so far we are not marking it as special like we do with the ordinary
>> shared zeropage. Let's mark it as special, so we can further refactor
>> vm_normal_page_pmd() and vm_normal_page().
>>
>> While at it, update the doc regarding the shared zero folios.
> 
> Hmm I wonder how this will interact with the static PMD series at [0]?

No, it shouldn't.

> 
> I wonder if more use of that might result in some weirdness with refcounting
> etc.?

I don't think so.

> 
> Also, that series was (though I reviewed against it) moving stuff that
> references the huge zero folio out of there, but also generally allows
> access and mapping of this folio via largest_zero_folio() so not only via
> insert_pmd().
> 
> So we're going to end up with mappings of this that are not marked special
> that are potentially going to have refcount/mapcount manipulation that
> contradict what you're doing here perhaps?

I don't think so. It's just like having the existing static (small) 
shared zeropage where the same rules about refcounting+mapcounting apply.

> 
> [0]: https://lore.kernel.org/all/20250707142319.319642-1-kernel@pankajraghav.com/
> 
>>
>> Reviewed-by: Oscar Salvador <osalvador@suse.de>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> I looked thorugh places that use vm_normal_page_pm() (other than decl of
> function):
> 
> fs/proc/task_mmu.c - seems to handle NULL page correctly + still undertsands zero page
> mm/pagewalk.c - correctly handles NULL page + huge zero page
> mm/huge_memory.c - can_change_pmd_writable() correctly returns false.
> 
> And all seems to work wtih this change.
> 
> Overall, other than concerns above + nits below LGTM, we should treat all
> the zero folios the same in this regard, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Thanks!

> 
>> ---
>>   mm/huge_memory.c |  5 ++++-
>>   mm/memory.c      | 14 +++++++++-----
>>   2 files changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index db08c37b87077..3f9a27812a590 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1320,6 +1320,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
>>   {
>>   	pmd_t entry;
>>   	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
>> +	entry = pmd_mkspecial(entry);
>>   	pgtable_trans_huge_deposit(mm, pmd, pgtable);
>>   	set_pmd_at(mm, haddr, pmd, entry);
>>   	mm_inc_nr_ptes(mm);
>> @@ -1429,7 +1430,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
>>   	if (fop.is_folio) {
>>   		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
>>
>> -		if (!is_huge_zero_folio(fop.folio)) {
>> +		if (is_huge_zero_folio(fop.folio)) {
>> +			entry = pmd_mkspecial(entry);
>> +		} else {
>>   			folio_get(fop.folio);
>>   			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
>>   			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 92fd18a5d8d1f..173eb6267e0ac 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -537,7 +537,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>>    *
>>    * "Special" mappings do not wish to be associated with a "struct page" (either
>>    * it doesn't exist, or it exists but they don't want to touch it). In this
>> - * case, NULL is returned here. "Normal" mappings do have a struct page.
>> + * case, NULL is returned here. "Normal" mappings do have a struct page and
>> + * are ordinarily refcounted.
>> + *
>> + * Page mappings of the shared zero folios are always considered "special", as
>> + * they are not ordinarily refcounted. However, selected page table walkers
>> + * (such as GUP) can still identify these mappings and work with the
>> + * underlying "struct page".
> 
> I feel like we need more detail or something more explicit about what 'not
> ordinary' refcounting constitutes. This is a bit vague.

Hm, I am not sure this is the correct place to document that. But let me 
see if I can come up with something reasonable

(like, the refcount and mapcount of these folios is never adjusted when 
mapping them into page tables)

> 
>>    *
>>    * There are 2 broad cases. Firstly, an architecture may define a pte_special()
>>    * pte bit, in which case this function is trivial. Secondly, an architecture
>> @@ -567,9 +573,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>>    *
>>    * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
>>    * page" backing, however the difference is that _all_ pages with a struct
>> - * page (that is, those where pfn_valid is true) are refcounted and considered
>> - * normal pages by the VM. The only exception are zeropages, which are
>> - * *never* refcounted.
>> + * page (that is, those where pfn_valid is true, except the shared zero
>> + * folios) are refcounted and considered normal pages by the VM.
>>    *
>>    * The disadvantage is that pages are refcounted (which can be slower and
>>    * simply not an option for some PFNMAP users). The advantage is that we
>> @@ -649,7 +654,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> 
> You know I"m semi-ashamed to admit I didn't even know this function
> exists. But yikes that we have a separate function like this just for PMDs.

It's a bit new-ish :)


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
  2025-07-17 11:52 ` [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
  2025-07-17 19:17   ` Lorenzo Stoakes
@ 2025-07-17 22:06   ` Demi Marie Obenour
  2025-07-18  7:44     ` David Hildenbrand
  1 sibling, 1 reply; 48+ messages in thread
From: Demi Marie Obenour @ 2025-07-17 22:06 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang


[-- Attachment #1.1.1: Type: text/plain, Size: 2539 bytes --]

On 7/17/25 07:52, David Hildenbrand wrote:
> print_bad_pte() looks like something that should actually be a WARN
> or similar, but historically it apparently has proven to be useful to
> detect corruption of page tables even on production systems -- report
> the issue and keep the system running to make it easier to actually detect
> what is going wrong (e.g., multiple such messages might shed a light).
> 
> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
> to take care of print_bad_pte() as well.
> 
> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
> implementation and renaming the function -- we'll rename it to what
> we actually print: bad (page) mappings. Maybe it should be called
> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
> because the assumption is that we are dealing with some (previously)
> present page table entry that got corrupted in weird ways.
> 
> Whether it is a PTE or something else will usually become obvious from the
> page table dump or from the dumped stack. If ever required in the future,
> we could pass the entry level type similar to "enum rmap_level". For now,
> let's keep it simple.
> 
> To make the function a bit more readable, factor out the ratelimit check
> into is_bad_page_map_ratelimited() and place the dumping of page
> table content into __dump_bad_page_map_pgtable(). We'll now dump
> information from each level in a single line, and just stop the table
> walk once we hit something that is not a present page table.
> 
> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
> for vm_normal_page(), now that we have a function that can handle it.
> 
> The report will now look something like (dumping pgd to pmd values):
> 
> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
> 
> Not using pgdp_get(), because that does not work properly on some arm
> configs where pgd_t is an array. Note that we are dumping all levels
> even when levels are folded for simplicity.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Should this still use a WARN?  If the admin sets panic-on-warn they
have asked for "crash if anything goes wrong" and so that is what
they should get.  Otherwise the system will still stay up.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
  2025-07-17 22:06   ` Demi Marie Obenour
@ 2025-07-18  7:44     ` David Hildenbrand
  2025-07-18  7:59       ` Demi Marie Obenour
  0 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-07-18  7:44 UTC (permalink / raw)
  To: Demi Marie Obenour, linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On 18.07.25 00:06, Demi Marie Obenour wrote:
> On 7/17/25 07:52, David Hildenbrand wrote:
>> print_bad_pte() looks like something that should actually be a WARN
>> or similar, but historically it apparently has proven to be useful to
>> detect corruption of page tables even on production systems -- report
>> the issue and keep the system running to make it easier to actually detect
>> what is going wrong (e.g., multiple such messages might shed a light).
>>
>> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
>> to take care of print_bad_pte() as well.
>>
>> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
>> implementation and renaming the function -- we'll rename it to what
>> we actually print: bad (page) mappings. Maybe it should be called
>> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
>> because the assumption is that we are dealing with some (previously)
>> present page table entry that got corrupted in weird ways.
>>
>> Whether it is a PTE or something else will usually become obvious from the
>> page table dump or from the dumped stack. If ever required in the future,
>> we could pass the entry level type similar to "enum rmap_level". For now,
>> let's keep it simple.
>>
>> To make the function a bit more readable, factor out the ratelimit check
>> into is_bad_page_map_ratelimited() and place the dumping of page
>> table content into __dump_bad_page_map_pgtable(). We'll now dump
>> information from each level in a single line, and just stop the table
>> walk once we hit something that is not a present page table.
>>
>> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
>> for vm_normal_page(), now that we have a function that can handle it.
>>
>> The report will now look something like (dumping pgd to pmd values):
>>
>> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
>> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
>> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>>
>> Not using pgdp_get(), because that does not work properly on some arm
>> configs where pgd_t is an array. Note that we are dumping all levels
>> even when levels are folded for simplicity.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> Should this still use a WARN?  If the admin sets panic-on-warn they
> have asked for "crash if anything goes wrong" and so that is what
> they should get.  Otherwise the system will still stay up.

I assume you're comment is in context of the other proposal regarding 
panicking.

It's a good question whether we should WARN: likely we should convert 
the "BUG:" ... message into a WARN. On panic-on-warn you'd panic 
immediately without being able to observe any other such messages (and 
as discussed in the RFC, apparently that can be valuable for debugging, 
because a single such report is often insufficient)

But as panic-on-warn is "panic on the first sight of a problem", that 
sounds right.

That change should not be part of this patch, though.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
  2025-07-18  7:44     ` David Hildenbrand
@ 2025-07-18  7:59       ` Demi Marie Obenour
  2025-07-18  8:26         ` David Hildenbrand
  0 siblings, 1 reply; 48+ messages in thread
From: Demi Marie Obenour @ 2025-07-18  7:59 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang


[-- Attachment #1.1.1: Type: text/plain, Size: 3957 bytes --]

On 7/18/25 03:44, David Hildenbrand wrote:
> On 18.07.25 00:06, Demi Marie Obenour wrote:
>> On 7/17/25 07:52, David Hildenbrand wrote:
>>> print_bad_pte() looks like something that should actually be a WARN
>>> or similar, but historically it apparently has proven to be useful to
>>> detect corruption of page tables even on production systems -- report
>>> the issue and keep the system running to make it easier to actually detect
>>> what is going wrong (e.g., multiple such messages might shed a light).
>>>
>>> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
>>> to take care of print_bad_pte() as well.
>>>
>>> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
>>> implementation and renaming the function -- we'll rename it to what
>>> we actually print: bad (page) mappings. Maybe it should be called
>>> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
>>> because the assumption is that we are dealing with some (previously)
>>> present page table entry that got corrupted in weird ways.
>>>
>>> Whether it is a PTE or something else will usually become obvious from the
>>> page table dump or from the dumped stack. If ever required in the future,
>>> we could pass the entry level type similar to "enum rmap_level". For now,
>>> let's keep it simple.
>>>
>>> To make the function a bit more readable, factor out the ratelimit check
>>> into is_bad_page_map_ratelimited() and place the dumping of page
>>> table content into __dump_bad_page_map_pgtable(). We'll now dump
>>> information from each level in a single line, and just stop the table
>>> walk once we hit something that is not a present page table.
>>>
>>> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
>>> for vm_normal_page(), now that we have a function that can handle it.
>>>
>>> The report will now look something like (dumping pgd to pmd values):
>>>
>>> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
>>> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
>>> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>>>
>>> Not using pgdp_get(), because that does not work properly on some arm
>>> configs where pgd_t is an array. Note that we are dumping all levels
>>> even when levels are folded for simplicity.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> Should this still use a WARN?  If the admin sets panic-on-warn they
>> have asked for "crash if anything goes wrong" and so that is what
>> they should get.  Otherwise the system will still stay up.
> 
> I assume you're comment is in context of the other proposal regarding 
> panicking.

Which one?  I'm only subscribed to xen-devel and might have missed it.

> It's a good question whether we should WARN: likely we should convert 
> the "BUG:" ... message into a WARN. On panic-on-warn you'd panic 
> immediately without being able to observe any other such messages (and 
> as discussed in the RFC, apparently that can be valuable for debugging, 
> because a single such report is often insufficient)
> 
> But as panic-on-warn is "panic on the first sight of a problem", that 
> sounds right.

There are other advantages to WARN() as well:

- Automated tools (like syzkaller) know how to deal with it (though they
  probably know how to deal with this too).

- It counts against kernel.warn_limit, so an administrator can choose
  to have the system panic if there are a huge number of warnings.
  (My completely uninformed guess is that "other such messages" would
  usually be less than 100, and that once one has gotten into the
  quadruple digits, the kernel is probably hopelessly confused.)

- It shows up in /sys/kernel/warn_count.

> That change should not be part of this patch, though.

Fair.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
  2025-07-18  7:59       ` Demi Marie Obenour
@ 2025-07-18  8:26         ` David Hildenbrand
  0 siblings, 0 replies; 48+ messages in thread
From: David Hildenbrand @ 2025-07-18  8:26 UTC (permalink / raw)
  To: Demi Marie Obenour, linux-kernel
  Cc: linux-mm, xen-devel, linux-fsdevel, nvdimm, Andrew Morton,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	Dan Williams, Matthew Wilcox, Jan Kara, Alexander Viro,
	Christian Brauner, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On 18.07.25 09:59, Demi Marie Obenour wrote:
> On 7/18/25 03:44, David Hildenbrand wrote:
>> On 18.07.25 00:06, Demi Marie Obenour wrote:
>>> On 7/17/25 07:52, David Hildenbrand wrote:
>>>> print_bad_pte() looks like something that should actually be a WARN
>>>> or similar, but historically it apparently has proven to be useful to
>>>> detect corruption of page tables even on production systems -- report
>>>> the issue and keep the system running to make it easier to actually detect
>>>> what is going wrong (e.g., multiple such messages might shed a light).
>>>>
>>>> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
>>>> to take care of print_bad_pte() as well.
>>>>
>>>> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
>>>> implementation and renaming the function -- we'll rename it to what
>>>> we actually print: bad (page) mappings. Maybe it should be called
>>>> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
>>>> because the assumption is that we are dealing with some (previously)
>>>> present page table entry that got corrupted in weird ways.
>>>>
>>>> Whether it is a PTE or something else will usually become obvious from the
>>>> page table dump or from the dumped stack. If ever required in the future,
>>>> we could pass the entry level type similar to "enum rmap_level". For now,
>>>> let's keep it simple.
>>>>
>>>> To make the function a bit more readable, factor out the ratelimit check
>>>> into is_bad_page_map_ratelimited() and place the dumping of page
>>>> table content into __dump_bad_page_map_pgtable(). We'll now dump
>>>> information from each level in a single line, and just stop the table
>>>> walk once we hit something that is not a present page table.
>>>>
>>>> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
>>>> for vm_normal_page(), now that we have a function that can handle it.
>>>>
>>>> The report will now look something like (dumping pgd to pmd values):
>>>>
>>>> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
>>>> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
>>>> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
>>>>
>>>> Not using pgdp_get(), because that does not work properly on some arm
>>>> configs where pgd_t is an array. Note that we are dumping all levels
>>>> even when levels are folded for simplicity.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> Should this still use a WARN?  If the admin sets panic-on-warn they
>>> have asked for "crash if anything goes wrong" and so that is what
>>> they should get.  Otherwise the system will still stay up.
>>
>> I assume you're comment is in context of the other proposal regarding
>> panicking.
> 
> Which one?  I'm only subscribed to xen-devel and might have missed it.

This one here:

https://lkml.kernel.org/r/4e1b7d2d-ed54-4e0a-a0a4-906b14d9cd41@p183

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
  2025-07-17 20:03     ` David Hildenbrand
@ 2025-07-18 10:15       ` Lorenzo Stoakes
  2025-07-18 11:04         ` David Hildenbrand
  0 siblings, 1 reply; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 10:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 10:03:31PM +0200, David Hildenbrand wrote:
> > > The report will now look something like (dumping pgd to pmd values):
> > >
> > > [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
> > > [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> > > [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
> > >
> > > Not using pgdp_get(), because that does not work properly on some arm
> > > configs where pgd_t is an array. Note that we are dumping all levels
> > > even when levels are folded for simplicity.
> >
> > Oh god. I reviewed this below. BUT OH GOD. What. Why???
>
> Exactly.
>
> I wish this patch wouldn't exist, but Hugh convinced me that apparently it
> can be useful in the real world.

Yeah... I mean out of scope for here but that sounds dubious. On the other hand,
we use typedef for page table values etc. etc. so we do make this possible.

>
> >
> > >
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >   mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
> > >   1 file changed, 94 insertions(+), 26 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 173eb6267e0ac..08d16ed7b4cc7 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -473,22 +473,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> > >   			add_mm_counter(mm, i, rss[i]);
> > >   }
> > >
> > > -/*
> > > - * This function is called to print an error when a bad pte
> > > - * is found. For example, we might have a PFN-mapped pte in
> > > - * a region that doesn't allow it.
> > > - *
> > > - * The calling function must still handle the error.
> > > - */
> > > -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > > -			  pte_t pte, struct page *page)
> > > +static bool is_bad_page_map_ratelimited(void)
> > >   {
> > > -	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > -	p4d_t *p4d = p4d_offset(pgd, addr);
> > > -	pud_t *pud = pud_offset(p4d, addr);
> > > -	pmd_t *pmd = pmd_offset(pud, addr);
> > > -	struct address_space *mapping;
> > > -	pgoff_t index;
> > >   	static unsigned long resume;
> > >   	static unsigned long nr_shown;
> > >   	static unsigned long nr_unshown;
> > > @@ -500,7 +486,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > >   	if (nr_shown == 60) {
> > >   		if (time_before(jiffies, resume)) {
> > >   			nr_unshown++;
> > > -			return;
> > > +			return true;
> > >   		}
> > >   		if (nr_unshown) {
> > >   			pr_alert("BUG: Bad page map: %lu messages suppressed\n",
> > > @@ -511,15 +497,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > >   	}
> > >   	if (nr_shown++ == 0)
> > >   		resume = jiffies + 60 * HZ;
> > > +	return false;
> > > +}
> > > +
> > > +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > +	unsigned long long pgdv, p4dv, pudv, pmdv;
> >
> > > +	p4d_t p4d, *p4dp;
> > > +	pud_t pud, *pudp;
> > > +	pmd_t pmd, *pmdp;
> > > +	pgd_t *pgdp;
> > > +
> > > +	/*
> > > +	 * This looks like a fully lockless walk, however, the caller is
> > > +	 * expected to hold the leaf page table lock in addition to other
> > > +	 * rmap/mm/vma locks. So this is just a re-walk to dump page table
> > > +	 * content while any concurrent modifications should be completely
> > > +	 * prevented.
> > > +	 */
> >
> > Hmmm :)
> >
> > Why aren't we trying to lock at leaf level?
>
> Ehm, did you read the:
>
> "the caller is expected to hold the leaf page table lock"
>
> :)

Yeah sorry I was in 'what locks do we need' mode and hadn't shifted back here,
but I guess the intent is that the caller _must_ hold this lock.

I know it's nitty and annoying (sorry!) but as asserting seems to not be a
possibility here, could we spell these out as a series of points like:

/*
 * The caller MUST hold the following locks:
 *
 * - Leaf page table lock
 * - Appropriate VMA lock to keep VMA stable
 */

I don't _actually_ think you need the rmap lock then, as none of the page tables
you access would be impacted by any rmap action afaict, with these locks held.
>
>
> >
> > We need to:
> >
> > - Keep VMA stable which prevents unmap page table teardown and khugepaged
> >    collapse.
> > - (not relevant as we don't traverse PTE table but) RCU lock for PTE
> >    entries to avoid MADV_DONTNEED page table withdrawal.
> >
> > Buuut if we're not locking at leaf level, we leave ourselves open to racing
> > faults, zaps, etc. etc.
>
> Yes. I can clarify in the comment of print_bad_page_map(), that it is
> expected to be called with the PTL (unwritten rule, not changing that).

Right, see above, just needs more clarity then.

>
> >
> > So perhaps this why you require such strict conditions...
> >
> > But can you truly be sure of these existing? And we should then assert them
> > here no? For rmap though we'd need the folio/vma.
>
> I hope you realize that this nastiness of a code is called in case our
> system is already running into something extremely unexpected and will
> probably be dead soon.
>
> So I am not to interested in adding anything more here. If you run into this
> code you're in big trouble already.

Yes am aware :) my concern is NULL ptr deref or UAF, but with the locks
held as stated those won't occur.

But f it's not sensible to do it then we don't have to :) I am a reasonable
man, or like to think I am ;)

But I think we need clarity as per the above.

>
> >
> > > +	pgdp = pgd_offset(mm, addr);
> > > +	pgdv = pgd_val(*pgdp);
> >
> > Before I went and looked again at the commit msg I said:
> >
> > 	"Shoudln't we strictly speaking use pgdp_get()? I see you use this
> > 	 helper for other levels."
> >
> > But obviously yeah. You explained the insane reason why not.
>
> Had to find out the hard way ... :)

Pain.

>
> [...]
>
> > > +/*
> > > + * This function is called to print an error when a bad page table entry (e.g.,
> > > + * corrupted page table entry) is found. For example, we might have a
> > > + * PFN-mapped pte in a region that doesn't allow it.
> > > + *
> > > + * The calling function must still handle the error.
> > > + */
> >
> > We have extremely strict locking conditions for the page table traversal... but
> > no mention of them here?
>
> Yeah, I can add that.

Thanks!

>
> >
> > > +static void print_bad_page_map(struct vm_area_struct *vma,
> > > +		unsigned long addr, unsigned long long entry, struct page *page)
> > > +{
> > > +	struct address_space *mapping;
> > > +	pgoff_t index;
> > > +
> > > +	if (is_bad_page_map_ratelimited())
> > > +		return;
> > >
> > >   	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
> > >   	index = linear_page_index(vma, addr);
> > >
> > > -	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
> > > -		 current->comm,
> > > -		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
> > > +	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);
> >
> > Sort of wonder if this is even useful if you don't know what the 'entry'
> > is? But I guess the dump below will tell you.
>
> You probably missed in the patch description:
>
> "Whether it is a PTE or something else will usually become obvious from the
> page table dump or from the dumped stack. If ever required in the future, we
> could pass the entry level type similar to "enum rmap_level". For now, let's
> keep it simple."

Yeah sorry I glossed over the commit msg, and now I pay for it ;) OK this
is fine then.

>
> >
> > Though maybe actually useful to see flags etc. in case some horrid
> > corruption happened and maybe dump isn't valid? But then the dump assumes
> > strict conditions to work so... can that happen?
>
> Not sure what you mean, can you elaborate?
>
> If you system is falling apart completely, I'm afraid this report here will
> not safe you.
>
> You'll probably get a flood of 60 ... if your system doesn't collapse before
> that.

I was musing whistfully about the value of outputting the entries. But I
guess _any_ information before a crash is useful. But your dump output will
be a lot more useful :)

>
> >
> > > +	__dump_bad_page_map_pgtable(vma->vm_mm, addr);
> > >   	if (page)
> > > -		dump_page(page, "bad pte");
> > > +		dump_page(page, "bad page map");
> > >   	pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
> > >   		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
> > >   	pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
> > > @@ -597,7 +655,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > >   		if (is_zero_pfn(pfn))
> > >   			return NULL;
> > >
> > > -		print_bad_pte(vma, addr, pte, NULL);
> > > +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
> > >   		return NULL;
> > >   	}
> > >
> > > @@ -625,7 +683,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > >
> > >   check_pfn:
> > >   	if (unlikely(pfn > highest_memmap_pfn)) {
> > > -		print_bad_pte(vma, addr, pte, NULL);
> > > +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
> >
> > This is unrelated to your series, but I guess this is for cases where
> > you're e.g. iomapping or such? So it's not something in the memmap but it's
> > a PFN that might reference io memory or such?
>
> No, it's just an easy check for catching corruptions. In a later patch I add
> a comment.

Ohhh right. I was overthinking this haha

>
> >
> > >   		return NULL;
> > >   	}
> > >
> > > @@ -654,8 +712,15 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > >   {
> > >   	unsigned long pfn = pmd_pfn(pmd);
> > >
> > > -	if (unlikely(pmd_special(pmd)))
> > > +	if (unlikely(pmd_special(pmd))) {
> > > +		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> > > +			return NULL;
> >
> > I guess we'll bring this altogether in a later patch with vm_normal_page()
> > as getting a little duplicative :P
>
> Not that part, but the other part :)

Yes :)

>
> >
> > Makes me think that VM_SPECIAL is kind of badly named (other than fact
> > 'special' is nebulous and overloaded in general) in that it contains stuff
> > that is -VMA-special but only VM_PFNMAP | VM_MIXEDMAP really indicates
> > specialness wrt to underlying folio.
>
> It is.

Yeah I think we in mm periodically moan about this whole thing...

>
> >
> > Then we have VM_IO, which strictly must not have an associated page right?
>
> VM_IO just means read/write side-effects, I think you could have ones with
> an memmap easily ... e.g., memory section (128MiB) spanning both memory and
> MMIO regions.

Hmm, but why not have two separate VMAs? I guess I need to look into more
what this flag actually effects.

But in terms of VMA manipulation in any case it really is no different from
e.g. VM_PFNMAP in terms of what it affects. Though well. I say that. Except
of course this whole vm_normal_page() thing...!

But I've not seen VM_IO set alone anywhere. On the other hand, I haven't
really dug deep on this...

>
> Thanks!
>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
  2025-07-17 20:31     ` David Hildenbrand
@ 2025-07-18 10:41       ` Lorenzo Stoakes
  2025-07-18 10:54         ` David Hildenbrand
  0 siblings, 1 reply; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 10:41 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 10:31:28PM +0200, David Hildenbrand wrote:
> On 17.07.25 20:29, Lorenzo Stoakes wrote:
> > On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote:
> > > The huge zero folio is refcounted (+mapcounted -- is that a word?)
> > > differently than "normal" folios, similarly (but different) to the ordinary
> > > shared zeropage.
> >
> > Yeah, I sort of wonder if we shouldn't just _not_ do any of that with zero
> > pages?
>
> I wish we could get rid of the weird refcounting of the huge zero folio and
> get rid of the shrinker. But as long as the shrinker exists, I'm afraid that
> weird per-process refcounting must stay.

Does this change of yours cause any issue with it? I mean now nothing can grab
this page using vm_normal_page_pmd(), so it won't be able to manipulate
refcounts.

Does this impact the shrink stuff at all? Haven't traced it all through.

>
> >
> > But for some reason the huge zero page wants to exist or not exist based on
> > usage for one. Which is stupid to me.
>
> Yes, I will try at some point (once we have the static huge zero folio) to
> remove the shrinker part and make it always static. Well, at least for
> reasonable architectures.

Yes, this seems the correct way to move forward.

And we need to get rid of (or neuter) the 'unreasonable' architectures...

>
> >
> > >
> > > For this reason, we special-case these pages in
> > > vm_normal_page*/vm_normal_folio*, and only allow selected callers to
> > > still use them (e.g., GUP can still take a reference on them).
> > >
> > > vm_normal_page_pmd() already filters out the huge zero folio. However,
> > > so far we are not marking it as special like we do with the ordinary
> > > shared zeropage. Let's mark it as special, so we can further refactor
> > > vm_normal_page_pmd() and vm_normal_page().
> > >
> > > While at it, update the doc regarding the shared zero folios.
> >
> > Hmm I wonder how this will interact with the static PMD series at [0]?
>
> No, it shouldn't.

I'm always nervous about these kinds of things :)

I'm assuming the reference/map counting will still work properly with the static
page?

I think I raised that in that series :P

>
> >
> > I wonder if more use of that might result in some weirdness with refcounting
> > etc.?
>
> I don't think so.

Good :>)

>
> >
> > Also, that series was (though I reviewed against it) moving stuff that
> > references the huge zero folio out of there, but also generally allows
> > access and mapping of this folio via largest_zero_folio() so not only via
> > insert_pmd().
> >
> > So we're going to end up with mappings of this that are not marked special
> > that are potentially going to have refcount/mapcount manipulation that
> > contradict what you're doing here perhaps?
>
> I don't think so. It's just like having the existing static (small) shared
> zeropage where the same rules about refcounting+mapcounting apply.

It feels like all this is a mess... am I missing something that would make it
all make sense?

It's not sane to disappear zero pages based on not-usage in 2025. Sorry if that
little RAM hurts you, use a microcontroller... after which it doesn't really
mean anything to have ref/map counts...

>
> >
> > [0]: https://lore.kernel.org/all/20250707142319.319642-1-kernel@pankajraghav.com/
> >
> > >
> > > Reviewed-by: Oscar Salvador <osalvador@suse.de>
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> >
> > I looked thorugh places that use vm_normal_page_pm() (other than decl of
> > function):
> >
> > fs/proc/task_mmu.c - seems to handle NULL page correctly + still undertsands zero page
> > mm/pagewalk.c - correctly handles NULL page + huge zero page
> > mm/huge_memory.c - can_change_pmd_writable() correctly returns false.
> >
> > And all seems to work wtih this change.
> >
> > Overall, other than concerns above + nits below LGTM, we should treat all
> > the zero folios the same in this regard, so:
> >
> > Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>
> Thanks!

No problem! Thanks for the cleanups, these are great...

>
> >
> > > ---
> > >   mm/huge_memory.c |  5 ++++-
> > >   mm/memory.c      | 14 +++++++++-----
> > >   2 files changed, 13 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index db08c37b87077..3f9a27812a590 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1320,6 +1320,7 @@ static void set_huge_zero_folio(pgtable_t pgtable, struct mm_struct *mm,
> > >   {
> > >   	pmd_t entry;
> > >   	entry = folio_mk_pmd(zero_folio, vma->vm_page_prot);
> > > +	entry = pmd_mkspecial(entry);
> > >   	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > >   	set_pmd_at(mm, haddr, pmd, entry);
> > >   	mm_inc_nr_ptes(mm);
> > > @@ -1429,7 +1430,9 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
> > >   	if (fop.is_folio) {
> > >   		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
> > >
> > > -		if (!is_huge_zero_folio(fop.folio)) {
> > > +		if (is_huge_zero_folio(fop.folio)) {
> > > +			entry = pmd_mkspecial(entry);
> > > +		} else {
> > >   			folio_get(fop.folio);
> > >   			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
> > >   			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 92fd18a5d8d1f..173eb6267e0ac 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -537,7 +537,13 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > >    *
> > >    * "Special" mappings do not wish to be associated with a "struct page" (either
> > >    * it doesn't exist, or it exists but they don't want to touch it). In this
> > > - * case, NULL is returned here. "Normal" mappings do have a struct page.
> > > + * case, NULL is returned here. "Normal" mappings do have a struct page and
> > > + * are ordinarily refcounted.
> > > + *
> > > + * Page mappings of the shared zero folios are always considered "special", as
> > > + * they are not ordinarily refcounted. However, selected page table walkers
> > > + * (such as GUP) can still identify these mappings and work with the
> > > + * underlying "struct page".
> >
> > I feel like we need more detail or something more explicit about what 'not
> > ordinary' refcounting constitutes. This is a bit vague.
>
> Hm, I am not sure this is the correct place to document that. But let me see
> if I can come up with something reasonable
>
> (like, the refcount and mapcount of these folios is never adjusted when
> mapping them into page tables)

I think having _something_ is good even if perhaps not ideally situated... :)

>
> >
> > >    *
> > >    * There are 2 broad cases. Firstly, an architecture may define a pte_special()
> > >    * pte bit, in which case this function is trivial. Secondly, an architecture
> > > @@ -567,9 +573,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > >    *
> > >    * VM_MIXEDMAP mappings can likewise contain memory with or without "struct
> > >    * page" backing, however the difference is that _all_ pages with a struct
> > > - * page (that is, those where pfn_valid is true) are refcounted and considered
> > > - * normal pages by the VM. The only exception are zeropages, which are
> > > - * *never* refcounted.
> > > + * page (that is, those where pfn_valid is true, except the shared zero
> > > + * folios) are refcounted and considered normal pages by the VM.
> > >    *
> > >    * The disadvantage is that pages are refcounted (which can be slower and
> > >    * simply not an option for some PFNMAP users). The advantage is that we
> > > @@ -649,7 +654,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> >
> > You know I"m semi-ashamed to admit I didn't even know this function
> > exists. But yikes that we have a separate function like this just for PMDs.
>
> It's a bit new-ish :)

OK I feel less bad about it then... -ish ;)

>
>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud()
  2025-07-17 20:14     ` David Hildenbrand
@ 2025-07-18 10:47       ` Lorenzo Stoakes
  2025-07-18 11:06         ` David Hildenbrand
  0 siblings, 1 reply; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 10:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 10:14:33PM +0200, David Hildenbrand wrote:
> On 17.07.25 22:03, Lorenzo Stoakes wrote:
> > On Thu, Jul 17, 2025 at 01:52:11PM +0200, David Hildenbrand wrote:
> > > Let's introduce vm_normal_page_pud(), which ends up being fairly simple
> > > because of our new common helpers and there not being a PUD-sized zero
> > > folio.
> > >
> > > Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO,
> > > structuring the code like the other (pmd/pte) cases. Defer
> > > introducing vm_normal_folio_pud() until really used.
> >
> > I mean fine :P but does anybody really use this?
>
> This is a unified PFN walker (!hugetlb + hugetlb), so you can easily run
> into hugetlb PUDs, DAX PUDs and huge pfnmap (vfio) PUDs :)

Ahhh ok. I hate hugetlb so very very much.

Oscar is doing the Lord's work improving things but the trauma is still
there... :P

Also yeah DAX ahem.

I'm not familiar with huge pfnmap PUDs, could you give me a hint on this? :>)

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
  2025-07-18 10:41       ` Lorenzo Stoakes
@ 2025-07-18 10:54         ` David Hildenbrand
  2025-07-18 13:06           ` Lorenzo Stoakes
  0 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-07-18 10:54 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On 18.07.25 12:41, Lorenzo Stoakes wrote:
> On Thu, Jul 17, 2025 at 10:31:28PM +0200, David Hildenbrand wrote:
>> On 17.07.25 20:29, Lorenzo Stoakes wrote:
>>> On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote:
>>>> The huge zero folio is refcounted (+mapcounted -- is that a word?)
>>>> differently than "normal" folios, similarly (but different) to the ordinary
>>>> shared zeropage.
>>>
>>> Yeah, I sort of wonder if we shouldn't just _not_ do any of that with zero
>>> pages?
>>
>> I wish we could get rid of the weird refcounting of the huge zero folio and
>> get rid of the shrinker. But as long as the shrinker exists, I'm afraid that
>> weird per-process refcounting must stay.
> 
> Does this change of yours cause any issue with it? I mean now nothing can grab
> this page using vm_normal_page_pmd(), so it won't be able to manipulate
> refcounts.

Please look again at vm_normal_page_pmd(): we have a manual 
huge_zero_pfn() check in there! There is no change in behavior. :)

It's not obvious from the diff below. But huge zero folio was considered 
special before this change, just not marked accordingly.

>>
>>>
>>>>
>>>> For this reason, we special-case these pages in
>>>> vm_normal_page*/vm_normal_folio*, and only allow selected callers to
>>>> still use them (e.g., GUP can still take a reference on them).
>>>>
>>>> vm_normal_page_pmd() already filters out the huge zero folio. However,
>>>> so far we are not marking it as special like we do with the ordinary
>>>> shared zeropage. Let's mark it as special, so we can further refactor
>>>> vm_normal_page_pmd() and vm_normal_page().
>>>>
>>>> While at it, update the doc regarding the shared zero folios.
>>>
>>> Hmm I wonder how this will interact with the static PMD series at [0]?
>>
>> No, it shouldn't.
> 
> I'm always nervous about these kinds of things :)
> 
> I'm assuming the reference/map counting will still work properly with the static
> page?

Let me stress again: no change in behavior besides setting the special 
flag in this patch. Return value of vm_normal_page_pmd() is not changed.

>>>
>>> Also, that series was (though I reviewed against it) moving stuff that
>>> references the huge zero folio out of there, but also generally allows
>>> access and mapping of this folio via largest_zero_folio() so not only via
>>> insert_pmd().
>>>
>>> So we're going to end up with mappings of this that are not marked special
>>> that are potentially going to have refcount/mapcount manipulation that
>>> contradict what you're doing here perhaps?
>>
>> I don't think so. It's just like having the existing static (small) shared
>> zeropage where the same rules about refcounting+mapcounting apply.
> 
> It feels like all this is a mess... am I missing something that would make it
> all make sense?

Let me clarify:

The small zeropage is never refcounted+mapcounted when mapped into page 
tables.

The huge zero folio is never refcounted+mapcounted when mapped into page 
tables EXCEPT it is refcounted in a weird different when first mapped 
into a process.

The whole reason is the shrinker. I don't like it, but there was a 
reason it was added. Maybe that reason no longer exists, but that's 
nothing that this patch series is concerned with, really. :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
  2025-07-18 10:15       ` Lorenzo Stoakes
@ 2025-07-18 11:04         ` David Hildenbrand
  2025-07-18 12:55           ` Lorenzo Stoakes
  0 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-07-18 11:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

> 
> Yeah sorry I was in 'what locks do we need' mode and hadn't shifted back here,
> but I guess the intent is that the caller _must_ hold this lock.
> 
> I know it's nitty and annoying (sorry!) but as asserting seems to not be a
> possibility here, could we spell these out as a series of points like:
> 
> /*
>   * The caller MUST hold the following locks:
>   *
>   * - Leaf page table lock
>   * - Appropriate VMA lock to keep VMA stable
>   */
> 
> I don't _actually_ think you need the rmap lock then, as none of the page tables
> you access would be impacted by any rmap action afaict, with these locks held.

I don't enjoy wrong comments ;)

This can be called from rmap code when doing a vm_normal_page() while 
holding the PTL.

Really, I think we are over-thinking a helper that is triggered in 
specific context when the world is about to collide.

This is not your general-purpose API.

Maybe I should have never added a comment. Maybe I should just not have 
done this patch, because I really don't want to do more than the bare 
minimum to print_bad_page_map().

Because I deeply detest it, and no comments we will add will change that.

[...]

>>> But can you truly be sure of these existing? And we should then assert them
>>> here no? For rmap though we'd need the folio/vma.
>>
>> I hope you realize that this nastiness of a code is called in case our
>> system is already running into something extremely unexpected and will
>> probably be dead soon.
>>
>> So I am not to interested in adding anything more here. If you run into this
>> code you're in big trouble already.
> 
> Yes am aware :) my concern is NULL ptr deref or UAF, but with the locks
> held as stated those won't occur.
> 
> But f it's not sensible to do it then we don't have to :) I am a reasonable
> man, or like to think I am ;)
> 
> But I think we need clarity as per the above.
> 
>>
>>>
>>>> +	pgdp = pgd_offset(mm, addr);
>>>> +	pgdv = pgd_val(*pgdp);
>>>
>>> Before I went and looked again at the commit msg I said:
>>>
>>> 	"Shoudln't we strictly speaking use pgdp_get()? I see you use this
>>> 	 helper for other levels."
>>>
>>> But obviously yeah. You explained the insane reason why not.
>>
>> Had to find out the hard way ... :)
> 
> Pain.
> 
>>
>> [...]
>>
>>>> +/*
>>>> + * This function is called to print an error when a bad page table entry (e.g.,
>>>> + * corrupted page table entry) is found. For example, we might have a
>>>> + * PFN-mapped pte in a region that doesn't allow it.
>>>> + *
>>>> + * The calling function must still handle the error.
>>>> + */
>>>
>>> We have extremely strict locking conditions for the page table traversal... but
>>> no mention of them here?
>>
>> Yeah, I can add that.
> 
> Thanks!
> 
>>
>>>
>>>> +static void print_bad_page_map(struct vm_area_struct *vma,
>>>> +		unsigned long addr, unsigned long long entry, struct page *page)
>>>> +{
>>>> +	struct address_space *mapping;
>>>> +	pgoff_t index;
>>>> +
>>>> +	if (is_bad_page_map_ratelimited())
>>>> +		return;
>>>>
>>>>    	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
>>>>    	index = linear_page_index(vma, addr);
>>>>
>>>> -	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
>>>> -		 current->comm,
>>>> -		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
>>>> +	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);
>>>
>>> Sort of wonder if this is even useful if you don't know what the 'entry'
>>> is? But I guess the dump below will tell you.
>>
>> You probably missed in the patch description:
>>
>> "Whether it is a PTE or something else will usually become obvious from the
>> page table dump or from the dumped stack. If ever required in the future, we
>> could pass the entry level type similar to "enum rmap_level". For now, let's
>> keep it simple."
> 
> Yeah sorry I glossed over the commit msg, and now I pay for it ;) OK this
> is fine then.

Let me play with indicating the page table level, but it's the kind of 
stuff I wouldn't want to do in this series here.

>>
>>>
>>> Then we have VM_IO, which strictly must not have an associated page right?
>>
>> VM_IO just means read/write side-effects, I think you could have ones with
>> an memmap easily ... e.g., memory section (128MiB) spanning both memory and
>> MMIO regions.
> 
> Hmm, but why not have two separate VMAs? I guess I need to look into more
> what this flag actually effects.

Oh, I meant, that we might have a "struct page" for MMIO memory 
(pfn_valid() == true).

In a MIXEDMAP that will get refcounted. Not sure if there are users that 
use VM_IO in a MIXEDMAP, I would assume so but didn't check.

So VM_IO doesn't really interact with vm_normal_page(), really. It's all 
about PFNMAP and MIXEDMAP.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud()
  2025-07-18 10:47       ` Lorenzo Stoakes
@ 2025-07-18 11:06         ` David Hildenbrand
  2025-07-18 12:44           ` Lorenzo Stoakes
  0 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-07-18 11:06 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On 18.07.25 12:47, Lorenzo Stoakes wrote:
> On Thu, Jul 17, 2025 at 10:14:33PM +0200, David Hildenbrand wrote:
>> On 17.07.25 22:03, Lorenzo Stoakes wrote:
>>> On Thu, Jul 17, 2025 at 01:52:11PM +0200, David Hildenbrand wrote:
>>>> Let's introduce vm_normal_page_pud(), which ends up being fairly simple
>>>> because of our new common helpers and there not being a PUD-sized zero
>>>> folio.
>>>>
>>>> Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO,
>>>> structuring the code like the other (pmd/pte) cases. Defer
>>>> introducing vm_normal_folio_pud() until really used.
>>>
>>> I mean fine :P but does anybody really use this?
>>
>> This is a unified PFN walker (!hugetlb + hugetlb), so you can easily run
>> into hugetlb PUDs, DAX PUDs and huge pfnmap (vfio) PUDs :)
> 
> Ahhh ok. I hate hugetlb so very very much.
> 
> Oscar is doing the Lord's work improving things but the trauma is still
> there... :P
> 
> Also yeah DAX ahem.
> 
> I'm not familiar with huge pfnmap PUDs, could you give me a hint on this? :>)

vmf_insert_pfn_pmd(), called from  drivers/vfio/pci/vfio_pci_core.c

Essentially, we create huge PUDs when mapping device BARs to user space.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*()
  2025-07-17 20:12     ` David Hildenbrand
@ 2025-07-18 12:35       ` Lorenzo Stoakes
  0 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 12:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 10:12:37PM +0200, David Hildenbrand wrote:
> > >
> > > -/*
> > > - * vm_normal_page -- This function gets the "struct page" associated with a pte.
> > > +/**
> > > + * vm_normal_page_pfn() - Get the "struct page" associated with a PFN in a
> > > + *			  non-special page table entry.
> >
> > This is a bit nebulous/confusing, I mean you'll get PTE entries with PTE special
> > bit that'll have a PFN but just no struct page/folio to look at, or should not
> > be touched.
> >
> > So the _pfn() bit doesn't really properly describe what it does.
> >
> > I wonder if it'd be better to just separate out the special handler, have
> > that return a boolean indicating special of either form, and then separate
> > other shared code separately from that?
>
> Let me think about that; I played with various approaches and this was the
> best I was come up with before running in circles.

Thanks

>
> >
> > > + * @vma: The VMA mapping the @pfn.
> > > + * @addr: The address where the @pfn is mapped.
> > > + * @pfn: The PFN.
> > > + * @entry: The page table entry value for error reporting purposes.
> > >    *
> > >    * "Special" mappings do not wish to be associated with a "struct page" (either
> > >    * it doesn't exist, or it exists but they don't want to touch it). In this
> > > @@ -603,10 +608,10 @@ static void print_bad_page_map(struct vm_area_struct *vma,
> > >    * (such as GUP) can still identify these mappings and work with the
> > >    * underlying "struct page".
> > >    *
> > > - * There are 2 broad cases. Firstly, an architecture may define a pte_special()
> > > - * pte bit, in which case this function is trivial. Secondly, an architecture
> > > - * may not have a spare pte bit, which requires a more complicated scheme,
> > > - * described below.
> > > + * There are 2 broad cases. Firstly, an architecture may define a "special"
> > > + * page table entry bit (e.g., pte_special()), in which case this function is
> > > + * trivial. Secondly, an architecture may not have a spare page table
> > > + * entry bit, which requires a more complicated scheme, described below.
> >
> > Strikes me this bit of the comment should be with vm_normal_page(). As this
> > implies the 2 broad cases are handled here and this isn't the case.
>
> Well, pragmatism. Splitting up the doc doesn't make sense. Having it at
> vm_normal_page() doesn't make sense.
>
> I'm sure the educated reader will be able to make sense of it :P
>
> But I'm happy to hear suggestions on how to do it differently :)

Right yeah.

I feel like having separate 'special' handling for each case as separate
functions, each with their own specific explanation would work.

But I don't want to hold up the series _too_ much on this, generally I just
find the _pfn thing confusing.

I mean the implementation is a total pain anyway...

I feel like we could even have separate special handling functions like

#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL

/*
 * < description of pte special special page >
 *
 * If returns true, then pagep set to NULL or, if a page can be found, that
 * page.
 *
 */
static struct bool is_special_page(struct vm_area_struct *vma, unsigned long addr,
		pte_t pte, struct page **pagep)
{
	unsigned long pfn = pte_pfn(pte);

	if (likely(!pte_special(pte))) {
		if (pfn <= highest_memmap_pfn)
			return false;

		goto bad;
	}

	if (vma->vm_ops && vma->vm_ops->find_special_page) {
		*pagep = vma->vm_ops->find_special_page(vma, addr);
		return true;
	} else if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) {
		goto special;
	}

	if (is_zero_pfn(pfn))
		goto special;

	/* If we reach here something's gone wrong. */

bad:
	print_bad_pte(vma, addr, pte, NULL);
special:
	*pagep = NULL;
	return true;
}
#else
/*
 * < description for not-pte special special page >
 */
static struct bool is_special_page(struct vm_area_struct *vma, unsigned long addr,
		pte_t pte, struct page **pagep)
{
	unsigned long pfn = pte_pfn(pte);

	if (is_zero_pfn(pfn))
		goto special;

	if (vma->vm_flags & VM_MIXEDMAP) {
		if (!pfn_valid(pfn) || is_zero_pfn(pfn))
			goto special;
	} else if (vma->vm_flags & VM_PFNMAP) {
		unsigned long off;

		off = (addr - vma->vm_start) >> PAGE_SHIFT;
		if (pfn == vma->vm_pgoff + off)
			goto special;
		/* Hell's bells we allow CoW !arch_has_pte_special of PFN pages! help! */
		if (!is_cow_mapping(vma->vm_flags))
			goto special;
	}

	if (pfn > highest_memmap_pfn) {
		print_bad_pte(vma, addr, pte, NULL);
		goto special;
	}

	return false;
special:
	*pagep = NULL;
	return true;
}

#endif

And then obviously invoke as makes sense... This is rough and untested,
just to give a sense :>)

>
> >
> > >    *
> > >    * A raw VM_PFNMAP mapping (ie. one that is not COWed) is always considered a
> > >    * special mapping (even if there are underlying and valid "struct pages").
> > > @@ -639,15 +644,72 @@ static void print_bad_page_map(struct vm_area_struct *vma,
> > >    * don't have to follow the strict linearity rule of PFNMAP mappings in
> > >    * order to support COWable mappings.
> > >    *
> > > + * This function is not expected to be called for obviously special mappings:
> > > + * when the page table entry has the "special" bit set.
> >
> > Hmm this is is a bit weird though, saying "obviously" special, because you're
> > handling "special" mappings here, but only for architectures that don't specify
> > the PTE special bit.
> >
> > So it makes it quite nebulous what constitutes 'obviously' here, really you mean
> > pte_special().
>
> Yes, I can clarify that.

Thanks!

>
> >
> > > + *
> > > + * Return: Returns the "struct page" if this is a "normal" mapping. Returns
> > > + *	   NULL if this is a "special" mapping.
> > > + */
> > > +static inline struct page *vm_normal_page_pfn(struct vm_area_struct *vma,
> > > +		unsigned long addr, unsigned long pfn, unsigned long long entry)
> > > +{
> > > +	/*
> > > +	 * With CONFIG_ARCH_HAS_PTE_SPECIAL, any special page table mappings
> > > +	 * (incl. shared zero folios) are marked accordingly and are handled
> > > +	 * by the caller.
> > > +	 */
> > > +	if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
> > > +		if (unlikely(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))) {
> > > +			if (vma->vm_flags & VM_MIXEDMAP) {
> > > +				/* If it has a "struct page", it's "normal". */
> > > +				if (!pfn_valid(pfn))
> > > +					return NULL;
> > > +			} else {
> > > +				unsigned long off = (addr - vma->vm_start) >> PAGE_SHIFT;
> > > +
> > > +				/* Only CoW'ed anon folios are "normal". */
> > > +				if (pfn == vma->vm_pgoff + off)
> > > +					return NULL;
> > > +				if (!is_cow_mapping(vma->vm_flags))
> > > +					return NULL;
> > > +			}
> > > +		}
> > > +
> > > +		if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
> >
> > This handles zero/zero huge page handling for non-pte_special() case
> > only. I wonder if we even need to bother having these marked special
> > generally since you can just check the PFN every time anyway.
>
> Well, that makes (a) pte_special() a bit weird -- not set for some special
> pages and (b) requires additional runtime checks for the case we all really
> care about -- pte_special().
>
> So I don't think we should change that.

OK, best to be consistent in setting for special pages.

>
> [...]
>
> > >
> > > +/**
> > > + * vm_normal_folio() - Get the "struct folio" associated with a PTE
> > > + * @vma: The VMA mapping the @pte.
> > > + * @addr: The address where the @pte is mapped.
> > > + * @pte: The PTE.
> > > + *
> > > + * Get the "struct folio" associated with a PTE. See vm_normal_page_pfn()
> > > + * for details.
> > > + *
> > > + * Return: Returns the "struct folio" if this is a "normal" mapping. Returns
> > > + *	   NULL if this is a "special" mapping.
> > > + */
> >
> > Nice to add a comment, but again feels weird to have the whole explanation in
> > vm_normal_page_pfn() but then to invoke vm_normal_page()..
>
> You want people to do pointer chasing to find what they are looking for? :)

Yes.

Only joking :P

Hopefully the ideas mentioned above clarify things... a bit maybe... :>)

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*()
  2025-07-17 20:03       ` David Hildenbrand
@ 2025-07-18 12:43         ` Lorenzo Stoakes
  2025-07-30 12:54           ` David Hildenbrand
  0 siblings, 1 reply; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 12:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 10:03:44PM +0200, David Hildenbrand wrote:
> On 17.07.25 21:55, Lorenzo Stoakes wrote:
> > On Thu, Jul 17, 2025 at 08:51:51PM +0100, Lorenzo Stoakes wrote:
> > > > @@ -721,37 +772,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > >   		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
> > > >   		return NULL;
> > > >   	}
> > > > -
> > > > -	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> > > > -		if (vma->vm_flags & VM_MIXEDMAP) {
> > > > -			if (!pfn_valid(pfn))
> > > > -				return NULL;
> > > > -			goto out;
> > > > -		} else {
> > > > -			unsigned long off;
> > > > -			off = (addr - vma->vm_start) >> PAGE_SHIFT;
> > > > -			if (pfn == vma->vm_pgoff + off)
> > > > -				return NULL;
> > > > -			if (!is_cow_mapping(vma->vm_flags))
> > > > -				return NULL;
> > > > -		}
> > > > -	}
> > > > -
> > > > -	if (is_huge_zero_pfn(pfn))
> > > > -		return NULL;
> > > > -	if (unlikely(pfn > highest_memmap_pfn)) {
> > > > -		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
> > > > -		return NULL;
> > > > -	}
> > > > -
> > > > -	/*
> > > > -	 * NOTE! We still have PageReserved() pages in the page tables.
> > > > -	 * eg. VDSO mappings can cause them to exist.
> > > > -	 */
> > > > -out:
> > > > -	return pfn_to_page(pfn);
> > > > +	return vm_normal_page_pfn(vma, addr, pfn, pmd_val(pmd));
> > >
> > > Hmm this seems broken, because you're now making these special on arches with
> > > pte_special() right? But then you're invoking the not-special function?
> > >
> > > Also for non-pte_special() arches you're kind of implying they _maybe_ could be
> > > special.
> >
> > OK sorry the diff caught me out here, you explicitly handle the pmd_special()
> > case here, duplicatively (yuck).
> >
> > Maybe you fix this up in a later patch :)
>
> I had that, but the conditions depend on the level, meaning: unnecessary
> checks for pte/pmd/pud level.
>
> I had a variant where I would pass "bool special" into vm_normal_page_pfn(),
> but I didn't like it.
>
> To optimize out, I would have to provide a "level" argument, and did not
> convince myself yet that that is a good idea at this point.

Yeah fair enough. That probably isn't worth it or might end up making things
even more ugly.

We must keep things within the realms of good taste...

See other mail for a suggestion... I think this is just an awkward function
whatever way round.

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud()
  2025-07-18 11:06         ` David Hildenbrand
@ 2025-07-18 12:44           ` Lorenzo Stoakes
  0 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 12:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Fri, Jul 18, 2025 at 01:06:30PM +0200, David Hildenbrand wrote:
> On 18.07.25 12:47, Lorenzo Stoakes wrote:
> > On Thu, Jul 17, 2025 at 10:14:33PM +0200, David Hildenbrand wrote:
> > > On 17.07.25 22:03, Lorenzo Stoakes wrote:
> > > > On Thu, Jul 17, 2025 at 01:52:11PM +0200, David Hildenbrand wrote:
> > > > > Let's introduce vm_normal_page_pud(), which ends up being fairly simple
> > > > > because of our new common helpers and there not being a PUD-sized zero
> > > > > folio.
> > > > >
> > > > > Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO,
> > > > > structuring the code like the other (pmd/pte) cases. Defer
> > > > > introducing vm_normal_folio_pud() until really used.
> > > >
> > > > I mean fine :P but does anybody really use this?
> > >
> > > This is a unified PFN walker (!hugetlb + hugetlb), so you can easily run
> > > into hugetlb PUDs, DAX PUDs and huge pfnmap (vfio) PUDs :)
> >
> > Ahhh ok. I hate hugetlb so very very much.
> >
> > Oscar is doing the Lord's work improving things but the trauma is still
> > there... :P
> >
> > Also yeah DAX ahem.
> >
> > I'm not familiar with huge pfnmap PUDs, could you give me a hint on this? :>)
>
> vmf_insert_pfn_pmd(), called from  drivers/vfio/pci/vfio_pci_core.c
>
> Essentially, we create huge PUDs when mapping device BARs to user space.

Ah makes sense. Thanks!


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
  2025-07-18 11:04         ` David Hildenbrand
@ 2025-07-18 12:55           ` Lorenzo Stoakes
  0 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 12:55 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Fri, Jul 18, 2025 at 01:04:30PM +0200, David Hildenbrand wrote:
> >
> > Yeah sorry I was in 'what locks do we need' mode and hadn't shifted back here,
> > but I guess the intent is that the caller _must_ hold this lock.
> >
> > I know it's nitty and annoying (sorry!) but as asserting seems to not be a
> > possibility here, could we spell these out as a series of points like:
> >
> > /*
> >   * The caller MUST hold the following locks:
> >   *
> >   * - Leaf page table lock
> >   * - Appropriate VMA lock to keep VMA stable
> >   */
> >
> > I don't _actually_ think you need the rmap lock then, as none of the page tables
> > you access would be impacted by any rmap action afaict, with these locks held.
>
> I don't enjoy wrong comments ;)
>
> This can be called from rmap code when doing a vm_normal_page() while
> holding the PTL.

OK so this is just underlining the confusion here (not your fault! I mean in
general with page table code, really).

Would this possibly work better then?:

/*
 * The caller MUST prevent page table teardown (by holding mmap, vma or rmap lock)
 * and MUST hold the leaf page table lock.
 */

>
> Really, I think we are over-thinking a helper that is triggered in specific
> context when the world is about to collide.

Indeed, but I think it's important to be clear as to what is required for this
to work (even though, as I understand it, we're already in trouble and if page
tables are corrupted or something like this we may even NULL ptr deref anyway so
it's best effort).

>
> This is not your general-purpose API.
>
> Maybe I should have never added a comment. Maybe I should just not have done
> this patch, because I really don't want to do more than the bare minimum to
> print_bad_page_map().

No no David no :P come back to the light sir...

This is a good patch I don't mean to dissuade you, I just want to make things
clear in a confusing bit of the kernel as we in mm as usual make life hard for
ourselves...

I think the locking comment _is_ important, as for far too long in mm we've had
_implicit_ understanding of where the locks should be at any given time, which
constantly blows up underneath us.

I'd rather keep things as clear as possible so even the intellectually
challenged such as myself are subject to less confusion.

Anyway TL;DR if you do something similar to suggestion above it's all good. Just
needs clarification.

>
> Because I deeply detest it, and no comments we will add will change that.

So do I! *clinks glass* but yes, it's horrid. But there's value in improving
quality of horrid code and refactoring to the least-worst version.

And we can attack it in a more serious way later.

[snip]

> > > > > +static void print_bad_page_map(struct vm_area_struct *vma,
> > > > > +		unsigned long addr, unsigned long long entry, struct page *page)
> > > > > +{
> > > > > +	struct address_space *mapping;
> > > > > +	pgoff_t index;
> > > > > +
> > > > > +	if (is_bad_page_map_ratelimited())
> > > > > +		return;
> > > > >
> > > > >    	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
> > > > >    	index = linear_page_index(vma, addr);
> > > > >
> > > > > -	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
> > > > > -		 current->comm,
> > > > > -		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
> > > > > +	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);
> > > >
> > > > Sort of wonder if this is even useful if you don't know what the 'entry'
> > > > is? But I guess the dump below will tell you.
> > >
> > > You probably missed in the patch description:
> > >
> > > "Whether it is a PTE or something else will usually become obvious from the
> > > page table dump or from the dumped stack. If ever required in the future, we
> > > could pass the entry level type similar to "enum rmap_level". For now, let's
> > > keep it simple."
> >
> > Yeah sorry I glossed over the commit msg, and now I pay for it ;) OK this
> > is fine then.
>
> Let me play with indicating the page table level, but it's the kind of stuff
> I wouldn't want to do in this series here.

Sure understood. I don't mean to hold this up with nits. Am happy to be
flexible, just thinking out loud generally.

>
> > >
> > > >
> > > > Then we have VM_IO, which strictly must not have an associated page right?
> > >
> > > VM_IO just means read/write side-effects, I think you could have ones with
> > > an memmap easily ... e.g., memory section (128MiB) spanning both memory and
> > > MMIO regions.
> >
> > Hmm, but why not have two separate VMAs? I guess I need to look into more
> > what this flag actually effects.
>
> Oh, I meant, that we might have a "struct page" for MMIO memory (pfn_valid()
> == true).
>
> In a MIXEDMAP that will get refcounted. Not sure if there are users that use
> VM_IO in a MIXEDMAP, I would assume so but didn't check.
>
> So VM_IO doesn't really interact with vm_normal_page(), really. It's all
> about PFNMAP and MIXEDMAP.

Thanks, yeah am thinking more broadly about VM_SPECIAL here. May go off and do
some work on that... VM_UNMERGEABLE might be better.

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
  2025-07-18 10:54         ` David Hildenbrand
@ 2025-07-18 13:06           ` Lorenzo Stoakes
  0 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-18 13:06 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Fri, Jul 18, 2025 at 12:54:38PM +0200, David Hildenbrand wrote:
> On 18.07.25 12:41, Lorenzo Stoakes wrote:
> > On Thu, Jul 17, 2025 at 10:31:28PM +0200, David Hildenbrand wrote:
> > > On 17.07.25 20:29, Lorenzo Stoakes wrote:
> > > > On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote:
> > > > > The huge zero folio is refcounted (+mapcounted -- is that a word?)
> > > > > differently than "normal" folios, similarly (but different) to the ordinary
> > > > > shared zeropage.
> > > >
> > > > Yeah, I sort of wonder if we shouldn't just _not_ do any of that with zero
> > > > pages?
> > >
> > > I wish we could get rid of the weird refcounting of the huge zero folio and
> > > get rid of the shrinker. But as long as the shrinker exists, I'm afraid that
> > > weird per-process refcounting must stay.
> >
> > Does this change of yours cause any issue with it? I mean now nothing can grab
> > this page using vm_normal_page_pmd(), so it won't be able to manipulate
> > refcounts.
>
> Please look again at vm_normal_page_pmd(): we have a manual huge_zero_pfn()
> check in there! There is no change in behavior. :)
>
> It's not obvious from the diff below. But huge zero folio was considered
> special before this change, just not marked accordingly.

Yeah I think the delta screwed me up here.

Previously:

struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
				pmd_t pmd)
{
	...
	if (unlikely(pmd_special(pmd)))
		return NULL;
	...
	if (is_huge_zero_pmd(pmd))
		return NULL;
	...
}

Now:

struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
				pmd_t pmd)
{
	...
	if (unlikely(pmd_special(pmd))) {
		...
		if (is_huge_zero_pfn(pfn))
			return NULL;
		...
	}
	return vm_normal_page_pfn(vma, addr, pfn, pmd_val(pmd));
}

And:

static inline struct page *vm_normal_page_pfn(struct vm_area_struct *vma,
		unsigned long addr, unsigned long pfn, unsigned long long entry)
{
	...
	if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
		...
		if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
			return NULL;
	}
	...
}

So it is equivalent, thanks, satisfied with that now!

Sorry for being a pain, just keen to really be confident in this.

>
> > >
> > > >
> > > > >
> > > > > For this reason, we special-case these pages in
> > > > > vm_normal_page*/vm_normal_folio*, and only allow selected callers to
> > > > > still use them (e.g., GUP can still take a reference on them).
> > > > >
> > > > > vm_normal_page_pmd() already filters out the huge zero folio. However,
> > > > > so far we are not marking it as special like we do with the ordinary
> > > > > shared zeropage. Let's mark it as special, so we can further refactor
> > > > > vm_normal_page_pmd() and vm_normal_page().
> > > > >
> > > > > While at it, update the doc regarding the shared zero folios.
> > > >
> > > > Hmm I wonder how this will interact with the static PMD series at [0]?
> > >
> > > No, it shouldn't.
> >
> > I'm always nervous about these kinds of things :)
> >
> > I'm assuming the reference/map counting will still work properly with the static
> > page?
>
> Let me stress again: no change in behavior besides setting the special flag
> in this patch. Return value of vm_normal_page_pmd() is not changed.

OK yeah I think this is the issue here then - I had assumed that it did
_somehow_ modify behaviour.

See above, based on your responses I'v satisfied myself it's all good thank
you!

>
> > > >
> > > > Also, that series was (though I reviewed against it) moving stuff that
> > > > references the huge zero folio out of there, but also generally allows
> > > > access and mapping of this folio via largest_zero_folio() so not only via
> > > > insert_pmd().
> > > >
> > > > So we're going to end up with mappings of this that are not marked special
> > > > that are potentially going to have refcount/mapcount manipulation that
> > > > contradict what you're doing here perhaps?
> > >
> > > I don't think so. It's just like having the existing static (small) shared
> > > zeropage where the same rules about refcounting+mapcounting apply.
> >
> > It feels like all this is a mess... am I missing something that would make it
> > all make sense?
>
> Let me clarify:
>
> The small zeropage is never refcounted+mapcounted when mapped into page
> tables.
>
> The huge zero folio is never refcounted+mapcounted when mapped into page
> tables EXCEPT it is refcounted in a weird different when first mapped into a
> process.

Thanks that's great!

>
> The whole reason is the shrinker. I don't like it, but there was a reason it
> was added. Maybe that reason no longer exists, but that's nothing that this
> patch series is concerned with, really. :)

I hate that so much but yes out of scope.

>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 1/9] mm/huge_memory: move more common code into insert_pmd()
  2025-07-17 11:52 ` [PATCH v2 1/9] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
  2025-07-17 15:34   ` Lorenzo Stoakes
@ 2025-07-25  2:47   ` Wei Yang
  1 sibling, 0 replies; 48+ messages in thread
From: Wei Yang @ 2025-07-25  2:47 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang, Alistair Popple

On Thu, Jul 17, 2025 at 01:52:04PM +0200, David Hildenbrand wrote:
>Let's clean it all further up.
>
>No functional change intended.
>
>Reviewed-by: Oscar Salvador <osalvador@suse.de>
>Reviewed-by: Alistair Popple <apopple@nvidia.com>
>Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 2/9] mm/huge_memory: move more common code into insert_pud()
  2025-07-17 11:52 ` [PATCH v2 2/9] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
  2025-07-17 15:42   ` Lorenzo Stoakes
@ 2025-07-25  2:56   ` Wei Yang
  1 sibling, 0 replies; 48+ messages in thread
From: Wei Yang @ 2025-07-25  2:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang, Alistair Popple

On Thu, Jul 17, 2025 at 01:52:05PM +0200, David Hildenbrand wrote:
>Let's clean it all further up.
>
>No functional change intended.
>
>Reviewed-by: Oscar Salvador <osalvador@suse.de>
>Reviewed-by: Alistair Popple <apopple@nvidia.com>
>Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd()
  2025-07-17 11:52 ` [PATCH v2 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
  2025-07-17 15:47   ` Lorenzo Stoakes
@ 2025-07-25  8:07   ` Wei Yang
  1 sibling, 0 replies; 48+ messages in thread
From: Wei Yang @ 2025-07-25  8:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 01:52:06PM +0200, David Hildenbrand wrote:
>Just like we do for vmf_insert_page_mkwrite() -> ... ->
>insert_page_into_pte_locked() with the shared zeropage, support the
>huge zero folio in vmf_insert_folio_pmd().
>
>When (un)mapping the huge zero folio in page tables, we neither
>adjust the refcount nor the mapcount, just like for the shared zeropage.
>
>For now, the huge zero folio is not marked as special yet, although
>vm_normal_page_pmd() really wants to treat it as special. We'll change
>that next.
>
>Reviewed-by: Oscar Salvador <osalvador@suse.de>
>Signed-off-by: David Hildenbrand <david@redhat.com>
>---
> mm/huge_memory.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
>diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>index 849feacaf8064..db08c37b87077 100644
>--- a/mm/huge_memory.c
>+++ b/mm/huge_memory.c
>@@ -1429,9 +1429,11 @@ static vm_fault_t insert_pmd(struct vm_area_struct *vma, unsigned long addr,
> 	if (fop.is_folio) {
> 		entry = folio_mk_pmd(fop.folio, vma->vm_page_prot);
> 
>-		folio_get(fop.folio);
>-		folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
>-		add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
>+		if (!is_huge_zero_folio(fop.folio)) {
>+			folio_get(fop.folio);
>+			folio_add_file_rmap_pmd(fop.folio, &fop.folio->page, vma);
>+			add_mm_counter(mm, mm_counter_file(fop.folio), HPAGE_PMD_NR);
>+		}

I think this is reasonable.

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>

> 	} else {
> 		entry = pmd_mkhuge(pfn_pmd(fop.pfn, prot));
> 		entry = pmd_mkspecial(entry);
>-- 
>2.50.1
>

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special
  2025-07-17 11:52 ` [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
  2025-07-17 18:29   ` Lorenzo Stoakes
@ 2025-07-28  8:49   ` Wei Yang
  1 sibling, 0 replies; 48+ messages in thread
From: Wei Yang @ 2025-07-28  8:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 01:52:08PM +0200, David Hildenbrand wrote:
>The huge zero folio is refcounted (+mapcounted -- is that a word?)
>differently than "normal" folios, similarly (but different) to the ordinary
>shared zeropage.
>
>For this reason, we special-case these pages in
>vm_normal_page*/vm_normal_folio*, and only allow selected callers to
>still use them (e.g., GUP can still take a reference on them).
>
>vm_normal_page_pmd() already filters out the huge zero folio. However,
>so far we are not marking it as special like we do with the ordinary
>shared zeropage. Let's mark it as special, so we can further refactor
>vm_normal_page_pmd() and vm_normal_page().
>
>While at it, update the doc regarding the shared zero folios.
>
>Reviewed-by: Oscar Salvador <osalvador@suse.de>
>Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud()
  2025-07-17 11:52 ` [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud() David Hildenbrand
  2025-07-17 20:03   ` Lorenzo Stoakes
@ 2025-07-29  7:52   ` Wei Yang
  1 sibling, 0 replies; 48+ messages in thread
From: Wei Yang @ 2025-07-29  7:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang

On Thu, Jul 17, 2025 at 01:52:11PM +0200, David Hildenbrand wrote:
>Let's introduce vm_normal_page_pud(), which ends up being fairly simple
>because of our new common helpers and there not being a PUD-sized zero
>folio.
>
>Use vm_normal_page_pud() in folio_walk_start() to resolve a TODO,
>structuring the code like the other (pmd/pte) cases. Defer
>introducing vm_normal_folio_pud() until really used.
>
>Reviewed-by: Oscar Salvador <osalvador@suse.de>
>Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page()
  2025-07-17 11:52 ` [PATCH v2 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
  2025-07-17 20:07   ` Lorenzo Stoakes
@ 2025-07-29  7:53   ` Wei Yang
  1 sibling, 0 replies; 48+ messages in thread
From: Wei Yang @ 2025-07-29  7:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Zi Yan, Baolin Wang, Nico Pache,
	Ryan Roberts, Dev Jain, Barry Song, Jann Horn, Pedro Falcato,
	Hugh Dickins, Oscar Salvador, Lance Yang, David Vrabel

On Thu, Jul 17, 2025 at 01:52:12PM +0200, David Hildenbrand wrote:
>... and hide it behind a kconfig option. There is really no need for
>any !xen code to perform this check.
>
>The naming is a bit off: we want to find the "normal" page when a PTE
>was marked "special". So it's really not "finding a special" page.
>
>Improve the documentation, and add a comment in the code where XEN ends
>up performing the pte_mkspecial() through a hypercall. More details can
>be found in commit 923b2919e2c3 ("xen/gntdev: mark userspace PTEs as
>special on x86 PV guests").
>
>Cc: David Vrabel <david.vrabel@citrix.com>
>Reviewed-by: Oscar Salvador <osalvador@suse.de>
>Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Wei Yang <richard.weiyang@gmail.com>

-- 
Wei Yang
Help you, Help me


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*()
  2025-07-18 12:43         ` Lorenzo Stoakes
@ 2025-07-30 12:54           ` David Hildenbrand
  2025-07-30 13:24             ` Lorenzo Stoakes
  0 siblings, 1 reply; 48+ messages in thread
From: David Hildenbrand @ 2025-07-30 12:54 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On 18.07.25 14:43, Lorenzo Stoakes wrote:
> On Thu, Jul 17, 2025 at 10:03:44PM +0200, David Hildenbrand wrote:
>> On 17.07.25 21:55, Lorenzo Stoakes wrote:
>>> On Thu, Jul 17, 2025 at 08:51:51PM +0100, Lorenzo Stoakes wrote:
>>>>> @@ -721,37 +772,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>>>>>    		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
>>>>>    		return NULL;
>>>>>    	}
>>>>> -
>>>>> -	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
>>>>> -		if (vma->vm_flags & VM_MIXEDMAP) {
>>>>> -			if (!pfn_valid(pfn))
>>>>> -				return NULL;
>>>>> -			goto out;
>>>>> -		} else {
>>>>> -			unsigned long off;
>>>>> -			off = (addr - vma->vm_start) >> PAGE_SHIFT;
>>>>> -			if (pfn == vma->vm_pgoff + off)
>>>>> -				return NULL;
>>>>> -			if (!is_cow_mapping(vma->vm_flags))
>>>>> -				return NULL;
>>>>> -		}
>>>>> -	}
>>>>> -
>>>>> -	if (is_huge_zero_pfn(pfn))
>>>>> -		return NULL;
>>>>> -	if (unlikely(pfn > highest_memmap_pfn)) {
>>>>> -		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
>>>>> -		return NULL;
>>>>> -	}
>>>>> -
>>>>> -	/*
>>>>> -	 * NOTE! We still have PageReserved() pages in the page tables.
>>>>> -	 * eg. VDSO mappings can cause them to exist.
>>>>> -	 */
>>>>> -out:
>>>>> -	return pfn_to_page(pfn);
>>>>> +	return vm_normal_page_pfn(vma, addr, pfn, pmd_val(pmd));
>>>>
>>>> Hmm this seems broken, because you're now making these special on arches with
>>>> pte_special() right? But then you're invoking the not-special function?
>>>>
>>>> Also for non-pte_special() arches you're kind of implying they _maybe_ could be
>>>> special.
>>>
>>> OK sorry the diff caught me out here, you explicitly handle the pmd_special()
>>> case here, duplicatively (yuck).
>>>
>>> Maybe you fix this up in a later patch :)
>>
>> I had that, but the conditions depend on the level, meaning: unnecessary
>> checks for pte/pmd/pud level.
>>
>> I had a variant where I would pass "bool special" into vm_normal_page_pfn(),
>> but I didn't like it.
>>
>> To optimize out, I would have to provide a "level" argument, and did not
>> convince myself yet that that is a good idea at this point.
> 
> Yeah fair enough. That probably isn't worth it or might end up making things
> even more ugly.

So, I decided to add the level arguments, but not use them to optimize the checks,
only to forward it to the new print_bad_pte().

So the new helper will be

/**
   * __vm_normal_page() - Get the "struct page" associated with a page table entry.
   * @vma: The VMA mapping the page table entry.
   * @addr: The address where the page table entry is mapped.
   * @pfn: The PFN stored in the page table entry.
   * @special: Whether the page table entry is marked "special".
   * @level: The page table level for error reporting purposes only.
   * @entry: The page table entry value for error reporting purposes only.
...
   */
static inline struct page *__vm_normal_page(struct vm_area_struct *vma,
                unsigned long addr, unsigned long pfn, bool special,
                unsigned long long entry, enum pgtable_level level)
...


And vm_nomal_page() will for example be

/**
  * vm_normal_page() - Get the "struct page" associated with a PTE
  * @vma: The VMA mapping the @pte.
  * @addr: The address where the @pte is mapped.
  * @pte: The PTE.
  *
  * Get the "struct page" associated with a PTE. See __vm_normal_page()
  * for details on "normal" and "special" mappings.
  *
  * Return: Returns the "struct page" if this is a "normal" mapping. Returns
  *        NULL if this is a "special" mapping.
  */
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
                             pte_t pte)
{
        return __vm_normal_page(vma, addr, pte_pfn(pte), pte_special(pte),
                                pte_val(pte), PGTABLE_LEVEL_PTE);
}



-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*()
  2025-07-30 12:54           ` David Hildenbrand
@ 2025-07-30 13:24             ` Lorenzo Stoakes
  0 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Stoakes @ 2025-07-30 13:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, xen-devel, linux-fsdevel, nvdimm,
	Andrew Morton, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Dan Williams, Matthew Wilcox, Jan Kara,
	Alexander Viro, Christian Brauner, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Zi Yan, Baolin Wang, Nico Pache, Ryan Roberts, Dev Jain,
	Barry Song, Jann Horn, Pedro Falcato, Hugh Dickins,
	Oscar Salvador, Lance Yang

On Wed, Jul 30, 2025 at 02:54:46PM +0200, David Hildenbrand wrote:
> On 18.07.25 14:43, Lorenzo Stoakes wrote:
> > On Thu, Jul 17, 2025 at 10:03:44PM +0200, David Hildenbrand wrote:
> > > On 17.07.25 21:55, Lorenzo Stoakes wrote:
> > > > On Thu, Jul 17, 2025 at 08:51:51PM +0100, Lorenzo Stoakes wrote:
> > > > > > @@ -721,37 +772,21 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > > > > >    		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
> > > > > >    		return NULL;
> > > > > >    	}
> > > > > > -
> > > > > > -	if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
> > > > > > -		if (vma->vm_flags & VM_MIXEDMAP) {
> > > > > > -			if (!pfn_valid(pfn))
> > > > > > -				return NULL;
> > > > > > -			goto out;
> > > > > > -		} else {
> > > > > > -			unsigned long off;
> > > > > > -			off = (addr - vma->vm_start) >> PAGE_SHIFT;
> > > > > > -			if (pfn == vma->vm_pgoff + off)
> > > > > > -				return NULL;
> > > > > > -			if (!is_cow_mapping(vma->vm_flags))
> > > > > > -				return NULL;
> > > > > > -		}
> > > > > > -	}
> > > > > > -
> > > > > > -	if (is_huge_zero_pfn(pfn))
> > > > > > -		return NULL;
> > > > > > -	if (unlikely(pfn > highest_memmap_pfn)) {
> > > > > > -		print_bad_page_map(vma, addr, pmd_val(pmd), NULL);
> > > > > > -		return NULL;
> > > > > > -	}
> > > > > > -
> > > > > > -	/*
> > > > > > -	 * NOTE! We still have PageReserved() pages in the page tables.
> > > > > > -	 * eg. VDSO mappings can cause them to exist.
> > > > > > -	 */
> > > > > > -out:
> > > > > > -	return pfn_to_page(pfn);
> > > > > > +	return vm_normal_page_pfn(vma, addr, pfn, pmd_val(pmd));
> > > > >
> > > > > Hmm this seems broken, because you're now making these special on arches with
> > > > > pte_special() right? But then you're invoking the not-special function?
> > > > >
> > > > > Also for non-pte_special() arches you're kind of implying they _maybe_ could be
> > > > > special.
> > > >
> > > > OK sorry the diff caught me out here, you explicitly handle the pmd_special()
> > > > case here, duplicatively (yuck).
> > > >
> > > > Maybe you fix this up in a later patch :)
> > >
> > > I had that, but the conditions depend on the level, meaning: unnecessary
> > > checks for pte/pmd/pud level.
> > >
> > > I had a variant where I would pass "bool special" into vm_normal_page_pfn(),
> > > but I didn't like it.
> > >
> > > To optimize out, I would have to provide a "level" argument, and did not
> > > convince myself yet that that is a good idea at this point.
> >
> > Yeah fair enough. That probably isn't worth it or might end up making things
> > even more ugly.
>
> So, I decided to add the level arguments, but not use them to optimize the checks,
> only to forward it to the new print_bad_pte().
>
> So the new helper will be
>
> /**
>   * __vm_normal_page() - Get the "struct page" associated with a page table entry.
>   * @vma: The VMA mapping the page table entry.
>   * @addr: The address where the page table entry is mapped.
>   * @pfn: The PFN stored in the page table entry.
>   * @special: Whether the page table entry is marked "special".
>   * @level: The page table level for error reporting purposes only.
>   * @entry: The page table entry value for error reporting purposes only.
> ...
>   */
> static inline struct page *__vm_normal_page(struct vm_area_struct *vma,
>                unsigned long addr, unsigned long pfn, bool special,
>                unsigned long long entry, enum pgtable_level level)
> ...
>
>
> And vm_nomal_page() will for example be
>
> /**
>  * vm_normal_page() - Get the "struct page" associated with a PTE
>  * @vma: The VMA mapping the @pte.
>  * @addr: The address where the @pte is mapped.
>  * @pte: The PTE.
>  *
>  * Get the "struct page" associated with a PTE. See __vm_normal_page()
>  * for details on "normal" and "special" mappings.
>  *
>  * Return: Returns the "struct page" if this is a "normal" mapping. Returns
>  *        NULL if this is a "special" mapping.
>  */
> struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
>                             pte_t pte)
> {
>        return __vm_normal_page(vma, addr, pte_pfn(pte), pte_special(pte),
>                                pte_val(pte), PGTABLE_LEVEL_PTE);
> }
>

OK that could work out well actually, cool thank you!


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2025-07-30 13:25 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-17 11:52 [PATCH v2 0/9] mm: vm_normal_page*() improvements David Hildenbrand
2025-07-17 11:52 ` [PATCH v2 1/9] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
2025-07-17 15:34   ` Lorenzo Stoakes
2025-07-25  2:47   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 2/9] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
2025-07-17 15:42   ` Lorenzo Stoakes
2025-07-25  2:56   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
2025-07-17 15:47   ` Lorenzo Stoakes
2025-07-25  8:07   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
2025-07-17 18:09   ` Lorenzo Stoakes
2025-07-17 11:52 ` [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
2025-07-17 18:29   ` Lorenzo Stoakes
2025-07-17 20:31     ` David Hildenbrand
2025-07-18 10:41       ` Lorenzo Stoakes
2025-07-18 10:54         ` David Hildenbrand
2025-07-18 13:06           ` Lorenzo Stoakes
2025-07-28  8:49   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
2025-07-17 19:17   ` Lorenzo Stoakes
2025-07-17 20:03     ` David Hildenbrand
2025-07-18 10:15       ` Lorenzo Stoakes
2025-07-18 11:04         ` David Hildenbrand
2025-07-18 12:55           ` Lorenzo Stoakes
2025-07-17 22:06   ` Demi Marie Obenour
2025-07-18  7:44     ` David Hildenbrand
2025-07-18  7:59       ` Demi Marie Obenour
2025-07-18  8:26         ` David Hildenbrand
2025-07-17 11:52 ` [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
2025-07-17 19:51   ` Lorenzo Stoakes
2025-07-17 19:55     ` Lorenzo Stoakes
2025-07-17 20:03       ` David Hildenbrand
2025-07-18 12:43         ` Lorenzo Stoakes
2025-07-30 12:54           ` David Hildenbrand
2025-07-30 13:24             ` Lorenzo Stoakes
2025-07-17 20:12     ` David Hildenbrand
2025-07-18 12:35       ` Lorenzo Stoakes
2025-07-17 11:52 ` [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud() David Hildenbrand
2025-07-17 20:03   ` Lorenzo Stoakes
2025-07-17 20:14     ` David Hildenbrand
2025-07-18 10:47       ` Lorenzo Stoakes
2025-07-18 11:06         ` David Hildenbrand
2025-07-18 12:44           ` Lorenzo Stoakes
2025-07-29  7:52   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
2025-07-17 20:07   ` Lorenzo Stoakes
2025-07-29  7:53   ` Wei Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).