linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
@ 2025-06-04 14:19 Lorenzo Stoakes
  2025-06-04 14:39 ` Suren Baghdasaryan
                   ` (5 more replies)
  0 siblings, 6 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2025-06-04 14:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Barry Song, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Muchun Song,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

The walk_page_range_novma() function is rather confusing - it supports two
modes, one used often, the other used only for debugging.

The first mode is the common case of traversal of kernel page tables, which
is what nearly all callers use this for.

Secondly it provides an unusual debugging interface that allows for the
traversal of page tables in a userland range of memory even for that memory
which is not described by a VMA.

It is far from certain that such page tables should even exist, but perhaps
this is precisely why it is useful as a debugging mechanism.

As a result, this is utilised by ptdump only. Historically, things were
reversed - ptdump was the only user, and other parts of the kernel evolved
to use the kernel page table walking here.

Since we have some complicated and confusing locking rules for the novma
case, it makes sense to separate the two usages into their own functions.

Doing this also provide self-documentation as to the intent of the caller -
are they doing something rather unusual or are they simply doing a standard
kernel page table walk?

We therefore establish two separate functions - walk_page_range_debug() for
this single usage, and walk_kernel_page_table_range() for general kernel
page table walking.

We additionally make walk_page_range_debug() internal to mm.

Note that ptdump uses the precise same function for kernel walking as a
convenience, so we permit this but make it very explicit by having
walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
v2:
* Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
* Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
* Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
  per David.

v1 resend:
* Actually cc'd lists...
* Fixed mistake in walk_page_range_novma() not handling kernel mappings and
  update commit message to referene.
* Added Mike's off-list Acked-by.
* Fixed up comments as per Mike.
* Add some historic flavour to the commit message as per Mike.
https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoakes@oracle.com/

v1:
(accidentally sent off-list due to error in scripting)

 arch/loongarch/mm/pageattr.c |  2 +-
 arch/openrisc/kernel/dma.c   |  4 +-
 arch/riscv/mm/pageattr.c     |  8 +--
 include/linux/pagewalk.h     |  7 ++-
 mm/hugetlb_vmemmap.c         |  2 +-
 mm/internal.h                |  4 ++
 mm/pagewalk.c                | 98 ++++++++++++++++++++++++------------
 mm/ptdump.c                  |  3 +-
 8 files changed, 82 insertions(+), 46 deletions(-)

diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
index 99165903908a..f5e910b68229 100644
--- a/arch/loongarch/mm/pageattr.c
+++ b/arch/loongarch/mm/pageattr.c
@@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp
 		return 0;

 	mmap_write_lock(&init_mm);
-	ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks);
+	ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks);
 	mmap_write_unlock(&init_mm);

 	flush_tlb_kernel_range(start, end);
diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
index 3a7b5baaa450..af932a4ad306 100644
--- a/arch/openrisc/kernel/dma.c
+++ b/arch/openrisc/kernel/dma.c
@@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
 	 * them and setting the cache-inhibit bit.
 	 */
 	mmap_write_lock(&init_mm);
-	error = walk_page_range_novma(&init_mm, va, va + size,
+	error = walk_kernel_page_table_range(va, va + size,
 			&set_nocache_walk_ops, NULL, NULL);
 	mmap_write_unlock(&init_mm);

@@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)

 	mmap_write_lock(&init_mm);
 	/* walk_page_range shouldn't be able to fail here */
-	WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
+	WARN_ON(walk_kernel_page_table_range(va, va + size,
 			&clear_nocache_walk_ops, NULL, NULL));
 	mmap_write_unlock(&init_mm);
 }
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index d815448758a1..3f76db3d2769 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
 			if (ret)
 				goto unlock;

-			ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
+			ret = walk_kernel_page_table_range(lm_start, lm_end,
 						    &pageattr_ops, NULL, &masks);
 			if (ret)
 				goto unlock;
@@ -317,13 +317,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
 		if (ret)
 			goto unlock;

-		ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
+		ret = walk_kernel_page_table_range(lm_start, lm_end,
 					    &pageattr_ops, NULL, &masks);
 		if (ret)
 			goto unlock;
 	}

-	ret =  walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL,
+	ret =  walk_kernel_page_table_range(start, end, &pageattr_ops, NULL,
 				     &masks);

 unlock:
@@ -335,7 +335,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
 	 */
 	flush_tlb_all();
 #else
-	ret =  walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL,
+	ret =  walk_kernel_page_table_range(start, end, &pageattr_ops, NULL,
 				     &masks);

 	mmap_write_unlock(&init_mm);
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 9700a29f8afb..8ac2f6d6d2a3 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -129,10 +129,9 @@ struct mm_walk {
 int walk_page_range(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private);
-int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
-			  unsigned long end, const struct mm_walk_ops *ops,
-			  pgd_t *pgd,
-			  void *private);
+int walk_kernel_page_table_range(unsigned long start,
+		unsigned long end, const struct mm_walk_ops *ops,
+		pgd_t *pgd, void *private);
 int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
 			unsigned long end, const struct mm_walk_ops *ops,
 			void *private);
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 27245e86df25..ba0fb1b6a5a8 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -166,7 +166,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
 	VM_BUG_ON(!PAGE_ALIGNED(start | end));

 	mmap_read_lock(&init_mm);
-	ret = walk_page_range_novma(&init_mm, start, end, &vmemmap_remap_ops,
+	ret = walk_kernel_page_table_range(start, end, &vmemmap_remap_ops,
 				    NULL, walk);
 	mmap_read_unlock(&init_mm);
 	if (ret)
diff --git a/mm/internal.h b/mm/internal.h
index 6b8ed2017743..43788d0de6e3 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1605,6 +1605,10 @@ static inline void accept_page(struct page *page)
 int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private);
+int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
+			  unsigned long end, const struct mm_walk_ops *ops,
+			  pgd_t *pgd,
+			  void *private);

 /* pt_reclaim.c */
 bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval);
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e478777c86e1..057a125c3bc0 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -584,9 +584,28 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
 	return walk_page_range_mm(mm, start, end, ops, private);
 }

+static int __walk_page_range_novma(struct mm_struct *mm, unsigned long start,
+		unsigned long end, const struct mm_walk_ops *ops,
+		pgd_t *pgd, void *private)
+{
+	struct mm_walk walk = {
+		.ops		= ops,
+		.mm		= mm,
+		.pgd		= pgd,
+		.private	= private,
+		.no_vma		= true
+	};
+
+	if (start >= end || !walk.mm)
+		return -EINVAL;
+	if (!check_ops_valid(ops))
+		return -EINVAL;
+
+	return walk_pgd_range(start, end, &walk);
+}
+
 /**
- * walk_page_range_novma - walk a range of pagetables not backed by a vma
- * @mm:		mm_struct representing the target process of page table walk
+ * walk_kernel_page_table_range - walk a range of kernel pagetables.
  * @start:	start address of the virtual address range
  * @end:	end address of the virtual address range
  * @ops:	operation to call during the walk
@@ -596,56 +615,69 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
  * Similar to walk_page_range() but can walk any page tables even if they are
  * not backed by VMAs. Because 'unusual' entries may be walked this function
  * will also not lock the PTEs for the pte_entry() callback. This is useful for
- * walking the kernel pages tables or page tables for firmware.
+ * walking kernel pages tables or page tables for firmware.
  *
  * Note: Be careful to walk the kernel pages tables, the caller may be need to
  * take other effective approaches (mmap lock may be insufficient) to prevent
  * the intermediate kernel page tables belonging to the specified address range
  * from being freed (e.g. memory hot-remove).
  */
-int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
+int walk_kernel_page_table_range(unsigned long start, unsigned long end,
+		const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
+{
+	struct mm_struct *mm = &init_mm;
+
+	/*
+	 * Kernel intermediate page tables are usually not freed, so the mmap
+	 * read lock is sufficient. But there are some exceptions.
+	 * E.g. memory hot-remove. In which case, the mmap lock is insufficient
+	 * to prevent the intermediate kernel pages tables belonging to the
+	 * specified address range from being freed. The caller should take
+	 * other actions to prevent this race.
+	 */
+	mmap_assert_locked(mm);
+
+	return __walk_page_range_novma(mm, start, end, ops, pgd, private);
+}
+
+/**
+ * walk_page_range_debug - walk a range of pagetables not backed by a vma
+ * @mm:		mm_struct representing the target process of page table walk
+ * @start:	start address of the virtual address range
+ * @end:	end address of the virtual address range
+ * @ops:	operation to call during the walk
+ * @pgd:	pgd to walk if different from mm->pgd
+ * @private:	private data for callbacks' usage
+ *
+ * Similar to walk_page_range() but can walk any page tables even if they are
+ * not backed by VMAs. Because 'unusual' entries may be walked this function
+ * will also not lock the PTEs for the pte_entry() callback.
+ *
+ * This is for debugging purposes ONLY.
+ */
+int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
 			  unsigned long end, const struct mm_walk_ops *ops,
 			  pgd_t *pgd,
 			  void *private)
 {
-	struct mm_walk walk = {
-		.ops		= ops,
-		.mm		= mm,
-		.pgd		= pgd,
-		.private	= private,
-		.no_vma		= true
-	};
-
-	if (start >= end || !walk.mm)
-		return -EINVAL;
-	if (!check_ops_valid(ops))
-		return -EINVAL;
+	/*
+	 * For convenience, we allow this function to also traverse kernel
+	 * mappings.
+	 */
+	if (mm == &init_mm)
+		return walk_kernel_page_table_range(start, end, ops, pgd, private);

 	/*
-	 * 1) For walking the user virtual address space:
-	 *
 	 * The mmap lock protects the page walker from changes to the page
 	 * tables during the walk.  However a read lock is insufficient to
 	 * protect those areas which don't have a VMA as munmap() detaches
 	 * the VMAs before downgrading to a read lock and actually tearing
 	 * down PTEs/page tables. In which case, the mmap write lock should
-	 * be hold.
-	 *
-	 * 2) For walking the kernel virtual address space:
-	 *
-	 * The kernel intermediate page tables usually do not be freed, so
-	 * the mmap map read lock is sufficient. But there are some exceptions.
-	 * E.g. memory hot-remove. In which case, the mmap lock is insufficient
-	 * to prevent the intermediate kernel pages tables belonging to the
-	 * specified address range from being freed. The caller should take
-	 * other actions to prevent this race.
+	 * be held.
 	 */
-	if (mm == &init_mm)
-		mmap_assert_locked(walk.mm);
-	else
-		mmap_assert_write_locked(walk.mm);
+	mmap_assert_write_locked(mm);

-	return walk_pgd_range(start, end, &walk);
+	return __walk_page_range_novma(mm, start, end, ops, pgd, private);
 }

 int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
diff --git a/mm/ptdump.c b/mm/ptdump.c
index 9374f29cdc6f..61a352aa12ed 100644
--- a/mm/ptdump.c
+++ b/mm/ptdump.c
@@ -4,6 +4,7 @@
 #include <linux/debugfs.h>
 #include <linux/ptdump.h>
 #include <linux/kasan.h>
+#include "internal.h"

 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
 /*
@@ -177,7 +178,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)

 	mmap_write_lock(mm);
 	while (range->start != range->end) {
-		walk_page_range_novma(mm, range->start, range->end,
+		walk_page_range_debug(mm, range->start, range->end,
 				      &ptdump_ops, pgd, st);
 		range++;
 	}
--
2.49.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-04 14:19 [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts Lorenzo Stoakes
@ 2025-06-04 14:39 ` Suren Baghdasaryan
  2025-06-05  4:34 ` Oscar Salvador
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Suren Baghdasaryan @ 2025-06-04 14:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Barry Song, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Michal Hocko, Muchun Song,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On Wed, Jun 4, 2025 at 7:21 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> The walk_page_range_novma() function is rather confusing - it supports two
> modes, one used often, the other used only for debugging.
>
> The first mode is the common case of traversal of kernel page tables, which
> is what nearly all callers use this for.
>
> Secondly it provides an unusual debugging interface that allows for the
> traversal of page tables in a userland range of memory even for that memory
> which is not described by a VMA.
>
> It is far from certain that such page tables should even exist, but perhaps
> this is precisely why it is useful as a debugging mechanism.
>
> As a result, this is utilised by ptdump only. Historically, things were
> reversed - ptdump was the only user, and other parts of the kernel evolved
> to use the kernel page table walking here.
>
> Since we have some complicated and confusing locking rules for the novma
> case, it makes sense to separate the two usages into their own functions.
>
> Doing this also provide self-documentation as to the intent of the caller -
> are they doing something rather unusual or are they simply doing a standard
> kernel page table walk?
>
> We therefore establish two separate functions - walk_page_range_debug() for
> this single usage, and walk_kernel_page_table_range() for general kernel
> page table walking.
>
> We additionally make walk_page_range_debug() internal to mm.
>
> Note that ptdump uses the precise same function for kernel walking as a
> convenience, so we permit this but make it very explicit by having
> walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Reviewed-by: Suren Baghdasaryan <surenb@google.com>

> ---
> v2:
> * Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
> * Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
> * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
>   per David.
>
> v1 resend:
> * Actually cc'd lists...
> * Fixed mistake in walk_page_range_novma() not handling kernel mappings and
>   update commit message to referene.
> * Added Mike's off-list Acked-by.
> * Fixed up comments as per Mike.
> * Add some historic flavour to the commit message as per Mike.
> https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoakes@oracle.com/
>
> v1:
> (accidentally sent off-list due to error in scripting)
>
>  arch/loongarch/mm/pageattr.c |  2 +-
>  arch/openrisc/kernel/dma.c   |  4 +-
>  arch/riscv/mm/pageattr.c     |  8 +--
>  include/linux/pagewalk.h     |  7 ++-
>  mm/hugetlb_vmemmap.c         |  2 +-
>  mm/internal.h                |  4 ++
>  mm/pagewalk.c                | 98 ++++++++++++++++++++++++------------
>  mm/ptdump.c                  |  3 +-
>  8 files changed, 82 insertions(+), 46 deletions(-)
>
> diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
> index 99165903908a..f5e910b68229 100644
> --- a/arch/loongarch/mm/pageattr.c
> +++ b/arch/loongarch/mm/pageattr.c
> @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp
>                 return 0;
>
>         mmap_write_lock(&init_mm);
> -       ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks);
> +       ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks);
>         mmap_write_unlock(&init_mm);
>
>         flush_tlb_kernel_range(start, end);
> diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
> index 3a7b5baaa450..af932a4ad306 100644
> --- a/arch/openrisc/kernel/dma.c
> +++ b/arch/openrisc/kernel/dma.c
> @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
>          * them and setting the cache-inhibit bit.
>          */
>         mmap_write_lock(&init_mm);
> -       error = walk_page_range_novma(&init_mm, va, va + size,
> +       error = walk_kernel_page_table_range(va, va + size,
>                         &set_nocache_walk_ops, NULL, NULL);
>         mmap_write_unlock(&init_mm);
>
> @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
>
>         mmap_write_lock(&init_mm);
>         /* walk_page_range shouldn't be able to fail here */
> -       WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
> +       WARN_ON(walk_kernel_page_table_range(va, va + size,
>                         &clear_nocache_walk_ops, NULL, NULL));
>         mmap_write_unlock(&init_mm);
>  }
> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
> index d815448758a1..3f76db3d2769 100644
> --- a/arch/riscv/mm/pageattr.c
> +++ b/arch/riscv/mm/pageattr.c
> @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
>                         if (ret)
>                                 goto unlock;
>
> -                       ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
> +                       ret = walk_kernel_page_table_range(lm_start, lm_end,
>                                                     &pageattr_ops, NULL, &masks);
>                         if (ret)
>                                 goto unlock;
> @@ -317,13 +317,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
>                 if (ret)
>                         goto unlock;
>
> -               ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
> +               ret = walk_kernel_page_table_range(lm_start, lm_end,
>                                             &pageattr_ops, NULL, &masks);
>                 if (ret)
>                         goto unlock;
>         }
>
> -       ret =  walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL,
> +       ret =  walk_kernel_page_table_range(start, end, &pageattr_ops, NULL,
>                                      &masks);
>
>  unlock:
> @@ -335,7 +335,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
>          */
>         flush_tlb_all();
>  #else
> -       ret =  walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL,
> +       ret =  walk_kernel_page_table_range(start, end, &pageattr_ops, NULL,
>                                      &masks);
>
>         mmap_write_unlock(&init_mm);
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index 9700a29f8afb..8ac2f6d6d2a3 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -129,10 +129,9 @@ struct mm_walk {
>  int walk_page_range(struct mm_struct *mm, unsigned long start,
>                 unsigned long end, const struct mm_walk_ops *ops,
>                 void *private);
> -int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
> -                         unsigned long end, const struct mm_walk_ops *ops,
> -                         pgd_t *pgd,
> -                         void *private);
> +int walk_kernel_page_table_range(unsigned long start,
> +               unsigned long end, const struct mm_walk_ops *ops,
> +               pgd_t *pgd, void *private);
>  int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
>                         unsigned long end, const struct mm_walk_ops *ops,
>                         void *private);
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 27245e86df25..ba0fb1b6a5a8 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -166,7 +166,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
>         VM_BUG_ON(!PAGE_ALIGNED(start | end));
>
>         mmap_read_lock(&init_mm);
> -       ret = walk_page_range_novma(&init_mm, start, end, &vmemmap_remap_ops,
> +       ret = walk_kernel_page_table_range(start, end, &vmemmap_remap_ops,
>                                     NULL, walk);
>         mmap_read_unlock(&init_mm);
>         if (ret)
> diff --git a/mm/internal.h b/mm/internal.h
> index 6b8ed2017743..43788d0de6e3 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -1605,6 +1605,10 @@ static inline void accept_page(struct page *page)
>  int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
>                 unsigned long end, const struct mm_walk_ops *ops,
>                 void *private);
> +int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
> +                         unsigned long end, const struct mm_walk_ops *ops,
> +                         pgd_t *pgd,
> +                         void *private);
>
>  /* pt_reclaim.c */
>  bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval);
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index e478777c86e1..057a125c3bc0 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -584,9 +584,28 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>         return walk_page_range_mm(mm, start, end, ops, private);
>  }
>
> +static int __walk_page_range_novma(struct mm_struct *mm, unsigned long start,
> +               unsigned long end, const struct mm_walk_ops *ops,
> +               pgd_t *pgd, void *private)
> +{
> +       struct mm_walk walk = {
> +               .ops            = ops,
> +               .mm             = mm,
> +               .pgd            = pgd,
> +               .private        = private,
> +               .no_vma         = true
> +       };
> +
> +       if (start >= end || !walk.mm)
> +               return -EINVAL;
> +       if (!check_ops_valid(ops))
> +               return -EINVAL;
> +
> +       return walk_pgd_range(start, end, &walk);
> +}
> +
>  /**
> - * walk_page_range_novma - walk a range of pagetables not backed by a vma
> - * @mm:                mm_struct representing the target process of page table walk
> + * walk_kernel_page_table_range - walk a range of kernel pagetables.
>   * @start:     start address of the virtual address range
>   * @end:       end address of the virtual address range
>   * @ops:       operation to call during the walk
> @@ -596,56 +615,69 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>   * Similar to walk_page_range() but can walk any page tables even if they are
>   * not backed by VMAs. Because 'unusual' entries may be walked this function
>   * will also not lock the PTEs for the pte_entry() callback. This is useful for
> - * walking the kernel pages tables or page tables for firmware.
> + * walking kernel pages tables or page tables for firmware.
>   *
>   * Note: Be careful to walk the kernel pages tables, the caller may be need to
>   * take other effective approaches (mmap lock may be insufficient) to prevent
>   * the intermediate kernel page tables belonging to the specified address range
>   * from being freed (e.g. memory hot-remove).
>   */
> -int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
> +int walk_kernel_page_table_range(unsigned long start, unsigned long end,
> +               const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
> +{
> +       struct mm_struct *mm = &init_mm;
> +
> +       /*
> +        * Kernel intermediate page tables are usually not freed, so the mmap
> +        * read lock is sufficient. But there are some exceptions.
> +        * E.g. memory hot-remove. In which case, the mmap lock is insufficient
> +        * to prevent the intermediate kernel pages tables belonging to the
> +        * specified address range from being freed. The caller should take
> +        * other actions to prevent this race.
> +        */
> +       mmap_assert_locked(mm);
> +
> +       return __walk_page_range_novma(mm, start, end, ops, pgd, private);
> +}
> +
> +/**
> + * walk_page_range_debug - walk a range of pagetables not backed by a vma
> + * @mm:                mm_struct representing the target process of page table walk
> + * @start:     start address of the virtual address range
> + * @end:       end address of the virtual address range
> + * @ops:       operation to call during the walk
> + * @pgd:       pgd to walk if different from mm->pgd
> + * @private:   private data for callbacks' usage
> + *
> + * Similar to walk_page_range() but can walk any page tables even if they are
> + * not backed by VMAs. Because 'unusual' entries may be walked this function
> + * will also not lock the PTEs for the pte_entry() callback.
> + *
> + * This is for debugging purposes ONLY.
> + */
> +int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
>                           unsigned long end, const struct mm_walk_ops *ops,
>                           pgd_t *pgd,
>                           void *private)
>  {
> -       struct mm_walk walk = {
> -               .ops            = ops,
> -               .mm             = mm,
> -               .pgd            = pgd,
> -               .private        = private,
> -               .no_vma         = true
> -       };
> -
> -       if (start >= end || !walk.mm)
> -               return -EINVAL;
> -       if (!check_ops_valid(ops))
> -               return -EINVAL;
> +       /*
> +        * For convenience, we allow this function to also traverse kernel
> +        * mappings.
> +        */
> +       if (mm == &init_mm)
> +               return walk_kernel_page_table_range(start, end, ops, pgd, private);
>
>         /*
> -        * 1) For walking the user virtual address space:
> -        *
>          * The mmap lock protects the page walker from changes to the page
>          * tables during the walk.  However a read lock is insufficient to
>          * protect those areas which don't have a VMA as munmap() detaches
>          * the VMAs before downgrading to a read lock and actually tearing
>          * down PTEs/page tables. In which case, the mmap write lock should
> -        * be hold.
> -        *
> -        * 2) For walking the kernel virtual address space:
> -        *
> -        * The kernel intermediate page tables usually do not be freed, so
> -        * the mmap map read lock is sufficient. But there are some exceptions.
> -        * E.g. memory hot-remove. In which case, the mmap lock is insufficient
> -        * to prevent the intermediate kernel pages tables belonging to the
> -        * specified address range from being freed. The caller should take
> -        * other actions to prevent this race.
> +        * be held.
>          */
> -       if (mm == &init_mm)
> -               mmap_assert_locked(walk.mm);
> -       else
> -               mmap_assert_write_locked(walk.mm);
> +       mmap_assert_write_locked(mm);
>
> -       return walk_pgd_range(start, end, &walk);
> +       return __walk_page_range_novma(mm, start, end, ops, pgd, private);
>  }
>
>  int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
> diff --git a/mm/ptdump.c b/mm/ptdump.c
> index 9374f29cdc6f..61a352aa12ed 100644
> --- a/mm/ptdump.c
> +++ b/mm/ptdump.c
> @@ -4,6 +4,7 @@
>  #include <linux/debugfs.h>
>  #include <linux/ptdump.h>
>  #include <linux/kasan.h>
> +#include "internal.h"
>
>  #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
>  /*
> @@ -177,7 +178,7 @@ void ptdump_walk_pgd(struct ptdump_state *st, struct mm_struct *mm, pgd_t *pgd)
>
>         mmap_write_lock(mm);
>         while (range->start != range->end) {
> -               walk_page_range_novma(mm, range->start, range->end,
> +               walk_page_range_debug(mm, range->start, range->end,
>                                       &ptdump_ops, pgd, st);
>                 range++;
>         }
> --
> 2.49.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-04 14:19 [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts Lorenzo Stoakes
  2025-06-04 14:39 ` Suren Baghdasaryan
@ 2025-06-05  4:34 ` Oscar Salvador
  2025-06-05  6:13 ` Qi Zheng
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Oscar Salvador @ 2025-06-05  4:34 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Barry Song, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Muchun Song, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On Wed, Jun 04, 2025 at 03:19:58PM +0100, Lorenzo Stoakes wrote:
> The walk_page_range_novma() function is rather confusing - it supports two
> modes, one used often, the other used only for debugging.
> 
> The first mode is the common case of traversal of kernel page tables, which
> is what nearly all callers use this for.
> 
> Secondly it provides an unusual debugging interface that allows for the
> traversal of page tables in a userland range of memory even for that memory
> which is not described by a VMA.
> 
> It is far from certain that such page tables should even exist, but perhaps
> this is precisely why it is useful as a debugging mechanism.
> 
> As a result, this is utilised by ptdump only. Historically, things were
> reversed - ptdump was the only user, and other parts of the kernel evolved
> to use the kernel page table walking here.
> 
> Since we have some complicated and confusing locking rules for the novma
> case, it makes sense to separate the two usages into their own functions.
> 
> Doing this also provide self-documentation as to the intent of the caller -
> are they doing something rather unusual or are they simply doing a standard
> kernel page table walk?
> 
> We therefore establish two separate functions - walk_page_range_debug() for
> this single usage, and walk_kernel_page_table_range() for general kernel
> page table walking.
> 
> We additionally make walk_page_range_debug() internal to mm.
> 
> Note that ptdump uses the precise same function for kernel walking as a
> convenience, so we permit this but make it very explicit by having
> walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-04 14:19 [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts Lorenzo Stoakes
  2025-06-04 14:39 ` Suren Baghdasaryan
  2025-06-05  4:34 ` Oscar Salvador
@ 2025-06-05  6:13 ` Qi Zheng
  2025-06-05  6:56 ` Vlastimil Babka
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 15+ messages in thread
From: Qi Zheng @ 2025-06-05  6:13 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Barry Song, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Muchun Song,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm,
	Andrew Morton



On 6/4/25 10:19 PM, Lorenzo Stoakes wrote:
> The walk_page_range_novma() function is rather confusing - it supports two
> modes, one used often, the other used only for debugging.
> 
> The first mode is the common case of traversal of kernel page tables, which
> is what nearly all callers use this for.
> 
> Secondly it provides an unusual debugging interface that allows for the
> traversal of page tables in a userland range of memory even for that memory
> which is not described by a VMA.
> 
> It is far from certain that such page tables should even exist, but perhaps
> this is precisely why it is useful as a debugging mechanism.
> 
> As a result, this is utilised by ptdump only. Historically, things were
> reversed - ptdump was the only user, and other parts of the kernel evolved
> to use the kernel page table walking here.
> 
> Since we have some complicated and confusing locking rules for the novma
> case, it makes sense to separate the two usages into their own functions.
> 
> Doing this also provide self-documentation as to the intent of the caller -
> are they doing something rather unusual or are they simply doing a standard
> kernel page table walk?
> 
> We therefore establish two separate functions - walk_page_range_debug() for
> this single usage, and walk_kernel_page_table_range() for general kernel
> page table walking.
> 
> We additionally make walk_page_range_debug() internal to mm.
> 
> Note that ptdump uses the precise same function for kernel walking as a
> convenience, so we permit this but make it very explicit by having
> walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> v2:
> * Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
> * Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
> * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
>    per David.
> 
> v1 resend:
> * Actually cc'd lists...
> * Fixed mistake in walk_page_range_novma() not handling kernel mappings and
>    update commit message to referene.
> * Added Mike's off-list Acked-by.
> * Fixed up comments as per Mike.
> * Add some historic flavour to the commit message as per Mike.
> https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoakes@oracle.com/
> 
> v1:
> (accidentally sent off-list due to error in scripting)
> 
>   arch/loongarch/mm/pageattr.c |  2 +-
>   arch/openrisc/kernel/dma.c   |  4 +-
>   arch/riscv/mm/pageattr.c     |  8 +--
>   include/linux/pagewalk.h     |  7 ++-
>   mm/hugetlb_vmemmap.c         |  2 +-
>   mm/internal.h                |  4 ++
>   mm/pagewalk.c                | 98 ++++++++++++++++++++++++------------
>   mm/ptdump.c                  |  3 +-
>   8 files changed, 82 insertions(+), 46 deletions(-)
> 

Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>

Thanks.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-04 14:19 [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2025-06-05  6:13 ` Qi Zheng
@ 2025-06-05  6:56 ` Vlastimil Babka
  2025-06-05  9:24   ` Lorenzo Stoakes
  2025-06-05  7:45 ` David Hildenbrand
  2025-06-05 19:19 ` Jann Horn
  5 siblings, 1 reply; 15+ messages in thread
From: Vlastimil Babka @ 2025-06-05  6:56 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Barry Song, David Hildenbrand, Liam R . Howlett, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Muchun Song, Oscar Salvador,
	Huacai Chen, WANG Xuerui, Jonas Bonn, Stefan Kristiansson,
	Stafford Horne, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexandre Ghiti, Jann Horn, loongarch, linux-kernel,
	linux-openrisc, linux-riscv, linux-mm

On 6/4/25 16:19, Lorenzo Stoakes wrote:
> The walk_page_range_novma() function is rather confusing - it supports two
> modes, one used often, the other used only for debugging.
> 
> The first mode is the common case of traversal of kernel page tables, which
> is what nearly all callers use this for.
> 
> Secondly it provides an unusual debugging interface that allows for the
> traversal of page tables in a userland range of memory even for that memory
> which is not described by a VMA.
> 
> It is far from certain that such page tables should even exist, but perhaps
> this is precisely why it is useful as a debugging mechanism.
> 
> As a result, this is utilised by ptdump only. Historically, things were
> reversed - ptdump was the only user, and other parts of the kernel evolved
> to use the kernel page table walking here.
> 
> Since we have some complicated and confusing locking rules for the novma
> case, it makes sense to separate the two usages into their own functions.
> 
> Doing this also provide self-documentation as to the intent of the caller -
> are they doing something rather unusual or are they simply doing a standard
> kernel page table walk?
> 
> We therefore establish two separate functions - walk_page_range_debug() for
> this single usage, and walk_kernel_page_table_range() for general kernel
> page table walking.
> 
> We additionally make walk_page_range_debug() internal to mm.
> 
> Note that ptdump uses the precise same function for kernel walking as a

IMHO it's not clear at this point what "the precise same function" means.

> convenience, so we permit this but make it very explicit by having
> walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.

  ^ walk_page_range_debug()

Maybe this could be reworded in the sense (AFAIU) that
walk_page_range_debug() can be used for both user space page table walking
or kernel depending on what mm is passed, so in the case of init_mm it
invokes walk_kernel_page_table_range() internally.

> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
> v2:
> * Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
> * Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
> * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
>   per David.
> 
> v1 resend:
> * Actually cc'd lists...
> * Fixed mistake in walk_page_range_novma() not handling kernel mappings and
>   update commit message to referene.
> * Added Mike's off-list Acked-by.
> * Fixed up comments as per Mike.
> * Add some historic flavour to the commit message as per Mike.
> https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoakes@oracle.com/
> 
> v1:
> (accidentally sent off-list due to error in scripting)
> 
>  arch/loongarch/mm/pageattr.c |  2 +-
>  arch/openrisc/kernel/dma.c   |  4 +-
>  arch/riscv/mm/pageattr.c     |  8 +--
>  include/linux/pagewalk.h     |  7 ++-
>  mm/hugetlb_vmemmap.c         |  2 +-
>  mm/internal.h                |  4 ++
>  mm/pagewalk.c                | 98 ++++++++++++++++++++++++------------
>  mm/ptdump.c                  |  3 +-
>  8 files changed, 82 insertions(+), 46 deletions(-)
> 
> diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
> index 99165903908a..f5e910b68229 100644
> --- a/arch/loongarch/mm/pageattr.c
> +++ b/arch/loongarch/mm/pageattr.c
> @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp
>  		return 0;
> 
>  	mmap_write_lock(&init_mm);
> -	ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks);
> +	ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks);
>  	mmap_write_unlock(&init_mm);

You've removed init_mm from walk_page_range_novma() but I see most callers
do the locking of init_mm immediately around it. This suggests a version
handling that automatically? A bit complicated by the read/write
possibilities, so maybe not worth wrapping? Just a thought, as David says ;)

> 
>  	flush_tlb_kernel_range(start, end);
> diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
> index 3a7b5baaa450..af932a4ad306 100644
> --- a/arch/openrisc/kernel/dma.c
> +++ b/arch/openrisc/kernel/dma.c
> @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
>  	 * them and setting the cache-inhibit bit.
>  	 */
>  	mmap_write_lock(&init_mm);
> -	error = walk_page_range_novma(&init_mm, va, va + size,
> +	error = walk_kernel_page_table_range(va, va + size,
>  			&set_nocache_walk_ops, NULL, NULL);
>  	mmap_write_unlock(&init_mm);
> 
> @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
> 
>  	mmap_write_lock(&init_mm);
>  	/* walk_page_range shouldn't be able to fail here */
> -	WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
> +	WARN_ON(walk_kernel_page_table_range(va, va + size,
>  			&clear_nocache_walk_ops, NULL, NULL));
>  	mmap_write_unlock(&init_mm);
>  }
> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
> index d815448758a1..3f76db3d2769 100644
> --- a/arch/riscv/mm/pageattr.c
> +++ b/arch/riscv/mm/pageattr.c
> @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
>  			if (ret)
>  				goto unlock;
> 
> -			ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
> +			ret = walk_kernel_page_table_range(lm_start, lm_end,
>  						    &pageattr_ops, NULL, &masks);

Note this and other places break the second line's arguments alignment on
the opening bracket. Maybe it just shows it's a bit fragile style...



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-04 14:19 [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2025-06-05  6:56 ` Vlastimil Babka
@ 2025-06-05  7:45 ` David Hildenbrand
  2025-06-05  9:33   ` Lorenzo Stoakes
  2025-06-05 19:19 ` Jann Horn
  5 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2025-06-05  7:45 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Barry Song, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Muchun Song, Oscar Salvador,
	Huacai Chen, WANG Xuerui, Jonas Bonn, Stefan Kristiansson,
	Stafford Horne, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexandre Ghiti, Jann Horn, loongarch, linux-kernel,
	linux-openrisc, linux-riscv, linux-mm

On 04.06.25 16:19, Lorenzo Stoakes wrote:
> The walk_page_range_novma() function is rather confusing - it supports two
> modes, one used often, the other used only for debugging.
> 
> The first mode is the common case of traversal of kernel page tables, which
> is what nearly all callers use this for.
> 
> Secondly it provides an unusual debugging interface that allows for the
> traversal of page tables in a userland range of memory even for that memory
> which is not described by a VMA.
> 
> It is far from certain that such page tables should even exist, but perhaps
> this is precisely why it is useful as a debugging mechanism.
> 
> As a result, this is utilised by ptdump only. Historically, things were
> reversed - ptdump was the only user, and other parts of the kernel evolved
> to use the kernel page table walking here.
> 
> Since we have some complicated and confusing locking rules for the novma
> case, it makes sense to separate the two usages into their own functions.
> 
> Doing this also provide self-documentation as to the intent of the caller -
> are they doing something rather unusual or are they simply doing a standard
> kernel page table walk?
> 
> We therefore establish two separate functions - walk_page_range_debug() for
> this single usage, and walk_kernel_page_table_range() for general kernel
> page table walking.
> 
> We additionally make walk_page_range_debug() internal to mm.
> 
> Note that ptdump uses the precise same function for kernel walking as a
> convenience, so we permit this but make it very explicit by having
> walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---


[...]

>   bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval);
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index e478777c86e1..057a125c3bc0 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -584,9 +584,28 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
>   	return walk_page_range_mm(mm, start, end, ops, private);
>   }
> 
> +static int __walk_page_range_novma(struct mm_struct *mm, unsigned long start,
> +		unsigned long end, const struct mm_walk_ops *ops,
> +		pgd_t *pgd, void *private)
> +{
> +	struct mm_walk walk = {
> +		.ops		= ops,
> +		.mm		= mm,
> +		.pgd		= pgd,
> +		.private	= private,
> +		.no_vma		= true
> +	};
> +
> +	if (start >= end || !walk.mm)
> +		return -EINVAL;
> +	if (!check_ops_valid(ops))
> +		return -EINVAL;

I'm wondering if that could be moved into walk_pgd_range().

> +
> +	return walk_pgd_range(start, end, &walk);
> +}
> +

I would inline that into both functions and finally get rid of that 
"novma" ... beauty of a function.

Well, we still have the "no_vma" parameter, but that's a different thing.

E.g.,, there is no need to check for walk.mm in the 
walk_kernel_page_table_range() case.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-05  6:56 ` Vlastimil Babka
@ 2025-06-05  9:24   ` Lorenzo Stoakes
  2025-06-05  9:42     ` Muchun Song
  0 siblings, 1 reply; 15+ messages in thread
From: Lorenzo Stoakes @ 2025-06-05  9:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Barry Song, David Hildenbrand, Liam R . Howlett,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Muchun Song,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On Thu, Jun 05, 2025 at 08:56:59AM +0200, Vlastimil Babka wrote:
> On 6/4/25 16:19, Lorenzo Stoakes wrote:
> > The walk_page_range_novma() function is rather confusing - it supports two
> > modes, one used often, the other used only for debugging.
> >
> > The first mode is the common case of traversal of kernel page tables, which
> > is what nearly all callers use this for.
> >
> > Secondly it provides an unusual debugging interface that allows for the
> > traversal of page tables in a userland range of memory even for that memory
> > which is not described by a VMA.
> >
> > It is far from certain that such page tables should even exist, but perhaps
> > this is precisely why it is useful as a debugging mechanism.
> >
> > As a result, this is utilised by ptdump only. Historically, things were
> > reversed - ptdump was the only user, and other parts of the kernel evolved
> > to use the kernel page table walking here.
> >
> > Since we have some complicated and confusing locking rules for the novma
> > case, it makes sense to separate the two usages into their own functions.
> >
> > Doing this also provide self-documentation as to the intent of the caller -
> > are they doing something rather unusual or are they simply doing a standard
> > kernel page table walk?
> >
> > We therefore establish two separate functions - walk_page_range_debug() for
> > this single usage, and walk_kernel_page_table_range() for general kernel
> > page table walking.
> >
> > We additionally make walk_page_range_debug() internal to mm.
> >
> > Note that ptdump uses the precise same function for kernel walking as a
>
> IMHO it's not clear at this point what "the precise same function" means.
>
> > convenience, so we permit this but make it very explicit by having
> > walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
>
>   ^ walk_page_range_debug()

Oops will fix.

>
> Maybe this could be reworded in the sense (AFAIU) that
> walk_page_range_debug() can be used for both user space page table walking
> or kernel depending on what mm is passed, so in the case of init_mm it
> invokes walk_kernel_page_table_range() internally.

Sure.

>
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> > v2:
> > * Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
> > * Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
> > * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
> >   per David.
> >
> > v1 resend:
> > * Actually cc'd lists...
> > * Fixed mistake in walk_page_range_novma() not handling kernel mappings and
> >   update commit message to referene.
> > * Added Mike's off-list Acked-by.
> > * Fixed up comments as per Mike.
> > * Add some historic flavour to the commit message as per Mike.
> > https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoakes@oracle.com/
> >
> > v1:
> > (accidentally sent off-list due to error in scripting)
> >
> >  arch/loongarch/mm/pageattr.c |  2 +-
> >  arch/openrisc/kernel/dma.c   |  4 +-
> >  arch/riscv/mm/pageattr.c     |  8 +--
> >  include/linux/pagewalk.h     |  7 ++-
> >  mm/hugetlb_vmemmap.c         |  2 +-
> >  mm/internal.h                |  4 ++
> >  mm/pagewalk.c                | 98 ++++++++++++++++++++++++------------
> >  mm/ptdump.c                  |  3 +-
> >  8 files changed, 82 insertions(+), 46 deletions(-)
> >
> > diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
> > index 99165903908a..f5e910b68229 100644
> > --- a/arch/loongarch/mm/pageattr.c
> > +++ b/arch/loongarch/mm/pageattr.c
> > @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp
> >  		return 0;
> >
> >  	mmap_write_lock(&init_mm);
> > -	ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks);
> > +	ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks);
> >  	mmap_write_unlock(&init_mm);
>
> You've removed init_mm from walk_page_range_novma() but I see most callers
> do the locking of init_mm immediately around it. This suggests a version
> handling that automatically? A bit complicated by the read/write
> possibilities, so maybe not worth wrapping? Just a thought, as David says ;)

Most callers write lock interestingly, but then one read lock's, so we can't
just assume and would need to pass a boolean which would kind of suck.

Also other walkers assume the caller has the lock so it's consistent to
keep it this way.

>
> >
> >  	flush_tlb_kernel_range(start, end);
> > diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
> > index 3a7b5baaa450..af932a4ad306 100644
> > --- a/arch/openrisc/kernel/dma.c
> > +++ b/arch/openrisc/kernel/dma.c
> > @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
> >  	 * them and setting the cache-inhibit bit.
> >  	 */
> >  	mmap_write_lock(&init_mm);
> > -	error = walk_page_range_novma(&init_mm, va, va + size,
> > +	error = walk_kernel_page_table_range(va, va + size,
> >  			&set_nocache_walk_ops, NULL, NULL);
> >  	mmap_write_unlock(&init_mm);
> >
> > @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
> >
> >  	mmap_write_lock(&init_mm);
> >  	/* walk_page_range shouldn't be able to fail here */
> > -	WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
> > +	WARN_ON(walk_kernel_page_table_range(va, va + size,
> >  			&clear_nocache_walk_ops, NULL, NULL));
> >  	mmap_write_unlock(&init_mm);
> >  }
> > diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
> > index d815448758a1..3f76db3d2769 100644
> > --- a/arch/riscv/mm/pageattr.c
> > +++ b/arch/riscv/mm/pageattr.c
> > @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
> >  			if (ret)
> >  				goto unlock;
> >
> > -			ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
> > +			ret = walk_kernel_page_table_range(lm_start, lm_end,
> >  						    &pageattr_ops, NULL, &masks);
>
> Note this and other places break the second line's arguments alignment on
> the opening bracket. Maybe it just shows it's a bit fragile style...
>
>

Yeah I know :) I know you won't believe this coming from me, but I was
trying to minimise the churn :P

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-05  7:45 ` David Hildenbrand
@ 2025-06-05  9:33   ` Lorenzo Stoakes
  0 siblings, 0 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2025-06-05  9:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Barry Song, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Muchun Song,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On Thu, Jun 05, 2025 at 09:45:47AM +0200, David Hildenbrand wrote:
> On 04.06.25 16:19, Lorenzo Stoakes wrote:
> > The walk_page_range_novma() function is rather confusing - it supports two
> > modes, one used often, the other used only for debugging.
> >
> > The first mode is the common case of traversal of kernel page tables, which
> > is what nearly all callers use this for.
> >
> > Secondly it provides an unusual debugging interface that allows for the
> > traversal of page tables in a userland range of memory even for that memory
> > which is not described by a VMA.
> >
> > It is far from certain that such page tables should even exist, but perhaps
> > this is precisely why it is useful as a debugging mechanism.
> >
> > As a result, this is utilised by ptdump only. Historically, things were
> > reversed - ptdump was the only user, and other parts of the kernel evolved
> > to use the kernel page table walking here.
> >
> > Since we have some complicated and confusing locking rules for the novma
> > case, it makes sense to separate the two usages into their own functions.
> >
> > Doing this also provide self-documentation as to the intent of the caller -
> > are they doing something rather unusual or are they simply doing a standard
> > kernel page table walk?
> >
> > We therefore establish two separate functions - walk_page_range_debug() for
> > this single usage, and walk_kernel_page_table_range() for general kernel
> > page table walking.
> >
> > We additionally make walk_page_range_debug() internal to mm.
> >
> > Note that ptdump uses the precise same function for kernel walking as a
> > convenience, so we permit this but make it very explicit by having
> > walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
>
>
> [...]
>
> >   bool try_get_and_clear_pmd(struct mm_struct *mm, pmd_t *pmd, pmd_t *pmdval);
> > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > index e478777c86e1..057a125c3bc0 100644
> > --- a/mm/pagewalk.c
> > +++ b/mm/pagewalk.c
> > @@ -584,9 +584,28 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
> >   	return walk_page_range_mm(mm, start, end, ops, private);
> >   }
> >
> > +static int __walk_page_range_novma(struct mm_struct *mm, unsigned long start,
> > +		unsigned long end, const struct mm_walk_ops *ops,
> > +		pgd_t *pgd, void *private)
> > +{
> > +	struct mm_walk walk = {
> > +		.ops		= ops,
> > +		.mm		= mm,
> > +		.pgd		= pgd,
> > +		.private	= private,
> > +		.no_vma		= true
> > +	};
> > +
> > +	if (start >= end || !walk.mm)
> > +		return -EINVAL;
> > +	if (!check_ops_valid(ops))
> > +		return -EINVAL;
>
> I'm wondering if that could be moved into walk_pgd_range().

There's stuff that gets called before walk_pgd_range(), see __walk_page_range()
for instance which will invoke ops->pre_vma() for instance.

And of course since we invoke walk_pgd_range() direct in this beautiful case,
but not in others we can't put that there either :)

>
> > +
> > +	return walk_pgd_range(start, end, &walk);
> > +}
> > +
>
> I would inline that into both functions and finally get rid of that "novma"
> ... beauty of a function.

Sure can do, I separated that out to avoid duplication, but it's not
exactly a massive amount of code so probably not too dreadful to just open
code it.

Will do this on a respin.

>
> Well, we still have the "no_vma" parameter, but that's a different thing.

Yeah...

>
> E.g.,, there is no need to check for walk.mm in the
> walk_kernel_page_table_range() case.

I feel overall there's more refactoring that could be done, obviously
overall we want to make this code do a _lot_ more when somebody has enough
time to generalise the page table walking logic for more kernel stuff :)

But of course this is all a question of time, as always...

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-05  9:24   ` Lorenzo Stoakes
@ 2025-06-05  9:42     ` Muchun Song
  2025-06-05  9:56       ` Lorenzo Stoakes
  0 siblings, 1 reply; 15+ messages in thread
From: Muchun Song @ 2025-06-05  9:42 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Andrew Morton, Barry Song, David Hildenbrand,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm



> On Jun 5, 2025, at 17:24, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> 
> On Thu, Jun 05, 2025 at 08:56:59AM +0200, Vlastimil Babka wrote:
>> On 6/4/25 16:19, Lorenzo Stoakes wrote:
>>> The walk_page_range_novma() function is rather confusing - it supports two
>>> modes, one used often, the other used only for debugging.
>>> 
>>> The first mode is the common case of traversal of kernel page tables, which
>>> is what nearly all callers use this for.
>>> 
>>> Secondly it provides an unusual debugging interface that allows for the
>>> traversal of page tables in a userland range of memory even for that memory
>>> which is not described by a VMA.
>>> 
>>> It is far from certain that such page tables should even exist, but perhaps
>>> this is precisely why it is useful as a debugging mechanism.
>>> 
>>> As a result, this is utilised by ptdump only. Historically, things were
>>> reversed - ptdump was the only user, and other parts of the kernel evolved
>>> to use the kernel page table walking here.
>>> 
>>> Since we have some complicated and confusing locking rules for the novma
>>> case, it makes sense to separate the two usages into their own functions.
>>> 
>>> Doing this also provide self-documentation as to the intent of the caller -
>>> are they doing something rather unusual or are they simply doing a standard
>>> kernel page table walk?
>>> 
>>> We therefore establish two separate functions - walk_page_range_debug() for
>>> this single usage, and walk_kernel_page_table_range() for general kernel
>>> page table walking.
>>> 
>>> We additionally make walk_page_range_debug() internal to mm.
>>> 
>>> Note that ptdump uses the precise same function for kernel walking as a
>> 
>> IMHO it's not clear at this point what "the precise same function" means.
>> 
>>> convenience, so we permit this but make it very explicit by having
>>> walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
>> 
>>  ^ walk_page_range_debug()
> 
> Oops will fix.
> 
>> 
>> Maybe this could be reworded in the sense (AFAIU) that
>> walk_page_range_debug() can be used for both user space page table walking
>> or kernel depending on what mm is passed, so in the case of init_mm it
>> invokes walk_kernel_page_table_range() internally.
> 
> Sure.
> 
>> 
>>> 
>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>>> ---
>>> v2:
>>> * Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
>>> * Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
>>> * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
>>>  per David.
>>> 
>>> v1 resend:
>>> * Actually cc'd lists...
>>> * Fixed mistake in walk_page_range_novma() not handling kernel mappings and
>>>  update commit message to referene.
>>> * Added Mike's off-list Acked-by.
>>> * Fixed up comments as per Mike.
>>> * Add some historic flavour to the commit message as per Mike.
>>> https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoakes@oracle.com/
>>> 
>>> v1:
>>> (accidentally sent off-list due to error in scripting)
>>> 
>>> arch/loongarch/mm/pageattr.c |  2 +-
>>> arch/openrisc/kernel/dma.c   |  4 +-
>>> arch/riscv/mm/pageattr.c     |  8 +--
>>> include/linux/pagewalk.h     |  7 ++-
>>> mm/hugetlb_vmemmap.c         |  2 +-
>>> mm/internal.h                |  4 ++
>>> mm/pagewalk.c                | 98 ++++++++++++++++++++++++------------
>>> mm/ptdump.c                  |  3 +-
>>> 8 files changed, 82 insertions(+), 46 deletions(-)
>>> 
>>> diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
>>> index 99165903908a..f5e910b68229 100644
>>> --- a/arch/loongarch/mm/pageattr.c
>>> +++ b/arch/loongarch/mm/pageattr.c
>>> @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp
>>> return 0;
>>> 
>>> mmap_write_lock(&init_mm);
>>> - ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks);
>>> + ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks);
>>> mmap_write_unlock(&init_mm);
>> 
>> You've removed init_mm from walk_page_range_novma() but I see most callers
>> do the locking of init_mm immediately around it. This suggests a version
>> handling that automatically? A bit complicated by the read/write
>> possibilities, so maybe not worth wrapping? Just a thought, as David says ;)
> 
> Most callers write lock interestingly, but then one read lock's, so we can't
> just assume and would need to pass a boolean which would kind of suck.

Hi Lorenzo,

Actually, the write lock introduced in commit 8782fb61cc848 to fix the
race condition when walking user page tables can be replaced with a read
lock. As explained in commit b123d09304d86, it is safe to walk kernel
page tables while holding the mmap read lock. The function name
`walk_kernel_page_table_range` clearly indicates its purpose: walking
kernel page tables. Thus, using a read lock internally is appropriate
and safe. Please correct me, if I am wrong.

To further enhance robustness, it is better to add a WARN_ON check to
ensure that the address range passed to walk_kernel_page_table_range
is indeed within the kernel address space. This will help prevent any
accidental misuse and catch issues early.

Muchun,
Thanks.

> 
> Also other walkers assume the caller has the lock so it's consistent to
> keep it this way.
> 
>> 
>>> 
>>> flush_tlb_kernel_range(start, end);
>>> diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
>>> index 3a7b5baaa450..af932a4ad306 100644
>>> --- a/arch/openrisc/kernel/dma.c
>>> +++ b/arch/openrisc/kernel/dma.c
>>> @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
>>>  * them and setting the cache-inhibit bit.
>>>  */
>>> mmap_write_lock(&init_mm);
>>> - error = walk_page_range_novma(&init_mm, va, va + size,
>>> + error = walk_kernel_page_table_range(va, va + size,
>>> &set_nocache_walk_ops, NULL, NULL);
>>> mmap_write_unlock(&init_mm);
>>> 
>>> @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
>>> 
>>> mmap_write_lock(&init_mm);
>>> /* walk_page_range shouldn't be able to fail here */
>>> - WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
>>> + WARN_ON(walk_kernel_page_table_range(va, va + size,
>>> &clear_nocache_walk_ops, NULL, NULL));
>>> mmap_write_unlock(&init_mm);
>>> }
>>> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
>>> index d815448758a1..3f76db3d2769 100644
>>> --- a/arch/riscv/mm/pageattr.c
>>> +++ b/arch/riscv/mm/pageattr.c
>>> @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
>>> if (ret)
>>> goto unlock;
>>> 
>>> - ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
>>> + ret = walk_kernel_page_table_range(lm_start, lm_end,
>>>     &pageattr_ops, NULL, &masks);
>> 
>> Note this and other places break the second line's arguments alignment on
>> the opening bracket. Maybe it just shows it's a bit fragile style...
>> 
>> 
> 
> Yeah I know :) I know you won't believe this coming from me, but I was
> trying to minimise the churn :P



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-05  9:42     ` Muchun Song
@ 2025-06-05  9:56       ` Lorenzo Stoakes
  2025-06-05 12:11         ` Muchun Song
  0 siblings, 1 reply; 15+ messages in thread
From: Lorenzo Stoakes @ 2025-06-05  9:56 UTC (permalink / raw)
  To: Muchun Song
  Cc: Vlastimil Babka, Andrew Morton, Barry Song, David Hildenbrand,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On Thu, Jun 05, 2025 at 05:42:16PM +0800, Muchun Song wrote:
>
>
> > On Jun 5, 2025, at 17:24, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Thu, Jun 05, 2025 at 08:56:59AM +0200, Vlastimil Babka wrote:
> >> On 6/4/25 16:19, Lorenzo Stoakes wrote:
> >>> The walk_page_range_novma() function is rather confusing - it supports two
> >>> modes, one used often, the other used only for debugging.
> >>>
> >>> The first mode is the common case of traversal of kernel page tables, which
> >>> is what nearly all callers use this for.
> >>>
> >>> Secondly it provides an unusual debugging interface that allows for the
> >>> traversal of page tables in a userland range of memory even for that memory
> >>> which is not described by a VMA.
> >>>
> >>> It is far from certain that such page tables should even exist, but perhaps
> >>> this is precisely why it is useful as a debugging mechanism.
> >>>
> >>> As a result, this is utilised by ptdump only. Historically, things were
> >>> reversed - ptdump was the only user, and other parts of the kernel evolved
> >>> to use the kernel page table walking here.
> >>>
> >>> Since we have some complicated and confusing locking rules for the novma
> >>> case, it makes sense to separate the two usages into their own functions.
> >>>
> >>> Doing this also provide self-documentation as to the intent of the caller -
> >>> are they doing something rather unusual or are they simply doing a standard
> >>> kernel page table walk?
> >>>
> >>> We therefore establish two separate functions - walk_page_range_debug() for
> >>> this single usage, and walk_kernel_page_table_range() for general kernel
> >>> page table walking.
> >>>
> >>> We additionally make walk_page_range_debug() internal to mm.
> >>>
> >>> Note that ptdump uses the precise same function for kernel walking as a
> >>
> >> IMHO it's not clear at this point what "the precise same function" means.
> >>
> >>> convenience, so we permit this but make it very explicit by having
> >>> walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
> >>
> >>  ^ walk_page_range_debug()
> >
> > Oops will fix.
> >
> >>
> >> Maybe this could be reworded in the sense (AFAIU) that
> >> walk_page_range_debug() can be used for both user space page table walking
> >> or kernel depending on what mm is passed, so in the case of init_mm it
> >> invokes walk_kernel_page_table_range() internally.
> >
> > Sure.
> >
> >>
> >>>
> >>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> >>> ---
> >>> v2:
> >>> * Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
> >>> * Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
> >>> * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
> >>>  per David.
> >>>
> >>> v1 resend:
> >>> * Actually cc'd lists...
> >>> * Fixed mistake in walk_page_range_novma() not handling kernel mappings and
> >>>  update commit message to referene.
> >>> * Added Mike's off-list Acked-by.
> >>> * Fixed up comments as per Mike.
> >>> * Add some historic flavour to the commit message as per Mike.
> >>> https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoakes@oracle.com/
> >>>
> >>> v1:
> >>> (accidentally sent off-list due to error in scripting)
> >>>
> >>> arch/loongarch/mm/pageattr.c |  2 +-
> >>> arch/openrisc/kernel/dma.c   |  4 +-
> >>> arch/riscv/mm/pageattr.c     |  8 +--
> >>> include/linux/pagewalk.h     |  7 ++-
> >>> mm/hugetlb_vmemmap.c         |  2 +-
> >>> mm/internal.h                |  4 ++
> >>> mm/pagewalk.c                | 98 ++++++++++++++++++++++++------------
> >>> mm/ptdump.c                  |  3 +-
> >>> 8 files changed, 82 insertions(+), 46 deletions(-)
> >>>
> >>> diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
> >>> index 99165903908a..f5e910b68229 100644
> >>> --- a/arch/loongarch/mm/pageattr.c
> >>> +++ b/arch/loongarch/mm/pageattr.c
> >>> @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp
> >>> return 0;
> >>>
> >>> mmap_write_lock(&init_mm);
> >>> - ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks);
> >>> + ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks);
> >>> mmap_write_unlock(&init_mm);
> >>
> >> You've removed init_mm from walk_page_range_novma() but I see most callers
> >> do the locking of init_mm immediately around it. This suggests a version
> >> handling that automatically? A bit complicated by the read/write
> >> possibilities, so maybe not worth wrapping? Just a thought, as David says ;)
> >
> > Most callers write lock interestingly, but then one read lock's, so we can't
> > just assume and would need to pass a boolean which would kind of suck.
>
> Hi Lorenzo,
>
> Actually, the write lock introduced in commit 8782fb61cc848 to fix the
> race condition when walking user page tables can be replaced with a read
> lock. As explained in commit b123d09304d86, it is safe to walk kernel
> page tables while holding the mmap read lock. The function name
> `walk_kernel_page_table_range` clearly indicates its purpose: walking
> kernel page tables. Thus, using a read lock internally is appropriate
> and safe. Please correct me, if I am wrong.
>
> To further enhance robustness, it is better to add a WARN_ON check to
> ensure that the address range passed to walk_kernel_page_table_range
> is indeed within the kernel address space. This will help prevent any
> accidental misuse and catch issues early.
>

Hi Muchun,

Thanks for the information. I'll chase up this locking stuff in a
follow-up, if that's ok? As I want to get this refactoring landed first
with the existing behaviour.

I did wonder about doing a WARN_ON(), but I am concerned that perhaps there
are some very odd architectures that place things in unusual locations... I
guess this would be a addr >= TASK_SIZE check?

On the 'weirdness' front, Jann's opinion is that the debug code shouldn't
even be walking page tables without VMAs in userspace anyway (and to me -
unless there's some weird arch-specific stuff out there - I am also
perplexed by this).

So we should in theory just drop the ability to do this kind of traversal
in general.

This is something I'll investigate...

There's also oddities like:

		if (walk->mm == &init_mm || addr >= TASK_SIZE)
			pte = pte_offset_kernel(pmd, addr);
		else
			pte = pte_offset_map(pmd, addr);

In the walk_pte_range() function, which I feel is wrong - you should pass
init_mm if you want to explore kernel page tables, so why are we doing this
here?

I wonder if it's for stuff like vsyscall which are ostensibly kernel
addresses but accessible in userspace... but then would we really want to
walk the page tables for this?

This is stuff I want dig into a little bit.

But overall, would you mind if I defer the WARN_ON() and other such stuff
to a follow up?

Just keen for us to firstly separate out the actual usages of these
functions very clearly, and thanks to Mike's great suggestion also making
the debug usage entirely mm-internal.

This should lay a foundation for introducing further sanity to the equation
here :)

I will properly look into the commits you mention in preparation for a
follow up though to be clear (your input is much appreciated! :)

> Muchun,
> Thanks.

Cheers, Lorenzo

>
> >
> > Also other walkers assume the caller has the lock so it's consistent to
> > keep it this way.
> >
> >>
> >>>
> >>> flush_tlb_kernel_range(start, end);
> >>> diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
> >>> index 3a7b5baaa450..af932a4ad306 100644
> >>> --- a/arch/openrisc/kernel/dma.c
> >>> +++ b/arch/openrisc/kernel/dma.c
> >>> @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
> >>>  * them and setting the cache-inhibit bit.
> >>>  */
> >>> mmap_write_lock(&init_mm);
> >>> - error = walk_page_range_novma(&init_mm, va, va + size,
> >>> + error = walk_kernel_page_table_range(va, va + size,
> >>> &set_nocache_walk_ops, NULL, NULL);
> >>> mmap_write_unlock(&init_mm);
> >>>
> >>> @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
> >>>
> >>> mmap_write_lock(&init_mm);
> >>> /* walk_page_range shouldn't be able to fail here */
> >>> - WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
> >>> + WARN_ON(walk_kernel_page_table_range(va, va + size,
> >>> &clear_nocache_walk_ops, NULL, NULL));
> >>> mmap_write_unlock(&init_mm);
> >>> }
> >>> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
> >>> index d815448758a1..3f76db3d2769 100644
> >>> --- a/arch/riscv/mm/pageattr.c
> >>> +++ b/arch/riscv/mm/pageattr.c
> >>> @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
> >>> if (ret)
> >>> goto unlock;
> >>>
> >>> - ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
> >>> + ret = walk_kernel_page_table_range(lm_start, lm_end,
> >>>     &pageattr_ops, NULL, &masks);
> >>
> >> Note this and other places break the second line's arguments alignment on
> >> the opening bracket. Maybe it just shows it's a bit fragile style...
> >>
> >>
> >
> > Yeah I know :) I know you won't believe this coming from me, but I was
> > trying to minimise the churn :P
>
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-05  9:56       ` Lorenzo Stoakes
@ 2025-06-05 12:11         ` Muchun Song
  0 siblings, 0 replies; 15+ messages in thread
From: Muchun Song @ 2025-06-05 12:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Vlastimil Babka, Andrew Morton, Barry Song, David Hildenbrand,
	Liam R . Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm



> On Jun 5, 2025, at 17:56, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
> 
> On Thu, Jun 05, 2025 at 05:42:16PM +0800, Muchun Song wrote:
>> 
>> 
>>> On Jun 5, 2025, at 17:24, Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote:
>>> 
>>> On Thu, Jun 05, 2025 at 08:56:59AM +0200, Vlastimil Babka wrote:
>>>> On 6/4/25 16:19, Lorenzo Stoakes wrote:
>>>>> The walk_page_range_novma() function is rather confusing - it supports two
>>>>> modes, one used often, the other used only for debugging.
>>>>> 
>>>>> The first mode is the common case of traversal of kernel page tables, which
>>>>> is what nearly all callers use this for.
>>>>> 
>>>>> Secondly it provides an unusual debugging interface that allows for the
>>>>> traversal of page tables in a userland range of memory even for that memory
>>>>> which is not described by a VMA.
>>>>> 
>>>>> It is far from certain that such page tables should even exist, but perhaps
>>>>> this is precisely why it is useful as a debugging mechanism.
>>>>> 
>>>>> As a result, this is utilised by ptdump only. Historically, things were
>>>>> reversed - ptdump was the only user, and other parts of the kernel evolved
>>>>> to use the kernel page table walking here.
>>>>> 
>>>>> Since we have some complicated and confusing locking rules for the novma
>>>>> case, it makes sense to separate the two usages into their own functions.
>>>>> 
>>>>> Doing this also provide self-documentation as to the intent of the caller -
>>>>> are they doing something rather unusual or are they simply doing a standard
>>>>> kernel page table walk?
>>>>> 
>>>>> We therefore establish two separate functions - walk_page_range_debug() for
>>>>> this single usage, and walk_kernel_page_table_range() for general kernel
>>>>> page table walking.
>>>>> 
>>>>> We additionally make walk_page_range_debug() internal to mm.
>>>>> 
>>>>> Note that ptdump uses the precise same function for kernel walking as a
>>>> 
>>>> IMHO it's not clear at this point what "the precise same function" means.
>>>> 
>>>>> convenience, so we permit this but make it very explicit by having
>>>>> walk_page_range_novma() invoke walk_kernel_page_table_range() in this case.
>>>> 
>>>> ^ walk_page_range_debug()
>>> 
>>> Oops will fix.
>>> 
>>>> 
>>>> Maybe this could be reworded in the sense (AFAIU) that
>>>> walk_page_range_debug() can be used for both user space page table walking
>>>> or kernel depending on what mm is passed, so in the case of init_mm it
>>>> invokes walk_kernel_page_table_range() internally.
>>> 
>>> Sure.
>>> 
>>>> 
>>>>> 
>>>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
>>>>> ---
>>>>> v2:
>>>>> * Renamed walk_page_range_novma() to walk_page_range_debug() as per David.
>>>>> * Moved walk_page_range_debug() definition to mm/internal.h as per Mike.
>>>>> * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as
>>>>> per David.
>>>>> 
>>>>> v1 resend:
>>>>> * Actually cc'd lists...
>>>>> * Fixed mistake in walk_page_range_novma() not handling kernel mappings and
>>>>> update commit message to referene.
>>>>> * Added Mike's off-list Acked-by.
>>>>> * Fixed up comments as per Mike.
>>>>> * Add some historic flavour to the commit message as per Mike.
>>>>> https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoakes@oracle.com/
>>>>> 
>>>>> v1:
>>>>> (accidentally sent off-list due to error in scripting)
>>>>> 
>>>>> arch/loongarch/mm/pageattr.c |  2 +-
>>>>> arch/openrisc/kernel/dma.c   |  4 +-
>>>>> arch/riscv/mm/pageattr.c     |  8 +--
>>>>> include/linux/pagewalk.h     |  7 ++-
>>>>> mm/hugetlb_vmemmap.c         |  2 +-
>>>>> mm/internal.h                |  4 ++
>>>>> mm/pagewalk.c                | 98 ++++++++++++++++++++++++------------
>>>>> mm/ptdump.c                  |  3 +-
>>>>> 8 files changed, 82 insertions(+), 46 deletions(-)
>>>>> 
>>>>> diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
>>>>> index 99165903908a..f5e910b68229 100644
>>>>> --- a/arch/loongarch/mm/pageattr.c
>>>>> +++ b/arch/loongarch/mm/pageattr.c
>>>>> @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp
>>>>> return 0;
>>>>> 
>>>>> mmap_write_lock(&init_mm);
>>>>> - ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks);
>>>>> + ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, &masks);
>>>>> mmap_write_unlock(&init_mm);
>>>> 
>>>> You've removed init_mm from walk_page_range_novma() but I see most callers
>>>> do the locking of init_mm immediately around it. This suggests a version
>>>> handling that automatically? A bit complicated by the read/write
>>>> possibilities, so maybe not worth wrapping? Just a thought, as David says ;)
>>> 
>>> Most callers write lock interestingly, but then one read lock's, so we can't
>>> just assume and would need to pass a boolean which would kind of suck.
>> 
>> Hi Lorenzo,
>> 
>> Actually, the write lock introduced in commit 8782fb61cc848 to fix the
>> race condition when walking user page tables can be replaced with a read
>> lock. As explained in commit b123d09304d86, it is safe to walk kernel
>> page tables while holding the mmap read lock. The function name
>> `walk_kernel_page_table_range` clearly indicates its purpose: walking
>> kernel page tables. Thus, using a read lock internally is appropriate
>> and safe. Please correct me, if I am wrong.
>> 
>> To further enhance robustness, it is better to add a WARN_ON check to
>> ensure that the address range passed to walk_kernel_page_table_range
>> is indeed within the kernel address space. This will help prevent any
>> accidental misuse and catch issues early.
>> 
> 
> Hi Muchun,
> 
> Thanks for the information. I'll chase up this locking stuff in a
> follow-up, if that's ok? As I want to get this refactoring landed first
> with the existing behaviour.

OK. Make sense.

> 
> I did wonder about doing a WARN_ON(), but I am concerned that perhaps there
> are some very odd architectures that place things in unusual locations... I
> guess this would be a addr >= TASK_SIZE check?

I think so.

> 
> On the 'weirdness' front, Jann's opinion is that the debug code shouldn't
> even be walking page tables without VMAs in userspace anyway (and to me -
> unless there's some weird arch-specific stuff out there - I am also
> perplexed by this).
> 
> So we should in theory just drop the ability to do this kind of traversal
> in general.
> 
> This is something I'll investigate...
> 
> There's also oddities like:
> 
> if (walk->mm == &init_mm || addr >= TASK_SIZE)
> 	pte = pte_offset_kernel(pmd, addr);
> else
> 	pte = pte_offset_map(pmd, addr);
> 
> In the walk_pte_range() function, which I feel is wrong - you should pass
> init_mm if you want to explore kernel page tables, so why are we doing this
> here?

I guess it is because some users want to walk kernel address space but
not mapped in init_mm (like efi_mm, see ptdump_efi_show()).

> 
> I wonder if it's for stuff like vsyscall which are ostensibly kernel
> addresses but accessible in userspace... but then would we really want to
> walk the page tables for this?
> 
> This is stuff I want dig into a little bit.
> 
> But overall, would you mind if I defer the WARN_ON() and other such stuff
> to a follow up?

No problem. It is up to you.

> 
> Just keen for us to firstly separate out the actual usages of these
> functions very clearly, and thanks to Mike's great suggestion also making
> the debug usage entirely mm-internal.
> 
> This should lay a foundation for introducing further sanity to the equation
> here :)
> 
> I will properly look into the commits you mention in preparation for a
> follow up though to be clear (your input is much appreciated! :)

Thank you! I look forward to your follow-up and any further insights you
might have.

Thanks.

> 
>> Muchun,
>> Thanks.
> 
> Cheers, Lorenzo
> 
>> 
>>> 
>>> Also other walkers assume the caller has the lock so it's consistent to
>>> keep it this way.
>>> 
>>>> 
>>>>> 
>>>>> flush_tlb_kernel_range(start, end);
>>>>> diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
>>>>> index 3a7b5baaa450..af932a4ad306 100644
>>>>> --- a/arch/openrisc/kernel/dma.c
>>>>> +++ b/arch/openrisc/kernel/dma.c
>>>>> @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
>>>>> * them and setting the cache-inhibit bit.
>>>>> */
>>>>> mmap_write_lock(&init_mm);
>>>>> - error = walk_page_range_novma(&init_mm, va, va + size,
>>>>> + error = walk_kernel_page_table_range(va, va + size,
>>>>> &set_nocache_walk_ops, NULL, NULL);
>>>>> mmap_write_unlock(&init_mm);
>>>>> 
>>>>> @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)
>>>>> 
>>>>> mmap_write_lock(&init_mm);
>>>>> /* walk_page_range shouldn't be able to fail here */
>>>>> - WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
>>>>> + WARN_ON(walk_kernel_page_table_range(va, va + size,
>>>>> &clear_nocache_walk_ops, NULL, NULL));
>>>>> mmap_write_unlock(&init_mm);
>>>>> }
>>>>> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
>>>>> index d815448758a1..3f76db3d2769 100644
>>>>> --- a/arch/riscv/mm/pageattr.c
>>>>> +++ b/arch/riscv/mm/pageattr.c
>>>>> @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
>>>>> if (ret)
>>>>> goto unlock;
>>>>> 
>>>>> - ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
>>>>> + ret = walk_kernel_page_table_range(lm_start, lm_end,
>>>>>    &pageattr_ops, NULL, &masks);
>>>> 
>>>> Note this and other places break the second line's arguments alignment on
>>>> the opening bracket. Maybe it just shows it's a bit fragile style...
>>>> 
>>>> 
>>> 
>>> Yeah I know :) I know you won't believe this coming from me, but I was
>>> trying to minimise the churn :P



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-04 14:19 [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts Lorenzo Stoakes
                   ` (4 preceding siblings ...)
  2025-06-05  7:45 ` David Hildenbrand
@ 2025-06-05 19:19 ` Jann Horn
  2025-06-05 20:23   ` David Hildenbrand
  5 siblings, 1 reply; 15+ messages in thread
From: Jann Horn @ 2025-06-05 19:19 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Barry Song, David Hildenbrand, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Muchun Song, Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On Wed, Jun 4, 2025 at 4:21 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
> The walk_page_range_novma() function is rather confusing - it supports two
> modes, one used often, the other used only for debugging.
>
> The first mode is the common case of traversal of kernel page tables, which
> is what nearly all callers use this for.
>
> Secondly it provides an unusual debugging interface that allows for the
> traversal of page tables in a userland range of memory even for that memory
> which is not described by a VMA.
>
> It is far from certain that such page tables should even exist, but perhaps
> this is precisely why it is useful as a debugging mechanism.
>
> As a result, this is utilised by ptdump only. Historically, things were
> reversed - ptdump was the only user, and other parts of the kernel evolved
> to use the kernel page table walking here.

Just for the record, copy-pasting my comment on v1 that was
accidentally sent off-list:
```
Sort of a tangential comment: I wonder if it would make sense to give
ptdump a different page table walker that uses roughly the same safety
contract as gup_fast() - turn off IRQs and then walk the page tables
locklessly. We'd need basically no locking and no special cases
(regarding userspace mappings at least), at the cost of having to
write the walker code such that we periodically restart the walk from
scratch and not being able to inspect referenced pages. (That might
also be nicer for debugging, since it wouldn't block on locks...)
```

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-05 19:19 ` Jann Horn
@ 2025-06-05 20:23   ` David Hildenbrand
  2025-06-06 10:59     ` Jann Horn
  0 siblings, 1 reply; 15+ messages in thread
From: David Hildenbrand @ 2025-06-05 20:23 UTC (permalink / raw)
  To: Jann Horn, Lorenzo Stoakes
  Cc: Andrew Morton, Barry Song, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Muchun Song,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On 05.06.25 21:19, Jann Horn wrote:
> On Wed, Jun 4, 2025 at 4:21 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>> The walk_page_range_novma() function is rather confusing - it supports two
>> modes, one used often, the other used only for debugging.
>>
>> The first mode is the common case of traversal of kernel page tables, which
>> is what nearly all callers use this for.
>>
>> Secondly it provides an unusual debugging interface that allows for the
>> traversal of page tables in a userland range of memory even for that memory
>> which is not described by a VMA.
>>
>> It is far from certain that such page tables should even exist, but perhaps
>> this is precisely why it is useful as a debugging mechanism.
>>
>> As a result, this is utilised by ptdump only. Historically, things were
>> reversed - ptdump was the only user, and other parts of the kernel evolved
>> to use the kernel page table walking here.
> 
> Just for the record, copy-pasting my comment on v1 that was
> accidentally sent off-list:
> ```
> Sort of a tangential comment: I wonder if it would make sense to give
> ptdump a different page table walker that uses roughly the same safety
> contract as gup_fast() - turn off IRQs and then walk the page tables
> locklessly. We'd need basically no locking and no special cases
> (regarding userspace mappings at least), at the cost of having to
> write the walker code such that we periodically restart the walk from
> scratch and not being able to inspect referenced pages. (That might
> also be nicer for debugging, since it wouldn't block on locks...)
> ```

I assume we don't have to dump more than pte values etc? So 
pte_special() and friends are not relevant to get it right.

GUP-fast depend on CONFIG_HAVE_GUP_FAST, not sure if that would be a 
concern for now.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-05 20:23   ` David Hildenbrand
@ 2025-06-06 10:59     ` Jann Horn
  2025-06-06 13:41       ` Lorenzo Stoakes
  0 siblings, 1 reply; 15+ messages in thread
From: Jann Horn @ 2025-06-06 10:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Barry Song, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Muchun Song, Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On Thu, Jun 5, 2025 at 10:23 PM David Hildenbrand <david@redhat.com> wrote:
> On 05.06.25 21:19, Jann Horn wrote:
> > On Wed, Jun 4, 2025 at 4:21 PM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> >> The walk_page_range_novma() function is rather confusing - it supports two
> >> modes, one used often, the other used only for debugging.
> >>
> >> The first mode is the common case of traversal of kernel page tables, which
> >> is what nearly all callers use this for.
> >>
> >> Secondly it provides an unusual debugging interface that allows for the
> >> traversal of page tables in a userland range of memory even for that memory
> >> which is not described by a VMA.
> >>
> >> It is far from certain that such page tables should even exist, but perhaps
> >> this is precisely why it is useful as a debugging mechanism.
> >>
> >> As a result, this is utilised by ptdump only. Historically, things were
> >> reversed - ptdump was the only user, and other parts of the kernel evolved
> >> to use the kernel page table walking here.
> >
> > Just for the record, copy-pasting my comment on v1 that was
> > accidentally sent off-list:
> > ```
> > Sort of a tangential comment: I wonder if it would make sense to give
> > ptdump a different page table walker that uses roughly the same safety
> > contract as gup_fast() - turn off IRQs and then walk the page tables
> > locklessly. We'd need basically no locking and no special cases
> > (regarding userspace mappings at least), at the cost of having to
> > write the walker code such that we periodically restart the walk from
> > scratch and not being able to inspect referenced pages. (That might
> > also be nicer for debugging, since it wouldn't block on locks...)
> > ```
>
> I assume we don't have to dump more than pte values etc? So
> pte_special() and friends are not relevant to get it right.
>
> GUP-fast depend on CONFIG_HAVE_GUP_FAST, not sure if that would be a
> concern for now.

Ah, good point, that's annoying... maaaybe we should just gate this
entire feature on CONFIG_HAVE_GUP_FAST to make sure the userspace
mappings are designed to be walkable in this way? It's in debugfs,
which _theoretically_
(https://docs.kernel.org/filesystems/debugfs.html) means there are no
stability guarantees, and I think it is normally used on architectures
that define CONFIG_HAVE_GUP_FAST...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-06 10:59     ` Jann Horn
@ 2025-06-06 13:41       ` Lorenzo Stoakes
  0 siblings, 0 replies; 15+ messages in thread
From: Lorenzo Stoakes @ 2025-06-06 13:41 UTC (permalink / raw)
  To: Jann Horn
  Cc: David Hildenbrand, Andrew Morton, Barry Song, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Muchun Song, Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On Fri, Jun 06, 2025 at 12:59:20PM +0200, Jann Horn wrote:
> On Thu, Jun 5, 2025 at 10:23 PM David Hildenbrand <david@redhat.com> wrote:
> > On 05.06.25 21:19, Jann Horn wrote:
> > > On Wed, Jun 4, 2025 at 4:21 PM Lorenzo Stoakes
> > > <lorenzo.stoakes@oracle.com> wrote:
> > >> The walk_page_range_novma() function is rather confusing - it supports two
> > >> modes, one used often, the other used only for debugging.
> > >>
> > >> The first mode is the common case of traversal of kernel page tables, which
> > >> is what nearly all callers use this for.
> > >>
> > >> Secondly it provides an unusual debugging interface that allows for the
> > >> traversal of page tables in a userland range of memory even for that memory
> > >> which is not described by a VMA.
> > >>
> > >> It is far from certain that such page tables should even exist, but perhaps
> > >> this is precisely why it is useful as a debugging mechanism.
> > >>
> > >> As a result, this is utilised by ptdump only. Historically, things were
> > >> reversed - ptdump was the only user, and other parts of the kernel evolved
> > >> to use the kernel page table walking here.
> > >
> > > Just for the record, copy-pasting my comment on v1 that was
> > > accidentally sent off-list:
> > > ```
> > > Sort of a tangential comment: I wonder if it would make sense to give
> > > ptdump a different page table walker that uses roughly the same safety
> > > contract as gup_fast() - turn off IRQs and then walk the page tables
> > > locklessly. We'd need basically no locking and no special cases
> > > (regarding userspace mappings at least), at the cost of having to
> > > write the walker code such that we periodically restart the walk from
> > > scratch and not being able to inspect referenced pages. (That might
> > > also be nicer for debugging, since it wouldn't block on locks...)
> > > ```
> >
> > I assume we don't have to dump more than pte values etc? So
> > pte_special() and friends are not relevant to get it right.
> >
> > GUP-fast depend on CONFIG_HAVE_GUP_FAST, not sure if that would be a
> > concern for now.
>
> Ah, good point, that's annoying... maaaybe we should just gate this
> entire feature on CONFIG_HAVE_GUP_FAST to make sure the userspace
> mappings are designed to be walkable in this way? It's in debugfs,
> which _theoretically_
> (https://docs.kernel.org/filesystems/debugfs.html) means there are no
> stability guarantees, and I think it is normally used on architectures
> that define CONFIG_HAVE_GUP_FAST...

Hm, it's a nice idea, but I wonder if it's worthwhile just for ptdump?

I really hate how we're just arbitrarily using init_mm.mmap_lock as a mutex
here though.

Could we GUP fast walkers here in general I wonder...? Or optionally maybe
for more general page table walking?

I mean of course gated on availability.

We sorely need a truly generalised page walker :) though of course it's a
matter of people having time :P

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-06-06 13:42 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-04 14:19 [PATCH v2] mm/pagewalk: split walk_page_range_novma() into kernel/user parts Lorenzo Stoakes
2025-06-04 14:39 ` Suren Baghdasaryan
2025-06-05  4:34 ` Oscar Salvador
2025-06-05  6:13 ` Qi Zheng
2025-06-05  6:56 ` Vlastimil Babka
2025-06-05  9:24   ` Lorenzo Stoakes
2025-06-05  9:42     ` Muchun Song
2025-06-05  9:56       ` Lorenzo Stoakes
2025-06-05 12:11         ` Muchun Song
2025-06-05  7:45 ` David Hildenbrand
2025-06-05  9:33   ` Lorenzo Stoakes
2025-06-05 19:19 ` Jann Horn
2025-06-05 20:23   ` David Hildenbrand
2025-06-06 10:59     ` Jann Horn
2025-06-06 13:41       ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).