linux-openrisc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
@ 2025-06-03 19:22 Lorenzo Stoakes
  2025-06-04  7:39 ` David Hildenbrand
  0 siblings, 1 reply; 8+ messages in thread
From: Lorenzo Stoakes @ 2025-06-03 19:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Barry Song, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Muchun Song,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

The walk_page_range_novma() function is rather confusing - it supports two
modes, one used often, the other used only for debugging.

The first mode is the common case of traversal of kernel page tables, which
is what nearly all callers use this for.

Secondly it provides an unusual debugging interface that allows for the
traversal of page tables in a userland range of memory even for that memory
which is not described by a VMA.

This is highly unusual and it is far from certain that such page tables
should even exist, but perhaps this is precisely why it is useful as a
debugging mechanism.

As a result, this is utilised by ptdump only. Historically, things were
reversed - ptdump was the only user, and other parts of the kernel evolved
to use the kernel page table walking here.

Since we have some complicated and confusing locking rules for the novma
case, it makes sense to separate the two usages into their own functions.

Doing this also provide self-documentation as to the intent of the caller -
are they doing something rather unusual or are they simply doing a standard
kernel page table walk?

We therefore maintain walk_page_range_novma() for this single usage, and
document the function as such.

Note that ptdump uses the precise same function for kernel walking as a
convenience, so we permit this but make it very explicit by having
walk_page_range_novma() invoke walk_page_range_kernel() in this case.

We introduce walk_page_range_kernel() for the far more common case of
kernel page table traversal.

While it would result in less churn to keep the function signature the same
for the kernel version, it doesn't make sense to pass an mm_struct in the
kernel case (it's always &init_mm), so we must modify the signature
accordingly.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
v1 resend:
* Actually cc'd lists...
* Fixed mistake in walk_page_range_novma() not handling kernel mappings and
  update commit message to referene.
* Added Mike's off-list Acked-by.
* Fixed up comments as per Mike.
* Add some historic flavour to the commit message as per Mike.

v1:
(accidentally sent off-list due to error in scripting)

 arch/loongarch/mm/pageattr.c |  2 +-
 arch/openrisc/kernel/dma.c   |  4 +-
 arch/riscv/mm/pageattr.c     |  8 +--
 include/linux/pagewalk.h     |  3 ++
 mm/hugetlb_vmemmap.c         |  2 +-
 mm/pagewalk.c                | 96 ++++++++++++++++++++++++------------
 6 files changed, 75 insertions(+), 40 deletions(-)

diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c
index 99165903908a..b701076605b3 100644
--- a/arch/loongarch/mm/pageattr.c
+++ b/arch/loongarch/mm/pageattr.c
@@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, pgp
 		return 0;

 	mmap_write_lock(&init_mm);
-	ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, &masks);
+	ret = walk_page_range_kernel(start, end, &pageattr_ops, NULL, &masks);
 	mmap_write_unlock(&init_mm);

 	flush_tlb_kernel_range(start, end);
diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
index 3a7b5baaa450..bc9ce5cc2e26 100644
--- a/arch/openrisc/kernel/dma.c
+++ b/arch/openrisc/kernel/dma.c
@@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size)
 	 * them and setting the cache-inhibit bit.
 	 */
 	mmap_write_lock(&init_mm);
-	error = walk_page_range_novma(&init_mm, va, va + size,
+	error = walk_page_range_kernel(va, va + size,
 			&set_nocache_walk_ops, NULL, NULL);
 	mmap_write_unlock(&init_mm);

@@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size)

 	mmap_write_lock(&init_mm);
 	/* walk_page_range shouldn't be able to fail here */
-	WARN_ON(walk_page_range_novma(&init_mm, va, va + size,
+	WARN_ON(walk_page_range_kernel(va, va + size,
 			&clear_nocache_walk_ops, NULL, NULL));
 	mmap_write_unlock(&init_mm);
 }
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index d815448758a1..80e0cd08a4e5 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
 			if (ret)
 				goto unlock;

-			ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
+			ret = walk_page_range_kernel(lm_start, lm_end,
 						    &pageattr_ops, NULL, &masks);
 			if (ret)
 				goto unlock;
@@ -317,13 +317,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
 		if (ret)
 			goto unlock;

-		ret = walk_page_range_novma(&init_mm, lm_start, lm_end,
+		ret = walk_page_range_kernel(lm_start, lm_end,
 					    &pageattr_ops, NULL, &masks);
 		if (ret)
 			goto unlock;
 	}

-	ret =  walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL,
+	ret =  walk_page_range_kernel(start, end, &pageattr_ops, NULL,
 				     &masks);

 unlock:
@@ -335,7 +335,7 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
 	 */
 	flush_tlb_all();
 #else
-	ret =  walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL,
+	ret =  walk_page_range_kernel(start, end, &pageattr_ops, NULL,
 				     &masks);

 	mmap_write_unlock(&init_mm);
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 9700a29f8afb..1c2dc3ae0719 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -129,6 +129,9 @@ struct mm_walk {
 int walk_page_range(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
 		void *private);
+int walk_page_range_kernel(unsigned long start,
+		unsigned long end, const struct mm_walk_ops *ops,
+		pgd_t *pgd, void *private);
 int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
 			  unsigned long end, const struct mm_walk_ops *ops,
 			  pgd_t *pgd,
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 27245e86df25..724033d9d4a1 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -166,7 +166,7 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end,
 	VM_BUG_ON(!PAGE_ALIGNED(start | end));

 	mmap_read_lock(&init_mm);
-	ret = walk_page_range_novma(&init_mm, start, end, &vmemmap_remap_ops,
+	ret = walk_page_range_kernel(start, end, &vmemmap_remap_ops,
 				    NULL, walk);
 	mmap_read_unlock(&init_mm);
 	if (ret)
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e478777c86e1..d27347ffcd63 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -584,9 +584,28 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
 	return walk_page_range_mm(mm, start, end, ops, private);
 }

+static int __walk_page_range_novma(struct mm_struct *mm, unsigned long start,
+		unsigned long end, const struct mm_walk_ops *ops,
+		pgd_t *pgd, void *private)
+{
+	struct mm_walk walk = {
+		.ops		= ops,
+		.mm		= mm,
+		.pgd		= pgd,
+		.private	= private,
+		.no_vma		= true
+	};
+
+	if (start >= end || !walk.mm)
+		return -EINVAL;
+	if (!check_ops_valid(ops))
+		return -EINVAL;
+
+	return walk_pgd_range(start, end, &walk);
+}
+
 /**
- * walk_page_range_novma - walk a range of pagetables not backed by a vma
- * @mm:		mm_struct representing the target process of page table walk
+ * walk_page_range_kernel - walk a range of kernel pagetables.
  * @start:	start address of the virtual address range
  * @end:	end address of the virtual address range
  * @ops:	operation to call during the walk
@@ -596,56 +615,69 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
  * Similar to walk_page_range() but can walk any page tables even if they are
  * not backed by VMAs. Because 'unusual' entries may be walked this function
  * will also not lock the PTEs for the pte_entry() callback. This is useful for
- * walking the kernel pages tables or page tables for firmware.
+ * walking kernel pages tables or page tables for firmware.
  *
  * Note: Be careful to walk the kernel pages tables, the caller may be need to
  * take other effective approaches (mmap lock may be insufficient) to prevent
  * the intermediate kernel page tables belonging to the specified address range
  * from being freed (e.g. memory hot-remove).
  */
+int walk_page_range_kernel(unsigned long start, unsigned long end,
+		const struct mm_walk_ops *ops, pgd_t *pgd, void *private)
+{
+	struct mm_struct *mm = &init_mm;
+
+	/*
+	 * Kernel intermediate page tables are usually not freed, so the mmap
+	 * read lock is sufficient. But there are some exceptions.
+	 * E.g. memory hot-remove. In which case, the mmap lock is insufficient
+	 * to prevent the intermediate kernel pages tables belonging to the
+	 * specified address range from being freed. The caller should take
+	 * other actions to prevent this race.
+	 */
+	mmap_assert_locked(mm);
+
+	return __walk_page_range_novma(mm, start, end, ops, pgd, private);
+}
+
+/**
+ * walk_page_range_novma - walk a range of pagetables not backed by a vma
+ * @mm:		mm_struct representing the target process of page table walk
+ * @start:	start address of the virtual address range
+ * @end:	end address of the virtual address range
+ * @ops:	operation to call during the walk
+ * @pgd:	pgd to walk if different from mm->pgd
+ * @private:	private data for callbacks' usage
+ *
+ * Similar to walk_page_range() but can walk any page tables even if they are
+ * not backed by VMAs. Because 'unusual' entries may be walked this function
+ * will also not lock the PTEs for the pte_entry() callback.
+ *
+ * This is for debugging purposes ONLY.
+ */
 int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
 			  unsigned long end, const struct mm_walk_ops *ops,
 			  pgd_t *pgd,
 			  void *private)
 {
-	struct mm_walk walk = {
-		.ops		= ops,
-		.mm		= mm,
-		.pgd		= pgd,
-		.private	= private,
-		.no_vma		= true
-	};
-
-	if (start >= end || !walk.mm)
-		return -EINVAL;
-	if (!check_ops_valid(ops))
-		return -EINVAL;
+	/*
+	 * For convenience, we allow this function to also traverse kernel
+	 * mappings.
+	 */
+	if (mm == &init_mm)
+		return walk_page_range_kernel(start, end, ops, pgd, private);

 	/*
-	 * 1) For walking the user virtual address space:
-	 *
 	 * The mmap lock protects the page walker from changes to the page
 	 * tables during the walk.  However a read lock is insufficient to
 	 * protect those areas which don't have a VMA as munmap() detaches
 	 * the VMAs before downgrading to a read lock and actually tearing
 	 * down PTEs/page tables. In which case, the mmap write lock should
-	 * be hold.
-	 *
-	 * 2) For walking the kernel virtual address space:
-	 *
-	 * The kernel intermediate page tables usually do not be freed, so
-	 * the mmap map read lock is sufficient. But there are some exceptions.
-	 * E.g. memory hot-remove. In which case, the mmap lock is insufficient
-	 * to prevent the intermediate kernel pages tables belonging to the
-	 * specified address range from being freed. The caller should take
-	 * other actions to prevent this race.
+	 * be held.
 	 */
-	if (mm == &init_mm)
-		mmap_assert_locked(walk.mm);
-	else
-		mmap_assert_write_locked(walk.mm);
+	mmap_assert_write_locked(mm);

-	return walk_pgd_range(start, end, &walk);
+	return __walk_page_range_novma(mm, start, end, ops, pgd, private);
 }

 int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
--
2.49.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-03 19:22 [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts Lorenzo Stoakes
@ 2025-06-04  7:39 ` David Hildenbrand
  2025-06-04  8:07   ` Mike Rapoport
  2025-06-04  9:20   ` Lorenzo Stoakes
  0 siblings, 2 replies; 8+ messages in thread
From: David Hildenbrand @ 2025-06-04  7:39 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Barry Song, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Muchun Song, Oscar Salvador,
	Huacai Chen, WANG Xuerui, Jonas Bonn, Stefan Kristiansson,
	Stafford Horne, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexandre Ghiti, Jann Horn, loongarch, linux-kernel,
	linux-openrisc, linux-riscv, linux-mm

On 03.06.25 21:22, Lorenzo Stoakes wrote:
> The walk_page_range_novma() function is rather confusing - it supports two
> modes, one used often, the other used only for debugging.
> 
> The first mode is the common case of traversal of kernel page tables, which
> is what nearly all callers use this for.

... and what people should be using it for 🙂

> 
> Secondly it provides an unusual debugging interface that allows for the
> traversal of page tables in a userland range of memory even for that memory
> which is not described by a VMA.
> 
> This is highly unusual and it is far from certain that such page tables
> should even exist, but perhaps this is precisely why it is useful as a
> debugging mechanism.
> 
> As a result, this is utilised by ptdump only. Historically, things were
> reversed - ptdump was the only user, and other parts of the kernel evolved
> to use the kernel page table walking here.
> 
> Since we have some complicated and confusing locking rules for the novma
> case, it makes sense to separate the two usages into their own functions.
> 
> Doing this also provide self-documentation as to the intent of the caller -
> are they doing something rather unusual or are they simply doing a standard
> kernel page table walk?
> 
> We therefore maintain walk_page_range_novma() for this single usage, and
> document the function as such.

If we have to keep this dangerous interface, it should probably be

walk_page_range_debug() or walk_page_range_dump()

> 
> Note that ptdump uses the precise same function for kernel walking as a
> convenience, so we permit this but make it very explicit by having
> walk_page_range_novma() invoke walk_page_range_kernel() in this case.
> 
> We introduce walk_page_range_kernel() for the far more common case of
> kernel page table traversal.

I wonder if we should give it a completely different name scheme to
highlight that this is something completely different.

walk_kernel_page_table_range()

etc.


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-04  7:39 ` David Hildenbrand
@ 2025-06-04  8:07   ` Mike Rapoport
  2025-06-04  8:12     ` David Hildenbrand
  2025-06-04  9:09     ` Lorenzo Stoakes
  2025-06-04  9:20   ` Lorenzo Stoakes
  1 sibling, 2 replies; 8+ messages in thread
From: Mike Rapoport @ 2025-06-04  8:07 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Barry Song, Liam R . Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Muchun Song,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On Wed, Jun 04, 2025 at 09:39:30AM +0200, David Hildenbrand wrote:
> On 03.06.25 21:22, Lorenzo Stoakes wrote:
> > The walk_page_range_novma() function is rather confusing - it supports two
> > modes, one used often, the other used only for debugging.
> > 
> > The first mode is the common case of traversal of kernel page tables, which
> > is what nearly all callers use this for.
> 
> ... and what people should be using it for 🙂
> 
> > 
> > Secondly it provides an unusual debugging interface that allows for the
> > traversal of page tables in a userland range of memory even for that memory
> > which is not described by a VMA.
> > 
> > This is highly unusual and it is far from certain that such page tables
> > should even exist, but perhaps this is precisely why it is useful as a
> > debugging mechanism.
> > 
> > As a result, this is utilised by ptdump only. Historically, things were
> > reversed - ptdump was the only user, and other parts of the kernel evolved
> > to use the kernel page table walking here.
> > 
> > Since we have some complicated and confusing locking rules for the novma
> > case, it makes sense to separate the two usages into their own functions.
> > 
> > Doing this also provide self-documentation as to the intent of the caller -
> > are they doing something rather unusual or are they simply doing a standard
> > kernel page table walk?
> > 
> > We therefore maintain walk_page_range_novma() for this single usage, and
> > document the function as such.
> 
> If we have to keep this dangerous interface, it should probably be
> 
> walk_page_range_debug() or walk_page_range_dump()

We can also move it from include/linux/pagewalk.h to mm/internal.h
 
> > 
> > Note that ptdump uses the precise same function for kernel walking as a
> > convenience, so we permit this but make it very explicit by having
> > walk_page_range_novma() invoke walk_page_range_kernel() in this case.
> > 
> > We introduce walk_page_range_kernel() for the far more common case of
> > kernel page table traversal.
> 
> I wonder if we should give it a completely different name scheme to
> highlight that this is something completely different.
> 
> walk_kernel_page_table_range()
> 
> etc.
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-04  8:07   ` Mike Rapoport
@ 2025-06-04  8:12     ` David Hildenbrand
  2025-06-04  9:09     ` Lorenzo Stoakes
  1 sibling, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2025-06-04  8:12 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Lorenzo Stoakes, Andrew Morton, Barry Song, Liam R . Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Muchun Song,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On 04.06.25 10:07, Mike Rapoport wrote:
> On Wed, Jun 04, 2025 at 09:39:30AM +0200, David Hildenbrand wrote:
>> On 03.06.25 21:22, Lorenzo Stoakes wrote:
>>> The walk_page_range_novma() function is rather confusing - it supports two
>>> modes, one used often, the other used only for debugging.
>>>
>>> The first mode is the common case of traversal of kernel page tables, which
>>> is what nearly all callers use this for.
>>
>> ... and what people should be using it for 🙂
>>
>>>
>>> Secondly it provides an unusual debugging interface that allows for the
>>> traversal of page tables in a userland range of memory even for that memory
>>> which is not described by a VMA.
>>>
>>> This is highly unusual and it is far from certain that such page tables
>>> should even exist, but perhaps this is precisely why it is useful as a
>>> debugging mechanism.
>>>
>>> As a result, this is utilised by ptdump only. Historically, things were
>>> reversed - ptdump was the only user, and other parts of the kernel evolved
>>> to use the kernel page table walking here.
>>>
>>> Since we have some complicated and confusing locking rules for the novma
>>> case, it makes sense to separate the two usages into their own functions.
>>>
>>> Doing this also provide self-documentation as to the intent of the caller -
>>> are they doing something rather unusual or are they simply doing a standard
>>> kernel page table walk?
>>>
>>> We therefore maintain walk_page_range_novma() for this single usage, and
>>> document the function as such.
>>
>> If we have to keep this dangerous interface, it should probably be
>>
>> walk_page_range_debug() or walk_page_range_dump()
> 
> We can also move it from include/linux/pagewalk.h to mm/internal.h

Agreed.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-04  8:07   ` Mike Rapoport
  2025-06-04  8:12     ` David Hildenbrand
@ 2025-06-04  9:09     ` Lorenzo Stoakes
  2025-06-04 12:26       ` Oscar Salvador
  1 sibling, 1 reply; 8+ messages in thread
From: Lorenzo Stoakes @ 2025-06-04  9:09 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David Hildenbrand, Andrew Morton, Barry Song, Liam R . Howlett,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko, Muchun Song,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On Wed, Jun 04, 2025 at 11:07:05AM +0300, Mike Rapoport wrote:
> On Wed, Jun 04, 2025 at 09:39:30AM +0200, David Hildenbrand wrote:
> > On 03.06.25 21:22, Lorenzo Stoakes wrote:
> > > The walk_page_range_novma() function is rather confusing - it supports two
> > > modes, one used often, the other used only for debugging.
> > >
> > > The first mode is the common case of traversal of kernel page tables, which
> > > is what nearly all callers use this for.
> >
> > ... and what people should be using it for 🙂
> >
> > >
> > > Secondly it provides an unusual debugging interface that allows for the
> > > traversal of page tables in a userland range of memory even for that memory
> > > which is not described by a VMA.
> > >
> > > This is highly unusual and it is far from certain that such page tables
> > > should even exist, but perhaps this is precisely why it is useful as a
> > > debugging mechanism.
> > >
> > > As a result, this is utilised by ptdump only. Historically, things were
> > > reversed - ptdump was the only user, and other parts of the kernel evolved
> > > to use the kernel page table walking here.
> > >
> > > Since we have some complicated and confusing locking rules for the novma
> > > case, it makes sense to separate the two usages into their own functions.
> > >
> > > Doing this also provide self-documentation as to the intent of the caller -
> > > are they doing something rather unusual or are they simply doing a standard
> > > kernel page table walk?
> > >
> > > We therefore maintain walk_page_range_novma() for this single usage, and
> > > document the function as such.
> >
> > If we have to keep this dangerous interface, it should probably be
> >
> > walk_page_range_debug() or walk_page_range_dump()
>
> We can also move it from include/linux/pagewalk.h to mm/internal.h

Yeah I was wondering about a rename actually, but then thought 'oh novma kinda
fits blah blah' but you're right, this is better.

Nice idea to move to mm/internal.h also :) I like this...

Will fixup on respin

>
> > >
> > > Note that ptdump uses the precise same function for kernel walking as a
> > > convenience, so we permit this but make it very explicit by having
> > > walk_page_range_novma() invoke walk_page_range_kernel() in this case.
> > >
> > > We introduce walk_page_range_kernel() for the far more common case of
> > > kernel page table traversal.
> >
> > I wonder if we should give it a completely different name scheme to
> > highlight that this is something completely different.
> >
> > walk_kernel_page_table_range()
> >
> > etc.
> >
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
>
> --
> Sincerely yours,
> Mike.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-04  7:39 ` David Hildenbrand
  2025-06-04  8:07   ` Mike Rapoport
@ 2025-06-04  9:20   ` Lorenzo Stoakes
  1 sibling, 0 replies; 8+ messages in thread
From: Lorenzo Stoakes @ 2025-06-04  9:20 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Barry Song, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Muchun Song,
	Oscar Salvador, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On Wed, Jun 04, 2025 at 09:39:30AM +0200, David Hildenbrand wrote:
> On 03.06.25 21:22, Lorenzo Stoakes wrote:
> > The walk_page_range_novma() function is rather confusing - it supports two
> > modes, one used often, the other used only for debugging.
> >
> > The first mode is the common case of traversal of kernel page tables, which
> > is what nearly all callers use this for.
>
> ... and what people should be using it for 🙂

:)

Yeah the whole intent of this patch is to detach the 'crazy debug' bit from
the 'used by arches all over the place' stuff.

Being super clear as to what you're doing matters.

>
> >
> > Secondly it provides an unusual debugging interface that allows for the
> > traversal of page tables in a userland range of memory even for that memory
> > which is not described by a VMA.
> >
> > This is highly unusual and it is far from certain that such page tables
> > should even exist, but perhaps this is precisely why it is useful as a
> > debugging mechanism.
> >
> > As a result, this is utilised by ptdump only. Historically, things were
> > reversed - ptdump was the only user, and other parts of the kernel evolved
> > to use the kernel page table walking here.
> >
> > Since we have some complicated and confusing locking rules for the novma
> > case, it makes sense to separate the two usages into their own functions.
> >
> > Doing this also provide self-documentation as to the intent of the caller -
> > are they doing something rather unusual or are they simply doing a standard
> > kernel page table walk?
> >
> > We therefore maintain walk_page_range_novma() for this single usage, and
> > document the function as such.
>
> If we have to keep this dangerous interface, it should probably be
>
> walk_page_range_debug() or walk_page_range_dump()

Ugh it's too early, I thought Mike suggested this :P but he suggested the
mm/internal.h bit.

But anyway I agree with both, will fix in v2.

>
> >
> > Note that ptdump uses the precise same function for kernel walking as a
> > convenience, so we permit this but make it very explicit by having
> > walk_page_range_novma() invoke walk_page_range_kernel() in this case.
> >
> > We introduce walk_page_range_kernel() for the far more common case of
> > kernel page table traversal.
>
> I wonder if we should give it a completely different name scheme to
> highlight that this is something completely different.
>
> walk_kernel_page_table_range()

Yeah, I think this might be a good idea actually. This is doing something
'unusual' unlike all the other walk_kernel_xxx() handlers, so this should
highlight it even more clearly.

Will fixup in v2.

>
> etc.
>
>
> --
> Cheers,
>
> David / dhildenb
>
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-04  9:09     ` Lorenzo Stoakes
@ 2025-06-04 12:26       ` Oscar Salvador
  2025-06-04 12:31         ` Lorenzo Stoakes
  0 siblings, 1 reply; 8+ messages in thread
From: Oscar Salvador @ 2025-06-04 12:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Mike Rapoport, David Hildenbrand, Andrew Morton, Barry Song,
	Liam R . Howlett, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Muchun Song, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On Wed, Jun 04, 2025 at 10:09:05AM +0100, Lorenzo Stoakes wrote:
> Nice idea to move to mm/internal.h also :) I like this...
> 
> Will fixup on respin

Dumb question but IIUC, walk_page_range_novma() will only be used by
ptdump from now on, so why not stick it into mm/ptdump.c, which is where
the only user of it lives?

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts
  2025-06-04 12:26       ` Oscar Salvador
@ 2025-06-04 12:31         ` Lorenzo Stoakes
  0 siblings, 0 replies; 8+ messages in thread
From: Lorenzo Stoakes @ 2025-06-04 12:31 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Rapoport, David Hildenbrand, Andrew Morton, Barry Song,
	Liam R . Howlett, Vlastimil Babka, Suren Baghdasaryan,
	Michal Hocko, Muchun Song, Huacai Chen, WANG Xuerui, Jonas Bonn,
	Stefan Kristiansson, Stafford Horne, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexandre Ghiti, Jann Horn, loongarch,
	linux-kernel, linux-openrisc, linux-riscv, linux-mm

On Wed, Jun 04, 2025 at 02:26:57PM +0200, Oscar Salvador wrote:
> On Wed, Jun 04, 2025 at 10:09:05AM +0100, Lorenzo Stoakes wrote:
> > Nice idea to move to mm/internal.h also :) I like this...
> >
> > Will fixup on respin
>
> Dumb question but IIUC, walk_page_range_novma() will only be used by
> ptdump from now on, so why not stick it into mm/ptdump.c, which is where
> the only user of it lives?

There's no such thing as a dumb question :) Even though I have maybe tested
that concept in the past personally ;)

I think mm/internal.h is fine, would be weird to have the declaration in
mm/ptdump.c but implemented elsewhere and obviously we need to keep the
page walking stuff together.

So I think it is best to keep it in internal.h

>
> --
> Oscar Salvador
> SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-06-04 12:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-03 19:22 [PATCH RESEND] mm/pagewalk: split walk_page_range_novma() into kernel/user parts Lorenzo Stoakes
2025-06-04  7:39 ` David Hildenbrand
2025-06-04  8:07   ` Mike Rapoport
2025-06-04  8:12     ` David Hildenbrand
2025-06-04  9:09     ` Lorenzo Stoakes
2025-06-04 12:26       ` Oscar Salvador
2025-06-04 12:31         ` Lorenzo Stoakes
2025-06-04  9:20   ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).