linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Enable huge-vmalloc permission change
@ 2025-05-30  9:04 Dev Jain
  2025-05-30  9:04 ` [PATCH 1/3] mm: Allow pagewalk without locks Dev Jain
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Dev Jain @ 2025-05-30  9:04 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel, Dev Jain

This series paves the path to enable huge mappings in vmalloc space by
default on arm64. For this we must ensure that we can handle any permission
games on vmalloc space. Currently, __change_memory_common() uses
apply_to_page_range() which does not support changing permissions for
leaf mappings. We attempt to move away from this by using walk_page_range_novma(),
similar to what riscv does right now; however, it is the responsibility
of the caller to ensure that we do not pass a range, or split the range
covering a partial leaf mapping.

This series is tied with Yang Shi's attempt [1] at using huge mappings
in the linear mapping in case the system supports BBML2, in which case
we will be able to split the linear mapping if needed without break-before-make.
Thus, Yang's series, IIUC, will be one such user of my series; suppose we
are changing permissions on a range of the linear map backed by PMD-hugepages,
then the sequence of operations should look like the following:

split_range(start, (start + HPAGE_PMD_SIZE) & ~HPAGE_PMD_MASK);
split_range(end & ~HPAGE_PMD_MASK, end);
__change_memory_common(start, end);

However, this series can be used independently of Yang's; since currently
permission games are being played only on pte mappings (due to apply_to_page_range
not supporting otherwise), this series provides the mechanism for enabling
huge mappings for various kernel mappings like linear map and vmalloc.

[1] https://lore.kernel.org/all/20250304222018.615808-1-yang@os.amperecomputing.com/

Dev Jain (3):
  mm: Allow pagewalk without locks
  arm64: pageattr: Use walk_page_range_novma() to change memory
    permissions
  mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64

 arch/arm64/mm/pageattr.c | 81 +++++++++++++++++++++++++++++++++++++---
 include/linux/pagewalk.h |  4 ++
 mm/pagewalk.c            | 18 +++++++--
 3 files changed, 94 insertions(+), 9 deletions(-)

-- 
2.30.2



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/3] mm: Allow pagewalk without locks
  2025-05-30  9:04 [PATCH 0/3] Enable huge-vmalloc permission change Dev Jain
@ 2025-05-30  9:04 ` Dev Jain
  2025-05-30 10:33   ` Ryan Roberts
  2025-05-30 10:57   ` Lorenzo Stoakes
  2025-05-30  9:04 ` [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions Dev Jain
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 29+ messages in thread
From: Dev Jain @ 2025-05-30  9:04 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel, Dev Jain

It is noted at [1] that KFENCE can manipulate kernel pgtable entries during
softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
This being a non-sleepable context, we cannot take the init_mm mmap lock.
Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage without
locks.

[1] https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/pagewalk.h |  2 ++
 mm/pagewalk.c            | 12 ++++++++----
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 9700a29f8afb..9bc8853ed3de 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -14,6 +14,8 @@ enum page_walk_lock {
 	PGWALK_WRLOCK = 1,
 	/* vma is expected to be already write-locked during the walk */
 	PGWALK_WRLOCK_VERIFY = 2,
+	/* no lock is needed */
+	PGWALK_NOLOCK = 3,
 };
 
 /**
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e478777c86e1..9657cf4664b2 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -440,6 +440,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
 	case PGWALK_RDLOCK:
 		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
 		break;
+	default:
+		break;
 	}
 #endif
 }
@@ -640,10 +642,12 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
 	 * specified address range from being freed. The caller should take
 	 * other actions to prevent this race.
 	 */
-	if (mm == &init_mm)
-		mmap_assert_locked(walk.mm);
-	else
-		mmap_assert_write_locked(walk.mm);
+	if (ops->walk_lock != PGWALK_NOLOCK) {
+		if (mm == &init_mm)
+			mmap_assert_locked(walk.mm);
+		else
+			mmap_assert_write_locked(walk.mm);
+	}
 
 	return walk_pgd_range(start, end, &walk);
 }
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-05-30  9:04 [PATCH 0/3] Enable huge-vmalloc permission change Dev Jain
  2025-05-30  9:04 ` [PATCH 1/3] mm: Allow pagewalk without locks Dev Jain
@ 2025-05-30  9:04 ` Dev Jain
  2025-05-30 12:53   ` Ryan Roberts
  2025-06-06  9:49   ` Lorenzo Stoakes
  2025-05-30  9:04 ` [PATCH 3/3] mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64 Dev Jain
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 29+ messages in thread
From: Dev Jain @ 2025-05-30  9:04 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel, Dev Jain

Move away from apply_to_page_range(), which does not honour leaf mappings,
to walk_page_range_novma(). The callbacks emit a warning and return EINVAL
if a partial range is detected.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 arch/arm64/mm/pageattr.c | 69 +++++++++++++++++++++++++++++++++++++---
 1 file changed, 64 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 39fd1f7ff02a..a5c829c64969 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -8,6 +8,7 @@
 #include <linux/mem_encrypt.h>
 #include <linux/sched.h>
 #include <linux/vmalloc.h>
+#include <linux/pagewalk.h>
 
 #include <asm/cacheflush.h>
 #include <asm/pgtable-prot.h>
@@ -20,6 +21,67 @@ struct page_change_data {
 	pgprot_t clear_mask;
 };
 
+static pteval_t set_pageattr_masks(unsigned long val, struct mm_walk *walk)
+{
+	struct page_change_data *masks = walk->private;
+	unsigned long new_val = val;
+
+	new_val &= ~(pgprot_val(masks->clear_mask));
+	new_val |= (pgprot_val(masks->set_mask));
+
+	return new_val;
+}
+
+static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
+			      unsigned long next, struct mm_walk *walk)
+{
+	pud_t val = pudp_get(pud);
+
+	if (pud_leaf(val)) {
+		if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
+			return -EINVAL;
+		val = __pud(set_pageattr_masks(pud_val(val), walk));
+		set_pud(pud, val);
+		walk->action = ACTION_CONTINUE;
+	}
+
+	return 0;
+}
+
+static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
+			      unsigned long next, struct mm_walk *walk)
+{
+	pmd_t val = pmdp_get(pmd);
+
+	if (pmd_leaf(val)) {
+		if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
+			return -EINVAL;
+		val = __pmd(set_pageattr_masks(pmd_val(val), walk));
+		set_pmd(pmd, val);
+		walk->action = ACTION_CONTINUE;
+	}
+
+	return 0;
+}
+
+static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
+			      unsigned long next, struct mm_walk *walk)
+{
+	pte_t val = ptep_get(pte);
+
+	val = __pte(set_pageattr_masks(pte_val(val), walk));
+	set_pte(pte, val);
+
+	return 0;
+}
+
+static const struct mm_walk_ops pageattr_ops = {
+	.pud_entry	= pageattr_pud_entry,
+	.pmd_entry	= pageattr_pmd_entry,
+	.pte_entry	= pageattr_pte_entry,
+	.walk_lock	= PGWALK_NOLOCK,
+};
+
 bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
 
 bool can_set_direct_map(void)
@@ -49,9 +111,6 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 	return 0;
 }
 
-/*
- * This function assumes that the range is mapped with PAGE_SIZE pages.
- */
 static int __change_memory_common(unsigned long start, unsigned long size,
 				pgprot_t set_mask, pgprot_t clear_mask)
 {
@@ -61,8 +120,8 @@ static int __change_memory_common(unsigned long start, unsigned long size,
 	data.set_mask = set_mask;
 	data.clear_mask = clear_mask;
 
-	ret = apply_to_page_range(&init_mm, start, size, change_page_range,
-					&data);
+	ret = walk_page_range_novma(&init_mm, start, start + size,
+				    &pageattr_ops, NULL, &data);
 
 	/*
 	 * If the memory is being made valid without changing any other bits
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 3/3] mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64
  2025-05-30  9:04 [PATCH 0/3] Enable huge-vmalloc permission change Dev Jain
  2025-05-30  9:04 ` [PATCH 1/3] mm: Allow pagewalk without locks Dev Jain
  2025-05-30  9:04 ` [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions Dev Jain
@ 2025-05-30  9:04 ` Dev Jain
  2025-05-30 11:14   ` Lorenzo Stoakes
  2025-05-30  9:24 ` [PATCH 0/3] Enable huge-vmalloc permission change Dev Jain
  2025-05-30 10:03 ` Ryan Roberts
  4 siblings, 1 reply; 29+ messages in thread
From: Dev Jain @ 2025-05-30  9:04 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel, Dev Jain

arm64 implements lazy_mmu_mode to allow deferral and batching of barriers
when updating kernel PTEs, which provides a nice performance boost. arm64
currently uses apply_to_page_range() to modify kernel PTE permissions,
which runs inside lazy_mmu_mode. So to prevent a performance regression,
let's add hooks to walk_page_range_novma() to allow continued use of
lazy_mmu_mode.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
Credits to Ryan for the patch description.

 arch/arm64/mm/pageattr.c | 12 ++++++++++++
 include/linux/pagewalk.h |  2 ++
 mm/pagewalk.c            |  6 ++++++
 3 files changed, 20 insertions(+)

diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index a5c829c64969..9163324b12a0 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -75,11 +75,23 @@ static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
 	return 0;
 }
 
+static void pte_lazy_mmu_enter(void)
+{
+	arch_enter_lazy_mmu_mode();
+}
+
+static void pte_lazy_mmu_leave(void)
+{
+	arch_leave_lazy_mmu_mode();
+}
+
 static const struct mm_walk_ops pageattr_ops = {
 	.pud_entry	= pageattr_pud_entry,
 	.pmd_entry	= pageattr_pmd_entry,
 	.pte_entry	= pageattr_pte_entry,
 	.walk_lock	= PGWALK_NOLOCK,
+	.pre_pte_table	= pte_lazy_mmu_enter,
+	.post_pte_table	= pte_lazy_mmu_leave,
 };
 
 bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 9bc8853ed3de..2157d345974c 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -88,6 +88,8 @@ struct mm_walk_ops {
 	int (*pre_vma)(unsigned long start, unsigned long end,
 		       struct mm_walk *walk);
 	void (*post_vma)(struct mm_walk *walk);
+	void (*pre_pte_table)(void);
+	void (*post_pte_table)(void);
 	int (*install_pte)(unsigned long addr, unsigned long next,
 			   pte_t *ptep, struct mm_walk *walk);
 	enum page_walk_lock walk_lock;
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 9657cf4664b2..a441f5cbbc45 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -33,6 +33,9 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
 	const struct mm_walk_ops *ops = walk->ops;
 	int err = 0;
 
+	if (walk->ops->pre_pte_table)
+		walk->ops->pre_pte_table();
+
 	for (;;) {
 		if (ops->install_pte && pte_none(ptep_get(pte))) {
 			pte_t new_pte;
@@ -56,6 +59,9 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
 		addr += PAGE_SIZE;
 		pte++;
 	}
+
+	if (walk->ops->post_pte_table)
+		walk->ops->post_pte_table();
 	return err;
 }
 
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re:  [PATCH 0/3] Enable huge-vmalloc permission change
  2025-05-30  9:04 [PATCH 0/3] Enable huge-vmalloc permission change Dev Jain
                   ` (2 preceding siblings ...)
  2025-05-30  9:04 ` [PATCH 3/3] mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64 Dev Jain
@ 2025-05-30  9:24 ` Dev Jain
  2025-05-30 10:03 ` Ryan Roberts
  4 siblings, 0 replies; 29+ messages in thread
From: Dev Jain @ 2025-05-30  9:24 UTC (permalink / raw)
  To: dev.jain
  Cc: Liam.Howlett, akpm, catalin.marinas, david, gshan,
	linux-arm-kernel, linux-kernel, linux-mm, lorenzo.stoakes, mhocko,
	rppt, steven.price, surenb, suzuki.poulose, vbabka, will,
	ryan.roberts, yang, anshuman.khandual

Brilliant, after running get_maintainers I forgot to Cc the
most imp people for this :( +Yang and Ryan, along with
Anshuman.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/3] Enable huge-vmalloc permission change
  2025-05-30  9:04 [PATCH 0/3] Enable huge-vmalloc permission change Dev Jain
                   ` (3 preceding siblings ...)
  2025-05-30  9:24 ` [PATCH 0/3] Enable huge-vmalloc permission change Dev Jain
@ 2025-05-30 10:03 ` Ryan Roberts
  2025-05-30 10:10   ` Dev Jain
  4 siblings, 1 reply; 29+ messages in thread
From: Ryan Roberts @ 2025-05-30 10:03 UTC (permalink / raw)
  To: Dev Jain, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel

On 30/05/2025 10:04, Dev Jain wrote:
> This series paves the path to enable huge mappings in vmalloc space by
> default on arm64. > For this we must ensure that we can handle any permission
> games on vmalloc space. 

And the linear map :)

> Currently, __change_memory_common() uses
> apply_to_page_range() which does not support changing permissions for
> leaf mappings. 

nit: A "leaf mapping" is the lowest level entry in the page tables for a given
address - i.e. it maps an address to some actual memory rather than to another
pgtable. It includes what the Arm ARM calls "page mappings" (PTE level) and
"block mappings" (PMD/PUD/.. level). apply_to_page_range() does support page
mappings, so saying it doesn't support leaf mappings is incorrect. It doesn't
support block mappings.

> We attempt to move away from this by using walk_page_range_novma(),
> similar to what riscv does right now; however, it is the responsibility
> of the caller to ensure that we do not pass a range, or split the range
> covering a partial leaf mapping.
> 
> This series is tied with Yang Shi's attempt [1] at using huge mappings
> in the linear mapping in case the system supports BBML2, in which case
> we will be able to split the linear mapping if needed without break-before-make.
> Thus, Yang's series, IIUC, will be one such user of my series; suppose we
> are changing permissions on a range of the linear map backed by PMD-hugepages,
> then the sequence of operations should look like the following:
> 
> split_range(start, (start + HPAGE_PMD_SIZE) & ~HPAGE_PMD_MASK);
> split_range(end & ~HPAGE_PMD_MASK, end);

I don't understand what the HPAGE_PMD_MASK twiddling is doing? That's not right.
It's going to give you the offset within the 2M region. You just want:

split_range(start)
split_range(end)

right?

> __change_memory_common(start, end);
> 
> However, this series can be used independently of Yang's; since currently
> permission games are being played only on pte mappings (due to apply_to_page_range
> not supporting otherwise), this series provides the mechanism for enabling
> huge mappings for various kernel mappings like linear map and vmalloc.

In other words, you are saying that this series is a prerequisite for Yang's
series (and both are prerequisites for huge vmalloc by default). Your series
adds a new capability that Yang's series will rely on (the ability to change
permissions on block mappings).

Thanks,
Ryan

> 
> [1] https://lore.kernel.org/all/20250304222018.615808-1-yang@os.amperecomputing.com/
> 
> Dev Jain (3):
>   mm: Allow pagewalk without locks
>   arm64: pageattr: Use walk_page_range_novma() to change memory
>     permissions
>   mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64
> 
>  arch/arm64/mm/pageattr.c | 81 +++++++++++++++++++++++++++++++++++++---
>  include/linux/pagewalk.h |  4 ++
>  mm/pagewalk.c            | 18 +++++++--
>  3 files changed, 94 insertions(+), 9 deletions(-)
> 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/3] Enable huge-vmalloc permission change
  2025-05-30 10:03 ` Ryan Roberts
@ 2025-05-30 10:10   ` Dev Jain
  2025-05-30 10:37     ` Ryan Roberts
  0 siblings, 1 reply; 29+ messages in thread
From: Dev Jain @ 2025-05-30 10:10 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel


On 30/05/25 3:33 pm, Ryan Roberts wrote:
> On 30/05/2025 10:04, Dev Jain wrote:
>> This series paves the path to enable huge mappings in vmalloc space by
>> default on arm64. > For this we must ensure that we can handle any permission
>> games on vmalloc space.
> And the linear map :)
>
>> Currently, __change_memory_common() uses
>> apply_to_page_range() which does not support changing permissions for
>> leaf mappings.
> nit: A "leaf mapping" is the lowest level entry in the page tables for a given
> address - i.e. it maps an address to some actual memory rather than to another
> pgtable. It includes what the Arm ARM calls "page mappings" (PTE level) and
> "block mappings" (PMD/PUD/.. level). apply_to_page_range() does support page
> mappings, so saying it doesn't support leaf mappings is incorrect. It doesn't
> support block mappings.

Sorry, again got confused by nomenclature : )

>
>> We attempt to move away from this by using walk_page_range_novma(),
>> similar to what riscv does right now; however, it is the responsibility
>> of the caller to ensure that we do not pass a range, or split the range
>> covering a partial leaf mapping.
>>
>> This series is tied with Yang Shi's attempt [1] at using huge mappings
>> in the linear mapping in case the system supports BBML2, in which case
>> we will be able to split the linear mapping if needed without break-before-make.
>> Thus, Yang's series, IIUC, will be one such user of my series; suppose we
>> are changing permissions on a range of the linear map backed by PMD-hugepages,
>> then the sequence of operations should look like the following:
>>
>> split_range(start, (start + HPAGE_PMD_SIZE) & ~HPAGE_PMD_MASK);
>> split_range(end & ~HPAGE_PMD_MASK, end);
> I don't understand what the HPAGE_PMD_MASK twiddling is doing? That's not right.
> It's going to give you the offset within the 2M region. You just want:
>
> split_range(start)
> split_range(end)
>
> right?

Suppose start = 2M + 4K, end = 8M + 5K. Then my sequence will compute to
split_range(2M + 4K, 3M)
split_range(8M, 8M + 5K)
__change_memory_common(2M + 4K, 8M + 5K)

So now __change_memory_common() wouldn't have to deal with splitting the
starts and ends. Please correct me if I am wrong.

>
>> __change_memory_common(start, end);
>>
>> However, this series can be used independently of Yang's; since currently
>> permission games are being played only on pte mappings (due to apply_to_page_range
>> not supporting otherwise), this series provides the mechanism for enabling
>> huge mappings for various kernel mappings like linear map and vmalloc.
> In other words, you are saying that this series is a prerequisite for Yang's
> series (and both are prerequisites for huge vmalloc by default). Your series
> adds a new capability that Yang's series will rely on (the ability to change
> permissions on block mappings).

That's right.

>
> Thanks,
> Ryan
>
>> [1] https://lore.kernel.org/all/20250304222018.615808-1-yang@os.amperecomputing.com/
>>
>> Dev Jain (3):
>>    mm: Allow pagewalk without locks
>>    arm64: pageattr: Use walk_page_range_novma() to change memory
>>      permissions
>>    mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64
>>
>>   arch/arm64/mm/pageattr.c | 81 +++++++++++++++++++++++++++++++++++++---
>>   include/linux/pagewalk.h |  4 ++
>>   mm/pagewalk.c            | 18 +++++++--
>>   3 files changed, 94 insertions(+), 9 deletions(-)
>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] mm: Allow pagewalk without locks
  2025-05-30  9:04 ` [PATCH 1/3] mm: Allow pagewalk without locks Dev Jain
@ 2025-05-30 10:33   ` Ryan Roberts
  2025-05-30 21:33     ` Yang Shi
  2025-05-30 10:57   ` Lorenzo Stoakes
  1 sibling, 1 reply; 29+ messages in thread
From: Ryan Roberts @ 2025-05-30 10:33 UTC (permalink / raw)
  To: Dev Jain, akpm, david, catalin.marinas, will, yang
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel

On 30/05/2025 10:04, Dev Jain wrote:
> It is noted at [1] that KFENCE can manipulate kernel pgtable entries during
> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
> This being a non-sleepable context, we cannot take the init_mm mmap lock.
> Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage without
> locks.

It looks like riscv solved this problem by moving from walk_page_range_novma()
to apply_to_existing_page_range() in Commit fb1cf0878328 ("riscv: rewrite
__kernel_map_pages() to fix sleeping in invalid context"). That won't work for
us because the whole point is that we want to support changing permissions on
block mappings.

Yang:

Not directly relavent to this patch, but I do worry about the potential need to
split the range here though once Yang's series comes in - we would need to
allocate memory for pgtables atomically in softirq context. KFENCE is intended
to be enabled in production IIRC, so we can't just not allow block mapping when
KFENCE is enabled and will likely need to think of a solution for this?


> 
> [1] https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pagewalk.h |  2 ++
>  mm/pagewalk.c            | 12 ++++++++----
>  2 files changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index 9700a29f8afb..9bc8853ed3de 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -14,6 +14,8 @@ enum page_walk_lock {
>  	PGWALK_WRLOCK = 1,
>  	/* vma is expected to be already write-locked during the walk */
>  	PGWALK_WRLOCK_VERIFY = 2,
> +	/* no lock is needed */
> +	PGWALK_NOLOCK = 3,

I'd imagine you either want to explicitly forbid this option for the other
entrypoints (i.e. the non- _novma variants) or you need to be able to handle
this option being passed in to the other functions, which you currently don't
do. I'd vote for explcitly disallowing (document it and return error if passed in).

>  };
>  
>  /**
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index e478777c86e1..9657cf4664b2 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -440,6 +440,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>  	case PGWALK_RDLOCK:
>  		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>  		break;
> +	default:
> +		break;
>  	}
>  #endif
>  }
> @@ -640,10 +642,12 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
>  	 * specified address range from being freed. The caller should take
>  	 * other actions to prevent this race.
>  	 */
> -	if (mm == &init_mm)
> -		mmap_assert_locked(walk.mm);

Given apply_to_page_range() doesn't do any locking for kernel pgtable walking, I
can be convinced that it's also not required for our case using this framework.
But why does this framework believe it is necessary? Should the comment above
this at least be updated?

Thanks,
Ryan

> -	else
> -		mmap_assert_write_locked(walk.mm);
> +	if (ops->walk_lock != PGWALK_NOLOCK) {
> +		if (mm == &init_mm)
> +			mmap_assert_locked(walk.mm);
> +		else
> +			mmap_assert_write_locked(walk.mm);
> +	}
>  
>  	return walk_pgd_range(start, end, &walk);
>  }



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/3] Enable huge-vmalloc permission change
  2025-05-30 10:10   ` Dev Jain
@ 2025-05-30 10:37     ` Ryan Roberts
  2025-05-30 10:42       ` Dev Jain
  0 siblings, 1 reply; 29+ messages in thread
From: Ryan Roberts @ 2025-05-30 10:37 UTC (permalink / raw)
  To: Dev Jain, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel

On 30/05/2025 11:10, Dev Jain wrote:
> 
> On 30/05/25 3:33 pm, Ryan Roberts wrote:
>> On 30/05/2025 10:04, Dev Jain wrote:
>>> This series paves the path to enable huge mappings in vmalloc space by
>>> default on arm64. > For this we must ensure that we can handle any permission
>>> games on vmalloc space.
>> And the linear map :)
>>
>>> Currently, __change_memory_common() uses
>>> apply_to_page_range() which does not support changing permissions for
>>> leaf mappings.
>> nit: A "leaf mapping" is the lowest level entry in the page tables for a given
>> address - i.e. it maps an address to some actual memory rather than to another
>> pgtable. It includes what the Arm ARM calls "page mappings" (PTE level) and
>> "block mappings" (PMD/PUD/.. level). apply_to_page_range() does support page
>> mappings, so saying it doesn't support leaf mappings is incorrect. It doesn't
>> support block mappings.
> 
> Sorry, again got confused by nomenclature : )
> 
>>
>>> We attempt to move away from this by using walk_page_range_novma(),
>>> similar to what riscv does right now; however, it is the responsibility
>>> of the caller to ensure that we do not pass a range, or split the range
>>> covering a partial leaf mapping.
>>>
>>> This series is tied with Yang Shi's attempt [1] at using huge mappings
>>> in the linear mapping in case the system supports BBML2, in which case
>>> we will be able to split the linear mapping if needed without break-before-make.
>>> Thus, Yang's series, IIUC, will be one such user of my series; suppose we
>>> are changing permissions on a range of the linear map backed by PMD-hugepages,
>>> then the sequence of operations should look like the following:
>>>
>>> split_range(start, (start + HPAGE_PMD_SIZE) & ~HPAGE_PMD_MASK);
>>> split_range(end & ~HPAGE_PMD_MASK, end);
>> I don't understand what the HPAGE_PMD_MASK twiddling is doing? That's not right.
>> It's going to give you the offset within the 2M region. You just want:
>>
>> split_range(start)
>> split_range(end)
>>
>> right?
> 
> Suppose start = 2M + 4K, end = 8M + 5K. Then my sequence will compute to

8M + 5K is not a valid split point. It has to be at least page aligned.

> split_range(2M + 4K, 3M)
> split_range(8M, 8M + 5K)

We just want to split at start and end. What are the 3M and 8M params supposed
to be? Anyway, this is off-topic for this series.

> __change_memory_common(2M + 4K, 8M + 5K)
> 
> So now __change_memory_common() wouldn't have to deal with splitting the
> starts and ends. Please correct me if I am wrong.
> 
>>
>>> __change_memory_common(start, end);
>>>
>>> However, this series can be used independently of Yang's; since currently
>>> permission games are being played only on pte mappings (due to
>>> apply_to_page_range
>>> not supporting otherwise), this series provides the mechanism for enabling
>>> huge mappings for various kernel mappings like linear map and vmalloc.
>> In other words, you are saying that this series is a prerequisite for Yang's
>> series (and both are prerequisites for huge vmalloc by default). Your series
>> adds a new capability that Yang's series will rely on (the ability to change
>> permissions on block mappings).
> 
> That's right.
> 
>>
>> Thanks,
>> Ryan
>>
>>> [1] https://lore.kernel.org/all/20250304222018.615808-1-
>>> yang@os.amperecomputing.com/
>>>
>>> Dev Jain (3):
>>>    mm: Allow pagewalk without locks
>>>    arm64: pageattr: Use walk_page_range_novma() to change memory
>>>      permissions
>>>    mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64
>>>
>>>   arch/arm64/mm/pageattr.c | 81 +++++++++++++++++++++++++++++++++++++---
>>>   include/linux/pagewalk.h |  4 ++
>>>   mm/pagewalk.c            | 18 +++++++--
>>>   3 files changed, 94 insertions(+), 9 deletions(-)
>>>



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/3] Enable huge-vmalloc permission change
  2025-05-30 10:37     ` Ryan Roberts
@ 2025-05-30 10:42       ` Dev Jain
  2025-05-30 10:51         ` Ryan Roberts
  0 siblings, 1 reply; 29+ messages in thread
From: Dev Jain @ 2025-05-30 10:42 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel


On 30/05/25 4:07 pm, Ryan Roberts wrote:
> On 30/05/2025 11:10, Dev Jain wrote:
>> On 30/05/25 3:33 pm, Ryan Roberts wrote:
>>> On 30/05/2025 10:04, Dev Jain wrote:
>>>> This series paves the path to enable huge mappings in vmalloc space by
>>>> default on arm64. > For this we must ensure that we can handle any permission
>>>> games on vmalloc space.
>>> And the linear map :)
>>>
>>>> Currently, __change_memory_common() uses
>>>> apply_to_page_range() which does not support changing permissions for
>>>> leaf mappings.
>>> nit: A "leaf mapping" is the lowest level entry in the page tables for a given
>>> address - i.e. it maps an address to some actual memory rather than to another
>>> pgtable. It includes what the Arm ARM calls "page mappings" (PTE level) and
>>> "block mappings" (PMD/PUD/.. level). apply_to_page_range() does support page
>>> mappings, so saying it doesn't support leaf mappings is incorrect. It doesn't
>>> support block mappings.
>> Sorry, again got confused by nomenclature : )
>>
>>>> We attempt to move away from this by using walk_page_range_novma(),
>>>> similar to what riscv does right now; however, it is the responsibility
>>>> of the caller to ensure that we do not pass a range, or split the range
>>>> covering a partial leaf mapping.
>>>>
>>>> This series is tied with Yang Shi's attempt [1] at using huge mappings
>>>> in the linear mapping in case the system supports BBML2, in which case
>>>> we will be able to split the linear mapping if needed without break-before-make.
>>>> Thus, Yang's series, IIUC, will be one such user of my series; suppose we
>>>> are changing permissions on a range of the linear map backed by PMD-hugepages,
>>>> then the sequence of operations should look like the following:
>>>>
>>>> split_range(start, (start + HPAGE_PMD_SIZE) & ~HPAGE_PMD_MASK);
>>>> split_range(end & ~HPAGE_PMD_MASK, end);
>>> I don't understand what the HPAGE_PMD_MASK twiddling is doing? That's not right.
>>> It's going to give you the offset within the 2M region. You just want:
>>>
>>> split_range(start)
>>> split_range(end)
>>>
>>> right?
>> Suppose start = 2M + 4K, end = 8M + 5K. Then my sequence will compute to
> 8M + 5K is not a valid split point. It has to be at least page aligned.

Sorry, so consider 8M + 4K.

>> split_range(2M + 4K, 3M)
>> split_range(8M, 8M + 5K)
> We just want to split at start and end. What are the 3M and 8M params supposed
> to be? Anyway, this is off-topic for this series.

I think we are both saying the same thing; yes we will split only the start and the end,
so if the address 2Mb + 4Kb is mapped as a PMD-hugepage, we need to split this PMD into
a PTE table, which will map 2Mb till 4Mb as base pages now.

>
>> __change_memory_common(2M + 4K, 8M + 5K)
>>
>> So now __change_memory_common() wouldn't have to deal with splitting the
>> starts and ends. Please correct me if I am wrong.
>>
>>>> __change_memory_common(start, end);
>>>>
>>>> However, this series can be used independently of Yang's; since currently
>>>> permission games are being played only on pte mappings (due to
>>>> apply_to_page_range
>>>> not supporting otherwise), this series provides the mechanism for enabling
>>>> huge mappings for various kernel mappings like linear map and vmalloc.
>>> In other words, you are saying that this series is a prerequisite for Yang's
>>> series (and both are prerequisites for huge vmalloc by default). Your series
>>> adds a new capability that Yang's series will rely on (the ability to change
>>> permissions on block mappings).
>> That's right.
>>
>>> Thanks,
>>> Ryan
>>>
>>>> [1] https://lore.kernel.org/all/20250304222018.615808-1-
>>>> yang@os.amperecomputing.com/
>>>>
>>>> Dev Jain (3):
>>>>     mm: Allow pagewalk without locks
>>>>     arm64: pageattr: Use walk_page_range_novma() to change memory
>>>>       permissions
>>>>     mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64
>>>>
>>>>    arch/arm64/mm/pageattr.c | 81 +++++++++++++++++++++++++++++++++++++---
>>>>    include/linux/pagewalk.h |  4 ++
>>>>    mm/pagewalk.c            | 18 +++++++--
>>>>    3 files changed, 94 insertions(+), 9 deletions(-)
>>>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/3] Enable huge-vmalloc permission change
  2025-05-30 10:42       ` Dev Jain
@ 2025-05-30 10:51         ` Ryan Roberts
  2025-05-30 11:11           ` Dev Jain
  0 siblings, 1 reply; 29+ messages in thread
From: Ryan Roberts @ 2025-05-30 10:51 UTC (permalink / raw)
  To: Dev Jain, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel

On 30/05/2025 11:42, Dev Jain wrote:
> 
> On 30/05/25 4:07 pm, Ryan Roberts wrote:
>> On 30/05/2025 11:10, Dev Jain wrote:
>>> On 30/05/25 3:33 pm, Ryan Roberts wrote:
>>>> On 30/05/2025 10:04, Dev Jain wrote:
>>>>> This series paves the path to enable huge mappings in vmalloc space by
>>>>> default on arm64. > For this we must ensure that we can handle any permission
>>>>> games on vmalloc space.
>>>> And the linear map :)
>>>>
>>>>> Currently, __change_memory_common() uses
>>>>> apply_to_page_range() which does not support changing permissions for
>>>>> leaf mappings.
>>>> nit: A "leaf mapping" is the lowest level entry in the page tables for a given
>>>> address - i.e. it maps an address to some actual memory rather than to another
>>>> pgtable. It includes what the Arm ARM calls "page mappings" (PTE level) and
>>>> "block mappings" (PMD/PUD/.. level). apply_to_page_range() does support page
>>>> mappings, so saying it doesn't support leaf mappings is incorrect. It doesn't
>>>> support block mappings.
>>> Sorry, again got confused by nomenclature : )
>>>
>>>>> We attempt to move away from this by using walk_page_range_novma(),
>>>>> similar to what riscv does right now; however, it is the responsibility
>>>>> of the caller to ensure that we do not pass a range, or split the range
>>>>> covering a partial leaf mapping.
>>>>>
>>>>> This series is tied with Yang Shi's attempt [1] at using huge mappings
>>>>> in the linear mapping in case the system supports BBML2, in which case
>>>>> we will be able to split the linear mapping if needed without break-before-
>>>>> make.
>>>>> Thus, Yang's series, IIUC, will be one such user of my series; suppose we
>>>>> are changing permissions on a range of the linear map backed by PMD-hugepages,
>>>>> then the sequence of operations should look like the following:
>>>>>
>>>>> split_range(start, (start + HPAGE_PMD_SIZE) & ~HPAGE_PMD_MASK);
>>>>> split_range(end & ~HPAGE_PMD_MASK, end);
>>>> I don't understand what the HPAGE_PMD_MASK twiddling is doing? That's not
>>>> right.
>>>> It's going to give you the offset within the 2M region. You just want:
>>>>
>>>> split_range(start)
>>>> split_range(end)
>>>>
>>>> right?
>>> Suppose start = 2M + 4K, end = 8M + 5K. Then my sequence will compute to
>> 8M + 5K is not a valid split point. It has to be at least page aligned.
> 
> Sorry, so consider 8M + 4K.
> 
>>> split_range(2M + 4K, 3M)
>>> split_range(8M, 8M + 5K)
>> We just want to split at start and end. What are the 3M and 8M params supposed
>> to be? Anyway, this is off-topic for this series.
> 
> I think we are both saying the same thing; yes we will split only the start and
> the end,
> so if the address 2Mb + 4Kb is mapped as a PMD-hugepage, we need to split this
> PMD into
> a PTE table, which will map 2Mb till 4Mb as base pages now.

Not quite; if you start with [2M, 4M) mapped by PMD, then want to ensure a
boundary at 2M+4K, I think you would end up with the first 64K (16 pages) mapped
by base page, then the then the next 1984K mapped by contpte blocks (31 contpte
blocks).

But yes, we agree :)

> 
>>
>>> __change_memory_common(2M + 4K, 8M + 5K)
>>>
>>> So now __change_memory_common() wouldn't have to deal with splitting the
>>> starts and ends. Please correct me if I am wrong.
>>>
>>>>> __change_memory_common(start, end);
>>>>>
>>>>> However, this series can be used independently of Yang's; since currently
>>>>> permission games are being played only on pte mappings (due to
>>>>> apply_to_page_range
>>>>> not supporting otherwise), this series provides the mechanism for enabling
>>>>> huge mappings for various kernel mappings like linear map and vmalloc.
>>>> In other words, you are saying that this series is a prerequisite for Yang's
>>>> series (and both are prerequisites for huge vmalloc by default). Your series
>>>> adds a new capability that Yang's series will rely on (the ability to change
>>>> permissions on block mappings).
>>> That's right.
>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>> [1] https://lore.kernel.org/all/20250304222018.615808-1-
>>>>> yang@os.amperecomputing.com/
>>>>>
>>>>> Dev Jain (3):
>>>>>     mm: Allow pagewalk without locks
>>>>>     arm64: pageattr: Use walk_page_range_novma() to change memory
>>>>>       permissions
>>>>>     mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64
>>>>>
>>>>>    arch/arm64/mm/pageattr.c | 81 +++++++++++++++++++++++++++++++++++++---
>>>>>    include/linux/pagewalk.h |  4 ++
>>>>>    mm/pagewalk.c            | 18 +++++++--
>>>>>    3 files changed, 94 insertions(+), 9 deletions(-)
>>>>>



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] mm: Allow pagewalk without locks
  2025-05-30  9:04 ` [PATCH 1/3] mm: Allow pagewalk without locks Dev Jain
  2025-05-30 10:33   ` Ryan Roberts
@ 2025-05-30 10:57   ` Lorenzo Stoakes
  2025-06-06  9:21     ` Dev Jain
  1 sibling, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-05-30 10:57 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel, Jann Horn

+cc Jan for page table stuff.

On Fri, May 30, 2025 at 02:34:05PM +0530, Dev Jain wrote:
> It is noted at [1] that KFENCE can manipulate kernel pgtable entries during
> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
> This being a non-sleepable context, we cannot take the init_mm mmap lock.
> Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage without
> locks.

Hm This is worrying.

You're unconditionally making it possible for dangerous usage here - to
walk page tables without a lock.

We need to assert this is only being used in a context where this makes
sense, e.g. a no VMA range under the right circumstances.

At the very least we need asserts that we are in a circumstance where this
is permitted.

For VMAs, you must keep the VMA stable, which requires a VMA read lock at
minimum.

See
https://origin.kernel.org/doc/html/latest/mm/process_addrs.html#page-tables
for details where these requirements are documented.

I also think we should update this documentation to cover off this non-VMA
task context stuff. I can perhaps do this so as not to egregiously add
workload to this series :)

Also, again this commit message is not enough for such a major change to
core mm stuff. I think you need to underline that - in non-task context -
you are safe to manipulate _kernel_ mappings, having precluded KFENCE as a
concern.

>
> [1] https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/

Basically expand upon this information.

Basically the commit message refers to your usage, but describes a patch
that makes it possible to do unlocked page table walks.

As I get into below, no pun intended, but this needs to be _locked down_
heavily.

- Only walk_page_range_novma() should allow it. All other functions should
  return -EINVAL if this is set.

- walk_page_range_novma() should assert we're in the appropriate context
  where this is feasible.

- Comments should be updated accordingly.

- We should assert (at least CONFIG_DEBUG_VM asserts) in every place that
  checks for a VMA that we are not in this lock mode, since this is
  disallowed.

>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pagewalk.h |  2 ++
>  mm/pagewalk.c            | 12 ++++++++----
>  2 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index 9700a29f8afb..9bc8853ed3de 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -14,6 +14,8 @@ enum page_walk_lock {
>  	PGWALK_WRLOCK = 1,
>  	/* vma is expected to be already write-locked during the walk */
>  	PGWALK_WRLOCK_VERIFY = 2,
> +	/* no lock is needed */
> +	PGWALK_NOLOCK = 3,

I'd prefer something very explicitly documenting that, at the very least, this
can only be used for non-VMA cases.

It's hard to think of a name here, but the comment should be explicit as to
under what circumstances this is allowed.

>  };
>
>  /**
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index e478777c86e1..9657cf4664b2 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -440,6 +440,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>  	case PGWALK_RDLOCK:
>  		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>  		break;
> +	default:
> +		break;

Please no 'default' here, we want to be explicit and cover all cases.

And surely, since you're explicitly only allowing this for non-VMA ranges, this
should be a WARN_ON_ONCE() or something?

>  	}
>  #endif
>  }
> @@ -640,10 +642,12 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
>  	 * specified address range from being freed. The caller should take
>  	 * other actions to prevent this race.
>  	 */

All functions other than this should explicitly disallow this locking mode
with -EINVAL checks. I do not want to see this locking mode made available
in a broken context.

The full comment:

	/*
	 * 1) For walking the user virtual address space:
	 *
	 * The mmap lock protects the page walker from changes to the page
	 * tables during the walk.  However a read lock is insufficient to
	 * protect those areas which don't have a VMA as munmap() detaches
	 * the VMAs before downgrading to a read lock and actually tearing
	 * down PTEs/page tables. In which case, the mmap write lock should
	 * be hold.
	 *
	 * 2) For walking the kernel virtual address space:
	 *
	 * The kernel intermediate page tables usually do not be freed, so
	 * the mmap map read lock is sufficient. But there are some exceptions.
	 * E.g. memory hot-remove. In which case, the mmap lock is insufficient
	 * to prevent the intermediate kernel pages tables belonging to the
	 * specified address range from being freed. The caller should take
	 * other actions to prevent this race.
	 */

Are you walking kernel memory only? Point 1 above explicitly points out why
userland novma memory requires a lock.

For point 2 you need to indicate why you don't need to consider hotplugging,
etc.

But as Ryan points out elsewhere, you should be expanding this comment to
explain your case...

You should also assert you're in a context where this applies and error
out/WARN if not.

> -	if (mm == &init_mm)
> -		mmap_assert_locked(walk.mm);
> -	else
> -		mmap_assert_write_locked(walk.mm);
> +	if (ops->walk_lock != PGWALK_NOLOCK) {

I really don't like the idea that you're allowing no lock for userland mappings.

This should at the very least be:

if (mm == &init_mm)  {
	if (ops->walk_lock != PGWALK_NOLOCK)
		mmap_assert_locked(walk.mm);
} else {
	mmap_assert_write_locked(walk.mm);
}

> +		if (mm == &init_mm)
> +			mmap_assert_locked(walk.mm);
> +		else
> +			mmap_assert_write_locked(walk.mm);
> +	}
>
>  	return walk_pgd_range(start, end, &walk);
>  }
> --
> 2.30.2
>

We have to be _really_ careful with this stuff. It's very fiddly and
brokenness can be a security issue.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 0/3] Enable huge-vmalloc permission change
  2025-05-30 10:51         ` Ryan Roberts
@ 2025-05-30 11:11           ` Dev Jain
  0 siblings, 0 replies; 29+ messages in thread
From: Dev Jain @ 2025-05-30 11:11 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel


On 30/05/25 4:21 pm, Ryan Roberts wrote:
> On 30/05/2025 11:42, Dev Jain wrote:
>> On 30/05/25 4:07 pm, Ryan Roberts wrote:
>>> On 30/05/2025 11:10, Dev Jain wrote:
>>>> On 30/05/25 3:33 pm, Ryan Roberts wrote:
>>>>> On 30/05/2025 10:04, Dev Jain wrote:
>>>>>> This series paves the path to enable huge mappings in vmalloc space by
>>>>>> default on arm64. > For this we must ensure that we can handle any permission
>>>>>> games on vmalloc space.
>>>>> And the linear map :)
>>>>>
>>>>>> Currently, __change_memory_common() uses
>>>>>> apply_to_page_range() which does not support changing permissions for
>>>>>> leaf mappings.
>>>>> nit: A "leaf mapping" is the lowest level entry in the page tables for a given
>>>>> address - i.e. it maps an address to some actual memory rather than to another
>>>>> pgtable. It includes what the Arm ARM calls "page mappings" (PTE level) and
>>>>> "block mappings" (PMD/PUD/.. level). apply_to_page_range() does support page
>>>>> mappings, so saying it doesn't support leaf mappings is incorrect. It doesn't
>>>>> support block mappings.
>>>> Sorry, again got confused by nomenclature : )
>>>>
>>>>>> We attempt to move away from this by using walk_page_range_novma(),
>>>>>> similar to what riscv does right now; however, it is the responsibility
>>>>>> of the caller to ensure that we do not pass a range, or split the range
>>>>>> covering a partial leaf mapping.
>>>>>>
>>>>>> This series is tied with Yang Shi's attempt [1] at using huge mappings
>>>>>> in the linear mapping in case the system supports BBML2, in which case
>>>>>> we will be able to split the linear mapping if needed without break-before-
>>>>>> make.
>>>>>> Thus, Yang's series, IIUC, will be one such user of my series; suppose we
>>>>>> are changing permissions on a range of the linear map backed by PMD-hugepages,
>>>>>> then the sequence of operations should look like the following:
>>>>>>
>>>>>> split_range(start, (start + HPAGE_PMD_SIZE) & ~HPAGE_PMD_MASK);
>>>>>> split_range(end & ~HPAGE_PMD_MASK, end);
>>>>> I don't understand what the HPAGE_PMD_MASK twiddling is doing? That's not
>>>>> right.
>>>>> It's going to give you the offset within the 2M region. You just want:
>>>>>
>>>>> split_range(start)
>>>>> split_range(end)
>>>>>
>>>>> right?
>>>> Suppose start = 2M + 4K, end = 8M + 5K. Then my sequence will compute to
>>> 8M + 5K is not a valid split point. It has to be at least page aligned.
>> Sorry, so consider 8M + 4K.
>>
>>>> split_range(2M + 4K, 3M)
>>>> split_range(8M, 8M + 5K)
>>> We just want to split at start and end. What are the 3M and 8M params supposed
>>> to be? Anyway, this is off-topic for this series.
>> I think we are both saying the same thing; yes we will split only the start and
>> the end,
>> so if the address 2Mb + 4Kb is mapped as a PMD-hugepage, we need to split this
>> PMD into
>> a PTE table, which will map 2Mb till 4Mb as base pages now.
> Not quite; if you start with [2M, 4M) mapped by PMD, then want to ensure a
> boundary at 2M+4K, I think you would end up with the first 64K (16 pages) mapped
> by base page, then the then the next 1984K mapped by contpte blocks (31 contpte
> blocks).
>
> But yes, we agree :)

Yes, I wasn't considering the contpte case : )

>
>>>> __change_memory_common(2M + 4K, 8M + 5K)
>>>>
>>>> So now __change_memory_common() wouldn't have to deal with splitting the
>>>> starts and ends. Please correct me if I am wrong.
>>>>
>>>>>> __change_memory_common(start, end);
>>>>>>
>>>>>> However, this series can be used independently of Yang's; since currently
>>>>>> permission games are being played only on pte mappings (due to
>>>>>> apply_to_page_range
>>>>>> not supporting otherwise), this series provides the mechanism for enabling
>>>>>> huge mappings for various kernel mappings like linear map and vmalloc.
>>>>> In other words, you are saying that this series is a prerequisite for Yang's
>>>>> series (and both are prerequisites for huge vmalloc by default). Your series
>>>>> adds a new capability that Yang's series will rely on (the ability to change
>>>>> permissions on block mappings).
>>>> That's right.
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>> [1] https://lore.kernel.org/all/20250304222018.615808-1-
>>>>>> yang@os.amperecomputing.com/
>>>>>>
>>>>>> Dev Jain (3):
>>>>>>      mm: Allow pagewalk without locks
>>>>>>      arm64: pageattr: Use walk_page_range_novma() to change memory
>>>>>>        permissions
>>>>>>      mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64
>>>>>>
>>>>>>     arch/arm64/mm/pageattr.c | 81 +++++++++++++++++++++++++++++++++++++---
>>>>>>     include/linux/pagewalk.h |  4 ++
>>>>>>     mm/pagewalk.c            | 18 +++++++--
>>>>>>     3 files changed, 94 insertions(+), 9 deletions(-)
>>>>>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64
  2025-05-30  9:04 ` [PATCH 3/3] mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64 Dev Jain
@ 2025-05-30 11:14   ` Lorenzo Stoakes
  2025-05-30 12:12     ` Ryan Roberts
  0 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-05-30 11:14 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel

On Fri, May 30, 2025 at 02:34:07PM +0530, Dev Jain wrote:
> arm64 implements lazy_mmu_mode to allow deferral and batching of barriers
> when updating kernel PTEs, which provides a nice performance boost. arm64
> currently uses apply_to_page_range() to modify kernel PTE permissions,
> which runs inside lazy_mmu_mode. So to prevent a performance regression,
> let's add hooks to walk_page_range_novma() to allow continued use of
> lazy_mmu_mode.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
> Credits to Ryan for the patch description.
>
>  arch/arm64/mm/pageattr.c | 12 ++++++++++++
>  include/linux/pagewalk.h |  2 ++
>  mm/pagewalk.c            |  6 ++++++
>  3 files changed, 20 insertions(+)
>
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index a5c829c64969..9163324b12a0 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -75,11 +75,23 @@ static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
>  	return 0;
>  }
>
> +static void pte_lazy_mmu_enter(void)
> +{
> +	arch_enter_lazy_mmu_mode();
> +}

Hm am I missing something? I don't see this function or the leave version
defined in arch/arm64?

No do I see __HAVE_ARCH_ENTER_LAZY_MMU_MODE?

> +
> +static void pte_lazy_mmu_leave(void)
> +{
> +	arch_leave_lazy_mmu_mode();
> +}

Are you absolutely sure you will never need to hook this stuff on higher level
page tables?

If this relates to vmalloc, then we do have huge page mappings in vmalloc logic?

> +
>  static const struct mm_walk_ops pageattr_ops = {
>  	.pud_entry	= pageattr_pud_entry,
>  	.pmd_entry	= pageattr_pmd_entry,
>  	.pte_entry	= pageattr_pte_entry,
>  	.walk_lock	= PGWALK_NOLOCK,
> +	.pre_pte_table	= pte_lazy_mmu_enter,
> +	.post_pte_table	= pte_lazy_mmu_leave,

This is kind of horrid really, are we sure the lazy mmu mode is valid for
everything that occurs within the the loop? I suppose it's only simple logic for
the ops->pte_entry stuff.

But it feels like hacking something in for this specific case.

At the same time I don't want to get in the way of an optimisation. We could do
something in ops->pmd_entry, but then we'd not get to turn it off afterwards...

Same for any higher level page table hm.

Is this really the only way to get this? I guess it's not feasible having this
just switched on for the whole operation...

I just fear that we could end up populating these mm_walk_ops with every corner
case thing we think of.

>  };
>
>  bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index 9bc8853ed3de..2157d345974c 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -88,6 +88,8 @@ struct mm_walk_ops {
>  	int (*pre_vma)(unsigned long start, unsigned long end,
>  		       struct mm_walk *walk);
>  	void (*post_vma)(struct mm_walk *walk);
> +	void (*pre_pte_table)(void);
> +	void (*post_pte_table)(void);
>  	int (*install_pte)(unsigned long addr, unsigned long next,
>  			   pte_t *ptep, struct mm_walk *walk);
>  	enum page_walk_lock walk_lock;
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index 9657cf4664b2..a441f5cbbc45 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -33,6 +33,9 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
>  	const struct mm_walk_ops *ops = walk->ops;
>  	int err = 0;
>
> +	if (walk->ops->pre_pte_table)
> +		walk->ops->pre_pte_table();

NIT: you have 'ops' already, no need for walk-> :)

> +
>  	for (;;) {
>  		if (ops->install_pte && pte_none(ptep_get(pte))) {
>  			pte_t new_pte;
> @@ -56,6 +59,9 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
>  		addr += PAGE_SIZE;
>  		pte++;
>  	}
> +
> +	if (walk->ops->post_pte_table)
> +		walk->ops->post_pte_table();

NIT: same as above.

>  	return err;
>  }
>
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64
  2025-05-30 11:14   ` Lorenzo Stoakes
@ 2025-05-30 12:12     ` Ryan Roberts
  2025-05-30 12:18       ` Lorenzo Stoakes
  0 siblings, 1 reply; 29+ messages in thread
From: Ryan Roberts @ 2025-05-30 12:12 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel

On 30/05/2025 12:14, Lorenzo Stoakes wrote:
> On Fri, May 30, 2025 at 02:34:07PM +0530, Dev Jain wrote:
>> arm64 implements lazy_mmu_mode to allow deferral and batching of barriers
>> when updating kernel PTEs, which provides a nice performance boost. arm64
>> currently uses apply_to_page_range() to modify kernel PTE permissions,
>> which runs inside lazy_mmu_mode. So to prevent a performance regression,
>> let's add hooks to walk_page_range_novma() to allow continued use of
>> lazy_mmu_mode.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>> Credits to Ryan for the patch description.
>>
>>  arch/arm64/mm/pageattr.c | 12 ++++++++++++
>>  include/linux/pagewalk.h |  2 ++
>>  mm/pagewalk.c            |  6 ++++++
>>  3 files changed, 20 insertions(+)
>>
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index a5c829c64969..9163324b12a0 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -75,11 +75,23 @@ static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
>>  	return 0;
>>  }
>>
>> +static void pte_lazy_mmu_enter(void)
>> +{
>> +	arch_enter_lazy_mmu_mode();
>> +}
> 
> Hm am I missing something? I don't see this function or the leave version
> defined in arch/arm64?
> 
> No do I see __HAVE_ARCH_ENTER_LAZY_MMU_MODE?

arm64 is starting to use lazy_mmu_mode from v6.16 - the changes are in Linus's tree.

> 
>> +
>> +static void pte_lazy_mmu_leave(void)
>> +{
>> +	arch_leave_lazy_mmu_mode();
>> +}
> 
> Are you absolutely sure you will never need to hook this stuff on higher level
> page tables?
> 
> If this relates to vmalloc, then we do have huge page mappings in vmalloc logic?

The performance advantage (for arm64's usage at least) really only (currently)
beneficial in practice to PTE level since we can reduce barriers by 512x. And
apply_to_page_range() was only using lazy mmu for the pte level anyway.

But actually I think we can do better here...

> 
>> +
>>  static const struct mm_walk_ops pageattr_ops = {
>>  	.pud_entry	= pageattr_pud_entry,
>>  	.pmd_entry	= pageattr_pmd_entry,
>>  	.pte_entry	= pageattr_pte_entry,
>>  	.walk_lock	= PGWALK_NOLOCK,
>> +	.pre_pte_table	= pte_lazy_mmu_enter,
>> +	.post_pte_table	= pte_lazy_mmu_leave,
> 
> This is kind of horrid really, are we sure the lazy mmu mode is valid for
> everything that occurs within the the loop? I suppose it's only simple logic for
> the ops->pte_entry stuff.
> 
> But it feels like hacking something in for this specific case.
> 
> At the same time I don't want to get in the way of an optimisation. We could do
> something in ops->pmd_entry, but then we'd not get to turn it off afterwards...
> 
> Same for any higher level page table hm.
> 
> Is this really the only way to get this? I guess it's not feasible having this
> just switched on for the whole operation...

...I think you're right. The only reason we traditionally confine the lazy mmu
mode to a single page table is because we want to enclose it within the PTL. But
that requirement doesn't stand for kernel mappings. As long as the walker can
guarrantee that it doesn't allocate any memory (because with certain debug
settings that can cause lazy mmu nesting) or try to sleep then I think we can
just bracket the whole walk_page_range_novma() call.

So I think we can avoid these new callbacks and just do:

arch_enter_lazy_mmu_mode()
walk_page_range_novma()
arch_leave_lazy_mmu_mode()

That will even give us the benefit of optimizing at PMD/PUD levels.


> 
> I just fear that we could end up populating these mm_walk_ops with every corner
> case thing we think of.
> 
>>  };
>>
>>  bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>> index 9bc8853ed3de..2157d345974c 100644
>> --- a/include/linux/pagewalk.h
>> +++ b/include/linux/pagewalk.h
>> @@ -88,6 +88,8 @@ struct mm_walk_ops {
>>  	int (*pre_vma)(unsigned long start, unsigned long end,
>>  		       struct mm_walk *walk);
>>  	void (*post_vma)(struct mm_walk *walk);
>> +	void (*pre_pte_table)(void);
>> +	void (*post_pte_table)(void);

nit: If we did end up with this approach, I wonder if it's better to generalize
and call it pre_table() and post_table(), passing in a level param? In any case,
you'll at least want to pass the walk structure.

>>  	int (*install_pte)(unsigned long addr, unsigned long next,
>>  			   pte_t *ptep, struct mm_walk *walk);
>>  	enum page_walk_lock walk_lock;
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index 9657cf4664b2..a441f5cbbc45 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -33,6 +33,9 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
>>  	const struct mm_walk_ops *ops = walk->ops;
>>  	int err = 0;
>>
>> +	if (walk->ops->pre_pte_table)
>> +		walk->ops->pre_pte_table();
> 
> NIT: you have 'ops' already, no need for walk-> :)
> 
>> +
>>  	for (;;) {
>>  		if (ops->install_pte && pte_none(ptep_get(pte))) {
>>  			pte_t new_pte;
>> @@ -56,6 +59,9 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
>>  		addr += PAGE_SIZE;
>>  		pte++;
>>  	}
>> +
>> +	if (walk->ops->post_pte_table)
>> +		walk->ops->post_pte_table();
> 
> NIT: same as above.
> 
>>  	return err;
>>  }
>>
>> --
>> 2.30.2
>>
> 
> 



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 3/3] mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64
  2025-05-30 12:12     ` Ryan Roberts
@ 2025-05-30 12:18       ` Lorenzo Stoakes
  0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-05-30 12:18 UTC (permalink / raw)
  To: Ryan Roberts
  Cc: Dev Jain, akpm, david, catalin.marinas, will, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, linux-mm, linux-kernel,
	suzuki.poulose, steven.price, gshan, linux-arm-kernel

On Fri, May 30, 2025 at 01:12:07PM +0100, Ryan Roberts wrote:
> On 30/05/2025 12:14, Lorenzo Stoakes wrote:
> > On Fri, May 30, 2025 at 02:34:07PM +0530, Dev Jain wrote:
> >> arm64 implements lazy_mmu_mode to allow deferral and batching of barriers
> >> when updating kernel PTEs, which provides a nice performance boost. arm64
> >> currently uses apply_to_page_range() to modify kernel PTE permissions,
> >> which runs inside lazy_mmu_mode. So to prevent a performance regression,
> >> let's add hooks to walk_page_range_novma() to allow continued use of
> >> lazy_mmu_mode.
> >>
> >> Signed-off-by: Dev Jain <dev.jain@arm.com>
> >> ---
> >> Credits to Ryan for the patch description.
> >>
> >>  arch/arm64/mm/pageattr.c | 12 ++++++++++++
> >>  include/linux/pagewalk.h |  2 ++
> >>  mm/pagewalk.c            |  6 ++++++
> >>  3 files changed, 20 insertions(+)
> >>
> >> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> >> index a5c829c64969..9163324b12a0 100644
> >> --- a/arch/arm64/mm/pageattr.c
> >> +++ b/arch/arm64/mm/pageattr.c
> >> @@ -75,11 +75,23 @@ static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
> >>  	return 0;
> >>  }
> >>
> >> +static void pte_lazy_mmu_enter(void)
> >> +{
> >> +	arch_enter_lazy_mmu_mode();
> >> +}
> >
> > Hm am I missing something? I don't see this function or the leave version
> > defined in arch/arm64?
> >
> > No do I see __HAVE_ARCH_ENTER_LAZY_MMU_MODE?
>
> arm64 is starting to use lazy_mmu_mode from v6.16 - the changes are in Linus's tree.

OK I did a pull and see it. Must be very recently merged there!

Dev - I suggest, when you rely on an upstream change that's going to land during
the merge window, to hold off on sending that series until after the merge
window.

Because if this lands in mm-new it will, currently, not actually do anything :)
and presumably this is the target.

Obviously this would be more egregious if you were relying upon something that
would result in a compile failure or similar, but it makes life easier if we
really make sure your dependencies are fulfilled.

>
> >
> >> +
> >> +static void pte_lazy_mmu_leave(void)
> >> +{
> >> +	arch_leave_lazy_mmu_mode();
> >> +}
> >
> > Are you absolutely sure you will never need to hook this stuff on higher level
> > page tables?
> >
> > If this relates to vmalloc, then we do have huge page mappings in vmalloc logic?
>
> The performance advantage (for arm64's usage at least) really only (currently)
> beneficial in practice to PTE level since we can reduce barriers by 512x. And
> apply_to_page_range() was only using lazy mmu for the pte level anyway.

Right.

>
> But actually I think we can do better here...

Oh...

>
> >
> >> +
> >>  static const struct mm_walk_ops pageattr_ops = {
> >>  	.pud_entry	= pageattr_pud_entry,
> >>  	.pmd_entry	= pageattr_pmd_entry,
> >>  	.pte_entry	= pageattr_pte_entry,
> >>  	.walk_lock	= PGWALK_NOLOCK,
> >> +	.pre_pte_table	= pte_lazy_mmu_enter,
> >> +	.post_pte_table	= pte_lazy_mmu_leave,
> >
> > This is kind of horrid really, are we sure the lazy mmu mode is valid for
> > everything that occurs within the the loop? I suppose it's only simple logic for
> > the ops->pte_entry stuff.
> >
> > But it feels like hacking something in for this specific case.
> >
> > At the same time I don't want to get in the way of an optimisation. We could do
> > something in ops->pmd_entry, but then we'd not get to turn it off afterwards...
> >
> > Same for any higher level page table hm.
> >
> > Is this really the only way to get this? I guess it's not feasible having this
> > just switched on for the whole operation...
>
> ...I think you're right. The only reason we traditionally confine the lazy mmu
> mode to a single page table is because we want to enclose it within the PTL. But
> that requirement doesn't stand for kernel mappings. As long as the walker can
> guarrantee that it doesn't allocate any memory (because with certain debug
> settings that can cause lazy mmu nesting) or try to sleep then I think we can
> just bracket the whole walk_page_range_novma() call.
>
> So I think we can avoid these new callbacks and just do:
>
> arch_enter_lazy_mmu_mode()
> walk_page_range_novma()
> arch_leave_lazy_mmu_mode()
>
> That will even give us the benefit of optimizing at PMD/PUD levels.

Yeah, this would avoid the issues I am concerned about!

>
>
> >
> > I just fear that we could end up populating these mm_walk_ops with every corner
> > case thing we think of.
> >
> >>  };
> >>
> >>  bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
> >> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> >> index 9bc8853ed3de..2157d345974c 100644
> >> --- a/include/linux/pagewalk.h
> >> +++ b/include/linux/pagewalk.h
> >> @@ -88,6 +88,8 @@ struct mm_walk_ops {
> >>  	int (*pre_vma)(unsigned long start, unsigned long end,
> >>  		       struct mm_walk *walk);
> >>  	void (*post_vma)(struct mm_walk *walk);
> >> +	void (*pre_pte_table)(void);
> >> +	void (*post_pte_table)(void);
>
> nit: If we did end up with this approach, I wonder if it's better to generalize
> and call it pre_table() and post_table(), passing in a level param? In any case,
> you'll at least want to pass the walk structure.

I thought the same thing, but wondered if we could just avoid doing this
altogether, which would be my preference.

And from what you say it does seem feasible!

>
> >>  	int (*install_pte)(unsigned long addr, unsigned long next,
> >>  			   pte_t *ptep, struct mm_walk *walk);
> >>  	enum page_walk_lock walk_lock;
> >> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> >> index 9657cf4664b2..a441f5cbbc45 100644
> >> --- a/mm/pagewalk.c
> >> +++ b/mm/pagewalk.c
> >> @@ -33,6 +33,9 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
> >>  	const struct mm_walk_ops *ops = walk->ops;
> >>  	int err = 0;
> >>
> >> +	if (walk->ops->pre_pte_table)
> >> +		walk->ops->pre_pte_table();
> >
> > NIT: you have 'ops' already, no need for walk-> :)
> >
> >> +
> >>  	for (;;) {
> >>  		if (ops->install_pte && pte_none(ptep_get(pte))) {
> >>  			pte_t new_pte;
> >> @@ -56,6 +59,9 @@ static int walk_pte_range_inner(pte_t *pte, unsigned long addr,
> >>  		addr += PAGE_SIZE;
> >>  		pte++;
> >>  	}
> >> +
> >> +	if (walk->ops->post_pte_table)
> >> +		walk->ops->post_pte_table();
> >
> > NIT: same as above.
> >
> >>  	return err;
> >>  }
> >>
> >> --
> >> 2.30.2
> >>
> >
> >
>

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-05-30  9:04 ` [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions Dev Jain
@ 2025-05-30 12:53   ` Ryan Roberts
  2025-06-02  4:35     ` Dev Jain
  2025-06-06  9:49   ` Lorenzo Stoakes
  1 sibling, 1 reply; 29+ messages in thread
From: Ryan Roberts @ 2025-05-30 12:53 UTC (permalink / raw)
  To: Dev Jain, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel

On 30/05/2025 10:04, Dev Jain wrote:
> Move away from apply_to_page_range(), which does not honour leaf mappings,
> to walk_page_range_novma(). The callbacks emit a warning and return EINVAL
> if a partial range is detected.

Perhaps:

"""
apply_to_page_range(), which was previously used to change permissions for
kernel mapped memory, can only operate on page mappings. In future we want to
support block mappings for more efficient TLB usage. Reimplement pageattr.c to
instead use walk_page_range_novma() to visit and modify leaf mappings of all sizes.

We only require that the start and end of a given range lie on leaf mapping
boundaries. If this is not the case, emit a warning and return -EINVAL.
"""


> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  arch/arm64/mm/pageattr.c | 69 +++++++++++++++++++++++++++++++++++++---
>  1 file changed, 64 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 39fd1f7ff02a..a5c829c64969 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -8,6 +8,7 @@
>  #include <linux/mem_encrypt.h>
>  #include <linux/sched.h>
>  #include <linux/vmalloc.h>
> +#include <linux/pagewalk.h>
>  
>  #include <asm/cacheflush.h>
>  #include <asm/pgtable-prot.h>
> @@ -20,6 +21,67 @@ struct page_change_data {
>  	pgprot_t clear_mask;
>  };
>  
> +static pteval_t set_pageattr_masks(unsigned long val, struct mm_walk *walk)

Please don't use unsigned long for raw ptes; This will break with D128 pgtables.

Anshuman had a patch in flight to introduce ptdesc_t as a generic/any level raw
value. It would be preferable to incorporate that patch and use it. pteval_t
isn't really correct because this is for any level and that implies pte level only.

> +{
> +	struct page_change_data *masks = walk->private;
> +	unsigned long new_val = val;

why do you need new_val? Why not just update and return val?

> +
> +	new_val &= ~(pgprot_val(masks->clear_mask));
> +	new_val |= (pgprot_val(masks->set_mask));
> +
> +	return new_val;
> +}

One potential pitfall of having a generic function that can change the
permissions for ptes at all levels is that bit 1 is defined differently for
level 3 vs the other levels. I don't think there should be any issues here in
practice having had a quick look at all the masks that users currently pass in
though.

> +
> +static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pud_t val = pudp_get(pud);
> +
> +	if (pud_leaf(val)) {
> +		if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
> +			return -EINVAL;
> +		val = __pud(set_pageattr_masks(pud_val(val), walk));
> +		set_pud(pud, val);
> +		walk->action = ACTION_CONTINUE;
> +	}
> +
> +	return 0;
> +}
> +
> +static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pmd_t val = pmdp_get(pmd);
> +
> +	if (pmd_leaf(val)) {
> +		if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
> +			return -EINVAL;
> +		val = __pmd(set_pageattr_masks(pmd_val(val), walk));
> +		set_pmd(pmd, val);
> +		walk->action = ACTION_CONTINUE;
> +	}
> +
> +	return 0;
> +}
> +
> +static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pte_t val = ptep_get(pte);

BUG: Use __ptep_get(), which is "below" the contpte management layer. ptep_get()
will look at the contiguous bit and potentially decide to accumulate all the a/d
bits in the block which is not relavent for kernel mappings.

> +
> +	val = __pte(set_pageattr_masks(pte_val(val), walk));
> +	set_pte(pte, val);

BUG: Use __set_pte(). Same reasoning as above. But this is more harmful because
set_pte() will try to detect contpte blocks and may zero/flush the entries.
Which would be very bad for kernel mappings.

> +
> +	return 0;
> +}
> +
> +static const struct mm_walk_ops pageattr_ops = {
> +	.pud_entry	= pageattr_pud_entry,
> +	.pmd_entry	= pageattr_pmd_entry,
> +	.pte_entry	= pageattr_pte_entry,

Is there a reason why you don't have pgd and p4d entries? I think there are
configs where the pgd may contain leaf mappings. Possibly 64K/42-bit, which will
have 2 levels and I think they will be pgd and pte. So I think you'd better
implement all levels to be correct.

> +	.walk_lock	= PGWALK_NOLOCK,
> +};
> +
>  bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
>  
>  bool can_set_direct_map(void)
> @@ -49,9 +111,6 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
>  	return 0;
>  }
>  
> -/*
> - * This function assumes that the range is mapped with PAGE_SIZE pages.
> - */
>  static int __change_memory_common(unsigned long start, unsigned long size,
>  				pgprot_t set_mask, pgprot_t clear_mask)
>  {
> @@ -61,8 +120,8 @@ static int __change_memory_common(unsigned long start, unsigned long size,
>  	data.set_mask = set_mask;
>  	data.clear_mask = clear_mask;
>  
> -	ret = apply_to_page_range(&init_mm, start, size, change_page_range,
> -					&data);
> +	ret = walk_page_range_novma(&init_mm, start, start + size,
> +				    &pageattr_ops, NULL, &data);
>  
>  	/*
>  	 * If the memory is being made valid without changing any other bits

I notice that set_direct_map_invalid_noflush() and
set_direct_map_default_noflush() don't use __change_memory_common() but instead
call apply_to_page_range() direct. (presumably because they don't want the
associated tlb flush). Is there a reason not to update these callers too?

Perhaps it would be cleaner to wrap in ___change_memory_common (3 leading
underscores) which does everything except the flush).

Thanks,
Ryan



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] mm: Allow pagewalk without locks
  2025-05-30 10:33   ` Ryan Roberts
@ 2025-05-30 21:33     ` Yang Shi
  0 siblings, 0 replies; 29+ messages in thread
From: Yang Shi @ 2025-05-30 21:33 UTC (permalink / raw)
  To: Ryan Roberts, Dev Jain, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel



On 5/30/25 3:33 AM, Ryan Roberts wrote:
> On 30/05/2025 10:04, Dev Jain wrote:
>> It is noted at [1] that KFENCE can manipulate kernel pgtable entries during
>> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
>> This being a non-sleepable context, we cannot take the init_mm mmap lock.
>> Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage without
>> locks.
> It looks like riscv solved this problem by moving from walk_page_range_novma()
> to apply_to_existing_page_range() in Commit fb1cf0878328 ("riscv: rewrite
> __kernel_map_pages() to fix sleeping in invalid context"). That won't work for
> us because the whole point is that we want to support changing permissions on
> block mappings.
>
> Yang:
>
> Not directly relavent to this patch, but I do worry about the potential need to
> split the range here though once Yang's series comes in - we would need to
> allocate memory for pgtables atomically in softirq context. KFENCE is intended
> to be enabled in production IIRC, so we can't just not allow block mapping when
> KFENCE is enabled and will likely need to think of a solution for this?

IIRC kfence typically uses a dedicated pool by default (if sample 
interval is not 0 on arm64), the pool is separate from linear mapping 
and it is mapped at PTE level. This should be the preferred way for 
production environment.

But if the pool is not used, typically 0 sample interval, we have to 
have the whole linear mapping mapped at PTE level always. I don't see a 
simple solution for it.

Thanks,
Yang

>
>
>> [1] https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pagewalk.h |  2 ++
>>   mm/pagewalk.c            | 12 ++++++++----
>>   2 files changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>> index 9700a29f8afb..9bc8853ed3de 100644
>> --- a/include/linux/pagewalk.h
>> +++ b/include/linux/pagewalk.h
>> @@ -14,6 +14,8 @@ enum page_walk_lock {
>>   	PGWALK_WRLOCK = 1,
>>   	/* vma is expected to be already write-locked during the walk */
>>   	PGWALK_WRLOCK_VERIFY = 2,
>> +	/* no lock is needed */
>> +	PGWALK_NOLOCK = 3,
> I'd imagine you either want to explicitly forbid this option for the other
> entrypoints (i.e. the non- _novma variants) or you need to be able to handle
> this option being passed in to the other functions, which you currently don't
> do. I'd vote for explcitly disallowing (document it and return error if passed in).
>
>>   };
>>   
>>   /**
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index e478777c86e1..9657cf4664b2 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -440,6 +440,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>>   	case PGWALK_RDLOCK:
>>   		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>>   		break;
>> +	default:
>> +		break;
>>   	}
>>   #endif
>>   }
>> @@ -640,10 +642,12 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
>>   	 * specified address range from being freed. The caller should take
>>   	 * other actions to prevent this race.
>>   	 */
>> -	if (mm == &init_mm)
>> -		mmap_assert_locked(walk.mm);
> Given apply_to_page_range() doesn't do any locking for kernel pgtable walking, I
> can be convinced that it's also not required for our case using this framework.
> But why does this framework believe it is necessary? Should the comment above
> this at least be updated?
>
> Thanks,
> Ryan
>
>> -	else
>> -		mmap_assert_write_locked(walk.mm);
>> +	if (ops->walk_lock != PGWALK_NOLOCK) {
>> +		if (mm == &init_mm)
>> +			mmap_assert_locked(walk.mm);
>> +		else
>> +			mmap_assert_write_locked(walk.mm);
>> +	}
>>   
>>   	return walk_pgd_range(start, end, &walk);
>>   }



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-05-30 12:53   ` Ryan Roberts
@ 2025-06-02  4:35     ` Dev Jain
  0 siblings, 0 replies; 29+ messages in thread
From: Dev Jain @ 2025-06-02  4:35 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel


On 30/05/25 6:23 pm, Ryan Roberts wrote:
> On 30/05/2025 10:04, Dev Jain wrote:
>> Move away from apply_to_page_range(), which does not honour leaf mappings,
>> to walk_page_range_novma(). The callbacks emit a warning and return EINVAL
>> if a partial range is detected.
> Perhaps:
>
> """
> apply_to_page_range(), which was previously used to change permissions for
> kernel mapped memory, can only operate on page mappings. In future we want to
> support block mappings for more efficient TLB usage. Reimplement pageattr.c to
> instead use walk_page_range_novma() to visit and modify leaf mappings of all sizes.
>
> We only require that the start and end of a given range lie on leaf mapping
> boundaries. If this is not the case, emit a warning and return -EINVAL.
> """

Thanks.

>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   arch/arm64/mm/pageattr.c | 69 +++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 64 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index 39fd1f7ff02a..a5c829c64969 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -8,6 +8,7 @@
>>   #include <linux/mem_encrypt.h>
>>   #include <linux/sched.h>
>>   #include <linux/vmalloc.h>
>> +#include <linux/pagewalk.h>
>>   
>>   #include <asm/cacheflush.h>
>>   #include <asm/pgtable-prot.h>
>> @@ -20,6 +21,67 @@ struct page_change_data {
>>   	pgprot_t clear_mask;
>>   };
>>   
>> +static pteval_t set_pageattr_masks(unsigned long val, struct mm_walk *walk)
> Please don't use unsigned long for raw ptes; This will break with D128 pgtables.
>
> Anshuman had a patch in flight to introduce ptdesc_t as a generic/any level raw
> value. It would be preferable to incorporate that patch and use it. pteval_t
> isn't really correct because this is for any level and that implies pte level only.

Okay.

>
>> +{
>> +	struct page_change_data *masks = walk->private;
>> +	unsigned long new_val = val;
> why do you need new_val? Why not just update and return val?

Yes, shameless copying from riscv and loongarch : )

>
>> +
>> +	new_val &= ~(pgprot_val(masks->clear_mask));
>> +	new_val |= (pgprot_val(masks->set_mask));
>> +
>> +	return new_val;
>> +}
> One potential pitfall of having a generic function that can change the
> permissions for ptes at all levels is that bit 1 is defined differently for
> level 3 vs the other levels. I don't think there should be any issues here in
> practice having had a quick look at all the masks that users currently pass in
> though.
>
>> +
>> +static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pud_t val = pudp_get(pud);
>> +
>> +	if (pud_leaf(val)) {
>> +		if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
>> +			return -EINVAL;
>> +		val = __pud(set_pageattr_masks(pud_val(val), walk));
>> +		set_pud(pud, val);
>> +		walk->action = ACTION_CONTINUE;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pmd_t val = pmdp_get(pmd);
>> +
>> +	if (pmd_leaf(val)) {
>> +		if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
>> +			return -EINVAL;
>> +		val = __pmd(set_pageattr_masks(pmd_val(val), walk));
>> +		set_pmd(pmd, val);
>> +		walk->action = ACTION_CONTINUE;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pte_t val = ptep_get(pte);
> BUG: Use __ptep_get(), which is "below" the contpte management layer. ptep_get()
> will look at the contiguous bit and potentially decide to accumulate all the a/d
> bits in the block which is not relavent for kernel mappings.

Thanks.

>
>> +
>> +	val = __pte(set_pageattr_masks(pte_val(val), walk));
>> +	set_pte(pte, val);
> BUG: Use __set_pte(). Same reasoning as above. But this is more harmful because
> set_pte() will try to detect contpte blocks and may zero/flush the entries.
> Which would be very bad for kernel mappings.

Thanks.

>
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct mm_walk_ops pageattr_ops = {
>> +	.pud_entry	= pageattr_pud_entry,
>> +	.pmd_entry	= pageattr_pmd_entry,
>> +	.pte_entry	= pageattr_pte_entry,
> Is there a reason why you don't have pgd and p4d entries? I think there are
> configs where the pgd may contain leaf mappings. Possibly 64K/42-bit, which will
> have 2 levels and I think they will be pgd and pte. So I think you'd better
> implement all levels to be correct.

Okay.

>
>> +	.walk_lock	= PGWALK_NOLOCK,
>> +};
>> +
>>   bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
>>   
>>   bool can_set_direct_map(void)
>> @@ -49,9 +111,6 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
>>   	return 0;
>>   }
>>   
>> -/*
>> - * This function assumes that the range is mapped with PAGE_SIZE pages.
>> - */
>>   static int __change_memory_common(unsigned long start, unsigned long size,
>>   				pgprot_t set_mask, pgprot_t clear_mask)
>>   {
>> @@ -61,8 +120,8 @@ static int __change_memory_common(unsigned long start, unsigned long size,
>>   	data.set_mask = set_mask;
>>   	data.clear_mask = clear_mask;
>>   
>> -	ret = apply_to_page_range(&init_mm, start, size, change_page_range,
>> -					&data);
>> +	ret = walk_page_range_novma(&init_mm, start, start + size,
>> +				    &pageattr_ops, NULL, &data);
>>   
>>   	/*
>>   	 * If the memory is being made valid without changing any other bits
> I notice that set_direct_map_invalid_noflush() and
> set_direct_map_default_noflush() don't use __change_memory_common() but instead
> call apply_to_page_range() direct. (presumably because they don't want the
> associated tlb flush). Is there a reason not to update these callers too?
>
> Perhaps it would be cleaner to wrap in ___change_memory_common (3 leading
> underscores) which does everything except the flush).

Makes sense, I think Yang's series will need this to handle block mappings
for linear map.

>
> Thanks,
> Ryan
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] mm: Allow pagewalk without locks
  2025-05-30 10:57   ` Lorenzo Stoakes
@ 2025-06-06  9:21     ` Dev Jain
  2025-06-06  9:33       ` Dev Jain
  2025-06-06 10:02       ` Lorenzo Stoakes
  0 siblings, 2 replies; 29+ messages in thread
From: Dev Jain @ 2025-06-06  9:21 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel, Jann Horn, Yang Shi,
	Ryan Roberts


On 30/05/25 4:27 pm, Lorenzo Stoakes wrote:
> +cc Jan for page table stuff.
>
> On Fri, May 30, 2025 at 02:34:05PM +0530, Dev Jain wrote:
>> It is noted at [1] that KFENCE can manipulate kernel pgtable entries during
>> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
>> This being a non-sleepable context, we cannot take the init_mm mmap lock.
>> Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage without
>> locks.
> Hm This is worrying.
>
> You're unconditionally making it possible for dangerous usage here - to
> walk page tables without a lock.
>
> We need to assert this is only being used in a context where this makes
> sense, e.g. a no VMA range under the right circumstances.
>
> At the very least we need asserts that we are in a circumstance where this
> is permitted.
>
> For VMAs, you must keep the VMA stable, which requires a VMA read lock at
> minimum.
>
> See
> https://origin.kernel.org/doc/html/latest/mm/process_addrs.html#page-tables
> for details where these requirements are documented.
>
> I also think we should update this documentation to cover off this non-VMA
> task context stuff. I can perhaps do this so as not to egregiously add
> workload to this series :)
>
> Also, again this commit message is not enough for such a major change to
> core mm stuff. I think you need to underline that - in non-task context -
> you are safe to manipulate _kernel_ mappings, having precluded KFENCE as a
> concern.

Sorry for late reply, after your comments I had to really go and understand
kernel pagetable walking properly by reading your process_addrs documentation
and reading the code, so that I could prepare an answer and improve my
understanding, thanks for your review!

How does the below comment above PGWALK_NOLOCK look?

"Walk without any lock. Use of this is only meant for the
  case where there is no underlying VMA, and the user has
  exclusive control over the range, guaranteeing no concurrent
  access. For example, changing permissions of vmalloc objects."

and the patch description can be modified as
"
It is noted at [1] that KFENCE can manipulate kernel pgtable entries during
softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
This being a non-sleepable context, we cannot take the init_mm mmap lock.
Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage without
locks.
Currently, apply_to_page_range is being used by __change_memory_common()
to change permissions over a range of vmalloc space, without any locking.
Patch 2 in this series shifts to the usage of walk_page_range_novma(), hence
this patch is needed. We do not need any locks because the vmalloc object
has exclusive access to the range, i.e two vmalloc objects do not share
the same physical address.
"


>
>> [1] https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/
> Basically expand upon this information.
>
> Basically the commit message refers to your usage, but describes a patch
> that makes it possible to do unlocked page table walks.
>
> As I get into below, no pun intended, but this needs to be _locked down_
> heavily.
>
> - Only walk_page_range_novma() should allow it. All other functions should
>    return -EINVAL if this is set.

Sure.

>
> - walk_page_range_novma() should assert we're in the appropriate context
>    where this is feasible.

There should be two conditions: that the mm is init_mm, and the start address
belongs to the vmalloc (or module) space. I am a little nervous about the second. On searching
throughout the codebase, I could find only vmalloc and module addresses getting
modified through set_memory_* API, but I couldn't prove that all such usages
are being done on vmalloc/module addresses.

>
> - Comments should be updated accordingly.
>
> - We should assert (at least CONFIG_DEBUG_VM asserts) in every place that
>    checks for a VMA that we are not in this lock mode, since this is
>    disallowed.
>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pagewalk.h |  2 ++
>>   mm/pagewalk.c            | 12 ++++++++----
>>   2 files changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>> index 9700a29f8afb..9bc8853ed3de 100644
>> --- a/include/linux/pagewalk.h
>> +++ b/include/linux/pagewalk.h
>> @@ -14,6 +14,8 @@ enum page_walk_lock {
>>   	PGWALK_WRLOCK = 1,
>>   	/* vma is expected to be already write-locked during the walk */
>>   	PGWALK_WRLOCK_VERIFY = 2,
>> +	/* no lock is needed */
>> +	PGWALK_NOLOCK = 3,
> I'd prefer something very explicitly documenting that, at the very least, this
> can only be used for non-VMA cases.
>
> It's hard to think of a name here, but the comment should be explicit as to
> under what circumstances this is allowed.
>
>>   };
>>
>>   /**
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index e478777c86e1..9657cf4664b2 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -440,6 +440,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>>   	case PGWALK_RDLOCK:
>>   		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>>   		break;
>> +	default:
>> +		break;
> Please no 'default' here, we want to be explicit and cover all cases.

Sure.

>
> And surely, since you're explicitly only allowing this for non-VMA ranges, this
> should be a WARN_ON_ONCE() or something?

Sounds good, maybe a WARN_ON_ONCE(vma)?

>
>>   	}
>>   #endif
>>   }
>> @@ -640,10 +642,12 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
>>   	 * specified address range from being freed. The caller should take
>>   	 * other actions to prevent this race.
>>   	 */
> All functions other than this should explicitly disallow this locking mode
> with -EINVAL checks. I do not want to see this locking mode made available
> in a broken context.
>
> The full comment:
>
> 	/*
> 	 * 1) For walking the user virtual address space:
> 	 *
> 	 * The mmap lock protects the page walker from changes to the page
> 	 * tables during the walk.  However a read lock is insufficient to
> 	 * protect those areas which don't have a VMA as munmap() detaches
> 	 * the VMAs before downgrading to a read lock and actually tearing
> 	 * down PTEs/page tables. In which case, the mmap write lock should
> 	 * be hold.
> 	 *
> 	 * 2) For walking the kernel virtual address space:
> 	 *
> 	 * The kernel intermediate page tables usually do not be freed, so
> 	 * the mmap map read lock is sufficient. But there are some exceptions.
> 	 * E.g. memory hot-remove. In which case, the mmap lock is insufficient
> 	 * to prevent the intermediate kernel pages tables belonging to the
> 	 * specified address range from being freed. The caller should take
> 	 * other actions to prevent this race.
> 	 */
>
> Are you walking kernel memory only? Point 1 above explicitly points out why
> userland novma memory requires a lock.
>
> For point 2 you need to indicate why you don't need to consider hotplugging,

Well, hotunplugging will first offline the physical memory, and since the
vmalloc object has the reference to the pages, there is no race.

> etc.
>
> But as Ryan points out elsewhere, you should be expanding this comment to
> explain your case...
>
> You should also assert you're in a context where this applies and error
> out/WARN if not.
>
>> -	if (mm == &init_mm)
>> -		mmap_assert_locked(walk.mm);
>> -	else
>> -		mmap_assert_write_locked(walk.mm);
>> +	if (ops->walk_lock != PGWALK_NOLOCK) {
> I really don't like the idea that you're allowing no lock for userland mappings.
>
> This should at the very least be:
>
> if (mm == &init_mm)  {
> 	if (ops->walk_lock != PGWALK_NOLOCK)
> 		mmap_assert_locked(walk.mm);
> } else {
> 	mmap_assert_write_locked(walk.mm);
> }

Sure.

>
>> +		if (mm == &init_mm)
>> +			mmap_assert_locked(walk.mm);
>> +		else
>> +			mmap_assert_write_locked(walk.mm);
>> +	}
>>
>>   	return walk_pgd_range(start, end, &walk);
>>   }
>> --
>> 2.30.2
>>
> We have to be _really_ careful with this stuff. It's very fiddly and
> brokenness can be a security issue.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] mm: Allow pagewalk without locks
  2025-06-06  9:21     ` Dev Jain
@ 2025-06-06  9:33       ` Dev Jain
  2025-06-06 10:02       ` Lorenzo Stoakes
  1 sibling, 0 replies; 29+ messages in thread
From: Dev Jain @ 2025-06-06  9:33 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel, Jann Horn, Yang Shi,
	Ryan Roberts


On 06/06/25 2:51 pm, Dev Jain wrote:
>
> On 30/05/25 4:27 pm, Lorenzo Stoakes wrote:
>> +cc Jan for page table stuff.
>>
>> On Fri, May 30, 2025 at 02:34:05PM +0530, Dev Jain wrote:
>>> It is noted at [1] that KFENCE can manipulate kernel pgtable entries 
>>> during
>>> softirqs. It does this by calling set_memory_valid() -> 
>>> __change_memory_common().
>>> This being a non-sleepable context, we cannot take the init_mm mmap 
>>> lock.
>>> Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage 
>>> without
>>> locks.
>> Hm This is worrying.
>>
>> You're unconditionally making it possible for dangerous usage here - to
>> walk page tables without a lock.
>>
>> We need to assert this is only being used in a context where this makes
>> sense, e.g. a no VMA range under the right circumstances.
>>
>> At the very least we need asserts that we are in a circumstance where 
>> this
>> is permitted.
>>
>> For VMAs, you must keep the VMA stable, which requires a VMA read 
>> lock at
>> minimum.
>>
>> See
>> https://origin.kernel.org/doc/html/latest/mm/process_addrs.html#page-tables 
>>
>> for details where these requirements are documented.
>>
>> I also think we should update this documentation to cover off this 
>> non-VMA
>> task context stuff. I can perhaps do this so as not to egregiously add
>> workload to this series :)
>>
>> Also, again this commit message is not enough for such a major change to
>> core mm stuff. I think you need to underline that - in non-task 
>> context -
>> you are safe to manipulate _kernel_ mappings, having precluded KFENCE 
>> as a
>> concern.
>
> Sorry for late reply, after your comments I had to really go and 
> understand
> kernel pagetable walking properly by reading your process_addrs 
> documentation
> and reading the code, so that I could prepare an answer and improve my
> understanding, thanks for your review!
>
> How does the below comment above PGWALK_NOLOCK look?
>
> "Walk without any lock. Use of this is only meant for the
>  case where there is no underlying VMA, and the user has
>  exclusive control over the range, guaranteeing no concurrent
>  access. For example, changing permissions of vmalloc objects."
>
> and the patch description can be modified as
> "
> It is noted at [1] that KFENCE can manipulate kernel pgtable entries 
> during
> softirqs. It does this by calling set_memory_valid() -> 
> __change_memory_common().
> This being a non-sleepable context, we cannot take the init_mm mmap lock.
> Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage 
> without
> locks.
> Currently, apply_to_page_range is being used by __change_memory_common()
> to change permissions over a range of vmalloc space, without any locking.
> Patch 2 in this series shifts to the usage of walk_page_range_novma(), 
> hence
> this patch is needed. We do not need any locks because the vmalloc object
> has exclusive access to the range, i.e two vmalloc objects do not share
> the same physical address.
> "
>
>
>>
>>> [1] 
>>> https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/
>> Basically expand upon this information.
>>
>> Basically the commit message refers to your usage, but describes a patch
>> that makes it possible to do unlocked page table walks.
>>
>> As I get into below, no pun intended, but this needs to be _locked down_
>> heavily.
>>
>> - Only walk_page_range_novma() should allow it. All other functions 
>> should
>>    return -EINVAL if this is set.
>
> Sure.
>
>>
>> - walk_page_range_novma() should assert we're in the appropriate context
>>    where this is feasible.
>
> There should be two conditions: that the mm is init_mm, and the start 
> address
> belongs to the vmalloc (or module) space. I am a little nervous about 
> the second. On searching
> throughout the codebase, I could find only vmalloc and module 
> addresses getting
> modified through set_memory_* API, but I couldn't prove that all such 
> usages
> are being done on vmalloc/module addresses.

Sorry, please ignore the bit about the second point, I confused it with 
some other

issue in the past. Now set_direct_map_invalid_noflush() will also use 
__change_memory_common ->

walk_page_range_novma, and previously too it was doing it locklessly. So 
let us assert only for

mm == init_mm.


>
>>
>> - Comments should be updated accordingly.
>>
>> - We should assert (at least CONFIG_DEBUG_VM asserts) in every place 
>> that
>>    checks for a VMA that we are not in this lock mode, since this is
>>    disallowed.
>>
>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>> ---
>>>   include/linux/pagewalk.h |  2 ++
>>>   mm/pagewalk.c            | 12 ++++++++----
>>>   2 files changed, 10 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>>> index 9700a29f8afb..9bc8853ed3de 100644
>>> --- a/include/linux/pagewalk.h
>>> +++ b/include/linux/pagewalk.h
>>> @@ -14,6 +14,8 @@ enum page_walk_lock {
>>>       PGWALK_WRLOCK = 1,
>>>       /* vma is expected to be already write-locked during the walk */
>>>       PGWALK_WRLOCK_VERIFY = 2,
>>> +    /* no lock is needed */
>>> +    PGWALK_NOLOCK = 3,
>> I'd prefer something very explicitly documenting that, at the very 
>> least, this
>> can only be used for non-VMA cases.
>>
>> It's hard to think of a name here, but the comment should be explicit 
>> as to
>> under what circumstances this is allowed.
>>
>>>   };
>>>
>>>   /**
>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>> index e478777c86e1..9657cf4664b2 100644
>>> --- a/mm/pagewalk.c
>>> +++ b/mm/pagewalk.c
>>> @@ -440,6 +440,8 @@ static inline void process_vma_walk_lock(struct 
>>> vm_area_struct *vma,
>>>       case PGWALK_RDLOCK:
>>>           /* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>>>           break;
>>> +    default:
>>> +        break;
>> Please no 'default' here, we want to be explicit and cover all cases.
>
> Sure.
>
>>
>> And surely, since you're explicitly only allowing this for non-VMA 
>> ranges, this
>> should be a WARN_ON_ONCE() or something?
>
> Sounds good, maybe a WARN_ON_ONCE(vma)?
>
>>
>>>       }
>>>   #endif
>>>   }
>>> @@ -640,10 +642,12 @@ int walk_page_range_novma(struct mm_struct 
>>> *mm, unsigned long start,
>>>        * specified address range from being freed. The caller should 
>>> take
>>>        * other actions to prevent this race.
>>>        */
>> All functions other than this should explicitly disallow this locking 
>> mode
>> with -EINVAL checks. I do not want to see this locking mode made 
>> available
>> in a broken context.
>>
>> The full comment:
>>
>>     /*
>>      * 1) For walking the user virtual address space:
>>      *
>>      * The mmap lock protects the page walker from changes to the page
>>      * tables during the walk.  However a read lock is insufficient to
>>      * protect those areas which don't have a VMA as munmap() detaches
>>      * the VMAs before downgrading to a read lock and actually tearing
>>      * down PTEs/page tables. In which case, the mmap write lock should
>>      * be hold.
>>      *
>>      * 2) For walking the kernel virtual address space:
>>      *
>>      * The kernel intermediate page tables usually do not be freed, so
>>      * the mmap map read lock is sufficient. But there are some 
>> exceptions.
>>      * E.g. memory hot-remove. In which case, the mmap lock is 
>> insufficient
>>      * to prevent the intermediate kernel pages tables belonging to the
>>      * specified address range from being freed. The caller should take
>>      * other actions to prevent this race.
>>      */
>>
>> Are you walking kernel memory only? Point 1 above explicitly points 
>> out why
>> userland novma memory requires a lock.
>>
>> For point 2 you need to indicate why you don't need to consider 
>> hotplugging,
>
> Well, hotunplugging will first offline the physical memory, and since the
> vmalloc object has the reference to the pages, there is no race.
>
>> etc.
>>
>> But as Ryan points out elsewhere, you should be expanding this 
>> comment to
>> explain your case...
>>
>> You should also assert you're in a context where this applies and error
>> out/WARN if not.
>>
>>> -    if (mm == &init_mm)
>>> -        mmap_assert_locked(walk.mm);
>>> -    else
>>> -        mmap_assert_write_locked(walk.mm);
>>> +    if (ops->walk_lock != PGWALK_NOLOCK) {
>> I really don't like the idea that you're allowing no lock for 
>> userland mappings.
>>
>> This should at the very least be:
>>
>> if (mm == &init_mm)  {
>>     if (ops->walk_lock != PGWALK_NOLOCK)
>>         mmap_assert_locked(walk.mm);
>> } else {
>>     mmap_assert_write_locked(walk.mm);
>> }
>
> Sure.
>
>>
>>> +        if (mm == &init_mm)
>>> +            mmap_assert_locked(walk.mm);
>>> +        else
>>> +            mmap_assert_write_locked(walk.mm);
>>> +    }
>>>
>>>       return walk_pgd_range(start, end, &walk);
>>>   }
>>> -- 
>>> 2.30.2
>>>
>> We have to be _really_ careful with this stuff. It's very fiddly and
>> brokenness can be a security issue.
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-05-30  9:04 ` [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions Dev Jain
  2025-05-30 12:53   ` Ryan Roberts
@ 2025-06-06  9:49   ` Lorenzo Stoakes
  2025-06-06 10:39     ` Dev Jain
  2025-06-09  9:41     ` Dev Jain
  1 sibling, 2 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-06  9:49 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel

On Fri, May 30, 2025 at 02:34:06PM +0530, Dev Jain wrote:
> Move away from apply_to_page_range(), which does not honour leaf mappings,
> to walk_page_range_novma(). The callbacks emit a warning and return EINVAL
> if a partial range is detected.

Hm a follow up question here - why not just improve apply_to_page_range() to
honour leaf mappings?

What does honouring leaf mappings actually mean? You mean handling huge pages?

Would it be all that difficult to implement?

It seems like you're pushing a bunch of the 'applying' logic over from there to
a walker that isn't maybe best suited to it and having to introduce an iffy new
form of locking...

Can we go vice-versa? :)

Also obviously walk_page_range_novma() doesn't exist any more :P
walk_kernel_page_table_range() is the preferred solution.

>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  arch/arm64/mm/pageattr.c | 69 +++++++++++++++++++++++++++++++++++++---
>  1 file changed, 64 insertions(+), 5 deletions(-)
>
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 39fd1f7ff02a..a5c829c64969 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -8,6 +8,7 @@
>  #include <linux/mem_encrypt.h>
>  #include <linux/sched.h>
>  #include <linux/vmalloc.h>
> +#include <linux/pagewalk.h>
>
>  #include <asm/cacheflush.h>
>  #include <asm/pgtable-prot.h>
> @@ -20,6 +21,67 @@ struct page_change_data {
>  	pgprot_t clear_mask;
>  };
>
> +static pteval_t set_pageattr_masks(unsigned long val, struct mm_walk *walk)
> +{
> +	struct page_change_data *masks = walk->private;
> +	unsigned long new_val = val;
> +
> +	new_val &= ~(pgprot_val(masks->clear_mask));
> +	new_val |= (pgprot_val(masks->set_mask));
> +
> +	return new_val;
> +}
> +
> +static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pud_t val = pudp_get(pud);
> +
> +	if (pud_leaf(val)) {
> +		if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
> +			return -EINVAL;
> +		val = __pud(set_pageattr_masks(pud_val(val), walk));
> +		set_pud(pud, val);
> +		walk->action = ACTION_CONTINUE;
> +	}
> +
> +	return 0;
> +}
> +
> +static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pmd_t val = pmdp_get(pmd);
> +
> +	if (pmd_leaf(val)) {
> +		if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
> +			return -EINVAL;
> +		val = __pmd(set_pageattr_masks(pmd_val(val), walk));
> +		set_pmd(pmd, val);
> +		walk->action = ACTION_CONTINUE;
> +	}
> +
> +	return 0;
> +}
> +
> +static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pte_t val = ptep_get(pte);
> +
> +	val = __pte(set_pageattr_masks(pte_val(val), walk));
> +	set_pte(pte, val);
> +
> +	return 0;
> +}
> +
> +static const struct mm_walk_ops pageattr_ops = {
> +	.pud_entry	= pageattr_pud_entry,
> +	.pmd_entry	= pageattr_pmd_entry,
> +	.pte_entry	= pageattr_pte_entry,
> +	.walk_lock	= PGWALK_NOLOCK,
> +};
> +
>  bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
>
>  bool can_set_direct_map(void)
> @@ -49,9 +111,6 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
>  	return 0;
>  }
>
> -/*
> - * This function assumes that the range is mapped with PAGE_SIZE pages.
> - */
>  static int __change_memory_common(unsigned long start, unsigned long size,
>  				pgprot_t set_mask, pgprot_t clear_mask)
>  {
> @@ -61,8 +120,8 @@ static int __change_memory_common(unsigned long start, unsigned long size,
>  	data.set_mask = set_mask;
>  	data.clear_mask = clear_mask;
>
> -	ret = apply_to_page_range(&init_mm, start, size, change_page_range,
> -					&data);
> +	ret = walk_page_range_novma(&init_mm, start, start + size,
> +				    &pageattr_ops, NULL, &data);
>
>  	/*
>  	 * If the memory is being made valid without changing any other bits
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/3] mm: Allow pagewalk without locks
  2025-06-06  9:21     ` Dev Jain
  2025-06-06  9:33       ` Dev Jain
@ 2025-06-06 10:02       ` Lorenzo Stoakes
  1 sibling, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-06 10:02 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel, Jann Horn, Yang Shi,
	Ryan Roberts

(One thing to note here is that I have refactored this walk_page_range_novma()
function so the kernel equivalent is now walk_kernel_page_table_range().)

Overall, I'm questioning doing this in the walker code at all.

The use of mmap locks for kernel mappings is essentially a convention among
callers, it's not something that is actually required for kernel page table
mappings.

And now we're introducing a new mode where we say 'ok that convention is
out the window, we won't assert anything'.

Not sure I want to add yet another way you can use the kernel walker code here,
not at least until I can unravel the mess of why we're even using these locks at
all.

Strikes me that init_mm.mmap_lock is just being used as a convenient mutual
exclusion for all other callers.

So as per my (new :P) comment to 2/3, maybe we should just fix apply_to_range()
no?

David - any thoughts here?

On Fri, Jun 06, 2025 at 02:51:48PM +0530, Dev Jain wrote:
>
> On 30/05/25 4:27 pm, Lorenzo Stoakes wrote:
> > +cc Jan for page table stuff.
> >
> > On Fri, May 30, 2025 at 02:34:05PM +0530, Dev Jain wrote:
> > > It is noted at [1] that KFENCE can manipulate kernel pgtable entries during
> > > softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
> > > This being a non-sleepable context, we cannot take the init_mm mmap lock.
> > > Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage without
> > > locks.
> > Hm This is worrying.
> >
> > You're unconditionally making it possible for dangerous usage here - to
> > walk page tables without a lock.
> >
> > We need to assert this is only being used in a context where this makes
> > sense, e.g. a no VMA range under the right circumstances.
> >
> > At the very least we need asserts that we are in a circumstance where this
> > is permitted.
> >
> > For VMAs, you must keep the VMA stable, which requires a VMA read lock at
> > minimum.
> >
> > See
> > https://origin.kernel.org/doc/html/latest/mm/process_addrs.html#page-tables
> > for details where these requirements are documented.
> >
> > I also think we should update this documentation to cover off this non-VMA
> > task context stuff. I can perhaps do this so as not to egregiously add
> > workload to this series :)
> >
> > Also, again this commit message is not enough for such a major change to
> > core mm stuff. I think you need to underline that - in non-task context -
> > you are safe to manipulate _kernel_ mappings, having precluded KFENCE as a
> > concern.
>
> Sorry for late reply, after your comments I had to really go and understand
> kernel pagetable walking properly by reading your process_addrs documentation
> and reading the code, so that I could prepare an answer and improve my
> understanding, thanks for your review!

Of course, that's fine, it's all super confusing, and continues to be so... :)

>
> How does the below comment above PGWALK_NOLOCK look?
>
> "Walk without any lock. Use of this is only meant for the
>  case where there is no underlying VMA, and the user has
>  exclusive control over the range, guaranteeing no concurrent
>  access. For example, changing permissions of vmalloc objects."
>

OK so now I think I understand better... this seems to wholly be about unwinding
the convention in this walker code that an mmap lock be taken on init_mm because
you have a context where that doesn't work.

Yeah, even more inclined to say no to this now.

Or if we absolutely cannot do this in apply_to_range(), then we should
explicitly have a function for this, like:

/*
 * Does not assert any locks have been taken. You must absolutely be certain
 * that appropriate locks are held.
 */
int walk_kernel_page_table_range_unlocked(unsigned long start, unsigned long end,
		const struct mm_walk_ops *ops, pgd_t *pgd, void *private);

An alternative would be to have a new parameter that specifies locking to
walk_kerenl_page_table_range(), but I'd rather not have to churn up all the
callers yet again :)

> and the patch description can be modified as
> "
> It is noted at [1] that KFENCE can manipulate kernel pgtable entries during
> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
> This being a non-sleepable context, we cannot take the init_mm mmap lock.
> Therefore, add PGWALK_NOLOCK to enable walk_page_range_novma() usage without
> locks.
> Currently, apply_to_page_range is being used by __change_memory_common()
> to change permissions over a range of vmalloc space, without any locking.
> Patch 2 in this series shifts to the usage of walk_page_range_novma(), hence
> this patch is needed. We do not need any locks because the vmalloc object
> has exclusive access to the range, i.e two vmalloc objects do not share
> the same physical address.
> "

Thanks for expanding, but this sort of dives into the KFENCE thing without
explaining why or the context or what it relates to. It's like diving into the
ocean to look for an oyster but never mentioning the oyster and only talking
about your wet suit :P

>
>
> >
> > > [1] https://lore.kernel.org/linux-arm-kernel/89d0ad18-4772-4d8f-ae8a-7c48d26a927e@arm.com/
> > Basically expand upon this information.
> >
> > Basically the commit message refers to your usage, but describes a patch
> > that makes it possible to do unlocked page table walks.
> >
> > As I get into below, no pun intended, but this needs to be _locked down_
> > heavily.
> >
> > - Only walk_page_range_novma() should allow it. All other functions should
> >    return -EINVAL if this is set.
>
> Sure.
>
> >
> > - walk_page_range_novma() should assert we're in the appropriate context
> >    where this is feasible.
>
> There should be two conditions: that the mm is init_mm, and the start address
> belongs to the vmalloc (or module) space. I am a little nervous about the second. On searching
> throughout the codebase, I could find only vmalloc and module addresses getting
> modified through set_memory_* API, but I couldn't prove that all such usages
> are being done on vmalloc/module addresses.

Hmm, yeah that's concerning.

>
> >
> > - Comments should be updated accordingly.
> >
> > - We should assert (at least CONFIG_DEBUG_VM asserts) in every place that
> >    checks for a VMA that we are not in this lock mode, since this is
> >    disallowed.
> >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >   include/linux/pagewalk.h |  2 ++
> > >   mm/pagewalk.c            | 12 ++++++++----
> > >   2 files changed, 10 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> > > index 9700a29f8afb..9bc8853ed3de 100644
> > > --- a/include/linux/pagewalk.h
> > > +++ b/include/linux/pagewalk.h
> > > @@ -14,6 +14,8 @@ enum page_walk_lock {
> > >   	PGWALK_WRLOCK = 1,
> > >   	/* vma is expected to be already write-locked during the walk */
> > >   	PGWALK_WRLOCK_VERIFY = 2,
> > > +	/* no lock is needed */
> > > +	PGWALK_NOLOCK = 3,
> > I'd prefer something very explicitly documenting that, at the very least, this
> > can only be used for non-VMA cases.
> >
> > It's hard to think of a name here, but the comment should be explicit as to
> > under what circumstances this is allowed.
> >
> > >   };
> > >
> > >   /**
> > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > > index e478777c86e1..9657cf4664b2 100644
> > > --- a/mm/pagewalk.c
> > > +++ b/mm/pagewalk.c
> > > @@ -440,6 +440,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> > >   	case PGWALK_RDLOCK:
> > >   		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
> > >   		break;
> > > +	default:
> > > +		break;
> > Please no 'default' here, we want to be explicit and cover all cases.
>
> Sure.
>
> >
> > And surely, since you're explicitly only allowing this for non-VMA ranges, this
> > should be a WARN_ON_ONCE() or something?
>
> Sounds good, maybe a WARN_ON_ONCE(vma)?
>
> >
> > >   	}
> > >   #endif
> > >   }
> > > @@ -640,10 +642,12 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
> > >   	 * specified address range from being freed. The caller should take
> > >   	 * other actions to prevent this race.
> > >   	 */
> > All functions other than this should explicitly disallow this locking mode
> > with -EINVAL checks. I do not want to see this locking mode made available
> > in a broken context.
> >
> > The full comment:
> >
> > 	/*
> > 	 * 1) For walking the user virtual address space:
> > 	 *
> > 	 * The mmap lock protects the page walker from changes to the page
> > 	 * tables during the walk.  However a read lock is insufficient to
> > 	 * protect those areas which don't have a VMA as munmap() detaches
> > 	 * the VMAs before downgrading to a read lock and actually tearing
> > 	 * down PTEs/page tables. In which case, the mmap write lock should
> > 	 * be hold.
> > 	 *
> > 	 * 2) For walking the kernel virtual address space:
> > 	 *
> > 	 * The kernel intermediate page tables usually do not be freed, so
> > 	 * the mmap map read lock is sufficient. But there are some exceptions.
> > 	 * E.g. memory hot-remove. In which case, the mmap lock is insufficient
> > 	 * to prevent the intermediate kernel pages tables belonging to the
> > 	 * specified address range from being freed. The caller should take
> > 	 * other actions to prevent this race.
> > 	 */
> >
> > Are you walking kernel memory only? Point 1 above explicitly points out why
> > userland novma memory requires a lock.
> >
> > For point 2 you need to indicate why you don't need to consider hotplugging,
>
> Well, hotunplugging will first offline the physical memory, and since the
> vmalloc object has the reference to the pages, there is no race.

Right, but fundamentally you are holding vmalloc locks no? So this is what
protects things? Or more broadly, vmalloc wholly controls its ranges.

>
> > etc.
> >
> > But as Ryan points out elsewhere, you should be expanding this comment to
> > explain your case...
> >
> > You should also assert you're in a context where this applies and error
> > out/WARN if not.
> >
> > > -	if (mm == &init_mm)
> > > -		mmap_assert_locked(walk.mm);
> > > -	else
> > > -		mmap_assert_write_locked(walk.mm);
> > > +	if (ops->walk_lock != PGWALK_NOLOCK) {
> > I really don't like the idea that you're allowing no lock for userland mappings.
> >
> > This should at the very least be:
> >
> > if (mm == &init_mm)  {
> > 	if (ops->walk_lock != PGWALK_NOLOCK)
> > 		mmap_assert_locked(walk.mm);
> > } else {
> > 	mmap_assert_write_locked(walk.mm);
> > }
>
> Sure.
>
> >
> > > +		if (mm == &init_mm)
> > > +			mmap_assert_locked(walk.mm);
> > > +		else
> > > +			mmap_assert_write_locked(walk.mm);
> > > +	}
> > >
> > >   	return walk_pgd_range(start, end, &walk);
> > >   }
> > > --
> > > 2.30.2
> > >
> > We have to be _really_ careful with this stuff. It's very fiddly and
> > brokenness can be a security issue.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-06-06  9:49   ` Lorenzo Stoakes
@ 2025-06-06 10:39     ` Dev Jain
  2025-06-06 10:56       ` Lorenzo Stoakes
  2025-06-09  9:41     ` Dev Jain
  1 sibling, 1 reply; 29+ messages in thread
From: Dev Jain @ 2025-06-06 10:39 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel


On 06/06/25 3:19 pm, Lorenzo Stoakes wrote:
> On Fri, May 30, 2025 at 02:34:06PM +0530, Dev Jain wrote:
>> Move away from apply_to_page_range(), which does not honour leaf mappings,
>> to walk_page_range_novma(). The callbacks emit a warning and return EINVAL
>> if a partial range is detected.
> Hm a follow up question here - why not just improve apply_to_page_range() to
> honour leaf mappings?
>
> What does honouring leaf mappings actually mean? You mean handling huge pages?

Sorry, I always confuse between block, page and leaf mappings :) I mean to say
block mappings, yes, huge pages.

>
> Would it be all that difficult to implement?

That is how I did it initially. But I think we get into the same problem
which you are describing w.r.t extending walk_page_range_novma - currently we
return EINVAL in case we encounter a block mapping in apply_to_page_range,
basically asserting that the callers always operate on page mappings. Removing this
assertion and generalizing apply_to_page_range kind of sounds the same as
removing the locking assertion and generalizing walk_page_range_novma...

>
> It seems like you're pushing a bunch of the 'applying' logic over from there to
> a walker that isn't maybe best suited to it and having to introduce an iffy new
> form of locking...

IMHO I think it is the reverse. Commit aee16b3cee2746880e40945a9b5bff4f309cfbc4
introduces apply_to_page_range, and commit e6473092bd9116583ce9ab8cf1b6570e1aa6fc83
introduces pagewalk. The commit messages say that the former is meant to be used
on page mappings, while the latter is generic. The latter implies that the former
was actually never meant to exist...

>
> Can we go vice-versa? :)
>
> Also obviously walk_page_range_novma() doesn't exist any more :P
> walk_kernel_page_table_range() is the preferred solution.
>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   arch/arm64/mm/pageattr.c | 69 +++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 64 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index 39fd1f7ff02a..a5c829c64969 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -8,6 +8,7 @@
>>   #include <linux/mem_encrypt.h>
>>   #include <linux/sched.h>
>>   #include <linux/vmalloc.h>
>> +#include <linux/pagewalk.h>
>>
>>   #include <asm/cacheflush.h>
>>   #include <asm/pgtable-prot.h>
>> @@ -20,6 +21,67 @@ struct page_change_data {
>>   	pgprot_t clear_mask;
>>   };
>>
>> +static pteval_t set_pageattr_masks(unsigned long val, struct mm_walk *walk)
>> +{
>> +	struct page_change_data *masks = walk->private;
>> +	unsigned long new_val = val;
>> +
>> +	new_val &= ~(pgprot_val(masks->clear_mask));
>> +	new_val |= (pgprot_val(masks->set_mask));
>> +
>> +	return new_val;
>> +}
>> +
>> +static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pud_t val = pudp_get(pud);
>> +
>> +	if (pud_leaf(val)) {
>> +		if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
>> +			return -EINVAL;
>> +		val = __pud(set_pageattr_masks(pud_val(val), walk));
>> +		set_pud(pud, val);
>> +		walk->action = ACTION_CONTINUE;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pmd_t val = pmdp_get(pmd);
>> +
>> +	if (pmd_leaf(val)) {
>> +		if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
>> +			return -EINVAL;
>> +		val = __pmd(set_pageattr_masks(pmd_val(val), walk));
>> +		set_pmd(pmd, val);
>> +		walk->action = ACTION_CONTINUE;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pte_t val = ptep_get(pte);
>> +
>> +	val = __pte(set_pageattr_masks(pte_val(val), walk));
>> +	set_pte(pte, val);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct mm_walk_ops pageattr_ops = {
>> +	.pud_entry	= pageattr_pud_entry,
>> +	.pmd_entry	= pageattr_pmd_entry,
>> +	.pte_entry	= pageattr_pte_entry,
>> +	.walk_lock	= PGWALK_NOLOCK,
>> +};
>> +
>>   bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
>>
>>   bool can_set_direct_map(void)
>> @@ -49,9 +111,6 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
>>   	return 0;
>>   }
>>
>> -/*
>> - * This function assumes that the range is mapped with PAGE_SIZE pages.
>> - */
>>   static int __change_memory_common(unsigned long start, unsigned long size,
>>   				pgprot_t set_mask, pgprot_t clear_mask)
>>   {
>> @@ -61,8 +120,8 @@ static int __change_memory_common(unsigned long start, unsigned long size,
>>   	data.set_mask = set_mask;
>>   	data.clear_mask = clear_mask;
>>
>> -	ret = apply_to_page_range(&init_mm, start, size, change_page_range,
>> -					&data);
>> +	ret = walk_page_range_novma(&init_mm, start, start + size,
>> +				    &pageattr_ops, NULL, &data);
>>
>>   	/*
>>   	 * If the memory is being made valid without changing any other bits
>> --
>> 2.30.2
>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-06-06 10:39     ` Dev Jain
@ 2025-06-06 10:56       ` Lorenzo Stoakes
  2025-06-06 11:08         ` Dev Jain
  0 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-06 10:56 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel

On Fri, Jun 06, 2025 at 04:09:51PM +0530, Dev Jain wrote:
>
> On 06/06/25 3:19 pm, Lorenzo Stoakes wrote:
> > On Fri, May 30, 2025 at 02:34:06PM +0530, Dev Jain wrote:
> > > Move away from apply_to_page_range(), which does not honour leaf mappings,
> > > to walk_page_range_novma(). The callbacks emit a warning and return EINVAL
> > > if a partial range is detected.
> > Hm a follow up question here - why not just improve apply_to_page_range() to
> > honour leaf mappings?
> >
> > What does honouring leaf mappings actually mean? You mean handling huge pages?
>
> Sorry, I always confuse between block, page and leaf mappings :) I mean to say
> block mappings, yes, huge pages.

Sometimes I think we like to give different names to things just to make life
confusing ;)

>
> >
> > Would it be all that difficult to implement?
>
> That is how I did it initially. But I think we get into the same problem
> which you are describing w.r.t extending walk_page_range_novma - currently we
> return EINVAL in case we encounter a block mapping in apply_to_page_range,
> basically asserting that the callers always operate on page mappings. Removing this
> assertion and generalizing apply_to_page_range kind of sounds the same as
> removing the locking assertion and generalizing walk_page_range_novma...

(Again keep in mind walk_page_range_novma no longer exists :)

Yeah it's problematic I guess in that you have a pte_fn_t and would have to get
into gross stuff like pretending a PMD entry is a PTE entry etc.

Ugh god why do we do this to ourselves.

>
> >
> > It seems like you're pushing a bunch of the 'applying' logic over from there to
> > a walker that isn't maybe best suited to it and having to introduce an iffy new
> > form of locking...
>
> IMHO I think it is the reverse. Commit aee16b3cee2746880e40945a9b5bff4f309cfbc4
> introduces apply_to_page_range, and commit e6473092bd9116583ce9ab8cf1b6570e1aa6fc83
> introduces pagewalk. The commit messages say that the former is meant to be used
> on page mappings, while the latter is generic. The latter implies that the former
> was actually never meant to exist...

What a mess.

Maybe the least-worst solution is to just add a new
walk_kernel_page_table_range_unlocked() function without an assert and in the
comment heavily underline that _you must have made sure this is safe_.

This needs revisting in general, I find the use of init_mm.mmap_lock pretty
gross.

>
> >
> > Can we go vice-versa? :)
> >
> > Also obviously walk_page_range_novma() doesn't exist any more :P
> > walk_kernel_page_table_range() is the preferred solution.
> >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >   arch/arm64/mm/pageattr.c | 69 +++++++++++++++++++++++++++++++++++++---
> > >   1 file changed, 64 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> > > index 39fd1f7ff02a..a5c829c64969 100644
> > > --- a/arch/arm64/mm/pageattr.c
> > > +++ b/arch/arm64/mm/pageattr.c
> > > @@ -8,6 +8,7 @@
> > >   #include <linux/mem_encrypt.h>
> > >   #include <linux/sched.h>
> > >   #include <linux/vmalloc.h>
> > > +#include <linux/pagewalk.h>
> > >
> > >   #include <asm/cacheflush.h>
> > >   #include <asm/pgtable-prot.h>
> > > @@ -20,6 +21,67 @@ struct page_change_data {
> > >   	pgprot_t clear_mask;
> > >   };
> > >
> > > +static pteval_t set_pageattr_masks(unsigned long val, struct mm_walk *walk)
> > > +{
> > > +	struct page_change_data *masks = walk->private;
> > > +	unsigned long new_val = val;
> > > +
> > > +	new_val &= ~(pgprot_val(masks->clear_mask));
> > > +	new_val |= (pgprot_val(masks->set_mask));
> > > +
> > > +	return new_val;
> > > +}
> > > +
> > > +static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
> > > +			      unsigned long next, struct mm_walk *walk)
> > > +{
> > > +	pud_t val = pudp_get(pud);
> > > +
> > > +	if (pud_leaf(val)) {
> > > +		if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
> > > +			return -EINVAL;
> > > +		val = __pud(set_pageattr_masks(pud_val(val), walk));
> > > +		set_pud(pud, val);
> > > +		walk->action = ACTION_CONTINUE;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
> > > +			      unsigned long next, struct mm_walk *walk)
> > > +{
> > > +	pmd_t val = pmdp_get(pmd);
> > > +
> > > +	if (pmd_leaf(val)) {
> > > +		if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
> > > +			return -EINVAL;
> > > +		val = __pmd(set_pageattr_masks(pmd_val(val), walk));
> > > +		set_pmd(pmd, val);
> > > +		walk->action = ACTION_CONTINUE;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
> > > +			      unsigned long next, struct mm_walk *walk)
> > > +{
> > > +	pte_t val = ptep_get(pte);
> > > +
> > > +	val = __pte(set_pageattr_masks(pte_val(val), walk));
> > > +	set_pte(pte, val);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +static const struct mm_walk_ops pageattr_ops = {
> > > +	.pud_entry	= pageattr_pud_entry,
> > > +	.pmd_entry	= pageattr_pmd_entry,
> > > +	.pte_entry	= pageattr_pte_entry,
> > > +	.walk_lock	= PGWALK_NOLOCK,
> > > +};
> > > +
> > >   bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
> > >
> > >   bool can_set_direct_map(void)
> > > @@ -49,9 +111,6 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
> > >   	return 0;
> > >   }
> > >
> > > -/*
> > > - * This function assumes that the range is mapped with PAGE_SIZE pages.
> > > - */
> > >   static int __change_memory_common(unsigned long start, unsigned long size,
> > >   				pgprot_t set_mask, pgprot_t clear_mask)
> > >   {
> > > @@ -61,8 +120,8 @@ static int __change_memory_common(unsigned long start, unsigned long size,
> > >   	data.set_mask = set_mask;
> > >   	data.clear_mask = clear_mask;
> > >
> > > -	ret = apply_to_page_range(&init_mm, start, size, change_page_range,
> > > -					&data);
> > > +	ret = walk_page_range_novma(&init_mm, start, start + size,
> > > +				    &pageattr_ops, NULL, &data);
> > >
> > >   	/*
> > >   	 * If the memory is being made valid without changing any other bits
> > > --
> > > 2.30.2
> > >


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-06-06 10:56       ` Lorenzo Stoakes
@ 2025-06-06 11:08         ` Dev Jain
  0 siblings, 0 replies; 29+ messages in thread
From: Dev Jain @ 2025-06-06 11:08 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel, Yang Shi, Ryan Roberts


On 06/06/25 4:26 pm, Lorenzo Stoakes wrote:
> On Fri, Jun 06, 2025 at 04:09:51PM +0530, Dev Jain wrote:
>> On 06/06/25 3:19 pm, Lorenzo Stoakes wrote:
>>> On Fri, May 30, 2025 at 02:34:06PM +0530, Dev Jain wrote:
>>>> Move away from apply_to_page_range(), which does not honour leaf mappings,
>>>> to walk_page_range_novma(). The callbacks emit a warning and return EINVAL
>>>> if a partial range is detected.
>>> Hm a follow up question here - why not just improve apply_to_page_range() to
>>> honour leaf mappings?
>>>
>>> What does honouring leaf mappings actually mean? You mean handling huge pages?
>> Sorry, I always confuse between block, page and leaf mappings :) I mean to say
>> block mappings, yes, huge pages.
> Sometimes I think we like to give different names to things just to make life
> confusing ;)
>
>>> Would it be all that difficult to implement?
>> That is how I did it initially. But I think we get into the same problem
>> which you are describing w.r.t extending walk_page_range_novma - currently we
>> return EINVAL in case we encounter a block mapping in apply_to_page_range,
>> basically asserting that the callers always operate on page mappings. Removing this
>> assertion and generalizing apply_to_page_range kind of sounds the same as
>> removing the locking assertion and generalizing walk_page_range_novma...
> (Again keep in mind walk_page_range_novma no longer exists :)

Ya I mean the pagewalk API.

>
> Yeah it's problematic I guess in that you have a pte_fn_t and would have to get
> into gross stuff like pretending a PMD entry is a PTE entry etc.

Yes, since the pagewalk API has level callbacks it makes life easier.

>
> Ugh god why do we do this to ourselves.

I know right :)

>
>>> It seems like you're pushing a bunch of the 'applying' logic over from there to
>>> a walker that isn't maybe best suited to it and having to introduce an iffy new
>>> form of locking...
>> IMHO I think it is the reverse. Commit aee16b3cee2746880e40945a9b5bff4f309cfbc4
>> introduces apply_to_page_range, and commit e6473092bd9116583ce9ab8cf1b6570e1aa6fc83
>> introduces pagewalk. The commit messages say that the former is meant to be used
>> on page mappings, while the latter is generic. The latter implies that the former
>> was actually never meant to exist...
> What a mess.
>
> Maybe the least-worst solution is to just add a new
> walk_kernel_page_table_range_unlocked() function without an assert and in the
> comment heavily underline that _you must have made sure this is safe_.
>
> This needs revisting in general, I find the use of init_mm.mmap_lock pretty
> gross.

There you said it! I have been reading kernel mapping code for some time and
the entire point of using the init_mm.mmap_lock has been mutual exclusion,
completely throwing out of the window what the mmap_lock actually means.
For example, we take init_mm write lock for ptdump_walk_pgd(), which does
not sound right to me since the only thing ptdump actually does is walk
the pagetables, yet we take the most restrictive lock.

>>> Can we go vice-versa? :)
>>>
>>> Also obviously walk_page_range_novma() doesn't exist any more :P
>>> walk_kernel_page_table_range() is the preferred solution.
>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    arch/arm64/mm/pageattr.c | 69 +++++++++++++++++++++++++++++++++++++---
>>>>    1 file changed, 64 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>>>> index 39fd1f7ff02a..a5c829c64969 100644
>>>> --- a/arch/arm64/mm/pageattr.c
>>>> +++ b/arch/arm64/mm/pageattr.c
>>>> @@ -8,6 +8,7 @@
>>>>    #include <linux/mem_encrypt.h>
>>>>    #include <linux/sched.h>
>>>>    #include <linux/vmalloc.h>
>>>> +#include <linux/pagewalk.h>
>>>>
>>>>    #include <asm/cacheflush.h>
>>>>    #include <asm/pgtable-prot.h>
>>>> @@ -20,6 +21,67 @@ struct page_change_data {
>>>>    	pgprot_t clear_mask;
>>>>    };
>>>>
>>>> +static pteval_t set_pageattr_masks(unsigned long val, struct mm_walk *walk)
>>>> +{
>>>> +	struct page_change_data *masks = walk->private;
>>>> +	unsigned long new_val = val;
>>>> +
>>>> +	new_val &= ~(pgprot_val(masks->clear_mask));
>>>> +	new_val |= (pgprot_val(masks->set_mask));
>>>> +
>>>> +	return new_val;
>>>> +}
>>>> +
>>>> +static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
>>>> +			      unsigned long next, struct mm_walk *walk)
>>>> +{
>>>> +	pud_t val = pudp_get(pud);
>>>> +
>>>> +	if (pud_leaf(val)) {
>>>> +		if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
>>>> +			return -EINVAL;
>>>> +		val = __pud(set_pageattr_masks(pud_val(val), walk));
>>>> +		set_pud(pud, val);
>>>> +		walk->action = ACTION_CONTINUE;
>>>> +	}
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
>>>> +			      unsigned long next, struct mm_walk *walk)
>>>> +{
>>>> +	pmd_t val = pmdp_get(pmd);
>>>> +
>>>> +	if (pmd_leaf(val)) {
>>>> +		if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
>>>> +			return -EINVAL;
>>>> +		val = __pmd(set_pageattr_masks(pmd_val(val), walk));
>>>> +		set_pmd(pmd, val);
>>>> +		walk->action = ACTION_CONTINUE;
>>>> +	}
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
>>>> +			      unsigned long next, struct mm_walk *walk)
>>>> +{
>>>> +	pte_t val = ptep_get(pte);
>>>> +
>>>> +	val = __pte(set_pageattr_masks(pte_val(val), walk));
>>>> +	set_pte(pte, val);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static const struct mm_walk_ops pageattr_ops = {
>>>> +	.pud_entry	= pageattr_pud_entry,
>>>> +	.pmd_entry	= pageattr_pmd_entry,
>>>> +	.pte_entry	= pageattr_pte_entry,
>>>> +	.walk_lock	= PGWALK_NOLOCK,
>>>> +};
>>>> +
>>>>    bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
>>>>
>>>>    bool can_set_direct_map(void)
>>>> @@ -49,9 +111,6 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
>>>>    	return 0;
>>>>    }
>>>>
>>>> -/*
>>>> - * This function assumes that the range is mapped with PAGE_SIZE pages.
>>>> - */
>>>>    static int __change_memory_common(unsigned long start, unsigned long size,
>>>>    				pgprot_t set_mask, pgprot_t clear_mask)
>>>>    {
>>>> @@ -61,8 +120,8 @@ static int __change_memory_common(unsigned long start, unsigned long size,
>>>>    	data.set_mask = set_mask;
>>>>    	data.clear_mask = clear_mask;
>>>>
>>>> -	ret = apply_to_page_range(&init_mm, start, size, change_page_range,
>>>> -					&data);
>>>> +	ret = walk_page_range_novma(&init_mm, start, start + size,
>>>> +				    &pageattr_ops, NULL, &data);
>>>>
>>>>    	/*
>>>>    	 * If the memory is being made valid without changing any other bits
>>>> --
>>>> 2.30.2
>>>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-06-06  9:49   ` Lorenzo Stoakes
  2025-06-06 10:39     ` Dev Jain
@ 2025-06-09  9:41     ` Dev Jain
  2025-06-09 11:00       ` Lorenzo Stoakes
  1 sibling, 1 reply; 29+ messages in thread
From: Dev Jain @ 2025-06-09  9:41 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel


On 06/06/25 3:19 pm, Lorenzo Stoakes wrote:
> On Fri, May 30, 2025 at 02:34:06PM +0530, Dev Jain wrote:
>> Move away from apply_to_page_range(), which does not honour leaf mappings,
>> to walk_page_range_novma(). The callbacks emit a warning and return EINVAL
>> if a partial range is detected.
> Hm a follow up question here - why not just improve apply_to_page_range() to
> honour leaf mappings?
>
> What does honouring leaf mappings actually mean? You mean handling huge pages?
>
> Would it be all that difficult to implement?
>
> It seems like you're pushing a bunch of the 'applying' logic over from there to
> a walker that isn't maybe best suited to it and having to introduce an iffy new
> form of locking...
>
> Can we go vice-versa? :)
>
> Also obviously walk_page_range_novma() doesn't exist any more :P
> walk_kernel_page_table_range() is the preferred solution.

I cannot see this function in mm-unstable. Also, mm-unstable is broken - I get
some RCU message in dmesg, and after some 20-30 second delay the kernel boots.
So on which branch do I base my work on...

>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-06-09  9:41     ` Dev Jain
@ 2025-06-09 11:00       ` Lorenzo Stoakes
  2025-06-09 11:31         ` Dev Jain
  0 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes @ 2025-06-09 11:00 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel

On Mon, Jun 09, 2025 at 03:11:59PM +0530, Dev Jain wrote:
>
> On 06/06/25 3:19 pm, Lorenzo Stoakes wrote:
> > On Fri, May 30, 2025 at 02:34:06PM +0530, Dev Jain wrote:
> > > Move away from apply_to_page_range(), which does not honour leaf mappings,
> > > to walk_page_range_novma(). The callbacks emit a warning and return EINVAL
> > > if a partial range is detected.
> > Hm a follow up question here - why not just improve apply_to_page_range() to
> > honour leaf mappings?
> >
> > What does honouring leaf mappings actually mean? You mean handling huge pages?
> >
> > Would it be all that difficult to implement?
> >
> > It seems like you're pushing a bunch of the 'applying' logic over from there to
> > a walker that isn't maybe best suited to it and having to introduce an iffy new
> > form of locking...
> >
> > Can we go vice-versa? :)
> >
> > Also obviously walk_page_range_novma() doesn't exist any more :P
> > walk_kernel_page_table_range() is the preferred solution.
>
> I cannot see this function in mm-unstable. Also, mm-unstable is broken - I get
> some RCU message in dmesg, and after some 20-30 second delay the kernel boots.
> So on which branch do I base my work on...

mm-unstable shouldn't be broken as that's what should be in -next, concerning!
Worth investigating... But rebase! :)

Sorry maybe isn't clear as we changed this recently - you should base all
changes intended for the next release on mm-new.

As I understand it:

- mm-new is the _rawest_ form of the state of mm, Andrew described it as
  the 'crazy' branch that basiclaly has everything.

- mm-unstable is when things have been percolating for a while and are
  considered reasonably stable-ish kinda, but most importantly - ready for
  -next testing. And is what goes to -next.

- mm-stable is gathered shortly before the merge window starts and is all
  the stuff Andrew will send to Linus.

To pick up on the most recent changes therefore use mm-new. Using anything
else risks issues like this where your patch will conflict, etc.

Another point to note is, during the merge window, mm-new is where changes
due for the release after the one being currently merged are kept
(e.g. over the past 2 weeks of 6.16 merge window, this would be changes for
6.17).

Not all subsystems even take patches at all during the merge window, but mm
does.

So especially during merge window it's important to base you changes on
mm-new.

>
> >

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-06-09 11:00       ` Lorenzo Stoakes
@ 2025-06-09 11:31         ` Dev Jain
  0 siblings, 0 replies; 29+ messages in thread
From: Dev Jain @ 2025-06-09 11:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel


On 09/06/25 4:30 pm, Lorenzo Stoakes wrote:
> On Mon, Jun 09, 2025 at 03:11:59PM +0530, Dev Jain wrote:
>> On 06/06/25 3:19 pm, Lorenzo Stoakes wrote:
>>> On Fri, May 30, 2025 at 02:34:06PM +0530, Dev Jain wrote:
>>>> Move away from apply_to_page_range(), which does not honour leaf mappings,
>>>> to walk_page_range_novma(). The callbacks emit a warning and return EINVAL
>>>> if a partial range is detected.
>>> Hm a follow up question here - why not just improve apply_to_page_range() to
>>> honour leaf mappings?
>>>
>>> What does honouring leaf mappings actually mean? You mean handling huge pages?
>>>
>>> Would it be all that difficult to implement?
>>>
>>> It seems like you're pushing a bunch of the 'applying' logic over from there to
>>> a walker that isn't maybe best suited to it and having to introduce an iffy new
>>> form of locking...
>>>
>>> Can we go vice-versa? :)
>>>
>>> Also obviously walk_page_range_novma() doesn't exist any more :P
>>> walk_kernel_page_table_range() is the preferred solution.
>> I cannot see this function in mm-unstable. Also, mm-unstable is broken - I get
>> some RCU message in dmesg, and after some 20-30 second delay the kernel boots.
>> So on which branch do I base my work on...
> mm-unstable shouldn't be broken as that's what should be in -next, concerning!
> Worth investigating... But rebase! :)
>
> Sorry maybe isn't clear as we changed this recently - you should base all
> changes intended for the next release on mm-new.
>
> As I understand it:
>
> - mm-new is the _rawest_ form of the state of mm, Andrew described it as
>    the 'crazy' branch that basiclaly has everything.
>
> - mm-unstable is when things have been percolating for a while and are
>    considered reasonably stable-ish kinda, but most importantly - ready for
>    -next testing. And is what goes to -next.
>
> - mm-stable is gathered shortly before the merge window starts and is all
>    the stuff Andrew will send to Linus.
>
> To pick up on the most recent changes therefore use mm-new. Using anything
> else risks issues like this where your patch will conflict, etc.
>
> Another point to note is, during the merge window, mm-new is where changes
> due for the release after the one being currently merged are kept
> (e.g. over the past 2 weeks of 6.16 merge window, this would be changes for
> 6.17).
>
> Not all subsystems even take patches at all during the merge window, but mm
> does.
>
> So especially during merge window it's important to base you changes on
> mm-new.

Thanks. I will base my changes on mm-new.

>
> Cheers, Lorenzo
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-06-09 11:35 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30  9:04 [PATCH 0/3] Enable huge-vmalloc permission change Dev Jain
2025-05-30  9:04 ` [PATCH 1/3] mm: Allow pagewalk without locks Dev Jain
2025-05-30 10:33   ` Ryan Roberts
2025-05-30 21:33     ` Yang Shi
2025-05-30 10:57   ` Lorenzo Stoakes
2025-06-06  9:21     ` Dev Jain
2025-06-06  9:33       ` Dev Jain
2025-06-06 10:02       ` Lorenzo Stoakes
2025-05-30  9:04 ` [PATCH 2/3] arm64: pageattr: Use walk_page_range_novma() to change memory permissions Dev Jain
2025-05-30 12:53   ` Ryan Roberts
2025-06-02  4:35     ` Dev Jain
2025-06-06  9:49   ` Lorenzo Stoakes
2025-06-06 10:39     ` Dev Jain
2025-06-06 10:56       ` Lorenzo Stoakes
2025-06-06 11:08         ` Dev Jain
2025-06-09  9:41     ` Dev Jain
2025-06-09 11:00       ` Lorenzo Stoakes
2025-06-09 11:31         ` Dev Jain
2025-05-30  9:04 ` [PATCH 3/3] mm/pagewalk: Add pre/post_pte_table callback for lazy MMU on arm64 Dev Jain
2025-05-30 11:14   ` Lorenzo Stoakes
2025-05-30 12:12     ` Ryan Roberts
2025-05-30 12:18       ` Lorenzo Stoakes
2025-05-30  9:24 ` [PATCH 0/3] Enable huge-vmalloc permission change Dev Jain
2025-05-30 10:03 ` Ryan Roberts
2025-05-30 10:10   ` Dev Jain
2025-05-30 10:37     ` Ryan Roberts
2025-05-30 10:42       ` Dev Jain
2025-05-30 10:51         ` Ryan Roberts
2025-05-30 11:11           ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).