[PATCH v2 0/2] Enable permission change on arm64 kernel block mappings

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Enable permission change on arm64 kernel block mappings
@ 2025-06-10 11:43 Dev Jain
  2025-06-10 11:44 ` [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking Dev Jain
  2025-06-10 11:44 ` [PATCH v2 2/2] arm64: pageattr: Use walk_page_range_novma() to change memory permissions Dev Jain
  0 siblings, 2 replies; 17+ messages in thread
From: Dev Jain @ 2025-06-10 11:43 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel, yang, ryan.roberts, anshuman.khandual, Dev Jain

This series paves the path to enable huge mappings in vmalloc space and
linear map space by default on arm64. For this we must ensure that we can
handle any permission games on the kernel (init_mm) pagetable. Currently,
__change_memory_common() uses apply_to_page_range() which does not support
changing permissions for leaf mappings. We attempt to move away from this
by using walk_page_range_novma(), similar to what riscv does right now;
however, it is the responsibility of the caller to ensure that we do not
pass a range, or split the range covering a partial leaf mapping.

This series is tied with Yang Shi's attempt [1] at using huge mappings
in the linear mapping in case the system supports BBML2, in which case
we will be able to split the linear mapping if needed without break-before-make.
Thus, Yang's series, IIUC, will be one such user of my series; suppose we
are changing permissions on a range of the linear map backed by PMD-hugepages,
then the sequence of operations should look like the following:

split_range(start)
split_range(end);
___change_memory_common(start, end);

However, this series can be used independently of Yang's; since currently
permission games are being played only on pte mappings (due to apply_to_page_range
not supporting otherwise), this series provides the mechanism for enabling
huge mappings for various kernel mappings like linear map and vmalloc.

[1] https://lore.kernel.org/all/20250304222018.615808-1-yang@os.amperecomputing.com/

v1->v2:
 - Squash patch 2 and 3
 - Add comment describing the responsibility of the caller to ensure no
   partial overlap with block mapping
 - Add comments and return -EINVAL at relevant places to document the usage
   of PGWALK_NOLOCK (Lorenzo)
 - Nest walk_kernel_page_table_range() with lazy_mmu calls, instead of
   doing so only per PTE level, fix bug in the PTE callback, introduce
   callbacks for all pagetable levels, use ptdesc_t instead of unsigned
   long, introduce ___change_memory_common and use them for direct map
   permission change functions (Ryan)

v1:
https://lore.kernel.org/all/20250530090407.19237-1-dev.jain@arm.com/

Dev Jain (2):
  mm: Allow lockless kernel pagetable walking
  arm64: pageattr: Use walk_page_range_novma() to change memory
    permissions

 arch/arm64/mm/pageattr.c | 158 +++++++++++++++++++++++++++++++--------
 include/linux/pagewalk.h |   7 ++
 mm/pagewalk.c            |  23 ++++--
 3 files changed, 151 insertions(+), 37 deletions(-)

-- 
2.30.2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
  2025-06-10 11:43 [PATCH v2 0/2] Enable permission change on arm64 kernel block mappings Dev Jain
@ 2025-06-10 11:44 ` Dev Jain
  2025-06-10 12:07   ` Lorenzo Stoakes
  2025-06-10 11:44 ` [PATCH v2 2/2] arm64: pageattr: Use walk_page_range_novma() to change memory permissions Dev Jain
  1 sibling, 1 reply; 17+ messages in thread
From: Dev Jain @ 2025-06-10 11:44 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel, yang, ryan.roberts, anshuman.khandual, Dev Jain

arm64 currently changes permissions on vmalloc objects locklessly, via
apply_to_page_range. Patch 2 moves away from this to use the pagewalk API,
since a limitation of the former is to deny changing permissions for block
mappings. However, the API currently enforces the init_mm.mmap_lock to be
held. To avoid the unnecessary bottleneck of the mmap_lock for our usecase,
this patch extends this generic API to be used locklessly, so as to retain
the existing behaviour for changing permissions. Apart from this reason,
it is noted at [1] that KFENCE can manipulate kernel pgtable entries during
softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
This being a non-sleepable context, we cannot take the init_mm mmap lock.

Since such extension can potentially be dangerous for other callers
consuming the pagewalk API, explicitly disallow lockless traversal for
userspace pagetables by returning EINVAL. Add comments to highlight the
conditions under which we can use the API locklessly - no underlying VMA,
and the user having exclusive control over the range, thus guaranteeing no
concurrent access.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/pagewalk.h |  7 +++++++
 mm/pagewalk.c            | 23 ++++++++++++++++++-----
 2 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 8ac2f6d6d2a3..5efd6541239b 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -14,6 +14,13 @@ enum page_walk_lock {
 	PGWALK_WRLOCK = 1,
 	/* vma is expected to be already write-locked during the walk */
 	PGWALK_WRLOCK_VERIFY = 2,
+	/*
+	 * Walk without any lock. Use of this is only meant for the
+	 * case where there is no underlying VMA, and the user has
+	 * exclusive control over the range, guaranteeing no concurrent
+	 * access. For example, changing permissions of vmalloc objects.
+	 */
+	PGWALK_NOLOCK = 3,
 };
 
 /**
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index ff5299eca687..d55d933f84ec 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -417,13 +417,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
 	return err;
 }
 
-static inline void process_mm_walk_lock(struct mm_struct *mm,
+static inline bool process_mm_walk_lock(struct mm_struct *mm,
 					enum page_walk_lock walk_lock)
 {
+	if (walk_lock == PGWALK_NOLOCK)
+		return 1;
+
 	if (walk_lock == PGWALK_RDLOCK)
 		mmap_assert_locked(mm);
 	else
 		mmap_assert_write_locked(mm);
+	return 0;
 }
 
 static inline void process_vma_walk_lock(struct vm_area_struct *vma,
@@ -440,6 +444,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
 	case PGWALK_RDLOCK:
 		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
 		break;
+	case PGWALK_NOLOCK:
+		break;
 	}
 #endif
 }
@@ -470,7 +476,8 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
 	if (!walk.mm)
 		return -EINVAL;
 
-	process_mm_walk_lock(walk.mm, ops->walk_lock);
+	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
+		return -EINVAL;
 
 	vma = find_vma(walk.mm, start);
 	do {
@@ -626,8 +633,12 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
 	 * to prevent the intermediate kernel pages tables belonging to the
 	 * specified address range from being freed. The caller should take
 	 * other actions to prevent this race.
+	 *
+	 * If the caller can guarantee that it has exclusive access to the
+	 * specified address range, only then it can use PGWALK_NOLOCK.
 	 */
-	mmap_assert_locked(mm);
+	if (ops->walk_lock != PGWALK_NOLOCK)
+		mmap_assert_locked(mm);
 
 	return walk_pgd_range(start, end, &walk);
 }
@@ -699,7 +710,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
 	if (!check_ops_valid(ops))
 		return -EINVAL;
 
-	process_mm_walk_lock(walk.mm, ops->walk_lock);
+	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
+		return -EINVAL;
 	process_vma_walk_lock(vma, ops->walk_lock);
 	return __walk_page_range(start, end, &walk);
 }
@@ -719,7 +731,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 	if (!check_ops_valid(ops))
 		return -EINVAL;
 
-	process_mm_walk_lock(walk.mm, ops->walk_lock);
+	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
+		return -EINVAL;
 	process_vma_walk_lock(vma, ops->walk_lock);
 	return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
 }
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/2] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-06-10 11:43 [PATCH v2 0/2] Enable permission change on arm64 kernel block mappings Dev Jain
  2025-06-10 11:44 ` [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking Dev Jain
@ 2025-06-10 11:44 ` Dev Jain
  2025-06-10 13:14   ` David Hildenbrand
  2025-06-10 14:41   ` Ryan Roberts
  1 sibling, 2 replies; 17+ messages in thread
From: Dev Jain @ 2025-06-10 11:44 UTC (permalink / raw)
  To: akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel, yang, ryan.roberts, anshuman.khandual, Dev Jain

Since apply_to_page_range does not support operations on block mappings,
use the generic pagewalk API to enable changing permissions for kernel
block mappings. This paves the path for enabling huge mappings by default
on kernel space mappings, thus leading to more efficient TLB usage.

We only require that the start and end of a given range lie on leaf mapping
boundaries. Return EINVAL in case a partial block mapping is detected; add
a corresponding comment in ___change_memory_common() to warn that
eliminating such a condition is the responsibility of the caller.

apply_to_page_range ultimately uses the lazy MMU hooks at the pte level
function (apply_to_pte_range) - we want to use this functionality after
this patch too. Ryan says:
"The only reason we traditionally confine the lazy mmu mode to a single
page table is because we want to enclose it within the PTL. But that
requirement doesn't stand for kernel mappings. As long as the walker can
guarantee that it doesn't allocate any memory (because with certain debug
settings that can cause lazy mmu nesting) or try to sleep then I think we
can just bracket the entire call."
Therefore, wrap the call to walk_kernel_page_table_range() with the
lazy MMU helpers.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 arch/arm64/mm/pageattr.c | 158 +++++++++++++++++++++++++++++++--------
 1 file changed, 126 insertions(+), 32 deletions(-)

diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 04d4a8f676db..2c118c0922ef 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -8,6 +8,7 @@
 #include <linux/mem_encrypt.h>
 #include <linux/sched.h>
 #include <linux/vmalloc.h>
+#include <linux/pagewalk.h>
 
 #include <asm/cacheflush.h>
 #include <asm/pgtable-prot.h>
@@ -20,6 +21,100 @@ struct page_change_data {
 	pgprot_t clear_mask;
 };
 
+static pteval_t set_pageattr_masks(ptdesc_t val, struct mm_walk *walk)
+{
+	struct page_change_data *masks = walk->private;
+
+	val &= ~(pgprot_val(masks->clear_mask));
+	val |= (pgprot_val(masks->set_mask));
+
+	return val;
+}
+
+static int pageattr_pgd_entry(pgd_t *pgd, unsigned long addr,
+			      unsigned long next, struct mm_walk *walk)
+{
+	pgd_t val = pgdp_get(pgd);
+
+	if (pgd_leaf(val)) {
+		if (WARN_ON_ONCE((next - addr) != PGDIR_SIZE))
+			return -EINVAL;
+		val = __pgd(set_pageattr_masks(pgd_val(val), walk));
+		set_pgd(pgd, val);
+		walk->action = ACTION_CONTINUE;
+	}
+
+	return 0;
+}
+
+static int pageattr_p4d_entry(p4d_t *p4d, unsigned long addr,
+			      unsigned long next, struct mm_walk *walk)
+{
+	p4d_t val = p4dp_get(p4d);
+
+	if (p4d_leaf(val)) {
+		if (WARN_ON_ONCE((next - addr) != P4D_SIZE))
+			return -EINVAL;
+		val = __p4d(set_pageattr_masks(p4d_val(val), walk));
+		set_p4d(p4d, val);
+		walk->action = ACTION_CONTINUE;
+	}
+
+	return 0;
+}
+
+static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
+			      unsigned long next, struct mm_walk *walk)
+{
+	pud_t val = pudp_get(pud);
+
+	if (pud_leaf(val)) {
+		if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
+			return -EINVAL;
+		val = __pud(set_pageattr_masks(pud_val(val), walk));
+		set_pud(pud, val);
+		walk->action = ACTION_CONTINUE;
+	}
+
+	return 0;
+}
+
+static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
+			      unsigned long next, struct mm_walk *walk)
+{
+	pmd_t val = pmdp_get(pmd);
+
+	if (pmd_leaf(val)) {
+		if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
+			return -EINVAL;
+		val = __pmd(set_pageattr_masks(pmd_val(val), walk));
+		set_pmd(pmd, val);
+		walk->action = ACTION_CONTINUE;
+	}
+
+	return 0;
+}
+
+static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
+			      unsigned long next, struct mm_walk *walk)
+{
+	pte_t val = __ptep_get(pte);
+
+	val = __pte(set_pageattr_masks(pte_val(val), walk));
+	__set_pte(pte, val);
+
+	return 0;
+}
+
+static const struct mm_walk_ops pageattr_ops = {
+	.pgd_entry	= pageattr_pgd_entry,
+	.p4d_entry	= pageattr_p4d_entry,
+	.pud_entry	= pageattr_pud_entry,
+	.pmd_entry	= pageattr_pmd_entry,
+	.pte_entry	= pageattr_pte_entry,
+	.walk_lock	= PGWALK_NOLOCK,
+};
+
 bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
 
 bool can_set_direct_map(void)
@@ -37,22 +132,7 @@ bool can_set_direct_map(void)
 		arm64_kfence_can_set_direct_map() || is_realm_world();
 }
 
-static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
-{
-	struct page_change_data *cdata = data;
-	pte_t pte = __ptep_get(ptep);
-
-	pte = clear_pte_bit(pte, cdata->clear_mask);
-	pte = set_pte_bit(pte, cdata->set_mask);
-
-	__set_pte(ptep, pte);
-	return 0;
-}
-
-/*
- * This function assumes that the range is mapped with PAGE_SIZE pages.
- */
-static int __change_memory_common(unsigned long start, unsigned long size,
+static int ___change_memory_common(unsigned long start, unsigned long size,
 				pgprot_t set_mask, pgprot_t clear_mask)
 {
 	struct page_change_data data;
@@ -61,9 +141,28 @@ static int __change_memory_common(unsigned long start, unsigned long size,
 	data.set_mask = set_mask;
 	data.clear_mask = clear_mask;
 
-	ret = apply_to_page_range(&init_mm, start, size, change_page_range,
-					&data);
+	arch_enter_lazy_mmu_mode();
+
+	/*
+	 * The caller must ensure that the range we are operating on does not
+	 * partially overlap a block mapping. Any such case should either not
+	 * exist, or must be eliminated by splitting the mapping - which for
+	 * kernel mappings can be done only on BBML2 systems.
+	 *
+	 */
+	ret = walk_kernel_page_table_range(start, start + size, &pageattr_ops,
+					   NULL, &data);
+	arch_leave_lazy_mmu_mode();
+
+	return ret;
+}
 
+static int __change_memory_common(unsigned long start, unsigned long size,
+				  pgprot_t set_mask, pgprot_t clear_mask)
+{
+	int ret;
+
+	ret = ___change_memory_common(start, size, set_mask, clear_mask);
 	/*
 	 * If the memory is being made valid without changing any other bits
 	 * then a TLBI isn't required as a non-valid entry cannot be cached in
@@ -71,6 +170,7 @@ static int __change_memory_common(unsigned long start, unsigned long size,
 	 */
 	if (pgprot_val(set_mask) != PTE_VALID || pgprot_val(clear_mask))
 		flush_tlb_kernel_range(start, start + size);
+
 	return ret;
 }
 
@@ -174,32 +274,26 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
 
 int set_direct_map_invalid_noflush(struct page *page)
 {
-	struct page_change_data data = {
-		.set_mask = __pgprot(0),
-		.clear_mask = __pgprot(PTE_VALID),
-	};
+	pgprot_t clear_mask = __pgprot(PTE_VALID);
+	pgprot_t set_mask = __pgprot(0);
 
 	if (!can_set_direct_map())
 		return 0;
 
-	return apply_to_page_range(&init_mm,
-				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+	return ___change_memory_common((unsigned long)page_address(page), PAGE_SIZE,
+					set_mask, clear_mask);
 }
 
 int set_direct_map_default_noflush(struct page *page)
 {
-	struct page_change_data data = {
-		.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
-		.clear_mask = __pgprot(PTE_RDONLY),
-	};
+	pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE);
+	pgprot_t clear_mask = __pgprot(PTE_RDONLY);
 
 	if (!can_set_direct_map())
 		return 0;
 
-	return apply_to_page_range(&init_mm,
-				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+	return ___change_memory_common((unsigned long)page_address(page), PAGE_SIZE,
+					set_mask, clear_mask);
 }
 
 static int __set_memory_enc_dec(unsigned long addr,
-- 
2.30.2



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
  2025-06-10 11:44 ` [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking Dev Jain
@ 2025-06-10 12:07   ` Lorenzo Stoakes
  2025-06-10 12:40     ` Dev Jain
  2025-06-10 13:24     ` David Hildenbrand
  0 siblings, 2 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-06-10 12:07 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel, yang, ryan.roberts,
	anshuman.khandual

OK so I think the best solution here is to just update check_ops_valid(), which
was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
enforce the install pte thing).

Let's do something like:

#define OPS_MAY_INSTALL_PTE	(1<<0)
#define OPS_MAY_AVOID_LOCK	(1<<1)

and update check_ops_valid() to take a flags or maybe 'capabilities' field.

Then check based on this e.g.:

if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
	return false;

if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
	return false;

return true;

Then update callers, most of whom can have '0' passed for this field, with
walk_page_range_mm() setting OPS_MAY_INSTALL_PTE and
walk_kernel_page_table_range() setting OPS_MAY_AVOID_LOCK.

That way we check it all in one place, it's consistent, we no long 'implicitly'
don't check it for walk_page_range_mm() and all is neater.

We do end up calling this predicate twice for walk_page_range(), so we could
(should?)  also make walk_page_range_mm() into a static __walk_page_range_mm()
and have a separate walk_page_range_mm() that does this check.

I think this will solve the interface issues I've raised below.

Thanks!

On Tue, Jun 10, 2025 at 05:14:00PM +0530, Dev Jain wrote:
> arm64 currently changes permissions on vmalloc objects locklessly, via
> apply_to_page_range. Patch 2 moves away from this to use the pagewalk API,
> since a limitation of the former is to deny changing permissions for block
> mappings. However, the API currently enforces the init_mm.mmap_lock to be
> held. To avoid the unnecessary bottleneck of the mmap_lock for our usecase,
> this patch extends this generic API to be used locklessly, so as to retain
> the existing behaviour for changing permissions. Apart from this reason,
> it is noted at [1] that KFENCE can manipulate kernel pgtable entries during
> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
> This being a non-sleepable context, we cannot take the init_mm mmap lock.
>
> Since such extension can potentially be dangerous for other callers
> consuming the pagewalk API, explicitly disallow lockless traversal for
> userspace pagetables by returning EINVAL. Add comments to highlight the
> conditions under which we can use the API locklessly - no underlying VMA,
> and the user having exclusive control over the range, thus guaranteeing no
> concurrent access.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/pagewalk.h |  7 +++++++
>  mm/pagewalk.c            | 23 ++++++++++++++++++-----
>  2 files changed, 25 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> index 8ac2f6d6d2a3..5efd6541239b 100644
> --- a/include/linux/pagewalk.h
> +++ b/include/linux/pagewalk.h
> @@ -14,6 +14,13 @@ enum page_walk_lock {
>  	PGWALK_WRLOCK = 1,
>  	/* vma is expected to be already write-locked during the walk */
>  	PGWALK_WRLOCK_VERIFY = 2,
> +	/*
> +	 * Walk without any lock. Use of this is only meant for the
> +	 * case where there is no underlying VMA, and the user has
> +	 * exclusive control over the range, guaranteeing no concurrent
> +	 * access. For example, changing permissions of vmalloc objects.
> +	 */
> +	PGWALK_NOLOCK = 3,

Thanks for the comment! This is good.

>  };
>
>  /**
> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> index ff5299eca687..d55d933f84ec 100644
> --- a/mm/pagewalk.c
> +++ b/mm/pagewalk.c
> @@ -417,13 +417,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>  	return err;
>  }
>
> -static inline void process_mm_walk_lock(struct mm_struct *mm,
> +static inline bool process_mm_walk_lock(struct mm_struct *mm,
>  					enum page_walk_lock walk_lock)

I don't like this signature at all, you don't describe what it does, and now it
returns... whether it was not locked? I think this might lead to confusion :)


>  {
> +	if (walk_lock == PGWALK_NOLOCK)
> +		return 1;

It's 2025, return true please :)

> +
>  	if (walk_lock == PGWALK_RDLOCK)
>  		mmap_assert_locked(mm);
>  	else
>  		mmap_assert_write_locked(mm);
> +	return 0;

It's 2025, return false please :)

>  }
>
>  static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> @@ -440,6 +444,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>  	case PGWALK_RDLOCK:
>  		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>  		break;
> +	case PGWALK_NOLOCK:
> +		break;

Under what circumstances would we be fine with this function being invoked with
no lock being specified?

Isn't it the case that the one situation in which we can specify PGWALK_NOLOCK
won't ever invoke this? Or did I miss a call of this function?

If not, we should make this a VM_WARN_ON_ONCE(1);

>  	}
>  #endif
>  }
> @@ -470,7 +476,8 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
>  	if (!walk.mm)
>  		return -EINVAL;
>
> -	process_mm_walk_lock(walk.mm, ops->walk_lock);
> +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
> +		return -EINVAL;

This is just weird, you're treating the return like it were an error value (no
it's a boolean), the name of the function doesn't expliain the 'verb' of what's
happening, it's just confusing.

Obviously I'm belabouring the point a bit, see suggestion at top :)

>
>  	vma = find_vma(walk.mm, start);
>  	do {
> @@ -626,8 +633,12 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
>  	 * to prevent the intermediate kernel pages tables belonging to the
>  	 * specified address range from being freed. The caller should take
>  	 * other actions to prevent this race.
> +	 *
> +	 * If the caller can guarantee that it has exclusive access to the
> +	 * specified address range, only then it can use PGWALK_NOLOCK.
>  	 */
> -	mmap_assert_locked(mm);
> +	if (ops->walk_lock != PGWALK_NOLOCK)
> +		mmap_assert_locked(mm);
>
>  	return walk_pgd_range(start, end, &walk);
>  }
> @@ -699,7 +710,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
>  	if (!check_ops_valid(ops))
>  		return -EINVAL;
>
> -	process_mm_walk_lock(walk.mm, ops->walk_lock);
> +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
> +		return -EINVAL;
>  	process_vma_walk_lock(vma, ops->walk_lock);
>  	return __walk_page_range(start, end, &walk);
>  }
> @@ -719,7 +731,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>  	if (!check_ops_valid(ops))
>  		return -EINVAL;
>
> -	process_mm_walk_lock(walk.mm, ops->walk_lock);
> +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
> +		return -EINVAL;
>  	process_vma_walk_lock(vma, ops->walk_lock);
>  	return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
>  }
> --
> 2.30.2
>

Thanks!


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
  2025-06-10 12:07   ` Lorenzo Stoakes
@ 2025-06-10 12:40     ` Dev Jain
  2025-06-10 12:57       ` Lorenzo Stoakes
  2025-06-10 13:24     ` David Hildenbrand
  1 sibling, 1 reply; 17+ messages in thread
From: Dev Jain @ 2025-06-10 12:40 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel, yang, ryan.roberts,
	anshuman.khandual


On 10/06/25 5:37 pm, Lorenzo Stoakes wrote:
> OK so I think the best solution here is to just update check_ops_valid(), which
> was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
> enforce the install pte thing).
>
> Let's do something like:
>
> #define OPS_MAY_INSTALL_PTE	(1<<0)
> #define OPS_MAY_AVOID_LOCK	(1<<1)
>
> and update check_ops_valid() to take a flags or maybe 'capabilities' field.
>
> Then check based on this e.g.:
>
> if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
> 	return false;
>
> if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
> 	return false;
>
> return true;
>
> Then update callers, most of whom can have '0' passed for this field, with
> walk_page_range_mm() setting OPS_MAY_INSTALL_PTE and
> walk_kernel_page_table_range() setting OPS_MAY_AVOID_LOCK.
>
> That way we check it all in one place, it's consistent, we no long 'implicitly'
> don't check it for walk_page_range_mm() and all is neater.
>
> We do end up calling this predicate twice for walk_page_range(), so we could
> (should?)  also make walk_page_range_mm() into a static __walk_page_range_mm()
> and have a separate walk_page_range_mm() that does this check.
>
> I think this will solve the interface issues I've raised below.

Makes sense, thank you for your suggestions.

>
> Thanks!
>
> On Tue, Jun 10, 2025 at 05:14:00PM +0530, Dev Jain wrote:
>> arm64 currently changes permissions on vmalloc objects locklessly, via
>> apply_to_page_range. Patch 2 moves away from this to use the pagewalk API,
>> since a limitation of the former is to deny changing permissions for block
>> mappings. However, the API currently enforces the init_mm.mmap_lock to be
>> held. To avoid the unnecessary bottleneck of the mmap_lock for our usecase,
>> this patch extends this generic API to be used locklessly, so as to retain
>> the existing behaviour for changing permissions. Apart from this reason,
>> it is noted at [1] that KFENCE can manipulate kernel pgtable entries during
>> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
>> This being a non-sleepable context, we cannot take the init_mm mmap lock.
>>
>> Since such extension can potentially be dangerous for other callers
>> consuming the pagewalk API, explicitly disallow lockless traversal for
>> userspace pagetables by returning EINVAL. Add comments to highlight the
>> conditions under which we can use the API locklessly - no underlying VMA,
>> and the user having exclusive control over the range, thus guaranteeing no
>> concurrent access.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   include/linux/pagewalk.h |  7 +++++++
>>   mm/pagewalk.c            | 23 ++++++++++++++++++-----
>>   2 files changed, 25 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>> index 8ac2f6d6d2a3..5efd6541239b 100644
>> --- a/include/linux/pagewalk.h
>> +++ b/include/linux/pagewalk.h
>> @@ -14,6 +14,13 @@ enum page_walk_lock {
>>   	PGWALK_WRLOCK = 1,
>>   	/* vma is expected to be already write-locked during the walk */
>>   	PGWALK_WRLOCK_VERIFY = 2,
>> +	/*
>> +	 * Walk without any lock. Use of this is only meant for the
>> +	 * case where there is no underlying VMA, and the user has
>> +	 * exclusive control over the range, guaranteeing no concurrent
>> +	 * access. For example, changing permissions of vmalloc objects.
>> +	 */
>> +	PGWALK_NOLOCK = 3,
> Thanks for the comment! This is good.
>
>>   };
>>
>>   /**
>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>> index ff5299eca687..d55d933f84ec 100644
>> --- a/mm/pagewalk.c
>> +++ b/mm/pagewalk.c
>> @@ -417,13 +417,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>>   	return err;
>>   }
>>
>> -static inline void process_mm_walk_lock(struct mm_struct *mm,
>> +static inline bool process_mm_walk_lock(struct mm_struct *mm,
>>   					enum page_walk_lock walk_lock)
> I don't like this signature at all, you don't describe what it does, and now it
> returns... whether it was not locked? I think this might lead to confusion :)
>
>
>>   {
>> +	if (walk_lock == PGWALK_NOLOCK)
>> +		return 1;
> It's 2025, return true please :)
>
>> +
>>   	if (walk_lock == PGWALK_RDLOCK)
>>   		mmap_assert_locked(mm);
>>   	else
>>   		mmap_assert_write_locked(mm);
>> +	return 0;
> It's 2025, return false please :)
>
>>   }
>>
>>   static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>> @@ -440,6 +444,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>>   	case PGWALK_RDLOCK:
>>   		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>>   		break;
>> +	case PGWALK_NOLOCK:
>> +		break;
> Under what circumstances would we be fine with this function being invoked with
> no lock being specified?
>
> Isn't it the case that the one situation in which we can specify PGWALK_NOLOCK
> won't ever invoke this? Or did I miss a call of this function?
>
> If not, we should make this a VM_WARN_ON_ONCE(1);

I was made aware that there is a quest to remove BUG_ON()'s in the kernel, see [1].
Is there a similar problem with VM_WARN_ON_ONCE()?

[1]: https://lore.kernel.org/all/053ae9ec-1113-4ed8-9625-adf382070bc5@redhat.com/

>
>>   	}
>>   #endif
>>   }
>> @@ -470,7 +476,8 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
>>   	if (!walk.mm)
>>   		return -EINVAL;
>>
>> -	process_mm_walk_lock(walk.mm, ops->walk_lock);
>> +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
>> +		return -EINVAL;
> This is just weird, you're treating the return like it were an error value (no
> it's a boolean), the name of the function doesn't expliain the 'verb' of what's
> happening, it's just confusing.
>
> Obviously I'm belabouring the point a bit, see suggestion at top :)
>
>>   	vma = find_vma(walk.mm, start);
>>   	do {
>> @@ -626,8 +633,12 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
>>   	 * to prevent the intermediate kernel pages tables belonging to the
>>   	 * specified address range from being freed. The caller should take
>>   	 * other actions to prevent this race.
>> +	 *
>> +	 * If the caller can guarantee that it has exclusive access to the
>> +	 * specified address range, only then it can use PGWALK_NOLOCK.
>>   	 */
>> -	mmap_assert_locked(mm);
>> +	if (ops->walk_lock != PGWALK_NOLOCK)
>> +		mmap_assert_locked(mm);
>>
>>   	return walk_pgd_range(start, end, &walk);
>>   }
>> @@ -699,7 +710,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
>>   	if (!check_ops_valid(ops))
>>   		return -EINVAL;
>>
>> -	process_mm_walk_lock(walk.mm, ops->walk_lock);
>> +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
>> +		return -EINVAL;
>>   	process_vma_walk_lock(vma, ops->walk_lock);
>>   	return __walk_page_range(start, end, &walk);
>>   }
>> @@ -719,7 +731,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>>   	if (!check_ops_valid(ops))
>>   		return -EINVAL;
>>
>> -	process_mm_walk_lock(walk.mm, ops->walk_lock);
>> +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
>> +		return -EINVAL;
>>   	process_vma_walk_lock(vma, ops->walk_lock);
>>   	return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
>>   }
>> --
>> 2.30.2
>>
> Thanks!


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
  2025-06-10 12:40     ` Dev Jain
@ 2025-06-10 12:57       ` Lorenzo Stoakes
  2025-06-11  3:43         ` Dev Jain
  0 siblings, 1 reply; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-06-10 12:57 UTC (permalink / raw)
  To: Dev Jain
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel, yang, ryan.roberts,
	anshuman.khandual

On Tue, Jun 10, 2025 at 06:10:03PM +0530, Dev Jain wrote:
>
> On 10/06/25 5:37 pm, Lorenzo Stoakes wrote:
> > OK so I think the best solution here is to just update check_ops_valid(), which
> > was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
> > enforce the install pte thing).
> >
> > Let's do something like:
> >
> > #define OPS_MAY_INSTALL_PTE	(1<<0)
> > #define OPS_MAY_AVOID_LOCK	(1<<1)
> >
> > and update check_ops_valid() to take a flags or maybe 'capabilities' field.
> >
> > Then check based on this e.g.:
> >
> > if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
> > 	return false;
> >
> > if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
> > 	return false;
> >
> > return true;
> >
> > Then update callers, most of whom can have '0' passed for this field, with
> > walk_page_range_mm() setting OPS_MAY_INSTALL_PTE and
> > walk_kernel_page_table_range() setting OPS_MAY_AVOID_LOCK.
> >
> > That way we check it all in one place, it's consistent, we no long 'implicitly'
> > don't check it for walk_page_range_mm() and all is neater.
> >
> > We do end up calling this predicate twice for walk_page_range(), so we could
> > (should?)  also make walk_page_range_mm() into a static __walk_page_range_mm()
> > and have a separate walk_page_range_mm() that does this check.
> >
> > I think this will solve the interface issues I've raised below.
>
> Makes sense, thank you for your suggestions.

Thanks :)

Sorry to be a pain but I think this will fit more nicely.

>
> >
> > Thanks!
> >
> > On Tue, Jun 10, 2025 at 05:14:00PM +0530, Dev Jain wrote:
> > > arm64 currently changes permissions on vmalloc objects locklessly, via
> > > apply_to_page_range. Patch 2 moves away from this to use the pagewalk API,
> > > since a limitation of the former is to deny changing permissions for block
> > > mappings. However, the API currently enforces the init_mm.mmap_lock to be
> > > held. To avoid the unnecessary bottleneck of the mmap_lock for our usecase,
> > > this patch extends this generic API to be used locklessly, so as to retain
> > > the existing behaviour for changing permissions. Apart from this reason,
> > > it is noted at [1] that KFENCE can manipulate kernel pgtable entries during
> > > softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
> > > This being a non-sleepable context, we cannot take the init_mm mmap lock.
> > >
> > > Since such extension can potentially be dangerous for other callers
> > > consuming the pagewalk API, explicitly disallow lockless traversal for
> > > userspace pagetables by returning EINVAL. Add comments to highlight the
> > > conditions under which we can use the API locklessly - no underlying VMA,
> > > and the user having exclusive control over the range, thus guaranteeing no
> > > concurrent access.
> > >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >   include/linux/pagewalk.h |  7 +++++++
> > >   mm/pagewalk.c            | 23 ++++++++++++++++++-----
> > >   2 files changed, 25 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
> > > index 8ac2f6d6d2a3..5efd6541239b 100644
> > > --- a/include/linux/pagewalk.h
> > > +++ b/include/linux/pagewalk.h
> > > @@ -14,6 +14,13 @@ enum page_walk_lock {
> > >   	PGWALK_WRLOCK = 1,
> > >   	/* vma is expected to be already write-locked during the walk */
> > >   	PGWALK_WRLOCK_VERIFY = 2,
> > > +	/*
> > > +	 * Walk without any lock. Use of this is only meant for the
> > > +	 * case where there is no underlying VMA, and the user has
> > > +	 * exclusive control over the range, guaranteeing no concurrent
> > > +	 * access. For example, changing permissions of vmalloc objects.
> > > +	 */
> > > +	PGWALK_NOLOCK = 3,
> > Thanks for the comment! This is good.
> >
> > >   };
> > >
> > >   /**
> > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c
> > > index ff5299eca687..d55d933f84ec 100644
> > > --- a/mm/pagewalk.c
> > > +++ b/mm/pagewalk.c
> > > @@ -417,13 +417,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
> > >   	return err;
> > >   }
> > >
> > > -static inline void process_mm_walk_lock(struct mm_struct *mm,
> > > +static inline bool process_mm_walk_lock(struct mm_struct *mm,
> > >   					enum page_walk_lock walk_lock)
> > I don't like this signature at all, you don't describe what it does, and now it
> > returns... whether it was not locked? I think this might lead to confusion :)
> >
> >
> > >   {
> > > +	if (walk_lock == PGWALK_NOLOCK)
> > > +		return 1;
> > It's 2025, return true please :)
> >
> > > +
> > >   	if (walk_lock == PGWALK_RDLOCK)
> > >   		mmap_assert_locked(mm);
> > >   	else
> > >   		mmap_assert_write_locked(mm);
> > > +	return 0;
> > It's 2025, return false please :)
> >
> > >   }
> > >
> > >   static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> > > @@ -440,6 +444,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
> > >   	case PGWALK_RDLOCK:
> > >   		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
> > >   		break;
> > > +	case PGWALK_NOLOCK:
> > > +		break;
> > Under what circumstances would we be fine with this function being invoked with
> > no lock being specified?
> >
> > Isn't it the case that the one situation in which we can specify PGWALK_NOLOCK
> > won't ever invoke this? Or did I miss a call of this function?
> >
> > If not, we should make this a VM_WARN_ON_ONCE(1);
>
> I was made aware that there is a quest to remove BUG_ON()'s in the kernel, see [1].
> Is there a similar problem with VM_WARN_ON_ONCE()?

No, in fact the proposal is that we replace BUG_ON()'s with [VM_]WARN_ON_ONCE()'s.

So this is all good, BUG_ON() is basically never needed unless you are _certain_
there will be system instability that MUST BE STOPPED NOW.

Which is almost never going to be the case.

See
https://lore.kernel.org/linux-mm/CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com/
where I managed to somehow provoke Linus into giving some (very interesting!) input ;)

But if you see the thread you can see further context on all this.

>
> [1]: https://lore.kernel.org/all/053ae9ec-1113-4ed8-9625-adf382070bc5@redhat.com/
>
> >
> > >   	}
> > >   #endif
> > >   }
> > > @@ -470,7 +476,8 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
> > >   	if (!walk.mm)
> > >   		return -EINVAL;
> > >
> > > -	process_mm_walk_lock(walk.mm, ops->walk_lock);
> > > +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
> > > +		return -EINVAL;
> > This is just weird, you're treating the return like it were an error value (no
> > it's a boolean), the name of the function doesn't expliain the 'verb' of what's
> > happening, it's just confusing.
> >
> > Obviously I'm belabouring the point a bit, see suggestion at top :)
> >
> > >   	vma = find_vma(walk.mm, start);
> > >   	do {
> > > @@ -626,8 +633,12 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
> > >   	 * to prevent the intermediate kernel pages tables belonging to the
> > >   	 * specified address range from being freed. The caller should take
> > >   	 * other actions to prevent this race.
> > > +	 *
> > > +	 * If the caller can guarantee that it has exclusive access to the
> > > +	 * specified address range, only then it can use PGWALK_NOLOCK.
> > >   	 */
> > > -	mmap_assert_locked(mm);
> > > +	if (ops->walk_lock != PGWALK_NOLOCK)
> > > +		mmap_assert_locked(mm);
> > >
> > >   	return walk_pgd_range(start, end, &walk);
> > >   }
> > > @@ -699,7 +710,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
> > >   	if (!check_ops_valid(ops))
> > >   		return -EINVAL;
> > >
> > > -	process_mm_walk_lock(walk.mm, ops->walk_lock);
> > > +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
> > > +		return -EINVAL;
> > >   	process_vma_walk_lock(vma, ops->walk_lock);
> > >   	return __walk_page_range(start, end, &walk);
> > >   }
> > > @@ -719,7 +731,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
> > >   	if (!check_ops_valid(ops))
> > >   		return -EINVAL;
> > >
> > > -	process_mm_walk_lock(walk.mm, ops->walk_lock);
> > > +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
> > > +		return -EINVAL;
> > >   	process_vma_walk_lock(vma, ops->walk_lock);
> > >   	return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
> > >   }
> > > --
> > > 2.30.2
> > >
> > Thanks!


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-06-10 11:44 ` [PATCH v2 2/2] arm64: pageattr: Use walk_page_range_novma() to change memory permissions Dev Jain
@ 2025-06-10 13:14   ` David Hildenbrand
  2025-06-10 14:41   ` Ryan Roberts
  1 sibling, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-06-10 13:14 UTC (permalink / raw)
  To: Dev Jain, akpm, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel, yang, ryan.roberts, anshuman.khandual

On 10.06.25 13:44, Dev Jain wrote:

Subject seems outdated: walk_kernel_page_table_range()

> Since apply_to_page_range does not support operations on block mappings,
> use the generic pagewalk API to enable changing permissions for kernel
> block mappings. This paves the path for enabling huge mappings by default
> on kernel space mappings, thus leading to more efficient TLB usage.
> 
> We only require that the start and end of a given range lie on leaf mapping
> boundaries. Return EINVAL in case a partial block mapping is detected; add
> a corresponding comment in ___change_memory_common() to warn that
> eliminating such a condition is the responsibility of the caller.
> 
> apply_to_page_range ultimately uses the lazy MMU hooks at the pte level
> function (apply_to_pte_range) - we want to use this functionality after
> this patch too. Ryan says:
> "The only reason we traditionally confine the lazy mmu mode to a single
> page table is because we want to enclose it within the PTL. But that
> requirement doesn't stand for kernel mappings. As long as the walker can
> guarantee that it doesn't allocate any memory (because with certain debug
> settings that can cause lazy mmu nesting) or try to sleep then I think we
> can just bracket the entire call."
> Therefore, wrap the call to walk_kernel_page_table_range() with the
> lazy MMU helpers.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>   arch/arm64/mm/pageattr.c | 158 +++++++++++++++++++++++++++++++--------
>   1 file changed, 126 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 04d4a8f676db..2c118c0922ef 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -8,6 +8,7 @@
>   #include <linux/mem_encrypt.h>
>   #include <linux/sched.h>
>   #include <linux/vmalloc.h>
> +#include <linux/pagewalk.h>
>   
>   #include <asm/cacheflush.h>
>   #include <asm/pgtable-prot.h>
> @@ -20,6 +21,100 @@ struct page_change_data {
>   	pgprot_t clear_mask;
>   };
>   

The general idea looks sane to me.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
  2025-06-10 12:07   ` Lorenzo Stoakes
  2025-06-10 12:40     ` Dev Jain
@ 2025-06-10 13:24     ` David Hildenbrand
  2025-06-10 13:25       ` David Hildenbrand
  2025-06-10 13:27       ` Lorenzo Stoakes
  1 sibling, 2 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-06-10 13:24 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, catalin.marinas, will, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, linux-mm, linux-kernel, suzuki.poulose, steven.price,
	gshan, linux-arm-kernel, yang, ryan.roberts, anshuman.khandual

On 10.06.25 14:07, Lorenzo Stoakes wrote:
> OK so I think the best solution here is to just update check_ops_valid(), which
> was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
> enforce the install pte thing).
> 
> Let's do something like:
> 
> #define OPS_MAY_INSTALL_PTE	(1<<0)
> #define OPS_MAY_AVOID_LOCK	(1<<1)
> 
> and update check_ops_valid() to take a flags or maybe 'capabilities' field.
> 
> Then check based on this e.g.:
> 
> if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
> 	return false;
> 
> if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
> 	return false;
> 

Hm. I mean, we really only want to allow this lockless check for 
walk_kernel_page_table_range(), right?

Having a walk_kernel_page_table_range_lockeless() might (or might not) 
be better, to really only special-case this specific path.

So, I am wondering if we should further start splitting the 
kernel-page-table walker up from the mm walker, at least on the "entry" 
function for now.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
  2025-06-10 13:24     ` David Hildenbrand
@ 2025-06-10 13:25       ` David Hildenbrand
  2025-06-10 13:27       ` Lorenzo Stoakes
  1 sibling, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-06-10 13:25 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dev Jain
  Cc: akpm, catalin.marinas, will, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, linux-mm, linux-kernel, suzuki.poulose, steven.price,
	gshan, linux-arm-kernel, yang, ryan.roberts, anshuman.khandual

On 10.06.25 15:24, David Hildenbrand wrote:
> On 10.06.25 14:07, Lorenzo Stoakes wrote:
>> OK so I think the best solution here is to just update check_ops_valid(), which
>> was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
>> enforce the install pte thing).
>>
>> Let's do something like:
>>
>> #define OPS_MAY_INSTALL_PTE	(1<<0)
>> #define OPS_MAY_AVOID_LOCK	(1<<1)
>>
>> and update check_ops_valid() to take a flags or maybe 'capabilities' field.
>>
>> Then check based on this e.g.:
>>
>> if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
>> 	return false;
>>
>> if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
>> 	return false;
>>
> 
> Hm. I mean, we really only want to allow this lockless check for
> walk_kernel_page_table_range(), right?
> 
> Having a walk_kernel_page_table_range_lockeless() might (or might not)
> be better, to really only special-case this specific path.

Note that I am also not quite happy bout that function name, but I think 
we should add a proper interface that documents clearly when it is even 
okay to call that function ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
  2025-06-10 13:24     ` David Hildenbrand
  2025-06-10 13:25       ` David Hildenbrand
@ 2025-06-10 13:27       ` Lorenzo Stoakes
  2025-06-10 13:31         ` David Hildenbrand
  2025-06-11  3:45         ` Dev Jain
  1 sibling, 2 replies; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-06-10 13:27 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dev Jain, akpm, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel, yang, ryan.roberts,
	anshuman.khandual

On Tue, Jun 10, 2025 at 03:24:16PM +0200, David Hildenbrand wrote:
> On 10.06.25 14:07, Lorenzo Stoakes wrote:
> > OK so I think the best solution here is to just update check_ops_valid(), which
> > was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
> > enforce the install pte thing).
> >
> > Let's do something like:
> >
> > #define OPS_MAY_INSTALL_PTE	(1<<0)
> > #define OPS_MAY_AVOID_LOCK	(1<<1)
> >
> > and update check_ops_valid() to take a flags or maybe 'capabilities' field.
> >
> > Then check based on this e.g.:
> >
> > if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
> > 	return false;
> >
> > if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
> > 	return false;
> >
>
> Hm. I mean, we really only want to allow this lockless check for
> walk_kernel_page_table_range(), right?
>
> Having a walk_kernel_page_table_range_lockeless() might (or might not) be
> better, to really only special-case this specific path.

Agree completely, Dev - let's definitely do this.

>
> So, I am wondering if we should further start splitting the
> kernel-page-table walker up from the mm walker, at least on the "entry"
> function for now.

How do you mean?

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
  2025-06-10 13:27       ` Lorenzo Stoakes
@ 2025-06-10 13:31         ` David Hildenbrand
  2025-06-10 13:35           ` Lorenzo Stoakes
  2025-06-11  3:45         ` Dev Jain
  1 sibling, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2025-06-10 13:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Dev Jain, akpm, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel, yang, ryan.roberts,
	anshuman.khandual

On 10.06.25 15:27, Lorenzo Stoakes wrote:
> On Tue, Jun 10, 2025 at 03:24:16PM +0200, David Hildenbrand wrote:
>> On 10.06.25 14:07, Lorenzo Stoakes wrote:
>>> OK so I think the best solution here is to just update check_ops_valid(), which
>>> was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
>>> enforce the install pte thing).
>>>
>>> Let's do something like:
>>>
>>> #define OPS_MAY_INSTALL_PTE	(1<<0)
>>> #define OPS_MAY_AVOID_LOCK	(1<<1)
>>>
>>> and update check_ops_valid() to take a flags or maybe 'capabilities' field.
>>>
>>> Then check based on this e.g.:
>>>
>>> if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
>>> 	return false;
>>>
>>> if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
>>> 	return false;
>>>
>>
>> Hm. I mean, we really only want to allow this lockless check for
>> walk_kernel_page_table_range(), right?
>>
>> Having a walk_kernel_page_table_range_lockeless() might (or might not) be
>> better, to really only special-case this specific path.
> 
> Agree completely, Dev - let's definitely do this.
> 
>>
>> So, I am wondering if we should further start splitting the
>> kernel-page-table walker up from the mm walker, at least on the "entry"
>> function for now.
> 
> How do you mean?

In particular, "struct mm_walk_ops"

does not quite make sense when not actually walking a "real" mm .

So maybe we should start having a separate structure where *vma, 
install_pte, walk_lock, hugetlb* does not even exist.

It might be a bit of churn, though ... not sure if there could be an 
easy translation layer for now.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
  2025-06-10 13:31         ` David Hildenbrand
@ 2025-06-10 13:35           ` Lorenzo Stoakes
  2025-06-10 13:44             ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Lorenzo Stoakes @ 2025-06-10 13:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dev Jain, akpm, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel, yang, ryan.roberts,
	anshuman.khandual

On Tue, Jun 10, 2025 at 03:31:56PM +0200, David Hildenbrand wrote:
> On 10.06.25 15:27, Lorenzo Stoakes wrote:
> > On Tue, Jun 10, 2025 at 03:24:16PM +0200, David Hildenbrand wrote:
> > > On 10.06.25 14:07, Lorenzo Stoakes wrote:
> > > > OK so I think the best solution here is to just update check_ops_valid(), which
> > > > was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
> > > > enforce the install pte thing).
> > > >
> > > > Let's do something like:
> > > >
> > > > #define OPS_MAY_INSTALL_PTE	(1<<0)
> > > > #define OPS_MAY_AVOID_LOCK	(1<<1)
> > > >
> > > > and update check_ops_valid() to take a flags or maybe 'capabilities' field.
> > > >
> > > > Then check based on this e.g.:
> > > >
> > > > if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
> > > > 	return false;
> > > >
> > > > if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
> > > > 	return false;
> > > >
> > >
> > > Hm. I mean, we really only want to allow this lockless check for
> > > walk_kernel_page_table_range(), right?
> > >
> > > Having a walk_kernel_page_table_range_lockeless() might (or might not) be
> > > better, to really only special-case this specific path.
> >
> > Agree completely, Dev - let's definitely do this.
> >
> > >
> > > So, I am wondering if we should further start splitting the
> > > kernel-page-table walker up from the mm walker, at least on the "entry"
> > > function for now.
> >
> > How do you mean?
>
> In particular, "struct mm_walk_ops"
>
> does not quite make sense when not actually walking a "real" mm .
>
> So maybe we should start having a separate structure where *vma,
> install_pte, walk_lock, hugetlb* does not even exist.
>
> It might be a bit of churn, though ... not sure if there could be an easy
> translation layer for now.

But you know... I looove churn right? <3 <3 <3 :)))

That's a nice idea, but I think something that should be a follow up.

Quite honestly I hate a lot about this code. I did some refactoring before, and
I might do some again.

todo++; ;)

I can tie this together actually with Muchun's suggestions from
https://lore.kernel.org/all/1AA4A4B3-AEBE-484A-8EE2-35A15035E748@linux.dev/ in
my 'page walk improvement' todo sub-list...

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
  2025-06-10 13:35           ` Lorenzo Stoakes
@ 2025-06-10 13:44             ` David Hildenbrand
  0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2025-06-10 13:44 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Dev Jain, akpm, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel, yang, ryan.roberts,
	anshuman.khandual

On 10.06.25 15:35, Lorenzo Stoakes wrote:
> On Tue, Jun 10, 2025 at 03:31:56PM +0200, David Hildenbrand wrote:
>> On 10.06.25 15:27, Lorenzo Stoakes wrote:
>>> On Tue, Jun 10, 2025 at 03:24:16PM +0200, David Hildenbrand wrote:
>>>> On 10.06.25 14:07, Lorenzo Stoakes wrote:
>>>>> OK so I think the best solution here is to just update check_ops_valid(), which
>>>>> was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
>>>>> enforce the install pte thing).
>>>>>
>>>>> Let's do something like:
>>>>>
>>>>> #define OPS_MAY_INSTALL_PTE	(1<<0)
>>>>> #define OPS_MAY_AVOID_LOCK	(1<<1)
>>>>>
>>>>> and update check_ops_valid() to take a flags or maybe 'capabilities' field.
>>>>>
>>>>> Then check based on this e.g.:
>>>>>
>>>>> if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
>>>>> 	return false;
>>>>>
>>>>> if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
>>>>> 	return false;
>>>>>
>>>>
>>>> Hm. I mean, we really only want to allow this lockless check for
>>>> walk_kernel_page_table_range(), right?
>>>>
>>>> Having a walk_kernel_page_table_range_lockeless() might (or might not) be
>>>> better, to really only special-case this specific path.
>>>
>>> Agree completely, Dev - let's definitely do this.
>>>
>>>>
>>>> So, I am wondering if we should further start splitting the
>>>> kernel-page-table walker up from the mm walker, at least on the "entry"
>>>> function for now.
>>>
>>> How do you mean?
>>
>> In particular, "struct mm_walk_ops"
>>
>> does not quite make sense when not actually walking a "real" mm .
>>
>> So maybe we should start having a separate structure where *vma,
>> install_pte, walk_lock, hugetlb* does not even exist.
>>
>> It might be a bit of churn, though ... not sure if there could be an easy
>> translation layer for now.
> 
> But you know... I looove churn right? <3 <3 <3 :)))
> 
> That's a nice idea, but I think something that should be a follow up.

Yes, absolutely, just wanted to raise it :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-06-10 11:44 ` [PATCH v2 2/2] arm64: pageattr: Use walk_page_range_novma() to change memory permissions Dev Jain
  2025-06-10 13:14   ` David Hildenbrand
@ 2025-06-10 14:41   ` Ryan Roberts
  2025-06-11  4:01     ` Dev Jain
  1 sibling, 1 reply; 17+ messages in thread
From: Ryan Roberts @ 2025-06-10 14:41 UTC (permalink / raw)
  To: Dev Jain, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel, yang, anshuman.khandual

On 10/06/2025 12:44, Dev Jain wrote:
> Since apply_to_page_range does not support operations on block mappings,
> use the generic pagewalk API to enable changing permissions for kernel
> block mappings. This paves the path for enabling huge mappings by default
> on kernel space mappings, thus leading to more efficient TLB usage.
> 
> We only require that the start and end of a given range lie on leaf mapping
> boundaries. Return EINVAL in case a partial block mapping is detected; add

nit: return -EINVAL

> a corresponding comment in ___change_memory_common() to warn that
> eliminating such a condition is the responsibility of the caller.
> 
> apply_to_page_range ultimately uses the lazy MMU hooks at the pte level
> function (apply_to_pte_range) - we want to use this functionality after
> this patch too. Ryan says:
> "The only reason we traditionally confine the lazy mmu mode to a single
> page table is because we want to enclose it within the PTL. But that
> requirement doesn't stand for kernel mappings. As long as the walker can
> guarantee that it doesn't allocate any memory (because with certain debug
> settings that can cause lazy mmu nesting) or try to sleep then I think we
> can just bracket the entire call."

It would be better to state the facts here rather than repeat what I previously
wrote on another thread :)

How about something like:

"""
apply_to_page_range() performs all pte level callbacks while in lazy mmu mode.
Since arm64 can optimize performance by batching barriers when modifying kernel
pgtables in lazy mmu mode, we would like to continue to benefit from this
optimisation. Unfortunately walk_kernel_page_table_range() does not use lazy mmu
mode. However, since the pagewalk framework is not allocating any memory, we can
safely bracket the whole operation inside lazy mmu mode ourselves.
"""

> Therefore, wrap the call to walk_kernel_page_table_range() with the
> lazy MMU helpers.
> 
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  arch/arm64/mm/pageattr.c | 158 +++++++++++++++++++++++++++++++--------
>  1 file changed, 126 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 04d4a8f676db..2c118c0922ef 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -8,6 +8,7 @@
>  #include <linux/mem_encrypt.h>
>  #include <linux/sched.h>
>  #include <linux/vmalloc.h>
> +#include <linux/pagewalk.h>
>  
>  #include <asm/cacheflush.h>
>  #include <asm/pgtable-prot.h>
> @@ -20,6 +21,100 @@ struct page_change_data {
>  	pgprot_t clear_mask;
>  };
>  
> +static pteval_t set_pageattr_masks(ptdesc_t val, struct mm_walk *walk)

val is ptdesc_t on entry and pteval_t on return. Let's use ptdesc_t throughout
since it's intended to represent a "pgtable descriptor at any level".

> +{
> +	struct page_change_data *masks = walk->private;
> +
> +	val &= ~(pgprot_val(masks->clear_mask));
> +	val |= (pgprot_val(masks->set_mask));
> +
> +	return val;
> +}
> +
> +static int pageattr_pgd_entry(pgd_t *pgd, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pgd_t val = pgdp_get(pgd);
> +
> +	if (pgd_leaf(val)) {
> +		if (WARN_ON_ONCE((next - addr) != PGDIR_SIZE))
> +			return -EINVAL;
> +		val = __pgd(set_pageattr_masks(pgd_val(val), walk));
> +		set_pgd(pgd, val);
> +		walk->action = ACTION_CONTINUE;
> +	}
> +
> +	return 0;
> +}
> +
> +static int pageattr_p4d_entry(p4d_t *p4d, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	p4d_t val = p4dp_get(p4d);
> +
> +	if (p4d_leaf(val)) {
> +		if (WARN_ON_ONCE((next - addr) != P4D_SIZE))
> +			return -EINVAL;
> +		val = __p4d(set_pageattr_masks(p4d_val(val), walk));
> +		set_p4d(p4d, val);
> +		walk->action = ACTION_CONTINUE;
> +	}
> +
> +	return 0;
> +}
> +
> +static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pud_t val = pudp_get(pud);
> +
> +	if (pud_leaf(val)) {
> +		if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
> +			return -EINVAL;
> +		val = __pud(set_pageattr_masks(pud_val(val), walk));
> +		set_pud(pud, val);
> +		walk->action = ACTION_CONTINUE;
> +	}
> +
> +	return 0;
> +}
> +
> +static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pmd_t val = pmdp_get(pmd);
> +
> +	if (pmd_leaf(val)) {
> +		if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
> +			return -EINVAL;
> +		val = __pmd(set_pageattr_masks(pmd_val(val), walk));
> +		set_pmd(pmd, val);
> +		walk->action = ACTION_CONTINUE;
> +	}
> +
> +	return 0;
> +}
> +
> +static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
> +			      unsigned long next, struct mm_walk *walk)
> +{
> +	pte_t val = __ptep_get(pte);
> +
> +	val = __pte(set_pageattr_masks(pte_val(val), walk));
> +	__set_pte(pte, val);
> +
> +	return 0;
> +}
> +
> +static const struct mm_walk_ops pageattr_ops = {
> +	.pgd_entry	= pageattr_pgd_entry,
> +	.p4d_entry	= pageattr_p4d_entry,

I may have been wrong when I told you that we would need to support pgd and p4d
for certain configs. Looking again at the code, I'm not sure.

Let's say we have 64K page size with 42 bit VA. That gives 2 levels of lookup.
We would normally think of that as a PGD table and a PTE table. But I think for
the purposes of this, pte_entry and pmd_entry will be called? I'm not really
sure - I don't have a great grasp on how pgtable folding works.

Trouble is, if pte_entry and pgd_entry get called (as I originally thought),
pgd_leaf() is always false on arm64, which would clearly be a bug...

I'm hoping someone else can pipe up to clarify. Or perhaps you can build the
config and do a test?

If it turns out that the "lower" callbacks will always be called, we should
probably remove pgd_entry and p4d_entry in the name of performance.

> +	.pud_entry	= pageattr_pud_entry,
> +	.pmd_entry	= pageattr_pmd_entry,
> +	.pte_entry	= pageattr_pte_entry,
> +	.walk_lock	= PGWALK_NOLOCK,
> +};
> +
>  bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
>  
>  bool can_set_direct_map(void)
> @@ -37,22 +132,7 @@ bool can_set_direct_map(void)
>  		arm64_kfence_can_set_direct_map() || is_realm_world();
>  }
>  
> -static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
> -{
> -	struct page_change_data *cdata = data;
> -	pte_t pte = __ptep_get(ptep);
> -
> -	pte = clear_pte_bit(pte, cdata->clear_mask);
> -	pte = set_pte_bit(pte, cdata->set_mask);
> -
> -	__set_pte(ptep, pte);
> -	return 0;
> -}
> -
> -/*
> - * This function assumes that the range is mapped with PAGE_SIZE pages.
> - */
> -static int __change_memory_common(unsigned long start, unsigned long size,
> +static int ___change_memory_common(unsigned long start, unsigned long size,
>  				pgprot_t set_mask, pgprot_t clear_mask)
>  {
>  	struct page_change_data data;
> @@ -61,9 +141,28 @@ static int __change_memory_common(unsigned long start, unsigned long size,
>  	data.set_mask = set_mask;
>  	data.clear_mask = clear_mask;
>  
> -	ret = apply_to_page_range(&init_mm, start, size, change_page_range,
> -					&data);
> +	arch_enter_lazy_mmu_mode();
> +
> +	/*
> +	 * The caller must ensure that the range we are operating on does not
> +	 * partially overlap a block mapping. Any such case should either not
> +	 * exist, or must be eliminated by splitting the mapping - which for
> +	 * kernel mappings can be done only on BBML2 systems.
> +	 *
> +	 */
> +	ret = walk_kernel_page_table_range(start, start + size, &pageattr_ops,
> +					   NULL, &data);
> +	arch_leave_lazy_mmu_mode();
> +
> +	return ret;
> +}
>  
> +static int __change_memory_common(unsigned long start, unsigned long size,
> +				  pgprot_t set_mask, pgprot_t clear_mask)
> +{
> +	int ret;
> +
> +	ret = ___change_memory_common(start, size, set_mask, clear_mask);
>  	/*
>  	 * If the memory is being made valid without changing any other bits
>  	 * then a TLBI isn't required as a non-valid entry cannot be cached in
> @@ -71,6 +170,7 @@ static int __change_memory_common(unsigned long start, unsigned long size,
>  	 */
>  	if (pgprot_val(set_mask) != PTE_VALID || pgprot_val(clear_mask))
>  		flush_tlb_kernel_range(start, start + size);
> +
>  	return ret;
>  }
>  
> @@ -174,32 +274,26 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
>  
>  int set_direct_map_invalid_noflush(struct page *page)
>  {
> -	struct page_change_data data = {
> -		.set_mask = __pgprot(0),
> -		.clear_mask = __pgprot(PTE_VALID),
> -	};
> +	pgprot_t clear_mask = __pgprot(PTE_VALID);
> +	pgprot_t set_mask = __pgprot(0);
>  
>  	if (!can_set_direct_map())
>  		return 0;
>  
> -	return apply_to_page_range(&init_mm,
> -				   (unsigned long)page_address(page),
> -				   PAGE_SIZE, change_page_range, &data);
> +	return ___change_memory_common((unsigned long)page_address(page), PAGE_SIZE,
> +					set_mask, clear_mask);
>  }
>  
>  int set_direct_map_default_noflush(struct page *page)
>  {
> -	struct page_change_data data = {
> -		.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
> -		.clear_mask = __pgprot(PTE_RDONLY),
> -	};
> +	pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE);
> +	pgprot_t clear_mask = __pgprot(PTE_RDONLY);
>  
>  	if (!can_set_direct_map())
>  		return 0;
>  
> -	return apply_to_page_range(&init_mm,
> -				   (unsigned long)page_address(page),
> -				   PAGE_SIZE, change_page_range, &data);
> +	return ___change_memory_common((unsigned long)page_address(page), PAGE_SIZE,

nit: you're at 85 chars here. Given you are breaking anyway, why not put
PAGE_SIZE on the next line? Same for set_direct_map_invalid_noflush().

> +					set_mask, clear_mask);
>  }
>  
>  static int __set_memory_enc_dec(unsigned long addr,


This is looking good to me overall - nearly there!

Thanks,
Ryan




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
  2025-06-10 12:57       ` Lorenzo Stoakes
@ 2025-06-11  3:43         ` Dev Jain
  0 siblings, 0 replies; 17+ messages in thread
From: Dev Jain @ 2025-06-11  3:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, david, catalin.marinas, will, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linux-mm, linux-kernel, suzuki.poulose,
	steven.price, gshan, linux-arm-kernel, yang, ryan.roberts,
	anshuman.khandual


On 10/06/25 6:27 pm, Lorenzo Stoakes wrote:
> On Tue, Jun 10, 2025 at 06:10:03PM +0530, Dev Jain wrote:
>> On 10/06/25 5:37 pm, Lorenzo Stoakes wrote:
>>> OK so I think the best solution here is to just update check_ops_valid(), which
>>> was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
>>> enforce the install pte thing).
>>>
>>> Let's do something like:
>>>
>>> #define OPS_MAY_INSTALL_PTE	(1<<0)
>>> #define OPS_MAY_AVOID_LOCK	(1<<1)
>>>
>>> and update check_ops_valid() to take a flags or maybe 'capabilities' field.
>>>
>>> Then check based on this e.g.:
>>>
>>> if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
>>> 	return false;
>>>
>>> if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
>>> 	return false;
>>>
>>> return true;
>>>
>>> Then update callers, most of whom can have '0' passed for this field, with
>>> walk_page_range_mm() setting OPS_MAY_INSTALL_PTE and
>>> walk_kernel_page_table_range() setting OPS_MAY_AVOID_LOCK.
>>>
>>> That way we check it all in one place, it's consistent, we no long 'implicitly'
>>> don't check it for walk_page_range_mm() and all is neater.
>>>
>>> We do end up calling this predicate twice for walk_page_range(), so we could
>>> (should?)  also make walk_page_range_mm() into a static __walk_page_range_mm()
>>> and have a separate walk_page_range_mm() that does this check.
>>>
>>> I think this will solve the interface issues I've raised below.
>> Makes sense, thank you for your suggestions.
> Thanks :)
>
> Sorry to be a pain but I think this will fit more nicely.
>
>>> Thanks!
>>>
>>> On Tue, Jun 10, 2025 at 05:14:00PM +0530, Dev Jain wrote:
>>>> arm64 currently changes permissions on vmalloc objects locklessly, via
>>>> apply_to_page_range. Patch 2 moves away from this to use the pagewalk API,
>>>> since a limitation of the former is to deny changing permissions for block
>>>> mappings. However, the API currently enforces the init_mm.mmap_lock to be
>>>> held. To avoid the unnecessary bottleneck of the mmap_lock for our usecase,
>>>> this patch extends this generic API to be used locklessly, so as to retain
>>>> the existing behaviour for changing permissions. Apart from this reason,
>>>> it is noted at [1] that KFENCE can manipulate kernel pgtable entries during
>>>> softirqs. It does this by calling set_memory_valid() -> __change_memory_common().
>>>> This being a non-sleepable context, we cannot take the init_mm mmap lock.
>>>>
>>>> Since such extension can potentially be dangerous for other callers
>>>> consuming the pagewalk API, explicitly disallow lockless traversal for
>>>> userspace pagetables by returning EINVAL. Add comments to highlight the
>>>> conditions under which we can use the API locklessly - no underlying VMA,
>>>> and the user having exclusive control over the range, thus guaranteeing no
>>>> concurrent access.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    include/linux/pagewalk.h |  7 +++++++
>>>>    mm/pagewalk.c            | 23 ++++++++++++++++++-----
>>>>    2 files changed, 25 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
>>>> index 8ac2f6d6d2a3..5efd6541239b 100644
>>>> --- a/include/linux/pagewalk.h
>>>> +++ b/include/linux/pagewalk.h
>>>> @@ -14,6 +14,13 @@ enum page_walk_lock {
>>>>    	PGWALK_WRLOCK = 1,
>>>>    	/* vma is expected to be already write-locked during the walk */
>>>>    	PGWALK_WRLOCK_VERIFY = 2,
>>>> +	/*
>>>> +	 * Walk without any lock. Use of this is only meant for the
>>>> +	 * case where there is no underlying VMA, and the user has
>>>> +	 * exclusive control over the range, guaranteeing no concurrent
>>>> +	 * access. For example, changing permissions of vmalloc objects.
>>>> +	 */
>>>> +	PGWALK_NOLOCK = 3,
>>> Thanks for the comment! This is good.
>>>
>>>>    };
>>>>
>>>>    /**
>>>> diff --git a/mm/pagewalk.c b/mm/pagewalk.c
>>>> index ff5299eca687..d55d933f84ec 100644
>>>> --- a/mm/pagewalk.c
>>>> +++ b/mm/pagewalk.c
>>>> @@ -417,13 +417,17 @@ static int __walk_page_range(unsigned long start, unsigned long end,
>>>>    	return err;
>>>>    }
>>>>
>>>> -static inline void process_mm_walk_lock(struct mm_struct *mm,
>>>> +static inline bool process_mm_walk_lock(struct mm_struct *mm,
>>>>    					enum page_walk_lock walk_lock)
>>> I don't like this signature at all, you don't describe what it does, and now it
>>> returns... whether it was not locked? I think this might lead to confusion :)
>>>
>>>
>>>>    {
>>>> +	if (walk_lock == PGWALK_NOLOCK)
>>>> +		return 1;
>>> It's 2025, return true please :)
>>>
>>>> +
>>>>    	if (walk_lock == PGWALK_RDLOCK)
>>>>    		mmap_assert_locked(mm);
>>>>    	else
>>>>    		mmap_assert_write_locked(mm);
>>>> +	return 0;
>>> It's 2025, return false please :)
>>>
>>>>    }
>>>>
>>>>    static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>>>> @@ -440,6 +444,8 @@ static inline void process_vma_walk_lock(struct vm_area_struct *vma,
>>>>    	case PGWALK_RDLOCK:
>>>>    		/* PGWALK_RDLOCK is handled by process_mm_walk_lock */
>>>>    		break;
>>>> +	case PGWALK_NOLOCK:
>>>> +		break;
>>> Under what circumstances would we be fine with this function being invoked with
>>> no lock being specified?
>>>
>>> Isn't it the case that the one situation in which we can specify PGWALK_NOLOCK
>>> won't ever invoke this? Or did I miss a call of this function?
>>>
>>> If not, we should make this a VM_WARN_ON_ONCE(1);
>> I was made aware that there is a quest to remove BUG_ON()'s in the kernel, see [1].
>> Is there a similar problem with VM_WARN_ON_ONCE()?
> No, in fact the proposal is that we replace BUG_ON()'s with [VM_]WARN_ON_ONCE()'s.
>
> So this is all good, BUG_ON() is basically never needed unless you are _certain_
> there will be system instability that MUST BE STOPPED NOW.
>
> Which is almost never going to be the case.
>
> See
> https://lore.kernel.org/linux-mm/CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com/
> where I managed to somehow provoke Linus into giving some (very interesting!) input ;)
>
> But if you see the thread you can see further context on all this.

Thanks for the info.

>
>> [1]: https://lore.kernel.org/all/053ae9ec-1113-4ed8-9625-adf382070bc5@redhat.com/
>>
>>>>    	}
>>>>    #endif
>>>>    }
>>>> @@ -470,7 +476,8 @@ int walk_page_range_mm(struct mm_struct *mm, unsigned long start,
>>>>    	if (!walk.mm)
>>>>    		return -EINVAL;
>>>>
>>>> -	process_mm_walk_lock(walk.mm, ops->walk_lock);
>>>> +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
>>>> +		return -EINVAL;
>>> This is just weird, you're treating the return like it were an error value (no
>>> it's a boolean), the name of the function doesn't expliain the 'verb' of what's
>>> happening, it's just confusing.
>>>
>>> Obviously I'm belabouring the point a bit, see suggestion at top :)
>>>
>>>>    	vma = find_vma(walk.mm, start);
>>>>    	do {
>>>> @@ -626,8 +633,12 @@ int walk_kernel_page_table_range(unsigned long start, unsigned long end,
>>>>    	 * to prevent the intermediate kernel pages tables belonging to the
>>>>    	 * specified address range from being freed. The caller should take
>>>>    	 * other actions to prevent this race.
>>>> +	 *
>>>> +	 * If the caller can guarantee that it has exclusive access to the
>>>> +	 * specified address range, only then it can use PGWALK_NOLOCK.
>>>>    	 */
>>>> -	mmap_assert_locked(mm);
>>>> +	if (ops->walk_lock != PGWALK_NOLOCK)
>>>> +		mmap_assert_locked(mm);
>>>>
>>>>    	return walk_pgd_range(start, end, &walk);
>>>>    }
>>>> @@ -699,7 +710,8 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
>>>>    	if (!check_ops_valid(ops))
>>>>    		return -EINVAL;
>>>>
>>>> -	process_mm_walk_lock(walk.mm, ops->walk_lock);
>>>> +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
>>>> +		return -EINVAL;
>>>>    	process_vma_walk_lock(vma, ops->walk_lock);
>>>>    	return __walk_page_range(start, end, &walk);
>>>>    }
>>>> @@ -719,7 +731,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
>>>>    	if (!check_ops_valid(ops))
>>>>    		return -EINVAL;
>>>>
>>>> -	process_mm_walk_lock(walk.mm, ops->walk_lock);
>>>> +	if (process_mm_walk_lock(walk.mm, ops->walk_lock))
>>>> +		return -EINVAL;
>>>>    	process_vma_walk_lock(vma, ops->walk_lock);
>>>>    	return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
>>>>    }
>>>> --
>>>> 2.30.2
>>>>
>>> Thanks!


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking
  2025-06-10 13:27       ` Lorenzo Stoakes
  2025-06-10 13:31         ` David Hildenbrand
@ 2025-06-11  3:45         ` Dev Jain
  1 sibling, 0 replies; 17+ messages in thread
From: Dev Jain @ 2025-06-11  3:45 UTC (permalink / raw)
  To: Lorenzo Stoakes, David Hildenbrand
  Cc: akpm, catalin.marinas, will, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, linux-mm, linux-kernel, suzuki.poulose, steven.price,
	gshan, linux-arm-kernel, yang, ryan.roberts, anshuman.khandual


On 10/06/25 6:57 pm, Lorenzo Stoakes wrote:
> On Tue, Jun 10, 2025 at 03:24:16PM +0200, David Hildenbrand wrote:
>> On 10.06.25 14:07, Lorenzo Stoakes wrote:
>>> OK so I think the best solution here is to just update check_ops_valid(), which
>>> was kind of sucky anyway (we check everywhere but walk_page_range_mm() to
>>> enforce the install pte thing).
>>>
>>> Let's do something like:
>>>
>>> #define OPS_MAY_INSTALL_PTE	(1<<0)
>>> #define OPS_MAY_AVOID_LOCK	(1<<1)
>>>
>>> and update check_ops_valid() to take a flags or maybe 'capabilities' field.
>>>
>>> Then check based on this e.g.:
>>>
>>> if (ops->install_pte && !(capabilities & OPS_MAY_INSTALL_PTE))
>>> 	return false;
>>>
>>> if (ops->walk_lock == PGWALK_NOLOCK && !(capabilities & OPS_MAY_AVOID_LOCK))
>>> 	return false;
>>>
>> Hm. I mean, we really only want to allow this lockless check for
>> walk_kernel_page_table_range(), right?
>>
>> Having a walk_kernel_page_table_range_lockeless() might (or might not) be
>> better, to really only special-case this specific path.
> Agree completely, Dev - let's definitely do this.

Makes sense.

>
>> So, I am wondering if we should further start splitting the
>> kernel-page-table walker up from the mm walker, at least on the "entry"
>> function for now.
> How do you mean?
>
>> --
>> Cheers,
>>
>> David / dhildenb
>>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 2/2] arm64: pageattr: Use walk_page_range_novma() to change memory permissions
  2025-06-10 14:41   ` Ryan Roberts
@ 2025-06-11  4:01     ` Dev Jain
  0 siblings, 0 replies; 17+ messages in thread
From: Dev Jain @ 2025-06-11  4:01 UTC (permalink / raw)
  To: Ryan Roberts, akpm, david, catalin.marinas, will
  Cc: lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	linux-mm, linux-kernel, suzuki.poulose, steven.price, gshan,
	linux-arm-kernel, yang, anshuman.khandual


On 10/06/25 8:11 pm, Ryan Roberts wrote:
> On 10/06/2025 12:44, Dev Jain wrote:
>> Since apply_to_page_range does not support operations on block mappings,
>> use the generic pagewalk API to enable changing permissions for kernel
>> block mappings. This paves the path for enabling huge mappings by default
>> on kernel space mappings, thus leading to more efficient TLB usage.
>>
>> We only require that the start and end of a given range lie on leaf mapping
>> boundaries. Return EINVAL in case a partial block mapping is detected; add
> nit: return -EINVAL

My intention here was to say "Return invalid argument error" so I omitted the
sign, but anyways I'll do that.

>
>> a corresponding comment in ___change_memory_common() to warn that
>> eliminating such a condition is the responsibility of the caller.
>>
>> apply_to_page_range ultimately uses the lazy MMU hooks at the pte level
>> function (apply_to_pte_range) - we want to use this functionality after
>> this patch too. Ryan says:
>> "The only reason we traditionally confine the lazy mmu mode to a single
>> page table is because we want to enclose it within the PTL. But that
>> requirement doesn't stand for kernel mappings. As long as the walker can
>> guarantee that it doesn't allocate any memory (because with certain debug
>> settings that can cause lazy mmu nesting) or try to sleep then I think we
>> can just bracket the entire call."
> It would be better to state the facts here rather than repeat what I previously
> wrote on another thread :)
>
> How about something like:
>
> """
> apply_to_page_range() performs all pte level callbacks while in lazy mmu mode.
> Since arm64 can optimize performance by batching barriers when modifying kernel
> pgtables in lazy mmu mode, we would like to continue to benefit from this
> optimisation. Unfortunately walk_kernel_page_table_range() does not use lazy mmu
> mode. However, since the pagewalk framework is not allocating any memory, we can
> safely bracket the whole operation inside lazy mmu mode ourselves.
> """

I didn't properly understand what you were saying so I copied it verbatim : )

How does not allocating memory imply that we can bracket the whole call with
lazy mmu? As I understand, the logic should be that, we can bracket a call if
we can ensure that in that call, the entries which are being updated, the corresponding
VAs will never be accessed before exiting lazy MMU. Therefore we can delay the
barriers till that time.

>> Therefore, wrap the call to walk_kernel_page_table_range() with the
>> lazy MMU helpers.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   arch/arm64/mm/pageattr.c | 158 +++++++++++++++++++++++++++++++--------
>>   1 file changed, 126 insertions(+), 32 deletions(-)
>>
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index 04d4a8f676db..2c118c0922ef 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -8,6 +8,7 @@
>>   #include <linux/mem_encrypt.h>
>>   #include <linux/sched.h>
>>   #include <linux/vmalloc.h>
>> +#include <linux/pagewalk.h>
>>   
>>   #include <asm/cacheflush.h>
>>   #include <asm/pgtable-prot.h>
>> @@ -20,6 +21,100 @@ struct page_change_data {
>>   	pgprot_t clear_mask;
>>   };
>>   
>> +static pteval_t set_pageattr_masks(ptdesc_t val, struct mm_walk *walk)
> val is ptdesc_t on entry and pteval_t on return. Let's use ptdesc_t throughout
> since it's intended to represent a "pgtable descriptor at any level".

Sure.

>
>> +{
>> +	struct page_change_data *masks = walk->private;
>> +
>> +	val &= ~(pgprot_val(masks->clear_mask));
>> +	val |= (pgprot_val(masks->set_mask));
>> +
>> +	return val;
>> +}
>> +
>> +static int pageattr_pgd_entry(pgd_t *pgd, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pgd_t val = pgdp_get(pgd);
>> +
>> +	if (pgd_leaf(val)) {
>> +		if (WARN_ON_ONCE((next - addr) != PGDIR_SIZE))
>> +			return -EINVAL;
>> +		val = __pgd(set_pageattr_masks(pgd_val(val), walk));
>> +		set_pgd(pgd, val);
>> +		walk->action = ACTION_CONTINUE;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int pageattr_p4d_entry(p4d_t *p4d, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	p4d_t val = p4dp_get(p4d);
>> +
>> +	if (p4d_leaf(val)) {
>> +		if (WARN_ON_ONCE((next - addr) != P4D_SIZE))
>> +			return -EINVAL;
>> +		val = __p4d(set_pageattr_masks(p4d_val(val), walk));
>> +		set_p4d(p4d, val);
>> +		walk->action = ACTION_CONTINUE;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int pageattr_pud_entry(pud_t *pud, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pud_t val = pudp_get(pud);
>> +
>> +	if (pud_leaf(val)) {
>> +		if (WARN_ON_ONCE((next - addr) != PUD_SIZE))
>> +			return -EINVAL;
>> +		val = __pud(set_pageattr_masks(pud_val(val), walk));
>> +		set_pud(pud, val);
>> +		walk->action = ACTION_CONTINUE;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int pageattr_pmd_entry(pmd_t *pmd, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pmd_t val = pmdp_get(pmd);
>> +
>> +	if (pmd_leaf(val)) {
>> +		if (WARN_ON_ONCE((next - addr) != PMD_SIZE))
>> +			return -EINVAL;
>> +		val = __pmd(set_pageattr_masks(pmd_val(val), walk));
>> +		set_pmd(pmd, val);
>> +		walk->action = ACTION_CONTINUE;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int pageattr_pte_entry(pte_t *pte, unsigned long addr,
>> +			      unsigned long next, struct mm_walk *walk)
>> +{
>> +	pte_t val = __ptep_get(pte);
>> +
>> +	val = __pte(set_pageattr_masks(pte_val(val), walk));
>> +	__set_pte(pte, val);
>> +
>> +	return 0;
>> +}
>> +
>> +static const struct mm_walk_ops pageattr_ops = {
>> +	.pgd_entry	= pageattr_pgd_entry,
>> +	.p4d_entry	= pageattr_p4d_entry,
> I may have been wrong when I told you that we would need to support pgd and p4d
> for certain configs. Looking again at the code, I'm not sure.
>
> Let's say we have 64K page size with 42 bit VA. That gives 2 levels of lookup.
> We would normally think of that as a PGD table and a PTE table. But I think for
> the purposes of this, pte_entry and pmd_entry will be called? I'm not really
> sure - I don't have a great grasp on how pgtable folding works.

As far as I remember it should be pmd and pte.

>
> Trouble is, if pte_entry and pgd_entry get called (as I originally thought),
> pgd_leaf() is always false on arm64, which would clearly be a bug...
>
> I'm hoping someone else can pipe up to clarify. Or perhaps you can build the
> config and do a test?
>
> If it turns out that the "lower" callbacks will always be called, we should
> probably remove pgd_entry and p4d_entry in the name of performance.

I shall confirm this, thanks.

>
>> +	.pud_entry	= pageattr_pud_entry,
>> +	.pmd_entry	= pageattr_pmd_entry,
>> +	.pte_entry	= pageattr_pte_entry,
>> +	.walk_lock	= PGWALK_NOLOCK,
>> +};
>> +
>>   bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
>>   
>>   bool can_set_direct_map(void)
>> @@ -37,22 +132,7 @@ bool can_set_direct_map(void)
>>   		arm64_kfence_can_set_direct_map() || is_realm_world();
>>   }
>>   
>> -static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
>> -{
>> -	struct page_change_data *cdata = data;
>> -	pte_t pte = __ptep_get(ptep);
>> -
>> -	pte = clear_pte_bit(pte, cdata->clear_mask);
>> -	pte = set_pte_bit(pte, cdata->set_mask);
>> -
>> -	__set_pte(ptep, pte);
>> -	return 0;
>> -}
>> -
>> -/*
>> - * This function assumes that the range is mapped with PAGE_SIZE pages.
>> - */
>> -static int __change_memory_common(unsigned long start, unsigned long size,
>> +static int ___change_memory_common(unsigned long start, unsigned long size,
>>   				pgprot_t set_mask, pgprot_t clear_mask)
>>   {
>>   	struct page_change_data data;
>> @@ -61,9 +141,28 @@ static int __change_memory_common(unsigned long start, unsigned long size,
>>   	data.set_mask = set_mask;
>>   	data.clear_mask = clear_mask;
>>   
>> -	ret = apply_to_page_range(&init_mm, start, size, change_page_range,
>> -					&data);
>> +	arch_enter_lazy_mmu_mode();
>> +
>> +	/*
>> +	 * The caller must ensure that the range we are operating on does not
>> +	 * partially overlap a block mapping. Any such case should either not
>> +	 * exist, or must be eliminated by splitting the mapping - which for
>> +	 * kernel mappings can be done only on BBML2 systems.
>> +	 *
>> +	 */
>> +	ret = walk_kernel_page_table_range(start, start + size, &pageattr_ops,
>> +					   NULL, &data);
>> +	arch_leave_lazy_mmu_mode();
>> +
>> +	return ret;
>> +}
>>   
>> +static int __change_memory_common(unsigned long start, unsigned long size,
>> +				  pgprot_t set_mask, pgprot_t clear_mask)
>> +{
>> +	int ret;
>> +
>> +	ret = ___change_memory_common(start, size, set_mask, clear_mask);
>>   	/*
>>   	 * If the memory is being made valid without changing any other bits
>>   	 * then a TLBI isn't required as a non-valid entry cannot be cached in
>> @@ -71,6 +170,7 @@ static int __change_memory_common(unsigned long start, unsigned long size,
>>   	 */
>>   	if (pgprot_val(set_mask) != PTE_VALID || pgprot_val(clear_mask))
>>   		flush_tlb_kernel_range(start, start + size);
>> +
>>   	return ret;
>>   }
>>   
>> @@ -174,32 +274,26 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
>>   
>>   int set_direct_map_invalid_noflush(struct page *page)
>>   {
>> -	struct page_change_data data = {
>> -		.set_mask = __pgprot(0),
>> -		.clear_mask = __pgprot(PTE_VALID),
>> -	};
>> +	pgprot_t clear_mask = __pgprot(PTE_VALID);
>> +	pgprot_t set_mask = __pgprot(0);
>>   
>>   	if (!can_set_direct_map())
>>   		return 0;
>>   
>> -	return apply_to_page_range(&init_mm,
>> -				   (unsigned long)page_address(page),
>> -				   PAGE_SIZE, change_page_range, &data);
>> +	return ___change_memory_common((unsigned long)page_address(page), PAGE_SIZE,
>> +					set_mask, clear_mask);
>>   }
>>   
>>   int set_direct_map_default_noflush(struct page *page)
>>   {
>> -	struct page_change_data data = {
>> -		.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
>> -		.clear_mask = __pgprot(PTE_RDONLY),
>> -	};
>> +	pgprot_t set_mask = __pgprot(PTE_VALID | PTE_WRITE);
>> +	pgprot_t clear_mask = __pgprot(PTE_RDONLY);
>>   
>>   	if (!can_set_direct_map())
>>   		return 0;
>>   
>> -	return apply_to_page_range(&init_mm,
>> -				   (unsigned long)page_address(page),
>> -				   PAGE_SIZE, change_page_range, &data);
>> +	return ___change_memory_common((unsigned long)page_address(page), PAGE_SIZE,
> nit: you're at 85 chars here. Given you are breaking anyway, why not put
> PAGE_SIZE on the next line? Same for set_direct_map_invalid_noflush().

Makes sense.

>
>> +					set_mask, clear_mask);
>>   }
>>   
>>   static int __set_memory_enc_dec(unsigned long addr,
>
> This is looking good to me overall - nearly there!
>
> Thanks,
> Ryan
>
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-06-11  4:02 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-10 11:43 [PATCH v2 0/2] Enable permission change on arm64 kernel block mappings Dev Jain
2025-06-10 11:44 ` [PATCH v2 1/2] mm: Allow lockless kernel pagetable walking Dev Jain
2025-06-10 12:07   ` Lorenzo Stoakes
2025-06-10 12:40     ` Dev Jain
2025-06-10 12:57       ` Lorenzo Stoakes
2025-06-11  3:43         ` Dev Jain
2025-06-10 13:24     ` David Hildenbrand
2025-06-10 13:25       ` David Hildenbrand
2025-06-10 13:27       ` Lorenzo Stoakes
2025-06-10 13:31         ` David Hildenbrand
2025-06-10 13:35           ` Lorenzo Stoakes
2025-06-10 13:44             ` David Hildenbrand
2025-06-11  3:45         ` Dev Jain
2025-06-10 11:44 ` [PATCH v2 2/2] arm64: pageattr: Use walk_page_range_novma() to change memory permissions Dev Jain
2025-06-10 13:14   ` David Hildenbrand
2025-06-10 14:41   ` Ryan Roberts
2025-06-11  4:01     ` Dev Jain

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).