linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1] arm64/mm: Ensure lazy_mmu_mode never nests
@ 2025-06-06 13:56 Ryan Roberts
  2025-06-10 12:00 ` Catalin Marinas
  0 siblings, 1 reply; 9+ messages in thread
From: Ryan Roberts @ 2025-06-06 13:56 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon; +Cc: Ryan Roberts, linux-arm-kernel, linux-kernel

Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
provided a quick fix to ensure that lazy_mmu_mode continues to work when
CONFIG_DEBUG_PAGEALLOC is enabled, which can cause lazy_mmu_mode to
nest.

The solution in that patch is the make the implementation tolerant to
nesting; when the inner nest exits lazy_mmu_mode, we exit then the outer
exit becomes a nop. But this sacrifices the optimization opportunity for
the remainder of the outer user.

So let's take a different approach and simply ensure the nesting never
happens in the first place. The nesting is caused when the page
allocator calls out to __kernel_map_pages() which then eventually calls
apply_to_page_range(), which calls arch_enter_lazy_mmu_mode(). So simply
notice if we are in lazy_mmu_mode in __kernel_map_pages() and
temporarily exit.

With that approach, we can effectively revert Commit 1ef3095b1405
("arm64/mm: Permit lazy_mmu_mode to be nested"), re-enabling the VM_WARN
if we ever detect nesting in future.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---

I wonder if you might be willing to take this for v6.16? I think its a neater
solution then my first attempt - Commit 1ef3095b1405 ("arm64/mm: Permit
lazy_mmu_mode to be nested") - which is already in Linus's master.

To be clear, the current solution is safe, I just think this is much neater.

Applies on today's master branch (e271ed52b344).

Thanks,
Ryan

 arch/arm64/include/asm/pgtable.h | 22 ++++++++++------------
 arch/arm64/mm/pageattr.c         | 23 +++++++++++++++++------
 2 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 88db8a0c0b37..9f387337ccc3 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -83,21 +83,11 @@ static inline void queue_pte_barriers(void)
 #define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
 static inline void arch_enter_lazy_mmu_mode(void)
 {
-	/*
-	 * lazy_mmu_mode is not supposed to permit nesting. But in practice this
-	 * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
-	 * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
-	 * permissions on the linear map with apply_to_page_range(), which
-	 * re-enters lazy_mmu_mode. So we tolerate nesting in our
-	 * implementation. The first call to arch_leave_lazy_mmu_mode() will
-	 * flush and clear the flag such that the remainder of the work in the
-	 * outer nest behaves as if outside of lazy mmu mode. This is safe and
-	 * keeps tracking simple.
-	 */
-
 	if (in_interrupt())
 		return;

+	VM_WARN_ON(test_thread_flag(TIF_LAZY_MMU));
+
 	set_thread_flag(TIF_LAZY_MMU);
 }

@@ -119,6 +109,14 @@ static inline void arch_leave_lazy_mmu_mode(void)
 	clear_thread_flag(TIF_LAZY_MMU);
 }

+static inline bool arch_in_lazy_mmu_mode(void)
+{
+	if (in_interrupt())
+		return false;
+
+	return test_thread_flag(TIF_LAZY_MMU);
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE

diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 04d4a8f676db..4da7a847d5f3 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -293,18 +293,29 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
 }

 #ifdef CONFIG_DEBUG_PAGEALLOC
-/*
- * This is - apart from the return value - doing the same
- * thing as the new set_direct_map_valid_noflush() function.
- *
- * Unify? Explain the conceptual differences?
- */
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
+	bool lazy_mmu;
+
 	if (!can_set_direct_map())
 		return;

+	/*
+	 * This is called during page alloc or free, and maybe called while in
+	 * lazy mmu mode. Since set_memory_valid() may also enter lazy mmu mode,
+	 * this would cause nesting which is not supported; the inner call to
+	 * exit the mode would exit, meaning that the outer lazy mmu mode is no
+	 * longer benefiting from the optimization. So temporarily leave lazy
+	 * mmu mode for the duration of the call.
+	 */
+	lazy_mmu = arch_in_lazy_mmu_mode();
+	if (lazy_mmu)
+		arch_leave_lazy_mmu_mode();
+
 	set_memory_valid((unsigned long)page_address(page), numpages, enable);
+
+	if (lazy_mmu)
+		arch_enter_lazy_mmu_mode();
 }
 #endif /* CONFIG_DEBUG_PAGEALLOC */

--
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v1] arm64/mm: Ensure lazy_mmu_mode never nests
  2025-06-06 13:56 [PATCH v1] arm64/mm: Ensure lazy_mmu_mode never nests Ryan Roberts
@ 2025-06-10 12:00 ` Catalin Marinas
  2025-06-10 13:41   ` Ryan Roberts
  0 siblings, 1 reply; 9+ messages in thread
From: Catalin Marinas @ 2025-06-10 12:00 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Will Deacon, linux-arm-kernel, linux-kernel

On Fri, Jun 06, 2025 at 02:56:52PM +0100, Ryan Roberts wrote:
> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
> provided a quick fix to ensure that lazy_mmu_mode continues to work when
> CONFIG_DEBUG_PAGEALLOC is enabled, which can cause lazy_mmu_mode to
> nest.
> 
> The solution in that patch is the make the implementation tolerant to

s/is the make/is to make/

> nesting; when the inner nest exits lazy_mmu_mode, we exit then the outer
> exit becomes a nop. But this sacrifices the optimization opportunity for
> the remainder of the outer user.
[...]
> I wonder if you might be willing to take this for v6.16? I think its a neater
> solution then my first attempt - Commit 1ef3095b1405 ("arm64/mm: Permit
> lazy_mmu_mode to be nested") - which is already in Linus's master.
> 
> To be clear, the current solution is safe, I just think this is much neater.

Maybe better, though I wouldn't say much neater. One concern I have is
about whether we'll get other such nesting in the future and we need to
fix them in generic code. Here we control __kernel_map_pages() but we
may not for other cases.

Is it the fault of the arch code that uses apply_to_page_range() via
__kernel_map_pages()? It feels like it shouldn't care about the lazy
mode as that's some detail of the apply_to_page_range() implementation.
Maybe this API should just allow nesting.

> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 88db8a0c0b37..9f387337ccc3 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -83,21 +83,11 @@ static inline void queue_pte_barriers(void)
>  #define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
>  static inline void arch_enter_lazy_mmu_mode(void)
>  {
> -	/*
> -	 * lazy_mmu_mode is not supposed to permit nesting. But in practice this
> -	 * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
> -	 * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
> -	 * permissions on the linear map with apply_to_page_range(), which
> -	 * re-enters lazy_mmu_mode. So we tolerate nesting in our
> -	 * implementation. The first call to arch_leave_lazy_mmu_mode() will
> -	 * flush and clear the flag such that the remainder of the work in the
> -	 * outer nest behaves as if outside of lazy mmu mode. This is safe and
> -	 * keeps tracking simple.
> -	 */
> -
>  	if (in_interrupt())
>  		return;
> 
> +	VM_WARN_ON(test_thread_flag(TIF_LAZY_MMU));

This warning is good to have back.

> +
>  	set_thread_flag(TIF_LAZY_MMU);
>  }
> 
> @@ -119,6 +109,14 @@ static inline void arch_leave_lazy_mmu_mode(void)
>  	clear_thread_flag(TIF_LAZY_MMU);
>  }
> 
> +static inline bool arch_in_lazy_mmu_mode(void)
> +{
> +	if (in_interrupt())
> +		return false;
> +
> +	return test_thread_flag(TIF_LAZY_MMU);
> +}
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
> 
> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
> index 04d4a8f676db..4da7a847d5f3 100644
> --- a/arch/arm64/mm/pageattr.c
> +++ b/arch/arm64/mm/pageattr.c
> @@ -293,18 +293,29 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
>  }
> 
>  #ifdef CONFIG_DEBUG_PAGEALLOC
> -/*
> - * This is - apart from the return value - doing the same
> - * thing as the new set_direct_map_valid_noflush() function.
> - *
> - * Unify? Explain the conceptual differences?
> - */
>  void __kernel_map_pages(struct page *page, int numpages, int enable)
>  {
> +	bool lazy_mmu;
> +
>  	if (!can_set_direct_map())
>  		return;
> 
> +	/*
> +	 * This is called during page alloc or free, and maybe called while in
> +	 * lazy mmu mode. Since set_memory_valid() may also enter lazy mmu mode,
> +	 * this would cause nesting which is not supported; the inner call to
> +	 * exit the mode would exit, meaning that the outer lazy mmu mode is no
> +	 * longer benefiting from the optimization. So temporarily leave lazy
> +	 * mmu mode for the duration of the call.
> +	 */
> +	lazy_mmu = arch_in_lazy_mmu_mode();
> +	if (lazy_mmu)
> +		arch_leave_lazy_mmu_mode();
> +
>  	set_memory_valid((unsigned long)page_address(page), numpages, enable);
> +
> +	if (lazy_mmu)
> +		arch_enter_lazy_mmu_mode();
>  }
>  #endif /* CONFIG_DEBUG_PAGEALLOC */

So basically you are flattening the enter/leave_lazy_mmu_mode() regions.
Ideally this could have been done by the nesting
arch_enter_lazy_mmu_mode() automatically but that means this function
returning the current mode and arch_leave_lazy_mmu_mode() taking an
argument - more like the irq saving/restoring (even better renaming it
to arch_restore_lazy_mmu_mode()). I guess this won't go well with the mm
folk who don't seem willing to changes in this area.

FWIW, this patch is correct.

-- 
Catalin

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v1] arm64/mm: Ensure lazy_mmu_mode never nests
  2025-06-10 12:00 ` Catalin Marinas
@ 2025-06-10 13:41   ` Ryan Roberts
  2025-06-10 15:07     ` Catalin Marinas
  0 siblings, 1 reply; 9+ messages in thread
From: Ryan Roberts @ 2025-06-10 13:41 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Will Deacon, linux-arm-kernel, linux-kernel

On 10/06/2025 13:00, Catalin Marinas wrote:
> On Fri, Jun 06, 2025 at 02:56:52PM +0100, Ryan Roberts wrote:
>> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
>> provided a quick fix to ensure that lazy_mmu_mode continues to work when
>> CONFIG_DEBUG_PAGEALLOC is enabled, which can cause lazy_mmu_mode to
>> nest.
>>
>> The solution in that patch is the make the implementation tolerant to
> 
> s/is the make/is to make/
> 
>> nesting; when the inner nest exits lazy_mmu_mode, we exit then the outer
>> exit becomes a nop. But this sacrifices the optimization opportunity for
>> the remainder of the outer user.
> [...]
>> I wonder if you might be willing to take this for v6.16? I think its a neater
>> solution then my first attempt - Commit 1ef3095b1405 ("arm64/mm: Permit
>> lazy_mmu_mode to be nested") - which is already in Linus's master.
>>
>> To be clear, the current solution is safe, I just think this is much neater.
> 
> Maybe better, though I wouldn't say much neater. One concern I have is
> about whether we'll get other such nesting in the future and we need to
> fix them in generic code. Here we control __kernel_map_pages() but we
> may not for other cases.
> 
> Is it the fault of the arch code that uses apply_to_page_range() via
> __kernel_map_pages()? It feels like it shouldn't care about the lazy
> mode as that's some detail of the apply_to_page_range() implementation.
> Maybe this API should just allow nesting.

I don't think it is possible to properly support nesting:

enter_lazy_mmu
    for_each_pte {
        read/modify-write pte

        alloc_page
            enter_lazy_mmu
                make page valid
            exit_lazy_mmu

        write_to_page
    }
exit_lazy_mmu

This example only works because lazy_mmu doesn't support nesting. The "make page
valid" operation is completed by the time of the inner exit_lazy_mmu so that the
page can be accessed in write_to_page. If nesting was supported, the inner
exit_lazy_mmu would become a nop and write_to_page would explode.

So the conclusion I eventually came to (after being nudged by Mike Rapoport at
[1]) is that this _is_ arm64's fault for creating a loop via
apply_to_page_range(). So I'm trying to fix this by breaking the loop.

[1] https://lore.kernel.org/all/aDqz7H-oBo35FRXe@kernel.org/

> 
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 88db8a0c0b37..9f387337ccc3 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -83,21 +83,11 @@ static inline void queue_pte_barriers(void)
>>  #define  __HAVE_ARCH_ENTER_LAZY_MMU_MODE
>>  static inline void arch_enter_lazy_mmu_mode(void)
>>  {
>> -	/*
>> -	 * lazy_mmu_mode is not supposed to permit nesting. But in practice this
>> -	 * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation
>> -	 * inside a lazy_mmu_mode section (such as zap_pte_range()) will change
>> -	 * permissions on the linear map with apply_to_page_range(), which
>> -	 * re-enters lazy_mmu_mode. So we tolerate nesting in our
>> -	 * implementation. The first call to arch_leave_lazy_mmu_mode() will
>> -	 * flush and clear the flag such that the remainder of the work in the
>> -	 * outer nest behaves as if outside of lazy mmu mode. This is safe and
>> -	 * keeps tracking simple.
>> -	 */
>> -
>>  	if (in_interrupt())
>>  		return;
>>
>> +	VM_WARN_ON(test_thread_flag(TIF_LAZY_MMU));
> 
> This warning is good to have back.
> 
>> +
>>  	set_thread_flag(TIF_LAZY_MMU);
>>  }
>>
>> @@ -119,6 +109,14 @@ static inline void arch_leave_lazy_mmu_mode(void)
>>  	clear_thread_flag(TIF_LAZY_MMU);
>>  }
>>
>> +static inline bool arch_in_lazy_mmu_mode(void)
>> +{
>> +	if (in_interrupt())
>> +		return false;
>> +
>> +	return test_thread_flag(TIF_LAZY_MMU);
>> +}
>> +
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
>>
>> diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
>> index 04d4a8f676db..4da7a847d5f3 100644
>> --- a/arch/arm64/mm/pageattr.c
>> +++ b/arch/arm64/mm/pageattr.c
>> @@ -293,18 +293,29 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid)
>>  }
>>
>>  #ifdef CONFIG_DEBUG_PAGEALLOC
>> -/*
>> - * This is - apart from the return value - doing the same
>> - * thing as the new set_direct_map_valid_noflush() function.
>> - *
>> - * Unify? Explain the conceptual differences?
>> - */
>>  void __kernel_map_pages(struct page *page, int numpages, int enable)
>>  {
>> +	bool lazy_mmu;
>> +
>>  	if (!can_set_direct_map())
>>  		return;
>>
>> +	/*
>> +	 * This is called during page alloc or free, and maybe called while in
>> +	 * lazy mmu mode. Since set_memory_valid() may also enter lazy mmu mode,
>> +	 * this would cause nesting which is not supported; the inner call to
>> +	 * exit the mode would exit, meaning that the outer lazy mmu mode is no
>> +	 * longer benefiting from the optimization. So temporarily leave lazy
>> +	 * mmu mode for the duration of the call.
>> +	 */
>> +	lazy_mmu = arch_in_lazy_mmu_mode();
>> +	if (lazy_mmu)
>> +		arch_leave_lazy_mmu_mode();
>> +
>>  	set_memory_valid((unsigned long)page_address(page), numpages, enable);
>> +
>> +	if (lazy_mmu)
>> +		arch_enter_lazy_mmu_mode();
>>  }
>>  #endif /* CONFIG_DEBUG_PAGEALLOC */
> 
> So basically you are flattening the enter/leave_lazy_mmu_mode() regions.
> Ideally this could have been done by the nesting
> arch_enter_lazy_mmu_mode() automatically but that means this function
> returning the current mode and arch_leave_lazy_mmu_mode() taking an
> argument - more like the irq saving/restoring (even better renaming it
> to arch_restore_lazy_mmu_mode()). I guess this won't go well with the mm
> folk who don't seem willing to changes in this area.

We could alternatively use some per-cpu storage for a nest count, but that gets
ugly quite quickly I suspect. But regardless, I'm not convinced the semantics of
a properly nested lazy_mmu are safe.

Thanks,
Ryan

> 
> FWIW, this patch is correct.
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v1] arm64/mm: Ensure lazy_mmu_mode never nests
  2025-06-10 13:41   ` Ryan Roberts
@ 2025-06-10 15:07     ` Catalin Marinas
  2025-06-10 16:37       ` Ryan Roberts
  0 siblings, 1 reply; 9+ messages in thread
From: Catalin Marinas @ 2025-06-10 15:07 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Will Deacon, linux-arm-kernel, linux-kernel

On Tue, Jun 10, 2025 at 02:41:01PM +0100, Ryan Roberts wrote:
> On 10/06/2025 13:00, Catalin Marinas wrote:
> > On Fri, Jun 06, 2025 at 02:56:52PM +0100, Ryan Roberts wrote:
> >> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
> >> provided a quick fix to ensure that lazy_mmu_mode continues to work when
> >> CONFIG_DEBUG_PAGEALLOC is enabled, which can cause lazy_mmu_mode to
> >> nest.
> >>
> >> The solution in that patch is the make the implementation tolerant to
> > 
> > s/is the make/is to make/
> > 
> >> nesting; when the inner nest exits lazy_mmu_mode, we exit then the outer
> >> exit becomes a nop. But this sacrifices the optimization opportunity for
> >> the remainder of the outer user.
> > [...]
> >> I wonder if you might be willing to take this for v6.16? I think its a neater
> >> solution then my first attempt - Commit 1ef3095b1405 ("arm64/mm: Permit
> >> lazy_mmu_mode to be nested") - which is already in Linus's master.
> >>
> >> To be clear, the current solution is safe, I just think this is much neater.
> > 
> > Maybe better, though I wouldn't say much neater. One concern I have is
> > about whether we'll get other such nesting in the future and we need to
> > fix them in generic code. Here we control __kernel_map_pages() but we
> > may not for other cases.
> > 
> > Is it the fault of the arch code that uses apply_to_page_range() via
> > __kernel_map_pages()? It feels like it shouldn't care about the lazy
> > mode as that's some detail of the apply_to_page_range() implementation.
> > Maybe this API should just allow nesting.
> 
> I don't think it is possible to properly support nesting:
> 
> enter_lazy_mmu
>     for_each_pte {
>         read/modify-write pte
> 
>         alloc_page
>             enter_lazy_mmu
>                 make page valid
>             exit_lazy_mmu
> 
>         write_to_page
>     }
> exit_lazy_mmu
> 
> This example only works because lazy_mmu doesn't support nesting. The "make page
> valid" operation is completed by the time of the inner exit_lazy_mmu so that the
> page can be accessed in write_to_page. If nesting was supported, the inner
> exit_lazy_mmu would become a nop and write_to_page would explode.

What I meant is for enter/exit_lazy_mmu to handle a kind of de-nesting
themselves: enter_lazy_mmu would emit the barriers if already in lazy
mode, clear pending (well, it doesn't need to do this but it may be
easier to reason about in terms of flattening). exit_lazy_mmu also needs
to emit the barriers but leave the lazy mode on if already on when last
entered. This does need some API modifications to return the old mode on
enter and get an argument for exit. But the correctness wouldn't be
affected since exit_lazy_mmu still emits the barriers irrespective of
the nesting level.

> So the conclusion I eventually came to (after being nudged by Mike Rapoport at
> [1]) is that this _is_ arm64's fault for creating a loop via
> apply_to_page_range(). So I'm trying to fix this by breaking the loop.
> 
> [1] https://lore.kernel.org/all/aDqz7H-oBo35FRXe@kernel.org/

If the only such loop is the arm64 code, fine by me to fix it this way.
Your series above had 6 patches and I thought there's more to fix.

> We could alternatively use some per-cpu storage for a nest count, but that gets
> ugly quite quickly I suspect.

Yeah, the thread flag is much nicer.

> But regardless, I'm not convinced the semantics of a properly nested
> lazy_mmu are safe.

I think we can make them safe but there would be opposition from the mm
people and it may not be trivial on x86. So, I'm fine with this arm64
specific change:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v1] arm64/mm: Ensure lazy_mmu_mode never nests
  2025-06-10 15:07     ` Catalin Marinas
@ 2025-06-10 16:37       ` Ryan Roberts
  2025-06-12 14:59         ` Will Deacon
  0 siblings, 1 reply; 9+ messages in thread
From: Ryan Roberts @ 2025-06-10 16:37 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Will Deacon, linux-arm-kernel, linux-kernel

On 10/06/2025 16:07, Catalin Marinas wrote:
> On Tue, Jun 10, 2025 at 02:41:01PM +0100, Ryan Roberts wrote:
>> On 10/06/2025 13:00, Catalin Marinas wrote:
>>> On Fri, Jun 06, 2025 at 02:56:52PM +0100, Ryan Roberts wrote:
>>>> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
>>>> provided a quick fix to ensure that lazy_mmu_mode continues to work when
>>>> CONFIG_DEBUG_PAGEALLOC is enabled, which can cause lazy_mmu_mode to
>>>> nest.
>>>>
>>>> The solution in that patch is the make the implementation tolerant to
>>>
>>> s/is the make/is to make/
>>>
>>>> nesting; when the inner nest exits lazy_mmu_mode, we exit then the outer
>>>> exit becomes a nop. But this sacrifices the optimization opportunity for
>>>> the remainder of the outer user.
>>> [...]
>>>> I wonder if you might be willing to take this for v6.16? I think its a neater
>>>> solution then my first attempt - Commit 1ef3095b1405 ("arm64/mm: Permit
>>>> lazy_mmu_mode to be nested") - which is already in Linus's master.
>>>>
>>>> To be clear, the current solution is safe, I just think this is much neater.
>>>
>>> Maybe better, though I wouldn't say much neater. One concern I have is
>>> about whether we'll get other such nesting in the future and we need to
>>> fix them in generic code. Here we control __kernel_map_pages() but we
>>> may not for other cases.
>>>
>>> Is it the fault of the arch code that uses apply_to_page_range() via
>>> __kernel_map_pages()? It feels like it shouldn't care about the lazy
>>> mode as that's some detail of the apply_to_page_range() implementation.
>>> Maybe this API should just allow nesting.
>>
>> I don't think it is possible to properly support nesting:
>>
>> enter_lazy_mmu
>>     for_each_pte {
>>         read/modify-write pte
>>
>>         alloc_page
>>             enter_lazy_mmu
>>                 make page valid
>>             exit_lazy_mmu
>>
>>         write_to_page
>>     }
>> exit_lazy_mmu
>>
>> This example only works because lazy_mmu doesn't support nesting. The "make page
>> valid" operation is completed by the time of the inner exit_lazy_mmu so that the
>> page can be accessed in write_to_page. If nesting was supported, the inner
>> exit_lazy_mmu would become a nop and write_to_page would explode.
> 
> What I meant is for enter/exit_lazy_mmu to handle a kind of de-nesting
> themselves: enter_lazy_mmu would emit the barriers if already in lazy
> mode, clear pending (well, it doesn't need to do this but it may be
> easier to reason about in terms of flattening). exit_lazy_mmu also needs
> to emit the barriers but leave the lazy mode on if already on when last
> entered. This does need some API modifications to return the old mode on
> enter and get an argument for exit. But the correctness wouldn't be
> affected since exit_lazy_mmu still emits the barriers irrespective of
> the nesting level.

Ahh I see what you mean now; exit always emits barriers but only the last exit
clears TIF_LAZY_MMU.

I think that's much cleaner, but we are changing the API which needs changes to
all the arches and my attempt at [1] didn't really gain much enthusiasm.

> 
>> So the conclusion I eventually came to (after being nudged by Mike Rapoport at
>> [1]) is that this _is_ arm64's fault for creating a loop via
>> apply_to_page_range(). So I'm trying to fix this by breaking the loop.
>>
>> [1] https://lore.kernel.org/all/aDqz7H-oBo35FRXe@kernel.org/
> 
> If the only such loop is the arm64 code, fine by me to fix it this way.
> Your series above had 6 patches and I thought there's more to fix.

My previous attempt was, in hindsight, hideous. It seems to be the case that
only arm64 suffers this problem so let's just fix it here.

> 
>> We could alternatively use some per-cpu storage for a nest count, but that gets
>> ugly quite quickly I suspect.
> 
> Yeah, the thread flag is much nicer.
> 
>> But regardless, I'm not convinced the semantics of a properly nested
>> lazy_mmu are safe.
> 
> I think we can make them safe but there would be opposition from the mm
> people and it may not be trivial on x86. So, I'm fine with this arm64
> specific change:
> 
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Agreed, and thanks!


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v1] arm64/mm: Ensure lazy_mmu_mode never nests
  2025-06-10 16:37       ` Ryan Roberts
@ 2025-06-12 14:59         ` Will Deacon
  2025-06-12 16:00           ` Ryan Roberts
  0 siblings, 1 reply; 9+ messages in thread
From: Will Deacon @ 2025-06-12 14:59 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Catalin Marinas, linux-arm-kernel, linux-kernel

On Tue, Jun 10, 2025 at 05:37:20PM +0100, Ryan Roberts wrote:
> On 10/06/2025 16:07, Catalin Marinas wrote:
> > On Tue, Jun 10, 2025 at 02:41:01PM +0100, Ryan Roberts wrote:
> >> On 10/06/2025 13:00, Catalin Marinas wrote:
> >>> On Fri, Jun 06, 2025 at 02:56:52PM +0100, Ryan Roberts wrote:
> >>>> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
> >>>> provided a quick fix to ensure that lazy_mmu_mode continues to work when
> >>>> CONFIG_DEBUG_PAGEALLOC is enabled, which can cause lazy_mmu_mode to
> >>>> nest.
> >>>>
> >>>> The solution in that patch is the make the implementation tolerant to
> >>>
> >>> s/is the make/is to make/
> >>>
> >>>> nesting; when the inner nest exits lazy_mmu_mode, we exit then the outer
> >>>> exit becomes a nop. But this sacrifices the optimization opportunity for
> >>>> the remainder of the outer user.
> >>> [...]
> >>>> I wonder if you might be willing to take this for v6.16? I think its a neater
> >>>> solution then my first attempt - Commit 1ef3095b1405 ("arm64/mm: Permit
> >>>> lazy_mmu_mode to be nested") - which is already in Linus's master.
> >>>>
> >>>> To be clear, the current solution is safe, I just think this is much neater.
> >>>
> >>> Maybe better, though I wouldn't say much neater. One concern I have is
> >>> about whether we'll get other such nesting in the future and we need to
> >>> fix them in generic code. Here we control __kernel_map_pages() but we
> >>> may not for other cases.
> >>>
> >>> Is it the fault of the arch code that uses apply_to_page_range() via
> >>> __kernel_map_pages()? It feels like it shouldn't care about the lazy
> >>> mode as that's some detail of the apply_to_page_range() implementation.
> >>> Maybe this API should just allow nesting.
> >>
> >> I don't think it is possible to properly support nesting:
> >>
> >> enter_lazy_mmu
> >>     for_each_pte {
> >>         read/modify-write pte
> >>
> >>         alloc_page
> >>             enter_lazy_mmu
> >>                 make page valid
> >>             exit_lazy_mmu
> >>
> >>         write_to_page
> >>     }
> >> exit_lazy_mmu
> >>
> >> This example only works because lazy_mmu doesn't support nesting. The "make page
> >> valid" operation is completed by the time of the inner exit_lazy_mmu so that the
> >> page can be accessed in write_to_page. If nesting was supported, the inner
> >> exit_lazy_mmu would become a nop and write_to_page would explode.
> > 
> > What I meant is for enter/exit_lazy_mmu to handle a kind of de-nesting
> > themselves: enter_lazy_mmu would emit the barriers if already in lazy
> > mode, clear pending (well, it doesn't need to do this but it may be
> > easier to reason about in terms of flattening). exit_lazy_mmu also needs
> > to emit the barriers but leave the lazy mode on if already on when last
> > entered. This does need some API modifications to return the old mode on
> > enter and get an argument for exit. But the correctness wouldn't be
> > affected since exit_lazy_mmu still emits the barriers irrespective of
> > the nesting level.
> 
> Ahh I see what you mean now; exit always emits barriers but only the last exit
> clears TIF_LAZY_MMU.
> 
> I think that's much cleaner, but we are changing the API which needs changes to
> all the arches and my attempt at [1] didn't really gain much enthusiasm.

To be honest, I don't think the proposal in this series is really
improving what we have. Either we support nested lazy mode or we don't;
having __kernel_map_pages() mess around with the lazy mmu state because
it somehow knows that set_memory_valid() is going to use it is fragile
and ugly.

So I'm incined to leave the current code as-is, unless we can remove it
in favour of teaching the core code how to handle it instead.

Cheers,

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v1] arm64/mm: Ensure lazy_mmu_mode never nests
  2025-06-12 14:59         ` Will Deacon
@ 2025-06-12 16:00           ` Ryan Roberts
  2025-06-12 16:21             ` Will Deacon
  2025-08-01 11:22             ` Kevin Brodsky
  0 siblings, 2 replies; 9+ messages in thread
From: Ryan Roberts @ 2025-06-12 16:00 UTC (permalink / raw)
  To: Will Deacon; +Cc: Catalin Marinas, linux-arm-kernel, linux-kernel

On 12/06/2025 15:59, Will Deacon wrote:
> On Tue, Jun 10, 2025 at 05:37:20PM +0100, Ryan Roberts wrote:
>> On 10/06/2025 16:07, Catalin Marinas wrote:
>>> On Tue, Jun 10, 2025 at 02:41:01PM +0100, Ryan Roberts wrote:
>>>> On 10/06/2025 13:00, Catalin Marinas wrote:
>>>>> On Fri, Jun 06, 2025 at 02:56:52PM +0100, Ryan Roberts wrote:
>>>>>> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
>>>>>> provided a quick fix to ensure that lazy_mmu_mode continues to work when
>>>>>> CONFIG_DEBUG_PAGEALLOC is enabled, which can cause lazy_mmu_mode to
>>>>>> nest.
>>>>>>
>>>>>> The solution in that patch is the make the implementation tolerant to
>>>>>
>>>>> s/is the make/is to make/
>>>>>
>>>>>> nesting; when the inner nest exits lazy_mmu_mode, we exit then the outer
>>>>>> exit becomes a nop. But this sacrifices the optimization opportunity for
>>>>>> the remainder of the outer user.
>>>>> [...]
>>>>>> I wonder if you might be willing to take this for v6.16? I think its a neater
>>>>>> solution then my first attempt - Commit 1ef3095b1405 ("arm64/mm: Permit
>>>>>> lazy_mmu_mode to be nested") - which is already in Linus's master.
>>>>>>
>>>>>> To be clear, the current solution is safe, I just think this is much neater.
>>>>>
>>>>> Maybe better, though I wouldn't say much neater. One concern I have is
>>>>> about whether we'll get other such nesting in the future and we need to
>>>>> fix them in generic code. Here we control __kernel_map_pages() but we
>>>>> may not for other cases.
>>>>>
>>>>> Is it the fault of the arch code that uses apply_to_page_range() via
>>>>> __kernel_map_pages()? It feels like it shouldn't care about the lazy
>>>>> mode as that's some detail of the apply_to_page_range() implementation.
>>>>> Maybe this API should just allow nesting.
>>>>
>>>> I don't think it is possible to properly support nesting:
>>>>
>>>> enter_lazy_mmu
>>>>     for_each_pte {
>>>>         read/modify-write pte
>>>>
>>>>         alloc_page
>>>>             enter_lazy_mmu
>>>>                 make page valid
>>>>             exit_lazy_mmu
>>>>
>>>>         write_to_page
>>>>     }
>>>> exit_lazy_mmu
>>>>
>>>> This example only works because lazy_mmu doesn't support nesting. The "make page
>>>> valid" operation is completed by the time of the inner exit_lazy_mmu so that the
>>>> page can be accessed in write_to_page. If nesting was supported, the inner
>>>> exit_lazy_mmu would become a nop and write_to_page would explode.
>>>
>>> What I meant is for enter/exit_lazy_mmu to handle a kind of de-nesting
>>> themselves: enter_lazy_mmu would emit the barriers if already in lazy
>>> mode, clear pending (well, it doesn't need to do this but it may be
>>> easier to reason about in terms of flattening). exit_lazy_mmu also needs
>>> to emit the barriers but leave the lazy mode on if already on when last
>>> entered. This does need some API modifications to return the old mode on
>>> enter and get an argument for exit. But the correctness wouldn't be
>>> affected since exit_lazy_mmu still emits the barriers irrespective of
>>> the nesting level.
>>
>> Ahh I see what you mean now; exit always emits barriers but only the last exit
>> clears TIF_LAZY_MMU.
>>
>> I think that's much cleaner, but we are changing the API which needs changes to
>> all the arches and my attempt at [1] didn't really gain much enthusiasm.
> 
> To be honest, I don't think the proposal in this series is really
> improving what we have. Either we support nested lazy mode or we don't;
> having __kernel_map_pages() mess around with the lazy mmu state because
> it somehow knows that set_memory_valid() is going to use it is fragile
> and ugly.
> 
> So I'm incined to leave the current code as-is, unless we can remove it
> in favour of teaching the core code how to handle it instead.

Yeah fair enough. I'm not going to have time to do the proper nesting support
thing. But I'll see if I can find someone internally that I might be able to
convince. If not, we'll just leave as is.

> 
> Cheers,
> 
> Will


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v1] arm64/mm: Ensure lazy_mmu_mode never nests
  2025-06-12 16:00           ` Ryan Roberts
@ 2025-06-12 16:21             ` Will Deacon
  2025-08-01 11:22             ` Kevin Brodsky
  1 sibling, 0 replies; 9+ messages in thread
From: Will Deacon @ 2025-06-12 16:21 UTC (permalink / raw)
  To: Ryan Roberts; +Cc: Catalin Marinas, linux-arm-kernel, linux-kernel

On Thu, Jun 12, 2025 at 05:00:51PM +0100, Ryan Roberts wrote:
> On 12/06/2025 15:59, Will Deacon wrote:
> > To be honest, I don't think the proposal in this series is really
> > improving what we have. Either we support nested lazy mode or we don't;
> > having __kernel_map_pages() mess around with the lazy mmu state because
> > it somehow knows that set_memory_valid() is going to use it is fragile
> > and ugly.
> > 
> > So I'm incined to leave the current code as-is, unless we can remove it
> > in favour of teaching the core code how to handle it instead.
> 
> Yeah fair enough. I'm not going to have time to do the proper nesting support
> thing. But I'll see if I can find someone internally that I might be able to
> convince. If not, we'll just leave as is.

Sounds fine by me, thanks Ryan.

Will

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v1] arm64/mm: Ensure lazy_mmu_mode never nests
  2025-06-12 16:00           ` Ryan Roberts
  2025-06-12 16:21             ` Will Deacon
@ 2025-08-01 11:22             ` Kevin Brodsky
  1 sibling, 0 replies; 9+ messages in thread
From: Kevin Brodsky @ 2025-08-01 11:22 UTC (permalink / raw)
  To: Ryan Roberts, Will Deacon; +Cc: Catalin Marinas, linux-arm-kernel, linux-kernel

On 12/06/2025 18:00, Ryan Roberts wrote:
> On 12/06/2025 15:59, Will Deacon wrote:
>> On Tue, Jun 10, 2025 at 05:37:20PM +0100, Ryan Roberts wrote:
>>> On 10/06/2025 16:07, Catalin Marinas wrote:
>>>> On Tue, Jun 10, 2025 at 02:41:01PM +0100, Ryan Roberts wrote:
>>>>> On 10/06/2025 13:00, Catalin Marinas wrote:
>>>>>> On Fri, Jun 06, 2025 at 02:56:52PM +0100, Ryan Roberts wrote:
>>>>>>> Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested")
>>>>>>> provided a quick fix to ensure that lazy_mmu_mode continues to work when
>>>>>>> CONFIG_DEBUG_PAGEALLOC is enabled, which can cause lazy_mmu_mode to
>>>>>>> nest.
>>>>>>>
>>>>>>> The solution in that patch is the make the implementation tolerant to
>>>>>> s/is the make/is to make/
>>>>>>
>>>>>>> nesting; when the inner nest exits lazy_mmu_mode, we exit then the outer
>>>>>>> exit becomes a nop. But this sacrifices the optimization opportunity for
>>>>>>> the remainder of the outer user.
>>>>>> [...]
>>>>>>> I wonder if you might be willing to take this for v6.16? I think its a neater
>>>>>>> solution then my first attempt - Commit 1ef3095b1405 ("arm64/mm: Permit
>>>>>>> lazy_mmu_mode to be nested") - which is already in Linus's master.
>>>>>>>
>>>>>>> To be clear, the current solution is safe, I just think this is much neater.
>>>>>> Maybe better, though I wouldn't say much neater. One concern I have is
>>>>>> about whether we'll get other such nesting in the future and we need to
>>>>>> fix them in generic code. Here we control __kernel_map_pages() but we
>>>>>> may not for other cases.
>>>>>>
>>>>>> Is it the fault of the arch code that uses apply_to_page_range() via
>>>>>> __kernel_map_pages()? It feels like it shouldn't care about the lazy
>>>>>> mode as that's some detail of the apply_to_page_range() implementation.
>>>>>> Maybe this API should just allow nesting.
>>>>> I don't think it is possible to properly support nesting:
>>>>>
>>>>> enter_lazy_mmu
>>>>>     for_each_pte {
>>>>>         read/modify-write pte
>>>>>
>>>>>         alloc_page
>>>>>             enter_lazy_mmu
>>>>>                 make page valid
>>>>>             exit_lazy_mmu
>>>>>
>>>>>         write_to_page
>>>>>     }
>>>>> exit_lazy_mmu
>>>>>
>>>>> This example only works because lazy_mmu doesn't support nesting. The "make page
>>>>> valid" operation is completed by the time of the inner exit_lazy_mmu so that the
>>>>> page can be accessed in write_to_page. If nesting was supported, the inner
>>>>> exit_lazy_mmu would become a nop and write_to_page would explode.
>>>> What I meant is for enter/exit_lazy_mmu to handle a kind of de-nesting
>>>> themselves: enter_lazy_mmu would emit the barriers if already in lazy
>>>> mode, clear pending (well, it doesn't need to do this but it may be
>>>> easier to reason about in terms of flattening). exit_lazy_mmu also needs
>>>> to emit the barriers but leave the lazy mode on if already on when last
>>>> entered. This does need some API modifications to return the old mode on
>>>> enter and get an argument for exit. But the correctness wouldn't be
>>>> affected since exit_lazy_mmu still emits the barriers irrespective of
>>>> the nesting level.
>>> Ahh I see what you mean now; exit always emits barriers but only the last exit
>>> clears TIF_LAZY_MMU.
>>>
>>> I think that's much cleaner, but we are changing the API which needs changes to
>>> all the arches and my attempt at [1] didn't really gain much enthusiasm.
>> To be honest, I don't think the proposal in this series is really
>> improving what we have. Either we support nested lazy mode or we don't;
>> having __kernel_map_pages() mess around with the lazy mmu state because
>> it somehow knows that set_memory_valid() is going to use it is fragile
>> and ugly.
>>
>> So I'm incined to leave the current code as-is, unless we can remove it
>> in favour of teaching the core code how to handle it instead.
> Yeah fair enough. I'm not going to have time to do the proper nesting support
> thing. But I'll see if I can find someone internally that I might be able to
> convince. If not, we'll just leave as is.

I've started working on supporting nesting as Catalin suggested above -
modifying the API so that enter() returns whether the call is nested,
and leave() takes the value returned by enter(). I've got it working
without too much trouble, there's a fair amount of churn at the call
sites but a trivial Coccinelle script can handle most of them.

This new approach will also help with protecting page tables with
kpkeys: lazy_mmu is very useful to write to POR_EL1 less often [1], but
this currently assumes that nesting doesn't occur. In fact the new API
should allow the old value of POR_EL1 to be returned by enter(), meaning
we wouldn't need to store it in a thread-local variable.

- Kevin

[1]
https://lore.kernel.org/linux-mm/20250411091631.954228-19-kevin.brodsky@arm.com/


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-08-01 11:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-06 13:56 [PATCH v1] arm64/mm: Ensure lazy_mmu_mode never nests Ryan Roberts
2025-06-10 12:00 ` Catalin Marinas
2025-06-10 13:41   ` Ryan Roberts
2025-06-10 15:07     ` Catalin Marinas
2025-06-10 16:37       ` Ryan Roberts
2025-06-12 14:59         ` Will Deacon
2025-06-12 16:00           ` Ryan Roberts
2025-06-12 16:21             ` Will Deacon
2025-08-01 11:22             ` Kevin Brodsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).