[PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode

sparclinux.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode
@ 2025-04-07 15:11 Alexander Gordeev
  2025-04-07 15:11 ` [PATCH v1 1/4] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
                   ` (5 more replies)
  0 siblings, 6 replies; 14+ messages in thread
From: Alexander Gordeev @ 2025-04-07 15:11 UTC (permalink / raw)
  To: Andrew Morton, Andrey Ryabinin
  Cc: Hugh Dickins, Nicholas Piggin, Guenter Roeck, Juergen Gross,
	Jeremy Fitzhardinge, linux-kernel, linux-mm, kasan-dev,
	sparclinux, xen-devel, linuxppc-dev, linux-s390

Hi All,

This series is an attempt to fix the violation of lazy MMU mode context
requirement as described for arch_enter_lazy_mmu_mode():

    This mode can only be entered and left under the protection of
    the page table locks for all page tables which may be modified.

On s390 if I make arch_enter_lazy_mmu_mode() -> preempt_enable() and
arch_leave_lazy_mmu_mode() -> preempt_disable() I am getting this:

    [  553.332108] preempt_count: 1, expected: 0
    [  553.332117] no locks held by multipathd/2116.
    [  553.332128] CPU: 24 PID: 2116 Comm: multipathd Kdump: loaded Tainted:
    [  553.332139] Hardware name: IBM 3931 A01 701 (LPAR)
    [  553.332146] Call Trace:
    [  553.332152]  [<00000000158de23a>] dump_stack_lvl+0xfa/0x150
    [  553.332167]  [<0000000013e10d12>] __might_resched+0x57a/0x5e8
    [  553.332178]  [<00000000144eb6c2>] __alloc_pages+0x2ba/0x7c0
    [  553.332189]  [<00000000144d5cdc>] __get_free_pages+0x2c/0x88
    [  553.332198]  [<00000000145663f6>] kasan_populate_vmalloc_pte+0x4e/0x110
    [  553.332207]  [<000000001447625c>] apply_to_pte_range+0x164/0x3c8
    [  553.332218]  [<000000001448125a>] apply_to_pmd_range+0xda/0x318
    [  553.332226]  [<000000001448181c>] __apply_to_page_range+0x384/0x768
    [  553.332233]  [<0000000014481c28>] apply_to_page_range+0x28/0x38
    [  553.332241]  [<00000000145665da>] kasan_populate_vmalloc+0x82/0x98
    [  553.332249]  [<00000000144c88d0>] alloc_vmap_area+0x590/0x1c90
    [  553.332257]  [<00000000144ca108>] __get_vm_area_node.constprop.0+0x138/0x260
    [  553.332265]  [<00000000144d17fc>] __vmalloc_node_range+0x134/0x360
    [  553.332274]  [<0000000013d5dbf2>] alloc_thread_stack_node+0x112/0x378
    [  553.332284]  [<0000000013d62726>] dup_task_struct+0x66/0x430
    [  553.332293]  [<0000000013d63962>] copy_process+0x432/0x4b80
    [  553.332302]  [<0000000013d68300>] kernel_clone+0xf0/0x7d0
    [  553.332311]  [<0000000013d68bd6>] __do_sys_clone+0xae/0xc8
    [  553.332400]  [<0000000013d68dee>] __s390x_sys_clone+0xd6/0x118
    [  553.332410]  [<0000000013c9d34c>] do_syscall+0x22c/0x328
    [  553.332419]  [<00000000158e7366>] __do_syscall+0xce/0xf0
    [  553.332428]  [<0000000015913260>] system_call+0x70/0x98

This exposes a KASAN issue fixed with patch 1 and apply_to_pte_range()
issue fixed with patches 2-3. Patch 4 is a debug improvement on top,
that could have helped to notice the issue.

Commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu
mode") looks like powerpc-only fix, yet not entirely conforming to the
above provided requirement (page tables itself are still not protected).
If I am not mistaken, xen and sparc are alike.

Thanks!

Alexander Gordeev (4):
  kasan: Avoid sleepable page allocation from atomic context
  mm: Cleanup apply_to_pte_range() routine
  mm: Protect kernel pgtables in apply_to_pte_range()
  mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context

 include/linux/pgtable.h | 15 ++++++++++++---
 mm/kasan/shadow.c       |  9 +++------
 mm/memory.c             | 33 +++++++++++++++++++++------------
 3 files changed, 36 insertions(+), 21 deletions(-)

-- 
2.45.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v1 1/4] kasan: Avoid sleepable page allocation from atomic context
  2025-04-07 15:11 [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
@ 2025-04-07 15:11 ` Alexander Gordeev
  2025-04-11  6:40   ` Nicholas Piggin
  2025-04-11  6:54   ` Nicholas Piggin
  2025-04-07 15:11 ` [PATCH v1 2/4] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 14+ messages in thread
From: Alexander Gordeev @ 2025-04-07 15:11 UTC (permalink / raw)
  To: Andrew Morton, Andrey Ryabinin
  Cc: Hugh Dickins, Nicholas Piggin, Guenter Roeck, Juergen Gross,
	Jeremy Fitzhardinge, linux-kernel, linux-mm, kasan-dev,
	sparclinux, xen-devel, linuxppc-dev, linux-s390

apply_to_page_range() enters lazy MMU mode and then invokes
kasan_populate_vmalloc_pte() callback on each page table walk
iteration. The lazy MMU mode may only be entered only under
protection of the page table lock. However, the callback can
go into sleep when trying to allocate a single page.

Change __get_free_page() allocation mode from GFP_KERNEL to
GFP_ATOMIC to avoid scheduling out while in atomic context.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 mm/kasan/shadow.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 88d1c9dcb507..edfa77959474 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 	if (likely(!pte_none(ptep_get(ptep))))
 		return 0;
 
-	page = __get_free_page(GFP_KERNEL);
+	page = __get_free_page(GFP_ATOMIC);
 	if (!page)
 		return -ENOMEM;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 1/4] kasan: Avoid sleepable page allocation from atomic context
  2025-04-07 15:11 ` [PATCH v1 1/4] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
@ 2025-04-11  6:40   ` Nicholas Piggin
  2025-04-11  6:54   ` Nicholas Piggin
  1 sibling, 0 replies; 14+ messages in thread
From: Nicholas Piggin @ 2025-04-11  6:40 UTC (permalink / raw)
  To: Alexander Gordeev, Andrew Morton, Andrey Ryabinin
  Cc: Hugh Dickins, Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge,
	linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
	linuxppc-dev, linux-s390

On Tue Apr 8, 2025 at 1:11 AM AEST, Alexander Gordeev wrote:
> apply_to_page_range() enters lazy MMU mode and then invokes
> kasan_populate_vmalloc_pte() callback on each page table walk
> iteration. The lazy MMU mode may only be entered only under
> protection of the page table lock. However, the callback can
> go into sleep when trying to allocate a single page.
>
> Change __get_free_page() allocation mode from GFP_KERNEL to
> GFP_ATOMIC to avoid scheduling out while in atomic context.

It's a bit unfortunate to make this use atomic allocs for
archs that don't need it.

Could you make it depend on __HAVE_ARCH_ENTER_LAZY_MMU_MODE
or is that overkill?

I wanted to remove ppc64's per-CPU page array and replace it
with on stack or dynaimc alloc array in the thread... but
cost/benefit of working on ppc64 hash MMU code is not
high :(

Fix itself for ppc64's requirement at least looks right to me
so for that,

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> ---
>  mm/kasan/shadow.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index 88d1c9dcb507..edfa77959474 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>  	if (likely(!pte_none(ptep_get(ptep))))
>  		return 0;
>  
> -	page = __get_free_page(GFP_KERNEL);
> +	page = __get_free_page(GFP_ATOMIC);
>  	if (!page)
>  		return -ENOMEM;
>  


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 1/4] kasan: Avoid sleepable page allocation from atomic context
  2025-04-07 15:11 ` [PATCH v1 1/4] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
  2025-04-11  6:40   ` Nicholas Piggin
@ 2025-04-11  6:54   ` Nicholas Piggin
  1 sibling, 0 replies; 14+ messages in thread
From: Nicholas Piggin @ 2025-04-11  6:54 UTC (permalink / raw)
  To: Alexander Gordeev, Andrew Morton, Andrey Ryabinin
  Cc: Hugh Dickins, Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge,
	linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
	linuxppc-dev, linux-s390

On Tue Apr 8, 2025 at 1:11 AM AEST, Alexander Gordeev wrote:
> apply_to_page_range() enters lazy MMU mode and then invokes
> kasan_populate_vmalloc_pte() callback on each page table walk
> iteration. The lazy MMU mode may only be entered only under
> protection of the page table lock. However, the callback can
> go into sleep when trying to allocate a single page.
>
> Change __get_free_page() allocation mode from GFP_KERNEL to
> GFP_ATOMIC to avoid scheduling out while in atomic context.
>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> ---
>  mm/kasan/shadow.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index 88d1c9dcb507..edfa77959474 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>  	if (likely(!pte_none(ptep_get(ptep))))
>  		return 0;
>  
> -	page = __get_free_page(GFP_KERNEL);
> +	page = __get_free_page(GFP_ATOMIC);
>  	if (!page)
>  		return -ENOMEM;
>  

Oh of course you can't make it GFP_KERNEL after the
patch to take ptl even for archs that don't use lazy
mmu.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v1 2/4] mm: Cleanup apply_to_pte_range() routine
  2025-04-07 15:11 [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
  2025-04-07 15:11 ` [PATCH v1 1/4] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
@ 2025-04-07 15:11 ` Alexander Gordeev
  2025-04-11  6:46   ` Nicholas Piggin
  2025-04-07 15:11 ` [PATCH v1 3/4] mm: Protect kernel pgtables in apply_to_pte_range() Alexander Gordeev
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Alexander Gordeev @ 2025-04-07 15:11 UTC (permalink / raw)
  To: Andrew Morton, Andrey Ryabinin
  Cc: Hugh Dickins, Nicholas Piggin, Guenter Roeck, Juergen Gross,
	Jeremy Fitzhardinge, linux-kernel, linux-mm, kasan-dev,
	sparclinux, xen-devel, linuxppc-dev, linux-s390

Reverse 'create' vs 'mm == &init_mm' conditions and move
page table mask modification out of the atomic context.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 mm/memory.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2d8c265fc7d6..f0201c8ec1ce 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2915,24 +2915,28 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				     pte_fn_t fn, void *data, bool create,
 				     pgtbl_mod_mask *mask)
 {
+	int err = create ? -ENOMEM : -EINVAL;
 	pte_t *pte, *mapped_pte;
-	int err = 0;
 	spinlock_t *ptl;
 
-	if (create) {
-		mapped_pte = pte = (mm == &init_mm) ?
-			pte_alloc_kernel_track(pmd, addr, mask) :
-			pte_alloc_map_lock(mm, pmd, addr, &ptl);
+	if (mm == &init_mm) {
+		if (create)
+			pte = pte_alloc_kernel_track(pmd, addr, mask);
+		else
+			pte = pte_offset_kernel(pmd, addr);
 		if (!pte)
-			return -ENOMEM;
+			return err;
 	} else {
-		mapped_pte = pte = (mm == &init_mm) ?
-			pte_offset_kernel(pmd, addr) :
-			pte_offset_map_lock(mm, pmd, addr, &ptl);
+		if (create)
+			pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
+		else
+			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 		if (!pte)
-			return -EINVAL;
+			return err;
+		mapped_pte = pte;
 	}
 
+	err = 0;
 	arch_enter_lazy_mmu_mode();
 
 	if (fn) {
@@ -2944,12 +2948,14 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			}
 		} while (addr += PAGE_SIZE, addr != end);
 	}
-	*mask |= PGTBL_PTE_MODIFIED;
 
 	arch_leave_lazy_mmu_mode();
 
 	if (mm != &init_mm)
 		pte_unmap_unlock(mapped_pte, ptl);
+
+	*mask |= PGTBL_PTE_MODIFIED;
+
 	return err;
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/4] mm: Cleanup apply_to_pte_range() routine
  2025-04-07 15:11 ` [PATCH v1 2/4] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
@ 2025-04-11  6:46   ` Nicholas Piggin
  2025-04-14 14:17     ` Alexander Gordeev
  0 siblings, 1 reply; 14+ messages in thread
From: Nicholas Piggin @ 2025-04-11  6:46 UTC (permalink / raw)
  To: Alexander Gordeev, Andrew Morton, Andrey Ryabinin
  Cc: Hugh Dickins, Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge,
	linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
	linuxppc-dev, linux-s390

On Tue Apr 8, 2025 at 1:11 AM AEST, Alexander Gordeev wrote:
> Reverse 'create' vs 'mm == &init_mm' conditions and move
> page table mask modification out of the atomic context.
>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> ---
>  mm/memory.c | 28 +++++++++++++++++-----------
>  1 file changed, 17 insertions(+), 11 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 2d8c265fc7d6..f0201c8ec1ce 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2915,24 +2915,28 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  				     pte_fn_t fn, void *data, bool create,
>  				     pgtbl_mod_mask *mask)
>  {
> +	int err = create ? -ENOMEM : -EINVAL;

Could you make this a new variable instead of reusing
existing err? 'const int pte_err' or something?

>  	pte_t *pte, *mapped_pte;
> -	int err = 0;
>  	spinlock_t *ptl;
>  
> -	if (create) {
> -		mapped_pte = pte = (mm == &init_mm) ?
> -			pte_alloc_kernel_track(pmd, addr, mask) :
> -			pte_alloc_map_lock(mm, pmd, addr, &ptl);
> +	if (mm == &init_mm) {
> +		if (create)
> +			pte = pte_alloc_kernel_track(pmd, addr, mask);
> +		else
> +			pte = pte_offset_kernel(pmd, addr);
>  		if (!pte)
> -			return -ENOMEM;
> +			return err;
>  	} else {
> -		mapped_pte = pte = (mm == &init_mm) ?
> -			pte_offset_kernel(pmd, addr) :
> -			pte_offset_map_lock(mm, pmd, addr, &ptl);
> +		if (create)
> +			pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
> +		else
> +			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
>  		if (!pte)
> -			return -EINVAL;
> +			return err;
> +		mapped_pte = pte;
>  	}
>  
> +	err = 0;
>  	arch_enter_lazy_mmu_mode();
>  
>  	if (fn) {
> @@ -2944,12 +2948,14 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  			}
>  		} while (addr += PAGE_SIZE, addr != end);
>  	}
> -	*mask |= PGTBL_PTE_MODIFIED;
>  
>  	arch_leave_lazy_mmu_mode();
>  
>  	if (mm != &init_mm)
>  		pte_unmap_unlock(mapped_pte, ptl);
> +
> +	*mask |= PGTBL_PTE_MODIFIED;

This is done just because we might as well? Less work in critical
section?

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> +
>  	return err;
>  }
>  


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/4] mm: Cleanup apply_to_pte_range() routine
  2025-04-11  6:46   ` Nicholas Piggin
@ 2025-04-14 14:17     ` Alexander Gordeev
  0 siblings, 0 replies; 14+ messages in thread
From: Alexander Gordeev @ 2025-04-14 14:17 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andrew Morton, Andrey Ryabinin, Hugh Dickins, Guenter Roeck,
	Juergen Gross, Jeremy Fitzhardinge, linux-kernel, linux-mm,
	kasan-dev, sparclinux, xen-devel, linuxppc-dev, linux-s390

On Fri, Apr 11, 2025 at 04:46:58PM +1000, Nicholas Piggin wrote:
> On Tue Apr 8, 2025 at 1:11 AM AEST, Alexander Gordeev wrote:
> > Reverse 'create' vs 'mm == &init_mm' conditions and move
> > page table mask modification out of the atomic context.
> >
> > Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> > ---
> >  mm/memory.c | 28 +++++++++++++++++-----------
> >  1 file changed, 17 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2d8c265fc7d6..f0201c8ec1ce 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2915,24 +2915,28 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> >  				     pte_fn_t fn, void *data, bool create,
> >  				     pgtbl_mod_mask *mask)
> >  {
> > +	int err = create ? -ENOMEM : -EINVAL;
> 
> Could you make this a new variable instead of reusing
> existing err? 'const int pte_err' or something?

Will do, when/if repost.

...

> > @@ -2944,12 +2948,14 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> >  			}
> >  		} while (addr += PAGE_SIZE, addr != end);
> >  	}
> > -	*mask |= PGTBL_PTE_MODIFIED;
> >  
> >  	arch_leave_lazy_mmu_mode();
> >  
> >  	if (mm != &init_mm)
> >  		pte_unmap_unlock(mapped_pte, ptl);
> > +
> > +	*mask |= PGTBL_PTE_MODIFIED;
> 
> This is done just because we might as well? Less work in critical
> section?

Yes.

Thanks!

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v1 3/4] mm: Protect kernel pgtables in apply_to_pte_range()
  2025-04-07 15:11 [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
  2025-04-07 15:11 ` [PATCH v1 1/4] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
  2025-04-07 15:11 ` [PATCH v1 2/4] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
@ 2025-04-07 15:11 ` Alexander Gordeev
  2025-04-11  6:53   ` Nicholas Piggin
  2025-04-07 15:11 ` [PATCH v1 4/4] mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context Alexander Gordeev
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 14+ messages in thread
From: Alexander Gordeev @ 2025-04-07 15:11 UTC (permalink / raw)
  To: Andrew Morton, Andrey Ryabinin
  Cc: Hugh Dickins, Nicholas Piggin, Guenter Roeck, Juergen Gross,
	Jeremy Fitzhardinge, linux-kernel, linux-mm, kasan-dev,
	sparclinux, xen-devel, linuxppc-dev, linux-s390

The lazy MMU mode can only be entered and left under the protection
of the page table locks for all page tables which may be modified.
Yet, when it comes to kernel mappings apply_to_pte_range() does not
take any locks. That does not conform arch_enter|leave_lazy_mmu_mode()
semantics and could potentially lead to re-schedulling a process while
in lazy MMU mode or racing on a kernel page table updates.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 mm/kasan/shadow.c | 7 ++-----
 mm/memory.c       | 5 ++++-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index edfa77959474..6531a7aa8562 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -308,14 +308,14 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 	__memset((void *)page, KASAN_VMALLOC_INVALID, PAGE_SIZE);
 	pte = pfn_pte(PFN_DOWN(__pa(page)), PAGE_KERNEL);
 
-	spin_lock(&init_mm.page_table_lock);
 	if (likely(pte_none(ptep_get(ptep)))) {
 		set_pte_at(&init_mm, addr, ptep, pte);
 		page = 0;
 	}
-	spin_unlock(&init_mm.page_table_lock);
+
 	if (page)
 		free_page(page);
+
 	return 0;
 }
 
@@ -401,13 +401,10 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 
 	page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
 
-	spin_lock(&init_mm.page_table_lock);
-
 	if (likely(!pte_none(ptep_get(ptep)))) {
 		pte_clear(&init_mm, addr, ptep);
 		free_page(page);
 	}
-	spin_unlock(&init_mm.page_table_lock);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index f0201c8ec1ce..1f3727104e99 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2926,6 +2926,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			pte = pte_offset_kernel(pmd, addr);
 		if (!pte)
 			return err;
+		spin_lock(&init_mm.page_table_lock);
 	} else {
 		if (create)
 			pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
@@ -2951,7 +2952,9 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 
 	arch_leave_lazy_mmu_mode();
 
-	if (mm != &init_mm)
+	if (mm == &init_mm)
+		spin_unlock(&init_mm.page_table_lock);
+	else
 		pte_unmap_unlock(mapped_pte, ptl);
 
 	*mask |= PGTBL_PTE_MODIFIED;
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 3/4] mm: Protect kernel pgtables in apply_to_pte_range()
  2025-04-07 15:11 ` [PATCH v1 3/4] mm: Protect kernel pgtables in apply_to_pte_range() Alexander Gordeev
@ 2025-04-11  6:53   ` Nicholas Piggin
  0 siblings, 0 replies; 14+ messages in thread
From: Nicholas Piggin @ 2025-04-11  6:53 UTC (permalink / raw)
  To: Alexander Gordeev, Andrew Morton, Andrey Ryabinin
  Cc: Hugh Dickins, Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge,
	linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
	linuxppc-dev, linux-s390

On Tue Apr 8, 2025 at 1:11 AM AEST, Alexander Gordeev wrote:
> The lazy MMU mode can only be entered and left under the protection
> of the page table locks for all page tables which may be modified.
> Yet, when it comes to kernel mappings apply_to_pte_range() does not
> take any locks. That does not conform arch_enter|leave_lazy_mmu_mode()
> semantics and could potentially lead to re-schedulling a process while
> in lazy MMU mode or racing on a kernel page table updates.
>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> ---
>  mm/kasan/shadow.c | 7 ++-----
>  mm/memory.c       | 5 ++++-
>  2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index edfa77959474..6531a7aa8562 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -308,14 +308,14 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>  	__memset((void *)page, KASAN_VMALLOC_INVALID, PAGE_SIZE);
>  	pte = pfn_pte(PFN_DOWN(__pa(page)), PAGE_KERNEL);
>  
> -	spin_lock(&init_mm.page_table_lock);
>  	if (likely(pte_none(ptep_get(ptep)))) {
>  		set_pte_at(&init_mm, addr, ptep, pte);
>  		page = 0;
>  	}
> -	spin_unlock(&init_mm.page_table_lock);
> +
>  	if (page)
>  		free_page(page);
> +
>  	return 0;
>  }
>  

kasan_populate_vmalloc_pte() is really the only thing that
takes the ptl in the apply_to_page_range fn()... Looks like
you may be right. I wonder why they do and nobody else? Just
luck?

Seems okay.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> @@ -401,13 +401,10 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>  
>  	page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
>  
> -	spin_lock(&init_mm.page_table_lock);
> -
>  	if (likely(!pte_none(ptep_get(ptep)))) {
>  		pte_clear(&init_mm, addr, ptep);
>  		free_page(page);
>  	}
> -	spin_unlock(&init_mm.page_table_lock);
>  
>  	return 0;
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index f0201c8ec1ce..1f3727104e99 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2926,6 +2926,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  			pte = pte_offset_kernel(pmd, addr);
>  		if (!pte)
>  			return err;
> +		spin_lock(&init_mm.page_table_lock);
>  	} else {
>  		if (create)
>  			pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
> @@ -2951,7 +2952,9 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>  
>  	arch_leave_lazy_mmu_mode();
>  
> -	if (mm != &init_mm)
> +	if (mm == &init_mm)
> +		spin_unlock(&init_mm.page_table_lock);
> +	else
>  		pte_unmap_unlock(mapped_pte, ptl);
>  
>  	*mask |= PGTBL_PTE_MODIFIED;


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v1 4/4] mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context
  2025-04-07 15:11 [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
                   ` (2 preceding siblings ...)
  2025-04-07 15:11 ` [PATCH v1 3/4] mm: Protect kernel pgtables in apply_to_pte_range() Alexander Gordeev
@ 2025-04-07 15:11 ` Alexander Gordeev
  2025-04-11  7:16   ` Nicholas Piggin
  2025-04-07 18:35 ` [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Andrew Morton
  2025-04-11  7:12 ` Nicholas Piggin
  5 siblings, 1 reply; 14+ messages in thread
From: Alexander Gordeev @ 2025-04-07 15:11 UTC (permalink / raw)
  To: Andrew Morton, Andrey Ryabinin
  Cc: Hugh Dickins, Nicholas Piggin, Guenter Roeck, Juergen Gross,
	Jeremy Fitzhardinge, linux-kernel, linux-mm, kasan-dev,
	sparclinux, xen-devel, linuxppc-dev, linux-s390

The lazy MMU batching may be only be entered and left under the
protection of the page table locks for all page tables which may
be modified. Yet, there were cases arch_enter_lazy_mmu_mode()
was called without the locks taken, e.g. commit b9ef323ea168
("powerpc/64s: Disable preemption in hash lazy mmu mode").

Make default arch_enter|leave|flush_lazy_mmu_mode() callbacks
complain at least in case the preemption is enabled to detect
wrong contexts.

Most platforms do not implement the callbacks, so to aovid a
performance impact allow the complaint when CONFIG_DEBUG_VM
option is enabled only.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 include/linux/pgtable.h | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e2b705c14945..959590bb66da 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -232,9 +232,18 @@ static inline int pmd_dirty(pmd_t pmd)
  * and the mode cannot be used in interrupt context.
  */
 #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
-#define arch_enter_lazy_mmu_mode()	do {} while (0)
-#define arch_leave_lazy_mmu_mode()	do {} while (0)
-#define arch_flush_lazy_mmu_mode()	do {} while (0)
+static inline void arch_enter_lazy_mmu_mode(void)
+{
+	VM_WARN_ON(preemptible());
+}
+static inline void arch_leave_lazy_mmu_mode(void)
+{
+	VM_WARN_ON(preemptible());
+}
+static inline void arch_flush_lazy_mmu_mode(void)
+{
+	VM_WARN_ON(preemptible());
+}
 #endif
 
 #ifndef pte_batch_hint
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 4/4] mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context
  2025-04-07 15:11 ` [PATCH v1 4/4] mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context Alexander Gordeev
@ 2025-04-11  7:16   ` Nicholas Piggin
  0 siblings, 0 replies; 14+ messages in thread
From: Nicholas Piggin @ 2025-04-11  7:16 UTC (permalink / raw)
  To: Alexander Gordeev, Andrew Morton, Andrey Ryabinin
  Cc: Hugh Dickins, Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge,
	linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
	linuxppc-dev, linux-s390

On Tue Apr 8, 2025 at 1:11 AM AEST, Alexander Gordeev wrote:
> The lazy MMU batching may be only be entered and left under the
> protection of the page table locks for all page tables which may
> be modified. Yet, there were cases arch_enter_lazy_mmu_mode()
> was called without the locks taken, e.g. commit b9ef323ea168
> ("powerpc/64s: Disable preemption in hash lazy mmu mode").
>
> Make default arch_enter|leave|flush_lazy_mmu_mode() callbacks
> complain at least in case the preemption is enabled to detect
> wrong contexts.
>
> Most platforms do not implement the callbacks, so to aovid a
> performance impact allow the complaint when CONFIG_DEBUG_VM
> option is enabled only.
>
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>

This is a good debugging feature independent of how the fix
is done. I would just warn once, since it's not a bug for
the arch and could fire frequently if it fires at all.

Reviewed-by: Nicholas Piggin <npiggin@gmail.com>

> ---
>  include/linux/pgtable.h | 15 ++++++++++++---
>  1 file changed, 12 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index e2b705c14945..959590bb66da 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -232,9 +232,18 @@ static inline int pmd_dirty(pmd_t pmd)
>   * and the mode cannot be used in interrupt context.
>   */
>  #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
> -#define arch_enter_lazy_mmu_mode()	do {} while (0)
> -#define arch_leave_lazy_mmu_mode()	do {} while (0)
> -#define arch_flush_lazy_mmu_mode()	do {} while (0)
> +static inline void arch_enter_lazy_mmu_mode(void)
> +{
> +	VM_WARN_ON(preemptible());
> +}
> +static inline void arch_leave_lazy_mmu_mode(void)
> +{
> +	VM_WARN_ON(preemptible());
> +}
> +static inline void arch_flush_lazy_mmu_mode(void)
> +{
> +	VM_WARN_ON(preemptible());
> +}
>  #endif
>  
>  #ifndef pte_batch_hint


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode
  2025-04-07 15:11 [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
                   ` (3 preceding siblings ...)
  2025-04-07 15:11 ` [PATCH v1 4/4] mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context Alexander Gordeev
@ 2025-04-07 18:35 ` Andrew Morton
  2025-04-11  7:12 ` Nicholas Piggin
  5 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2025-04-07 18:35 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Andrey Ryabinin, Hugh Dickins, Nicholas Piggin, Guenter Roeck,
	Juergen Gross, Jeremy Fitzhardinge, linux-kernel, linux-mm,
	kasan-dev, sparclinux, xen-devel, linuxppc-dev, linux-s390

On Mon,  7 Apr 2025 17:11:26 +0200 Alexander Gordeev <agordeev@linux.ibm.com> wrote:

> This series is an attempt to fix the violation of lazy MMU mode context
> requirement as described for arch_enter_lazy_mmu_mode():
> 
>     This mode can only be entered and left under the protection of
>     the page table locks for all page tables which may be modified.
> 
> On s390 if I make arch_enter_lazy_mmu_mode() -> preempt_enable() and
> arch_leave_lazy_mmu_mode() -> preempt_disable() I am getting this:
>
> ...
>

Could you please reorganize this into two series?  One series which
should be fast-tracked into 6.15-rcX and one series for 6.16-rc1?

And in the first series, please suggest whether its patches should be
backported into -stable and see if we can come up with suitable Fixes:
targets?

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode
  2025-04-07 15:11 [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
                   ` (4 preceding siblings ...)
  2025-04-07 18:35 ` [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Andrew Morton
@ 2025-04-11  7:12 ` Nicholas Piggin
  2025-04-11 12:04   ` Alexander Gordeev
  5 siblings, 1 reply; 14+ messages in thread
From: Nicholas Piggin @ 2025-04-11  7:12 UTC (permalink / raw)
  To: Alexander Gordeev, Andrew Morton, Andrey Ryabinin
  Cc: Hugh Dickins, Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge,
	linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
	linuxppc-dev, linux-s390

On Tue Apr 8, 2025 at 1:11 AM AEST, Alexander Gordeev wrote:
> Hi All,
>
> This series is an attempt to fix the violation of lazy MMU mode context
> requirement as described for arch_enter_lazy_mmu_mode():
>
>     This mode can only be entered and left under the protection of
>     the page table locks for all page tables which may be modified.
>
> On s390 if I make arch_enter_lazy_mmu_mode() -> preempt_enable() and
> arch_leave_lazy_mmu_mode() -> preempt_disable() I am getting this:
>
>     [  553.332108] preempt_count: 1, expected: 0
>     [  553.332117] no locks held by multipathd/2116.
>     [  553.332128] CPU: 24 PID: 2116 Comm: multipathd Kdump: loaded Tainted:
>     [  553.332139] Hardware name: IBM 3931 A01 701 (LPAR)
>     [  553.332146] Call Trace:
>     [  553.332152]  [<00000000158de23a>] dump_stack_lvl+0xfa/0x150
>     [  553.332167]  [<0000000013e10d12>] __might_resched+0x57a/0x5e8
>     [  553.332178]  [<00000000144eb6c2>] __alloc_pages+0x2ba/0x7c0
>     [  553.332189]  [<00000000144d5cdc>] __get_free_pages+0x2c/0x88
>     [  553.332198]  [<00000000145663f6>] kasan_populate_vmalloc_pte+0x4e/0x110
>     [  553.332207]  [<000000001447625c>] apply_to_pte_range+0x164/0x3c8
>     [  553.332218]  [<000000001448125a>] apply_to_pmd_range+0xda/0x318
>     [  553.332226]  [<000000001448181c>] __apply_to_page_range+0x384/0x768
>     [  553.332233]  [<0000000014481c28>] apply_to_page_range+0x28/0x38
>     [  553.332241]  [<00000000145665da>] kasan_populate_vmalloc+0x82/0x98
>     [  553.332249]  [<00000000144c88d0>] alloc_vmap_area+0x590/0x1c90
>     [  553.332257]  [<00000000144ca108>] __get_vm_area_node.constprop.0+0x138/0x260
>     [  553.332265]  [<00000000144d17fc>] __vmalloc_node_range+0x134/0x360
>     [  553.332274]  [<0000000013d5dbf2>] alloc_thread_stack_node+0x112/0x378
>     [  553.332284]  [<0000000013d62726>] dup_task_struct+0x66/0x430
>     [  553.332293]  [<0000000013d63962>] copy_process+0x432/0x4b80
>     [  553.332302]  [<0000000013d68300>] kernel_clone+0xf0/0x7d0
>     [  553.332311]  [<0000000013d68bd6>] __do_sys_clone+0xae/0xc8
>     [  553.332400]  [<0000000013d68dee>] __s390x_sys_clone+0xd6/0x118
>     [  553.332410]  [<0000000013c9d34c>] do_syscall+0x22c/0x328
>     [  553.332419]  [<00000000158e7366>] __do_syscall+0xce/0xf0
>     [  553.332428]  [<0000000015913260>] system_call+0x70/0x98
>
> This exposes a KASAN issue fixed with patch 1 and apply_to_pte_range()
> issue fixed with patches 2-3. Patch 4 is a debug improvement on top,
> that could have helped to notice the issue.
>
> Commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash lazy mmu
> mode") looks like powerpc-only fix, yet not entirely conforming to the
> above provided requirement (page tables itself are still not protected).
> If I am not mistaken, xen and sparc are alike.

Huh. powerpc actually has some crazy code in __switch_to() that is
supposed to handle preemption while in lazy mmu mode. So we probably
don't even need to disable preemption, just use the raw per-cpu
accessors (or keep disabling preemption and remove the now dead code
from context switch).

IIRC all this got built up over a long time with some TLB flush
rules changing at the same time, we could probably stay in lazy mmu
mode for a longer time until it was discovered we really need to
flush before dropping the PTL.

ppc64 and sparc I think don't even need lazy mmu mode for kasan (TLBs
do not require flushing) and will function just fine if not in lazy
mode (they just flush one TLB at a time), not sure about xen. We could
actually go the other way and require that archs operate properly when
not in lazy mode (at least for kernel page tables) and avoid it for
apply_to_page_range()?

Thanks,
Nick

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode
  2025-04-11  7:12 ` Nicholas Piggin
@ 2025-04-11 12:04   ` Alexander Gordeev
  0 siblings, 0 replies; 14+ messages in thread
From: Alexander Gordeev @ 2025-04-11 12:04 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Andrew Morton, Andrey Ryabinin, Hugh Dickins, Guenter Roeck,
	Juergen Gross, Jeremy Fitzhardinge, linux-kernel, linux-mm,
	kasan-dev, sparclinux, xen-devel, linuxppc-dev, linux-s390

On Fri, Apr 11, 2025 at 05:12:28PM +1000, Nicholas Piggin wrote:
...
> Huh. powerpc actually has some crazy code in __switch_to() that is
> supposed to handle preemption while in lazy mmu mode. So we probably
> don't even need to disable preemption, just use the raw per-cpu
> accessors (or keep disabling preemption and remove the now dead code
> from context switch).

Well, I tried to do the latter ;)
https://lore.kernel.org/linuxppc-dev/3b4e3e28172f09165b19ee7cac67a860d7cc1c6e.1742915600.git.agordeev@linux.ibm.com/
Not sure how it is aligned with the current state (see below).

> IIRC all this got built up over a long time with some TLB flush
> rules changing at the same time, we could probably stay in lazy mmu
> mode for a longer time until it was discovered we really need to
> flush before dropping the PTL.
> 
> ppc64 and sparc I think don't even need lazy mmu mode for kasan (TLBs
> do not require flushing) and will function just fine if not in lazy
> mode (they just flush one TLB at a time), not sure about xen. We could
> actually go the other way and require that archs operate properly when
> not in lazy mode (at least for kernel page tables) and avoid it for
> apply_to_page_range()?

Ryan Roberts hopefully brought some order to the topic:
https://lore.kernel.org/linux-mm/20250303141542.3371656-1-ryan.roberts@arm.com/

> Thanks,
> Nick

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-04-14 14:18 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-07 15:11 [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
2025-04-07 15:11 ` [PATCH v1 1/4] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
2025-04-11  6:40   ` Nicholas Piggin
2025-04-11  6:54   ` Nicholas Piggin
2025-04-07 15:11 ` [PATCH v1 2/4] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
2025-04-11  6:46   ` Nicholas Piggin
2025-04-14 14:17     ` Alexander Gordeev
2025-04-07 15:11 ` [PATCH v1 3/4] mm: Protect kernel pgtables in apply_to_pte_range() Alexander Gordeev
2025-04-11  6:53   ` Nicholas Piggin
2025-04-07 15:11 ` [PATCH v1 4/4] mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context Alexander Gordeev
2025-04-11  7:16   ` Nicholas Piggin
2025-04-07 18:35 ` [PATCH v1 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Andrew Morton
2025-04-11  7:12 ` Nicholas Piggin
2025-04-11 12:04   ` Alexander Gordeev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).