All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode
@ 2025-03-28  9:13 Alexander Gordeev
  2025-03-28  9:13 ` [PATCH 1/4] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Alexander Gordeev @ 2025-03-28  9:13 UTC (permalink / raw)
  To: Andrey Ryabinin, Andrew Morton
  Cc: linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
	linuxppc-dev, linux-s390, Hugh Dickins, Nicholas Piggin,
	Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge

Hi All!

On s390 if make arch_enter_lazy_mmu_mode() do preempt_enable() and
arch_leave_lazy_mmu_mode() do preempt_disable() I am getting this:

    [  553.332108] preempt_count: 1, expected: 0
    [  553.332117] no locks held by multipathd/2116.
    [  553.332128] CPU: 24 PID: 2116 Comm: multipathd Kdump: loaded Tainted:
    [  553.332139] Hardware name: IBM 3931 A01 701 (LPAR)
    [  553.332146] Call Trace:
    [  553.332152]  [<00000000158de23a>] dump_stack_lvl+0xfa/0x150 
    [  553.332167]  [<0000000013e10d12>] __might_resched+0x57a/0x5e8 
    [  553.332178]  [<00000000144eb6c2>] __alloc_pages+0x2ba/0x7c0 
    [  553.332189]  [<00000000144d5cdc>] __get_free_pages+0x2c/0x88 
    [  553.332198]  [<00000000145663f6>] kasan_populate_vmalloc_pte+0x4e/0x110 
    [  553.332207]  [<000000001447625c>] apply_to_pte_range+0x164/0x3c8 
    [  553.332218]  [<000000001448125a>] apply_to_pmd_range+0xda/0x318 
    [  553.332226]  [<000000001448181c>] __apply_to_page_range+0x384/0x768 
    [  553.332233]  [<0000000014481c28>] apply_to_page_range+0x28/0x38 
    [  553.332241]  [<00000000145665da>] kasan_populate_vmalloc+0x82/0x98 
    [  553.332249]  [<00000000144c88d0>] alloc_vmap_area+0x590/0x1c90 
    [  553.332257]  [<00000000144ca108>] __get_vm_area_node.constprop.0+0x138/0x260 
    [  553.332265]  [<00000000144d17fc>] __vmalloc_node_range+0x134/0x360 
    [  553.332274]  [<0000000013d5dbf2>] alloc_thread_stack_node+0x112/0x378 
    [  553.332284]  [<0000000013d62726>] dup_task_struct+0x66/0x430 
    [  553.332293]  [<0000000013d63962>] copy_process+0x432/0x4b80 
    [  553.332302]  [<0000000013d68300>] kernel_clone+0xf0/0x7d0 
    [  553.332311]  [<0000000013d68bd6>] __do_sys_clone+0xae/0xc8 
    [  553.332400]  [<0000000013d68dee>] __s390x_sys_clone+0xd6/0x118 
    [  553.332410]  [<0000000013c9d34c>] do_syscall+0x22c/0x328 
    [  553.332419]  [<00000000158e7366>] __do_syscall+0xce/0xf0 
    [  553.332428]  [<0000000015913260>] system_call+0x70/0x98 

I guess, commit b9ef323ea168 ("powerpc/64s: Disable preemption in hash
lazy mmu mode") (albeit not completely) fixed similar issue on PPC:

    apply_to_page_range on kernel pages does not disable preemption, which         
    is a requirement for hash's lazy mmu mode, which keeps track of the            
    TLBs to flush with a per-cpu array.                                            

This series is an attempt to fix the violation of lazy MMU mode context
as described for arch_enter_lazy_mmu_mode():

    This mode can only be entered and left under the protection of  
    the page table locks for all page tables which may be modified.

If I am not mistaken, xen and sparc are also prone to the described
problem, as they use this_cpu_ptr() rather than get_cpu_var().

Take init_mm.page_table_lock for kernel tables to avoid all of that.

Thanks!

Alexander Gordeev (4):
  kasan: Avoid sleepable page allocation from atomic context
  mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context
  mm: Cleanup apply_to_pte_range() routine
  mm: Protect kernel pgtables in apply_to_pte_range()

 include/linux/pgtable.h | 15 ++++++++++++---
 mm/kasan/shadow.c       |  9 +++------
 mm/memory.c             | 33 +++++++++++++++++++++------------
 3 files changed, 36 insertions(+), 21 deletions(-)

-- 
2.45.2



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/4] kasan: Avoid sleepable page allocation from atomic context
  2025-03-28  9:13 [PATCH 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
@ 2025-03-28  9:13 ` Alexander Gordeev
  2025-03-28  9:13 ` [PATCH 2/4] mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context Alexander Gordeev
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Alexander Gordeev @ 2025-03-28  9:13 UTC (permalink / raw)
  To: Andrey Ryabinin, Andrew Morton
  Cc: linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
	linuxppc-dev, linux-s390, Hugh Dickins, Nicholas Piggin,
	Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge

apply_to_page_range() enters lazy MMU mode and then invokes
kasan_populate_vmalloc_pte() callback on each page table walk
iteration. The lazy MMU mode may only be entered only under
protection of the page table lock. However, the callback can
go into sleep when trying to allocate a single page.

Change __get_free_page() allocation mode from GFP_KERNEL to
GFP_ATOMIC to avoid scheduling out while in atomic context.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 mm/kasan/shadow.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 88d1c9dcb507..edfa77959474 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 	if (likely(!pte_none(ptep_get(ptep))))
 		return 0;
 
-	page = __get_free_page(GFP_KERNEL);
+	page = __get_free_page(GFP_ATOMIC);
 	if (!page)
 		return -ENOMEM;
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 2/4] mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context
  2025-03-28  9:13 [PATCH 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
  2025-03-28  9:13 ` [PATCH 1/4] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
@ 2025-03-28  9:13 ` Alexander Gordeev
  2025-03-28  9:13 ` [PATCH 3/4] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
  2025-03-28  9:13 ` [PATCH 4/4] mm: Protect kernel pgtables in apply_to_pte_range() Alexander Gordeev
  3 siblings, 0 replies; 5+ messages in thread
From: Alexander Gordeev @ 2025-03-28  9:13 UTC (permalink / raw)
  To: Andrey Ryabinin, Andrew Morton
  Cc: linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
	linuxppc-dev, linux-s390, Hugh Dickins, Nicholas Piggin,
	Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge

The lazy MMU batching may be only be entered and left under the
protection of the page table locks for all page tables which may
be modified. Yet, there were cases arch_enter_lazy_mmu_mode()
was called without the locks taken, e.g. commit b9ef323ea168
("powerpc/64s: Disable preemption in hash lazy mmu mode").

Make default arch_enter|leave|flush_lazy_mmu_mode() callbacks
complain at least in case the preemption is enabled to detect
wrong contexts.

Most platforms do not implement the callbacks, so to aovid a
performance impact allow the complaint when CONFIG_DEBUG_VM
option is enabled only.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 include/linux/pgtable.h | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 94d267d02372..6669f977e368 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -228,9 +228,18 @@ static inline int pmd_dirty(pmd_t pmd)
  * it must synchronize the delayed page table writes properly on other CPUs.
  */
 #ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
-#define arch_enter_lazy_mmu_mode()	do {} while (0)
-#define arch_leave_lazy_mmu_mode()	do {} while (0)
-#define arch_flush_lazy_mmu_mode()	do {} while (0)
+static inline void arch_enter_lazy_mmu_mode(void)
+{
+	VM_WARN_ON(preemptible());
+}
+static inline void arch_leave_lazy_mmu_mode(void)
+{
+	VM_WARN_ON(preemptible());
+}
+static inline void arch_flush_lazy_mmu_mode(void)
+{
+	VM_WARN_ON(preemptible());
+}
 #endif
 
 #ifndef pte_batch_hint
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 3/4] mm: Cleanup apply_to_pte_range() routine
  2025-03-28  9:13 [PATCH 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
  2025-03-28  9:13 ` [PATCH 1/4] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
  2025-03-28  9:13 ` [PATCH 2/4] mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context Alexander Gordeev
@ 2025-03-28  9:13 ` Alexander Gordeev
  2025-03-28  9:13 ` [PATCH 4/4] mm: Protect kernel pgtables in apply_to_pte_range() Alexander Gordeev
  3 siblings, 0 replies; 5+ messages in thread
From: Alexander Gordeev @ 2025-03-28  9:13 UTC (permalink / raw)
  To: Andrey Ryabinin, Andrew Morton
  Cc: linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
	linuxppc-dev, linux-s390, Hugh Dickins, Nicholas Piggin,
	Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge

Reverse 'create' vs 'mm == &init_mm' conditions and move
page table mask modification out of the atomic context.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 mm/memory.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index fb7b8dc75167..00f253404db5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2884,24 +2884,28 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				     pte_fn_t fn, void *data, bool create,
 				     pgtbl_mod_mask *mask)
 {
+	int err = create ? -ENOMEM : -EINVAL;
 	pte_t *pte, *mapped_pte;
-	int err = 0;
 	spinlock_t *ptl;
 
-	if (create) {
-		mapped_pte = pte = (mm == &init_mm) ?
-			pte_alloc_kernel_track(pmd, addr, mask) :
-			pte_alloc_map_lock(mm, pmd, addr, &ptl);
+	if (mm == &init_mm) {
+		if (create)
+			pte = pte_alloc_kernel_track(pmd, addr, mask);
+		else
+			pte = pte_offset_kernel(pmd, addr);
 		if (!pte)
-			return -ENOMEM;
+			return err;
 	} else {
-		mapped_pte = pte = (mm == &init_mm) ?
-			pte_offset_kernel(pmd, addr) :
-			pte_offset_map_lock(mm, pmd, addr, &ptl);
+		if (create)
+			pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
+		else
+			pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 		if (!pte)
-			return -EINVAL;
+			return err;
+		mapped_pte = pte;
 	}
 
+	err = 0;
 	arch_enter_lazy_mmu_mode();
 
 	if (fn) {
@@ -2913,12 +2917,14 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			}
 		} while (addr += PAGE_SIZE, addr != end);
 	}
-	*mask |= PGTBL_PTE_MODIFIED;
 
 	arch_leave_lazy_mmu_mode();
 
 	if (mm != &init_mm)
 		pte_unmap_unlock(mapped_pte, ptl);
+
+	*mask |= PGTBL_PTE_MODIFIED;
+
 	return err;
 }
 
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH 4/4] mm: Protect kernel pgtables in apply_to_pte_range()
  2025-03-28  9:13 [PATCH 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
                   ` (2 preceding siblings ...)
  2025-03-28  9:13 ` [PATCH 3/4] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
@ 2025-03-28  9:13 ` Alexander Gordeev
  3 siblings, 0 replies; 5+ messages in thread
From: Alexander Gordeev @ 2025-03-28  9:13 UTC (permalink / raw)
  To: Andrey Ryabinin, Andrew Morton
  Cc: linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
	linuxppc-dev, linux-s390, Hugh Dickins, Nicholas Piggin,
	Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge

The lazy MMU mode can only be entered and left under the protection
of the page table locks for all page tables which may be modified.
Yet, when it comes to kernel mappings apply_to_pte_range() does not
take any locks. That does not conform arch_enter|leave_lazy_mmu_mode()
semantics and could potentially lead to re-schedulling a process while
in lazy MMU mode or racing on a kernel page table updates.

Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
 mm/kasan/shadow.c | 7 ++-----
 mm/memory.c       | 5 ++++-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index edfa77959474..6531a7aa8562 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -308,14 +308,14 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 	__memset((void *)page, KASAN_VMALLOC_INVALID, PAGE_SIZE);
 	pte = pfn_pte(PFN_DOWN(__pa(page)), PAGE_KERNEL);
 
-	spin_lock(&init_mm.page_table_lock);
 	if (likely(pte_none(ptep_get(ptep)))) {
 		set_pte_at(&init_mm, addr, ptep, pte);
 		page = 0;
 	}
-	spin_unlock(&init_mm.page_table_lock);
+
 	if (page)
 		free_page(page);
+
 	return 0;
 }
 
@@ -401,13 +401,10 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 
 	page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
 
-	spin_lock(&init_mm.page_table_lock);
-
 	if (likely(!pte_none(ptep_get(ptep)))) {
 		pte_clear(&init_mm, addr, ptep);
 		free_page(page);
 	}
-	spin_unlock(&init_mm.page_table_lock);
 
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index 00f253404db5..c000377cad0c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2895,6 +2895,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			pte = pte_offset_kernel(pmd, addr);
 		if (!pte)
 			return err;
+		spin_lock(&init_mm.page_table_lock);
 	} else {
 		if (create)
 			pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
@@ -2920,7 +2921,9 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 
 	arch_leave_lazy_mmu_mode();
 
-	if (mm != &init_mm)
+	if (mm == &init_mm)
+		spin_unlock(&init_mm.page_table_lock);
+	else
 		pte_unmap_unlock(mapped_pte, ptl);
 
 	*mask |= PGTBL_PTE_MODIFIED;
-- 
2.45.2



^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-03-28  9:14 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-28  9:13 [PATCH 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
2025-03-28  9:13 ` [PATCH 1/4] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
2025-03-28  9:13 ` [PATCH 2/4] mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context Alexander Gordeev
2025-03-28  9:13 ` [PATCH 3/4] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
2025-03-28  9:13 ` [PATCH 4/4] mm: Protect kernel pgtables in apply_to_pte_range() Alexander Gordeev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.