* [PATCH 1/4] kasan: Avoid sleepable page allocation from atomic context
2025-03-28 9:13 [PATCH 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
@ 2025-03-28 9:13 ` Alexander Gordeev
2025-03-28 9:13 ` [PATCH 2/4] mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context Alexander Gordeev
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Alexander Gordeev @ 2025-03-28 9:13 UTC (permalink / raw)
To: Andrey Ryabinin, Andrew Morton
Cc: linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
linuxppc-dev, linux-s390, Hugh Dickins, Nicholas Piggin,
Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge
apply_to_page_range() enters lazy MMU mode and then invokes
kasan_populate_vmalloc_pte() callback on each page table walk
iteration. The lazy MMU mode may only be entered only under
protection of the page table lock. However, the callback can
go into sleep when trying to allocate a single page.
Change __get_free_page() allocation mode from GFP_KERNEL to
GFP_ATOMIC to avoid scheduling out while in atomic context.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
mm/kasan/shadow.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 88d1c9dcb507..edfa77959474 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
if (likely(!pte_none(ptep_get(ptep))))
return 0;
- page = __get_free_page(GFP_KERNEL);
+ page = __get_free_page(GFP_ATOMIC);
if (!page)
return -ENOMEM;
--
2.45.2
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH 2/4] mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context
2025-03-28 9:13 [PATCH 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
2025-03-28 9:13 ` [PATCH 1/4] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
@ 2025-03-28 9:13 ` Alexander Gordeev
2025-03-28 9:13 ` [PATCH 3/4] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
2025-03-28 9:13 ` [PATCH 4/4] mm: Protect kernel pgtables in apply_to_pte_range() Alexander Gordeev
3 siblings, 0 replies; 5+ messages in thread
From: Alexander Gordeev @ 2025-03-28 9:13 UTC (permalink / raw)
To: Andrey Ryabinin, Andrew Morton
Cc: linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
linuxppc-dev, linux-s390, Hugh Dickins, Nicholas Piggin,
Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge
The lazy MMU batching may be only be entered and left under the
protection of the page table locks for all page tables which may
be modified. Yet, there were cases arch_enter_lazy_mmu_mode()
was called without the locks taken, e.g. commit b9ef323ea168
("powerpc/64s: Disable preemption in hash lazy mmu mode").
Make default arch_enter|leave|flush_lazy_mmu_mode() callbacks
complain at least in case the preemption is enabled to detect
wrong contexts.
Most platforms do not implement the callbacks, so to aovid a
performance impact allow the complaint when CONFIG_DEBUG_VM
option is enabled only.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
include/linux/pgtable.h | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 94d267d02372..6669f977e368 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -228,9 +228,18 @@ static inline int pmd_dirty(pmd_t pmd)
* it must synchronize the delayed page table writes properly on other CPUs.
*/
#ifndef __HAVE_ARCH_ENTER_LAZY_MMU_MODE
-#define arch_enter_lazy_mmu_mode() do {} while (0)
-#define arch_leave_lazy_mmu_mode() do {} while (0)
-#define arch_flush_lazy_mmu_mode() do {} while (0)
+static inline void arch_enter_lazy_mmu_mode(void)
+{
+ VM_WARN_ON(preemptible());
+}
+static inline void arch_leave_lazy_mmu_mode(void)
+{
+ VM_WARN_ON(preemptible());
+}
+static inline void arch_flush_lazy_mmu_mode(void)
+{
+ VM_WARN_ON(preemptible());
+}
#endif
#ifndef pte_batch_hint
--
2.45.2
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH 3/4] mm: Cleanup apply_to_pte_range() routine
2025-03-28 9:13 [PATCH 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
2025-03-28 9:13 ` [PATCH 1/4] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
2025-03-28 9:13 ` [PATCH 2/4] mm: Allow detection of wrong arch_enter_lazy_mmu_mode() context Alexander Gordeev
@ 2025-03-28 9:13 ` Alexander Gordeev
2025-03-28 9:13 ` [PATCH 4/4] mm: Protect kernel pgtables in apply_to_pte_range() Alexander Gordeev
3 siblings, 0 replies; 5+ messages in thread
From: Alexander Gordeev @ 2025-03-28 9:13 UTC (permalink / raw)
To: Andrey Ryabinin, Andrew Morton
Cc: linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
linuxppc-dev, linux-s390, Hugh Dickins, Nicholas Piggin,
Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge
Reverse 'create' vs 'mm == &init_mm' conditions and move
page table mask modification out of the atomic context.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
mm/memory.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index fb7b8dc75167..00f253404db5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2884,24 +2884,28 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
pte_fn_t fn, void *data, bool create,
pgtbl_mod_mask *mask)
{
+ int err = create ? -ENOMEM : -EINVAL;
pte_t *pte, *mapped_pte;
- int err = 0;
spinlock_t *ptl;
- if (create) {
- mapped_pte = pte = (mm == &init_mm) ?
- pte_alloc_kernel_track(pmd, addr, mask) :
- pte_alloc_map_lock(mm, pmd, addr, &ptl);
+ if (mm == &init_mm) {
+ if (create)
+ pte = pte_alloc_kernel_track(pmd, addr, mask);
+ else
+ pte = pte_offset_kernel(pmd, addr);
if (!pte)
- return -ENOMEM;
+ return err;
} else {
- mapped_pte = pte = (mm == &init_mm) ?
- pte_offset_kernel(pmd, addr) :
- pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (create)
+ pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
+ else
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
if (!pte)
- return -EINVAL;
+ return err;
+ mapped_pte = pte;
}
+ err = 0;
arch_enter_lazy_mmu_mode();
if (fn) {
@@ -2913,12 +2917,14 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
}
} while (addr += PAGE_SIZE, addr != end);
}
- *mask |= PGTBL_PTE_MODIFIED;
arch_leave_lazy_mmu_mode();
if (mm != &init_mm)
pte_unmap_unlock(mapped_pte, ptl);
+
+ *mask |= PGTBL_PTE_MODIFIED;
+
return err;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH 4/4] mm: Protect kernel pgtables in apply_to_pte_range()
2025-03-28 9:13 [PATCH 0/4] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
` (2 preceding siblings ...)
2025-03-28 9:13 ` [PATCH 3/4] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
@ 2025-03-28 9:13 ` Alexander Gordeev
3 siblings, 0 replies; 5+ messages in thread
From: Alexander Gordeev @ 2025-03-28 9:13 UTC (permalink / raw)
To: Andrey Ryabinin, Andrew Morton
Cc: linux-kernel, linux-mm, kasan-dev, sparclinux, xen-devel,
linuxppc-dev, linux-s390, Hugh Dickins, Nicholas Piggin,
Guenter Roeck, Juergen Gross, Jeremy Fitzhardinge
The lazy MMU mode can only be entered and left under the protection
of the page table locks for all page tables which may be modified.
Yet, when it comes to kernel mappings apply_to_pte_range() does not
take any locks. That does not conform arch_enter|leave_lazy_mmu_mode()
semantics and could potentially lead to re-schedulling a process while
in lazy MMU mode or racing on a kernel page table updates.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
mm/kasan/shadow.c | 7 ++-----
mm/memory.c | 5 ++++-
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index edfa77959474..6531a7aa8562 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -308,14 +308,14 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
__memset((void *)page, KASAN_VMALLOC_INVALID, PAGE_SIZE);
pte = pfn_pte(PFN_DOWN(__pa(page)), PAGE_KERNEL);
- spin_lock(&init_mm.page_table_lock);
if (likely(pte_none(ptep_get(ptep)))) {
set_pte_at(&init_mm, addr, ptep, pte);
page = 0;
}
- spin_unlock(&init_mm.page_table_lock);
+
if (page)
free_page(page);
+
return 0;
}
@@ -401,13 +401,10 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
- spin_lock(&init_mm.page_table_lock);
-
if (likely(!pte_none(ptep_get(ptep)))) {
pte_clear(&init_mm, addr, ptep);
free_page(page);
}
- spin_unlock(&init_mm.page_table_lock);
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index 00f253404db5..c000377cad0c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2895,6 +2895,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
pte = pte_offset_kernel(pmd, addr);
if (!pte)
return err;
+ spin_lock(&init_mm.page_table_lock);
} else {
if (create)
pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
@@ -2920,7 +2921,9 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
arch_leave_lazy_mmu_mode();
- if (mm != &init_mm)
+ if (mm == &init_mm)
+ spin_unlock(&init_mm.page_table_lock);
+ else
pte_unmap_unlock(mapped_pte, ptl);
*mask |= PGTBL_PTE_MODIFIED;
--
2.45.2
^ permalink raw reply related [flat|nested] 5+ messages in thread