* [PATCH v2 1/3] kasan: Avoid sleepable page allocation from atomic context
2025-04-08 16:07 [PATCH v2 0/3] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
@ 2025-04-08 16:07 ` Alexander Gordeev
2025-04-09 14:10 ` Andrey Ryabinin
2025-04-08 16:07 ` [PATCH v2 2/3] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
2025-04-08 16:07 ` [PATCH v2 3/3] mm: Protect kernel pgtables in apply_to_pte_range() Alexander Gordeev
2 siblings, 1 reply; 10+ messages in thread
From: Alexander Gordeev @ 2025-04-08 16:07 UTC (permalink / raw)
To: Andrew Morton, Andrey Ryabinin
Cc: Hugh Dickins, Nicholas Piggin, Guenter Roeck, Juergen Gross,
Jeremy Fitzhardinge, linux-kernel, linux-mm, kasan-dev,
sparclinux, xen-devel, linuxppc-dev, linux-s390, stable
apply_to_page_range() enters lazy MMU mode and then invokes
kasan_populate_vmalloc_pte() callback on each page table walk
iteration. The lazy MMU mode may only be entered only under
protection of the page table lock. However, the callback can
go into sleep when trying to allocate a single page.
Change __get_free_page() allocation mode from GFP_KERNEL to
GFP_ATOMIC to avoid scheduling out while in atomic context.
Cc: stable@vger.kernel.org
Fixes: 3c5c3cfb9ef4 ("kasan: support backing vmalloc space with real shadow memory")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
mm/kasan/shadow.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 88d1c9dcb507..edfa77959474 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
if (likely(!pte_none(ptep_get(ptep))))
return 0;
- page = __get_free_page(GFP_KERNEL);
+ page = __get_free_page(GFP_ATOMIC);
if (!page)
return -ENOMEM;
--
2.45.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v2 1/3] kasan: Avoid sleepable page allocation from atomic context
2025-04-08 16:07 ` [PATCH v2 1/3] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
@ 2025-04-09 14:10 ` Andrey Ryabinin
2025-04-09 14:25 ` Alexander Gordeev
0 siblings, 1 reply; 10+ messages in thread
From: Andrey Ryabinin @ 2025-04-09 14:10 UTC (permalink / raw)
To: Alexander Gordeev, Andrew Morton
Cc: Hugh Dickins, Nicholas Piggin, Guenter Roeck, Juergen Gross,
Jeremy Fitzhardinge, linux-kernel, linux-mm, kasan-dev,
sparclinux, xen-devel, linuxppc-dev, linux-s390, stable
On 4/8/25 6:07 PM, Alexander Gordeev wrote:
> apply_to_page_range() enters lazy MMU mode and then invokes
> kasan_populate_vmalloc_pte() callback on each page table walk
> iteration. The lazy MMU mode may only be entered only under
> protection of the page table lock. However, the callback can
> go into sleep when trying to allocate a single page.
>
> Change __get_free_page() allocation mode from GFP_KERNEL to
> GFP_ATOMIC to avoid scheduling out while in atomic context.
>
> Cc: stable@vger.kernel.org
> Fixes: 3c5c3cfb9ef4 ("kasan: support backing vmalloc space with real shadow memory")
> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
> ---
> mm/kasan/shadow.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
> index 88d1c9dcb507..edfa77959474 100644
> --- a/mm/kasan/shadow.c
> +++ b/mm/kasan/shadow.c
> @@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
> if (likely(!pte_none(ptep_get(ptep))))
> return 0;
>
> - page = __get_free_page(GFP_KERNEL);
> + page = __get_free_page(GFP_ATOMIC);
> if (!page)
> return -ENOMEM;
>
I think a better way to fix this would be moving out allocation from atomic context. Allocate page prior
to apply_to_page_range() call and pass it down to kasan_populate_vmalloc_pte().
Whenever kasan_populate_vmalloc_pte() will require additional page we could bail out with -EAGAIN,
and allocate another one.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 1/3] kasan: Avoid sleepable page allocation from atomic context
2025-04-09 14:10 ` Andrey Ryabinin
@ 2025-04-09 14:25 ` Alexander Gordeev
2025-04-09 14:56 ` Andrey Ryabinin
0 siblings, 1 reply; 10+ messages in thread
From: Alexander Gordeev @ 2025-04-09 14:25 UTC (permalink / raw)
To: Andrey Ryabinin
Cc: Andrew Morton, Hugh Dickins, Nicholas Piggin, Guenter Roeck,
Juergen Gross, Jeremy Fitzhardinge, linux-kernel, linux-mm,
kasan-dev, sparclinux, xen-devel, linuxppc-dev, linux-s390,
stable
On Wed, Apr 09, 2025 at 04:10:58PM +0200, Andrey Ryabinin wrote:
Hi Andrey,
> > @@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
> > if (likely(!pte_none(ptep_get(ptep))))
> > return 0;
> >
> > - page = __get_free_page(GFP_KERNEL);
> > + page = __get_free_page(GFP_ATOMIC);
> > if (!page)
> > return -ENOMEM;
> >
>
> I think a better way to fix this would be moving out allocation from atomic context. Allocate page prior
> to apply_to_page_range() call and pass it down to kasan_populate_vmalloc_pte().
I think the page address could be passed as the parameter to kasan_populate_vmalloc_pte().
> Whenever kasan_populate_vmalloc_pte() will require additional page we could bail out with -EAGAIN,
> and allocate another one.
When would it be needed? kasan_populate_vmalloc_pte() handles just one page.
Thanks!
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 1/3] kasan: Avoid sleepable page allocation from atomic context
2025-04-09 14:25 ` Alexander Gordeev
@ 2025-04-09 14:56 ` Andrey Ryabinin
2025-04-10 15:18 ` Alexander Gordeev
0 siblings, 1 reply; 10+ messages in thread
From: Andrey Ryabinin @ 2025-04-09 14:56 UTC (permalink / raw)
To: Alexander Gordeev
Cc: Andrew Morton, Hugh Dickins, Nicholas Piggin, Guenter Roeck,
Juergen Gross, Jeremy Fitzhardinge, linux-kernel, linux-mm,
kasan-dev, sparclinux, xen-devel, linuxppc-dev, linux-s390,
stable
On 4/9/25 4:25 PM, Alexander Gordeev wrote:
> On Wed, Apr 09, 2025 at 04:10:58PM +0200, Andrey Ryabinin wrote:
>
> Hi Andrey,
>
>>> @@ -301,7 +301,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
>>> if (likely(!pte_none(ptep_get(ptep))))
>>> return 0;
>>>
>>> - page = __get_free_page(GFP_KERNEL);
>>> + page = __get_free_page(GFP_ATOMIC);
>>> if (!page)
>>> return -ENOMEM;
>>>
>>
>> I think a better way to fix this would be moving out allocation from atomic context. Allocate page prior
>> to apply_to_page_range() call and pass it down to kasan_populate_vmalloc_pte().
>
> I think the page address could be passed as the parameter to kasan_populate_vmalloc_pte().
We'll need to pass it as 'struct page **page' or maybe as pointer to some struct, e.g.:
struct page_data {
struct page *page;
};
So, the kasan_populate_vmalloc_pte() would do something like this:
kasan_populate_vmalloc_pte() {
if (!pte_none)
return 0;
if (!page_data->page)
return -EAGAIN;
//use page to set pte
//NULLify pointer so that next kasan_populate_vmalloc_pte() will bail
// out to allocate new page
page_data->page = NULL;
}
And it might be good idea to add 'last_addr' to page_data, so that we know where we stopped
so that the next apply_to_page_range() call could continue, instead of starting from the beginning.
>
>> Whenever kasan_populate_vmalloc_pte() will require additional page we could bail out with -EAGAIN,
>> and allocate another one.
>
> When would it be needed? kasan_populate_vmalloc_pte() handles just one page.
>
apply_to_page_range() goes over range of addresses and calls kasan_populate_vmalloc_pte()
multiple times (each time with different 'addr' but the same '*unused' arg). Things will go wrong
if you'll use same page multiple times for different addresses.
> Thanks!
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 1/3] kasan: Avoid sleepable page allocation from atomic context
2025-04-09 14:56 ` Andrey Ryabinin
@ 2025-04-10 15:18 ` Alexander Gordeev
0 siblings, 0 replies; 10+ messages in thread
From: Alexander Gordeev @ 2025-04-10 15:18 UTC (permalink / raw)
To: Andrey Ryabinin, Andrew Morton
Cc: Hugh Dickins, Nicholas Piggin, Guenter Roeck, Juergen Gross,
Jeremy Fitzhardinge, linux-kernel, linux-mm, kasan-dev,
sparclinux, xen-devel, linuxppc-dev, linux-s390, stable
On Wed, Apr 09, 2025 at 04:56:29PM +0200, Andrey Ryabinin wrote:
Hi Andrey,
...
> >>> - page = __get_free_page(GFP_KERNEL);
> >>> + page = __get_free_page(GFP_ATOMIC);
> >>> if (!page)
> >> I think a better way to fix this would be moving out allocation from atomic context. Allocate page prior
> >> to apply_to_page_range() call and pass it down to kasan_populate_vmalloc_pte().
> > I think the page address could be passed as the parameter to kasan_populate_vmalloc_pte().
>
> We'll need to pass it as 'struct page **page' or maybe as pointer to some struct, e.g.:
> struct page_data {
> struct page *page;
> };
...
Thanks for the hint! I will try to implement that, but will likely start
in two weeks, after I am back from vacation.
Not sure wether this version needs to be dropped.
Thanks!
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v2 2/3] mm: Cleanup apply_to_pte_range() routine
2025-04-08 16:07 [PATCH v2 0/3] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
2025-04-08 16:07 ` [PATCH v2 1/3] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
@ 2025-04-08 16:07 ` Alexander Gordeev
2025-04-08 16:07 ` [PATCH v2 3/3] mm: Protect kernel pgtables in apply_to_pte_range() Alexander Gordeev
2 siblings, 0 replies; 10+ messages in thread
From: Alexander Gordeev @ 2025-04-08 16:07 UTC (permalink / raw)
To: Andrew Morton, Andrey Ryabinin
Cc: Hugh Dickins, Nicholas Piggin, Guenter Roeck, Juergen Gross,
Jeremy Fitzhardinge, linux-kernel, linux-mm, kasan-dev,
sparclinux, xen-devel, linuxppc-dev, linux-s390, stable
Reverse 'create' vs 'mm == &init_mm' conditions and move
page table mask modification out of the atomic context.
This is a prerequisite for fixing missing kernel page
tables lock.
Cc: stable@vger.kernel.org
Fixes: 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy updates")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
mm/memory.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 2d8c265fc7d6..f0201c8ec1ce 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2915,24 +2915,28 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
pte_fn_t fn, void *data, bool create,
pgtbl_mod_mask *mask)
{
+ int err = create ? -ENOMEM : -EINVAL;
pte_t *pte, *mapped_pte;
- int err = 0;
spinlock_t *ptl;
- if (create) {
- mapped_pte = pte = (mm == &init_mm) ?
- pte_alloc_kernel_track(pmd, addr, mask) :
- pte_alloc_map_lock(mm, pmd, addr, &ptl);
+ if (mm == &init_mm) {
+ if (create)
+ pte = pte_alloc_kernel_track(pmd, addr, mask);
+ else
+ pte = pte_offset_kernel(pmd, addr);
if (!pte)
- return -ENOMEM;
+ return err;
} else {
- mapped_pte = pte = (mm == &init_mm) ?
- pte_offset_kernel(pmd, addr) :
- pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (create)
+ pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
+ else
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
if (!pte)
- return -EINVAL;
+ return err;
+ mapped_pte = pte;
}
+ err = 0;
arch_enter_lazy_mmu_mode();
if (fn) {
@@ -2944,12 +2948,14 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
}
} while (addr += PAGE_SIZE, addr != end);
}
- *mask |= PGTBL_PTE_MODIFIED;
arch_leave_lazy_mmu_mode();
if (mm != &init_mm)
pte_unmap_unlock(mapped_pte, ptl);
+
+ *mask |= PGTBL_PTE_MODIFIED;
+
return err;
}
--
2.45.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* [PATCH v2 3/3] mm: Protect kernel pgtables in apply_to_pte_range()
2025-04-08 16:07 [PATCH v2 0/3] mm: Fix apply_to_pte_range() vs lazy MMU mode Alexander Gordeev
2025-04-08 16:07 ` [PATCH v2 1/3] kasan: Avoid sleepable page allocation from atomic context Alexander Gordeev
2025-04-08 16:07 ` [PATCH v2 2/3] mm: Cleanup apply_to_pte_range() routine Alexander Gordeev
@ 2025-04-08 16:07 ` Alexander Gordeev
2025-04-10 14:50 ` Alexander Gordeev
2 siblings, 1 reply; 10+ messages in thread
From: Alexander Gordeev @ 2025-04-08 16:07 UTC (permalink / raw)
To: Andrew Morton, Andrey Ryabinin
Cc: Hugh Dickins, Nicholas Piggin, Guenter Roeck, Juergen Gross,
Jeremy Fitzhardinge, linux-kernel, linux-mm, kasan-dev,
sparclinux, xen-devel, linuxppc-dev, linux-s390, stable
The lazy MMU mode can only be entered and left under the protection
of the page table locks for all page tables which may be modified.
Yet, when it comes to kernel mappings apply_to_pte_range() does not
take any locks. That does not conform arch_enter|leave_lazy_mmu_mode()
semantics and could potentially lead to re-schedulling a process while
in lazy MMU mode or racing on a kernel page table updates.
Cc: stable@vger.kernel.org
Fixes: 38e0edb15bd0 ("mm/apply_to_range: call pte function with lazy updates")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
---
mm/kasan/shadow.c | 7 ++-----
mm/memory.c | 5 ++++-
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index edfa77959474..6531a7aa8562 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -308,14 +308,14 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
__memset((void *)page, KASAN_VMALLOC_INVALID, PAGE_SIZE);
pte = pfn_pte(PFN_DOWN(__pa(page)), PAGE_KERNEL);
- spin_lock(&init_mm.page_table_lock);
if (likely(pte_none(ptep_get(ptep)))) {
set_pte_at(&init_mm, addr, ptep, pte);
page = 0;
}
- spin_unlock(&init_mm.page_table_lock);
+
if (page)
free_page(page);
+
return 0;
}
@@ -401,13 +401,10 @@ static int kasan_depopulate_vmalloc_pte(pte_t *ptep, unsigned long addr,
page = (unsigned long)__va(pte_pfn(ptep_get(ptep)) << PAGE_SHIFT);
- spin_lock(&init_mm.page_table_lock);
-
if (likely(!pte_none(ptep_get(ptep)))) {
pte_clear(&init_mm, addr, ptep);
free_page(page);
}
- spin_unlock(&init_mm.page_table_lock);
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index f0201c8ec1ce..1f3727104e99 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2926,6 +2926,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
pte = pte_offset_kernel(pmd, addr);
if (!pte)
return err;
+ spin_lock(&init_mm.page_table_lock);
} else {
if (create)
pte = pte_alloc_map_lock(mm, pmd, addr, &ptl);
@@ -2951,7 +2952,9 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
arch_leave_lazy_mmu_mode();
- if (mm != &init_mm)
+ if (mm == &init_mm)
+ spin_unlock(&init_mm.page_table_lock);
+ else
pte_unmap_unlock(mapped_pte, ptl);
*mask |= PGTBL_PTE_MODIFIED;
--
2.45.2
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: [PATCH v2 3/3] mm: Protect kernel pgtables in apply_to_pte_range()
2025-04-08 16:07 ` [PATCH v2 3/3] mm: Protect kernel pgtables in apply_to_pte_range() Alexander Gordeev
@ 2025-04-10 14:50 ` Alexander Gordeev
2025-04-10 22:47 ` Andrew Morton
0 siblings, 1 reply; 10+ messages in thread
From: Alexander Gordeev @ 2025-04-10 14:50 UTC (permalink / raw)
To: Andrew Morton, Andrey Ryabinin
Cc: Hugh Dickins, Nicholas Piggin, Guenter Roeck, Juergen Gross,
Jeremy Fitzhardinge, linux-kernel, linux-mm, kasan-dev,
sparclinux, xen-devel, linuxppc-dev, linux-s390, stable
On Tue, Apr 08, 2025 at 06:07:32PM +0200, Alexander Gordeev wrote:
Hi Andrew,
> The lazy MMU mode can only be entered and left under the protection
> of the page table locks for all page tables which may be modified.
Heiko Carstens noticed that the above claim is not valid, since
v6.15-rc1 commit 691ee97e1a9d ("mm: fix lazy mmu docs and usage"),
which restates it to:
"In the general case, no lock is guaranteed to be held between entry and exit
of the lazy mode. So the implementation must assume preemption may be enabled"
That effectively invalidates this patch, so it needs to be dropped.
Patch 2 still could be fine, except -stable and Fixes tags and it does
not need to aim 6.15-rcX. Do you want me to repost it?
Thanks!
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v2 3/3] mm: Protect kernel pgtables in apply_to_pte_range()
2025-04-10 14:50 ` Alexander Gordeev
@ 2025-04-10 22:47 ` Andrew Morton
0 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2025-04-10 22:47 UTC (permalink / raw)
To: Alexander Gordeev
Cc: Andrey Ryabinin, Hugh Dickins, Nicholas Piggin, Guenter Roeck,
Juergen Gross, Jeremy Fitzhardinge, linux-kernel, linux-mm,
kasan-dev, sparclinux, xen-devel, linuxppc-dev, linux-s390,
stable
On Thu, 10 Apr 2025 16:50:33 +0200 Alexander Gordeev <agordeev@linux.ibm.com> wrote:
> On Tue, Apr 08, 2025 at 06:07:32PM +0200, Alexander Gordeev wrote:
>
> Hi Andrew,
>
> > The lazy MMU mode can only be entered and left under the protection
> > of the page table locks for all page tables which may be modified.
>
> Heiko Carstens noticed that the above claim is not valid, since
> v6.15-rc1 commit 691ee97e1a9d ("mm: fix lazy mmu docs and usage"),
> which restates it to:
>
> "In the general case, no lock is guaranteed to be held between entry and exit
> of the lazy mode. So the implementation must assume preemption may be enabled"
>
> That effectively invalidates this patch, so it needs to be dropped.
>
> Patch 2 still could be fine, except -stable and Fixes tags and it does
> not need to aim 6.15-rcX. Do you want me to repost it?
I dropped the whole series - let's start again.
^ permalink raw reply [flat|nested] 10+ messages in thread