* [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
@ 2026-04-08 9:24 Hrushikesh Salunke
2026-04-08 9:47 ` Vlastimil Babka (SUSE)
2026-04-08 11:32 ` [syzbot ci] " syzbot ci
0 siblings, 2 replies; 6+ messages in thread
From: Hrushikesh Salunke @ 2026-04-08 9:24 UTC (permalink / raw)
To: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy
Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
shivankg, hsalunke
When init_on_alloc is enabled, kernel_init_pages() clears every page
one at a time, calling clear_page() per page. This is unnecessarily
slow for large contiguous allocations (mTHPs, HugeTLB) that dominate
real workloads.
On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via
clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local()
overhead and allowing the arch clearing primitive to operate on the full
contiguous range in a single invocation. The batch size is the full
allocation when the preempt model is preemptible (preemption points are
implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with
cond_resched() between batches to limit scheduling latency under
cooperative preemption.
The HIGHMEM path is kept as-is since those pages require kmap.
Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
Before: 0.445s
After: 0.166s (-62.7%, 2.68x faster)
Kernel time (sys) reduction per workload with init_on_alloc=1:
Workload Before After Change
Graph500 64C128T 30m 41.8s 15m 14.8s -50.3%
Graph500 16C32T 15m 56.7s 9m 43.7s -39.0%
Pagerank 32T 1m 58.5s 1m 12.8s -38.5%
Pagerank 128T 2m 36.3s 1m 40.4s -35.7%
Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com>
---
base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35
mm/page_alloc.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b1c5430cad4e..178cbebadd50 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1224,8 +1224,23 @@ static void kernel_init_pages(struct page *page, int numpages)
/* s390's use of memset() could override KASAN redzones. */
kasan_disable_current();
- for (i = 0; i < numpages; i++)
- clear_highpage_kasan_tagged(page + i);
+
+ if (!IS_ENABLED(CONFIG_HIGHMEM)) {
+ void *addr = kasan_reset_tag(page_address(page));
+ unsigned int unit = preempt_model_preemptible() ?
+ numpages : PROCESS_PAGES_NON_PREEMPT_BATCH;
+ int count;
+
+ for (i = 0; i < numpages; i += count) {
+ cond_resched();
+ count = min_t(int, unit, numpages - i);
+ clear_pages(addr + (i << PAGE_SHIFT), count);
+ }
+ } else {
+ for (i = 0; i < numpages; i++)
+ clear_highpage_kasan_tagged(page + i);
+ }
+
kasan_enable_current();
}
--
2.43.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
2026-04-08 9:24 [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() Hrushikesh Salunke
@ 2026-04-08 9:47 ` Vlastimil Babka (SUSE)
2026-04-08 10:44 ` Salunke, Hrushikesh
2026-04-08 11:32 ` [syzbot ci] " syzbot ci
1 sibling, 1 reply; 6+ messages in thread
From: Vlastimil Babka (SUSE) @ 2026-04-08 9:47 UTC (permalink / raw)
To: Hrushikesh Salunke, akpm, surenb, mhocko, jackmanb, hannes, ziy
Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
shivankg, David Hildenbrand
On 4/8/26 11:24, Hrushikesh Salunke wrote:
> When init_on_alloc is enabled, kernel_init_pages() clears every page
> one at a time, calling clear_page() per page. This is unnecessarily
> slow for large contiguous allocations (mTHPs, HugeTLB) that dominate
> real workloads.
>
> On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via
> clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local()
> overhead and allowing the arch clearing primitive to operate on the full
> contiguous range in a single invocation. The batch size is the full
> allocation when the preempt model is preemptible (preemption points are
> implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with
> cond_resched() between batches to limit scheduling latency under
> cooperative preemption.
>
> The HIGHMEM path is kept as-is since those pages require kmap.
>
> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
>
> Before: 0.445s
> After: 0.166s (-62.7%, 2.68x faster)
>
> Kernel time (sys) reduction per workload with init_on_alloc=1:
>
> Workload Before After Change
> Graph500 64C128T 30m 41.8s 15m 14.8s -50.3%
> Graph500 16C32T 15m 56.7s 9m 43.7s -39.0%
> Pagerank 32T 1m 58.5s 1m 12.8s -38.5%
> Pagerank 128T 2m 36.3s 1m 40.4s -35.7%
>
> Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com>
> ---
> base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35
Any way to reuse the code added by [1], e.g. clear_user_highpages()?
[1]
https://lore.kernel.org/linux-mm/20250917152418.4077386-1-ankur.a.arora@oracle.com/
>
> mm/page_alloc.c | 19 +++++++++++++++++--
> 1 file changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b1c5430cad4e..178cbebadd50 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1224,8 +1224,23 @@ static void kernel_init_pages(struct page *page, int numpages)
>
> /* s390's use of memset() could override KASAN redzones. */
> kasan_disable_current();
> - for (i = 0; i < numpages; i++)
> - clear_highpage_kasan_tagged(page + i);
> +
> + if (!IS_ENABLED(CONFIG_HIGHMEM)) {
> + void *addr = kasan_reset_tag(page_address(page));
> + unsigned int unit = preempt_model_preemptible() ?
> + numpages : PROCESS_PAGES_NON_PREEMPT_BATCH;
> + int count;
> +
> + for (i = 0; i < numpages; i += count) {
> + cond_resched();
> + count = min_t(int, unit, numpages - i);
> + clear_pages(addr + (i << PAGE_SHIFT), count);
> + }
> + } else {
> + for (i = 0; i < numpages; i++)
> + clear_highpage_kasan_tagged(page + i);
> + }
> +
> kasan_enable_current();
> }
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
2026-04-08 9:47 ` Vlastimil Babka (SUSE)
@ 2026-04-08 10:44 ` Salunke, Hrushikesh
2026-04-08 10:53 ` David Hildenbrand (Arm)
2026-04-08 11:16 ` Raghavendra K T
0 siblings, 2 replies; 6+ messages in thread
From: Salunke, Hrushikesh @ 2026-04-08 10:44 UTC (permalink / raw)
To: Vlastimil Babka (SUSE), akpm, surenb, mhocko, jackmanb, hannes,
ziy
Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
shivankg, David Hildenbrand, hsalunke
On 08-04-2026 15:17, Vlastimil Babka (SUSE) wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On 4/8/26 11:24, Hrushikesh Salunke wrote:
>> When init_on_alloc is enabled, kernel_init_pages() clears every page
>> one at a time, calling clear_page() per page. This is unnecessarily
>> slow for large contiguous allocations (mTHPs, HugeTLB) that dominate
>> real workloads.
>>
>> On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via
>> clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local()
>> overhead and allowing the arch clearing primitive to operate on the full
>> contiguous range in a single invocation. The batch size is the full
>> allocation when the preempt model is preemptible (preemption points are
>> implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with
>> cond_resched() between batches to limit scheduling latency under
>> cooperative preemption.
>>
>> The HIGHMEM path is kept as-is since those pages require kmap.
>>
>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
>>
>> Before: 0.445s
>> After: 0.166s (-62.7%, 2.68x faster)
>>
>> Kernel time (sys) reduction per workload with init_on_alloc=1:
>>
>> Workload Before After Change
>> Graph500 64C128T 30m 41.8s 15m 14.8s -50.3%
>> Graph500 16C32T 15m 56.7s 9m 43.7s -39.0%
>> Pagerank 32T 1m 58.5s 1m 12.8s -38.5%
>> Pagerank 128T 2m 36.3s 1m 40.4s -35.7%
>>
>> Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com>
>> ---
>> base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35
> Any way to reuse the code added by [1], e.g. clear_user_highpages()?
>
> [1]
> https://lore.kernel.org/linux-mm/20250917152418.4077386-1-ankur.a.arora@oracle.com/
Thanks for the review. Sure, I will check if code reuse is possible.
Meanwhile I found another issue with the current patch.
kernel_init_pages() runs inside the allocator (post_alloc_hook and
__free_pages_prepare), so it inherits whatever context the caller is in.
Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I
hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a
page allocation happens while the PTE lock and RCU read lock are held,
making the cond_resched() in the clearing loop illegal:
[ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235
[ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash
[ 1997.353572] preempt_count: 1, expected: 0
[ 1997.353706] RCU nest depth: 1, expected: 0
[ 1997.353837] 3 locks held by bash/19725:
[ 1997.353839] #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430
[ 1997.353850] #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220
[ 1997.353855] #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170
[ 1997.353868] Call Trace:
[ 1997.353870] <TASK>
[ 1997.353873] dump_stack_lvl+0x91/0xb0
[ 1997.353877] __might_resched+0x15f/0x290
[ 1997.353882] kernel_init_pages+0x4b/0xa0
[ 1997.353886] get_page_from_freelist+0x406/0x1e60
[ 1997.353895] __alloc_frozen_pages_noprof+0x1d8/0x1730
[ 1997.353912] alloc_pages_mpol+0xa4/0x190
[ 1997.353917] alloc_pages_noprof+0x59/0xd0
[ 1997.353919] get_free_pages_noprof+0x11/0x40
[ 1997.353921] __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0
[ 1997.353923] __zap_vma_range+0x1bbd/0x1f40
[ 1997.353931] unmap_vmas+0xd9/0x1d0
[ 1997.353934] exit_mmap+0x10a/0x430
[ 1997.353943] __mmput+0x3d/0x130
[ 1997.353947] do_exit+0x2a7/0xae0
[ 1997.353951] do_group_exit+0x36/0xa0
[ 1997.353953] __x64_sys_exit_group+0x18/0x20
[ 1997.353959] do_syscall_64+0xe1/0x710
[ 1997.353990] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1997.354003] </TASK>
This also means clear_contig_highpages() can't be directly reused here
since it has an unconditional might_sleep() + cond_resched(). I'll look
into this. Any suggestions on the right way to handle cond_resched()
in a context that may or may not be atomic?
Thanks,
Hrushikesh
>> mm/page_alloc.c | 19 +++++++++++++++++--
>> 1 file changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index b1c5430cad4e..178cbebadd50 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1224,8 +1224,23 @@ static void kernel_init_pages(struct page *page, int numpages)
>>
>> /* s390's use of memset() could override KASAN redzones. */
>> kasan_disable_current();
>> - for (i = 0; i < numpages; i++)
>> - clear_highpage_kasan_tagged(page + i);
>> +
>> + if (!IS_ENABLED(CONFIG_HIGHMEM)) {
>> + void *addr = kasan_reset_tag(page_address(page));
>> + unsigned int unit = preempt_model_preemptible() ?
>> + numpages : PROCESS_PAGES_NON_PREEMPT_BATCH;
>> + int count;
>> +
>> + for (i = 0; i < numpages; i += count) {
>> + cond_resched();
>> + count = min_t(int, unit, numpages - i);
>> + clear_pages(addr + (i << PAGE_SHIFT), count);
>> + }
>> + } else {
>> + for (i = 0; i < numpages; i++)
>> + clear_highpage_kasan_tagged(page + i);
>> + }
>> +
>> kasan_enable_current();
>> }
>>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
2026-04-08 10:44 ` Salunke, Hrushikesh
@ 2026-04-08 10:53 ` David Hildenbrand (Arm)
2026-04-08 11:16 ` Raghavendra K T
1 sibling, 0 replies; 6+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-08 10:53 UTC (permalink / raw)
To: Salunke, Hrushikesh, Vlastimil Babka (SUSE), akpm, surenb, mhocko,
jackmanb, hannes, ziy
Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
shivankg
On 4/8/26 12:44, Salunke, Hrushikesh wrote:
>
> On 08-04-2026 15:17, Vlastimil Babka (SUSE) wrote:
>
>> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>>
>>
>> On 4/8/26 11:24, Hrushikesh Salunke wrote:
>>> When init_on_alloc is enabled, kernel_init_pages() clears every page
>>> one at a time, calling clear_page() per page. This is unnecessarily
>>> slow for large contiguous allocations (mTHPs, HugeTLB) that dominate
>>> real workloads.
>>>
>>> On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via
>>> clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local()
>>> overhead and allowing the arch clearing primitive to operate on the full
>>> contiguous range in a single invocation. The batch size is the full
>>> allocation when the preempt model is preemptible (preemption points are
>>> implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with
>>> cond_resched() between batches to limit scheduling latency under
>>> cooperative preemption.
>>>
>>> The HIGHMEM path is kept as-is since those pages require kmap.
>>>
>>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
>>>
>>> Before: 0.445s
>>> After: 0.166s (-62.7%, 2.68x faster)
>>>
>>> Kernel time (sys) reduction per workload with init_on_alloc=1:
>>>
>>> Workload Before After Change
>>> Graph500 64C128T 30m 41.8s 15m 14.8s -50.3%
>>> Graph500 16C32T 15m 56.7s 9m 43.7s -39.0%
>>> Pagerank 32T 1m 58.5s 1m 12.8s -38.5%
>>> Pagerank 128T 2m 36.3s 1m 40.4s -35.7%
>>>
>>> Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com>
>>> ---
>>> base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35
>> Any way to reuse the code added by [1], e.g. clear_user_highpages()?
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20250917152418.4077386-1-ankur.a.arora@oracle.com/
>
> Thanks for the review. Sure, I will check if code reuse is possible.
> Meanwhile I found another issue with the current patch.
>
> kernel_init_pages() runs inside the allocator (post_alloc_hook and
> __free_pages_prepare), so it inherits whatever context the caller is in.
> Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I
> hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a
> page allocation happens while the PTE lock and RCU read lock are held,
> making the cond_resched() in the clearing loop illegal:
>
> [ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235
> [ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash
> [ 1997.353572] preempt_count: 1, expected: 0
> [ 1997.353706] RCU nest depth: 1, expected: 0
> [ 1997.353837] 3 locks held by bash/19725:
> [ 1997.353839] #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430
> [ 1997.353850] #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220
> [ 1997.353855] #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170
> [ 1997.353868] Call Trace:
> [ 1997.353870] <TASK>
> [ 1997.353873] dump_stack_lvl+0x91/0xb0
> [ 1997.353877] __might_resched+0x15f/0x290
> [ 1997.353882] kernel_init_pages+0x4b/0xa0
> [ 1997.353886] get_page_from_freelist+0x406/0x1e60
> [ 1997.353895] __alloc_frozen_pages_noprof+0x1d8/0x1730
> [ 1997.353912] alloc_pages_mpol+0xa4/0x190
> [ 1997.353917] alloc_pages_noprof+0x59/0xd0
> [ 1997.353919] get_free_pages_noprof+0x11/0x40
> [ 1997.353921] __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0
> [ 1997.353923] __zap_vma_range+0x1bbd/0x1f40
> [ 1997.353931] unmap_vmas+0xd9/0x1d0
> [ 1997.353934] exit_mmap+0x10a/0x430
> [ 1997.353943] __mmput+0x3d/0x130
> [ 1997.353947] do_exit+0x2a7/0xae0
> [ 1997.353951] do_group_exit+0x36/0xa0
> [ 1997.353953] __x64_sys_exit_group+0x18/0x20
> [ 1997.353959] do_syscall_64+0xe1/0x710
> [ 1997.353990] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1997.354003] </TASK>
>
> This also means clear_contig_highpages() can't be directly reused here
> since it has an unconditional might_sleep() + cond_resched(). I'll look
> into this. Any suggestions on the right way to handle cond_resched()
> in a context that may or may not be atomic?
clear_contig_highpages() is prepared to handle arbitrary sizes,
including 1 GiB chunks or even larger.
The question is whether you even have to use
PROCESS_PAGES_NON_PREEMPT_BATCH given that we cannot trigger a manual
resched either way (and the assumption is that memory we are clearing is
not that big. Well, on arm64 it can still be 512 MiB).
So I wonder what happens when you just use clear_pages().
Likely you should provide a clear_highpages_kasan_tagged() and a
clear_highpages() ?
So you would be calling clear_highpages_kasan_tagged() here that would
just default to calling clear_highpages() unless kasan applies etc.
--
Cheers,
David
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
2026-04-08 10:44 ` Salunke, Hrushikesh
2026-04-08 10:53 ` David Hildenbrand (Arm)
@ 2026-04-08 11:16 ` Raghavendra K T
1 sibling, 0 replies; 6+ messages in thread
From: Raghavendra K T @ 2026-04-08 11:16 UTC (permalink / raw)
To: Salunke, Hrushikesh, Vlastimil Babka (SUSE), akpm, surenb, mhocko,
jackmanb, hannes, ziy
Cc: linux-mm, linux-kernel, bharata, ankur.a.arora, shivankg,
David Hildenbrand
On 4/8/2026 4:14 PM, Salunke, Hrushikesh wrote:
> [Some people who received this message don't often get email from hsalunke@amd.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On 08-04-2026 15:17, Vlastimil Babka (SUSE) wrote:
>
>> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>>
>>
>> On 4/8/26 11:24, Hrushikesh Salunke wrote:
>>> When init_on_alloc is enabled, kernel_init_pages() clears every page
>>> one at a time, calling clear_page() per page. This is unnecessarily
>>> slow for large contiguous allocations (mTHPs, HugeTLB) that dominate
>>> real workloads.
>>>
>>> On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via
>>> clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local()
>>> overhead and allowing the arch clearing primitive to operate on the full
>>> contiguous range in a single invocation. The batch size is the full
>>> allocation when the preempt model is preemptible (preemption points are
>>> implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with
>>> cond_resched() between batches to limit scheduling latency under
>>> cooperative preemption.
>>>
>>> The HIGHMEM path is kept as-is since those pages require kmap.
>>>
>>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
>>>
>>> Before: 0.445s
>>> After: 0.166s (-62.7%, 2.68x faster)
>>>
>>> Kernel time (sys) reduction per workload with init_on_alloc=1:
>>>
>>> Workload Before After Change
>>> Graph500 64C128T 30m 41.8s 15m 14.8s -50.3%
>>> Graph500 16C32T 15m 56.7s 9m 43.7s -39.0%
>>> Pagerank 32T 1m 58.5s 1m 12.8s -38.5%
>>> Pagerank 128T 2m 36.3s 1m 40.4s -35.7%
>>>
>>> Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com>
>>> ---
>>> base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35
>> Any way to reuse the code added by [1], e.g. clear_user_highpages()?
>>
>> [1]
>> https://lore.kernel.org/linux-mm/20250917152418.4077386-1-ankur.a.arora@oracle.com/
>
> Thanks for the review. Sure, I will check if code reuse is possible.
> Meanwhile I found another issue with the current patch.
>
> kernel_init_pages() runs inside the allocator (post_alloc_hook and
> __free_pages_prepare), so it inherits whatever context the caller is in.
> Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I
> hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a
> page allocation happens while the PTE lock and RCU read lock are held,
> making the cond_resched() in the clearing loop illegal:
>
> [ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235
> [ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash
> [ 1997.353572] preempt_count: 1, expected: 0
> [ 1997.353706] RCU nest depth: 1, expected: 0
> [ 1997.353837] 3 locks held by bash/19725:
> [ 1997.353839] #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430
> [ 1997.353850] #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220
> [ 1997.353855] #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170
> [ 1997.353868] Call Trace:
> [ 1997.353870] <TASK>
> [ 1997.353873] dump_stack_lvl+0x91/0xb0
> [ 1997.353877] __might_resched+0x15f/0x290
> [ 1997.353882] kernel_init_pages+0x4b/0xa0
> [ 1997.353886] get_page_from_freelist+0x406/0x1e60
> [ 1997.353895] __alloc_frozen_pages_noprof+0x1d8/0x1730
> [ 1997.353912] alloc_pages_mpol+0xa4/0x190
> [ 1997.353917] alloc_pages_noprof+0x59/0xd0
> [ 1997.353919] get_free_pages_noprof+0x11/0x40
> [ 1997.353921] __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0
> [ 1997.353923] __zap_vma_range+0x1bbd/0x1f40
> [ 1997.353931] unmap_vmas+0xd9/0x1d0
> [ 1997.353934] exit_mmap+0x10a/0x430
> [ 1997.353943] __mmput+0x3d/0x130
> [ 1997.353947] do_exit+0x2a7/0xae0
> [ 1997.353951] do_group_exit+0x36/0xa0
> [ 1997.353953] __x64_sys_exit_group+0x18/0x20
> [ 1997.353959] do_syscall_64+0xe1/0x710
> [ 1997.353990] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 1997.354003] </TASK>
>
> This also means clear_contig_highpages() can't be directly reused here
> since it has an unconditional might_sleep() + cond_resched(). I'll look
> into this. Any suggestions on the right way to handle cond_resched()
> in a context that may or may not be atomic?
>
> Thanks,
> Hrushikesh
>
>>> mm/page_alloc.c | 19 +++++++++++++++++--
>>> 1 file changed, 17 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index b1c5430cad4e..178cbebadd50 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1224,8 +1224,23 @@ static void kernel_init_pages(struct page *page, int numpages)
>>>
>>> /* s390's use of memset() could override KASAN redzones. */
>>> kasan_disable_current();
>>> - for (i = 0; i < numpages; i++)
>>> - clear_highpage_kasan_tagged(page + i);
>>> +
>>> + if (!IS_ENABLED(CONFIG_HIGHMEM)) {
>>> + void *addr = kasan_reset_tag(page_address(page));
>>> + unsigned int unit = preempt_model_preemptible() ?
>>> + numpages : PROCESS_PAGES_NON_PREEMPT_BATCH;
>>> + int count;
>>> +
>>> + for (i = 0; i < numpages; i += count) {
>>> + cond_resched();
Just thinking,
Considering that for preemptible kernel/preempt_auto preempt_count()
knows about preemption points to decide where it can preempt,
and
for non_preemptible kernel and voluntary kernel it is safe to do
preemption at PROCESS_PAGES_NON_PREEMPT_BATCH granularity
do we need cond_resched() here ?
Let me know if I am missing something.
>>> + count = min_t(int, unit, numpages - i);
>>> + clear_pages(addr + (i << PAGE_SHIFT), count);
>>> + }
>>> + } else {
>>> + for (i = 0; i < numpages; i++)
>>> + clear_highpage_kasan_tagged(page + i);
>>> + }
>>> +
>>> kasan_enable_current();
>>> }
>>>
Regards
- Raghu
^ permalink raw reply [flat|nested] 6+ messages in thread
* [syzbot ci] Re: mm/page_alloc: use batch page clearing in kernel_init_pages()
2026-04-08 9:24 [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() Hrushikesh Salunke
2026-04-08 9:47 ` Vlastimil Babka (SUSE)
@ 2026-04-08 11:32 ` syzbot ci
1 sibling, 0 replies; 6+ messages in thread
From: syzbot ci @ 2026-04-08 11:32 UTC (permalink / raw)
To: akpm, ankur.a.arora, bharata, hannes, hsalunke, jackmanb,
linux-kernel, linux-mm, mhocko, rkodsara, shivankg, surenb,
vbabka, ziy
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v1] mm/page_alloc: use batch page clearing in kernel_init_pages()
https://lore.kernel.org/all/20260408092441.435133-1-hsalunke@amd.com
* [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages()
and found the following issue:
WARNING in preempt_model_full
Full report is available here:
https://ci.syzbot.org/series/be6c0534-641b-42aa-8b73-ab8f592ec267
***
WARNING in preempt_model_full
tree: mm-new
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base: 0d90551ea699ef3d1a85cd7a1a7e21e8d4f04db2
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/22cbd293-31c6-46b9-b8d4-2ff590ce406b/config
CPU topo: Max. logical packages: 2
CPU topo: Max. logical nodes: 1
CPU topo: Num. nodes per package: 1
CPU topo: Max. logical dies: 2
CPU topo: Max. dies per package: 1
CPU topo: Max. threads per core: 1
CPU topo: Num. cores per package: 1
CPU topo: Num. threads per package: 1
CPU topo: Allowing 2 present CPUs plus 0 hotplug CPUs
kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write()
PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
PM: hibernation: Registered nosave memory: [mem 0x0009f000-0x000fffff]
PM: hibernation: Registered nosave memory: [mem 0x7ffdf000-0xffffffff]
[gap 0xc0000000-0xfed1bfff] available for PCI devices
Booting paravirtualized kernel on KVM
clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
Zone ranges:
DMA [mem 0x0000000000001000-0x0000000000ffffff]
DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
Normal [mem 0x0000000100000000-0x000000023fffffff]
Device empty
Movable zone start for each node
Early memory node ranges
node 0: [mem 0x0000000000001000-0x000000000009efff]
node 0: [mem 0x0000000000100000-0x000000007ffdefff]
node 0: [mem 0x0000000100000000-0x0000000160000fff]
node 1: [mem 0x0000000160001000-0x000000023fffffff]
Initmem setup node 0 [mem 0x0000000000001000-0x0000000160000fff]
Initmem setup node 1 [mem 0x0000000160001000-0x000000023fffffff]
On node 0, zone DMA: 1 pages in unavailable ranges
On node 0, zone DMA: 97 pages in unavailable ranges
On node 0, zone Normal: 33 pages in unavailable ranges
setup_percpu: NR_CPUS:8 nr_cpumask_bits:2 nr_cpu_ids:2 nr_node_ids:2
percpu: Embedded 71 pages/cpu s250120 r8192 d32504 u2097152
kvm-guest: PV spinlocks disabled, no host support
Kernel command line: earlyprintk=serial net.ifnames=0 sysctl.kernel.hung_task_all_cpu_backtrace=1 ima_policy=tcb nf-conntrack-ftp.ports=20000 nf-conntrack-tftp.ports=20000 nf-conntrack-sip.ports=20000 nf-conntrack-irc.ports=20000 nf-conntrack-sane.ports=20000 binder.debug_mask=0 rcupdate.rcu_expedited=1 rcupdate.rcu_cpu_stall_cputime=1 no_hash_pointers page_owner=on sysctl.vm.nr_hugepages=4 sysctl.vm.nr_overcommit_hugepages=4 secretmem.enable=1 sysctl.max_rcu_stall_to_panic=1 msr.allow_writes=off coredump_filter=0xffff root=/dev/sda console=ttyS0 vsyscall=native numa=fake=2 kvm-intel.nested=1 spec_store_bypass_disable=prctl nopcid vivid.n_devs=64 vivid.multiplanar=1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2 netrom.nr_ndevs=32 rose.rose_ndevs=32 smp.csd_lock_timeout=100000 watchdog_thresh=55 workqueue.watchdog_thresh=140 sysctl.net.core.netdev_unregister_timeout_secs=140 dummy_hcd.num=32 max_loop=32 nbds_max=32 \
Kernel command line: comedi.comedi_num_legacy_minors=4 panic_on_warn=1 root=/dev/sda console=ttyS0 root=/dev/sda1
Unknown kernel command line parameters "nbds_max=32", will be passed to user space.
printk: log buffer data + meta data: 262144 + 917504 = 1179648 bytes
software IO TLB: area num 2.
Fallback order for Node 0: 0 1
Fallback order for Node 1: 1 0
Built 2 zonelists, mobility grouping on. Total pages: 1834877
Policy zone: Normal
mem auto-init: stack:all(zero), heap alloc:on, heap free:off
stackdepot: allocating hash table via alloc_large_system_hash
stackdepot hash table entries: 1048576 (order: 12, 16777216 bytes, linear)
stackdepot: allocating space for 8192 stack pools via memblock
------------[ cut here ]------------
preempt_dynamic_mode == preempt_dynamic_undefined
WARNING: kernel/sched/core.c:7743 at preempt_model_full+0x1e/0x30, CPU#0: swapper/0
Modules linked in:
CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted syzkaller #0 PREEMPT(undef)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:preempt_model_full+0x1e/0x30
Code: 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 83 3d 35 d9 cb 0c ff 74 10 83 3d 2c d9 cb 0c 02 0f 94 c0 2e e9 93 fa 19 0a 90 <0f> 0b 90 eb ea 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90
RSP: 0000:ffffffff8e407a78 EFLAGS: 00010046
RAX: 1ffffffff1c359d9 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffea0004000000
RBP: 0000000000040100 R08: ffffffff8221aafb R09: 0000000000000000
R10: ffffed1020000000 R11: ffffed1024206961 R12: 0000000000000001
R13: 0000000000000001 R14: ffff888100000000 R15: dffffc0000000000
FS: 0000000000000000(0000) GS:ffff88818de62000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88823ffff000 CR3: 000000000e54c000 CR4: 00000000000000b0
Call Trace:
<TASK>
kernel_init_pages+0x6d/0xe0
post_alloc_hook+0xae/0x1e0
get_page_from_freelist+0x24ba/0x2540
__alloc_frozen_pages_noprof+0x18d/0x380
alloc_pages_mpol+0x235/0x490
alloc_pages_noprof+0xac/0x2a0
__pud_alloc+0x3a/0x460
preallocate_vmalloc_pages+0x386/0x3d0
mm_core_init+0x79/0xb0
start_kernel+0x15a/0x3d0
x86_64_start_reservations+0x24/0x30
x86_64_start_kernel+0x143/0x1c0
common_startup_64+0x13e/0x147
</TASK>
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).
The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-08 11:32 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-08 9:24 [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() Hrushikesh Salunke
2026-04-08 9:47 ` Vlastimil Babka (SUSE)
2026-04-08 10:44 ` Salunke, Hrushikesh
2026-04-08 10:53 ` David Hildenbrand (Arm)
2026-04-08 11:16 ` Raghavendra K T
2026-04-08 11:32 ` [syzbot ci] " syzbot ci
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox