* [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() @ 2026-04-08 9:24 Hrushikesh Salunke 2026-04-08 9:47 ` Vlastimil Babka (SUSE) 2026-04-08 11:32 ` [syzbot ci] " syzbot ci 0 siblings, 2 replies; 6+ messages in thread From: Hrushikesh Salunke @ 2026-04-08 9:24 UTC (permalink / raw) To: akpm, vbabka, surenb, mhocko, jackmanb, hannes, ziy Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora, shivankg, hsalunke When init_on_alloc is enabled, kernel_init_pages() clears every page one at a time, calling clear_page() per page. This is unnecessarily slow for large contiguous allocations (mTHPs, HugeTLB) that dominate real workloads. On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local() overhead and allowing the arch clearing primitive to operate on the full contiguous range in a single invocation. The batch size is the full allocation when the preempt model is preemptible (preemption points are implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with cond_resched() between batches to limit scheduling latency under cooperative preemption. The HIGHMEM path is kept as-is since those pages require kmap. Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1: Before: 0.445s After: 0.166s (-62.7%, 2.68x faster) Kernel time (sys) reduction per workload with init_on_alloc=1: Workload Before After Change Graph500 64C128T 30m 41.8s 15m 14.8s -50.3% Graph500 16C32T 15m 56.7s 9m 43.7s -39.0% Pagerank 32T 1m 58.5s 1m 12.8s -38.5% Pagerank 128T 2m 36.3s 1m 40.4s -35.7% Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com> --- base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35 mm/page_alloc.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b1c5430cad4e..178cbebadd50 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1224,8 +1224,23 @@ static void kernel_init_pages(struct page *page, int numpages) /* s390's use of memset() could override KASAN redzones. */ kasan_disable_current(); - for (i = 0; i < numpages; i++) - clear_highpage_kasan_tagged(page + i); + + if (!IS_ENABLED(CONFIG_HIGHMEM)) { + void *addr = kasan_reset_tag(page_address(page)); + unsigned int unit = preempt_model_preemptible() ? + numpages : PROCESS_PAGES_NON_PREEMPT_BATCH; + int count; + + for (i = 0; i < numpages; i += count) { + cond_resched(); + count = min_t(int, unit, numpages - i); + clear_pages(addr + (i << PAGE_SHIFT), count); + } + } else { + for (i = 0; i < numpages; i++) + clear_highpage_kasan_tagged(page + i); + } + kasan_enable_current(); } -- 2.43.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() 2026-04-08 9:24 [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() Hrushikesh Salunke @ 2026-04-08 9:47 ` Vlastimil Babka (SUSE) 2026-04-08 10:44 ` Salunke, Hrushikesh 2026-04-08 11:32 ` [syzbot ci] " syzbot ci 1 sibling, 1 reply; 6+ messages in thread From: Vlastimil Babka (SUSE) @ 2026-04-08 9:47 UTC (permalink / raw) To: Hrushikesh Salunke, akpm, surenb, mhocko, jackmanb, hannes, ziy Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora, shivankg, David Hildenbrand On 4/8/26 11:24, Hrushikesh Salunke wrote: > When init_on_alloc is enabled, kernel_init_pages() clears every page > one at a time, calling clear_page() per page. This is unnecessarily > slow for large contiguous allocations (mTHPs, HugeTLB) that dominate > real workloads. > > On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via > clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local() > overhead and allowing the arch clearing primitive to operate on the full > contiguous range in a single invocation. The batch size is the full > allocation when the preempt model is preemptible (preemption points are > implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with > cond_resched() between batches to limit scheduling latency under > cooperative preemption. > > The HIGHMEM path is kept as-is since those pages require kmap. > > Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1: > > Before: 0.445s > After: 0.166s (-62.7%, 2.68x faster) > > Kernel time (sys) reduction per workload with init_on_alloc=1: > > Workload Before After Change > Graph500 64C128T 30m 41.8s 15m 14.8s -50.3% > Graph500 16C32T 15m 56.7s 9m 43.7s -39.0% > Pagerank 32T 1m 58.5s 1m 12.8s -38.5% > Pagerank 128T 2m 36.3s 1m 40.4s -35.7% > > Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com> > --- > base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35 Any way to reuse the code added by [1], e.g. clear_user_highpages()? [1] https://lore.kernel.org/linux-mm/20250917152418.4077386-1-ankur.a.arora@oracle.com/ > > mm/page_alloc.c | 19 +++++++++++++++++-- > 1 file changed, 17 insertions(+), 2 deletions(-) > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index b1c5430cad4e..178cbebadd50 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1224,8 +1224,23 @@ static void kernel_init_pages(struct page *page, int numpages) > > /* s390's use of memset() could override KASAN redzones. */ > kasan_disable_current(); > - for (i = 0; i < numpages; i++) > - clear_highpage_kasan_tagged(page + i); > + > + if (!IS_ENABLED(CONFIG_HIGHMEM)) { > + void *addr = kasan_reset_tag(page_address(page)); > + unsigned int unit = preempt_model_preemptible() ? > + numpages : PROCESS_PAGES_NON_PREEMPT_BATCH; > + int count; > + > + for (i = 0; i < numpages; i += count) { > + cond_resched(); > + count = min_t(int, unit, numpages - i); > + clear_pages(addr + (i << PAGE_SHIFT), count); > + } > + } else { > + for (i = 0; i < numpages; i++) > + clear_highpage_kasan_tagged(page + i); > + } > + > kasan_enable_current(); > } > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() 2026-04-08 9:47 ` Vlastimil Babka (SUSE) @ 2026-04-08 10:44 ` Salunke, Hrushikesh 2026-04-08 10:53 ` David Hildenbrand (Arm) 2026-04-08 11:16 ` Raghavendra K T 0 siblings, 2 replies; 6+ messages in thread From: Salunke, Hrushikesh @ 2026-04-08 10:44 UTC (permalink / raw) To: Vlastimil Babka (SUSE), akpm, surenb, mhocko, jackmanb, hannes, ziy Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora, shivankg, David Hildenbrand, hsalunke On 08-04-2026 15:17, Vlastimil Babka (SUSE) wrote: > Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. > > > On 4/8/26 11:24, Hrushikesh Salunke wrote: >> When init_on_alloc is enabled, kernel_init_pages() clears every page >> one at a time, calling clear_page() per page. This is unnecessarily >> slow for large contiguous allocations (mTHPs, HugeTLB) that dominate >> real workloads. >> >> On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via >> clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local() >> overhead and allowing the arch clearing primitive to operate on the full >> contiguous range in a single invocation. The batch size is the full >> allocation when the preempt model is preemptible (preemption points are >> implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with >> cond_resched() between batches to limit scheduling latency under >> cooperative preemption. >> >> The HIGHMEM path is kept as-is since those pages require kmap. >> >> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1: >> >> Before: 0.445s >> After: 0.166s (-62.7%, 2.68x faster) >> >> Kernel time (sys) reduction per workload with init_on_alloc=1: >> >> Workload Before After Change >> Graph500 64C128T 30m 41.8s 15m 14.8s -50.3% >> Graph500 16C32T 15m 56.7s 9m 43.7s -39.0% >> Pagerank 32T 1m 58.5s 1m 12.8s -38.5% >> Pagerank 128T 2m 36.3s 1m 40.4s -35.7% >> >> Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com> >> --- >> base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35 > Any way to reuse the code added by [1], e.g. clear_user_highpages()? > > [1] > https://lore.kernel.org/linux-mm/20250917152418.4077386-1-ankur.a.arora@oracle.com/ Thanks for the review. Sure, I will check if code reuse is possible. Meanwhile I found another issue with the current patch. kernel_init_pages() runs inside the allocator (post_alloc_hook and __free_pages_prepare), so it inherits whatever context the caller is in. Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a page allocation happens while the PTE lock and RCU read lock are held, making the cond_resched() in the clearing loop illegal: [ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235 [ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash [ 1997.353572] preempt_count: 1, expected: 0 [ 1997.353706] RCU nest depth: 1, expected: 0 [ 1997.353837] 3 locks held by bash/19725: [ 1997.353839] #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430 [ 1997.353850] #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220 [ 1997.353855] #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170 [ 1997.353868] Call Trace: [ 1997.353870] <TASK> [ 1997.353873] dump_stack_lvl+0x91/0xb0 [ 1997.353877] __might_resched+0x15f/0x290 [ 1997.353882] kernel_init_pages+0x4b/0xa0 [ 1997.353886] get_page_from_freelist+0x406/0x1e60 [ 1997.353895] __alloc_frozen_pages_noprof+0x1d8/0x1730 [ 1997.353912] alloc_pages_mpol+0xa4/0x190 [ 1997.353917] alloc_pages_noprof+0x59/0xd0 [ 1997.353919] get_free_pages_noprof+0x11/0x40 [ 1997.353921] __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0 [ 1997.353923] __zap_vma_range+0x1bbd/0x1f40 [ 1997.353931] unmap_vmas+0xd9/0x1d0 [ 1997.353934] exit_mmap+0x10a/0x430 [ 1997.353943] __mmput+0x3d/0x130 [ 1997.353947] do_exit+0x2a7/0xae0 [ 1997.353951] do_group_exit+0x36/0xa0 [ 1997.353953] __x64_sys_exit_group+0x18/0x20 [ 1997.353959] do_syscall_64+0xe1/0x710 [ 1997.353990] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 1997.354003] </TASK> This also means clear_contig_highpages() can't be directly reused here since it has an unconditional might_sleep() + cond_resched(). I'll look into this. Any suggestions on the right way to handle cond_resched() in a context that may or may not be atomic? Thanks, Hrushikesh >> mm/page_alloc.c | 19 +++++++++++++++++-- >> 1 file changed, 17 insertions(+), 2 deletions(-) >> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index b1c5430cad4e..178cbebadd50 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -1224,8 +1224,23 @@ static void kernel_init_pages(struct page *page, int numpages) >> >> /* s390's use of memset() could override KASAN redzones. */ >> kasan_disable_current(); >> - for (i = 0; i < numpages; i++) >> - clear_highpage_kasan_tagged(page + i); >> + >> + if (!IS_ENABLED(CONFIG_HIGHMEM)) { >> + void *addr = kasan_reset_tag(page_address(page)); >> + unsigned int unit = preempt_model_preemptible() ? >> + numpages : PROCESS_PAGES_NON_PREEMPT_BATCH; >> + int count; >> + >> + for (i = 0; i < numpages; i += count) { >> + cond_resched(); >> + count = min_t(int, unit, numpages - i); >> + clear_pages(addr + (i << PAGE_SHIFT), count); >> + } >> + } else { >> + for (i = 0; i < numpages; i++) >> + clear_highpage_kasan_tagged(page + i); >> + } >> + >> kasan_enable_current(); >> } >> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() 2026-04-08 10:44 ` Salunke, Hrushikesh @ 2026-04-08 10:53 ` David Hildenbrand (Arm) 2026-04-08 11:16 ` Raghavendra K T 1 sibling, 0 replies; 6+ messages in thread From: David Hildenbrand (Arm) @ 2026-04-08 10:53 UTC (permalink / raw) To: Salunke, Hrushikesh, Vlastimil Babka (SUSE), akpm, surenb, mhocko, jackmanb, hannes, ziy Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora, shivankg On 4/8/26 12:44, Salunke, Hrushikesh wrote: > > On 08-04-2026 15:17, Vlastimil Babka (SUSE) wrote: > >> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. >> >> >> On 4/8/26 11:24, Hrushikesh Salunke wrote: >>> When init_on_alloc is enabled, kernel_init_pages() clears every page >>> one at a time, calling clear_page() per page. This is unnecessarily >>> slow for large contiguous allocations (mTHPs, HugeTLB) that dominate >>> real workloads. >>> >>> On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via >>> clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local() >>> overhead and allowing the arch clearing primitive to operate on the full >>> contiguous range in a single invocation. The batch size is the full >>> allocation when the preempt model is preemptible (preemption points are >>> implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with >>> cond_resched() between batches to limit scheduling latency under >>> cooperative preemption. >>> >>> The HIGHMEM path is kept as-is since those pages require kmap. >>> >>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1: >>> >>> Before: 0.445s >>> After: 0.166s (-62.7%, 2.68x faster) >>> >>> Kernel time (sys) reduction per workload with init_on_alloc=1: >>> >>> Workload Before After Change >>> Graph500 64C128T 30m 41.8s 15m 14.8s -50.3% >>> Graph500 16C32T 15m 56.7s 9m 43.7s -39.0% >>> Pagerank 32T 1m 58.5s 1m 12.8s -38.5% >>> Pagerank 128T 2m 36.3s 1m 40.4s -35.7% >>> >>> Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com> >>> --- >>> base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35 >> Any way to reuse the code added by [1], e.g. clear_user_highpages()? >> >> [1] >> https://lore.kernel.org/linux-mm/20250917152418.4077386-1-ankur.a.arora@oracle.com/ > > Thanks for the review. Sure, I will check if code reuse is possible. > Meanwhile I found another issue with the current patch. > > kernel_init_pages() runs inside the allocator (post_alloc_hook and > __free_pages_prepare), so it inherits whatever context the caller is in. > Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I > hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a > page allocation happens while the PTE lock and RCU read lock are held, > making the cond_resched() in the clearing loop illegal: > > [ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235 > [ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash > [ 1997.353572] preempt_count: 1, expected: 0 > [ 1997.353706] RCU nest depth: 1, expected: 0 > [ 1997.353837] 3 locks held by bash/19725: > [ 1997.353839] #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430 > [ 1997.353850] #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220 > [ 1997.353855] #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170 > [ 1997.353868] Call Trace: > [ 1997.353870] <TASK> > [ 1997.353873] dump_stack_lvl+0x91/0xb0 > [ 1997.353877] __might_resched+0x15f/0x290 > [ 1997.353882] kernel_init_pages+0x4b/0xa0 > [ 1997.353886] get_page_from_freelist+0x406/0x1e60 > [ 1997.353895] __alloc_frozen_pages_noprof+0x1d8/0x1730 > [ 1997.353912] alloc_pages_mpol+0xa4/0x190 > [ 1997.353917] alloc_pages_noprof+0x59/0xd0 > [ 1997.353919] get_free_pages_noprof+0x11/0x40 > [ 1997.353921] __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0 > [ 1997.353923] __zap_vma_range+0x1bbd/0x1f40 > [ 1997.353931] unmap_vmas+0xd9/0x1d0 > [ 1997.353934] exit_mmap+0x10a/0x430 > [ 1997.353943] __mmput+0x3d/0x130 > [ 1997.353947] do_exit+0x2a7/0xae0 > [ 1997.353951] do_group_exit+0x36/0xa0 > [ 1997.353953] __x64_sys_exit_group+0x18/0x20 > [ 1997.353959] do_syscall_64+0xe1/0x710 > [ 1997.353990] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 1997.354003] </TASK> > > This also means clear_contig_highpages() can't be directly reused here > since it has an unconditional might_sleep() + cond_resched(). I'll look > into this. Any suggestions on the right way to handle cond_resched() > in a context that may or may not be atomic? clear_contig_highpages() is prepared to handle arbitrary sizes, including 1 GiB chunks or even larger. The question is whether you even have to use PROCESS_PAGES_NON_PREEMPT_BATCH given that we cannot trigger a manual resched either way (and the assumption is that memory we are clearing is not that big. Well, on arm64 it can still be 512 MiB). So I wonder what happens when you just use clear_pages(). Likely you should provide a clear_highpages_kasan_tagged() and a clear_highpages() ? So you would be calling clear_highpages_kasan_tagged() here that would just default to calling clear_highpages() unless kasan applies etc. -- Cheers, David ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() 2026-04-08 10:44 ` Salunke, Hrushikesh 2026-04-08 10:53 ` David Hildenbrand (Arm) @ 2026-04-08 11:16 ` Raghavendra K T 1 sibling, 0 replies; 6+ messages in thread From: Raghavendra K T @ 2026-04-08 11:16 UTC (permalink / raw) To: Salunke, Hrushikesh, Vlastimil Babka (SUSE), akpm, surenb, mhocko, jackmanb, hannes, ziy Cc: linux-mm, linux-kernel, bharata, ankur.a.arora, shivankg, David Hildenbrand On 4/8/2026 4:14 PM, Salunke, Hrushikesh wrote: > [Some people who received this message don't often get email from hsalunke@amd.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] > > On 08-04-2026 15:17, Vlastimil Babka (SUSE) wrote: > >> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. >> >> >> On 4/8/26 11:24, Hrushikesh Salunke wrote: >>> When init_on_alloc is enabled, kernel_init_pages() clears every page >>> one at a time, calling clear_page() per page. This is unnecessarily >>> slow for large contiguous allocations (mTHPs, HugeTLB) that dominate >>> real workloads. >>> >>> On 64-bit (!HIGHMEM) systems, switch to clearing pages in batch via >>> clear_pages(), bypassing the per-page kmap_local_page()/kunmap_local() >>> overhead and allowing the arch clearing primitive to operate on the full >>> contiguous range in a single invocation. The batch size is the full >>> allocation when the preempt model is preemptible (preemption points are >>> implicit), or PROCESS_PAGES_NON_PREEMPT_BATCH otherwise, with >>> cond_resched() between batches to limit scheduling latency under >>> cooperative preemption. >>> >>> The HIGHMEM path is kept as-is since those pages require kmap. >>> >>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1: >>> >>> Before: 0.445s >>> After: 0.166s (-62.7%, 2.68x faster) >>> >>> Kernel time (sys) reduction per workload with init_on_alloc=1: >>> >>> Workload Before After Change >>> Graph500 64C128T 30m 41.8s 15m 14.8s -50.3% >>> Graph500 16C32T 15m 56.7s 9m 43.7s -39.0% >>> Pagerank 32T 1m 58.5s 1m 12.8s -38.5% >>> Pagerank 128T 2m 36.3s 1m 40.4s -35.7% >>> >>> Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com> >>> --- >>> base commit: 1a2fbbe3653f0ebb24af9b306a8a968287344a35 >> Any way to reuse the code added by [1], e.g. clear_user_highpages()? >> >> [1] >> https://lore.kernel.org/linux-mm/20250917152418.4077386-1-ankur.a.arora@oracle.com/ > > Thanks for the review. Sure, I will check if code reuse is possible. > Meanwhile I found another issue with the current patch. > > kernel_init_pages() runs inside the allocator (post_alloc_hook and > __free_pages_prepare), so it inherits whatever context the caller is in. > Testing with CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_PROVE_LOCKING=y, I > hit this during exit_group() -> exit_mmap() -> __zap_vma_range, where a > page allocation happens while the PTE lock and RCU read lock are held, > making the cond_resched() in the clearing loop illegal: > > [ 1997.353228] BUG: sleeping function called from invalid context at mm/page_alloc.c:1235 > [ 1997.353433] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 19725, name: bash > [ 1997.353572] preempt_count: 1, expected: 0 > [ 1997.353706] RCU nest depth: 1, expected: 0 > [ 1997.353837] 3 locks held by bash/19725: > [ 1997.353839] #0: ff38cd415971e540 (&mm->mmap_lock){++++}-{4:4}, at: exit_mmap+0x6e/0x430 > [ 1997.353850] #1: ffffffffb03d6f60 (rcu_read_lock){....}-{1:3}, at: __pte_offset_map+0x2c/0x220 > [ 1997.353855] #2: ff38cd410deb4618 (ptlock_ptr(ptdesc)#2){+.+.}-{3:3}, at: pte_offset_map_lock+0x92/0x170 > [ 1997.353868] Call Trace: > [ 1997.353870] <TASK> > [ 1997.353873] dump_stack_lvl+0x91/0xb0 > [ 1997.353877] __might_resched+0x15f/0x290 > [ 1997.353882] kernel_init_pages+0x4b/0xa0 > [ 1997.353886] get_page_from_freelist+0x406/0x1e60 > [ 1997.353895] __alloc_frozen_pages_noprof+0x1d8/0x1730 > [ 1997.353912] alloc_pages_mpol+0xa4/0x190 > [ 1997.353917] alloc_pages_noprof+0x59/0xd0 > [ 1997.353919] get_free_pages_noprof+0x11/0x40 > [ 1997.353921] __tlb_remove_folio_pages_size.isra.0+0x7f/0xe0 > [ 1997.353923] __zap_vma_range+0x1bbd/0x1f40 > [ 1997.353931] unmap_vmas+0xd9/0x1d0 > [ 1997.353934] exit_mmap+0x10a/0x430 > [ 1997.353943] __mmput+0x3d/0x130 > [ 1997.353947] do_exit+0x2a7/0xae0 > [ 1997.353951] do_group_exit+0x36/0xa0 > [ 1997.353953] __x64_sys_exit_group+0x18/0x20 > [ 1997.353959] do_syscall_64+0xe1/0x710 > [ 1997.353990] entry_SYSCALL_64_after_hwframe+0x76/0x7e > [ 1997.354003] </TASK> > > This also means clear_contig_highpages() can't be directly reused here > since it has an unconditional might_sleep() + cond_resched(). I'll look > into this. Any suggestions on the right way to handle cond_resched() > in a context that may or may not be atomic? > > Thanks, > Hrushikesh > >>> mm/page_alloc.c | 19 +++++++++++++++++-- >>> 1 file changed, 17 insertions(+), 2 deletions(-) >>> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index b1c5430cad4e..178cbebadd50 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -1224,8 +1224,23 @@ static void kernel_init_pages(struct page *page, int numpages) >>> >>> /* s390's use of memset() could override KASAN redzones. */ >>> kasan_disable_current(); >>> - for (i = 0; i < numpages; i++) >>> - clear_highpage_kasan_tagged(page + i); >>> + >>> + if (!IS_ENABLED(CONFIG_HIGHMEM)) { >>> + void *addr = kasan_reset_tag(page_address(page)); >>> + unsigned int unit = preempt_model_preemptible() ? >>> + numpages : PROCESS_PAGES_NON_PREEMPT_BATCH; >>> + int count; >>> + >>> + for (i = 0; i < numpages; i += count) { >>> + cond_resched(); Just thinking, Considering that for preemptible kernel/preempt_auto preempt_count() knows about preemption points to decide where it can preempt, and for non_preemptible kernel and voluntary kernel it is safe to do preemption at PROCESS_PAGES_NON_PREEMPT_BATCH granularity do we need cond_resched() here ? Let me know if I am missing something. >>> + count = min_t(int, unit, numpages - i); >>> + clear_pages(addr + (i << PAGE_SHIFT), count); >>> + } >>> + } else { >>> + for (i = 0; i < numpages; i++) >>> + clear_highpage_kasan_tagged(page + i); >>> + } >>> + >>> kasan_enable_current(); >>> } >>> Regards - Raghu ^ permalink raw reply [flat|nested] 6+ messages in thread
* [syzbot ci] Re: mm/page_alloc: use batch page clearing in kernel_init_pages() 2026-04-08 9:24 [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() Hrushikesh Salunke 2026-04-08 9:47 ` Vlastimil Babka (SUSE) @ 2026-04-08 11:32 ` syzbot ci 1 sibling, 0 replies; 6+ messages in thread From: syzbot ci @ 2026-04-08 11:32 UTC (permalink / raw) To: akpm, ankur.a.arora, bharata, hannes, hsalunke, jackmanb, linux-kernel, linux-mm, mhocko, rkodsara, shivankg, surenb, vbabka, ziy Cc: syzbot, syzkaller-bugs syzbot ci has tested the following series [v1] mm/page_alloc: use batch page clearing in kernel_init_pages() https://lore.kernel.org/all/20260408092441.435133-1-hsalunke@amd.com * [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() and found the following issue: WARNING in preempt_model_full Full report is available here: https://ci.syzbot.org/series/be6c0534-641b-42aa-8b73-ab8f592ec267 *** WARNING in preempt_model_full tree: mm-new URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git base: 0d90551ea699ef3d1a85cd7a1a7e21e8d4f04db2 arch: amd64 compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8 config: https://ci.syzbot.org/builds/22cbd293-31c6-46b9-b8d4-2ff590ce406b/config CPU topo: Max. logical packages: 2 CPU topo: Max. logical nodes: 1 CPU topo: Num. nodes per package: 1 CPU topo: Max. logical dies: 2 CPU topo: Max. dies per package: 1 CPU topo: Max. threads per core: 1 CPU topo: Num. cores per package: 1 CPU topo: Num. threads per package: 1 CPU topo: Allowing 2 present CPUs plus 0 hotplug CPUs kvm-guest: APIC: eoi() replaced with kvm_guest_apic_eoi_write() PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff] PM: hibernation: Registered nosave memory: [mem 0x0009f000-0x000fffff] PM: hibernation: Registered nosave memory: [mem 0x7ffdf000-0xffffffff] [gap 0xc0000000-0xfed1bfff] available for PCI devices Booting paravirtualized kernel on KVM clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns Zone ranges: DMA [mem 0x0000000000001000-0x0000000000ffffff] DMA32 [mem 0x0000000001000000-0x00000000ffffffff] Normal [mem 0x0000000100000000-0x000000023fffffff] Device empty Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000000001000-0x000000000009efff] node 0: [mem 0x0000000000100000-0x000000007ffdefff] node 0: [mem 0x0000000100000000-0x0000000160000fff] node 1: [mem 0x0000000160001000-0x000000023fffffff] Initmem setup node 0 [mem 0x0000000000001000-0x0000000160000fff] Initmem setup node 1 [mem 0x0000000160001000-0x000000023fffffff] On node 0, zone DMA: 1 pages in unavailable ranges On node 0, zone DMA: 97 pages in unavailable ranges On node 0, zone Normal: 33 pages in unavailable ranges setup_percpu: NR_CPUS:8 nr_cpumask_bits:2 nr_cpu_ids:2 nr_node_ids:2 percpu: Embedded 71 pages/cpu s250120 r8192 d32504 u2097152 kvm-guest: PV spinlocks disabled, no host support Kernel command line: earlyprintk=serial net.ifnames=0 sysctl.kernel.hung_task_all_cpu_backtrace=1 ima_policy=tcb nf-conntrack-ftp.ports=20000 nf-conntrack-tftp.ports=20000 nf-conntrack-sip.ports=20000 nf-conntrack-irc.ports=20000 nf-conntrack-sane.ports=20000 binder.debug_mask=0 rcupdate.rcu_expedited=1 rcupdate.rcu_cpu_stall_cputime=1 no_hash_pointers page_owner=on sysctl.vm.nr_hugepages=4 sysctl.vm.nr_overcommit_hugepages=4 secretmem.enable=1 sysctl.max_rcu_stall_to_panic=1 msr.allow_writes=off coredump_filter=0xffff root=/dev/sda console=ttyS0 vsyscall=native numa=fake=2 kvm-intel.nested=1 spec_store_bypass_disable=prctl nopcid vivid.n_devs=64 vivid.multiplanar=1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2 netrom.nr_ndevs=32 rose.rose_ndevs=32 smp.csd_lock_timeout=100000 watchdog_thresh=55 workqueue.watchdog_thresh=140 sysctl.net.core.netdev_unregister_timeout_secs=140 dummy_hcd.num=32 max_loop=32 nbds_max=32 \ Kernel command line: comedi.comedi_num_legacy_minors=4 panic_on_warn=1 root=/dev/sda console=ttyS0 root=/dev/sda1 Unknown kernel command line parameters "nbds_max=32", will be passed to user space. printk: log buffer data + meta data: 262144 + 917504 = 1179648 bytes software IO TLB: area num 2. Fallback order for Node 0: 0 1 Fallback order for Node 1: 1 0 Built 2 zonelists, mobility grouping on. Total pages: 1834877 Policy zone: Normal mem auto-init: stack:all(zero), heap alloc:on, heap free:off stackdepot: allocating hash table via alloc_large_system_hash stackdepot hash table entries: 1048576 (order: 12, 16777216 bytes, linear) stackdepot: allocating space for 8192 stack pools via memblock ------------[ cut here ]------------ preempt_dynamic_mode == preempt_dynamic_undefined WARNING: kernel/sched/core.c:7743 at preempt_model_full+0x1e/0x30, CPU#0: swapper/0 Modules linked in: CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted syzkaller #0 PREEMPT(undef) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:preempt_model_full+0x1e/0x30 Code: 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 83 3d 35 d9 cb 0c ff 74 10 83 3d 2c d9 cb 0c 02 0f 94 c0 2e e9 93 fa 19 0a 90 <0f> 0b 90 eb ea 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 RSP: 0000:ffffffff8e407a78 EFLAGS: 00010046 RAX: 1ffffffff1c359d9 RBX: 0000000000000001 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffea0004000000 RBP: 0000000000040100 R08: ffffffff8221aafb R09: 0000000000000000 R10: ffffed1020000000 R11: ffffed1024206961 R12: 0000000000000001 R13: 0000000000000001 R14: ffff888100000000 R15: dffffc0000000000 FS: 0000000000000000(0000) GS:ffff88818de62000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88823ffff000 CR3: 000000000e54c000 CR4: 00000000000000b0 Call Trace: <TASK> kernel_init_pages+0x6d/0xe0 post_alloc_hook+0xae/0x1e0 get_page_from_freelist+0x24ba/0x2540 __alloc_frozen_pages_noprof+0x18d/0x380 alloc_pages_mpol+0x235/0x490 alloc_pages_noprof+0xac/0x2a0 __pud_alloc+0x3a/0x460 preallocate_vmalloc_pages+0x386/0x3d0 mm_core_init+0x79/0xb0 start_kernel+0x15a/0x3d0 x86_64_start_reservations+0x24/0x30 x86_64_start_kernel+0x143/0x1c0 common_startup_64+0x13e/0x147 </TASK> *** If these findings have caused you to resend the series or submit a separate fix, please add the following tag to your commit message: Tested-by: syzbot@syzkaller.appspotmail.com --- This report is generated by a bot. It may contain errors. syzbot ci engineers can be reached at syzkaller@googlegroups.com. To test a patch for this bug, please reply with `#syz test` (should be on a separate line). The patch should be attached to the email. Note: arguments like custom git repos and branches are not supported. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-08 11:32 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-08 9:24 [PATCH] mm/page_alloc: use batch page clearing in kernel_init_pages() Hrushikesh Salunke 2026-04-08 9:47 ` Vlastimil Babka (SUSE) 2026-04-08 10:44 ` Salunke, Hrushikesh 2026-04-08 10:53 ` David Hildenbrand (Arm) 2026-04-08 11:16 ` Raghavendra K T 2026-04-08 11:32 ` [syzbot ci] " syzbot ci
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox