[PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing
@ 2026-04-22 10:26 Hrushikesh Salunke
  2026-04-22 18:25 ` David Hildenbrand (Arm)
  2026-04-23 11:12 ` Andrew Morton
  0 siblings, 2 replies; 10+ messages in thread
From: Hrushikesh Salunke @ 2026-04-22 10:26 UTC (permalink / raw)
  To: akpm, david, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko,
	jackmanb, hannes, ziy
  Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
	shivankg, hsalunke

When init_on_alloc is enabled, kernel_init_pages() clears every page
one at a time via clear_highpage_kasan_tagged(), which incurs per-page
kmap_local_page()/kunmap_local() overhead and prevents the architecture
clearing primitive from operating on contiguous ranges.

Introduce clear_highpages_kasan_tagged() in highmem.h, a batch
clearing helper that calls clear_pages() for the full contiguous range
on !HIGHMEM systems, bypassing the per-page kmap overhead and allowing
a single invocation of the arch clearing primitive across the entire
allocation. The HIGHMEM path falls back to per-page clearing since
those pages require kmap.

Replace kernel_init_pages() with direct calls to the new helper, as it
becomes a trivial wrapper.

Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:

  Before: 0.445s
  After:  0.166s  (-62.7%, 2.68x faster)

Kernel time (sys) reduction per workload with init_on_alloc=1:

  Workload            Before       After       Change
  Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
  Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
  Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
  Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%

Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Pankaj Gupta <pankaj.gupta@amd.com>
---
base commit: 2bcc13c29c711381d815c1ba5d5b25737400c71a

v2: https://lore.kernel.org/all/20260421042451.76918-1-hsalunke@amd.com/
v1: https://lore.kernel.org/all/20260408092441.435133-1-hsalunke@amd.com/

Changes since v2:
- Moved kasan_disable_current()/kasan_enable_current() into
  clear_highpages_kasan_tagged(), per David and Zi Yan's suggestion.
- Removed kernel_init_pages() and replaced its two call sites with
  direct calls to the helper.

Changes since v1:
- Dropped cond_resched() and PROCESS_PAGES_NON_PREEMPT_BATCH as
  kernel_init_pages() runs inside the page allocator and can be
  called from atomic context, making cond_resched() unsafe. The
  original code never had a cond_resched() here, and the
  performance gain comes from batching, not rescheduling.

- Moved the !HIGHMEM/HIGHMEM branching into a new
  clear_highpages_kasan_tagged() helper in highmem.h, per David's
  suggestion.

 include/linux/highmem.h | 15 +++++++++++++++
 mm/page_alloc.c         | 15 ++-------------
 2 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index af03db851a1d..1178b786b5b0 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -345,6 +345,21 @@ static inline void clear_highpage_kasan_tagged(struct page *page)
 	kunmap_local(kaddr);
 }
 
+static inline void clear_highpages_kasan_tagged(struct page *page, int numpages)
+{
+	/* s390's use of memset() could override KASAN redzones. */
+	kasan_disable_current();
+	if (!IS_ENABLED(CONFIG_HIGHMEM)) {
+		clear_pages(kasan_reset_tag(page_address(page)), numpages);
+	} else {
+		int i;
+
+		for (i = 0; i < numpages; i++)
+			clear_highpage_kasan_tagged(page + i);
+	}
+	kasan_enable_current();
+}
+
 #ifndef __HAVE_ARCH_TAG_CLEAR_HIGHPAGES
 
 /* Return false to let people know we did not initialize the pages */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 65e205111553..2908d24dd3e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1208,17 +1208,6 @@ static inline bool should_skip_kasan_poison(struct page *page)
 	return page_kasan_tag(page) == KASAN_TAG_KERNEL;
 }
 
-static void kernel_init_pages(struct page *page, int numpages)
-{
-	int i;
-
-	/* s390's use of memset() could override KASAN redzones. */
-	kasan_disable_current();
-	for (i = 0; i < numpages; i++)
-		clear_highpage_kasan_tagged(page + i);
-	kasan_enable_current();
-}
-
 #ifdef CONFIG_MEM_ALLOC_PROFILING
 
 /* Should be called only if mem_alloc_profiling_enabled() */
@@ -1428,7 +1417,7 @@ __always_inline bool __free_pages_prepare(struct page *page,
 			init = false;
 	}
 	if (init)
-		kernel_init_pages(page, 1 << order);
+		clear_highpages_kasan_tagged(page, 1 << order);
 
 	/*
 	 * arch_free_page() can make the page's contents inaccessible.  s390
@@ -1853,7 +1842,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
 	}
 	/* If memory is still not initialized, initialize it now. */
 	if (init)
-		kernel_init_pages(page, 1 << order);
+		clear_highpages_kasan_tagged(page, 1 << order);
 
 	set_page_owner(page, order, gfp_flags);
 	page_table_check_alloc(page, order);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing
  2026-04-22 10:26 [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing Hrushikesh Salunke
@ 2026-04-22 18:25 ` David Hildenbrand (Arm)
  2026-04-23  5:09   ` Salunke, Hrushikesh
  2026-04-23 11:12 ` Andrew Morton
  1 sibling, 1 reply; 10+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-22 18:25 UTC (permalink / raw)
  To: Hrushikesh Salunke, akpm, ljs, Liam.Howlett, vbabka, rppt, surenb,
	mhocko, jackmanb, hannes, ziy
  Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
	shivankg

On 4/22/26 12:26, Hrushikesh Salunke wrote:
> When init_on_alloc is enabled, kernel_init_pages() clears every page
> one at a time via clear_highpage_kasan_tagged(), which incurs per-page
> kmap_local_page()/kunmap_local() overhead and prevents the architecture
> clearing primitive from operating on contiguous ranges.
> 
> Introduce clear_highpages_kasan_tagged() in highmem.h, a batch
> clearing helper that calls clear_pages() for the full contiguous range
> on !HIGHMEM systems, bypassing the per-page kmap overhead and allowing
> a single invocation of the arch clearing primitive across the entire
> allocation. The HIGHMEM path falls back to per-page clearing since
> those pages require kmap.
> 
> Replace kernel_init_pages() with direct calls to the new helper, as it
> becomes a trivial wrapper.
> 
> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
> 
>   Before: 0.445s
>   After:  0.166s  (-62.7%, 2.68x faster)
> 
> Kernel time (sys) reduction per workload with init_on_alloc=1:
> 
>   Workload            Before       After       Change
>   Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
>   Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
>   Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
>   Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%

We do have some elaborate handling in clear_contig_highpages() to chunk it up
(and to call cond_resched()). But that function can get called with much bigger
ranges.

I'm not concerned about the cond_resched() -- we wouldn't do one here before --
but I'm wondering whether we could end up triggering a HW instruction that is
uninterruptible and takes a rather long time.

But clear_contig_highpages() breaks it into 32MiB chunks, and only x86 supports
it so far. So we won't exceed that with the maximum buddy order of 4MiB on x86.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing
  2026-04-22 18:25 ` David Hildenbrand (Arm)
@ 2026-04-23  5:09   ` Salunke, Hrushikesh
  2026-04-23 10:13     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 10+ messages in thread
From: Salunke, Hrushikesh @ 2026-04-23  5:09 UTC (permalink / raw)
  To: David Hildenbrand (Arm), akpm, ljs, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, jackmanb, hannes, ziy
  Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
	shivankg


On 22-04-2026 23:55, David Hildenbrand (Arm) wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On 4/22/26 12:26, Hrushikesh Salunke wrote:
>> When init_on_alloc is enabled, kernel_init_pages() clears every page
>> one at a time via clear_highpage_kasan_tagged(), which incurs per-page
>> kmap_local_page()/kunmap_local() overhead and prevents the architecture
>> clearing primitive from operating on contiguous ranges.
>>
>> Introduce clear_highpages_kasan_tagged() in highmem.h, a batch
>> clearing helper that calls clear_pages() for the full contiguous range
>> on !HIGHMEM systems, bypassing the per-page kmap overhead and allowing
>> a single invocation of the arch clearing primitive across the entire
>> allocation. The HIGHMEM path falls back to per-page clearing since
>> those pages require kmap.
>>
>> Replace kernel_init_pages() with direct calls to the new helper, as it
>> becomes a trivial wrapper.
>>
>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
>>
>>   Before: 0.445s
>>   After:  0.166s  (-62.7%, 2.68x faster)
>>
>> Kernel time (sys) reduction per workload with init_on_alloc=1:
>>
>>   Workload            Before       After       Change
>>   Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
>>   Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
>>   Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
>>   Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%
> We do have some elaborate handling in clear_contig_highpages() to chunk it up
> (and to call cond_resched()). But that function can get called with much bigger
> ranges.
>
> I'm not concerned about the cond_resched() -- we wouldn't do one here before --
> but I'm wondering whether we could end up triggering a HW instruction that is
> uninterruptible and takes a rather long time.
>
> But clear_contig_highpages() breaks it into 32MiB chunks, and only x86 supports
> it so far. So we won't exceed that with the maximum buddy order of 4MiB on x86.
>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>
> --
> Cheers,
>
> David

Right, on x86 the max buddy order keeps it well within safe limits.

Also, rep stosb/stosq on x86, currently used for clearing, is
interruptible, the CPU can take interrupts between iterations and
resume where it left off. So even for larger ranges it wouldn't be a
single uninterruptible operation. Other architectures use a per-page
loop for clearing, so the same applies there.

Thanks for the Ack!

Regards,
Hrushikesh




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing
  2026-04-23  5:09   ` Salunke, Hrushikesh
@ 2026-04-23 10:13     ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-23 10:13 UTC (permalink / raw)
  To: Salunke, Hrushikesh, akpm, ljs, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, jackmanb, hannes, ziy
  Cc: linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
	shivankg

On 4/23/26 07:09, Salunke, Hrushikesh wrote:
> 
> On 22-04-2026 23:55, David Hildenbrand (Arm) wrote:
>> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>>
>>
>> On 4/22/26 12:26, Hrushikesh Salunke wrote:
>>> When init_on_alloc is enabled, kernel_init_pages() clears every page
>>> one at a time via clear_highpage_kasan_tagged(), which incurs per-page
>>> kmap_local_page()/kunmap_local() overhead and prevents the architecture
>>> clearing primitive from operating on contiguous ranges.
>>>
>>> Introduce clear_highpages_kasan_tagged() in highmem.h, a batch
>>> clearing helper that calls clear_pages() for the full contiguous range
>>> on !HIGHMEM systems, bypassing the per-page kmap overhead and allowing
>>> a single invocation of the arch clearing primitive across the entire
>>> allocation. The HIGHMEM path falls back to per-page clearing since
>>> those pages require kmap.
>>>
>>> Replace kernel_init_pages() with direct calls to the new helper, as it
>>> becomes a trivial wrapper.
>>>
>>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
>>>
>>>   Before: 0.445s
>>>   After:  0.166s  (-62.7%, 2.68x faster)
>>>
>>> Kernel time (sys) reduction per workload with init_on_alloc=1:
>>>
>>>   Workload            Before       After       Change
>>>   Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
>>>   Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
>>>   Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
>>>   Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%
>> We do have some elaborate handling in clear_contig_highpages() to chunk it up
>> (and to call cond_resched()). But that function can get called with much bigger
>> ranges.
>>
>> I'm not concerned about the cond_resched() -- we wouldn't do one here before --
>> but I'm wondering whether we could end up triggering a HW instruction that is
>> uninterruptible and takes a rather long time.
>>
>> But clear_contig_highpages() breaks it into 32MiB chunks, and only x86 supports
>> it so far. So we won't exceed that with the maximum buddy order of 4MiB on x86.
>>
>> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
>>
>> --
>> Cheers,
>>
>> David
> 
> Right, on x86 the max buddy order keeps it well within safe limits.
> 
> Also, rep stosb/stosq on x86, currently used for clearing, is
> interruptible, the CPU can take interrupts between iterations and
> resume where it left off. 

Ah yes, indeed!

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing
  2026-04-22 10:26 [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing Hrushikesh Salunke
  2026-04-22 18:25 ` David Hildenbrand (Arm)
@ 2026-04-23 11:12 ` Andrew Morton
  2026-04-24  8:42   ` Salunke, Hrushikesh
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2026-04-23 11:12 UTC (permalink / raw)
  To: Hrushikesh Salunke
  Cc: david, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, jackmanb,
	hannes, ziy, linux-mm, linux-kernel, rkodsara, bharata,
	ankur.a.arora, shivankg

On Wed, 22 Apr 2026 10:26:58 +0000 Hrushikesh Salunke <hsalunke@amd.com> wrote:

> When init_on_alloc is enabled, kernel_init_pages() clears every page
> one at a time via clear_highpage_kasan_tagged(), which incurs per-page
> kmap_local_page()/kunmap_local() overhead and prevents the architecture
> clearing primitive from operating on contiguous ranges.
> 
> Introduce clear_highpages_kasan_tagged() in highmem.h, a batch
> clearing helper that calls clear_pages() for the full contiguous range
> on !HIGHMEM systems, bypassing the per-page kmap overhead and allowing
> a single invocation of the arch clearing primitive across the entire
> allocation. The HIGHMEM path falls back to per-page clearing since
> those pages require kmap.
> 
> Replace kernel_init_pages() with direct calls to the new helper, as it
> becomes a trivial wrapper.
> 
> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
> 
>   Before: 0.445s
>   After:  0.166s  (-62.7%, 2.68x faster)

Nice.

> Kernel time (sys) reduction per workload with init_on_alloc=1:
> 
>   Workload            Before       After       Change
>   Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
>   Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
>   Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
>   Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%
> 
> ...
>
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -345,6 +345,21 @@ static inline void clear_highpage_kasan_tagged(struct page *page)
>  	kunmap_local(kaddr);
>  }
>  
> +static inline void clear_highpages_kasan_tagged(struct page *page, int numpages)
> +{
> +	/* s390's use of memset() could override KASAN redzones. */
> +	kasan_disable_current();
> +	if (!IS_ENABLED(CONFIG_HIGHMEM)) {
> +		clear_pages(kasan_reset_tag(page_address(page)), numpages);
> +	} else {
> +		int i;
> +
> +		for (i = 0; i < numpages; i++)
> +			clear_highpage_kasan_tagged(page + i);
> +	}
> +	kasan_enable_current();
> +}

Why was it globally published and inlined?  Is there any expectation
that this will be used outside of page_alloc.c?

Both of the callsites are themselves inlined.  The patch adds 330 bytes
to my arm allmodcnfig page_alloc.o - did we gain anything from that?




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing
  2026-04-23 11:12 ` Andrew Morton
@ 2026-04-24  8:42   ` Salunke, Hrushikesh
  2026-04-24  8:52     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 10+ messages in thread
From: Salunke, Hrushikesh @ 2026-04-24  8:42 UTC (permalink / raw)
  To: Andrew Morton
  Cc: david, ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, jackmanb,
	hannes, ziy, linux-mm, linux-kernel, rkodsara, bharata,
	ankur.a.arora, shivankg, hsalunke


On 23-04-2026 16:42, Andrew Morton wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Wed, 22 Apr 2026 10:26:58 +0000 Hrushikesh Salunke <hsalunke@amd.com> wrote:
>
>> When init_on_alloc is enabled, kernel_init_pages() clears every page
>> one at a time via clear_highpage_kasan_tagged(), which incurs per-page
>> kmap_local_page()/kunmap_local() overhead and prevents the architecture
>> clearing primitive from operating on contiguous ranges.
>>
>> Introduce clear_highpages_kasan_tagged() in highmem.h, a batch
>> clearing helper that calls clear_pages() for the full contiguous range
>> on !HIGHMEM systems, bypassing the per-page kmap overhead and allowing
>> a single invocation of the arch clearing primitive across the entire
>> allocation. The HIGHMEM path falls back to per-page clearing since
>> those pages require kmap.
>>
>> Replace kernel_init_pages() with direct calls to the new helper, as it
>> becomes a trivial wrapper.
>>
>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
>>
>>   Before: 0.445s
>>   After:  0.166s  (-62.7%, 2.68x faster)
> Nice.
>
>> Kernel time (sys) reduction per workload with init_on_alloc=1:
>>
>>   Workload            Before       After       Change
>>   Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
>>   Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
>>   Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
>>   Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%
>>
>> ...
>>
>> --- a/include/linux/highmem.h
>> +++ b/include/linux/highmem.h
>> @@ -345,6 +345,21 @@ static inline void clear_highpage_kasan_tagged(struct page *page)
>>       kunmap_local(kaddr);
>>  }
>>
>> +static inline void clear_highpages_kasan_tagged(struct page *page, int numpages)
>> +{
>> +     /* s390's use of memset() could override KASAN redzones. */
>> +     kasan_disable_current();
>> +     if (!IS_ENABLED(CONFIG_HIGHMEM)) {
>> +             clear_pages(kasan_reset_tag(page_address(page)), numpages);
>> +     } else {
>> +             int i;
>> +
>> +             for (i = 0; i < numpages; i++)
>> +                     clear_highpage_kasan_tagged(page + i);
>> +     }
>> +     kasan_enable_current();
>> +}
> Why was it globally published and inlined?  Is there any expectation
> that this will be used outside of page_alloc.c?
>
> Both of the callsites are themselves inlined.  The patch adds 330 bytes
> to my arm allmodcnfig page_alloc.o - did we gain anything from that?
>
Hi Andrew,

The idea was to keep it alongside clear_highpage_kasan_tagged() as its
batch counterpart, but currently it is only used by page_alloc.c.

Your concern about the code size increase is valid. Would you prefer if
I move it to page_alloc.c as a static function and drop the inline
in v4? If an external user comes along later it can always be moved
back to the header.

regards,
Hrushikesh



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing
  2026-04-24  8:42   ` Salunke, Hrushikesh
@ 2026-04-24  8:52     ` David Hildenbrand (Arm)
  2026-04-28  3:55       ` Salunke, Hrushikesh
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-24  8:52 UTC (permalink / raw)
  To: Salunke, Hrushikesh, Andrew Morton
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, jackmanb, hannes,
	ziy, linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
	shivankg

On 4/24/26 10:42, Salunke, Hrushikesh wrote:
> 
> On 23-04-2026 16:42, Andrew Morton wrote:
>> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>>
>>
>> On Wed, 22 Apr 2026 10:26:58 +0000 Hrushikesh Salunke <hsalunke@amd.com> wrote:
>>
>>> When init_on_alloc is enabled, kernel_init_pages() clears every page
>>> one at a time via clear_highpage_kasan_tagged(), which incurs per-page
>>> kmap_local_page()/kunmap_local() overhead and prevents the architecture
>>> clearing primitive from operating on contiguous ranges.
>>>
>>> Introduce clear_highpages_kasan_tagged() in highmem.h, a batch
>>> clearing helper that calls clear_pages() for the full contiguous range
>>> on !HIGHMEM systems, bypassing the per-page kmap overhead and allowing
>>> a single invocation of the arch clearing primitive across the entire
>>> allocation. The HIGHMEM path falls back to per-page clearing since
>>> those pages require kmap.
>>>
>>> Replace kernel_init_pages() with direct calls to the new helper, as it
>>> becomes a trivial wrapper.
>>>
>>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
>>>
>>>   Before: 0.445s
>>>   After:  0.166s  (-62.7%, 2.68x faster)
>> Nice.
>>
>>> Kernel time (sys) reduction per workload with init_on_alloc=1:
>>>
>>>   Workload            Before       After       Change
>>>   Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
>>>   Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
>>>   Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
>>>   Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%
>>>
>>> ...
>>>
>>> --- a/include/linux/highmem.h
>>> +++ b/include/linux/highmem.h
>>> @@ -345,6 +345,21 @@ static inline void clear_highpage_kasan_tagged(struct page *page)
>>>       kunmap_local(kaddr);
>>>  }
>>>
>>> +static inline void clear_highpages_kasan_tagged(struct page *page, int numpages)
>>> +{
>>> +     /* s390's use of memset() could override KASAN redzones. */
>>> +     kasan_disable_current();
>>> +     if (!IS_ENABLED(CONFIG_HIGHMEM)) {
>>> +             clear_pages(kasan_reset_tag(page_address(page)), numpages);
>>> +     } else {
>>> +             int i;
>>> +
>>> +             for (i = 0; i < numpages; i++)
>>> +                     clear_highpage_kasan_tagged(page + i);
>>> +     }
>>> +     kasan_enable_current();
>>> +}
>> Why was it globally published and inlined?  Is there any expectation
>> that this will be used outside of page_alloc.c?
>>
>> Both of the callsites are themselves inlined.  The patch adds 330 bytes
>> to my arm allmodcnfig page_alloc.o - did we gain anything from that?
>>
> Hi Andrew,
> 
> The idea was to keep it alongside clear_highpage_kasan_tagged() as its
> batch counterpart, but currently it is only used by page_alloc.c.

Right.

Looking at init_vmalloc_pages(), I wonder if it could also benefit from batching
if we find that pages are actually contiguous.

That would require looking up multiple pages at once. vmalloc_to_pages() or sth
like that. Surely, doing such an optimized page table walk could be beneficial
by itself.

> 
> Your concern about the code size increase is valid. Would you prefer if
> I move it to page_alloc.c as a static function and drop the inline
> in v4? If an external user comes along later it can always be moved
> back to the header.

What is exactly is responsible for the code increase? Two calls in
clear_highpages_kasan_tagged()?

Surely the compiler would just inline kernel_init_pages() already?

So my best guess that the 330 bytes are just clear_pages() overhead or some code
layout changes?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing
  2026-04-24  8:52     ` David Hildenbrand (Arm)
@ 2026-04-28  3:55       ` Salunke, Hrushikesh
  2026-04-28  7:06         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 10+ messages in thread
From: Salunke, Hrushikesh @ 2026-04-28  3:55 UTC (permalink / raw)
  To: David Hildenbrand (Arm), Andrew Morton
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, jackmanb, hannes,
	ziy, linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
	shivankg, hsalunke


On 24-04-2026 14:22, David Hildenbrand (Arm) wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On 4/24/26 10:42, Salunke, Hrushikesh wrote:
>> On 23-04-2026 16:42, Andrew Morton wrote:
>>> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>>>
>>>
>>> On Wed, 22 Apr 2026 10:26:58 +0000 Hrushikesh Salunke <hsalunke@amd.com> wrote:
>>>
>>>> When init_on_alloc is enabled, kernel_init_pages() clears every page
>>>> one at a time via clear_highpage_kasan_tagged(), which incurs per-page
>>>> kmap_local_page()/kunmap_local() overhead and prevents the architecture
>>>> clearing primitive from operating on contiguous ranges.
>>>>
>>>> Introduce clear_highpages_kasan_tagged() in highmem.h, a batch
>>>> clearing helper that calls clear_pages() for the full contiguous range
>>>> on !HIGHMEM systems, bypassing the per-page kmap overhead and allowing
>>>> a single invocation of the arch clearing primitive across the entire
>>>> allocation. The HIGHMEM path falls back to per-page clearing since
>>>> those pages require kmap.
>>>>
>>>> Replace kernel_init_pages() with direct calls to the new helper, as it
>>>> becomes a trivial wrapper.
>>>>
>>>> Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
>>>>
>>>>   Before: 0.445s
>>>>   After:  0.166s  (-62.7%, 2.68x faster)
>>> Nice.
>>>
>>>> Kernel time (sys) reduction per workload with init_on_alloc=1:
>>>>
>>>>   Workload            Before       After       Change
>>>>   Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
>>>>   Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
>>>>   Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
>>>>   Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%
>>>>
>>>> ...
>>>>
>>>> --- a/include/linux/highmem.h
>>>> +++ b/include/linux/highmem.h
>>>> @@ -345,6 +345,21 @@ static inline void clear_highpage_kasan_tagged(struct page *page)
>>>>       kunmap_local(kaddr);
>>>>  }
>>>>
>>>> +static inline void clear_highpages_kasan_tagged(struct page *page, int numpages)
>>>> +{
>>>> +     /* s390's use of memset() could override KASAN redzones. */
>>>> +     kasan_disable_current();
>>>> +     if (!IS_ENABLED(CONFIG_HIGHMEM)) {
>>>> +             clear_pages(kasan_reset_tag(page_address(page)), numpages);
>>>> +     } else {
>>>> +             int i;
>>>> +
>>>> +             for (i = 0; i < numpages; i++)
>>>> +                     clear_highpage_kasan_tagged(page + i);
>>>> +     }
>>>> +     kasan_enable_current();
>>>> +}
>>> Why was it globally published and inlined?  Is there any expectation
>>> that this will be used outside of page_alloc.c?
>>>
>>> Both of the callsites are themselves inlined.  The patch adds 330 bytes
>>> to my arm allmodcnfig page_alloc.o - did we gain anything from that?
>>>
>> Hi Andrew,
>>
>> The idea was to keep it alongside clear_highpage_kasan_tagged() as its
>> batch counterpart, but currently it is only used by page_alloc.c.
> Right.
>
> Looking at init_vmalloc_pages(), I wonder if it could also benefit from batching
> if we find that pages are actually contiguous.
>
> That would require looking up multiple pages at once. vmalloc_to_pages() or sth
> like that. Surely, doing such an optimized page table walk could be beneficial
> by itself.

Interesting idea. For the general case where we only have struct page
pointers, we'd need physical contiguity detection and a batched page
table walk as you described. But looking at init_vmalloc_pages() 
specifically, it already has the vmalloc virtual address which is 
contiguous, so can we just do following and potentially skip the 
vmalloc_to_page() walk entirely:

clear_pages(kasan_reset_tag((void *)start), size >> PAGE_SHIFT);

What do you think? would this simpler approach work
, or am I missing something?

>
>> Your concern about the code size increase is valid. Would you prefer if
>> I move it to page_alloc.c as a static function and drop the inline
>> in v4? If an external user comes along later it can always be moved
>> back to the header.
> What is exactly is responsible for the code increase? Two calls in
> clear_highpages_kasan_tagged()?
>
> Surely the compiler would just inline kernel_init_pages() already?
>
> So my best guess that the 330 bytes are just clear_pages() overhead or some code
> layout changes?

You're right, it's essentially the clear_pages() overhead being
duplicated at each call site. The compiler was actually not inlining
kernel_init_pages(), it was a standalone function. But it was inlining
post_alloc_hook() and free_pages_prepare() into their callers, so with
the patch each of those inlined copies now carries the full 
clear_highpages_kasan_tagged() code instead of a small call instruction.

I ran bloat-o-meter on arm allmodconfig and confirmed this. I also
tested moving clear_highpages_kasan_tagged() into page_alloc.c as 
a static (non-inline) function, and the bloat disappears entirely. As
currently there are no other users of this function so I will move it
in page_alloc.c. I will make this change in v4.

Regards,
Hrushikesh.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing
  2026-04-28  3:55       ` Salunke, Hrushikesh
@ 2026-04-28  7:06         ` David Hildenbrand (Arm)
  2026-04-28  8:31           ` Ankur Arora
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand (Arm) @ 2026-04-28  7:06 UTC (permalink / raw)
  To: Salunke, Hrushikesh, Andrew Morton
  Cc: ljs, Liam.Howlett, vbabka, rppt, surenb, mhocko, jackmanb, hannes,
	ziy, linux-mm, linux-kernel, rkodsara, bharata, ankur.a.arora,
	shivankg

On 4/28/26 05:55, Salunke, Hrushikesh wrote:
> 
> On 24-04-2026 14:22, David Hildenbrand (Arm) wrote:
>> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>>
>>
>> On 4/24/26 10:42, Salunke, Hrushikesh wrote:
>>> Hi Andrew,
>>>
>>> The idea was to keep it alongside clear_highpage_kasan_tagged() as its
>>> batch counterpart, but currently it is only used by page_alloc.c.
>> Right.
>>
>> Looking at init_vmalloc_pages(), I wonder if it could also benefit from batching
>> if we find that pages are actually contiguous.
>>
>> That would require looking up multiple pages at once. vmalloc_to_pages() or sth
>> like that. Surely, doing such an optimized page table walk could be beneficial
>> by itself.
> 
> Interesting idea. For the general case where we only have struct page
> pointers, we'd need physical contiguity detection and a batched page
> table walk as you described. But looking at init_vmalloc_pages() 
> specifically, it already has the vmalloc virtual address which is 
> contiguous, so can we just do following and potentially skip the 
> vmalloc_to_page() walk entirely:
> 
> clear_pages(kasan_reset_tag((void *)start), size >> PAGE_SHIFT);
> 
> What do you think? would this simpler approach work
> , or am I missing something?

Good question. :)

That way you'd be operating on the vmalloc address range, not on the direct map.

Is the vmalloc address range guaranteed to be writable at that point?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing
  2026-04-28  7:06         ` David Hildenbrand (Arm)
@ 2026-04-28  8:31           ` Ankur Arora
  0 siblings, 0 replies; 10+ messages in thread
From: Ankur Arora @ 2026-04-28  8:31 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Salunke, Hrushikesh, Andrew Morton, ljs, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, jackmanb, hannes, ziy, linux-mm,
	linux-kernel, rkodsara, bharata, ankur.a.arora, shivankg


David Hildenbrand (Arm) <david@kernel.org> writes:

> On 4/28/26 05:55, Salunke, Hrushikesh wrote:
>>
>> On 24-04-2026 14:22, David Hildenbrand (Arm) wrote:
>>> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>>>
>>>
>>> On 4/24/26 10:42, Salunke, Hrushikesh wrote:
>>>> Hi Andrew,
>>>>
>>>> The idea was to keep it alongside clear_highpage_kasan_tagged() as its
>>>> batch counterpart, but currently it is only used by page_alloc.c.
>>> Right.
>>>
>>> Looking at init_vmalloc_pages(), I wonder if it could also benefit from batching
>>> if we find that pages are actually contiguous.
>>>
>>> That would require looking up multiple pages at once. vmalloc_to_pages() or sth
>>> like that. Surely, doing such an optimized page table walk could be beneficial
>>> by itself.
>>
>> Interesting idea. For the general case where we only have struct page
>> pointers, we'd need physical contiguity detection and a batched page
>> table walk as you described. But looking at init_vmalloc_pages()
>> specifically, it already has the vmalloc virtual address which is
>> contiguous, so can we just do following and potentially skip the
>> vmalloc_to_page() walk entirely:
>>
>> clear_pages(kasan_reset_tag((void *)start), size >> PAGE_SHIFT);
>>
>> What do you think? would this simpler approach work
>> , or am I missing something?
>
> Good question. :)
>
> That way you'd be operating on the vmalloc address range, not on the direct map.

From my testing (including when using userspace VA) most of the speedup
was from CPU prefetch. Which we should get when working with the
vmalloc address range. Assuming it is writable and stable.

> Is the vmalloc address range guaranteed to be writable at that point?

What happens if we get preempted and migrated while clearing? Seems like
the vmalloc lazy syncing should be able to handle that?

--
ankur


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-04-28  8:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-22 10:26 [PATCH v3] mm/page_alloc: replace kernel_init_pages() with batch page clearing Hrushikesh Salunke
2026-04-22 18:25 ` David Hildenbrand (Arm)
2026-04-23  5:09   ` Salunke, Hrushikesh
2026-04-23 10:13     ` David Hildenbrand (Arm)
2026-04-23 11:12 ` Andrew Morton
2026-04-24  8:42   ` Salunke, Hrushikesh
2026-04-24  8:52     ` David Hildenbrand (Arm)
2026-04-28  3:55       ` Salunke, Hrushikesh
2026-04-28  7:06         ` David Hildenbrand (Arm)
2026-04-28  8:31           ` Ankur Arora

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox