* [PATCH] arm64: Implement clear_pages()
@ 2026-03-03 10:06 Linus Walleij
2026-03-03 14:46 ` Will Deacon
0 siblings, 1 reply; 6+ messages in thread
From: Linus Walleij @ 2026-03-03 10:06 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon, Marc Zyngier, Oliver Upton,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Ryan Roberts,
Ankur Arora, David Hildenbrand
Cc: linux-arm-kernel, kvmarm, James Clark, Linus Walleij
A recent patch introduced clear_pages() and made it possible to
provide assembly optimizations like for clear_page().
This augments the existing clear_page() optimization in arm64
to accept any number of pages the following way:
- Make clear_page() a static inline special case of clear_pages()
- Implement clear_pages() as a static inline that just calculate
the number of total bytes in the page set and passes this number
to the assembly routine clear_pages_asm.
- The old clear_pages assembly is rewritten to clear_pages_asm
which will take a start address (at an even page) and a number
of bytes to clear from that address.
This is similar to the optimization provided for x86.
Performance improvements:
The baseline is the current v7.0-rc1 which calls the existing
clear_page() assembly optimization in a loop, see <linux/mm.h>.
Any improvements are about avoiding the outer loop, in most cases
the clearing will be linear and the savings will be small and
only noticeable on really big clearing operations.
We boot the kernel with cmdline like this:
"default_hugepagesz=1G hugepagesz=1G hugepages=32" to make sure
we have ample hugepages. This was then tested with the same
cmdline as the original series:
perf bench mem mmap -p 1GB -f demand -s 32GB -l 5
The first run was discarded as the memory hierarchy is cold on
the first run. Then I ran the above command 5 times and averaged
the throughput, which sees a small but consistent improvement in
the throughput:
On QEMU:
Before this patch: After this patch:
2.38 GB/s 2.41 GB/s
On hardware Radxa Orion O6 we see this on *some* cores and no
change on others:
Before this patch: After this patch:
43.3 GB/s 45.3 GB/s
There is a small but consistent improvement in throughput, as
expected.
Tested-by: James Clark <james.clark2@arm.com>
Signed-off-by: Linus Walleij <linusw@kernel.org>
---
arch/arm64/include/asm/page.h | 13 ++++++++++++-
arch/arm64/kernel/image-vars.h | 2 +-
arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
arch/arm64/lib/Makefile | 2 +-
arch/arm64/lib/{clear_page.S => clear_pages.S} | 18 +++++++++---------
5 files changed, 24 insertions(+), 13 deletions(-)
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index b39cc1127e1f..916a3e7c9a19 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -20,7 +20,18 @@ struct page;
struct vm_area_struct;
extern void copy_page(void *to, const void *from);
-extern void clear_page(void *to);
+extern void clear_pages_asm(void *addr, unsigned int nbytes);
+
+static inline void clear_pages(void *addr, unsigned int npages)
+{
+ clear_pages_asm(addr, npages * PAGE_SIZE);
+}
+#define clear_pages clear_pages
+
+static inline void clear_page(void *addr)
+{
+ clear_pages(addr, 1);
+}
void copy_user_highpage(struct page *to, struct page *from,
unsigned long vaddr, struct vm_area_struct *vma);
diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
index d7b0d12b1015..61232f9e1e68 100644
--- a/arch/arm64/kernel/image-vars.h
+++ b/arch/arm64/kernel/image-vars.h
@@ -117,7 +117,7 @@ KVM_NVHE_ALIAS(__start___kvm_ex_table);
KVM_NVHE_ALIAS(__stop___kvm_ex_table);
/* Position-independent library routines */
-KVM_NVHE_ALIAS_HYP(clear_page, __pi_clear_page);
+KVM_NVHE_ALIAS_HYP(clear_pages, __pi_clear_pages);
KVM_NVHE_ALIAS_HYP(copy_page, __pi_copy_page);
KVM_NVHE_ALIAS_HYP(memcpy, __pi_memcpy);
KVM_NVHE_ALIAS_HYP(memset, __pi_memset);
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index a244ec25f8c5..f857dac82a88 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -17,7 +17,7 @@ ccflags-y += -fno-stack-protector \
hostprogs := gen-hyprel
HOST_EXTRACFLAGS += -I$(objtree)/include
-lib-objs := clear_page.o copy_page.o memcpy.o memset.o
+lib-objs := clear_pages.o copy_page.o memcpy.o memset.o
lib-objs := $(addprefix ../../../lib/, $(lib-objs))
CFLAGS_switch.nvhe.o += -Wno-override-init
diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
index 633e5223d944..86995e2e0807 100644
--- a/arch/arm64/lib/Makefile
+++ b/arch/arm64/lib/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
lib-y := clear_user.o delay.o copy_from_user.o \
copy_to_user.o copy_page.o \
- clear_page.o csum.o insn.o memchr.o memcpy.o \
+ clear_pages.o csum.o insn.o memchr.o memcpy.o \
memset.o memcmp.o strcmp.o strncmp.o strlen.o \
strnlen.o strchr.o strrchr.o tishift.o
diff --git a/arch/arm64/lib/clear_page.S b/arch/arm64/lib/clear_pages.S
similarity index 70%
rename from arch/arm64/lib/clear_page.S
rename to arch/arm64/lib/clear_pages.S
index bd6f7d5eb6eb..2d3043c13791 100644
--- a/arch/arm64/lib/clear_page.S
+++ b/arch/arm64/lib/clear_pages.S
@@ -12,22 +12,22 @@
* Clear page @dest
*
* Parameters:
- * x0 - dest
+ * x0 - dest - should be start of a page
+ * x1 - number of bytes to clear, should be a multiple of PAGE_SIZE
*/
-SYM_FUNC_START(__pi_clear_page)
+SYM_FUNC_START(__pi_clear_pages)
#ifdef CONFIG_AS_HAS_MOPS
.arch_extension mops
alternative_if_not ARM64_HAS_MOPS
b .Lno_mops
alternative_else_nop_endif
-
- mov x1, #PAGE_SIZE
setpn [x0]!, x1!, xzr
setmn [x0]!, x1!, xzr
seten [x0]!, x1!, xzr
ret
.Lno_mops:
#endif
+ add x4, x0, x1 /* Find the end */
mrs x1, dczid_el0
tbnz x1, #4, 2f /* Branch if DC ZVA is prohibited */
and w1, w1, #0xf
@@ -36,7 +36,7 @@ alternative_else_nop_endif
1: dc zva, x0
add x0, x0, x1
- tst x0, #(PAGE_SIZE - 1)
+ cmp x0, x4
b.ne 1b
ret
@@ -45,9 +45,9 @@ alternative_else_nop_endif
stnp xzr, xzr, [x0, #32]
stnp xzr, xzr, [x0, #48]
add x0, x0, #64
- tst x0, #(PAGE_SIZE - 1)
+ cmp x0, x4
b.ne 2b
ret
-SYM_FUNC_END(__pi_clear_page)
-SYM_FUNC_ALIAS(clear_page, __pi_clear_page)
-EXPORT_SYMBOL(clear_page)
+SYM_FUNC_END(__pi_clear_pages)
+SYM_FUNC_ALIAS(clear_pages_asm, __pi_clear_pages)
+EXPORT_SYMBOL(clear_pages_asm)
---
base-commit: dbe60c40b86ec4a1168552398b3b64c14c38b2d7
change-id: 20260212-aarch64-clear-pages-a439c2c552bb
Best regards,
--
Linus Walleij <linusw@kernel.org>
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64: Implement clear_pages()
2026-03-03 10:06 [PATCH] arm64: Implement clear_pages() Linus Walleij
@ 2026-03-03 14:46 ` Will Deacon
2026-03-03 15:45 ` Catalin Marinas
2026-03-04 0:39 ` Linus Walleij
0 siblings, 2 replies; 6+ messages in thread
From: Will Deacon @ 2026-03-03 14:46 UTC (permalink / raw)
To: Linus Walleij
Cc: Catalin Marinas, Marc Zyngier, Oliver Upton, Joey Gouly,
Suzuki K Poulose, Zenghui Yu, Ryan Roberts, Ankur Arora,
David Hildenbrand, linux-arm-kernel, kvmarm, James Clark
On Tue, Mar 03, 2026 at 11:06:13AM +0100, Linus Walleij wrote:
> A recent patch introduced clear_pages() and made it possible to
> provide assembly optimizations like for clear_page().
>
> This augments the existing clear_page() optimization in arm64
> to accept any number of pages the following way:
>
> - Make clear_page() a static inline special case of clear_pages()
>
> - Implement clear_pages() as a static inline that just calculate
> the number of total bytes in the page set and passes this number
> to the assembly routine clear_pages_asm.
>
> - The old clear_pages assembly is rewritten to clear_pages_asm
> which will take a start address (at an even page) and a number
> of bytes to clear from that address.
>
> This is similar to the optimization provided for x86.
>
> Performance improvements:
>
> The baseline is the current v7.0-rc1 which calls the existing
> clear_page() assembly optimization in a loop, see <linux/mm.h>.
> Any improvements are about avoiding the outer loop, in most cases
> the clearing will be linear and the savings will be small and
> only noticeable on really big clearing operations.
>
> We boot the kernel with cmdline like this:
> "default_hugepagesz=1G hugepagesz=1G hugepages=32" to make sure
> we have ample hugepages. This was then tested with the same
> cmdline as the original series:
>
> perf bench mem mmap -p 1GB -f demand -s 32GB -l 5
>
> The first run was discarded as the memory hierarchy is cold on
> the first run. Then I ran the above command 5 times and averaged
> the throughput, which sees a small but consistent improvement in
> the throughput:
>
> On QEMU:
>
> Before this patch: After this patch:
> 2.38 GB/s 2.41 GB/s
I really don't think we should pay attention to performance under QEMU
as it doesn't necessarily have any correlation with real hardware.
> On hardware Radxa Orion O6 we see this on *some* cores and no
> change on others:
>
> Before this patch: After this patch:
> 43.3 GB/s 45.3 GB/s
>
> There is a small but consistent improvement in throughput, as
> expected.
>
> Tested-by: James Clark <james.clark2@arm.com>
> Signed-off-by: Linus Walleij <linusw@kernel.org>
> ---
> arch/arm64/include/asm/page.h | 13 ++++++++++++-
> arch/arm64/kernel/image-vars.h | 2 +-
> arch/arm64/kvm/hyp/nvhe/Makefile | 2 +-
> arch/arm64/lib/Makefile | 2 +-
> arch/arm64/lib/{clear_page.S => clear_pages.S} | 18 +++++++++---------
> 5 files changed, 24 insertions(+), 13 deletions(-)
>
> diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> index b39cc1127e1f..916a3e7c9a19 100644
> --- a/arch/arm64/include/asm/page.h
> +++ b/arch/arm64/include/asm/page.h
> @@ -20,7 +20,18 @@ struct page;
> struct vm_area_struct;
>
> extern void copy_page(void *to, const void *from);
> -extern void clear_page(void *to);
> +extern void clear_pages_asm(void *addr, unsigned int nbytes);
> +
> +static inline void clear_pages(void *addr, unsigned int npages)
> +{
> + clear_pages_asm(addr, npages * PAGE_SIZE);
> +}
> +#define clear_pages clear_pages
Hmm. From what I can tell, this just turns a branch in C code into a
branch in assembly, so it's hard to correlate that meaningfully with
the performance improvement you see.
If we have CPUs that are this sensitive to branches, perhaps we'd be
better off taking the opposite approach and moving more code into C
so that the compiler can optimise the control flow for us?
Will
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64: Implement clear_pages()
2026-03-03 14:46 ` Will Deacon
@ 2026-03-03 15:45 ` Catalin Marinas
2026-03-04 0:39 ` Linus Walleij
1 sibling, 0 replies; 6+ messages in thread
From: Catalin Marinas @ 2026-03-03 15:45 UTC (permalink / raw)
To: Will Deacon
Cc: Linus Walleij, Marc Zyngier, Oliver Upton, Joey Gouly,
Suzuki K Poulose, Zenghui Yu, Ryan Roberts, Ankur Arora,
David Hildenbrand, linux-arm-kernel, kvmarm, James Clark
On Tue, Mar 03, 2026 at 02:46:34PM +0000, Will Deacon wrote:
> On Tue, Mar 03, 2026 at 11:06:13AM +0100, Linus Walleij wrote:
> > On QEMU:
> >
> > Before this patch: After this patch:
> > 2.38 GB/s 2.41 GB/s
>
> I really don't think we should pay attention to performance under QEMU
> as it doesn't necessarily have any correlation with real hardware.
I agree.
> > diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> > index b39cc1127e1f..916a3e7c9a19 100644
> > --- a/arch/arm64/include/asm/page.h
> > +++ b/arch/arm64/include/asm/page.h
> > @@ -20,7 +20,18 @@ struct page;
> > struct vm_area_struct;
> >
> > extern void copy_page(void *to, const void *from);
> > -extern void clear_page(void *to);
> > +extern void clear_pages_asm(void *addr, unsigned int nbytes);
> > +
> > +static inline void clear_pages(void *addr, unsigned int npages)
> > +{
> > + clear_pages_asm(addr, npages * PAGE_SIZE);
> > +}
> > +#define clear_pages clear_pages
>
> Hmm. From what I can tell, this just turns a branch in C code into a
> branch in assembly, so it's hard to correlate that meaningfully with
> the performance improvement you see.
>
> If we have CPUs that are this sensitive to branches, perhaps we'd be
> better off taking the opposite approach and moving more code into C
> so that the compiler can optimise the control flow for us?
I think it's more than the loop branch - the whole DCZID_EL0 read to
decide whether to use DC ZVA or STNP. I wonder why we didn't do that
with an alternative than always read the sysreg.
That said, I wouldn't mind rewriting this in C if the numbers don't get
worse. It is a bit more involved if we keep the DC ZVA use, though with
alternatives maybe not that bad (mte_set_mem_tag_range() is an example
of doing something similar in C but for clear page we don't need to deal
with unaligned boundaries).
--
Catalin
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64: Implement clear_pages()
2026-03-03 14:46 ` Will Deacon
2026-03-03 15:45 ` Catalin Marinas
@ 2026-03-04 0:39 ` Linus Walleij
2026-03-04 8:05 ` Ankur Arora
1 sibling, 1 reply; 6+ messages in thread
From: Linus Walleij @ 2026-03-04 0:39 UTC (permalink / raw)
To: Will Deacon
Cc: Catalin Marinas, Marc Zyngier, Oliver Upton, Joey Gouly,
Suzuki K Poulose, Zenghui Yu, Ryan Roberts, Ankur Arora,
David Hildenbrand, linux-arm-kernel, kvmarm, James Clark
On Tue, Mar 3, 2026 at 3:46 PM Will Deacon <will@kernel.org> wrote:
> > +extern void clear_pages_asm(void *addr, unsigned int nbytes);
> > +
> > +static inline void clear_pages(void *addr, unsigned int npages)
> > +{
> > + clear_pages_asm(addr, npages * PAGE_SIZE);
> > +}
> > +#define clear_pages clear_pages
>
> Hmm. From what I can tell, this just turns a branch in C code into a
> branch in assembly, so it's hard to correlate that meaningfully with
> the performance improvement you see.
I think what I see is the effect of #define clear_pages clear_pages.
Because without that <linux/mm.h> open codes:
#ifndef clear_pages
(...)
static inline void clear_pages(void *addr, unsigned int npages)
{
do {
clear_page(addr);
addr += PAGE_SIZE;
} while (--npages);
}
#endif
So for clearing anything multi-page we get an outer loop
and an inner loop inside clear_page(), but with clear_pages()
implemented there is no outer loop, instead the total bytes is
computed first (not one page at a time) and then there is a
single loop.
> If we have CPUs that are this sensitive to branches, perhaps we'd be
> better off taking the opposite approach and moving more code into C
> so that the compiler can optimise the control flow for us?
Hm! That would be to create a default clear_page() in
<linux/mm.h> and simply delete the existing lib/clear_page.S
and let the default kick in.
Right now every arch is implementing it custom.
Maybe for no reason in some cases, I could try it!
I doubt the compiler would emit this part though:
#ifdef CONFIG_AS_HAS_MOPS
(...)
alternative_else_nop_endif
setpn [x0]!, x1!, xzr
setmn [x0]!, x1!, xzr
seten [x0]!, x1!, xzr
ret
Three instructions to clear all pages. But maybe that is not good
if this is a gigabyte, and the per-page loop provides a good breather
preemption point in that case, and then we just shouldn't touch
anything.
Yours,
Linus Walleij
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64: Implement clear_pages()
2026-03-04 0:39 ` Linus Walleij
@ 2026-03-04 8:05 ` Ankur Arora
2026-03-04 8:49 ` Catalin Marinas
0 siblings, 1 reply; 6+ messages in thread
From: Ankur Arora @ 2026-03-04 8:05 UTC (permalink / raw)
To: Linus Walleij
Cc: Will Deacon, Catalin Marinas, Marc Zyngier, Oliver Upton,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Ryan Roberts,
Ankur Arora, David Hildenbrand, linux-arm-kernel, kvmarm,
James Clark
Linus Walleij <linusw@kernel.org> writes:
> On Tue, Mar 3, 2026 at 3:46 PM Will Deacon <will@kernel.org> wrote:
>
>> > +extern void clear_pages_asm(void *addr, unsigned int nbytes);
>> > +
>> > +static inline void clear_pages(void *addr, unsigned int npages)
>> > +{
>> > + clear_pages_asm(addr, npages * PAGE_SIZE);
>> > +}
>> > +#define clear_pages clear_pages
>>
>> Hmm. From what I can tell, this just turns a branch in C code into a
>> branch in assembly, so it's hard to correlate that meaningfully with
>> the performance improvement you see.
>
> I think what I see is the effect of #define clear_pages clear_pages.
>
> Because without that <linux/mm.h> open codes:
>
> #ifndef clear_pages
> (...)
> static inline void clear_pages(void *addr, unsigned int npages)
> {
> do {
> clear_page(addr);
> addr += PAGE_SIZE;
> } while (--npages);
> }
> #endif
>
> So for clearing anything multi-page we get an outer loop
> and an inner loop inside clear_page(), but with clear_pages()
> implemented there is no outer loop, instead the total bytes is
> computed first (not one page at a time) and then there is a
> single loop.
So, on x86 (specifically on AMD Zen and Intel Icelake systems)
the extra computation, branches, and in an early version calls
cond_resched() after every single page did not seem to matter.
This is probably uarch dependant but seems to me that the cost
of an extra address computation or an easily predicted branch
would probably be just noise.
>> If we have CPUs that are this sensitive to branches, perhaps we'd be
>> better off taking the opposite approach and moving more code into C
>> so that the compiler can optimise the control flow for us?
>
> Hm! That would be to create a default clear_page() in
> <linux/mm.h> and simply delete the existing lib/clear_page.S
> and let the default kick in.
>
> Right now every arch is implementing it custom.
> Maybe for no reason in some cases, I could try it!
>
> I doubt the compiler would emit this part though:
>
> #ifdef CONFIG_AS_HAS_MOPS
> (...)
> alternative_else_nop_endif
> setpn [x0]!, x1!, xzr
> setmn [x0]!, x1!, xzr
> seten [x0]!, x1!, xzr
> ret
>
> Three instructions to clear all pages. But maybe that is not good
> if this is a gigabyte, and the per-page loop provides a good breather
> preemption point in that case, and then we just shouldn't touch
> anything.
The code in folio_zero_user (clear_contig_highpages()) takes care of
chunking up the clearing based on preemption model.
The idea being that if you are running with preempt=none or voluntary
then you might want to call cond_resched(), say every 32MB or so.
If you are running with preempt=full or preempt=lazy, then it would
just clear a full GB page.
That would need the set[mpe]n instructions to be interruptible though.
(Seems to me that that is true but maybe someone could confirm.)
Thanks
--
ankur
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] arm64: Implement clear_pages()
2026-03-04 8:05 ` Ankur Arora
@ 2026-03-04 8:49 ` Catalin Marinas
0 siblings, 0 replies; 6+ messages in thread
From: Catalin Marinas @ 2026-03-04 8:49 UTC (permalink / raw)
To: Ankur Arora
Cc: Linus Walleij, Will Deacon, Marc Zyngier, Oliver Upton,
Joey Gouly, Suzuki K Poulose, Zenghui Yu, Ryan Roberts,
David Hildenbrand, linux-arm-kernel, kvmarm, James Clark
On Wed, Mar 04, 2026 at 12:05:04AM -0800, Ankur Arora wrote:
> Linus Walleij <linusw@kernel.org> writes:
> > On Tue, Mar 3, 2026 at 3:46 PM Will Deacon <will@kernel.org> wrote:
> >> If we have CPUs that are this sensitive to branches, perhaps we'd be
> >> better off taking the opposite approach and moving more code into C
> >> so that the compiler can optimise the control flow for us?
> >
> > Hm! That would be to create a default clear_page() in
> > <linux/mm.h> and simply delete the existing lib/clear_page.S
> > and let the default kick in.
> >
> > Right now every arch is implementing it custom.
> > Maybe for no reason in some cases, I could try it!
> >
> > I doubt the compiler would emit this part though:
> >
> > #ifdef CONFIG_AS_HAS_MOPS
> > (...)
> > alternative_else_nop_endif
> > setpn [x0]!, x1!, xzr
> > setmn [x0]!, x1!, xzr
> > seten [x0]!, x1!, xzr
> > ret
It won't generate it. It wouldn't be a pure C but have some inline asm.
Anyway, I'm fine with the .S file, I just wonder whether the DCZID_EL0
read inside the inner loop is causing problems (not the FEAT_MOPS
variant).
> > Three instructions to clear all pages. But maybe that is not good
> > if this is a gigabyte, and the per-page loop provides a good breather
> > preemption point in that case, and then we just shouldn't touch
> > anything.
>
> The code in folio_zero_user (clear_contig_highpages()) takes care of
> chunking up the clearing based on preemption model.
> The idea being that if you are running with preempt=none or voluntary
> then you might want to call cond_resched(), say every 32MB or so.
>
> If you are running with preempt=full or preempt=lazy, then it would
> just clear a full GB page.
>
> That would need the set[mpe]n instructions to be interruptible though.
> (Seems to me that that is true but maybe someone could confirm.)
Yes, they are interruptible (the SETM). The x0, x1 register are left in
a state that the SETM can be restarted from where it was interrupted.
--
Catalin
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-03-04 8:49 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-03 10:06 [PATCH] arm64: Implement clear_pages() Linus Walleij
2026-03-03 14:46 ` Will Deacon
2026-03-03 15:45 ` Catalin Marinas
2026-03-04 0:39 ` Linus Walleij
2026-03-04 8:05 ` Ankur Arora
2026-03-04 8:49 ` Catalin Marinas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox