* [PATCH] arm64/lib: copy_page: s/stnp/stp
@ 2024-06-13 0:18 Jisheng Zhang
2024-06-24 17:56 ` Catalin Marinas
0 siblings, 1 reply; 4+ messages in thread
From: Jisheng Zhang @ 2024-06-13 0:18 UTC (permalink / raw)
To: Catalin Marinas, Will Deacon; +Cc: linux-arm-kernel, linux-kernel
stnp performs non-temporal store, give a hints to the memory system
that caching is not useful for this data. But the scenario where
copy_page() used may not have this implication, although I must admit
there's such case where stnp helps performance(good). In this good
case, we can rely on the HW write streaming mechanism in some
implementations such as cortex-a55 to detect the case and take actions.
testing with https://github.com/apinski-cavium/copy_page_benchmark
this patch can reduce the time by about 3% on cortex-a55 platforms.
Signed-off-by: Jisheng Zhang <jszhang@kernel.org>
---
arch/arm64/lib/copy_page.S | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S
index 6a56d7cf309d..4c74fe2d8bd6 100644
--- a/arch/arm64/lib/copy_page.S
+++ b/arch/arm64/lib/copy_page.S
@@ -32,21 +32,21 @@ SYM_FUNC_START(__pi_copy_page)
1:
tst x0, #(PAGE_SIZE - 1)
- stnp x2, x3, [x0, #-256]
+ stp x2, x3, [x0, #-256]
ldp x2, x3, [x1]
- stnp x4, x5, [x0, #16 - 256]
+ stp x4, x5, [x0, #16 - 256]
ldp x4, x5, [x1, #16]
- stnp x6, x7, [x0, #32 - 256]
+ stp x6, x7, [x0, #32 - 256]
ldp x6, x7, [x1, #32]
- stnp x8, x9, [x0, #48 - 256]
+ stp x8, x9, [x0, #48 - 256]
ldp x8, x9, [x1, #48]
- stnp x10, x11, [x0, #64 - 256]
+ stp x10, x11, [x0, #64 - 256]
ldp x10, x11, [x1, #64]
- stnp x12, x13, [x0, #80 - 256]
+ stp x12, x13, [x0, #80 - 256]
ldp x12, x13, [x1, #80]
- stnp x14, x15, [x0, #96 - 256]
+ stp x14, x15, [x0, #96 - 256]
ldp x14, x15, [x1, #96]
- stnp x16, x17, [x0, #112 - 256]
+ stp x16, x17, [x0, #112 - 256]
ldp x16, x17, [x1, #112]
add x0, x0, #128
@@ -54,14 +54,14 @@ SYM_FUNC_START(__pi_copy_page)
b.ne 1b
- stnp x2, x3, [x0, #-256]
- stnp x4, x5, [x0, #16 - 256]
- stnp x6, x7, [x0, #32 - 256]
- stnp x8, x9, [x0, #48 - 256]
- stnp x10, x11, [x0, #64 - 256]
- stnp x12, x13, [x0, #80 - 256]
- stnp x14, x15, [x0, #96 - 256]
- stnp x16, x17, [x0, #112 - 256]
+ stp x2, x3, [x0, #-256]
+ stp x4, x5, [x0, #16 - 256]
+ stp x6, x7, [x0, #32 - 256]
+ stp x8, x9, [x0, #48 - 256]
+ stp x10, x11, [x0, #64 - 256]
+ stp x12, x13, [x0, #80 - 256]
+ stp x14, x15, [x0, #96 - 256]
+ stp x16, x17, [x0, #112 - 256]
ret
SYM_FUNC_END(__pi_copy_page)
--
2.43.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] arm64/lib: copy_page: s/stnp/stp
2024-06-13 0:18 [PATCH] arm64/lib: copy_page: s/stnp/stp Jisheng Zhang
@ 2024-06-24 17:56 ` Catalin Marinas
2024-06-26 11:50 ` Jisheng Zhang
0 siblings, 1 reply; 4+ messages in thread
From: Catalin Marinas @ 2024-06-24 17:56 UTC (permalink / raw)
To: Jisheng Zhang; +Cc: Will Deacon, linux-arm-kernel, linux-kernel
On Thu, Jun 13, 2024 at 08:18:12AM +0800, Jisheng Zhang wrote:
> stnp performs non-temporal store, give a hints to the memory system
> that caching is not useful for this data. But the scenario where
> copy_page() used may not have this implication, although I must admit
> there's such case where stnp helps performance(good). In this good
> case, we can rely on the HW write streaming mechanism in some
> implementations such as cortex-a55 to detect the case and take actions.
>
> testing with https://github.com/apinski-cavium/copy_page_benchmark
> this patch can reduce the time by about 3% on cortex-a55 platforms.
What about other CPUs? I'm also not convinced by such microbenchmarks.
It looks like it always copies to the same page, the stp may even
benefit from some caching of the data which we wouldn't need in a real
scenario.
So, I'm not merging this unless it's backed by some solid data across
several CPU implementations.
--
Catalin
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] arm64/lib: copy_page: s/stnp/stp
2024-06-24 17:56 ` Catalin Marinas
@ 2024-06-26 11:50 ` Jisheng Zhang
2024-06-26 18:08 ` Catalin Marinas
0 siblings, 1 reply; 4+ messages in thread
From: Jisheng Zhang @ 2024-06-26 11:50 UTC (permalink / raw)
To: Catalin Marinas; +Cc: Will Deacon, linux-arm-kernel, linux-kernel
On Mon, Jun 24, 2024 at 06:56:33PM +0100, Catalin Marinas wrote:
> On Thu, Jun 13, 2024 at 08:18:12AM +0800, Jisheng Zhang wrote:
> > stnp performs non-temporal store, give a hints to the memory system
> > that caching is not useful for this data. But the scenario where
> > copy_page() used may not have this implication, although I must admit
> > there's such case where stnp helps performance(good). In this good
> > case, we can rely on the HW write streaming mechanism in some
> > implementations such as cortex-a55 to detect the case and take actions.
> >
> > testing with https://github.com/apinski-cavium/copy_page_benchmark
> > this patch can reduce the time by about 3% on cortex-a55 platforms.
>
> What about other CPUs? I'm also not convinced by such microbenchmarks.
Per my test on CA53 and CA73, CA73 got similar improvements. As for CA53
there's no difference, maybe due to the follwoing commit in the ATF:
54035fc4672aa ("Disable non-temporal hint on Cortex-A53/57")
> It looks like it always copies to the same page, the stp may even
> benefit from some caching of the data which we wouldn't need in a real
> scenario.
Yep this is also my understanding where's the improvement from. And
I must admit there's case where stnp helps performance. we can rely
on the HW write streaming mechanism to detect and take actions.
However, sometimes, the "cache" behavior can benefit the scenario. Then
in this case, the stnp here would double lose.
what do you think?
>
> So, I'm not merging this unless it's backed by some solid data across
> several CPU implementations.
>
> --
> Catalin
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] arm64/lib: copy_page: s/stnp/stp
2024-06-26 11:50 ` Jisheng Zhang
@ 2024-06-26 18:08 ` Catalin Marinas
0 siblings, 0 replies; 4+ messages in thread
From: Catalin Marinas @ 2024-06-26 18:08 UTC (permalink / raw)
To: Jisheng Zhang; +Cc: Will Deacon, linux-arm-kernel, linux-kernel
On Wed, Jun 26, 2024 at 07:50:57PM +0800, Jisheng Zhang wrote:
> On Mon, Jun 24, 2024 at 06:56:33PM +0100, Catalin Marinas wrote:
> > On Thu, Jun 13, 2024 at 08:18:12AM +0800, Jisheng Zhang wrote:
> > > stnp performs non-temporal store, give a hints to the memory system
> > > that caching is not useful for this data. But the scenario where
> > > copy_page() used may not have this implication, although I must admit
> > > there's such case where stnp helps performance(good). In this good
> > > case, we can rely on the HW write streaming mechanism in some
> > > implementations such as cortex-a55 to detect the case and take actions.
> > >
> > > testing with https://github.com/apinski-cavium/copy_page_benchmark
> > > this patch can reduce the time by about 3% on cortex-a55 platforms.
[...]
> > It looks like it always copies to the same page, the stp may even
> > benefit from some caching of the data which we wouldn't need in a real
> > scenario.
>
> Yep this is also my understanding where's the improvement from. And
> I must admit there's case where stnp helps performance. we can rely
> on the HW write streaming mechanism to detect and take actions.
Well, is that case realistic? Can you show any improvement with some
real-world uses? Most likely modern CPUs fall back to non-temporal
stores after a series of STPs but it depends on how soon they do it, how
much cache gets polluted. OTOH, page copying could be the result of a
CoW and we'd expect subsequent accesses from the user where some caching
may be beneficial.
So, hard to tell but we should make a decision based on a microbenchmark
that writes over the same page multiple times. If you have some
real-world tests that exercise this path (e.g. CoW, Android app startup)
and show an improvement, I'd be in favour of this. Otherwise, no.
Thanks.
--
Catalin
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-06-26 18:09 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-13 0:18 [PATCH] arm64/lib: copy_page: s/stnp/stp Jisheng Zhang
2024-06-24 17:56 ` Catalin Marinas
2024-06-26 11:50 ` Jisheng Zhang
2024-06-26 18:08 ` Catalin Marinas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).