* [PATCH] ARM64: Improve copy_page for 128 cache line sizes. [not found] <CA+=Sn1ku1CQUM8whiMmv_sZY175kt6b1wg_818fyu++N6Sybgg@mail.gmail.com> @ 2016-01-06 16:31 ` Will Deacon 0 siblings, 0 replies; 4+ messages in thread From: Will Deacon @ 2016-01-06 16:31 UTC (permalink / raw) To: linux-arm-kernel Hi Andrew, On Tue, Dec 22, 2015 at 03:32:19PM -0800, Andrew Pinski wrote: > On Tue, Dec 21, 2015 at 5:43 AM, Arnd Bergmann <arnd@arndb.de> wrote: > > On Monday 21 December 2015, Will Deacon wrote: > >> On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote: > I think it is the prefetching. ThunderX T88 pass 1 and pass 2 does > not have a hardware prefetcher so prefetching a half of a cacheline > ahead does not help at all. > > >> > >> Also, how are you measuring the improvement? If you can share your > >> test somewhere, I can see how it affects the other systems I have > >> access to. > > You can find my benchmark at > https://github.com/apinski-cavium/copy_page_benchmark . > copy_page is my previous patch. > copy_page128 is just the unrolled and only 128 byte prefetching > copy_page64 is the original code > copy_page64unroll is the new patch which I will be sending out soon. Thanks, this was really helpful to evaluate the different versions on the Cortex-A* cores I've got on my desk. Doing so showed that, in fact, having explicit prfm instructions tends to be *harmful* for us -- the hardware prefetcher is actually doing a much better job on its own. Now, I still maintain that we don't want lots of different copy_page implementations, but I'm not averse to patching a nop with a prfm on cores that benefit from software-driven prefetching. We could hang it off the alternatives framework that we have already. > > Are there any possible downsides to using the ThunderX version on other > > microarchitectures too and skip the check? > > Yes that is a good idea. I will send out a new patch in a little bit > which just unrolls the loop with keeping of the two prefetch > instructions in there. copy_page64unroll didn't perform well on all of my systems. The code below was the best all-rounder I could come up with. Do you reckon you could try taking it and adding prefetches to see if you can make it fly on ThunderX? Cheers, Will --->8 ENTRY(copy_page) ldp x2, x3, [x1] ldp x4, x5, [x1, #16] ldp x6, x7, [x1, #32] ldp x8, x9, [x1, #48] ldp x10, x11, [x1, #64] ldp x12, x13, [x1, #80] ldp x14, x15, [x1, #96] ldp x16, x17, [x1, #112] mov x18, #(PAGE_SIZE - 128) add x1, x1, #128 1: subs x18, x18, #128 stnp x2, x3, [x0] ldp x2, x3, [x1] stnp x4, x5, [x0, #16] ldp x4, x5, [x1, #16] stnp x6, x7, [x0, #32] ldp x6, x7, [x1, #32] stnp x8, x9, [x0, #48] ldp x8, x9, [x1, #48] stnp x10, x11, [x0, #64] ldp x10, x11, [x1, #64] stnp x12, x13, [x0, #80] ldp x12, x13, [x1, #80] stnp x14, x15, [x0, #96] ldp x14, x15, [x1, #96] stnp x16, x17, [x0, #112] ldp x16, x17, [x1, #112] add x0, x0, #128 add x1, x1, #128 b.gt 1b stnp x2, x3, [x0] stnp x4, x5, [x0, #16] stnp x6, x7, [x0, #32] stnp x8, x9, [x0, #48] stnp x10, x11, [x0, #64] stnp x12, x13, [x0, #80] stnp x14, x15, [x0, #96] stnp x16, x17, [x0, #112] ret ENDPROC(copy_page) ^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH] ARM64: Improve copy_page for 128 cache line sizes. @ 2015-12-20 0:11 Andrew Pinski 2015-12-21 12:46 ` Will Deacon 0 siblings, 1 reply; 4+ messages in thread From: Andrew Pinski @ 2015-12-20 0:11 UTC (permalink / raw) To: linux-arm-kernel Adding a check for the cache line size is not much overhead. Special case 128 byte cache line size. This improves copy_page by 85% on ThunderX compared to the original implementation. For LMBench, it improves between 4-10%. Signed-off-by: Andrew Pinski <apinski@cavium.com> --- arch/arm64/lib/copy_page.S | 39 +++++++++++++++++++++++++++++++++++++++ 1 files changed, 39 insertions(+), 0 deletions(-) diff --git a/arch/arm64/lib/copy_page.S b/arch/arm64/lib/copy_page.S index 512b9a7..4c28789 100644 --- a/arch/arm64/lib/copy_page.S +++ b/arch/arm64/lib/copy_page.S @@ -18,6 +18,7 @@ #include <linux/const.h> #include <asm/assembler.h> #include <asm/page.h> +#include <asm/cachetype.h> /* * Copy a page from src to dest (both are page aligned) @@ -27,8 +28,17 @@ * x1 - src */ ENTRY(copy_page) + /* Special case 128 byte or more cache lines */ + mrs x2, ctr_el0 + lsr x2, x2, CTR_CWG_SHIFT + and w2, w2, CTR_CWG_MASK + cmp w2, 5 + b.ge 2f + /* Assume cache line size is 64 bytes. */ prfm pldl1strm, [x1, #64] + /* Align the loop is it fits in one cache line. */ + .balign 64 1: ldp x2, x3, [x1] ldp x4, x5, [x1, #16] ldp x6, x7, [x1, #32] @@ -43,4 +53,33 @@ ENTRY(copy_page) tst x1, #(PAGE_SIZE - 1) b.ne 1b ret + +2: + /* The cache line size is at least 128 bytes. */ + prfm pldl1strm, [x1, #128] + /* Align the loop so it fits in one cache line */ + .balign 128 +1: prfm pldl1strm, [x1, #256] + ldp x2, x3, [x1] + ldp x4, x5, [x1, #16] + ldp x6, x7, [x1, #32] + ldp x8, x9, [x1, #48] + stnp x2, x3, [x0] + stnp x4, x5, [x0, #16] + stnp x6, x7, [x0, #32] + stnp x8, x9, [x0, #48] + + ldp x2, x3, [x1, #64] + ldp x4, x5, [x1, #80] + ldp x6, x7, [x1, #96] + ldp x8, x9, [x1, #112] + add x1, x1, #128 + stnp x2, x3, [x0, #64] + stnp x4, x5, [x0, #80] + stnp x6, x7, [x0, #96] + stnp x8, x9, [x0, #112] + add x0, x0, #128 + tst x1, #(PAGE_SIZE - 1) + b.ne 1b + ret ENDPROC(copy_page) -- 1.7.2.5 ^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH] ARM64: Improve copy_page for 128 cache line sizes. 2015-12-20 0:11 Andrew Pinski @ 2015-12-21 12:46 ` Will Deacon 2015-12-21 13:42 ` Arnd Bergmann 0 siblings, 1 reply; 4+ messages in thread From: Will Deacon @ 2015-12-21 12:46 UTC (permalink / raw) To: linux-arm-kernel On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote: > Adding a check for the cache line size is not much overhead. > Special case 128 byte cache line size. > This improves copy_page by 85% on ThunderX compared to the > original implementation. So this patch seems to: - Align the loop - Increase the prefetch size - Unroll the loop once Do you know where your 85% boost comes from between these? I'd really like to avoid having multiple versions of copy_page, if possible, but maybe we could end up with something that works well enough regardless of cacheline size. Understanding what your bottleneck is would help to lead us in the right direction. Also, how are you measuring the improvement? If you can share your test somewhere, I can see how it affects the other systems I have access to. Will ^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH] ARM64: Improve copy_page for 128 cache line sizes. 2015-12-21 12:46 ` Will Deacon @ 2015-12-21 13:42 ` Arnd Bergmann 0 siblings, 0 replies; 4+ messages in thread From: Arnd Bergmann @ 2015-12-21 13:42 UTC (permalink / raw) To: linux-arm-kernel On Monday 21 December 2015, Will Deacon wrote: > On Sat, Dec 19, 2015 at 04:11:18PM -0800, Andrew Pinski wrote: > > Adding a check for the cache line size is not much overhead. > > Special case 128 byte cache line size. > > This improves copy_page by 85% on ThunderX compared to the > > original implementation. > > So this patch seems to: > > - Align the loop > - Increase the prefetch size > - Unroll the loop once > > Do you know where your 85% boost comes from between these? I'd really > like to avoid having multiple versions of copy_page, if possible, but > maybe we could end up with something that works well enough regardless > of cacheline size. Understanding what your bottleneck is would help to > lead us in the right direction. > > Also, how are you measuring the improvement? If you can share your > test somewhere, I can see how it affects the other systems I have access > to. A related question would be how other CPU cores are affected by the change. The test for the cache line size is going to take a few cycles, possibly a lot on certain implementations, e.g. if we ever get one where 'mrs' is microcoded or trapped by a hypervisor. Are there any possible downsides to using the ThunderX version on other microarchitectures too and skip the check? Arnd ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2016-01-06 16:31 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <CA+=Sn1ku1CQUM8whiMmv_sZY175kt6b1wg_818fyu++N6Sybgg@mail.gmail.com>
2016-01-06 16:31 ` [PATCH] ARM64: Improve copy_page for 128 cache line sizes Will Deacon
2015-12-20 0:11 Andrew Pinski
2015-12-21 12:46 ` Will Deacon
2015-12-21 13:42 ` Arnd Bergmann
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).