From: Catalin Marinas <catalin.marinas@arm.com>
To: "Havens, Austin" <austin.havens@anritsu.com>
Cc: "will@kernel.org" <will@kernel.org>,
"michal.simek@amd.com" <michal.simek@amd.com>,
"Suresh, Siddarth" <siddarth.suresh@anritsu.com>,
"Lui, Vincent" <vincent.lui@anritsu.com>,
Mark Rutland <mark.rutland@arm.com>,
"linux-arm-kernel@lists.infradead.org"
<linux-arm-kernel@lists.infradead.org>
Subject: Re: Slowdown copying data between kernel versions 4.19 and 5.15
Date: Thu, 29 Jun 2023 14:36:58 +0100 [thread overview]
Message-ID: <ZJ2I+qJ0o8KQAXkE@arm.com> (raw)
In-Reply-To: <OS3P301MB04215E0BD835F0742EA377D9EE24A@OS3P301MB0421.JPNP301.PROD.OUTLOOK.COM>
Hi Austin,
On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
> <Friday, June 23, 2023 14:30 PM PDT, Austin Havens (Anritsu)>
> >In the process of updating our kernel from 4.19 to 5.15 we noticed a
> >slowdown when copying data. We are using Zynqmp 9EG SoCs and
> >basically following the Xilinx/AMD release branches (though a bit
> >behind). I did some sample based profiling with perf, and it showed
> >that a lot of the time was in __arch_copy_from_user, and since the
> >amount of data getting copied is the same, it seems like it is
> >spending more time in each __arch_copy_from_user call.
Thanks for digging into this. Which CPUs does this SoC have? Cortex-A53?
> >I made a test program to replicate the issue and here is what I see
> >(i used the same binary on both versions to rule out differences from
> >the compiler).
> >
> >root@smudge:/tmp# uname -a
> >Linux smudge 4.19.0-xilinx-v2019.1 #1 SMP PREEMPT Thu May 18 04:01:27 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
> >root@smudge:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> >
> > Performance counter stats for '/mnt/usrroot/test_copy':
> >
> > 13202623 instructions # 0.25 insn per cycle
> > 52947780 cycles
> > 37588761 ld_dep_stall
> > 16301 read_alloc
> > 1660 dTLB-load-misses
> >
> > 0.044990363 seconds time elapsed
> >
> > 0.004092000 seconds user
> > 0.040920000 seconds sys
> >
> >root@ahraptor:/tmp# uname -a
> >Linux ahraptor 5.15.36-xilinx-v2022.1 #1 SMP PREEMPT Mon Apr 10 22:46:16 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
> >root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> >
> > Performance counter stats for '/mnt/usrroot/test_copy':
> >
> > 11625888 instructions # 0.14 insn per cycle
> > 83135040 cycles
> > 69833562 ld_dep_stall
> > 27948 read_alloc
> > 3367 dTLB-load-misses
> >
> > 0.070537894 seconds time elapsed
> >
> > 0.004165000 seconds user
> > 0.066643000 seconds sys
It is indeed a significant slowdown but does it show in real world
scenarios or mostly in microbenchmarks?
> After comparing the dissassembly of __arch_copy_from_user on both
> kernels and going through commit logs, I figured out the slowdown was
> mostly due to to the changes from commit
> c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially the changes to
> uao_ldp.
Your commit is missing an 'f' in front, it should be fc703d80130b
("arm64: uaccess: split user/kernel routines"). This is indeed replacing
one LDP with two LDTR instructions. The reason for this is that we
wanted to only use the 'T' variants of the access instructions so that
they are executed with EL0 privileges (better for security).
> I could not directly revert the changes to test since more names
> changed in other commits than I cared to figure out, but I hacked out
> that change, and saw that the performance of the test program was
> basically back to normal.
>
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index ccedf548dac9..2ddf7eba46fd 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -64,9 +64,9 @@ alternative_else_nop_endif
> * unprivileged instructions, and USER() only works for single instructions.
> */
> .macro user_ldp l, reg1, reg2, addr, post_inc
> -8888: ldtr \reg1, [\addr];
> -8889: ldtr \reg2, [\addr, #8];
> - add \addr, \addr, \post_inc;
> +8888: ldp \reg1, \reg2, [\addr], \post_inc;
> +8889: nop;
> + nop;
This won't work in all cases. For example on newer CPUs it fails if PAN
is enabled since LDP won't be able to access user space. While we could
disable PAN for these routines (and MTE tag checking), this also
overrides the execute-only user permissions since now an LDP/STP is
allowed to access them.
So, I'm not adding them back for a specific microarchitecture.
> Profiling with the hacked __arch_copy_from_user
> root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
>
> Performance counter stats for '/mnt/usrroot/test_copy':
>
> 11822342 instructions # 0.23 insn per cycle
> 50689594 cycles
> 37627922 ld_dep_stall
> 17933 read_alloc
> 3421 dTLB-load-misses
>
> 0.043440253 seconds time elapsed
>
> 0.004382000 seconds user
> 0.039442000 seconds sys
>
> Unfortunately the hack crashes in other cases so it is not a viable
> solution for us. Also, on our actual workload there is still a small
> difference in performance remaining that I have not tracked down yet
> (I am guessing it has to do with the dTLB-load-misses remaining
> higher).
>
> Note, I think that the slow down is only noticeable in cases like ours
> where the data being copied from is not in cache (for us, because the
> FPGA writes it).
I can see Robin already replied mentioning the usercopy refresh series,
maybe these do improve performance.
Since you mentioned the data is not cached, you could experiment with
some prefetching in copy loop, maybe it boosts the performance a bit.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2023-06-29 13:37 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-23 21:30 Slowdown copying data between kernel versions 4.19 and 5.15 Havens, Austin
2023-06-28 21:38 ` Havens, Austin
2023-06-29 13:09 ` Robin Murphy
2023-06-29 13:36 ` Catalin Marinas [this message]
2023-06-29 14:24 ` Mark Rutland
2023-06-29 18:33 ` Robin Murphy
2023-06-29 19:33 ` Havens, Austin
2023-06-30 11:15 ` Mark Rutland
2023-07-06 19:12 ` Havens, Austin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZJ2I+qJ0o8KQAXkE@arm.com \
--to=catalin.marinas@arm.com \
--cc=austin.havens@anritsu.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=mark.rutland@arm.com \
--cc=michal.simek@amd.com \
--cc=siddarth.suresh@anritsu.com \
--cc=vincent.lui@anritsu.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox