Re: Slowdown copying data between kernel versions 4.19 and 5.15

From: Catalin Marinas <catalin.marinas@arm.com>
To: "Havens, Austin" <austin.havens@anritsu.com>
Cc: "will@kernel.org" <will@kernel.org>,
	"michal.simek@amd.com" <michal.simek@amd.com>,
	"Suresh, Siddarth" <siddarth.suresh@anritsu.com>,
	"Lui, Vincent" <vincent.lui@anritsu.com>,
	Mark Rutland <mark.rutland@arm.com>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: Slowdown copying data between kernel versions 4.19 and 5.15
Date: Thu, 29 Jun 2023 14:36:58 +0100	[thread overview]
Message-ID: <ZJ2I+qJ0o8KQAXkE@arm.com> (raw)
In-Reply-To: <OS3P301MB04215E0BD835F0742EA377D9EE24A@OS3P301MB0421.JPNP301.PROD.OUTLOOK.COM>

Hi Austin,

On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
> <Friday, June 23, 2023 14:30 PM PDT, Austin Havens (Anritsu)> 
> >In the process of updating our kernel from 4.19 to 5.15 we noticed a
> >slowdown when copying data.  We are using  Zynqmp 9EG SoCs and
> >basically following the Xilinx/AMD release branches (though a bit
> >behind).  I did some sample based profiling with perf, and it showed
> >that a lot of the time was in __arch_copy_from_user, and since the
> >amount of data getting copied is the same, it seems like it is
> >spending more time in each __arch_copy_from_user call. 

Thanks for digging into this. Which CPUs does this SoC have? Cortex-A53?

> >I made  a test program to replicate the issue and here is what I see
> >(i used the same binary on both versions to rule out differences from
> >the compiler). 
> >
> >root@smudge:/tmp# uname -a
> >Linux smudge 4.19.0-xilinx-v2019.1 #1 SMP PREEMPT Thu May 18 04:01:27 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
> >root@smudge:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> >
> > Performance counter stats for '/mnt/usrroot/test_copy':
> >
> >          13202623      instructions              #    0.25  insn per cycle         
> >          52947780      cycles                                                      
> >          37588761      ld_dep_stall                                                
> >             16301      read_alloc                                                  
> >              1660      dTLB-load-misses                                            
> >
> >       0.044990363 seconds time elapsed
> >
> >       0.004092000 seconds user
> >       0.040920000 seconds sys
> >
> >root@ahraptor:/tmp# uname -a
> >Linux ahraptor 5.15.36-xilinx-v2022.1 #1 SMP PREEMPT Mon Apr 10 22:46:16 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux
> >root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> >
> > Performance counter stats for '/mnt/usrroot/test_copy':
> >
> >          11625888      instructions              #    0.14  insn per cycle         
> >          83135040      cycles                                                      
> >          69833562      ld_dep_stall                                                
> >             27948      read_alloc                                                  
> >              3367      dTLB-load-misses                                            
> >
> >       0.070537894 seconds time elapsed
> >
> >       0.004165000 seconds user
> >       0.066643000 seconds sys

It is indeed a significant slowdown but does it show in real world
scenarios or mostly in microbenchmarks?

> After comparing the dissassembly of __arch_copy_from_user on both
> kernels and going through commit logs, I figured out the slowdown was
> mostly due to to the changes from commit
> c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially the changes to
> uao_ldp. 

Your commit is missing an 'f' in front, it should be fc703d80130b
("arm64: uaccess: split user/kernel routines"). This is indeed replacing
one LDP with two LDTR instructions. The reason for this is that we
wanted to only use the 'T' variants of the access instructions so that
they are executed with EL0 privileges (better for security).

> I could not directly revert the changes to test since more names
> changed in other commits than I cared to figure out, but I hacked out
> that change, and saw that the performance of the test program was
> basically back to normal. 
> 
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index ccedf548dac9..2ddf7eba46fd 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -64,9 +64,9 @@ alternative_else_nop_endif
>   * unprivileged instructions, and USER() only works for single instructions.
>   */
>         .macro user_ldp l, reg1, reg2, addr, post_inc
> -8888:          ldtr    \reg1, [\addr];
> -8889:          ldtr    \reg2, [\addr, #8];
> -               add     \addr, \addr, \post_inc;
> +8888:          ldp     \reg1, \reg2, [\addr], \post_inc;
> +8889:          nop;
> +               nop;

This won't work in all cases. For example on newer CPUs it fails if PAN
is enabled since LDP won't be able to access user space. While we could
disable PAN for these routines (and MTE tag checking), this also
overrides the execute-only user permissions since now an LDP/STP is
allowed to access them.

So, I'm not adding them back for a specific microarchitecture.

> Profiling with the hacked __arch_copy_from_user
> root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> 
>  Performance counter stats for '/mnt/usrroot/test_copy':
> 
>           11822342      instructions              #    0.23  insn per cycle
>           50689594      cycles
>           37627922      ld_dep_stall
>              17933      read_alloc
>               3421      dTLB-load-misses
> 
>        0.043440253 seconds time elapsed
> 
>        0.004382000 seconds user
>        0.039442000 seconds sys
> 
> Unfortunately the hack crashes in other cases so it is not a viable
> solution for us. Also, on our actual workload there is still a small
> difference in performance remaining that I have not tracked down yet
> (I am guessing it has to do with the dTLB-load-misses remaining
> higher). 
> 
> Note, I think that the slow down is only noticeable in cases like ours
> where the data being copied from is not in cache (for us, because the
> FPGA writes it).

I can see Robin already replied mentioning the usercopy refresh series,
maybe these do improve performance.

Since you mentioned the data is not cached, you could experiment with
some prefetching in copy loop, maybe it boosts the performance a bit.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel