public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed
From: Mark Rutland <mark.rutland@arm.com>
To: "Havens, Austin" <austin.havens@anritsu.com>
Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>,
	"will@kernel.org" <will@kernel.org>,
	"michal.simek@amd.com" <michal.simek@amd.com>,
	"Suresh, Siddarth" <siddarth.suresh@anritsu.com>,
	"Lui, Vincent" <vincent.lui@anritsu.com>,
	"linux-arm-kernel@lists.infradead.org"
	<linux-arm-kernel@lists.infradead.org>
Subject: Re: Slowdown copying data between kernel versions 4.19 and 5.15
Date: Thu, 29 Jun 2023 15:24:33 +0100	[thread overview]
Message-ID: <ZJ2UIfbCWeLzRMfs@FVFF77S0Q05N> (raw)
In-Reply-To: <OS3P301MB04215E0BD835F0742EA377D9EE24A@OS3P301MB0421.JPNP301.PROD.OUTLOOK.COM>

On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote:
> >After some investigation I am guessing the issue is either in the iovector
> >iteration changes (around
> >https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the
> >lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out
> >of my depth so it is just speculation. 
> 
> After comparing the dissassembly of __arch_copy_from_user on both kernels and
> going through commit logs, I figured out the slowdown was mostly due to to
> the changes from commit c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially
> the changes to uao_ldp. 

For the benefit of others, that's commit:

  fc703d80130b1c9d ("arm64: uaccess: split user/kernel routine")

> 
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index 2c26ca5b7bb0..2b5454fa0f24 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -59,62 +59,32 @@ alternative_else_nop_endif
>  #endif
>  
>  /*
> - * Generate the assembly for UAO alternatives with exception table entries.
> + * Generate the assembly for LDTR/STTR with exception table entries.
>   * This is complicated as there is no post-increment or pair versions of the
>   * unprivileged instructions, and USER() only works for single instructions.
>   */
> -#ifdef CONFIG_ARM64_UAO
>         .macro uao_ldp l, reg1, reg2, addr, post_inc
> -               alternative_if_not ARM64_HAS_UAO
> -8888:                  ldp     \reg1, \reg2, [\addr], \post_inc;
> -8889:                  nop;
> -                       nop;
> -               alternative_else
> -                       ldtr    \reg1, [\addr];
> -                       ldtr    \reg2, [\addr, #8];
> -                       add     \addr, \addr, \post_inc;
> -               alternative_endif
> +8888:          ldtr    \reg1, [\addr];
> +8889:          ldtr    \reg2, [\addr, #8];
> +               add     \addr, \addr, \post_inc;
>  
>                 _asm_extable    8888b,\l;
>                 _asm_extable    8889b,\l;
>         .endm
> 
> I could not directly revert the changes to test since more names changed in
> other commits than I cared to figure out, but I hacked out that change, and
> saw that the performance of the test program was basically back to normal. 
> 
> diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h
> index ccedf548dac9..2ddf7eba46fd 100644
> --- a/arch/arm64/include/asm/asm-uaccess.h
> +++ b/arch/arm64/include/asm/asm-uaccess.h
> @@ -64,9 +64,9 @@ alternative_else_nop_endif
>   * unprivileged instructions, and USER() only works for single instructions.
>   */
>         .macro user_ldp l, reg1, reg2, addr, post_inc
> -8888:          ldtr    \reg1, [\addr];
> -8889:          ldtr    \reg2, [\addr, #8];
> -               add     \addr, \addr, \post_inc;
> +8888:          ldp     \reg1, \reg2, [\addr], \post_inc;
> +8889:          nop;
> +               nop;

As Catalin noted, we can't make that change generally as it'd be broken for any
system with PAN, and in general we *really* want to use LDTR/STTR for user
accesses to catch any misuse with kernel pointers.

> Profiling with the hacked __arch_copy_from_user 
> root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy
> 
>  Performance counter stats for '/mnt/usrroot/test_copy':
> 
>           11822342      instructions              #    0.23  insn per cycle         
>           50689594      cycles                                                      
>           37627922      ld_dep_stall                                                
>              17933      read_alloc                                                  
>               3421      dTLB-load-misses                                            
> 
>        0.043440253 seconds time elapsed
> 
>        0.004382000 seconds user
>        0.039442000 seconds sys
> 
> Unfortunately the hack crashes in other cases so it is not a viable solution
> for us. Also, on our actual workload there is still a small difference in
> performance remaining that I have not tracked down yet (I am guessing it has
> to do with the dTLB-load-misses remaining higher). 
> 
> Note, I think that the slow down is only noticeable in cases like ours where
> the data being copied from is not in cache (for us, because the FPGA writes
> it).

When you say "is not in cache", what exactly do you mean? If this were just the
latency of filling a cache I wouldn't expect the size of the first access to
make a difference, so I'm assuming the source buffer is not mapped with
cacheable memory attributes, which we generally assume.

Which memory attribues are the source and destination buffers mapped with? Is
that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped?

I'm assuming this is with some out-of-tree driver; if that's in a public tree
could you please provide a pointer to it?

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  parent reply	other threads:[~2023-06-29 14:25 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-23 21:30 Slowdown copying data between kernel versions 4.19 and 5.15 Havens, Austin
2023-06-28 21:38 ` Havens, Austin
2023-06-29 13:09   ` Robin Murphy
2023-06-29 13:36   ` Catalin Marinas
2023-06-29 14:24   ` Mark Rutland [this message]
2023-06-29 18:33     ` Robin Murphy
2023-06-29 19:33     ` Havens, Austin
2023-06-30 11:15       ` Mark Rutland
2023-07-06 19:12         ` Havens, Austin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZJ2UIfbCWeLzRMfs@FVFF77S0Q05N \
    --to=mark.rutland@arm.com \
    --cc=austin.havens@anritsu.com \
    --cc=catalin.marinas@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=michal.simek@amd.com \
    --cc=siddarth.suresh@anritsu.com \
    --cc=vincent.lui@anritsu.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox