From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id CE3BAEB64D9 for ; Thu, 29 Jun 2023 13:37:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=fyAbmpk59BuWgDjuwn2xg9OMqzGKpBmWqSsb3lAGesM=; b=RxoYYw1lDxFAB+ Q7Ahxm9nX3PheAMQtreSccLh04ZuuvFpc4LXuHBsDGOxvuMUwjAhfo0uf/p5COZmHPW0HpwwgxuJc NO0XSDeokw6a0wvz62u3zDrWb5QWaOKxtd34RnQBQtye+ECvScEunR4oyaMwaUygGmjNgIgYPeDYt ahzyzzXUX1wt7/ZLAOPq/SWORr41+9YJksvNLQD8Ajww9dpxZQ55F/0Q1foJWUehpDfxTwTbDIGYX +IFktcSdf19Ah2Z41XmfzEwnxnrOOmI0dXBeTt+mgYIJxaKOPOP7wVd6WTBp5dfPUF71NekexSI0Z fzk+5zoxyI4q6wXer3NQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qErpb-001D0T-2o; Thu, 29 Jun 2023 13:37:07 +0000 Received: from dfw.source.kernel.org ([2604:1380:4641:c500::1]) by bombadil.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux)) id 1qErpY-001Czn-1p for linux-arm-kernel@lists.infradead.org; Thu, 29 Jun 2023 13:37:06 +0000 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 1F0FC61540; Thu, 29 Jun 2023 13:37:03 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 32A8CC433C0; Thu, 29 Jun 2023 13:37:01 +0000 (UTC) Date: Thu, 29 Jun 2023 14:36:58 +0100 From: Catalin Marinas To: "Havens, Austin" Cc: "will@kernel.org" , "michal.simek@amd.com" , "Suresh, Siddarth" , "Lui, Vincent" , Mark Rutland , "linux-arm-kernel@lists.infradead.org" Subject: Re: Slowdown copying data between kernel versions 4.19 and 5.15 Message-ID: References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230629_063704_710071_46337652 X-CRM114-Status: GOOD ( 34.25 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Hi Austin, On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote: > =A0 > >In the process of updating our kernel from 4.19 to 5.15 we noticed a > >slowdown when copying data. =A0We are using =A0Zynqmp 9EG SoCs and > >basically following the Xilinx/AMD release branches (though a bit > >behind). =A0I did some sample based profiling with perf, and it showed > >that a lot of the time was in __arch_copy_from_user, and since the > >amount of data getting copied is the same, it seems like it is > >spending more time in each __arch_copy_from_user call. = Thanks for digging into this. Which CPUs does this SoC have? Cortex-A53? >=A0>I made =A0a test program to replicate the issue and here is what I see > >(i used the same binary on both versions to rule out differences from > >the compiler). = > > > >root@smudge:/tmp# uname -a > >Linux smudge 4.19.0-xilinx-v2019.1 #1 SMP PREEMPT Thu May 18 04:01:27 UT= C 2023 aarch64 aarch64 aarch64 GNU/Linux > >root@smudge:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e r= ead_alloc -e dTLB-load-misses /mnt/usrroot/test_copy > > > >=A0Performance counter stats for '/mnt/usrroot/test_copy': > > > >=A0 =A0 =A0 =A0 =A0=A013202623 =A0 =A0 =A0instructions =A0 =A0 =A0 =A0 = =A0 =A0 =A0# =A0 =A00.25 =A0insn per cycle =A0 =A0 =A0 =A0 = > >=A0 =A0 =A0 =A0 =A0=A052947780 =A0 =A0 =A0cycles =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 > >=A0 =A0 =A0 =A0 =A0=A037588761 =A0 =A0 =A0ld_dep_stall =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 > >=A0 =A0 =A0 =A0 =A0 =A0 =A016301 =A0 =A0 =A0read_alloc =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 > >=A0 =A0 =A0 =A0 =A0 =A0 =A0=A01660 =A0 =A0 =A0dTLB-load-misses =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 > > > >=A0 =A0 =A0 =A00.044990363 seconds time elapsed > > > > =A0 =A0 =A00.004092000 seconds user > >=A0 =A0 =A0 =A00.040920000 seconds sys > > > >root@ahraptor:/tmp# uname -a > >Linux ahraptor 5.15.36-xilinx-v2022.1 #1 SMP PREEMPT Mon Apr 10 22:46:16= UTC 2023 aarch64 aarch64 aarch64 GNU/Linux > >root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e= read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy > > > >=A0Performance counter stats for '/mnt/usrroot/test_copy': > > > > =A0 =A0 =A0 =A0=A011625888 =A0 =A0 =A0instructions =A0 =A0 =A0 =A0 =A0= =A0 =A0# =A0 =A00.14 =A0insn per cycle =A0 =A0 =A0 =A0 = > >=A0 =A0 =A0 =A0 =A0=A083135040 =A0 =A0 =A0cycles =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 > >=A0 =A0 =A0 =A0 =A0=A069833562 =A0 =A0 =A0ld_dep_stall =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 > >=A0 =A0 =A0 =A0 =A0 =A0 =A027948 =A0 =A0 =A0read_alloc =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 > >=A0 =A0 =A0 =A0 =A0 =A0 =A0=A03367 =A0 =A0 =A0dTLB-load-misses =A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 > > > >=A0 =A0 =A0 =A00.070537894 seconds time elapsed > > > >=A0 =A0 =A0 =A00.004165000 seconds user > >=A0 =A0 =A0 =A00.066643000 seconds sys It is indeed a significant slowdown but does it show in real world scenarios or mostly in microbenchmarks? > After comparing the dissassembly of __arch_copy_from_user on both > kernels and going through commit logs, I figured out the slowdown was > mostly due to to the changes from commit > c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially the changes to > uao_ldp. = Your commit is missing an 'f' in front, it should be fc703d80130b ("arm64: uaccess: split user/kernel routines"). This is indeed replacing one LDP with two LDTR instructions. The reason for this is that we wanted to only use the 'T' variants of the access instructions so that they are executed with EL0 privileges (better for security). > I could not directly revert the changes to test since more names > changed in other commits than I cared to figure out, but I hacked out > that change, and saw that the performance of the test program was > basically back to normal. = > = > diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/as= m/asm-uaccess.h > index ccedf548dac9..2ddf7eba46fd 100644 > --- a/arch/arm64/include/asm/asm-uaccess.h > +++ b/arch/arm64/include/asm/asm-uaccess.h > @@ -64,9 +64,9 @@ alternative_else_nop_endif > * unprivileged instructions, and USER() only works for single instructi= ons. > */ > .macro user_ldp l, reg1, reg2, addr, post_inc > -8888: ldtr \reg1, [\addr]; > -8889: ldtr \reg2, [\addr, #8]; > - add \addr, \addr, \post_inc; > +8888: ldp \reg1, \reg2, [\addr], \post_inc; > +8889: nop; > + nop; This won't work in all cases. For example on newer CPUs it fails if PAN is enabled since LDP won't be able to access user space. While we could disable PAN for these routines (and MTE tag checking), this also overrides the execute-only user permissions since now an LDP/STP is allowed to access them. So, I'm not adding them back for a specific microarchitecture. > Profiling with the hacked __arch_copy_from_user > root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e = read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy > = > Performance counter stats for '/mnt/usrroot/test_copy': > = > 11822342 instructions # 0.23 insn per cyc= le > 50689594 cycles > 37627922 ld_dep_stall > 17933 read_alloc > 3421 dTLB-load-misses > = > 0.043440253 seconds time elapsed > = > 0.004382000 seconds user > 0.039442000 seconds sys > = > Unfortunately the hack crashes in other cases so it is not a viable > solution for us. Also, on our actual workload there is still a small > difference in performance remaining that I have not tracked down yet > (I am guessing it has to do with the dTLB-load-misses remaining > higher). = > = > Note, I think that the slow down is only noticeable in cases like ours > where the data being copied from is not in cache (for us, because the > FPGA writes it). I can see Robin already replied mentioning the usercopy refresh series, maybe these do improve performance. Since you mentioned the data is not cached, you could experiment with some prefetching in copy loop, maybe it boosts the performance a bit. -- = Catalin _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel