From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 1E420EB64D9 for ; Thu, 29 Jun 2023 14:25:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References: Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=O0qsh6qtTlkpMFHEktVAFVRjsECQxRsiAdIonK/zSW4=; b=J4xDJGimw6yWV8 b6jZP+lZEm2x3njuXRvCO2jZ9gLW+qFprPaO838VmtzBd5fZ5OwGX5yHKHdmPFruIlckFF2VPV1lS jig/03YT9PWnxIGs2/RSYUY44PTjCxfg3SyHm1KGUczjEVR6HMOUvrRY51gKQMOK712ByoyUe3SJ2 Fv1wwM5iBSdIlZaZ1MOnQYHPUn9VhmpBhYjdP3I/WCLXqUtmi4efq9SjzLTajHY/ObZg+C7ZGlu24 w9FlFHVdvsoGsuLCLPH/5IekDP0/h69/dXW9JbkzdlKQ7/fNo4Pv6FtxrHiVu7orlkBSscHb/kXCe TvqtaF09i0L5jE7elsqw==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qEsZn-001Hli-0w; Thu, 29 Jun 2023 14:24:51 +0000 Received: from foss.arm.com ([217.140.110.172]) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1qEsZj-001Hkw-2H for linux-arm-kernel@lists.infradead.org; Thu, 29 Jun 2023 14:24:49 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id AA605C14; Thu, 29 Jun 2023 07:25:24 -0700 (PDT) Received: from FVFF77S0Q05N (unknown [10.57.27.252]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id AE78E3F73F; Thu, 29 Jun 2023 07:24:39 -0700 (PDT) Date: Thu, 29 Jun 2023 15:24:33 +0100 From: Mark Rutland To: "Havens, Austin" Cc: "catalin.marinas@arm.com" , "will@kernel.org" , "michal.simek@amd.com" , "Suresh, Siddarth" , "Lui, Vincent" , "linux-arm-kernel@lists.infradead.org" Subject: Re: Slowdown copying data between kernel versions 4.19 and 5.15 Message-ID: References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20230629_072447_855086_AEF857AC X-CRM114-Status: GOOD ( 29.52 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org On Wed, Jun 28, 2023 at 09:38:14PM +0000, Havens, Austin wrote: > >After some investigation I am guessing the issue is either in the iovector > >iteration changes (around > >https://elixir.bootlin.com/linux/v5.15/source/lib/iov_iter.c#L922 ) or the > >lower level changes in arch/arm64/lib/copy_from_user.S, but I am pretty out > >of my depth so it is just speculation. > > After comparing the dissassembly of __arch_copy_from_user on both kernels and > going through commit logs, I figured out the slowdown was mostly due to to > the changes from commit c703d80130b1c9d6783f4cbb9516fd5fe4a750d, specifially > the changes to uao_ldp. For the benefit of others, that's commit: fc703d80130b1c9d ("arm64: uaccess: split user/kernel routine") > > diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h > index 2c26ca5b7bb0..2b5454fa0f24 100644 > --- a/arch/arm64/include/asm/asm-uaccess.h > +++ b/arch/arm64/include/asm/asm-uaccess.h > @@ -59,62 +59,32 @@ alternative_else_nop_endif > #endif > > /* > - * Generate the assembly for UAO alternatives with exception table entries. > + * Generate the assembly for LDTR/STTR with exception table entries. > * This is complicated as there is no post-increment or pair versions of the > * unprivileged instructions, and USER() only works for single instructions. > */ > -#ifdef CONFIG_ARM64_UAO > .macro uao_ldp l, reg1, reg2, addr, post_inc > - alternative_if_not ARM64_HAS_UAO > -8888: ldp \reg1, \reg2, [\addr], \post_inc; > -8889: nop; > - nop; > - alternative_else > - ldtr \reg1, [\addr]; > - ldtr \reg2, [\addr, #8]; > - add \addr, \addr, \post_inc; > - alternative_endif > +8888: ldtr \reg1, [\addr]; > +8889: ldtr \reg2, [\addr, #8]; > + add \addr, \addr, \post_inc; > > _asm_extable 8888b,\l; > _asm_extable 8889b,\l; > .endm > > I could not directly revert the changes to test since more names changed in > other commits than I cared to figure out, but I hacked out that change, and > saw that the performance of the test program was basically back to normal. > > diff --git a/arch/arm64/include/asm/asm-uaccess.h b/arch/arm64/include/asm/asm-uaccess.h > index ccedf548dac9..2ddf7eba46fd 100644 > --- a/arch/arm64/include/asm/asm-uaccess.h > +++ b/arch/arm64/include/asm/asm-uaccess.h > @@ -64,9 +64,9 @@ alternative_else_nop_endif > * unprivileged instructions, and USER() only works for single instructions. > */ > .macro user_ldp l, reg1, reg2, addr, post_inc > -8888: ldtr \reg1, [\addr]; > -8889: ldtr \reg2, [\addr, #8]; > - add \addr, \addr, \post_inc; > +8888: ldp \reg1, \reg2, [\addr], \post_inc; > +8889: nop; > + nop; As Catalin noted, we can't make that change generally as it'd be broken for any system with PAN, and in general we *really* want to use LDTR/STTR for user accesses to catch any misuse with kernel pointers. > Profiling with the hacked __arch_copy_from_user > root@ahraptor:/tmp# perf stat -einstructions -ecycles -e ld_dep_stall -e read_alloc -e dTLB-load-misses /mnt/usrroot/test_copy > > Performance counter stats for '/mnt/usrroot/test_copy': > > 11822342 instructions # 0.23 insn per cycle > 50689594 cycles > 37627922 ld_dep_stall > 17933 read_alloc > 3421 dTLB-load-misses > > 0.043440253 seconds time elapsed > > 0.004382000 seconds user > 0.039442000 seconds sys > > Unfortunately the hack crashes in other cases so it is not a viable solution > for us. Also, on our actual workload there is still a small difference in > performance remaining that I have not tracked down yet (I am guessing it has > to do with the dTLB-load-misses remaining higher). > > Note, I think that the slow down is only noticeable in cases like ours where > the data being copied from is not in cache (for us, because the FPGA writes > it). When you say "is not in cache", what exactly do you mean? If this were just the latency of filling a cache I wouldn't expect the size of the first access to make a difference, so I'm assuming the source buffer is not mapped with cacheable memory attributes, which we generally assume. Which memory attribues are the source and destination buffers mapped with? Is that Normal-WB, Normal-NC, or Device? How exactly has that memory been mapped? I'm assuming this is with some out-of-tree driver; if that's in a public tree could you please provide a pointer to it? Thanks, Mark. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel