From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8BD6B35968 for ; Sun, 16 Mar 2025 10:58:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742122721; cv=none; b=F1dMiaNFNNZNFNmPhvsEhMnmbctPJrScm8H72aR64+WjpZrcBdYk0qAV5uLNofc/hnfcsW4FFU8rhVx3iykIBAGuPm6GmSblAXSE7mfVpbra1GdFXgZnko1tiLMi1CiFkYTMvYfJStGULHpB/Uqw/oJGdSDsjNg7q6OreEk+Z/A= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1742122721; c=relaxed/simple; bh=/1fqhAq2zyyFAdfwpCeGQhrYIy72WkUu40u3s9Mni5s=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=S08RWP/lc7i/H3JIVs5Ojtkmhp9TKs+LbAkckOwwl8Dn2NSa15x4p/eQQxpXbe2HCm+56hWOpl3IpdZD1Qd/+b/c+rPRfIgf2rORFUgOoX3n9UufoYoVNjtqhbMtGkXi9XF6jvF5zaUf6WZSmNzWHAUxjSV+ulHW+XsVLwfyNWY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=upn6xrqK; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="upn6xrqK" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 28968C4CEDD; Sun, 16 Mar 2025 10:58:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1742122721; bh=/1fqhAq2zyyFAdfwpCeGQhrYIy72WkUu40u3s9Mni5s=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=upn6xrqKJiXGzZ0zNwCK71HUP1cYarUwJw1jmOZsYJRBUFMpyFpy+uVAbND1W47eq Ecyavdrp7FLwZTYsxCbUG8UJxNwZ2+mHUcY1Eu2NdB9kkx1utSGP/FC18WYNNtfqjs fqZYlgNc1WIK6gMFmWKQJNnNZ7IA2JvCbzCs4AtyEjxQeB+13a8Q8XXNtJsOkSoHng oqCeSPoaeEH9QMDa6VvCGMoVx2Ej+MQQDBKGG0eFUQU3UGpj0WPiq+SPSYpgTyYmCB SjnKkzsJhQs4+MO/UHUFmqqJE2fELoQfqxndCmBDYffgSgAabJzH06JUzbJJ0PGKRZ 0rbT7XR10QREQ== Date: Sun, 16 Mar 2025 11:58:35 +0100 From: Ingo Molnar To: Herton Krzesinski Cc: Linus Torvalds , x86@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com, linux-kernel@vger.kernel.org, olichtne@redhat.com, atomasov@redhat.com, aokuliar@redhat.com Subject: Re: [PATCH] x86: add back the alignment of the destination to 8 bytes in copy_user_generic() Message-ID: References: <20250314175309.2263997-1-herton@redhat.com> <20250314175309.2263997-2-herton@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: * Herton Krzesinski wrote: > On Fri, Mar 14, 2025 at 4:06 PM Linus Torvalds > wrote: > > > > On Fri, 14 Mar 2025 at 07:53, Herton R. Krzesinski wrote: > > > > > > --- a/arch/x86/include/asm/uaccess_64.h > > > +++ b/arch/x86/include/asm/uaccess_64.h > > > @@ -130,7 +130,7 @@ copy_user_generic(void *to, const void *from, unsigned long len) > > > "2:\n" > > > _ASM_EXTABLE_UA(1b, 2b) > > > :"+c" (len), "+D" (to), "+S" (from), ASM_CALL_CONSTRAINT > > > - : : "memory", "rax"); > > > + : : "memory", "rax", "rdx", "r8"); > > > > Please don't penalize the caller with the extra clobbers. > > > > Maybe it doesn't matter - these functions are marked always_inline, > > but they aren't inlined in very many places and maybe those places > > have registers to spare - but let's not penalize the FSRM case anyway. > > > > And we do call it "rep_movs_alternative", so let's keep it close to > > "rep movs" semantics (yes, we already clobber %rax, but let's not make > > it worse). > > > > As to the actual change to rep_movs - that should be done differently > > too. In particular, I doubt it makes any sense to try to align the > > destination for small writes or for the ERMS case when we use 'rep > > movsb', so I think this should all go into just the ".Llarge_movsq" > > case. > > > > .. and then the patch can be further optimized to just do the first - > > possibly unaligned - destination word unconditionally, and then > > updating the addresses and counts to make the rest be aligned. > > > > Something ENTIRELY UNTESTED like this, in other words. And I wrote it > > so that it doesn't need any new temporary registers, so no need for > > clobbers or for some save/restore code. > > > > NOTE! The patch below is very intentionally whitespace-damaged. > > Anybody who applies this needs to look at it very carefully, because I > > just threw this together with zero testing and only very limited > > thought. > > > > But if it works, and if it actually improves performance, I think it > > might be a fairly minimal approach to "align destination". > > It does look good in my testing here, I built same kernel I > was using for testing the original patch (based on > 6.14-rc6), this is one of the results I got in one of the runs > testing on the same machine: > > CPU RATE SYS TIME sender-receiver > Server bind 19: 20.8Gbits/sec 14.832313000 20.863476111 75.4%-89.2% > Server bind 21: 18.0Gbits/sec 18.705221000 23.996913032 80.8%-89.7% > Server bind 23: 20.1Gbits/sec 15.331761000 21.536657212 75.0%-89.7% > Server bind none: 24.1Gbits/sec 14.164226000 18.043132731 82.3%-87.1% > > There are still some variations between runs, which is > expected as was the same when I tested my patch or in > the not aligned case, but it's consistently better/higher than > the no align case. Looks really it's sufficient to align for the > higher than or equal 64 bytes copy case. Mind sending a v2 patch with a changelog and these benchmark numbers added in, and perhaps a Co-developed-by tag with Linus or so? Thanks, Ingo