[PATCH AUTOSEL 6.6 7/8] x86/csum: clean up `csum_partial' further

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Sasha Levin <sashal@kernel.org>,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, x86@kernel.org,
	goldstein.w.n@gmail.com, nathan@kernel.org, edumazet@google.com,
	david.laight@aculab.com
Subject: [PATCH AUTOSEL 6.6 7/8] x86/csum: clean up `csum_partial' further
Date: Mon,  8 Jan 2024 07:27:23 -0500	[thread overview]
Message-ID: <20240108122745.2090122-7-sashal@kernel.org> (raw)
In-Reply-To: <20240108122745.2090122-1-sashal@kernel.org>

From: Linus Torvalds <torvalds@linux-foundation.org>

[ Upstream commit a476aae3f1dc78a162a0d2e7945feea7d2b29401 ]

Commit 688eb8191b47 ("x86/csum: Improve performance of `csum_partial`")
ended up improving the code generation for the IP csum calculations, and
in particular special-casing the 40-byte case that is a hot case for
IPv6 headers.

It then had _another_ special case for the 64-byte unrolled loop, which
did two chains of 32-byte blocks, which allows modern CPU's to improve
performance by doing the chains in parallel thanks to renaming the carry
flag.

This just unifies the special cases and combines them into just one
single helper the 40-byte csum case, and replaces the 64-byte case by a
80-byte case that just does that single helper twice.  It avoids having
all these different versions of inline assembly, and actually improved
performance further in my tests.

There was never anything magical about the 64-byte unrolled case, even
though it happens to be a common size (and typically is the cacheline
size).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 arch/x86/lib/csum-partial_64.c | 81 ++++++++++++++++------------------
 1 file changed, 37 insertions(+), 44 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index 557e42ede68ec..c9dae65ac01b5 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -16,6 +16,20 @@ static inline __wsum csum_finalize_sum(u64 temp64)
 	return (__force __wsum)((temp64 + ror64(temp64, 32)) >> 32);
 }
 
+static inline unsigned long update_csum_40b(unsigned long sum, const unsigned long m[5])
+{
+	asm("addq %1,%0\n\t"
+	     "adcq %2,%0\n\t"
+	     "adcq %3,%0\n\t"
+	     "adcq %4,%0\n\t"
+	     "adcq %5,%0\n\t"
+	     "adcq $0,%0"
+		:"+r" (sum)
+		:"m" (m[0]), "m" (m[1]), "m" (m[2]),
+		 "m" (m[3]), "m" (m[4]));
+	return sum;
+}
+
 /*
  * Do a checksum on an arbitrary memory area.
  * Returns a 32bit checksum.
@@ -31,52 +45,31 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 {
 	u64 temp64 = (__force u64)sum;
 
-	/*
-	 * len == 40 is the hot case due to IPv6 headers, but annotating it likely()
-	 * has noticeable negative affect on codegen for all other cases with
-	 * minimal performance benefit here.
-	 */
-	if (len == 40) {
-		asm("addq 0*8(%[src]),%[res]\n\t"
-		    "adcq 1*8(%[src]),%[res]\n\t"
-		    "adcq 2*8(%[src]),%[res]\n\t"
-		    "adcq 3*8(%[src]),%[res]\n\t"
-		    "adcq 4*8(%[src]),%[res]\n\t"
-		    "adcq $0,%[res]"
-		    : [res] "+r"(temp64)
-		    : [src] "r"(buff), "m"(*(const char(*)[40])buff));
-		return csum_finalize_sum(temp64);
+	/* Do two 40-byte chunks in parallel to get better ILP */
+	if (likely(len >= 80)) {
+		u64 temp64_2 = 0;
+		do {
+			temp64 = update_csum_40b(temp64, buff);
+			temp64_2 = update_csum_40b(temp64_2, buff + 40);
+			buff += 80;
+			len -= 80;
+		} while (len >= 80);
+
+		asm("addq %1,%0\n\t"
+		    "adcq $0,%0"
+		    :"+r" (temp64): "r" (temp64_2));
 	}
-	if (unlikely(len >= 64)) {
-		/*
-		 * Extra accumulators for better ILP in the loop.
-		 */
-		u64 tmp_accum, tmp_carries;
 
-		asm("xorl %k[tmp_accum],%k[tmp_accum]\n\t"
-		    "xorl %k[tmp_carries],%k[tmp_carries]\n\t"
-		    "subl $64, %[len]\n\t"
-		    "1:\n\t"
-		    "addq 0*8(%[src]),%[res]\n\t"
-		    "adcq 1*8(%[src]),%[res]\n\t"
-		    "adcq 2*8(%[src]),%[res]\n\t"
-		    "adcq 3*8(%[src]),%[res]\n\t"
-		    "adcl $0,%k[tmp_carries]\n\t"
-		    "addq 4*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 5*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 6*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 7*8(%[src]),%[tmp_accum]\n\t"
-		    "adcl $0,%k[tmp_carries]\n\t"
-		    "addq $64, %[src]\n\t"
-		    "subl $64, %[len]\n\t"
-		    "jge 1b\n\t"
-		    "addq %[tmp_accum],%[res]\n\t"
-		    "adcq %[tmp_carries],%[res]\n\t"
-		    "adcq $0,%[res]"
-		    : [tmp_accum] "=&r"(tmp_accum),
-		      [tmp_carries] "=&r"(tmp_carries), [res] "+r"(temp64),
-		      [len] "+r"(len), [src] "+r"(buff)
-		    : "m"(*(const char *)buff));
+	/*
+	 * len == 40 is the hot case due to IPv6 headers, so return
+	 * early for that exact case without checking the tail bytes.
+	 */
+	if (len >= 40) {
+		temp64 = update_csum_40b(temp64, buff);
+		len -= 40;
+		if (!len)
+			return csum_finalize_sum(temp64);
+		buff += 40;
 	}
 
 	if (len & 32) {
-- 
2.43.0

next prev parent reply	other threads:[~2024-01-08 12:28 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-08 12:27 [PATCH AUTOSEL 6.6 1/8] virtio_blk: fix snprintf truncation compiler warning Sasha Levin
2024-01-08 12:27 ` [PATCH AUTOSEL 6.6 2/8] nfc: Do not send datagram if socket state isn't LLCP_BOUND Sasha Levin
2024-01-08 12:27 ` [PATCH AUTOSEL 6.6 3/8] net: qrtr: ns: Return 0 if server port is not present Sasha Levin
2024-01-08 12:27 ` [PATCH AUTOSEL 6.6 4/8] connector: Fix proc_event_num_listeners count not cleared Sasha Levin
2024-01-08 12:27 ` [PATCH AUTOSEL 6.6 5/8] ARM: sun9i: smp: fix return code check of of_property_match_string Sasha Levin
2024-01-08 12:27 ` [PATCH AUTOSEL 6.6 6/8] x86/csum: Remove unnecessary odd handling Sasha Levin
2024-01-08 12:27 ` Sasha Levin [this message]
2024-01-08 12:27 ` [PATCH AUTOSEL 6.6 8/8] nouveau: fix disp disabling with GSP Sasha Levin

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:557e42ede68e dfblob:c9dae65ac01b )
 OR (
bs:"[PATCH AUTOSEL 6.6 7/8] x86/csum: clean up `csum_partial' further" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240108122745.2090122-7-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=david.laight@aculab.com \
    --cc=edumazet@google.com \
    --cc=goldstein.w.n@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=nathan@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox