Re: x86/csum: Remove unnecessary odd handling

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: x86/csum: Remove unnecessary odd handling
       [not found] <20230628020657.957880-1-goldstein.w.n@gmail.com>
@ 2023-06-28  9:12 ` Borislav Petkov
  2023-06-28 15:32   ` Noah Goldstein
                     ` (2 more replies)
  2023-09-01 22:21 ` Noah Goldstein
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 34+ messages in thread
From: Borislav Petkov @ 2023-06-28  9:12 UTC (permalink / raw)
  To: Noah Goldstein, Linus Torvalds
  Cc: x86, edumazet, tglx, mingo, dave.hansen, hpa, lkml

+ Linus who's been poking at this yesterday.

+ lkml. Please always CC lkml when sending patches.

On Tue, Jun 27, 2023 at 09:06:57PM -0500, Noah Goldstein wrote:
> The special case for odd aligned buffers is unnecessary and mostly
> just adds overhead. Aligned buffers is the expectations, and even for
> unaligned buffer, the only case that was helped is if the buffer was
> 1-byte from word aligned which is ~1/7 of the cases. Overall it seems
> highly unlikely to be worth to extra branch.
> 
> It was left in the previous perf improvement patch because I was
> erroneously comparing the exact output of `csum_partial(...)`, but
> really we only need `csum_fold(csum_partial(...))` to match so its
> safe to remove.
> 
> All csum kunit tests pass.
> 
> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> ---
>  arch/x86/lib/csum-partial_64.c | 37 ++--------------------------------
>  1 file changed, 2 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> index cea25ca8b8cf..d06112e98893 100644
> --- a/arch/x86/lib/csum-partial_64.c
> +++ b/arch/x86/lib/csum-partial_64.c
> @@ -11,28 +11,6 @@
>  #include <asm/checksum.h>
>  #include <asm/word-at-a-time.h>
>  
> -static inline unsigned short from32to16(unsigned a)
> -{
> -	unsigned short b = a >> 16;
> -	asm("addw %w2,%w0\n\t"
> -	    "adcw $0,%w0\n"
> -	    : "=r" (b)
> -	    : "0" (b), "r" (a));
> -	return b;
> -}
> -
> -static inline __wsum csum_tail(u64 temp64, int odd)
> -{
> -	unsigned int result;
> -
> -	result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
> -	if (unlikely(odd)) {
> -		result = from32to16(result);
> -		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
> -	}
> -	return (__force __wsum)result;
> -}
> -
>  /*
>   * Do a checksum on an arbitrary memory area.
>   * Returns a 32bit checksum.
> @@ -47,17 +25,6 @@ static inline __wsum csum_tail(u64 temp64, int odd)
>  __wsum csum_partial(const void *buff, int len, __wsum sum)
>  {
>  	u64 temp64 = (__force u64)sum;
> -	unsigned odd;
> -
> -	odd = 1 & (unsigned long) buff;
> -	if (unlikely(odd)) {
> -		if (unlikely(len == 0))
> -			return sum;
> -		temp64 = ror32((__force u32)sum, 8);
> -		temp64 += (*(unsigned char *)buff << 8);
> -		len--;
> -		buff++;
> -	}
>  
>  	/*
>  	 * len == 40 is the hot case due to IPv6 headers, but annotating it likely()
> @@ -73,7 +40,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
>  		    "adcq $0,%[res]"
>  		    : [res] "+r"(temp64)
>  		    : [src] "r"(buff), "m"(*(const char(*)[40])buff));
> -		return csum_tail(temp64, odd);
> +		return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
>  	}
>  	if (unlikely(len >= 64)) {
>  		/*
> @@ -143,7 +110,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
>  		    : [res] "+r"(temp64)
>  		    : [trail] "r"(trail));
>  	}
> -	return csum_tail(temp64, odd);
> +	return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
>  }
>  EXPORT_SYMBOL(csum_partial);
>  
> -- 
> 2.34.1
> 

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2023-06-28  9:12 ` x86/csum: Remove unnecessary odd handling Borislav Petkov
@ 2023-06-28 15:32   ` Noah Goldstein
  2023-06-28 17:44     ` Linus Torvalds
  2023-06-29 14:04   ` David Laight
  2023-06-29 14:27   ` David Laight
  2 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2023-06-28 15:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, x86, edumazet, tglx, mingo, dave.hansen, hpa,
	lkml

On Wed, Jun 28, 2023 at 4:12 AM Borislav Petkov <bp@alien8.de> wrote:
>
> + Linus who's been poking at this yesterday.
>
Linus, if you're planning a patch and want to just integrate the codes
here I'm happy drop this patch

> + lkml. Please always CC lkml when sending patches.

Ah, will do so in the future.
>
> On Tue, Jun 27, 2023 at 09:06:57PM -0500, Noah Goldstein wrote:
> > The special case for odd aligned buffers is unnecessary and mostly
> > just adds overhead. Aligned buffers is the expectations, and even for
> > unaligned buffer, the only case that was helped is if the buffer was
> > 1-byte from word aligned which is ~1/7 of the cases. Overall it seems
> > highly unlikely to be worth to extra branch.
> >
> > It was left in the previous perf improvement patch because I was
> > erroneously comparing the exact output of `csum_partial(...)`, but
> > really we only need `csum_fold(csum_partial(...))` to match so its
> > safe to remove.
> >
> > All csum kunit tests pass.
> >
> > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> > ---
> >  arch/x86/lib/csum-partial_64.c | 37 ++--------------------------------
> >  1 file changed, 2 insertions(+), 35 deletions(-)
> >
> > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > index cea25ca8b8cf..d06112e98893 100644
> > --- a/arch/x86/lib/csum-partial_64.c
> > +++ b/arch/x86/lib/csum-partial_64.c
> > @@ -11,28 +11,6 @@
> >  #include <asm/checksum.h>
> >  #include <asm/word-at-a-time.h>
> >
> > -static inline unsigned short from32to16(unsigned a)
> > -{
> > -     unsigned short b = a >> 16;
> > -     asm("addw %w2,%w0\n\t"
> > -         "adcw $0,%w0\n"
> > -         : "=r" (b)
> > -         : "0" (b), "r" (a));
> > -     return b;
> > -}
> > -
> > -static inline __wsum csum_tail(u64 temp64, int odd)
> > -{
> > -     unsigned int result;
> > -
> > -     result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
> > -     if (unlikely(odd)) {
> > -             result = from32to16(result);
> > -             result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
> > -     }
> > -     return (__force __wsum)result;
> > -}
> > -
> >  /*
> >   * Do a checksum on an arbitrary memory area.
> >   * Returns a 32bit checksum.
> > @@ -47,17 +25,6 @@ static inline __wsum csum_tail(u64 temp64, int odd)
> >  __wsum csum_partial(const void *buff, int len, __wsum sum)
> >  {
> >       u64 temp64 = (__force u64)sum;
> > -     unsigned odd;
> > -
> > -     odd = 1 & (unsigned long) buff;
> > -     if (unlikely(odd)) {
> > -             if (unlikely(len == 0))
> > -                     return sum;
> > -             temp64 = ror32((__force u32)sum, 8);
> > -             temp64 += (*(unsigned char *)buff << 8);
> > -             len--;
> > -             buff++;
> > -     }
> >
> >       /*
> >        * len == 40 is the hot case due to IPv6 headers, but annotating it likely()
> > @@ -73,7 +40,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
> >                   "adcq $0,%[res]"
> >                   : [res] "+r"(temp64)
> >                   : [src] "r"(buff), "m"(*(const char(*)[40])buff));
> > -             return csum_tail(temp64, odd);
> > +             return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
> >       }
> >       if (unlikely(len >= 64)) {
> >               /*
> > @@ -143,7 +110,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
> >                   : [res] "+r"(temp64)
> >                   : [trail] "r"(trail));
> >       }
> > -     return csum_tail(temp64, odd);
> > +     return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
> >  }
> >  EXPORT_SYMBOL(csum_partial);
> >
> > --
> > 2.34.1
> >
>
> --
> Regards/Gruss,
>     Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2023-06-28 15:32   ` Noah Goldstein
@ 2023-06-28 17:44     ` Linus Torvalds
  2023-06-28 18:34       ` Noah Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2023-06-28 17:44 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: Borislav Petkov, x86, edumazet, tglx, mingo, dave.hansen, hpa,
	lkml

[-- Attachment #1: Type: text/plain, Size: 1323 bytes --]

On Wed, 28 Jun 2023 at 08:32, Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> Linus, if you're planning a patch and want to just integrate the codes
> here I'm happy drop this patch

No, that patch looks good to me.

In fact, I wasn't planning on integrating my patch at all. I literally
did it as a "I would have done it this way instead" exercise.

And while I am currently running with my patch in the kernel, I don't
even really know if it works and does the right thing. Maybe my use
doesn't even trigger csum_partial() at all. I did not do any testing
that "yes, I get the same checksum as a result".

So

 (a) removing the pointless one-byte alignment looks good to me.

 (b) I'd actually hope that somebody who _cares_ about this path and
has put some real work into it (as opposed to my "superficial
dabbling") would look at my patch and either go "yeah, not worth it",
or "looks good, I'll take it".

and I'm including that final patch of mine here again in case there
was any confusion with the earlier versions (there were at least two
known-broken versions I posted).

*If* somebody likes it, and verifies that the checksum result is
correct, feel free to do anything with that patch, including adding my
signed-off-by for it (or taking the credit all for yourself -
Mwahahahahaahaa!)

              Linus

[-- Attachment #2: 0001-Silly-csum-improvement.-Maybe.patch --]
[-- Type: text/x-patch, Size: 3423 bytes --]

From 24a1d533d96074220927d844a619a54419b69b81 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Tue, 27 Jun 2023 13:55:32 -0700
Subject: [PATCH] Silly csum improvement. Maybe.

---
 arch/x86/lib/csum-partial_64.c | 83 ++++++++++++++++------------------
 1 file changed, 38 insertions(+), 45 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index cea25ca8b8cf..d96e1da6604a 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -33,6 +33,20 @@ static inline __wsum csum_tail(u64 temp64, int odd)
 	return (__force __wsum)result;
 }
 
+static inline unsigned long update_csum_40b(unsigned long sum, const unsigned long m[5])
+{
+	asm("addq %1,%0\n\t"
+	     "adcq %2,%0\n\t"
+	     "adcq %3,%0\n\t"
+	     "adcq %4,%0\n\t"
+	     "adcq %5,%0\n\t"
+	     "adcq $0,%0"
+		:"+r" (sum)
+		:"m" (m[0]), "m" (m[1]), "m" (m[2]),
+		 "m" (m[3]), "m" (m[4]));
+	return sum;
+}
+
 /*
  * Do a checksum on an arbitrary memory area.
  * Returns a 32bit checksum.
@@ -59,52 +73,31 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 		buff++;
 	}
 
-	/*
-	 * len == 40 is the hot case due to IPv6 headers, but annotating it likely()
-	 * has noticeable negative affect on codegen for all other cases with
-	 * minimal performance benefit here.
-	 */
-	if (len == 40) {
-		asm("addq 0*8(%[src]),%[res]\n\t"
-		    "adcq 1*8(%[src]),%[res]\n\t"
-		    "adcq 2*8(%[src]),%[res]\n\t"
-		    "adcq 3*8(%[src]),%[res]\n\t"
-		    "adcq 4*8(%[src]),%[res]\n\t"
-		    "adcq $0,%[res]"
-		    : [res] "+r"(temp64)
-		    : [src] "r"(buff), "m"(*(const char(*)[40])buff));
-		return csum_tail(temp64, odd);
-	}
-	if (unlikely(len >= 64)) {
-		/*
-		 * Extra accumulators for better ILP in the loop.
-		 */
-		u64 tmp_accum, tmp_carries;
+	/* Do two 40-byte chunks in parallel to get better ILP */
+	if (likely(len >= 80)) {
+		u64 temp64_2 = 0;
+		do {
+			temp64 = update_csum_40b(temp64, buff);
+			temp64_2 = update_csum_40b(temp64_2, buff + 40);
+			buff += 80;
+			len -= 80;
+		} while (len >= 80);
 
-		asm("xorl %k[tmp_accum],%k[tmp_accum]\n\t"
-		    "xorl %k[tmp_carries],%k[tmp_carries]\n\t"
-		    "subl $64, %[len]\n\t"
-		    "1:\n\t"
-		    "addq 0*8(%[src]),%[res]\n\t"
-		    "adcq 1*8(%[src]),%[res]\n\t"
-		    "adcq 2*8(%[src]),%[res]\n\t"
-		    "adcq 3*8(%[src]),%[res]\n\t"
-		    "adcl $0,%k[tmp_carries]\n\t"
-		    "addq 4*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 5*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 6*8(%[src]),%[tmp_accum]\n\t"
-		    "adcq 7*8(%[src]),%[tmp_accum]\n\t"
-		    "adcl $0,%k[tmp_carries]\n\t"
-		    "addq $64, %[src]\n\t"
-		    "subl $64, %[len]\n\t"
-		    "jge 1b\n\t"
-		    "addq %[tmp_accum],%[res]\n\t"
-		    "adcq %[tmp_carries],%[res]\n\t"
-		    "adcq $0,%[res]"
-		    : [tmp_accum] "=&r"(tmp_accum),
-		      [tmp_carries] "=&r"(tmp_carries), [res] "+r"(temp64),
-		      [len] "+r"(len), [src] "+r"(buff)
-		    : "m"(*(const char *)buff));
+		asm("addq %1,%0\n\t"
+		    "adcq $0,%0"
+		    :"+r" (temp64): "r" (temp64_2));
+	}
+
+	/*
+	 * len == 40 is the hot case due to IPv6 headers, so return
+	 * early for that exact case without checking the tail bytes.
+	 */
+	if (len >= 40) {
+		temp64 = update_csum_40b(temp64, buff);
+		len -= 40;
+		if (!len)
+			return csum_tail(temp64, odd);
+		buff += 40;
 	}
 
 	if (len & 32) {
-- 
2.41.0.203.ga4f2cd32bb.dirty


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2023-06-28 17:44     ` Linus Torvalds
@ 2023-06-28 18:34       ` Noah Goldstein
  2023-06-28 20:02         ` Linus Torvalds
  0 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2023-06-28 18:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, x86, edumazet, tglx, mingo, dave.hansen, hpa,
	lkml

On Wed, Jun 28, 2023 at 12:44 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 28 Jun 2023 at 08:32, Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > Linus, if you're planning a patch and want to just integrate the codes
> > here I'm happy drop this patch
>
> No, that patch looks good to me.
>
> In fact, I wasn't planning on integrating my patch at all. I literally
> did it as a "I would have done it this way instead" exercise.
>
> And while I am currently running with my patch in the kernel, I don't
> even really know if it works and does the right thing. Maybe my use
> doesn't even trigger csum_partial() at all. I did not do any testing
> that "yes, I get the same checksum as a result".
>

There is a reasonably robust kunit for csum_partial: lib/checksum_kunit.c
so if you happened to run the kunit testsuite with your patch, it's
probably correct.

> So
>
>  (a) removing the pointless one-byte alignment looks good to me.
>
>  (b) I'd actually hope that somebody who _cares_ about this path and
> has put some real work into it (as opposed to my "superficial
> dabbling") would look at my patch and either go "yeah, not worth it",
> or "looks good, I'll take it".
>
> and I'm including that final patch of mine here again in case there
> was any confusion with the earlier versions (there were at least two
> known-broken versions I posted).
>
> *If* somebody likes it, and verifies that the checksum result is
> correct, feel free to do anything with that patch, including adding my
> signed-off-by for it (or taking the credit all for yourself -
> Mwahahahahaahaa!)
>
>               Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2023-06-28 18:34       ` Noah Goldstein
@ 2023-06-28 20:02         ` Linus Torvalds
  0 siblings, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2023-06-28 20:02 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: Borislav Petkov, x86, edumazet, tglx, mingo, dave.hansen, hpa,
	lkml

On Wed, 28 Jun 2023 at 11:34, Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> There is a reasonably robust kunit for csum_partial: lib/checksum_kunit.c
> so if you happened to run the kunit testsuite with your patch, it's
> probably correct.

Testing? TESTING?

Why do you think I do open source? I like the programming part. I even
like the debugging part. But testing - that's what users are for. No?

               Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: x86/csum: Remove unnecessary odd handling
  2023-06-28  9:12 ` x86/csum: Remove unnecessary odd handling Borislav Petkov
  2023-06-28 15:32   ` Noah Goldstein
@ 2023-06-29 14:04   ` David Laight
  2023-06-29 14:27   ` David Laight
  2 siblings, 0 replies; 34+ messages in thread
From: David Laight @ 2023-06-29 14:04 UTC (permalink / raw)
  To: 'Borislav Petkov', Noah Goldstein, Linus Torvalds
  Cc: x86@kernel.org, edumazet@google.com, tglx@linutronix.de,
	mingo@redhat.com, dave.hansen@linux.intel.com, hpa@zytor.com,
	lkml

From: Borislav Petkov
> Sent: 28 June 2023 10:13
> 
> + Linus who's been poking at this yesterday.
> 
> + lkml. Please always CC lkml when sending patches.
> 
> On Tue, Jun 27, 2023 at 09:06:57PM -0500, Noah Goldstein wrote:
> > The special case for odd aligned buffers is unnecessary and mostly
> > just adds overhead. Aligned buffers is the expectations, and even for
> > unaligned buffer, the only case that was helped is if the buffer was
> > 1-byte from word aligned which is ~1/7 of the cases. Overall it seems
> > highly unlikely to be worth to extra branch.
> >
> > It was left in the previous perf improvement patch because I was
> > erroneously comparing the exact output of `csum_partial(...)`, but
> > really we only need `csum_fold(csum_partial(...))` to match so its
> > safe to remove.

I'm sure I've suggested this before.
The 'odd' check was needed by an earlier implementation.

Misaligned buffers are (just about) measurably slower.
But it is pretty much noise and the extra code in the
aligned case will code more.

It is pretty much impossible to find out what the cpu is doing,
but if you do misaligned accesses to a PCIe target you can
(with suitable hardware) look at the generated TLP.

What that shows is misaligned transfers being done in 8-byte
chunks and being split into two TLP if they cross a 64 byte
(probably cache line) boundary.

It is likely that the same happens for cached accesses.

Given that the cpu can do two memory reads each clock
it isn't surprising that the checksum loop (which doesn't
even manage a read every clock) is slower by less than
one clock every cache line.

Someone might also want to use the 'arc' C version of csum_fold()
on pretty much every architecture [1].
It is:
	return (~sum - ror32(sum, 16)) >> 16;
significantly better than the x86 asm (even on more recent
cpu that don't take 2 clocks for an 'adc').

[1] arm can do a bit better because of the barrel shifter.
    sparc is slower because it has a carry flag but no rotate.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: x86/csum: Remove unnecessary odd handling
  2023-06-28  9:12 ` x86/csum: Remove unnecessary odd handling Borislav Petkov
  2023-06-28 15:32   ` Noah Goldstein
  2023-06-29 14:04   ` David Laight
@ 2023-06-29 14:27   ` David Laight
  2 siblings, 0 replies; 34+ messages in thread
From: David Laight @ 2023-06-29 14:27 UTC (permalink / raw)
  To: 'Borislav Petkov', Noah Goldstein, Linus Torvalds
  Cc: x86@kernel.org, edumazet@google.com, tglx@linutronix.de,
	mingo@redhat.com, dave.hansen@linux.intel.com, hpa@zytor.com,
	lkml

...
> > All csum kunit tests pass.

Last time I looked I couldn't see where generated IPv6
checksums get changed from 0x0000 (from ~csum_fold() using
adc) to 0xffff - which I think the protocol requires.

The trivial way to do this is to initialise the sum to 1
(instead or 0 or 0xffff) and then add 1 after the invert.

It doesn't matter (much) for IPv4 because 0x0000 is 'no checksum'
rather than 'invalid'.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* x86/csum: Remove unnecessary odd handling
       [not found] <20230628020657.957880-1-goldstein.w.n@gmail.com>
  2023-06-28  9:12 ` x86/csum: Remove unnecessary odd handling Borislav Petkov
@ 2023-09-01 22:21 ` Noah Goldstein
  2023-09-06 13:49   ` David Laight
  2023-09-06 14:38   ` David Laight
  2023-09-20 19:23 ` Noah Goldstein
  2023-09-24 14:35 ` Noah Goldstein
  3 siblings, 2 replies; 34+ messages in thread
From: Noah Goldstein @ 2023-09-01 22:21 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, edumazet, tglx, mingo, torvalds, bp, dave.hansen,
	David.Laight, hpa, goldstein.w.n

The special case for odd aligned buffers is unnecessary and mostly
just adds overhead. Aligned buffers is the expectations, and even for
unaligned buffer, the only case that was helped is if the buffer was
1-byte from word aligned which is ~1/7 of the cases. Overall it seems
highly unlikely to be worth to extra branch.

It was left in the previous perf improvement patch because I was
erroneously comparing the exact output of `csum_partial(...)`, but
really we only need `csum_fold(csum_partial(...))` to match so its
safe to remove.

All csum kunit tests pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 arch/x86/lib/csum-partial_64.c | 37 ++--------------------------------
 1 file changed, 2 insertions(+), 35 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index cea25ca8b8cf..d06112e98893 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -11,28 +11,6 @@
 #include <asm/checksum.h>
 #include <asm/word-at-a-time.h>
 
-static inline unsigned short from32to16(unsigned a)
-{
-	unsigned short b = a >> 16;
-	asm("addw %w2,%w0\n\t"
-	    "adcw $0,%w0\n"
-	    : "=r" (b)
-	    : "0" (b), "r" (a));
-	return b;
-}
-
-static inline __wsum csum_tail(u64 temp64, int odd)
-{
-	unsigned int result;
-
-	result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
-	if (unlikely(odd)) {
-		result = from32to16(result);
-		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
-	}
-	return (__force __wsum)result;
-}
-
 /*
  * Do a checksum on an arbitrary memory area.
  * Returns a 32bit checksum.
@@ -47,17 +25,6 @@ static inline __wsum csum_tail(u64 temp64, int odd)
 __wsum csum_partial(const void *buff, int len, __wsum sum)
 {
 	u64 temp64 = (__force u64)sum;
-	unsigned odd;
-
-	odd = 1 & (unsigned long) buff;
-	if (unlikely(odd)) {
-		if (unlikely(len == 0))
-			return sum;
-		temp64 = ror32((__force u32)sum, 8);
-		temp64 += (*(unsigned char *)buff << 8);
-		len--;
-		buff++;
-	}
 
 	/*
 	 * len == 40 is the hot case due to IPv6 headers, but annotating it likely()
@@ -73,7 +40,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 		    "adcq $0,%[res]"
 		    : [res] "+r"(temp64)
 		    : [src] "r"(buff), "m"(*(const char(*)[40])buff));
-		return csum_tail(temp64, odd);
+		return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
 	}
 	if (unlikely(len >= 64)) {
 		/*
@@ -143,7 +110,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 		    : [res] "+r"(temp64)
 		    : [trail] "r"(trail));
 	}
-	return csum_tail(temp64, odd);
+	return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
 }
 EXPORT_SYMBOL(csum_partial);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* RE: x86/csum: Remove unnecessary odd handling
  2023-09-01 22:21 ` Noah Goldstein
@ 2023-09-06 13:49   ` David Laight
  2023-09-06 14:38   ` David Laight
  1 sibling, 0 replies; 34+ messages in thread
From: David Laight @ 2023-09-06 13:49 UTC (permalink / raw)
  To: 'Noah Goldstein', x86@kernel.org
  Cc: linux-kernel@vger.kernel.org, edumazet@google.com,
	tglx@linutronix.de, mingo@redhat.com,
	torvalds@linux-foundation.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com

From: Noah Goldstein
> Sent: 01 September 2023 23:21
> 
> The special case for odd aligned buffers is unnecessary and mostly
> just adds overhead. Aligned buffers is the expectations, and even for
> unaligned buffer, the only case that was helped is if the buffer was
> 1-byte from word aligned which is ~1/7 of the cases. Overall it seems
> highly unlikely to be worth to extra branch.
> 
> It was left in the previous perf improvement patch because I was
> erroneously comparing the exact output of `csum_partial(...)`, but
> really we only need `csum_fold(csum_partial(...))` to match so its
> safe to remove.

This is pretty much the same patch I send in Dec 2021...

Reviewed-by: David Laight <david.laight@aculab.com>

> 
> All csum kunit tests pass.
> 
> Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> ---
>  arch/x86/lib/csum-partial_64.c | 37 ++--------------------------------
>  1 file changed, 2 insertions(+), 35 deletions(-)
> 
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> index cea25ca8b8cf..d06112e98893 100644
> --- a/arch/x86/lib/csum-partial_64.c
> +++ b/arch/x86/lib/csum-partial_64.c
> @@ -11,28 +11,6 @@
>  #include <asm/checksum.h>
>  #include <asm/word-at-a-time.h>
> 
> -static inline unsigned short from32to16(unsigned a)
> -{
> -	unsigned short b = a >> 16;
> -	asm("addw %w2,%w0\n\t"
> -	    "adcw $0,%w0\n"
> -	    : "=r" (b)
> -	    : "0" (b), "r" (a));
> -	return b;
> -}
> -
> -static inline __wsum csum_tail(u64 temp64, int odd)
> -{
> -	unsigned int result;
> -
> -	result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
> -	if (unlikely(odd)) {
> -		result = from32to16(result);
> -		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
> -	}
> -	return (__force __wsum)result;
> -}
> -
>  /*
>   * Do a checksum on an arbitrary memory area.
>   * Returns a 32bit checksum.
> @@ -47,17 +25,6 @@ static inline __wsum csum_tail(u64 temp64, int odd)
>  __wsum csum_partial(const void *buff, int len, __wsum sum)
>  {
>  	u64 temp64 = (__force u64)sum;
> -	unsigned odd;
> -
> -	odd = 1 & (unsigned long) buff;
> -	if (unlikely(odd)) {
> -		if (unlikely(len == 0))
> -			return sum;
> -		temp64 = ror32((__force u32)sum, 8);
> -		temp64 += (*(unsigned char *)buff << 8);
> -		len--;
> -		buff++;
> -	}
> 
>  	/*
>  	 * len == 40 is the hot case due to IPv6 headers, but annotating it likely()
> @@ -73,7 +40,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
>  		    "adcq $0,%[res]"
>  		    : [res] "+r"(temp64)
>  		    : [src] "r"(buff), "m"(*(const char(*)[40])buff));
> -		return csum_tail(temp64, odd);
> +		return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
>  	}
>  	if (unlikely(len >= 64)) {
>  		/*
> @@ -143,7 +110,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
>  		    : [res] "+r"(temp64)
>  		    : [trail] "r"(trail));
>  	}
> -	return csum_tail(temp64, odd);
> +	return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
>  }
>  EXPORT_SYMBOL(csum_partial);
> 
> --
> 2.34.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: x86/csum: Remove unnecessary odd handling
  2023-09-01 22:21 ` Noah Goldstein
  2023-09-06 13:49   ` David Laight
@ 2023-09-06 14:38   ` David Laight
  2023-09-20 19:20     ` Noah Goldstein
  1 sibling, 1 reply; 34+ messages in thread
From: David Laight @ 2023-09-06 14:38 UTC (permalink / raw)
  To: 'Noah Goldstein', x86@kernel.org
  Cc: linux-kernel@vger.kernel.org, edumazet@google.com,
	tglx@linutronix.de, mingo@redhat.com,
	torvalds@linux-foundation.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com

From: Noah Goldstein
> Sent: 01 September 2023 23:21
...
> +	return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);

The generic C alternative:
	return (temp64 + ror64(temp64, 32)) >> 32;
is the same number of instructions but might get
better scheduling.

The C version of csum_fold() from arc/include/asm/checksum.h
is also better than the x86 asm version.
(And also pretty much all the other architecture dependant
copies.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2023-09-06 14:38   ` David Laight
@ 2023-09-20 19:20     ` Noah Goldstein
  0 siblings, 0 replies; 34+ messages in thread
From: Noah Goldstein @ 2023-09-20 19:20 UTC (permalink / raw)
  To: David Laight
  Cc: x86@kernel.org, linux-kernel@vger.kernel.org, edumazet@google.com,
	tglx@linutronix.de, mingo@redhat.com,
	torvalds@linux-foundation.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com

On Wed, Sep 6, 2023 at 9:38 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Noah Goldstein
> > Sent: 01 September 2023 23:21
> ...
> > +     return add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
>
> The generic C alternative:
>         return (temp64 + ror64(temp64, 32)) >> 32;
> is the same number of instructions but might get
> better scheduling.
>
Sorry, I missed this.
Bright idea :)
Adding in new version + you reviewed by tag. Then hopefully this can
get in...

> The C version of csum_fold() from arc/include/asm/checksum.h
> is also better than the x86 asm version.
> (And also pretty much all the other architecture dependant
> copies.)
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* x86/csum: Remove unnecessary odd handling
       [not found] <20230628020657.957880-1-goldstein.w.n@gmail.com>
  2023-06-28  9:12 ` x86/csum: Remove unnecessary odd handling Borislav Petkov
  2023-09-01 22:21 ` Noah Goldstein
@ 2023-09-20 19:23 ` Noah Goldstein
  2023-09-23  3:24   ` kernel test robot
  2023-09-24 14:35 ` Noah Goldstein
  3 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2023-09-20 19:23 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, edumazet, tglx, mingo, torvalds, bp, dave.hansen,
	David.Laight, hpa, goldstein.w.n, David Laight

The special case for odd aligned buffers is unnecessary and mostly
just adds overhead. Aligned buffers is the expectations, and even for
unaligned buffer, the only case that was helped is if the buffer was
1-byte from word aligned which is ~1/7 of the cases. Overall it seems
highly unlikely to be worth to extra branch.

It was left in the previous perf improvement patch because I was
erroneously comparing the exact output of `csum_partial(...)`, but
really we only need `csum_fold(csum_partial(...))` to match so its
safe to remove.

All csum kunit tests pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Laight <david.laight@aculab.com>
---
 arch/x86/lib/csum-partial_64.c | 36 ++++------------------------------
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index cea25ca8b8cf..b83c8accd756 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -11,26 +11,9 @@
 #include <asm/checksum.h>
 #include <asm/word-at-a-time.h>
 
-static inline unsigned short from32to16(unsigned a)
+static inline __wsum csum_finalize_sum(u64 temp64)
 {
-	unsigned short b = a >> 16;
-	asm("addw %w2,%w0\n\t"
-	    "adcw $0,%w0\n"
-	    : "=r" (b)
-	    : "0" (b), "r" (a));
-	return b;
-}
-
-static inline __wsum csum_tail(u64 temp64, int odd)
-{
-	unsigned int result;
-
-	result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
-	if (unlikely(odd)) {
-		result = from32to16(result);
-		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
-	}
-	return (__force __wsum)result;
+	return (temp64 + ror64(temp64, 32)) >> 32;
 }
 
 /*
@@ -47,17 +30,6 @@ static inline __wsum csum_tail(u64 temp64, int odd)
 __wsum csum_partial(const void *buff, int len, __wsum sum)
 {
 	u64 temp64 = (__force u64)sum;
-	unsigned odd;
-
-	odd = 1 & (unsigned long) buff;
-	if (unlikely(odd)) {
-		if (unlikely(len == 0))
-			return sum;
-		temp64 = ror32((__force u32)sum, 8);
-		temp64 += (*(unsigned char *)buff << 8);
-		len--;
-		buff++;
-	}
 
 	/*
 	 * len == 40 is the hot case due to IPv6 headers, but annotating it likely()
@@ -73,7 +45,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 		    "adcq $0,%[res]"
 		    : [res] "+r"(temp64)
 		    : [src] "r"(buff), "m"(*(const char(*)[40])buff));
-		return csum_tail(temp64, odd);
+		return csum_finalize_sum(temp64);
 	}
 	if (unlikely(len >= 64)) {
 		/*
@@ -143,7 +115,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 		    : [res] "+r"(temp64)
 		    : [trail] "r"(trail));
 	}
-	return csum_tail(temp64, odd);
+	return csum_finalize_sum(temp64);
 }
 EXPORT_SYMBOL(csum_partial);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2023-09-20 19:23 ` Noah Goldstein
@ 2023-09-23  3:24   ` kernel test robot
  2023-09-23 14:05     ` Noah Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: kernel test robot @ 2023-09-23  3:24 UTC (permalink / raw)
  To: Noah Goldstein, x86
  Cc: oe-kbuild-all, linux-kernel, edumazet, tglx, mingo, torvalds, bp,
	dave.hansen, David.Laight, hpa, goldstein.w.n

Hi Noah,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/x86/core]
[also build test WARNING on tip/master tip/auto-latest linus/master v6.6-rc2 next-20230921]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Noah-Goldstein/x86-csum-Remove-unnecessary-odd-handling/20230921-032450
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20230920192300.3772199-1-goldstein.w.n%40gmail.com
patch subject: x86/csum: Remove unnecessary odd handling
config: x86_64-randconfig-121-20230921 (https://download.01.org/0day-ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309231130.ZI5MdlDc-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression (different base types) @@     expected restricted __wsum @@     got unsigned long long @@
   arch/x86/lib/csum-partial_64.c:16:45: sparse:     expected restricted __wsum
   arch/x86/lib/csum-partial_64.c:16:45: sparse:     got unsigned long long
>> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression (different base types) @@     expected restricted __wsum @@     got unsigned long long @@
   arch/x86/lib/csum-partial_64.c:16:45: sparse:     expected restricted __wsum
   arch/x86/lib/csum-partial_64.c:16:45: sparse:     got unsigned long long

vim +16 arch/x86/lib/csum-partial_64.c

    13	
    14	static inline __wsum csum_finalize_sum(u64 temp64)
    15	{
  > 16		return (temp64 + ror64(temp64, 32)) >> 32;
    17	}
    18	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2023-09-23  3:24   ` kernel test robot
@ 2023-09-23 14:05     ` Noah Goldstein
  2023-09-23 21:13       ` David Laight
  0 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2023-09-23 14:05 UTC (permalink / raw)
  To: kernel test robot
  Cc: x86, oe-kbuild-all, linux-kernel, edumazet, tglx, mingo, torvalds,
	bp, dave.hansen, David.Laight, hpa

On Fri, Sep 22, 2023 at 10:25 PM kernel test robot <lkp@intel.com> wrote:
>
> Hi Noah,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on tip/x86/core]
> [also build test WARNING on tip/master tip/auto-latest linus/master v6.6-rc2 next-20230921]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Noah-Goldstein/x86-csum-Remove-unnecessary-odd-handling/20230921-032450
> base:   tip/x86/core
> patch link:    https://lore.kernel.org/r/20230920192300.3772199-1-goldstein.w.n%40gmail.com
> patch subject: x86/csum: Remove unnecessary odd handling
> config: x86_64-randconfig-121-20230921 (https://download.01.org/0day-ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/config)
> compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202309231130.ZI5MdlDc-lkp@intel.com/
>
> sparse warnings: (new ones prefixed by >>)
> >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression (different base types) @@     expected restricted __wsum @@     got unsigned long long @@
>    arch/x86/lib/csum-partial_64.c:16:45: sparse:     expected restricted __wsum
>    arch/x86/lib/csum-partial_64.c:16:45: sparse:     got unsigned long long
> >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression (different base types) @@     expected restricted __wsum @@     got unsigned long long @@
>    arch/x86/lib/csum-partial_64.c:16:45: sparse:     expected restricted __wsum
>    arch/x86/lib/csum-partial_64.c:16:45: sparse:     got unsigned long long
>
> vim +16 arch/x86/lib/csum-partial_64.c
>
>     13
>     14  static inline __wsum csum_finalize_sum(u64 temp64)
>     15  {
>   > 16          return (temp64 + ror64(temp64, 32)) >> 32;
>     17  }
>     18

Just needs a `(__wsum)` cast. Should I post a new patch?
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: x86/csum: Remove unnecessary odd handling
  2023-09-23 14:05     ` Noah Goldstein
@ 2023-09-23 21:13       ` David Laight
  2023-09-24 14:35         ` Noah Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: David Laight @ 2023-09-23 21:13 UTC (permalink / raw)
  To: 'Noah Goldstein', kernel test robot
  Cc: x86@kernel.org, oe-kbuild-all@lists.linux.dev,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	tglx@linutronix.de, mingo@redhat.com,
	torvalds@linux-foundation.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com

From: Noah Goldstein
> Sent: 23 September 2023 15:05
> 
> On Fri, Sep 22, 2023 at 10:25 PM kernel test robot <lkp@intel.com> wrote:
> >
> > Hi Noah,
> >
> > kernel test robot noticed the following build warnings:
> >
> > [auto build test WARNING on tip/x86/core]
> > [also build test WARNING on tip/master tip/auto-latest linus/master v6.6-rc2 next-20230921]
> > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > And when submitting patch, we suggest to use '--base' as documented in
> > https://git-scm.com/docs/git-format-patch#_base_tree_information]
> >
> > url:    https://github.com/intel-lab-lkp/linux/commits/Noah-Goldstein/x86-csum-Remove-unnecessary-
> odd-handling/20230921-032450
> > base:   tip/x86/core
> > patch link:    https://lore.kernel.org/r/20230920192300.3772199-1-goldstein.w.n%40gmail.com
> > patch subject: x86/csum: Remove unnecessary odd handling
> > config: x86_64-randconfig-121-20230921 (https://download.01.org/0day-
> ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/config)
> > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> > reproduce (this is a W=1 build): (https://download.01.org/0day-
> ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202309231130.ZI5MdlDc-lkp@intel.com/
> >
> > sparse warnings: (new ones prefixed by >>)
> > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression
> (different base types) @@     expected restricted __wsum @@     got unsigned long long @@
> >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     expected restricted __wsum
> >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     got unsigned long long
> > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression
> (different base types) @@     expected restricted __wsum @@     got unsigned long long @@
> >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     expected restricted __wsum
> >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     got unsigned long long
> >
> > vim +16 arch/x86/lib/csum-partial_64.c
> >
> >     13
> >     14  static inline __wsum csum_finalize_sum(u64 temp64)
> >     15  {
> >   > 16          return (temp64 + ror64(temp64, 32)) >> 32;
> >     17  }
> >     18
> 
> Just needs a `(__wsum)` cast. Should I post a new patch?

It'll need to be a (__force __wsum) cast.

I think new patches are expected...

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* x86/csum: Remove unnecessary odd handling
       [not found] <20230628020657.957880-1-goldstein.w.n@gmail.com>
                   ` (2 preceding siblings ...)
  2023-09-20 19:23 ` Noah Goldstein
@ 2023-09-24 14:35 ` Noah Goldstein
  3 siblings, 0 replies; 34+ messages in thread
From: Noah Goldstein @ 2023-09-24 14:35 UTC (permalink / raw)
  To: x86
  Cc: linux-kernel, edumazet, tglx, mingo, torvalds, bp, dave.hansen,
	David.Laight, hpa, goldstein.w.n, David Laight

The special case for odd aligned buffers is unnecessary and mostly
just adds overhead. Aligned buffers is the expectations, and even for
unaligned buffer, the only case that was helped is if the buffer was
1-byte from word aligned which is ~1/7 of the cases. Overall it seems
highly unlikely to be worth to extra branch.

It was left in the previous perf improvement patch because I was
erroneously comparing the exact output of `csum_partial(...)`, but
really we only need `csum_fold(csum_partial(...))` to match so its
safe to remove.

All csum kunit tests pass.

Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Laight <david.laight@aculab.com>
---
 arch/x86/lib/csum-partial_64.c | 36 ++++------------------------------
 1 file changed, 4 insertions(+), 32 deletions(-)

diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
index cea25ca8b8cf..557e42ede68e 100644
--- a/arch/x86/lib/csum-partial_64.c
+++ b/arch/x86/lib/csum-partial_64.c
@@ -11,26 +11,9 @@
 #include <asm/checksum.h>
 #include <asm/word-at-a-time.h>
 
-static inline unsigned short from32to16(unsigned a)
+static inline __wsum csum_finalize_sum(u64 temp64)
 {
-	unsigned short b = a >> 16;
-	asm("addw %w2,%w0\n\t"
-	    "adcw $0,%w0\n"
-	    : "=r" (b)
-	    : "0" (b), "r" (a));
-	return b;
-}
-
-static inline __wsum csum_tail(u64 temp64, int odd)
-{
-	unsigned int result;
-
-	result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
-	if (unlikely(odd)) {
-		result = from32to16(result);
-		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
-	}
-	return (__force __wsum)result;
+	return (__force __wsum)((temp64 + ror64(temp64, 32)) >> 32);
 }
 
 /*
@@ -47,17 +30,6 @@ static inline __wsum csum_tail(u64 temp64, int odd)
 __wsum csum_partial(const void *buff, int len, __wsum sum)
 {
 	u64 temp64 = (__force u64)sum;
-	unsigned odd;
-
-	odd = 1 & (unsigned long) buff;
-	if (unlikely(odd)) {
-		if (unlikely(len == 0))
-			return sum;
-		temp64 = ror32((__force u32)sum, 8);
-		temp64 += (*(unsigned char *)buff << 8);
-		len--;
-		buff++;
-	}
 
 	/*
 	 * len == 40 is the hot case due to IPv6 headers, but annotating it likely()
@@ -73,7 +45,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 		    "adcq $0,%[res]"
 		    : [res] "+r"(temp64)
 		    : [src] "r"(buff), "m"(*(const char(*)[40])buff));
-		return csum_tail(temp64, odd);
+		return csum_finalize_sum(temp64);
 	}
 	if (unlikely(len >= 64)) {
 		/*
@@ -143,7 +115,7 @@ __wsum csum_partial(const void *buff, int len, __wsum sum)
 		    : [res] "+r"(temp64)
 		    : [trail] "r"(trail));
 	}
-	return csum_tail(temp64, odd);
+	return csum_finalize_sum(temp64);
 }
 EXPORT_SYMBOL(csum_partial);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2023-09-23 21:13       ` David Laight
@ 2023-09-24 14:35         ` Noah Goldstein
  2023-12-23 22:18           ` Noah Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2023-09-24 14:35 UTC (permalink / raw)
  To: David Laight
  Cc: kernel test robot, x86@kernel.org, oe-kbuild-all@lists.linux.dev,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	tglx@linutronix.de, mingo@redhat.com,
	torvalds@linux-foundation.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com

On Sat, Sep 23, 2023 at 4:13 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Noah Goldstein
> > Sent: 23 September 2023 15:05
> >
> > On Fri, Sep 22, 2023 at 10:25 PM kernel test robot <lkp@intel.com> wrote:
> > >
> > > Hi Noah,
> > >
> > > kernel test robot noticed the following build warnings:
> > >
> > > [auto build test WARNING on tip/x86/core]
> > > [also build test WARNING on tip/master tip/auto-latest linus/master v6.6-rc2 next-20230921]
> > > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > > And when submitting patch, we suggest to use '--base' as documented in
> > > https://git-scm.com/docs/git-format-patch#_base_tree_information]
> > >
> > > url:    https://github.com/intel-lab-lkp/linux/commits/Noah-Goldstein/x86-csum-Remove-unnecessary-
> > odd-handling/20230921-032450
> > > base:   tip/x86/core
> > > patch link:    https://lore.kernel.org/r/20230920192300.3772199-1-goldstein.w.n%40gmail.com
> > > patch subject: x86/csum: Remove unnecessary odd handling
> > > config: x86_64-randconfig-121-20230921 (https://download.01.org/0day-
> > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/config)
> > > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> > > reproduce (this is a W=1 build): (https://download.01.org/0day-
> > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/reproduce)
> > >
> > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > the same patch/commit), kindly add following tags
> > > | Reported-by: kernel test robot <lkp@intel.com>
> > > | Closes: https://lore.kernel.org/oe-kbuild-all/202309231130.ZI5MdlDc-lkp@intel.com/
> > >
> > > sparse warnings: (new ones prefixed by >>)
> > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression
> > (different base types) @@     expected restricted __wsum @@     got unsigned long long @@
> > >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     expected restricted __wsum
> > >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     got unsigned long long
> > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression
> > (different base types) @@     expected restricted __wsum @@     got unsigned long long @@
> > >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     expected restricted __wsum
> > >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     got unsigned long long
> > >
> > > vim +16 arch/x86/lib/csum-partial_64.c
> > >
> > >     13
> > >     14  static inline __wsum csum_finalize_sum(u64 temp64)
> > >     15  {
> > >   > 16          return (temp64 + ror64(temp64, 32)) >> 32;
> > >     17  }
> > >     18
> >
> > Just needs a `(__wsum)` cast. Should I post a new patch?
>
> It'll need to be a (__force __wsum) cast.
>
> I think new patches are expected...
>
Thank you, posted V4 that should fix the warning.
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2023-09-24 14:35         ` Noah Goldstein
@ 2023-12-23 22:18           ` Noah Goldstein
  2024-01-04 23:28             ` Noah Goldstein
  0 siblings, 1 reply; 34+ messages in thread
From: Noah Goldstein @ 2023-12-23 22:18 UTC (permalink / raw)
  To: David Laight
  Cc: kernel test robot, x86@kernel.org, oe-kbuild-all@lists.linux.dev,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	tglx@linutronix.de, mingo@redhat.com,
	torvalds@linux-foundation.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com

On Sun, Sep 24, 2023 at 7:35 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Sat, Sep 23, 2023 at 4:13 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Noah Goldstein
> > > Sent: 23 September 2023 15:05
> > >
> > > On Fri, Sep 22, 2023 at 10:25 PM kernel test robot <lkp@intel.com> wrote:
> > > >
> > > > Hi Noah,
> > > >
> > > > kernel test robot noticed the following build warnings:
> > > >
> > > > [auto build test WARNING on tip/x86/core]
> > > > [also build test WARNING on tip/master tip/auto-latest linus/master v6.6-rc2 next-20230921]
> > > > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > > > And when submitting patch, we suggest to use '--base' as documented in
> > > > https://git-scm.com/docs/git-format-patch#_base_tree_information]
> > > >
> > > > url:    https://github.com/intel-lab-lkp/linux/commits/Noah-Goldstein/x86-csum-Remove-unnecessary-
> > > odd-handling/20230921-032450
> > > > base:   tip/x86/core
> > > > patch link:    https://lore.kernel.org/r/20230920192300.3772199-1-goldstein.w.n%40gmail.com
> > > > patch subject: x86/csum: Remove unnecessary odd handling
> > > > config: x86_64-randconfig-121-20230921 (https://download.01.org/0day-
> > > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/config)
> > > > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> > > > reproduce (this is a W=1 build): (https://download.01.org/0day-
> > > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/reproduce)
> > > >
> > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > > the same patch/commit), kindly add following tags
> > > > | Reported-by: kernel test robot <lkp@intel.com>
> > > > | Closes: https://lore.kernel.org/oe-kbuild-all/202309231130.ZI5MdlDc-lkp@intel.com/
> > > >
> > > > sparse warnings: (new ones prefixed by >>)
> > > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression
> > > (different base types) @@     expected restricted __wsum @@     got unsigned long long @@
> > > >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     expected restricted __wsum
> > > >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     got unsigned long long
> > > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression
> > > (different base types) @@     expected restricted __wsum @@     got unsigned long long @@
> > > >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     expected restricted __wsum
> > > >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     got unsigned long long
> > > >
> > > > vim +16 arch/x86/lib/csum-partial_64.c
> > > >
> > > >     13
> > > >     14  static inline __wsum csum_finalize_sum(u64 temp64)
> > > >     15  {
> > > >   > 16          return (temp64 + ror64(temp64, 32)) >> 32;
> > > >     17  }
> > > >     18
> > >
> > > Just needs a `(__wsum)` cast. Should I post a new patch?
> >
> > It'll need to be a (__force __wsum) cast.
> >
> > I think new patches are expected...
> >
> Thank you, posted V4 that should fix the warning.
> >         David
> >
> > -
> > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > Registration No: 1397386 (Wales)

ping.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2023-12-23 22:18           ` Noah Goldstein
@ 2024-01-04 23:28             ` Noah Goldstein
  2024-01-04 23:34               ` Dave Hansen
  2024-01-04 23:36               ` Linus Torvalds
  0 siblings, 2 replies; 34+ messages in thread
From: Noah Goldstein @ 2024-01-04 23:28 UTC (permalink / raw)
  To: David Laight
  Cc: kernel test robot, x86@kernel.org, oe-kbuild-all@lists.linux.dev,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	tglx@linutronix.de, mingo@redhat.com,
	torvalds@linux-foundation.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com

On Sat, Dec 23, 2023 at 2:18 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Sun, Sep 24, 2023 at 7:35 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
> >
> > On Sat, Sep 23, 2023 at 4:13 PM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Noah Goldstein
> > > > Sent: 23 September 2023 15:05
> > > >
> > > > On Fri, Sep 22, 2023 at 10:25 PM kernel test robot <lkp@intel.com> wrote:
> > > > >
> > > > > Hi Noah,
> > > > >
> > > > > kernel test robot noticed the following build warnings:
> > > > >
> > > > > [auto build test WARNING on tip/x86/core]
> > > > > [also build test WARNING on tip/master tip/auto-latest linus/master v6.6-rc2 next-20230921]
> > > > > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > > > > And when submitting patch, we suggest to use '--base' as documented in
> > > > > https://git-scm.com/docs/git-format-patch#_base_tree_information]
> > > > >
> > > > > url:    https://github.com/intel-lab-lkp/linux/commits/Noah-Goldstein/x86-csum-Remove-unnecessary-
> > > > odd-handling/20230921-032450
> > > > > base:   tip/x86/core
> > > > > patch link:    https://lore.kernel.org/r/20230920192300.3772199-1-goldstein.w.n%40gmail.com
> > > > > patch subject: x86/csum: Remove unnecessary odd handling
> > > > > config: x86_64-randconfig-121-20230921 (https://download.01.org/0day-
> > > > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/config)
> > > > > compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> > > > > reproduce (this is a W=1 build): (https://download.01.org/0day-
> > > > ci/archive/20230923/202309231130.ZI5MdlDc-lkp@intel.com/reproduce)
> > > > >
> > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > > > the same patch/commit), kindly add following tags
> > > > > | Reported-by: kernel test robot <lkp@intel.com>
> > > > > | Closes: https://lore.kernel.org/oe-kbuild-all/202309231130.ZI5MdlDc-lkp@intel.com/
> > > > >
> > > > > sparse warnings: (new ones prefixed by >>)
> > > > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression
> > > > (different base types) @@     expected restricted __wsum @@     got unsigned long long @@
> > > > >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     expected restricted __wsum
> > > > >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     got unsigned long long
> > > > > >> arch/x86/lib/csum-partial_64.c:16:45: sparse: sparse: incorrect type in return expression
> > > > (different base types) @@     expected restricted __wsum @@     got unsigned long long @@
> > > > >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     expected restricted __wsum
> > > > >    arch/x86/lib/csum-partial_64.c:16:45: sparse:     got unsigned long long
> > > > >
> > > > > vim +16 arch/x86/lib/csum-partial_64.c
> > > > >
> > > > >     13
> > > > >     14  static inline __wsum csum_finalize_sum(u64 temp64)
> > > > >     15  {
> > > > >   > 16          return (temp64 + ror64(temp64, 32)) >> 32;
> > > > >     17  }
> > > > >     18
> > > >
> > > > Just needs a `(__wsum)` cast. Should I post a new patch?
> > >
> > > It'll need to be a (__force __wsum) cast.
> > >
> > > I think new patches are expected...
> > >
> > Thank you, posted V4 that should fix the warning.
> > >         David
> > >
> > > -
> > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> > > Registration No: 1397386 (Wales)
>
> ping.
ping

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2024-01-04 23:28             ` Noah Goldstein
@ 2024-01-04 23:34               ` Dave Hansen
  2024-01-04 23:36               ` Linus Torvalds
  1 sibling, 0 replies; 34+ messages in thread
From: Dave Hansen @ 2024-01-04 23:34 UTC (permalink / raw)
  To: Noah Goldstein, David Laight
  Cc: kernel test robot, x86@kernel.org, oe-kbuild-all@lists.linux.dev,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	tglx@linutronix.de, mingo@redhat.com,
	torvalds@linux-foundation.org, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com

On 1/4/24 15:28, Noah Goldstein wrote:
>> ping.
> ping

Noah, it would be great to elaborate on what you are pinging about.
This thread has a note that you "posted v4", but I don't see a
clearly-tagged v4 in my inbox, just a bunch of identially-titled
non-standard-subject[1] emails without any indication of what changed in
the individual patches.

1. https://docs.kernel.org/process/submitting-patches.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2024-01-04 23:28             ` Noah Goldstein
  2024-01-04 23:34               ` Dave Hansen
@ 2024-01-04 23:36               ` Linus Torvalds
  2024-01-05  0:33                 ` Linus Torvalds
  1 sibling, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2024-01-04 23:36 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: David Laight, kernel test robot, x86@kernel.org,
	oe-kbuild-all@lists.linux.dev, linux-kernel@vger.kernel.org,
	edumazet@google.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com

On Thu, 4 Jan 2024 at 15:28, Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> ping

Bah. I think this keeps falling through the cracks because the
networking people consider this an architecture thing, and the x86
people probably consider this a networking thing.

Anyway, since I looked at the thing originally, and feel like I know
the x86 side and understand the strange IP csum too, I just applied it
directly.

              Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2024-01-04 23:36               ` Linus Torvalds
@ 2024-01-05  0:33                 ` Linus Torvalds
  2024-01-05 10:41                   ` David Laight
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2024-01-05  0:33 UTC (permalink / raw)
  To: Noah Goldstein
  Cc: David Laight, kernel test robot, x86@kernel.org,
	oe-kbuild-all@lists.linux.dev, linux-kernel@vger.kernel.org,
	edumazet@google.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com

On Thu, 4 Jan 2024 at 15:36, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Anyway, since I looked at the thing originally, and feel like I know
> the x86 side and understand the strange IP csum too, I just applied it
> directly.

I ended up just applying my 40-byte cleanup thing too that I've been
keeping in my own tree since posting it (as the "Silly csum
improvement. Maybe" patch).

I've been running it on my own machine since last June, and I finally
even enabled the csum KUnit test just to double-check it.

               Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: x86/csum: Remove unnecessary odd handling
  2024-01-05  0:33                 ` Linus Torvalds
@ 2024-01-05 10:41                   ` David Laight
  2024-01-05 16:12                     ` David Laight
  2024-01-05 18:05                     ` Linus Torvalds
  0 siblings, 2 replies; 34+ messages in thread
From: David Laight @ 2024-01-05 10:41 UTC (permalink / raw)
  To: 'Linus Torvalds', Noah Goldstein
  Cc: kernel test robot, x86@kernel.org, oe-kbuild-all@lists.linux.dev,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com

From: Linus Torvalds
> Sent: 05 January 2024 00:33
> 
> On Thu, 4 Jan 2024 at 15:36, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Anyway, since I looked at the thing originally, and feel like I know
> > the x86 side and understand the strange IP csum too, I just applied it
> > directly.
> 
> I ended up just applying my 40-byte cleanup thing too that I've been
> keeping in my own tree since posting it (as the "Silly csum
> improvement. Maybe" patch).

Interesting, I'm pretty sure trying to get two blocks of
 'adc' scheduled in parallel like that doesn't work.

I got an adc every clock from this 'beast':
+       /*
+        * Align the byte count to a multiple of 16 then
+        * add 64 bit words to alternating registers.
+        * Finally reduce to 64 bits.
+        */
+       asm(    "       bt    $4, %[len]\n"
+               "       jnc   10f\n"
+               "       add   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   16(%[len]), %[len]\n"
+               "10:    jecxz 20f\n"
+               "       adc   (%[buff], %[len]), %[sum_0]\n"
+               "       adc   8(%[buff], %[len]), %[sum_1]\n"
+               "       lea   32(%[len]), %[len_tmp]\n"
+               "       adc   16(%[buff], %[len]), %[sum_0]\n"
+               "       adc   24(%[buff], %[len]), %[sum_1]\n"
+               "       mov   %[len_tmp], %[len]\n"
+               "       jmp   10b\n"
+               "20:    adc   %[sum_0], %[sum]\n"
+               "       adc   %[sum_1], %[sum]\n"
+               "       adc   $0, %[sum]\n"
+           : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
+               [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
+           : [buff] "r" (buff)
+           : "memory" );

Followed by code to sort out and trailing 15 bytes.

Intel cpu (from P-II until Broadwell 5th-gen) take two clocks for 'adc'
(probably because it needs 3 inputs).
So 'adc'  chains ran a lot slower than you might think.
(Clearly no one ever actually benchmarked the old code!)
The first fix made the carry output available early - so adding
to alternate registers helps. IIRC this is in Ivy/Sandy bridge.
Maybe no one cares about Ivy/Sandy bridge and Haswell any more.
AMD cpu don't have this problem.

I'm pretty sure I measured that loop with a misaligned buffer.
Measurably slower, but less than one clock per cache line.
I guess that the cache-line crossing reads get split, but you
gain most back because the cpu can do two reads/clock.

Maybe I'll sort out another patch...

I did get 15/16 bytes/clock with a similar loop that used adox/adcx
but that needed unrolling again and only works on a few cpu.
IIRC amd have some cpu that support adox - but execute it slowly!
Annoyingly you can't use 'loop' even on cpu that support adox
because it is stupidly slow on intel cpu (ok on amd).

That version is a lot of pain since it needs run-time patching.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: x86/csum: Remove unnecessary odd handling
  2024-01-05 10:41                   ` David Laight
@ 2024-01-05 16:12                     ` David Laight
  2024-01-05 18:05                     ` Linus Torvalds
  1 sibling, 0 replies; 34+ messages in thread
From: David Laight @ 2024-01-05 16:12 UTC (permalink / raw)
  To: 'Linus Torvalds', 'Noah Goldstein'
  Cc: 'kernel test robot', 'x86@kernel.org',
	'oe-kbuild-all@lists.linux.dev',
	'linux-kernel@vger.kernel.org',
	'edumazet@google.com', 'tglx@linutronix.de',
	'mingo@redhat.com', 'bp@alien8.de',
	'dave.hansen@linux.intel.com', 'hpa@zytor.com'

From: David Laight
> Sent: 05 January 2024 10:41
> 
> From: Linus Torvalds
> > Sent: 05 January 2024 00:33
> >
> > On Thu, 4 Jan 2024 at 15:36, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > Anyway, since I looked at the thing originally, and feel like I know
> > > the x86 side and understand the strange IP csum too, I just applied it
> > > directly.
> >
> > I ended up just applying my 40-byte cleanup thing too that I've been
> > keeping in my own tree since posting it (as the "Silly csum
> > improvement. Maybe" patch).
> 
> Interesting, I'm pretty sure trying to get two blocks of
>  'adc' scheduled in parallel like that doesn't work.
> 
> I got an adc every clock from this 'beast':
> +       /*
> +        * Align the byte count to a multiple of 16 then
> +        * add 64 bit words to alternating registers.
> +        * Finally reduce to 64 bits.
> +        */
> +       asm(    "       bt    $4, %[len]\n"
> +               "       jnc   10f\n"
> +               "       add   (%[buff], %[len]), %[sum_0]\n"
> +               "       adc   8(%[buff], %[len]), %[sum_1]\n"
> +               "       lea   16(%[len]), %[len]\n"
> +               "10:    jecxz 20f\n"
> +               "       adc   (%[buff], %[len]), %[sum_0]\n"
> +               "       adc   8(%[buff], %[len]), %[sum_1]\n"
> +               "       lea   32(%[len]), %[len_tmp]\n"
> +               "       adc   16(%[buff], %[len]), %[sum_0]\n"
> +               "       adc   24(%[buff], %[len]), %[sum_1]\n"
> +               "       mov   %[len_tmp], %[len]\n"
> +               "       jmp   10b\n"
> +               "20:    adc   %[sum_0], %[sum]\n"
> +               "       adc   %[sum_1], %[sum]\n"
> +               "       adc   $0, %[sum]\n"
> +           : [sum] "+&r" (sum), [sum_0] "+&r" (sum_0), [sum_1] "+&r" (sum_1),
> +               [len] "+&c" (len), [len_tmp] "=&r" (len_tmp)
> +           : [buff] "r" (buff)
> +           : "memory" );

I've got far too many x86 checksum functions lying around.

Actually you don't need all that.
Anything recent (probably Broadwell on) will execute:
	"10:    jecxz 20f\n"
	"       adc   (%[buff], %[len]), %[sum]\n"
	"       adc   8(%[buff], %[len]), %[sum]\n"
	"       lea   16(%[len]), %[len]\n"
	"       jmp   10b\n"
	"20:    adc   $0, %[sum]\n"
in two clocks per iteration - 8 bytes/clock.
Since it is trivial to handle 8n+4 buffers (eg as above)
that only leaves the C code to handle the final 0-7 bytes.

> Maybe I'll sort out another patch...

Probably after the next rc1 is out.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2024-01-05 10:41                   ` David Laight
  2024-01-05 16:12                     ` David Laight
@ 2024-01-05 18:05                     ` Linus Torvalds
  2024-01-05 23:52                       ` David Laight
  2024-01-06 22:08                       ` David Laight
  1 sibling, 2 replies; 34+ messages in thread
From: Linus Torvalds @ 2024-01-05 18:05 UTC (permalink / raw)
  To: David Laight
  Cc: Noah Goldstein, kernel test robot, x86@kernel.org,
	oe-kbuild-all@lists.linux.dev, linux-kernel@vger.kernel.org,
	edumazet@google.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com

[-- Attachment #1: Type: text/plain, Size: 450 bytes --]

On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@aculab.com> wrote:
>
> Interesting, I'm pretty sure trying to get two blocks of
>  'adc' scheduled in parallel like that doesn't work.

You should check out the benchmark at

       https://github.com/fenrus75/csum_partial

and see if you can improve on it. I'm including the patch (on top of
that code by Arjan) to implement the actual current kernel version as
"New version".

         Linus

[-- Attachment #2: p --]
[-- Type: application/octet-stream, Size: 4840 bytes --]

From 6ff7f7a72a4855970b1621ac9724c44c393a6d44 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Fri, 5 Jan 2024 09:46:32 -0800
Subject: [PATCH] Add the current kernel version as "New version"

---
 Makefile       |   3 --
 csum_partial.c | 117 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 115 insertions(+), 5 deletions(-)

diff --git a/Makefile b/Makefile
index e4b1bb3..4e29f8a 100644
--- a/Makefile
+++ b/Makefile
@@ -17,6 +17,3 @@ chain2.svg: graphs/chain2.dot
 chain2a.svg: graphs/chain2a.dot
 	dot -Tsvg -O graphs/chain2a.dot  
 	mv graphs/chain2a.dot.svg chain2a.svg
-	
-	
-	
\ No newline at end of file
diff --git a/csum_partial.c b/csum_partial.c
index 4db0d97..ddf6acd 100644
--- a/csum_partial.c
+++ b/csum_partial.c
@@ -14,13 +14,28 @@
 #include <time.h>
 
 typedef uint32_t __wsum;
+typedef uint32_t __u32;
+typedef uint64_t __u64;
 typedef uint64_t u64;
 typedef uint32_t u32;
+# define likely(x) __builtin_expect(!!(x), 1)
 # define unlikely(x) __builtin_expect(!!(x), 0)
 
+#define __force
+
 #define LOOPCOUNT 102400
 #define PACKETSIZE 40
 
+/**
+ * ror64 - rotate a 64-bit value right
+ * @word: value to rotate
+ * @shift: bits to roll
+ */
+static inline __u64 ror64(__u64 word, unsigned int shift)
+{
+	return (word >> (shift & 63)) | (word << ((-shift) & 63));
+}
+
 static inline unsigned long load_unaligned_zeropad(const void *addr)
 {
 	unsigned long ret, dummy;
@@ -484,7 +499,105 @@ static inline __wsum nulltest(const void *buff, int len, __wsum sum)
 {
 	return 2;
 }
+static inline __wsum csum_finalize_sum(u64 temp64)
+{
+	return (__force __wsum)((temp64 + ror64(temp64, 32)) >> 32);
+}
 
+static inline unsigned long update_csum_40b(unsigned long sum, const unsigned long m[5])
+{
+	asm("addq %1,%0\n\t"
+	     "adcq %2,%0\n\t"
+	     "adcq %3,%0\n\t"
+	     "adcq %4,%0\n\t"
+	     "adcq %5,%0\n\t"
+	     "adcq $0,%0"
+		:"+r" (sum)
+		:"m" (m[0]), "m" (m[1]), "m" (m[2]),
+		 "m" (m[3]), "m" (m[4]));
+	return sum;
+}
+
+/*
+ * Do a checksum on an arbitrary memory area.
+ * Returns a 32bit checksum.
+ *
+ * This isn't as time critical as it used to be because many NICs
+ * do hardware checksumming these days.
+ *
+ * Still, with CHECKSUM_COMPLETE this is called to compute
+ * checksums on IPv6 headers (40 bytes) and other small parts.
+ * it's best to have buff aligned on a 64-bit boundary
+ */
+__wsum csum_partial_new(const void *buff, int len, __wsum sum)
+{
+	u64 temp64 = (__force u64)sum;
+
+	/* Do two 40-byte chunks in parallel to get better ILP */
+	if (likely(len >= 80)) {
+		u64 temp64_2 = 0;
+		do {
+			temp64 = update_csum_40b(temp64, buff);
+			temp64_2 = update_csum_40b(temp64_2, buff + 40);
+			buff += 80;
+			len -= 80;
+		} while (len >= 80);
+
+		asm("addq %1,%0\n\t"
+		    "adcq $0,%0"
+		    :"+r" (temp64): "r" (temp64_2));
+	}
+
+	/*
+	 * len == 40 is the hot case due to IPv6 headers, so return
+	 * early for that exact case without checking the tail bytes.
+	 */
+	if (len >= 40) {
+		temp64 = update_csum_40b(temp64, buff);
+		len -= 40;
+		if (!len)
+			return csum_finalize_sum(temp64);
+		buff += 40;
+	}
+
+	if (len & 32) {
+		asm("addq 0*8(%[src]),%[res]\n\t"
+		    "adcq 1*8(%[src]),%[res]\n\t"
+		    "adcq 2*8(%[src]),%[res]\n\t"
+		    "adcq 3*8(%[src]),%[res]\n\t"
+		    "adcq $0,%[res]"
+		    : [res] "+r"(temp64)
+		    : [src] "r"(buff), "m"(*(const char(*)[32])buff));
+		buff += 32;
+	}
+	if (len & 16) {
+		asm("addq 0*8(%[src]),%[res]\n\t"
+		    "adcq 1*8(%[src]),%[res]\n\t"
+		    "adcq $0,%[res]"
+		    : [res] "+r"(temp64)
+		    : [src] "r"(buff), "m"(*(const char(*)[16])buff));
+		buff += 16;
+	}
+	if (len & 8) {
+		asm("addq 0*8(%[src]),%[res]\n\t"
+		    "adcq $0,%[res]"
+		    : [res] "+r"(temp64)
+		    : [src] "r"(buff), "m"(*(const char(*)[8])buff));
+		buff += 8;
+	}
+	if (len & 7) {
+		unsigned int shift = (-len << 3) & 63;
+		unsigned long trail;
+
+		trail = (load_unaligned_zeropad(buff) << shift) >> shift;
+
+		asm("addq %[trail],%[res]\n\t"
+		    "adcq $0,%[res]"
+		    : [res] "+r"(temp64)
+		    : [trail] "r"(trail));
+	}
+	return csum_finalize_sum(temp64);
+}
 
 double cycles[64];
 int cyclecount[64];
@@ -612,6 +725,7 @@ int main(int argc, char **argv)
 
 	MEASURE(2,  csum_partial, "Upcoming linux kernel version");
 	MEASURE(4,  csum_specialized, "Specialized to size 40");
+	MEASURE(6,  csum_partial_new, "New version");
 	MEASURE(22, csum_partial_no_odd, "Odd-alignment handling removed");
 	MEASURE(24, csum_partial_dead_code, "Dead code elimination           ");
 	MEASURE(28, csum_partial_ACX, "ADX interleaved ");
@@ -619,7 +733,6 @@ int main(int argc, char **argv)
 	MEASURE(34, csum_partial_32bit, "32 bit train ");
 	MEASURE(36, csum_partial_zero_sum, "Assume zero input sum");
 
-
 	report();
 	}
-}
\ No newline at end of file
+}

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* RE: x86/csum: Remove unnecessary odd handling
  2024-01-05 18:05                     ` Linus Torvalds
@ 2024-01-05 23:52                       ` David Laight
  2024-01-06  0:18                         ` Linus Torvalds
  2024-01-06 22:08                       ` David Laight
  1 sibling, 1 reply; 34+ messages in thread
From: David Laight @ 2024-01-05 23:52 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: Noah Goldstein, kernel test robot, x86@kernel.org,
	oe-kbuild-all@lists.linux.dev, linux-kernel@vger.kernel.org,
	edumazet@google.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com

From: Linus Torvalds
> Sent: 05 January 2024 18:06
> 
> On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@aculab.com> wrote:
> >
> > Interesting, I'm pretty sure trying to get two blocks of
> >  'adc' scheduled in parallel like that doesn't work.
> 
> You should check out the benchmark at
> 
>        https://github.com/fenrus75/csum_partial
> 
> and see if you can improve on it. I'm including the patch (on top of
> that code by Arjan) to implement the actual current kernel version as
> "New version".

I'd have to fix his benchmark code first :-)
You can't use the TSC unless you lock the cpu frequency.
The longer the test runs for the faster the cpu will run.
I've had to use the performance counters to get accurate cycle counts.

I'm also sure I tried two separate 'adc' chains and failed
to get any overlapped instrictions.
But I'll try his loop in my framework.

The 'lfence' version also really does matter.
In some sense it doesn't matter if you add 10 clocks to the
IP header checksum - they'll be dwarfed by everything else.
But if you do have to checksum an 8k NFS UDP datagram the
loop count matters.

The OOO engine is very good a just piling up instructions
that are waiting for previous instructions in a big queue
and executing instruction for which the data is available.
So the control flow for the checksum code can finish well
before the checksum is available.

On a related point, do you remember what the 'killer app'
was for doing the checksum in copy_to/from_user?

The change pre-dates git and I can't find the commit message.
The only place it is done in the receive path is for UDP.
At a guess that helped 8k UDP datagrams going into a userspace nfsd?
I bet nothing really depends on RX UDP performance to userspace
on systems that don't do hw checksum?

The tx side is done copying data into an skb, I'm not sure if
that is all copies - you really don't want to do it if the
hardware supports tx checksum offload.
Using 'rep movsb' for copy_from_user() has  to be faster than
the I-cache busting 'copy and checksum' code - even if you
end up doing a checksum not much later on.

IIRC (I looked a couple of weeks ago) only x86, sparc32 and m68k
actually have 'copy and checksum' (and the m68k didn't seem to
have the markers for faults!).
(All of them are I-cache killers)
The default iovec[] version checksums every fragment - which
gives grief trying to report a 32bit checksum and EFAULT.
OTOH it could sum the linear kernel buffer after the copy.
That would let it fold the checksum to 17bits and report -1
for EFAULT.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2024-01-05 23:52                       ` David Laight
@ 2024-01-06  0:18                         ` Linus Torvalds
  2024-01-06 10:26                           ` Eric Dumazet
  0 siblings, 1 reply; 34+ messages in thread
From: Linus Torvalds @ 2024-01-06  0:18 UTC (permalink / raw)
  To: David Laight
  Cc: Noah Goldstein, kernel test robot, x86@kernel.org,
	oe-kbuild-all@lists.linux.dev, linux-kernel@vger.kernel.org,
	edumazet@google.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com

On Fri, 5 Jan 2024 at 15:53, David Laight <David.Laight@aculab.com> wrote:
>
> I'd have to fix his benchmark code first :-)
> You can't use the TSC unless you lock the cpu frequency.
> The longer the test runs for the faster the cpu will run.

They'll stabilize, it has soem cycle result aging code.

But yes, set the CPU policy to 'performance' or do performance
counters if you care deeply.

> On a related point, do you remember what the 'killer app'
> was for doing the checksum in copy_to/from_user?

No. It's a long time ago, and many things have changed since.

It's possible the copy-and-csum it's not worth it any more, simply
because all modern network cards will do the csum for us, and I think
loopback sets a flag saying "no need to checksum" too.

But I do have a strong memory of it being a big deal back when. A
_loong_ time ago.

          Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2024-01-06  0:18                         ` Linus Torvalds
@ 2024-01-06 10:26                           ` Eric Dumazet
  2024-01-06 19:32                             ` Linus Torvalds
  2024-01-07 12:11                             ` David Laight
  0 siblings, 2 replies; 34+ messages in thread
From: Eric Dumazet @ 2024-01-06 10:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Laight, Noah Goldstein, kernel test robot, x86@kernel.org,
	oe-kbuild-all@lists.linux.dev, linux-kernel@vger.kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com

On Sat, Jan 6, 2024 at 1:19 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, 5 Jan 2024 at 15:53, David Laight <David.Laight@aculab.com> wrote:
> >
> > I'd have to fix his benchmark code first :-)
> > You can't use the TSC unless you lock the cpu frequency.
> > The longer the test runs for the faster the cpu will run.
>
> They'll stabilize, it has soem cycle result aging code.
>
> But yes, set the CPU policy to 'performance' or do performance
> counters if you care deeply.
>
> > On a related point, do you remember what the 'killer app'
> > was for doing the checksum in copy_to/from_user?
>
> No. It's a long time ago, and many things have changed since.
>
> It's possible the copy-and-csum it's not worth it any more, simply
> because all modern network cards will do the csum for us, and I think
> loopback sets a flag saying "no need to checksum" too.
>
> But I do have a strong memory of it being a big deal back when. A
> _loong_ time ago.
>

On a related note, at least with clang, I found that csum_ipv6_magic()
is needlessly using temporary on-stack storage,
showing a stall on Cascade Lake unless I am patching add32_with_carry() :

diff --git a/arch/x86/include/asm/checksum_64.h
b/arch/x86/include/asm/checksum_64.h
index 407beebadaf45a748f91a36b78bd1d023449b132..c3d6f47626c70d81f0c2ba401d85050b09a39922
100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -171,7 +171,7 @@ static inline unsigned add32_with_carry(unsigned
a, unsigned b)
        asm("addl %2,%0\n\t"
            "adcl $0,%0"
            : "=r" (a)
-           : "0" (a), "rm" (b));
+           : "0" (a), "r" (b));
        return a;
 }

Before patch :

ffffffff81b9b600 <csum_ipv6_magic>:
ffffffff81b9b600: 0f 1f 44 00 00        nopl   0x0(%rax,%rax,1)
ffffffff81b9b601: R_X86_64_NONE __fentry__-0x4
ffffffff81b9b605: 55                    push   %rbp
ffffffff81b9b606: 48 89 e5              mov    %rsp,%rbp
ffffffff81b9b609: 48 83 ec 04          sub    $0x4,%rsp     // This
will be removed after patch
ffffffff81b9b60d: 0f ca                bswap  %edx
ffffffff81b9b60f: 89 c9                mov    %ecx,%ecx
ffffffff81b9b611: 48 c1 e1 08          shl    $0x8,%rcx
ffffffff81b9b615: 44 89 c0              mov    %r8d,%eax
ffffffff81b9b618: 48 01 d0              add    %rdx,%rax
ffffffff81b9b61b: 48 01 c8              add    %rcx,%rax
ffffffff81b9b61e: 48 03 07              add    (%rdi),%rax
ffffffff81b9b621: 48 13 47 08          adc    0x8(%rdi),%rax
ffffffff81b9b625: 48 13 06              adc    (%rsi),%rax
ffffffff81b9b628: 48 13 46 08          adc    0x8(%rsi),%rax
ffffffff81b9b62c: 48 83 d0 00          adc    $0x0,%rax
ffffffff81b9b630: 48 89 c1              mov    %rax,%rcx
ffffffff81b9b633: 48 c1 e9 20          shr    $0x20,%rcx
ffffffff81b9b637: 89 4d fc              mov    %ecx,-0x4(%rbp)  //
Store exc on the stack
ffffffff81b9b63a: 03 45 fc              add    -0x4(%rbp),%eax  //
reads the value right away, stalling some Intel cpus.
ffffffff81b9b63d: 83 d0 00              adc    $0x0,%eax
ffffffff81b9b640: 89 c1                mov    %eax,%ecx   // Note that
ecs content was scratch anyway
ffffffff81b9b642: c1 e1 10              shl    $0x10,%ecx
ffffffff81b9b645: 25 00 00 ff ff        and    $0xffff0000,%eax
ffffffff81b9b64a: 01 c8                add    %ecx,%eax
ffffffff81b9b64c: 15 ff ff 00 00        adc    $0xffff,%eax
ffffffff81b9b651: f7 d0                not    %eax
ffffffff81b9b653: c1 e8 10              shr    $0x10,%eax
ffffffff81b9b656: 48 83 c4 04          add    $0x4,%rsp
ffffffff81b9b65a: 5d                    pop    %rbp
ffffffff81b9b65b: c3                    ret
ffffffff81b9b65c: cc                    int3
ffffffff81b9b65d: 00 00                add    %al,(%rax)
ffffffff81b9b65f: cc                    int3

After patch:

00000000000000a0 <csum_ipv6_magic>:
  a0: 0f 1f 44 00 00        nopl   0x0(%rax,%rax,1)
a1: R_X86_64_NONE __fentry__-0x4
  a5: 55                    push   %rbp
  a6: 48 89 e5              mov    %rsp,%rbp
  a9: 0f ca                bswap  %edx
  ab: 89 c9                mov    %ecx,%ecx
  ad: 48 c1 e1 08          shl    $0x8,%rcx
  b1: 44 89 c0              mov    %r8d,%eax
  b4: 48 01 d0              add    %rdx,%rax
  b7: 48 01 c8              add    %rcx,%rax
  ba: 48 03 07              add    (%rdi),%rax
  bd: 48 13 47 08          adc    0x8(%rdi),%rax
  c1: 48 13 06              adc    (%rsi),%rax
  c4: 48 13 46 08          adc    0x8(%rsi),%rax
  c8: 48 83 d0 00          adc    $0x0,%rax
  cc: 48 89 c1              mov    %rax,%rcx
  cf: 48 c1 e9 20          shr    $0x20,%rcx
  d3: 01 c8                add    %ecx,%eax   // just better !
  d5: 83 d0 00              adc    $0x0,%eax
  d8: 89 c1                mov    %eax,%ecx
  da: c1 e1 10              shl    $0x10,%ecx
  dd: 25 00 00 ff ff        and    $0xffff0000,%eax
  e2: 01 c8                add    %ecx,%eax
  e4: 15 ff ff 00 00        adc    $0xffff,%eax
  e9: f7 d0                not    %eax
  eb: c1 e8 10              shr    $0x10,%eax
  ee: 5d                    pop    %rbp
  ef: c3                    ret
  f0: cc                    int3

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
  2024-01-06 10:26                           ` Eric Dumazet
@ 2024-01-06 19:32                             ` Linus Torvalds
  2024-01-07 12:11                             ` David Laight
  1 sibling, 0 replies; 34+ messages in thread
From: Linus Torvalds @ 2024-01-06 19:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Laight, Noah Goldstein, kernel test robot, x86@kernel.org,
	oe-kbuild-all@lists.linux.dev, linux-kernel@vger.kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com

On Sat, 6 Jan 2024 at 02:26, Eric Dumazet <edumazet@google.com> wrote:
>
> On a related note, at least with clang, I found that csum_ipv6_magic()
> is needlessly using temporary on-stack storage,
> showing a stall on Cascade Lake unless I am patching add32_with_carry() :

This is a known compiler bug in clang:

    https://github.com/llvm/llvm-project/issues/20571
    https://github.com/llvm/llvm-project/issues/30873
    https://github.com/llvm/llvm-project/issues/34837

and while we could always just use "r" for constraints, it will

 (a) possibly run out of registers in some situations

 (b) pessimize gcc that does this right

Can you please push the clang people to not do the stupid thing they
do now, which is to say "oh, I can use registers _or_ memory, so I'll
always use memory".

Btw, it's actually much worse than that, and clang is just doing
incredibly broken things. Switching to "r" just hides the garbage.

Because sometimes the source really *is* memory, ie we have

    net/ipv4/udp.c:
                 csum = csum_add(csum, frags->csum);

and then it's criminally stupid to load it into a register when it can
be just used directly.

But clang says "criminally stupid? *I* will show you stupid!" -
because what *clang* does with the above is this thing of beauty:

        movl    136(%rax), %edi
        movl    %edi, 28(%rsp)
        addl    28(%rsp), %ecx
        adcl    $0, %ecx

which has turned from "criminally stupid" to "it's art, I tell you -
you're not supposed to understand it".

Anyway, this is *literally* a clang bug. Complain to clang people for
being *extra* stupid - we told the compiler that it can use a register
or memory, and clang decided "I'll use memory", so then when we gave
it a memory location, it said "no, not *that* memory - I'll just
reload it on stack".

In contrast, gcc gets this right - and for that udp.c case it just generates

        addl 136(%rax),%ecx     # frags_67->D.58941.D.58869.D.58836.csum, a
        adcl $0,%ecx    # a

like it should.

And for csum_ipv6_magic, gcc generates this:

        addl %edx,%eax  # tmp112, a
        adcl $0,%eax    # a

IOW, the kernel is *right*, and this is purely clang being completely bogus.

I really don't want to penalize a good compiler because a bad one
can't get its act together.

               Linus

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: x86/csum: Remove unnecessary odd handling
  2024-01-05 18:05                     ` Linus Torvalds
  2024-01-05 23:52                       ` David Laight
@ 2024-01-06 22:08                       ` David Laight
  2024-01-07  1:09                         ` H. Peter Anvin
  1 sibling, 1 reply; 34+ messages in thread
From: David Laight @ 2024-01-06 22:08 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: Noah Goldstein, x86@kernel.org, oe-kbuild-all@lists.linux.dev,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com

[-- Attachment #1: Type: text/plain, Size: 2325 bytes --]

From: Linus Torvalds
> Sent: 05 January 2024 18:06
> 
> On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@aculab.com> wrote:
> >
> > Interesting, I'm pretty sure trying to get two blocks of
> >  'adc' scheduled in parallel like that doesn't work.
> 
> You should check out the benchmark at
> 
>        https://github.com/fenrus75/csum_partial
> 
> and see if you can improve on it. I'm including the patch (on top of
> that code by Arjan) to implement the actual current kernel version as
> "New version".

Annoyingly (for me) you are partially right...

I found where my ip checksum perf code was hiding and revisited it.
Although I found comments elsewhere that the 'jecxz, adc, adc, lea, jmp'
did an adc every clock it isn't happening for me now.

I'm only measuring the inner loop for multiples of 64 bytes.
The code less than 8 bytes and partial final words is a
separate problem.
The less unrolled the main loop, the less overhead there'll
be for 'normal' sizes.
So I've changed your '80 byte' block to 64 bytes for consistency.

I'm ignoring pre-sandy bridge cpu (no split flags) and pre-broadwell
(adc takes two clocks - although adc to alternate regs is one clock
on sandy bridge).
My test system is an i7-7700, I think anything from broadwell (gen 4)
will be at least as good.
I don't have a modern amd cpu.

The best loop for 256+ bytes is an adxc/adxo one.
However that requires the run-time patching.
Followed by new kernel version (two blocks of 4 adc).
The surprising one is:
		xor	sum, sum
	1:	adc	(buff), sum
		adc	8(buff), sum
		lea	16(buff), buff
		dec	count
		jnz	1b
		adc	$0, sum
For 256 bytes it is only a couple of clocks slower.
Maybe 10% slower for 512+ bytes.
But it need almost no extra code for 'normal' buffer sizes.
By comparison the adxc/adxo one is 20% faster.

The code is doing:
	old = rdpmc
	mfence
	csum = do_csum(buf, len);
	mfence
	clocks = rdpmc - old
(That is directly reading the pmc register.)
With 'no-op' function it takes 160 clocks (I-cache resident).
Without the mfence 40 - but pretty much everything can execute
after the 2nd rdpmc.

I've attached my (horrid) test program.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

[-- Attachment #2: ipcsum.c --]
[-- Type: text/plain, Size: 12950 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>

#include <linux/perf_event.h>
#include <sys/mman.h>
#include <sys/syscall.h>

#ifndef PASSES
#define PASSES 10
#endif

static int init_pmc(void)
{
	static struct perf_event_attr perf_attr = {
		.type = PERF_TYPE_HARDWARE,
		.config = PERF_COUNT_HW_CPU_CYCLES,
		.pinned = 1,
	};
	struct perf_event_mmap_page *pc;

	int perf_fd;
	perf_fd = syscall(__NR_perf_event_open, &perf_attr, 0, -1, -1, 0);
	if (perf_fd < 0) {
		fprintf(stderr, "perf_event_open failed: errno %d\n", errno);
		exit(1);
	}
	pc = mmap(NULL, 4096, PROT_READ, MAP_SHARED, perf_fd, 0);
	if (pc == MAP_FAILED) {
		fprintf(stderr, "perf_event mmap() failed: errno %d\n", errno);
		exit(1);
	}
	return pc->index - 1;
}

static inline unsigned int rdpmc(unsigned int counter)
{
        unsigned int low, high;

#ifndef NOFENCE
	asm volatile("mfence");
#endif
	asm volatile("rdpmc" : "=a" (low), "=d" (high) : "c" (counter));
#ifndef NOFENCE
	asm volatile("mfence");
#endif

	// return low bits, counter might to 32 or 40 bits wide.
	return low;
}

__attribute__((noinline))
unsigned long overhead(const unsigned char *buff, unsigned long len)
{
	return 0;
}

// Existing kernel loop.
// Simple 'adc' loop, abs max 8 bytes/clock.
// Will be very slow on old cpu (Pre Ivy bridge) cpu where 'dec'
// has a false dependeny against the carry flag.
// And also pre-broadwell where the adc take two clocks.
__attribute__((noinline))
unsigned long adc_dec_8(const unsigned char *buff, unsigned long len)
{
	unsigned long csum = 0;

	long count = len/64;
	asm("1:	addq	0(%[buff]), %[csum]\n"
	    "	adcq	8(%[buff]), %[csum]\n"
	    "	adcq	16(%[buff]), %[csum]\n"
	    "	adcq	24(%[buff]), %[csum]\n"
	    "	adcq	32(%[buff]), %[csum]\n"
	    "	adcq	40(%[buff]), %[csum]\n"
	    "	adcq	48(%[buff]), %[csum]\n"
	    "	adcq	56(%[buff]), %[csum]\n"
	    "	adcq	$0, %[csum]\n"
	    " 	lea	64(%[buff]), %[buff]\n"
	    " 	dec	%[count]\n"
	    "	jnz	1b"
		: [csum] "+&r" (csum), [len] "+&r" (len), [count] "+&r" (count)
		: [buff] "r" (buff)
		: "memory");

	return csum;
}

// As above but only 4 adc
__attribute__((noinline))
unsigned long adc_dec_4(const unsigned char *buff, unsigned long len)
{
	unsigned long csum = 0;

	long count = len/32;
	asm("	clc\n"
	    "1:	adcq	(%[buff]), %[csum]\n"
	    "	adcq	8(%[buff]), %[csum]\n"
	    "	adcq	16(%[buff]), %[csum]\n"
	    "	adcq	24(%[buff]), %[csum]\n"
	    " 	lea	32(%[buff]), %[buff]\n"
	    " 	dec	%[count]\n"
	    "	jnz	1b\n"
	    "	adcq	$0, %[csum]\n"
		: [csum] "+&r" (csum), [len] "+&r" (len), [count] "+&r" (count)
		: [buff] "r" (buff)
		: "memory");

	return csum;
}

// As above but only 2 adc
__attribute__((noinline))
unsigned long adc_dec_2(const unsigned char *buff, unsigned long len)
{
	unsigned long csum = 0;

	long count = len/16;
	asm("	clc\n"
	    "1:	adcq	(%[buff]), %[csum]\n"
	    "	adcq	8(%[buff]), %[csum]\n"
	    " 	lea	16(%[buff]), %[buff]\n"
	    " 	dec	%[count]\n"
	    "	jnz	1b\n"
	    "	adcq	$0, %[csum]\n"
		: [csum] "+&r" (csum), [len] "+&r" (len), [count] "+&r" (count)
		: [buff] "r" (buff)
		: "memory");

	return csum;
}

// Alternate 'adc' loop that avoids using the zero flag.
__attribute__((noinline))
unsigned long adc_jcxz_4(const unsigned char *buff, unsigned long len)
{
	unsigned long csum = 0;

	buff += len;
	len = -len;
	asm("	clc\n"
	    "1: jrcxz	2f\n"
	    "	adcq	0(%[buff], %[len]), %[csum]\n"
	    "	adcq	8(%[buff], %[len]), %[csum]\n"
	    "	adcq	16(%[buff], %[len]), %[csum]\n"
	    "	adcq	24(%[buff], %[len]), %[csum]\n"
	    " 	lea	32(%[len]), %[len]\n"
	    "	jmp	1b\n"
	    "2:	adcq	$0, %[csum]\n"
		: [csum] "+&r" (csum), [len] "+&c" (len)
		: [buff] "r" (buff)
		: "memory");

	return csum;
}

// Shorter alternate 'adc' loop that avoids using the zero flag.
__attribute__((noinline))
unsigned long adc_jcxz_2(const unsigned char *buff, unsigned long len)
{
	unsigned long csum = 0;

	buff += len;
	len = -len;
	asm("	clc\n"
	    "1: jrcxz	2f\n"
	    "	adcq	0(%[buff], %[len]), %[csum]\n"
	    "	adcq	8(%[buff], %[len]), %[csum]\n"
	    " 	lea	16(%[len]), %[len]\n"
	    "	jmp	1b\n"
	    "2:	adcq	$0, %[csum]\n"
		: [csum] "+&r" (csum), [len] "+&c" (len)
		: [buff] "r" (buff)
		: "memory");

	return csum;
}

// Proposed kernel loop.
// This splits the adc chain in two.
__attribute__((noinline))
unsigned long adc_4_pair(const unsigned char *buff, unsigned long len)
{
	unsigned long csum = 0;

	long count = len/64;
	unsigned long csum1;
	asm("1:	movq	0(%[buff]), %[csum1]\n"
	    "	adcq	8(%[buff]), %[csum1]\n"
	    "	adcq	16(%[buff]), %[csum1]\n"
	    "	adcq	24(%[buff]), %[csum1]\n"
	    "	adcq	32(%[buff]), %[csum1]\n"
	    "	adcq	$0, %[csum1]\n"
	    " 	lea	64(%[buff]), %[buff]\n"
	    "	addq	40-64(%[buff]), %[csum]\n"
	    "	adcq	48-64(%[buff]), %[csum]\n"
	    "	adcq	56-64(%[buff]), %[csum]\n"
	    "	adcq	%[csum1], %[csum]\n"
	    "	adcq	$0, %[csum]\n"
	    " 	dec	%[count]\n"
	    "	jnz	1b"
		: [csum] "+&r" (csum), [csum1] "=&r" (csum1), [len] "+&r" (len), [count] "+&r" (count)
		: [buff] "r" (buff)
		: "memory");

	return csum;
}

// Less unrolled version of previous.
__attribute__((noinline))
unsigned long adc_2_pair(const unsigned char *buff, unsigned long len)
{
	unsigned long csum = 0;

	long count = len;
	unsigned long csum1;
	asm("1:	movq	0(%[buff]), %[csum1]\n"
	    "	adcq	8(%[buff]), %[csum1]\n"
	    "	adcq	$0, %[csum1]\n"
	    " 	lea	32(%[buff]), %[buff]\n"
	    "	addq	16-32(%[buff]), %[csum]\n"
	    "	adcq	24-32(%[buff]), %[csum]\n"
	    "	adcq	%[csum1], %[csum]\n"
	    "	adcq	$0, %[csum]\n"
	    " 	sub	$32, %[count]\n"
	    "	jnz	1b"
		: [csum] "+&r" (csum), [csum1] "=&r" (csum1), [len] "+&r" (len), [count] "+&r" (count)
		: [buff] "r" (buff)
		: "memory");

	return csum;
}

// adxc/adxo loop.
// This is the only loop that exceeds 8 bytes/clock for 1k buffers.
__attribute__((noinline))
unsigned long adcx_adox(const unsigned char *buff, unsigned long len)
{
	unsigned long csum = 0;

	unsigned long csum_odd;
	buff += len;
	len = -len;
	asm(    "       xor   %[csum_odd], %[csum_odd]\n"    // Also clears carry and overflow
		"10:    jrcxz 20f\n"
		"       adcx    (%[buff], %[len]), %[csum]\n"
		"       adox   8(%[buff], %[len]), %[csum_odd]\n"
		"       adcx  16(%[buff], %[len]), %[csum]\n"
		"       adox  24(%[buff], %[len]), %[csum_odd]\n"
		"       adcx  32(%[buff], %[len]), %[csum]\n"
		"       adox  40(%[buff], %[len]), %[csum_odd]\n"
		"       adcx  48(%[buff], %[len]), %[csum]\n"
		"       adox  56(%[buff], %[len]), %[csum_odd]\n"
		"       lea   64(%[len]), %[len]\n"
		"       jmp   10b\n"
		"20:    adox  %[len], %[csum_odd]\n"  // [len] is zero
		"       adcx  %[csum_odd], %[csum]\n"
		"       adcx  %[len], %[csum]"
	    : [csum] "+&r" (csum), [len] "+&c" (len), [csum_odd] "=&r" (csum_odd)
	    : [buff] "r" (buff)
	    : "memory");

	return csum;
}

#define OPTIMISER_HIDE_VAR(x) asm volatile( "": "+r" (x))

// In principe the cpu can do 2 reads every clock.
// So combining with adds to different registers it should
// be possible to get this C code to run near to 8 bytes/clock.
// But the compiler doesn't want to 'play ball'.
// The OPTIMISER_HIDE_VAR() helps a bit
__attribute__((noinline))
unsigned long add_c_16(const unsigned char *buff, unsigned long len)
{
	unsigned long csum_odd = 0;
	unsigned long csum = 0;

	buff += len;
	len = -len;
	do {
		csum += *(unsigned int *)(buff + len + 0);
		csum_odd += *(unsigned int *)(buff + len + 4);
		OPTIMISER_HIDE_VAR(csum);
		OPTIMISER_HIDE_VAR(csum_odd);
		csum += *(unsigned int *)(buff + len + 8);
		csum_odd += *(unsigned int *)(buff + len + 12);
		OPTIMISER_HIDE_VAR(csum);
		OPTIMISER_HIDE_VAR(csum_odd);
	} while (len += 16);
	csum += csum_odd;

	return csum;
}

// As above but unrolled further
__attribute__((noinline))
unsigned long add_c_32(const unsigned char *buff, unsigned long len)
{
	unsigned long csum_odd = 0;
	unsigned long csum = 0;

	buff += len;
	len = -len;
	do {
		csum += *(unsigned int *)(buff + len + 0);
		csum_odd += *(unsigned int *)(buff + len + 4);
		OPTIMISER_HIDE_VAR(csum);
		OPTIMISER_HIDE_VAR(csum_odd);
		csum += *(unsigned int *)(buff + len + 8);
		csum_odd += *(unsigned int *)(buff + len + 12);
		OPTIMISER_HIDE_VAR(csum);
		OPTIMISER_HIDE_VAR(csum_odd);
		csum += *(unsigned int *)(buff + len + 16);
		csum_odd += *(unsigned int *)(buff + len + 20);
		OPTIMISER_HIDE_VAR(csum);
		OPTIMISER_HIDE_VAR(csum_odd);
		csum += *(unsigned int *)(buff + len + 24);
		csum_odd += *(unsigned int *)(buff + len + 28);
		OPTIMISER_HIDE_VAR(csum);
		OPTIMISER_HIDE_VAR(csum_odd);
	} while (len += 32);
	csum += csum_odd;

	return csum;
}

// Alternative C loop that sums 'val << 32' to get the carries.
// Might be better on arm64 than x86
__attribute__((noinline))
unsigned long add_c_high(const unsigned char *buff, unsigned long len)
{
	unsigned long sum_hi_odd = 0, sum_lo_odd = 0;
	unsigned long sum_hi = 0, sum_lo = 0;
	unsigned long val;
	unsigned int i;

	buff += len;
	len = -len;
	do {
		val = *(unsigned long *)(buff + len + 0);
		sum_lo += val;
		sum_hi += val >> 32;
		val = *(unsigned long *)(buff + len + 8);
		sum_lo_odd += val;
		sum_hi_odd += val >> 32;
	} while (len += 16);
	OPTIMISER_HIDE_VAR(sum_lo);
	OPTIMISER_HIDE_VAR(sum_hi);
	sum_lo += sum_lo_odd;
	sum_hi += sum_hi_odd;

	sum_hi += (sum_lo >> 32) - (sum_hi & 0xffffffffu);
	return sum_hi + (sum_lo & 0xffffffff);
}

unsigned char buffer[8192] __attribute__((aligned(4096))) = {
0x46,0x56,0x20,0x04,0x10,0x02,0x50,0x07,0x72,0x4d,0xc6,0x3d,0x31,0x85,0x2d,0xbd,
0xe2,0xe0,0x9d,0x3e,0x3b,0x7a,0x70,0x3d,0xd2,0xfb,0x8c,0xbf,0x95,0x10,0xa9,0xbe,
0xeb,0xfd,0x29,0x40,0xd5,0x7a,0x61,0x40,0xde,0xcd,0x14,0xbf,0x81,0x1b,0xf6,0x3f,
0xbc,0xff,0x17,0x3f,0x67,0x1c,0x6e,0xbe,0xf4,0xc2,0x05,0x40,0x0b,0x13,0x78,0x3f,
0xfe,0x47,0xa7,0xbd,0x59,0xc2,0x15,0x3f,0x07,0xd0,0xea,0xbf,0x97,0xf1,0x3c,0x3f,
0xcc,0xfa,0x6b,0x40,0x72,0x6a,0x4f,0xbe,0x0b,0xe3,0x75,0x3e,0x3c,0x9b,0x0e,0xbf,
0xa9,0xeb,0xb7,0x3f,0xeb,0x4a,0xec,0x3e,0x33,0x8c,0x0c,0x3f,0x6a,0xf2,0xf3,0x3e,
0x2b,0x45,0x86,0x3f,0x83,0xce,0x8a,0x3f,0xf6,0x01,0x16,0x40,0x9c,0x17,0x47,0x3e,
0x44,0x83,0x61,0x40,0x74,0xc7,0x5c,0x3f,0xec,0xe7,0x95,0x3f,0xee,0x19,0xb5,0xbf,
0xb5,0xf0,0x03,0xbf,0xd1,0x02,0x1c,0x3e,0xa3,0x55,0x90,0xbe,0x1e,0x0b,0xa1,0xbf,
0xa4,0xa8,0xb4,0x3f,0xc6,0x68,0x91,0x3f,0xd1,0xc5,0xab,0x3f,0xb9,0x14,0x62,0x3f,
0x7c,0xe0,0xb9,0xbf,0xc0,0xa4,0xb5,0x3d,0x6f,0xd9,0xa7,0x3f,0x8f,0xc4,0xb0,0x3d,
0x48,0x2c,0x7a,0x3e,0x83,0xb2,0x3c,0x40,0x36,0xd3,0x18,0x40,0xb7,0xa9,0x57,0x40,
0xda,0xd3,0x95,0x3f,0x74,0x95,0xc0,0xbe,0xbb,0xce,0x71,0x3e,0x95,0xec,0x18,0xbf,
0x94,0x17,0xdd,0x3f,0x98,0xa5,0x02,0x3f,0xbb,0xfb,0xbb,0x3e,0xd0,0x5a,0x9c,0x3f,
0xd4,0x70,0x9b,0xbf,0x3b,0x9f,0x20,0xc0,0x84,0x5b,0x0f,0x40,0x5e,0x48,0x2c,0xbf,
};


static __attribute__((noinline)) void
print_csum(const char *msg, unsigned long csum, unsigned int *ticks)
{
	static unsigned int offset;
	unsigned int i;

	csum = (csum & 0xffffffff) + (csum >> 32);
	csum = (csum & 0xffffffff) + (csum >> 32);
	csum = (csum & 0xffff) + (csum >> 16);
	csum = (csum & 0xffff) + (csum >> 16);
	printf("  %4lx", csum);

	for (i = 0; i < PASSES; i++)
		printf(" %5d", ticks[i] - offset);
	printf("  %s\n", msg);

	if (!offset) {
		offset = ticks[0];
		for (i = 1; i < PASSES; i++)
			if (offset > ticks[i])
				offset = ticks[i];
	}
}

static inline __attribute__((always_inline)) void
do_csum(int pmc_id, const char *msg, unsigned long (*fn)(const unsigned char *, unsigned long),
	const unsigned char *buf, unsigned long len)
{
	unsigned int ticks[PASSES];
	unsigned long csum;
	unsigned int tick;
	unsigned int i;

	for (i = 0; i < PASSES; i++) {
		tick = rdpmc(pmc_id);
		csum = fn(buf, len);
		ticks[i] = rdpmc(pmc_id) - tick;
	}

	print_csum(msg, csum, ticks);
}

int main(int argc, char **argv)
{
	unsigned long csum;
	unsigned int len, off;
	unsigned int pmc_id = init_pmc();

	len = 128;
	off = 0;
	for (;;) {
		switch (getopt(argc, argv, "l:o:")) {
		case -1:
			break;
		default:
			exit(1);
		case 'l':
			len = atoi(optarg);
			continue;
		case 'o':
			off = atoi(optarg);
			continue;
		}
		break;
	}

	/* Solving partial blocks is a different problem... */
	len = (len + 63) & ~63;
	if (!len)
		len = 64;

	do_csum(pmc_id, "overhead", overhead, buffer + off, len);
	do_csum(pmc_id, "adc_dec_2", adc_dec_2, buffer + off, len);
	do_csum(pmc_id, "adc_dec_4", adc_dec_4, buffer + off, len);
	do_csum(pmc_id, "adc_dec_8", adc_dec_8, buffer + off, len);
	do_csum(pmc_id, "adc_jcxz_2", adc_jcxz_2, buffer + off, len);
	do_csum(pmc_id, "adc_jcxz_4", adc_jcxz_4, buffer + off, len);
	do_csum(pmc_id, "adc_2_pair", adc_2_pair, buffer + off, len);
	do_csum(pmc_id, "adc_4_pair", adc_4_pair, buffer + off, len);
	do_csum(pmc_id, "adcx_adox", adcx_adox, buffer + off, len);
	do_csum(pmc_id, "add_c_16", add_c_16, buffer + off, len);
	do_csum(pmc_id, "add_c_32", add_c_32, buffer + off, len);
	do_csum(pmc_id, "add_c_high", add_c_high, buffer + off, len);

	return 0;
}


^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: x86/csum: Remove unnecessary odd handling
  2024-01-06 22:08                       ` David Laight
@ 2024-01-07  1:09                         ` H. Peter Anvin
  2024-01-07 11:44                           ` David Laight
  0 siblings, 1 reply; 34+ messages in thread
From: H. Peter Anvin @ 2024-01-07  1:09 UTC (permalink / raw)
  To: David Laight, 'Linus Torvalds'
  Cc: Noah Goldstein, x86@kernel.org, oe-kbuild-all@lists.linux.dev,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com

On January 6, 2024 2:08:48 PM PST, David Laight <David.Laight@ACULAB.COM> wrote:
>From: Linus Torvalds
>> Sent: 05 January 2024 18:06
>> 
>> On Fri, 5 Jan 2024 at 02:41, David Laight <David.Laight@aculab.com> wrote:
>> >
>> > Interesting, I'm pretty sure trying to get two blocks of
>> >  'adc' scheduled in parallel like that doesn't work.
>> 
>> You should check out the benchmark at
>> 
>>        https://github.com/fenrus75/csum_partial
>> 
>> and see if you can improve on it. I'm including the patch (on top of
>> that code by Arjan) to implement the actual current kernel version as
>> "New version".
>
>Annoyingly (for me) you are partially right...
>
>I found where my ip checksum perf code was hiding and revisited it.
>Although I found comments elsewhere that the 'jecxz, adc, adc, lea, jmp'
>did an adc every clock it isn't happening for me now.
>
>I'm only measuring the inner loop for multiples of 64 bytes.
>The code less than 8 bytes and partial final words is a
>separate problem.
>The less unrolled the main loop, the less overhead there'll
>be for 'normal' sizes.
>So I've changed your '80 byte' block to 64 bytes for consistency.
>
>I'm ignoring pre-sandy bridge cpu (no split flags) and pre-broadwell
>(adc takes two clocks - although adc to alternate regs is one clock
>on sandy bridge).
>My test system is an i7-7700, I think anything from broadwell (gen 4)
>will be at least as good.
>I don't have a modern amd cpu.
>
>The best loop for 256+ bytes is an adxc/adxo one.
>However that requires the run-time patching.
>Followed by new kernel version (two blocks of 4 adc).
>The surprising one is:
>		xor	sum, sum
>	1:	adc	(buff), sum
>		adc	8(buff), sum
>		lea	16(buff), buff
>		dec	count
>		jnz	1b
>		adc	$0, sum
>For 256 bytes it is only a couple of clocks slower.
>Maybe 10% slower for 512+ bytes.
>But it need almost no extra code for 'normal' buffer sizes.
>By comparison the adxc/adxo one is 20% faster.
>
>The code is doing:
>	old = rdpmc
>	mfence
>	csum = do_csum(buf, len);
>	mfence
>	clocks = rdpmc - old
>(That is directly reading the pmc register.)
>With 'no-op' function it takes 160 clocks (I-cache resident).
>Without the mfence 40 - but pretty much everything can execute
>after the 2nd rdpmc.
>
>I've attached my (horrid) test program.
>
>	David
>
>-
>Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
>Registration No: 1397386 (Wales)

Rather than runtime patching perhaps separate paths...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: x86/csum: Remove unnecessary odd handling
  2024-01-07  1:09                         ` H. Peter Anvin
@ 2024-01-07 11:44                           ` David Laight
  0 siblings, 0 replies; 34+ messages in thread
From: David Laight @ 2024-01-07 11:44 UTC (permalink / raw)
  To: 'H. Peter Anvin', 'Linus Torvalds'
  Cc: Noah Goldstein, x86@kernel.org, oe-kbuild-all@lists.linux.dev,
	linux-kernel@vger.kernel.org, edumazet@google.com,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com

From: H. Peter Anvin
> Sent: 07 January 2024 01:09
> 
> On January 6, 2024 2:08:48 PM PST, David Laight <David.Laight@ACULAB.COM> wrote:
...
> >The best loop for 256+ bytes is an adxc/adxo one.
> >However that requires the run-time patching.
...
> Rather than runtime patching perhaps separate paths...

It will need to detect the cpu type earlier, so a static
branch is probably enough.
Easier than substituting the entire code block.

I think it is silvermont and knight's landing that have
a 4 clock penalty for 64bit adxc (Intel atom family).
That might only be a decode penalty, so doesn't affect
the loop 'that much' (adc is 2 clocks on those cpu).
So probably not actually worth doing a run-time
performance check.

I might 'cook up' a full checksum function later.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: x86/csum: Remove unnecessary odd handling
  2024-01-06 10:26                           ` Eric Dumazet
  2024-01-06 19:32                             ` Linus Torvalds
@ 2024-01-07 12:11                             ` David Laight
  1 sibling, 0 replies; 34+ messages in thread
From: David Laight @ 2024-01-07 12:11 UTC (permalink / raw)
  To: 'Eric Dumazet', Linus Torvalds
  Cc: Noah Goldstein, kernel test robot, x86@kernel.org,
	oe-kbuild-all@lists.linux.dev, linux-kernel@vger.kernel.org,
	tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
	dave.hansen@linux.intel.com, hpa@zytor.com

From: Eric Dumazet
> Sent: 06 January 2024 10:26
...
> On a related note, at least with clang, I found that csum_ipv6_magic()
> is needlessly using temporary on-stack storage,
> showing a stall on Cascade Lake unless I am patching add32_with_carry() :
> 
> diff --git a/arch/x86/include/asm/checksum_64.h
> b/arch/x86/include/asm/checksum_64.h
> index 407beebadaf45a748f91a36b78bd1d023449b132..c3d6f47626c70d81f0c2ba401d85050b09a39922
> 100644
> --- a/arch/x86/include/asm/checksum_64.h
> +++ b/arch/x86/include/asm/checksum_64.h
> @@ -171,7 +171,7 @@ static inline unsigned add32_with_carry(unsigned
> a, unsigned b)
>         asm("addl %2,%0\n\t"
>             "adcl $0,%0"
>             : "=r" (a)
> -           : "0" (a), "rm" (b));
> +           : "0" (a), "r" (b));
>         return a;
>  }

Try replacing:
	return csum_fold(
	       (__force __wsum)add32_with_carry(sum64 & 0xffffffff, sum64>>32));
with:
	return csum_fold((__force __wsum)((sum64 + ror64(sum64, 32)) >> 32));

Should be less instructions as well.
(shift, add, shift v shift, mov, and, add, add)
Although both might be 3 clocks.

The best C version of csum_fold (from IIRC arc) is also likely to be
better than the x86 asm one - certainly no worse.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: x86/csum: Remove unnecessary odd handling
@ 2025-10-24 22:12 Ywe Cærlyn
  0 siblings, 0 replies; 34+ messages in thread
From: Ywe Cærlyn @ 2025-10-24 22:12 UTC (permalink / raw)
  To: linux-kernel

Xcuse me, but isn´t it time people show skill in highlevel languages 
running like low-level, instead of the LDA/STA MOV joystick thing.

3 letter abbrev "God" thing. Some people complain much about "God" but 
do nothing about it!?

My annoyed input.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-10-24 22:12 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20230628020657.957880-1-goldstein.w.n@gmail.com>
2023-06-28  9:12 ` x86/csum: Remove unnecessary odd handling Borislav Petkov
2023-06-28 15:32   ` Noah Goldstein
2023-06-28 17:44     ` Linus Torvalds
2023-06-28 18:34       ` Noah Goldstein
2023-06-28 20:02         ` Linus Torvalds
2023-06-29 14:04   ` David Laight
2023-06-29 14:27   ` David Laight
2023-09-01 22:21 ` Noah Goldstein
2023-09-06 13:49   ` David Laight
2023-09-06 14:38   ` David Laight
2023-09-20 19:20     ` Noah Goldstein
2023-09-20 19:23 ` Noah Goldstein
2023-09-23  3:24   ` kernel test robot
2023-09-23 14:05     ` Noah Goldstein
2023-09-23 21:13       ` David Laight
2023-09-24 14:35         ` Noah Goldstein
2023-12-23 22:18           ` Noah Goldstein
2024-01-04 23:28             ` Noah Goldstein
2024-01-04 23:34               ` Dave Hansen
2024-01-04 23:36               ` Linus Torvalds
2024-01-05  0:33                 ` Linus Torvalds
2024-01-05 10:41                   ` David Laight
2024-01-05 16:12                     ` David Laight
2024-01-05 18:05                     ` Linus Torvalds
2024-01-05 23:52                       ` David Laight
2024-01-06  0:18                         ` Linus Torvalds
2024-01-06 10:26                           ` Eric Dumazet
2024-01-06 19:32                             ` Linus Torvalds
2024-01-07 12:11                             ` David Laight
2024-01-06 22:08                       ` David Laight
2024-01-07  1:09                         ` H. Peter Anvin
2024-01-07 11:44                           ` David Laight
2023-09-24 14:35 ` Noah Goldstein
2025-10-24 22:12 Ywe Cærlyn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox