From mboxrd@z Thu Jan  1 00:00:00 1970
From: w@1wt.eu (Willy Tarreau)
Date: Thu, 12 Dec 2013 18:11:08 +0100
Subject: gcc miscompiles csum_tcpudp_magic() on ARMv5
In-Reply-To: <20131212164748.GS4360@n2100.arm.linux.org.uk>
References: <1386850444.22947.46.camel@sakura.staff.proxad.net>
 <20131212124015.GL4360@n2100.arm.linux.org.uk>
 <1386855390.22947.68.camel@sakura.staff.proxad.net>
 <20131212144853.GO4360@n2100.arm.linux.org.uk>
 <1386860657.25449.3.camel@sakura.staff.proxad.net>
 <20131212154110.GQ4360@n2100.arm.linux.org.uk>
 <20131212160426.GD31816@1wt.eu>
 <20131212164748.GS4360@n2100.arm.linux.org.uk>
Message-ID: <20131212171108.GA2337@1wt.eu>
To: linux-arm-kernel@lists.infradead.org
List-Id: linux-arm-kernel.lists.infradead.org

On Thu, Dec 12, 2013 at 04:47:48PM +0000, Russell King - ARM Linux wrote:
> > Then changing the type of the function argument would probably be safer!
> 
> Actually, I think we can do a bit better with this code.  We really don't
> need much of this messing around here, we can combine some of these steps.
> 
> We have:
> 
> 16-bit protocol in host endian
> 16-bit length in host endian
> 
> and we need to combine them into a 32-bit checksum which is then
> subsequently folded down to 16-bits by adding the top and bottom halves.
> 
> Now, what we can do is this:
> 
> 1. Construct a combined 32-bit protocol and length:
> 
> 	unsigned lenproto = len | proto << 16;
> 
> 2. Pass this into the assembly thusly:
> 
>                 __asm__(
>                 "adds   %0, %1, %2      @ csum_tcpudp_nofold    \n\t"
>                 "adcs   %0, %0, %3                              \n\t"
> #ifdef __ARMEB__
>                 "adcs   %0, %0, %4                              \n\t"
> #else
>                 "adcs   %0, %0, %4, ror #8                      \n\t"
> #endif
>                 "adc    %0, %0, #0"
>                 : "=&r"(sum)
>                 : "r" (sum), "r" (daddr), "r" (saddr), "r" (lenprot)
>                 : "cc");
> 
> with no swabbing at this stage.  Well, where do we get the endian
> conversion?  See that ror #8 - that a 32 bit rotate by 8 bits.  As
> these are two 16-bit quantities, we end up with this:
> 
> original:
> 	31..24	23..16	15..8	7..0
> 	len_h	len_l	pro_h	pro_l
> 
> accumulated:
> 	31..24	23..16	15..8	7..0
> 	pro_l	len_h	len_l	pro_h
> 
> And now when we fold it down to 16-bit:
> 
> 			15..8	7..0
> 			len_l	pro_h
> 			pro_l	len_h

Amusing, I've used the same optimization yesterday when computing a
TCP pseudo-header checksum.

Another thing that can be done to improve the folding of the 16-bit
checksum is to swap the values to be added, sum them and only keep
the high half integer which already contains the carry. At least on
x86 I save some cycles doing this :

              31:24  23:16  15:8  7:0
     sum32 =    D      C      B    A

     To fold this into 16-bit at a time, I just do this :

                   31:24     23:16          15:8  7:0
     sum32           D         C              B    A
  +  sum32swapped    B         A              D    C
  =                 A+B  C+A+carry(B+D/C+A)  B+D  C+A

so just take the upper result and you get the final 16-bit word at
once.

In C it does :

       fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16

When the CPU has a rotate instruction, it's fast :-)

Cheers,
Willy