From mboxrd@z Thu Jan  1 00:00:00 1970
From: w@1wt.eu (Willy Tarreau)
Date: Thu, 12 Dec 2013 18:35:37 +0100
Subject: gcc miscompiles csum_tcpudp_magic() on ARMv5
In-Reply-To: <20131212172049.GU4360@n2100.arm.linux.org.uk>
References: <1386850444.22947.46.camel@sakura.staff.proxad.net>
 <20131212124015.GL4360@n2100.arm.linux.org.uk>
 <1386855390.22947.68.camel@sakura.staff.proxad.net>
 <20131212144853.GO4360@n2100.arm.linux.org.uk>
 <1386860657.25449.3.camel@sakura.staff.proxad.net>
 <20131212154110.GQ4360@n2100.arm.linux.org.uk>
 <20131212160426.GD31816@1wt.eu>
 <20131212164748.GS4360@n2100.arm.linux.org.uk> <20131212171108.GA2337@1wt.eu>
 <20131212172049.GU4360@n2100.arm.linux.org.uk>
Message-ID: <20131212173537.GB2337@1wt.eu>
To: linux-arm-kernel@lists.infradead.org
List-Id: linux-arm-kernel.lists.infradead.org

On Thu, Dec 12, 2013 at 05:20:49PM +0000, Russell King - ARM Linux wrote:
> On Thu, Dec 12, 2013 at 06:11:08PM +0100, Willy Tarreau wrote:
> > Another thing that can be done to improve the folding of the 16-bit
> > checksum is to swap the values to be added, sum them and only keep
> > the high half integer which already contains the carry. At least on
> > x86 I save some cycles doing this :
> > 
> >               31:24  23:16  15:8  7:0
> >      sum32 =    D      C      B    A
> > 
> >      To fold this into 16-bit at a time, I just do this :
> > 
> >                    31:24     23:16          15:8  7:0
> >      sum32           D         C              B    A
> >   +  sum32swapped    B         A              D    C
> >   =                 A+B  C+A+carry(B+D/C+A)  B+D  C+A
> > 
> > so just take the upper result and you get the final 16-bit word at
> > once.
> > 
> > In C it does :
> > 
> >        fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16
> > 
> > When the CPU has a rotate instruction, it's fast :-)
> 
> Indeed - and if your CPU can do the rotate and add at the same time,
> it's just a singe instruction, and it ends up looking remarkably
> similar to this:
> 
> static inline __sum16 csum_fold(__wsum sum)
> {
>         __asm__(
>         "add    %0, %1, %1, ror #16     @ csum_fold"
>         : "=r" (sum)
>         : "r" (sum)
>         : "cc");
>         return (__force __sum16)(~(__force u32)sum >> 16);
> }

Marvelous :-)

Willy