All of lore.kernel.org
 help / color / mirror / Atom feed
From: w@1wt.eu (Willy Tarreau)
To: linux-arm-kernel@lists.infradead.org
Subject: gcc miscompiles csum_tcpudp_magic() on ARMv5
Date: Thu, 12 Dec 2013 18:11:08 +0100	[thread overview]
Message-ID: <20131212171108.GA2337@1wt.eu> (raw)
In-Reply-To: <20131212164748.GS4360@n2100.arm.linux.org.uk>

On Thu, Dec 12, 2013 at 04:47:48PM +0000, Russell King - ARM Linux wrote:
> > Then changing the type of the function argument would probably be safer!
> 
> Actually, I think we can do a bit better with this code.  We really don't
> need much of this messing around here, we can combine some of these steps.
> 
> We have:
> 
> 16-bit protocol in host endian
> 16-bit length in host endian
> 
> and we need to combine them into a 32-bit checksum which is then
> subsequently folded down to 16-bits by adding the top and bottom halves.
> 
> Now, what we can do is this:
> 
> 1. Construct a combined 32-bit protocol and length:
> 
> 	unsigned lenproto = len | proto << 16;
> 
> 2. Pass this into the assembly thusly:
> 
>                 __asm__(
>                 "adds   %0, %1, %2      @ csum_tcpudp_nofold    \n\t"
>                 "adcs   %0, %0, %3                              \n\t"
> #ifdef __ARMEB__
>                 "adcs   %0, %0, %4                              \n\t"
> #else
>                 "adcs   %0, %0, %4, ror #8                      \n\t"
> #endif
>                 "adc    %0, %0, #0"
>                 : "=&r"(sum)
>                 : "r" (sum), "r" (daddr), "r" (saddr), "r" (lenprot)
>                 : "cc");
> 
> with no swabbing at this stage.  Well, where do we get the endian
> conversion?  See that ror #8 - that a 32 bit rotate by 8 bits.  As
> these are two 16-bit quantities, we end up with this:
> 
> original:
> 	31..24	23..16	15..8	7..0
> 	len_h	len_l	pro_h	pro_l
> 
> accumulated:
> 	31..24	23..16	15..8	7..0
> 	pro_l	len_h	len_l	pro_h
> 
> And now when we fold it down to 16-bit:
> 
> 			15..8	7..0
> 			len_l	pro_h
> 			pro_l	len_h

Amusing, I've used the same optimization yesterday when computing a
TCP pseudo-header checksum.

Another thing that can be done to improve the folding of the 16-bit
checksum is to swap the values to be added, sum them and only keep
the high half integer which already contains the carry. At least on
x86 I save some cycles doing this :

              31:24  23:16  15:8  7:0
     sum32 =    D      C      B    A

     To fold this into 16-bit at a time, I just do this :

                   31:24     23:16          15:8  7:0
     sum32           D         C              B    A
  +  sum32swapped    B         A              D    C
  =                 A+B  C+A+carry(B+D/C+A)  B+D  C+A

so just take the upper result and you get the final 16-bit word at
once.

In C it does :

       fold16 = (((sum32 >> 16) | (sum32 << 16)) + sum32) >> 16

When the CPU has a rotate instruction, it's fast :-)

Cheers,
Willy

  reply	other threads:[~2013-12-12 17:11 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-12-12 12:14 gcc miscompiles csum_tcpudp_magic() on ARMv5 Maxime Bizon
2013-12-12 12:40 ` Russell King - ARM Linux
2013-12-12 13:36   ` Maxime Bizon
2013-12-12 13:48     ` Måns Rullgård
2013-12-12 14:10       ` Maxime Bizon
2013-12-12 14:19         ` Willy Tarreau
2013-12-12 14:28           ` Maxime Bizon
2013-12-12 14:42             ` Måns Rullgård
2013-12-12 14:52               ` Maxime Bizon
2013-12-12 14:58                 ` Måns Rullgård
2013-12-12 15:00                 ` Russell King - ARM Linux
2013-12-12 15:26                   ` Maxime Bizon
2013-12-12 15:07               ` Willy Tarreau
2013-12-12 15:18                 ` Måns Rullgård
2013-12-12 15:28                   ` Willy Tarreau
2013-12-12 15:43                     ` Russell King - ARM Linux
2013-12-12 15:50                       ` Måns Rullgård
2013-12-12 14:37           ` Måns Rullgård
2013-12-12 14:40             ` Maxime Bizon
2013-12-12 14:47               ` Måns Rullgård
2013-12-12 14:26         ` Måns Rullgård
2013-12-12 14:48     ` Russell King - ARM Linux
2013-12-12 15:00       ` Måns Rullgård
2013-12-12 15:04       ` Maxime Bizon
2013-12-12 15:41         ` Russell King - ARM Linux
2013-12-12 16:04           ` Måns Rullgård
2013-12-12 16:04           ` Willy Tarreau
2013-12-12 16:47             ` Russell King - ARM Linux
2013-12-12 17:11               ` Willy Tarreau [this message]
2013-12-12 17:20                 ` Russell King - ARM Linux
2013-12-12 17:35                   ` Willy Tarreau
2013-12-12 18:07                   ` Nicolas Pitre
2013-12-12 22:30               ` Maxime Bizon
2013-12-12 22:36                 ` Russell King - ARM Linux
2013-12-12 22:44                   ` Maxime Bizon
2013-12-12 22:48                     ` Russell King - ARM Linux
2013-12-12 17:34           ` Maxime Bizon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20131212171108.GA2337@1wt.eu \
    --to=w@1wt.eu \
    --cc=linux-arm-kernel@lists.infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.