[RFC] csum experts, csum_replace2() is too expensive

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] csum experts, csum_replace2() is too expensive
@ 2014-03-20 18:49 Eric Dumazet
  2014-03-21  0:13 ` Herbert Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Eric Dumazet @ 2014-03-20 18:49 UTC (permalink / raw)
  To: Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton; +Cc: netdev

csum_replace2() uses about 29 cycles, while a plain ip_send_check() is
way faster (16 cycles)

csum_partial() is not really meant for doing checksums over 8 bytes !

Any idea how to make the thing really fast as intended ?

I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
is insane...

Following patch might be the fastest thing ?

(At this point we already have validated IP checksum)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8c54870db792..86c924c16f3c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff)
 	int proto = iph->protocol;
 	int err = -ENOSYS;
 
-	csum_replace2(&iph->check, iph->tot_len, newlen);
 	iph->tot_len = newlen;
+	ip_send_check(&iph);
 
 	rcu_read_lock();
 	ops = rcu_dereference(inet_offloads[proto]);

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-20 18:49 [RFC] csum experts, csum_replace2() is too expensive Eric Dumazet
@ 2014-03-21  0:13 ` Herbert Xu
  2014-03-21  0:54   ` Eric Dumazet
  2014-03-21  1:56 ` Andi Kleen
  2014-03-21 18:07 ` David Miller
  2 siblings, 1 reply; 26+ messages in thread
From: Herbert Xu @ 2014-03-21  0:13 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Patrick McHardy, H.K. Jerry Chu, Michael Dalton, netdev

On Thu, Mar 20, 2014 at 11:49:01AM -0700, Eric Dumazet wrote:
> csum_replace2() uses about 29 cycles, while a plain ip_send_check() is
> way faster (16 cycles)
> 
> csum_partial() is not really meant for doing checksums over 8 bytes !
> 
> Any idea how to make the thing really fast as intended ?
> 
> I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
> is insane...
> 
> Following patch might be the fastest thing ?

Your patch makes sense to me.  However does it actually help
your 1% in the GRO profile?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-21  0:13 ` Herbert Xu
@ 2014-03-21  0:54   ` Eric Dumazet
  2014-03-21 12:50     ` Herbert Xu
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2014-03-21  0:54 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Patrick McHardy, H.K. Jerry Chu, Michael Dalton, netdev

On Fri, 2014-03-21 at 08:13 +0800, Herbert Xu wrote:

> Your patch makes sense to me.  However does it actually help
> your 1% in the GRO profile?

Well, I caught this strange csum_partial() presence in a profile,
but my question was more general than the GRO use case.

(BTW the patch didn't even compile ;) )

I guess that in GRO case, the following would be even faster...

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8c54870db792..b8fe5e8a02dd 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1361,6 +1361,7 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head,
 	if (!ops || !ops->callbacks.gro_receive)
 		goto out_unlock;
 
+	/* GRO is limited to IPv4 with no options : ihl = 5 */
 	if (*(u8 *)iph != 0x45)
 		goto out_unlock;
 
@@ -1434,8 +1435,10 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff)
 	int proto = iph->protocol;
 	int err = -ENOSYS;
 
-	csum_replace2(&iph->check, iph->tot_len, newlen);
+	/* GRO is limited to IPv4 with no options : ihl = 5 */
 	iph->tot_len = newlen;
+	iph->check = 0;
+	iph->check = ip_fast_csum((u8 *)iph, 5);
 
 	rcu_read_lock();
 	ops = rcu_dereference(inet_offloads[proto]);

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-20 18:49 [RFC] csum experts, csum_replace2() is too expensive Eric Dumazet
  2014-03-21  0:13 ` Herbert Xu
@ 2014-03-21  1:56 ` Andi Kleen
  2014-03-21  2:51   ` Eric Dumazet
  2014-03-21 12:50   ` Eric Dumazet
  2014-03-21 18:07 ` David Miller
  2 siblings, 2 replies; 26+ messages in thread
From: Andi Kleen @ 2014-03-21  1:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton,
	netdev

Eric Dumazet <eric.dumazet@gmail.com> writes:
>
> I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
> is insane...


Couldn't it just be the cache miss?

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-21  1:56 ` Andi Kleen
@ 2014-03-21  2:51   ` Eric Dumazet
  2014-03-21 12:50   ` Eric Dumazet
  1 sibling, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2014-03-21  2:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton,
	netdev

On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote:
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> >
> > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
> > is insane...
> 
> 
> Couldn't it just be the cache miss?

Well, no, because we aggregate frames that were already fully checked
(by GRO engine/layers) in the same NAPI run.

In this particular case, it seems we aggregate few frames per run (its a
40Gb Intel NIC, it apparently is tuned for low latency)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-21  0:54   ` Eric Dumazet
@ 2014-03-21 12:50     ` Herbert Xu
  0 siblings, 0 replies; 26+ messages in thread
From: Herbert Xu @ 2014-03-21 12:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Patrick McHardy, H.K. Jerry Chu, Michael Dalton, netdev

On Thu, Mar 20, 2014 at 05:54:17PM -0700, Eric Dumazet wrote:
> On Fri, 2014-03-21 at 08:13 +0800, Herbert Xu wrote:
> 
> > Your patch makes sense to me.  However does it actually help
> > your 1% in the GRO profile?
> 
> Well, I caught this strange csum_partial() presence in a profile,
> but my question was more general than the GRO use case.
> 
> (BTW the patch didn't even compile ;) )
> 
> I guess that in GRO case, the following would be even faster...

Agreed.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-21  1:56 ` Andi Kleen
  2014-03-21  2:51   ` Eric Dumazet
@ 2014-03-21 12:50   ` Eric Dumazet
  2014-03-21 13:28     ` Andi Kleen
                       ` (3 more replies)
  1 sibling, 4 replies; 26+ messages in thread
From: Eric Dumazet @ 2014-03-21 12:50 UTC (permalink / raw)
  To: Andi Kleen, H. Peter Anvin
  Cc: Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton,
	netdev, linux-kernel@vger.kernel.org

On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote:
> Eric Dumazet <eric.dumazet@gmail.com> writes:
> >
> > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
> > is insane...
> 
> 
> Couldn't it just be the cache miss?

Or the fact that we mix 16 bit stores and 32bit loads ?

iph->tot_len = newlen;
iph->check = 0;
iph->check = ip_fast_csum(iph, 5);

-> following perf annotation :

         :      ffffffff81538e70 <inet_gro_complete>:
    0.49 :      ffffffff81538e70:       callq  ffffffff81578c80 <__entry_text_start>
    0.49 :      ffffffff81538e75:       push   %rbp
    1.46 :      ffffffff81538e76:       movzwl 0x60(%rdi),%edx
    0.00 :      ffffffff81538e7a:       movslq %esi,%rax
    0.00 :      ffffffff81538e7d:       add    0xc8(%rdi),%rax
    1.46 :      ffffffff81538e84:       mov    %rsp,%rbp
    0.00 :      ffffffff81538e87:       sub    %esi,%edx
    0.00 :      ffffffff81538e89:       rol    $0x8,%dx
    0.00 :      ffffffff81538e8d:       movzbl 0x9(%rax),%ecx
    0.98 :      ffffffff81538e91:       mov    %dx,0x2(%rax)     iph->tot_len = newlen;
    0.49 :      ffffffff81538e95:       xor    %edx,%edx
    0.00 :      ffffffff81538e97:       mov    %dx,0xa(%rax)     iph->check = 0;
    0.00 :      ffffffff81538e9b:       mov    (%rax),%edx       // 32bit load  -> stall 
   86.34 :      ffffffff81538e9d:       add    0x4(%rax),%edx
    0.49 :      ffffffff81538ea0:       adc    0x8(%rax),%edx
    0.49 :      ffffffff81538ea3:       adc    0xc(%rax),%edx
    0.98 :      ffffffff81538ea6:       adc    0x10(%rax),%edx
    0.49 :      ffffffff81538ea9:       adc    $0x0,%edx
    0.00 :      ffffffff81538eac:       mov    %edx,%r8d
    0.49 :      ffffffff81538eaf:       xor    %dx,%dx
    0.00 :      ffffffff81538eb2:       shl    $0x10,%r8d
    0.98 :      ffffffff81538eb6:       add    %r8d,%edx
    0.98 :      ffffffff81538eb9:       adc    $0xffff,%edx
    0.98 :      ffffffff81538ebf:       not    %edx
    0.00 :      ffffffff81538ec1:       shr    $0x10,%edx
    0.49 :      ffffffff81538ec4:       mov    %dx,0xa(%rax)
    0.00 :      ffffffff81538ec8:       movslq %ecx,%rax
    0.00 :      ffffffff81538ecb:       mov    -0x7e734f40(,%rax,8),%rax
    0.00 :      ffffffff81538ed3:       test   %rax,%rax
    0.00 :      ffffffff81538ed6:       je     ffffffff81538ef0 <inet_gro_complete+0x80>

BTW, any idea why ip_fast_csum() on x86 contains a "memory" constraint ?
It looks like a barrier() would be more appropriate.

Following patch seems to help, but the stall seems to be the fact that
we write on iph->check (16bits) before doing the checksum using 32bit
loads.

Note that we could share same ip_fast_csum() implementation between x86
32/64 bits...

diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
index e6fd8a026c7b..a81cc3a5facc 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -46,6 +46,24 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
 {
 	unsigned int sum;
 
+	/*
+	 * Callers might clear iph->check before calling us, make sure
+	 * compiler wont mess things...
+	 */
+	barrier();
+
+	if (__builtin_constant_p(ihl) && ihl == 5) {
+		asm("  movl (%[iph]), %[sum]\n"
+		    "  addl 4(%[iph]), %[sum]\n"
+		    "  adcl 8(%[iph]), %[sum]\n"
+		    "  adcl 12(%[iph]), %[sum]\n"
+		    "  adcl 16(%[iph]), %[sum]\n"
+		    "  adcl $0, %[sum]\n"
+		    : [sum] "=r" (sum)
+		    : [iph] "r" (iph)
+		    );
+		return csum_fold(sum);
+	}
 	asm("  movl (%1), %0\n"
 	    "  subl $4, %2\n"
 	    "  jbe 2f\n"
@@ -68,7 +86,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
 	   will assume they contain their original values. */
 	    : "=r" (sum), "=r" (iph), "=r" (ihl)
 	    : "1" (iph), "2" (ihl)
-	    : "memory");
+	    );
 	return (__force __sum16)sum;
 }
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8c54870db792..90aa562dfbf5 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1434,8 +1434,9 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff)
 	int proto = iph->protocol;
 	int err = -ENOSYS;
 
-	csum_replace2(&iph->check, iph->tot_len, newlen);
 	iph->tot_len = newlen;
+	iph->check = 0;
+	iph->check = ip_fast_csum((u8 *)iph, 5);
 
 	rcu_read_lock();
 	ops = rcu_dereference(inet_offloads[proto]);

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-21 12:50   ` Eric Dumazet
@ 2014-03-21 13:28     ` Andi Kleen
  2014-03-21 13:32     ` Eric Dumazet
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2014-03-21 13:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, H. Peter Anvin, Patrick McHardy, Herbert Xu,
	H.K. Jerry Chu, Michael Dalton, netdev,
	linux-kernel@vger.kernel.org

On Fri, Mar 21, 2014 at 05:50:50AM -0700, Eric Dumazet wrote:
> On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote:
> > Eric Dumazet <eric.dumazet@gmail.com> writes:
> > >
> > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
> > > is insane...
> > 
> > 
> > Couldn't it just be the cache miss?
> 
> Or the fact that we mix 16 bit stores and 32bit loads ?

It should cause a small stall from not doing load-store
forwarding, but 1% of a serious workload would be surprising.

Are you sure it's not some skid effect?

-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-21 12:50   ` Eric Dumazet
  2014-03-21 13:28     ` Andi Kleen
@ 2014-03-21 13:32     ` Eric Dumazet
  2014-03-21 13:47       ` Eric Dumazet
  2014-03-21 14:14     ` David Laight
  2014-03-21 18:52     ` David Miller
  3 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2014-03-21 13:32 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Patrick McHardy, Herbert Xu, H.K. Jerry Chu,
	Michael Dalton, netdev, linux-kernel@vger.kernel.org

On Fri, 2014-03-21 at 05:50 -0700, Eric Dumazet wrote:

> Or the fact that we mix 16 bit stores and 32bit loads ?
> 
> iph->tot_len = newlen;
> iph->check = 0;
> iph->check = ip_fast_csum(iph, 5);

Yep definitely. Using 16 bit loads in ip_fast_csum() totally removes the
stall. I no longer see inet_gro_complete() in perf top...

+       if (__builtin_constant_p(ihl) && ihl == 5) {
+               asm("  movw (%[iph]), %w[sum]\n"        /* ihl/version/tos */
+                   "  addw 2(%[iph]), %w[sum]\n"       /* tot_len      */
+                   "  adcw 8(%[iph]), %w[sum]\n"       /* ttl/protocol */
+                   "  adcw 10(%[iph]), %w[sum]\n"      /* check        */
+                   "  adcl 4(%[iph]), %[sum]\n"        /* id/frag_off  */
+                   "  adcl 12(%[iph]), %[sum]\n"       /* saddr        */
+                   "  adcl 16(%[iph]), %[sum]\n"       /* daddr        */
+                   "  adcl $0, %[sum]\n"
+                   : [sum] "=r" (sum)
+                   : [iph] "r" (iph)
+                   );
+               return csum_fold(sum);

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-21 13:32     ` Eric Dumazet
@ 2014-03-21 13:47       ` Eric Dumazet
  2014-03-21 13:56         ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2014-03-21 13:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Patrick McHardy, Herbert Xu, H.K. Jerry Chu,
	Michael Dalton, netdev, linux-kernel@vger.kernel.org

On Fri, 2014-03-21 at 06:32 -0700, Eric Dumazet wrote:
> On Fri, 2014-03-21 at 05:50 -0700, Eric Dumazet wrote:
> 
> > Or the fact that we mix 16 bit stores and 32bit loads ?
> > 
> > iph->tot_len = newlen;
> > iph->check = 0;
> > iph->check = ip_fast_csum(iph, 5);
> 
> Yep definitely. Using 16 bit loads in ip_fast_csum() totally removes the
> stall. I no longer see inet_gro_complete() in perf top...
> 
> +       if (__builtin_constant_p(ihl) && ihl == 5) {
> +               asm("  movw (%[iph]), %w[sum]\n"        /* ihl/version/tos */
> +                   "  addw 2(%[iph]), %w[sum]\n"       /* tot_len      */
> +                   "  adcw 8(%[iph]), %w[sum]\n"       /* ttl/protocol */
> +                   "  adcw 10(%[iph]), %w[sum]\n"      /* check        */
> +                   "  adcl 4(%[iph]), %[sum]\n"        /* id/frag_off  */
> +                   "  adcl 12(%[iph]), %[sum]\n"       /* saddr        */
> +                   "  adcl 16(%[iph]), %[sum]\n"       /* daddr        */
> +                   "  adcl $0, %[sum]\n"
> +                   : [sum] "=r" (sum)
> +                   : [iph] "r" (iph)
> +                   );
> +               return csum_fold(sum);
> 

Another idea would be to move the ip_fast_csum() call at the end of
inet_gro_complete()

I'll try this :

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8c54870db792..0ca8f350a532 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff)
        int proto = iph->protocol;
        int err = -ENOSYS;
 
-       csum_replace2(&iph->check, iph->tot_len, newlen);
        iph->tot_len = newlen;
+       iph->check = 0;
 
        rcu_read_lock();
        ops = rcu_dereference(inet_offloads[proto]);
@@ -1447,6 +1447,7 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff)
         * inet_gro_receive().
         */
        err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph));
+       iph->check = ip_fast_csum((u8 *)iph, 5);
 
 out_unlock:
        rcu_read_unlock();

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-21 13:47       ` Eric Dumazet
@ 2014-03-21 13:56         ` Eric Dumazet
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2014-03-21 13:56 UTC (permalink / raw)
  To: Andi Kleen
  Cc: H. Peter Anvin, Patrick McHardy, Herbert Xu, H.K. Jerry Chu,
	Michael Dalton, netdev, linux-kernel@vger.kernel.org

On Fri, 2014-03-21 at 06:47 -0700, Eric Dumazet wrote:

> Another idea would be to move the ip_fast_csum() call at the end of
> inet_gro_complete()
> 
> I'll try this :
> 
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 8c54870db792..0ca8f350a532 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff)
>         int proto = iph->protocol;
>         int err = -ENOSYS;
>  
> -       csum_replace2(&iph->check, iph->tot_len, newlen);
>         iph->tot_len = newlen;
> +       iph->check = 0;
>  
>         rcu_read_lock();
>         ops = rcu_dereference(inet_offloads[proto]);
> @@ -1447,6 +1447,7 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff)
>          * inet_gro_receive().
>          */
>         err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph));
> +       iph->check = ip_fast_csum((u8 *)iph, 5);
>  
>  out_unlock:
>         rcu_read_unlock();
> 

This doesn't help... same stall.

Looks like the best way is to use some 16 bit loads in ip_fast_csum()
for the ihl=5 case.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-21 12:50   ` Eric Dumazet
  2014-03-21 13:28     ` Andi Kleen
  2014-03-21 13:32     ` Eric Dumazet
@ 2014-03-21 14:14     ` David Laight
  2014-03-21 18:52     ` David Miller
  3 siblings, 0 replies; 26+ messages in thread
From: David Laight @ 2014-03-21 14:14 UTC (permalink / raw)
  To: 'Eric Dumazet', Andi Kleen, H. Peter Anvin
  Cc: Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton,
	netdev, linux-kernel@vger.kernel.org

From: Eric Dumazet
> On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote:
> > Eric Dumazet <eric.dumazet@gmail.com> writes:
> > >
> > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
> > > is insane...
> >
> >
> > Couldn't it just be the cache miss?
> 
> Or the fact that we mix 16 bit stores and 32bit loads ?
> 
> BTW, any idea why ip_fast_csum() on x86 contains a "memory" constraint ?

The correct constraint would be one that told gcc that it
accesses the 20 bytes from the source pointer.

Without it gcc won't necessarily write out the values before
the asm instructions execute.

	David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-20 18:49 [RFC] csum experts, csum_replace2() is too expensive Eric Dumazet
  2014-03-21  0:13 ` Herbert Xu
  2014-03-21  1:56 ` Andi Kleen
@ 2014-03-21 18:07 ` David Miller
  2014-03-21 18:30   ` Eric Dumazet
  2014-03-24  2:51   ` [PATCH net-next] net: optimize csum_replace2() Eric Dumazet
  2 siblings, 2 replies; 26+ messages in thread
From: David Miller @ 2014-03-21 18:07 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kaber, herbert, hkchu, mwdalton, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 20 Mar 2014 11:49:01 -0700

> csum_replace2() uses about 29 cycles, while a plain ip_send_check() is
> way faster (16 cycles)
> 
> csum_partial() is not really meant for doing checksums over 8 bytes !
> 
> Any idea how to make the thing really fast as intended ?
> 
> I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
> is insane...
> 
> Following patch might be the fastest thing ?
> 
> (At this point we already have validated IP checksum)
 ...
> @@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff)
>  	int proto = iph->protocol;
>  	int err = -ENOSYS;
>  
> -	csum_replace2(&iph->check, iph->tot_len, newlen);
>  	iph->tot_len = newlen;
> +	ip_send_check(&iph);

Yeah the csum_replace*() are extremely suboptimal.

We should be able to cons up something cheap like the trick that
ip_decrease_ttl() uses.

https://tools.ietf.org/html/rfc1624
https://tools.ietf.org/html/rfc1141

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-21 18:07 ` David Miller
@ 2014-03-21 18:30   ` Eric Dumazet
  2014-03-24  2:51   ` [PATCH net-next] net: optimize csum_replace2() Eric Dumazet
  1 sibling, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2014-03-21 18:30 UTC (permalink / raw)
  To: David Miller; +Cc: kaber, herbert, hkchu, mwdalton, netdev

On Fri, 2014-03-21 at 14:07 -0400, David Miller wrote:

> Yeah the csum_replace*() are extremely suboptimal.
> 
> We should be able to cons up something cheap like the trick that
> ip_decrease_ttl() uses.
> 
> https://tools.ietf.org/html/rfc1624
> https://tools.ietf.org/html/rfc1141

Thanks for the pointers, I'll cook a patch.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-21 12:50   ` Eric Dumazet
                       ` (2 preceding siblings ...)
  2014-03-21 14:14     ` David Laight
@ 2014-03-21 18:52     ` David Miller
  2014-03-24  2:50       ` Eric Dumazet
  3 siblings, 1 reply; 26+ messages in thread
From: David Miller @ 2014-03-21 18:52 UTC (permalink / raw)
  To: eric.dumazet
  Cc: andi, hpa, kaber, herbert, hkchu, mwdalton, netdev, linux-kernel

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 21 Mar 2014 05:50:50 -0700

> It looks like a barrier() would be more appropriate.

barrier() == __asm__ __volatile__(:::"memory")

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-21 18:52     ` David Miller
@ 2014-03-24  2:50       ` Eric Dumazet
  2014-03-24 10:30         ` David Laight
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2014-03-24  2:50 UTC (permalink / raw)
  To: David Miller
  Cc: andi, hpa, kaber, herbert, hkchu, mwdalton, netdev, linux-kernel

On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 21 Mar 2014 05:50:50 -0700
> 
> > It looks like a barrier() would be more appropriate.
> 
> barrier() == __asm__ __volatile__(:::"memory")

Indeed, but now you mention it, ip_fast_csum() do not uses volatile
keyword on x86_64, and has no "m" constraint either.

This means that for the following hypothetical networking code :

void foobar(struct iphdr *iph, __be16 newlen, __be16 newid)
{
        iph->tot_len = newlen;
        iph->check = 0;
        iph->check = ip_fast_csum((u8 *)iph, 5);

        pr_err("%p\n", iph);

        iph->id = newid;
        iph->check = 0;
        iph->check = ip_fast_csum((u8 *)iph, 5);
}


ip_fast_csum() is done _once_ only.

Following patch seems needed. Thats one another call for x86 code factorization ...

diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
index e6fd8a026c7b..c67778544880 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -46,7 +46,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
 {
 	unsigned int sum;
 
-	asm("  movl (%1), %0\n"
+	asm volatile("  movl (%1), %0\n"
 	    "  subl $4, %2\n"
 	    "  jbe 2f\n"
 	    "  addl 4(%1), %0\n"

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH net-next] net: optimize csum_replace2()
  2014-03-21 18:07 ` David Miller
  2014-03-21 18:30   ` Eric Dumazet
@ 2014-03-24  2:51   ` Eric Dumazet
  2014-03-24  4:20     ` David Miller
  2014-03-24 10:22     ` David Laight
  1 sibling, 2 replies; 26+ messages in thread
From: Eric Dumazet @ 2014-03-24  2:51 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, hkchu, mwdalton, netdev

From: Eric Dumazet <edumazet@google.com>

When changing one 16bit value by another in IP header, we can adjust the
IP checksum by doing a simple operation described in RFC 1624,
as reminded by David.

csum_partial() is a complex function on x86_64, not really suited
for small number of checksummed bytes.

I spotted csum_partial() being in the top 20 most consuming
functions (more than 1 %) in a GRO workload, which was rather
unexpected.

The caller was inet_gro_complete() doing a csum_replace2() when
building the new IP header for the GRO packet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/checksum.h |   23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/net/checksum.h b/include/net/checksum.h
index 37a0e24adbe7..a28f4e0f6251 100644
--- a/include/net/checksum.h
+++ b/include/net/checksum.h
@@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend)
 	return csum_add(csum, ~addend);
 }
 
+static inline __sum16 csum16_add(__sum16 csum, __be16 addend)
+{
+	u16 res = (__force u16)csum;
+
+	res += (__force u16)addend;
+	return (__force __sum16)(res + (res < (__force u16)addend));
+}
+
+static inline __sum16 csum16_sub(__sum16 csum, __be16 addend)
+{
+	return csum16_add(csum, ~addend);
+}
+
 static inline __wsum
 csum_block_add(__wsum csum, __wsum csum2, int offset)
 {
@@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to)
 	*sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum)));
 }
 
-static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to)
+/* Implements RFC 1624 (Incremental Internet Checksum)
+ * 3. Discussion states :
+ *     HC' = ~(~HC + ~m + m')
+ *  m : old value of a 16bit field
+ *  m' : new value of a 16bit field
+ */
+static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new)
 {
-	csum_replace4(sum, (__force __be32)from, (__force __be32)to);
+	*sum = ~csum16_add(csum16_sub(~(*sum), old), new);
 }
 
 struct sk_buff;

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH net-next] net: optimize csum_replace2()
  2014-03-24  2:51   ` [PATCH net-next] net: optimize csum_replace2() Eric Dumazet
@ 2014-03-24  4:20     ` David Miller
  2014-03-24 10:22     ` David Laight
  1 sibling, 0 replies; 26+ messages in thread
From: David Miller @ 2014-03-24  4:20 UTC (permalink / raw)
  To: eric.dumazet; +Cc: herbert, hkchu, mwdalton, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 23 Mar 2014 19:51:36 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> When changing one 16bit value by another in IP header, we can adjust the
> IP checksum by doing a simple operation described in RFC 1624,
> as reminded by David.
> 
> csum_partial() is a complex function on x86_64, not really suited
> for small number of checksummed bytes.
> 
> I spotted csum_partial() being in the top 20 most consuming
> functions (more than 1 %) in a GRO workload, which was rather
> unexpected.
> 
> The caller was inet_gro_complete() doing a csum_replace2() when
> building the new IP header for the GRO packet.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks for following through on this Eric.

Would be nice to improve csum_replace4() similarly, since every
NAT packet uses that thing when we change the address in the
protocol header.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH net-next] net: optimize csum_replace2()
  2014-03-24  2:51   ` [PATCH net-next] net: optimize csum_replace2() Eric Dumazet
  2014-03-24  4:20     ` David Miller
@ 2014-03-24 10:22     ` David Laight
  2014-03-24 13:25       ` Eric Dumazet
  1 sibling, 1 reply; 26+ messages in thread
From: David Laight @ 2014-03-24 10:22 UTC (permalink / raw)
  To: 'Eric Dumazet', David Miller
  Cc: herbert@gondor.apana.org.au, hkchu@google.com,
	mwdalton@google.com, netdev@vger.kernel.org

From: Eric Dumazet <edumazet@google.com>
> 
> When changing one 16bit value by another in IP header, we can adjust the
> IP checksum by doing a simple operation described in RFC 1624,
> as reminded by David.
> 
> csum_partial() is a complex function on x86_64, not really suited
> for small number of checksummed bytes.
> 
> I spotted csum_partial() being in the top 20 most consuming
> functions (more than 1 %) in a GRO workload, which was rather
> unexpected.
> 
> The caller was inet_gro_complete() doing a csum_replace2() when
> building the new IP header for the GRO packet.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
>  include/net/checksum.h |   23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/checksum.h b/include/net/checksum.h
> index 37a0e24adbe7..a28f4e0f6251 100644
> --- a/include/net/checksum.h
> +++ b/include/net/checksum.h
> @@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend)
>  	return csum_add(csum, ~addend);
>  }
> 
> +static inline __sum16 csum16_add(__sum16 csum, __be16 addend)
> +{
> +	u16 res = (__force u16)csum;

Shouldn't that be u32 ?

> +	res += (__force u16)addend;
> +	return (__force __sum16)(res + (res < (__force u16)addend));
> +}
> +
> +static inline __sum16 csum16_sub(__sum16 csum, __be16 addend)
> +{
> +	return csum16_add(csum, ~addend);
> +}
> +
>  static inline __wsum
>  csum_block_add(__wsum csum, __wsum csum2, int offset)
>  {
> @@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to)
>  	*sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum)));
>  }
> 
> -static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to)
> +/* Implements RFC 1624 (Incremental Internet Checksum)
> + * 3. Discussion states :
> + *     HC' = ~(~HC + ~m + m')
> + *  m : old value of a 16bit field
> + *  m' : new value of a 16bit field
> + */
> +static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new)
>  {
> -	csum_replace4(sum, (__force __be32)from, (__force __be32)to);
> +	*sum = ~csum16_add(csum16_sub(~(*sum), old), new);
>  }

It might be clearer to just say:
	*sum = ~csum16_add(csum16_add(~*sum, ~old), new));
or even:
	*sum = ~csum16_add(csum16_add(*sum ^ 0xffff, old ^ 0xffff), new));
which might remove some mask instructions - especially if all the
intermediate values are left larger than 16 bits.

	David




^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-24  2:50       ` Eric Dumazet
@ 2014-03-24 10:30         ` David Laight
  2014-03-24 13:17           ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: David Laight @ 2014-03-24 10:30 UTC (permalink / raw)
  To: 'Eric Dumazet', David Miller
  Cc: andi@firstfloor.org, hpa@zytor.com, kaber@trash.net,
	herbert@gondor.apana.org.au, hkchu@google.com,
	mwdalton@google.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org

From: Eric Dumazet
> On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Fri, 21 Mar 2014 05:50:50 -0700
> >
> > > It looks like a barrier() would be more appropriate.
> >
> > barrier() == __asm__ __volatile__(:::"memory")
> 
> Indeed, but now you mention it, ip_fast_csum() do not uses volatile
> keyword on x86_64, and has no "m" constraint either.

Adding 'volatile' isn't sufficient to force gcc to write data
into the area being checksummed.
ip_fast_csum() either needs an explicit "m" constraint for the actual
buffer (and target) bytes, or the stronger "memory" constraint.
The 'volatile' is then not needed.

	David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-24 10:30         ` David Laight
@ 2014-03-24 13:17           ` Eric Dumazet
  2014-03-24 14:13             ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2014-03-24 13:17 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, andi@firstfloor.org, hpa@zytor.com, kaber@trash.net,
	herbert@gondor.apana.org.au, hkchu@google.com,
	mwdalton@google.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Mon, 2014-03-24 at 10:30 +0000, David Laight wrote:
> From: Eric Dumazet
> > On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote:
> > > From: Eric Dumazet <eric.dumazet@gmail.com>
> > > Date: Fri, 21 Mar 2014 05:50:50 -0700
> > >
> > > > It looks like a barrier() would be more appropriate.
> > >
> > > barrier() == __asm__ __volatile__(:::"memory")
> > 
> > Indeed, but now you mention it, ip_fast_csum() do not uses volatile
> > keyword on x86_64, and has no "m" constraint either.
> 
> Adding 'volatile' isn't sufficient to force gcc to write data
> into the area being checksummed.

You missed the point. Its not about forcing gcc to write data, because
it does.

Point is : gcc doesn't recompute the checksum a second time.

> ip_fast_csum() either needs an explicit "m" constraint for the actual
> buffer (and target) bytes, or the stronger "memory" constraint.
> The 'volatile' is then not needed.

What about you take a look at the actual code ?

"memory" constraint is already there. And no, its not enough, otherwise
I wouldn't have sent this mail.

I actually compiled the code and double checked.

0000000000007010 <foobar>:
    7010:	e8 00 00 00 00       	callq  7015 <foobar+0x5>
			7011: R_X86_64_PC32	__fentry__-0x4
    7015:	55                   	push   %rbp
    7016:	31 c0                	xor    %eax,%eax
    7018:	b9 05 00 00 00       	mov    $0x5,%ecx
    701d:	48 89 e5             	mov    %rsp,%rbp
    7020:	48 83 ec 20          	sub    $0x20,%rsp
    7024:	48 89 5d e8          	mov    %rbx,-0x18(%rbp)
    7028:	4c 89 6d f8          	mov    %r13,-0x8(%rbp)
    702c:	48 89 fb             	mov    %rdi,%rbx
    702f:	4c 89 65 f0          	mov    %r12,-0x10(%rbp)
    7033:	41 89 d5             	mov    %edx,%r13d
    7036:	66 89 47 0a          	mov    %ax,0xa(%rdi)
    703a:	66 89 77 02          	mov    %si,0x2(%rdi)
    703e:	48 89 f8             	mov    %rdi,%rax
    7041:	48 89 fe             	mov    %rdi,%rsi
    7044:	44 8b 20             	mov    (%rax),%r12d
    7047:	83 e9 04             	sub    $0x4,%ecx
    704a:	76 2e                	jbe    707a <foobar+0x6a>
    704c:	44 03 60 04          	add    0x4(%rax),%r12d
    7050:	44 13 60 08          	adc    0x8(%rax),%r12d
    7054:	44 13 60 0c          	adc    0xc(%rax),%r12d
    7058:	44 13 60 10          	adc    0x10(%rax),%r12d
    705c:	48 8d 40 04          	lea    0x4(%rax),%rax
    7060:	ff c9                	dec    %ecx
    7062:	75 f4                	jne    7058 <foobar+0x48>
    7064:	41 83 d4 00          	adc    $0x0,%r12d
    7068:	44 89 e1             	mov    %r12d,%ecx
    706b:	41 c1 ec 10          	shr    $0x10,%r12d
    706f:	66 41 01 cc          	add    %cx,%r12w
    7073:	41 83 d4 00          	adc    $0x0,%r12d
    7077:	41 f7 d4             	not    %r12d
    707a:	31 c0                	xor    %eax,%eax
    707c:	66 44 89 67 0a       	mov    %r12w,0xa(%rdi)
    7081:	48 c7 c7 00 00 00 00 	mov    $0x0,%rdi
			7084: R_X86_64_32S	.rodata.str1.1+0xabd
    7088:	e8 00 00 00 00       	callq  708d <foobar+0x7d>
			7089: R_X86_64_PC32	printk-0x4
    708d:	66 44 89 6b 02       	mov    %r13w,0x2(%rbx)
    7092:	66 44 89 63 0a       	mov    %r12w,0xa(%rbx)
    7097:	4c 8b 6d f8          	mov    -0x8(%rbp),%r13
    709b:	48 8b 5d e8          	mov    -0x18(%rbp),%rbx
    709f:	4c 8b 65 f0          	mov    -0x10(%rbp),%r12
    70a3:	c9                   	leaveq 
    70a4:	c3                   	retq   

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH net-next] net: optimize csum_replace2()
  2014-03-24 10:22     ` David Laight
@ 2014-03-24 13:25       ` Eric Dumazet
  2014-03-24 14:38         ` David Laight
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2014-03-24 13:25 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, herbert@gondor.apana.org.au, hkchu@google.com,
	mwdalton@google.com, netdev@vger.kernel.org

On Mon, 2014-03-24 at 10:22 +0000, David Laight wrote:
> From: Eric Dumazet <edumazet@google.com>
> > 
> > When changing one 16bit value by another in IP header, we can adjust the
> > IP checksum by doing a simple operation described in RFC 1624,
> > as reminded by David.
> > 
> > csum_partial() is a complex function on x86_64, not really suited
> > for small number of checksummed bytes.
> > 
> > I spotted csum_partial() being in the top 20 most consuming
> > functions (more than 1 %) in a GRO workload, which was rather
> > unexpected.
> > 
> > The caller was inet_gro_complete() doing a csum_replace2() when
> > building the new IP header for the GRO packet.
> > 
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > ---
> >  include/net/checksum.h |   23 +++++++++++++++++++++--
> >  1 file changed, 21 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/net/checksum.h b/include/net/checksum.h
> > index 37a0e24adbe7..a28f4e0f6251 100644
> > --- a/include/net/checksum.h
> > +++ b/include/net/checksum.h
> > @@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend)
> >  	return csum_add(csum, ~addend);
> >  }
> > 
> > +static inline __sum16 csum16_add(__sum16 csum, __be16 addend)
> > +{
> > +	u16 res = (__force u16)csum;
> 
> Shouldn't that be u32 ?

Why ? We compute 16bit checksums here.

> 
> > +	res += (__force u16)addend;
> > +	return (__force __sum16)(res + (res < (__force u16)addend));

Note how carry is propagated, and how we return 16 bit anyway.

Using u32 would force to use a fold, which is more expensive.

> > +}
> > +
> > +static inline __sum16 csum16_sub(__sum16 csum, __be16 addend)
> > +{
> > +	return csum16_add(csum, ~addend);
> > +}
> > +
> >  static inline __wsum
> >  csum_block_add(__wsum csum, __wsum csum2, int offset)
> >  {
> > @@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to)
> >  	*sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum)));
> >  }
> > 
> > -static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to)
> > +/* Implements RFC 1624 (Incremental Internet Checksum)
> > + * 3. Discussion states :
> > + *     HC' = ~(~HC + ~m + m')
> > + *  m : old value of a 16bit field
> > + *  m' : new value of a 16bit field
> > + */
> > +static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new)
> >  {
> > -	csum_replace4(sum, (__force __be32)from, (__force __be32)to);
> > +	*sum = ~csum16_add(csum16_sub(~(*sum), old), new);
> >  }
> 
> It might be clearer to just say:
> 	*sum = ~csum16_add(csum16_add(~*sum, ~old), new));
> or even:
> 	*sum = ~csum16_add(csum16_add(*sum ^ 0xffff, old ^ 0xffff), new));
> which might remove some mask instructions - especially if all the
> intermediate values are left larger than 16 bits.

We have csum_add() and csum_sub(), I added csum16_add() and
csum16_sub(), for analogy and completeness.

For linux guys, its quite common stuff to use csum_sub(x, y) instead of
csum_add(c, ~y).

You can use whatever code matching your taste.

RFC 1624 mentions ~m, not m ^ 0xffff

But again if you prefer m ^ 0xffff, thats up to you ;)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] csum experts, csum_replace2() is too expensive
  2014-03-24 13:17           ` Eric Dumazet
@ 2014-03-24 14:13             ` Eric Dumazet
  0 siblings, 0 replies; 26+ messages in thread
From: Eric Dumazet @ 2014-03-24 14:13 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, andi@firstfloor.org, hpa@zytor.com, kaber@trash.net,
	herbert@gondor.apana.org.au, hkchu@google.com,
	mwdalton@google.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Mon, 2014-03-24 at 06:17 -0700, Eric Dumazet wrote:
> On Mon, 2014-03-24 at 10:30 +0000, David Laight wrote:

> > ip_fast_csum() either needs an explicit "m" constraint for the actual
> > buffer (and target) bytes, or the stronger "memory" constraint.
> > The 'volatile' is then not needed.

I am testing the following :

diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h
index e6fd8a026c7b..89d7fa8837b5 100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -45,6 +45,9 @@ static inline __sum16 csum_fold(__wsum sum)
 static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
 {
 	unsigned int sum;
+	struct full_ip_header {
+		unsigned int w[ihl];
+	};
 
 	asm("  movl (%1), %0\n"
 	    "  subl $4, %2\n"
@@ -67,8 +70,9 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
 	   are modified, we must also specify them as outputs, or gcc
 	   will assume they contain their original values. */
 	    : "=r" (sum), "=r" (iph), "=r" (ihl)
-	    : "1" (iph), "2" (ihl)
-	    : "memory");
+	    : "1" (iph), "2" (ihl),
+	      "m" (*(struct full_ip_header *) iph)
+	);
 	return (__force __sum16)sum;
 }
 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* RE: [PATCH net-next] net: optimize csum_replace2()
  2014-03-24 13:25       ` Eric Dumazet
@ 2014-03-24 14:38         ` David Laight
  2014-03-24 15:14           ` Eric Dumazet
  0 siblings, 1 reply; 26+ messages in thread
From: David Laight @ 2014-03-24 14:38 UTC (permalink / raw)
  To: 'Eric Dumazet'
  Cc: David Miller, herbert@gondor.apana.org.au, hkchu@google.com,
	mwdalton@google.com, netdev@vger.kernel.org

From: Eric Dumazet 
> On Mon, 2014-03-24 at 10:22 +0000, David Laight wrote:
> > From: Eric Dumazet <edumazet@google.com>
> > >
> > > When changing one 16bit value by another in IP header, we can adjust the
> > > IP checksum by doing a simple operation described in RFC 1624,
> > > as reminded by David.
> > >
> > > csum_partial() is a complex function on x86_64, not really suited
> > > for small number of checksummed bytes.
> > >
> > > I spotted csum_partial() being in the top 20 most consuming
> > > functions (more than 1 %) in a GRO workload, which was rather
> > > unexpected.
> > >
> > > The caller was inet_gro_complete() doing a csum_replace2() when
> > > building the new IP header for the GRO packet.
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > ---
> > >  include/net/checksum.h |   23 +++++++++++++++++++++--
> > >  1 file changed, 21 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/net/checksum.h b/include/net/checksum.h
> > > index 37a0e24adbe7..a28f4e0f6251 100644
> > > --- a/include/net/checksum.h
> > > +++ b/include/net/checksum.h
> > > @@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend)
> > >  	return csum_add(csum, ~addend);
> > >  }
> > >
> > > +static inline __sum16 csum16_add(__sum16 csum, __be16 addend)
> > > +{
> > > +	u16 res = (__force u16)csum;
> >
> > Shouldn't that be u32 ?
> 
> Why ? We compute 16bit checksums here.
> 
> >
> > > +	res += (__force u16)addend;
> > > +	return (__force __sum16)(res + (res < (__force u16)addend));
> 
> Note how carry is propagated, and how we return 16 bit anyway.

Not enough coffee - my brain didn't parse that correctly at all :-(

> Using u32 would force to use a fold, which is more expensive.

Try compiling for something other than x86/amd64
(ie something without 16bit arithmetic).
The ppc compiler I have (not the latest( generates 3 masks with 0xffff and
a conditional branch for the above function (without the inline).

> > > +}
> > > +
> > > +static inline __sum16 csum16_sub(__sum16 csum, __be16 addend)
> > > +{
> > > +	return csum16_add(csum, ~addend);
> > > +}
> > > +
> > >  static inline __wsum
> > >  csum_block_add(__wsum csum, __wsum csum2, int offset)
> > >  {
> > > @@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to)
> > >  	*sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum)));
> > >  }
> > >
> > > -static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to)
> > > +/* Implements RFC 1624 (Incremental Internet Checksum)
> > > + * 3. Discussion states :
> > > + *     HC' = ~(~HC + ~m + m')
> > > + *  m : old value of a 16bit field
> > > + *  m' : new value of a 16bit field
> > > + */
> > > +static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new)
> > >  {
> > > -	csum_replace4(sum, (__force __be32)from, (__force __be32)to);
> > > +	*sum = ~csum16_add(csum16_sub(~(*sum), old), new);
> > >  }
> >
> > It might be clearer to just say:
> > 	*sum = ~csum16_add(csum16_add(~*sum, ~old), new));
> > or even:
> > 	*sum = ~csum16_add(csum16_add(*sum ^ 0xffff, old ^ 0xffff), new));
> > which might remove some mask instructions - especially if all the
> > intermediate values are left larger than 16 bits.
> 
> We have csum_add() and csum_sub(), I added csum16_add() and
> csum16_sub(), for analogy and completeness.
> 
> For linux guys, its quite common stuff to use csum_sub(x, y) instead of
> csum_add(c, ~y).
> 
> You can use whatever code matching your taste.
> 
> RFC 1624 mentions ~m, not m ^ 0xffff

But C only has 32bit maths, so all the 16bit values keep
need masking.

> But again if you prefer m ^ 0xffff, thats up to you ;)

On ppc I get much better code for (retyped):
static inline u32 csum_add(u32 c, u32 a)
{
	c += a'
	return (c + (c >> 16)) & 0xffff;
}
void csum_repl(u16 *c16, u16 o16, u16 n)
{
	u32 c = *c16, o = o16;
	*c16 = csum_add(csum_add(c ^ 0xffff, o ^ 0xffff), n);
}
13 instructions and no branches against 21 instructions
and 2 branches.
The same is probably true of arm.
For amd64 (where the compiler can do a 16bit compare)
my version is one instruction longer.
I've not tried to guess at the actual cycle counts!

	David


^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH net-next] net: optimize csum_replace2()
  2014-03-24 14:38         ` David Laight
@ 2014-03-24 15:14           ` Eric Dumazet
  2014-03-24 15:52             ` David Laight
  0 siblings, 1 reply; 26+ messages in thread
From: Eric Dumazet @ 2014-03-24 15:14 UTC (permalink / raw)
  To: David Laight
  Cc: David Miller, herbert@gondor.apana.org.au, hkchu@google.com,
	mwdalton@google.com, netdev@vger.kernel.org

On Mon, 2014-03-24 at 14:38 +0000, David Laight wrote:
> But C only has 32bit maths, so all the 16bit values keep
> need masking.
> 
> > But again if you prefer m ^ 0xffff, thats up to you ;)
> 
> On ppc I get much better code for (retyped):
> static inline u32 csum_add(u32 c, u32 a)
> {
> 	c += a'
> 	return (c + (c >> 16)) & 0xffff;
> }
> void csum_repl(u16 *c16, u16 o16, u16 n)
> {
> 	u32 c = *c16, o = o16;
> 	*c16 = csum_add(csum_add(c ^ 0xffff, o ^ 0xffff), n);
> }
> 13 instructions and no branches against 21 instructions
> and 2 branches.
> The same is probably true of arm.
> For amd64 (where the compiler can do a 16bit compare)
> my version is one instruction longer.
> I've not tried to guess at the actual cycle counts!
> 
> 	David

Code I provided uses no conditional branch on x86.

It sounds you could provide helper to arm, if you really care of this
path. I find surprising you did not comment on my prior mails on this
subject and you suddenly seem to care now the patch is merged.

We for example have the following helper in x86 :

static inline unsigned add32_with_carry(unsigned a, unsigned b)
{
        asm("addl %2,%0\n\t"
            "adcl $0,%0"
            : "=r" (a)
            : "0" (a), "r" (b));
        return a;
}

But these days, gcc seems to do a pretty good job without these helpers.

Yes we could save 4 instructions, but I doubt it is worth the pain of
adding arch helpers. I certainly wont do it.

0000000000000030 <inet_gro_complete>:
      30:	55                   	push   %rbp
      31:	48 63 c6             	movslq %esi,%rax
      34:	48 03 87 d0 00 00 00 	add    0xd0(%rdi),%rax
      3b:	48 89 e5             	mov    %rsp,%rbp
      3e:	8b 57 68             	mov    0x68(%rdi),%edx
      41:	44 0f b7 40 02       	movzwl 0x2(%rax),%r8d
      46:	0f b7 48 0a          	movzwl 0xa(%rax),%ecx
      4a:	66 29 f2             	sub    %si,%dx
      4d:	66 c1 c2 08          	rol    $0x8,%dx
      51:	44 0f b6 48 09       	movzbl 0x9(%rax),%r9d
      56:	66 89 50 02          	mov    %dx,0x2(%rax)
      5a:	41 f7 d0             	not    %r8d
      5d:	f7 d1                	not    %ecx
      5f:	66 44 01 c1          	add    %r8w,%cx
      63:	41 0f 92 c0          	setb   %r8b
      67:	45 0f b6 c0          	movzbl %r8b,%r8d
      6b:	44 01 c1             	add    %r8d,%ecx
      6e:	66 01 d1             	add    %dx,%cx
      71:	41 0f 92 c0          	setb   %r8b
      75:	45 0f b6 c0          	movzbl %r8b,%r8d
      79:	44 01 c1             	add    %r8d,%ecx
      7c:	f7 d1                	not    %ecx
      7e:	66 89 48 0a          	mov    %cx,0xa(%rax)
      82:	49 63 c1             	movslq %r9d,%rax
      85:	48 8b 04 c5 00 00 00 	mov    0x0(,%rax,8),%rax

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [PATCH net-next] net: optimize csum_replace2()
  2014-03-24 15:14           ` Eric Dumazet
@ 2014-03-24 15:52             ` David Laight
  0 siblings, 0 replies; 26+ messages in thread
From: David Laight @ 2014-03-24 15:52 UTC (permalink / raw)
  To: 'Eric Dumazet'
  Cc: David Miller, herbert@gondor.apana.org.au, hkchu@google.com,
	mwdalton@google.com, netdev@vger.kernel.org

From: Eric Dumazet
...
> Code I provided uses no conditional branch on x86.

All the world isn't x86...
The sparc, ppc and arm people might want to consider optimising
this code further.

> It sounds you could provide helper to arm, if you really care of this
> path. I find surprising you did not comment on my prior mails on this
> subject and you suddenly seem to care now the patch is merged.

I don't remember seeing this particular patch before today.
Last week you were still sorting out the stall caused by
the 16bit write -> 32bit load in the existing code.

In any case your change is clearly a significant improvement on
what was there before.

> We for example have the following helper in x86 :
> 
> static inline unsigned add32_with_carry(unsigned a, unsigned b)
> {
>         asm("addl %2,%0\n\t"
>             "adcl $0,%0"
>             : "=r" (a)
>             : "0" (a), "r" (b));
>         return a;
> }
> 
> But these days, gcc seems to do a pretty good job without these helpers.

Indeed.

While x86 can do 16bit maths, most cpus can't - so the generated
code for 'short' (and 'char') maths must mask with 0xffff (or 0xff)
every time a value is written to a local (ie register) variable.
In general you get better code by using local variables that
are the same size as machine registers.
This also applies to function arguments and return types.

I'm not sure how much difference it would make overall.
It rather depends on whether anything appears in a very
hot path. OTOH a lot of mall changes can add together.

	David

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-03-24 15:52 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-03-20 18:49 [RFC] csum experts, csum_replace2() is too expensive Eric Dumazet
2014-03-21  0:13 ` Herbert Xu
2014-03-21  0:54   ` Eric Dumazet
2014-03-21 12:50     ` Herbert Xu
2014-03-21  1:56 ` Andi Kleen
2014-03-21  2:51   ` Eric Dumazet
2014-03-21 12:50   ` Eric Dumazet
2014-03-21 13:28     ` Andi Kleen
2014-03-21 13:32     ` Eric Dumazet
2014-03-21 13:47       ` Eric Dumazet
2014-03-21 13:56         ` Eric Dumazet
2014-03-21 14:14     ` David Laight
2014-03-21 18:52     ` David Miller
2014-03-24  2:50       ` Eric Dumazet
2014-03-24 10:30         ` David Laight
2014-03-24 13:17           ` Eric Dumazet
2014-03-24 14:13             ` Eric Dumazet
2014-03-21 18:07 ` David Miller
2014-03-21 18:30   ` Eric Dumazet
2014-03-24  2:51   ` [PATCH net-next] net: optimize csum_replace2() Eric Dumazet
2014-03-24  4:20     ` David Miller
2014-03-24 10:22     ` David Laight
2014-03-24 13:25       ` Eric Dumazet
2014-03-24 14:38         ` David Laight
2014-03-24 15:14           ` Eric Dumazet
2014-03-24 15:52             ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).