* [RFC] csum experts, csum_replace2() is too expensive
@ 2014-03-20 18:49 Eric Dumazet
2014-03-21 0:13 ` Herbert Xu
` (2 more replies)
0 siblings, 3 replies; 26+ messages in thread
From: Eric Dumazet @ 2014-03-20 18:49 UTC (permalink / raw)
To: Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton; +Cc: netdev
csum_replace2() uses about 29 cycles, while a plain ip_send_check() is
way faster (16 cycles)
csum_partial() is not really meant for doing checksums over 8 bytes !
Any idea how to make the thing really fast as intended ?
I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that
is insane...
Following patch might be the fastest thing ?
(At this point we already have validated IP checksum)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 8c54870db792..86c924c16f3c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff)
int proto = iph->protocol;
int err = -ENOSYS;
- csum_replace2(&iph->check, iph->tot_len, newlen);
iph->tot_len = newlen;
+ ip_send_check(&iph);
rcu_read_lock();
ops = rcu_dereference(inet_offloads[proto]);
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-20 18:49 [RFC] csum experts, csum_replace2() is too expensive Eric Dumazet @ 2014-03-21 0:13 ` Herbert Xu 2014-03-21 0:54 ` Eric Dumazet 2014-03-21 1:56 ` Andi Kleen 2014-03-21 18:07 ` David Miller 2 siblings, 1 reply; 26+ messages in thread From: Herbert Xu @ 2014-03-21 0:13 UTC (permalink / raw) To: Eric Dumazet; +Cc: Patrick McHardy, H.K. Jerry Chu, Michael Dalton, netdev On Thu, Mar 20, 2014 at 11:49:01AM -0700, Eric Dumazet wrote: > csum_replace2() uses about 29 cycles, while a plain ip_send_check() is > way faster (16 cycles) > > csum_partial() is not really meant for doing checksums over 8 bytes ! > > Any idea how to make the thing really fast as intended ? > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > is insane... > > Following patch might be the fastest thing ? Your patch makes sense to me. However does it actually help your 1% in the GRO profile? Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-21 0:13 ` Herbert Xu @ 2014-03-21 0:54 ` Eric Dumazet 2014-03-21 12:50 ` Herbert Xu 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2014-03-21 0:54 UTC (permalink / raw) To: Herbert Xu; +Cc: Patrick McHardy, H.K. Jerry Chu, Michael Dalton, netdev On Fri, 2014-03-21 at 08:13 +0800, Herbert Xu wrote: > Your patch makes sense to me. However does it actually help > your 1% in the GRO profile? Well, I caught this strange csum_partial() presence in a profile, but my question was more general than the GRO use case. (BTW the patch didn't even compile ;) ) I guess that in GRO case, the following would be even faster... diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 8c54870db792..b8fe5e8a02dd 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1361,6 +1361,7 @@ static struct sk_buff **inet_gro_receive(struct sk_buff **head, if (!ops || !ops->callbacks.gro_receive) goto out_unlock; + /* GRO is limited to IPv4 with no options : ihl = 5 */ if (*(u8 *)iph != 0x45) goto out_unlock; @@ -1434,8 +1435,10 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff) int proto = iph->protocol; int err = -ENOSYS; - csum_replace2(&iph->check, iph->tot_len, newlen); + /* GRO is limited to IPv4 with no options : ihl = 5 */ iph->tot_len = newlen; + iph->check = 0; + iph->check = ip_fast_csum((u8 *)iph, 5); rcu_read_lock(); ops = rcu_dereference(inet_offloads[proto]); ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-21 0:54 ` Eric Dumazet @ 2014-03-21 12:50 ` Herbert Xu 0 siblings, 0 replies; 26+ messages in thread From: Herbert Xu @ 2014-03-21 12:50 UTC (permalink / raw) To: Eric Dumazet; +Cc: Patrick McHardy, H.K. Jerry Chu, Michael Dalton, netdev On Thu, Mar 20, 2014 at 05:54:17PM -0700, Eric Dumazet wrote: > On Fri, 2014-03-21 at 08:13 +0800, Herbert Xu wrote: > > > Your patch makes sense to me. However does it actually help > > your 1% in the GRO profile? > > Well, I caught this strange csum_partial() presence in a profile, > but my question was more general than the GRO use case. > > (BTW the patch didn't even compile ;) ) > > I guess that in GRO case, the following would be even faster... Agreed. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-20 18:49 [RFC] csum experts, csum_replace2() is too expensive Eric Dumazet 2014-03-21 0:13 ` Herbert Xu @ 2014-03-21 1:56 ` Andi Kleen 2014-03-21 2:51 ` Eric Dumazet 2014-03-21 12:50 ` Eric Dumazet 2014-03-21 18:07 ` David Miller 2 siblings, 2 replies; 26+ messages in thread From: Andi Kleen @ 2014-03-21 1:56 UTC (permalink / raw) To: Eric Dumazet Cc: Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton, netdev Eric Dumazet <eric.dumazet@gmail.com> writes: > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > is insane... Couldn't it just be the cache miss? -Andi -- ak@linux.intel.com -- Speaking for myself only ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-21 1:56 ` Andi Kleen @ 2014-03-21 2:51 ` Eric Dumazet 2014-03-21 12:50 ` Eric Dumazet 1 sibling, 0 replies; 26+ messages in thread From: Eric Dumazet @ 2014-03-21 2:51 UTC (permalink / raw) To: Andi Kleen Cc: Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton, netdev On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote: > Eric Dumazet <eric.dumazet@gmail.com> writes: > > > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > > is insane... > > > Couldn't it just be the cache miss? Well, no, because we aggregate frames that were already fully checked (by GRO engine/layers) in the same NAPI run. In this particular case, it seems we aggregate few frames per run (its a 40Gb Intel NIC, it apparently is tuned for low latency) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-21 1:56 ` Andi Kleen 2014-03-21 2:51 ` Eric Dumazet @ 2014-03-21 12:50 ` Eric Dumazet 2014-03-21 13:28 ` Andi Kleen ` (3 more replies) 1 sibling, 4 replies; 26+ messages in thread From: Eric Dumazet @ 2014-03-21 12:50 UTC (permalink / raw) To: Andi Kleen, H. Peter Anvin Cc: Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton, netdev, linux-kernel@vger.kernel.org On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote: > Eric Dumazet <eric.dumazet@gmail.com> writes: > > > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > > is insane... > > > Couldn't it just be the cache miss? Or the fact that we mix 16 bit stores and 32bit loads ? iph->tot_len = newlen; iph->check = 0; iph->check = ip_fast_csum(iph, 5); -> following perf annotation : : ffffffff81538e70 <inet_gro_complete>: 0.49 : ffffffff81538e70: callq ffffffff81578c80 <__entry_text_start> 0.49 : ffffffff81538e75: push %rbp 1.46 : ffffffff81538e76: movzwl 0x60(%rdi),%edx 0.00 : ffffffff81538e7a: movslq %esi,%rax 0.00 : ffffffff81538e7d: add 0xc8(%rdi),%rax 1.46 : ffffffff81538e84: mov %rsp,%rbp 0.00 : ffffffff81538e87: sub %esi,%edx 0.00 : ffffffff81538e89: rol $0x8,%dx 0.00 : ffffffff81538e8d: movzbl 0x9(%rax),%ecx 0.98 : ffffffff81538e91: mov %dx,0x2(%rax) iph->tot_len = newlen; 0.49 : ffffffff81538e95: xor %edx,%edx 0.00 : ffffffff81538e97: mov %dx,0xa(%rax) iph->check = 0; 0.00 : ffffffff81538e9b: mov (%rax),%edx // 32bit load -> stall 86.34 : ffffffff81538e9d: add 0x4(%rax),%edx 0.49 : ffffffff81538ea0: adc 0x8(%rax),%edx 0.49 : ffffffff81538ea3: adc 0xc(%rax),%edx 0.98 : ffffffff81538ea6: adc 0x10(%rax),%edx 0.49 : ffffffff81538ea9: adc $0x0,%edx 0.00 : ffffffff81538eac: mov %edx,%r8d 0.49 : ffffffff81538eaf: xor %dx,%dx 0.00 : ffffffff81538eb2: shl $0x10,%r8d 0.98 : ffffffff81538eb6: add %r8d,%edx 0.98 : ffffffff81538eb9: adc $0xffff,%edx 0.98 : ffffffff81538ebf: not %edx 0.00 : ffffffff81538ec1: shr $0x10,%edx 0.49 : ffffffff81538ec4: mov %dx,0xa(%rax) 0.00 : ffffffff81538ec8: movslq %ecx,%rax 0.00 : ffffffff81538ecb: mov -0x7e734f40(,%rax,8),%rax 0.00 : ffffffff81538ed3: test %rax,%rax 0.00 : ffffffff81538ed6: je ffffffff81538ef0 <inet_gro_complete+0x80> BTW, any idea why ip_fast_csum() on x86 contains a "memory" constraint ? It looks like a barrier() would be more appropriate. Following patch seems to help, but the stall seems to be the fact that we write on iph->check (16bits) before doing the checksum using 32bit loads. Note that we could share same ip_fast_csum() implementation between x86 32/64 bits... diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h index e6fd8a026c7b..a81cc3a5facc 100644 --- a/arch/x86/include/asm/checksum_64.h +++ b/arch/x86/include/asm/checksum_64.h @@ -46,6 +46,24 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) { unsigned int sum; + /* + * Callers might clear iph->check before calling us, make sure + * compiler wont mess things... + */ + barrier(); + + if (__builtin_constant_p(ihl) && ihl == 5) { + asm(" movl (%[iph]), %[sum]\n" + " addl 4(%[iph]), %[sum]\n" + " adcl 8(%[iph]), %[sum]\n" + " adcl 12(%[iph]), %[sum]\n" + " adcl 16(%[iph]), %[sum]\n" + " adcl $0, %[sum]\n" + : [sum] "=r" (sum) + : [iph] "r" (iph) + ); + return csum_fold(sum); + } asm(" movl (%1), %0\n" " subl $4, %2\n" " jbe 2f\n" @@ -68,7 +86,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) will assume they contain their original values. */ : "=r" (sum), "=r" (iph), "=r" (ihl) : "1" (iph), "2" (ihl) - : "memory"); + ); return (__force __sum16)sum; } diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 8c54870db792..90aa562dfbf5 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1434,8 +1434,9 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff) int proto = iph->protocol; int err = -ENOSYS; - csum_replace2(&iph->check, iph->tot_len, newlen); iph->tot_len = newlen; + iph->check = 0; + iph->check = ip_fast_csum((u8 *)iph, 5); rcu_read_lock(); ops = rcu_dereference(inet_offloads[proto]); ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-21 12:50 ` Eric Dumazet @ 2014-03-21 13:28 ` Andi Kleen 2014-03-21 13:32 ` Eric Dumazet ` (2 subsequent siblings) 3 siblings, 0 replies; 26+ messages in thread From: Andi Kleen @ 2014-03-21 13:28 UTC (permalink / raw) To: Eric Dumazet Cc: Andi Kleen, H. Peter Anvin, Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton, netdev, linux-kernel@vger.kernel.org On Fri, Mar 21, 2014 at 05:50:50AM -0700, Eric Dumazet wrote: > On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote: > > Eric Dumazet <eric.dumazet@gmail.com> writes: > > > > > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > > > is insane... > > > > > > Couldn't it just be the cache miss? > > Or the fact that we mix 16 bit stores and 32bit loads ? It should cause a small stall from not doing load-store forwarding, but 1% of a serious workload would be surprising. Are you sure it's not some skid effect? -Andi ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-21 12:50 ` Eric Dumazet 2014-03-21 13:28 ` Andi Kleen @ 2014-03-21 13:32 ` Eric Dumazet 2014-03-21 13:47 ` Eric Dumazet 2014-03-21 14:14 ` David Laight 2014-03-21 18:52 ` David Miller 3 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2014-03-21 13:32 UTC (permalink / raw) To: Andi Kleen Cc: H. Peter Anvin, Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton, netdev, linux-kernel@vger.kernel.org On Fri, 2014-03-21 at 05:50 -0700, Eric Dumazet wrote: > Or the fact that we mix 16 bit stores and 32bit loads ? > > iph->tot_len = newlen; > iph->check = 0; > iph->check = ip_fast_csum(iph, 5); Yep definitely. Using 16 bit loads in ip_fast_csum() totally removes the stall. I no longer see inet_gro_complete() in perf top... + if (__builtin_constant_p(ihl) && ihl == 5) { + asm(" movw (%[iph]), %w[sum]\n" /* ihl/version/tos */ + " addw 2(%[iph]), %w[sum]\n" /* tot_len */ + " adcw 8(%[iph]), %w[sum]\n" /* ttl/protocol */ + " adcw 10(%[iph]), %w[sum]\n" /* check */ + " adcl 4(%[iph]), %[sum]\n" /* id/frag_off */ + " adcl 12(%[iph]), %[sum]\n" /* saddr */ + " adcl 16(%[iph]), %[sum]\n" /* daddr */ + " adcl $0, %[sum]\n" + : [sum] "=r" (sum) + : [iph] "r" (iph) + ); + return csum_fold(sum); ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-21 13:32 ` Eric Dumazet @ 2014-03-21 13:47 ` Eric Dumazet 2014-03-21 13:56 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2014-03-21 13:47 UTC (permalink / raw) To: Andi Kleen Cc: H. Peter Anvin, Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton, netdev, linux-kernel@vger.kernel.org On Fri, 2014-03-21 at 06:32 -0700, Eric Dumazet wrote: > On Fri, 2014-03-21 at 05:50 -0700, Eric Dumazet wrote: > > > Or the fact that we mix 16 bit stores and 32bit loads ? > > > > iph->tot_len = newlen; > > iph->check = 0; > > iph->check = ip_fast_csum(iph, 5); > > Yep definitely. Using 16 bit loads in ip_fast_csum() totally removes the > stall. I no longer see inet_gro_complete() in perf top... > > + if (__builtin_constant_p(ihl) && ihl == 5) { > + asm(" movw (%[iph]), %w[sum]\n" /* ihl/version/tos */ > + " addw 2(%[iph]), %w[sum]\n" /* tot_len */ > + " adcw 8(%[iph]), %w[sum]\n" /* ttl/protocol */ > + " adcw 10(%[iph]), %w[sum]\n" /* check */ > + " adcl 4(%[iph]), %[sum]\n" /* id/frag_off */ > + " adcl 12(%[iph]), %[sum]\n" /* saddr */ > + " adcl 16(%[iph]), %[sum]\n" /* daddr */ > + " adcl $0, %[sum]\n" > + : [sum] "=r" (sum) > + : [iph] "r" (iph) > + ); > + return csum_fold(sum); > Another idea would be to move the ip_fast_csum() call at the end of inet_gro_complete() I'll try this : diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 8c54870db792..0ca8f350a532 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff) int proto = iph->protocol; int err = -ENOSYS; - csum_replace2(&iph->check, iph->tot_len, newlen); iph->tot_len = newlen; + iph->check = 0; rcu_read_lock(); ops = rcu_dereference(inet_offloads[proto]); @@ -1447,6 +1447,7 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff) * inet_gro_receive(). */ err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph)); + iph->check = ip_fast_csum((u8 *)iph, 5); out_unlock: rcu_read_unlock(); ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-21 13:47 ` Eric Dumazet @ 2014-03-21 13:56 ` Eric Dumazet 0 siblings, 0 replies; 26+ messages in thread From: Eric Dumazet @ 2014-03-21 13:56 UTC (permalink / raw) To: Andi Kleen Cc: H. Peter Anvin, Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton, netdev, linux-kernel@vger.kernel.org On Fri, 2014-03-21 at 06:47 -0700, Eric Dumazet wrote: > Another idea would be to move the ip_fast_csum() call at the end of > inet_gro_complete() > > I'll try this : > > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index 8c54870db792..0ca8f350a532 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff) > int proto = iph->protocol; > int err = -ENOSYS; > > - csum_replace2(&iph->check, iph->tot_len, newlen); > iph->tot_len = newlen; > + iph->check = 0; > > rcu_read_lock(); > ops = rcu_dereference(inet_offloads[proto]); > @@ -1447,6 +1447,7 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff) > * inet_gro_receive(). > */ > err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph)); > + iph->check = ip_fast_csum((u8 *)iph, 5); > > out_unlock: > rcu_read_unlock(); > This doesn't help... same stall. Looks like the best way is to use some 16 bit loads in ip_fast_csum() for the ihl=5 case. ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [RFC] csum experts, csum_replace2() is too expensive 2014-03-21 12:50 ` Eric Dumazet 2014-03-21 13:28 ` Andi Kleen 2014-03-21 13:32 ` Eric Dumazet @ 2014-03-21 14:14 ` David Laight 2014-03-21 18:52 ` David Miller 3 siblings, 0 replies; 26+ messages in thread From: David Laight @ 2014-03-21 14:14 UTC (permalink / raw) To: 'Eric Dumazet', Andi Kleen, H. Peter Anvin Cc: Patrick McHardy, Herbert Xu, H.K. Jerry Chu, Michael Dalton, netdev, linux-kernel@vger.kernel.org From: Eric Dumazet > On Thu, 2014-03-20 at 18:56 -0700, Andi Kleen wrote: > > Eric Dumazet <eric.dumazet@gmail.com> writes: > > > > > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > > > is insane... > > > > > > Couldn't it just be the cache miss? > > Or the fact that we mix 16 bit stores and 32bit loads ? > > BTW, any idea why ip_fast_csum() on x86 contains a "memory" constraint ? The correct constraint would be one that told gcc that it accesses the 20 bytes from the source pointer. Without it gcc won't necessarily write out the values before the asm instructions execute. David ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-21 12:50 ` Eric Dumazet ` (2 preceding siblings ...) 2014-03-21 14:14 ` David Laight @ 2014-03-21 18:52 ` David Miller 2014-03-24 2:50 ` Eric Dumazet 3 siblings, 1 reply; 26+ messages in thread From: David Miller @ 2014-03-21 18:52 UTC (permalink / raw) To: eric.dumazet Cc: andi, hpa, kaber, herbert, hkchu, mwdalton, netdev, linux-kernel From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 21 Mar 2014 05:50:50 -0700 > It looks like a barrier() would be more appropriate. barrier() == __asm__ __volatile__(:::"memory") ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-21 18:52 ` David Miller @ 2014-03-24 2:50 ` Eric Dumazet 2014-03-24 10:30 ` David Laight 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2014-03-24 2:50 UTC (permalink / raw) To: David Miller Cc: andi, hpa, kaber, herbert, hkchu, mwdalton, netdev, linux-kernel On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Fri, 21 Mar 2014 05:50:50 -0700 > > > It looks like a barrier() would be more appropriate. > > barrier() == __asm__ __volatile__(:::"memory") Indeed, but now you mention it, ip_fast_csum() do not uses volatile keyword on x86_64, and has no "m" constraint either. This means that for the following hypothetical networking code : void foobar(struct iphdr *iph, __be16 newlen, __be16 newid) { iph->tot_len = newlen; iph->check = 0; iph->check = ip_fast_csum((u8 *)iph, 5); pr_err("%p\n", iph); iph->id = newid; iph->check = 0; iph->check = ip_fast_csum((u8 *)iph, 5); } ip_fast_csum() is done _once_ only. Following patch seems needed. Thats one another call for x86 code factorization ... diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h index e6fd8a026c7b..c67778544880 100644 --- a/arch/x86/include/asm/checksum_64.h +++ b/arch/x86/include/asm/checksum_64.h @@ -46,7 +46,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) { unsigned int sum; - asm(" movl (%1), %0\n" + asm volatile(" movl (%1), %0\n" " subl $4, %2\n" " jbe 2f\n" " addl 4(%1), %0\n" ^ permalink raw reply related [flat|nested] 26+ messages in thread
* RE: [RFC] csum experts, csum_replace2() is too expensive 2014-03-24 2:50 ` Eric Dumazet @ 2014-03-24 10:30 ` David Laight 2014-03-24 13:17 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: David Laight @ 2014-03-24 10:30 UTC (permalink / raw) To: 'Eric Dumazet', David Miller Cc: andi@firstfloor.org, hpa@zytor.com, kaber@trash.net, herbert@gondor.apana.org.au, hkchu@google.com, mwdalton@google.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org From: Eric Dumazet > On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote: > > From: Eric Dumazet <eric.dumazet@gmail.com> > > Date: Fri, 21 Mar 2014 05:50:50 -0700 > > > > > It looks like a barrier() would be more appropriate. > > > > barrier() == __asm__ __volatile__(:::"memory") > > Indeed, but now you mention it, ip_fast_csum() do not uses volatile > keyword on x86_64, and has no "m" constraint either. Adding 'volatile' isn't sufficient to force gcc to write data into the area being checksummed. ip_fast_csum() either needs an explicit "m" constraint for the actual buffer (and target) bytes, or the stronger "memory" constraint. The 'volatile' is then not needed. David ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [RFC] csum experts, csum_replace2() is too expensive 2014-03-24 10:30 ` David Laight @ 2014-03-24 13:17 ` Eric Dumazet 2014-03-24 14:13 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2014-03-24 13:17 UTC (permalink / raw) To: David Laight Cc: David Miller, andi@firstfloor.org, hpa@zytor.com, kaber@trash.net, herbert@gondor.apana.org.au, hkchu@google.com, mwdalton@google.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, 2014-03-24 at 10:30 +0000, David Laight wrote: > From: Eric Dumazet > > On Fri, 2014-03-21 at 14:52 -0400, David Miller wrote: > > > From: Eric Dumazet <eric.dumazet@gmail.com> > > > Date: Fri, 21 Mar 2014 05:50:50 -0700 > > > > > > > It looks like a barrier() would be more appropriate. > > > > > > barrier() == __asm__ __volatile__(:::"memory") > > > > Indeed, but now you mention it, ip_fast_csum() do not uses volatile > > keyword on x86_64, and has no "m" constraint either. > > Adding 'volatile' isn't sufficient to force gcc to write data > into the area being checksummed. You missed the point. Its not about forcing gcc to write data, because it does. Point is : gcc doesn't recompute the checksum a second time. > ip_fast_csum() either needs an explicit "m" constraint for the actual > buffer (and target) bytes, or the stronger "memory" constraint. > The 'volatile' is then not needed. What about you take a look at the actual code ? "memory" constraint is already there. And no, its not enough, otherwise I wouldn't have sent this mail. I actually compiled the code and double checked. 0000000000007010 <foobar>: 7010: e8 00 00 00 00 callq 7015 <foobar+0x5> 7011: R_X86_64_PC32 __fentry__-0x4 7015: 55 push %rbp 7016: 31 c0 xor %eax,%eax 7018: b9 05 00 00 00 mov $0x5,%ecx 701d: 48 89 e5 mov %rsp,%rbp 7020: 48 83 ec 20 sub $0x20,%rsp 7024: 48 89 5d e8 mov %rbx,-0x18(%rbp) 7028: 4c 89 6d f8 mov %r13,-0x8(%rbp) 702c: 48 89 fb mov %rdi,%rbx 702f: 4c 89 65 f0 mov %r12,-0x10(%rbp) 7033: 41 89 d5 mov %edx,%r13d 7036: 66 89 47 0a mov %ax,0xa(%rdi) 703a: 66 89 77 02 mov %si,0x2(%rdi) 703e: 48 89 f8 mov %rdi,%rax 7041: 48 89 fe mov %rdi,%rsi 7044: 44 8b 20 mov (%rax),%r12d 7047: 83 e9 04 sub $0x4,%ecx 704a: 76 2e jbe 707a <foobar+0x6a> 704c: 44 03 60 04 add 0x4(%rax),%r12d 7050: 44 13 60 08 adc 0x8(%rax),%r12d 7054: 44 13 60 0c adc 0xc(%rax),%r12d 7058: 44 13 60 10 adc 0x10(%rax),%r12d 705c: 48 8d 40 04 lea 0x4(%rax),%rax 7060: ff c9 dec %ecx 7062: 75 f4 jne 7058 <foobar+0x48> 7064: 41 83 d4 00 adc $0x0,%r12d 7068: 44 89 e1 mov %r12d,%ecx 706b: 41 c1 ec 10 shr $0x10,%r12d 706f: 66 41 01 cc add %cx,%r12w 7073: 41 83 d4 00 adc $0x0,%r12d 7077: 41 f7 d4 not %r12d 707a: 31 c0 xor %eax,%eax 707c: 66 44 89 67 0a mov %r12w,0xa(%rdi) 7081: 48 c7 c7 00 00 00 00 mov $0x0,%rdi 7084: R_X86_64_32S .rodata.str1.1+0xabd 7088: e8 00 00 00 00 callq 708d <foobar+0x7d> 7089: R_X86_64_PC32 printk-0x4 708d: 66 44 89 6b 02 mov %r13w,0x2(%rbx) 7092: 66 44 89 63 0a mov %r12w,0xa(%rbx) 7097: 4c 8b 6d f8 mov -0x8(%rbp),%r13 709b: 48 8b 5d e8 mov -0x18(%rbp),%rbx 709f: 4c 8b 65 f0 mov -0x10(%rbp),%r12 70a3: c9 leaveq 70a4: c3 retq ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [RFC] csum experts, csum_replace2() is too expensive 2014-03-24 13:17 ` Eric Dumazet @ 2014-03-24 14:13 ` Eric Dumazet 0 siblings, 0 replies; 26+ messages in thread From: Eric Dumazet @ 2014-03-24 14:13 UTC (permalink / raw) To: David Laight Cc: David Miller, andi@firstfloor.org, hpa@zytor.com, kaber@trash.net, herbert@gondor.apana.org.au, hkchu@google.com, mwdalton@google.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org On Mon, 2014-03-24 at 06:17 -0700, Eric Dumazet wrote: > On Mon, 2014-03-24 at 10:30 +0000, David Laight wrote: > > ip_fast_csum() either needs an explicit "m" constraint for the actual > > buffer (and target) bytes, or the stronger "memory" constraint. > > The 'volatile' is then not needed. I am testing the following : diff --git a/arch/x86/include/asm/checksum_64.h b/arch/x86/include/asm/checksum_64.h index e6fd8a026c7b..89d7fa8837b5 100644 --- a/arch/x86/include/asm/checksum_64.h +++ b/arch/x86/include/asm/checksum_64.h @@ -45,6 +45,9 @@ static inline __sum16 csum_fold(__wsum sum) static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) { unsigned int sum; + struct full_ip_header { + unsigned int w[ihl]; + }; asm(" movl (%1), %0\n" " subl $4, %2\n" @@ -67,8 +70,9 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl) are modified, we must also specify them as outputs, or gcc will assume they contain their original values. */ : "=r" (sum), "=r" (iph), "=r" (ihl) - : "1" (iph), "2" (ihl) - : "memory"); + : "1" (iph), "2" (ihl), + "m" (*(struct full_ip_header *) iph) + ); return (__force __sum16)sum; } ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-20 18:49 [RFC] csum experts, csum_replace2() is too expensive Eric Dumazet 2014-03-21 0:13 ` Herbert Xu 2014-03-21 1:56 ` Andi Kleen @ 2014-03-21 18:07 ` David Miller 2014-03-21 18:30 ` Eric Dumazet 2014-03-24 2:51 ` [PATCH net-next] net: optimize csum_replace2() Eric Dumazet 2 siblings, 2 replies; 26+ messages in thread From: David Miller @ 2014-03-21 18:07 UTC (permalink / raw) To: eric.dumazet; +Cc: kaber, herbert, hkchu, mwdalton, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 20 Mar 2014 11:49:01 -0700 > csum_replace2() uses about 29 cycles, while a plain ip_send_check() is > way faster (16 cycles) > > csum_partial() is not really meant for doing checksums over 8 bytes ! > > Any idea how to make the thing really fast as intended ? > > I saw csum_partial() consuming 1% of cpu cycles in a GRO workload, that > is insane... > > Following patch might be the fastest thing ? > > (At this point we already have validated IP checksum) ... > @@ -1434,8 +1434,8 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff) > int proto = iph->protocol; > int err = -ENOSYS; > > - csum_replace2(&iph->check, iph->tot_len, newlen); > iph->tot_len = newlen; > + ip_send_check(&iph); Yeah the csum_replace*() are extremely suboptimal. We should be able to cons up something cheap like the trick that ip_decrease_ttl() uses. https://tools.ietf.org/html/rfc1624 https://tools.ietf.org/html/rfc1141 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [RFC] csum experts, csum_replace2() is too expensive 2014-03-21 18:07 ` David Miller @ 2014-03-21 18:30 ` Eric Dumazet 2014-03-24 2:51 ` [PATCH net-next] net: optimize csum_replace2() Eric Dumazet 1 sibling, 0 replies; 26+ messages in thread From: Eric Dumazet @ 2014-03-21 18:30 UTC (permalink / raw) To: David Miller; +Cc: kaber, herbert, hkchu, mwdalton, netdev On Fri, 2014-03-21 at 14:07 -0400, David Miller wrote: > Yeah the csum_replace*() are extremely suboptimal. > > We should be able to cons up something cheap like the trick that > ip_decrease_ttl() uses. > > https://tools.ietf.org/html/rfc1624 > https://tools.ietf.org/html/rfc1141 Thanks for the pointers, I'll cook a patch. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH net-next] net: optimize csum_replace2() 2014-03-21 18:07 ` David Miller 2014-03-21 18:30 ` Eric Dumazet @ 2014-03-24 2:51 ` Eric Dumazet 2014-03-24 4:20 ` David Miller 2014-03-24 10:22 ` David Laight 1 sibling, 2 replies; 26+ messages in thread From: Eric Dumazet @ 2014-03-24 2:51 UTC (permalink / raw) To: David Miller; +Cc: herbert, hkchu, mwdalton, netdev From: Eric Dumazet <edumazet@google.com> When changing one 16bit value by another in IP header, we can adjust the IP checksum by doing a simple operation described in RFC 1624, as reminded by David. csum_partial() is a complex function on x86_64, not really suited for small number of checksummed bytes. I spotted csum_partial() being in the top 20 most consuming functions (more than 1 %) in a GRO workload, which was rather unexpected. The caller was inet_gro_complete() doing a csum_replace2() when building the new IP header for the GRO packet. Signed-off-by: Eric Dumazet <edumazet@google.com> --- include/net/checksum.h | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-) diff --git a/include/net/checksum.h b/include/net/checksum.h index 37a0e24adbe7..a28f4e0f6251 100644 --- a/include/net/checksum.h +++ b/include/net/checksum.h @@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend) return csum_add(csum, ~addend); } +static inline __sum16 csum16_add(__sum16 csum, __be16 addend) +{ + u16 res = (__force u16)csum; + + res += (__force u16)addend; + return (__force __sum16)(res + (res < (__force u16)addend)); +} + +static inline __sum16 csum16_sub(__sum16 csum, __be16 addend) +{ + return csum16_add(csum, ~addend); +} + static inline __wsum csum_block_add(__wsum csum, __wsum csum2, int offset) { @@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to) *sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum))); } -static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to) +/* Implements RFC 1624 (Incremental Internet Checksum) + * 3. Discussion states : + * HC' = ~(~HC + ~m + m') + * m : old value of a 16bit field + * m' : new value of a 16bit field + */ +static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new) { - csum_replace4(sum, (__force __be32)from, (__force __be32)to); + *sum = ~csum16_add(csum16_sub(~(*sum), old), new); } struct sk_buff; ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: [PATCH net-next] net: optimize csum_replace2() 2014-03-24 2:51 ` [PATCH net-next] net: optimize csum_replace2() Eric Dumazet @ 2014-03-24 4:20 ` David Miller 2014-03-24 10:22 ` David Laight 1 sibling, 0 replies; 26+ messages in thread From: David Miller @ 2014-03-24 4:20 UTC (permalink / raw) To: eric.dumazet; +Cc: herbert, hkchu, mwdalton, netdev From: Eric Dumazet <eric.dumazet@gmail.com> Date: Sun, 23 Mar 2014 19:51:36 -0700 > From: Eric Dumazet <edumazet@google.com> > > When changing one 16bit value by another in IP header, we can adjust the > IP checksum by doing a simple operation described in RFC 1624, > as reminded by David. > > csum_partial() is a complex function on x86_64, not really suited > for small number of checksummed bytes. > > I spotted csum_partial() being in the top 20 most consuming > functions (more than 1 %) in a GRO workload, which was rather > unexpected. > > The caller was inet_gro_complete() doing a csum_replace2() when > building the new IP header for the GRO packet. > > Signed-off-by: Eric Dumazet <edumazet@google.com> Applied, thanks for following through on this Eric. Would be nice to improve csum_replace4() similarly, since every NAT packet uses that thing when we change the address in the protocol header. ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [PATCH net-next] net: optimize csum_replace2() 2014-03-24 2:51 ` [PATCH net-next] net: optimize csum_replace2() Eric Dumazet 2014-03-24 4:20 ` David Miller @ 2014-03-24 10:22 ` David Laight 2014-03-24 13:25 ` Eric Dumazet 1 sibling, 1 reply; 26+ messages in thread From: David Laight @ 2014-03-24 10:22 UTC (permalink / raw) To: 'Eric Dumazet', David Miller Cc: herbert@gondor.apana.org.au, hkchu@google.com, mwdalton@google.com, netdev@vger.kernel.org From: Eric Dumazet <edumazet@google.com> > > When changing one 16bit value by another in IP header, we can adjust the > IP checksum by doing a simple operation described in RFC 1624, > as reminded by David. > > csum_partial() is a complex function on x86_64, not really suited > for small number of checksummed bytes. > > I spotted csum_partial() being in the top 20 most consuming > functions (more than 1 %) in a GRO workload, which was rather > unexpected. > > The caller was inet_gro_complete() doing a csum_replace2() when > building the new IP header for the GRO packet. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > --- > include/net/checksum.h | 23 +++++++++++++++++++++-- > 1 file changed, 21 insertions(+), 2 deletions(-) > > diff --git a/include/net/checksum.h b/include/net/checksum.h > index 37a0e24adbe7..a28f4e0f6251 100644 > --- a/include/net/checksum.h > +++ b/include/net/checksum.h > @@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend) > return csum_add(csum, ~addend); > } > > +static inline __sum16 csum16_add(__sum16 csum, __be16 addend) > +{ > + u16 res = (__force u16)csum; Shouldn't that be u32 ? > + res += (__force u16)addend; > + return (__force __sum16)(res + (res < (__force u16)addend)); > +} > + > +static inline __sum16 csum16_sub(__sum16 csum, __be16 addend) > +{ > + return csum16_add(csum, ~addend); > +} > + > static inline __wsum > csum_block_add(__wsum csum, __wsum csum2, int offset) > { > @@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to) > *sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum))); > } > > -static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to) > +/* Implements RFC 1624 (Incremental Internet Checksum) > + * 3. Discussion states : > + * HC' = ~(~HC + ~m + m') > + * m : old value of a 16bit field > + * m' : new value of a 16bit field > + */ > +static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new) > { > - csum_replace4(sum, (__force __be32)from, (__force __be32)to); > + *sum = ~csum16_add(csum16_sub(~(*sum), old), new); > } It might be clearer to just say: *sum = ~csum16_add(csum16_add(~*sum, ~old), new)); or even: *sum = ~csum16_add(csum16_add(*sum ^ 0xffff, old ^ 0xffff), new)); which might remove some mask instructions - especially if all the intermediate values are left larger than 16 bits. David ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [PATCH net-next] net: optimize csum_replace2() 2014-03-24 10:22 ` David Laight @ 2014-03-24 13:25 ` Eric Dumazet 2014-03-24 14:38 ` David Laight 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2014-03-24 13:25 UTC (permalink / raw) To: David Laight Cc: David Miller, herbert@gondor.apana.org.au, hkchu@google.com, mwdalton@google.com, netdev@vger.kernel.org On Mon, 2014-03-24 at 10:22 +0000, David Laight wrote: > From: Eric Dumazet <edumazet@google.com> > > > > When changing one 16bit value by another in IP header, we can adjust the > > IP checksum by doing a simple operation described in RFC 1624, > > as reminded by David. > > > > csum_partial() is a complex function on x86_64, not really suited > > for small number of checksummed bytes. > > > > I spotted csum_partial() being in the top 20 most consuming > > functions (more than 1 %) in a GRO workload, which was rather > > unexpected. > > > > The caller was inet_gro_complete() doing a csum_replace2() when > > building the new IP header for the GRO packet. > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > --- > > include/net/checksum.h | 23 +++++++++++++++++++++-- > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > diff --git a/include/net/checksum.h b/include/net/checksum.h > > index 37a0e24adbe7..a28f4e0f6251 100644 > > --- a/include/net/checksum.h > > +++ b/include/net/checksum.h > > @@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend) > > return csum_add(csum, ~addend); > > } > > > > +static inline __sum16 csum16_add(__sum16 csum, __be16 addend) > > +{ > > + u16 res = (__force u16)csum; > > Shouldn't that be u32 ? Why ? We compute 16bit checksums here. > > > + res += (__force u16)addend; > > + return (__force __sum16)(res + (res < (__force u16)addend)); Note how carry is propagated, and how we return 16 bit anyway. Using u32 would force to use a fold, which is more expensive. > > +} > > + > > +static inline __sum16 csum16_sub(__sum16 csum, __be16 addend) > > +{ > > + return csum16_add(csum, ~addend); > > +} > > + > > static inline __wsum > > csum_block_add(__wsum csum, __wsum csum2, int offset) > > { > > @@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to) > > *sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum))); > > } > > > > -static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to) > > +/* Implements RFC 1624 (Incremental Internet Checksum) > > + * 3. Discussion states : > > + * HC' = ~(~HC + ~m + m') > > + * m : old value of a 16bit field > > + * m' : new value of a 16bit field > > + */ > > +static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new) > > { > > - csum_replace4(sum, (__force __be32)from, (__force __be32)to); > > + *sum = ~csum16_add(csum16_sub(~(*sum), old), new); > > } > > It might be clearer to just say: > *sum = ~csum16_add(csum16_add(~*sum, ~old), new)); > or even: > *sum = ~csum16_add(csum16_add(*sum ^ 0xffff, old ^ 0xffff), new)); > which might remove some mask instructions - especially if all the > intermediate values are left larger than 16 bits. We have csum_add() and csum_sub(), I added csum16_add() and csum16_sub(), for analogy and completeness. For linux guys, its quite common stuff to use csum_sub(x, y) instead of csum_add(c, ~y). You can use whatever code matching your taste. RFC 1624 mentions ~m, not m ^ 0xffff But again if you prefer m ^ 0xffff, thats up to you ;) ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [PATCH net-next] net: optimize csum_replace2() 2014-03-24 13:25 ` Eric Dumazet @ 2014-03-24 14:38 ` David Laight 2014-03-24 15:14 ` Eric Dumazet 0 siblings, 1 reply; 26+ messages in thread From: David Laight @ 2014-03-24 14:38 UTC (permalink / raw) To: 'Eric Dumazet' Cc: David Miller, herbert@gondor.apana.org.au, hkchu@google.com, mwdalton@google.com, netdev@vger.kernel.org From: Eric Dumazet > On Mon, 2014-03-24 at 10:22 +0000, David Laight wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > > > When changing one 16bit value by another in IP header, we can adjust the > > > IP checksum by doing a simple operation described in RFC 1624, > > > as reminded by David. > > > > > > csum_partial() is a complex function on x86_64, not really suited > > > for small number of checksummed bytes. > > > > > > I spotted csum_partial() being in the top 20 most consuming > > > functions (more than 1 %) in a GRO workload, which was rather > > > unexpected. > > > > > > The caller was inet_gro_complete() doing a csum_replace2() when > > > building the new IP header for the GRO packet. > > > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > > --- > > > include/net/checksum.h | 23 +++++++++++++++++++++-- > > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > > > diff --git a/include/net/checksum.h b/include/net/checksum.h > > > index 37a0e24adbe7..a28f4e0f6251 100644 > > > --- a/include/net/checksum.h > > > +++ b/include/net/checksum.h > > > @@ -69,6 +69,19 @@ static inline __wsum csum_sub(__wsum csum, __wsum addend) > > > return csum_add(csum, ~addend); > > > } > > > > > > +static inline __sum16 csum16_add(__sum16 csum, __be16 addend) > > > +{ > > > + u16 res = (__force u16)csum; > > > > Shouldn't that be u32 ? > > Why ? We compute 16bit checksums here. > > > > > > + res += (__force u16)addend; > > > + return (__force __sum16)(res + (res < (__force u16)addend)); > > Note how carry is propagated, and how we return 16 bit anyway. Not enough coffee - my brain didn't parse that correctly at all :-( > Using u32 would force to use a fold, which is more expensive. Try compiling for something other than x86/amd64 (ie something without 16bit arithmetic). The ppc compiler I have (not the latest( generates 3 masks with 0xffff and a conditional branch for the above function (without the inline). > > > +} > > > + > > > +static inline __sum16 csum16_sub(__sum16 csum, __be16 addend) > > > +{ > > > + return csum16_add(csum, ~addend); > > > +} > > > + > > > static inline __wsum > > > csum_block_add(__wsum csum, __wsum csum2, int offset) > > > { > > > @@ -112,9 +125,15 @@ static inline void csum_replace4(__sum16 *sum, __be32 from, __be32 to) > > > *sum = csum_fold(csum_partial(diff, sizeof(diff), ~csum_unfold(*sum))); > > > } > > > > > > -static inline void csum_replace2(__sum16 *sum, __be16 from, __be16 to) > > > +/* Implements RFC 1624 (Incremental Internet Checksum) > > > + * 3. Discussion states : > > > + * HC' = ~(~HC + ~m + m') > > > + * m : old value of a 16bit field > > > + * m' : new value of a 16bit field > > > + */ > > > +static inline void csum_replace2(__sum16 *sum, __be16 old, __be16 new) > > > { > > > - csum_replace4(sum, (__force __be32)from, (__force __be32)to); > > > + *sum = ~csum16_add(csum16_sub(~(*sum), old), new); > > > } > > > > It might be clearer to just say: > > *sum = ~csum16_add(csum16_add(~*sum, ~old), new)); > > or even: > > *sum = ~csum16_add(csum16_add(*sum ^ 0xffff, old ^ 0xffff), new)); > > which might remove some mask instructions - especially if all the > > intermediate values are left larger than 16 bits. > > We have csum_add() and csum_sub(), I added csum16_add() and > csum16_sub(), for analogy and completeness. > > For linux guys, its quite common stuff to use csum_sub(x, y) instead of > csum_add(c, ~y). > > You can use whatever code matching your taste. > > RFC 1624 mentions ~m, not m ^ 0xffff But C only has 32bit maths, so all the 16bit values keep need masking. > But again if you prefer m ^ 0xffff, thats up to you ;) On ppc I get much better code for (retyped): static inline u32 csum_add(u32 c, u32 a) { c += a' return (c + (c >> 16)) & 0xffff; } void csum_repl(u16 *c16, u16 o16, u16 n) { u32 c = *c16, o = o16; *c16 = csum_add(csum_add(c ^ 0xffff, o ^ 0xffff), n); } 13 instructions and no branches against 21 instructions and 2 branches. The same is probably true of arm. For amd64 (where the compiler can do a 16bit compare) my version is one instruction longer. I've not tried to guess at the actual cycle counts! David ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [PATCH net-next] net: optimize csum_replace2() 2014-03-24 14:38 ` David Laight @ 2014-03-24 15:14 ` Eric Dumazet 2014-03-24 15:52 ` David Laight 0 siblings, 1 reply; 26+ messages in thread From: Eric Dumazet @ 2014-03-24 15:14 UTC (permalink / raw) To: David Laight Cc: David Miller, herbert@gondor.apana.org.au, hkchu@google.com, mwdalton@google.com, netdev@vger.kernel.org On Mon, 2014-03-24 at 14:38 +0000, David Laight wrote: > But C only has 32bit maths, so all the 16bit values keep > need masking. > > > But again if you prefer m ^ 0xffff, thats up to you ;) > > On ppc I get much better code for (retyped): > static inline u32 csum_add(u32 c, u32 a) > { > c += a' > return (c + (c >> 16)) & 0xffff; > } > void csum_repl(u16 *c16, u16 o16, u16 n) > { > u32 c = *c16, o = o16; > *c16 = csum_add(csum_add(c ^ 0xffff, o ^ 0xffff), n); > } > 13 instructions and no branches against 21 instructions > and 2 branches. > The same is probably true of arm. > For amd64 (where the compiler can do a 16bit compare) > my version is one instruction longer. > I've not tried to guess at the actual cycle counts! > > David Code I provided uses no conditional branch on x86. It sounds you could provide helper to arm, if you really care of this path. I find surprising you did not comment on my prior mails on this subject and you suddenly seem to care now the patch is merged. We for example have the following helper in x86 : static inline unsigned add32_with_carry(unsigned a, unsigned b) { asm("addl %2,%0\n\t" "adcl $0,%0" : "=r" (a) : "0" (a), "r" (b)); return a; } But these days, gcc seems to do a pretty good job without these helpers. Yes we could save 4 instructions, but I doubt it is worth the pain of adding arch helpers. I certainly wont do it. 0000000000000030 <inet_gro_complete>: 30: 55 push %rbp 31: 48 63 c6 movslq %esi,%rax 34: 48 03 87 d0 00 00 00 add 0xd0(%rdi),%rax 3b: 48 89 e5 mov %rsp,%rbp 3e: 8b 57 68 mov 0x68(%rdi),%edx 41: 44 0f b7 40 02 movzwl 0x2(%rax),%r8d 46: 0f b7 48 0a movzwl 0xa(%rax),%ecx 4a: 66 29 f2 sub %si,%dx 4d: 66 c1 c2 08 rol $0x8,%dx 51: 44 0f b6 48 09 movzbl 0x9(%rax),%r9d 56: 66 89 50 02 mov %dx,0x2(%rax) 5a: 41 f7 d0 not %r8d 5d: f7 d1 not %ecx 5f: 66 44 01 c1 add %r8w,%cx 63: 41 0f 92 c0 setb %r8b 67: 45 0f b6 c0 movzbl %r8b,%r8d 6b: 44 01 c1 add %r8d,%ecx 6e: 66 01 d1 add %dx,%cx 71: 41 0f 92 c0 setb %r8b 75: 45 0f b6 c0 movzbl %r8b,%r8d 79: 44 01 c1 add %r8d,%ecx 7c: f7 d1 not %ecx 7e: 66 89 48 0a mov %cx,0xa(%rax) 82: 49 63 c1 movslq %r9d,%rax 85: 48 8b 04 c5 00 00 00 mov 0x0(,%rax,8),%rax ^ permalink raw reply [flat|nested] 26+ messages in thread
* RE: [PATCH net-next] net: optimize csum_replace2() 2014-03-24 15:14 ` Eric Dumazet @ 2014-03-24 15:52 ` David Laight 0 siblings, 0 replies; 26+ messages in thread From: David Laight @ 2014-03-24 15:52 UTC (permalink / raw) To: 'Eric Dumazet' Cc: David Miller, herbert@gondor.apana.org.au, hkchu@google.com, mwdalton@google.com, netdev@vger.kernel.org From: Eric Dumazet ... > Code I provided uses no conditional branch on x86. All the world isn't x86... The sparc, ppc and arm people might want to consider optimising this code further. > It sounds you could provide helper to arm, if you really care of this > path. I find surprising you did not comment on my prior mails on this > subject and you suddenly seem to care now the patch is merged. I don't remember seeing this particular patch before today. Last week you were still sorting out the stall caused by the 16bit write -> 32bit load in the existing code. In any case your change is clearly a significant improvement on what was there before. > We for example have the following helper in x86 : > > static inline unsigned add32_with_carry(unsigned a, unsigned b) > { > asm("addl %2,%0\n\t" > "adcl $0,%0" > : "=r" (a) > : "0" (a), "r" (b)); > return a; > } > > But these days, gcc seems to do a pretty good job without these helpers. Indeed. While x86 can do 16bit maths, most cpus can't - so the generated code for 'short' (and 'char') maths must mask with 0xffff (or 0xff) every time a value is written to a local (ie register) variable. In general you get better code by using local variables that are the same size as machine registers. This also applies to function arguments and return types. I'm not sure how much difference it would make overall. It rather depends on whether anything appears in a very hot path. OTOH a lot of mall changes can add together. David ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2014-03-24 15:52 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-03-20 18:49 [RFC] csum experts, csum_replace2() is too expensive Eric Dumazet 2014-03-21 0:13 ` Herbert Xu 2014-03-21 0:54 ` Eric Dumazet 2014-03-21 12:50 ` Herbert Xu 2014-03-21 1:56 ` Andi Kleen 2014-03-21 2:51 ` Eric Dumazet 2014-03-21 12:50 ` Eric Dumazet 2014-03-21 13:28 ` Andi Kleen 2014-03-21 13:32 ` Eric Dumazet 2014-03-21 13:47 ` Eric Dumazet 2014-03-21 13:56 ` Eric Dumazet 2014-03-21 14:14 ` David Laight 2014-03-21 18:52 ` David Miller 2014-03-24 2:50 ` Eric Dumazet 2014-03-24 10:30 ` David Laight 2014-03-24 13:17 ` Eric Dumazet 2014-03-24 14:13 ` Eric Dumazet 2014-03-21 18:07 ` David Miller 2014-03-21 18:30 ` Eric Dumazet 2014-03-24 2:51 ` [PATCH net-next] net: optimize csum_replace2() Eric Dumazet 2014-03-24 4:20 ` David Miller 2014-03-24 10:22 ` David Laight 2014-03-24 13:25 ` Eric Dumazet 2014-03-24 14:38 ` David Laight 2014-03-24 15:14 ` Eric Dumazet 2014-03-24 15:52 ` David Laight
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).