All of lore.kernel.org
 help / color / mirror / Atom feed
From: Richard Gobert <richardbgobert@gmail.com>
To: Paolo Abeni <pabeni@redhat.com>
Cc: Eric Dumazet <edumazet@google.com>,
	davem@davemloft.net, kuba@kernel.org, dsahern@kernel.org,
	alexanderduyck@fb.com, lucien.xin@gmail.com,
	lixiaoyan@google.com, iwienand@redhat.com, leon@kernel.org,
	ye.xingchen@zte.com.cn, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4 2/2] gro: optimise redundant parsing of packets
Date: Thu, 20 Apr 2023 19:23:13 +0200	[thread overview]
Message-ID: <20230420172311.GA38309@debian> (raw)
In-Reply-To: <352c24ff0c1b3a9f63062c21bbee0dca1b9ebfff.camel@redhat.com>

> On Wed, 2023-03-22 at 20:33 +0100, Richard Gobert wrote:
> > > On Wed, Mar 22, 2023 at 2:59 AM Paolo Abeni <pabeni@redhat.com>
> > > wrote:
> > > > 
> > > > On Mon, 2023-03-20 at 18:00 +0100, Richard Gobert wrote:
> > > > > Currently the IPv6 extension headers are parsed twice: first in
> > > > > ipv6_gro_receive, and then again in ipv6_gro_complete.
> > > > > 
> > > > > By using the new ->transport_proto field, and also storing the
> > > > > size of the
> > > > > network header, we can avoid parsing extension headers a second
> > > > > time in
> > > > > ipv6_gro_complete (which saves multiple memory dereferences and
> > > > > conditional
> > > > > checks inside ipv6_exthdrs_len for a varying amount of
> > > > > extension headers in
> > > > > IPv6 packets).
> > > > > 
> > > > > The implementation had to handle both inner and outer layers in
> > > > > case of
> > > > > encapsulation (as they can't use the same field). I've applied
> > > > > a similar
> > > > > optimisation to Ethernet.
> > > > > 
> > > > > Performance tests for TCP stream over IPv6 with a varying
> > > > > amount of
> > > > > extension headers demonstrate throughput improvement of ~0.7%.
> > > > 
> > > > I'm surprised that the improvement is measurable: for large
> > > > aggregate
> > > > packets a single ipv6_exthdrs_len() call is avoided out of tens
> > > > calls
> > > > for the individual pkts. Additionally such figure is comparable
> > > > to
> > > > noise level in my tests.
> > 
> > It's not simple but I made an effort to make a quiet environment.
> > Correct configuration allows for this kind of measurements to be made
> > as the test is CPU bound and noise is a variance that can be reduced
> > with 
> > enough samples.
> > 
> > Environment example: (100Gbit NIC (mlx5), physical machine, i9 12th
> > gen)
> > 
> >     # power-management and hyperthreading disabled in BIOS
> >     # sysctl preallocate net mem
> >     echo 0 > /sys/devices/system/cpu/cpufreq/boost # disable
> > turboboost
> >     ethtool -A enp1s0f0np0 rx off tx off autoneg off # no PAUSE
> > frames
> > 
> >     # Single core performance
> >     for x in /sys/devices/system/cpu/cpu[1-9]*/online; do echo 0
> > >"$x"; done
> > 
> >     ./network-testing-master/bin/netfilter_unload_modules.sh
> > 2>/dev/null # unload netfilter
> >     tuned-adm profile latency-performance
> >     cpupower frequency-set -f 2200MHz # Set core to specific
> > frequency
> >     systemctl isolate rescue-ssh.target
> >     # and kill all processes besides init
> > 
> > > > This adds a couple of additional branches for the common (no
> > > > extensions
> > > > header) case.
> > 
> > The additional branch in ipv6_gro_receive would be negligible or even
> > non-existent for a branch predictor in the common case
> > (non-encapsulated packets).
> > I could wrap it with a likely macro if you wish.
> > Inside ipv6_gro_complete a couple of branches are saved for the
> > common
> > case as demonstrated below.
> > 
> > original code ipv6_gro_complete (ipv6_exthdrs_len is inlined):
> > 
> >     // if (skb->encapsulation)
> > 
> >     ffffffff81c4962b:	f6 87 81 00 00 00 20 	testb 
> > $0x20,0x81(%rdi)
> >     ffffffff81c49632:	74 2a                	je    
> > ffffffff81c4965e <ipv6_gro_complete+0x3e>
> > 
> >     ...
> > 
> >     // nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);
> > 
> >     ffffffff81c4969c:	eb 1b                	jmp   
> > ffffffff81c496b9 <ipv6_gro_complete+0x99>    <-- jump to beginning of
> > for loop
> >     ffffffff81c4968e:   b8 28 00 00 00          mov    $0x28,%eax
> >     ffffffff81c49693:   31 f6                   xor    %esi,%esi
> >     ffffffff81c49695:   48 c7 c7 c0 28 aa 82    mov   
> > $0xffffffff82aa28c0,%rdi
> >     ffffffff81c4969c:   eb 1b                   jmp   
> > ffffffff81c496b9 <ipv6_gro_complete+0x99>
> >     ffffffff81c4969e:   f6 41 18 01             testb 
> > $0x1,0x18(%rcx)
> >     ffffffff81c496a2:   74 34                   je    
> > ffffffff81c496d8 <ipv6_gro_complete+0xb8>    <--- 3rd conditional
> > check: !((*opps)->flags & INET6_PROTO_GSO_EXTHDR)
> >     ffffffff81c496a4:   48 98                   cltq  
> >     ffffffff81c496a6:   48 01 c2                add    %rax,%rdx
> >     ffffffff81c496a9:   0f b6 42 01             movzbl 0x1(%rdx),%eax
> >     ffffffff81c496ad:   0f b6 0a                movzbl (%rdx),%ecx
> >     ffffffff81c496b0:   8d 04 c5 08 00 00 00    lea   
> > 0x8(,%rax,8),%eax
> >     ffffffff81c496b7:   01 c6                   add    %eax,%esi
> >     ffffffff81c496b9:   85 c9                   test   %ecx,%ecx    
> > <--- for loop starts here
> >     ffffffff81c496bb:   74 e7                   je    
> > ffffffff81c496a4 <ipv6_gro_complete+0x84>    <--- 1st conditional
> > check: proto != NEXTHDR_HOP
> >     ffffffff81c496bd:   48 8b 0c cf             mov   
> > (%rdi,%rcx,8),%rcx
> >     ffffffff81c496c1:   48 85 c9                test   %rcx,%rcx
> >     ffffffff81c496c4:   75 d8                   jne   
> > ffffffff81c4969e <ipv6_gro_complete+0x7e>    <--- 2nd conditional
> > check: unlikely(!(*opps))
> >     
> >     ... (indirect call ops->callbacks.gro_complete)
> > 
> > ipv6_exthdrs_len contains a loop which has 3 conditional checks.
> > For the common (no extensions header) case, in the new code, *all 3
> > branches are completely avoided*
> > 
> > patched code ipv6_gro_complete:
> > 
> >     // if (skb->encapsulation)
> >     ffffffff81befe58:   f6 83 81 00 00 00 20    testb 
> > $0x20,0x81(%rbx)
> >     ffffffff81befe5f:   74 78                   je    
> > ffffffff81befed9 <ipv6_gro_complete+0xb9>
> >     
> >     ...
> >     
> >     // else
> >     ffffffff81befed9:	0f b6 43 50          	movzbl
> > 0x50(%rbx),%eax
> >     ffffffff81befedd:	0f b7 73 4c          	movzwl
> > 0x4c(%rbx),%esi
> >     ffffffff81befee1:	48 8b 0c c5 c0 3f a9 	mov    -
> > 0x7d56c040(,%rax,8),%rcx
> >     
> >     ... (indirect call ops->callbacks.gro_complete)
> > 
> > Thus, the patch is beneficial for both the common case and the ext
> > hdr
> > case. I would appreciate a second consideration :)
> 
> A problem with the above analysis is that it does not take in
> consideration the places where the new branch are added:
> eth_gro_receive() and ipv6_gro_receive().
> 
> Note that such functions are called for each packet on the wire:
> multiple times for each aggregate packets. 
> 
> The above is likely not measurable in terms on pps delta, but the added
> CPU cycles spent for the common case are definitely there. In my
> opinion that outlast the benefit for the extensions header case.
> 
> Cheers,
> 
> Paolo
> 
> p.s. please refrain from off-list ping. That is ignored by most and
> considered rude by some.

Thanks,
I will re-post the first patch as a new one.
As for the second patch, I get your point, you are correct. I didn't
pay enough attention to the accumulated overhead during the receive phase, as it
wasn't showing up in my measurements. I'll look further into it, and check if I
can come up with a better solution.

Sorry for the off-list ping, is it ok to send a ping via the mailing list?

      reply	other threads:[~2023-04-20 17:23 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-20 16:37 [PATCH v4 0/2] gro: optimise redundant parsing of packets Richard Gobert
2023-03-20 16:46 ` [PATCH v4 1/2] gro: decrease size of CB Richard Gobert
2023-03-20 17:00 ` [PATCH v4 2/2] gro: optimise redundant parsing of packets Richard Gobert
2023-03-22  9:59   ` Paolo Abeni
2023-03-22 10:11     ` Eric Dumazet
2023-03-22 19:33       ` Richard Gobert
2023-04-03 11:41         ` Paolo Abeni
2023-04-20 17:23           ` Richard Gobert [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230420172311.GA38309@debian \
    --to=richardbgobert@gmail.com \
    --cc=alexanderduyck@fb.com \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=iwienand@redhat.com \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lixiaoyan@google.com \
    --cc=lucien.xin@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=ye.xingchen@zte.com.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.