From: Richard Gobert <richardbgobert@gmail.com>
To: Eric Dumazet <edumazet@google.com>, pabeni@redhat.com
Cc: Paolo Abeni <pabeni@redhat.com>,
davem@davemloft.net, kuba@kernel.org, dsahern@kernel.org,
alexanderduyck@fb.com, lucien.xin@gmail.com,
lixiaoyan@google.com, iwienand@redhat.com, leon@kernel.org,
ye.xingchen@zte.com.cn, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v4 2/2] gro: optimise redundant parsing of packets
Date: Wed, 22 Mar 2023 20:33:11 +0100 [thread overview]
Message-ID: <20230322193309.GA32681@debian> (raw)
In-Reply-To: <CANn89iK_bsPoaRVe+FNZ7LF_eLbz2Af6kju4j9TVHtbgkpcn5g@mail.gmail.com>
> On Wed, Mar 22, 2023 at 2:59 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >
> > On Mon, 2023-03-20 at 18:00 +0100, Richard Gobert wrote:
> > > Currently the IPv6 extension headers are parsed twice: first in
> > > ipv6_gro_receive, and then again in ipv6_gro_complete.
> > >
> > > By using the new ->transport_proto field, and also storing the size of the
> > > network header, we can avoid parsing extension headers a second time in
> > > ipv6_gro_complete (which saves multiple memory dereferences and conditional
> > > checks inside ipv6_exthdrs_len for a varying amount of extension headers in
> > > IPv6 packets).
> > >
> > > The implementation had to handle both inner and outer layers in case of
> > > encapsulation (as they can't use the same field). I've applied a similar
> > > optimisation to Ethernet.
> > >
> > > Performance tests for TCP stream over IPv6 with a varying amount of
> > > extension headers demonstrate throughput improvement of ~0.7%.
> >
> > I'm surprised that the improvement is measurable: for large aggregate
> > packets a single ipv6_exthdrs_len() call is avoided out of tens calls
> > for the individual pkts. Additionally such figure is comparable to
> > noise level in my tests.
It's not simple but I made an effort to make a quiet environment.
Correct configuration allows for this kind of measurements to be made
as the test is CPU bound and noise is a variance that can be reduced with
enough samples.
Environment example: (100Gbit NIC (mlx5), physical machine, i9 12th
gen)
# power-management and hyperthreading disabled in BIOS
# sysctl preallocate net mem
echo 0 > /sys/devices/system/cpu/cpufreq/boost # disable turboboost
ethtool -A enp1s0f0np0 rx off tx off autoneg off # no PAUSE frames
# Single core performance
for x in /sys/devices/system/cpu/cpu[1-9]*/online; do echo 0 >"$x"; done
./network-testing-master/bin/netfilter_unload_modules.sh 2>/dev/null # unload netfilter
tuned-adm profile latency-performance
cpupower frequency-set -f 2200MHz # Set core to specific frequency
systemctl isolate rescue-ssh.target
# and kill all processes besides init
> > This adds a couple of additional branches for the common (no extensions
> > header) case.
The additional branch in ipv6_gro_receive would be negligible or even
non-existent for a branch predictor in the common case
(non-encapsulated packets).
I could wrap it with a likely macro if you wish.
Inside ipv6_gro_complete a couple of branches are saved for the common
case as demonstrated below.
original code ipv6_gro_complete (ipv6_exthdrs_len is inlined):
// if (skb->encapsulation)
ffffffff81c4962b: f6 87 81 00 00 00 20 testb $0x20,0x81(%rdi)
ffffffff81c49632: 74 2a je ffffffff81c4965e <ipv6_gro_complete+0x3e>
...
// nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);
ffffffff81c4969c: eb 1b jmp ffffffff81c496b9 <ipv6_gro_complete+0x99> <-- jump to beginning of for loop
ffffffff81c4968e: b8 28 00 00 00 mov $0x28,%eax
ffffffff81c49693: 31 f6 xor %esi,%esi
ffffffff81c49695: 48 c7 c7 c0 28 aa 82 mov $0xffffffff82aa28c0,%rdi
ffffffff81c4969c: eb 1b jmp ffffffff81c496b9 <ipv6_gro_complete+0x99>
ffffffff81c4969e: f6 41 18 01 testb $0x1,0x18(%rcx)
ffffffff81c496a2: 74 34 je ffffffff81c496d8 <ipv6_gro_complete+0xb8> <--- 3rd conditional check: !((*opps)->flags & INET6_PROTO_GSO_EXTHDR)
ffffffff81c496a4: 48 98 cltq
ffffffff81c496a6: 48 01 c2 add %rax,%rdx
ffffffff81c496a9: 0f b6 42 01 movzbl 0x1(%rdx),%eax
ffffffff81c496ad: 0f b6 0a movzbl (%rdx),%ecx
ffffffff81c496b0: 8d 04 c5 08 00 00 00 lea 0x8(,%rax,8),%eax
ffffffff81c496b7: 01 c6 add %eax,%esi
ffffffff81c496b9: 85 c9 test %ecx,%ecx <--- for loop starts here
ffffffff81c496bb: 74 e7 je ffffffff81c496a4 <ipv6_gro_complete+0x84> <--- 1st conditional check: proto != NEXTHDR_HOP
ffffffff81c496bd: 48 8b 0c cf mov (%rdi,%rcx,8),%rcx
ffffffff81c496c1: 48 85 c9 test %rcx,%rcx
ffffffff81c496c4: 75 d8 jne ffffffff81c4969e <ipv6_gro_complete+0x7e> <--- 2nd conditional check: unlikely(!(*opps))
... (indirect call ops->callbacks.gro_complete)
ipv6_exthdrs_len contains a loop which has 3 conditional checks.
For the common (no extensions header) case, in the new code, *all 3
branches are completely avoided*
patched code ipv6_gro_complete:
// if (skb->encapsulation)
ffffffff81befe58: f6 83 81 00 00 00 20 testb $0x20,0x81(%rbx)
ffffffff81befe5f: 74 78 je ffffffff81befed9 <ipv6_gro_complete+0xb9>
...
// else
ffffffff81befed9: 0f b6 43 50 movzbl 0x50(%rbx),%eax
ffffffff81befedd: 0f b7 73 4c movzwl 0x4c(%rbx),%esi
ffffffff81befee1: 48 8b 0c c5 c0 3f a9 mov -0x7d56c040(,%rax,8),%rcx
... (indirect call ops->callbacks.gro_complete)
Thus, the patch is beneficial for both the common case and the ext hdr
case. I would appreciate a second consideration :)
> > while patch 1/2 could be useful, patch 2/2 overall looks not worthy to
> > me.
> >
> > I suggest to re-post for inclusion only patch 1, unless others have
> > strong different opinions.
> >
>
> +2
>
> I have the same feeling/opinion.
>
> > Cheers,
> >
> > Paolo
> >
next prev parent reply other threads:[~2023-03-22 19:33 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-20 16:37 [PATCH v4 0/2] gro: optimise redundant parsing of packets Richard Gobert
2023-03-20 16:46 ` [PATCH v4 1/2] gro: decrease size of CB Richard Gobert
2023-03-20 17:00 ` [PATCH v4 2/2] gro: optimise redundant parsing of packets Richard Gobert
2023-03-22 9:59 ` Paolo Abeni
2023-03-22 10:11 ` Eric Dumazet
2023-03-22 19:33 ` Richard Gobert [this message]
2023-04-03 11:41 ` Paolo Abeni
2023-04-20 17:23 ` Richard Gobert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230322193309.GA32681@debian \
--to=richardbgobert@gmail.com \
--cc=alexanderduyck@fb.com \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=iwienand@redhat.com \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lixiaoyan@google.com \
--cc=lucien.xin@gmail.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=ye.xingchen@zte.com.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.