From: Kuniyuki Iwashima <kuniyu@amazon.com>
To: <willemdebruijn.kernel@gmail.com>
Cc: <davem@davemloft.net>, <krisman@suse.de>, <kuniyu@amazon.com>,
<lmb@isovalent.com>, <martin.lau@kernel.org>,
<netdev@vger.kernel.org>
Subject: Re: [PATCH v2] udp: Avoid call to compute_score on multiple sites
Date: Wed, 10 Apr 2024 17:08:07 -0700 [thread overview]
Message-ID: <20240411000807.55368-1-kuniyu@amazon.com> (raw)
In-Reply-To: <66171e3698462_2d249d29412@willemb.c.googlers.com.notmuch>
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Wed, 10 Apr 2024 19:18:14 -0400
> Kuniyuki Iwashima wrote:
> > From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> > Date: Wed, 10 Apr 2024 18:51:33 -0400
> > > Kuniyuki Iwashima wrote:
> > > > From: Gabriel Krisman Bertazi <krisman@suse.de>
> > > > Date: Wed, 10 Apr 2024 17:50:47 -0400
> > > > > We've observed a 7-12% performance regression in iperf3 UDP ipv4 and
> > > > > ipv6 tests with multiple sockets on Zen3 cpus, which we traced back to
> > > > > commit f0ea27e7bfe1 ("udp: re-score reuseport groups when connected
> > > > > sockets are present"). The failing tests were those that would spawn
> > > > > UDP sockets per-cpu on systems that have a high number of cpus.
> > > > >
> > > > > Unsurprisingly, it is not caused by the extra re-scoring of the reused
> > > > > socket, but due to the compiler no longer inlining compute_score, once
> > > > > it has the extra call site in udp4_lib_lookup2. This is augmented by
> > > > > the "Safe RET" mitigation for SRSO, needed in our Zen3 cpus.
> > > > >
> > > > > We could just explicitly inline it, but compute_score() is quite a large
> > > > > function, around 300b. Inlining in two sites would almost double
> > > > > udp4_lib_lookup2, which is a silly thing to do just to workaround a
> > > > > mitigation. Instead, this patch shuffles the code a bit to avoid the
> > > > > multiple calls to compute_score. Since it is a static function used in
> > > > > one spot, the compiler can safely fold it in, as it did before, without
> > > > > increasing the text size.
> > > > >
> > > > > With this patch applied I ran my original iperf3 testcases. The failing
> > > > > cases all looked like this (ipv4):
> > > > > iperf3 -c 127.0.0.1 --udp -4 -f K -b $R -l 8920 -t 30 -i 5 -P 64 -O 2
> > > > >
> > > > > where $R is either 1G/10G/0 (max, unlimited). I ran 3 times each.
> > > > > baseline is 6.9.0-rc1-g962490525cff, just a recent checkout of Linus
> > > > > tree. harmean == harmonic mean; CV == coefficient of variation.
> > > > >
> > > > > ipv4:
> > > > > 1G 10G MAX
> > > > > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
> > > > > baseline 1730488.20(0.0050) 1639269.91(0.0795) 1436340.05(0.0954)
> > > > > patched 1980936.14(0.0020) 1933614.06(0.0866) 1784184.51(0.0961)
> > > > >
> > > > > ipv6:
> > > > > 1G 10G MAX
> > > > > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
> > > > > baseline 1679016.07(0.0053) 1697504.56(0.0064) 1481432.74(0.0840)
> > > > > patched 1924003.38(0.0153) 1852277.31(0.0457) 1690991.46(0.1848)
> > > > >
> > > > > This restores the performance we had before the change above with this
> > > > > benchmark. We obviously don't expect any real impact when mitigations
> > > > > are disabled, but just to be sure it also doesn't regresses:
> > > > >
> > > > > mitigations=off ipv4:
> > > > > 1G 10G MAX
> > > > > HARMEAN (CV) HARMEAN (CV) HARMEAN (CV)
> > > > > baseline 3230279.97(0.0066) 3229320.91(0.0060) 2605693.19(0.0697)
> > > > > patched 3242802.36(0.0073) 3239310.71(0.0035) 2502427.19(0.0882)
> > > > >
> > > > > Cc: Lorenz Bauer <lmb@isovalent.com>
> > > > > Fixes: f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
> > > > > Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> > > > >
> > > > > ---
> > > > > Changes since v1:
> > > > > (me)
> > > > > - recollected performance data after changes below only for the
> > > > > mitigations enabled case.
> > > > > (suggested by Willem de Bruijn)
> > > > > - Drop __always_inline in compute_score
> > > > > - Simplify logic by replacing third struct sock pointer with bool
> > > > > - Fix typo in commit message
> > > > > - Don't explicitly break out of loop after rescore
> > > > > ---
> > > > > net/ipv4/udp.c | 18 +++++++++++++-----
> > > > > net/ipv6/udp.c | 17 +++++++++++++----
> > > > > 2 files changed, 26 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> > > > > index 661d0e0d273f..a13ef8e06093 100644
> > > > > --- a/net/ipv4/udp.c
> > > > > +++ b/net/ipv4/udp.c
> > > > > @@ -427,12 +427,15 @@ static struct sock *udp4_lib_lookup2(struct net *net,
> > > > > {
> > > > > struct sock *sk, *result;
> > > > > int score, badness;
> > > > > + bool rescore = false;
> > > >
> > > > nit: Keep reverse xmax tree order.
> > > > https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
> > > >
> > > > >
> > > > > result = NULL;
> > > > > badness = 0;
> > > > > udp_portaddr_for_each_entry_rcu(sk, &hslot2->head) {
> > > > > - score = compute_score(sk, net, saddr, sport,
> > > > > - daddr, hnum, dif, sdif);
> > > > > +rescore:
> > > > > + score = compute_score((rescore ? result : sk), net, saddr,
> > > >
> > > > I guess () is not needed around rescore ?
> > > >
> > > > Both same for IPv6.
> > > >
> > > > Otherwise, looks good to me.
> > > >
> > > > Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> > >
> > > Can we avoid using the same name for the label and boolean?
> > >
> > > And since if looping result will have state TCP_ESTABLISHED, can it
> > > just be
> > >
> > > sk = result;
> > > goto rescore;
> >
> > TCP_ESTABLISHED never reaches the rescore jump as it's checked
Sorry I forgot about BPF. I think BPF can select established one,
so this part is not true.
> > before calling inet_lookup_reuseport() and inet_lookup_reuseport()
> > also does not select TCP_ESTABLISHED.
>
> I was thinking of the second part, inet_lookup_reuseport returning
> a connection from the group. I suppose this is assured not to be
> the case due to
>
> /* Falleback to scoring if grnult;p has connections */
> if (!reuseport_has_conns(sk))
> return result;
>
>
> Is that what you mean?
reuseport_has_conns() is for reuseport group mixed with TCP_CLOSE and
TCP_ESTABLISHED, and here sk is usually (I mean without BPF) TCP_CLOSE
so that we don't decide too early and can check TCP_ESTABLISHED socket
placed later in the same hash chain.
Also, reuseport_select_sock_by_hash() returns NULL If the reuseport group
has only TCP_ESTABLISHED sockets when selecting, and then we continue with
result = sk;
>
> There are a lot of hidden assumptions then to make sure this
> control flow is correct.
>
> Should we instead just have
>
> badness = score;
>
> + if (rescore)
> + continue;
The 2nd compute_score() is added recently for a situation where
inet_lookup_reuseport() returns sk with higher score to avoid
calling inet_lookup_reuseport() again for the result.
f0ea27e7bfe1 ("udp: re-score reuseport groups when connected sockets are present")
>
> Also, can the rescore = false in the datapath be avoided. The purpose
> is to only jump once. It would be good if it is obvious that a
> repeated (or infinite) loop is not possible, regardless of what
> the callees return.
>
next prev parent reply other threads:[~2024-04-11 0:08 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-10 21:50 [PATCH v2] udp: Avoid call to compute_score on multiple sites Gabriel Krisman Bertazi
2024-04-10 22:13 ` Kuniyuki Iwashima
2024-04-10 22:51 ` Willem de Bruijn
2024-04-10 23:07 ` Kuniyuki Iwashima
2024-04-10 23:18 ` Willem de Bruijn
2024-04-11 0:08 ` Kuniyuki Iwashima [this message]
2024-04-11 1:54 ` Gabriel Krisman Bertazi
2024-04-11 2:21 ` Kuniyuki Iwashima
2024-04-11 19:53 ` Gabriel Krisman Bertazi
2024-04-11 20:13 ` Kuniyuki Iwashima
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240411000807.55368-1-kuniyu@amazon.com \
--to=kuniyu@amazon.com \
--cc=davem@davemloft.net \
--cc=krisman@suse.de \
--cc=lmb@isovalent.com \
--cc=martin.lau@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=willemdebruijn.kernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.