* [PATCH] udp: Force compute_score to always inline
@ 2026-04-09 22:15 Gabriel Krisman Bertazi
2026-04-09 22:36 ` Eric Dumazet
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Gabriel Krisman Bertazi @ 2026-04-09 22:15 UTC (permalink / raw)
To: willemdebruijn.kernel, davem, dsahern, edumazet, kuba, pabeni,
kuniyu
Cc: horms, netdev, Gabriel Krisman Bertazi
Back in 2024 I reported a 7-12% regression on an iperf3 UDP loopback
thoughput test that we traced to the extra overhead of calling
compute_score on two places, introduced by commit f0ea27e7bfe1 ("udp:
re-score reuseport groups when connected sockets are present"). At the
time, I pointed out the overhead was caused by the multiple calls,
associated with cpu-specific mitigations, and merged commit
50aee97d1511 ("udp: Avoid call to compute_score on multiple sites") to
jump back explicitly, to force the rescore call in a single place.
Recently though, we got another regression report against a newer distro
version, which a team colleague traced back to the same root-cause.
Turns out that once we updated to gcc-13, the compiler got smart enough
to unroll the loop, undoing my previous mitigation. Let's bite the
bullet and __always_inline compute_score on both ipv4 and ipv6 to
prevent gcc from de-optimizing it again in the future. These functions
are only called in two places each, udpX_lib_lookup1 and
udpX_lib_lookup2, so the extra size shouldn't be a problem and it is hot
enough to be very visible in profilings. In fact, with gcc13, forcing
the inline will prevent gcc from unrolling the fix from commit
50aee97d1511, so we don't end up increasing udpX_lib_lookup2 at all.
I haven't recollected the results myself, as I don't have access to the
machine at the moment. But the same colleague reported 4.67%
inprovement with this patch in the loopback benchmark, solving the
regression report within noise margins.
Fixes: 50aee97d1511 ("udp: Avoid call to compute_score on multiple sites")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
net/ipv4/udp.c | 8 ++++----
net/ipv6/udp.c | 9 +++++----
2 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6c6b68a66dcd..e591e2ab0d7d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -365,10 +365,10 @@ int udp_v4_get_port(struct sock *sk, unsigned short snum)
return udp_lib_get_port(sk, snum, hash2_nulladdr);
}
-static int compute_score(struct sock *sk, const struct net *net,
- __be32 saddr, __be16 sport,
- __be32 daddr, unsigned short hnum,
- int dif, int sdif)
+static __always_inline int
+compute_score(struct sock *sk, const struct net *net,
+ __be32 saddr, __be16 sport, __be32 daddr,
+ unsigned short hnum, int dif, int sdif)
{
int score;
struct inet_sock *inet;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 010b909275dd..889d229aad61 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -127,10 +127,11 @@ void udp_v6_rehash(struct sock *sk)
udp_lib_rehash(sk, new_hash, new_hash4);
}
-static int compute_score(struct sock *sk, const struct net *net,
- const struct in6_addr *saddr, __be16 sport,
- const struct in6_addr *daddr, unsigned short hnum,
- int dif, int sdif)
+static __always_inline int
+compute_score(struct sock *sk, const struct net *net,
+ const struct in6_addr *saddr, __be16 sport,
+ const struct in6_addr *daddr, unsigned short hnum,
+ int dif, int sdif)
{
int bound_dev_if, score;
struct inet_sock *inet;
--
2.52.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] udp: Force compute_score to always inline
2026-04-09 22:15 [PATCH] udp: Force compute_score to always inline Gabriel Krisman Bertazi
@ 2026-04-09 22:36 ` Eric Dumazet
2026-04-09 22:50 ` Gabriel Krisman Bertazi
2026-04-10 13:02 ` Willem de Bruijn
2026-04-10 13:04 ` Willem de Bruijn
2 siblings, 1 reply; 6+ messages in thread
From: Eric Dumazet @ 2026-04-09 22:36 UTC (permalink / raw)
To: Gabriel Krisman Bertazi
Cc: willemdebruijn.kernel, davem, dsahern, kuba, pabeni, kuniyu,
horms, netdev
On Thu, Apr 9, 2026 at 3:16 PM Gabriel Krisman Bertazi <krisman@suse.de> wrote:
>
> Back in 2024 I reported a 7-12% regression on an iperf3 UDP loopback
> thoughput test that we traced to the extra overhead of calling
> compute_score on two places, introduced by commit f0ea27e7bfe1 ("udp:
> re-score reuseport groups when connected sockets are present"). At the
> time, I pointed out the overhead was caused by the multiple calls,
> associated with cpu-specific mitigations, and merged commit
> 50aee97d1511 ("udp: Avoid call to compute_score on multiple sites") to
> jump back explicitly, to force the rescore call in a single place.
>
> Recently though, we got another regression report against a newer distro
> version, which a team colleague traced back to the same root-cause.
> Turns out that once we updated to gcc-13, the compiler got smart enough
> to unroll the loop, undoing my previous mitigation. Let's bite the
> bullet and __always_inline compute_score on both ipv4 and ipv6 to
> prevent gcc from de-optimizing it again in the future. These functions
> are only called in two places each, udpX_lib_lookup1 and
> udpX_lib_lookup2, so the extra size shouldn't be a problem and it is hot
> enough to be very visible in profilings. In fact, with gcc13, forcing
> the inline will prevent gcc from unrolling the fix from commit
> 50aee97d1511, so we don't end up increasing udpX_lib_lookup2 at all.
>
> I haven't recollected the results myself, as I don't have access to the
> machine at the moment. But the same colleague reported 4.67%
> inprovement with this patch in the loopback benchmark, solving the
> regression report within noise margins.
You could include scripts/bloat-o-meter results, so that we can sense
the cost of such a change.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 6/1 up/down: 622/-410 (212)
Function old new delta
__udp6_lib_lookup 797 1007 +210
__udp4_lib_lookup 838 984 +146
udp6_lib_lookup2 404 536 +132
udp4_lib_lookup2 396 498 +102
udpv6_rcv 3018 3034 +16
udp_init_sock 244 260 +16
bpf_iter_udp_batch 953 937 -16
__pfx_compute_score 32 - -32
compute_score 362 - -362
Total: Before=30269687, After=30269899, chg +0.00%
No change for clang.
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] udp: Force compute_score to always inline
2026-04-09 22:36 ` Eric Dumazet
@ 2026-04-09 22:50 ` Gabriel Krisman Bertazi
0 siblings, 0 replies; 6+ messages in thread
From: Gabriel Krisman Bertazi @ 2026-04-09 22:50 UTC (permalink / raw)
To: Eric Dumazet
Cc: willemdebruijn.kernel, davem, dsahern, kuba, pabeni, kuniyu,
horms, netdev
Eric Dumazet <edumazet@google.com> writes:
> On Thu, Apr 9, 2026 at 3:16 PM Gabriel Krisman Bertazi <krisman@suse.de> wrote:
>
>>
>> Back in 2024 I reported a 7-12% regression on an iperf3 UDP loopback
>> thoughput test that we traced to the extra overhead of calling
>> compute_score on two places, introduced by commit f0ea27e7bfe1 ("udp:
>> re-score reuseport groups when connected sockets are present"). At the
>> time, I pointed out the overhead was caused by the multiple calls,
>> associated with cpu-specific mitigations, and merged commit
>> 50aee97d1511 ("udp: Avoid call to compute_score on multiple sites") to
>> jump back explicitly, to force the rescore call in a single place.
>>
>> Recently though, we got another regression report against a newer distro
>> version, which a team colleague traced back to the same root-cause.
>> Turns out that once we updated to gcc-13, the compiler got smart enough
>> to unroll the loop, undoing my previous mitigation. Let's bite the
>> bullet and __always_inline compute_score on both ipv4 and ipv6 to
>> prevent gcc from de-optimizing it again in the future. These functions
>> are only called in two places each, udpX_lib_lookup1 and
>> udpX_lib_lookup2, so the extra size shouldn't be a problem and it is hot
>> enough to be very visible in profilings. In fact, with gcc13, forcing
>> the inline will prevent gcc from unrolling the fix from commit
>> 50aee97d1511, so we don't end up increasing udpX_lib_lookup2 at all.
>>
>> I haven't recollected the results myself, as I don't have access to the
>> machine at the moment. But the same colleague reported 4.67%
>> inprovement with this patch in the loopback benchmark, solving the
>> regression report within noise margins.
>
> You could include scripts/bloat-o-meter results, so that we can sense
> the cost of such a change.
>
> $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
> add/remove: 0/2 grow/shrink: 6/1 up/down: 622/-410 (212)
> Function old new delta
> __udp6_lib_lookup 797 1007 +210
> __udp4_lib_lookup 838 984 +146
> udp6_lib_lookup2 404 536 +132
> udp4_lib_lookup2 396 498 +102
> udpv6_rcv 3018 3034 +16
> udp_init_sock 244 260 +16
> bpf_iter_udp_batch 953 937 -16
> __pfx_compute_score 32 - -32
> compute_score 362 - -362
> Total: Before=30269687, After=30269899, chg +0.00%
>
> No change for clang.
>
> Reviewed-by: Eric Dumazet <edumazet@google.com>
Apologies, I wasn't aware of that tool. I did some calculations by hand
and found something like 200 bytes extra in udp6_lib_lookup2.
For gcc-13:
scripts/bloat-o-meter vmlinux vmlinux-inline
add/remove: 0/2 grow/shrink: 4/0 up/down: 616/-416 (200)
Function old new delta
udp6_lib_lookup2 762 949 +187
__udp6_lib_lookup 810 975 +165
udp4_lib_lookup2 757 906 +149
__udp4_lib_lookup 871 986 +115
__pfx_compute_score 32 - -32
compute_score 384 - -384
Total: Before=35011784, After=35011984, chg +0.00%
--
Gabriel Krisman Bertazi
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] udp: Force compute_score to always inline
2026-04-09 22:15 [PATCH] udp: Force compute_score to always inline Gabriel Krisman Bertazi
2026-04-09 22:36 ` Eric Dumazet
@ 2026-04-10 13:02 ` Willem de Bruijn
2026-04-10 13:04 ` Willem de Bruijn
2 siblings, 0 replies; 6+ messages in thread
From: Willem de Bruijn @ 2026-04-10 13:02 UTC (permalink / raw)
To: Gabriel Krisman Bertazi, willemdebruijn.kernel, davem, dsahern,
edumazet, kuba, pabeni, kuniyu
Cc: horms, netdev, Gabriel Krisman Bertazi
Gabriel Krisman Bertazi wrote:
> Back in 2024 I reported a 7-12% regression on an iperf3 UDP loopback
> thoughput test that we traced to the extra overhead of calling
> compute_score on two places, introduced by commit f0ea27e7bfe1 ("udp:
> re-score reuseport groups when connected sockets are present"). At the
> time, I pointed out the overhead was caused by the multiple calls,
> associated with cpu-specific mitigations, and merged commit
> 50aee97d1511 ("udp: Avoid call to compute_score on multiple sites") to
> jump back explicitly, to force the rescore call in a single place.
>
> Recently though, we got another regression report against a newer distro
> version, which a team colleague traced back to the same root-cause.
> Turns out that once we updated to gcc-13, the compiler got smart enough
> to unroll the loop, undoing my previous mitigation. Let's bite the
> bullet and __always_inline compute_score on both ipv4 and ipv6 to
> prevent gcc from de-optimizing it again in the future. These functions
> are only called in two places each, udpX_lib_lookup1 and
> udpX_lib_lookup2, so the extra size shouldn't be a problem and it is hot
> enough to be very visible in profilings. In fact, with gcc13, forcing
> the inline will prevent gcc from unrolling the fix from commit
> 50aee97d1511, so we don't end up increasing udpX_lib_lookup2 at all.
>
> I haven't recollected the results myself, as I don't have access to the
> machine at the moment. But the same colleague reported 4.67%
> inprovement with this patch in the loopback benchmark, solving the
> regression report within noise margins.
>
> Fixes: 50aee97d1511 ("udp: Avoid call to compute_score on multiple sites")
> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Acked-by: Willem de Bruijn <willemb@google.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] udp: Force compute_score to always inline
2026-04-09 22:15 [PATCH] udp: Force compute_score to always inline Gabriel Krisman Bertazi
2026-04-09 22:36 ` Eric Dumazet
2026-04-10 13:02 ` Willem de Bruijn
@ 2026-04-10 13:04 ` Willem de Bruijn
2026-04-10 16:01 ` Gabriel Krisman Bertazi
2 siblings, 1 reply; 6+ messages in thread
From: Willem de Bruijn @ 2026-04-10 13:04 UTC (permalink / raw)
To: Gabriel Krisman Bertazi, willemdebruijn.kernel, davem, dsahern,
edumazet, kuba, pabeni, kuniyu
Cc: horms, netdev, Gabriel Krisman Bertazi
Gabriel Krisman Bertazi wrote:
> Back in 2024 I reported a 7-12% regression on an iperf3 UDP loopback
> thoughput test that we traced to the extra overhead of calling
> compute_score on two places, introduced by commit f0ea27e7bfe1 ("udp:
> re-score reuseport groups when connected sockets are present"). At the
> time, I pointed out the overhead was caused by the multiple calls,
> associated with cpu-specific mitigations, and merged commit
> 50aee97d1511 ("udp: Avoid call to compute_score on multiple sites") to
> jump back explicitly, to force the rescore call in a single place.
>
> Recently though, we got another regression report against a newer distro
> version, which a team colleague traced back to the same root-cause.
> Turns out that once we updated to gcc-13, the compiler got smart enough
> to unroll the loop, undoing my previous mitigation. Let's bite the
> bullet and __always_inline compute_score on both ipv4 and ipv6 to
> prevent gcc from de-optimizing it again in the future. These functions
> are only called in two places each, udpX_lib_lookup1 and
> udpX_lib_lookup2, so the extra size shouldn't be a problem and it is hot
> enough to be very visible in profilings. In fact, with gcc13, forcing
> the inline will prevent gcc from unrolling the fix from commit
> 50aee97d1511, so we don't end up increasing udpX_lib_lookup2 at all.
>
> I haven't recollected the results myself, as I don't have access to the
> machine at the moment. But the same colleague reported 4.67%
> inprovement with this patch in the loopback benchmark, solving the
> regression report within noise margins.
>
> Fixes: 50aee97d1511 ("udp: Avoid call to compute_score on multiple sites")
> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Spotted this a tad late: should the comment udp4_lib_lookup2 be
updated: "compute_score is too long of a function to be inline .."
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] udp: Force compute_score to always inline
2026-04-10 13:04 ` Willem de Bruijn
@ 2026-04-10 16:01 ` Gabriel Krisman Bertazi
0 siblings, 0 replies; 6+ messages in thread
From: Gabriel Krisman Bertazi @ 2026-04-10 16:01 UTC (permalink / raw)
To: Willem de Bruijn
Cc: davem, dsahern, edumazet, kuba, pabeni, kuniyu, horms, netdev
Willem de Bruijn <willemdebruijn.kernel@gmail.com> writes:
> Gabriel Krisman Bertazi wrote:
>
>> Back in 2024 I reported a 7-12% regression on an iperf3 UDP loopback
>> thoughput test that we traced to the extra overhead of calling
>> compute_score on two places, introduced by commit f0ea27e7bfe1 ("udp:
>> re-score reuseport groups when connected sockets are present"). At the
>> time, I pointed out the overhead was caused by the multiple calls,
>> associated with cpu-specific mitigations, and merged commit
>> 50aee97d1511 ("udp: Avoid call to compute_score on multiple sites") to
>> jump back explicitly, to force the rescore call in a single place.
>>
>> Recently though, we got another regression report against a newer distro
>> version, which a team colleague traced back to the same root-cause.
>> Turns out that once we updated to gcc-13, the compiler got smart enough
>> to unroll the loop, undoing my previous mitigation. Let's bite the
>> bullet and __always_inline compute_score on both ipv4 and ipv6 to
>> prevent gcc from de-optimizing it again in the future. These functions
>> are only called in two places each, udpX_lib_lookup1 and
>> udpX_lib_lookup2, so the extra size shouldn't be a problem and it is hot
>> enough to be very visible in profilings. In fact, with gcc13, forcing
>> the inline will prevent gcc from unrolling the fix from commit
>> 50aee97d1511, so we don't end up increasing udpX_lib_lookup2 at all.
>>
>> I haven't recollected the results myself, as I don't have access to the
>> machine at the moment. But the same colleague reported 4.67%
>> inprovement with this patch in the loopback benchmark, solving the
>> regression report within noise margins.
>>
>> Fixes: 50aee97d1511 ("udp: Avoid call to compute_score on multiple sites")
>> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
>
> Spotted this a tad late: should the comment udp4_lib_lookup2 be
> updated: "compute_score is too long of a function to be inline .."
Thanks for noticing. I send a v2 just with this fixed and adding
bloat-o-meter data to the commit message, but preserved your ack.
Please review the updated comment for the ack.
--
Gabriel Krisman Bertazi
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-04-10 16:01 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09 22:15 [PATCH] udp: Force compute_score to always inline Gabriel Krisman Bertazi
2026-04-09 22:36 ` Eric Dumazet
2026-04-09 22:50 ` Gabriel Krisman Bertazi
2026-04-10 13:02 ` Willem de Bruijn
2026-04-10 13:04 ` Willem de Bruijn
2026-04-10 16:01 ` Gabriel Krisman Bertazi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox