From: hawk@kernel.org
To: netdev@vger.kernel.org
Cc: davem@davemloft.net, dsahern@kernel.org, edumazet@google.com,
kuba@kernel.org, pabeni@redhat.com, horms@kernel.org,
shuah@kernel.org, linux-kselftest@vger.kernel.org,
hawk@kernel.org, ivan@cloudflare.com, kernel-team@cloudflare.com
Subject: [RFC PATCH net-next 0/4] ipv4/ipv6: local address lookup scaling
Date: Tue, 31 Mar 2026 23:07:35 +0200 [thread overview]
Message-ID: <20260331210739.3998753-1-hawk@kernel.org> (raw)
From: Jesper Dangaard Brouer <hawk@kernel.org>
On servers with many IPv4 addresses (e.g. Cloudflare edge nodes with
~700 addrs), __ip_dev_find() becomes visible in perf profiles on the
unconnected UDP sendmsg path:
udp_sendmsg
ip_route_output_flow
ip_route_output_key_hash_rcu
__ip_dev_find <-- source address validation
inet_lookup_ifaddr_rcu <-- walks inet_addr_lst hash chain
The current inet_addr_lst is a fixed-size hlist hash table with 256
buckets (IN4_ADDR_HSIZE). At 700 addresses this gives ~2.8 entries per
bucket on average. The O(N/256) lookup cost grows linearly as addresses
are added, with no mechanism to resize. IPv6 has the same issue with
inet6_addr_lst (IN6_ADDR_HSIZE = 256).
This series presents two approaches. The first two patches are a
minimal, low-risk fix. The last two patches are a more complete but
higher-complexity solution. Both are included so maintainers can
evaluate the trade-off.
Approach A: compile-time CONFIG option (patches 1-2)
----------------------------------------------------
The simplest fix: add CONFIG_INET_ADDR_HASH_BUCKETS and
CONFIG_INET6_ADDR_HASH_BUCKETS (default 256, range 64-16384, EXPERT)
so operators can size the tables at build time.
- 2 lines of Kconfig + 1 line change each
- Zero runtime behavior change at default settings
- Immediately deployable for known workloads
- Does not help workloads that cannot rebuild their kernel
- Memory cost scales with the chosen bucket count per netns
Approach B: rhltable conversion (patches 3-4)
---------------------------------------------
Convert IPv4 inet_addr_lst from a fixed hlist to an rhltable (resizable
hash linked table). The rhltable:
- Maintains O(1) lookup regardless of address count
- Automatically grows and shrinks as addresses are added/removed
- Eliminates the fixed IN4_ADDR_HSIZE compile-time limit
- Uses the rhl (resizable hash list) variant to preserve support for
duplicate keys -- the same IP can exist on multiple interfaces, and
the old hlist already allowed this
The rhashtable_params are tuned for this use case:
- No .hashfn set: with key_len = sizeof(__be32), the default path
calls jhash2(key, 1, seed), which the compiler fully inlines since
jhash2 is static inline and the length is a compile-time constant.
The result is equivalent to jhash_1word().
- .obj_cmpfn provides a direct __be32 comparison, replacing the
generic memcmp() call in the default rhashtable_compare(). The
compiler inlines this to a single cmp instruction in the inner
lookup loop.
- .min_size = 32 reduces initial memory vs the old fixed 256-bucket
hlist (256 x 8 = 2048 bytes per netns). The rhltable starts at
32 x 8 = 256 bytes for buckets plus ~180 bytes of metadata, so
~440 bytes total per netns -- a 4x reduction. Most containers
only have loopback, so 32 buckets is more than sufficient.
With these settings, objdump confirms that inet_lookup_ifaddr_rcu()
contains zero indirect calls and zero function calls to hashfn or
cmpfn -- everything is inlined by the compiler.
The check_lifetime() work function previously iterated all hash buckets
directly. Since rhltable does not expose bucket iteration, this is
converted to iterate via for_each_netdev + in_dev->ifa_list, which is
the natural way to walk all addresses and avoids coupling the lifetime
logic to the hash table internals.
Benchmarks
----------
Performance was measured using bpftrace kprobe/kretprobe on
__ip_dev_find inside a virtme-ng VM (4 vCPUs, veth pair, network
namespaces, CPU-isolated host). A C benchmark tool sends unconnected
UDP packets in a tight loop, cycling through all source addresses to
exercise the lookup path on every sendto().
__ip_dev_find average latency (bpftrace stats, 5 rounds x 3s):
Addrs rhltable (ns) hlist (ns) Improvement
----- ------------- ---------- -----------
20 201 200 0%
100 210 218 +4%
500 206 234 +12%
700 218 237 +8%
1000 214 228 +6%
2000 231 265 +13%
5000 247 335 +26%
At low address counts (20) the rhltable matches the hlist. The
custom obj_cmpfn (single cmp vs memcmp) and the compiler-inlined
jhash2 keeps the more advanced hash table as fast as the simple hlist.
At the production use-case of ~700 addresses, the rhltable provides an
8% reduction in __ip_dev_find latency (218 ns vs 237 ns). The
improvement grows with address count, reaching 26% at 5000 addresses
as hlist chains lengthen while rhltable stays flat.
Note: bpftrace kprobe/kretprobe instrumentation adds constant overhead
to each call, so the absolute nanosecond values are inflated. The
relative improvement percentages are the meaningful comparison.
UDP sendmsg throughput (end-to-end, 10 rounds x 3s, unconnected
sendto() cycling all source addresses, iptables DROP on receiver,
virtme-ng VM with CPU isolation, frequency pinned to 1400 MHz):
Addrs rhltable (pkt/s) hlist (pkt/s) Improvement
----- ---------------- ------------- -----------
20 385,696 389,206 -0.9%
100 382,188 382,693 -0.1%
500 378,509 366,216 +3.4%
700 375,295 361,877 +3.7%
1000 365,848 356,039 +2.8%
2000 345,092 333,232 +3.6%
5000 324,478 303,649 +6.9%
10000 330,775 268,786 +23.1%
The absolute throughput is low because CPUs are deliberately
down-clocked for measurement stability (CV < 2%). The relative
improvement percentages are the meaningful comparison.
At the production use-case of ~700 addresses, rhltable provides a
+3.7% throughput improvement, growing to +23% at 10,000 addresses as
hlist O(N/256) chain walks dominate.
Summary
-------
Patches 1-2 (CONFIG option) are trivial, safe, and solve the immediate
problem for operators who can rebuild their kernel. Patches 3-4
(rhltable) are a better long-term solution that works automatically for
all workloads, at the cost of more complex code and a harder review.
If the rhltable approach is acceptable, patches 1-2 become unnecessary
(the rhltable supersedes the fixed hash table entirely). They are
included here to give maintainers the choice.
Future work: a similar rhltable conversion for IPv6 inet6_addr_lst.
Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: kernel-team@cloudflare.com
Jesper Dangaard Brouer (4):
ipv4: make inet_addr_lst hash table size configurable
ipv6: make inet6_addr_lst hash table size configurable
ipv4: convert inet_addr_lst to rhltable for dynamic resizing
selftests: net: add IPv4 address lookup stress test
include/linux/inetdevice.h | 3 +-
include/net/ip.h | 5 -
include/net/netns/ipv4.h | 4 +-
net/ipv4/devinet.c | 149 ++--
net/ipv6/Kconfig | 15 +
net/ipv6/addrconf.c | 2 +-
tools/testing/selftests/net/Makefile | 3 +
.../selftests/net/ipv4_addr_lookup_test.sh | 804 ++++++++++++++++++
.../net/ipv4_addr_lookup_test_virtme.sh | 282 ++++++
.../selftests/net/ipv4_addr_lookup_trace.bt | 178 ++++
.../net/ipv4_addr_lookup_udp_sender.c | 401 +++++++++
11 files changed, 1772 insertions(+), 74 deletions(-)
create mode 100755 tools/testing/selftests/net/ipv4_addr_lookup_test.sh
create mode 100755 tools/testing/selftests/net/ipv4_addr_lookup_test_virtme.sh
create mode 100644 tools/testing/selftests/net/ipv4_addr_lookup_trace.bt
create mode 100644 tools/testing/selftests/net/ipv4_addr_lookup_udp_sender.c
--
2.43.0
next reply other threads:[~2026-03-31 21:08 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-31 21:07 hawk [this message]
2026-03-31 21:07 ` [RFC PATCH net-next 1/4] ipv4: make inet_addr_lst hash table size configurable hawk
2026-03-31 21:07 ` [RFC PATCH net-next 2/4] ipv6: make inet6_addr_lst " hawk
2026-03-31 21:07 ` [RFC PATCH net-next 3/4] ipv4: convert inet_addr_lst to rhltable for dynamic resizing hawk
2026-03-31 21:07 ` [RFC PATCH net-next 4/4] selftests: net: add IPv4 address lookup stress test hawk
2026-04-03 22:35 ` [RFC PATCH net-next 0/4] ipv4/ipv6: local address lookup scaling David Ahern
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260331210739.3998753-1-hawk@kernel.org \
--to=hawk@kernel.org \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=ivan@cloudflare.com \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=shuah@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox