All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 0/4] ipv4/ipv6: local address lookup scaling
@ 2026-03-31 21:07 hawk
  2026-03-31 21:07 ` [RFC PATCH net-next 1/4] ipv4: make inet_addr_lst hash table size configurable hawk
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: hawk @ 2026-03-31 21:07 UTC (permalink / raw)
  To: netdev
  Cc: davem, dsahern, edumazet, kuba, pabeni, horms, shuah,
	linux-kselftest, hawk, ivan, kernel-team

From: Jesper Dangaard Brouer <hawk@kernel.org>

On servers with many IPv4 addresses (e.g. Cloudflare edge nodes with
~700 addrs), __ip_dev_find() becomes visible in perf profiles on the
unconnected UDP sendmsg path:

  udp_sendmsg
    ip_route_output_flow
      ip_route_output_key_hash_rcu
        __ip_dev_find              <-- source address validation
          inet_lookup_ifaddr_rcu   <-- walks inet_addr_lst hash chain

The current inet_addr_lst is a fixed-size hlist hash table with 256
buckets (IN4_ADDR_HSIZE). At 700 addresses this gives ~2.8 entries per
bucket on average. The O(N/256) lookup cost grows linearly as addresses
are added, with no mechanism to resize. IPv6 has the same issue with
inet6_addr_lst (IN6_ADDR_HSIZE = 256).

This series presents two approaches. The first two patches are a
minimal, low-risk fix. The last two patches are a more complete but
higher-complexity solution. Both are included so maintainers can
evaluate the trade-off.

Approach A: compile-time CONFIG option (patches 1-2)
----------------------------------------------------

The simplest fix: add CONFIG_INET_ADDR_HASH_BUCKETS and
CONFIG_INET6_ADDR_HASH_BUCKETS (default 256, range 64-16384, EXPERT)
so operators can size the tables at build time.

 - 2 lines of Kconfig + 1 line change each
 - Zero runtime behavior change at default settings
 - Immediately deployable for known workloads
 - Does not help workloads that cannot rebuild their kernel
 - Memory cost scales with the chosen bucket count per netns

Approach B: rhltable conversion (patches 3-4)
---------------------------------------------

Convert IPv4 inet_addr_lst from a fixed hlist to an rhltable (resizable
hash linked table). The rhltable:

 - Maintains O(1) lookup regardless of address count
 - Automatically grows and shrinks as addresses are added/removed
 - Eliminates the fixed IN4_ADDR_HSIZE compile-time limit
 - Uses the rhl (resizable hash list) variant to preserve support for
   duplicate keys -- the same IP can exist on multiple interfaces, and
   the old hlist already allowed this

The rhashtable_params are tuned for this use case:

 - No .hashfn set: with key_len = sizeof(__be32), the default path
   calls jhash2(key, 1, seed), which the compiler fully inlines since
   jhash2 is static inline and the length is a compile-time constant.
   The result is equivalent to jhash_1word().

 - .obj_cmpfn provides a direct __be32 comparison, replacing the
   generic memcmp() call in the default rhashtable_compare(). The
   compiler inlines this to a single cmp instruction in the inner
   lookup loop.

 - .min_size = 32 reduces initial memory vs the old fixed 256-bucket
   hlist (256 x 8 = 2048 bytes per netns). The rhltable starts at
   32 x 8 = 256 bytes for buckets plus ~180 bytes of metadata, so
   ~440 bytes total per netns -- a 4x reduction. Most containers
   only have loopback, so 32 buckets is more than sufficient.

With these settings, objdump confirms that inet_lookup_ifaddr_rcu()
contains zero indirect calls and zero function calls to hashfn or
cmpfn -- everything is inlined by the compiler.

The check_lifetime() work function previously iterated all hash buckets
directly. Since rhltable does not expose bucket iteration, this is
converted to iterate via for_each_netdev + in_dev->ifa_list, which is
the natural way to walk all addresses and avoids coupling the lifetime
logic to the hash table internals.

Benchmarks
----------

Performance was measured using bpftrace kprobe/kretprobe on
__ip_dev_find inside a virtme-ng VM (4 vCPUs, veth pair, network
namespaces, CPU-isolated host). A C benchmark tool sends unconnected
UDP packets in a tight loop, cycling through all source addresses to
exercise the lookup path on every sendto().

__ip_dev_find average latency (bpftrace stats, 5 rounds x 3s):

  Addrs   rhltable (ns)   hlist (ns)   Improvement
  -----   -------------   ----------   -----------
     20           201          200          0%
    100           210          218         +4%
    500           206          234        +12%
    700           218          237         +8%
   1000           214          228         +6%
   2000           231          265        +13%
   5000           247          335        +26%

At low address counts (20) the rhltable matches the hlist. The
custom obj_cmpfn (single cmp vs memcmp) and the compiler-inlined
jhash2 keeps the more advanced hash table as fast as the simple hlist.

At the production use-case of ~700 addresses, the rhltable provides an
8% reduction in __ip_dev_find latency (218 ns vs 237 ns). The
improvement grows with address count, reaching 26% at 5000 addresses
as hlist chains lengthen while rhltable stays flat.

Note: bpftrace kprobe/kretprobe instrumentation adds constant overhead
to each call, so the absolute nanosecond values are inflated. The
relative improvement percentages are the meaningful comparison.

UDP sendmsg throughput (end-to-end, 10 rounds x 3s, unconnected
sendto() cycling all source addresses, iptables DROP on receiver,
virtme-ng VM with CPU isolation, frequency pinned to 1400 MHz):

  Addrs   rhltable (pkt/s)   hlist (pkt/s)   Improvement
  -----   ----------------   -------------   -----------
     20          385,696          389,206        -0.9%
    100          382,188          382,693        -0.1%
    500          378,509          366,216        +3.4%
    700          375,295          361,877        +3.7%
   1000          365,848          356,039        +2.8%
   2000          345,092          333,232        +3.6%
   5000          324,478          303,649        +6.9%
  10000          330,775          268,786       +23.1%

The absolute throughput is low because CPUs are deliberately
down-clocked for measurement stability (CV < 2%). The relative
improvement percentages are the meaningful comparison.

At the production use-case of ~700 addresses, rhltable provides a
+3.7% throughput improvement, growing to +23% at 10,000 addresses as
hlist O(N/256) chain walks dominate.

Summary
-------

Patches 1-2 (CONFIG option) are trivial, safe, and solve the immediate
problem for operators who can rebuild their kernel. Patches 3-4
(rhltable) are a better long-term solution that works automatically for
all workloads, at the cost of more complex code and a harder review.

If the rhltable approach is acceptable, patches 1-2 become unnecessary
(the rhltable supersedes the fixed hash table entirely). They are
included here to give maintainers the choice.

Future work: a similar rhltable conversion for IPv6 inet6_addr_lst.

Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: kernel-team@cloudflare.com

Jesper Dangaard Brouer (4):
  ipv4: make inet_addr_lst hash table size configurable
  ipv6: make inet6_addr_lst hash table size configurable
  ipv4: convert inet_addr_lst to rhltable for dynamic resizing
  selftests: net: add IPv4 address lookup stress test

 include/linux/inetdevice.h                    |   3 +-
 include/net/ip.h                              |   5 -
 include/net/netns/ipv4.h                      |   4 +-
 net/ipv4/devinet.c                            | 149 ++--
 net/ipv6/Kconfig                              |  15 +
 net/ipv6/addrconf.c                           |   2 +-
 tools/testing/selftests/net/Makefile          |   3 +
 .../selftests/net/ipv4_addr_lookup_test.sh    | 804 ++++++++++++++++++
 .../net/ipv4_addr_lookup_test_virtme.sh       | 282 ++++++
 .../selftests/net/ipv4_addr_lookup_trace.bt   | 178 ++++
 .../net/ipv4_addr_lookup_udp_sender.c         | 401 +++++++++
 11 files changed, 1772 insertions(+), 74 deletions(-)
 create mode 100755 tools/testing/selftests/net/ipv4_addr_lookup_test.sh
 create mode 100755 tools/testing/selftests/net/ipv4_addr_lookup_test_virtme.sh
 create mode 100644 tools/testing/selftests/net/ipv4_addr_lookup_trace.bt
 create mode 100644 tools/testing/selftests/net/ipv4_addr_lookup_udp_sender.c

-- 
2.43.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-03 22:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-31 21:07 [RFC PATCH net-next 0/4] ipv4/ipv6: local address lookup scaling hawk
2026-03-31 21:07 ` [RFC PATCH net-next 1/4] ipv4: make inet_addr_lst hash table size configurable hawk
2026-03-31 21:07 ` [RFC PATCH net-next 2/4] ipv6: make inet6_addr_lst " hawk
2026-03-31 21:07 ` [RFC PATCH net-next 3/4] ipv4: convert inet_addr_lst to rhltable for dynamic resizing hawk
2026-03-31 21:07 ` [RFC PATCH net-next 4/4] selftests: net: add IPv4 address lookup stress test hawk
2026-04-03 22:35 ` [RFC PATCH net-next 0/4] ipv4/ipv6: local address lookup scaling David Ahern

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.