public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed
From: hawk@kernel.org
To: netdev@vger.kernel.org
Cc: davem@davemloft.net, dsahern@kernel.org, edumazet@google.com,
	kuba@kernel.org, pabeni@redhat.com, horms@kernel.org,
	shuah@kernel.org, linux-kselftest@vger.kernel.org,
	hawk@kernel.org, ivan@cloudflare.com, kernel-team@cloudflare.com
Subject: [RFC PATCH net-next 0/4] ipv4/ipv6: local address lookup scaling
Date: Tue, 31 Mar 2026 23:07:35 +0200	[thread overview]
Message-ID: <20260331210739.3998753-1-hawk@kernel.org> (raw)

From: Jesper Dangaard Brouer <hawk@kernel.org>

On servers with many IPv4 addresses (e.g. Cloudflare edge nodes with
~700 addrs), __ip_dev_find() becomes visible in perf profiles on the
unconnected UDP sendmsg path:

  udp_sendmsg
    ip_route_output_flow
      ip_route_output_key_hash_rcu
        __ip_dev_find              <-- source address validation
          inet_lookup_ifaddr_rcu   <-- walks inet_addr_lst hash chain

The current inet_addr_lst is a fixed-size hlist hash table with 256
buckets (IN4_ADDR_HSIZE). At 700 addresses this gives ~2.8 entries per
bucket on average. The O(N/256) lookup cost grows linearly as addresses
are added, with no mechanism to resize. IPv6 has the same issue with
inet6_addr_lst (IN6_ADDR_HSIZE = 256).

This series presents two approaches. The first two patches are a
minimal, low-risk fix. The last two patches are a more complete but
higher-complexity solution. Both are included so maintainers can
evaluate the trade-off.

Approach A: compile-time CONFIG option (patches 1-2)
----------------------------------------------------

The simplest fix: add CONFIG_INET_ADDR_HASH_BUCKETS and
CONFIG_INET6_ADDR_HASH_BUCKETS (default 256, range 64-16384, EXPERT)
so operators can size the tables at build time.

 - 2 lines of Kconfig + 1 line change each
 - Zero runtime behavior change at default settings
 - Immediately deployable for known workloads
 - Does not help workloads that cannot rebuild their kernel
 - Memory cost scales with the chosen bucket count per netns

Approach B: rhltable conversion (patches 3-4)
---------------------------------------------

Convert IPv4 inet_addr_lst from a fixed hlist to an rhltable (resizable
hash linked table). The rhltable:

 - Maintains O(1) lookup regardless of address count
 - Automatically grows and shrinks as addresses are added/removed
 - Eliminates the fixed IN4_ADDR_HSIZE compile-time limit
 - Uses the rhl (resizable hash list) variant to preserve support for
   duplicate keys -- the same IP can exist on multiple interfaces, and
   the old hlist already allowed this

The rhashtable_params are tuned for this use case:

 - No .hashfn set: with key_len = sizeof(__be32), the default path
   calls jhash2(key, 1, seed), which the compiler fully inlines since
   jhash2 is static inline and the length is a compile-time constant.
   The result is equivalent to jhash_1word().

 - .obj_cmpfn provides a direct __be32 comparison, replacing the
   generic memcmp() call in the default rhashtable_compare(). The
   compiler inlines this to a single cmp instruction in the inner
   lookup loop.

 - .min_size = 32 reduces initial memory vs the old fixed 256-bucket
   hlist (256 x 8 = 2048 bytes per netns). The rhltable starts at
   32 x 8 = 256 bytes for buckets plus ~180 bytes of metadata, so
   ~440 bytes total per netns -- a 4x reduction. Most containers
   only have loopback, so 32 buckets is more than sufficient.

With these settings, objdump confirms that inet_lookup_ifaddr_rcu()
contains zero indirect calls and zero function calls to hashfn or
cmpfn -- everything is inlined by the compiler.

The check_lifetime() work function previously iterated all hash buckets
directly. Since rhltable does not expose bucket iteration, this is
converted to iterate via for_each_netdev + in_dev->ifa_list, which is
the natural way to walk all addresses and avoids coupling the lifetime
logic to the hash table internals.

Benchmarks
----------

Performance was measured using bpftrace kprobe/kretprobe on
__ip_dev_find inside a virtme-ng VM (4 vCPUs, veth pair, network
namespaces, CPU-isolated host). A C benchmark tool sends unconnected
UDP packets in a tight loop, cycling through all source addresses to
exercise the lookup path on every sendto().

__ip_dev_find average latency (bpftrace stats, 5 rounds x 3s):

  Addrs   rhltable (ns)   hlist (ns)   Improvement
  -----   -------------   ----------   -----------
     20           201          200          0%
    100           210          218         +4%
    500           206          234        +12%
    700           218          237         +8%
   1000           214          228         +6%
   2000           231          265        +13%
   5000           247          335        +26%

At low address counts (20) the rhltable matches the hlist. The
custom obj_cmpfn (single cmp vs memcmp) and the compiler-inlined
jhash2 keeps the more advanced hash table as fast as the simple hlist.

At the production use-case of ~700 addresses, the rhltable provides an
8% reduction in __ip_dev_find latency (218 ns vs 237 ns). The
improvement grows with address count, reaching 26% at 5000 addresses
as hlist chains lengthen while rhltable stays flat.

Note: bpftrace kprobe/kretprobe instrumentation adds constant overhead
to each call, so the absolute nanosecond values are inflated. The
relative improvement percentages are the meaningful comparison.

UDP sendmsg throughput (end-to-end, 10 rounds x 3s, unconnected
sendto() cycling all source addresses, iptables DROP on receiver,
virtme-ng VM with CPU isolation, frequency pinned to 1400 MHz):

  Addrs   rhltable (pkt/s)   hlist (pkt/s)   Improvement
  -----   ----------------   -------------   -----------
     20          385,696          389,206        -0.9%
    100          382,188          382,693        -0.1%
    500          378,509          366,216        +3.4%
    700          375,295          361,877        +3.7%
   1000          365,848          356,039        +2.8%
   2000          345,092          333,232        +3.6%
   5000          324,478          303,649        +6.9%
  10000          330,775          268,786       +23.1%

The absolute throughput is low because CPUs are deliberately
down-clocked for measurement stability (CV < 2%). The relative
improvement percentages are the meaningful comparison.

At the production use-case of ~700 addresses, rhltable provides a
+3.7% throughput improvement, growing to +23% at 10,000 addresses as
hlist O(N/256) chain walks dominate.

Summary
-------

Patches 1-2 (CONFIG option) are trivial, safe, and solve the immediate
problem for operators who can rebuild their kernel. Patches 3-4
(rhltable) are a better long-term solution that works automatically for
all workloads, at the cost of more complex code and a harder review.

If the rhltable approach is acceptable, patches 1-2 become unnecessary
(the rhltable supersedes the fixed hash table entirely). They are
included here to give maintainers the choice.

Future work: a similar rhltable conversion for IPv6 inet6_addr_lst.

Cc: Ivan Babrou <ivan@cloudflare.com>
Cc: kernel-team@cloudflare.com

Jesper Dangaard Brouer (4):
  ipv4: make inet_addr_lst hash table size configurable
  ipv6: make inet6_addr_lst hash table size configurable
  ipv4: convert inet_addr_lst to rhltable for dynamic resizing
  selftests: net: add IPv4 address lookup stress test

 include/linux/inetdevice.h                    |   3 +-
 include/net/ip.h                              |   5 -
 include/net/netns/ipv4.h                      |   4 +-
 net/ipv4/devinet.c                            | 149 ++--
 net/ipv6/Kconfig                              |  15 +
 net/ipv6/addrconf.c                           |   2 +-
 tools/testing/selftests/net/Makefile          |   3 +
 .../selftests/net/ipv4_addr_lookup_test.sh    | 804 ++++++++++++++++++
 .../net/ipv4_addr_lookup_test_virtme.sh       | 282 ++++++
 .../selftests/net/ipv4_addr_lookup_trace.bt   | 178 ++++
 .../net/ipv4_addr_lookup_udp_sender.c         | 401 +++++++++
 11 files changed, 1772 insertions(+), 74 deletions(-)
 create mode 100755 tools/testing/selftests/net/ipv4_addr_lookup_test.sh
 create mode 100755 tools/testing/selftests/net/ipv4_addr_lookup_test_virtme.sh
 create mode 100644 tools/testing/selftests/net/ipv4_addr_lookup_trace.bt
 create mode 100644 tools/testing/selftests/net/ipv4_addr_lookup_udp_sender.c

-- 
2.43.0


             reply	other threads:[~2026-03-31 21:08 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-31 21:07 hawk [this message]
2026-03-31 21:07 ` [RFC PATCH net-next 1/4] ipv4: make inet_addr_lst hash table size configurable hawk
2026-03-31 21:07 ` [RFC PATCH net-next 2/4] ipv6: make inet6_addr_lst " hawk
2026-03-31 21:07 ` [RFC PATCH net-next 3/4] ipv4: convert inet_addr_lst to rhltable for dynamic resizing hawk
2026-03-31 21:07 ` [RFC PATCH net-next 4/4] selftests: net: add IPv4 address lookup stress test hawk
2026-04-03 22:35 ` [RFC PATCH net-next 0/4] ipv4/ipv6: local address lookup scaling David Ahern

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260331210739.3998753-1-hawk@kernel.org \
    --to=hawk@kernel.org \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=ivan@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=shuah@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox