From: Neil Spring <ntspring@meta.com>
To: netdev@vger.kernel.org
Cc: edumazet@google.com, ncardwell@google.com, kuniyu@google.com,
davem@davemloft.net, kuba@kernel.org, dsahern@kernel.org,
pabeni@redhat.com, horms@kernel.org, shuah@kernel.org,
linux-kselftest@vger.kernel.org, ntspring@meta.com,
bpf@vger.kernel.org, martin.lau@linux.dev, daniel@iogearbox.net
Subject: [PATCH net-next v11 0/2] tcp: rehash onto different local ECMP path on retransmit timeout
Date: Tue, 2 Jun 2026 11:14:26 -0700 [thread overview]
Message-ID: <20260602181428.2318919-1-ntspring@meta.com> (raw)
Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO,
PLB, and spurious-retransmission events, but the new hash is not
propagated into the IPv6 ECMP path selection logic. The cached
route is reused and fib6_select_path() is never re-invoked, so
the connection uses the same local ECMP decision.
This series adds the two missing pieces:
1. __sk_dst_reset() alongside sk_rethink_txhash() so the cached dst
is invalidated and the next transmit triggers a fresh route lookup.
2. fl6->mp_hash set from sk_txhash before each route lookup so
fib6_select_path() picks a path based on the (potentially re-rolled)
hash. This only applies to fib_multipath_hash_policy 0 (the
default L3 policy). Policy 0 is already asymmetric for TCP: with
auto_flowlabels enabled (the default), each socket gets a per-socket
flowlabel, so the hash is already per-connection and unidirectional.
Policies 1-3 exist for operators who need deterministic symmetric
hashing (e.g., for stateful middleboxes) and are left unchanged.
Patch 1 is the kernel change; patch 2 adds selftests covering SYN
rehash, SYN/ACK rehash, midstream RTO rehash, midstream ACK rehash
(spurious retransmission), PLB rehash, a policy 1 negative test,
a flowlabel leak regression test, two dst rebuild consistency
tests (normal and syncookie) verifying that natural route
invalidation does not cause unintended path changes, and a
syncookie server path consistency test verifying that the SYN-ACK
and post-cookie ACKs use the same ECMP nexthop.
Changes since v10: https://lore.kernel.org/netdev/20260529160136.1010064-1-ntspring@meta.com/
Patch 1:
- Fix build without CONFIG_SYN_COOKIES
- Leave IPv4 syncookie txhash unmodified (`net_tx_rndhash()`)
- Document the IPv6 TCP policy 0 behavior change in ip-sysctl.rst
Patch 2:
- Correct runtime estimate from ~15s to ~60s
- Build DCTCP as `=y` instead of `=m` to avoid module load races
- Fix false failure of the midstream ACK test by limiting the send
buffer to avoid a closed receive window; window probes do not
cause rehash
Changes since v9: https://lore.kernel.org/netdev/20260526203403.3517607-1-ntspring@meta.com/
Patch 1:
- Split cookie_init_sequence() into pure computation and a new
cookie_record_sent() helper for the side effects; call
cookie_record_sent() after route_req() succeeds so the overflow
timestamp and SYNCOOKIESSENT counter are not bumped when no
SYN-ACK is sent
Patch 2:
- Make midstream ACK rehash test more reliable by blocking the unused
path first
- Fix port overlap when ECMP_REBUILD_ROUNDS exceeds the default
Changes since v8: https://lore.kernel.org/netdev/20260522215733.929238-1-ntspring@meta.com/
Patch 1:
- Fix REPFLOW flowlabel reflection for syncookie SYN-ACKs: pass 0 as
tw_isn to route_req() so tcp_v6_init_req() saves ireq->pktopts
Patch 2:
- Give midstream and ACK rehash attempt helpers distinct failure
messages (no TX activity vs no data on alternate path vs counter
not incrementing) instead of a single generic error
- Drop unused ns_server parameter from ecmp_dst_rebuild_check()
- Clean up server socat before break on setup failure in the dst
rebuild loop
Changes since v7: https://lore.kernel.org/netdev/20260520064310.4154268-1-ntspring@meta.com/
Patch 1:
- Remove #if IS_ENABLED(CONFIG_IPV6) guards around __sk_dst_reset()
in tcp_plb.c and tcp_timer.c (Eric Dumazet)
- Guard mp_hash in inet6_csk_route_socket() on sk_protocol == IPPROTO_TCP
instead of txhash != 0, since non-TCP callers like L2TP set sk_txhash
in __ip6_datagram_connect() and should retain flow-key-based ECMP
- Use the syncookie (ISN) as txhash for both the SYN-ACK route lookup
and cookie_v6_check() socket creation, so the server's ECMP selection is
consistent across the stateless SYN-ACK and the subsequent full socket.
Move cookie_init_sequence() before route_req() in tcp_conn_request()
so the SYN-ACK dst is computed with the cookie-derived txhash; derive
txhash from snt_isn in cookie_tcp_reqsk_init() to match
Patch 2:
- Invalidate dst via dummy route add/del instead of route replace to
avoid a transient single-nexthop state during multipath replacement
- Add syncookie server path consistency test verifying the SYN-ACK and
post-cookie ACKs use the same ECMP path
- Strengthen policy 1 negative test to wait for multiple rehash attempts
and verify SYNs landed on exactly one interface
Changes since v6: https://lore.kernel.org/netdev/20260517174522.2232057-1-ntspring@meta.com/
- Guard mp_hash assignment so that non-TCP callers of
inet6_csk_route_socket() fall through to rt6_multipath_hash()
(superseded in v8 by sk_protocol == IPPROTO_TCP guard)
- Initialize txhash in bpf_sk_assign_tcp_reqsk() to avoid reading
uninitialized slab memory in inet6_csk_route_req()
- Check post-rebuild busywait return status to avoid silent false pass
Changes since v5: https://lore.kernel.org/netdev/20260513204048.2721843-1-ntspring@meta.com/
- Improve selftest reliability: suppress __dst_negative_advice() via
tcp_retries1=255 in dst rebuild tests so a real RTO cannot trigger
an unintended rehash; add internal retry to midstream and ACK
rehash tests to tolerate probabilistic ECMP path selection; fix
midstream baseline capture to account for packets that bypass tc
filters during the prio qdisc's TCQ_F_CAN_BYPASS window
- Increase ECMP_REBUILD_ROUNDS default to 10 for reliable regression
detection with 2-way ECMP; replace sleep with busywait
- Use tcp_allowed_congestion_control instead of changing the host's
default congestion control for PLB test
- Use (txhash >> 1) ?: 1 to guarantee non-zero mp_hash, since zero
falls back to rt6_multipath_hash()
Changes since v4: https://lore.kernel.org/netdev/20260507171319.1259115-1-ntspring@meta.com/
- Condition fl6->mp_hash on fib_multipath_hash_policy == 0 to preserve
deterministic hash policies 1-3 (e.g., symmetric 5-tuple for policy 1)
- Set fl6->mp_hash in tcp_v6_connect() and cookie_v6_check() for
initial route lookup consistency; move sk_set_txhash() earlier
(Jakub Kicinski)
- Add policy 1 negative test; improve sysctl save/restore
- Add flowlabel leak test confirming mp_hash does not alter the
on-wire IPv6 flow label
- Add dst rebuild consistency tests (normal and syncookie) verifying
that route table changes do not cause unintended ECMP path changes
Changes since v3: https://lore.kernel.org/netdev/20260505193824.2791642-1-ntspring@meta.com/
- Use __sk_dst_reset() instead of sk_dst_reset() since the socket lock
is held in all three call sites (Eric Dumazet)
- Guard __sk_dst_reset() with sk->sk_family == AF_INET6 since IPv4 ECMP
does not use sk_txhash for path selection
- Guard __sk_dst_reset() in tcp_plb_check_rehash() with the return value
of sk_rethink_txhash()
- Move tcp_rsk(req)->txhash initialization before route_req() in
tcp_conn_request() to avoid reading uninitialized memory
- Add CONFIG_TCP_CONG_DCTCP=m to selftests/net/config for PLB test
- Skip PLB test gracefully if DCTCP is not available
- Save and restore original congestion control algorithm in PLB test
- Default get_netstat_counter() to 0 when counter is not found
- Skip all tests if tcp_syn_linear_timeouts is not available
- Replace bash/pipe data sources with socat OPEN:/dev/zero for
cleaner process cleanup
- Fix shellcheck warnings
Changes since v2: https://lore.kernel.org/netdev/20260408070514.1840227-1-ntspring@meta.com/
- Retitle "ECMP" to "local ECMP" to distinguish from remote ECMP
(Neal Cardwell)
- Add fl6->mp_hash propagation in inet6_sk_rebuild_header() (af_inet6.c),
covering the dst rebuild path used on established sockets
- Remove incorrect ir_iif update from tcp_check_req() in tcp_minisocks.c;
the SYN/ACK rehash is already handled by tcp_rtx_synack() re-rolling
txhash which feeds into inet6_csk_route_req()'s mp_hash
(Eric Dumazet)
- Add ACK rehash and PLB rehash selftests
- Improve selftest reliability
Changes since v1: https://lore.kernel.org/netdev/20260408002802.2448424-1-ntspring@meta.com/
- Use tcp_rsk(req)->txhash instead of jhash_1word(req->num_retrans, ...)
for ECMP path selection in inet6_csk_route_req(), making the request
socket path consistent with the established socket path (Eric Dumazet)
- Add comments explaining the >> 1 shift for 31-bit mp_hash range
- Use socat -u (unidirectional) in selftest to avoid SIGPIPE race
- Increase tcp_syn_retries and tcp_syn_linear_timeouts to 25 for
better rehash coverage
Neil Spring (2):
tcp: rehash onto different local ECMP path on retransmit timeout
selftests: net: add local ECMP rehash test
Documentation/networking/ip-sysctl.rst | 5 +-
include/net/tcp.h | 20 +-
net/core/filter.c | 1 +
net/ipv4/syncookies.c | 11 +-
net/ipv4/tcp_input.c | 15 +-
net/ipv4/tcp_plb.c | 5 +-
net/ipv4/tcp_timer.c | 2 +
net/ipv6/af_inet6.c | 3 +
net/ipv6/inet6_connection_sock.c | 8 +
net/ipv6/syncookies.c | 4 +
net/ipv6/tcp_ipv6.c | 13 +-
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/config | 1 +
tools/testing/selftests/net/ecmp_rehash.sh | 1066 ++++++++++++++++++++
14 files changed, 1140 insertions(+), 15 deletions(-)
create mode 100755 tools/testing/selftests/net/ecmp_rehash.sh
base-commit: c1e5127b577c6b88fa48e532616932ae978528d5
--
2.53.0-Meta
next reply other threads:[~2026-06-02 18:14 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-02 18:14 Neil Spring [this message]
2026-06-02 18:14 ` [PATCH v11 1/2] tcp: rehash onto different local ECMP path on retransmit timeout Neil Spring
2026-06-02 18:52 ` bot+bpf-ci
2026-06-03 18:14 ` sashiko-bot
2026-06-02 18:14 ` [PATCH v11 2/2] selftests: net: add local ECMP rehash test Neil Spring
2026-06-02 18:35 ` bot+bpf-ci
2026-06-02 19:06 ` Neil Spring
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260602181428.2318919-1-ntspring@meta.com \
--to=ntspring@meta.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=linux-kselftest@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=shuah@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.