From: Neil Spring <ntspring@meta.com>
To: netdev@vger.kernel.org
Cc: edumazet@google.com, ncardwell@google.com, kuniyu@google.com,
davem@davemloft.net, kuba@kernel.org, dsahern@kernel.org,
pabeni@redhat.com, horms@kernel.org, shuah@kernel.org,
linux-kselftest@vger.kernel.org, ntspring@meta.com,
bpf@vger.kernel.org, martin.lau@linux.dev, daniel@iogearbox.net
Subject: [PATCH net-next v8 0/2] tcp: rehash onto different local ECMP path on retransmit timeout
Date: Fri, 22 May 2026 14:57:31 -0700 [thread overview]
Message-ID: <20260522215733.929238-1-ntspring@meta.com> (raw)
Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO,
PLB, and spurious-retransmission events, but the new hash is not
propagated into the IPv6 ECMP path selection logic. The cached
route is reused and fib6_select_path() is never re-invoked, so
so the connection uses the same local ECMP decision.
This series adds the two missing pieces:
1. __sk_dst_reset() alongside sk_rethink_txhash() so the cached dst
is invalidated and the next transmit triggers a fresh route lookup.
2. fl6->mp_hash set from sk_txhash before each route lookup so
fib6_select_path() picks a path based on the (potentially re-rolled)
hash. This is conditioned on fib_multipath_hash_policy == 0 (L3)
because policies 1-3 compute a deterministic hash from the flow
keys which must not be overridden.
Patch 1 is the kernel change; patch 2 adds selftests covering SYN
rehash, SYN/ACK rehash, midstream RTO rehash, midstream ACK rehash
(spurious retransmission), PLB rehash, a policy 1 negative test,
a flowlabel leak regression test, two dst rebuild consistency
tests (normal and syncookie) verifying that natural route
invalidation does not cause unintended path changes, and a
syncookie server path consistency test verifying that the SYN-ACK
and post-cookie ACKs use the same ECMP nexthop.
I'd like guidance on whether to use the ISN as txhash when using
syncookies; it keeps the SYN-ACK and subsequent data path consistent,
but one could argue that this consistency doesn't matter because no
reordering is possible.
Changes since v7: https://lore.kernel.org/netdev/20260520064310.4154268-1-ntspring@meta.com/
Patch 1:
- Remove #if IS_ENABLED(CONFIG_IPV6) guards around __sk_dst_reset()
in tcp_plb.c and tcp_timer.c (Eric Dumazet)
- Guard mp_hash in inet6_csk_route_socket() on sk_protocol == IPPROTO_TCP
instead of txhash != 0, since non-TCP callers like L2TP set sk_txhash
in __ip6_datagram_connect() and should retain flow-key-based ECMP
- Use the syncookie (ISN) as txhash for both the SYN-ACK route lookup
and cookie_v6_check() socket creation, so the server's ECMP selection is
consistent across the stateless SYN-ACK and the subsequent full socket.
Move cookie_init_sequence() before route_req() in tcp_conn_request()
so the SYN-ACK dst is computed with the cookie-derived txhash; derive
txhash from snt_isn in cookie_tcp_reqsk_init() to match
Patch 2:
- Invalidate dst via dummy route add/del instead of route replace to
avoid a transient single-nexthop state during multipath replacement
- Add syncookie server path consistency test verifying the SYN-ACK and
post-cookie ACKs use the same ECMP path
- Strengthen policy 1 negative test to wait for multiple rehash attempts
and verify SYNs landed on exactly one interface
Changes since v6: https://lore.kernel.org/netdev/20260517174522.2232057-1-ntspring@meta.com/
- Guard mp_hash assignment so that non-TCP callers of
inet6_csk_route_socket() fall through to rt6_multipath_hash()
(superseded in v8 by sk_protocol == IPPROTO_TCP guard)
- Initialize txhash in bpf_sk_assign_tcp_reqsk() to avoid reading
uninitialized slab memory in inet6_csk_route_req()
- Check post-rebuild busywait return status to avoid silent false pass
Changes since v5: https://lore.kernel.org/netdev/20260513204048.2721843-1-ntspring@meta.com/
- Improve selftest reliability: suppress __dst_negative_advice() via
tcp_retries1=255 in dst rebuild tests so a real RTO cannot trigger
an unintended rehash; add internal retry to midstream and ACK
rehash tests to tolerate probabilistic ECMP path selection; fix
midstream baseline capture to account for packets that bypass tc
filters during the prio qdisc's TCQ_F_CAN_BYPASS window
- Increase ECMP_REBUILD_ROUNDS default to 10 for reliable regression
detection with 2-way ECMP; replace sleep with busywait
- Use tcp_allowed_congestion_control instead of changing the host's
default congestion control for PLB test
- Use (txhash >> 1) ?: 1 to guarantee non-zero mp_hash, since zero
falls back to rt6_multipath_hash()
Changes since v4: https://lore.kernel.org/netdev/20260507171319.1259115-1-ntspring@meta.com/
- Condition fl6->mp_hash on fib_multipath_hash_policy == 0 to preserve
deterministic hash policies 1-3 (e.g., symmetric 5-tuple for policy 1)
- Set fl6->mp_hash in tcp_v6_connect() and cookie_v6_check() for
initial route lookup consistency; move sk_set_txhash() earlier
(Jakub Kicinski)
- Add policy 1 negative test; improve sysctl save/restore
- Add flowlabel leak test confirming mp_hash does not alter the
on-wire IPv6 flow label
- Add dst rebuild consistency tests (normal and syncookie) verifying
that route table changes do not cause unintended ECMP path changes
Changes since v3: https://lore.kernel.org/netdev/20260505193824.2791642-1-ntspring@meta.com/
- Use __sk_dst_reset() instead of sk_dst_reset() since the socket lock
is held in all three call sites (Eric Dumazet)
- Guard __sk_dst_reset() with sk->sk_family == AF_INET6 since IPv4 ECMP
does not use sk_txhash for path selection
- Guard __sk_dst_reset() in tcp_plb_check_rehash() with the return value
of sk_rethink_txhash()
- Move tcp_rsk(req)->txhash initialization before route_req() in
tcp_conn_request() to avoid reading uninitialized memory
- Add CONFIG_TCP_CONG_DCTCP=m to selftests/net/config for PLB test
- Skip PLB test gracefully if DCTCP is not available
- Save and restore original congestion control algorithm in PLB test
- Default get_netstat_counter() to 0 when counter is not found
- Skip all tests if tcp_syn_linear_timeouts is not available
- Replace bash/pipe data sources with socat OPEN:/dev/zero for
cleaner process cleanup
- Fix shellcheck warnings
Changes since v2: https://lore.kernel.org/netdev/20260408070514.1840227-1-ntspring@meta.com/
- Retitle "ECMP" to "local ECMP" to distinguish from remote ECMP
(Neal Cardwell)
- Add fl6->mp_hash propagation in inet6_sk_rebuild_header() (af_inet6.c),
covering the dst rebuild path used on established sockets
- Remove incorrect ir_iif update from tcp_check_req() in tcp_minisocks.c;
the SYN/ACK rehash is already handled by tcp_rtx_synack() re-rolling
txhash which feeds into inet6_csk_route_req()'s mp_hash
(Eric Dumazet)
- Add ACK rehash and PLB rehash selftests
- Improve selftest reliability
Changes since v1: https://lore.kernel.org/netdev/20260408002802.2448424-1-ntspring@meta.com/
- Use tcp_rsk(req)->txhash instead of jhash_1word(req->num_retrans, ...)
for ECMP path selection in inet6_csk_route_req(), making the request
socket path consistent with the established socket path (Eric Dumazet)
- Add comments explaining the >> 1 shift for 31-bit mp_hash range
- Use socat -u (unidirectional) in selftest to avoid SIGPIPE race
- Increase tcp_syn_retries and tcp_syn_linear_timeouts to 25 for
better rehash coverage
Neil Spring (2):
tcp: rehash onto different local ECMP path on retransmit timeout
selftests: net: add local ECMP rehash test
net/core/filter.c | 1 +
net/ipv4/tcp_input.c | 6 +-
net/ipv4/tcp_plb.c | 5 +-
net/ipv4/tcp_timer.c | 2 +
net/ipv6/af_inet6.c | 3 +
net/ipv6/inet6_connection_sock.c | 7 +
net/ipv6/syncookies.c | 4 +
net/ipv6/tcp_ipv6.c | 13 +-
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/config | 1 +
tools/testing/selftests/net/ecmp_rehash.sh | 952 +++++++++++++++++++++
11 files changed, 990 insertions(+), 5 deletions(-)
create mode 100755 tools/testing/selftests/net/ecmp_rehash.sh
--
2.53.0-Meta
next reply other threads:[~2026-05-22 21:57 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-22 21:57 Neil Spring [this message]
2026-05-22 21:57 ` [PATCH net-next v8 1/2] tcp: rehash onto different local ECMP path on retransmit timeout Neil Spring
2026-05-22 21:57 ` [PATCH net-next v8 2/2] selftests: net: add local ECMP rehash test Neil Spring
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260522215733.929238-1-ntspring@meta.com \
--to=ntspring@meta.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=linux-kselftest@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=shuah@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox