From: Neil Spring <ntspring@meta.com>
To: netdev@vger.kernel.org
Cc: edumazet@google.com, ncardwell@google.com, kuniyu@google.com,
davem@davemloft.net, kuba@kernel.org, dsahern@kernel.org,
pabeni@redhat.com, horms@kernel.org, shuah@kernel.org,
linux-kselftest@vger.kernel.org, ntspring@meta.com,
bpf@vger.kernel.org, martin.lau@linux.dev, daniel@iogearbox.net
Subject: [PATCH net-next v13 0/2] tcp: rehash onto different local ECMP path on retransmit timeout
Date: Thu, 11 Jun 2026 18:00:45 -0700 [thread overview]
Message-ID: <20260612010047.1377331-1-ntspring@meta.com> (raw)
Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO,
PLB, and spurious-retransmission events, but the new hash is not
propagated into the IPv6 ECMP path selection. The cached
route is reused and fib6_select_path() is never re-invoked, so
the connection uses the same local ECMP decision.
This series adds the two missing pieces:
1. __sk_dst_reset() alongside sk_rethink_txhash() so the cached dst
is invalidated and the next transmit triggers a fresh route lookup.
2. fl6->mp_hash set from sk_txhash before each route lookup so
fib6_select_path() picks a path from the (potentially re-rolled) hash.
The override applies only to fib_multipath_hash_policy 0 (the default L3
policy). Its hash includes the flow label, but that is 0 by default
(np->flow_label is unset; auto_flowlabels computes the on-wire label
later, per packet), so flows to the same peer share one local path.
Keying it on sk_txhash makes that local path per-connection and lets a
rehash re-select it; even when a flow label is present (reflected REPFLOW
or explicitly set) only local path selection changes -- the on-wire flow
label is unaffected. Policies 1-3 are left unchanged.
Patch 1 is the kernel change; patch 2 adds selftests covering SYN
rehash, SYN/ACK rehash, midstream RTO rehash, midstream ACK rehash
(spurious retransmission), PLB rehash, a policy 1 negative test,
a flowlabel leak regression test, two dst rebuild consistency
tests (normal and syncookie) verifying that natural route
invalidation does not cause unintended path changes, and a
syncookie server path consistency test verifying that the SYN-ACK
and post-cookie ACKs use the same ECMP nexthop.
Changes since v12: https://lore.kernel.org/netdev/20260604212246.265079-1-ntspring@meta.com/
Patch 1:
- Factor the repeated policy-0 IPv6 mp_hash assignment into a shared
ip6_ecmp_set_mp_hash() helper (Paolo Abeni)
- Replace the open-coded txhash reroll + dst reset at the three rehash
sites with a __sk_rethink_txhash_reset_dst() helper, kept separate
from sk_rethink_txhash() so dst_negative_advice()'s dst op still runs
(Paolo Abeni)
Patch 2:
- Check the first rehash attempt's exit status directly instead of via
$? (shellcheck SC2181), and drop the redundant fail_reason capture on
the tolerated first attempt (Paolo Abeni)
- Redirect the remaining slowwait stdout to /dev/null so loopy_wait's
counter output cannot leak into the captured failure message
(Sashiko AI review)
Changes since v11: https://lore.kernel.org/netdev/20260602181428.2318919-1-ntspring@meta.com/
Patch 1:
- Fix the IPv6-only rule to exclude IPv4-mapped connections: key the
cookie txhash on skb->protocol, not sk->sk_family (Sashiko AI review)
- Set fl6->mp_hash in tcp_v6_send_response() so RSTs and time-wait
ACKs use the connection's ECMP path (Sashiko AI review)
- Remove the bpf_sk_assign_tcp_reqsk() txhash init added in v7; it is
redundant, as cookie_tcp_reqsk_init() always sets txhash before the
request socket is routed (verified by poisoning txhash and running
the tcp_custom_syncookie BPF selftest: the route lookup never saw
the poison)
- Document that policy 0 IPv6 TCP ECMP selection follows txhash over a
reflected/explicit flow label (on-wire flow label unchanged)
Patch 2:
- Drain TCP teardown between rounds so late FIN/RST packets do not
pollute the next round's tc filter counters (bot+bpf-ci)
- Skip the syncookie tests when CONFIG_SYN_COOKIES is unavailable;
select it in selftests/net/config
Changes since v10: https://lore.kernel.org/netdev/20260529160136.1010064-1-ntspring@meta.com/
Patch 1:
- Fix build without CONFIG_SYN_COOKIES
- Leave IPv4 syncookie txhash unmodified (`net_tx_rndhash()`)
- Document the IPv6 TCP policy 0 behavior change in ip-sysctl.rst
Patch 2:
- Correct runtime estimate from ~15s to ~60s
- Build DCTCP as `=y` instead of `=m` to avoid module load races
- Fix false failure of the midstream ACK test by limiting the send
buffer to avoid a closed receive window; window probes do not
cause rehash
Changes since v9: https://lore.kernel.org/netdev/20260526203403.3517607-1-ntspring@meta.com/
Patch 1:
- Split cookie_init_sequence() into pure computation and a new
cookie_record_sent() helper for the side effects; call
cookie_record_sent() after route_req() succeeds so the overflow
timestamp and SYNCOOKIESSENT counter are not bumped when no
SYN-ACK is sent
Patch 2:
- Make midstream ACK rehash test more reliable by blocking the unused
path first
- Fix port overlap when ECMP_REBUILD_ROUNDS exceeds the default
Changes since v8: https://lore.kernel.org/netdev/20260522215733.929238-1-ntspring@meta.com/
Patch 1:
- Fix REPFLOW flowlabel reflection for syncookie SYN-ACKs: pass 0 as
tw_isn to route_req() so tcp_v6_init_req() saves ireq->pktopts
Patch 2:
- Give midstream and ACK rehash attempt helpers distinct failure
messages (no TX activity vs no data on alternate path vs counter
not incrementing) instead of a single generic error
- Drop unused ns_server parameter from ecmp_dst_rebuild_check()
- Clean up server socat before break on setup failure in the dst
rebuild loop
Changes since v7: https://lore.kernel.org/netdev/20260520064310.4154268-1-ntspring@meta.com/
Patch 1:
- Remove #if IS_ENABLED(CONFIG_IPV6) guards around __sk_dst_reset()
in tcp_plb.c and tcp_timer.c (Eric Dumazet)
- Guard mp_hash in inet6_csk_route_socket() on sk_protocol == IPPROTO_TCP
instead of txhash != 0, since non-TCP callers like L2TP set sk_txhash
in __ip6_datagram_connect() and should retain flow-key-based ECMP
- Use the syncookie (ISN) as txhash for both the SYN-ACK route lookup
and cookie_v6_check() socket creation, so the server's ECMP selection is
consistent across the stateless SYN-ACK and the subsequent full socket.
Move cookie_init_sequence() before route_req() in tcp_conn_request()
so the SYN-ACK dst is computed with the cookie-derived txhash; derive
txhash from snt_isn in cookie_tcp_reqsk_init() to match
Patch 2:
- Invalidate dst via dummy route add/del instead of route replace to
avoid a transient single-nexthop state during multipath replacement
- Add syncookie server path consistency test verifying the SYN-ACK and
post-cookie ACKs use the same ECMP path
- Strengthen policy 1 negative test to wait for multiple rehash attempts
and verify SYNs landed on exactly one interface
Changes since v6: https://lore.kernel.org/netdev/20260517174522.2232057-1-ntspring@meta.com/
- Guard mp_hash assignment so that non-TCP callers of
inet6_csk_route_socket() fall through to rt6_multipath_hash()
(superseded in v8 by sk_protocol == IPPROTO_TCP guard)
- Initialize txhash in bpf_sk_assign_tcp_reqsk() to avoid reading
uninitialized slab memory in inet6_csk_route_req() (reverted in v12
as redundant)
- Check post-rebuild busywait return status to avoid silent false pass
Changes since v5: https://lore.kernel.org/netdev/20260513204048.2721843-1-ntspring@meta.com/
- Improve selftest reliability: suppress __dst_negative_advice() via
tcp_retries1=255 in dst rebuild tests so a real RTO cannot trigger
an unintended rehash; add internal retry to midstream and ACK
rehash tests to tolerate probabilistic ECMP path selection; fix
midstream baseline capture to account for packets that bypass tc
filters during the prio qdisc's TCQ_F_CAN_BYPASS window
- Increase ECMP_REBUILD_ROUNDS default to 10 for reliable regression
detection with 2-way ECMP; replace sleep with busywait
- Use tcp_allowed_congestion_control instead of changing the host's
default congestion control for PLB test
- Use (txhash >> 1) ?: 1 to guarantee non-zero mp_hash, since zero
falls back to rt6_multipath_hash()
Changes since v4: https://lore.kernel.org/netdev/20260507171319.1259115-1-ntspring@meta.com/
- Condition fl6->mp_hash on fib_multipath_hash_policy == 0 to preserve
deterministic hash policies 1-3 (e.g., symmetric 5-tuple for policy 1)
- Set fl6->mp_hash in tcp_v6_connect() and cookie_v6_check() for
initial route lookup consistency; move sk_set_txhash() earlier
(Jakub Kicinski)
- Add policy 1 negative test; improve sysctl save/restore
- Add flowlabel leak test confirming mp_hash does not alter the
on-wire IPv6 flow label
- Add dst rebuild consistency tests (normal and syncookie) verifying
that route table changes do not cause unintended ECMP path changes
Changes since v3: https://lore.kernel.org/netdev/20260505193824.2791642-1-ntspring@meta.com/
- Use __sk_dst_reset() instead of sk_dst_reset() since the socket lock
is held in all three call sites (Eric Dumazet)
- Guard __sk_dst_reset() with sk->sk_family == AF_INET6 since IPv4 ECMP
does not use sk_txhash for path selection
- Guard __sk_dst_reset() in tcp_plb_check_rehash() with the return value
of sk_rethink_txhash()
- Move tcp_rsk(req)->txhash initialization before route_req() in
tcp_conn_request() to avoid reading uninitialized memory
- Add CONFIG_TCP_CONG_DCTCP=m to selftests/net/config for PLB test
- Skip PLB test gracefully if DCTCP is not available
- Save and restore original congestion control algorithm in PLB test
- Default get_netstat_counter() to 0 when counter is not found
- Skip all tests if tcp_syn_linear_timeouts is not available
- Replace bash/pipe data sources with socat OPEN:/dev/zero for
cleaner process cleanup
- Fix shellcheck warnings
Changes since v2: https://lore.kernel.org/netdev/20260408070514.1840227-1-ntspring@meta.com/
- Retitle "ECMP" to "local ECMP" to distinguish from remote ECMP
(Neal Cardwell)
- Add fl6->mp_hash propagation in inet6_sk_rebuild_header() (af_inet6.c),
covering the dst rebuild path used on established sockets
- Remove incorrect ir_iif update from tcp_check_req() in tcp_minisocks.c;
the SYN/ACK rehash is already handled by tcp_rtx_synack() re-rolling
txhash which feeds into inet6_csk_route_req()'s mp_hash
(Eric Dumazet)
- Add ACK rehash and PLB rehash selftests
- Improve selftest reliability
Changes since v1: https://lore.kernel.org/netdev/20260408002802.2448424-1-ntspring@meta.com/
- Use tcp_rsk(req)->txhash instead of jhash_1word(req->num_retrans, ...)
for ECMP path selection in inet6_csk_route_req(), making the request
socket path consistent with the established socket path (Eric Dumazet)
- Add comments explaining the >> 1 shift for 31-bit mp_hash range
- Use socat -u (unidirectional) in selftest to avoid SIGPIPE race
- Increase tcp_syn_retries and tcp_syn_linear_timeouts to 25 for
better rehash coverage
Neil Spring (2):
tcp: rehash onto different local ECMP path on retransmit timeout
selftests: net: add local ECMP rehash test
Documentation/networking/ip-sysctl.rst | 6 +-
include/net/ipv6.h | 11 +
include/net/sock.h | 14 +
include/net/tcp.h | 20 +-
net/ipv4/syncookies.c | 11 +-
net/ipv4/tcp_input.c | 18 +-
net/ipv4/tcp_plb.c | 2 +-
net/ipv4/tcp_timer.c | 2 +-
net/ipv6/af_inet6.c | 2 +
net/ipv6/inet6_connection_sock.c | 5 +
net/ipv6/syncookies.c | 2 +
net/ipv6/tcp_ipv6.c | 19 +-
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/config | 2 +
tools/testing/selftests/net/ecmp_rehash.sh | 1112 ++++++++++++++++++++
15 files changed, 1211 insertions(+), 16 deletions(-)
create mode 100755 tools/testing/selftests/net/ecmp_rehash.sh
--
2.53.0-Meta
next reply other threads:[~2026-06-12 1:00 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-12 1:00 Neil Spring [this message]
2026-06-12 1:00 ` [PATCH net-next v13 1/2] tcp: rehash onto different local ECMP path on retransmit timeout Neil Spring
2026-06-12 1:00 ` [PATCH net-next v13 2/2] selftests: net: add local ECMP rehash test Neil Spring
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260612010047.1377331-1-ntspring@meta.com \
--to=ntspring@meta.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=linux-kselftest@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=shuah@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.