From: Neil Spring <ntspring@meta.com>
To: netdev@vger.kernel.org
Cc: edumazet@google.com, ncardwell@google.com, kuniyu@google.com,
davem@davemloft.net, kuba@kernel.org, dsahern@kernel.org,
pabeni@redhat.com, horms@kernel.org, shuah@kernel.org,
linux-kselftest@vger.kernel.org, ntspring@meta.com,
bpf@vger.kernel.org, martin.lau@linux.dev, daniel@iogearbox.net
Subject: [PATCH net-next v7 0/2] tcp: rehash onto different local ECMP path on retransmit timeout
Date: Tue, 19 May 2026 23:43:08 -0700 [thread overview]
Message-ID: <20260520064310.4154268-1-ntspring@meta.com> (raw)
Currently sk_rethink_txhash() re-rolls the socket's txhash on RTO,
PLB, and spurious-retransmission events, but the new hash is not
propagated into the IPv6 ECMP path selection logic. The cached
route is reused and fib6_select_path() is never re-invoked, so
the connection stays on the same ECMP path.
This series adds the two missing pieces:
1. __sk_dst_reset() alongside sk_rethink_txhash() so the cached dst
is invalidated and the next transmit triggers a fresh route lookup.
2. fl6->mp_hash set from sk_txhash before each route lookup so
fib6_select_path() picks a path based on the (potentially re-rolled)
hash. This is conditioned on fib_multipath_hash_policy == 0 (L3)
because policies 1-3 compute a deterministic hash from the flow
keys which must not be overridden.
Patch 1 is the kernel change; patch 2 adds selftests covering SYN
rehash, SYN/ACK rehash, midstream RTO rehash, midstream ACK rehash
(spurious retransmission), PLB rehash, a policy 1 negative test,
a flowlabel leak regression test, and two dst rebuild consistency
tests (normal and syncookie) verifying that natural route
invalidation does not cause unintended path changes.
Changes since v6: https://lore.kernel.org/netdev/20260517174522.2232057-1-ntspring@meta.com/
- Guard mp_hash assignment with txhash != 0 so that non-TCP callers
of inet6_csk_route_socket() (e.g., L2TP) fall through to the
default rt6_multipath_hash() instead of forcing mp_hash to 1
- Initialize txhash in bpf_sk_assign_tcp_reqsk() to avoid reading
uninitialized slab memory in inet6_csk_route_req()
- Check post-rebuild busywait return status to avoid silent false pass
Changes since v5: https://lore.kernel.org/netdev/20260513204048.2721843-1-ntspring@meta.com/
- Improve selftest reliability: suppress __dst_negative_advice() via
tcp_retries1=255 in dst rebuild tests so a real RTO cannot trigger
an unintended rehash; add internal retry to midstream and ACK
rehash tests to tolerate probabilistic ECMP path selection; fix
midstream baseline capture to account for packets that bypass tc
filters during the prio qdisc's TCQ_F_CAN_BYPASS window
- Increase ECMP_REBUILD_ROUNDS default to 10 for reliable regression
detection with 2-way ECMP; replace sleep with busywait
- Use tcp_allowed_congestion_control instead of changing the host's
default congestion control for PLB test
- Use (txhash >> 1) ?: 1 to guarantee non-zero mp_hash, since zero
falls back to rt6_multipath_hash()
Changes since v4: https://lore.kernel.org/netdev/20260507171319.1259115-1-ntspring@meta.com/
- Condition fl6->mp_hash on fib_multipath_hash_policy == 0 to preserve
deterministic hash policies 1-3 (e.g., symmetric 5-tuple for policy 1)
- Set fl6->mp_hash in tcp_v6_connect() and cookie_v6_check() for
initial route lookup consistency; move sk_set_txhash() earlier
(Jakub Kicinski)
- Add policy 1 negative test; improve sysctl save/restore
- Add flowlabel leak test confirming mp_hash does not alter the
on-wire IPv6 flow label
- Add dst rebuild consistency tests (normal and syncookie) verifying
that route table changes do not cause unintended ECMP path changes
Changes since v3: https://lore.kernel.org/netdev/20260505193824.2791642-1-ntspring@meta.com/
- Use __sk_dst_reset() instead of sk_dst_reset() since the socket lock
is held in all three call sites (Eric Dumazet)
- Guard __sk_dst_reset() with sk->sk_family == AF_INET6 since IPv4 ECMP
does not use sk_txhash for path selection
- Guard __sk_dst_reset() in tcp_plb_check_rehash() with the return value
of sk_rethink_txhash()
- Move tcp_rsk(req)->txhash initialization before route_req() in
tcp_conn_request() to avoid reading uninitialized memory
- Add CONFIG_TCP_CONG_DCTCP=m to selftests/net/config for PLB test
- Skip PLB test gracefully if DCTCP is not available
- Save and restore original congestion control algorithm in PLB test
- Default get_netstat_counter() to 0 when counter is not found
- Skip all tests if tcp_syn_linear_timeouts is not available
- Replace bash/pipe data sources with socat OPEN:/dev/zero for
cleaner process cleanup
- Fix shellcheck warnings
Changes since v2: https://lore.kernel.org/netdev/20260408070514.1840227-1-ntspring@meta.com/
- Retitle "ECMP" to "local ECMP" to distinguish from remote ECMP
(Neal Cardwell)
- Add fl6->mp_hash propagation in inet6_sk_rebuild_header() (af_inet6.c),
covering the dst rebuild path used on established sockets
- Remove incorrect ir_iif update from tcp_check_req() in tcp_minisocks.c;
the SYN/ACK rehash is already handled by tcp_rtx_synack() re-rolling
txhash which feeds into inet6_csk_route_req()'s mp_hash
(Eric Dumazet)
- Add ACK rehash and PLB rehash selftests
- Improve selftest reliability
Changes since v1: https://lore.kernel.org/netdev/20260408002802.2448424-1-ntspring@meta.com/
- Use tcp_rsk(req)->txhash instead of jhash_1word(req->num_retrans, ...)
for ECMP path selection in inet6_csk_route_req(), making the request
socket path consistent with the established socket path (Eric Dumazet)
- Add comments explaining the >> 1 shift for 31-bit mp_hash range
- Use socat -u (unidirectional) in selftest to avoid SIGPIPE race
- Increase tcp_syn_retries and tcp_syn_linear_timeouts to 25 for
better rehash coverage
Neil Spring (2):
tcp: rehash onto different local ECMP path on retransmit timeout
selftests: net: add local ECMP rehash test
net/core/filter.c | 1 +
net/ipv4/tcp_input.c | 6 +-
net/ipv4/tcp_plb.c | 7 +-
net/ipv4/tcp_timer.c | 4 +
net/ipv6/af_inet6.c | 3 +
net/ipv6/inet6_connection_sock.c | 7 +
net/ipv6/syncookies.c | 4 +
net/ipv6/tcp_ipv6.c | 13 +-
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/config | 1 +
tools/testing/selftests/net/ecmp_rehash.sh | 933 +++++++++++++++++++++
11 files changed, 975 insertions(+), 5 deletions(-)
create mode 100755 tools/testing/selftests/net/ecmp_rehash.sh
--
2.53.0-Meta
next reply other threads:[~2026-05-20 6:43 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-20 6:43 Neil Spring [this message]
2026-05-20 6:43 ` [PATCH net-next v7 1/2] tcp: rehash onto different local ECMP path on retransmit timeout Neil Spring
2026-05-20 7:25 ` Eric Dumazet
2026-05-20 6:43 ` [PATCH net-next v7 2/2] selftests: net: add local ECMP rehash test Neil Spring
2026-05-30 0:44 ` sashiko-bot
2026-05-20 21:40 ` [PATCH net-next v7 0/2] tcp: rehash onto different local ECMP path on retransmit timeout Jakub Kicinski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260520064310.4154268-1-ntspring@meta.com \
--to=ntspring@meta.com \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=linux-kselftest@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=shuah@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.