* [PATCH net-next 0/2] tcp: rehash onto different ECMP path on retransmit timeout
@ 2026-04-08 0:28 Neil Spring
2026-04-08 0:28 ` [PATCH net-next 1/2] " Neil Spring
` (2 more replies)
0 siblings, 3 replies; 9+ messages in thread
From: Neil Spring @ 2026-04-08 0:28 UTC (permalink / raw)
To: netdev
Cc: edumazet, ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms,
shuah, linux-kselftest
I configured an ECMP route across two local interfaces, expecting
autoflowlabel to adapt when one of those interfaces is blocked in
reaching the destination. However, it did not. This affected SYN,
SYN/ACK, and established traffic.
I was able to make this work as I expected it to by calling
sk_dst_reset() at times when txhash was updated, and propagating
sk_txhash into fl6->mp_hash so fib6_select_path() uses the socket's
current hash for ECMP selection.
I expected autoflowlabel to apply to local ECMP routes, and think
it should, but am open to feedback if this isn't the right way to
do it.
Patch 1 has the kernel changes; patch 2 adds a selftest exercising
SYN, SYN/ACK, and established connection rehash over a two-path
ECMP topology. The selftest retries a SYN 26 times, so has a tiny
(~3e-8) probability of false failure if repeatedly unlucky with ECMP
path selection.
Neil Spring (2):
tcp: rehash onto different ECMP path on retransmit timeout
selftests: net: add ECMP rehash test
net/ipv4/tcp_input.c | 4 +-
net/ipv4/tcp_minisocks.c | 9 +
net/ipv4/tcp_plb.c | 1 +
net/ipv4/tcp_timer.c | 1 +
net/ipv6/inet6_connection_sock.c | 8 +
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/ecmp_rehash.sh | 354 +++++++++++++++++++++
7 files changed, 377 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/net/ecmp_rehash.sh
--
2.52.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH net-next 1/2] tcp: rehash onto different ECMP path on retransmit timeout
2026-04-08 0:28 [PATCH net-next 0/2] tcp: rehash onto different ECMP path on retransmit timeout Neil Spring
@ 2026-04-08 0:28 ` Neil Spring
2026-04-08 1:09 ` Eric Dumazet
2026-04-08 0:28 ` [PATCH net-next 2/2] selftests: net: add ECMP rehash test Neil Spring
2026-04-08 7:05 ` [PATCH net-next v2 0/2] tcp: rehash onto different ECMP path on retransmit timeout Neil Spring
2 siblings, 1 reply; 9+ messages in thread
From: Neil Spring @ 2026-04-08 0:28 UTC (permalink / raw)
To: netdev
Cc: edumazet, ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms,
shuah, linux-kselftest
Add sk_dst_reset() alongside sk_rethink_txhash() in the RTO, PLB,
and spurious-retrans paths so that the next transmit triggers a fresh
route lookup. Propagate sk_txhash into fl6->mp_hash in
inet6_csk_route_req() and inet6_csk_route_socket() so
fib6_select_path() uses the socket's current hash for ECMP selection.
The ir_iif update in tcp_check_req() covers both IPv4 and IPv6
because it was cleaner than gating on address family; IPv4 is
otherwise unaltered, and not having autoflowlabel in IPv4 means
I wouldn't expect a new path on timeout.
It is possible that PLB does not need this (that there are other
methods of reacting to local congestion); I added the sk_dst_reset
for consistency.
Signed-off-by: Neil Spring <ntspring@meta.com>
---
net/ipv4/tcp_input.c | 4 +++-
net/ipv4/tcp_minisocks.c | 9 +++++++++
net/ipv4/tcp_plb.c | 1 +
net/ipv4/tcp_timer.c | 1 +
net/ipv6/inet6_connection_sock.c | 8 ++++++++
5 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7171442c3ed7..3d42ab45066c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5014,8 +5014,10 @@ static void tcp_rcv_spurious_retrans(struct sock *sk,
skb->protocol == htons(ETH_P_IPV6) &&
(tcp_sk(sk)->inet_conn.icsk_ack.lrcv_flowlabel !=
ntohl(ip6_flowlabel(ipv6_hdr(skb)))) &&
- sk_rethink_txhash(sk))
+ sk_rethink_txhash(sk)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPDUPLICATEDATAREHASH);
+ sk_dst_reset(sk);
+ }
/* Save last flowlabel after a spurious retrans. */
tcp_save_lrcv_flowlabel(sk, skb);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 199f0b579e89..ef4b3771e9d8 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -750,6 +750,15 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
* Reset timer after retransmitting SYNACK, similar to
* the idea of fast retransmit in recovery.
*/
+
+#if IS_ENABLED(CONFIG_IPV6)
+ if (sk->sk_family == AF_INET6)
+ inet_rsk(req)->ir_iif = tcp_v6_iif(skb);
+ else
+#endif
+ inet_rsk(req)->ir_iif =
+ inet_request_bound_dev_if(sk, skb);
+
if (!tcp_oow_rate_limited(sock_net(sk), skb,
LINUX_MIB_TCPACKSKIPPEDSYNRECV,
&tcp_rsk(req)->last_oow_ack_time)) {
diff --git a/net/ipv4/tcp_plb.c b/net/ipv4/tcp_plb.c
index 68ccdb9a5412..d7cc00a58e53 100644
--- a/net/ipv4/tcp_plb.c
+++ b/net/ipv4/tcp_plb.c
@@ -79,6 +79,7 @@ void tcp_plb_check_rehash(struct sock *sk, struct tcp_plb_state *plb)
return;
sk_rethink_txhash(sk);
+ sk_dst_reset(sk);
plb->consec_cong_rounds = 0;
tcp_sk(sk)->plb_rehash++;
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPLBREHASH);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ea99988795e7..acc22fc532c2 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -299,6 +299,7 @@ static int tcp_write_timeout(struct sock *sk)
if (sk_rethink_txhash(sk)) {
tp->timeout_rehash++;
__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTREHASH);
+ sk_dst_reset(sk);
}
return 0;
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 37534e116899..2fe753bb38b4 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -48,6 +48,11 @@ struct dst_entry *inet6_csk_route_req(const struct sock *sk,
fl6->flowi6_uid = sk_uid(sk);
security_req_classify_flow(req, flowi6_to_flowi_common(fl6));
+ if (req->num_retrans)
+ fl6->mp_hash = jhash_1word(req->num_retrans,
+ (__force u32)ireq->ir_rmt_port)
+ >> 1;
+
if (!dst) {
dst = ip6_dst_lookup_flow(sock_net(sk), sk, fl6, final_p);
if (IS_ERR(dst))
@@ -70,6 +75,9 @@ struct dst_entry *inet6_csk_route_socket(struct sock *sk,
fl6->saddr = np->saddr;
fl6->flowlabel = np->flow_label;
IP6_ECN_flow_xmit(sk, fl6->flowlabel);
+
+ if (sk->sk_txhash)
+ fl6->mp_hash = sk->sk_txhash >> 1;
fl6->flowi6_oif = sk->sk_bound_dev_if;
fl6->flowi6_mark = sk->sk_mark;
fl6->fl6_sport = inet->inet_sport;
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH net-next 2/2] selftests: net: add ECMP rehash test
2026-04-08 0:28 [PATCH net-next 0/2] tcp: rehash onto different ECMP path on retransmit timeout Neil Spring
2026-04-08 0:28 ` [PATCH net-next 1/2] " Neil Spring
@ 2026-04-08 0:28 ` Neil Spring
2026-04-08 7:05 ` [PATCH net-next v2 0/2] tcp: rehash onto different ECMP path on retransmit timeout Neil Spring
2 siblings, 0 replies; 9+ messages in thread
From: Neil Spring @ 2026-04-08 0:28 UTC (permalink / raw)
To: netdev
Cc: edumazet, ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms,
shuah, linux-kselftest
Add ecmp_rehash.sh to exercise TCP ECMP path re-selection on
retransmission timeout. Three tests cover client SYN rehash, server
SYN/ACK rehash, and midstream RTO rehash of an established connection
over a two-path ECMP topology with one leg blocked by tc.
The SYN test retries 26 times, so has a false negative probability
of ~(1/2)^25 ≈ 3e-8.
Signed-off-by: Neil Spring <ntspring@meta.com>
---
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/ecmp_rehash.sh | 354 +++++++++++++++++++++
2 files changed, 355 insertions(+)
create mode 100755 tools/testing/selftests/net/ecmp_rehash.sh
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 6bced3ed798b..acc61a51d7e2 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -25,6 +25,7 @@ TEST_PROGS := \
cmsg_time.sh \
double_udp_encap.sh \
drop_monitor_tests.sh \
+ ecmp_rehash.sh \
fcnal-ipv4.sh \
fcnal-ipv6.sh \
fcnal-other.sh \
diff --git a/tools/testing/selftests/net/ecmp_rehash.sh b/tools/testing/selftests/net/ecmp_rehash.sh
new file mode 100755
index 000000000000..a062c0b51fd6
--- /dev/null
+++ b/tools/testing/selftests/net/ecmp_rehash.sh
@@ -0,0 +1,354 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test ECMP path re-selection on TCP retransmission timeout.
+#
+# Two namespaces connected by two parallel veth pairs with a 2-way ECMP
+# route. When a TCP path is blocked (via tc drop), RTO triggers
+# sk_rethink_txhash() + sk_dst_reset(), causing the next route lookup
+# to select the other ECMP path.
+#
+# False negative: ~(1/2)^25 ≈ 3e-8. With tcp_syn_retries=6 (~127 s
+# timeout) and tcp_syn_linear_timeouts=20 there are roughly 26
+# independent rehash attempts, each choosing one of 2 paths uniformly.
+
+source lib.sh
+
+SUBNETS=(a b)
+PORT=9900
+
+ALL_TESTS="
+ test_ecmp_rto_rehash
+ test_ecmp_synack_rehash
+ test_ecmp_midstream_rehash
+"
+
+link_tx_packets_get()
+{
+ local ns=$1; shift
+ local dev=$1; shift
+
+ ip netns exec "$ns" cat "/sys/class/net/$dev/statistics/tx_packets"
+}
+
+# Return the number of packets matched by the tc filter action on a device.
+# When tc drops packets via "action drop", the device's tx_packets is not
+# incremented (packet never reaches veth_xmit), but the tc action maintains
+# its own counter.
+tc_filter_pkt_count()
+{
+ local ns=$1; shift
+ local dev=$1; shift
+
+ ip netns exec "$ns" tc -s filter show dev "$dev" parent 1: 2>/dev/null |
+ awk '/Sent .* pkt/ { for (i=1;i<=NF;i++) if ($i=="pkt") { print $(i-1); exit } }'
+}
+
+# Read TcpTimeoutRehash counter from /proc/net/netstat in a namespace.
+# This counter increments in tcp_write_timeout() on every RTO that triggers
+# sk_rethink_txhash().
+get_timeout_rehash_count()
+{
+ local ns=$1; shift
+
+ ip netns exec "$ns" awk '
+ /^TcpExt:/ {
+ if (!h) { split($0, n); h=1 }
+ else {
+ split($0, v)
+ for (i in n)
+ if (n[i] == "TcpTimeoutRehash") print v[i]
+ }
+ }
+ ' /proc/net/netstat
+}
+
+# Block TCP (IPv6 next-header = 6) egress, allowing ICMPv6 through.
+block_tcp()
+{
+ local ns=$1; shift
+ local dev=$1; shift
+
+ ip netns exec "$ns" tc qdisc add dev "$dev" root handle 1: prio
+ ip netns exec "$ns" tc filter add dev "$dev" parent 1: \
+ protocol ipv6 prio 1 u32 match u8 0x06 0xff at 6 action drop
+}
+
+unblock_tcp()
+{
+ local ns=$1; shift
+ local dev=$1; shift
+
+ ip netns exec "$ns" tc qdisc del dev "$dev" root 2>/dev/null
+}
+
+# Return success when both devices have dropped at least one TCP packet.
+both_devs_attempted()
+{
+ local ns=$1; shift
+ local dev0=$1; shift
+ local dev1=$1; shift
+
+ local c0 c1
+ c0=$(tc_filter_pkt_count "$ns" "$dev0")
+ c1=$(tc_filter_pkt_count "$ns" "$dev1")
+ [ "${c0:-0}" -ge 1 ] && [ "${c1:-0}" -ge 1 ]
+}
+
+setup()
+{
+ setup_ns NS1 NS2
+
+ local ns
+ for ns in "$NS1" "$NS2"; do
+ ip netns exec "$ns" sysctl -qw net.ipv6.conf.all.accept_dad=0
+ ip netns exec "$ns" sysctl -qw net.ipv6.conf.default.accept_dad=0
+ ip netns exec "$ns" sysctl -qw net.ipv6.conf.all.forwarding=1
+ ip netns exec "$ns" sysctl -qw net.core.txrehash=1
+ done
+
+ local i sub
+ for i in 0 1; do
+ sub=${SUBNETS[$i]}
+ ip link add "veth${i}a" type veth peer name "veth${i}b"
+ ip link set "veth${i}a" netns "$NS1"
+ ip link set "veth${i}b" netns "$NS2"
+ ip -n "$NS1" addr add "fd00:${sub}::1/64" dev "veth${i}a"
+ ip -n "$NS2" addr add "fd00:${sub}::2/64" dev "veth${i}b"
+ ip -n "$NS1" link set "veth${i}a" up
+ ip -n "$NS2" link set "veth${i}b" up
+ done
+
+ ip -n "$NS1" addr add fd00:ff::1/128 dev lo
+ ip -n "$NS2" addr add fd00:ff::2/128 dev lo
+
+ # Allow many SYN retries at 1-second intervals (linear, no
+ # exponential backoff) so the rehash test has enough attempts
+ # to exercise both ECMP paths deterministically.
+ ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_syn_retries=6
+ ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_syn_linear_timeouts=20
+
+ ip -n "$NS1" -6 route add fd00:ff::2/128 \
+ nexthop via fd00:a::2 dev veth0a \
+ nexthop via fd00:b::2 dev veth1a
+
+ ip -n "$NS2" -6 route add fd00:ff::1/128 \
+ nexthop via fd00:a::1 dev veth0b \
+ nexthop via fd00:b::1 dev veth1b
+
+ for i in 0 1; do
+ sub=${SUBNETS[$i]}
+ ip netns exec "$NS1" \
+ ping -6 -c1 -W5 "fd00:${sub}::2" &>/dev/null
+ ip netns exec "$NS2" \
+ ping -6 -c1 -W5 "fd00:${sub}::1" &>/dev/null
+ done
+
+ if ! ip netns exec "$NS1" ping -6 -c1 -W5 fd00:ff::2 &>/dev/null; then
+ echo "Basic connectivity check failed"
+ return $ksft_skip
+ fi
+}
+
+# Block ALL paths, start a connection, wait until SYNs have been dropped
+# on both interfaces (proving rehash steered the SYN to a new path), then
+# unblock so the connection completes.
+test_ecmp_rto_rehash()
+{
+ RET=0
+
+ block_tcp "$NS1" veth0a
+ defer unblock_tcp "$NS1" veth0a
+ block_tcp "$NS1" veth1a
+ defer unblock_tcp "$NS1" veth1a
+
+ ip netns exec "$NS2" socat \
+ "TCP6-LISTEN:$PORT,bind=[fd00:ff::2],reuseaddr,fork" \
+ EXEC:"echo ESTABLISH_OK" &
+ defer kill_process $!
+
+ wait_local_port_listen "$NS2" $PORT tcp
+
+ local rehash_before
+ rehash_before=$(get_timeout_rehash_count "$NS1")
+
+ # Start the connection in the background; it will retry SYNs at
+ # 1-second intervals until an unblocked path is found.
+ ip netns exec "$NS1" bash -c \
+ "echo test | socat - \
+ 'TCP6:[fd00:ff::2]:$PORT,bind=[fd00:ff::1],connect-timeout=60'" \
+ >"/tmp/ecmp_rto_$$" 2>&1 &
+ local client_pid=$!
+ defer kill_process $client_pid
+
+ # Wait until both paths have seen at least one dropped SYN.
+ # This proves sk_rethink_txhash() rehashed the connection from
+ # one ECMP path to the other.
+ slowwait 30 both_devs_attempted "$NS1" veth0a veth1a
+ check_err $? "SYNs did not appear on both paths (rehash not working)"
+ if [ $RET -ne 0 ]; then
+ log_test "ECMP RTO rehash: establish with blocked paths"
+ return
+ fi
+
+ # Unblock both paths and let the next SYN retransmit succeed.
+ unblock_tcp "$NS1" veth0a
+ unblock_tcp "$NS1" veth1a
+
+ local rc=0
+ wait $client_pid || rc=$?
+
+ local result
+ result=$(cat "/tmp/ecmp_rto_$$" 2>/dev/null)
+ rm -f "/tmp/ecmp_rto_$$"
+
+ if [ $rc -ne 0 ] || [[ "$result" != *"ESTABLISH_OK"* ]]; then
+ check_err 1 "connection failed after unblocking: $result"
+ fi
+
+ local rehash_after
+ rehash_after=$(get_timeout_rehash_count "$NS1")
+ if [ "$rehash_after" -le "$rehash_before" ]; then
+ check_err 1 "TcpTimeoutRehash counter did not increment"
+ fi
+
+ log_test "ECMP RTO rehash: establish with blocked paths"
+}
+
+# Block the server's return paths so SYN/ACKs are dropped. The client
+# retransmits SYNs at 1-second intervals; each duplicate SYN arriving at
+# the server updates ir_iif to match the new arrival interface, so the
+# retransmitted SYN/ACK routes back via the interface the SYN arrived on.
+test_ecmp_synack_rehash()
+{
+ RET=0
+ local port=$((PORT + 2))
+
+ block_tcp "$NS2" veth0b
+ defer unblock_tcp "$NS2" veth0b
+ block_tcp "$NS2" veth1b
+ defer unblock_tcp "$NS2" veth1b
+
+ ip netns exec "$NS2" socat \
+ "TCP6-LISTEN:$port,bind=[fd00:ff::2],reuseaddr,fork" \
+ EXEC:"echo SYNACK_OK" &
+ defer kill_process $!
+
+ wait_local_port_listen "$NS2" $port tcp
+
+ # Start the connection; SYNs reach the server (client egress is
+ # open) but SYN/ACKs are dropped on the server's return path.
+ ip netns exec "$NS1" bash -c \
+ "echo test | socat - \
+ 'TCP6:[fd00:ff::2]:$port,bind=[fd00:ff::1],connect-timeout=60'" \
+ >"/tmp/ecmp_synack_$$" 2>&1 &
+ local client_pid=$!
+ defer kill_process $client_pid
+
+ # Wait until both server-side interfaces have dropped at least
+ # one SYN/ACK, proving the server rehashed its return path.
+ slowwait 30 both_devs_attempted "$NS2" veth0b veth1b
+ check_err $? "SYN/ACKs did not appear on both return paths"
+ if [ $RET -ne 0 ]; then
+ log_test "ECMP SYN/ACK rehash: blocked return path"
+ return
+ fi
+
+ # Unblock and let the connection complete.
+ unblock_tcp "$NS2" veth0b
+ unblock_tcp "$NS2" veth1b
+
+ local rc=0
+ wait $client_pid || rc=$?
+
+ local result
+ result=$(cat "/tmp/ecmp_synack_$$" 2>/dev/null)
+ rm -f "/tmp/ecmp_synack_$$"
+
+ if [ $rc -ne 0 ] || [[ "$result" != *"SYNACK_OK"* ]]; then
+ check_err 1 "connection failed after unblocking: $result"
+ fi
+
+ log_test "ECMP SYN/ACK rehash: blocked return path"
+}
+
+# Establish a data transfer with both paths open, then block the
+# active path. Verify the transfer continues via rehash and that
+# TcpTimeoutRehash incremented.
+test_ecmp_midstream_rehash()
+{
+ RET=0
+ local port=$((PORT + 1))
+
+ ip netns exec "$NS2" socat -u \
+ "TCP6-LISTEN:$port,bind=[fd00:ff::2],reuseaddr" - >/dev/null &
+ defer kill_process $!
+
+ wait_local_port_listen "$NS2" $port tcp
+
+ local base_tx0 base_tx1
+ base_tx0=$(link_tx_packets_get "$NS1" veth0a)
+ base_tx1=$(link_tx_packets_get "$NS1" veth1a)
+
+ ip netns exec "$NS1" bash -c "
+ for i in \$(seq 1 40); do
+ dd if=/dev/zero bs=10k count=1 2>/dev/null
+ sleep 0.25
+ done | timeout 60 socat - 'TCP6:[fd00:ff::2]:$port,bind=[fd00:ff::1]'
+ " &>/dev/null &
+ local client_pid=$!
+ defer kill_process $client_pid
+
+ busywait $BUSYWAIT_TIMEOUT until_counter_is \
+ ">= $((base_tx0 + base_tx1 + 20))" \
+ link_tx_packets_total "$NS1"
+ check_err $? "no TX activity detected"
+ if [ $RET -ne 0 ]; then
+ log_test "ECMP midstream rehash: block active path"
+ return
+ fi
+
+ # Find the active path and block it.
+ local cur0 cur1 active_idx
+ cur0=$(link_tx_packets_get "$NS1" veth0a)
+ cur1=$(link_tx_packets_get "$NS1" veth1a)
+ if [ $((cur0 - base_tx0)) -ge $((cur1 - base_tx1)) ]; then
+ active_idx=0
+ else
+ active_idx=1
+ fi
+
+ local rehash_before
+ rehash_before=$(get_timeout_rehash_count "$NS1")
+
+ block_tcp "$NS1" "veth${active_idx}a"
+ defer unblock_tcp "$NS1" "veth${active_idx}a"
+
+ local rc=0
+ wait $client_pid || rc=$?
+
+ check_err $rc "data transfer failed after blocking veth${active_idx}a"
+
+ local rehash_after
+ rehash_after=$(get_timeout_rehash_count "$NS1")
+ if [ "$rehash_after" -le "$rehash_before" ]; then
+ check_err 1 "TcpTimeoutRehash counter did not increment"
+ fi
+
+ log_test "ECMP midstream rehash: block active path"
+}
+
+link_tx_packets_total()
+{
+ local ns=$1; shift
+
+ echo $(( $(link_tx_packets_get "$ns" veth0a) +
+ $(link_tx_packets_get "$ns" veth1a) ))
+}
+
+require_command socat
+
+trap cleanup_all_ns EXIT
+setup || exit $?
+tests_run
+exit $EXIT_STATUS
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH net-next 1/2] tcp: rehash onto different ECMP path on retransmit timeout
2026-04-08 0:28 ` [PATCH net-next 1/2] " Neil Spring
@ 2026-04-08 1:09 ` Eric Dumazet
2026-04-08 6:59 ` Neil Spring
0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2026-04-08 1:09 UTC (permalink / raw)
To: Neil Spring
Cc: netdev, ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms,
shuah, linux-kselftest, Ido Schimmel
On Tue, Apr 7, 2026 at 5:28 PM Neil Spring <ntspring@meta.com> wrote:
>
> Add sk_dst_reset() alongside sk_rethink_txhash() in the RTO, PLB,
> and spurious-retrans paths so that the next transmit triggers a fresh
> route lookup. Propagate sk_txhash into fl6->mp_hash in
> inet6_csk_route_req() and inet6_csk_route_socket() so
> fib6_select_path() uses the socket's current hash for ECMP selection.
>
> The ir_iif update in tcp_check_req() covers both IPv4 and IPv6
> because it was cleaner than gating on address family; IPv4 is
> otherwise unaltered, and not having autoflowlabel in IPv4 means
> I wouldn't expect a new path on timeout.
>
> It is possible that PLB does not need this (that there are other
> methods of reacting to local congestion); I added the sk_dst_reset
> for consistency.
>
> Signed-off-by: Neil Spring <ntspring@meta.com>
> ---
> net/ipv4/tcp_input.c | 4 +++-
> net/ipv4/tcp_minisocks.c | 9 +++++++++
> net/ipv4/tcp_plb.c | 1 +
> net/ipv4/tcp_timer.c | 1 +
> net/ipv6/inet6_connection_sock.c | 8 ++++++++
> 5 files changed, 22 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 7171442c3ed7..3d42ab45066c 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5014,8 +5014,10 @@ static void tcp_rcv_spurious_retrans(struct sock *sk,
> skb->protocol == htons(ETH_P_IPV6) &&
> (tcp_sk(sk)->inet_conn.icsk_ack.lrcv_flowlabel !=
> ntohl(ip6_flowlabel(ipv6_hdr(skb)))) &&
> - sk_rethink_txhash(sk))
> + sk_rethink_txhash(sk)) {
> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPDUPLICATEDATAREHASH);
> + sk_dst_reset(sk);
> + }
>
> /* Save last flowlabel after a spurious retrans. */
> tcp_save_lrcv_flowlabel(sk, skb);
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 199f0b579e89..ef4b3771e9d8 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -750,6 +750,15 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
> * Reset timer after retransmitting SYNACK, similar to
> * the idea of fast retransmit in recovery.
> */
> +
What is the following part doing?
tcp_v6_init_req() uses something quite different before setting ir_iif
A comment explaining the rationale would be nice.
> +#if IS_ENABLED(CONFIG_IPV6)
> + if (sk->sk_family == AF_INET6)
> + inet_rsk(req)->ir_iif = tcp_v6_iif(skb);
> + else
> +#endif
> + inet_rsk(req)->ir_iif =
> + inet_request_bound_dev_if(sk, skb);
> +
> if (!tcp_oow_rate_limited(sock_net(sk), skb,
> LINUX_MIB_TCPACKSKIPPEDSYNRECV,
> &tcp_rsk(req)->last_oow_ack_time)) {
> diff --git a/net/ipv4/tcp_plb.c b/net/ipv4/tcp_plb.c
> index 68ccdb9a5412..d7cc00a58e53 100644
> --- a/net/ipv4/tcp_plb.c
> +++ b/net/ipv4/tcp_plb.c
> @@ -79,6 +79,7 @@ void tcp_plb_check_rehash(struct sock *sk, struct tcp_plb_state *plb)
> return;
>
> sk_rethink_txhash(sk);
> + sk_dst_reset(sk);
> plb->consec_cong_rounds = 0;
> tcp_sk(sk)->plb_rehash++;
> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPLBREHASH);
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index ea99988795e7..acc22fc532c2 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -299,6 +299,7 @@ static int tcp_write_timeout(struct sock *sk)
> if (sk_rethink_txhash(sk)) {
> tp->timeout_rehash++;
> __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTREHASH);
> + sk_dst_reset(sk);
> }
>
> return 0;
> diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
> index 37534e116899..2fe753bb38b4 100644
> --- a/net/ipv6/inet6_connection_sock.c
> +++ b/net/ipv6/inet6_connection_sock.c
> @@ -48,6 +48,11 @@ struct dst_entry *inet6_csk_route_req(const struct sock *sk,
> fl6->flowi6_uid = sk_uid(sk);
> security_req_classify_flow(req, flowi6_to_flowi_common(fl6));
>
> + if (req->num_retrans)
> + fl6->mp_hash = jhash_1word(req->num_retrans,
> + (__force u32)ireq->ir_rmt_port)
> + >> 1;
Why not setting mp_hash to sk_txhash ?
Why are you using ">> 1" ?
rt6_multipath_hash() seems to be bypassed, it might be time to add a
comment there
explaining that mp_hash needs to be 31-bit only...
Perhaps use rt6_multipath_hash() and expand it to use a socket pointer
to retrieve sk->sk_txhash when/if possible
instead of yet another flow dissection.
> +
> if (!dst) {
> dst = ip6_dst_lookup_flow(sock_net(sk), sk, fl6, final_p);
> if (IS_ERR(dst))
> @@ -70,6 +75,9 @@ struct dst_entry *inet6_csk_route_socket(struct sock *sk,
> fl6->saddr = np->saddr;
> fl6->flowlabel = np->flow_label;
> IP6_ECN_flow_xmit(sk, fl6->flowlabel);
> +
> + if (sk->sk_txhash)
> + fl6->mp_hash = sk->sk_txhash >> 1;
Seems inconsistent, and same question about the right shift.
> fl6->flowi6_oif = sk->sk_bound_dev_if;
> fl6->flowi6_mark = sk->sk_mark;
> fl6->fl6_sport = inet->inet_sport;
> --
> 2.52.0
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH net-next 1/2] tcp: rehash onto different ECMP path on retransmit timeout
2026-04-08 1:09 ` Eric Dumazet
@ 2026-04-08 6:59 ` Neil Spring
0 siblings, 0 replies; 9+ messages in thread
From: Neil Spring @ 2026-04-08 6:59 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev, ncardwell, kuniyu, davem, dsahern, kuba, pabeni, horms,
shuah, linux-kselftest, Ido Schimmel
Thanks Eric for looking at this; sorry for the duplicate email as I
failed to reply all; comments below.
On Tue, Apr 7, 2026 at 6:09=E2=80=AFPM Eric Dumazet <edumazet@google.com> w=
rote:
> On Tue, Apr 7, 2026 at 5:28=E2=80=AFPM Neil Spring <ntspring@meta.com> wr=
ote:
> >
> > Add sk_dst_reset() alongside sk_rethink_txhash() in the RTO, PLB,
> > and spurious-retrans paths so that the next transmit triggers a fresh
> > route lookup. Propagate sk_txhash into fl6->mp_hash in
> > inet6_csk_route_req() and inet6_csk_route_socket() so
> > fib6_select_path() uses the socket's current hash for ECMP selection.
> >
> > The ir_iif update in tcp_check_req() covers both IPv4 and IPv6
> > because it was cleaner than gating on address family; IPv4 is
> > otherwise unaltered, and not having autoflowlabel in IPv4 means
> > I wouldn't expect a new path on timeout.
> >
> > It is possible that PLB does not need this (that there are other
> > methods of reacting to local congestion); I added the sk_dst_reset
> > for consistency.
> >
> > Signed-off-by: Neil Spring <ntspring@meta.com>
> > ---
> > net/ipv4/tcp_input.c | 4 +++-
> > net/ipv4/tcp_minisocks.c | 9 +++++++++
> > net/ipv4/tcp_plb.c | 1 +
> > net/ipv4/tcp_timer.c | 1 +
> > net/ipv6/inet6_connection_sock.c | 8 ++++++++
> > 5 files changed, 22 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 7171442c3ed7..3d42ab45066c 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5014,8 +5014,10 @@ static void tcp_rcv_spurious_retrans(struct sock=
*sk,
> > skb->protocol =3D=3D htons(ETH_P_IPV6) &&
> > (tcp_sk(sk)->inet_conn.icsk_ack.lrcv_flowlabel !=3D
> > ntohl(ip6_flowlabel(ipv6_hdr(skb)))) &&
> > - sk_rethink_txhash(sk))
> > + sk_rethink_txhash(sk)) {
> > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPDUPLICATEDATAR=
EHASH);
> > + sk_dst_reset(sk);
> > + }
> >
> > /* Save last flowlabel after a spurious retrans. */
> > tcp_save_lrcv_flowlabel(sk, skb);
> > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > index 199f0b579e89..ef4b3771e9d8 100644
> > --- a/net/ipv4/tcp_minisocks.c
> > +++ b/net/ipv4/tcp_minisocks.c
> > @@ -750,6 +750,15 @@ struct sock *tcp_check_req(struct sock *sk, struct=
sk_buff *skb,
> > * Reset timer after retransmitting SYNACK, similar to
> > * the idea of fast retransmit in recovery.
> > */
> > +
>
> What is the following part doing?
> tcp_v6_init_req() uses something quite different before setting ir_iif
> A comment explaining the rationale would be nice.
Comment added in v2. The behavior I thought I observed was that the
ir_iif goes into fl6->flowi6_oif in inet6_csk_route_req(), limiting options
to any next hop on that interface, so if we received a retransmitted syn
on a new interface, we should try to use the new interface for the
response.
>
> > +#if IS_ENABLED(CONFIG_IPV6)
> > + if (sk->sk_family =3D=3D AF_INET6)
> > + inet_rsk(req)->ir_iif =3D tcp_v6_iif(skb);
> > + else
> > +#endif
> > + inet_rsk(req)->ir_iif =3D
> > + inet_request_bound_dev_if(sk, skb);
> > +
> > if (!tcp_oow_rate_limited(sock_net(sk), skb,
> > LINUX_MIB_TCPACKSKIPPEDSYNREC=
V,
> > &tcp_rsk(req)->last_oow_ack_t=
ime)) {
> > diff --git a/net/ipv4/tcp_plb.c b/net/ipv4/tcp_plb.c
> > index 68ccdb9a5412..d7cc00a58e53 100644
> > --- a/net/ipv4/tcp_plb.c
> > +++ b/net/ipv4/tcp_plb.c
> > @@ -79,6 +79,7 @@ void tcp_plb_check_rehash(struct sock *sk, struct tcp=
_plb_state *plb)
> > return;
> >
> > sk_rethink_txhash(sk);
> > + sk_dst_reset(sk);
> > plb->consec_cong_rounds =3D 0;
> > tcp_sk(sk)->plb_rehash++;
> > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPLBREHASH);
> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > index ea99988795e7..acc22fc532c2 100644
> > --- a/net/ipv4/tcp_timer.c
> > +++ b/net/ipv4/tcp_timer.c
> > @@ -299,6 +299,7 @@ static int tcp_write_timeout(struct sock *sk)
> > if (sk_rethink_txhash(sk)) {
> > tp->timeout_rehash++;
> > __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTREHAS=
H);
> > + sk_dst_reset(sk);
> > }
> >
> > return 0;
> > diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connecti=
on_sock.c
> > index 37534e116899..2fe753bb38b4 100644
> > --- a/net/ipv6/inet6_connection_sock.c
> > +++ b/net/ipv6/inet6_connection_sock.c
> > @@ -48,6 +48,11 @@ struct dst_entry *inet6_csk_route_req(const struct s=
ock *sk,
> > fl6->flowi6_uid =3D sk_uid(sk);
> > security_req_classify_flow(req, flowi6_to_flowi_common(fl6));
> >
> > + if (req->num_retrans)
> > + fl6->mp_hash =3D jhash_1word(req->num_retrans,
> > + (__force u32)ireq->ir_rmt_po=
rt)
> > + >> 1;
>
> Why not setting mp_hash to sk_txhash ?
To be honest, I was trying to avoid adding #include <tcp.h> to the top of
inet6_connection_sock.c, which is needed to get tcp_rsk(req)->txhash, but
that does make better code, using that in v2.
> Why are you using ">> 1" ?
The >> 1 is due to rt6_multipath_hash returning its computed
value shifted right one before going into mp_hash in route.c sites.
I understood mp_hash can't have the high bit set and still work
due to fib6_select_path treating it as a signed integer when comparing
to fib_nh_upper_bound.
> rt6_multipath_hash() seems to be bypassed, it might be time to add a
> comment there
> explaining that mp_hash needs to be 31-bit only...
>
> Perhaps use rt6_multipath_hash() and expand it to use a socket pointer
> to retrieve sk->sk_txhash when/if possible
> instead of yet another flow dissection.
I think thyour comment above (grabbing txhash) avoids the dissection.
>
> > +
> > if (!dst) {
> > dst =3D ip6_dst_lookup_flow(sock_net(sk), sk, fl6, fina=
l_p);
> > if (IS_ERR(dst))
> > @@ -70,6 +75,9 @@ struct dst_entry *inet6_csk_route_socket(struct sock =
*sk,
> > fl6->saddr =3D np->saddr;
> > fl6->flowlabel =3D np->flow_label;
> > IP6_ECN_flow_xmit(sk, fl6->flowlabel);
> > +
> > + if (sk->sk_txhash)
> > + fl6->mp_hash =3D sk->sk_txhash >> 1;
>
> Seems inconsistent, and same question about the right shift.
Adding comment and standardizing on
* tcp_rsk(req)->txhash >> 1 for request sockets,
* sk->sk_txhash >> 1 for established sockets.
>
>
> > fl6->flowi6_oif =3D sk->sk_bound_dev_if;
> > fl6->flowi6_mark =3D sk->sk_mark;
> > fl6->fl6_sport =3D inet->inet_sport;
> > --
> > 2.52.0
> >
On Tue, Apr 7, 2026 at 6:09 PM Eric Dumazet <edumazet@google.com> wrote:
>
> >
> On Tue, Apr 7, 2026 at 5:28 PM Neil Spring <ntspring@meta.com> wrote:
> >
> > Add sk_dst_reset() alongside sk_rethink_txhash() in the RTO, PLB,
> > and spurious-retrans paths so that the next transmit triggers a fresh
> > route lookup. Propagate sk_txhash into fl6->mp_hash in
> > inet6_csk_route_req() and inet6_csk_route_socket() so
> > fib6_select_path() uses the socket's current hash for ECMP selection.
> >
> > The ir_iif update in tcp_check_req() covers both IPv4 and IPv6
> > because it was cleaner than gating on address family; IPv4 is
> > otherwise unaltered, and not having autoflowlabel in IPv4 means
> > I wouldn't expect a new path on timeout.
> >
> > It is possible that PLB does not need this (that there are other
> > methods of reacting to local congestion); I added the sk_dst_reset
> > for consistency.
> >
> > Signed-off-by: Neil Spring <ntspring@meta.com>
> > ---
> > net/ipv4/tcp_input.c | 4 +++-
> > net/ipv4/tcp_minisocks.c | 9 +++++++++
> > net/ipv4/tcp_plb.c | 1 +
> > net/ipv4/tcp_timer.c | 1 +
> > net/ipv6/inet6_connection_sock.c | 8 ++++++++
> > 5 files changed, 22 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 7171442c3ed7..3d42ab45066c 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5014,8 +5014,10 @@ static void tcp_rcv_spurious_retrans(struct sock *sk,
> > skb->protocol == htons(ETH_P_IPV6) &&
> > (tcp_sk(sk)->inet_conn.icsk_ack.lrcv_flowlabel !=
> > ntohl(ip6_flowlabel(ipv6_hdr(skb)))) &&
> > - sk_rethink_txhash(sk))
> > + sk_rethink_txhash(sk)) {
> > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPDUPLICATEDATAREHASH);
> > + sk_dst_reset(sk);
> > + }
> >
> > /* Save last flowlabel after a spurious retrans. */
> > tcp_save_lrcv_flowlabel(sk, skb);
> > diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> > index 199f0b579e89..ef4b3771e9d8 100644
> > --- a/net/ipv4/tcp_minisocks.c
> > +++ b/net/ipv4/tcp_minisocks.c
> > @@ -750,6 +750,15 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
> > * Reset timer after retransmitting SYNACK, similar to
> > * the idea of fast retransmit in recovery.
> > */
> > +
>
> What is the following part doing?
> tcp_v6_init_req() uses something quite different before setting ir_iif
> A comment explaining the rationale would be nice.
>
> > +#if IS_ENABLED(CONFIG_IPV6)
> > + if (sk->sk_family == AF_INET6)
> > + inet_rsk(req)->ir_iif = tcp_v6_iif(skb);
> > + else
> > +#endif
> > + inet_rsk(req)->ir_iif =
> > + inet_request_bound_dev_if(sk, skb);
> > +
> > if (!tcp_oow_rate_limited(sock_net(sk), skb,
> > LINUX_MIB_TCPACKSKIPPEDSYNRECV,
> > &tcp_rsk(req)->last_oow_ack_time)) {
> > diff --git a/net/ipv4/tcp_plb.c b/net/ipv4/tcp_plb.c
> > index 68ccdb9a5412..d7cc00a58e53 100644
> > --- a/net/ipv4/tcp_plb.c
> > +++ b/net/ipv4/tcp_plb.c
> > @@ -79,6 +79,7 @@ void tcp_plb_check_rehash(struct sock *sk, struct tcp_plb_state *plb)
> > return;
> >
> > sk_rethink_txhash(sk);
> > + sk_dst_reset(sk);
> > plb->consec_cong_rounds = 0;
> > tcp_sk(sk)->plb_rehash++;
> > NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPLBREHASH);
> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> > index ea99988795e7..acc22fc532c2 100644
> > --- a/net/ipv4/tcp_timer.c
> > +++ b/net/ipv4/tcp_timer.c
> > @@ -299,6 +299,7 @@ static int tcp_write_timeout(struct sock *sk)
> > if (sk_rethink_txhash(sk)) {
> > tp->timeout_rehash++;
> > __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTREHASH);
> > + sk_dst_reset(sk);
> > }
> >
> > return 0;
> > diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
> > index 37534e116899..2fe753bb38b4 100644
> > --- a/net/ipv6/inet6_connection_sock.c
> > +++ b/net/ipv6/inet6_connection_sock.c
> > @@ -48,6 +48,11 @@ struct dst_entry *inet6_csk_route_req(const struct sock *sk,
> > fl6->flowi6_uid = sk_uid(sk);
> > security_req_classify_flow(req, flowi6_to_flowi_common(fl6));
> >
> > + if (req->num_retrans)
> > + fl6->mp_hash = jhash_1word(req->num_retrans,
> > + (__force u32)ireq->ir_rmt_port)
> > + >> 1;
>
> Why not setting mp_hash to sk_txhash ?
>
> Why are you using ">> 1" ?
>
> rt6_multipath_hash() seems to be bypassed, it might be time to add a
> comment there
> explaining that mp_hash needs to be 31-bit only...
>
> Perhaps use rt6_multipath_hash() and expand it to use a socket pointer
> to retrieve sk->sk_txhash when/if possible
> instead of yet another flow dissection.
>
> > +
> > if (!dst) {
> > dst = ip6_dst_lookup_flow(sock_net(sk), sk, fl6, final_p);
> > if (IS_ERR(dst))
> > @@ -70,6 +75,9 @@ struct dst_entry *inet6_csk_route_socket(struct sock *sk,
> > fl6->saddr = np->saddr;
> > fl6->flowlabel = np->flow_label;
> > IP6_ECN_flow_xmit(sk, fl6->flowlabel);
> > +
> > + if (sk->sk_txhash)
> > + fl6->mp_hash = sk->sk_txhash >> 1;
>
> Seems inconsistent, and same question about the right shift.
>
>
> > fl6->flowi6_oif = sk->sk_bound_dev_if;
> > fl6->flowi6_mark = sk->sk_mark;
> > fl6->fl6_sport = inet->inet_sport;
> > --
> > 2.52.0
> >
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH net-next v2 0/2] tcp: rehash onto different ECMP path on retransmit timeout
2026-04-08 0:28 [PATCH net-next 0/2] tcp: rehash onto different ECMP path on retransmit timeout Neil Spring
2026-04-08 0:28 ` [PATCH net-next 1/2] " Neil Spring
2026-04-08 0:28 ` [PATCH net-next 2/2] selftests: net: add ECMP rehash test Neil Spring
@ 2026-04-08 7:05 ` Neil Spring
2026-04-08 7:05 ` [PATCH net-next v2 1/2] " Neil Spring
2026-04-08 7:05 ` [PATCH net-next v2 2/2] selftests: net: add ECMP rehash test Neil Spring
2 siblings, 2 replies; 9+ messages in thread
From: Neil Spring @ 2026-04-08 7:05 UTC (permalink / raw)
To: netdev; +Cc: edumazet, davem, kuba
Make TCP retransmission timeouts select a different ECMP path for IPv6.
Currently sk_rethink_txhash() changes the socket's txhash on RTO, but the
cached route is reused and the new hash is not propagated into the ECMP
path selection logic. This series adds sk_dst_reset() alongside
sk_rethink_txhash() to force a fresh route lookup, and sets fl6->mp_hash
from sk_txhash so fib6_select_path() picks a path based on the new hash.
Three selftest scenarios verify the behavior: SYN retransmission, SYN/ACK
retransmission (server-side), and midstream RTO on an established
connection.
Changes since v1:
- Use tcp_rsk(req)->txhash instead of jhash_1word(req->num_retrans, ...)
for ECMP path selection in inet6_csk_route_req(), making the request
socket path consistent with the established socket path (Eric Dumazet)
- Add comments explaining the >> 1 shift for 31-bit mp_hash range
- Add comment explaining the ir_iif update rationale in tcp_check_req()
- Use socat -u (unidirectional) in selftest to avoid SIGPIPE race
- Increase tcp_syn_retries and tcp_syn_linear_timeouts to 25 for
better rehash coverage; add tcp_synack_retries=10 on the server
Neil Spring (2):
tcp: rehash onto different ECMP path on retransmit timeout
selftests: net: add ECMP rehash test
net/ipv4/tcp_input.c | 4 +-
net/ipv4/tcp_minisocks.c | 13 +
net/ipv4/tcp_plb.c | 1 +
net/ipv4/tcp_timer.c | 1 +
net/ipv6/inet6_connection_sock.c | 11 +
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/ecmp_rehash.sh | 361 +++++++++++++++++++++
7 files changed, 391 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/net/ecmp_rehash.sh
--
2.52.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH net-next v2 1/2] tcp: rehash onto different ECMP path on retransmit timeout
2026-04-08 7:05 ` [PATCH net-next v2 0/2] tcp: rehash onto different ECMP path on retransmit timeout Neil Spring
@ 2026-04-08 7:05 ` Neil Spring
2026-04-08 7:12 ` Eric Dumazet
2026-04-08 7:05 ` [PATCH net-next v2 2/2] selftests: net: add ECMP rehash test Neil Spring
1 sibling, 1 reply; 9+ messages in thread
From: Neil Spring @ 2026-04-08 7:05 UTC (permalink / raw)
To: netdev; +Cc: edumazet, davem, kuba
Add sk_dst_reset() alongside sk_rethink_txhash() in the RTO, PLB,
and spurious-retrans paths so that the next transmit triggers a fresh
route lookup. Propagate sk_txhash into fl6->mp_hash in
inet6_csk_route_req() and inet6_csk_route_socket() so
fib6_select_path() uses the socket's current hash for ECMP selection.
The ir_iif update in tcp_check_req() covers both IPv4 and IPv6
because it was cleaner than gating on address family; IPv4 is
otherwise unaltered, and not having autoflowlabel in IPv4 means
I wouldn't expect a new path on timeout.
It is possible that PLB does not need this (that there are other
methods of reacting to local congestion); I added the sk_dst_reset
for consistency.
Signed-off-by: Neil Spring <ntspring@meta.com>
---
net/ipv4/tcp_input.c | 4 +++-
net/ipv4/tcp_minisocks.c | 13 +++++++++++++
net/ipv4/tcp_plb.c | 1 +
net/ipv4/tcp_timer.c | 1 +
net/ipv6/inet6_connection_sock.c | 11 +++++++++++
5 files changed, 29 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7171442c3ed7..3d42ab45066c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5014,8 +5014,10 @@ static void tcp_rcv_spurious_retrans(struct sock *sk,
skb->protocol == htons(ETH_P_IPV6) &&
(tcp_sk(sk)->inet_conn.icsk_ack.lrcv_flowlabel !=
ntohl(ip6_flowlabel(ipv6_hdr(skb)))) &&
- sk_rethink_txhash(sk))
+ sk_rethink_txhash(sk)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPDUPLICATEDATAREHASH);
+ sk_dst_reset(sk);
+ }
/* Save last flowlabel after a spurious retrans. */
tcp_save_lrcv_flowlabel(sk, skb);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 199f0b579e89..27edf71effc2 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -750,6 +750,19 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
* Reset timer after retransmitting SYNACK, similar to
* the idea of fast retransmit in recovery.
*/
+
+ /* Update ir_iif to match the interface the retransmitted
+ * SYN arrived on; inet6_csk_route_req() uses this as
+ * flowi6_oif, constraining ECMP path for the SYN/ACK.
+ */
+#if IS_ENABLED(CONFIG_IPV6)
+ if (sk->sk_family == AF_INET6)
+ inet_rsk(req)->ir_iif = tcp_v6_iif(skb);
+ else
+#endif
+ inet_rsk(req)->ir_iif =
+ inet_request_bound_dev_if(sk, skb);
+
if (!tcp_oow_rate_limited(sock_net(sk), skb,
LINUX_MIB_TCPACKSKIPPEDSYNRECV,
&tcp_rsk(req)->last_oow_ack_time)) {
diff --git a/net/ipv4/tcp_plb.c b/net/ipv4/tcp_plb.c
index 68ccdb9a5412..d7cc00a58e53 100644
--- a/net/ipv4/tcp_plb.c
+++ b/net/ipv4/tcp_plb.c
@@ -79,6 +79,7 @@ void tcp_plb_check_rehash(struct sock *sk, struct tcp_plb_state *plb)
return;
sk_rethink_txhash(sk);
+ sk_dst_reset(sk);
plb->consec_cong_rounds = 0;
tcp_sk(sk)->plb_rehash++;
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPLBREHASH);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ea99988795e7..acc22fc532c2 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -299,6 +299,7 @@ static int tcp_write_timeout(struct sock *sk)
if (sk_rethink_txhash(sk)) {
tp->timeout_rehash++;
__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTREHASH);
+ sk_dst_reset(sk);
}
return 0;
diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
index 37534e116899..3fd7acbe2c49 100644
--- a/net/ipv6/inet6_connection_sock.c
+++ b/net/ipv6/inet6_connection_sock.c
@@ -14,6 +14,7 @@
#include <linux/ipv6.h>
#include <linux/jhash.h>
#include <linux/slab.h>
+#include <linux/tcp.h>
#include <net/addrconf.h>
#include <net/inet_connection_sock.h>
@@ -48,6 +49,12 @@ struct dst_entry *inet6_csk_route_req(const struct sock *sk,
fl6->flowi6_uid = sk_uid(sk);
security_req_classify_flow(req, flowi6_to_flowi_common(fl6));
+ /* Use the request socket's txhash (re-rolled by tcp_rtx_synack())
+ * for ECMP path selection; >> 1 for 31-bit mp_hash range.
+ */
+ if (tcp_rsk(req)->txhash)
+ fl6->mp_hash = tcp_rsk(req)->txhash >> 1;
+
if (!dst) {
dst = ip6_dst_lookup_flow(sock_net(sk), sk, fl6, final_p);
if (IS_ERR(dst))
@@ -70,6 +77,10 @@ struct dst_entry *inet6_csk_route_socket(struct sock *sk,
fl6->saddr = np->saddr;
fl6->flowlabel = np->flow_label;
IP6_ECN_flow_xmit(sk, fl6->flowlabel);
+
+ /* >> 1 for 31-bit mp_hash range matching nhc_upper_bound. */
+ if (sk->sk_txhash)
+ fl6->mp_hash = sk->sk_txhash >> 1;
fl6->flowi6_oif = sk->sk_bound_dev_if;
fl6->flowi6_mark = sk->sk_mark;
fl6->fl6_sport = inet->inet_sport;
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH net-next v2 2/2] selftests: net: add ECMP rehash test
2026-04-08 7:05 ` [PATCH net-next v2 0/2] tcp: rehash onto different ECMP path on retransmit timeout Neil Spring
2026-04-08 7:05 ` [PATCH net-next v2 1/2] " Neil Spring
@ 2026-04-08 7:05 ` Neil Spring
1 sibling, 0 replies; 9+ messages in thread
From: Neil Spring @ 2026-04-08 7:05 UTC (permalink / raw)
To: netdev; +Cc: edumazet, davem, kuba
Add ecmp_rehash.sh to exercise TCP ECMP path re-selection on
retransmission timeout. Three tests cover client SYN rehash, server
SYN/ACK rehash, and midstream RTO rehash of an established connection
over a two-path ECMP topology with one leg blocked by tc.
The SYN test retries 26 times, so has a false negative probability
of ~(1/2)^25 ≈ 3e-8.
Signed-off-by: Neil Spring <ntspring@meta.com>
---
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/ecmp_rehash.sh | 361 +++++++++++++++++++++
2 files changed, 362 insertions(+)
create mode 100755 tools/testing/selftests/net/ecmp_rehash.sh
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 6bced3ed798b..acc61a51d7e2 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -25,6 +25,7 @@ TEST_PROGS := \
cmsg_time.sh \
double_udp_encap.sh \
drop_monitor_tests.sh \
+ ecmp_rehash.sh \
fcnal-ipv4.sh \
fcnal-ipv6.sh \
fcnal-other.sh \
diff --git a/tools/testing/selftests/net/ecmp_rehash.sh b/tools/testing/selftests/net/ecmp_rehash.sh
new file mode 100755
index 000000000000..a468ccf22d4f
--- /dev/null
+++ b/tools/testing/selftests/net/ecmp_rehash.sh
@@ -0,0 +1,361 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test ECMP path re-selection on TCP retransmission timeout.
+#
+# Two namespaces connected by two parallel veth pairs with a 2-way ECMP
+# route. When a TCP path is blocked (via tc drop), RTO triggers
+# sk_rethink_txhash() + sk_dst_reset(), causing the next route lookup
+# to select the other ECMP path.
+#
+# False negative: ~(1/2)^25 ≈ 3e-8. With tcp_syn_retries=25 and
+# tcp_syn_linear_timeouts=25 there are 26 SYN attempts at 1-second
+# intervals, each choosing one of 2 paths uniformly.
+
+source lib.sh
+
+SUBNETS=(a b)
+PORT=9900
+
+ALL_TESTS="
+ test_ecmp_rto_rehash
+ test_ecmp_synack_rehash
+ test_ecmp_midstream_rehash
+"
+
+link_tx_packets_get()
+{
+ local ns=$1; shift
+ local dev=$1; shift
+
+ ip netns exec "$ns" cat "/sys/class/net/$dev/statistics/tx_packets"
+}
+
+# Return the number of packets matched by the tc filter action on a device.
+# When tc drops packets via "action drop", the device's tx_packets is not
+# incremented (packet never reaches veth_xmit), but the tc action maintains
+# its own counter.
+tc_filter_pkt_count()
+{
+ local ns=$1; shift
+ local dev=$1; shift
+
+ ip netns exec "$ns" tc -s filter show dev "$dev" parent 1: 2>/dev/null |
+ awk '/Sent .* pkt/ { for (i=1;i<=NF;i++) if ($i=="pkt") { print $(i-1); exit } }'
+}
+
+# Read TcpTimeoutRehash counter from /proc/net/netstat in a namespace.
+# This counter increments in tcp_write_timeout() on every RTO that triggers
+# sk_rethink_txhash().
+get_timeout_rehash_count()
+{
+ local ns=$1; shift
+
+ ip netns exec "$ns" awk '
+ /^TcpExt:/ {
+ if (!h) { split($0, n); h=1 }
+ else {
+ split($0, v)
+ for (i in n)
+ if (n[i] == "TcpTimeoutRehash") print v[i]
+ }
+ }
+ ' /proc/net/netstat
+}
+
+# Block TCP (IPv6 next-header = 6) egress, allowing ICMPv6 through.
+block_tcp()
+{
+ local ns=$1; shift
+ local dev=$1; shift
+
+ ip netns exec "$ns" tc qdisc add dev "$dev" root handle 1: prio
+ ip netns exec "$ns" tc filter add dev "$dev" parent 1: \
+ protocol ipv6 prio 1 u32 match u8 0x06 0xff at 6 action drop
+}
+
+unblock_tcp()
+{
+ local ns=$1; shift
+ local dev=$1; shift
+
+ ip netns exec "$ns" tc qdisc del dev "$dev" root 2>/dev/null
+}
+
+# Return success when both devices have dropped at least one TCP packet.
+both_devs_attempted()
+{
+ local ns=$1; shift
+ local dev0=$1; shift
+ local dev1=$1; shift
+
+ local c0 c1
+ c0=$(tc_filter_pkt_count "$ns" "$dev0")
+ c1=$(tc_filter_pkt_count "$ns" "$dev1")
+ [ "${c0:-0}" -ge 1 ] && [ "${c1:-0}" -ge 1 ]
+}
+
+setup()
+{
+ setup_ns NS1 NS2
+
+ local ns
+ for ns in "$NS1" "$NS2"; do
+ ip netns exec "$ns" sysctl -qw net.ipv6.conf.all.accept_dad=0
+ ip netns exec "$ns" sysctl -qw net.ipv6.conf.default.accept_dad=0
+ ip netns exec "$ns" sysctl -qw net.ipv6.conf.all.forwarding=1
+ ip netns exec "$ns" sysctl -qw net.core.txrehash=1
+ done
+
+ local i sub
+ for i in 0 1; do
+ sub=${SUBNETS[$i]}
+ ip link add "veth${i}a" type veth peer name "veth${i}b"
+ ip link set "veth${i}a" netns "$NS1"
+ ip link set "veth${i}b" netns "$NS2"
+ ip -n "$NS1" addr add "fd00:${sub}::1/64" dev "veth${i}a"
+ ip -n "$NS2" addr add "fd00:${sub}::2/64" dev "veth${i}b"
+ ip -n "$NS1" link set "veth${i}a" up
+ ip -n "$NS2" link set "veth${i}b" up
+ done
+
+ ip -n "$NS1" addr add fd00:ff::1/128 dev lo
+ ip -n "$NS2" addr add fd00:ff::2/128 dev lo
+
+ # Allow many SYN retries at 1-second intervals (linear, no
+ # exponential backoff) so the rehash test has enough attempts
+ # to exercise both ECMP paths. tcp_syn_retries is a pure
+ # retransmission count (not time-based for SYN_SENT), so with
+ # linear 1-second intervals it also sets the total lifetime.
+ ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_syn_retries=25
+ ip netns exec "$NS1" sysctl -qw net.ipv4.tcp_syn_linear_timeouts=25
+
+ # Keep the server's request socket alive during the blocking
+ # period so SYN/ACK retransmits continue.
+ ip netns exec "$NS2" sysctl -qw net.ipv4.tcp_synack_retries=10
+
+ ip -n "$NS1" -6 route add fd00:ff::2/128 \
+ nexthop via fd00:a::2 dev veth0a \
+ nexthop via fd00:b::2 dev veth1a
+
+ ip -n "$NS2" -6 route add fd00:ff::1/128 \
+ nexthop via fd00:a::1 dev veth0b \
+ nexthop via fd00:b::1 dev veth1b
+
+ for i in 0 1; do
+ sub=${SUBNETS[$i]}
+ ip netns exec "$NS1" \
+ ping -6 -c1 -W5 "fd00:${sub}::2" &>/dev/null
+ ip netns exec "$NS2" \
+ ping -6 -c1 -W5 "fd00:${sub}::1" &>/dev/null
+ done
+
+ if ! ip netns exec "$NS1" ping -6 -c1 -W5 fd00:ff::2 &>/dev/null; then
+ echo "Basic connectivity check failed"
+ return $ksft_skip
+ fi
+}
+
+# Block ALL paths, start a connection, wait until SYNs have been dropped
+# on both interfaces (proving rehash steered the SYN to a new path), then
+# unblock so the connection completes.
+test_ecmp_rto_rehash()
+{
+ RET=0
+
+ block_tcp "$NS1" veth0a
+ defer unblock_tcp "$NS1" veth0a
+ block_tcp "$NS1" veth1a
+ defer unblock_tcp "$NS1" veth1a
+
+ ip netns exec "$NS2" socat \
+ "TCP6-LISTEN:$PORT,bind=[fd00:ff::2],reuseaddr,fork" \
+ EXEC:"echo ESTABLISH_OK" &
+ defer kill_process $!
+
+ wait_local_port_listen "$NS2" $PORT tcp
+
+ local rehash_before
+ rehash_before=$(get_timeout_rehash_count "$NS1")
+
+ # Start the connection in the background; it will retry SYNs at
+ # 1-second intervals until an unblocked path is found.
+ # Use -u (unidirectional) to only receive from the server;
+ # sending data back would risk SIGPIPE if the server's EXEC
+ # child has already exited.
+ ip netns exec "$NS1" socat -u \
+ "TCP6:[fd00:ff::2]:$PORT,bind=[fd00:ff::1],connect-timeout=60" \
+ STDOUT >"/tmp/ecmp_rto_$$" 2>&1 &
+ local client_pid=$!
+ defer kill_process $client_pid
+
+ # Wait until both paths have seen at least one dropped SYN.
+ # This proves sk_rethink_txhash() rehashed the connection from
+ # one ECMP path to the other.
+ slowwait 30 both_devs_attempted "$NS1" veth0a veth1a
+ check_err $? "SYNs did not appear on both paths (rehash not working)"
+ if [ $RET -ne 0 ]; then
+ log_test "ECMP RTO rehash: establish with blocked paths"
+ return
+ fi
+
+ # Unblock both paths and let the next SYN retransmit succeed.
+ unblock_tcp "$NS1" veth0a
+ unblock_tcp "$NS1" veth1a
+
+ local rc=0
+ wait $client_pid || rc=$?
+
+ local result
+ result=$(cat "/tmp/ecmp_rto_$$" 2>/dev/null)
+ rm -f "/tmp/ecmp_rto_$$"
+
+ if [[ "$result" != *"ESTABLISH_OK"* ]]; then
+ check_err 1 "connection failed after unblocking (rc=$rc): $result"
+ fi
+
+ local rehash_after
+ rehash_after=$(get_timeout_rehash_count "$NS1")
+ if [ "$rehash_after" -le "$rehash_before" ]; then
+ check_err 1 "TcpTimeoutRehash counter did not increment"
+ fi
+
+ log_test "ECMP RTO rehash: establish with blocked paths"
+}
+
+# Block the server's return paths so SYN/ACKs are dropped. The client
+# retransmits SYNs at 1-second intervals; each duplicate SYN arriving at
+# the server updates ir_iif to match the new arrival interface, so the
+# retransmitted SYN/ACK routes back via the interface the SYN arrived on.
+test_ecmp_synack_rehash()
+{
+ RET=0
+ local port=$((PORT + 2))
+
+ block_tcp "$NS2" veth0b
+ defer unblock_tcp "$NS2" veth0b
+ block_tcp "$NS2" veth1b
+ defer unblock_tcp "$NS2" veth1b
+
+ ip netns exec "$NS2" socat \
+ "TCP6-LISTEN:$port,bind=[fd00:ff::2],reuseaddr,fork" \
+ EXEC:"echo SYNACK_OK" &
+ defer kill_process $!
+
+ wait_local_port_listen "$NS2" $port tcp
+
+ # Start the connection; SYNs reach the server (client egress is
+ # open) but SYN/ACKs are dropped on the server's return path.
+ ip netns exec "$NS1" socat -u \
+ "TCP6:[fd00:ff::2]:$port,bind=[fd00:ff::1],connect-timeout=60" \
+ STDOUT >"/tmp/ecmp_synack_$$" 2>&1 &
+ local client_pid=$!
+ defer kill_process $client_pid
+
+ # Wait until both server-side interfaces have dropped at least
+ # one SYN/ACK, proving the server rehashed its return path.
+ slowwait 30 both_devs_attempted "$NS2" veth0b veth1b
+ check_err $? "SYN/ACKs did not appear on both return paths"
+ if [ $RET -ne 0 ]; then
+ log_test "ECMP SYN/ACK rehash: blocked return path"
+ return
+ fi
+
+ # Unblock and let the connection complete.
+ unblock_tcp "$NS2" veth0b
+ unblock_tcp "$NS2" veth1b
+
+ local rc=0
+ wait $client_pid || rc=$?
+
+ local result
+ result=$(cat "/tmp/ecmp_synack_$$" 2>/dev/null)
+ rm -f "/tmp/ecmp_synack_$$"
+
+ if [[ "$result" != *"SYNACK_OK"* ]]; then
+ check_err 1 "connection failed after unblocking (rc=$rc): $result"
+ fi
+
+ log_test "ECMP SYN/ACK rehash: blocked return path"
+}
+
+# Establish a data transfer with both paths open, then block the
+# active path. Verify the transfer continues via rehash and that
+# TcpTimeoutRehash incremented.
+test_ecmp_midstream_rehash()
+{
+ RET=0
+ local port=$((PORT + 1))
+
+ ip netns exec "$NS2" socat -u \
+ "TCP6-LISTEN:$port,bind=[fd00:ff::2],reuseaddr" - >/dev/null &
+ defer kill_process $!
+
+ wait_local_port_listen "$NS2" $port tcp
+
+ local base_tx0 base_tx1
+ base_tx0=$(link_tx_packets_get "$NS1" veth0a)
+ base_tx1=$(link_tx_packets_get "$NS1" veth1a)
+
+ ip netns exec "$NS1" bash -c "
+ for i in \$(seq 1 40); do
+ dd if=/dev/zero bs=10k count=1 2>/dev/null
+ sleep 0.25
+ done | timeout 60 socat - 'TCP6:[fd00:ff::2]:$port,bind=[fd00:ff::1]'
+ " &>/dev/null &
+ local client_pid=$!
+ defer kill_process $client_pid
+
+ busywait $BUSYWAIT_TIMEOUT until_counter_is \
+ ">= $((base_tx0 + base_tx1 + 20))" \
+ link_tx_packets_total "$NS1"
+ check_err $? "no TX activity detected"
+ if [ $RET -ne 0 ]; then
+ log_test "ECMP midstream rehash: block active path"
+ return
+ fi
+
+ # Find the active path and block it.
+ local cur0 cur1 active_idx
+ cur0=$(link_tx_packets_get "$NS1" veth0a)
+ cur1=$(link_tx_packets_get "$NS1" veth1a)
+ if [ $((cur0 - base_tx0)) -ge $((cur1 - base_tx1)) ]; then
+ active_idx=0
+ else
+ active_idx=1
+ fi
+
+ local rehash_before
+ rehash_before=$(get_timeout_rehash_count "$NS1")
+
+ block_tcp "$NS1" "veth${active_idx}a"
+ defer unblock_tcp "$NS1" "veth${active_idx}a"
+
+ local rc=0
+ wait $client_pid || rc=$?
+
+ check_err $rc "data transfer failed after blocking veth${active_idx}a"
+
+ local rehash_after
+ rehash_after=$(get_timeout_rehash_count "$NS1")
+ if [ "$rehash_after" -le "$rehash_before" ]; then
+ check_err 1 "TcpTimeoutRehash counter did not increment"
+ fi
+
+ log_test "ECMP midstream rehash: block active path"
+}
+
+link_tx_packets_total()
+{
+ local ns=$1; shift
+
+ echo $(( $(link_tx_packets_get "$ns" veth0a) +
+ $(link_tx_packets_get "$ns" veth1a) ))
+}
+
+require_command socat
+
+trap cleanup_all_ns EXIT
+setup || exit $?
+tests_run
+exit $EXIT_STATUS
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH net-next v2 1/2] tcp: rehash onto different ECMP path on retransmit timeout
2026-04-08 7:05 ` [PATCH net-next v2 1/2] " Neil Spring
@ 2026-04-08 7:12 ` Eric Dumazet
0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2026-04-08 7:12 UTC (permalink / raw)
To: Neil Spring; +Cc: netdev, davem, kuba
On Wed, Apr 8, 2026 at 12:05 AM Neil Spring <ntspring@meta.com> wrote:
>
> Add sk_dst_reset() alongside sk_rethink_txhash() in the RTO, PLB,
> and spurious-retrans paths so that the next transmit triggers a fresh
> route lookup. Propagate sk_txhash into fl6->mp_hash in
> inet6_csk_route_req() and inet6_csk_route_socket() so
> fib6_select_path() uses the socket's current hash for ECMP selection.
>
> The ir_iif update in tcp_check_req() covers both IPv4 and IPv6
> because it was cleaner than gating on address family; IPv4 is
> otherwise unaltered, and not having autoflowlabel in IPv4 means
> I wouldn't expect a new path on timeout.
>
> It is possible that PLB does not need this (that there are other
> methods of reacting to local congestion); I added the sk_dst_reset
> for consistency.
>
> Signed-off-by: Neil Spring <ntspring@meta.com>
Please make sure to wait ~24 hours between sending a new version.
Documentation/process/maintainer-netdev.rst
> ---
> net/ipv4/tcp_input.c | 4 +++-
> net/ipv4/tcp_minisocks.c | 13 +++++++++++++
> net/ipv4/tcp_plb.c | 1 +
> net/ipv4/tcp_timer.c | 1 +
> net/ipv6/inet6_connection_sock.c | 11 +++++++++++
> 5 files changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 7171442c3ed7..3d42ab45066c 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5014,8 +5014,10 @@ static void tcp_rcv_spurious_retrans(struct sock *sk,
> skb->protocol == htons(ETH_P_IPV6) &&
> (tcp_sk(sk)->inet_conn.icsk_ack.lrcv_flowlabel !=
> ntohl(ip6_flowlabel(ipv6_hdr(skb)))) &&
> - sk_rethink_txhash(sk))
> + sk_rethink_txhash(sk)) {
> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPDUPLICATEDATAREHASH);
> + sk_dst_reset(sk);
> + }
>
> /* Save last flowlabel after a spurious retrans. */
> tcp_save_lrcv_flowlabel(sk, skb);
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 199f0b579e89..27edf71effc2 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -750,6 +750,19 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
> * Reset timer after retransmitting SYNACK, similar to
> * the idea of fast retransmit in recovery.
> */
> +
> + /* Update ir_iif to match the interface the retransmitted
> + * SYN arrived on; inet6_csk_route_req() uses this as
> + * flowi6_oif, constraining ECMP path for the SYN/ACK.
> + */
Please put this part in a separate patch.
Explain why it is important that the SYN/ACK is sent on the same interface ?
What happens if it is not ?
Why are we keeping tcp_v6_init_req() part? Would it be dead code ?
/* So that link locals have meaning */
if ((!sk_listener->sk_bound_dev_if || l3_slave) &&
ipv6_addr_type(&ireq->ir_v6_rmt_addr) & IPV6_ADDR_LINKLOCAL)
ireq->ir_iif = tcp_v6_iif(skb);
> +#if IS_ENABLED(CONFIG_IPV6)
> + if (sk->sk_family == AF_INET6)
> + inet_rsk(req)->ir_iif = tcp_v6_iif(skb);
> + else
> +#endif
> + inet_rsk(req)->ir_iif =
> + inet_request_bound_dev_if(sk, skb);
> +
> if (!tcp_oow_rate_limited(sock_net(sk), skb,
> LINUX_MIB_TCPACKSKIPPEDSYNRECV,
> &tcp_rsk(req)->last_oow_ack_time)) {
> diff --git a/net/ipv4/tcp_plb.c b/net/ipv4/tcp_plb.c
> index 68ccdb9a5412..d7cc00a58e53 100644
> --- a/net/ipv4/tcp_plb.c
> +++ b/net/ipv4/tcp_plb.c
> @@ -79,6 +79,7 @@ void tcp_plb_check_rehash(struct sock *sk, struct tcp_plb_state *plb)
> return;
>
> sk_rethink_txhash(sk);
> + sk_dst_reset(sk);
> plb->consec_cong_rounds = 0;
> tcp_sk(sk)->plb_rehash++;
> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPPLBREHASH);
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index ea99988795e7..acc22fc532c2 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -299,6 +299,7 @@ static int tcp_write_timeout(struct sock *sk)
> if (sk_rethink_txhash(sk)) {
> tp->timeout_rehash++;
> __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPTIMEOUTREHASH);
> + sk_dst_reset(sk);
> }
>
> return 0;
> diff --git a/net/ipv6/inet6_connection_sock.c b/net/ipv6/inet6_connection_sock.c
> index 37534e116899..3fd7acbe2c49 100644
> --- a/net/ipv6/inet6_connection_sock.c
> +++ b/net/ipv6/inet6_connection_sock.c
> @@ -14,6 +14,7 @@
> #include <linux/ipv6.h>
> #include <linux/jhash.h>
> #include <linux/slab.h>
> +#include <linux/tcp.h>
>
> #include <net/addrconf.h>
> #include <net/inet_connection_sock.h>
> @@ -48,6 +49,12 @@ struct dst_entry *inet6_csk_route_req(const struct sock *sk,
> fl6->flowi6_uid = sk_uid(sk);
> security_req_classify_flow(req, flowi6_to_flowi_common(fl6));
>
> + /* Use the request socket's txhash (re-rolled by tcp_rtx_synack())
> + * for ECMP path selection; >> 1 for 31-bit mp_hash range.
> + */
> + if (tcp_rsk(req)->txhash)
> + fl6->mp_hash = tcp_rsk(req)->txhash >> 1;
> +
> if (!dst) {
> dst = ip6_dst_lookup_flow(sock_net(sk), sk, fl6, final_p);
> if (IS_ERR(dst))
> @@ -70,6 +77,10 @@ struct dst_entry *inet6_csk_route_socket(struct sock *sk,
> fl6->saddr = np->saddr;
> fl6->flowlabel = np->flow_label;
> IP6_ECN_flow_xmit(sk, fl6->flowlabel);
> +
> + /* >> 1 for 31-bit mp_hash range matching nhc_upper_bound. */
> + if (sk->sk_txhash)
Why is the test needed ?
Note that because of the right shift , sk_txhash == 1 would become 0 anyway.
> + fl6->mp_hash = sk->sk_txhash >> 1;
> fl6->flowi6_oif = sk->sk_bound_dev_if;
> fl6->flowi6_mark = sk->sk_mark;
> fl6->fl6_sport = inet->inet_sport;
> --
> 2.52.0
>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2026-04-08 7:13 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-08 0:28 [PATCH net-next 0/2] tcp: rehash onto different ECMP path on retransmit timeout Neil Spring
2026-04-08 0:28 ` [PATCH net-next 1/2] " Neil Spring
2026-04-08 1:09 ` Eric Dumazet
2026-04-08 6:59 ` Neil Spring
2026-04-08 0:28 ` [PATCH net-next 2/2] selftests: net: add ECMP rehash test Neil Spring
2026-04-08 7:05 ` [PATCH net-next v2 0/2] tcp: rehash onto different ECMP path on retransmit timeout Neil Spring
2026-04-08 7:05 ` [PATCH net-next v2 1/2] " Neil Spring
2026-04-08 7:12 ` Eric Dumazet
2026-04-08 7:05 ` [PATCH net-next v2 2/2] selftests: net: add ECMP rehash test Neil Spring
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox