[PATCH net 0/2] Fix race condition between TCP

Netdev List
 help / color / mirror / Atom feed

* [PATCH net 0/2] Fix race condition between TCP_REPAIR dump and data receive
@ 2026-05-17 18:41 Stefano Brivio
  2026-05-17 18:41 ` [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode Stefano Brivio
  2026-05-17 18:41 ` [PATCH net 2/2] selftests: Add data path tests for TCP_REPAIR mode Stefano Brivio
  0 siblings, 2 replies; 12+ messages in thread
From: Stefano Brivio @ 2026-05-17 18:41 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Pavel Emelyanov, Laurent Vivier, Jon Maloy, Dmitry Safonov,
	Andrei Vagin, netdev, linux-kselftest, linux-kernel,
	Neal Cardwell, Kuniyuki Iwashima, Simon Horman, Shuah Khan

If we receive data on a socket that's in repair mode, the sequence and
contents of the receive queue we dump depend on the timing. We need to
freeze the input queue, otherwise the connection parameters restored
later might not match the actual state of the connection.

Patch 1/2 has the full details and the fix, patch 2/2 introduces
selftests to illustrate the problem and verify the solution.

Stefano Brivio (2):
  tcp: Don't accept data when socket is in repair mode
  selftests: Add data path tests for TCP_REPAIR mode

 include/net/dropreason-core.h                 |   3 +
 net/ipv4/tcp_input.c                          |  14 +-
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/net/tcp_repair/.gitignore       |   3 +
 .../testing/selftests/net/tcp_repair/Makefile |  23 ++
 .../testing/selftests/net/tcp_repair/client.c | 273 ++++++++++++++++++
 .../testing/selftests/net/tcp_repair/inner.sh |  32 ++
 .../testing/selftests/net/tcp_repair/outer.sh |  44 +++
 tools/testing/selftests/net/tcp_repair/run.sh |  12 +
 .../testing/selftests/net/tcp_repair/server.c | 155 ++++++++++
 tools/testing/selftests/net/tcp_repair/talk.h |  26 ++
 11 files changed, 585 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/tcp_repair/.gitignore
 create mode 100644 tools/testing/selftests/net/tcp_repair/Makefile
 create mode 100644 tools/testing/selftests/net/tcp_repair/client.c
 create mode 100755 tools/testing/selftests/net/tcp_repair/inner.sh
 create mode 100755 tools/testing/selftests/net/tcp_repair/outer.sh
 create mode 100755 tools/testing/selftests/net/tcp_repair/run.sh
 create mode 100644 tools/testing/selftests/net/tcp_repair/server.c
 create mode 100644 tools/testing/selftests/net/tcp_repair/talk.h

-- 
2.43.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode
  2026-05-17 18:41 [PATCH net 0/2] Fix race condition between TCP_REPAIR dump and data receive Stefano Brivio
@ 2026-05-17 18:41 ` Stefano Brivio
  2026-05-17 19:05   ` Kuniyuki Iwashima
  2026-05-17 18:41 ` [PATCH net 2/2] selftests: Add data path tests for TCP_REPAIR mode Stefano Brivio
  1 sibling, 1 reply; 12+ messages in thread
From: Stefano Brivio @ 2026-05-17 18:41 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Pavel Emelyanov, Laurent Vivier, Jon Maloy, Dmitry Safonov,
	Andrei Vagin, netdev, linux-kselftest, linux-kernel,
	Neal Cardwell, Kuniyuki Iwashima, Simon Horman, Shuah Khan

Once a socket enters repair mode (TCP_REPAIR socket option with
TCP_REPAIR_ON value), it's possible to dump the receive sequence
number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
(using TCP_REPAIR_QUEUE to select it).

If we receive data after the application fetched the sequence number
or saved the contents of the queue, though, the application will now
have outdated information, which defeats the whole functionality,
because this leads to gaps in sequence and data once they're restored
by the target instance of the application, resulting in a hanging or
otherwise non-functional TCP connection.

This type of race condition was discovered in the KubeVirt integration
of passt(1), using a remote iperf3 client connected to an iperf3
server running in the guest which is being migrated. The setup allows
traffic to reach the origin node hosting the guest during the
migration.

If passt dumps sequence number and contents of the queue *before*
further data is received and acknowledged to the peer by the kernel,
once the TCP data connection is migrated to the target node, the
remote client becomes unable to continue sending, because a portion
of the data it sent *and received an acknowledgement for* is now lost.

Schematically:

1. client --seq 1:100--> origin host --> passt --> guest --> server

2. client <--ACK: 100-- origin host

3. migration starts, passt enables repair mode, dumps the sequence
   number (101) and sends it to the target node of the guest migration

4. client --seq 101:201--> origin host (passt not receiving anymore)

5. client <--ACK: 201-- origin host

6. migration completes, and passt restores sequence number 101 on the
   migrated socket

7. client --seq 201:301--> target host (now seeing a sequence jump)

8. client <--ACK: 100-- target host

...and the connection can't recover anymore, because the client can't
resend data that was already (erroneously) acknowledged. We need to
avoid step 5. above.

This would equally affect CRIU (the other known user of TCP_REPAIR),
should data be received while the original container is frozen: the
sequence dumped and the contents of the saved incoming queue would
then depend on the timing.

The race condition is also illustrated in the kselftests introduced
by the next patch.

To prevent this issue, discard data received for a socket in repair
mode, with a new reason, SKB_DROP_REASON_SOCKET_REPAIR.

Fixes: ee9952831cfd ("tcp: Initial repair mode")
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 include/net/dropreason-core.h |  3 +++
 net/ipv4/tcp_input.c          | 14 +++++++++++++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
index 2f312d1f67d6..19ab9e6ffc33 100644
--- a/include/net/dropreason-core.h
+++ b/include/net/dropreason-core.h
@@ -9,6 +9,7 @@
 	FN(SOCKET_CLOSE)		\
 	FN(SOCKET_FILTER)		\
 	FN(SOCKET_RCVBUFF)		\
+	FN(SOCKET_REPAIR)		\
 	FN(UNIX_DISCONNECT)		\
 	FN(UNIX_SKIP_OOB)		\
 	FN(PKT_TOO_SMALL)		\
@@ -158,6 +159,8 @@ enum skb_drop_reason {
 	SKB_DROP_REASON_SOCKET_FILTER,
 	/** @SKB_DROP_REASON_SOCKET_RCVBUFF: socket receive buff is full */
 	SKB_DROP_REASON_SOCKET_RCVBUFF,
+	/** @SKB_DROP_REASON_SOCKET_REPAIR: socket is in repair mode */
+	SKB_DROP_REASON_SOCKET_REPAIR,
 	/**
 	 * @SKB_DROP_REASON_UNIX_DISCONNECT: recv queue is purged when SOCK_DGRAM
 	 * or SOCK_SEQPACKET socket re-connect()s to another socket or notices
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d5c9e65d9760..6eca34274f97 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6457,6 +6457,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
  *	  or pure receivers (this means either the sequence number or the ack
  *	  value must stay constant)
  *	- Unexpected TCP option.
+ *	- Socket is in repair mode.
  *
  *	When these conditions are not satisfied it drops into a standard
  *	receive procedure patterned after RFC793 to handle all cases.
@@ -6506,7 +6507,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)

 	if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
 	    TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
-	    !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
+	    !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt) &&
+	    !tp->repair) {
 		int tcp_header_len = tp->tcp_header_len;
 		s32 delta = 0;
 		int flag = 0;
@@ -6632,6 +6634,11 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 		goto discard;
 	}

+	if (tp->repair) {
+		reason = SKB_DROP_REASON_SOCKET_REPAIR;
+		goto discard;
+	}
+
 	/*
 	 *	Standard slow path.
 	 */
@@ -7125,6 +7132,11 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 	int queued = 0;
 	SKB_DR(reason);

+	if (tp->repair) {
+		SKB_DR_SET(reason, SOCKET_REPAIR);
+		goto discard;
+	}
+
 	switch (sk->sk_state) {
 	case TCP_CLOSE:
 		SKB_DR_SET(reason, TCP_CLOSE);
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net 2/2] selftests: Add data path tests for TCP_REPAIR mode
  2026-05-17 18:41 [PATCH net 0/2] Fix race condition between TCP_REPAIR dump and data receive Stefano Brivio
  2026-05-17 18:41 ` [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode Stefano Brivio
@ 2026-05-17 18:41 ` Stefano Brivio
  1 sibling, 0 replies; 12+ messages in thread
From: Stefano Brivio @ 2026-05-17 18:41 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Pavel Emelyanov, Laurent Vivier, Jon Maloy, Dmitry Safonov,
	Andrei Vagin, netdev, linux-kselftest, linux-kernel,
	Neal Cardwell, Kuniyuki Iwashima, Simon Horman, Shuah Khan

The new tests check that, once TCP_REPAIR is enabled on a socket:

- additional incoming data queued to it don't increase the sequence
  number dumped via TCP_QUEUE_SEQ socket option

- a connected endpoint will not receive ACK segments after sending
  more data

These tests are implemented by a client, sending data and commands
(accept connection, enter repair mode, dump sequences, receive data,
and exit) describing the test sequence, and a server receiving data
and implementing those commands.

This way, the client can accurately synchronise repair modes with
data exchanges.

In order to avoid using loopback connections, where data would be
immediately acknowledged by the server side without packets being
actually sent or received, the client resides in a separate network
namespace ("inner") compared to the server, and a veth interface pair
connects them.

The tests can be run unprivileged as the outer network and user
namespaces are also detached as a first step, so that the veth
interfaces can be created in outer and inner namespaces without
capabilities.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
 tools/testing/selftests/Makefile              |   1 +
 .../selftests/net/tcp_repair/.gitignore       |   3 +
 .../testing/selftests/net/tcp_repair/Makefile |  23 ++
 .../testing/selftests/net/tcp_repair/client.c | 273 ++++++++++++++++++
 .../testing/selftests/net/tcp_repair/inner.sh |  32 ++
 .../testing/selftests/net/tcp_repair/outer.sh |  44 +++
 tools/testing/selftests/net/tcp_repair/run.sh |  12 +
 .../testing/selftests/net/tcp_repair/server.c | 155 ++++++++++
 tools/testing/selftests/net/tcp_repair/talk.h |  26 ++
 9 files changed, 569 insertions(+)
 create mode 100644 tools/testing/selftests/net/tcp_repair/.gitignore
 create mode 100644 tools/testing/selftests/net/tcp_repair/Makefile
 create mode 100644 tools/testing/selftests/net/tcp_repair/client.c
 create mode 100755 tools/testing/selftests/net/tcp_repair/inner.sh
 create mode 100755 tools/testing/selftests/net/tcp_repair/outer.sh
 create mode 100755 tools/testing/selftests/net/tcp_repair/run.sh
 create mode 100644 tools/testing/selftests/net/tcp_repair/server.c
 create mode 100644 tools/testing/selftests/net/tcp_repair/talk.h

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 6e59b8f63e41..e119abe5c78e 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -84,6 +84,7 @@ TARGETS += net/packetdrill
 TARGETS += net/ppp
 TARGETS += net/rds
 TARGETS += net/tcp_ao
+TARGETS += net/tcp_repair
 TARGETS += nolibc
 TARGETS += nsfs
 TARGETS += pci_endpoint
diff --git a/tools/testing/selftests/net/tcp_repair/.gitignore b/tools/testing/selftests/net/tcp_repair/.gitignore
new file mode 100644
index 000000000000..9756c86770b9
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_repair/.gitignore
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+client
+server
diff --git a/tools/testing/selftests/net/tcp_repair/Makefile b/tools/testing/selftests/net/tcp_repair/Makefile
new file mode 100644
index 000000000000..d01d0a20b6df
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_repair/Makefile
@@ -0,0 +1,23 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# selftests/net/tcp_repair: TCP_REPAIR connection tests
+#
+# Makefile - Build test client and server, declare run.sh entrypoint
+#
+# Copyright (c) 2026 Red Hat GmbH
+#
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+top_srcdir = ../../../../..
+
+CFLAGS += -Wall -Wextra -pedantic
+
+TEST_PROGS := \
+	run.sh
+
+TEST_GEN_FILES := \
+	client \
+	server \
+# end of TEST_GEN_FILES
+
+include ../../lib.mk
diff --git a/tools/testing/selftests/net/tcp_repair/client.c b/tools/testing/selftests/net/tcp_repair/client.c
new file mode 100644
index 000000000000..b5106bf922b1
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_repair/client.c
@@ -0,0 +1,273 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/* selftests/net/tcp_repair: TCP_REPAIR connection tests
+ *
+ * client.c - Run list of tests, send commands and data to server
+ *
+ * Copyright (c) 2026 Red Hat GmbH
+ *
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <arpa/inet.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netdb.h>
+
+#include <linux/tcp.h>		/* latest and greatest struct tcp_info, but */
+#define SOL_TCP		6	/* we can't include netinet/tcp.h as a result */
+
+#include "talk.h"
+
+/**
+ * srv() - Send command to server, return received value (not for ACCEPT)
+ * @ctl:	Control socket
+ * @op:		Command type
+ * @arg:	Optional argument (always sent, might be zero)
+ *
+ * Return: integer value received by client as response
+ */
+int srv(int ctl, enum op op, int arg)
+{
+	int cmd[2] = { op, arg }, ret;
+
+	send(ctl, cmd, sizeof(cmd), 0);
+	if (op != ACCEPT)
+		recv(ctl, &ret, sizeof(ret), 0);
+
+	return ret;
+}
+
+/**
+ * test_seq_slow_path() - Sequence doesn't change after sending one byte
+ * @ctl:	Control socket
+ * @data:	Data socket
+ *
+ * Return: 0 if the test passes, -1 if it fails
+ */
+int test_seq_slow_path(int ctl, int data)
+{
+	uint32_t seq1, seq2;
+
+	(void)ctl;
+	(void)data;
+
+	srv(ctl, REPAIR, TCP_REPAIR_ON);
+	seq1 = (uint32_t)srv(ctl, DUMP_RECV_SEQ, 0);
+
+	send(data, (char *)("a"), 1, 0);
+
+	seq2 = (uint32_t)srv(ctl, DUMP_RECV_SEQ, 0);
+
+	if (seq1 != seq2) {
+		fprintf(stderr, "Sequence changed in repair mode, %u -> %u\n",
+			seq1, seq2);
+		return -1;
+	}
+
+	return 0;
+}
+
+/**
+ * test_seq_fast_path() - Sequence doesn't change after a large transfer
+ * @ctl:	Control socket
+ * @data:	Data socket
+ *
+ * Return: 0 if the test passes, -1 if it fails
+ */
+int test_seq_fast_path(int ctl, int data)
+{
+	char buf[1000] = { 0 };
+	uint32_t seq1, seq2;
+	int i;
+
+	(void)ctl;
+	(void)data;
+
+	for (i = 0; i < 50; i++) {
+		send(data, buf, sizeof(buf), 0);
+		srv(ctl, RECV, sizeof(buf));
+	}
+
+	srv(ctl, REPAIR, TCP_REPAIR_ON);
+	seq1 = (uint32_t)srv(ctl, DUMP_RECV_SEQ, 0);
+
+	fcntl(data, F_SETFL, O_NONBLOCK);
+	for (i = 0; i < 50; i++)
+		send(data, buf, sizeof(buf), 0);
+
+	seq2 = (uint32_t)srv(ctl, DUMP_RECV_SEQ, 0);
+
+	if (seq1 != seq2) {
+		fprintf(stderr, "Sequence changed in repair mode, %u -> %u\n",
+			seq1, seq2);
+		return -1;
+	}
+
+	return 0;
+}
+
+/**
+ * test_acked_slow_path() - Our ACK sequence doesn't change after sending byte
+ * @ctl:	Control socket
+ * @data:	Data socket
+ *
+ * Return: 0 if the test passes, -1 if it fails
+ */
+int test_acked_slow_path(int ctl, int data)
+{
+	unsigned long acked1, acked2;
+	struct tcp_info tinfo;
+	socklen_t sl;
+
+	(void)ctl;
+	(void)data;
+
+	srv(ctl, REPAIR, TCP_REPAIR_ON);
+
+	sl = sizeof(tinfo);
+	getsockopt(data, SOL_TCP, TCP_INFO, &tinfo, &sl);
+	acked1 = tinfo.tcpi_bytes_acked;
+
+	send(data, (char *)("a"), 1, 0);
+
+	getsockopt(data, SOL_TCP, TCP_INFO, &tinfo, &sl);
+	acked2 = tinfo.tcpi_bytes_acked;
+
+	if (acked1 != acked2) {
+		fprintf(stderr, "ACK received in repair mode, %lu -> %lu\n",
+			acked1, acked2);
+		return -1;
+	}
+
+	return 0;
+}
+
+/**
+ * test_acked_fast_path() - Our ACK sequence doesn't change after large transfer
+ * @ctl:	Control socket
+ * @data:	Data socket
+ *
+ * Return: 0 if the test passes, -1 if it fails
+ */
+int test_acked_fast_path(int ctl, int data)
+{
+	unsigned long acked1, acked2;
+	char buf[1000] = { 0 };
+	struct tcp_info tinfo;
+	socklen_t sl;
+	int i;
+
+	(void)ctl;
+	(void)data;
+
+	for (i = 0; i < 50; i++) {
+		send(data, buf, sizeof(buf), 0);
+		srv(ctl, RECV, sizeof(buf));
+	}
+
+	srv(ctl, REPAIR, TCP_REPAIR_ON);
+
+	sl = sizeof(tinfo);
+	getsockopt(data, SOL_TCP, TCP_INFO, &tinfo, &sl);
+	acked1 = tinfo.tcpi_bytes_acked;
+
+	fcntl(data, F_SETFL, O_NONBLOCK);
+	for (i = 0; i < 50; i++)
+		send(data, buf, sizeof(buf), 0);
+
+	getsockopt(data, SOL_TCP, TCP_INFO, &tinfo, &sl);
+	acked2 = tinfo.tcpi_bytes_acked;
+
+	if (acked1 != acked2) {
+		fprintf(stderr, "ACK received in repair mode, %lu -> %lu\n",
+			acked1, acked2);
+		return -1;
+	}
+
+	return 0;
+}
+
+/**
+ * struct test - List of tests
+ * @desc:	Test description
+ * @f:		Function executing the test
+ */
+struct {
+	char *desc;
+	int (*f)(int ctl, int data);
+} test[] = {
+	{
+		"Sequence freezes in repair mode, slow path TCP input",
+		test_seq_slow_path,
+	},
+	{
+		"Sequence freezes in repair mode, fast path TCP input",
+		test_seq_fast_path,
+	},
+	{
+		"No ACKs in repair mode, slow path TCP input",
+		test_acked_slow_path,
+	},
+	{
+		"No ACKs in repair mode, fast path TCP input",
+		test_acked_fast_path,
+	},
+};
+
+/**
+ * main() - Entry point, connect control socket to server and run list of tests
+ * @argc:	Argument count, must be 3 (two options)
+ * @argv:	Options: server address and port
+ *
+ * Return: -1 on bad usage, 0 on success, 1 if at least one test fails
+ */
+int main(int argc, char **argv)
+{
+	struct addrinfo hints = { 0, AF_UNSPEC, SOCK_STREAM, 0, 0,
+				  NULL, NULL, NULL };
+	int ctl, data, ret = 0;
+	struct addrinfo *r;
+	unsigned i;
+
+	if (argc != 3) {
+		fprintf(stderr, "%s DST_ADDR DST_PORT\n", argv[0]);
+		return -1;
+	}
+
+	getaddrinfo(argv[1], argv[2], &hints, &r);
+
+	ctl = socket(r->ai_family, SOCK_STREAM, IPPROTO_TCP);
+	setsockopt(ctl, SOL_SOCKET, SO_REUSEADDR, &((int){ 1 }), sizeof(int));
+	connect(ctl, r->ai_addr, r->ai_addrlen);
+
+	for (i = 0; i < sizeof(test) / sizeof(test[0]); i++) {
+		int rc;
+
+		data = socket(r->ai_family, SOCK_STREAM, IPPROTO_TCP);
+		setsockopt(data, SOL_SOCKET, SO_REUSEADDR,
+			   &((int){ 1 }), sizeof(int));
+		srv(ctl, ACCEPT, 0);
+		connect(data, r->ai_addr, r->ai_addrlen);
+
+		rc = test[i].f(ctl, data);
+
+		close(data);
+
+		if (rc) {
+			fprintf(stdout, "TEST: %-60s  [FAIL]\n", test[i].desc);
+			ret = 1;
+		} else {
+			fprintf(stdout, "TEST: %-60s  [ OK ]\n", test[i].desc);
+		}
+	}
+
+	return ret;
+}
diff --git a/tools/testing/selftests/net/tcp_repair/inner.sh b/tools/testing/selftests/net/tcp_repair/inner.sh
new file mode 100755
index 000000000000..3987dc0514a8
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_repair/inner.sh
@@ -0,0 +1,32 @@
+#!/bin/sh -euf
+# SPDX-License-Identifier: GPL-2.0
+#
+# selftests/net/tcp_repair: TCP_REPAIR connection tests
+#
+# inner.sh - Set up link to outer namespace, run test client in inner namespace
+#
+# Copyright (c) 2026 Red Hat GmbH
+#
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+ns_inner_dir="${1}"
+
+# Tell the parent shell about our PID
+echo "${$}" > "${ns_inner_dir}/pid"
+mkdir "${ns_inner_dir}/pid_ready"
+
+# Wait for veth to appear
+while [ -z "$(sed -n '4p' /proc/net/dev)" ]; do
+	sleep 0.1 || sleep 1
+done
+
+# Set up link to outer namespace
+ip link set dev veth0 up
+ip addr add 169.254.2.2 dev veth0
+ip ro add default dev veth0
+
+# Finally run tests
+set +e
+./client 169.254.2.1 1024
+echo "${?}" > "${ns_inner_dir}/result"
+mkdir "${ns_inner_dir}/result_ready"
diff --git a/tools/testing/selftests/net/tcp_repair/outer.sh b/tools/testing/selftests/net/tcp_repair/outer.sh
new file mode 100755
index 000000000000..17ae1230f0e5
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_repair/outer.sh
@@ -0,0 +1,44 @@
+#!/bin/sh -euf
+# SPDX-License-Identifier: GPL-2.0
+#
+# selftests/net/tcp_repair: TCP_REPAIR connection tests
+#
+# outer.sh - Set up outer namespace, run test server there
+#
+# Copyright (c) 2026 Red Hat GmbH
+#
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+ns_inner_dir="$(mktemp -d)"
+
+cleanup() {
+	rm -rf "${ns_inner_dir}"
+}
+
+trap cleanup EXIT
+
+# Detach inner namespace in a subshell, tests start from there
+unshare -rUn -- ./inner.sh "${ns_inner_dir}" &
+
+# Wait for inner namespace
+while [ ! -d "${ns_inner_dir}/pid_ready" ]; do
+	sleep 0.1 || sleep 1
+done
+
+# Set up link to inner namespace
+ip link add veth0 type veth peer name veth0 netns "$(cat "${ns_inner_dir}/pid")"
+ip link set dev veth0 up
+ip addr add 169.254.2.1 dev veth0
+ip ro add default dev veth0
+
+# Run test server
+./server 1024
+
+# Wait for test results
+while [ ! -d "${ns_inner_dir}/result_ready" ]; do
+	sleep 0.1 || sleep 1
+done
+
+# Clean up and return results
+ret="$(cat "${ns_inner_dir}/result")"
+exit "${ret}"
diff --git a/tools/testing/selftests/net/tcp_repair/run.sh b/tools/testing/selftests/net/tcp_repair/run.sh
new file mode 100755
index 000000000000..f87ff0a8a6f6
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_repair/run.sh
@@ -0,0 +1,12 @@
+#!/bin/sh -euf
+# SPDX-License-Identifier: GPL-2.0
+#
+# selftests/net/tcp_repair: TCP_REPAIR connection tests
+#
+# run.sh - Test entry point: detach outer namespace and run outer.sh in it
+#
+# Copyright (c) 2026 Red Hat GmbH
+#
+# Author: Stefano Brivio <sbrivio@redhat.com>
+
+unshare -rUn -- ./outer.sh
diff --git a/tools/testing/selftests/net/tcp_repair/server.c b/tools/testing/selftests/net/tcp_repair/server.c
new file mode 100644
index 000000000000..256c321320b7
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_repair/server.c
@@ -0,0 +1,155 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* selftests/net/tcp_repair: TCP_REPAIR connection tests
+ *
+ * server.c - Receive commands and data, set TCP_REPAIR options on data socket
+ *
+ * Copyright (c) 2026 Red Hat GmbH
+ *
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+#include <errno.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <arpa/inet.h>
+
+#include <linux/tcp.h>		/* needed for TCP_REPAIR constants but */
+#define SOL_TCP		6	/* we can't include netinet/tcp.h as a result */
+
+#include "talk.h"
+
+/**
+ * cmd_accept() - Accept data connection (must be first command in test)
+ * @unused:	Not used
+ * @listen:	Listening socket
+ * @data:	Return value from accept(), set on return
+ *
+ * Return: 0
+ */
+int cmd_accept(int unused, int listen, int *data)
+{
+	(void)unused;
+
+	if (*data != -1)
+		close(*data);
+
+	*data = accept(listen, NULL, NULL);
+
+	return 0;
+}
+
+/**
+ * cmd_dump_recv_seq() - Dump receive sequence of data socket
+ * @unused:	Not used
+ * @unused2:	Not used
+ * @data:	File descriptor for data socket
+ *
+ * Return: receive sequence of data socket
+ */
+int cmd_dump_recv_seq(int unused, int unused2, int *data)
+{
+	socklen_t sl;
+	int v;
+
+	(void)unused;
+	(void)unused2;
+
+	v = TCP_RECV_QUEUE;
+	setsockopt(*data, SOL_TCP, TCP_REPAIR_QUEUE, &v, sizeof(v));
+
+	sl = sizeof(v);
+	getsockopt(*data, SOL_TCP, TCP_QUEUE_SEQ, &v, &sl);
+	return v;
+}
+
+/**
+ * cmd_exit() - Exit successfully
+ * @unused:	Not used
+ * @unused2:	Not used
+ * @unused3:	Not used
+ *
+ * Return: this function doesn't actually return
+ */
+int cmd_exit(int unused, int unused2, int *unused3)
+{
+	(void)unused;
+	(void)unused2;
+	(void)unused3;
+
+	exit(EXIT_SUCCESS);
+	return 0;
+}
+
+/**
+ * cmd_recv() - Receive (discard) a given amount of bytes
+ * @len:	Amount of bytes the client wants us to receive
+ * @unused:	Not used
+ * @data:	File descriptor for data socket
+ *
+ * Return: return code from recv()
+ */
+int cmd_recv(int len, int unused, int *data)
+{
+	(void)unused;
+
+	return recv(*data, NULL, len, MSG_TRUNC);
+}
+
+/**
+ * cmd_repair() - Set repair mode to mode supplied by client
+ * @mode:	Value for socket option provided by the client
+ * @unused:	Not used
+ * @data:	File descriptor for data socket
+ *
+ * Return: return code from setsockopt()
+ */
+int cmd_repair(int mode, int unused, int *data)
+{
+	(void)unused;
+
+	return setsockopt(*data, SOL_TCP, TCP_REPAIR, &mode, sizeof(mode));
+}
+
+/* List of commands and their handlers */
+int (*fn[])(int arg, int listen, int *data) = {
+	[ACCEPT]	= cmd_accept,
+	[DUMP_RECV_SEQ]	= cmd_dump_recv_seq,
+	[EXIT]		= cmd_exit,
+	[RECV]		= cmd_recv,
+	[REPAIR]	= cmd_repair,
+};
+
+/**
+ * main() - Entry point, accept control connection and dispatch commands
+ * @argc:	Argument count, must be 2 (one option)
+ * @argv:	Options: server port
+ *
+ * Return: 0 on success, exit on failure
+ */
+int main(int argc, char **argv)
+{
+	struct sockaddr_in a = { AF_INET, htons(atoi(argv[1])), { 0 }, { 0 } };
+	int s, ctl, data = -1, cmd[2];
+
+	s = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+	setsockopt(s, SOL_SOCKET, SO_REUSEADDR, &((int){ 1 }), sizeof(int));
+
+	if (argc != 2) {
+		fprintf(stderr, "%s PORT\n", argv[0]);
+		exit(EXIT_FAILURE);
+	}
+
+	bind(s, (struct sockaddr *)&a, sizeof(a));
+	listen(s, 0);
+	ctl = accept(s, NULL, NULL);
+
+	while (recv(ctl, cmd, sizeof(cmd), 0) == sizeof(cmd)) {
+		int ret = fn[cmd[0]](cmd[1], s, &data);
+		if (cmd[0] != ACCEPT)
+			send(ctl, &ret, sizeof(ret), 0);
+	}
+
+	return 0;
+}
diff --git a/tools/testing/selftests/net/tcp_repair/talk.h b/tools/testing/selftests/net/tcp_repair/talk.h
new file mode 100644
index 000000000000..e2fbad7fae07
--- /dev/null
+++ b/tools/testing/selftests/net/tcp_repair/talk.h
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/* selftests/net/tcp_repair: TCP_REPAIR connection tests
+ *
+ * talk.h - Communication protocol for client and server
+ *
+ * Copyright (c) 2026 Red Hat GmbH
+ *
+ * Author: Stefano Brivio <sbrivio@redhat.com>
+ */
+
+/**
+ * enum op - Server command type (taking optional int argument, returning int)
+ * @ACCEPT		Accept connection on data socket (doesn't return int)
+ * @DUMP_RECV_SEQ	Dump receive sequence, return it to the client
+ * @EXIT		Exit, return 0 to the client
+ * @RECV		Try receiving given amount of bytes, return received
+ * @REPAIR		Set repair mode to argument, return setsockopt() value
+ */
+enum op {
+	ACCEPT,
+	DUMP_RECV_SEQ,
+	EXIT,
+	RECV,
+	REPAIR,
+};
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode
  2026-05-17 18:41 ` [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode Stefano Brivio
@ 2026-05-17 19:05   ` Kuniyuki Iwashima
  2026-05-17 19:53     ` Stefano Brivio
  0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-17 19:05 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Pavel Emelyanov, Laurent Vivier, Jon Maloy, Dmitry Safonov,
	Andrei Vagin, netdev, linux-kselftest, linux-kernel,
	Neal Cardwell, Simon Horman, Shuah Khan

On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> Once a socket enters repair mode (TCP_REPAIR socket option with
> TCP_REPAIR_ON value), it's possible to dump the receive sequence
> number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
> (using TCP_REPAIR_QUEUE to select it).
>
> If we receive data after the application fetched the sequence number
> or saved the contents of the queue, though, the application will now
> have outdated information, which defeats the whole functionality,
> because this leads to gaps in sequence and data once they're restored
> by the target instance of the application, resulting in a hanging or
> otherwise non-functional TCP connection.
>
> This type of race condition was discovered in the KubeVirt integration
> of passt(1), using a remote iperf3 client connected to an iperf3
> server running in the guest which is being migrated. The setup allows
> traffic to reach the origin node hosting the guest during the
> migration.
>
> If passt dumps sequence number and contents of the queue *before*
> further data is received and acknowledged to the peer by the kernel,
> once the TCP data connection is migrated to the target node, the
> remote client becomes unable to continue sending, because a portion
> of the data it sent *and received an acknowledgement for* is now lost.
>
> Schematically:
>
> 1. client --seq 1:100--> origin host --> passt --> guest --> server
>
> 2. client <--ACK: 100-- origin host
>
> 3. migration starts,

Here, a netfilter rule or bpf prog must be installed to
drop packets temporarily until migration completes.

We do not want unlikely tests in the fast path.

You can find a similar issue:
https://lore.kernel.org/netdev/20260130145122.368748-1-me@linux.beauty/

> passt enables repair mode, dumps the sequence
>    number (101) and sends it to the target node of the guest migration
>
> 4. client --seq 101:201--> origin host (passt not receiving anymore)
>
> 5. client <--ACK: 201-- origin host
>
> 6. migration completes, and passt restores sequence number 101 on the
>    migrated socket
>
> 7. client --seq 201:301--> target host (now seeing a sequence jump)
>
> 8. client <--ACK: 100-- target host
>
> ...and the connection can't recover anymore, because the client can't
> resend data that was already (erroneously) acknowledged. We need to
> avoid step 5. above.
>
> This would equally affect CRIU (the other known user of TCP_REPAIR),
> should data be received while the original container is frozen: the
> sequence dumped and the contents of the saved incoming queue would
> then depend on the timing.
>
> The race condition is also illustrated in the kselftests introduced
> by the next patch.
>
> To prevent this issue, discard data received for a socket in repair
> mode, with a new reason, SKB_DROP_REASON_SOCKET_REPAIR.
>
> Fixes: ee9952831cfd ("tcp: Initial repair mode")
> Tested-by: Laurent Vivier <lvivier@redhat.com>
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
>  include/net/dropreason-core.h |  3 +++
>  net/ipv4/tcp_input.c          | 14 +++++++++++++-
>  2 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
> index 2f312d1f67d6..19ab9e6ffc33 100644
> --- a/include/net/dropreason-core.h
> +++ b/include/net/dropreason-core.h
> @@ -9,6 +9,7 @@
>         FN(SOCKET_CLOSE)                \
>         FN(SOCKET_FILTER)               \
>         FN(SOCKET_RCVBUFF)              \
> +       FN(SOCKET_REPAIR)               \
>         FN(UNIX_DISCONNECT)             \
>         FN(UNIX_SKIP_OOB)               \
>         FN(PKT_TOO_SMALL)               \
> @@ -158,6 +159,8 @@ enum skb_drop_reason {
>         SKB_DROP_REASON_SOCKET_FILTER,
>         /** @SKB_DROP_REASON_SOCKET_RCVBUFF: socket receive buff is full */
>         SKB_DROP_REASON_SOCKET_RCVBUFF,
> +       /** @SKB_DROP_REASON_SOCKET_REPAIR: socket is in repair mode */
> +       SKB_DROP_REASON_SOCKET_REPAIR,
>         /**
>          * @SKB_DROP_REASON_UNIX_DISCONNECT: recv queue is purged when SOCK_DGRAM
>          * or SOCK_SEQPACKET socket re-connect()s to another socket or notices
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index d5c9e65d9760..6eca34274f97 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6457,6 +6457,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
>   *       or pure receivers (this means either the sequence number or the ack
>   *       value must stay constant)
>   *     - Unexpected TCP option.
> + *     - Socket is in repair mode.
>   *
>   *     When these conditions are not satisfied it drops into a standard
>   *     receive procedure patterned after RFC793 to handle all cases.
> @@ -6506,7 +6507,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
>
>         if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
>             TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
> -           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
> +           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt) &&
> +           !tp->repair) {
>                 int tcp_header_len = tp->tcp_header_len;
>                 s32 delta = 0;
>                 int flag = 0;
> @@ -6632,6 +6634,11 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
>                 goto discard;
>         }
>
> +       if (tp->repair) {
> +               reason = SKB_DROP_REASON_SOCKET_REPAIR;
> +               goto discard;
> +       }
> +
>         /*
>          *      Standard slow path.
>          */
> @@ -7125,6 +7132,11 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>         int queued = 0;
>         SKB_DR(reason);
>
> +       if (tp->repair) {
> +               SKB_DR_SET(reason, SOCKET_REPAIR);
> +               goto discard;
> +       }
> +
>         switch (sk->sk_state) {
>         case TCP_CLOSE:
>                 SKB_DR_SET(reason, TCP_CLOSE);
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode
  2026-05-17 19:05   ` Kuniyuki Iwashima
@ 2026-05-17 19:53     ` Stefano Brivio
  2026-05-17 21:01       ` Kuniyuki Iwashima
  2026-05-18  7:57       ` Eric Dumazet
  0 siblings, 2 replies; 12+ messages in thread
From: Stefano Brivio @ 2026-05-17 19:53 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Pavel Emelyanov, Laurent Vivier, Jon Maloy, Dmitry Safonov,
	Andrei Vagin, netdev, linux-kselftest, linux-kernel,
	Neal Cardwell, Simon Horman, Shuah Khan, David Gibson

On Sun, 17 May 2026 12:05:45 -0700
Kuniyuki Iwashima <kuniyu@google.com> wrote:

> On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > Once a socket enters repair mode (TCP_REPAIR socket option with
> > TCP_REPAIR_ON value), it's possible to dump the receive sequence
> > number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
> > (using TCP_REPAIR_QUEUE to select it).
> >
> > If we receive data after the application fetched the sequence number
> > or saved the contents of the queue, though, the application will now
> > have outdated information, which defeats the whole functionality,
> > because this leads to gaps in sequence and data once they're restored
> > by the target instance of the application, resulting in a hanging or
> > otherwise non-functional TCP connection.
> >
> > This type of race condition was discovered in the KubeVirt integration
> > of passt(1), using a remote iperf3 client connected to an iperf3
> > server running in the guest which is being migrated. The setup allows
> > traffic to reach the origin node hosting the guest during the
> > migration.
> >
> > If passt dumps sequence number and contents of the queue *before*
> > further data is received and acknowledged to the peer by the kernel,
> > once the TCP data connection is migrated to the target node, the
> > remote client becomes unable to continue sending, because a portion
> > of the data it sent *and received an acknowledgement for* is now lost.
> >
> > Schematically:
> >
> > 1. client --seq 1:100--> origin host --> passt --> guest --> server
> >
> > 2. client <--ACK: 100-- origin host
> >
> > 3. migration starts,  
> 
> Here, a netfilter rule or bpf prog must be installed to
> drop packets temporarily until migration completes.

Thanks for the review.

I have to say it's rather unexpected to have to work around obvious
kernel issues in userspace: TCP_REPAIR implies that queues are frozen,
and this is handled correctly on the sending side (see
tcp_write_xmit()), but was clearly forgotten on the receiving side.

TCP_REPAIR also allows to dump queues, not just sequence numbers, so
this really is a bad race condition making the whole functionality
unreliable.

But anyway, even looking for a practical workaround of the kind you
suggested, I see two issues with it:

1. we would still have a race condition, because userspace doesn't have
   a way to synchronise application of nftables rules (or even a BPF
   program) with the effects of TCP_REPAIR. We could apply nftables
   rules "a while before" just to be sure, but this is severely going to
   impact migration downtime

2. passt(1) runs unprivileged and uses a very simple helper to set
   TCP_REPAIR on the socket. Expanding helpers of this kind to directly
   manipulate nftables rules, or installing BPF programs, looks like a
   substantial security drawback

> We do not want unlikely tests in the fast path.

I understand, but note that this doesn't really add a branch to the
fast path: there's already a list of (more expensive) conditions under
which we need to fall back to slow path, with 'tp' definitely
prefetched at that point, so I don't expect any fast path cost from
doing this. Everything else is handled in the slow path.

To be sure, I also checked throughput with delivery to local sockets as
that's the only case affected (veth setup similar to the one from the
selftests from patch 2/2) and there's no visible difference.

> You can find a similar issue:
> https://lore.kernel.org/netdev/20260130145122.368748-1-me@linux.beauty/

That one is not a kernel issue: in that case, the socket is closed, so
it's actually expected that the kernel will reset the connection. As
Jakub pointed out, that patch introduces a race condition on its own,
and it's a hack rather than a fix.

We happened to have that kind of issue in passt as well (the
implementation is inspired by CRIU), but that's something entirely
different which userspace clearly needs to take care of, so we fixed it
here:

  https://passt.top/passt/commit/?id=a8782865c342eb2682cca292d5bf92b567344351

> > passt enables repair mode, dumps the sequence
> >    number (101) and sends it to the target node of the guest migration
> >
> > 4. client --seq 101:201--> origin host (passt not receiving anymore)
> >
> > 5. client <--ACK: 201-- origin host
> >
> > 6. migration completes, and passt restores sequence number 101 on the
> >    migrated socket
> >
> > 7. client --seq 201:301--> target host (now seeing a sequence jump)
> >
> > 8. client <--ACK: 100-- target host
> >
> > ...and the connection can't recover anymore, because the client can't
> > resend data that was already (erroneously) acknowledged. We need to
> > avoid step 5. above.
> >
> > This would equally affect CRIU (the other known user of TCP_REPAIR),
> > should data be received while the original container is frozen: the
> > sequence dumped and the contents of the saved incoming queue would
> > then depend on the timing.
> >
> > The race condition is also illustrated in the kselftests introduced
> > by the next patch.
> >
> > To prevent this issue, discard data received for a socket in repair
> > mode, with a new reason, SKB_DROP_REASON_SOCKET_REPAIR.
> >
> > Fixes: ee9952831cfd ("tcp: Initial repair mode")
> > Tested-by: Laurent Vivier <lvivier@redhat.com>
> > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > ---
> >  include/net/dropreason-core.h |  3 +++
> >  net/ipv4/tcp_input.c          | 14 +++++++++++++-
> >  2 files changed, 16 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
> > index 2f312d1f67d6..19ab9e6ffc33 100644
> > --- a/include/net/dropreason-core.h
> > +++ b/include/net/dropreason-core.h
> > @@ -9,6 +9,7 @@
> >         FN(SOCKET_CLOSE)                \
> >         FN(SOCKET_FILTER)               \
> >         FN(SOCKET_RCVBUFF)              \
> > +       FN(SOCKET_REPAIR)               \
> >         FN(UNIX_DISCONNECT)             \
> >         FN(UNIX_SKIP_OOB)               \
> >         FN(PKT_TOO_SMALL)               \
> > @@ -158,6 +159,8 @@ enum skb_drop_reason {
> >         SKB_DROP_REASON_SOCKET_FILTER,
> >         /** @SKB_DROP_REASON_SOCKET_RCVBUFF: socket receive buff is full */
> >         SKB_DROP_REASON_SOCKET_RCVBUFF,
> > +       /** @SKB_DROP_REASON_SOCKET_REPAIR: socket is in repair mode */
> > +       SKB_DROP_REASON_SOCKET_REPAIR,
> >         /**
> >          * @SKB_DROP_REASON_UNIX_DISCONNECT: recv queue is purged when SOCK_DGRAM
> >          * or SOCK_SEQPACKET socket re-connect()s to another socket or notices
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index d5c9e65d9760..6eca34274f97 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -6457,6 +6457,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> >   *       or pure receivers (this means either the sequence number or the ack
> >   *       value must stay constant)
> >   *     - Unexpected TCP option.
> > + *     - Socket is in repair mode.
> >   *
> >   *     When these conditions are not satisfied it drops into a standard
> >   *     receive procedure patterned after RFC793 to handle all cases.
> > @@ -6506,7 +6507,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
> >
> >         if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
> >             TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
> > -           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
> > +           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt) &&
> > +           !tp->repair) {
> >                 int tcp_header_len = tp->tcp_header_len;
> >                 s32 delta = 0;
> >                 int flag = 0;
> > @@ -6632,6 +6634,11 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
> >                 goto discard;
> >         }
> >
> > +       if (tp->repair) {
> > +               reason = SKB_DROP_REASON_SOCKET_REPAIR;
> > +               goto discard;
> > +       }
> > +
> >         /*
> >          *      Standard slow path.
> >          */
> > @@ -7125,6 +7132,11 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> >         int queued = 0;
> >         SKB_DR(reason);
> >
> > +       if (tp->repair) {
> > +               SKB_DR_SET(reason, SOCKET_REPAIR);
> > +               goto discard;
> > +       }
> > +
> >         switch (sk->sk_state) {
> >         case TCP_CLOSE:
> >                 SKB_DR_SET(reason, TCP_CLOSE);
> > --
> > 2.43.0

-- 
Stefano


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode
  2026-05-17 19:53     ` Stefano Brivio
@ 2026-05-17 21:01       ` Kuniyuki Iwashima
  2026-05-18  5:32         ` David Gibson
  2026-05-18  7:57       ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2026-05-17 21:01 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Pavel Emelyanov, Laurent Vivier, Jon Maloy, Dmitry Safonov,
	Andrei Vagin, netdev, linux-kselftest, linux-kernel,
	Neal Cardwell, Simon Horman, Shuah Khan, David Gibson

On Sun, May 17, 2026 at 12:53 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Sun, 17 May 2026 12:05:45 -0700
> Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> > On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > Once a socket enters repair mode (TCP_REPAIR socket option with
> > > TCP_REPAIR_ON value), it's possible to dump the receive sequence
> > > number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
> > > (using TCP_REPAIR_QUEUE to select it).
> > >
> > > If we receive data after the application fetched the sequence number
> > > or saved the contents of the queue, though, the application will now
> > > have outdated information, which defeats the whole functionality,
> > > because this leads to gaps in sequence and data once they're restored
> > > by the target instance of the application, resulting in a hanging or
> > > otherwise non-functional TCP connection.
> > >
> > > This type of race condition was discovered in the KubeVirt integration
> > > of passt(1), using a remote iperf3 client connected to an iperf3
> > > server running in the guest which is being migrated. The setup allows
> > > traffic to reach the origin node hosting the guest during the
> > > migration.
> > >
> > > If passt dumps sequence number and contents of the queue *before*
> > > further data is received and acknowledged to the peer by the kernel,
> > > once the TCP data connection is migrated to the target node, the
> > > remote client becomes unable to continue sending, because a portion
> > > of the data it sent *and received an acknowledgement for* is now lost.
> > >
> > > Schematically:
> > >
> > > 1. client --seq 1:100--> origin host --> passt --> guest --> server
> > >
> > > 2. client <--ACK: 100-- origin host
> > >
> > > 3. migration starts,
> >
> > Here, a netfilter rule or bpf prog must be installed to
> > drop packets temporarily until migration completes.
>
> Thanks for the review.
>
> I have to say it's rather unexpected to have to work around obvious
> kernel issues in userspace: TCP_REPAIR implies that queues are frozen,
> and this is handled correctly on the sending side (see
> tcp_write_xmit()), but was clearly forgotten on the receiving side.
>
> TCP_REPAIR also allows to dump queues, not just sequence numbers, so
> this really is a bad race condition making the whole functionality
> unreliable.
>
> But anyway, even looking for a practical workaround of the kind you
> suggested, I see two issues with it:
>
> 1. we would still have a race condition, because userspace doesn't have
>    a way to synchronise application of nftables rules (or even a BPF
>    program) with the effects of TCP_REPAIR.

Note that setsockopt(TCP_REPAIR) is under lock_sock(), so the
backlog is always cleared before returning to userspace, and the
following getsockopt() will have stable view.


> We could apply nftables
>    rules "a while before" just to be sure, but this is severely going to
>    impact migration downtime

So it's "just before", not "a while before".


>
> 2. passt(1) runs unprivileged and uses a very simple helper to set
>    TCP_REPAIR on the socket.

I guess it uses userns ?

setsockopt(TCP_REPAIR) requires CAP_NET_ADMIN in the
userns tied to the socket's netns (in tcp_can_repair_sock()),
so you should be able to use netfilter anyway.


> Expanding helpers of this kind to directly
>    manipulate nftables rules, or installing BPF programs, looks like a
>    substantial security drawback
>
> > We do not want unlikely tests in the fast path.
>
> I understand, but note that this doesn't really add a branch to the
> fast path: there's already a list of (more expensive) conditions under
> which we need to fall back to slow path, with 'tp' definitely
> prefetched at that point, so I don't expect any fast path cost from
> doing this. Everything else is handled in the slow path.
>
> To be sure, I also checked throughput with delivery to local sockets as
> that's the only case affected (veth setup similar to the one from the
> selftests from patch 2/2) and there's no visible difference.
>
> > You can find a similar issue:
> > https://lore.kernel.org/netdev/20260130145122.368748-1-me@linux.beauty/
>
> That one is not a kernel issue: in that case, the socket is closed, so
> it's actually expected that the kernel will reset the connection. As
> Jakub pointed out, that patch introduces a race condition on its own,
> and it's a hack rather than a fix.
>
> We happened to have that kind of issue in passt as well (the
> implementation is inspired by CRIU), but that's something entirely
> different which userspace clearly needs to take care of, so we fixed it
> here:
>
>   https://passt.top/passt/commit/?id=a8782865c342eb2682cca292d5bf92b567344351
>
> > > passt enables repair mode, dumps the sequence
> > >    number (101) and sends it to the target node of the guest migration
> > >
> > > 4. client --seq 101:201--> origin host (passt not receiving anymore)
> > >
> > > 5. client <--ACK: 201-- origin host
> > >
> > > 6. migration completes, and passt restores sequence number 101 on the
> > >    migrated socket
> > >
> > > 7. client --seq 201:301--> target host (now seeing a sequence jump)
> > >
> > > 8. client <--ACK: 100-- target host
> > >
> > > ...and the connection can't recover anymore, because the client can't
> > > resend data that was already (erroneously) acknowledged. We need to
> > > avoid step 5. above.
> > >
> > > This would equally affect CRIU (the other known user of TCP_REPAIR),
> > > should data be received while the original container is frozen: the
> > > sequence dumped and the contents of the saved incoming queue would
> > > then depend on the timing.
> > >
> > > The race condition is also illustrated in the kselftests introduced
> > > by the next patch.
> > >
> > > To prevent this issue, discard data received for a socket in repair
> > > mode, with a new reason, SKB_DROP_REASON_SOCKET_REPAIR.
> > >
> > > Fixes: ee9952831cfd ("tcp: Initial repair mode")
> > > Tested-by: Laurent Vivier <lvivier@redhat.com>
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > >  include/net/dropreason-core.h |  3 +++
> > >  net/ipv4/tcp_input.c          | 14 +++++++++++++-
> > >  2 files changed, 16 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
> > > index 2f312d1f67d6..19ab9e6ffc33 100644
> > > --- a/include/net/dropreason-core.h
> > > +++ b/include/net/dropreason-core.h
> > > @@ -9,6 +9,7 @@
> > >         FN(SOCKET_CLOSE)                \
> > >         FN(SOCKET_FILTER)               \
> > >         FN(SOCKET_RCVBUFF)              \
> > > +       FN(SOCKET_REPAIR)               \
> > >         FN(UNIX_DISCONNECT)             \
> > >         FN(UNIX_SKIP_OOB)               \
> > >         FN(PKT_TOO_SMALL)               \
> > > @@ -158,6 +159,8 @@ enum skb_drop_reason {
> > >         SKB_DROP_REASON_SOCKET_FILTER,
> > >         /** @SKB_DROP_REASON_SOCKET_RCVBUFF: socket receive buff is full */
> > >         SKB_DROP_REASON_SOCKET_RCVBUFF,
> > > +       /** @SKB_DROP_REASON_SOCKET_REPAIR: socket is in repair mode */
> > > +       SKB_DROP_REASON_SOCKET_REPAIR,
> > >         /**
> > >          * @SKB_DROP_REASON_UNIX_DISCONNECT: recv queue is purged when SOCK_DGRAM
> > >          * or SOCK_SEQPACKET socket re-connect()s to another socket or notices
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index d5c9e65d9760..6eca34274f97 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -6457,6 +6457,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> > >   *       or pure receivers (this means either the sequence number or the ack
> > >   *       value must stay constant)
> > >   *     - Unexpected TCP option.
> > > + *     - Socket is in repair mode.
> > >   *
> > >   *     When these conditions are not satisfied it drops into a standard
> > >   *     receive procedure patterned after RFC793 to handle all cases.
> > > @@ -6506,7 +6507,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
> > >
> > >         if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
> > >             TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
> > > -           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
> > > +           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt) &&
> > > +           !tp->repair) {
> > >                 int tcp_header_len = tp->tcp_header_len;
> > >                 s32 delta = 0;
> > >                 int flag = 0;
> > > @@ -6632,6 +6634,11 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
> > >                 goto discard;
> > >         }
> > >
> > > +       if (tp->repair) {
> > > +               reason = SKB_DROP_REASON_SOCKET_REPAIR;
> > > +               goto discard;
> > > +       }
> > > +
> > >         /*
> > >          *      Standard slow path.
> > >          */
> > > @@ -7125,6 +7132,11 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> > >         int queued = 0;
> > >         SKB_DR(reason);
> > >
> > > +       if (tp->repair) {
> > > +               SKB_DR_SET(reason, SOCKET_REPAIR);
> > > +               goto discard;
> > > +       }
> > > +
> > >         switch (sk->sk_state) {
> > >         case TCP_CLOSE:
> > >                 SKB_DR_SET(reason, TCP_CLOSE);
> > > --
> > > 2.43.0
>
> --
> Stefano
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode
  2026-05-17 21:01       ` Kuniyuki Iwashima
@ 2026-05-18  5:32         ` David Gibson
  0 siblings, 0 replies; 12+ messages in thread
From: David Gibson @ 2026-05-18  5:32 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Stefano Brivio, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Pavel Emelyanov, Laurent Vivier, Jon Maloy,
	Dmitry Safonov, Andrei Vagin, netdev, linux-kselftest,
	linux-kernel, Neal Cardwell, Simon Horman, Shuah Khan

[-- Attachment #1: Type: text/plain, Size: 4205 bytes --]

On Sun, May 17, 2026 at 02:01:29PM -0700, Kuniyuki Iwashima wrote:
> On Sun, May 17, 2026 at 12:53 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Sun, 17 May 2026 12:05:45 -0700
> > Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >
> > > On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > >
> > > > Once a socket enters repair mode (TCP_REPAIR socket option with
> > > > TCP_REPAIR_ON value), it's possible to dump the receive sequence
> > > > number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
> > > > (using TCP_REPAIR_QUEUE to select it).
> > > >
> > > > If we receive data after the application fetched the sequence number
> > > > or saved the contents of the queue, though, the application will now
> > > > have outdated information, which defeats the whole functionality,
> > > > because this leads to gaps in sequence and data once they're restored
> > > > by the target instance of the application, resulting in a hanging or
> > > > otherwise non-functional TCP connection.
> > > >
> > > > This type of race condition was discovered in the KubeVirt integration
> > > > of passt(1), using a remote iperf3 client connected to an iperf3
> > > > server running in the guest which is being migrated. The setup allows
> > > > traffic to reach the origin node hosting the guest during the
> > > > migration.
> > > >
> > > > If passt dumps sequence number and contents of the queue *before*
> > > > further data is received and acknowledged to the peer by the kernel,
> > > > once the TCP data connection is migrated to the target node, the
> > > > remote client becomes unable to continue sending, because a portion
> > > > of the data it sent *and received an acknowledgement for* is now lost.
> > > >
> > > > Schematically:
> > > >
> > > > 1. client --seq 1:100--> origin host --> passt --> guest --> server
> > > >
> > > > 2. client <--ACK: 100-- origin host
> > > >
> > > > 3. migration starts,
> > >
> > > Here, a netfilter rule or bpf prog must be installed to
> > > drop packets temporarily until migration completes.
> >
> > Thanks for the review.
> >
> > I have to say it's rather unexpected to have to work around obvious
> > kernel issues in userspace: TCP_REPAIR implies that queues are frozen,
> > and this is handled correctly on the sending side (see
> > tcp_write_xmit()), but was clearly forgotten on the receiving side.
> >
> > TCP_REPAIR also allows to dump queues, not just sequence numbers, so
> > this really is a bad race condition making the whole functionality
> > unreliable.
> >
> > But anyway, even looking for a practical workaround of the kind you
> > suggested, I see two issues with it:
> >
> > 1. we would still have a race condition, because userspace doesn't have
> >    a way to synchronise application of nftables rules (or even a BPF
> >    program) with the effects of TCP_REPAIR.
> 
> Note that setsockopt(TCP_REPAIR) is under lock_sock(), so the
> backlog is always cleared before returning to userspace, and the
> following getsockopt() will have stable view.
> 
> 
> > We could apply nftables
> >    rules "a while before" just to be sure, but this is severely going to
> >    impact migration downtime
> 
> So it's "just before", not "a while before".
> 
> 
> >
> > 2. passt(1) runs unprivileged and uses a very simple helper to set
> >    TCP_REPAIR on the socket.
> 
> I guess it uses userns ?

It does, but it doesn't help.

> setsockopt(TCP_REPAIR) requires CAP_NET_ADMIN in the
> userns tied to the socket's netns (in tcp_can_repair_sock()),
> so you should be able to use netfilter anyway.

The sockets we use TCP_REPAIR on are *outside* the netns/userns - the
whole point is that we're bridging between the outside world and the
ns.  So we have no privileges there to use netfilter (and even if we
did, we couldn't safely use it, since we don't know how else netfilter
is being used on the host system).

-- 
David Gibson (he or they)	| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you, not the other way
				| around.
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode
  2026-05-17 19:53     ` Stefano Brivio
  2026-05-17 21:01       ` Kuniyuki Iwashima
@ 2026-05-18  7:57       ` Eric Dumazet
  2026-05-18 11:28         ` Stefano Brivio
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2026-05-18  7:57 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Kuniyuki Iwashima, David S. Miller, Jakub Kicinski, Paolo Abeni,
	Pavel Emelyanov, Laurent Vivier, Jon Maloy, Dmitry Safonov,
	Andrei Vagin, netdev, linux-kselftest, linux-kernel,
	Neal Cardwell, Simon Horman, Shuah Khan, David Gibson

On Sun, May 17, 2026 at 12:53 PM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Sun, 17 May 2026 12:05:45 -0700
> Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> > On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > Once a socket enters repair mode (TCP_REPAIR socket option with
> > > TCP_REPAIR_ON value), it's possible to dump the receive sequence
> > > number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
> > > (using TCP_REPAIR_QUEUE to select it).
> > >
> > > If we receive data after the application fetched the sequence number
> > > or saved the contents of the queue, though, the application will now
> > > have outdated information, which defeats the whole functionality,
> > > because this leads to gaps in sequence and data once they're restored
> > > by the target instance of the application, resulting in a hanging or
> > > otherwise non-functional TCP connection.
> > >
> > > This type of race condition was discovered in the KubeVirt integration
> > > of passt(1), using a remote iperf3 client connected to an iperf3
> > > server running in the guest which is being migrated. The setup allows
> > > traffic to reach the origin node hosting the guest during the
> > > migration.
> > >
> > > If passt dumps sequence number and contents of the queue *before*
> > > further data is received and acknowledged to the peer by the kernel,
> > > once the TCP data connection is migrated to the target node, the
> > > remote client becomes unable to continue sending, because a portion
> > > of the data it sent *and received an acknowledgement for* is now lost.
> > >
> > > Schematically:
> > >
> > > 1. client --seq 1:100--> origin host --> passt --> guest --> server
> > >
> > > 2. client <--ACK: 100-- origin host
> > >
> > > 3. migration starts,
> >
> > Here, a netfilter rule or bpf prog must be installed to
> > drop packets temporarily until migration completes.
>
> Thanks for the review.
>
> I have to say it's rather unexpected to have to work around obvious
> kernel issues in userspace: TCP_REPAIR implies that queues are frozen,
> and this is handled correctly on the sending side (see
> tcp_write_xmit()), but was clearly forgotten on the receiving side.

Have you contacted TCP repair authors for best practices?

Please do not add code to TCP fast path.

Instead make sure TCP repair inserts in ehash table sockets ready to be used.

Yes, it is more complex, but I am sure it can be done properly.

>
> TCP_REPAIR also allows to dump queues, not just sequence numbers, so
> this really is a bad race condition making the whole functionality
> unreliable.
>
> But anyway, even looking for a practical workaround of the kind you
> suggested, I see two issues with it:
>
> 1. we would still have a race condition, because userspace doesn't have
>    a way to synchronise application of nftables rules (or even a BPF
>    program) with the effects of TCP_REPAIR. We could apply nftables
>    rules "a while before" just to be sure, but this is severely going to
>    impact migration downtime
>
> 2. passt(1) runs unprivileged and uses a very simple helper to set
>    TCP_REPAIR on the socket. Expanding helpers of this kind to directly
>    manipulate nftables rules, or installing BPF programs, looks like a
>    substantial security drawback
>
> > We do not want unlikely tests in the fast path.
>
> I understand, but note that this doesn't really add a branch to the
> fast path: there's already a list of (more expensive) conditions under
> which we need to fall back to slow path, with 'tp' definitely
> prefetched at that point, so I don't expect any fast path cost from
> doing this. Everything else is handled in the slow path.
>
> To be sure, I also checked throughput with delivery to local sockets as
> that's the only case affected (veth setup similar to the one from the
> selftests from patch 2/2) and there's no visible difference.
>
> > You can find a similar issue:
> > https://lore.kernel.org/netdev/20260130145122.368748-1-me@linux.beauty/
>
> That one is not a kernel issue: in that case, the socket is closed, so
> it's actually expected that the kernel will reset the connection. As
> Jakub pointed out, that patch introduces a race condition on its own,
> and it's a hack rather than a fix.
>
> We happened to have that kind of issue in passt as well (the
> implementation is inspired by CRIU), but that's something entirely
> different which userspace clearly needs to take care of, so we fixed it
> here:
>
>   https://passt.top/passt/commit/?id=a8782865c342eb2682cca292d5bf92b567344351
>
> > > passt enables repair mode, dumps the sequence
> > >    number (101) and sends it to the target node of the guest migration
> > >
> > > 4. client --seq 101:201--> origin host (passt not receiving anymore)
> > >
> > > 5. client <--ACK: 201-- origin host
> > >
> > > 6. migration completes, and passt restores sequence number 101 on the
> > >    migrated socket
> > >
> > > 7. client --seq 201:301--> target host (now seeing a sequence jump)
> > >
> > > 8. client <--ACK: 100-- target host
> > >
> > > ...and the connection can't recover anymore, because the client can't
> > > resend data that was already (erroneously) acknowledged. We need to
> > > avoid step 5. above.
> > >
> > > This would equally affect CRIU (the other known user of TCP_REPAIR),
> > > should data be received while the original container is frozen: the
> > > sequence dumped and the contents of the saved incoming queue would
> > > then depend on the timing.
> > >
> > > The race condition is also illustrated in the kselftests introduced
> > > by the next patch.
> > >
> > > To prevent this issue, discard data received for a socket in repair
> > > mode, with a new reason, SKB_DROP_REASON_SOCKET_REPAIR.
> > >
> > > Fixes: ee9952831cfd ("tcp: Initial repair mode")
> > > Tested-by: Laurent Vivier <lvivier@redhat.com>
> > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > ---
> > >  include/net/dropreason-core.h |  3 +++
> > >  net/ipv4/tcp_input.c          | 14 +++++++++++++-
> > >  2 files changed, 16 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
> > > index 2f312d1f67d6..19ab9e6ffc33 100644
> > > --- a/include/net/dropreason-core.h
> > > +++ b/include/net/dropreason-core.h
> > > @@ -9,6 +9,7 @@
> > >         FN(SOCKET_CLOSE)                \
> > >         FN(SOCKET_FILTER)               \
> > >         FN(SOCKET_RCVBUFF)              \
> > > +       FN(SOCKET_REPAIR)               \
> > >         FN(UNIX_DISCONNECT)             \
> > >         FN(UNIX_SKIP_OOB)               \
> > >         FN(PKT_TOO_SMALL)               \
> > > @@ -158,6 +159,8 @@ enum skb_drop_reason {
> > >         SKB_DROP_REASON_SOCKET_FILTER,
> > >         /** @SKB_DROP_REASON_SOCKET_RCVBUFF: socket receive buff is full */
> > >         SKB_DROP_REASON_SOCKET_RCVBUFF,
> > > +       /** @SKB_DROP_REASON_SOCKET_REPAIR: socket is in repair mode */
> > > +       SKB_DROP_REASON_SOCKET_REPAIR,
> > >         /**
> > >          * @SKB_DROP_REASON_UNIX_DISCONNECT: recv queue is purged when SOCK_DGRAM
> > >          * or SOCK_SEQPACKET socket re-connect()s to another socket or notices
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index d5c9e65d9760..6eca34274f97 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -6457,6 +6457,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> > >   *       or pure receivers (this means either the sequence number or the ack
> > >   *       value must stay constant)
> > >   *     - Unexpected TCP option.
> > > + *     - Socket is in repair mode.
> > >   *
> > >   *     When these conditions are not satisfied it drops into a standard
> > >   *     receive procedure patterned after RFC793 to handle all cases.
> > > @@ -6506,7 +6507,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
> > >
> > >         if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
> > >             TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
> > > -           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
> > > +           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt) &&
> > > +           !tp->repair) {
> > >                 int tcp_header_len = tp->tcp_header_len;
> > >                 s32 delta = 0;
> > >                 int flag = 0;
> > > @@ -6632,6 +6634,11 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
> > >                 goto discard;
> > >         }
> > >
> > > +       if (tp->repair) {
> > > +               reason = SKB_DROP_REASON_SOCKET_REPAIR;
> > > +               goto discard;
> > > +       }
> > > +
> > >         /*
> > >          *      Standard slow path.
> > >          */
> > > @@ -7125,6 +7132,11 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> > >         int queued = 0;
> > >         SKB_DR(reason);
> > >
> > > +       if (tp->repair) {
> > > +               SKB_DR_SET(reason, SOCKET_REPAIR);
> > > +               goto discard;
> > > +       }
> > > +
> > >         switch (sk->sk_state) {
> > >         case TCP_CLOSE:
> > >                 SKB_DR_SET(reason, TCP_CLOSE);
> > > --
> > > 2.43.0
>
> --
> Stefano
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode
  2026-05-18  7:57       ` Eric Dumazet
@ 2026-05-18 11:28         ` Stefano Brivio
  2026-05-19 15:36           ` Andrei Vagin
  0 siblings, 1 reply; 12+ messages in thread
From: Stefano Brivio @ 2026-05-18 11:28 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, Pavel Emelyanov,
	Laurent Vivier, Jon Maloy, Dmitry Safonov, Andrei Vagin, netdev,
	linux-kselftest, linux-kernel, Neal Cardwell, Simon Horman,
	Shuah Khan, David Gibson

On Mon, 18 May 2026 00:57:16 -0700
Eric Dumazet <edumazet@google.com> wrote:

> On Sun, May 17, 2026 at 12:53 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Sun, 17 May 2026 12:05:45 -0700
> > Kuniyuki Iwashima <kuniyu@google.com> wrote:
> >  
> > > On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <sbrivio@redhat.com> wrote:  
> > > >
> > > > Once a socket enters repair mode (TCP_REPAIR socket option with
> > > > TCP_REPAIR_ON value), it's possible to dump the receive sequence
> > > > number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
> > > > (using TCP_REPAIR_QUEUE to select it).
> > > >
> > > > If we receive data after the application fetched the sequence number
> > > > or saved the contents of the queue, though, the application will now
> > > > have outdated information, which defeats the whole functionality,
> > > > because this leads to gaps in sequence and data once they're restored
> > > > by the target instance of the application, resulting in a hanging or
> > > > otherwise non-functional TCP connection.
> > > >
> > > > This type of race condition was discovered in the KubeVirt integration
> > > > of passt(1), using a remote iperf3 client connected to an iperf3
> > > > server running in the guest which is being migrated. The setup allows
> > > > traffic to reach the origin node hosting the guest during the
> > > > migration.
> > > >
> > > > If passt dumps sequence number and contents of the queue *before*
> > > > further data is received and acknowledged to the peer by the kernel,
> > > > once the TCP data connection is migrated to the target node, the
> > > > remote client becomes unable to continue sending, because a portion
> > > > of the data it sent *and received an acknowledgement for* is now lost.
> > > >
> > > > Schematically:
> > > >
> > > > 1. client --seq 1:100--> origin host --> passt --> guest --> server
> > > >
> > > > 2. client <--ACK: 100-- origin host
> > > >
> > > > 3. migration starts,  
> > >
> > > Here, a netfilter rule or bpf prog must be installed to
> > > drop packets temporarily until migration completes.  
> >
> > Thanks for the review.
> >
> > I have to say it's rather unexpected to have to work around obvious
> > kernel issues in userspace: TCP_REPAIR implies that queues are frozen,
> > and this is handled correctly on the sending side (see
> > tcp_write_xmit()), but was clearly forgotten on the receiving side.  
> 
> Have you contacted TCP repair authors for best practices?

I Cc'ed them here, Pavel is the author (but as far as I understand not
active in kernel development anymore), and I know that Andrei did some
substantial work on it in the past, so he's Cc'ed here as well.

But we are following what CRIU (the userspace reference implementation)
does, and CRIU would be affected by this issue as well (depending on
usage).

> Please do not add code to TCP fast path.

My understanding was that we should avoid adding *computational
overhead* to it, which this patch doesn't _seem_ to add. The actual
overhead it adds is in the slow path.

But, regardless of that, I hope I found a solution that would address
your potential concern as well.

There seems to be a common way to disable fast path for a given socket
(and to re-enable it later), courtesy of those tp->pred_flags that are
checked in the existing condition in tcp_rcv_established().

The idea looking at other functions seems to be:

- set pred_flags to 0 to disable the fast path (which I would do once
  TCP_REPAIR_ON is set)

- call __tcp_fast_path_on(), which recalculates it, to re-enable fast
  path (which I would do for TCP_REPAIR_OFF / TCP_REPAIR_OFF_NO_WP)

That's what other implementations dealing with rare / corner cases, such
as tcp_data_queue_ofo(), or tcp_check_urg(), seem to be handling
this. It looks rather idiomatic. I wasn't aware of that.

I plan to post a v2 doing this instead, in a bit.

> Instead make sure TCP repair inserts in ehash table sockets ready to be used.

I guess you're referring to the other issue / hack Kuniyuki linked,
where the kernel is sending RST segments for sockets that have been
closed and (not yet) reopened?

For the specific race condition I'm dealing with here I don't think
that would help. Here it's about incoming data being queued when it
shouldn't. The sockets are already in the ehash table ready to be used
(...that's actually the problem, in some sense).

> Yes, it is more complex, but I am sure it can be done properly.
> 
> > TCP_REPAIR also allows to dump queues, not just sequence numbers, so
> > this really is a bad race condition making the whole functionality
> > unreliable.
> >
> > But anyway, even looking for a practical workaround of the kind you
> > suggested, I see two issues with it:
> >
> > 1. we would still have a race condition, because userspace doesn't have
> >    a way to synchronise application of nftables rules (or even a BPF
> >    program) with the effects of TCP_REPAIR. We could apply nftables
> >    rules "a while before" just to be sure, but this is severely going to
> >    impact migration downtime
> >
> > 2. passt(1) runs unprivileged and uses a very simple helper to set
> >    TCP_REPAIR on the socket. Expanding helpers of this kind to directly
> >    manipulate nftables rules, or installing BPF programs, looks like a
> >    substantial security drawback
> >  
> > > We do not want unlikely tests in the fast path.  
> >
> > I understand, but note that this doesn't really add a branch to the
> > fast path: there's already a list of (more expensive) conditions under
> > which we need to fall back to slow path, with 'tp' definitely
> > prefetched at that point, so I don't expect any fast path cost from
> > doing this. Everything else is handled in the slow path.
> >
> > To be sure, I also checked throughput with delivery to local sockets as
> > that's the only case affected (veth setup similar to the one from the
> > selftests from patch 2/2) and there's no visible difference.
> >  
> > > You can find a similar issue:
> > > https://lore.kernel.org/netdev/20260130145122.368748-1-me@linux.beauty/  
> >
> > That one is not a kernel issue: in that case, the socket is closed, so
> > it's actually expected that the kernel will reset the connection. As
> > Jakub pointed out, that patch introduces a race condition on its own,
> > and it's a hack rather than a fix.
> >
> > We happened to have that kind of issue in passt as well (the
> > implementation is inspired by CRIU), but that's something entirely
> > different which userspace clearly needs to take care of, so we fixed it
> > here:
> >
> >   https://passt.top/passt/commit/?id=a8782865c342eb2682cca292d5bf92b567344351
> >  
> > > > passt enables repair mode, dumps the sequence
> > > >    number (101) and sends it to the target node of the guest migration
> > > >
> > > > 4. client --seq 101:201--> origin host (passt not receiving anymore)
> > > >
> > > > 5. client <--ACK: 201-- origin host
> > > >
> > > > 6. migration completes, and passt restores sequence number 101 on the
> > > >    migrated socket
> > > >
> > > > 7. client --seq 201:301--> target host (now seeing a sequence jump)
> > > >
> > > > 8. client <--ACK: 100-- target host
> > > >
> > > > ...and the connection can't recover anymore, because the client can't
> > > > resend data that was already (erroneously) acknowledged. We need to
> > > > avoid step 5. above.
> > > >
> > > > This would equally affect CRIU (the other known user of TCP_REPAIR),
> > > > should data be received while the original container is frozen: the
> > > > sequence dumped and the contents of the saved incoming queue would
> > > > then depend on the timing.
> > > >
> > > > The race condition is also illustrated in the kselftests introduced
> > > > by the next patch.
> > > >
> > > > To prevent this issue, discard data received for a socket in repair
> > > > mode, with a new reason, SKB_DROP_REASON_SOCKET_REPAIR.
> > > >
> > > > Fixes: ee9952831cfd ("tcp: Initial repair mode")
> > > > Tested-by: Laurent Vivier <lvivier@redhat.com>
> > > > Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> > > > ---
> > > >  include/net/dropreason-core.h |  3 +++
> > > >  net/ipv4/tcp_input.c          | 14 +++++++++++++-
> > > >  2 files changed, 16 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
> > > > index 2f312d1f67d6..19ab9e6ffc33 100644
> > > > --- a/include/net/dropreason-core.h
> > > > +++ b/include/net/dropreason-core.h
> > > > @@ -9,6 +9,7 @@
> > > >         FN(SOCKET_CLOSE)                \
> > > >         FN(SOCKET_FILTER)               \
> > > >         FN(SOCKET_RCVBUFF)              \
> > > > +       FN(SOCKET_REPAIR)               \
> > > >         FN(UNIX_DISCONNECT)             \
> > > >         FN(UNIX_SKIP_OOB)               \
> > > >         FN(PKT_TOO_SMALL)               \
> > > > @@ -158,6 +159,8 @@ enum skb_drop_reason {
> > > >         SKB_DROP_REASON_SOCKET_FILTER,
> > > >         /** @SKB_DROP_REASON_SOCKET_RCVBUFF: socket receive buff is full */
> > > >         SKB_DROP_REASON_SOCKET_RCVBUFF,
> > > > +       /** @SKB_DROP_REASON_SOCKET_REPAIR: socket is in repair mode */
> > > > +       SKB_DROP_REASON_SOCKET_REPAIR,
> > > >         /**
> > > >          * @SKB_DROP_REASON_UNIX_DISCONNECT: recv queue is purged when SOCK_DGRAM
> > > >          * or SOCK_SEQPACKET socket re-connect()s to another socket or notices
> > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > > index d5c9e65d9760..6eca34274f97 100644
> > > > --- a/net/ipv4/tcp_input.c
> > > > +++ b/net/ipv4/tcp_input.c
> > > > @@ -6457,6 +6457,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> > > >   *       or pure receivers (this means either the sequence number or the ack
> > > >   *       value must stay constant)
> > > >   *     - Unexpected TCP option.
> > > > + *     - Socket is in repair mode.
> > > >   *
> > > >   *     When these conditions are not satisfied it drops into a standard
> > > >   *     receive procedure patterned after RFC793 to handle all cases.
> > > > @@ -6506,7 +6507,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
> > > >
> > > >         if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
> > > >             TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
> > > > -           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
> > > > +           !after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt) &&
> > > > +           !tp->repair) {
> > > >                 int tcp_header_len = tp->tcp_header_len;
> > > >                 s32 delta = 0;
> > > >                 int flag = 0;
> > > > @@ -6632,6 +6634,11 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
> > > >                 goto discard;
> > > >         }
> > > >
> > > > +       if (tp->repair) {
> > > > +               reason = SKB_DROP_REASON_SOCKET_REPAIR;
> > > > +               goto discard;
> > > > +       }
> > > > +
> > > >         /*
> > > >          *      Standard slow path.
> > > >          */
> > > > @@ -7125,6 +7132,11 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> > > >         int queued = 0;
> > > >         SKB_DR(reason);
> > > >
> > > > +       if (tp->repair) {
> > > > +               SKB_DR_SET(reason, SOCKET_REPAIR);
> > > > +               goto discard;
> > > > +       }
> > > > +
> > > >         switch (sk->sk_state) {
> > > >         case TCP_CLOSE:
> > > >                 SKB_DR_SET(reason, TCP_CLOSE);
> > > > --
> > > > 2.43.0  
> >
> > --
> > Stefano

-- 
Stefano


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode
  2026-05-18 11:28         ` Stefano Brivio
@ 2026-05-19 15:36           ` Andrei Vagin
  2026-05-19 16:11             ` Stefano Brivio
  0 siblings, 1 reply; 12+ messages in thread
From: Andrei Vagin @ 2026-05-19 15:36 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Eric Dumazet, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Pavel Emelyanov, Laurent Vivier, Jon Maloy,
	Dmitry Safonov, netdev, linux-kselftest, linux-kernel,
	Neal Cardwell, Simon Horman, Shuah Khan, David Gibson

On Mon, May 18, 2026 at 4:34 AM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Mon, 18 May 2026 00:57:16 -0700
> Eric Dumazet <edumazet@google.com> wrote:
>
> > On Sun, May 17, 2026 at 12:53 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > On Sun, 17 May 2026 12:05:45 -0700
> > > Kuniyuki Iwashima <kuniyu@google.com> wrote:
> > >
> > > > On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > >
> > > > > Once a socket enters repair mode (TCP_REPAIR socket option with
> > > > > TCP_REPAIR_ON value), it's possible to dump the receive sequence
> > > > > number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
> > > > > (using TCP_REPAIR_QUEUE to select it).
> > > > >
> > > > > If we receive data after the application fetched the sequence number
> > > > > or saved the contents of the queue, though, the application will now
> > > > > have outdated information, which defeats the whole functionality,
> > > > > because this leads to gaps in sequence and data once they're restored
> > > > > by the target instance of the application, resulting in a hanging or
> > > > > otherwise non-functional TCP connection.
> > > > >
> > > > > This type of race condition was discovered in the KubeVirt integration
> > > > > of passt(1), using a remote iperf3 client connected to an iperf3
> > > > > server running in the guest which is being migrated. The setup allows
> > > > > traffic to reach the origin node hosting the guest during the
> > > > > migration.
> > > > >
> > > > > If passt dumps sequence number and contents of the queue *before*
> > > > > further data is received and acknowledged to the peer by the kernel,
> > > > > once the TCP data connection is migrated to the target node, the
> > > > > remote client becomes unable to continue sending, because a portion
> > > > > of the data it sent *and received an acknowledgement for* is now lost.
> > > > >
> > > > > Schematically:
> > > > >
> > > > > 1. client --seq 1:100--> origin host --> passt --> guest --> server
> > > > >
> > > > > 2. client <--ACK: 100-- origin host
> > > > >
> > > > > 3. migration starts,
> > > >
> > > > Here, a netfilter rule or bpf prog must be installed to
> > > > drop packets temporarily until migration completes.
> > >
> > > Thanks for the review.
> > >
> > > I have to say it's rather unexpected to have to work around obvious
> > > kernel issues in userspace: TCP_REPAIR implies that queues are frozen,
> > > and this is handled correctly on the sending side (see
> > > tcp_write_xmit()), but was clearly forgotten on the receiving side.
> >
> > Have you contacted TCP repair authors for best practices?
>
> I Cc'ed them here, Pavel is the author (but as far as I understand not
> active in kernel development anymore), and I know that Andrei did some
> substantial work on it in the past, so he's Cc'ed here as well.
>
> But we are following what CRIU (the userspace reference implementation)
> does, and CRIU would be affected by this issue as well (depending on
> usage).

Before extracting the socket state, CRIU uses netfilter (iptables or
nftables) to block all traffic associated with the specific TCP
connection or, in the case of a container, the entire network namespace.

This approach provides two main benefits:

During the dump, we don't need to leave all sockets in repair mode for
the entire duration of the dump.  We enable and disable repair mode just
to grab the state. It's simplify a roll back if something goes wrong.

During restoration, it ensures that the destination kernel will not
process any packets until the socket is fully reconstructed. This
prevents the kernel from sending a Reset (RST) or getting out of sync
before the connection is ready to be resumed.

I haven't looked at the patch yet, but I have no objections to the idea
itself.

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode
  2026-05-19 15:36           ` Andrei Vagin
@ 2026-05-19 16:11             ` Stefano Brivio
  2026-05-19 18:42               ` Andrei Vagin
  0 siblings, 1 reply; 12+ messages in thread
From: Stefano Brivio @ 2026-05-19 16:11 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Eric Dumazet, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Pavel Emelyanov, Laurent Vivier, Jon Maloy,
	Dmitry Safonov, netdev, linux-kselftest, linux-kernel,
	Neal Cardwell, Simon Horman, Shuah Khan, David Gibson

On Tue, 19 May 2026 08:36:27 -0700
Andrei Vagin <avagin@google.com> wrote:

> On Mon, May 18, 2026 at 4:34 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > On Mon, 18 May 2026 00:57:16 -0700
> > Eric Dumazet <edumazet@google.com> wrote:
> >  
> > > On Sun, May 17, 2026 at 12:53 PM Stefano Brivio <sbrivio@redhat.com> wrote:  
> > > >
> > > > On Sun, 17 May 2026 12:05:45 -0700
> > > > Kuniyuki Iwashima <kuniyu@google.com> wrote:
> > > >  
> > > > > On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <sbrivio@redhat.com> wrote:  
> > > > > >
> > > > > > Once a socket enters repair mode (TCP_REPAIR socket option with
> > > > > > TCP_REPAIR_ON value), it's possible to dump the receive sequence
> > > > > > number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
> > > > > > (using TCP_REPAIR_QUEUE to select it).
> > > > > >
> > > > > > If we receive data after the application fetched the sequence number
> > > > > > or saved the contents of the queue, though, the application will now
> > > > > > have outdated information, which defeats the whole functionality,
> > > > > > because this leads to gaps in sequence and data once they're restored
> > > > > > by the target instance of the application, resulting in a hanging or
> > > > > > otherwise non-functional TCP connection.
> > > > > >
> > > > > > This type of race condition was discovered in the KubeVirt integration
> > > > > > of passt(1), using a remote iperf3 client connected to an iperf3
> > > > > > server running in the guest which is being migrated. The setup allows
> > > > > > traffic to reach the origin node hosting the guest during the
> > > > > > migration.
> > > > > >
> > > > > > If passt dumps sequence number and contents of the queue *before*
> > > > > > further data is received and acknowledged to the peer by the kernel,
> > > > > > once the TCP data connection is migrated to the target node, the
> > > > > > remote client becomes unable to continue sending, because a portion
> > > > > > of the data it sent *and received an acknowledgement for* is now lost.
> > > > > >
> > > > > > Schematically:
> > > > > >
> > > > > > 1. client --seq 1:100--> origin host --> passt --> guest --> server
> > > > > >
> > > > > > 2. client <--ACK: 100-- origin host
> > > > > >
> > > > > > 3. migration starts,  
> > > > >
> > > > > Here, a netfilter rule or bpf prog must be installed to
> > > > > drop packets temporarily until migration completes.  
> > > >
> > > > Thanks for the review.
> > > >
> > > > I have to say it's rather unexpected to have to work around obvious
> > > > kernel issues in userspace: TCP_REPAIR implies that queues are frozen,
> > > > and this is handled correctly on the sending side (see
> > > > tcp_write_xmit()), but was clearly forgotten on the receiving side.  
> > >
> > > Have you contacted TCP repair authors for best practices?  
> >
> > I Cc'ed them here, Pavel is the author (but as far as I understand not
> > active in kernel development anymore), and I know that Andrei did some
> > substantial work on it in the past, so he's Cc'ed here as well.
> >
> > But we are following what CRIU (the userspace reference implementation)
> > does, and CRIU would be affected by this issue as well (depending on
> > usage).  
> 
> Before extracting the socket state, CRIU uses netfilter (iptables or
> nftables) to block all traffic associated with the specific TCP
> connection or, in the case of a container, the entire network namespace.

...by default, yes, by "depending on usage" I actually meant
'--network-lock skip', but I have to admit I'm not sure how commonly
that is used.

> This approach provides two main benefits:
> 
> During the dump, we don't need to leave all sockets in repair mode for
> the entire duration of the dump.  We enable and disable repair mode just
> to grab the state. It's simplify a roll back if something goes wrong.

Thanks for clarifying this aspect, I wasn't sure whether it was done to
make rollback easier.

That wouldn't really simplify things with passt in case of rollback
because, as we anyway keep track of flows internally (we implement
them, after all), we want to keep those sockets (frozen) in repair mode
until we're able to process traffic again (something CRIU doesn't need
to do), and explicitly take them out of repair mode at a rather precise
point in time.

> During restoration, it ensures that the destination kernel will not
> process any packets until the socket is fully reconstructed. This
> prevents the kernel from sending a Reset (RST) or getting out of sync
> before the connection is ready to be resumed.

Right, we don't have that issue in passt (anymore) because we keep
sockets open on the origin node anyway, and we block migration until
we're done restoring connection sockets and listening sockets on the
target.

> I haven't looked at the patch yet, but I have no objections to the idea
> itself.

Thanks for the explanation and for having a look! For passt (and
KubeVirt) that would save a lot of complexity, plus the security
aspects myself and David raised.

And, besides that and the usage CRIU and passt make of it, I still think
that allowing data to be queued while somebody can dump the queue is a
fundamental race condition.

-- 
Stefano


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode
  2026-05-19 16:11             ` Stefano Brivio
@ 2026-05-19 18:42               ` Andrei Vagin
  0 siblings, 0 replies; 12+ messages in thread
From: Andrei Vagin @ 2026-05-19 18:42 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: Eric Dumazet, Kuniyuki Iwashima, David S. Miller, Jakub Kicinski,
	Paolo Abeni, Pavel Emelyanov, Laurent Vivier, Jon Maloy,
	Dmitry Safonov, netdev, linux-kselftest, linux-kernel,
	Neal Cardwell, Simon Horman, Shuah Khan, David Gibson

On Tue, May 19, 2026 at 9:12 AM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> On Tue, 19 May 2026 08:36:27 -0700
> Andrei Vagin <avagin@google.com> wrote:
>
> > On Mon, May 18, 2026 at 4:34 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > >
> > > On Mon, 18 May 2026 00:57:16 -0700
> > > Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > > On Sun, May 17, 2026 at 12:53 PM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > >
> > > > > On Sun, 17 May 2026 12:05:45 -0700
> > > > > Kuniyuki Iwashima <kuniyu@google.com> wrote:
> > > > >
> > > > > > On Sun, May 17, 2026 at 11:41 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> > > > > > >
> > > > > > > Once a socket enters repair mode (TCP_REPAIR socket option with
> > > > > > > TCP_REPAIR_ON value), it's possible to dump the receive sequence
> > > > > > > number (TCP_QUEUE_SEQ) and the contents of the receive queue itself
> > > > > > > (using TCP_REPAIR_QUEUE to select it).
> > > > > > >
> > > > > > > If we receive data after the application fetched the sequence number
> > > > > > > or saved the contents of the queue, though, the application will now
> > > > > > > have outdated information, which defeats the whole functionality,
> > > > > > > because this leads to gaps in sequence and data once they're restored
> > > > > > > by the target instance of the application, resulting in a hanging or
> > > > > > > otherwise non-functional TCP connection.
> > > > > > >
> > > > > > > This type of race condition was discovered in the KubeVirt integration
> > > > > > > of passt(1), using a remote iperf3 client connected to an iperf3
> > > > > > > server running in the guest which is being migrated. The setup allows
> > > > > > > traffic to reach the origin node hosting the guest during the
> > > > > > > migration.
> > > > > > >
> > > > > > > If passt dumps sequence number and contents of the queue *before*
> > > > > > > further data is received and acknowledged to the peer by the kernel,
> > > > > > > once the TCP data connection is migrated to the target node, the
> > > > > > > remote client becomes unable to continue sending, because a portion
> > > > > > > of the data it sent *and received an acknowledgement for* is now lost.
> > > > > > >
> > > > > > > Schematically:
> > > > > > >
> > > > > > > 1. client --seq 1:100--> origin host --> passt --> guest --> server
> > > > > > >
> > > > > > > 2. client <--ACK: 100-- origin host
> > > > > > >
> > > > > > > 3. migration starts,
> > > > > >
> > > > > > Here, a netfilter rule or bpf prog must be installed to
> > > > > > drop packets temporarily until migration completes.
> > > > >
> > > > > Thanks for the review.
> > > > >
> > > > > I have to say it's rather unexpected to have to work around obvious
> > > > > kernel issues in userspace: TCP_REPAIR implies that queues are frozen,
> > > > > and this is handled correctly on the sending side (see
> > > > > tcp_write_xmit()), but was clearly forgotten on the receiving side.
> > > >
> > > > Have you contacted TCP repair authors for best practices?
> > >
> > > I Cc'ed them here, Pavel is the author (but as far as I understand not
> > > active in kernel development anymore), and I know that Andrei did some
> > > substantial work on it in the past, so he's Cc'ed here as well.
> > >
> > > But we are following what CRIU (the userspace reference implementation)
> > > does, and CRIU would be affected by this issue as well (depending on
> > > usage).
> >
> > Before extracting the socket state, CRIU uses netfilter (iptables or
> > nftables) to block all traffic associated with the specific TCP
> > connection or, in the case of a container, the entire network namespace.
>
> ...by default, yes, by "depending on usage" I actually meant
> '--network-lock skip', but I have to admit I'm not sure how commonly
> that is used.
>
> > This approach provides two main benefits:
> >
> > During the dump, we don't need to leave all sockets in repair mode for
> > the entire duration of the dump.  We enable and disable repair mode just
> > to grab the state. It's simplify a roll back if something goes wrong.
>
> Thanks for clarifying this aspect, I wasn't sure whether it was done to
> make rollback easier.

I would rephrase that: it likely wasn't implemented originally because
Pavel may not have considered it necessary. When checkpointing a
container with a separate network namespace, blocking all traffic within
that namespace is straightforward. If CRIU crashes or is killed during a
dump, we simply unblock the traffic, and everything continues to
function without needing to manage individual connections.

As I mentioned before, I have no objections to the idea itself, moreover
it looks quite rational and will make the code safer regarding undesired
side effects/race conditions. Thanks for working on that.

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-19 18:43 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-17 18:41 [PATCH net 0/2] Fix race condition between TCP_REPAIR dump and data receive Stefano Brivio
2026-05-17 18:41 ` [PATCH net 1/2] tcp: Don't accept data when socket is in repair mode Stefano Brivio
2026-05-17 19:05   ` Kuniyuki Iwashima
2026-05-17 19:53     ` Stefano Brivio
2026-05-17 21:01       ` Kuniyuki Iwashima
2026-05-18  5:32         ` David Gibson
2026-05-18  7:57       ` Eric Dumazet
2026-05-18 11:28         ` Stefano Brivio
2026-05-19 15:36           ` Andrei Vagin
2026-05-19 16:11             ` Stefano Brivio
2026-05-19 18:42               ` Andrei Vagin
2026-05-17 18:41 ` [PATCH net 2/2] selftests: Add data path tests for TCP_REPAIR mode Stefano Brivio

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox