[PATCH nf 0/4] netfilter: conntrack: fix obscure confirmed race

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH nf 0/4] netfilter: conntrack: fix obscure confirmed race
@ 2025-06-27 14:27 Florian Westphal
  2025-06-27 14:27 ` [PATCH nf 1/4] selftests: netfilter: conntrack_resize.sh: extend resize test Florian Westphal
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Florian Westphal @ 2025-06-27 14:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

We go a crash report pointing at __nf_ct_delete_from_lists.
While I've been unable to reproduce this, there appears to be a race,
IPS_CONFIRMED bit is set too early and can cause datapath or gc worker
to unlink an entry that hasn't been fully initialised.

The last patch is the actual fix, the first 3 patches extend and add
a few more conntrack tests to exercise clash resolution for udp.

Florian Westphal (4):
  selftests: netfilter: conntrack_resize.sh: extend resize test
  selftests: netfilter: add conntrack clash resolution test case
  selftests: netfilter: conntrack_resize.sh: also use udpclash tool
  netfilter: nf_conntrack: fix crash due to removal of uninitialised
    entry

 include/net/netfilter/nf_conntrack.h          |  15 +-
 net/netfilter/nf_conntrack_core.c             |  18 +-
 .../selftests/net/netfilter/.gitignore        |   1 +
 .../testing/selftests/net/netfilter/Makefile  |   3 +
 .../net/netfilter/conntrack_clash.sh          | 175 ++++++++++++++++++
 .../net/netfilter/conntrack_resize.sh         |  97 +++++++++-
 .../selftests/net/netfilter/udpclash.c        | 158 ++++++++++++++++
 7 files changed, 454 insertions(+), 13 deletions(-)
 create mode 100755 tools/testing/selftests/net/netfilter/conntrack_clash.sh
 create mode 100644 tools/testing/selftests/net/netfilter/udpclash.c

-- 
2.49.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH nf 1/4] selftests: netfilter: conntrack_resize.sh: extend resize test
  2025-06-27 14:27 [PATCH nf 0/4] netfilter: conntrack: fix obscure confirmed race Florian Westphal
@ 2025-06-27 14:27 ` Florian Westphal
  2025-06-27 14:27 ` [PATCH nf 2/4] selftests: netfilter: add conntrack clash resolution test case Florian Westphal
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2025-06-27 14:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

Extend the resize test:
 - continuously dump table both via /proc and ctnetlink interfaces while
   table is resized in a loop.
 - if socat is available, send udp packets in additon to ping requests.
 - increase/decrease the icmp and udp timeouts while resizes are happening.
   This makes sure we also exercise the 'ct has expired' check that happens
   on conntrack lookup.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 .../net/netfilter/conntrack_resize.sh         | 80 +++++++++++++++++--
 1 file changed, 75 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/net/netfilter/conntrack_resize.sh b/tools/testing/selftests/net/netfilter/conntrack_resize.sh
index 9e033e80219e..aa1ba07eaf50 100755
--- a/tools/testing/selftests/net/netfilter/conntrack_resize.sh
+++ b/tools/testing/selftests/net/netfilter/conntrack_resize.sh
@@ -12,6 +12,9 @@ tmpfile=""
 tmpfile_proc=""
 tmpfile_uniq=""
 ret=0
+have_socat=0
+
+socat -h > /dev/null && have_socat=1
 
 insert_count=2000
 [ "$KSFT_MACHINE_SLOW" = "yes" ] && insert_count=400
@@ -123,7 +126,7 @@ ctflush() {
         done
 }
 
-ctflood()
+ct_pingflood()
 {
 	local ns="$1"
 	local duration="$2"
@@ -152,6 +155,28 @@ ctflood()
 	wait
 }
 
+ct_udpflood()
+{
+	local ns="$1"
+	local duration="$2"
+	local now=$(date +%s)
+	local end=$((now + duration))
+
+	[ $have_socat -ne "1" ] && return
+
+        while [ $now -lt $end ]; do
+ip netns exec "$ns" bash<<"EOF"
+	for i in $(seq 1 100);do
+		dport=$(((RANDOM%65536)+1))
+
+		echo bar | socat -u STDIN UDP:"127.0.0.1:$dport" &
+	done > /dev/null 2>&1
+	wait
+EOF
+		now=$(date +%s)
+	done
+}
+
 # dump to /dev/null.  We don't want dumps to cause infinite loops
 # or use-after-free even when conntrack table is altered while dumps
 # are in progress.
@@ -169,6 +194,48 @@ ct_nulldump()
 	wait
 }
 
+ct_nulldump_loop()
+{
+	local ns="$1"
+	local duration="$2"
+	local now=$(date +%s)
+	local end=$((now + duration))
+
+        while [ $now -lt $end ]; do
+		ct_nulldump "$ns"
+		sleep $((RANDOM%2))
+		now=$(date +%s)
+	done
+}
+
+change_timeouts()
+{
+	local ns="$1"
+	local r1=$((RANDOM%2))
+	local r2=$((RANDOM%2))
+
+	[ "$r1" -eq 1 ] && ip netns exec "$ns" sysctl -q net.netfilter.nf_conntrack_icmp_timeout=$((RANDOM%5))
+	[ "$r2" -eq 1 ] && ip netns exec "$ns" sysctl -q net.netfilter.nf_conntrack_udp_timeout=$((RANDOM%5))
+}
+
+ct_change_timeouts_loop()
+{
+	local ns="$1"
+	local duration="$2"
+	local now=$(date +%s)
+	local end=$((now + duration))
+
+        while [ $now -lt $end ]; do
+		change_timeouts "$ns"
+		sleep $((RANDOM%2))
+		now=$(date +%s)
+	done
+
+	# restore defaults
+	ip netns exec "$ns" sysctl -q net.netfilter.nf_conntrack_icmp_timeout=30
+	ip netns exec "$ns" sysctl -q net.netfilter.nf_conntrack_udp_timeout=30
+}
+
 check_taint()
 {
 	local tainted_then="$1"
@@ -198,10 +265,13 @@ insert_flood()
 
 	r=$((RANDOM%$insert_count))
 
-	ctflood "$n" "$timeout" "floodresize" &
+	ct_pingflood "$n" "$timeout" "floodresize" &
+	ct_udpflood "$n" "$timeout" &
+
 	insert_ctnetlink "$n" "$r" &
 	ctflush "$n" "$timeout" &
-	ct_nulldump "$n" &
+	ct_nulldump_loop "$n" "$timeout" &
+	ct_change_timeouts_loop "$n" "$timeout" &
 
 	wait
 }
@@ -306,7 +376,7 @@ test_dump_all()
 
 	ip netns exec "$nsclient1" sysctl -q net.netfilter.nf_conntrack_icmp_timeout=3600
 
-	ctflood "$nsclient1" $timeout "dumpall" &
+	ct_pingflood "$nsclient1" $timeout "dumpall" &
 	insert_ctnetlink "$nsclient2" $insert_count
 
 	wait
@@ -368,7 +438,7 @@ test_conntrack_disable()
 	ct_flush_once "$nsclient1"
 	ct_flush_once "$nsclient2"
 
-	ctflood "$nsclient1" "$timeout" "conntrack disable"
+	ct_pingflood "$nsclient1" "$timeout" "conntrack disable"
 	ip netns exec "$nsclient2" ping -q -c 1 127.0.0.1 >/dev/null 2>&1
 
 	# Disabled, should not have picked up any connection.
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH nf 2/4] selftests: netfilter: add conntrack clash resolution test case
  2025-06-27 14:27 [PATCH nf 0/4] netfilter: conntrack: fix obscure confirmed race Florian Westphal
  2025-06-27 14:27 ` [PATCH nf 1/4] selftests: netfilter: conntrack_resize.sh: extend resize test Florian Westphal
@ 2025-06-27 14:27 ` Florian Westphal
  2025-06-27 14:27 ` [PATCH nf 3/4] selftests: netfilter: conntrack_resize.sh: also use udpclash tool Florian Westphal
  2025-06-27 14:27 ` [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry Florian Westphal
  3 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2025-06-27 14:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

Add a dedicated test to exercise conntrack clash resolution path.
Test program emits 128 identical udp packets in parallel, then reads
back replies from socat echo server.

Also check (via conntrack -S) that the clash path was hit at least once.
Due to the racy nature of the test its possible that despite the
threaded program all packets were processed in-order or on same cpu,
emit a SKIP warning in this case.

Two tests are added:
 - one to test the simpler, non-nat case
 - one to exercise clash resolution where packets
   might have different nat transformations attached to them.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 .../selftests/net/netfilter/.gitignore        |   1 +
 .../testing/selftests/net/netfilter/Makefile  |   3 +
 .../net/netfilter/conntrack_clash.sh          | 175 ++++++++++++++++++
 .../selftests/net/netfilter/udpclash.c        | 158 ++++++++++++++++
 4 files changed, 337 insertions(+)
 create mode 100755 tools/testing/selftests/net/netfilter/conntrack_clash.sh
 create mode 100644 tools/testing/selftests/net/netfilter/udpclash.c

diff --git a/tools/testing/selftests/net/netfilter/.gitignore b/tools/testing/selftests/net/netfilter/.gitignore
index 64c4f8d9aa6c..5d2be9a00627 100644
--- a/tools/testing/selftests/net/netfilter/.gitignore
+++ b/tools/testing/selftests/net/netfilter/.gitignore
@@ -5,3 +5,4 @@ conntrack_dump_flush
 conntrack_reverse_clash
 sctp_collision
 nf_queue
+udpclash
diff --git a/tools/testing/selftests/net/netfilter/Makefile b/tools/testing/selftests/net/netfilter/Makefile
index e9b2f553588d..a98ed892f55f 100644
--- a/tools/testing/selftests/net/netfilter/Makefile
+++ b/tools/testing/selftests/net/netfilter/Makefile
@@ -15,6 +15,7 @@ TEST_PROGS += conntrack_tcp_unreplied.sh
 TEST_PROGS += conntrack_resize.sh
 TEST_PROGS += conntrack_sctp_collision.sh
 TEST_PROGS += conntrack_vrf.sh
+TEST_PROGS += conntrack_clash.sh
 TEST_PROGS += conntrack_reverse_clash.sh
 TEST_PROGS += ipvs.sh
 TEST_PROGS += nf_conntrack_packetdrill.sh
@@ -44,6 +45,7 @@ TEST_GEN_FILES += connect_close nf_queue
 TEST_GEN_FILES += conntrack_dump_flush
 TEST_GEN_FILES += conntrack_reverse_clash
 TEST_GEN_FILES += sctp_collision
+TEST_GEN_FILES += udpclash
 
 include ../../lib.mk
 
@@ -52,6 +54,7 @@ $(OUTPUT)/nf_queue: LDLIBS += $(MNL_LDLIBS)
 
 $(OUTPUT)/conntrack_dump_flush: CFLAGS += $(MNL_CFLAGS)
 $(OUTPUT)/conntrack_dump_flush: LDLIBS += $(MNL_LDLIBS)
+$(OUTPUT)/udpclash: LDLIBS += -lpthread
 
 TEST_FILES := lib.sh
 TEST_FILES += packetdrill
diff --git a/tools/testing/selftests/net/netfilter/conntrack_clash.sh b/tools/testing/selftests/net/netfilter/conntrack_clash.sh
new file mode 100755
index 000000000000..3712c1b9b38b
--- /dev/null
+++ b/tools/testing/selftests/net/netfilter/conntrack_clash.sh
@@ -0,0 +1,175 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source lib.sh
+
+clash_resolution_active=0
+dport=22111
+ret=0
+
+cleanup()
+{
+	# netns cleanup also zaps any remaining socat echo server.
+	cleanup_all_ns
+}
+
+checktool "nft --version" "run test without nft"
+checktool "conntrack --version" "run test without conntrack"
+checktool "socat -h" "run test without socat"
+
+trap cleanup EXIT
+
+setup_ns nsclient1 nsclient2 nsrouter
+
+ip netns exec "$nsrouter" nft -f -<<EOF
+table ip t {
+	chain lb {
+		meta l4proto udp dnat to numgen random mod 3 map { 0 : 10.0.2.1 . 9000, 1 : 10.0.2.1 . 9001, 2 : 10.0.2.1 . 9002 }
+	}
+
+	chain prerouting {
+		type nat hook prerouting priority dstnat
+
+		udp dport $dport counter jump lb
+	}
+
+	chain output {
+		type nat hook output priority dstnat
+
+		udp dport $dport counter jump lb
+	}
+}
+EOF
+
+load_simple_ruleset()
+{
+ip netns exec "$1" nft -f -<<EOF
+table ip t {
+	chain forward {
+		type filter hook forward priority 0
+
+		ct state new counter
+	}
+}
+EOF
+}
+
+spawn_servers()
+{
+	local ns="$1"
+	local ports="9000 9001 9002"
+
+	for port in $ports; do
+		ip netns exec "$ns" socat UDP-RECVFROM:$port,fork PIPE 2>/dev/null &
+	done
+
+	for port in $ports; do
+		wait_local_port_listen "$ns" $port udp
+	done
+}
+
+add_addr()
+{
+	local ns="$1"
+	local dev="$2"
+	local i="$3"
+	local j="$4"
+
+	ip -net "$ns" link set "$dev" up
+	ip -net "$ns" addr add "10.0.$i.$j/24" dev "$dev"
+}
+
+ping_test()
+{
+	local ns="$1"
+	local daddr="$2"
+
+	if ! ip netns exec "$ns" ping -q -c 1 $daddr > /dev/null;then
+		echo "FAIL: ping from $ns to $daddr"
+		exit 1
+	fi
+}
+
+run_one_clash_test()
+{
+	local ns="$1"
+	local daddr="$2"
+	local dport="$3"
+	local entries
+	local cre
+
+	if ! ip netns exec "$ns" ./udpclash $daddr $dport;then
+		echo "FAIL: did not receive expected number of replies for $daddr:$dport"
+		ret=1
+		return 1
+	fi
+
+	entries=$(conntrack -S | wc -l)
+	cre=$(conntrack -S | grep -v "clash_resolve=0" | wc -l)
+
+	if [ "$cre" -ne "$entries" ] ;then
+		clash_resolution_active=1
+		return 0
+	fi
+
+	# 1 cpu -> parallel insertion impossible
+	if [ "$entries" -eq 1 ]; then
+		return 0
+	fi
+
+	# not a failure: clash resolution logic did not trigger, but all replies
+	# were received.  With right timing, xmit completed sequentially and
+	# no parallel insertion occurs.
+	return $ksft_skip
+}
+
+run_clash_test()
+{
+	local ns="$1"
+	local daddr="$2"
+	local dport="$3"
+
+	for i in $(seq 1 10);do
+		run_one_clash_test "$ns" "$daddr" "$dport"
+		local rv=$?
+		if [ $rv -eq 0 ];then
+			echo "PASS: clash resolution test for $daddr:$dport on attempt $i"
+			return 0
+		elif [ $rv -eq 1 ];then
+			echo "FAIL: clash resolution test for $daddr:$dport on attempt $i"
+			return 1
+		fi
+	done
+}
+
+ip link add veth0 netns "$nsclient1" type veth peer name veth0 netns "$nsrouter"
+ip link add veth0 netns "$nsclient2" type veth peer name veth1 netns "$nsrouter"
+add_addr "$nsclient1" veth0 1 1
+add_addr "$nsclient2" veth0 2 1
+add_addr "$nsrouter" veth0 1 99
+add_addr "$nsrouter" veth1 2 99
+
+ip -net "$nsclient1" route add default via 10.0.1.99
+ip -net "$nsclient2" route add default via 10.0.2.99
+ip netns exec "$nsrouter" sysctl -q net.ipv4.ip_forward=1
+
+ping_test "$nsclient1" 10.0.1.99
+ping_test "$nsclient1" 10.0.2.1
+ping_test "$nsclient2" 10.0.1.1
+
+spawn_servers "$nsclient2"
+
+# exercise clash resolution with nat:
+# nsrouter is supposed to dnat to 10.0.2.1:900{0,1,2,3}.
+run_clash_test "$nsclient1" 10.0.1.99 "$dport"
+
+# exercise clash resolution without nat.
+load_simple_ruleset "$nsclient2"
+run_clash_test "$nsclient2" 127.0.0.1 9001
+
+if [ $clash_resolution_active -eq 0 ];then
+	[ "$ret" -eq 0 ] && ret=$ksft_skip
+	echo "SKIP: Clash resolution did not trigger"
+fi
+
+exit $ret
diff --git a/tools/testing/selftests/net/netfilter/udpclash.c b/tools/testing/selftests/net/netfilter/udpclash.c
new file mode 100644
index 000000000000..85c7b906ad08
--- /dev/null
+++ b/tools/testing/selftests/net/netfilter/udpclash.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Usage: ./udpclash <IP> <PORT>
+ *
+ * Emit THREAD_COUNT UDP packets sharing the same saddr:daddr pair.
+ *
+ * This mimics DNS resolver libraries that emit A and AAAA requests
+ * in parallel.
+ *
+ * This exercises conntrack clash resolution logic added and later
+ * refined in
+ *
+ *  71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race")
+ *  ed07d9a021df ("netfilter: nf_conntrack: resolve clash for matching conntracks")
+ *  6a757c07e51f ("netfilter: conntrack: allow insertion of clashing entries")
+ */
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <arpa/inet.h>
+#include <sys/socket.h>
+#include <pthread.h>
+
+#define THREAD_COUNT 128
+
+struct thread_args {
+	const struct sockaddr_in *si_remote;
+	int sockfd;
+};
+
+static int wait = 1;
+
+static void *thread_main(void *varg)
+{
+	const struct sockaddr_in *si_remote;
+	const struct thread_args *args = varg;
+	static const char msg[] = "foo";
+
+	si_remote = args->si_remote;
+
+	while (wait == 1)
+		;
+
+	if (sendto(args->sockfd, msg, strlen(msg), MSG_NOSIGNAL,
+		   (struct sockaddr *)si_remote, sizeof(*si_remote)) < 0)
+		exit(111);
+
+	return varg;
+}
+
+static int run_test(int fd, const struct sockaddr_in *si_remote)
+{
+	struct thread_args thread_args = {
+		.si_remote = si_remote,
+		.sockfd = fd,
+	};
+	pthread_t *tid = calloc(THREAD_COUNT, sizeof(pthread_t));
+	unsigned int repl_count = 0, timeout = 0;
+	int i;
+
+	if (!tid) {
+		perror("calloc");
+		return 1;
+	}
+
+	for (i = 0; i < THREAD_COUNT; i++) {
+		int err = pthread_create(&tid[i], NULL, &thread_main, &thread_args);
+
+		if (err != 0) {
+			perror("pthread_create");
+			exit(1);
+		}
+	}
+
+	wait = 0;
+
+	for (i = 0; i < THREAD_COUNT; i++)
+		pthread_join(tid[i], NULL);
+
+	while (repl_count < THREAD_COUNT) {
+		struct sockaddr_in si_repl;
+		socklen_t si_repl_len = sizeof(si_repl);
+		char repl[512];
+		ssize_t ret;
+
+		ret = recvfrom(fd, repl, sizeof(repl), MSG_NOSIGNAL,
+			       (struct sockaddr *) &si_repl, &si_repl_len);
+		if (ret < 0) {
+			if (timeout++ > 5000) {
+				fputs("timed out while waiting for reply from thread\n", stderr);
+				break;
+			}
+
+			/* give reply time to pass though the stack */
+			usleep(1000);
+			continue;
+		}
+
+		if (si_repl_len != sizeof(*si_remote)) {
+			fprintf(stderr, "warning: reply has unexpected repl_len %d vs %d\n",
+				(int)si_repl_len, (int)sizeof(si_repl));
+		} else if (si_remote->sin_addr.s_addr != si_repl.sin_addr.s_addr ||
+			si_remote->sin_port != si_repl.sin_port) {
+			char a[64], b[64];
+
+			inet_ntop(AF_INET, &si_remote->sin_addr, a, sizeof(a));
+			inet_ntop(AF_INET, &si_repl.sin_addr, b, sizeof(b));
+
+			fprintf(stderr, "reply from wrong source: want %s:%d got %s:%d\n",
+				a, ntohs(si_remote->sin_port), b, ntohs(si_repl.sin_port));
+		}
+
+		repl_count++;
+	}
+
+	printf("got %d of %d replies\n", repl_count, THREAD_COUNT);
+
+	free(tid);
+
+	return repl_count == THREAD_COUNT ? 0 : 1;
+}
+
+int main(int argc, char *argv[])
+{
+	struct sockaddr_in si_local = {
+		.sin_family = AF_INET,
+	};
+	struct sockaddr_in si_remote = {
+		.sin_family = AF_INET,
+	};
+	int fd, ret;
+
+	if (argc < 3) {
+		fputs("Usage: send_udp <daddr> <dport>\n", stderr);
+		return 1;
+	}
+
+	si_remote.sin_port = htons(atoi(argv[2]));
+	si_remote.sin_addr.s_addr = inet_addr(argv[1]);
+
+	fd = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_UDP);
+	if (fd < 0) {
+		perror("socket");
+		return 1;
+	}
+
+	if (bind(fd, (struct sockaddr *)&si_local, sizeof(si_local)) < 0) {
+		perror("bind");
+		return 1;
+	}
+
+	ret = run_test(fd, &si_remote);
+
+	close(fd);
+
+	return ret;
+}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH nf 3/4] selftests: netfilter: conntrack_resize.sh: also use udpclash tool
  2025-06-27 14:27 [PATCH nf 0/4] netfilter: conntrack: fix obscure confirmed race Florian Westphal
  2025-06-27 14:27 ` [PATCH nf 1/4] selftests: netfilter: conntrack_resize.sh: extend resize test Florian Westphal
  2025-06-27 14:27 ` [PATCH nf 2/4] selftests: netfilter: add conntrack clash resolution test case Florian Westphal
@ 2025-06-27 14:27 ` Florian Westphal
  2025-06-27 14:27 ` [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry Florian Westphal
  3 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2025-06-27 14:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

Previous patch added a new clash resolution test case.
Also use this during conntrack resize stress test in addition
to icmp ping flood.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 .../selftests/net/netfilter/conntrack_resize.sh | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/tools/testing/selftests/net/netfilter/conntrack_resize.sh b/tools/testing/selftests/net/netfilter/conntrack_resize.sh
index aa1ba07eaf50..788cd56ea4a0 100755
--- a/tools/testing/selftests/net/netfilter/conntrack_resize.sh
+++ b/tools/testing/selftests/net/netfilter/conntrack_resize.sh
@@ -177,6 +177,22 @@ EOF
 	done
 }
 
+ct_udpclash()
+{
+	local ns="$1"
+	local duration="$2"
+	local now=$(date +%s)
+	local end=$((now + duration))
+
+	[ -x udpclash ] || return
+
+        while [ $now -lt $end ]; do
+		ip netns exec "$ns" ./udpclash 127.0.0.1 $((RANDOM%65536)) > /dev/null 2>&1
+
+		now=$(date +%s)
+	done
+}
+
 # dump to /dev/null.  We don't want dumps to cause infinite loops
 # or use-after-free even when conntrack table is altered while dumps
 # are in progress.
@@ -267,6 +283,7 @@ insert_flood()
 
 	ct_pingflood "$n" "$timeout" "floodresize" &
 	ct_udpflood "$n" "$timeout" &
+	ct_udpclash "$n" "$timeout" &
 
 	insert_ctnetlink "$n" "$r" &
 	ctflush "$n" "$timeout" &
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
  2025-06-27 14:27 [PATCH nf 0/4] netfilter: conntrack: fix obscure confirmed race Florian Westphal
                   ` (2 preceding siblings ...)
  2025-06-27 14:27 ` [PATCH nf 3/4] selftests: netfilter: conntrack_resize.sh: also use udpclash tool Florian Westphal
@ 2025-06-27 14:27 ` Florian Westphal
  2025-07-03 13:56   ` Pablo Neira Ayuso
  3 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2025-06-27 14:27 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

A crash in conntrack was reported while trying to unlink the conntrack
entry from the hash bucket list:
    [exception RIP: __nf_ct_delete_from_lists+172]
    [..]
 #7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack]
 #8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack]
 #9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack]
    [..]

The nf_conn struct is marked as allocated from slab but appears to be in
a partially initialised state:

 ct hlist pointer is garbage; looks like the ct hash value
 (hence crash).
 ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected
 ct->timeout is 30000 (=30s), which is unexpected.

Everything else looks like normal udp conntrack entry.  If we ignore
ct->status and pretend its 0, the entry matches those that are newly
allocated but not yet inserted into the hash:
  - ct hlist pointers are overloaded and store/cache the raw tuple hash
  - ct->timeout matches the relative time expected for a new udp flow
    rather than the absolute 'jiffies' value.

If it were not for the presence of IPS_CONFIRMED,
__nf_conntrack_find_get() would have skipped the entry.

Theory is that we did hit following race:

cpu x 			cpu y			cpu z
 found entry E		found entry E
 E is expired		<preemption>
 nf_ct_delete()
 return E to rcu slab
					init_conntrack
					E is re-inited,
					ct->status set to 0
					reply tuplehash hnnode.pprev
					stores hash value.

cpu y found E right before it was deleted on cpu x.
E is now re-inited on cpu z.  cpu y was preempted before
checking for expiry and/or confirm bit.

					->refcnt set to 1
					E now owned by skb
					->timeout set to 30000

If cpu y were to resume now, it would observe E as
expired but would skip E due to missing CONFIRMED bit.

					nf_conntrack_confirm gets called
					sets: ct->status |= CONFIRMED
					This is wrong: E is not yet added
					to hashtable.

cpu y resumes, it observes E as expired but CONFIRMED:
			<resumes>
			nf_ct_expired()
			 -> yes (ct->timeout is 30s)
			confirmed bit set.

cpu y will try to delete E from the hashtable:
			nf_ct_delete() -> set DYING bit
			__nf_ct_delete_from_lists

Even this scenario doesn't guarantee a crash:
cpu z still holds the table bucket lock(s) so y blocks:

			wait for spinlock held by z

					CONFIRMED is set but there is no
					guarantee ct will be added to hash:
					"chaintoolong" or "clash resolution"
					logic both skip the insert step.
					reply hnnode.pprev still stores the
					hash value.

					unlocks spinlock
					return NF_DROP
			<unblocks, then
			 crashes on hlist_nulls_del_rcu pprev>

In case CPU z does insert the entry into the hashtable, cpu y will unlink
E again right away but no crash occurs.

Without 'cpu y' race, 'garbage' hlist is of no consequence:
ct refcnt remains at 1, eventually skb will be free'd and E gets
destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy.

To resolve this, move the IPS_CONFIRMED assignment after the table
insertion but before the unlock.

It doesn't matter if other CPUs can observe a newly inserted entry right
before the CONFIRMED bit was set:

Such event cannot be distinguished from above "E is the old incarnation"
case: the entry will be skipped.

Also change nf_ct_should_gc() to first check the confirmed bit.

The gc sequence is:
 1. Check if entry has expired, if not skip to next entry
 2. Obtain a reference to the expired entry.
 3. Call nf_ct_should_gc() to double-check step 1.

nf_ct_should_gc() is thus called only for entries that already failed an
expiry check. After this patch, once the confirmed bit check passes
ct->timeout has been altered to reflect the absolute 'best before' date
instead of a relative time.  Step 3 will therefore not remove the entry.

Without this change to nf_ct_should_gc() we could still get this sequence:

 1. Check if entry has expired.
 2. Obtain a reference.
 3. Call nf_ct_should_gc() to double-check step 1:
    4 - entry is still observed as expired
    5 - meanwhile, ct->timeout is corrected to absolute value on other CPU
      and confirm bit gets set
    6 - confirm bit is seen
    7 - valid entry is removed again

First do check 6), then 4) so the gc expiry check always picks up either
confirmed bit unset (entry gets skipped) or expiry re-check failure for
re-inited conntrack objects.

This change cannot be backported to releases before 5.19. Without
commit 8a75a2c17410 ("netfilter: conntrack: remove unconfirmed list")
|= IPS_CONFIRMED line cannot be moved without further changes.

Fixes: 1397af5bfd7d ("netfilter: conntrack: remove the percpu dying list")
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/netfilter/nf_conntrack.h | 15 +++++++++++++--
 net/netfilter/nf_conntrack_core.c    | 18 ++++++++++++------
 2 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 3f02a45773e8..ca26274196b9 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -306,8 +306,19 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
 /* use after obtaining a reference count */
 static inline bool nf_ct_should_gc(const struct nf_conn *ct)
 {
-	return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
-	       !nf_ct_is_dying(ct);
+	if (!nf_ct_is_confirmed(ct))
+		return false;
+
+	/* load ct->timeout after is_confirmed() test.
+	 * Pairs with __nf_conntrack_confirm() which:
+	 * 1. Increases ct->timeout value
+	 * 2. Inserts ct into rcu hlist
+	 * 3. Sets the confirmed bit
+	 * 4. Unlocks the hlist lock
+	 */
+	smp_acquire__after_ctrl_dep();
+
+	return nf_ct_is_expired(ct) && !nf_ct_is_dying(ct);
 }
 
 #define	NF_CT_DAY	(86400 * HZ)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 201d3c4ec623..442a972a03d7 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1124,6 +1124,7 @@ static int nf_ct_resolve_clash_harder(struct sk_buff *skb, u32 repl_idx)
 
 	hlist_nulls_add_head_rcu(&loser_ct->tuplehash[IP_CT_DIR_REPLY].hnnode,
 				 &nf_conntrack_hash[repl_idx]);
+	loser_ct->status |= IPS_CONFIRMED;
 
 	NF_CT_STAT_INC(net, clash_resolve);
 	return NF_ACCEPT;
@@ -1260,8 +1261,6 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	 * user context, else we insert an already 'dead' hash, blocking
 	 * further use of that particular connection -JM.
 	 */
-	ct->status |= IPS_CONFIRMED;
-
 	if (unlikely(nf_ct_is_dying(ct))) {
 		NF_CT_STAT_INC(net, insert_failed);
 		goto dying;
@@ -1293,7 +1292,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 		}
 	}
 
-	/* Timer relative to confirmation time, not original
+	/* Timeout is relative to confirmation time, not original
 	   setting time, otherwise we'd get timer wrap in
 	   weird delay cases. */
 	ct->timeout += nfct_time_stamp;
@@ -1301,11 +1300,18 @@ __nf_conntrack_confirm(struct sk_buff *skb)
 	__nf_conntrack_insert_prepare(ct);
 
 	/* Since the lookup is lockless, hash insertion must be done after
-	 * starting the timer and setting the CONFIRMED bit. The RCU barriers
-	 * guarantee that no other CPU can find the conntrack before the above
-	 * stores are visible.
+	 * setting ct->timeout. The RCU barriers guarantee that no other CPU
+	 * can find the conntrack before the above stores are visible.
 	 */
 	__nf_conntrack_hash_insert(ct, hash, reply_hash);
+
+	/* IPS_CONFIRMED unset means 'ct not (yet) in hash', conntrack lookups
+	 * skip entries that lack this bit.  This happens when a CPU is looking
+	 * at a stale entry that is being recycled due to SLAB_TYPESAFE_BY_RCU
+	 * or when another CPU encounters this entry right after the insertion
+	 * but before the set-confirm-bit below.
+	 */
+	ct->status |= IPS_CONFIRMED;
 	nf_conntrack_double_unlock(hash, reply_hash);
 	local_bh_enable();
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
  2025-06-27 14:27 ` [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry Florian Westphal
@ 2025-07-03 13:56   ` Pablo Neira Ayuso
  2025-07-03 14:21     ` Florian Westphal
  0 siblings, 1 reply; 12+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-03 13:56 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

Hi FLorian,

Thanks for the description, this scenario is esoteric.

Is this bug fully reproducible?

More questions below.

On Fri, Jun 27, 2025 at 04:27:53PM +0200, Florian Westphal wrote:
> A crash in conntrack was reported while trying to unlink the conntrack
> entry from the hash bucket list:
>     [exception RIP: __nf_ct_delete_from_lists+172]
>     [..]
>  #7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack]
>  #8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack]
>  #9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack]
>     [..]
> 
> The nf_conn struct is marked as allocated from slab but appears to be in
> a partially initialised state:
> 
>  ct hlist pointer is garbage; looks like the ct hash value
>  (hence crash).
>  ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected
>  ct->timeout is 30000 (=30s), which is unexpected.
> 
> Everything else looks like normal udp conntrack entry.  If we ignore
> ct->status and pretend its 0, the entry matches those that are newly
> allocated but not yet inserted into the hash:
>   - ct hlist pointers are overloaded and store/cache the raw tuple hash
>   - ct->timeout matches the relative time expected for a new udp flow
>     rather than the absolute 'jiffies' value.
> 
> If it were not for the presence of IPS_CONFIRMED,
> __nf_conntrack_find_get() would have skipped the entry.
> 
> Theory is that we did hit following race:
> 
> cpu x 			cpu y			cpu z
>  found entry E		found entry E
>  E is expired		<preemption>
>  nf_ct_delete()
>  return E to rcu slab
> 					init_conntrack
> 					E is re-inited,
> 					ct->status set to 0
> 					reply tuplehash hnnode.pprev
> 					stores hash value.
> 
> cpu y found E right before it was deleted on cpu x.
> E is now re-inited on cpu z.  cpu y was preempted before
> checking for expiry and/or confirm bit.
> 
> 					->refcnt set to 1
> 					E now owned by skb
> 					->timeout set to 30000
> 
> If cpu y were to resume now, it would observe E as
> expired but would skip E due to missing CONFIRMED bit.
> 
> 					nf_conntrack_confirm gets called
> 					sets: ct->status |= CONFIRMED
> 					This is wrong: E is not yet added
> 					to hashtable.
> 
> cpu y resumes, it observes E as expired but CONFIRMED:
> 			<resumes>
> 			nf_ct_expired()
> 			 -> yes (ct->timeout is 30s)
> 			confirmed bit set.
> 
> cpu y will try to delete E from the hashtable:
> 			nf_ct_delete() -> set DYING bit
> 			__nf_ct_delete_from_lists
> 
> Even this scenario doesn't guarantee a crash:
> cpu z still holds the table bucket lock(s) so y blocks:
> 
> 			wait for spinlock held by z
> 
> 					CONFIRMED is set but there is no
> 					guarantee ct will be added to hash:
> 					"chaintoolong" or "clash resolution"
> 					logic both skip the insert step.
> 					reply hnnode.pprev still stores the
> 					hash value.
> 
> 					unlocks spinlock
> 					return NF_DROP
> 			<unblocks, then
> 			 crashes on hlist_nulls_del_rcu pprev>
> 
> In case CPU z does insert the entry into the hashtable, cpu y will unlink
> E again right away but no crash occurs.
> 
> Without 'cpu y' race, 'garbage' hlist is of no consequence:
> ct refcnt remains at 1, eventually skb will be free'd and E gets
> destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy.
> 
> To resolve this, move the IPS_CONFIRMED assignment after the table
> insertion but before the unlock.
> 
> It doesn't matter if other CPUs can observe a newly inserted entry right
> before the CONFIRMED bit was set:
> 
> Such event cannot be distinguished from above "E is the old incarnation"
> case: the entry will be skipped.
> 
> Also change nf_ct_should_gc() to first check the confirmed bit.
> 
> The gc sequence is:
>  1. Check if entry has expired, if not skip to next entry
>  2. Obtain a reference to the expired entry.
>  3. Call nf_ct_should_gc() to double-check step 1.
> 
> nf_ct_should_gc() is thus called only for entries that already failed an
> expiry check. After this patch, once the confirmed bit check passes
> ct->timeout has been altered to reflect the absolute 'best before' date
> instead of a relative time.  Step 3 will therefore not remove the entry.
> 
> Without this change to nf_ct_should_gc() we could still get this sequence:
> 
>  1. Check if entry has expired.
>  2. Obtain a reference.
>  3. Call nf_ct_should_gc() to double-check step 1:
>     4 - entry is still observed as expired
>     5 - meanwhile, ct->timeout is corrected to absolute value on other CPU
>       and confirm bit gets set
>     6 - confirm bit is seen
>     7 - valid entry is removed again
> 
> First do check 6), then 4) so the gc expiry check always picks up either
> confirmed bit unset (entry gets skipped) or expiry re-check failure for
> re-inited conntrack objects.
> 
> This change cannot be backported to releases before 5.19. Without
> commit 8a75a2c17410 ("netfilter: conntrack: remove unconfirmed list")
> |= IPS_CONFIRMED line cannot be moved without further changes.
> 
> Fixes: 1397af5bfd7d ("netfilter: conntrack: remove the percpu dying list")
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  include/net/netfilter/nf_conntrack.h | 15 +++++++++++++--
>  net/netfilter/nf_conntrack_core.c    | 18 ++++++++++++------
>  2 files changed, 25 insertions(+), 8 deletions(-)
> 
> diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
> index 3f02a45773e8..ca26274196b9 100644
> --- a/include/net/netfilter/nf_conntrack.h
> +++ b/include/net/netfilter/nf_conntrack.h
> @@ -306,8 +306,19 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
>  /* use after obtaining a reference count */
>  static inline bool nf_ct_should_gc(const struct nf_conn *ct)
>  {
> -	return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
> -	       !nf_ct_is_dying(ct);
> +	if (!nf_ct_is_confirmed(ct))
> +		return false;
> +
> +	/* load ct->timeout after is_confirmed() test.
> +	 * Pairs with __nf_conntrack_confirm() which:
> +	 * 1. Increases ct->timeout value
> +	 * 2. Inserts ct into rcu hlist
> +	 * 3. Sets the confirmed bit
> +	 * 4. Unlocks the hlist lock
> +	 */
> +	smp_acquire__after_ctrl_dep();
> +
> +	return nf_ct_is_expired(ct) && !nf_ct_is_dying(ct);
>  }
>  
>  #define	NF_CT_DAY	(86400 * HZ)
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 201d3c4ec623..442a972a03d7 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1124,6 +1124,7 @@ static int nf_ct_resolve_clash_harder(struct sk_buff *skb, u32 repl_idx)
>  
>  	hlist_nulls_add_head_rcu(&loser_ct->tuplehash[IP_CT_DIR_REPLY].hnnode,
>  				 &nf_conntrack_hash[repl_idx]);
> +	loser_ct->status |= IPS_CONFIRMED;
>  
>  	NF_CT_STAT_INC(net, clash_resolve);
>  	return NF_ACCEPT;
> @@ -1260,8 +1261,6 @@ __nf_conntrack_confirm(struct sk_buff *skb)
>  	 * user context, else we insert an already 'dead' hash, blocking
>  	 * further use of that particular connection -JM.
>  	 */
> -	ct->status |= IPS_CONFIRMED;
> -
>  	if (unlikely(nf_ct_is_dying(ct))) {
>  		NF_CT_STAT_INC(net, insert_failed);
>  		goto dying;
> @@ -1293,7 +1292,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
>  		}
>  	}
>  
> -	/* Timer relative to confirmation time, not original
> +	/* Timeout is relative to confirmation time, not original
>  	   setting time, otherwise we'd get timer wrap in
>  	   weird delay cases. */
>  	ct->timeout += nfct_time_stamp;
> @@ -1301,11 +1300,18 @@ __nf_conntrack_confirm(struct sk_buff *skb)
>  	__nf_conntrack_insert_prepare(ct);
>  
>  	/* Since the lookup is lockless, hash insertion must be done after
> -	 * starting the timer and setting the CONFIRMED bit. The RCU barriers
> -	 * guarantee that no other CPU can find the conntrack before the above
> -	 * stores are visible.
> +	 * setting ct->timeout. The RCU barriers guarantee that no other CPU
> +	 * can find the conntrack before the above stores are visible.
>  	 */
>  	__nf_conntrack_hash_insert(ct, hash, reply_hash);
> +
> +	/* IPS_CONFIRMED unset means 'ct not (yet) in hash', conntrack lookups
> +	 * skip entries that lack this bit.  This happens when a CPU is looking
> +	 * at a stale entry that is being recycled due to SLAB_TYPESAFE_BY_RCU
> +	 * or when another CPU encounters this entry right after the insertion
> +	 * but before the set-confirm-bit below.
> +	 */
> +	ct->status |= IPS_CONFIRMED;

My understanding is that this bit setting can still be reordered.

>  	nf_conntrack_double_unlock(hash, reply_hash);
>  	local_bh_enable();
>  
> -- 
> 2.49.0
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
  2025-07-03 13:56   ` Pablo Neira Ayuso
@ 2025-07-03 14:21     ` Florian Westphal
  2025-07-14 13:51       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2025-07-03 14:21 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> Thanks for the description, this scenario is esoteric.
> 
> Is this bug fully reproducible?

No.  Unicorn.  Only happened once.
Everything is based off reading the backtrace and vmcore.

> > +	/* IPS_CONFIRMED unset means 'ct not (yet) in hash', conntrack lookups
> > +	 * skip entries that lack this bit.  This happens when a CPU is looking
> > +	 * at a stale entry that is being recycled due to SLAB_TYPESAFE_BY_RCU
> > +	 * or when another CPU encounters this entry right after the insertion
> > +	 * but before the set-confirm-bit below.
> > +	 */
> > +	ct->status |= IPS_CONFIRMED;
> 
> My understanding is that this bit setting can still be reordered.

You mean it could be moved before hlist_add? So this is needed?

- ct->status |= IPS_CONFIRMED;
+ smp_mb__before_atomic();
+ set_bit(IPS_CONFIRMED_BIT, &ct->status) ?

I can send a v2 with this change.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
  2025-07-03 14:21     ` Florian Westphal
@ 2025-07-14 13:51       ` Pablo Neira Ayuso
  2025-07-14 14:36         ` Florian Westphal
  0 siblings, 1 reply; 12+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-14 13:51 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Thu, Jul 03, 2025 at 04:21:51PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Thanks for the description, this scenario is esoteric.
> > 
> > Is this bug fully reproducible?
> 
> No.  Unicorn.  Only happened once.
> Everything is based off reading the backtrace and vmcore.

I guess this needs a chaos money to trigger this bug. Else, can we try to catch this unicorn again?

I would push 1/4 and 3/4 to nf.git to start with. Unless you are 100% sure this fix is needed.

> > > +	/* IPS_CONFIRMED unset means 'ct not (yet) in hash', conntrack lookups
> > > +	 * skip entries that lack this bit.  This happens when a CPU is looking
> > > +	 * at a stale entry that is being recycled due to SLAB_TYPESAFE_BY_RCU
> > > +	 * or when another CPU encounters this entry right after the insertion
> > > +	 * but before the set-confirm-bit below.
> > > +	 */
> > > +	ct->status |= IPS_CONFIRMED;
> > 
> > My understanding is that this bit setting can still be reordered.
> 
> You mean it could be moved before hlist_add? So this is needed?
> 
> - ct->status |= IPS_CONFIRMED;
> + smp_mb__before_atomic();
> + set_bit(IPS_CONFIRMED_BIT, &ct->status) ?
> 
> I can send a v2 with this change.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
  2025-07-14 13:51       ` Pablo Neira Ayuso
@ 2025-07-14 14:36         ` Florian Westphal
  2025-07-15 22:09           ` Pablo Neira Ayuso
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2025-07-14 14:36 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Thu, Jul 03, 2025 at 04:21:51PM +0200, Florian Westphal wrote:
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > Thanks for the description, this scenario is esoteric.
> > > 
> > > Is this bug fully reproducible?
> > 
> > No.  Unicorn.  Only happened once.
> > Everything is based off reading the backtrace and vmcore.
> 
> I guess this needs a chaos money to trigger this bug. Else, can we try to catch this unicorn again?

I would not hold my breath.  But I don't see anything that prevents the
race described in 4/4, and all the things match in the vmcore, including
increment of clash resolution counter.  If you think its too perfect
then ok, we can keep 4/4 back until someone else reports this problem
again.

> I would push 1/4 and 3/4 to nf.git to start with. Unless you are 100% sure this fix is needed.

3/4 needs 2/4 present as well.  I can then resend 4/4 then with the

> > - ct->status |= IPS_CONFIRMED;
> > + smp_mb__before_atomic();
> > + set_bit(IPS_CONFIRMED_BIT, &ct->status) ?

change.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
  2025-07-14 14:36         ` Florian Westphal
@ 2025-07-15 22:09           ` Pablo Neira Ayuso
  2025-07-16 15:59             ` Florian Westphal
  0 siblings, 1 reply; 12+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-15 22:09 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Mon, Jul 14, 2025 at 04:36:35PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Thu, Jul 03, 2025 at 04:21:51PM +0200, Florian Westphal wrote:
> > > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > > Thanks for the description, this scenario is esoteric.
> > > > 
> > > > Is this bug fully reproducible?
> > > 
> > > No.  Unicorn.  Only happened once.
> > > Everything is based off reading the backtrace and vmcore.
> > 
> > I guess this needs a chaos money to trigger this bug. Else, can we try to catch this unicorn again?
> 
> I would not hold my breath.  But I don't see anything that prevents the
> race described in 4/4, and all the things match in the vmcore, including
> increment of clash resolution counter.  If you think its too perfect
> then ok, we can keep 4/4 back until someone else reports this problem
> again.

Hm, I think your sequence is possible, it is the SLAB_TYPESAFE_BY_RCU rule
that allows for this to occur.

Could this rare sequence still happen?

cpu x                   cpu y                   cpu z
 found entry E          found entry E
 E is expired           <preemption>
 nf_ct_delete()
 return E to rcu slab
                                        init_conntrack
                                        <preemption>     NOTE: ct->status not yet set to zero

cpu y resumes, it observes E as expired but CONFIRMED:
                        <resumes>
                        nf_ct_expired()
                         -> yes (ct->timeout is 30s)
                        confirmed bit set.

> > I would push 1/4 and 3/4 to nf.git to start with. Unless you are 100% sure this fix is needed.
> 
> 3/4 needs 2/4 present as well.  I can then resend 4/4 then with the

Right, I accidentally skipped that test, it should be also included.

> > > - ct->status |= IPS_CONFIRMED;
> > > + smp_mb__before_atomic();
> > > + set_bit(IPS_CONFIRMED_BIT, &ct->status) ?
> 
> change.

If the status bit is used to synchronize the different threads,
I agree this needs to be set_bit(). But I am not sure yet this is
sufficient yet.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
  2025-07-15 22:09           ` Pablo Neira Ayuso
@ 2025-07-16 15:59             ` Florian Westphal
  2025-07-16 17:00               ` Pablo Neira Ayuso
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2025-07-16 15:59 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Mon, Jul 14, 2025 at 04:36:35PM +0200, Florian Westphal wrote:
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > On Thu, Jul 03, 2025 at 04:21:51PM +0200, Florian Westphal wrote:
> > > > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > > > Thanks for the description, this scenario is esoteric.
> > > > > 
> > > > > Is this bug fully reproducible?
> > > > 
> > > > No.  Unicorn.  Only happened once.
> > > > Everything is based off reading the backtrace and vmcore.
> > > 
> > > I guess this needs a chaos money to trigger this bug. Else, can we try to catch this unicorn again?
> > 
> > I would not hold my breath.  But I don't see anything that prevents the
> > race described in 4/4, and all the things match in the vmcore, including
> > increment of clash resolution counter.  If you think its too perfect
> > then ok, we can keep 4/4 back until someone else reports this problem
> > again.
> 
> Hm, I think your sequence is possible, it is the SLAB_TYPESAFE_BY_RCU rule
> that allows for this to occur.
> 
> Could this rare sequence still happen?
> 
> cpu x                   cpu y                   cpu z
>  found entry E          found entry E
>  E is expired           <preemption>
>  nf_ct_delete()
>  return E to rcu slab
>                                         init_conntrack
>                                         <preemption>     NOTE: ct->status not yet set to zero
> 
> cpu y resumes, it observes E as expired but CONFIRMED:
>                         <resumes>
>                         nf_ct_expired()
>                          -> yes (ct->timeout is 30s)
>                         confirmed bit set.

Yes, that can happen, but then the refcount can't be incremented
as its 0 (-> entry is skipped). If its nonzero but the object was returned
by the kmem cache we have a different kind of bug (free with refcount > 0),
or use-after-free.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
  2025-07-16 15:59             ` Florian Westphal
@ 2025-07-16 17:00               ` Pablo Neira Ayuso
  0 siblings, 0 replies; 12+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-16 17:00 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Wed, Jul 16, 2025 at 05:59:41PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Mon, Jul 14, 2025 at 04:36:35PM +0200, Florian Westphal wrote:
> > > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > > On Thu, Jul 03, 2025 at 04:21:51PM +0200, Florian Westphal wrote:
> > > > > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > > > > Thanks for the description, this scenario is esoteric.
> > > > > > 
> > > > > > Is this bug fully reproducible?
> > > > > 
> > > > > No.  Unicorn.  Only happened once.
> > > > > Everything is based off reading the backtrace and vmcore.
> > > > 
> > > > I guess this needs a chaos money to trigger this bug. Else, can we try to catch this unicorn again?
> > > 
> > > I would not hold my breath.  But I don't see anything that prevents the
> > > race described in 4/4, and all the things match in the vmcore, including
> > > increment of clash resolution counter.  If you think its too perfect
> > > then ok, we can keep 4/4 back until someone else reports this problem
> > > again.
> > 
> > Hm, I think your sequence is possible, it is the SLAB_TYPESAFE_BY_RCU rule
> > that allows for this to occur.
> > 
> > Could this rare sequence still happen?
> > 
> > cpu x                   cpu y                   cpu z
> >  found entry E          found entry E
> >  E is expired           <preemption>
> >  nf_ct_delete()
> >  return E to rcu slab
> >                                         init_conntrack
> >                                         <preemption>     NOTE: ct->status not yet set to zero
> > 
> > cpu y resumes, it observes E as expired but CONFIRMED:
> >                         <resumes>
> >                         nf_ct_expired()
> >                          -> yes (ct->timeout is 30s)
> >                         confirmed bit set.
> 
> Yes, that can happen, but then the refcount can't be incremented
> as its 0 (-> entry is skipped).

Right, refcount zero prevents it.

static void nf_ct_gc_expired(struct nf_conn *ct)
{
        if (!refcount_inc_not_zero(&ct->ct_general.use))
                return;


> If its nonzero but the object was returned
> by the kmem cache we have a different kind of bug (free with refcount > 0),
> or use-after-free.

OK, thanks for explaining, use set_bit() and post v2.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-07-16 17:00 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-27 14:27 [PATCH nf 0/4] netfilter: conntrack: fix obscure confirmed race Florian Westphal
2025-06-27 14:27 ` [PATCH nf 1/4] selftests: netfilter: conntrack_resize.sh: extend resize test Florian Westphal
2025-06-27 14:27 ` [PATCH nf 2/4] selftests: netfilter: add conntrack clash resolution test case Florian Westphal
2025-06-27 14:27 ` [PATCH nf 3/4] selftests: netfilter: conntrack_resize.sh: also use udpclash tool Florian Westphal
2025-06-27 14:27 ` [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry Florian Westphal
2025-07-03 13:56   ` Pablo Neira Ayuso
2025-07-03 14:21     ` Florian Westphal
2025-07-14 13:51       ` Pablo Neira Ayuso
2025-07-14 14:36         ` Florian Westphal
2025-07-15 22:09           ` Pablo Neira Ayuso
2025-07-16 15:59             ` Florian Westphal
2025-07-16 17:00               ` Pablo Neira Ayuso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).