* [PATCH nf 1/4] selftests: netfilter: conntrack_resize.sh: extend resize test
2025-06-27 14:27 [PATCH nf 0/4] netfilter: conntrack: fix obscure confirmed race Florian Westphal
@ 2025-06-27 14:27 ` Florian Westphal
2025-06-27 14:27 ` [PATCH nf 2/4] selftests: netfilter: add conntrack clash resolution test case Florian Westphal
` (2 subsequent siblings)
3 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2025-06-27 14:27 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal
Extend the resize test:
- continuously dump table both via /proc and ctnetlink interfaces while
table is resized in a loop.
- if socat is available, send udp packets in additon to ping requests.
- increase/decrease the icmp and udp timeouts while resizes are happening.
This makes sure we also exercise the 'ct has expired' check that happens
on conntrack lookup.
Signed-off-by: Florian Westphal <fw@strlen.de>
---
.../net/netfilter/conntrack_resize.sh | 80 +++++++++++++++++--
1 file changed, 75 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/net/netfilter/conntrack_resize.sh b/tools/testing/selftests/net/netfilter/conntrack_resize.sh
index 9e033e80219e..aa1ba07eaf50 100755
--- a/tools/testing/selftests/net/netfilter/conntrack_resize.sh
+++ b/tools/testing/selftests/net/netfilter/conntrack_resize.sh
@@ -12,6 +12,9 @@ tmpfile=""
tmpfile_proc=""
tmpfile_uniq=""
ret=0
+have_socat=0
+
+socat -h > /dev/null && have_socat=1
insert_count=2000
[ "$KSFT_MACHINE_SLOW" = "yes" ] && insert_count=400
@@ -123,7 +126,7 @@ ctflush() {
done
}
-ctflood()
+ct_pingflood()
{
local ns="$1"
local duration="$2"
@@ -152,6 +155,28 @@ ctflood()
wait
}
+ct_udpflood()
+{
+ local ns="$1"
+ local duration="$2"
+ local now=$(date +%s)
+ local end=$((now + duration))
+
+ [ $have_socat -ne "1" ] && return
+
+ while [ $now -lt $end ]; do
+ip netns exec "$ns" bash<<"EOF"
+ for i in $(seq 1 100);do
+ dport=$(((RANDOM%65536)+1))
+
+ echo bar | socat -u STDIN UDP:"127.0.0.1:$dport" &
+ done > /dev/null 2>&1
+ wait
+EOF
+ now=$(date +%s)
+ done
+}
+
# dump to /dev/null. We don't want dumps to cause infinite loops
# or use-after-free even when conntrack table is altered while dumps
# are in progress.
@@ -169,6 +194,48 @@ ct_nulldump()
wait
}
+ct_nulldump_loop()
+{
+ local ns="$1"
+ local duration="$2"
+ local now=$(date +%s)
+ local end=$((now + duration))
+
+ while [ $now -lt $end ]; do
+ ct_nulldump "$ns"
+ sleep $((RANDOM%2))
+ now=$(date +%s)
+ done
+}
+
+change_timeouts()
+{
+ local ns="$1"
+ local r1=$((RANDOM%2))
+ local r2=$((RANDOM%2))
+
+ [ "$r1" -eq 1 ] && ip netns exec "$ns" sysctl -q net.netfilter.nf_conntrack_icmp_timeout=$((RANDOM%5))
+ [ "$r2" -eq 1 ] && ip netns exec "$ns" sysctl -q net.netfilter.nf_conntrack_udp_timeout=$((RANDOM%5))
+}
+
+ct_change_timeouts_loop()
+{
+ local ns="$1"
+ local duration="$2"
+ local now=$(date +%s)
+ local end=$((now + duration))
+
+ while [ $now -lt $end ]; do
+ change_timeouts "$ns"
+ sleep $((RANDOM%2))
+ now=$(date +%s)
+ done
+
+ # restore defaults
+ ip netns exec "$ns" sysctl -q net.netfilter.nf_conntrack_icmp_timeout=30
+ ip netns exec "$ns" sysctl -q net.netfilter.nf_conntrack_udp_timeout=30
+}
+
check_taint()
{
local tainted_then="$1"
@@ -198,10 +265,13 @@ insert_flood()
r=$((RANDOM%$insert_count))
- ctflood "$n" "$timeout" "floodresize" &
+ ct_pingflood "$n" "$timeout" "floodresize" &
+ ct_udpflood "$n" "$timeout" &
+
insert_ctnetlink "$n" "$r" &
ctflush "$n" "$timeout" &
- ct_nulldump "$n" &
+ ct_nulldump_loop "$n" "$timeout" &
+ ct_change_timeouts_loop "$n" "$timeout" &
wait
}
@@ -306,7 +376,7 @@ test_dump_all()
ip netns exec "$nsclient1" sysctl -q net.netfilter.nf_conntrack_icmp_timeout=3600
- ctflood "$nsclient1" $timeout "dumpall" &
+ ct_pingflood "$nsclient1" $timeout "dumpall" &
insert_ctnetlink "$nsclient2" $insert_count
wait
@@ -368,7 +438,7 @@ test_conntrack_disable()
ct_flush_once "$nsclient1"
ct_flush_once "$nsclient2"
- ctflood "$nsclient1" "$timeout" "conntrack disable"
+ ct_pingflood "$nsclient1" "$timeout" "conntrack disable"
ip netns exec "$nsclient2" ping -q -c 1 127.0.0.1 >/dev/null 2>&1
# Disabled, should not have picked up any connection.
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH nf 2/4] selftests: netfilter: add conntrack clash resolution test case
2025-06-27 14:27 [PATCH nf 0/4] netfilter: conntrack: fix obscure confirmed race Florian Westphal
2025-06-27 14:27 ` [PATCH nf 1/4] selftests: netfilter: conntrack_resize.sh: extend resize test Florian Westphal
@ 2025-06-27 14:27 ` Florian Westphal
2025-06-27 14:27 ` [PATCH nf 3/4] selftests: netfilter: conntrack_resize.sh: also use udpclash tool Florian Westphal
2025-06-27 14:27 ` [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry Florian Westphal
3 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2025-06-27 14:27 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal
Add a dedicated test to exercise conntrack clash resolution path.
Test program emits 128 identical udp packets in parallel, then reads
back replies from socat echo server.
Also check (via conntrack -S) that the clash path was hit at least once.
Due to the racy nature of the test its possible that despite the
threaded program all packets were processed in-order or on same cpu,
emit a SKIP warning in this case.
Two tests are added:
- one to test the simpler, non-nat case
- one to exercise clash resolution where packets
might have different nat transformations attached to them.
Signed-off-by: Florian Westphal <fw@strlen.de>
---
.../selftests/net/netfilter/.gitignore | 1 +
.../testing/selftests/net/netfilter/Makefile | 3 +
.../net/netfilter/conntrack_clash.sh | 175 ++++++++++++++++++
.../selftests/net/netfilter/udpclash.c | 158 ++++++++++++++++
4 files changed, 337 insertions(+)
create mode 100755 tools/testing/selftests/net/netfilter/conntrack_clash.sh
create mode 100644 tools/testing/selftests/net/netfilter/udpclash.c
diff --git a/tools/testing/selftests/net/netfilter/.gitignore b/tools/testing/selftests/net/netfilter/.gitignore
index 64c4f8d9aa6c..5d2be9a00627 100644
--- a/tools/testing/selftests/net/netfilter/.gitignore
+++ b/tools/testing/selftests/net/netfilter/.gitignore
@@ -5,3 +5,4 @@ conntrack_dump_flush
conntrack_reverse_clash
sctp_collision
nf_queue
+udpclash
diff --git a/tools/testing/selftests/net/netfilter/Makefile b/tools/testing/selftests/net/netfilter/Makefile
index e9b2f553588d..a98ed892f55f 100644
--- a/tools/testing/selftests/net/netfilter/Makefile
+++ b/tools/testing/selftests/net/netfilter/Makefile
@@ -15,6 +15,7 @@ TEST_PROGS += conntrack_tcp_unreplied.sh
TEST_PROGS += conntrack_resize.sh
TEST_PROGS += conntrack_sctp_collision.sh
TEST_PROGS += conntrack_vrf.sh
+TEST_PROGS += conntrack_clash.sh
TEST_PROGS += conntrack_reverse_clash.sh
TEST_PROGS += ipvs.sh
TEST_PROGS += nf_conntrack_packetdrill.sh
@@ -44,6 +45,7 @@ TEST_GEN_FILES += connect_close nf_queue
TEST_GEN_FILES += conntrack_dump_flush
TEST_GEN_FILES += conntrack_reverse_clash
TEST_GEN_FILES += sctp_collision
+TEST_GEN_FILES += udpclash
include ../../lib.mk
@@ -52,6 +54,7 @@ $(OUTPUT)/nf_queue: LDLIBS += $(MNL_LDLIBS)
$(OUTPUT)/conntrack_dump_flush: CFLAGS += $(MNL_CFLAGS)
$(OUTPUT)/conntrack_dump_flush: LDLIBS += $(MNL_LDLIBS)
+$(OUTPUT)/udpclash: LDLIBS += -lpthread
TEST_FILES := lib.sh
TEST_FILES += packetdrill
diff --git a/tools/testing/selftests/net/netfilter/conntrack_clash.sh b/tools/testing/selftests/net/netfilter/conntrack_clash.sh
new file mode 100755
index 000000000000..3712c1b9b38b
--- /dev/null
+++ b/tools/testing/selftests/net/netfilter/conntrack_clash.sh
@@ -0,0 +1,175 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+source lib.sh
+
+clash_resolution_active=0
+dport=22111
+ret=0
+
+cleanup()
+{
+ # netns cleanup also zaps any remaining socat echo server.
+ cleanup_all_ns
+}
+
+checktool "nft --version" "run test without nft"
+checktool "conntrack --version" "run test without conntrack"
+checktool "socat -h" "run test without socat"
+
+trap cleanup EXIT
+
+setup_ns nsclient1 nsclient2 nsrouter
+
+ip netns exec "$nsrouter" nft -f -<<EOF
+table ip t {
+ chain lb {
+ meta l4proto udp dnat to numgen random mod 3 map { 0 : 10.0.2.1 . 9000, 1 : 10.0.2.1 . 9001, 2 : 10.0.2.1 . 9002 }
+ }
+
+ chain prerouting {
+ type nat hook prerouting priority dstnat
+
+ udp dport $dport counter jump lb
+ }
+
+ chain output {
+ type nat hook output priority dstnat
+
+ udp dport $dport counter jump lb
+ }
+}
+EOF
+
+load_simple_ruleset()
+{
+ip netns exec "$1" nft -f -<<EOF
+table ip t {
+ chain forward {
+ type filter hook forward priority 0
+
+ ct state new counter
+ }
+}
+EOF
+}
+
+spawn_servers()
+{
+ local ns="$1"
+ local ports="9000 9001 9002"
+
+ for port in $ports; do
+ ip netns exec "$ns" socat UDP-RECVFROM:$port,fork PIPE 2>/dev/null &
+ done
+
+ for port in $ports; do
+ wait_local_port_listen "$ns" $port udp
+ done
+}
+
+add_addr()
+{
+ local ns="$1"
+ local dev="$2"
+ local i="$3"
+ local j="$4"
+
+ ip -net "$ns" link set "$dev" up
+ ip -net "$ns" addr add "10.0.$i.$j/24" dev "$dev"
+}
+
+ping_test()
+{
+ local ns="$1"
+ local daddr="$2"
+
+ if ! ip netns exec "$ns" ping -q -c 1 $daddr > /dev/null;then
+ echo "FAIL: ping from $ns to $daddr"
+ exit 1
+ fi
+}
+
+run_one_clash_test()
+{
+ local ns="$1"
+ local daddr="$2"
+ local dport="$3"
+ local entries
+ local cre
+
+ if ! ip netns exec "$ns" ./udpclash $daddr $dport;then
+ echo "FAIL: did not receive expected number of replies for $daddr:$dport"
+ ret=1
+ return 1
+ fi
+
+ entries=$(conntrack -S | wc -l)
+ cre=$(conntrack -S | grep -v "clash_resolve=0" | wc -l)
+
+ if [ "$cre" -ne "$entries" ] ;then
+ clash_resolution_active=1
+ return 0
+ fi
+
+ # 1 cpu -> parallel insertion impossible
+ if [ "$entries" -eq 1 ]; then
+ return 0
+ fi
+
+ # not a failure: clash resolution logic did not trigger, but all replies
+ # were received. With right timing, xmit completed sequentially and
+ # no parallel insertion occurs.
+ return $ksft_skip
+}
+
+run_clash_test()
+{
+ local ns="$1"
+ local daddr="$2"
+ local dport="$3"
+
+ for i in $(seq 1 10);do
+ run_one_clash_test "$ns" "$daddr" "$dport"
+ local rv=$?
+ if [ $rv -eq 0 ];then
+ echo "PASS: clash resolution test for $daddr:$dport on attempt $i"
+ return 0
+ elif [ $rv -eq 1 ];then
+ echo "FAIL: clash resolution test for $daddr:$dport on attempt $i"
+ return 1
+ fi
+ done
+}
+
+ip link add veth0 netns "$nsclient1" type veth peer name veth0 netns "$nsrouter"
+ip link add veth0 netns "$nsclient2" type veth peer name veth1 netns "$nsrouter"
+add_addr "$nsclient1" veth0 1 1
+add_addr "$nsclient2" veth0 2 1
+add_addr "$nsrouter" veth0 1 99
+add_addr "$nsrouter" veth1 2 99
+
+ip -net "$nsclient1" route add default via 10.0.1.99
+ip -net "$nsclient2" route add default via 10.0.2.99
+ip netns exec "$nsrouter" sysctl -q net.ipv4.ip_forward=1
+
+ping_test "$nsclient1" 10.0.1.99
+ping_test "$nsclient1" 10.0.2.1
+ping_test "$nsclient2" 10.0.1.1
+
+spawn_servers "$nsclient2"
+
+# exercise clash resolution with nat:
+# nsrouter is supposed to dnat to 10.0.2.1:900{0,1,2,3}.
+run_clash_test "$nsclient1" 10.0.1.99 "$dport"
+
+# exercise clash resolution without nat.
+load_simple_ruleset "$nsclient2"
+run_clash_test "$nsclient2" 127.0.0.1 9001
+
+if [ $clash_resolution_active -eq 0 ];then
+ [ "$ret" -eq 0 ] && ret=$ksft_skip
+ echo "SKIP: Clash resolution did not trigger"
+fi
+
+exit $ret
diff --git a/tools/testing/selftests/net/netfilter/udpclash.c b/tools/testing/selftests/net/netfilter/udpclash.c
new file mode 100644
index 000000000000..85c7b906ad08
--- /dev/null
+++ b/tools/testing/selftests/net/netfilter/udpclash.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Usage: ./udpclash <IP> <PORT>
+ *
+ * Emit THREAD_COUNT UDP packets sharing the same saddr:daddr pair.
+ *
+ * This mimics DNS resolver libraries that emit A and AAAA requests
+ * in parallel.
+ *
+ * This exercises conntrack clash resolution logic added and later
+ * refined in
+ *
+ * 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race")
+ * ed07d9a021df ("netfilter: nf_conntrack: resolve clash for matching conntracks")
+ * 6a757c07e51f ("netfilter: conntrack: allow insertion of clashing entries")
+ */
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <arpa/inet.h>
+#include <sys/socket.h>
+#include <pthread.h>
+
+#define THREAD_COUNT 128
+
+struct thread_args {
+ const struct sockaddr_in *si_remote;
+ int sockfd;
+};
+
+static int wait = 1;
+
+static void *thread_main(void *varg)
+{
+ const struct sockaddr_in *si_remote;
+ const struct thread_args *args = varg;
+ static const char msg[] = "foo";
+
+ si_remote = args->si_remote;
+
+ while (wait == 1)
+ ;
+
+ if (sendto(args->sockfd, msg, strlen(msg), MSG_NOSIGNAL,
+ (struct sockaddr *)si_remote, sizeof(*si_remote)) < 0)
+ exit(111);
+
+ return varg;
+}
+
+static int run_test(int fd, const struct sockaddr_in *si_remote)
+{
+ struct thread_args thread_args = {
+ .si_remote = si_remote,
+ .sockfd = fd,
+ };
+ pthread_t *tid = calloc(THREAD_COUNT, sizeof(pthread_t));
+ unsigned int repl_count = 0, timeout = 0;
+ int i;
+
+ if (!tid) {
+ perror("calloc");
+ return 1;
+ }
+
+ for (i = 0; i < THREAD_COUNT; i++) {
+ int err = pthread_create(&tid[i], NULL, &thread_main, &thread_args);
+
+ if (err != 0) {
+ perror("pthread_create");
+ exit(1);
+ }
+ }
+
+ wait = 0;
+
+ for (i = 0; i < THREAD_COUNT; i++)
+ pthread_join(tid[i], NULL);
+
+ while (repl_count < THREAD_COUNT) {
+ struct sockaddr_in si_repl;
+ socklen_t si_repl_len = sizeof(si_repl);
+ char repl[512];
+ ssize_t ret;
+
+ ret = recvfrom(fd, repl, sizeof(repl), MSG_NOSIGNAL,
+ (struct sockaddr *) &si_repl, &si_repl_len);
+ if (ret < 0) {
+ if (timeout++ > 5000) {
+ fputs("timed out while waiting for reply from thread\n", stderr);
+ break;
+ }
+
+ /* give reply time to pass though the stack */
+ usleep(1000);
+ continue;
+ }
+
+ if (si_repl_len != sizeof(*si_remote)) {
+ fprintf(stderr, "warning: reply has unexpected repl_len %d vs %d\n",
+ (int)si_repl_len, (int)sizeof(si_repl));
+ } else if (si_remote->sin_addr.s_addr != si_repl.sin_addr.s_addr ||
+ si_remote->sin_port != si_repl.sin_port) {
+ char a[64], b[64];
+
+ inet_ntop(AF_INET, &si_remote->sin_addr, a, sizeof(a));
+ inet_ntop(AF_INET, &si_repl.sin_addr, b, sizeof(b));
+
+ fprintf(stderr, "reply from wrong source: want %s:%d got %s:%d\n",
+ a, ntohs(si_remote->sin_port), b, ntohs(si_repl.sin_port));
+ }
+
+ repl_count++;
+ }
+
+ printf("got %d of %d replies\n", repl_count, THREAD_COUNT);
+
+ free(tid);
+
+ return repl_count == THREAD_COUNT ? 0 : 1;
+}
+
+int main(int argc, char *argv[])
+{
+ struct sockaddr_in si_local = {
+ .sin_family = AF_INET,
+ };
+ struct sockaddr_in si_remote = {
+ .sin_family = AF_INET,
+ };
+ int fd, ret;
+
+ if (argc < 3) {
+ fputs("Usage: send_udp <daddr> <dport>\n", stderr);
+ return 1;
+ }
+
+ si_remote.sin_port = htons(atoi(argv[2]));
+ si_remote.sin_addr.s_addr = inet_addr(argv[1]);
+
+ fd = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_UDP);
+ if (fd < 0) {
+ perror("socket");
+ return 1;
+ }
+
+ if (bind(fd, (struct sockaddr *)&si_local, sizeof(si_local)) < 0) {
+ perror("bind");
+ return 1;
+ }
+
+ ret = run_test(fd, &si_remote);
+
+ close(fd);
+
+ return ret;
+}
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH nf 3/4] selftests: netfilter: conntrack_resize.sh: also use udpclash tool
2025-06-27 14:27 [PATCH nf 0/4] netfilter: conntrack: fix obscure confirmed race Florian Westphal
2025-06-27 14:27 ` [PATCH nf 1/4] selftests: netfilter: conntrack_resize.sh: extend resize test Florian Westphal
2025-06-27 14:27 ` [PATCH nf 2/4] selftests: netfilter: add conntrack clash resolution test case Florian Westphal
@ 2025-06-27 14:27 ` Florian Westphal
2025-06-27 14:27 ` [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry Florian Westphal
3 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2025-06-27 14:27 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal
Previous patch added a new clash resolution test case.
Also use this during conntrack resize stress test in addition
to icmp ping flood.
Signed-off-by: Florian Westphal <fw@strlen.de>
---
.../selftests/net/netfilter/conntrack_resize.sh | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/tools/testing/selftests/net/netfilter/conntrack_resize.sh b/tools/testing/selftests/net/netfilter/conntrack_resize.sh
index aa1ba07eaf50..788cd56ea4a0 100755
--- a/tools/testing/selftests/net/netfilter/conntrack_resize.sh
+++ b/tools/testing/selftests/net/netfilter/conntrack_resize.sh
@@ -177,6 +177,22 @@ EOF
done
}
+ct_udpclash()
+{
+ local ns="$1"
+ local duration="$2"
+ local now=$(date +%s)
+ local end=$((now + duration))
+
+ [ -x udpclash ] || return
+
+ while [ $now -lt $end ]; do
+ ip netns exec "$ns" ./udpclash 127.0.0.1 $((RANDOM%65536)) > /dev/null 2>&1
+
+ now=$(date +%s)
+ done
+}
+
# dump to /dev/null. We don't want dumps to cause infinite loops
# or use-after-free even when conntrack table is altered while dumps
# are in progress.
@@ -267,6 +283,7 @@ insert_flood()
ct_pingflood "$n" "$timeout" "floodresize" &
ct_udpflood "$n" "$timeout" &
+ ct_udpclash "$n" "$timeout" &
insert_ctnetlink "$n" "$r" &
ctflush "$n" "$timeout" &
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
2025-06-27 14:27 [PATCH nf 0/4] netfilter: conntrack: fix obscure confirmed race Florian Westphal
` (2 preceding siblings ...)
2025-06-27 14:27 ` [PATCH nf 3/4] selftests: netfilter: conntrack_resize.sh: also use udpclash tool Florian Westphal
@ 2025-06-27 14:27 ` Florian Westphal
2025-07-03 13:56 ` Pablo Neira Ayuso
3 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2025-06-27 14:27 UTC (permalink / raw)
To: netfilter-devel; +Cc: Florian Westphal
A crash in conntrack was reported while trying to unlink the conntrack
entry from the hash bucket list:
[exception RIP: __nf_ct_delete_from_lists+172]
[..]
#7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack]
#8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack]
#9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack]
[..]
The nf_conn struct is marked as allocated from slab but appears to be in
a partially initialised state:
ct hlist pointer is garbage; looks like the ct hash value
(hence crash).
ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected
ct->timeout is 30000 (=30s), which is unexpected.
Everything else looks like normal udp conntrack entry. If we ignore
ct->status and pretend its 0, the entry matches those that are newly
allocated but not yet inserted into the hash:
- ct hlist pointers are overloaded and store/cache the raw tuple hash
- ct->timeout matches the relative time expected for a new udp flow
rather than the absolute 'jiffies' value.
If it were not for the presence of IPS_CONFIRMED,
__nf_conntrack_find_get() would have skipped the entry.
Theory is that we did hit following race:
cpu x cpu y cpu z
found entry E found entry E
E is expired <preemption>
nf_ct_delete()
return E to rcu slab
init_conntrack
E is re-inited,
ct->status set to 0
reply tuplehash hnnode.pprev
stores hash value.
cpu y found E right before it was deleted on cpu x.
E is now re-inited on cpu z. cpu y was preempted before
checking for expiry and/or confirm bit.
->refcnt set to 1
E now owned by skb
->timeout set to 30000
If cpu y were to resume now, it would observe E as
expired but would skip E due to missing CONFIRMED bit.
nf_conntrack_confirm gets called
sets: ct->status |= CONFIRMED
This is wrong: E is not yet added
to hashtable.
cpu y resumes, it observes E as expired but CONFIRMED:
<resumes>
nf_ct_expired()
-> yes (ct->timeout is 30s)
confirmed bit set.
cpu y will try to delete E from the hashtable:
nf_ct_delete() -> set DYING bit
__nf_ct_delete_from_lists
Even this scenario doesn't guarantee a crash:
cpu z still holds the table bucket lock(s) so y blocks:
wait for spinlock held by z
CONFIRMED is set but there is no
guarantee ct will be added to hash:
"chaintoolong" or "clash resolution"
logic both skip the insert step.
reply hnnode.pprev still stores the
hash value.
unlocks spinlock
return NF_DROP
<unblocks, then
crashes on hlist_nulls_del_rcu pprev>
In case CPU z does insert the entry into the hashtable, cpu y will unlink
E again right away but no crash occurs.
Without 'cpu y' race, 'garbage' hlist is of no consequence:
ct refcnt remains at 1, eventually skb will be free'd and E gets
destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy.
To resolve this, move the IPS_CONFIRMED assignment after the table
insertion but before the unlock.
It doesn't matter if other CPUs can observe a newly inserted entry right
before the CONFIRMED bit was set:
Such event cannot be distinguished from above "E is the old incarnation"
case: the entry will be skipped.
Also change nf_ct_should_gc() to first check the confirmed bit.
The gc sequence is:
1. Check if entry has expired, if not skip to next entry
2. Obtain a reference to the expired entry.
3. Call nf_ct_should_gc() to double-check step 1.
nf_ct_should_gc() is thus called only for entries that already failed an
expiry check. After this patch, once the confirmed bit check passes
ct->timeout has been altered to reflect the absolute 'best before' date
instead of a relative time. Step 3 will therefore not remove the entry.
Without this change to nf_ct_should_gc() we could still get this sequence:
1. Check if entry has expired.
2. Obtain a reference.
3. Call nf_ct_should_gc() to double-check step 1:
4 - entry is still observed as expired
5 - meanwhile, ct->timeout is corrected to absolute value on other CPU
and confirm bit gets set
6 - confirm bit is seen
7 - valid entry is removed again
First do check 6), then 4) so the gc expiry check always picks up either
confirmed bit unset (entry gets skipped) or expiry re-check failure for
re-inited conntrack objects.
This change cannot be backported to releases before 5.19. Without
commit 8a75a2c17410 ("netfilter: conntrack: remove unconfirmed list")
|= IPS_CONFIRMED line cannot be moved without further changes.
Fixes: 1397af5bfd7d ("netfilter: conntrack: remove the percpu dying list")
Signed-off-by: Florian Westphal <fw@strlen.de>
---
include/net/netfilter/nf_conntrack.h | 15 +++++++++++++--
net/netfilter/nf_conntrack_core.c | 18 ++++++++++++------
2 files changed, 25 insertions(+), 8 deletions(-)
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 3f02a45773e8..ca26274196b9 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -306,8 +306,19 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
/* use after obtaining a reference count */
static inline bool nf_ct_should_gc(const struct nf_conn *ct)
{
- return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
- !nf_ct_is_dying(ct);
+ if (!nf_ct_is_confirmed(ct))
+ return false;
+
+ /* load ct->timeout after is_confirmed() test.
+ * Pairs with __nf_conntrack_confirm() which:
+ * 1. Increases ct->timeout value
+ * 2. Inserts ct into rcu hlist
+ * 3. Sets the confirmed bit
+ * 4. Unlocks the hlist lock
+ */
+ smp_acquire__after_ctrl_dep();
+
+ return nf_ct_is_expired(ct) && !nf_ct_is_dying(ct);
}
#define NF_CT_DAY (86400 * HZ)
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 201d3c4ec623..442a972a03d7 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -1124,6 +1124,7 @@ static int nf_ct_resolve_clash_harder(struct sk_buff *skb, u32 repl_idx)
hlist_nulls_add_head_rcu(&loser_ct->tuplehash[IP_CT_DIR_REPLY].hnnode,
&nf_conntrack_hash[repl_idx]);
+ loser_ct->status |= IPS_CONFIRMED;
NF_CT_STAT_INC(net, clash_resolve);
return NF_ACCEPT;
@@ -1260,8 +1261,6 @@ __nf_conntrack_confirm(struct sk_buff *skb)
* user context, else we insert an already 'dead' hash, blocking
* further use of that particular connection -JM.
*/
- ct->status |= IPS_CONFIRMED;
-
if (unlikely(nf_ct_is_dying(ct))) {
NF_CT_STAT_INC(net, insert_failed);
goto dying;
@@ -1293,7 +1292,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
}
}
- /* Timer relative to confirmation time, not original
+ /* Timeout is relative to confirmation time, not original
setting time, otherwise we'd get timer wrap in
weird delay cases. */
ct->timeout += nfct_time_stamp;
@@ -1301,11 +1300,18 @@ __nf_conntrack_confirm(struct sk_buff *skb)
__nf_conntrack_insert_prepare(ct);
/* Since the lookup is lockless, hash insertion must be done after
- * starting the timer and setting the CONFIRMED bit. The RCU barriers
- * guarantee that no other CPU can find the conntrack before the above
- * stores are visible.
+ * setting ct->timeout. The RCU barriers guarantee that no other CPU
+ * can find the conntrack before the above stores are visible.
*/
__nf_conntrack_hash_insert(ct, hash, reply_hash);
+
+ /* IPS_CONFIRMED unset means 'ct not (yet) in hash', conntrack lookups
+ * skip entries that lack this bit. This happens when a CPU is looking
+ * at a stale entry that is being recycled due to SLAB_TYPESAFE_BY_RCU
+ * or when another CPU encounters this entry right after the insertion
+ * but before the set-confirm-bit below.
+ */
+ ct->status |= IPS_CONFIRMED;
nf_conntrack_double_unlock(hash, reply_hash);
local_bh_enable();
--
2.49.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
2025-06-27 14:27 ` [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry Florian Westphal
@ 2025-07-03 13:56 ` Pablo Neira Ayuso
2025-07-03 14:21 ` Florian Westphal
0 siblings, 1 reply; 12+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-03 13:56 UTC (permalink / raw)
To: Florian Westphal; +Cc: netfilter-devel
Hi FLorian,
Thanks for the description, this scenario is esoteric.
Is this bug fully reproducible?
More questions below.
On Fri, Jun 27, 2025 at 04:27:53PM +0200, Florian Westphal wrote:
> A crash in conntrack was reported while trying to unlink the conntrack
> entry from the hash bucket list:
> [exception RIP: __nf_ct_delete_from_lists+172]
> [..]
> #7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack]
> #8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack]
> #9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack]
> [..]
>
> The nf_conn struct is marked as allocated from slab but appears to be in
> a partially initialised state:
>
> ct hlist pointer is garbage; looks like the ct hash value
> (hence crash).
> ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected
> ct->timeout is 30000 (=30s), which is unexpected.
>
> Everything else looks like normal udp conntrack entry. If we ignore
> ct->status and pretend its 0, the entry matches those that are newly
> allocated but not yet inserted into the hash:
> - ct hlist pointers are overloaded and store/cache the raw tuple hash
> - ct->timeout matches the relative time expected for a new udp flow
> rather than the absolute 'jiffies' value.
>
> If it were not for the presence of IPS_CONFIRMED,
> __nf_conntrack_find_get() would have skipped the entry.
>
> Theory is that we did hit following race:
>
> cpu x cpu y cpu z
> found entry E found entry E
> E is expired <preemption>
> nf_ct_delete()
> return E to rcu slab
> init_conntrack
> E is re-inited,
> ct->status set to 0
> reply tuplehash hnnode.pprev
> stores hash value.
>
> cpu y found E right before it was deleted on cpu x.
> E is now re-inited on cpu z. cpu y was preempted before
> checking for expiry and/or confirm bit.
>
> ->refcnt set to 1
> E now owned by skb
> ->timeout set to 30000
>
> If cpu y were to resume now, it would observe E as
> expired but would skip E due to missing CONFIRMED bit.
>
> nf_conntrack_confirm gets called
> sets: ct->status |= CONFIRMED
> This is wrong: E is not yet added
> to hashtable.
>
> cpu y resumes, it observes E as expired but CONFIRMED:
> <resumes>
> nf_ct_expired()
> -> yes (ct->timeout is 30s)
> confirmed bit set.
>
> cpu y will try to delete E from the hashtable:
> nf_ct_delete() -> set DYING bit
> __nf_ct_delete_from_lists
>
> Even this scenario doesn't guarantee a crash:
> cpu z still holds the table bucket lock(s) so y blocks:
>
> wait for spinlock held by z
>
> CONFIRMED is set but there is no
> guarantee ct will be added to hash:
> "chaintoolong" or "clash resolution"
> logic both skip the insert step.
> reply hnnode.pprev still stores the
> hash value.
>
> unlocks spinlock
> return NF_DROP
> <unblocks, then
> crashes on hlist_nulls_del_rcu pprev>
>
> In case CPU z does insert the entry into the hashtable, cpu y will unlink
> E again right away but no crash occurs.
>
> Without 'cpu y' race, 'garbage' hlist is of no consequence:
> ct refcnt remains at 1, eventually skb will be free'd and E gets
> destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy.
>
> To resolve this, move the IPS_CONFIRMED assignment after the table
> insertion but before the unlock.
>
> It doesn't matter if other CPUs can observe a newly inserted entry right
> before the CONFIRMED bit was set:
>
> Such event cannot be distinguished from above "E is the old incarnation"
> case: the entry will be skipped.
>
> Also change nf_ct_should_gc() to first check the confirmed bit.
>
> The gc sequence is:
> 1. Check if entry has expired, if not skip to next entry
> 2. Obtain a reference to the expired entry.
> 3. Call nf_ct_should_gc() to double-check step 1.
>
> nf_ct_should_gc() is thus called only for entries that already failed an
> expiry check. After this patch, once the confirmed bit check passes
> ct->timeout has been altered to reflect the absolute 'best before' date
> instead of a relative time. Step 3 will therefore not remove the entry.
>
> Without this change to nf_ct_should_gc() we could still get this sequence:
>
> 1. Check if entry has expired.
> 2. Obtain a reference.
> 3. Call nf_ct_should_gc() to double-check step 1:
> 4 - entry is still observed as expired
> 5 - meanwhile, ct->timeout is corrected to absolute value on other CPU
> and confirm bit gets set
> 6 - confirm bit is seen
> 7 - valid entry is removed again
>
> First do check 6), then 4) so the gc expiry check always picks up either
> confirmed bit unset (entry gets skipped) or expiry re-check failure for
> re-inited conntrack objects.
>
> This change cannot be backported to releases before 5.19. Without
> commit 8a75a2c17410 ("netfilter: conntrack: remove unconfirmed list")
> |= IPS_CONFIRMED line cannot be moved without further changes.
>
> Fixes: 1397af5bfd7d ("netfilter: conntrack: remove the percpu dying list")
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
> include/net/netfilter/nf_conntrack.h | 15 +++++++++++++--
> net/netfilter/nf_conntrack_core.c | 18 ++++++++++++------
> 2 files changed, 25 insertions(+), 8 deletions(-)
>
> diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
> index 3f02a45773e8..ca26274196b9 100644
> --- a/include/net/netfilter/nf_conntrack.h
> +++ b/include/net/netfilter/nf_conntrack.h
> @@ -306,8 +306,19 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
> /* use after obtaining a reference count */
> static inline bool nf_ct_should_gc(const struct nf_conn *ct)
> {
> - return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
> - !nf_ct_is_dying(ct);
> + if (!nf_ct_is_confirmed(ct))
> + return false;
> +
> + /* load ct->timeout after is_confirmed() test.
> + * Pairs with __nf_conntrack_confirm() which:
> + * 1. Increases ct->timeout value
> + * 2. Inserts ct into rcu hlist
> + * 3. Sets the confirmed bit
> + * 4. Unlocks the hlist lock
> + */
> + smp_acquire__after_ctrl_dep();
> +
> + return nf_ct_is_expired(ct) && !nf_ct_is_dying(ct);
> }
>
> #define NF_CT_DAY (86400 * HZ)
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 201d3c4ec623..442a972a03d7 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -1124,6 +1124,7 @@ static int nf_ct_resolve_clash_harder(struct sk_buff *skb, u32 repl_idx)
>
> hlist_nulls_add_head_rcu(&loser_ct->tuplehash[IP_CT_DIR_REPLY].hnnode,
> &nf_conntrack_hash[repl_idx]);
> + loser_ct->status |= IPS_CONFIRMED;
>
> NF_CT_STAT_INC(net, clash_resolve);
> return NF_ACCEPT;
> @@ -1260,8 +1261,6 @@ __nf_conntrack_confirm(struct sk_buff *skb)
> * user context, else we insert an already 'dead' hash, blocking
> * further use of that particular connection -JM.
> */
> - ct->status |= IPS_CONFIRMED;
> -
> if (unlikely(nf_ct_is_dying(ct))) {
> NF_CT_STAT_INC(net, insert_failed);
> goto dying;
> @@ -1293,7 +1292,7 @@ __nf_conntrack_confirm(struct sk_buff *skb)
> }
> }
>
> - /* Timer relative to confirmation time, not original
> + /* Timeout is relative to confirmation time, not original
> setting time, otherwise we'd get timer wrap in
> weird delay cases. */
> ct->timeout += nfct_time_stamp;
> @@ -1301,11 +1300,18 @@ __nf_conntrack_confirm(struct sk_buff *skb)
> __nf_conntrack_insert_prepare(ct);
>
> /* Since the lookup is lockless, hash insertion must be done after
> - * starting the timer and setting the CONFIRMED bit. The RCU barriers
> - * guarantee that no other CPU can find the conntrack before the above
> - * stores are visible.
> + * setting ct->timeout. The RCU barriers guarantee that no other CPU
> + * can find the conntrack before the above stores are visible.
> */
> __nf_conntrack_hash_insert(ct, hash, reply_hash);
> +
> + /* IPS_CONFIRMED unset means 'ct not (yet) in hash', conntrack lookups
> + * skip entries that lack this bit. This happens when a CPU is looking
> + * at a stale entry that is being recycled due to SLAB_TYPESAFE_BY_RCU
> + * or when another CPU encounters this entry right after the insertion
> + * but before the set-confirm-bit below.
> + */
> + ct->status |= IPS_CONFIRMED;
My understanding is that this bit setting can still be reordered.
> nf_conntrack_double_unlock(hash, reply_hash);
> local_bh_enable();
>
> --
> 2.49.0
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
2025-07-03 13:56 ` Pablo Neira Ayuso
@ 2025-07-03 14:21 ` Florian Westphal
2025-07-14 13:51 ` Pablo Neira Ayuso
0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2025-07-03 14:21 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter-devel
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> Thanks for the description, this scenario is esoteric.
>
> Is this bug fully reproducible?
No. Unicorn. Only happened once.
Everything is based off reading the backtrace and vmcore.
> > + /* IPS_CONFIRMED unset means 'ct not (yet) in hash', conntrack lookups
> > + * skip entries that lack this bit. This happens when a CPU is looking
> > + * at a stale entry that is being recycled due to SLAB_TYPESAFE_BY_RCU
> > + * or when another CPU encounters this entry right after the insertion
> > + * but before the set-confirm-bit below.
> > + */
> > + ct->status |= IPS_CONFIRMED;
>
> My understanding is that this bit setting can still be reordered.
You mean it could be moved before hlist_add? So this is needed?
- ct->status |= IPS_CONFIRMED;
+ smp_mb__before_atomic();
+ set_bit(IPS_CONFIRMED_BIT, &ct->status) ?
I can send a v2 with this change.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
2025-07-03 14:21 ` Florian Westphal
@ 2025-07-14 13:51 ` Pablo Neira Ayuso
2025-07-14 14:36 ` Florian Westphal
0 siblings, 1 reply; 12+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-14 13:51 UTC (permalink / raw)
To: Florian Westphal; +Cc: netfilter-devel
On Thu, Jul 03, 2025 at 04:21:51PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > Thanks for the description, this scenario is esoteric.
> >
> > Is this bug fully reproducible?
>
> No. Unicorn. Only happened once.
> Everything is based off reading the backtrace and vmcore.
I guess this needs a chaos money to trigger this bug. Else, can we try to catch this unicorn again?
I would push 1/4 and 3/4 to nf.git to start with. Unless you are 100% sure this fix is needed.
> > > + /* IPS_CONFIRMED unset means 'ct not (yet) in hash', conntrack lookups
> > > + * skip entries that lack this bit. This happens when a CPU is looking
> > > + * at a stale entry that is being recycled due to SLAB_TYPESAFE_BY_RCU
> > > + * or when another CPU encounters this entry right after the insertion
> > > + * but before the set-confirm-bit below.
> > > + */
> > > + ct->status |= IPS_CONFIRMED;
> >
> > My understanding is that this bit setting can still be reordered.
>
> You mean it could be moved before hlist_add? So this is needed?
>
> - ct->status |= IPS_CONFIRMED;
> + smp_mb__before_atomic();
> + set_bit(IPS_CONFIRMED_BIT, &ct->status) ?
>
> I can send a v2 with this change.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
2025-07-14 13:51 ` Pablo Neira Ayuso
@ 2025-07-14 14:36 ` Florian Westphal
2025-07-15 22:09 ` Pablo Neira Ayuso
0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2025-07-14 14:36 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter-devel
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Thu, Jul 03, 2025 at 04:21:51PM +0200, Florian Westphal wrote:
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > Thanks for the description, this scenario is esoteric.
> > >
> > > Is this bug fully reproducible?
> >
> > No. Unicorn. Only happened once.
> > Everything is based off reading the backtrace and vmcore.
>
> I guess this needs a chaos money to trigger this bug. Else, can we try to catch this unicorn again?
I would not hold my breath. But I don't see anything that prevents the
race described in 4/4, and all the things match in the vmcore, including
increment of clash resolution counter. If you think its too perfect
then ok, we can keep 4/4 back until someone else reports this problem
again.
> I would push 1/4 and 3/4 to nf.git to start with. Unless you are 100% sure this fix is needed.
3/4 needs 2/4 present as well. I can then resend 4/4 then with the
> > - ct->status |= IPS_CONFIRMED;
> > + smp_mb__before_atomic();
> > + set_bit(IPS_CONFIRMED_BIT, &ct->status) ?
change.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
2025-07-14 14:36 ` Florian Westphal
@ 2025-07-15 22:09 ` Pablo Neira Ayuso
2025-07-16 15:59 ` Florian Westphal
0 siblings, 1 reply; 12+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-15 22:09 UTC (permalink / raw)
To: Florian Westphal; +Cc: netfilter-devel
On Mon, Jul 14, 2025 at 04:36:35PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Thu, Jul 03, 2025 at 04:21:51PM +0200, Florian Westphal wrote:
> > > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > > Thanks for the description, this scenario is esoteric.
> > > >
> > > > Is this bug fully reproducible?
> > >
> > > No. Unicorn. Only happened once.
> > > Everything is based off reading the backtrace and vmcore.
> >
> > I guess this needs a chaos money to trigger this bug. Else, can we try to catch this unicorn again?
>
> I would not hold my breath. But I don't see anything that prevents the
> race described in 4/4, and all the things match in the vmcore, including
> increment of clash resolution counter. If you think its too perfect
> then ok, we can keep 4/4 back until someone else reports this problem
> again.
Hm, I think your sequence is possible, it is the SLAB_TYPESAFE_BY_RCU rule
that allows for this to occur.
Could this rare sequence still happen?
cpu x cpu y cpu z
found entry E found entry E
E is expired <preemption>
nf_ct_delete()
return E to rcu slab
init_conntrack
<preemption> NOTE: ct->status not yet set to zero
cpu y resumes, it observes E as expired but CONFIRMED:
<resumes>
nf_ct_expired()
-> yes (ct->timeout is 30s)
confirmed bit set.
> > I would push 1/4 and 3/4 to nf.git to start with. Unless you are 100% sure this fix is needed.
>
> 3/4 needs 2/4 present as well. I can then resend 4/4 then with the
Right, I accidentally skipped that test, it should be also included.
> > > - ct->status |= IPS_CONFIRMED;
> > > + smp_mb__before_atomic();
> > > + set_bit(IPS_CONFIRMED_BIT, &ct->status) ?
>
> change.
If the status bit is used to synchronize the different threads,
I agree this needs to be set_bit(). But I am not sure yet this is
sufficient yet.
Thanks.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
2025-07-15 22:09 ` Pablo Neira Ayuso
@ 2025-07-16 15:59 ` Florian Westphal
2025-07-16 17:00 ` Pablo Neira Ayuso
0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2025-07-16 15:59 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netfilter-devel
Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Mon, Jul 14, 2025 at 04:36:35PM +0200, Florian Westphal wrote:
> > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > On Thu, Jul 03, 2025 at 04:21:51PM +0200, Florian Westphal wrote:
> > > > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > > > Thanks for the description, this scenario is esoteric.
> > > > >
> > > > > Is this bug fully reproducible?
> > > >
> > > > No. Unicorn. Only happened once.
> > > > Everything is based off reading the backtrace and vmcore.
> > >
> > > I guess this needs a chaos money to trigger this bug. Else, can we try to catch this unicorn again?
> >
> > I would not hold my breath. But I don't see anything that prevents the
> > race described in 4/4, and all the things match in the vmcore, including
> > increment of clash resolution counter. If you think its too perfect
> > then ok, we can keep 4/4 back until someone else reports this problem
> > again.
>
> Hm, I think your sequence is possible, it is the SLAB_TYPESAFE_BY_RCU rule
> that allows for this to occur.
>
> Could this rare sequence still happen?
>
> cpu x cpu y cpu z
> found entry E found entry E
> E is expired <preemption>
> nf_ct_delete()
> return E to rcu slab
> init_conntrack
> <preemption> NOTE: ct->status not yet set to zero
>
> cpu y resumes, it observes E as expired but CONFIRMED:
> <resumes>
> nf_ct_expired()
> -> yes (ct->timeout is 30s)
> confirmed bit set.
Yes, that can happen, but then the refcount can't be incremented
as its 0 (-> entry is skipped). If its nonzero but the object was returned
by the kmem cache we have a different kind of bug (free with refcount > 0),
or use-after-free.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH nf 4/4] netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
2025-07-16 15:59 ` Florian Westphal
@ 2025-07-16 17:00 ` Pablo Neira Ayuso
0 siblings, 0 replies; 12+ messages in thread
From: Pablo Neira Ayuso @ 2025-07-16 17:00 UTC (permalink / raw)
To: Florian Westphal; +Cc: netfilter-devel
On Wed, Jul 16, 2025 at 05:59:41PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > On Mon, Jul 14, 2025 at 04:36:35PM +0200, Florian Westphal wrote:
> > > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > > On Thu, Jul 03, 2025 at 04:21:51PM +0200, Florian Westphal wrote:
> > > > > Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > > > > > Thanks for the description, this scenario is esoteric.
> > > > > >
> > > > > > Is this bug fully reproducible?
> > > > >
> > > > > No. Unicorn. Only happened once.
> > > > > Everything is based off reading the backtrace and vmcore.
> > > >
> > > > I guess this needs a chaos money to trigger this bug. Else, can we try to catch this unicorn again?
> > >
> > > I would not hold my breath. But I don't see anything that prevents the
> > > race described in 4/4, and all the things match in the vmcore, including
> > > increment of clash resolution counter. If you think its too perfect
> > > then ok, we can keep 4/4 back until someone else reports this problem
> > > again.
> >
> > Hm, I think your sequence is possible, it is the SLAB_TYPESAFE_BY_RCU rule
> > that allows for this to occur.
> >
> > Could this rare sequence still happen?
> >
> > cpu x cpu y cpu z
> > found entry E found entry E
> > E is expired <preemption>
> > nf_ct_delete()
> > return E to rcu slab
> > init_conntrack
> > <preemption> NOTE: ct->status not yet set to zero
> >
> > cpu y resumes, it observes E as expired but CONFIRMED:
> > <resumes>
> > nf_ct_expired()
> > -> yes (ct->timeout is 30s)
> > confirmed bit set.
>
> Yes, that can happen, but then the refcount can't be incremented
> as its 0 (-> entry is skipped).
Right, refcount zero prevents it.
static void nf_ct_gc_expired(struct nf_conn *ct)
{
if (!refcount_inc_not_zero(&ct->ct_general.use))
return;
> If its nonzero but the object was returned
> by the kmem cache we have a different kind of bug (free with refcount > 0),
> or use-after-free.
OK, thanks for explaining, use set_bit() and post v2.
^ permalink raw reply [flat|nested] 12+ messages in thread