nft_queues.sh failures

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* nft_queues.sh failures
@ 2025-05-22 10:09 Paolo Abeni
  2025-05-22 13:53 ` Jakub Kicinski
  2025-05-22 14:46 ` [PATCH net] selftests: netfilter: nft_queue.sh: double sctp test timeout Florian Westphal
  0 siblings, 2 replies; 5+ messages in thread
From: Paolo Abeni @ 2025-05-22 10:09 UTC (permalink / raw)
  To: Florian Westphal, Pablo Neira Ayuso, Jozsef Kadlecsik
  Cc: netfilter-devel@vger.kernel.org, netdev@vger.kernel.org

Hi,

Recently the nipa CI infra went through some tuning, and the mentioned
self-test now often fails.

As I could not find any applied or pending relevant change, I have a
vague suspect that the timeout applied to the server command now
triggers due to different timing. Could you please have a look?

Thanks

Paolo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: nft_queues.sh failures
  2025-05-22 10:09 nft_queues.sh failures Paolo Abeni
@ 2025-05-22 13:53 ` Jakub Kicinski
  2025-05-22 14:10   ` Paolo Abeni
  2025-05-22 14:46 ` [PATCH net] selftests: netfilter: nft_queue.sh: double sctp test timeout Florian Westphal
  1 sibling, 1 reply; 5+ messages in thread
From: Jakub Kicinski @ 2025-05-22 13:53 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Florian Westphal, Pablo Neira Ayuso, Jozsef Kadlecsik,
	netfilter-devel@vger.kernel.org, netdev@vger.kernel.org

On Thu, 22 May 2025 12:09:01 +0200 Paolo Abeni wrote:
> Recently the nipa CI infra went through some tuning, and the mentioned
> self-test now often fails.
> 
> As I could not find any applied or pending relevant change, I have a
> vague suspect that the timeout applied to the server command now
> triggers due to different timing. Could you please have a look?

Oh, I was just staring at:
https://lore.kernel.org/all/20250522031835.4395-1-shiming.cheng@mediatek.com/
do you think it's not that?

I'll hide both that patch and Florian's fix from the queue for now, 
for a test.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: nft_queues.sh failures
  2025-05-22 13:53 ` Jakub Kicinski
@ 2025-05-22 14:10   ` Paolo Abeni
  2025-05-22 14:35     ` Florian Westphal
  0 siblings, 1 reply; 5+ messages in thread
From: Paolo Abeni @ 2025-05-22 14:10 UTC (permalink / raw)
  To: Jakub Kicinski, Florian Westphal
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik,
	netfilter-devel@vger.kernel.org, netdev@vger.kernel.org

On 5/22/25 3:53 PM, Jakub Kicinski wrote:
> On Thu, 22 May 2025 12:09:01 +0200 Paolo Abeni wrote:
>> Recently the nipa CI infra went through some tuning, and the mentioned
>> self-test now often fails.
>>
>> As I could not find any applied or pending relevant change, I have a
>> vague suspect that the timeout applied to the server command now
>> triggers due to different timing. Could you please have a look?
> 
> Oh, I was just staring at:
> https://lore.kernel.org/all/20250522031835.4395-1-shiming.cheng@mediatek.com/
> do you think it's not that?

It's not obvious to me. The failing test case is:

tcp via loopback and re-queueing

There should be no S/W segmentation there, as the loopback interface
exposes TSO.

@Florian, I'm sorry I should have mentioned explicitly the failing test
before. Sample failures:

https://netdev-3.bots.linux.dev/vmksft-nf/results/131921/2-nft-queue-sh/stdout
https://netdev-3.bots.linux.dev/vmksft-nf/results/131741/2-nft-queue-sh/stdout

I was wondering about this timeout specifically:

https://elixir.bootlin.com/linux/v6.15-rc7/source/tools/testing/selftests/net/netfilter/nft_queue.sh#L329

> I'll hide both that patch and Florian's fix from the queue for now, 
> for a test.

Fine by me.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: nft_queues.sh failures
  2025-05-22 14:10   ` Paolo Abeni
@ 2025-05-22 14:35     ` Florian Westphal
  0 siblings, 0 replies; 5+ messages in thread
From: Florian Westphal @ 2025-05-22 14:35 UTC (permalink / raw)
  To: Paolo Abeni
  Cc: Jakub Kicinski, Pablo Neira Ayuso, Jozsef Kadlecsik,
	netfilter-devel@vger.kernel.org, netdev@vger.kernel.org

Paolo Abeni <pabeni@redhat.com> wrote:
> On 5/22/25 3:53 PM, Jakub Kicinski wrote:
> > On Thu, 22 May 2025 12:09:01 +0200 Paolo Abeni wrote:
> >> Recently the nipa CI infra went through some tuning, and the mentioned
> >> self-test now often fails.
> >>
> >> As I could not find any applied or pending relevant change, I have a
> >> vague suspect that the timeout applied to the server command now
> >> triggers due to different timing. Could you please have a look?
> > 
> > Oh, I was just staring at:
> > https://lore.kernel.org/all/20250522031835.4395-1-shiming.cheng@mediatek.com/
> > do you think it's not that?

It is, thanks Jakub!

With my updated test case, it does pass, but see for yourself:
# PASS: sctp and nfqueue in forward chain (duration: 118s)
# PASS: sctp and nfqueue in output chain with GSO (duration: 56s)

(the old timeout was 60s, so this would FAIL without the updated test).

plain net-next/main:
# PASS: sctp and nfqueue in forward chain (duration: 42s)
# PASS: sctp and nfqueue in output chain with GSO (duration: 21s)

I haven't debugged yet but i'd guess that some packets get corrupted
when nfqueue segments gso skbs, thus forcing retransmits.

> It's not obvious to me. The failing test case is:
> 
> tcp via loopback and re-queueing
> 
> There should be no S/W segmentation there, as the loopback interface
> exposes TSO.

The nfqueue test also forces software segmentation, even for lo, so that
the userspace listener gets non-aggregated packets (its possible to
disable this so 'large packets' get queued to userspace, this is also
tested for tcp by this selftest).

> @Florian, I'm sorry I should have mentioned explicitly the failing test
> before. Sample failures:
> 
> https://netdev-3.bots.linux.dev/vmksft-nf/results/131921/2-nft-queue-sh/stdout
> https://netdev-3.bots.linux.dev/vmksft-nf/results/131741/2-nft-queue-sh/stdout

both show sctp failing:

# PASS: tcp via loopback and re-queueing

---> tcp loopback passes

# 2025/05/22 05:11:46 socat[32441] E write(7, 0x55ca6b34e000, 8192): Connection reset by peer
# cmp: EOF on /tmp/tmp.1LVNFztWUK after byte 50208768, in line 1
# FAIL: sctp forward: input and output file differ
#  Input file-rw------- 1 root root 209715200 May 22 05:10 /tmp/tmp.teqIUO7Jfh
# Output file-rw------- 1 root root 50208768 May 22 05:11 /tmp/tmp.1LVNFztWUK
# 2025/05/22 05:12:46 socat[32459] E write(7, 0x561110e23000, 8192): Connection reset by peer
# cmp: EOF on /tmp/tmp.1LVNFztWUK after byte 36528128, in line 1
# FAIL: sctp output: input and output file differ

so its sctp+nfqueue thats failing.
And it does seem to be related to the pending patch pointed out by
Jakub.
> > I'll hide both that patch and Florian's fix from the queue for now, 
> > for a test.
> 
> Fine by me.

I'll resend the update tomorrow, keeping the OLD timeout of 60s, I think
keeping track of the 'transmit time' in the test log archives could be
useful in the future.

> I was wondering about this timeout specifically:
> 
> https://elixir.bootlin.com/linux/v6.15-rc7/source/tools/testing/selftests/net/netfilter/nft_queue.sh#L329

5s isn't so short, lo is supposed to be fast (the userspace prog
asks for GSO packets, so no s/w segmentation should happen but even
with GSO segmentation I would not expect it to fail).

I would prefer to keep the 5s for tcp; I don't recall this was a problem
in the past.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH net] selftests: netfilter: nft_queue.sh: double sctp test timeout
  2025-05-22 10:09 nft_queues.sh failures Paolo Abeni
  2025-05-22 13:53 ` Jakub Kicinski
@ 2025-05-22 14:46 ` Florian Westphal
  1 sibling, 0 replies; 5+ messages in thread
From: Florian Westphal @ 2025-05-22 14:46 UTC (permalink / raw)
  To: netdev; +Cc: Florian Westphal, Paolo Abeni

Paolo Abeni says:
 Recently the nipa CI infra went through some tuning, and the mentioned
 self-test now often fails.

 As I could not find any applied or pending relevant change, I have a
 vague suspect that the timeout applied to the server command now
 triggers due to different timing. Could you please have a look?

Double timeouts for sctp even for standard kernel build.
For MACHINE_SLOW, reduce both file transfer size (no change)
but increase the timeout too.

Because SCTP nfqueue tests had timeout related issues before (esp. on debug
kernels) also print the file transfer duration in the PASS/FAIL message.
This would also allow us to see if there is/was an unexpected slowdown
(NIPA keeps logs around).

Output of altered lines now looks like this:
  PASS: tcp and nfqueue in forward chan (duration: 2s)
  PASS: tcp via loopback (duration: 2s)
  PASS: sctp and nfqueue in forward chain (duration: 42s)
  PASS: sctp and nfqueue in output chain with GSO (duration: 21s)

No fixes tag, there were no changes in nfqueue in quite some time.
As the test isn't failing for me even without this change I have no
reason to suspect a breaking change on sctp side either.

Reported-by: Paolo Abeni <pabeni@redhat.com>
Closes: https://lore.kernel.org/netdev/584524ef-9fd7-4326-9f1b-693ca62c5692@redhat.com/
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Also applies to net-next.

 .../selftests/net/netfilter/nft_queue.sh      | 43 ++++++++++++++++---
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/net/netfilter/nft_queue.sh b/tools/testing/selftests/net/netfilter/nft_queue.sh
index 784d1b46912b..eceb443f0eb0 100755
--- a/tools/testing/selftests/net/netfilter/nft_queue.sh
+++ b/tools/testing/selftests/net/netfilter/nft_queue.sh
@@ -10,6 +10,8 @@ source lib.sh
 ret=0
 timeout=5
 
+SCTP_TEST_TIMEOUT=120
+
 cleanup()
 {
 	ip netns pids "$ns1" | xargs kill 2>/dev/null
@@ -40,7 +42,12 @@ TMPFILE3=$(mktemp)
 
 TMPINPUT=$(mktemp)
 COUNT=200
-[ "$KSFT_MACHINE_SLOW" = "yes" ] && COUNT=25
+
+if [ "$KSFT_MACHINE_SLOW" = "yes" ];then
+	COUNT=$((COUNT/4))
+	SCTP_TEST_TIMEOUT=$((SCTP_TEST_TIMEOUT*4))
+fi
+
 dd conv=sparse status=none if=/dev/zero bs=1M count=$COUNT of="$TMPINPUT"
 
 if ! ip link add veth0 netns "$nsrouter" type veth peer name eth0 netns "$ns1" > /dev/null 2>&1; then
@@ -275,9 +282,11 @@ test_tcp_forward()
 	busywait "$BUSYWAIT_TIMEOUT" listener_ready "$ns2"
 	busywait "$BUSYWAIT_TIMEOUT" nf_queue_wait "$nsrouter" 2
 
+	local tthen=$(date +%s)
+
 	ip netns exec "$ns1" socat -u STDIN TCP:10.0.2.99:12345 <"$TMPINPUT" >/dev/null
 
-	wait "$rpid" && echo "PASS: tcp and nfqueue in forward chain"
+	wait_and_check_retval "$rpid" "tcp and nfqueue in forward chain" "$tthen"
 	kill "$nfqpid"
 }
 
@@ -288,13 +297,14 @@ test_tcp_localhost()
 
 	ip netns exec "$nsrouter" ./nf_queue -q 3 &
 	local nfqpid=$!
+	local tthen=$(date +%s)
 
 	busywait "$BUSYWAIT_TIMEOUT" listener_ready "$nsrouter"
 	busywait "$BUSYWAIT_TIMEOUT" nf_queue_wait "$nsrouter" 3
 
 	ip netns exec "$nsrouter" socat -u STDIN TCP:127.0.0.1:12345 <"$TMPINPUT" >/dev/null
 
-	wait "$rpid" && echo "PASS: tcp via loopback"
+	wait_and_check_retval "$rpid" "tcp via loopback" "$tthen"
 	kill "$nfqpid"
 }
 
@@ -417,6 +427,23 @@ check_output_files()
 	fi
 }
 
+wait_and_check_retval()
+{
+	local rpid="$1"
+	local msg="$2"
+	local tthen="$3"
+	local tnow=$(date +%s)
+
+	if wait "$rpid";then
+		echo -n "PASS: "
+	else
+		echo -n "FAIL: "
+		ret=1
+	fi
+
+	printf "%s (duration: %ds)\n" "$msg" $((tnow-tthen))
+}
+
 test_sctp_forward()
 {
 	ip netns exec "$nsrouter" nft -f /dev/stdin <<EOF
@@ -428,13 +455,14 @@ table inet sctpq {
         }
 }
 EOF
-	timeout 60 ip netns exec "$ns2" socat -u SCTP-LISTEN:12345 STDOUT > "$TMPFILE1" &
+	timeout "$SCTP_TEST_TIMEOUT" ip netns exec "$ns2" socat -u SCTP-LISTEN:12345 STDOUT > "$TMPFILE1" &
 	local rpid=$!
 
 	busywait "$BUSYWAIT_TIMEOUT" sctp_listener_ready "$ns2"
 
 	ip netns exec "$nsrouter" ./nf_queue -q 10 -G &
 	local nfqpid=$!
+	local tthen=$(date +%s)
 
 	ip netns exec "$ns1" socat -u STDIN SCTP:10.0.2.99:12345 <"$TMPINPUT" >/dev/null
 
@@ -443,7 +471,7 @@ EOF
 		exit 1
 	fi
 
-	wait "$rpid" && echo "PASS: sctp and nfqueue in forward chain"
+	wait_and_check_retval "$rpid" "sctp and nfqueue in forward chain" "$tthen"
 	kill "$nfqpid"
 
 	check_output_files "$TMPINPUT" "$TMPFILE1" "sctp forward"
@@ -462,13 +490,14 @@ EOF
 	# reduce test file size, software segmentation causes sk wmem increase.
 	dd conv=sparse status=none if=/dev/zero bs=1M count=$((COUNT/2)) of="$TMPINPUT"
 
-	timeout 60 ip netns exec "$ns2" socat -u SCTP-LISTEN:12345 STDOUT > "$TMPFILE1" &
+	timeout "$SCTP_TEST_TIMEOUT" ip netns exec "$ns2" socat -u SCTP-LISTEN:12345 STDOUT > "$TMPFILE1" &
 	local rpid=$!
 
 	busywait "$BUSYWAIT_TIMEOUT" sctp_listener_ready "$ns2"
 
 	ip netns exec "$ns1" ./nf_queue -q 11 &
 	local nfqpid=$!
+	local tthen=$(date +%s)
 
 	ip netns exec "$ns1" socat -u STDIN SCTP:10.0.2.99:12345 <"$TMPINPUT" >/dev/null
 
@@ -478,7 +507,7 @@ EOF
 	fi
 
 	# must wait before checking completeness of output file.
-	wait "$rpid" && echo "PASS: sctp and nfqueue in output chain with GSO"
+	wait_and_check_retval "$rpid" "sctp and nfqueue in output chain with GSO" "$tthen"
 	kill "$nfqpid"
 
 	check_output_files "$TMPINPUT" "$TMPFILE1" "sctp output"
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-05-22 14:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-22 10:09 nft_queues.sh failures Paolo Abeni
2025-05-22 13:53 ` Jakub Kicinski
2025-05-22 14:10   ` Paolo Abeni
2025-05-22 14:35     ` Florian Westphal
2025-05-22 14:46 ` [PATCH net] selftests: netfilter: nft_queue.sh: double sctp test timeout Florian Westphal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).