netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net] selftests: rtnetlink: try double sleep to give WQ a chance
@ 2025-07-10 14:53 Jakub Kicinski
  2025-07-11  2:14 ` Hangbin Liu
  0 siblings, 1 reply; 5+ messages in thread
From: Jakub Kicinski @ 2025-07-10 14:53 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
	liuhangbin, shuah, linux-kselftest

The rtnetlink test for preferred lifetime of an address is quite flaky.
Problems started around the 6.16 merge window in May. The test fails
with:

   FAIL: preferred_lft addresses remaining

and unlike most of our flakes this one fails on the "normal" kernel
builds, not the builds with kernel/configs/debug.config. I suspect
the flakes may be related to power saving, since the expirations
run from a "power efficient" workqueue. Adding a short sleep seems
to decrease the flakes by 8x but they still happen. With this
patch in place we get a flake every couple of weeks, not every
couple of days. Better ideas welcome..

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: liuhangbin@gmail.com
CC: shuah@kernel.org
CC: linux-kselftest@vger.kernel.org
---
 tools/testing/selftests/net/rtnetlink.sh | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
index 2e8243a65b50..b9e1497ea27a 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -299,6 +299,11 @@ kci_test_addrlft()
 	done
 
 	sleep 5
+	# Schedule out for a bit, address GC runs from the power efficient WQ
+	# if the long sleep above has put the whole system into sleep state
+	# the WQ may have not had a chance to run.
+	sleep 0.1
+
 	run_cmd_grep_fail "10.23.11." ip addr show dev "$devdummy"
 	if [ $? -eq 0 ]; then
 		check_err 1
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net] selftests: rtnetlink: try double sleep to give WQ a chance
  2025-07-10 14:53 [PATCH net] selftests: rtnetlink: try double sleep to give WQ a chance Jakub Kicinski
@ 2025-07-11  2:14 ` Hangbin Liu
  2025-07-11 14:17   ` Jakub Kicinski
  0 siblings, 1 reply; 5+ messages in thread
From: Hangbin Liu @ 2025-07-11  2:14 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, shuah,
	linux-kselftest

On Thu, Jul 10, 2025 at 07:53:12AM -0700, Jakub Kicinski wrote:
> The rtnetlink test for preferred lifetime of an address is quite flaky.
> Problems started around the 6.16 merge window in May. The test fails
> with:
> 
>    FAIL: preferred_lft addresses remaining
> 
> and unlike most of our flakes this one fails on the "normal" kernel
> builds, not the builds with kernel/configs/debug.config. I suspect
> the flakes may be related to power saving, since the expirations
> run from a "power efficient" workqueue. Adding a short sleep seems
> to decrease the flakes by 8x but they still happen. With this
> patch in place we get a flake every couple of weeks, not every
> couple of days. Better ideas welcome..
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: liuhangbin@gmail.com
> CC: shuah@kernel.org
> CC: linux-kselftest@vger.kernel.org
> ---
>  tools/testing/selftests/net/rtnetlink.sh | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/tools/testing/selftests/net/rtnetlink.sh b/tools/testing/selftests/net/rtnetlink.sh
> index 2e8243a65b50..b9e1497ea27a 100755
> --- a/tools/testing/selftests/net/rtnetlink.sh
> +++ b/tools/testing/selftests/net/rtnetlink.sh
> @@ -299,6 +299,11 @@ kci_test_addrlft()
>  	done
>  
>  	sleep 5
> +	# Schedule out for a bit, address GC runs from the power efficient WQ
> +	# if the long sleep above has put the whole system into sleep state
> +	# the WQ may have not had a chance to run.
> +	sleep 0.1
> +

How about use slowwait to check if the address still exists. e.g.

check_addr_not_exist()
{
	dev=$1
	addr=$2
	if ip addr show dev $dev | grep -q $addr; then
		return 1
	else
		return 0
}

	slowwait 5 check_addr_not_exist "$devdummy" "10.23.11."

>  	run_cmd_grep_fail "10.23.11." ip addr show dev "$devdummy"
>  	if [ $? -eq 0 ]; then
>  		check_err 1
> -- 
> 2.50.0
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net] selftests: rtnetlink: try double sleep to give WQ a chance
  2025-07-11  2:14 ` Hangbin Liu
@ 2025-07-11 14:17   ` Jakub Kicinski
  2025-07-14  7:19     ` Hangbin Liu
  0 siblings, 1 reply; 5+ messages in thread
From: Jakub Kicinski @ 2025-07-11 14:17 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, shuah,
	linux-kselftest

On Fri, 11 Jul 2025 02:14:03 +0000 Hangbin Liu wrote:
> >  	sleep 5
> > +	# Schedule out for a bit, address GC runs from the power efficient WQ
> > +	# if the long sleep above has put the whole system into sleep state
> > +	# the WQ may have not had a chance to run.
> > +	sleep 0.1
> > +  
> 
> How about use slowwait to check if the address still exists.

Weirdly if we read the addresses twice they disappear, I haven't looked
into the code for the why, but seemed like using slowwait could
potentially mask the addresses sticking around when nobody runs 
the Netlink handlers for a while? Dunno..

I queued this debug patch a couple of months ago:

 	sleep 5
-	run_cmd_grep_fail "10.23.11." ip addr show dev "$devdummy"
+	ip addr show dev "$devdummy" > /tmp/a
+	run_cmd_grep_fail "10.23.11." cat /tmp/a
 	if [ $? -eq 0 ]; then
-		check_err 1
-		end_test "FAIL: preferred_lft addresses remaining"
+	    check_err 1
+	    cat /tmp/a
+	    echo "==="
+		ip addr show dev "$devdummy"
+		end_test "FAIL: preferred_lft addresses remaining ($lft)"
 		return
 	fi

And when it flakes the output looks like this:

# 7.23 [+7.00] 297: test-dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
# 7.23 [+0.00]     link/ether 9e:a6:c4:c2:1b:16 brd ff:ff:ff:ff:ff:ff
# 7.23 [+0.00]     inet 10.23.11.81/32 scope global deprecated dynamic test-dummy0
# 7.23 [+0.00]        valid_lft 0sec preferred_lft 0sec
# 7.23 [+0.00]     inet 10.23.11.84/32 scope global deprecated dynamic test-dummy0
# 7.24 [+0.00]        valid_lft 0sec preferred_lft 0sec
# 7.24 [+0.00]     inet 10.23.11.93/32 scope global deprecated dynamic test-dummy0
# 7.24 [+0.00]        valid_lft 0sec preferred_lft 0sec
# 7.24 [+0.00]     inet 10.23.11.94/32 scope global deprecated dynamic test-dummy0
# 7.24 [+0.00]        valid_lft 0sec preferred_lft 0sec
# 7.24 [+0.00]     inet 10.23.11.97/32 scope global deprecated dynamic test-dummy0
# 7.24 [+0.00]        valid_lft 0sec preferred_lft 0sec
# 7.24 [+0.00]     inet 10.23.11.99/32 scope global deprecated dynamic test-dummy0
# 7.24 [+0.00]        valid_lft 0sec preferred_lft 0sec
# 7.24 [+0.00]     inet6 fe80::9ca6:c4ff:fec2:1b16/64 scope link proto kernel_ll 
# 7.24 [+0.00]        valid_lft forever preferred_lft forever
# 7.24 [+0.00] ===
# 7.25 [+0.00] 297: test-dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
# 7.25 [+0.00]     link/ether 9e:a6:c4:c2:1b:16 brd ff:ff:ff:ff:ff:ff
# 7.25 [+0.00]     inet6 fe80::9ca6:c4ff:fec2:1b16/64 scope link proto kernel_ll 
# 7.25 [+0.00]        valid_lft forever preferred_lft forever
# 7.25 [+0.00] FAIL: preferred_lft addresses remaining (1)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net] selftests: rtnetlink: try double sleep to give WQ a chance
  2025-07-11 14:17   ` Jakub Kicinski
@ 2025-07-14  7:19     ` Hangbin Liu
  2025-07-14 22:30       ` Jakub Kicinski
  0 siblings, 1 reply; 5+ messages in thread
From: Hangbin Liu @ 2025-07-14  7:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, shuah,
	linux-kselftest

On Fri, Jul 11, 2025 at 07:17:29AM -0700, Jakub Kicinski wrote:
> On Fri, 11 Jul 2025 02:14:03 +0000 Hangbin Liu wrote:
> > >  	sleep 5
> > > +	# Schedule out for a bit, address GC runs from the power efficient WQ
> > > +	# if the long sleep above has put the whole system into sleep state
> > > +	# the WQ may have not had a chance to run.
> > > +	sleep 0.1
> > > +  
> > 
> > How about use slowwait to check if the address still exists.
> 
> Weirdly if we read the addresses twice they disappear, I haven't looked
> into the code for the why, but seemed like using slowwait could
> potentially mask the addresses sticking around when nobody runs 
> the Netlink handlers for a while? Dunno..

Not sure if I understand correctly. Do you mean the addresses will keep there
if we use slowwait?

Thanks
Hangbin

> 
> I queued this debug patch a couple of months ago:
> 
>  	sleep 5
> -	run_cmd_grep_fail "10.23.11." ip addr show dev "$devdummy"
> +	ip addr show dev "$devdummy" > /tmp/a
> +	run_cmd_grep_fail "10.23.11." cat /tmp/a
>  	if [ $? -eq 0 ]; then
> -		check_err 1
> -		end_test "FAIL: preferred_lft addresses remaining"
> +	    check_err 1
> +	    cat /tmp/a
> +	    echo "==="
> +		ip addr show dev "$devdummy"
> +		end_test "FAIL: preferred_lft addresses remaining ($lft)"
>  		return
>  	fi
> 
> And when it flakes the output looks like this:
> 
> # 7.23 [+7.00] 297: test-dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
> # 7.23 [+0.00]     link/ether 9e:a6:c4:c2:1b:16 brd ff:ff:ff:ff:ff:ff
> # 7.23 [+0.00]     inet 10.23.11.81/32 scope global deprecated dynamic test-dummy0
> # 7.23 [+0.00]        valid_lft 0sec preferred_lft 0sec
> # 7.23 [+0.00]     inet 10.23.11.84/32 scope global deprecated dynamic test-dummy0
> # 7.24 [+0.00]        valid_lft 0sec preferred_lft 0sec
> # 7.24 [+0.00]     inet 10.23.11.93/32 scope global deprecated dynamic test-dummy0
> # 7.24 [+0.00]        valid_lft 0sec preferred_lft 0sec
> # 7.24 [+0.00]     inet 10.23.11.94/32 scope global deprecated dynamic test-dummy0
> # 7.24 [+0.00]        valid_lft 0sec preferred_lft 0sec
> # 7.24 [+0.00]     inet 10.23.11.97/32 scope global deprecated dynamic test-dummy0
> # 7.24 [+0.00]        valid_lft 0sec preferred_lft 0sec
> # 7.24 [+0.00]     inet 10.23.11.99/32 scope global deprecated dynamic test-dummy0
> # 7.24 [+0.00]        valid_lft 0sec preferred_lft 0sec
> # 7.24 [+0.00]     inet6 fe80::9ca6:c4ff:fec2:1b16/64 scope link proto kernel_ll 
> # 7.24 [+0.00]        valid_lft forever preferred_lft forever
> # 7.24 [+0.00] ===
> # 7.25 [+0.00] 297: test-dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
> # 7.25 [+0.00]     link/ether 9e:a6:c4:c2:1b:16 brd ff:ff:ff:ff:ff:ff
> # 7.25 [+0.00]     inet6 fe80::9ca6:c4ff:fec2:1b16/64 scope link proto kernel_ll 
> # 7.25 [+0.00]        valid_lft forever preferred_lft forever
> # 7.25 [+0.00] FAIL: preferred_lft addresses remaining (1)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net] selftests: rtnetlink: try double sleep to give WQ a chance
  2025-07-14  7:19     ` Hangbin Liu
@ 2025-07-14 22:30       ` Jakub Kicinski
  0 siblings, 0 replies; 5+ messages in thread
From: Jakub Kicinski @ 2025-07-14 22:30 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, shuah,
	linux-kselftest

On Mon, 14 Jul 2025 07:19:09 +0000 Hangbin Liu wrote:
> > > How about use slowwait to check if the address still exists.  
> > 
> > Weirdly if we read the addresses twice they disappear, I haven't looked
> > into the code for the why, but seemed like using slowwait could
> > potentially mask the addresses sticking around when nobody runs 
> > the Netlink handlers for a while? Dunno..  
> 
> Not sure if I understand correctly. Do you mean the addresses will keep there
> if we use slowwait?

No, I mean there may be false negatives, not false positive.
But maybe it's fine, it will definitely prevent flakes.
Could you post the slowwait patch officially?
-- 
pw-bot: cr

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-07-14 22:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-10 14:53 [PATCH net] selftests: rtnetlink: try double sleep to give WQ a chance Jakub Kicinski
2025-07-11  2:14 ` Hangbin Liu
2025-07-11 14:17   ` Jakub Kicinski
2025-07-14  7:19     ` Hangbin Liu
2025-07-14 22:30       ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).